Automagic epi curve plotting: part I

As of 24 May 2018, the underlying data schema of the Github repo from which the epi curve plotter draws its data has changed. Therefore, a lot of the code had to be adjusted. The current code can be found here on Github. This also plots a classical epi curve.

One of the best resources during the 2013-16 West African Ebola outbreak was Caitlin RiversGithub repo, which was probably one of the best ways to stay up to date on the numbers. For the current outbreak, she has also set up a Github repo, with really frequent updates straight from the WHO’s DON data and the information from DRC Ministry of Public Health (MdlS) mailing list.1 Using R, I have set up a simple script that I only have to run every time I want a pre-defined visualisation of the current situation. I am usually doing this on a remote RStudio server, which makes matters quite easy for me to quickly generate data on the fly from RStudio.

Obtaining the most recent data

Using the following little script, I grab the latest from the ebola-drc Github repo:

# Fetch most recent DRC data.
library(magrittr)
library(curl)
library(readr)
library(dplyr)

current_drc_data <- Sys.time() %>%
  format("%d%H%M%S%b%Y") %>%
  paste("raw_data/drc/", "drc-", ., ".csv", sep = "") %T>%
  curl_fetch_disk("https://raw.githubusercontent.com/cmrivers/ebola_drc/master/drc/data.csv", .) %>%
  read_csv()

This uses curl (the R implementation) to fetch the most recent data and save it as a timestamped2 file in the data folder I set up just for that purpose.3 Simply sourcing this script (source("fetch_drc_data.R")) should then load the current DRC dataset into the environment.4

Data munging

We need to do a little data munging. First, we melt down the data frame using reshape2‘s melt function. Melting takes ‘wide’ data and converumnts it into ‘long’ data – for example, in our case, the original data had a row for each daily report for each health zone, and a column for the various combinations of confirmed/probable/suspected over cases/deaths. Melting the data frame down creates a variable type column (say, confirmed_deaths and a value column (giving the value, e.g. 3). Using lubridate,5 the dates are parsed, and the values are stored in a numeric format.

library(magrittr)
library(reshape2)
library(lubridate)

current_drc_data %<>% melt(value_name = "value", measure.vars = c("confirmed_cases", "confirmed_deaths", "probable_cases", "probable_deaths", "suspect_cases", "suspect_deaths", "ruled_out"))
current_drc_data$event_date <- lubridate::ymd(current_drc_data$event_date)
current_drc_data$report_date <- lubridate::ymd(current_drc_data$report_date)
current_drc_data$value <- as.numeric(current_drc_data$value)

Next, we drop ruled_out cases, as they play no significant role for the current visualisation.

current_drc_data <- current_drc_data[current_drc_data$variable != "ruled_out",]

We also need to split the type labels into two different columns, so as to allow plotting them as a matrix. Currently, data type labels (the variable column) has both the certainty status (confirmed, suspected or probable) and the type of indicator (cases vs deaths) in a single variable, separated by an underscore. We’ll use stringr‘s str_split_fixed to split the variable names by underscore, and join them into two separate columns, suspicion and mm, the latter denoting mortality/morbidity status.

current_drc_data %<>% cbind(., str_split_fixed(use_series(., variable), "_", 2)) %>% 
                 subset(select = -c(variable)) %>% 
                 set_colnames(c("event_date", "report_date", "health_zone", "value", "suspicion", "mm"))

Let’s filter out the health zones that are being observed but have no relevant data for us yet:

relevant_health_zones <- current_drc_data %>% 
                         subset(select = c("health_zone", "value")) %>% 
                         group_by(health_zone) %>% 
                         summarise(totals = sum(value, na.rm=TRUE)) %>% 
                         dplyr::filter(totals > 0) %>% 
                         use_series(health_zone)

This gives us a vector of all health zones that are currently reporting cases. We can filter our DRC data for that:

current_drc_data %<>% dplyr::filter(health_zone %in% relevant_health_zones)

This whittles down our table by a few rows. Finally, we might want to create a fake health zone that summarises all other health zones’ respective data:

totals <- current_drc_data %>% group_by(event_date, report_date, suspicion, mm) 
                           %>% summarise(value = sum(value), health_zone=as.factor("DRC total"))

# Reorder totals to match the core dataset
totals <- totals[,c(1,2,6,5,3,4)]

Finally, we bind these together to a single data frame:

current_drc_data %<>% rbind.data.frame(totals)

Visualising it!

Of course, all this was in pursuance of cranking out a nice visualisation. For this, we need to do a couple of things, including first ensuring that “DRC total” is treated separately and comes last:

regions <- current_drc_data %>% use_series(health_zone) %>% unique()
regions[!regions == "DRC total"]
regions %<>% c("DRC total")

current_drc_data$health_zone_f <- factor(current_drc_data$health_zone, levels = regions)

I normally start out by declaring the colour scheme I will be using. In general, I tend to use the same few colour schemes, which I keep in a few gists. For simple plots, I prefer to use no more than five colours:

colour_scheme <- c(white = rgb(238, 238, 238, maxColorValue = 255),
                   light_primary = rgb(236, 231, 216, maxColorValue = 255),
                   dark_primary = rgb(127, 112, 114, maxColorValue = 255),
                   accent_red = rgb(240, 97, 114, maxColorValue = 255),
                   accent_blue = rgb(69, 82, 98, maxColorValue = 255))

With that sorted, I can invoke the ggplot method, storing the plot in an object, p. This is so as to facilitate later retrieval by the ggsave method.

p <- ggplot(current_drc_data, aes(x=event_date, y=value)) +

  # Title and subtitle
  ggtitle(paste("Daily EBOV status", "DRC", Sys.Date(), sep=" - ")) +
  labs(subtitle = "(c) Chris von Csefalvay/CBRD (cbrd.co) - @chrisvcsefalvay") +
  
  # This facets the plot based on the factor vector we created ear 
  facet_grid(health_zone_f ~ suspicion) +
  geom_path(aes(group = mm, colour = mm, alpha = mm), na.rm = TRUE) +
  geom_point(aes(colour = mm, alpha = mm)) +

  # Axis labels
  ylab("Cases") +
  xlab("Date") +

  # The x-axis is between the first notified case and the last
  xlim(c("2018-05-08", Sys.Date())) +
  scale_x_date(date_breaks = "7 days", date_labels = "%m/%d") +

  # Because often there's an overlap and cases that die on the day of registration
  # tend to count here as well, some opacity is useful.
  scale_alpha_manual(values = c("cases" = 0.5, "deaths" = 0.8)) +
  scale_colour_manual(values = c("cases" = colour_scheme[["accent_blue"]], "deaths" = colour_scheme[["accent_red"]])) +

  # Ordinarily, I have these derive from a theme package, but they're very good
  # defaults and starting poinnnnnntsssssts
  theme(panel.spacing.y = unit(0.6, "lines"), 
        panel.spacing.x = unit(1, "lines"),
        plot.title = element_text(colour = colour_scheme[["accent_blue"]]),
        plot.subtitle = element_text(colour = colour_scheme[["accent_blue"]]),
        axis.line = element_line(colour = colour_scheme[["dark_primary"]]),
        panel.background = element_rect(fill = colour_scheme[["white"]]),
        panel.grid.major = element_line(colour = colour_scheme[["light_primary"]]),
        panel.grid.minor = element_line(colour = colour_scheme[["light_primary"]]),
        strip.background = element_rect(fill = colour_scheme[["accent_blue"]]),
        strip.text = element_text(colour = colour_scheme[["light_primary"]])
  )
DRC EBOV outbreak, 22 May 2018. The data has some significant gaps, owing to monitoring and recording issues, but some clear trends already emerge from this simple illustration.

The end result is a fairly appealing plot, although if the epidemic goes on, one might want to consider getting rid of the point markers. All that remains is to insert an automatic call to the ggsave function to save the image:

Sys.time() %>%
  format("%d%H%M%S%b%Y") %>%
  paste("DRC-EBOV-", ., ".png", sep="") %>%
  ggsave(plot = p, device="png", path="visualisations/drc/", width = 8, height = 6)

Automation

The cronR package has a built-in cron scheduler add-in for RStudio, allowing you to manage all your code scheduling needs.

Of course, being a lazy epidemiologist, this is the kind of stuff that just has to be automated! Since I run my entire RStudio instance on a remote machine, it would make perfect sense to regularly run this script. cronR package comes with a nice widget, which will allow you to simply schedule any task. Old-school command line users can, of course, always resort to ye olde command line based scheduling and execution. One important caveat: the context of cron execution will not necessarily be the same as of your R project or indeed of the R user. Therefore, when you source a file or refer to paths, you may want to refer to fully qualified paths, i.e. /home/my_user/my_R_stuff/script.R rather than merely script.R. cronR is very good at logging when things go awry, so if the plots do not start to magically accumulate at the requisite rate, do give the log a check.

The next step is, of course, to automate uploads to Twitter. But that’s for another day.

References   [ + ]

1. Disclaimer: Dr Rivers is a friend, former collaborator and someone I hold in very high esteem. I’m also from time to time trying to contribute to these repos.
2. My convention for timestamps is the military DTG convention of DDHHMMSSMONYEAR, so e.g. 7.15AM on 21 May 2018 would be 21071500MAY2018.
3. It is, technically, bad form to put the path in the file name for the curl::curl_fetch_disk() function, given that curl::curl_fetch_disk() offers the path parameter just for that. However, due to the intricacies of piping data forward, this is arguably the best way to do it so as to allow the filename to be reused for the read_csv() function, too.
4. If for whatever reason you would prefer to keep only one copy of the data around and overwrite it every time, quite simply provide a fixed filename into curl_fetch_disk().
5. As an informal convention of mine, when using the simple conversion functions of lubridate such as ymd, I tend to note the source package, hence lubridate::ymd over simply ymd.

Installing RStudio Server on Debian 9

Oh, wouldn’t it be just wonderful if you could have your own RStudio installation on a server that you could then access from whatever device you currently have, including an iPad? It totally would. Except it’s some times far from straightforward. Here’s how to do it relatively painlessly.

Step 1: Get a server

Choose a suitable (and affordable) server, pick a location near you in the drop-down menu on the bottom, and press Add this Linode! to set up your first Linode.

I use Linode,1 and in general, their 4096 server is pretty good. Linodes can be very easily resized, so this should not be a worry.

Step 2: Set up the server

Once your Linode is up, it will turn up on your dashboard with a random name (linode1234567, typically). If you click on it, you will see your Linode is ‘Brand New’, which means you need to configure it. I usually keep them in groups depending on purpose: blog servers, various processing servers, hosts, research servers. Each of them then gets a name. Choose whatever nomenclature fits your needs best.

Give your Linode a name you can recognise it by, and assign it to a category.

Next, click on the Rebuild tab, and configure the root password, the operating system (we’ll be using Debian 9), the swap disk size (for R, it’s a good idea to set this as large as you can) and finally, set the deployment disk size. I usually set that for 66%.

Step 3: Install R

The Remote Access field shows the SSH access command, complete with the root user, as well as public IPs, default gateways and other networking stuff.

SSH into your Linode and log in as root. You will find your Linode’s IP address and other interesting factoids about it under the Remote access tab. I have obscured some information as I don’t want you scallywags messing around with my server, but the IP address is displayed both in the SSH link (the one that goes ssh [email protected] or something along these lines) and below under public IPs.

Using vim, open /etc/apt/sources.list and add the following source:

# CRAN server for Debian stretch (R and related stuff)
deb http://cran.rstudio.com/bin/linux/debian stretch-cran34/

Save the file, and next install dirmngr (sudo apt-get install dirmngr). Then, add the requisite GPG key for CRAN, update the repository and install r-base:

sudo apt-key adv --keyserver keys.gnupg.net --recv-key 'E19F5F87128899B192B1A2C2AD5F960A256A04AF'

sudo apt update
sudo apt install r-base

At this point, you can enter R to test if your R installation works, and try to install a package, like ggplot2. It should work, but may ask you to select an installation server. If all is well, this is what you should be getting:

[email protected]:~# R

R version 3.4.4 (2018-03-15) -- "Someone to Lean On"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

In the prompt, enter install.packages("ggplot2"). Select a server if requested. Otherwise, watch ggplot2 (and a bazillion other packages) install. Then, quit R by calling q().

Step 4: Downgrade libssl

For some inscrutable reason, RStudio is currently set up to work with libssl1.0.0, whereas Debian 9 comes with libssl1.1.0 out of the box. Clearly that’s not going to work, so we’ll have to roll back our libssl. We’ll do so by creating a file called /etc/apt/sources.list.d/jessie.list using vim. And we’ll fill it with the following:

deb http://httpredir.debian.org/debian jessie main contrib non-free
deb-src http://httpredir.debian.org/debian jessie main contrib non-free

deb http://security.debian.org/ jessie/updates main contrib non-free
deb-src http://security.debian.org/ jessie/updates main contrib non-free

In case you’re curious: this creates a sources list called jessie, and allows you to draw from Debian Jessie. So let’s have apt get with the program (sudo apt update), and install libssl1.0.0 (sudo apt install libssl1.0.0). Ift may be prudent to also install the openssl tool corresponding to the libssl version (sudo apt install openssl/jessie).

Step 5: Time to install RStudio Server!

First, install gDebi, a package installer, by typing sudo apt-get install gdebi. Next, we’ll be grabbing the latest RStudio version using wget, and installing it using gDebi. Make sure you either do this in your home directory or in /tmp/, ideally. Note that the versions of RStudio tend to change – 1.1.442 was the current version, released 12 March 2018, at the time of writing this post, but may by now have changed. You can check the current version number here.

wget https://download2.rstudio.org/rstudio-server-1.1.442-amd64.deb
sudo gdebi rstudio-server-1.1.442-amd64.deb

If all is well and you said yes to the dress question about whether you actually want to install RStudio Server (no, you’ve been going through this whole pain in the rear for excrement and jocularity, duh) , you should see something like this:

(Reading database ... 55651 files and directories currently installed.)
Preparing to unpack rstudio-server-1.1.442-amd64.deb ...
Unpacking rstudio-server (1.1.442) ...
Setting up rstudio-server (1.1.442) ...
groupadd: group 'rstudio-server' already exists
rsession: no process found
Created symlink /etc/systemd/system/multi-user.target.wants/rstudio-server.service → /etc/systemd/system/rstudio-server.service.
● rstudio-server.service - RStudio Server
   Loaded: loaded (/etc/systemd/system/rstudio-server.service; enabled; vendor preset: enabled)
   Active: active (running) since Sat 2018-03-31 21:59:25 UTC; 1s ago
  Process: 16478 ExecStart=/usr/lib/rstudio-server/bin/rserver (code=exited, status=0/SUCCESS)
 Main PID: 16479 (rserver)
    Tasks: 3 (limit: 4915)
   CGroup: /system.slice/rstudio-server.service
           └─16479 /usr/lib/rstudio-server/bin/rserver

Mar 31 21:59:25 localhost systemd[1]: Starting RStudio Server...
Mar 31 21:59:25 localhost systemd[1]: Started RStudio Server.

Now, if all goes well, you can navigate to your server’s IP at port 8787, and you should behold something akin to this:

The login interface. Technically, you could simply log in with your root password and root as the username. But just don’t yet.

There are a few more things you wish to install at this point – these are libraries that will help with SSL and other functionality from within R. Head back to the terminal and install the following:

sudo apt-get install libssl-dev libcurl4-openssl-dev libssh2-1-dev

Step 6: Some finishing touches

The login screen will draw on PAM, Unix’s own authentication module, in lieu of a user manager. As such, to create access, you will have to create a new Unix user with adduser, and assign a password to it. This will grant it a directory of its own, and you the ability to fine-tune what they should, and what they shouldn’t, have access to. Win-win! This will allow you to share a single installation among a range of people.

Step 7: To reverse proxy, or to not reverse proxy?

There are diverging opinions as to whether a reverse proxy such as NGINX carries any benefit. In my understanding, they do not, and there’s a not entirely unpleasant degree of security by obscurity in having a hard to guess port (which, by the way, you can change). It also makes uploads, on which you will probably rely quite a bit, more difficult. On the whole, there are more arguments against than in favour of a reverse proxy, but I may add specific guidance on reverse proxying here if there’s interest.

References   [ + ]

1. Using this link gives me a referral bonus of $20 as long as you remain a customer for 90 days. If you do not wish to do so, please use this link. It costs the same either way, and I would be using Linode anyway as their service is superbly reliable.