Automagic epi curve plotting: part I

As of 24 May 2018, the underlying data schema of the Github repo from which the epi curve plotter draws its data has changed. Therefore, a lot of the code had to be adjusted. The current code can be found here on Github. This also plots a classical epi curve.

One of the best resources during the 2013-16 West African Ebola outbreak was Caitlin RiversGithub repo, which was probably one of the best ways to stay up to date on the numbers. For the current outbreak, she has also set up a Github repo, with really frequent updates straight from the WHO’s DON data and the information from DRC Ministry of Public Health (MdlS) mailing list.1 Using R, I have set up a simple script that I only have to run every time I want a pre-defined visualisation of the current situation. I am usually doing this on a remote RStudio server, which makes matters quite easy for me to quickly generate data on the fly from RStudio.

Obtaining the most recent data

Using the following little script, I grab the latest from the ebola-drc Github repo:

# Fetch most recent DRC data.
library(magrittr)
library(curl)
library(readr)
library(dplyr)

current_drc_data <- Sys.time() %>%
  format("%d%H%M%S%b%Y") %>%
  paste("raw_data/drc/", "drc-", ., ".csv", sep = "") %T>%
  curl_fetch_disk("https://raw.githubusercontent.com/cmrivers/ebola_drc/master/drc/data.csv", .) %>%
  read_csv()

This uses curl (the R implementation) to fetch the most recent data and save it as a timestamped2 file in the data folder I set up just for that purpose.3 Simply sourcing this script (source("fetch_drc_data.R")) should then load the current DRC dataset into the environment.4

Data munging

We need to do a little data munging. First, we melt down the data frame using reshape2‘s melt function. Melting takes ‘wide’ data and converumnts it into ‘long’ data – for example, in our case, the original data had a row for each daily report for each health zone, and a column for the various combinations of confirmed/probable/suspected over cases/deaths. Melting the data frame down creates a variable type column (say, confirmed_deaths and a value column (giving the value, e.g. 3). Using lubridate,5 the dates are parsed, and the values are stored in a numeric format.

library(magrittr)
library(reshape2)
library(lubridate)

current_drc_data %<>% melt(value_name = "value", measure.vars = c("confirmed_cases", "confirmed_deaths", "probable_cases", "probable_deaths", "suspect_cases", "suspect_deaths", "ruled_out"))
current_drc_data$event_date <- lubridate::ymd(current_drc_data$event_date)
current_drc_data$report_date <- lubridate::ymd(current_drc_data$report_date)
current_drc_data$value <- as.numeric(current_drc_data$value)

Next, we drop ruled_out cases, as they play no significant role for the current visualisation.

current_drc_data <- current_drc_data[current_drc_data$variable != "ruled_out",]

We also need to split the type labels into two different columns, so as to allow plotting them as a matrix. Currently, data type labels (the variable column) has both the certainty status (confirmed, suspected or probable) and the type of indicator (cases vs deaths) in a single variable, separated by an underscore. We’ll use stringr‘s str_split_fixed to split the variable names by underscore, and join them into two separate columns, suspicion and mm, the latter denoting mortality/morbidity status.

current_drc_data %<>% cbind(., str_split_fixed(use_series(., variable), "_", 2)) %>% 
                 subset(select = -c(variable)) %>% 
                 set_colnames(c("event_date", "report_date", "health_zone", "value", "suspicion", "mm"))

Let’s filter out the health zones that are being observed but have no relevant data for us yet:

relevant_health_zones <- current_drc_data %>% 
                         subset(select = c("health_zone", "value")) %>% 
                         group_by(health_zone) %>% 
                         summarise(totals = sum(value, na.rm=TRUE)) %>% 
                         dplyr::filter(totals > 0) %>% 
                         use_series(health_zone)

This gives us a vector of all health zones that are currently reporting cases. We can filter our DRC data for that:

current_drc_data %<>% dplyr::filter(health_zone %in% relevant_health_zones)

This whittles down our table by a few rows. Finally, we might want to create a fake health zone that summarises all other health zones’ respective data:

totals <- current_drc_data %>% group_by(event_date, report_date, suspicion, mm) 
                           %>% summarise(value = sum(value), health_zone=as.factor("DRC total"))

# Reorder totals to match the core dataset
totals <- totals[,c(1,2,6,5,3,4)]

Finally, we bind these together to a single data frame:

current_drc_data %<>% rbind.data.frame(totals)

Visualising it!

Of course, all this was in pursuance of cranking out a nice visualisation. For this, we need to do a couple of things, including first ensuring that “DRC total” is treated separately and comes last:

regions <- current_drc_data %>% use_series(health_zone) %>% unique()
regions[!regions == "DRC total"]
regions %<>% c("DRC total")

current_drc_data$health_zone_f <- factor(current_drc_data$health_zone, levels = regions)

I normally start out by declaring the colour scheme I will be using. In general, I tend to use the same few colour schemes, which I keep in a few gists. For simple plots, I prefer to use no more than five colours:

colour_scheme <- c(white = rgb(238, 238, 238, maxColorValue = 255),
                   light_primary = rgb(236, 231, 216, maxColorValue = 255),
                   dark_primary = rgb(127, 112, 114, maxColorValue = 255),
                   accent_red = rgb(240, 97, 114, maxColorValue = 255),
                   accent_blue = rgb(69, 82, 98, maxColorValue = 255))

With that sorted, I can invoke the ggplot method, storing the plot in an object, p. This is so as to facilitate later retrieval by the ggsave method.

p <- ggplot(current_drc_data, aes(x=event_date, y=value)) +

  # Title and subtitle
  ggtitle(paste("Daily EBOV status", "DRC", Sys.Date(), sep=" - ")) +
  labs(subtitle = "(c) Chris von Csefalvay/CBRD (cbrd.co) - @chrisvcsefalvay") +
  
  # This facets the plot based on the factor vector we created ear 
  facet_grid(health_zone_f ~ suspicion) +
  geom_path(aes(group = mm, colour = mm, alpha = mm), na.rm = TRUE) +
  geom_point(aes(colour = mm, alpha = mm)) +

  # Axis labels
  ylab("Cases") +
  xlab("Date") +

  # The x-axis is between the first notified case and the last
  xlim(c("2018-05-08", Sys.Date())) +
  scale_x_date(date_breaks = "7 days", date_labels = "%m/%d") +

  # Because often there's an overlap and cases that die on the day of registration
  # tend to count here as well, some opacity is useful.
  scale_alpha_manual(values = c("cases" = 0.5, "deaths" = 0.8)) +
  scale_colour_manual(values = c("cases" = colour_scheme[["accent_blue"]], "deaths" = colour_scheme[["accent_red"]])) +

  # Ordinarily, I have these derive from a theme package, but they're very good
  # defaults and starting poinnnnnntsssssts
  theme(panel.spacing.y = unit(0.6, "lines"), 
        panel.spacing.x = unit(1, "lines"),
        plot.title = element_text(colour = colour_scheme[["accent_blue"]]),
        plot.subtitle = element_text(colour = colour_scheme[["accent_blue"]]),
        axis.line = element_line(colour = colour_scheme[["dark_primary"]]),
        panel.background = element_rect(fill = colour_scheme[["white"]]),
        panel.grid.major = element_line(colour = colour_scheme[["light_primary"]]),
        panel.grid.minor = element_line(colour = colour_scheme[["light_primary"]]),
        strip.background = element_rect(fill = colour_scheme[["accent_blue"]]),
        strip.text = element_text(colour = colour_scheme[["light_primary"]])
  )
DRC EBOV outbreak, 22 May 2018. The data has some significant gaps, owing to monitoring and recording issues, but some clear trends already emerge from this simple illustration.

The end result is a fairly appealing plot, although if the epidemic goes on, one might want to consider getting rid of the point markers. All that remains is to insert an automatic call to the ggsave function to save the image:

Sys.time() %>%
  format("%d%H%M%S%b%Y") %>%
  paste("DRC-EBOV-", ., ".png", sep="") %>%
  ggsave(plot = p, device="png", path="visualisations/drc/", width = 8, height = 6)

Automation

The cronR package has a built-in cron scheduler add-in for RStudio, allowing you to manage all your code scheduling needs.

Of course, being a lazy epidemiologist, this is the kind of stuff that just has to be automated! Since I run my entire RStudio instance on a remote machine, it would make perfect sense to regularly run this script. cronR package comes with a nice widget, which will allow you to simply schedule any task. Old-school command line users can, of course, always resort to ye olde command line based scheduling and execution. One important caveat: the context of cron execution will not necessarily be the same as of your R project or indeed of the R user. Therefore, when you source a file or refer to paths, you may want to refer to fully qualified paths, i.e. /home/my_user/my_R_stuff/script.R rather than merely script.R. cronR is very good at logging when things go awry, so if the plots do not start to magically accumulate at the requisite rate, do give the log a check.

The next step is, of course, to automate uploads to Twitter. But that’s for another day.

References   [ + ]

1.Disclaimer: Dr Rivers is a friend, former collaborator and someone I hold in very high esteem. I’m also from time to time trying to contribute to these repos.
2.My convention for timestamps is the military DTG convention of DDHHMMSSMONYEAR, so e.g. 7.15AM on 21 May 2018 would be 21071500MAY2018.
3.It is, technically, bad form to put the path in the file name for the curl::curl_fetch_disk() function, given that curl::curl_fetch_disk() offers the path parameter just for that. However, due to the intricacies of piping data forward, this is arguably the best way to do it so as to allow the filename to be reused for the read_csv() function, too.
4.If for whatever reason you would prefer to keep only one copy of the data around and overwrite it every time, quite simply provide a fixed filename into curl_fetch_disk().
5.As an informal convention of mine, when using the simple conversion functions of lubridate such as ymd, I tend to note the source package, hence lubridate::ymd over simply ymd.