Structuring R projects

There are some things that I call Smith goods:1 things I want, nay, require, but hate doing. A clean room is one of these – I have a visceral need to have some semblance of tidiness around me, I just absolutely hate tidying, especially in the summer.2 Starting and structuring packages and projects is another of these things, which is why I’m so happy things like cookiecutter exist that do it for you in Python.

While I don’t like structuring R projects, I keep doing it, because I know it matters. That’s a pearl of wisdom that came occasionally at a great price.
I am famously laid back about structuring R projects – my chill attitude is only occasionally compared to the Holy Inquisition, the other Holy Inquisition and Gunny R. Lee Ermey’s portrayal of Drill Sgt. Hartman, and it’s been months since I last gutted an intern for messing up namespaces.3 So while I don’t like structuring R projects, I keep doing it, because I know it matters. That’s a pearl of wisdom that came occasionally at a great price, some of which I am hoping to save you by this post.

Five principles of structuring R projects

Every R project is different. Therefore, when structuring R projects, there has to be a lot more adaptability than there is normally When structuring R projects, I try to follow five overarching principles.

  1. The project determines the structure. In a small exploratory data analysis (EDA) project, you might have some leeway as to structural features that you might not have when writing safety-critical or autonomously running code. This variability in R – reflective of the diversity of its use – means that it’s hard to devise a boilerplate that’s universally applicable to all kinds of projects.
  2. Structure is a means to an end, not an end in itself. The reason why gutting interns, scalping them or yelling at them Gunny style are inadvisable is not just the additional paperwork it creates for HR. Rather, the point of the whole exercise is to create people who understand why the rules exists and organically adopt them, understanding how they help.
  3. Rules are good, tools are better. When tools are provided that take the burden of adherence – linters, structure generators like cookiecutter, IDE plugins, &c. – off the developer, adherence is both more likely and simpler.
  4. Structures should be interpretable to a wide range of collaborators. Even if you have no collaborators, thinking from the perspective of an analyst, a data scientist, a modeller, a data engineer and, most importantly, the client who will at the very end receive the overall product.
  5. Structures should be capable of evolution. Your project may change objectives, it may evolve, it may change. What was a pet project might become a client product. What was designed to be a massive, error-resilient superstructure might have to scale down. And most importantly, your single-player adventure may end up turning into an MMORPG. Your structure has to be able to roll with the punches.

A good starting structure

Pretty much every R project can be imagined as a sort of process: data gets ingested, magic happens, then the results – analyses, processed data, and so on – get spit out. The absolute minimum structure reflects this:

.
└── my_awesome_project
    ├── src
    ├── output
    ├── data
    │   ├── raw
    │   └── processed
    ├── README.md
    ├── run_analyses.R 
    └── .gitignore

In this structure, we see this reflected by having a data/ folder (a source), a folder for the code that performs the operations (src/) and a place to put the results (output/). The root analysis file (the sole R file on the top level) is responsible for launching and orchestrating the functions defined in the src/ folder’s contents.

The data folder

The data folder is, unsurprisingly, where your data goes. In many cases, you may not have any file-formatted raw data (e.g. where the raw data is accessed via a *DBC connection to a database), and you might even keep all intermediate files there, although that’s pretty uncommon on the whole, and might not make you the local DBA’s favourite (not to mention data protection issues). So while the raw/ subfolder might be dispensed with, you’ll most definitely need a data/ folder.

When it comes to data, it is crucial to make a distinction between source data and generated data. Rich Fitzjohn puts it best when he says to treat

  • source data as read-only, and
  • generated data as disposable.

The preferred implementation I have adopted is to have

  • a data/raw/ folder, which is usually is symlinked to a folder that is write-only to clients but read-only to the R user,4,
  • a data/temp/ folder, which contains temp data, and
  • a data/output/ folder, if warranted.

The src folder

Some call this folder R– I find this a misleading practice, as you might have C++, bash and other non-R code in it, but is unfortunately enforced by R if you want to structure your project as a valid R package, which I advocate in some cases. I am a fan of structuring the src/ folder, usually by their logical function. There are two systems of nomenclature that have worked really well for me and people I work with:

  • The library model: in this case, the root folder of src/ holds individual .R scripts that when executed will carry out an analysis. There may be one or more such scripts, e.g. for different analyses or different depths of insight. Subfolders of src/ are named after the kind of scripts they contain, e.g. ETL, transformation, plotting. The risk with this structure is that sometimes it’s tricky to remember what’s where, so descriptive file names are particularly important.
  • The pipeline model: in this case, there is a main runner script or potentially a small number. These go through scripts in a sequence. It is a sensible idea in such a case to establish sequential subfolders or sequentially numbered scripts that are executed in sequence. Typically, this model performs better if there are at most a handful distinct pipelines.

Whichever approach you adopt, a crucial point is to keep function definition and application separate. This means that only the pipeline or the runner scripts are allowed to execute (apply) functions, and other files are merely supposed to define them. Typically, folder level segregation works best for this:

  • keep all function definitions in subfolders of src/, e.g. src/data_engineering, and have the directly-executable scripts directly under src/ (this works better for larger projects), or
  • keep function definitions in src/, and keep the directly executable scripts in the root folder (this is more convenient for smaller projects, where perhaps the entire data engineering part is not much more than a single script).

output and other output folders

Output may mean a range of things, depending on the nature of your project. It can be anything from a whole D.Phil thesis written in a LaTeX-compliant form to a brief report to a client. There are a couple of conventions with regard to output folders that are useful to keep in mind.

Separating plot output

My personal preference is that plot output folders should be subfolders of output/, rather than top-tier folders, unless the plots themselves are the objective.
It is common to have a separate folder for plots (usually called figs/ or plots/), usually so that they could be used for various purposes. My personal preference is that plot output folders should be subfolders of output folders, rather than top-tier folders, unless they are the very output of the project. That is the case, for instance, where the project is intended to create a particular plot on a regular basis. This was the case, for instance, with the CBRD project whose purpose was to regularly generate daily epicurves for the DRC Zaire ebolavirus outbreak.

With regard to maps, in general, the principle that has worked best for teams I ran was to treat static maps as plots. However, dynamic maps (e.g. LeafletJS apps), tilesets, layers or generated files (e.g. GeoJSON files) tend to deserve their own folder.

Reports and reporting

For business users, automatically getting a beautiful PDF report can be priceless.
Not every project needs a reporting folder, but for business users, having a nice, pre-written reporting script that can be run automatically and produces a beautiful PDF report every day can be priceless. A large organisation I worked for in the past used this very well to monitor their Amazon AWS expenditure.5 A team of over fifty data scientists worked on a range of EC2 instances, and runaway spending from provisioning instances that were too big, leaving instances on and data transfer charges resulting from misconfigured instances6 was rampant. So the client wanted daily, weekly, monthly and 10-day rolling usage nicely plotted in a report, by user, highlighting people who would go on the naughty list. This was very well accomplished by an RMarkdown template that was ‘knit‘ every day at 0600 and uploaded as an HTML file onto an internal server, so that every user could see who’s been naughty and who’s been nice. EC2 usage costs have gone down by almost 30% in a few weeks, and that was without having to dismember anyone!7

Probably the only structural rule to keep in mind is to keep reports and reporting code separate. Reports are client products, reporting code is a work product and therefore should reside in src/.

Requirements and general settings

I am, in general, not a huge fan of outright loading whole packages to begin with. Too many users of R don’t realise that

  • you do not need to attach (library(package)) a package in order to use a function from it – as long as the package is available to R, you can simply call the function as package::function(arg1, arg2, ...), and
  • importing a package using library(package) puts every single function from that package into the namespace, overwriting by default all previous entries. This means that in order to deterministically know what any given symbol means, you would have to know, at all times, the order of package imports. Needless to say, there is enough stuff to keep in one’s mind when coding in R to worry about this stuff.

However, some packages might be useful to import, and sometimes it’s useful to have an initialisation script. This may be the case in three particular scenarios:

  • You need a particular locale setting, or a particularly crucial environment setting.
  • It’s your own package and you know you’re not going to shadow already existing symbols.
  • You are not using packrat or some other package management solution, and definitely need to ensure some packages are installed, but prefer not to put the clunky install-if-not-present code in every single thing.

In these cases, it’s sensible to have a file you would source before every top-level script – in an act of shameless thievery from Python, I tend to call this requirements.R, and it includes some fundamental settings I like to rely on, such as setting the locale appropriately. It also includes a CRAN install check script, although I would very much advise the use of Packrat over it, since it’s not version-sensitive.

Themes, house style and other settings

It is common, in addition to all this, to keep some general settings. If your institution has a ‘house style’ for ggplot2 (as, for instance, a ggthemr file), for instance, this could be part of your project’s config. But where does this best go?

I’m a big fan of keeping house styles in separate repos, as this ensures consistency across the board.
It would normally be perfectly fine to keep your settings in a config.R file at root level, but a config/ folder is much preferred as it prevents clutter if you derive any of your configurations from a git submodule. I’m a big fan of keeping house styles and other things intended to give a shared appearance to code and outputs (e.g. linting rules, text editor settings, map themes) in separate – and very, very well managed! – repos, as this ensures consistency across the board over time. As a result, most of my projects do have a config folder instead of a single configuration file.

It is paramount to separate project configuration and runtime configuration:

  • Project configuration pertains to the project itself, its outputs, schemes, the whole nine yards. For instance, the paper size to use for generated LaTeX documents would normally be a project configuration item. Your project configuration belongs in your config/ folder.
  • Runtime configuration pertains to parameters that relate to individual runs. In general, you should aspire to have as few of these, if any, as possible – and if you do, you should keep them as environment variables. But if you do decide to keep them as a file, it’s generally a good idea to keep them at the top level, and store them not as R files but as e.g. JSON files. There are a range of tools that can programmatically edit and change these file formats, while changing R files programmatically is fraught with difficulties.

Keeping runtime configuration editable

A few years ago, I worked on a viral forecasting tool where a range of model parameters to build the forecast from were hardcoded as R variables in a runtime configuration file. It was eventually decided to create a Python-based web interface on top of it, which would allow users to see the results as a dashboard (reading from a database where forecast results would be written) and make adjustments to some of the model parameters. The problem was, that’s really not easy to do with variables in an R file.

On the other hand, Python can easily read a JSON file into memory, change values as requested and export them onto the file system. So instead of that, the web interface would store the parameters in a JSON file, from which R would then read them and execute accordingly. Worked like a charm. Bottom line – configurations are data, and using code to store data is bad form.

Dirty little secrets

Everybody has secrets. In all likelihood, your project is no different: passwords, API keys, database credentials, the works. The first rule of this, of course, is never hardcode credentials in code. But you will need to work out how to make your project work, including via version control, while also not divulging credentials to the world at large. My preferred solutions, in order of preference, are:

  1. the keyring package, which interacts with OS X’s keychain, Windows’s Credential Store and the Secret Service API on Linux (where supported),
  2. using environment variables,
  3. using a secrets file that is .gitignored,
  4. using a config file that’s .gitignored,
  5. prompting the user.

Let’s take these – except the last one, which you should consider only as a measure of desperation, as it relies on RStudio and your code should aspire to run without it – in turn.

Using keyring

keyring is an R package that interfaces with the operating system’s keychain management solution, and works without any additional software on OS X and Windows.8 Using keyring is delightfully simple: it conceives of an individual key as belonging to a keyring and identified by a service. By reference to the service, it can then be retrieved easily once the user has authenticated to access the keychain. It has two drawbacks to be aware of:

  • It’s an interactive solution (it has to get access permission for the keychain), so if what you’re after is R code that runs quietly without any intervention, this is not your best bet.
  • A key can only contain a username and a password, so it cannot store more complex credentials, such as 4-ple secrets (e.g. in OAuth, where you may have a consumer and a publisher key and secret each). In that case, you could split them into separate keyring keys.

However, for most interactive purposes, keyring works fine. This includes single-item secrets, e.g. API keys, where you can use some junk as your username and hold only on to the password.

For most interactive purposes, keyring works fine. This includes single-item secrets, e.g. API keys.
By default, the operating system’s ‘main’ keyring is used, but you’re welcome to create a new one for your project. Note that users may be prompted for a keychain password at call time, and it’s helpful if they know what’s going on, so be sure you document your keyring calls well.

To set a key, simply call keyring::key_set(service = "my_awesome_service", username = "my_awesome_user). This will launch a dialogue using the host OS’s keychain handler to request authentication to access the relevant keychain (in this case, the system keychain, as no keychain is specified), and you can then retrieve

  • the username: using keyring::key_list("my_awesome_service")[1,2], and
  • the password: using keyring::key_get("my_awesome_service").

Using environment variables

The thing to remember about environment variables is that they’re ‘relatively private’: everyone in the user session will be able to read them.
Using environment variables to hold certain secrets has become extremely popular especially for Dockerised implementations of R code, as envvars can be very easily set using Docker. The thing to remember about environment variables is that they’re ‘relatively private’: they’re not part of the codebase, so they will definitely not accidentally get committed to the VCS, but everyone who has access to the particular user session  will be able to read them. This may be an issue when e.g. multiple people are sharing the ec2-user account on an EC2 instance. The other drawback of envvars is that if there’s a large number of them, setting them can be a pain. R has a little workaround for that: if you create an envfile called .Renviron in the working directory, it will store values in the environment. So for instance the following .Renviron file will bind an API key and a username:

api_username = "my_awesome_user"
api_key = "e19bb9e938e85e49037518a102860147"

So when you then call Sys.getenv("api_username"), you get the correct result. It’s worth keeping in mind that the .Renviron file is sourced once, and once only: at the start of the R session. Thus, obviously, changes made after that will not propagate into the session until it ends and a new session is started. It’s also rather clumsy to edit, although most APIs used to ini files will, with the occasional grumble, digest .Renvirons.

Needless to say, committing the .Renviron file to the VCS is what is sometimes referred to as making a chocolate fireman in the business, and is generally a bad idea.

Using a .gitignored config or secrets file

config is a package that allows you to keep a range of configuration settings outside your code, in a YAML file, then retrieve them. For instance, you can create a default configuration for an API:

default:
    my_awesome_api:
        url: 'https://awesome_api.internal'
        username: 'my_test_user'
        api_key: 'e19bb9e938e85e49037518a102860147'

From R, you could then access this using the config::get() function:

my_awesome_api_configuration <- config::get("my_awesome_api")

This would then allow you to e.g. refer to the URL as my_awesome_api_configuration$url, and the API key as my_awesome_api_configuration$api_key. As long as the configuration YAML file is kept out of the VCS, all is well. The problem is that not everything in such a configuration file is supposed to be secret. For instance, it makes sense for a database access credentials to have the other credentials DBI::dbConnect() needs for a connection available to other users, but keep the password private. So .gitignoreing a config file is not a good idea.

A dedicated secrets file is a better place for credentials than a config file, as this file can then be wholesale .gitignored.
A somewhat better idea is a secrets file. This file can be safely .gitignored, because it definitely only contains secrets. As previously noted, definitely create it using a format that can be widely written (JSON, YAML).9 For reasons noted in the next subsection, the thing you should definitely not do is creating a secrets file that consists of R variable assignments, however convenient an idea that may appear at first. Because…

Whatever you do…

One of the best ways to mess up is creating a fabulous way of keeping your secret credentials truly secret… then loading them into the global scope. Never, ever assign credentials. Ever.

You might have seen code like this:

dbuser <- Sys.getenv("dbuser")
dbpass <- Sys.getenv("dbpass")

conn <- DBI::dbConnect(odbc::odbc(), UID = dbuser, PWD = dbpass)
Never, ever put credentials into any environment if possible – especially not into the global scope.
This will work perfectly, except once its done, it will leave the password and the user name, in unencrypted plaintext (!), in the global scope, accessible to any code. That’s not just extremely embarrassing if, say, your wife of ten years discovers that your database password is your World of Warcraft character’s first name, but also a potential security risk. Never put credentials into any environment if possible, and if it has to happen, at least make it happen within a function so that they don’t end up in the global scope. The correct way to do the above would be more akin to this:

create_db_connection <- function() {
    DBI::dbConnect(odbc::odbc(), UID = Sys.getenv("dbuser"), PWD = Sys.getenv("dbpass")) %>% return()
}

Concluding remarks

Structuring R projects is an art, not just a science. Many best practices are highly domain-specific, and learning these generally happens by trial and pratfall error. In many ways, it’s the bellwether of an R developer’s skill trajectory, because it shows whether they possess the tenacity and endurance it takes to do meticulous, fine and often rather boring work in pursuance of future success – or at the very least, an easier time debugging things in the future. Studies show that one of the greatest predictors of success in life is being able to tolerate deferred gratification, and structuring R projects is a pure exercise in that discipline.

Structuring R projects is an art, not just a science. Many best practices are highly domain-specific, and learning these generally happens by trial and error.
At the same time, a well-executed structure can save valuable developer time, prevent errors and allow data scientists to focus on the data rather than debugging and trying to find where that damn snippet of code is or scratching their head trying to figure out what a particularly obscurely named function does. What might feel like an utter waste of time has enormous potential to create value, both for the individual, the team and the organisation.

As long as you keep in mind why structure matters and what its ultimate aims are, you will arrive at a form of order out of chaos that will be productive, collaborative and useful.
I’m sure there are many aspects of structuring R projects that I have omitted or ignored – in many ways, it is my own experiences that inform and motivate these commentaries on R. Some of these observations are echoed by many authors, others diverge greatly from what’s commonly held wisdom. As with all concepts in development, I encourage you to read widely, get to know as many different ideas about structuring R projects as possible, and synthesise your own style. As long as you keep in mind why structure matters and what its ultimate aims are, you will arrive at a form of order out of chaos that will be productive, collaborative and mutually useful not just for your own development but others’ work as well.

My last commentary on defensive programming in R has spawned a vivid and exciting debate on Reddit, and many have made extremely insightful comments there. I’m deeply grateful for all who have contributed there. I hope you will also consider posting your observations in the comment section below. That way, comments will remain together with the original content.

References   [ + ]

1.As in, Adam Smith.
2.It took me years to figure out why. It turns out that I have ZF alpha-1 antitrypsin deficiency. As a consequence, even minimal exposure to small particulates and dust can set off violent coughing attacks and impair breathing for days. Symptoms tend to be worse in hot weather due to impaired connective tissue something-or-other.
3.That’s a joke. I don’t gut interns – they’re valuable resources, HR shuns dismembering your coworkers, it creates paperwork and I liked every intern I’ve ever worked with – but most importantly, once gutted like a fish, they are not going to learn anything new. I prefer gentle, structured discussions on the benefits of good package structure. Please respect your interns – they are the next generation, and you are probably one of their first example of what software development/data science leadership looks like. The waves you set into motion will ripple through generations, well after you’re gone. You better set a good example.
4.Such a folder is often referred to as a ‘dropbox’, and the typical corresponding octal setting, 0422, guarantees that the R user will not accidentally overwrite data.
5.The organisation consented to me telling this story but requested anonymity, a request I honour whenever legally possible.
6.In case you’re unfamiliar with AWS: it’s a cloud service where elastic computing instances (EC2 instances) reside in ‘regions’, e.g. us-west-1a. There are (small but nonzero) charges for data transfer between regions. If you’re in one region but you configure the yum repo server of another region as your default, there will be costs, and, eventually, tears – provision ten instances with a few GBs worth of downloads, and there’ll be yelling. This is now more or less impossible to do except on purpose, but one must never underestimate what users are capable of from time to time!
7.Or so I’m told.
8.Linux users will need libsecret 0.16 or above, and sodium.
9.XML is acceptable if you’re threatened with waterboarding.

Assignment in R: slings and arrows

Having recently shared my post about defensive programming in R on the r/rstats subreddit, I was blown away by the sheer number of comments as much as I was blown away by the insight many displayed. One particular comment by u/guepier struck my attention. In my previous post, I came out quite vehemently against using the = operator to effect assignment in R. u/guepier‘s made a great point, however:

But point 9 is where you’re quite simply wrong, sorry:

never, ever, ever use = to assign. Every time you do it, a kitten dies of sadness.

This is FUD, please don’t spread it. There’s nothing wrong with =. It’s purely a question of personal preference. In fact, if anything <- is more error-prone (granted, this is a very slight chance but it’s still higher than the chance of making an error when using =).

Now, assignment is no doubt a hot topic – a related issue, assignment expressions, has recently led to Python’s BDFL to be forced to resign –, so I’ll have to tread carefully. A surprising number of people have surprisingly strong feelings about assignment and assignment expressions. In R, this is complicated by its unusual assignment structure, involving two assignment operators that are just different enough to be trouble.

A brief history of <-

IBM Model M SSK keyboard with APL keys
This is the IBM Model M SSK keyboard. The APL symbols are printed on it in somewhat faint yellow.

There are many ways in which <- in R is anomalous. For starters, it is rare to find a binary operator that consists of two characters – which is an interesting window on the R <- operator’s history.

The <- operator, apparently, stems from a day long gone by, when keyboards existed for the programming language eldritch horror that is APL. When R’s precursor, S, was conceived, APL keyboards and printing heads existed, and these could print a single ← character. It was only after most standard keyboard assignments ended up eschewing this all-important symbol that R and S accepted the digraphic <- as a substitute.

OK, but what does it do?

In the Brown Book, the underscore was actually an alias for the arrow assignment operator.
In the Brown Book (Richard A. Becker and John M. Chambers (1984). S: An Interactive Environment for Data Analysis and Graphics), the underscore was actually an alias for the arrow assignment operator! Thankfully, this did not make it into R.
<- is one of the first operators anyone encounters when familiarising themselves with the R language. The general idea is quite simple: it is a directionally unambiguous assignment, i.e. it indicates quite clearly that the right-hand side value (rhs, in the following) will replace the left-hand side variable (lhs), or be assigned to the newly created lhs if it has not yet been initialised. Or that, at the very least, is the short story.

Because quite peculiarly, there is another way to accomplish a simple assignment in R: the equality sign (=). And because on the top level, a <- b and a = b are equivalent, people have sometimes treated the two as being quintessentially identical. Which is not the case. Or maybe it is. It’s all very confusing. Let’s see if we can unconfuse it.

The Holy Writ

The Holy Writ, known to uninitiated as the R Manual, has this to say about assignment operators and their differences:

The operators <- and = assign into the environment in which they are evaluated. The operator <- can be used anywhere, whereas the operator = is only allowed at the top level (e.g., in the complete expression typed at the command prompt) or as one of the subexpressions in a braced list of expressions.

If this sounds like absolute gibberish, or you cannot think of what would qualify as not being on the top level or a subexpression in a braced list of expressions, welcome to the squad – I’ve had R experts scratch their head about this for an embarrassingly long time until they realised what the R documentation, in its neutron starlike denseness, actually meant.

If it’s in (parentheses) rather than {braces}, = and <- are going to behave weird

To translate the scriptural words above quoted to human speak, this means = cannot be used in the conditional part (the part enclosed by (parentheses) as opposed to {curly braces}) of control structures, among others. This is less an issue between <- and =, and rather an issue between = and ==. Consider the following example:

x = 3

if(x = 3) 1 else 0
# Error: unexpected '=' in "if(x ="

So far so good: you should not use a single equality sign as an equality test operator. The right way to do it is:

> if(x == 3) 1 else 0
[1] 1

But what about arrow assignment?

if(x <- 3) 1 else 0
# [1] 1

Oh, look, it works! Or does it?

if(x <- 4) 1 else 0
# [1] 1

The problem is that an assignment will always yield true if successful. So instead of comparing x to 4, it assigned 4 to x, then happily informed us that it is indeed true.

The bottom line is not to use = as comparison operator, and <- as anything at all in a control flow expression’s conditional part. Or as John Chambers notes,

Disallowing the new assignment form in control expressions avoids programming errors (such as the example above) that are more likely with the equal operator than with other S assignments.

Chain assignments

One example of where  <- and = behave differently (or rather, one behaves and the other throws an error) is a chain assignment. In a chain assignment, we exploit the fact that R assigns from right to left. The sole criterion is that all except the rightmost members of the chain must be capable of being assigned to.

# Chain assignment using <-
a <- b <- c <- 3

# Chain assignment using =
a = b = c = 3

# Chain assignment that will, unsurprisingly, fail
a = b = 3 = 4
# Error in 3 = 4 : invalid (do_set) left-hand side to assignment

So we’ve seen that as long as the chain assignment is logically valid, it’ll work fine, whether it’s using <- or =. But what if we mix them up?

a = b = c <- 1
# Works fine...

a = b <- c <- 1
# We're still great...

a <- b = c = 1
# Error in a <- b = c = 1 : could not find function "<-<-"
# Oh.

The bottom line from the example above is that where <- and = are mixed, the leftmost assignment has to be carried out using =, and cannot be by <-. In that one particular context, = and <- are not interchangeable.

A small note on chain assignments: many people dislike chain assignments because they’re ‘invisible’ – they literally return nothing at all. If that is an issue, you can surround your chain assignment with parentheses – regardless of whether it uses <-, = or a (valid) mixture thereof:

a = b = c <- 3
# ...
# ... still nothing...
# ... ... more silence...

(a = b = c <- 3)
# [1] 3

Assignment and initialisation in functions

This is the big whammy – one of the most important differences between <- and =, and a great way to break your code. If you have paid attention until now, you’ll be rewarded by, hopefully, some interesting knowledge.

= is a pure assignment operator. It does not necessary initialise a variable in the global namespace. <-, on the other hand, always creates a variable, with the lhs value as its name, in the global namespace. This becomes quite prominent when using it in functions.

Traditionally, when invoking a function, we are supposed to bind its arguments in the format parameter = argument.1 And as we know from what we know about functions, the keyword’s scope is restricted to the function block. To demonstrate this:

add_up_numbers <- function(a, b) {
    return(a + b)
}

add_up_numbers(a = 3, b = 5)
# [1] 8

a + b
# Error: object 'a' not found

This is expected: a (as well as b, but that didn’t even make it far enough to get checked!) doesn’t exist in the global scope, it exists only in the local scope of the function add_up_numbers. But what happens if we use <- assignment?

add_up_numbers(a <- 3, b <- 5)
# [1] 8

a + b
# [1] 8

Now, a and b still only exist in the local scope of the function add_up_numbers. However, using the assignment operator, we have also created new variables called a and b in the global scope. It’s important not to confuse it with accessing the local scope, as the following example demonstrates:

add_up_numbers(c <- 5, d <- 6)
# [1] 11

a + b
# [1] 8

c + d
# [1] 11

In other words, a + b gave us the sum of the values a and b had in the global scope. When we invoked add_up_numbers(c <- 5, d <- 6), the following happened, in order:

  1. A variable called c was initialised in the global scope. The value 5 was assigned to it.
  2. A variable called d was initialised in the global scope. The value 6 was assigned to it.
  3. The function add_up_numbers() was called on positional arguments c and d.
  4. c was assigned to the variable a in the function’s local scope.
  5. d was assigned to the variable b in the function’s local scope.
  6. The function returned the sum of the variables a and b in the local scope.

It may sound more than a little tedious to think about this function in this way, but it highlights three important things about <- assignment:

  1. In a function call, <- assignment to a keyword name is not the same as using =, which simply binds a value to the keyword.
  2. <- assignment in a function call affects the global scope, using = to provide an argument does not.
  3. Outside this context, <- and = have the same effect, i.e. they assign, or initialise and assign, in the current scope.

Phew. If that sounds like absolute confusing gibberish, give it another read and try playing around with it a little. I promise, it makes some sense. Eventually.

So… should you or shouldn’t you?

Which raises the question that launched this whole post: should you use = for assignment at all? Quite a few style guides, such as Google’s R style guide, have outright banned the use of = as assignment operator, while others have encouraged the use of ->. Personally, I’m inclined to agree with them, for three reasons.

  1. Because of the existence of ->, assignment by definition is best when it’s structured in a way that shows what is assigned to which side. a -> b and b <- a have a formal clarity that a = b does not have.
  2. Good code is unambiguous even if the language isn’t. This way, -> and <- always mean assignment, = always means argument binding and == always means comparison.
  3. Many argue that <- is ambiguous, as x<-3 may be mistyped as x<3 or x-3, or alternatively may be (visually) parsed as x < -3, i.e. compare x to -3. In reality, this is a non-issue. RStudio has a built-in shortcut (Alt/⎇ + ) for <-, and automatically inserts a space before and after it. And if one adheres to sound coding principles and surrounds operators with white spaces, this is not an issue that arises.

Like with all coding standards, consistency is key. Consistently used suboptimal solutions are superior, from a coding perspective, to an inconsistent mixture of right and wrong solutions.

References   [ + ]

1.A parameter is an abstract ‘slot’ where you can put in values that configure a function’s execution. Arguments are the actual values you put in. So add_up_numbers(a,b) has the parameters a and b, and add_up_numbers(a = 3, b = 5) has the arguments 3 and 5.

The Ten Rules of Defensive Programming in R

Where R code is integrated into a pipeline, runs autonomously or is embedded into a larger analytical solution, writing code that fails well is going to be crucial.
The topic of defensive programming in R is, admittedly, a little unusual. R, while fun and powerful, is not going to run defibrillators, nuclear power plants or spacecraft. In fact, much – if not most! – R code is actually executed interactively, where small glitches don’t really matter. But where R code is integrated into a pipeline, runs autonomously or is embedded into a larger analytical solution, writing code that fails well is going to be crucial. So below, I have collected my top ten principles of defensive programming in R. I have done so with an eye to users who do not come from the life critical systems community and might not have encountered defensive programming before, so some of these rules apply to all languages.

What is defensive programming?

The idea of defensive programming is not to write code that never fails. That’s an impossible aspiration. Rather, the fundamental idea is to write code that fails well. To me, ‘failing well’ means five things:

  1. Fail fast: your code should ensure all criteria are met before they embark upon operations, especially if those are computationally expensive or might irreversibly affect data.
  2. Fail safe: where there is a failure, your code should ensure that it relinquishes all locks and does not acquire any new ones, not write files, and so on.
  3. Fail conspicuously: when something is broken, it should return a very clear error message, and give as much information as possible to help unbreak it.
  4. Fail appropriately: failure should have appropriate effects. For every developer, it’s a judgment call to ensure whether a particular issue would be a a debug/info item, a warning or an error (which by definition means halting execution). Failures should be handled appropriately.
  5. Fail creatively: not everything needs to be a failure. It is perfectly legitimate to handle problems. One example is repeating a HTTP request that has timed out: there’s no need to immediately error out, because quite frankly, that sort of stuff happens. Equally, it’s legitimate to look for a parameter, then check for a configuration file if none was provided, and finally try checking the arguments with which the code was invoked before raising an error.1

And so, without further ado, here are the ten ways I implement these in my day-to-day R coding practice – and I encourage you to do so yourself. You will thank yourself for it.

The Ten Commandments of Defensive Programming in R

  1. Document your code.
  2. In God we trust, everyone else we verify.
  3. Keep functions short and sweet.
  4. Refer to external functions explicitly.
  5. Don’t use require() to import packages into the namespace.
  6. Aggressively manage package and version dependencies.
  7. Use a consistent style and automated code quality tools.
  8. Everything is a package.
  9. Power in names.
  10. Know the rules and their rationale, so that you know when to break them.

ONE: Document your code.

It’s a little surprising to even see this – I mean, shouldn’t you do this stuff anyway? Yes, you should, except some people think that because so much of R is done in the interpreter anyway, rules do not apply to them. Wrong! They very much do.

A few months ago, I saw some code written by a mentee of mine. It was infinitely long – over 250 standard lines! –, had half a dozen required arguments and did everything under the sun. This is, of course, as we’ll discuss later, a bad idea, but let’s put that aside. The problem is, I had no idea what the function was doing! After about half an hour of diligent row-by-row analysis, I figured it out, but that could have been half an hour spent doing something more enjoyable, such as a root canal without anaesthetic while listening to Nickelback. My friend could have saved me quite some hair-tearing by quite simply documenting his code. In R, the standard for documenting the code is called roxygen2, it’s got a great parser that outputs the beautiful LaTeX docs you probably (hopefully!) have encountered when looking up a package’s documentation, and it’s described in quite a bit of detail by Hadley Wickham. What more could you wish for?

Oh. An example. Yeah, that’d be useful. We’ll be documenting a fairly simple function, which calculates the Hamming distance between two strings of equal length, and throws something unpleasant in our face if they are not. Quick recap: the Hamming distance is the number of characters that do not match among two strings. Or mathematically put,

H(s, t) = \sum_{k=1}^{\mathcal{l}(s)} I(s_k, t_k) \mid \mathcal{l}(s) = \mathcal{l}(t)

where \mathcal{l}() is the length function and D(p, q) is the dissimilarity function, which returns 1 if two letters are not identical and 0 otherwise.

So, our function would look like this:

hamming <- function(s1, s2) {
  s1 <- strsplit(s1, "")[[1]]
  s2 <- strsplit(s2, "")[[1]]
  
  return(sum(s1 != s2))
}

Not bad, and pretty evident to a seasoned R user, but it would still be a good idea to point out a thing or two. One of these would be that the result of this code will be inaccurate (technically) if the two strings are of different lengths (we could, and will, test for that, but that’s for a later date). The Hamming distance is defined only for equal-length strings, and so it would be good if the user knew what they have to do – and what they’re going to get. Upon pressing Ctrl/Cmd+Shift+Alt+R, RStudio helpfully whips us up a nice little roxygen2 skeleton:

#' Title
#'
#' @param s1 
#' @param s2 
#'
#' @return
#' @export
#'
#' @examples
hamming <- function(s1, s2) {
  s1 <- strsplit(s1, "")[[1]]
  s2 <- strsplit(s2, "")[[1]]
  
  return(sum(s1 != s2))
}

So, let’s populate it! Most of the fields are fairly self-explanatory. roxygen2, unlike JavaDoc or RST-based Python documentation, does not require formal specification of types – it’s all free text. Also, since it will be parsed into LaTeX someday, you can go wild. A few things deserve mention.

  • You can document multiple parameters. Since s1 and s2 are both going to be strings, you can simply write @param s1,s2 The strings to be compared.
  • Use \code{...} to typeset something as fixed-width.
  • To create links in the documentation, you can use \url{https://chrisvoncsefalvay.com} to link to a URL, \code{\link{someotherfunction}} to refer to the function someotherfunction in the same package, and \code{\link[adifferentpackage]{someotherfunction}} to refer to the function someotherfunction in the adifferentpackage package. Where your function has necessary dependencies outside the current script or the base packages, it is prudent to note them here.
  • You can use \seealso{} to refer to other links or other functions, in this package or another, worth checking out.
  • Anything you put under the examples will be executed as part of testing and building the documentation. If your intention is to give an idea of what the code looks like in practice, and you don’t want the result or even the side effects, you can surround your example with a \dontrun{...} environment.
  • You can draw examples from a file. In this case, you use @example instead of @examples, and specify the path, relative to the file in which the documentation is, to the script: @example docs/examples/hamming.R would be such a directive.
  • What’s that @export thing at the end? Quite simply, it tells roxygen2 to export the function to the NAMESPACE file, making it accessible for reference by other documentation files (that’s how when you use \code{\link[somepackage]{thingamabob}}, roxygen2 knows which package to link to.

With that in mind, here’s what a decent documentation to our Hamming distance function would look like that would pass muster from a defensive programming perspective:

#' Hamming distance
#'
#' Calculates the Hamming distance between two strings of equal length.
#'
#' @param s1 
#' @param s2 
#'
#' @return The Hamming distance between the two strings \code{s1} and \code{s2}, provided as an integer.
#'
#' @section Warning:
#' 
#' For a Hamming distance calculation, the input strings must be of equal length. This code does NOT reject input strings of different lengths. 
#'
#' @examples
#' hamming("AAGAGTGTCGGCATACGTGTA", "AAGAGCGTCGGCATACGTGTA")  
#'  
#' @export
hamming <- function(s1, s2) {
  s1 <- strsplit(s1, "")[[1]]
  s2 <- strsplit(s2, "")[[1]]
  
  return(sum(s1 != s2))
}

The .Rd file generated from the hamming() function’s roxygen2 docstring: an intermediary format, resembling LaTeX, from which R can build a multitude of documentation outputsThis little example shows all that a good documentation does: it provides what to supply the function with and in what type, it provides what it will spit out and in what format, and adequately warns of what is not being checked. It’s always better to check input types, but warnings go a long way.2 From this file, R generates an .Rd file, which is basically a LaTeX file that it can parse into various forms of documentation (see left.) In the end, it yields the documentation below, with the adequate warning – a win for defensive programming!

The finished documentation, rendered as it would be in an online documentation pane in RStudio

TWO: In God we trust, everyone else we verify.

In the above example, we have taken the user at face value: we assumed that his inputs will be of equal length, and we assumed they will be strings. But because this is a post on defensive programming, we are going to be suspicious, and not trust our user. So let’s make sure we fail early and check what the user supplies us with. In many programming languages, you would be using various assertions (e.g. the assert keyword in Python), but all R has, as far as built-ins are concerned, is stopifnot(). stopifnot() does as the name suggests: if the condition is not met, execution stops with an error message. However, on the whole, it’s fairly clunky, and it most definitely should not be used to check for user inputs. For that, there are three tactics worth considering.

  1. assertthat is a package by Hadley Wickham (who else!), which implements a range of assert clauses. Most importantly, unlike stopifnot(), assertthat::assert_that() does a decent job at trying to interpret the error message. Consider our previous Hamming distance example: instead of gracelessly falling on its face, a test using
assert_that(length(s1) == length(s2))
  1. would politely inform us that s1 not equal to s2. That’s worth it for the borderline Canadian politeness alone.
  2. Consider the severity of the user input failure. Can it be worked around? For instance, a function requiring an integer may, if it is given a float, try to coerce it to a float. If you opt for this solution, make sure that you 1) always issue a warning, and 2) always allow the user to specify to run the function in ‘strict’ mode (typically by setting the strict parameter to TRUE), which will raise a fatal error rather than try to logic its way out of this pickle.
  3. Finally, make sure that it’s your code that fails, not the system code. Users should have relatively little insight into the internals of the system. And so, if at some point there’ll be a division by an argument foo, you should test whether foo == 0 at the outset and inform the user that foo cannot be zero. By the time the division operation is performed, the variable might have been renamed baz, and the user will not get much actionable intelligence out of the fact that ‘division by zero’ has occurred at some point, and baz was the culprit. Just test early for known incompatibilities, and stop further execution. The same goes, of course, for potentially malicious code.

In general, your code should be strict as to what it accepts, and you should not be afraid to reject anything that doesn’t look like what you’re looking for. Consider for this not only types but also values, e.g. if the value provided for a timeout in minutes is somewhere north of the lifetime of the universe, you should politely reject such an argument – with a good explanation, of course.

Update: After posting this on Reddit, u/BillWeld pointed out a great idiom for checking user inputs that’s most definitely worth reposting here:

f <- function(a, b, c)
{
    if (getOption("warn") > 0) {
        stopifnot(
            is.numeric(a),
            is.vector(a),
            length(a) == 1,
            is.finite(a),
            a > 0,
            is.character(b),
            # Other requirements go here
            )
    }

    # The main body of the function goes here
}

I find this a great and elegant idiom, although it is your call, as the programmer, to decide which deviations and what degree of incompatibility should cause the function to fail as opposed to merely emit a warning.

THREE: Keep functions short and sweet.

Rule #4 of the Power of Ten states

No function should be longer than what can be printed on a single sheet of paper in a standard reference format with one line per statement and one line per declaration. Typically, this means no more than about 60 lines of code per function.

Rationale: Each function should be a logical unit in the code that is understandable and verifiable as a unit. It is much harder to understand a logical unit that spans multiple screens on a computer display or multiple pages when printed. Excessively long functions are often a sign of poorly structured code.

In practice, with larger screen sizes and higher resolutions, much more than a measly hundred lines fit on a single screen. However, since many users view R code in a quarter-screen window in RStudio, an appropriate figure would be about 60-80 lines. Note that this does not include comments and whitespaces, nor does it penalise indentation styles (functions, conditionals, etc.).

Functions should represent a logical unity. Therefore, if a function needs to be split for compliance with this rule, you should do so in a manner that creates logical units. Typically, one good way is to split functions by the object they act on.

FOUR: Refer to external functions explicitly.

In R, there are two ways to invoke a function, yet most people don’t tend to be aware of this. Even in many textbooks, the library(package) function is treated as quintessentially analogous to, say, import in Python. This is a fundamental misunderstanding.

In R, packages do not need to be imported in order to be able to invoke their functions, and that’s not what the library() function does anyway. library() attaches a package to the current namespace.

What does this mean? Consider the following example. The foreign package is one of my favourite packages. In my day-to-day practice, I get data from all sorts of environments, and foreign helps me import them. One of its functions, read.epiinfo(), is particularly useful as it imports data from CDC’s free EpiInfo toolkit. Assuming that foreign is in a library accessible to my instance of R,4, I can invoke the read.epiinfo() function in two ways:

  • I can directly invoke the function using its canonical name, of the form package::function() – in this case, foreign::read.epiinfo().
  • Alternatively, I can attach the entire foreign package to the namespace of the current session using library(foreign). This has three effects, of which the first tends to be well-known, the second less so and the third altogether ignored.
    1. Functions in foreign will be directly available. Regardless of the fact that it came from a different package, you will be able to invoke it the same way you invoke, say, a function defined in the same script, by simply calling read.epiinfo().
    2. If there was a package of identical name to any function in foreign, that function will be ‘shadowed’, i.e. removed from the namespace. The namespace will always refer to the most recent function, and the older function will only be available by explicit invocation.
    3. When you invoke a function from the namespace, it will not be perfectly clear from a mere reading of the code what the function actually is, or where it comes from. Rather, the user or maintainer will have to guess what a given name will represent in the namespace at the time the code is running the particular line.

Controversially, my suggestion is

  • to eschew the use of library() altogether, and
  • write always explicitly invoke functions outside those functions that are in the namespace at startup.

This is not common advice, and many will disagree. That’s fine. Not all code needs to be safety-critical, and importing ggplot2 with library() for a simple plotting script is fine. But where you want code that’s easy to analyse, easy to read and can be reliably analysed as well, you want explicit invocations. Explicit invocations give you three main benefits:

  1. You will always know what code will be executed. filter may mean dplyr::filter, stats::filter, and so on, whereas specifically invoking dplyr::filter is unambiguous. You know what the code will be (simply invoking dplyr::filter without braces or arguments returns the source), and you know what that code is going to do.
  2. Your code will be more predictable. When someone – or something – analyses your code, they will not have to spend so much time trying to guess what at the time a particular identifier refers to within the namespace.
  3. There is no risk that as a ‘side effect’ various other functions you seek to rely on will be removed from the namespace. In interactive coding, R usually warns you and lists all shadowed functions upon importing functions with the same name into the namespace using library(), but for code intended to be bulk executed, this issue has caused a lot of headache.

Obviously, all of this applies to require() as well, although on the whole the latter should not be applied in general.

FIVE: Don’t use require() to import a package into the namespace.

Even seasoned R users sometimes forget the difference between library() and require(). The difference is quite simple: while both functions attempt to attach the package argument to the namespace,

  • require() returns FALSE if the import failed, while
  • library() simply loads the package and raises an error if the import failed.

Just about the only legitimate use for require() is writing an attach-or-install function. In any other case, as Yihui Xie points out, require() is almost definitely the wrong function to use.

SIX: Aggressively manage package and version dependencies.

Packrat is one of those packages that have changed what R is like – for the better. Packrat gives every project a package library, akin to a private /lib/ folder. This is not the place to document the sheer awesomeness of Packrat – you can do so yourself by doing the walkthrough of Packrat. But seriously, use it. Your coworkers will love you for it.

Equally important is to make sure that you specify the version of R that your code is written against. This is best accomplished on a higher level of configuration, however.

SEVEN: Use a consistent style and automated code quality tools.

This should be obvious – we’re programmers, which means we’re constitutionally lazy. If it can be solved by code faster than manually, then code it is! Two tools help you in this are lintr and styler.

  • lintr is an amazingly widely supported (from RStudio through vim to Sublime Text 3, I hear a version for microwave ovens is in the works!) linter for R code. Linters improve code quality primarily by enforcing good coding practices rather than good style. One big perk of lintr is that it can be injected rather easily into the Travis CI workflow, which is a big deal for those who maintain multi-contributor projects and use Travis to keep the cats appropriately herded.
  • styler was initially designed to help code adhere to the Tidyverse Style Guide, which in my humble opinion is one of the best style guides that have ever existed for R. It can now take any custom style files and reformat your code, either as a function or straight from an RStudio add-in.

So use them.

EIGHT: Everything is a package.

Whether you’re writing R code for fun, profit, research or the creation of shareholder value, your coworkers and your clients – rightly! – expect a coherent piece of work product that has everything in one neat package, preferably version controlled. Sending around single R scripts might have been appropriate at some point in the mid-1990s, but it isn’t anymore. And so, your work product should always be structured like a package. As a minimum, this should include:

  1. A DESCRIPTION and NAMESPACE file.
  2. The source code, including comments.
  3. Where appropriate, data mappings and other ancillary data to implement the code. These go normally into the data/ folder. Where these are large, such as massive shape files, you might consider using Git LFS.
  4. Dependencies, preferably in a packrat repo.
  5. The documentation, helping users to understand the code and in particular, if the code is to be part of a pipeline, explaining how to interact with the API it exposes.
  6. Where the work product is an analysis rather than a bit of code intended to carry out a task, the analysis as vignettes.

To understand the notion of analyses as packages, two outstanding posts by Robert M. Flight are worth reading: part 1 explains the ‘why’ and part 2 explains the ‘how’. Robert’s work is getting a little long in the tooth, and packages like knitr have taken the place of vignettes as analytical outputs, but the principles remain the same. Inasmuch as it is possible, an analysis in R should be a self-contained package, with all the dependencies and data either linked or included. From the perspective of the user, all that should be left for them to do is to execute the analysis.

NINE: Power in names.

There are only two hard things in Computer Science: cache invalidation and naming things.

Phil Karlton

In general, R has a fair few idiosyncrasies in naming things. For starters, dots/periods . are perfectly permitted in variable names (and thus function names), when in most languages, the dot operator is a binary operator retrieving the first operand object’s method called the second operand:

a.b(args) = dot(a, b, args) = a_{Method: b}(args)

For instance, in Python, wallet.pay(arg1, arg2) means ‘invoke the method pay of the object wallet with the arguments arg1 and arg2‘. In R, on the other hand, it’s a character like any other, and therefore there is no special meaning attached to it – you can even have a variable contaning multiple dots, or in fact a variable whose name consists entirely of dots5 – in R, .......... is a perfectly valid variable name. It is also a typcal example of the fact that justibecause you can do something doesn’t mean you also should do so.

A few straightforward rules for variable names in R are worth abiding by:

  1. Above all, be consistent. That’s more important than whatever you choose.
  2. Some style guides, including Google’s R style guide, treat variables, functions and constants as different entities in respect of naming. This is, in my not-so-humble opinion, a blatant misunderstanding of the fact that functions are variables of the type function, and not some distinct breed of animal. Therefore, I recommend using a unitary schema for all variables, callable or not.
  3. In order of my preference, the following are legitimate options for naming:
    • Underscore separated: average_speed
    • Dot separated: average.speed
    • JavaScript style lower-case CamelCase: averageSpeed
  4. Things that don’t belong into identifiers: hyphens, non-alphanumeric characters, emojis (🤦🏼‍♂️) and other horrors.
  5. Where identifiers are hierarchical, it is better to start representing them as hierarchical objects rather than assigning them to different variables. For example, instead of monthly_forecast_january, monthly_forecast_february and so on, it is better to have a list associative array called forecasts in which the forecasts are keyed by month name, and can then be retrieved using the $key or the [key] accessors. If your naming has half a dozen components, maybe it’s time to think about structuring your data better.

Finally, in some cases, the same data may be represented by multiple formats – for instance, data about productivity is first imported as a text file, and then converted into a data frame. In such cases, Hungarian notation may be legitimate, e.g. txt_productivity or productivity.txt vs df_productivity or productivity.df. This is more or less the only case in which Hungarian notation is appropriate.6

And while we’re at variables: never, ever, ever use = to assign. Every time you do it, a kitten dies of sadness.

For file naming, some sensible rules have served me well, and will be hopefully equally useful for you:

  1. File names should be descriptive, but no longer than 63 characters.
  2. File names should be all lower case, separated by underscores, and end in .R. That’s a capital R. Not a lower-case R. EveR.
  3. Where there is a ‘head’ script that sequentially invokes (sources) a number of subsidiary scripts, it is common for the head script to be called 00-.R, and the rest given a sequential corresponding to the order in which they are sourced and a descriptive name, e.g. 01-load_data_from_db.R, 02-transform_data_and_anonymise_records.R and so on.
  4. Where there is a core script, but it does not invoke other files sequentially, it is common for the core script to be called 00-base.R or main.R. As long as it’s somewhere made clear to the user which file to execute, all is fair.
  5. The injunction against emojis and other nonsense holds for file names, too.

TEN: Know the rules and their rationale, so that you know when to break them.

It’s important to understand why style rules and defensive programming principles exist. How else would we know which rules we can break, and when?

The reality is that there are no rules, in any field, that do not ever permit of exceptions. And defensive programming rules are no exception. Rules are tools that help us get our work done better and more reliably, not some abstract holy scripture. With that in mind, when can you ignore these rules?

  • You can, of course, always ignore these rules if you’re working on your own, and most of your work is interactive. You’re going to screw yourself over, but that’s your right and privilege.
  • Adhering to common standards is more important than doing what some dude on the internet (i.e. my good self) thinks is good R coding practice. Coherence and consistency are crucial, and you’ll have to stick to your team’s style over your own ideas. You can propose to change those rules, you can suggest that they be altogether redrafted, and link them this page. But don’t go out and start following a style of your own just because you think it’s better (even if you’re right).
  • It’s always a good idea to appoint a suitable individual – with lots of experience and little ego, ideally! – as code quality standards coordinator (CQSC). They will then centrally coordinate adherence to standards, train on defensive coding practices, review operational adherence, manage tools and configurations and onboard new people.
  • Equally, having an ‘editor settings repo’ is pretty useful. This should support, at the very least, RStudio lintr and styler.
  • Some prefer to have the style guide as a GitHub or Confluence wiki – I generally advise against that, as that cannot be tracked and versioned as well as, say, a bunch of RST files that are collated together using Sphinx, or some RMarkdown files that are rendered automatically upon update using a GitHub webhook.

Conclusion

Always code as if your life, or that of your loved ones, depended on the code you write – because it very well may someday.
In the end, defensive programming may not be critical for you at all. You may never need to use it, and even if you do, the chances that as an R programmer your code will have to live up to defensive programming rules and requirements is much lower than, say, for an embedded programmer. But algorithms, including those written in R, create an increasing amount of data that is used to support major decisions. What your code spits out may decide someone’s career, whether someone can get a loan, whether they can be insured or whether their health insurance will be dropped. It may even decide a whole company’s fate. This is an inescapable consequence of the algorithmic world we now live in.

A few years ago, at a seminar on coding to FDA CDRH standards, the instructor finished by giving us his overriding rule: in the end, always code as if your life, or that of your loved ones, depended on the code you write (because it very well may someday!). This may sound dramatic in the context of R, which is primarily a statistical programming language, but the point stands: code has real-life consequences, and we owe it to those whose lives, careers or livelihoods depend on our code to give them the kind of code that we wish to rely on: well-tested, reliable and stable.

References   [ + ]

1.One crucial element here: keep in mind the order of parameters – because the explicit should override the implied, your code should look at the arguments first, then the environment variables, then the configuration file!
2.Eagle-eyed readers might spot the mutation in the example. This is indeed the T310C mutation on the TNFRSF13B gene (17p11.2), which causes autosomal dominant type 2 Common Variable Immune Deficiency, a condition I have, but manage through regular IVIG, prophylactic antibiotics and aggressive monitoring.
3.Holzmann, G.J. (2006). The Power of 10: Rules for developing safety-critical code. IEEE Computer 39(6):95-99.
4.R stores packages in libraries, which are basically folders where the packages reside. There are global packages as well as user-specific packages. For a useful overview of packages and package installation strategies, read this summary by Nathan Stephens.
5.The exception is ..., which is not a valid variable name as it collides with the splat operator.
6.In fact, the necessity of even this case is quite arguable. A much better practice would be simply importing the text as productivity, then transform it into a data frame, if there’s no need for both versions to continue coexisting.

Is Senomyx putting baby parts into your fruit juice? The answer may (not) surprise you.

I know a great number of people who oppose abortion, and who are therefore opposed to the biomedical use of tissue or cells that have been derived from foetuses aborted for that very purpose, although most of them do not oppose the use of foetal tissue from foetuses aborted for some other reason.1 Over the last week or so, many have sent me a story involving a company named Senomyx, wondering if it’s true. In particular, the story claimed that Senomyx uses foetal cells, or substances derived from foetal cell lines, to create artificial flavourings. One article I have been referred to is straight from the immediately credible-sounding Natural News:2 3

Every time you purchase mass-produced processed “food” from the likes of Kraft, PepsiCo, or Nestle, you’re choosing, whether you realize it or not, to feed your family not only genetically engineered poisons and chemical additives, but also various flavoring agents manufactured using the tissue of aborted human babies.

It’s true: A company based out of California, known as Senomyx, is in the business of using aborted embryonic cells to test fake flavoring chemicals, both savory and sweet, which are then added to things like soft drinks, candy and cookies. And Senomyx has admittedly partnered with a number of major food manufacturers to lace its cannibalistic additives into all sorts of factory foods scarfed down by millions of American consumers every single day.

Needless to say, this is total bullcrap, and in the following, we’re going on a ride in the Magic Schoolbus through this fetid swamp of lies, half-truths and quarter-wits.

In the beginning was the cell culture…

About ten years before I was born, a healthy foetus was (legally) aborted in Leiden, the Netherlands, and samples were taken from the foetus’s kidney. They were much like the cells in your kidney. Like the cells in your kidney (hopefully!), they had a beginning and an end. That end is known as the Hayflick limit or ‘replicative senescence limit’. Cells contain small ‘caps’ at the end of their chromosomes, known as telomeres.4 At every mitosis, these shorten a little, like a countdown clock, showing how many divisions the cell has left. And once the telomeres are all gone, the cell enters a stage called cellular senescence, and is permanently stuck in the G1 phase of the cell cycle, unable to move on to phase S. This is a self-preservative mechanism: with every mitosis, the cell line ages, and becomes more likely to start suffering from errors it passes on to its descendants. Experimentally, we know that the Hayflick limit is around 40-60 divisions.

Kadir Nelson, The Mother of Modern Medicine. Oil on linen, 59 1/2 x 49 1/2.
Kadir Nelson, The Mother of Modern Medicine. Oil on linen, 59 1/2 x 49 1/2. Image courtesy of the National Museum of African American History & Culture, Washington, DC.

But every rule has an exception. Most obviously, there are the cells that just won’t die – this is the case in cancers. And so, some cancer cells have this senescence mechanism disabled, and can divide an unlimited number of times. They have become practically immortal. This is, of course, a nuisance if they are inside a human, because proliferation and division of those cells causes unpleasant things like tumours. For researchers, however, they offer something precious: a cell line that will, as long as it’s fed and watered, live and divide indefinitely. This is called an immortal cell line. The most famous of them, HeLa, began life as cervical epithelial cells of a woman named Henrietta Lacks. Then something dreadful happened, and the cells’ cell cycle regulation was disrupted – Ms Lacks developed aggressive cervical cancer, and died on October 4, 1951. Before that, without her permission or consent, samples were taken from her cervix, and passed onto George Otto Gey, a cell biologist who realised what would be Ms Lacks’s most lasting heritage: her cells would divide and divide, well beyond the Hayflick limit, with no end in sight. Finally, cell biologists had the Holy Grail: a single cell line that produced near-exact copies of itself, all descendants of a single cell from the cervix of a destitute African-American woman, who died without a penny to her name, but whose cells would for decades continue to save lives – among others, Salk’s polio vaccine was cultured in HeLa cells.5

HeLa cells were immortal ‘by nature’ – they underwent changes that have rendered them cancerous/immortal, depending on perspective. But it’s also possible to take regular mortal cells and interfere with their cell cycle to render them immortal. And this brings us back to the kidney cells of the foetus aborted in the 1970s, a life that was never born yet will never die, but live on as HEK 293. A cell biologist, Frank Graham, working in Alex van Eb’s lab at the University of Leiden,6 used the E1 gene from adenovirus 5 to ‘immortalise’ the cell line, effectively rewriting the cell’s internal regulation mechanism to allow it to continue to divide indefinitely rather than enter cellular senescence.7 This became the cell culture HEK 293,8 one of my favourite cell cultures altogether, with an almost transcendental beauty in symmetry.

HEK293 cells
HEK 293 cells, immunofluorescence photomicrograph.

HEK 293 is today a stock cell line, used for a range of experimental and production processes. You can buy it from Sigma-Aldrich for little north of £350 a vial. The cells you’re getting will be exact genetic copies of the initial cell sample, and its direct genetic descendants. That’s the point of an immortalised cell line: you can alter a single cell to effectively divide indefinitely as long as the requisite nutrients, space and temperature are present. They are immensely useful for two reasons:

  • You don’t need to take new cell samples every time you need a few million cells to tinker with.
  • All cells in a cell line are perfectly identical.9 It goes without saying just how important for reproducible science it is to have widely available, standardised, reference cell lines.

Admittedly, this was a whistle-stop tour through cell cultures and immortal(ised) cell lines, but the basics should be clear. Right? I mean… right?

Unless you’re Sayer Ji.

Until about 1900 today, I had absolutely no idea of who Sayer Ji is, and I would lie if I asserted my life was drastically impoverished by that particular ignorance. Sayer Ji runs a website (which I am not going to link to, but I am going to link to RationalWiki’s entry on it, as well as Orac’s collected works on the man, the myth and the bullshit), and he describes himself as a ‘thought leader’. That alone sets off big alarms. Mr Ji, to the best of my knowledge, has no medical credentials whatsoever, nor any accredited credentials that would allow him to make the grandiose statements that he seems to indulge in with the moderation of Tony Montana when it comes to cocaine:

There is not a single disease ever identified cause by a lack of a drug, but there are diseases caused by a lack of vitamins, minerals and nutrients. Why, then, do we consider the former – chemical medicine – the standard of care and food-as-medicine as quackery? Money and power is the obvious answer.

Mr Ji’s ‘article’ rests on the fundamental assertion that ‘abortion tainted vaccines’ constitute cannibalism. He is, of course, entirely misguided, and does not understand how cell line derived vaccines work.
You may question why I even engage with someone whose cognition is operating on this level, and you might be justified in doing so. Please ascribe it to my masochistic tendencies, or consider it a sacrifice for the common good. Either way, I pushed through a particular article of his which is cited quite extensively in the context of Senomyx, titled Biotech’s Dark Promise: Involuntary Cannibalism for All.10 In short, Mr Ji’s ‘article’ rests on the fundamental assertion that ‘abortion tainted vaccines’, among which he ranks anything derived from HEK 293 (which he just refers to as 293),11 constitute cannibalism. He is, of course, entirely misguided, and does not understand how cell line derived vaccines work:

Whereas cannibalism is considered by most modern societies to be the ultimate expression of uncivilized or barbaric behavior, it is intrinsic to many of the Western world’s most prized biotechnological and medical innovations. Probably the most ‘taken for granted’ example of this is the use of live, aborted fetus cell lines from induced abortions to produce vaccines. Known as diploid cell vaccines (diploid cells have two (di-) sets of chromosomes inherited from human mother and father), they are non-continuous (unlike cancer cells), and therefore must be continually replaced, i.e. new aborted, live fetal tissue must be harvested periodically.

For the time being, the VRBPAC has mooted but not approved tumour cell line derived vaccines, and is unlikely to do so anytime soon. However, the idea that diploid cell vaccines need a constant influx of cells is completely idiotic, and reveals Mr Ji’s profound ignorance.12 WI-38, for instance, is a diploid human cell line, and perfectly ‘continuous’ (by which he means immortal). There is no new “aborted, live fetal tissue” that “must be harvested periodically”.

Vaccines do not typically contain cells from the culture, but only the proteins, VLPs or virions expressed by the cells. There’s no cannibalism if no cells are consumed.
Equally, Mr Ji is unaware of the idea that vaccines do not typically contain cells from the culture, but only the proteins, VLPs or virions expressed by the cells. There’s no cannibalism if no cells are consumed, and the denaturation process (the attenuation part of attenuated vaccines) is already breaking whatever cellular parts there are to hell and back. On the infinitesimal chance that a whole cell has made it through, it will be blasted into a million little pieces by the body’s immune system – being a foreign cell around a bunch of adjuvant is like breaking into a bank vault right across the local police station while expressly alerting the police you’re about to crack the safe and, just to be sure, providing them with a handy link for a live stream.13

From cell lines to Diet Coke: Senomyx and high throughput receptor ligand screening

If you have soldiered on until now, good job. This is where the fun part comes – debunking the fear propaganda against Senomyx by a collection of staggering ignorami.

Senomyx is a company with a pretty clever business model. Instead of using human probands to develop taste enhancers (aka ‘flavourings’) and scents, Senomyx uses a foetal cell line, specifically HEK 293, to express certain receptors that mimic the taste receptors in the human body.

To reiterate the obvious: no foetal cells ever get anywhere near your food.
Then, it tests a vast number of candidate compounds on them, and sees which elicit a particular reaction. That product (which typically is a complex organic chemical but has nothing to do with the foetal cell!) is then patented and goes into your food. To reiterate the obvious: no foetal cells ever get anywhere near your food.

The kerfuffle around this is remarkably stupid because this is basically the same as High Throughput Screening (HTS), a core component of drug development today.14 Let’s go through how a drug is developed, with a fairly simple and entirely unrealistic example.15 We know that low postsynaptic levels of monoamines, especially of serotonin (but also dopamine and norepinephrine) correlate with low mood. One way to try to increase postsynaptic levels is by inhibiting the breakdown of monoamines, which happens by an enzyme called monoamine oxydase (MAO). But how do we find out which of the several thousand promising MAO inhibitors that our computer model spit out will actually work best? We can’t run a clinical trial for each. Not even an animal test. But we can run a high throughput screen. Here’s a much simplified example of how that could be done (it’s not how it’s actually done anymore, but it gets the idea across).

  1. We take a microtitre plate (a flat plate with up to hundreds of little holes called ‘wells’ that each take about half a millilitre of fluid), and fill it with our favourite monoamine neurotransmitters. Mmm, yummy serotonin! But because we’re tricky, we label them with a fluorescent tag or fluorophore, a substance that gives off light if excited by light of a particular wavelength, but only as long as they’re not oxidated by MAO.
  2. We add a tiny amounts of each of our putative drugs to a different well each of the microtitre plate.
  3. Then, we add some monoamine oxidase to each well.
  4. When illuminated by the particular wavelength of light that excites the fluorophores, some wells will light up pretty well, others will be fairly dark. The bright wells indicate that the candidate drug in that well has largely inhibited monoamine oxidase, and thus the monoamine neurotransmitter remained intact. A dark well indicates that most or all of the monoamines were oxidised and as such no longer give off light. This helps us whittle down thousands of candidate molecules to hundreds or even dozens.

What Senomyx does is largely similar. While their process is proprietary, it is evident it’s largely analogous to high throughput screening. The human cell lines are modified to express receptors analogous to those in taste buds. These are exposed to potential flavourings, and the degree to which they stimulate the receptor can be quantified and even compared, so that e.g you can gauge what quantity of this new flavouring would offset 1 grams of sugar. The candidate substances are then tested on real people, too. Some of the substances are not sweeteners itself but taste intensifiers, which interact to intensify the sweet taste sensation of sugar. It’s a fascinating technology, and a great idea – and has a huge potential to reduce the amount of sugars, salts and other harmful dietary flavourings in many meals.

S2222
S2227, a menthol-like flavouring manufactured by Senomyx. Can you spot the aborted cells? Nope, neither can I.

The end product is a flavouring – a molecule that has nothing to do with aborted cells (an example, the potent cooling sensate S2227, is depicted to the left – I dare anyone to show me where the aborted foetal cells are!). Soylent green, it turns out, is not people babies after all.

Conclusion

Now, two matters are outstanding. One is the safety of these substances. That, however, is irrelevant to how they were isolated. The product of a high-throughput screen is no more or less likely to be toxic than something derived from nature. The second point is a little more subtle.

A number of critics of Senomyx point out that this is, in a sense, deceiving the customer, and with pearl-clutching that would have won them awards in the 1940s, point out that companies want one thing, and it’s absolutely filthy:16

Companies like PepsiCo and Nestle S.A. seek to gain over competitors. To do so, they boast products like “reduced sugar”, “reduced sodium”, “no msg”, etc. Sales profits and stocks increase when consumers believe they eat or drink a healthier product.

The Weston A Price Foundation carries skepticism. They believe that the bottom line is what’s important. Companies only want to decrease the cost of goods for increased profits. Shareholders only care about stock prices and investment potential.

Err… and you expected what? There is absolutely no doubt that something that gives your body the taste of salt without the adverse effects of a high-sodium diet is A Good Thing – so consumers do not merely think they’re getting a healthier product, they’re getting a product with the same taste they prefer, but without the adverse dietary consequences (in other words: a healthier product).

To people who have to adhere to a strict diet, flavour enhancers can give back a craved-for flavour experience, improve quality of life and increase diet adherence.
To people who have to adhere to a strict low-sodium diet due to kidney disease, heart disease/hypertension or other health issues, this could well give back a craved-for flavour and improve quality of life. To people struggling with obesity, losing weight without having to say no to their favourite drink can result in better health outcomes. People with peanut allergies can enjoy a Reese’s Cup with a synthetic and chemically different protein that creates the same taste sensation, without risking anaphylaxis. In the end, these are things that matter, and should matter more than the fact that – shock horror! – someone is making money out of this.

All Senomyx does is what drug companies have been doing for decades, and an increasing number of companies will. But yet again, the cynics who see lizardoid conspiracies and corporate deceit behind every wall know the price of everything, but haven’t thought about the value of it for a second, have seized upon another talking point. They did so exploiting a genuine pro-life sentiment so many hold dear, intentionally misrepresenting or recklessly misunderstanding the fact that no aborted tissue will ever make its way into your Coke, nor will there be a need for a stream of abortions to feed a burgeoning industry for artificial taste bud cell lines. If anybody here is exploitative, it’s not Senomyx – it’s those who seize upon the universal human injunction against cannibalism and infanticide to push a scientifically incorrect, debunked agenda against something they themselves barely understand.

References   [ + ]

1.This is not a post about the politics of abortion, Roe v Wade, pro-life vs pro-choice, religion vs science or any of the rest. It is about laying a pernicious lie to rest. My position regarding abortion is quite irrelevant to this, as is theirs, but it deserves mention that all of the people who got in touch hold very genuine and consistent views about the sanctity of life. Please let’s not make it about something it isn’t about.
2.Needless to say, this is not my friends’ usual fare, they too have seen it on social media and were quite dubious.
3.In line with our linking policy, we do not link to pages that endorse violence, hate or discrimination.
4.From Greek τέλος, ‘end’.
5.Finally, Rebecca Skloot’s amazing book, The Immortal Life of Henrietta Lacks, paid a long overdue tribute to the mother of modern medicine in 2010. Her book is a must-read to anyone who wants to understand the ethical complexities of immortal cell lines, as well as the touching story of a woman whom we for so long have known by initials only, yet owe such a debt to.
6.Disclosure: the University of Leiden is my alma mater, I have spent a wonderful year there, and received great treatment at LUMC, the university hospital. I am not, and have never been, in receipt of funds from the university.
7.As such, these cells represent an immortalised cell line as opposed to an immortal cell line like HeLa, where the change to the cell cycle regulation has already occurred
8.Human, Embryonic, Kidney, from experiment #293.
9.Sort of. Like all human processes, cell culturing is not perfect. One risk is a so-called ‘contaminated cell line’, and the classical case study for that is the Chang Liver cell line, which turned out to be all HeLa, all the way. This is not only a significant problem, it is also responsible for huge monetary loss and wasted research money. You can read more about this, and what scientists are doing to combat the problem, here.
10.For ethical reasons, this blog refuses to link to, and thus create revenue for, quacks, extremists and pseudoscientists. However, where the source material is indispensable, an archival service is used to obtain a snapshot of the website, so that you, too, can safely peruse Mr Ji’s nonsense without making him any money. GreenMedInfo has a whole tedious page on how to cite their nonsense, which I am going to roundly ignore because a) it looks and reads like it was written by someone who flunked out of pre-law in his sophomore year, b) 17 U.S.C. §107 explicitly guarantees fair use rights for scholarship, research and criticism.
11.Curiously, he does not mention WI-38 and MRC-5, both cell lines derived from the lung epithelial cells of aborted foetuses, which are widely used in vaccine production…
12.The word ‘profound’ is especially meet in this context.
13.Which sounds like something someone MUST have done already. Come on. It’s 2018.
14.This article is a fantastic illustration of just how powerful this technique is!
15.Unrealistic, because we know all there is to know about monoamine oxidase inhibitors, and there’s no point in researching them much more – but it illustrates the point well!
16.Yes, I brought THAT meme in here!

On dealing with failed trials

To survive this moment, you need to be able to look everyone in the eye and remind them that the trial was designed to rigorously answer an important question. If this is true, then you have likely just learned something useful, regardless of whether the intervention worked or not. Further, if the study was well designed, other people will be more likely to trust your result and act on it. In other words, there is no such thing as a “negative” trial result – the answer given by a well-designed trial is always useful, whether the intervention worked or not. So you simply remind everyone of this. People will still be disappointed of course – we’d be fools to test treatments if we didn’t think they worked, and it’s natural to hope. But at least we know we ran a good trial and added knowledge to the world – that’s not nothing.

I rarely post quotes, but this is important enough to have it here. Go read the whole thing, now.

I tend to say that for scientists, there are no failed trials. Failed trials are investor-speak: and that’s not to say anything bad about investors or their perspective. They’re in the business to make money, and to them, a trial that means the drug candidate they just blew a billion dollars to develop will never make it to the market is, in a profound sense, a failure. But to us as scientists, it’s just another step on the journey of understanding more and more about this world (and as any scientist knows, failure is largely the bread and butter of experimental work). We’ve found another way that doesn’t work. Maybe we’ve saved lives, even – failed experiments have often stemmed the tide of predatory therapies or found toxicities that explain certain reactions with other drugs as well. Maybe we’ve understood something that may be useful somewhere else. And at worst, we’ve learned which road not to go down.

And to a scientist, that should not ever feel like a failure.

Bayesian reasoning in clinical diagnostics: a primer.

We know, from the source of eternal wisdom that is Saturday Morning Breakfast Cereal, that insufficient math education is the basis of the entire Western economy.1 This makes Bayesian logic and reasoning about probabilities almost like a dark art, a well-kept secret that only a few seem to know (and it shouldn’t be… but that’s a different story). This weird-wonderful argument, reflecting a much-reiterated meme about vaccines and vaccine efficacy, is a good example:

The argument, here, in case you are not familiar with the latest in anti-vaccination fallacies, is that vaccines don’t work, and they have not reduced the incidence of vaccine-preventable diseases. Rather, if a person is vaccinated for, say, measles, then despite displaying clinical signs of measles, he will be considered to have a different disease, and therefore all disease statistics proving the efficacy of vaccines are wrong. Now, that’s clearly nonsense, but it highlights one interesting point, one that has a massive bearing on computational systems drawing conclusions from evidence: namely, the internal Bayesian logic of the diagnostic process.

Which, incidentally, is the most important thing that they didn’t teach you in school. Bayesian logic, that is. Shockingly, they don’t even teach much of it in medical school unless you do research, and even there it’s seen as a predictive method, not a tool to make sense of analytical process. Which is a pity. The reason why idiotic arguments like the above by @Cattlechildren proliferate is that physicians have been taught how to diagnose well, but never how to explain and reason about the diagnostic process. This was true for the generations before me, and is more or less true for those still in med school today. What is often covered up with nebulous concepts like ‘clinical experience’ is in fact solid Bayesian reasoning. Knowing the mathematical fundamentals of the thought process you are using day to day, and which help you make the right decisions every day in the clinic, helps you reason about it, find weak points, answer challenges and respond to them. For this reason, my highest hope is that as many MDs, epidemiologists, med students, RNs, NPs and other clinical decision-makers will engage with this topic, even if it’s a little long. I promise, it’s worth it.

Some basic ideas about probability

In probability, an event, usually denoted with a capital and customarily starting at A (I have no idea why, as it makes things only more confusing!), is any outcome or incidence that we’re interested in – as long as they’re binary, that is, they either happen or don’t happen, and discrete, that is, there’s a clear definition for it, so that we can decide if it’s happened or not – no half-way events for now.2 In other words, an event can’t happen and not happen at the same time. Or, to get used to the notation of conditionality, p(A \mid \neg A) = 0.3 A thing cannot be both true and false.

Now, we may be interested in how likely it is for an event to happen if another event happens: how likely is A if B holds true? This is denoted as p(A|B), and for now, the most important thing to keep in mind about it is that it is not necessarily the same as p(B|A)!4

Bayesian logic deals with the notion of conditional probabilities – in other words, the probability of one event, given another.5 It is one of the most widely misunderstood part of probability, yet it is crucial to understand to our own idea of the way we reason about things.

Just to understand how important this is, let us consider a classic example.


Case study 1: speed cameras

Your local authority is broke. And so, it does what local authorities do when they’re broke: play poker with the borough credit card set up a bunch of speed cameras and fine drivers. Over this particular stretch of road, the speed limit is 60mph.

According to the manufacturer, the speed cameras are very sensitive, but not very specific. In other words, they never falsely indicate that a driver was below the speed limit, but they may falsely indicate that the driver was above it, in about 3% of the cases (the false positive rate).

One morning, you’re greeted by a message in your postbox, notifying you that you’ve driven too fast and fining you a rather respectable amount of cash. What is the probability that you have indeed driven too fast?

You may feel inclined to blurt out 97%. That, in fact, is wrong.


Explanation

It’s rather counter-intuitive at first to understand why, until we consider the problem in formal terms. We know the probability p(A|\not B), that is, the probability of being snapped (A) even though you were not speeding (\not B). But what the question asks is what the likelihood that you were, in fact, speeding (B) given the fact that you were snapped (A). And as we have learned, the conditional probability operator is not commutative, that is, p(A|B) is not necessarily the same as p(B|A).

Why is that the case? Because base rates matter. In other words, the probabilities of A and B, in and of themselves, are material. Consider, for a moment, the unlikely scenario of living in that mythical wonderland of law-abiding citizens where nobody speeds. Then, it does not matter how many drivers are snapped – all of them are false positives, and thus p(B|A), the probability of speeding (B) given that one got snapped by a speed camera (A), is actually zero.

In other words, if we want to reverse the conditional operator, we need to make allowances for the ‘base frequency’, the ordinary frequency with which each event occurs on its own. To overcome base frequency neglect,6 we have a mathematical tool, courtesy of the good Revd. Thomas Bayes, who sayeth that, verily,

$latex p(B \mid A) = \frac{p(A \mid B) p(B)}{p(A)}

Or, in words: if you want to reverse the probabilities, you will have to take the base rates of each event into account. If what we know is the likelihood that you were not speeding if you were snapped and what we’re interested in is the likelihood that someone getting snapped is indeed speeding, we’ll need to know a few more things.


Case study 1: Speed cameras – continued

  • We know that the speed cameras have a Type II (false negative) error rate of zero – in other words, if you are speeding (B), you are guaranteed to get snapped (A) – thus, $p(A \mid B)$ is 1.
  • We also know from the Highway Authority, who were using a different and more accurate measurement system, that approximately one in 1,000 drivers is speeding (p(B) = 0.001).
  • Finally, we know that of 1,000 drivers, 31 will be snapped – the one speeder and 3% accounting for the false positive rate –, yielding p(A) = 0.031.

Putting that into our equation,

p(B|A) = \frac{p(A \mid B) p(B)}{p(A)} = \frac{1 \cdot 0.001}{0.031} = 0.032

In other words, the likelihood that we indeed did exceed the speed limit is just barely north of 3%. That’s a far cry from the ‘intuitive’ answer of 97% (quite accidentally, it’s almost the inverse).


Diagnostics, probabilities and Bayesian logic

The procedure of medical diagnostics is ultimately a relatively simple algorithm:

  1. create a list of possibilities, however remote (the process of differential diagnostics),
  2. order them in order of likelihood,
  3. update priors as you run tests.7

From a statistical perspective, this is implemented as follows.

  1. We begin by running a number of tests, specifically m of them. It is assumed that the tests are independent from each other, i.e. the value of one does not affect the value of another. Let R_j denote the results of test $j \leq m$.
    1. For each test, we need to iterate over all our differentials D_{i \ldots n}, and determine the probability of each in light of the new evidence, i.e. $latex p(D_i \mid R_j).
    2. So, let’s take the results of test j that yielded the results R_j, and the putative diagnosis D_i. What we’re interested in is p(D_i \mid R_j), that is, the probability of the putative diagnosis given the new evidence. Or, to use Bayesian lingo, we are updating our prior: we had a previous probability assigned to D_i, which may have been a uniform probability or some other probability, and we are now updating it – seeing how likely it is given the new evidence, getting what is referred to as a posterior.8
    3. To calculate the posterior P(D_i | R_j), we need to know three things – the sensitivity and specificity of the test j (I’ll call these S^+_j and S^-_j, respectively), the overall incidence of D_i,9 and the overall incidence of the particular result R_j.
    4. Plugging these variables into our beloved Bayesian formula, we get p(D_i \mid R_j) = \frac{p(R_j \mid D_i) p(D_i)}{p(R_j)}.
    5. We know that p(R_j \mid D_i), that is, the probability that someone will test a particular way if they do have the condition D_i, is connected to sensitivity and specificity: if R_j is supposed to be positive if the patient has D_i, then p(R_j \mid D_i) = S^-_j (sensitivity), whereas if the test is supposed to be negative if the patient has D_i, then p(R_j \mid D_i) = S^+_j (specificity).
    6. We also know, or are supposed to know, the overall incidence of D_i and the probability of a particular outcome, R_j. With that, we can update our prior for D_i \mid R_j.
  2. We iterate over each of the tests, updating the priors every time new evidence comes in.

This may sound daunting and highly mathematical, but in fact most physicians have this down to an innate skill, so much so that when I explained this to a group of FY2 doctors, they couldn’t believe it – until they thought about how they thought. And that’s a key issue here: thinking about the way we arrive at results is important, because they are the bedrock of what we need to make those results intelligible to others.


Case study 2: ATA testing for coeliac disease

For a worked example of this in the diagnosis of coeliac disease, check Notebook 1: ATA case study. It puts things in the context of sensitivity and specificity in medical testing, and is in many ways quite similar to the above example, except here, we’re working with a real-world test with real-world uncertainties.

There are several ways of testing for coeliac disease, a metabolic disorder in which the body responds to gluten proteins (gliadins and glutenins) in wheats, wheat hybrids, barley, oats and rye. One diagnostic approach looks at genetic markers in the HLA-DQ (Human Leukocyte Antigen type DQ), part of the MHC (Major Histocompatibility Complex) Class II receptor system. Genetic testing for a particular haplotype of the HLA-DQ2 gene, called DQ2.5, can lead to a diagnosis in most patients. Unfortunately, it’s slow and expensive. Another test, a colonoscopic biopsy of the intestines, looks at the intestinal villi, short protrusions (about 1mm long) into the intestine, for tell-tale damage – but this test is unpleasant, possibly painful and costly.

So, a more frequent way is by looking for evidence of an autoantibody called anti-tissue transglutaminase antibody (ATA) – unrelated to this gene, sadly. ATA testing is cheap and cheerful, and relatively good, with a sensitivity (S^+_{ATA}) of 85% and specificity (S^+_{ATA}) of 97%.10 We also know the rough probability of a sample being from someone who actually has coeliac disease – for a referral lab, it’s about 1%.

Let’s consider the following case study. A patient gets tested for coeliac disease using the ATA test described above. Depending on whether the test is positive or negative, what are the chances she has coeliac disease?

Sensitivity and specificity trade-off for an ATA test given various values of true coeliac disease prevalence in the population.

If you’ve read the notebook, you know by now that the probability of having coeliac disease if testing positive is around 22%, or a little better than one-fifth. And from the visualisation to the left, you could see that small incremental improvements in specificity would yield a lot more increase in accuracy (marginal accuracy gain) than increases in sensitivity.

While quite simple, this is a good case study because it emphasises a few essential things about Bayesian reasoning:

  • Always know your baselines. In this case, we took a baseline of 1%, even though the average incidence of coeliac disease in the population is closer to about 0.25% of that. Why? Because we don’t spot-test people for coeliac disease. People who do get tested get tested because they exhibit symptoms that may or may not be coeliac disease, and by definition they have a higher prevalence11 of coeliac disease. The factor is, of course, entirely imaginary – you would, normally, need to know or have a way to figure out the true baseline values.
  • Use independent baselines. It is absolutely crucial to make sure that you do not get the baselines from your own measurement process. In this case, for instance, the incidence of coeliac disease should not be calculated by reference to your own lab’s number of positive tests divided by total tests. This merely allows for further proliferation of false positives and negatives, however minuscule their effect. A good way is to do follow-up studies, checking how many of the patients tested positive or negative for ATA were further tested using other methodologies, many of which may be more reliable, and calculate the proportion of actual cases coming through your door by reference to that.

Case study 3: Vaccines in differential diagnosis

This case is slightly different, as we are going to compare two different scenarios. Both concern D_{VPD}, a somewhat contrived vaccine-preventable illness. D_{VPD} produces a very particular symptom or symptom set, S, and produces this symptom or symptom set in every case, without fail.12 The question is – how does the vaccination status affect the differential diagnosis of two identical patients,13 presenting with the same symptoms S, one of whom is unvaccinated?

No. That’s not how this works. That’s not how ANY of this works. Nnnnope.

It has been a regrettably enduring trope of the anti-vaccination movement that because doctors believe vaccines work, they will not diagnose a patient with a vaccine-preventable disease (VPD), simply striking it off the differential diagnosis or substitute a different diagnosis for it.14 The reality is explored in this notebook, which compares two scenarios, of the same condition, with two persons with the sole difference of vaccination status. That difference makes a massive – about 7,800x – difference between the likelihood of the vaccinated and the unvaccinated person having the disease. The result is that a 7,800 times less likely outcome slides down the differential. As NZ paediatrician Dr Greenhouse (@greenhousemd) noted in the tweet, “it’s good medical care”. In the words of British economist John Maynard Keynes,15 “when the facts change, I change my mind”. And so do diagnosticians.

Quite absolutely simply put: it’s not an exclusion or fudging data or in any sensible way proof that “no vaccine in history has ever worked”. It’s quite simply a reflection of the reality that if in a population a condition is almost 8,000 times less likely, then, yes, other more frequent conditions push ahead.


Lessons learned

Bayesian analysis of the diagnostic procedure allows not only increased clarity about what one is doing as a clinician. Rather, it allows the full panoply of tools available to mathematical and logical reasoning to investigate claims, objections and contentions – and like in the case of the alleged non-diagnosis of vaccines, discard them.

The most powerful tool anyone who utilises any process of structured clinical reasoning – be it clinical reasoning in diagnostics, algorithmic analysis, detective work or intelligence analysis – is to be able to formally reason about one’s own toolkit of structured processes. It is my hope that if you’ve never thought about your clinical diagnostic process in these terms, you will now be able to see a new facet of it.

References   [ + ]

1.The basis of non-Western economies tends to be worse. That’s about as much as Western economies have going for them. See: Venezuela and the DPRK.
2.There’s a whole branch of probability that deals with continuous probabilities, but discrete probabilities are crazy enough for the time being.
3.Read: The probability of A given not-A is zero. A being any arbitrary event: the stock market crashing, the temperature tomorrow exceeding 30ºC, &.
4.In other words, it may be the same, but that’s pure accident. Mathematically, they’re almost always different.
5.It’s tempting to assume that this implies causation, or that the second event must temporally succeed the first, but none of those are implied, and in fact only serve to confuse things more.
6.You will also hear this referred to as ‘base rate neglect’ or ‘base rate fallacy’. As an epidemiologist, ‘rate’ has a specific meaning for us – it generally means events over a span of time. It’s not a rate unless it’s necessarily over time. I know, we’re pedantic like that.
7.This presupposes that these tests are independent of each other, like observations of a random variable. They generally aren’t – for instance, we run the acute phase protein CRP, W/ESR (another acute phase marker) and a WBC count, but these are typically not independent from each other. In such cases, it’s legitimate to use B = B_1 \cap B_2 \cap \ \ldots \cap B_n or, as my preferred notation goes, B = \bigcap^n_{k=1} B_k. I know ‘updating’ is the core mantra of Bayesianism, but knowing what to update and knowing where to simply calculate the conjoint probability is what experts in Bayesian reasoning rake in the big bucks for.
8.Note that a posterior from this step can, upon more new evidence, become the prior in the next round – the prior for j may be the inferred probability p(D_i), but the prior for j + 1 is p(D_i \mid R_j), and so on. More about multiple observations later.
9.It’s important to note that this is not necessarily the population incidence. For instance, the overall incidence and thus the relevant D for EBOV (D_{EBOV}) is going to be different for a haemorrhagic fever referral lab in Kinshasa and a county hospital microbiology lab in Michigan.
10.Lock, R.J. et al. (1999). IgA anti-tissue transglutaminase as a diagnostic marker of gluten sensitive enteropathy. J Clin Pathol 52(4):274-7.
11.More epidemiopedantry: ‘incidence’ refers to new cases over time, ‘prevalence’ refers to cases at a moment in time.
12.This is, of course, unrealistic. I will do a walkthrough of an example of multiple symptoms that each have an association with the illness in a later post.
13.It’s assumed gender is irrelevant to this disease.
14.Presumably hoping that refusing to diagnose a patient with diphtheria and instead diagnosing them with a throat staph infection will somehow get the patient okay enough that nobody will notice the insanely prominent pseudomembrane…
15.Or not…

Automagic epi curve plotting: part I

As of 24 May 2018, the underlying data schema of the Github repo from which the epi curve plotter draws its data has changed. Therefore, a lot of the code had to be adjusted. The current code can be found here on Github. This also plots a classical epi curve.

One of the best resources during the 2013-16 West African Ebola outbreak was Caitlin RiversGithub repo, which was probably one of the best ways to stay up to date on the numbers. For the current outbreak, she has also set up a Github repo, with really frequent updates straight from the WHO’s DON data and the information from DRC Ministry of Public Health (MdlS) mailing list.1 Using R, I have set up a simple script that I only have to run every time I want a pre-defined visualisation of the current situation. I am usually doing this on a remote RStudio server, which makes matters quite easy for me to quickly generate data on the fly from RStudio.

Obtaining the most recent data

Using the following little script, I grab the latest from the ebola-drc Github repo:

# Fetch most recent DRC data.
library(magrittr)
library(curl)
library(readr)
library(dplyr)

current_drc_data <- Sys.time() %>%
  format("%d%H%M%S%b%Y") %>%
  paste("raw_data/drc/", "drc-", ., ".csv", sep = "") %T>%
  curl_fetch_disk("https://raw.githubusercontent.com/cmrivers/ebola_drc/master/drc/data.csv", .) %>%
  read_csv()

This uses curl (the R implementation) to fetch the most recent data and save it as a timestamped2 file in the data folder I set up just for that purpose.3 Simply sourcing this script (source("fetch_drc_data.R")) should then load the current DRC dataset into the environment.4

Data munging

We need to do a little data munging. First, we melt down the data frame using reshape2‘s melt function. Melting takes ‘wide’ data and converumnts it into ‘long’ data – for example, in our case, the original data had a row for each daily report for each health zone, and a column for the various combinations of confirmed/probable/suspected over cases/deaths. Melting the data frame down creates a variable type column (say, confirmed_deaths and a value column (giving the value, e.g. 3). Using lubridate,5 the dates are parsed, and the values are stored in a numeric format.

library(magrittr)
library(reshape2)
library(lubridate)

current_drc_data %<>% melt(value_name = "value", measure.vars = c("confirmed_cases", "confirmed_deaths", "probable_cases", "probable_deaths", "suspect_cases", "suspect_deaths", "ruled_out"))
current_drc_data$event_date <- lubridate::ymd(current_drc_data$event_date)
current_drc_data$report_date <- lubridate::ymd(current_drc_data$report_date)
current_drc_data$value <- as.numeric(current_drc_data$value)

Next, we drop ruled_out cases, as they play no significant role for the current visualisation.

current_drc_data <- current_drc_data[current_drc_data$variable != "ruled_out",]

We also need to split the type labels into two different columns, so as to allow plotting them as a matrix. Currently, data type labels (the variable column) has both the certainty status (confirmed, suspected or probable) and the type of indicator (cases vs deaths) in a single variable, separated by an underscore. We’ll use stringr‘s str_split_fixed to split the variable names by underscore, and join them into two separate columns, suspicion and mm, the latter denoting mortality/morbidity status.

current_drc_data %<>% cbind(., str_split_fixed(use_series(., variable), "_", 2)) %>% 
                 subset(select = -c(variable)) %>% 
                 set_colnames(c("event_date", "report_date", "health_zone", "value", "suspicion", "mm"))

Let’s filter out the health zones that are being observed but have no relevant data for us yet:

relevant_health_zones <- current_drc_data %>% 
                         subset(select = c("health_zone", "value")) %>% 
                         group_by(health_zone) %>% 
                         summarise(totals = sum(value, na.rm=TRUE)) %>% 
                         dplyr::filter(totals > 0) %>% 
                         use_series(health_zone)

This gives us a vector of all health zones that are currently reporting cases. We can filter our DRC data for that:

current_drc_data %<>% dplyr::filter(health_zone %in% relevant_health_zones)

This whittles down our table by a few rows. Finally, we might want to create a fake health zone that summarises all other health zones’ respective data:

totals <- current_drc_data %>% group_by(event_date, report_date, suspicion, mm) 
                           %>% summarise(value = sum(value), health_zone=as.factor("DRC total"))

# Reorder totals to match the core dataset
totals <- totals[,c(1,2,6,5,3,4)]

Finally, we bind these together to a single data frame:

current_drc_data %<>% rbind.data.frame(totals)

Visualising it!

Of course, all this was in pursuance of cranking out a nice visualisation. For this, we need to do a couple of things, including first ensuring that “DRC total” is treated separately and comes last:

regions <- current_drc_data %>% use_series(health_zone) %>% unique()
regions[!regions == "DRC total"]
regions %<>% c("DRC total")

current_drc_data$health_zone_f <- factor(current_drc_data$health_zone, levels = regions)

I normally start out by declaring the colour scheme I will be using. In general, I tend to use the same few colour schemes, which I keep in a few gists. For simple plots, I prefer to use no more than five colours:

colour_scheme <- c(white = rgb(238, 238, 238, maxColorValue = 255),
                   light_primary = rgb(236, 231, 216, maxColorValue = 255),
                   dark_primary = rgb(127, 112, 114, maxColorValue = 255),
                   accent_red = rgb(240, 97, 114, maxColorValue = 255),
                   accent_blue = rgb(69, 82, 98, maxColorValue = 255))

With that sorted, I can invoke the ggplot method, storing the plot in an object, p. This is so as to facilitate later retrieval by the ggsave method.

p <- ggplot(current_drc_data, aes(x=event_date, y=value)) +

  # Title and subtitle
  ggtitle(paste("Daily EBOV status", "DRC", Sys.Date(), sep=" - ")) +
  labs(subtitle = "(c) Chris von Csefalvay/CBRD (cbrd.co) - @chrisvcsefalvay") +
  
  # This facets the plot based on the factor vector we created ear 
  facet_grid(health_zone_f ~ suspicion) +
  geom_path(aes(group = mm, colour = mm, alpha = mm), na.rm = TRUE) +
  geom_point(aes(colour = mm, alpha = mm)) +

  # Axis labels
  ylab("Cases") +
  xlab("Date") +

  # The x-axis is between the first notified case and the last
  xlim(c("2018-05-08", Sys.Date())) +
  scale_x_date(date_breaks = "7 days", date_labels = "%m/%d") +

  # Because often there's an overlap and cases that die on the day of registration
  # tend to count here as well, some opacity is useful.
  scale_alpha_manual(values = c("cases" = 0.5, "deaths" = 0.8)) +
  scale_colour_manual(values = c("cases" = colour_scheme[["accent_blue"]], "deaths" = colour_scheme[["accent_red"]])) +

  # Ordinarily, I have these derive from a theme package, but they're very good
  # defaults and starting poinnnnnntsssssts
  theme(panel.spacing.y = unit(0.6, "lines"), 
        panel.spacing.x = unit(1, "lines"),
        plot.title = element_text(colour = colour_scheme[["accent_blue"]]),
        plot.subtitle = element_text(colour = colour_scheme[["accent_blue"]]),
        axis.line = element_line(colour = colour_scheme[["dark_primary"]]),
        panel.background = element_rect(fill = colour_scheme[["white"]]),
        panel.grid.major = element_line(colour = colour_scheme[["light_primary"]]),
        panel.grid.minor = element_line(colour = colour_scheme[["light_primary"]]),
        strip.background = element_rect(fill = colour_scheme[["accent_blue"]]),
        strip.text = element_text(colour = colour_scheme[["light_primary"]])
  )
DRC EBOV outbreak, 22 May 2018. The data has some significant gaps, owing to monitoring and recording issues, but some clear trends already emerge from this simple illustration.

The end result is a fairly appealing plot, although if the epidemic goes on, one might want to consider getting rid of the point markers. All that remains is to insert an automatic call to the ggsave function to save the image:

Sys.time() %>%
  format("%d%H%M%S%b%Y") %>%
  paste("DRC-EBOV-", ., ".png", sep="") %>%
  ggsave(plot = p, device="png", path="visualisations/drc/", width = 8, height = 6)

Automation

The cronR package has a built-in cron scheduler add-in for RStudio, allowing you to manage all your code scheduling needs.

Of course, being a lazy epidemiologist, this is the kind of stuff that just has to be automated! Since I run my entire RStudio instance on a remote machine, it would make perfect sense to regularly run this script. cronR package comes with a nice widget, which will allow you to simply schedule any task. Old-school command line users can, of course, always resort to ye olde command line based scheduling and execution. One important caveat: the context of cron execution will not necessarily be the same as of your R project or indeed of the R user. Therefore, when you source a file or refer to paths, you may want to refer to fully qualified paths, i.e. /home/my_user/my_R_stuff/script.R rather than merely script.R. cronR is very good at logging when things go awry, so if the plots do not start to magically accumulate at the requisite rate, do give the log a check.

The next step is, of course, to automate uploads to Twitter. But that’s for another day.

References   [ + ]

1.Disclaimer: Dr Rivers is a friend, former collaborator and someone I hold in very high esteem. I’m also from time to time trying to contribute to these repos.
2.My convention for timestamps is the military DTG convention of DDHHMMSSMONYEAR, so e.g. 7.15AM on 21 May 2018 would be 21071500MAY2018.
3.It is, technically, bad form to put the path in the file name for the curl::curl_fetch_disk() function, given that curl::curl_fetch_disk() offers the path parameter just for that. However, due to the intricacies of piping data forward, this is arguably the best way to do it so as to allow the filename to be reused for the read_csv() function, too.
4.If for whatever reason you would prefer to keep only one copy of the data around and overwrite it every time, quite simply provide a fixed filename into curl_fetch_disk().
5.As an informal convention of mine, when using the simple conversion functions of lubridate such as ymd, I tend to note the source package, hence lubridate::ymd over simply ymd.

Ebola: a primer.

After this post was published, a lot of people have asked me to do a Reddit AMA, where I could answer some questions from a wide audience. The AMA has concluded by now, but you can read the questions and answers here.

 

As I’m writing this, the beginnings of what could well be a major outbreak are raging in Bikoro territory, Equateur province, in the northeast of the Democratic Republic of the Congo (DRC). Recent news indicate that Mbandaka, the capital of Equateur and home to a busy port and a million people, has now reported cases as of 17 May. The death toll has reached 25 as of the time of writing, and it’s anyone’s guess how bad it’ll get – having learned from the unexpectedly extensive devastation of the West African Zaire ebolavirus outbreak (2013-16), everybody is preparing for the worst case scenario. Me and ebolaviruses have a long relationship, going back over a decade – I sometimes tend to wistfully remark that I know more about virion protein (VP) 24 of the Zaire ebolavirus (EBOV) than I know about some of my own family members. The reverse of the medal is that reading some of the nonsense in the press is borderline physically painful. I’ve assembled these resources for interested laypeople – especially journalists intending to comment on the Bikoro outbreak, in hopes that it will somewhat reduce misunderstandings.

Some taxonomy pedantry for starters

To start with, a point of pedantry: there are multiple ebolaviruses, so technically, ‘Ebola virus’ is a misnomer. Viral taxonomy is a complex thing, governed largely by the International Committee on the Taxonomy of Viruses (ICTV). The latter has preliminarily determined the taxonomy of filoviruses to look as follows:1

  • Family Filoviridae
  • Genus Ebolavirus
    • Species Bundibugyo ebolavirus (BDBV)
    • Species Reston ebolavirus (RESV or RESTV)
    • Species Sudan ebolavirus (SUDV)
    • Species Taï Forest ebolavirus, formerly Côte d’Ivoire ebolavirus (TAFV)
    • Species Zaire ebolavirus (EBOV or ZEBOV)
  • Genus Marburgvirus
    • Species Marburg marburgvirus (MARV)

By far the most important of these are EBOV and SUDV. These have been responsible for almost all major outbreaks – TAFV had only one single human case (CFR:2 1/0, 0%), RESV killed a lot of monkeys3 but a number of humans, despite seroconverting,4 did not fall ill. SUDV is generally regarded as somewhat more benign than EBOV, with a CFR around 50% (range 41-65%, discounting the 2011 Luweero case, where the single patient died). EBOV is the type species of ebolavirus, and it commonly has mortalities up to 93%. It is almost definite that the current outbreak in the DRC is an EBOV outbreak.

Viral species are further subdivided into strains. This is important for ebolaviruses, EBOV in particular, because there seems to be an emerging divergence. Typically, ebolavirus outbreaks claim up to 3-400 lives at most, tend to be over in 3-4 months and are fairly localised. Because non-RESV ebolaviruses, at least in humans, need contact with bodily fluids, long chains of transmission are rare. The 2013-16 West African outbreak, however, seems to have upended this hypothesis. That outbreak lasted almost twelve times the average for all known outbreaks until then, and claimed more lives than all known ebolavirus outbreaks (since the index outbreak in Yambuku, DRC, in 1976) put together. Why this was the case is a bit of a mystery, but there is now an understanding that EBOV strains that are more similar to the Mayinga (EBOV/May) strain isolated in 1976 are different from strains more similar to the Makona strain (EBOV/Mak), which was the prevalent strain in the West African outbreak.

Background and history

Courtesy of SIB/ViralZone.

Ebolaviruses belong to the family of filoviridae, so named for their threadlike appearance – they are among some of the longest viruses, reaching a length of up to 14,000nm and a width of approximately 80nm. The genome of ebolaviruses is relatively simple, approximately 19,000 base pairs long, and stored as a single-strand negative sense RNA, making ebolaviruses, and all other filoviridae, (–)ssRNA or Baltimore V viruses. This is significant as negative-sense single-strand RNA viruses need to be translated into positive-sense RNA by RNA polymerase, and therefore aren’t directly infectious.

Ebolaviruses, and other filoviruses, are probably pretty old – in fact, the study by Taylor et al. (2014) has found genetic fossils5 of EBOV-like VP35 in the same location of several cricetid rodents’ (voles and hamsters) genomes, suggesting that ebolaviruses have diverged from marburgviruses around the time the common ancestor of hamsters and voles lived, sometime around the miocene (16-23 million years ago).6

The Yambuku mission hospital’s record for Mr Mabalo, a school teacher, who was the first recorded human case of Ebola Virus Disease, dated 26 August 1976. Mr Mabalo would eventually succumb to EVD on 06 September.
Photo courtesy of Guido van der Groen/ITM Antwerp.

We also know that EBOV only relatively recently diverged from other ebolaviruses (sometime in the last century), but the first acknowledged outbreak of an ebolavirus took place in 1976 in Yambuku, in what was then Zaïre and is today the DR Congo. The story of this outbreak is extensively told in a retrospective article by Joel Breman and a number of others who have been present at the initial outbreak, written four decades later. Arguably, we saw the emergence of a physiologically and epidemiologically different strain of EBOV during the West African EBOV epidemic, too – at least in the wild, EBOV/Mak behaved very differently from EBOV/May: characterised by long chains of transmission, a somewhat lower CFR7 and a much longer epidemic duration with a significantly larger number of cases – indeed, the 2013-16 outbreak claimed more lives than every single known filoviral outbreak since the first recorded filoviral epidemic, the 1967 Marburg outbreak, put together. Recent evidence seems to suggest that infection with EBOV/Mak does seem to exhibit some significant differences from the previously known strains that are clinically different to the point that they might explain the difference between the 2013-2016 West African outbreak and previous epidemics, which typically were regionally limited, originated in central Africa (Sudan and Zaire) rather than the coastal states of the Gulf of Guinea and lasted a few months with no more than 3-400 cases.8

Worth reading:

  • Two of the protagonists of the 1976 Yambuku outbreaks have written amazing autobiographies that are worth reading. No Time to Lose, by Peter Piot, is a fascinating book, although most of it – like Peter Piot’s career – is devoted to STDs, especially the fight against AIDS. His colleague and countryman, Guido van der Groen, has also written an engaging and well-written memoir, On the Trail of Ebola.
  • Murphy, F.A. (2016): Historical perspective: what constitutes discovery (of a new virus)? In: Kielian, M. et al. (eds)., Advances in Virus Research95:197-220. – What’s it like to discover a virus? Fred Murphy, whose transfer electron micrograph graces the header of this blog  post and has become inextricably associated with ebolaviruses, was working as CDC’s chief viropathologist in 1976, and if not a father of EBOV’s discovery, he is at the very least its godfather. His experiences with Ebola specifically are summarised in section 5.8 of the chapter.
  • Tropical medicine professor and ID physician David Brett-Major‘s book, A Year of Ebola, is an up-close-and-personal account of a year of the 2013-2016 West African outbreak, and the challenges that rendering assistance in the chaos of such an outbreak. For those unfamiliar with what a major, multi-party public health intervention involves, this book is a must-read.
  • A good and somewhat lighthearted starter is my interview with Devon from the Bugs, Blood and Bones podcast: part 1 | part 2. This discusses many of the principal points you should know about ebolaviruses, especially the reason we can’t simply eliminate ebolaviruses as easily as, say, smallpox.

The (proteomic) nature of the beast

In its octameric ‘ring’ or ‘crown’ configuration, VP40 is a regulator of RNA viral transcription. Author’s figure.

Ebolaviruses are remarkably simple for all the destruction they’re capable of. To understand the issues that curing ebolavirus infections raises, it’s important to understand how the virus itself is constructed and how it operates on a molecular level. The ebolavirus genome encodes seven proteins: a nucleoprotein (NP), a RNA polymerase (L), the glycoprotein GP, and four viral proteins (VPs): VP24, VP30, VP35 and VP40 (sometimes referred to as the matrix protein). For this reason, some of Ebola’s viral proteins ‘moonlight’ – that is, they fulfill multiple functions, depending on their polymerisation state.

  • The overall structure of the virion is given by the ebola matrix protein or VP40. As a hexamer looking a bit like the S-shaped Tetris piece,9 it’s responsible for the structure of the virion, while as a crown-shaped octamer wrapped around the RNA, it regulates RNA transcription. The matrix protein’s main purpose, other than serving as a physical outer shell, is to connect the nucleocapsid with the target cell’s membrane, allowing penetration. VP40 also gives ebolaviruses the characteristic structure. For this reason, and the fact that it also coordinates some aspects of the viral lifecycle – in particular virion assembly and ‘budding’, that is, egress from infected cells –, it’s being considered as a therapeutic target.10
    As a hexamer, VP40 is the primary matrix protein of filoviral virions.
  • The RNA is surrounded by a dynamic nucleocapsid, made up of VP35, VP30 and VP24. The purpose of this is to store and, at the necessary time, deliver, the genetic payload. The nucleoprotein NP is wrapped around the RNA genome.
  • VP24 is also used to disrupt the innate immune system, specifically the STAT1 signalling pathway. Normally, in response to viral infections, interferons phosphorylate the STAT1 protein, which then binds to karyopherin alpha (KPNA). Karyopherin alpha is an ‘importin’, a shuttle protein. Once STAT1 is bound to KPNA, it is ferried to the nucleus, and stimulates gene transcription. VP24 selectively tricks this: it binds competitively to KPNA, so that STAT1 cannot bind to it. In a sense, VP24 is hijacking the cell’s internal shuttle system, preventing an adequate immune response but maintaining the ability to use the system for its own purposes.
  • L, or RNA-dependent RNA polymerase, is required because ebolaviruses are negative-sense single strand RNA viruses, and thus a complementary, positive sense strand needs to be generated for transcription.
  • GP, the ebolavirus glycoprotein, is perhaps the most essential part of the internal machinery of an ebolavirus. GP is responsible for infecting new cells, and for a cytopathogenic effect on endothelial cells – in other words, GP damages the cells that line blood vessels in particular and has been observed to cause endothelial cell loss. This in turn results in the haemorrhagic symptoms that characterise EVD’s haemorrhagic stage.11

Ebola virus disease (EVD) and pathophysiology

Human and primate ebolavirus infection (regardless of species or strain) causes Ebola Virus Disease (EVD), sometimes referred to as Ebola haemorrhagic fever (EHF). EVD is more accurate as the well-known haemorrhagic manifestations are far from ubiquitous (about half the cases at best).12

Illustration courtesy of T.W. Geisbert and H. Feldmann.

EVD begins with nonspecific signs – like a bad flu: after an incubation time of about 4 days to two weeks, fatigue, fever, loss of appetite and muscle aches set in, along with vomiting, diarrhoea and more vomiting. Despite its apparent simplicity, ebolaviruses carry out a complex and multifactorial propgramme of destruction:

  1. Prodromic stage: In the early, prodromic stage, the viral protein VP24 inhibits interferon type I and II signalling, effectively cutting the communication lines of the immune system and allowing the virus to proliferate in peace. During this time, the patient may be asymptomatic or have nonspecific symptoms like headaches, fatigue and a mild
  2. Early disseminating stage: Ebolaviruses preferentially attack certain white blood cells that allow it to spread through the lymphatic system, in particular dendritic cells, macrophages and monocytes, and later on spread prolifically through liver cells and the adrenal gland, causing liver damage (leading to clotting issues and the diagnostically significant elevated transaminase levels). The death of the infected monocytes (called a cytopathic or cytopathogenic effect) causes immunosuppression through low lymphocyte counts and releases pro-inflammatory molecules, in particular TFN-alpha, and the interleukins IL-6 and IL-8, creating a state somewhat reminiscent of sepsis. GP also assists in inhibiting neutrophils, white blood cells crucial for immune reactions, from activating.
  3. Vascular endothelial damage: Glycoprotein (GP) in vascular endothelial cells (the cells lining the walls of blood vessels) destroys the integrity of blood vessels around three to four days after infection, leading to bleeding.
  4. Liver injury and DIC: GP, when expressed in the liver, causes liver damage, and also suppresses the production of integrins. Integrins are transmembrane proteins that allow cells to attach to the various molecules outside the cell, which is crucial for clotting. Together, these lead to a paradoxical state called disseminated intravascular coagulation (DIC): small blood clots form in the capillaries all over the body, leading to ischemia (lack of blood supply) and organ failure, while at the same time using up all the clotting factors and platelets. This is responsible for the later haemorrhagic manifestations.
  5. At this stage, patients that do not recover succumb to the combined effects of multi-organ failure, volume loss from diarrhoea and massive haemorrhage.

Together, these have a damaging effect on vascular endothelial cells, the cells lining the walls of blood vessels, leading to internal bleeding and the haemorrhagic manifestations.

Eventually, the haemorrhagic (bleeding) symptoms – bleeding under the skin, uncontrollable bleeding from blood draws, bleeding into the sclerae (the whites of the eyes), blood in vomit and faeces – may begin, largely because damage to the liver and depletion of clotting factors.

Death usually occurs 8-14 days from onset of symptoms. Contrary to popular perception, death is actually not caused by bleeding out – the blood loss is quite simply not enough to be fatal, even in the haemorrhagic cases. Rather, ebolaviruses turn the body’s own inflammatory cascades on overdrive, causing a state that’s somewhat similar to septic shock. Survivors begin to feel better around 10-14 days after first symptoms, but recovery is slow and can take months.

Worth reading:

  • Geisbert, T.W. and Feldmann, H. (2011): Ebola haemorrhagic fever. Lancet 377:849-62. – a great summary, while intended for professional audiences, it is probably the most comprehensive article on what we know about ebolaviruses. Nb. that it was written before the 2013-16 West African outbreak.
  • Munoz-Fontela, C. and McElroy, A.K. (2017): Ebola virus disease in humans: pathophysiology and immunity. In: Mühlberger E. et al. (eds.), Marburg- and Ebolaviruses. Springer, 2017. – This is a rather pricey book, and aimed at public health experts, but is probably the best summary of post-West African outbreak scholarship on all things ebola- and marburgviruses. For those writing for a professional audience or desiring a more comprehensive understanding of the underlying biology, it’s a must-have. Disclaimer: many of the chapter authors and editors are friends and/or valued colleagues.

Ecology and reservoir hosts

Finding the reservoir host of ebolaviruses and Marburg marburgvirus has consumed an incredible amount of scientific effort during the 1980s and 1990s, with relatively little to show for it. It was clear from the very beginning that ebolaviruses are zoonotic – that is, there’s a reservoir host, an animal in which the virus can persist and multiply without causing disease. This explains why it sometimes appears as if ebolaviruses (and Marburg) came out of nothing, wreaked havoc, then disappeared as fast as they appeared. Using RT-PCR and qRT-PCR, it’s now clear that that the reservoir hosts are bats, and a number of species, in particular certain fruit bats. Bats have a complex interferon (IFN) system, much more complex than the human or NHP13 IFN system. This seems to give them an ability to manage the infection in their bodies (see the Kühl and Pöhlmann paper below).There’s a global increase of bat-borne pathogens causing outbreak – these are almost all viral (the related henipaviruses Hendra virus in Australia and Nipah virus in Malaysia/Bangladesh, the coronaviruses MERS-CoV and SARS-CoV, rabies, etc.). As humanity, in need of arable land across the world to feed the exploding population and mineral resources like diamonds and coltan, encroaches upon traditional habitats of Chiropteran species, especially the caves and jungles where they roost, interactions between bats and humans will become more and more frequent, raising the risk of infections. Clearly a strategy to manage ebolaviruses must also be able to manage the ecological problem of habitat loss.

Worth reading:

  • Kühl, A. and Pöhlmann, S. (2012): How Ebola Virus Counters the Interferon System. Zoon Pub Health 59:116-131. – great paper, but tough to digest for non-technical audiences. For those who prefer a slightly more relaxed version, see the next link.
  • Fagre, A. (2016): Why don’t bats get Ebola? Scientific American Guest Blog, July 18, 2016. – same topic as above, just for more popular audiences.
  • On ecology, the chapter Ecology of Filoviruses in Mühlberger et al. (eds.), op cit, is worth reading.
  • For understanding zoonotic diseases, Spillover by David Quammen (2013) is an excellent read. Ebola: The Natural and Human History of a Deadly Virus, written in 2014, updates his chapter on ebolaviruses – largely EBOV – for an audience hungry for information after the 2013 West African outbreak. – Quammen has a great style and writes well, without Preston’s sensationalism. If this is your first foray into writing about, or trying to understand, filoviral zoonoses, both books are very much worth reading. The added value of whatever was added to the Ebola chapter in Spillover in Ebola: The Natural and Human History is, to me at least, dubious. It is, however, a much shorter read for those pressed for time.

Treatment and prophylaxis

So far, no particular agent has proved to be conclusively effective against EBOV infection after symptoms have emerged, and treatment is mainly symptomatic. It is haunting that the state of the art in treating filoviral haemorrhagic fever 2018 is not much different from the approach Margaretha Isaäcson and her team used on the three Marburg cases – Cases 1 and 2, Australian hitchhikers, and Case 3, a nurse who took care of both Cases 1 and 2 – in 1975:

At this stage, it became clear that there would be no specific treatment that could be relied upon to attack and kill the virus responsible for this infection. The girls’ only chance of survival would, therefore, depend on meticulous, ongoing monitoring of various organ functions and managing clinical problems in anticipation or as they presented themselves. This approach required a large team in support of the core formed by the clinicians responsible for the daily evaluation, treatment and general management of the patients.
– from the notes of Margaretha Isaäcson, 26 February 1975

A model Ebola Treatment Centre. The three wards each segregate low-likelihood, high-likelihood and confirmed cases. The orange building in the lower right corner is a field morgue. Double fencing allows patients’ families to communicate with their loved ones from a safe distance, without needing to breach isolation. Illustration courtesy of MSF.

Treatment is focused on volume and electrolyte replacement (intravenously or using oral rehydration salts aka ORSs), pain management and blood transfusions to combat blood loss. To manage disseminated intravascular coagulation and the ensuing coagulopathy, heparin and clotting factors have both been used, with mixed success. Intensive care can greatly increase survival chances, but in low resource settings this remains a challenge. The West African outbreak has demonstrated the utility and sustainability of three-segment (four, if you count the morgue) Ebola Treatment Centres (ETCs, see image) as an easy and inexpensive way to reduce nosocomial spread (spread within a healthcare facility). The model ETC design, which separates confirmed, low-probability and high-probability cases, reduces the risk to lower probability cases by separating them from higher-probability or confirmed cases. One of the painful lessons of the 1976 Yambuku outbreak was that reuse of medical equipment, in particular of hypodermic needles and syringes, can greatly contribute to the spread of ebolaviruses, and this makes overcoming the logistic challenges of dealing with an ebolavirus outbreak in an isolated and ill-accessible location all the more acute.

There are no specific treatment options for EVD that have stood the test of time and rigorous trials. A few of the most often discussed specific treatment options are outlined below:

  • Convalescent plasma has for a long time been the best hope against filoviral infections, but is not always accessible and has its own risks, such as residual viral loads. It also doesn’t keep too well (like liquid plasma, it must be kept between +2ºC and +6ºC). It is taken from survivors of the infection using plasmapheresis, a process quite similar to haemodialysis except in this case, the dialysate is retained. This contains antibodies that the patient developed following his infection. Convalescent plasma also contains a range of other antibodies, and these can cause various immune reactions – importantly, convalescent plasma must come from healthy individuals (‘donor qualified’, i.e. adequate haemoglobin levels and free from bloodborne pathogens) that are compatible with the recipient’s blood type. In regions where ebolaviruses are endemic, this is one of the easiest treatment options to implement, but the efficacy of convalescent plasma may be hampered by epitopic dissimilarity (that is, if the strain the donor recovered from and the strain the recipient is suffering from are too dissimilar, the antibodies won’t work). The WHO has worked out a detailed guideline on using convalescent plasma, which also highlights one of its greatest drawbacks: it works best for patients with early stage disease.
  • ZMapp is a biological drug, specifically a monoclonal antibody. Monoclonal antibodies are artificially created equivalents of the antibodies in convalescent plasma. The great benefit of ZMapp over convalescent plasma is that it only contains antibodies specifically against EBOV, and as such the risk of immune reactions is negligible. ZMapp’s efficacy is quite controversial, as due to the scarcity and cost of the drug, the number of patients treated was too low to really be able to draw conclusions from.
  • Brincidofovir is a broad spectrum antiviral against DNA viruses, such as cytomegalovirus, smallpox and herpes simplex. For some reason, its lipid moiety appears to have shown some efficacy against EBOV, even though EBOV is not a DNA virus but a (-)ssRNA (negative single sense RNA, Baltimore Group V) virus. However, a very small (n=4) Phase II trial in Liberia was prematurely cancelled, and all enrolled subjects died of EVD, after the manufacturer decided to stop pursuing EVD as a target for brincidofovir.
  • Favipiravir is also a broad spectrum antiviral, with specific activity against RNA viruses, initially developed against influenzaviruses. The JIKI trial was conducted in Gueckedou, the ground zero of the 2013-2016 outbreak, in September 2014, and has indicated some efficacy for patients with less severe disease (low to medium viral loads). Controversially, because the criteria weren’t met for a proper randomised clinical trial in late 2014, the JIKI trial was historically controlled, and this has drawn extensive professional criticism.

There are a range of ebolavirus vaccines, most specifically targeting EBOV. The two currently available vaccines are rVSV-ZEBOV and the cAd3-ZEBOV vaccine (colloquially referred to as the NIAID vaccine).

  • rVSV-ZEBOV is a somewhat quirky viral vaccine. It is intended to create antibodies to GP, the virion glycoprotein of EBOV. Normally, vaccines contain an adjuvant and an antigen, such as a viral protein (e.g. the HPV vaccine contains the protein shell, called the L1 major capsid protein, of various HPV strains). The immune system then recognises this as foreign and generates antibodies against them. rVSV-ZEBOV works a little different – it actually contains a live virus, VSV (vesicular stomatitis virus or Indiana vesiculovirus, a distant relative of rabies), which is harmless in humans but causes a disease very similar to foot and mouth disease in cattle and horses. This recombinant (hence r) VSV expresses small amounts of GP, to which the body then generates antibodies. In a ring vaccination trial called Ebola ça Suffit-GRV Trial, 7,284 participants were recruited in Guinea and a parallel trial with the rVSV-ZEBOV vaccine was carried out in Sierra Leone by the CDC (the STRIVE VSV-EBOV trial). The trial faced complex ethical dilemmas. Placebo control would clearly not be ethically (or politically) acceptable, so instead the trial participants were randomised into two cohorts, some of whom received the vaccine after a three week delay. However, due to encouraging early results, the control arm was effectively dispensed with and everybody was vaccinated. The National Academies of Sciences, Engineering and Medicine published an report in which they assessed the trials, and found that much like in the case of favipiravir, it’s hard to do assess a life-saving treatment in the middle of a lethal epidemic. The WHO has announced that it will use the rVSV-ZEBOV vaccine to ring vaccinate contacts of known, laboratory confirmed cases, from 21 May onwards, and has a stock of 7,000 doses of the vaccine in cold storage in Kinshasa. Ring vaccination has been used successfully in the eradication of smallpox, and there is ample evidence to its efficacy and the ability to control further spread, provided contact tracing is successful.
  • cAd3-ZEBOV aka the NIAID/GSK vaccine is a similarly structured vaccine, but derived from a chimpanzee adenovirus, ChAd3. Like the rVSV-ZEBOV vaccine, the cAd3-ZEBOV vaccine expresses glycoproteins from EBOV and, depending on configuration, SUDV.14 This vaccine is considered less ‘ready for use’, and while it’s been found safe, it is not clear what efficacy it will ultimately have.

Worth reading:

  • On Ebola treatment centres, Chowell, G. and Viboud, C. (2015): Controlling Ebola: key role of Ebola treatment centres. Lancet Inf Dis 15(2):139-141. – a good outline of the cheap yet surprisingly effective three-stage treatment centre model.
  • Medecins Sans Frontieres, who have pioneered the three-stage treatment centre structure, have a great interactive guide to a treatment centre that reflects the idea of segregation by infection probability quite well.
  • David Kroll’s article in Forbes asks the question on everyone’s mind: how will we know if the Ebola drugs used during the West African outbreak have indeed worked?Most patients received multiple different treatments, and the sample size was quite small – most of the patients in Africa have only received the usual symptomatic treatment. Clearly, there’s a huge ethical issue, and one of health equity, involved here: many drugs, high costs, many patients, and a willingness to give patients every possible chance at survival. The moral imperatives and the practicalities of the situation make it hard for researchers to gauge efficacy of individual treatments.
  • Adebamowo, C. et al. (2014). Randomised controlled trials for Ebola: practical and ethical issues. Lancet 384:1423-1424. – when it comes to clinical trials for diseases with high mortality, complex ethical issues arise. This makes research and the traditional methods of evaluating treatments difficult. Randomised controlled trials, the gold standard when it comes to assessing the efficacy of medical interventions, are difficult to conduct in the middle of a devastating epidemic, and raises complex ethical issues.
  • National Academies of Sciences, Engineering and Medicine (2017). Integrating Clinical Research into Epidemic Response: The Ebola Experience. The National Academies Press, Washington, DC. – this is probably the best overview of the current state of the art when it comes to vaccines for EBOV after the West African outbreak. Chapter 4 is a must-read for vaccines, and chapter 3 for clinical treatments. Furthermore, Chapter 2 is a great in-depth exploration of the Scylla and Charybdis of doing high-quality, evidence-based clinical research in the middle of an epidemic with a high-mortality viral disease.

Keeping up to date & other stuff to read

The situation is currently quite rapidly evolving, and information flow is often quite restricted due to unreliable communication links. Perhaps the best source of information about what’s going on at the time is ProMED-mail, run by ISID. I also tweet pretty prolifically about the emerging crisis and other public health issues (you can find me at @chrisvcsefalvay), and of course you can find all my blog posts and public appearances that involve filoviruses on this page. I’m also always happy to answer questions, here in the comments thread or using the contact form (if you’re writing for a publication, please use the contact form).

I hope this primer to ebolaviruses was helpful, and if you intend to write about the subject, you now feel better informed. Please feel free to raise any questions that you think remain open in the comment thread below!

References   [ + ]

1.See ICTV page on filoviral taxonomy.
2.Case-fatality rate, i.e. the number of cases versus the number of deaths. Typically given as case/fatality, percentage – e.g. 10/3 (30%) means 10 cases, 3 died, 30% CFR.
3.This is the outbreak dramatised in Preston’s Hot Zone.
4.Seroconversion refers to developing antibodies against a pathogen. It does not mean actually becoming sick as well, just that the body has encountered the pathogen and has responded to it.
5.A fossil gene is what happens when a virus does not infect or kill the host, but rather incorporates bits and pieces of the viral genome into its own.
6.Taylor, D.J. et al. (2014). Evidence that ebolaviruses and cuevaviruses have been diverging from marburgviruses since the MiocenePeerJ 2 Sep 2014, 2:e556.
7.Case-fatality ratio or case-fatality rate, which is a misnomer, since it’s neither a rate nor a ratio in the epidemiological sense. Normally given as a percentage, it is defined as \frac{C_d}{\Sigma C}, where C_d describes all deceased cases and \Sigma C is defined as the total of all cases that meet the inclusion criteria.
8.Versteeg, K. and Geisbert, T.W. (2017). Infection with the Makona variant results in a delayed and distinct host immune response compared to previous Ebola virus variantsScientific Reports 7:9730.
9.Officially, a ‘mirrored Z free tetromino‘. Except, of course, it’s a hexomino.
10.Madara, J.J., Harty, R.N. et al. (2015). The multifunctional Ebola virus VP40 matrix protein is a promising therapeutic targetFuture Virol (10)5: 537-546.
11.Yang, Z.Y., Nabel, G.J. et al. (2000). Identification of the Ebola virus glycoprotein as the main viral determinant of vascular cell cytotoxicity and injury. Nature Med 6(8):886-9.
12.The descriptions of ebolaviruses or even Marburg turning patients into bags of goo or exploding with blood, largely inspired by Preston’s Hot Zone, are wildly inaccurate. Still, it’s one nasty disease.
13.Non-human primate.
14.The vaccine is intended to express glycoproteins from both when in production use. The current Phase II UK trials, conducted by Oxford University’s Jenner Institute, are done with a variant expressing only EBOV GP.

Get your Jupyterhub box set up on Linode with a single go!

At CBRD, a lot of the research work we do is done on remote machines. For various reasons, we like being able to spin up and wind down these boxes at will, and auto-configure them at short notice according to a few standard variables. Depending on the installation, then, we would have a perfectly set up box with all the features we want, focused around R, Python and as frontends, RStudio and Jupyterhub.

Where StackScripts fit in

There are a wide range of ways to configure boxes – Puppet, Chef cookbooks, Terraform, Dockerfiles, and all that –, but for ease of use, we rely on simple shell files that can be run as Linode StackScripts.

StackScripts are an extremely convenient way to configure single Linode with the software you need. Unlike more complex systems like Terraform,

Basic usage

The Ares research node generator StackFile does a handful of things:

  • Update system and add the CRAN repo as a source
  • Install R and the RStudio version of choice
  • Install Python and JupyterHub version of choice
  • Install an opinionated set of system level packages (i.e. available to all users)
  • Configures ports and some other configuration items for the instance
  • Creates the root user
  • Daemonises the RStudio server and JupyterHub to automatically start at failure and automatically start at reboot

When deployed using Linode from its StackFile, it allows for a wide range of configuration options, including ports for both Jupyter and RStudio, and a completely configured first user set up both on JupyterHub and RStudio. In addition, you can configure some install settings. A ‘barebones’ install option exists that allows for a minimum set of packages to be installed – this is useful for testing or if the desired configuration diverges from the ordinary structure. In addition, OpenCV, deep learning tools and cartography tools can be selectively disabled or enabled, as these are not always required.

User administration for Jupyterhub and RStudio

In general, user administration is by preset attached to PAM, i.e. the built-in Linux administration structure. JupyterHub has its own administration features, described here. RStudio, on the other hand, authenticates by user group membership. The two share the same usergroup, specified in the configuration (by default and convention, this is jupyter, but you can change it), and because users created by JupyterHub fall into that user group, creating users in JupyterHub automatically grants them access to RStudio. This is overall acceptable as we tend to use both, but there might be a safety concern there. If so, you can change the auth-required-user-group=$USERGROUPNAME setting to a defined usergroup in the /etc/rstudio/rstudio.conf.

Issues

There are some glitches that we’re trying to iron out:

  • Cartography and GIS tools glitch a little due to issues with PROJ.4.
  • GPU/CUDA support is not implemented as this is not customarily used or provided on Linodes
  • certbot and Let's Encrypt is not really supported yet, as our boxes are never directly public-facing, but you should most definitely find a way to put your server behind some form of SSL/TLS.
  • Currently, only Ubuntu 16.04 LTS is supported, as it’s the most widely used version. CRAN does not yet support more recent versions yet, but these will be added once CRAN support is added.
New to Linodes? Linode is a great service offering a simple, uncomplicated way to set up a Linux server, just how you want it! 👍🏻 In fact, I’ve been using Linode servers for over four years, and have only good experiences with them. I keep recommending them to friends, and they’re all happy! So if this post was helpful, and you’re thinking of signing up for a Linode, would you consider doing so using my affiliate link? I run this blog as a free service and pay for hosting out of pocket, and I don’t do advertisements (and never will!) or paid reviews of things (like what, paid Tyvek reviews? “Best Ebola PPE”? Nope.) – so really, it would make me so happy if you could help out this way.

As always, do let me know how things work out for you! Feel free to leave comments and ideas below. If you’re getting value out of using this script, just let me know.

Herd immunity: how it really works

There are few concepts as trivial yet as widely misunderstood as herd immunity. In a sense, it’s not all that surprising, because frankly, there’s something almost magical about it – herd immunity means that in a population, some people who are not or cannot be immunized continue to reap the benefit of immunization. On its own, this may even be counter-intuitive. And so, unsurprisingly, like many evidently true concepts, herd immunity has its malcontents – going so far as to condemn the very idea as a ‘CDC lie’ – never mind that the concept was first used in 1923, well before the CDC was established.1

Now, let’s ignore for a moment what Dr Humphries, a nephrologist-turned-homeopath with a penchant for being economical with the truth when not outright lying, has to say – not because she’s a quack but because she has not the most basic idea of epidemiology. Instead, let’s look at this alleged ‘myth’ to begin with.

Herd immunity: the cold, hard maths

Our current understanding of herd immunity is actually a result of increasing understanding of population dynamics in epidemiology, towards the second half of the 20th century. There are, on the whole, two ways to explain it. Both are actually the same thing, and one can be derived from the other.

The simple explanation: effective R_0 depletion

The simple explanation rests on a simplification that makes it possible to describe herd immunity in terms that are intelligible at the level of high school maths. In epidemiology, R_0 (pron. ‘arr-nought‘, like a pirate), describes the basic reproduction rate of an infectious disease.2 To put it at its most simplistic: R_0 is the number of cases produced by each case. The illustration on the side shows the index case (IDX) and the first two generations of an infection with R_0 = 3.

Now, R_0 is a theoretical variable. It is usually observationally estimated, and don’t take measures intended to reduce it into account. And that’s where it gets interesting.

Consider the following scenario, where a third of the population is vaccinated, denoted by dark black circles around the nodes representing them. One would expect that of the 13 persons, a third, i.e. about. 4 , would remain disease-free. But in fact, over half of the people will remain disease-free, including three who are not vaccinated. This is because the person in the previous generation did not pass on the pathogen to them. In other words, preventing spread, e.g. by vaccination or by quarantine, can affect and reduce R_0. Thus in this case, the effective R_0 was closer to 1.66 than 3 – almost halving the R_0 by vaccinating only a third of the population.

We also know that for infections where the patient either dies or recovers, the infection has a simple ecology: every case must be ‘replaced’. In other words, if the effective R_0 falls below 1, the infection will eventually peter out. This happens quite often when everyone in a population is dead or immune after an infection has burned through it (more about that later).

Thus, the infection will be sustainable if and only if

R_{0} \geq 1

Under the assumption of a 100% efficient vaccine, the threshold value \bar{p_V} after which the infection will no longer be able to sustain itself is calculated as

\bar{p_V} = 1 - \frac{1}{R_0}

Adjusting for vaccine efficacy, E_V, which is usually less than 100%, we get

\bar{p_V} = \frac{1-\frac{1}{R_0}}{E_V} = \frac{R_0 - 1}{R_0 E_V}

For a worked example, let’s consider measles. Measles has an R_0 around 15 (although a much higher value has been observed in the past, up to 30, in some communities), and the measles vaccine is about 96% effective. What percentage of the population needs to be vaccinated? Let’s consider \bar{p_V}, the minimum or threshold value above which herd immunity is effective:

\bar{p_V} = \frac{R_0 - 1}{R_0 E_V} = \frac{15-1}{15 \cdot 0.96} = \frac{14}{14.4} \approx 97.22\%

The more complex explanation: \mathcal{SIR} models

Note: this is somewhat complex maths and is generally not recommended unless you’re a masochist and/or comfortable with calculus and differential equations. It does give you a more nuanced picture of matters, but is not necessary to understand the whole of the argumentation. So feel free to skip it.

The slightly more complex explanation relies on a three-compartment model, in which the population is allotted into one of three compartments: \mathcal{S}usceptible, \mathcal{I}nfectious and \mathcal{R}ecovered. This model makes certain assumptions, such as that persons are infectious from the moment they’re exposed and that once they recover, they’re immune. There are various twists on the idea of a multicompartment model that takes into account the fact that this is not true for every disease, but the overall idea is the same.3 In general, multicompartment models begin with everybody susceptible, and a seed population of infectious subjects. Vaccination in such models is usually accounted for by treating them as ‘recovered’, and thus immune, from t = 0 onwards.

Given an invariant population (i.e. it is assumed that no births, deaths or migration occurs), the population can be described as consisting of the sum of the mutually exclusive compartments: P = \mathcal{S}(t) + \mathcal{I}(t) + \mathcal{R}(t). For the same reason, the total change is invariant over time, i.e.

\frac{d \mathcal{S}}{d t} + \frac{d \mathcal{I}}{d t} + \frac{d \mathcal{R}}{d t} = 0

Under this assumption of a closed system, we can relate the volumes of each of the compartment to the transition probabilities \beta (from \mathcal{S} to \mathcal{I}) and \gamma (from \mathcal{I} to \mathcal{R}), so that:

\frac{d \mathcal{S}}{d t} = - \frac{\beta \mathcal{I} \mathcal{S}}{P}

\frac{d \mathcal{I}}{d t} = \frac{\beta \mathcal{I} \mathcal{S}}{P} - \gamma \mathcal{I}

\frac{d \mathcal{R}}{d t} = \gamma \mathcal{I}

Incidentally, in case you were wondering how this connects to the previous explanation: R_0 = \frac{\beta}{\gamma}.

Now, let us consider the end of the infection. If \mathcal{S} is reduced sufficiently, the disease will cease to be viable. This does not need every individual to be recovered or immune, however, as is evident from dividing the first by the third differential equation and integrating and substituting R_0, which yields

\displaystyle \mathcal{S}(t) = \mathcal{S}(0) e^{\frac{-R_0 (\mathcal{R}(t)-\mathcal{R}(0))}{P}}

Substituting this in, the limit of \mathcal{R}, as t approaches infinity, is

\displaystyle \lim_{t\to\infty}\mathcal{R}(t) = P - \lim_{t\to\infty}\mathcal{S}(t) = P - \mathcal{S}(0) e^{\frac{-R_0 (\mathcal{R}(t)-\mathcal{R}(0))}{P}}

From the latter, it is evident that

\displaystyle \lim_{t\to\infty}\mathcal{S}(t) \neq 0 \mid \mathcal{S}(0) \neq 0

In other words, once the infection has burned out, there will still be some individuals who are not immune, not immunised and not vaccinated. These are the individuals protected by herd immunity. This is a pretty elegant explanation for why herd immunity happens and how it works. There are three points to take away from this.

First, herd immunity is not unique to vaccination. The above finding in relation to the nonzero limit of \lim_{t\to\infty}\mathcal{S}(t) holds as long as \mathcal{S}(0) \neq 0, but regardless of what \mathcal{R}(0) is. In other words, herd immunity is not something artificial.

Two, for any i \in \mathcal{S} (that is, any susceptible person) at time t, the probability of which compartment he will be in at t+1 depends on whom he encounters. That, statistically, depends on the relative sizes of the compartments. In this model, the assumption is that the sample i will encounter will reflect the relative proportions of the individual compartments’ sizes. Thus if i meets n people at time t, each compartment will be proportionally represented, i.e. for any compartment \mathcal{C}, the proportion will be \frac{\mathcal{C}(t)}{P-1} for all \mathcal{C} \neq \mathcal{S}, for which the proportion will be \frac{\mathcal{S}(t) - 1}{P - 1}, since one cannot meet oneself. Given that the transition probability \beta_{i}(t) is assumed to equal the probability of meeting at least one element of \mathcal{I}, the following can be said. i‘s risk of infection depends on the relationship of n and \mathcal{I}(t), so that i is likely to get infected if

\displaystyle n \frac{\mathcal{I}(t)}{P-1} \geq 1

This elucidates two risk factors clearly, and the way to reduce them: reduce interactions (quarantine/self-quarantine), thereby reducing n, and reduce the proportion of infectious cases (\frac{\mathcal{I}(t)}{P-1}). The latter is where herd immunity from immunisation comes in. Recall that for a constant n, i‘s risk of infection at t rises as \mathcal{I}(t) rises.4 Recall also that while susceptible cases can turn into infectious cases, recovered (or vaccinated) cases cannot. And so, as \mathcal{R}(0) converges to P-1,5 i‘s risk of infection at any time t, denoted by \beta_{i}(t), falls. In other words,

\displaystyle \lim_{\mathcal{R}(0) \to P-1} \beta_{i}(t) = 0

Or to put it simply: the more are vaccinated at the start, the lower the probability, all things being equal, to meet someone who can pass on the infection.6

A final point to note is that this is primarily a model of statistical dynamics, and deals with average probabilities. It does not – it cannot – take account of facts like that some some susceptible people are just darn unlucky, and bump into a flock of unvaccinated, shiny-eyed snowflakes. Equally, in some places, susceptible people and infected people congregate, creating a viral breeding ground, also known as a Waldorf school. There are agent based models, which are basically attempts at brute force hacking reality, that can take account of such disparities. The takeaway is that herd immunity does not mean no susceptible individual will get infected. What it does mean is that their probability of getting infected is going to be significantly lower, for two reasons. First given a constant number of encounters (n), the likelihood of one of them being with an infectious individual is going to be much lower. More importantly, however, because of herd immunity, the disease is going to be able to persist in the population for a far shorter time – eventually it will burn through the small number of ‘accessible’ susceptible persons. Since the cumulative risk \beta_{i}^T for i \in \mathcal{S} for an infection that dies out after time T is defined as

\beta_i^T = \displaystyle \int\limits_0^T \beta_{i}(t) \, \mathrm{d}t

– the sooner the infection dies out, the smaller the likelihood that i will be infected. With that mathematical basis, let’s tackle a few of the myths about herd immunity.

Myth #1: herd immunity only works with naturally acquired immunity

This argument goes roughly along the following lines: herd immunity does exist, but it only exists if and where the immunity is acquired the ‘natural’ way, i.e. by surviving the disease. Case in point:

The $64,000 question, of course, is what the difference is between the residual immunity from a vaccine and the residual immunity from having survived the illness. A vaccine effectively ‘simulates’ the illness without actually causing the pathological symptoms. From the perspective of the immune system, it is largely irrelevant whether it has been exposed to an actual virus that can damage the body, or merely a capsid protein that is entirely harmless but will nonetheless elicit the same immune reaction. That should suffice to bust this myth, but it’s worth considering immunity quantitatively for a moment. As we have seen above, the source of immunity doesn’t matter. In fact, it doesn’t even have to be immunity: culling every animal except one in a herd is an entirely good way to reduce disease transmission. So is sealing oneself away from the rest of society and spending the evenings telling sexually explicit stories, as the heroes and heroines of Boccaccio’s Decameron have done, since we know that

\displaystyle n \frac{\mathcal{I}(t)}{P-1} \geq 1

Boccaccio’s crowd of assorted perverts knew nothing of all this, of course, but they did know that if they reduced n, the number of contacts with possibly infected persons, their chances of surviving the plague would increase. As it indeed did. Score one for medieval perverts. The bottom line is that it is entirely immaterial how immunity was obtained.

Myth #2: Herd immunity is a concept deriving from animals. It doesn’t work on humans.

This is one of the more outlandish claims, but shockingly, it actually has a tiny kernel of truth.

Now, much of the above is a veritable storehouse of insanity, but the point it makes in the beginning has some truth to it. In human populations, herd immunity sometimes behaves anomalously, because humans are not homogenously distributed. This is true a fortiori for humans who decide not to vaccinate, who – for better or worse – tend to flock in small groups. The term of venery for a bunch of anti-vaxxers is, in case you were wondering, a ‘plague’.7

Herd immunity was, in fact, observed in a range of species. Humans are different as we can knowingly and consciously decide to create herd immunity in our population and protect our fellow men, women and children, the last of whom are particularly susceptible to infectious diseases, from some of the worst killers.

Myth #3: If herd immunity can be obtained through natural immunity, surely we don’t need vaccines.

This argument has recently been peddled by the illustrious Kelly Brogan MD, who bills herself as a ‘holistic psychiatrist’ who threw away her script pad, which means she tends exclusively to the worried well and those with mild mental health issues where medication does not play as decisive a role as it does in, say, schizophrenia, severe PTSD, crippling anxiety disorders or complex neuropsychiatric post-insult phenomena.8 Here’s her foray into epidemiology, something she vaguely remembers studying in her first year of med school.

In this, Dr Brogan has successfully found almost a century old evidence for what everybody knew, namely that herd immunity can be naturally obtained. To anyone who has read the maths part above, this should evoke a sensation of ‘DUH!’. The problem is twofold. One, the ‘actual virus’ has an unsavoury fatality rate of 0.1%, not including the horribly tragic, heartbreaking late consequence of measles known as SSPE.9 Two, and perhaps more important: you don’t get lifelong, natural immunity if you die. This may have somehow escaped Dr Brogan’s attention, but part of the point of herd immunity is to protect those who would not survive, or would suffer serious sequelae, if they contracted the infection. What we don’t know, of course, how many of that 68% suffered permanent injuries, and how many are not included because they died. What we do know is that all 68% probably had a miserable time. Anyone who thinks measles is so fantastic should start by contracting it themselves.

Myth #4: Herd immunity means 95% need to be vaccinated to prevent a disease.

This one comes courtesy of Sarah aka the Healthy Home Economist,10, who, to what I presume must be the chagrin of her alma mater, states she has a Master’s from UPenn. Suspiciously enough, she does not state what in. I am somehow pretty sure it’s not public health.

The tedious conspiracy theory aside, it is quite evident just how little she understands of herd immunity. No – herd immunity is not based upon11 the idea that 95% must be vaccinated, and it is most definitely not based on the idea that 100% must be vaccinated. Indeed, the whole bloody point of herd immunity is that you do not need to vaccinate 100% to protect 100%. In fact, given the R_0 and vaccine efficacy E_V, we can predict the threshold vaccination rate for herd immunity quite simply, as demonstrated earlier: the threshold value, \bar{p_V}, can be calculated as

\bar{p_V} = \frac{R_0 - 1}{R_0 E_V}

As an illustration, the herd immunity threshold \bar{p_V} for mumps, with an efficacy of 88%12 and an R_0 of around 5.5, is \approx 92.98\%, while for Ebola, which has a very low R_0 around 2.0, herd immunity sets in once about 50% are immune.13

And those ‘conventional health authorities’? That’s what we call health authorities whose ideas work.

Myth #5: If vaccines work, why do we need herd immunity?

This argument is delightfully idiotic, because it, too, ignores the fundamental underlying facts of herd immunity. Quite apart from the fact that some people cannot receive some or all vaccines and other people can receive vaccines but may not generate sufficient antibody titres to have effective immunity, sometimes vaccines simply fail. Few things are 100%, and while vaccines are designed to be resilient, they can degrade due to inappropriate storage or fail to elicit a sufficient response for some other reason. Unlike wearing deodorant (or ‘deoderant’, as spelling-challenged anti-vaxxers would say), infections can sometimes be imagined as a chain of transmission. This is a frequently used model to explain the consequences of not vaccinating on others.

In this illustration, an index patient (IDX) is infected and comes in contact with G1, who in turn comes into contact with G2 who in turn comes into contact with G3. In the first case, G1, G2 and G3 are all vaccinated. The vaccine may have a small failure rate – 2% in this case – but by the time we get to G3, his chances of contracting the infection are 1:125,000 or 0.0008%. In the second case, G2 is unvaccinated – if G1’s vaccine fails, G2 is almost guaranteed to also fall ill. By not vaccinating, his own risk has increased 50-fold, from 0.04% to 2%. But that’s not all – due to G2’s failure to vaccinate, G3 will also be affected – instead of the lottery odds of 1:125,000, his risk has also risen 50-fold, to 1:2,500. And this 50-fold increase of risk will carry down the chain of potential transmission due to G2’s failure to vaccinate. No matter how well vaccines work, there’s always a small residual risk of failure, just as there is a residual risk of failure with everything. But it takes not vaccinating to make that risk hike up 50-fold. Makes that deodorant (‘deoderant’?) analogy sound rather silly, right?

Conclusion

Admittedly, the mathematical basis of herd immunity is complex. And the idea itself is somewhat counterintuitive. None of these are fit excuses for spreading lies and misinformation about herd immunity.

I have not engaged with the blatantly insane arguments (NWO, Zionists, Masonic conspiracies, Georgia Guidestones), nor with the blatantly untrue ones (doctors and public health officers are evil and guided just by money as they cash in on the suffering of innocent children). I was too busy booking my next flight paid for by Big Pharma.14 Envy is a powerful force, and it’s a good way to motivate people to suspect and hate people who sacrificed their 20s and 30s to work healing others and are eventually finally getting paid in their 40s. But it’s the myths that sway the well-meaning and uncommitted, and I firmly believe it’s part of our job as public health experts to counter them with truth.15

In every social structure, part of co-existence is assuming responsibility not just for oneself but for those who are affected by our decisions. Herd immunity is one of those instances where it’s no longer about just ourselves. Many have taken the language of herd immunity to suggest that it is some sort of favour or sacrifice done for the communal good, when it is in in fact the very opposite – it is preventing (inadvertent but often unavoidable) harm to others from ourselves.

And when the stakes are this high, when it’s about life and death of millions who for whatever reason cannot be vaccinated or cannot form an immune response, getting the facts right is paramount. I hope this has helped you, and if you know someone who would benefit from it, please do pass it on to them.

References   [ + ]

1.Topley, W. W. C. and Wilson, G. S. (1923). The spread of bacterial infection; the problem of herd immunity. J Hyg 21:243-249. The CDC was founded 23 years later, in 1946.
2.Why R_0? Because it is unrelated to \mathcal{R}, the quantity denoting recovered cases in \mathcal{S(E)IR} models – which is entirely unrelated. To emphasize the distinction, I will use mathcal fonts for the compartments in compartment models.
3.I hope to write about SIS, SEIR and vital dynamic models in the near future, but for this argument, it really doesn’t matter.
4.Technically, as \frac{\mathcal{I}(t)}{P - 1} rises, but since the model presupposes that P is constant, it doesn’t matter.
5.Since otherwise \mathcal{R} = P and \mathcal{S} = 0, and the whole model is moot, as noted above.
6.Note that this does not, unlike the R_0 explanation, presuppose any degree of vaccine efficacy. An inefficiently vaccinated person is simply in \mathcal{S} rather than \mathcal{R}.
7.Initially, ‘a culture’ was proposed, but looking at the average opponent of vaccination, it was clear this could not possibly work.
8.In other words, if you have actual mental health issues, try an actual psychiatrist who follows evidence-based protocols.
9.Subacute sclerosing panencephalitis is a long-term delayed consequence of measles infection, manifesting as a disseminated encephalitis that is invariably fatal. There are no adjectives that do the horror that is SSPE justice, so here’s a good summary paper on it.
10.As a rule, I don’t link to blogs and websites disseminating harmful information that endangers public health.
11.Correct term: ‘on’
12.As per the CDC.
13.Efficacy E_V is presumed to be 100% where immunity is not acquired via vaccination but by survival.
14.Anyone in public health is happy to tell you those things don’t merely no longer exist, they never even existed in our field.
15.And, if need be, maths.