Structuring R projects

There are some things that I call Smith goods:1 things I want, nay, require, but hate doing. A clean room is one of these – I have a visceral need to have some semblance of tidiness around me, I just absolutely hate tidying, especially in the summer.2 Starting and structuring packages and projects is another of these things, which is why I’m so happy things like cookiecutter exist that do it for you in Python.

While I don’t like structuring R projects, I keep doing it, because I know it matters. That’s a pearl of wisdom that came occasionally at a great price.
I am famously laid back about structuring R projects – my chill attitude is only occasionally compared to the Holy Inquisition, the other Holy Inquisition and Gunny R. Lee Ermey’s portrayal of Drill Sgt. Hartman, and it’s been months since I last gutted an intern for messing up namespaces.3 So while I don’t like structuring R projects, I keep doing it, because I know it matters. That’s a pearl of wisdom that came occasionally at a great price, some of which I am hoping to save you by this post.

Five principles of structuring R projects

Every R project is different. Therefore, when structuring R projects, there has to be a lot more adaptability than there is normally When structuring R projects, I try to follow five overarching principles.

  1. The project determines the structure. In a small exploratory data analysis (EDA) project, you might have some leeway as to structural features that you might not have when writing safety-critical or autonomously running code. This variability in R – reflective of the diversity of its use – means that it’s hard to devise a boilerplate that’s universally applicable to all kinds of projects.
  2. Structure is a means to an end, not an end in itself. The reason why gutting interns, scalping them or yelling at them Gunny style are inadvisable is not just the additional paperwork it creates for HR. Rather, the point of the whole exercise is to create people who understand why the rules exists and organically adopt them, understanding how they help.
  3. Rules are good, tools are better. When tools are provided that take the burden of adherence – linters, structure generators like cookiecutter, IDE plugins, &c. – off the developer, adherence is both more likely and simpler.
  4. Structures should be interpretable to a wide range of collaborators. Even if you have no collaborators, thinking from the perspective of an analyst, a data scientist, a modeller, a data engineer and, most importantly, the client who will at the very end receive the overall product.
  5. Structures should be capable of evolution. Your project may change objectives, it may evolve, it may change. What was a pet project might become a client product. What was designed to be a massive, error-resilient superstructure might have to scale down. And most importantly, your single-player adventure may end up turning into an MMORPG. Your structure has to be able to roll with the punches.

A good starting structure

Pretty much every R project can be imagined as a sort of process: data gets ingested, magic happens, then the results – analyses, processed data, and so on – get spit out. The absolute minimum structure reflects this:

.
└── my_awesome_project
    ├── src
    ├── output
    ├── data
    │   ├── raw
    │   └── processed
    ├── README.md
    ├── run_analyses.R 
    └── .gitignore

In this structure, we see this reflected by having a data/ folder (a source), a folder for the code that performs the operations (src/) and a place to put the results (output/). The root analysis file (the sole R file on the top level) is responsible for launching and orchestrating the functions defined in the src/ folder’s contents.

The data folder

The data folder is, unsurprisingly, where your data goes. In many cases, you may not have any file-formatted raw data (e.g. where the raw data is accessed via a *DBC connection to a database), and you might even keep all intermediate files there, although that’s pretty uncommon on the whole, and might not make you the local DBA’s favourite (not to mention data protection issues). So while the raw/ subfolder might be dispensed with, you’ll most definitely need a data/ folder.

When it comes to data, it is crucial to make a distinction between source data and generated data. Rich Fitzjohn puts it best when he says to treat

  • source data as read-only, and
  • generated data as disposable.

The preferred implementation I have adopted is to have

  • a data/raw/ folder, which is usually is symlinked to a folder that is write-only to clients but read-only to the R user,4,
  • a data/temp/ folder, which contains temp data, and
  • a data/output/ folder, if warranted.

The src folder

Some call this folder R– I find this a misleading practice, as you might have C++, bash and other non-R code in it, but is unfortunately enforced by R if you want to structure your project as a valid R package, which I advocate in some cases. I am a fan of structuring the src/ folder, usually by their logical function. There are two systems of nomenclature that have worked really well for me and people I work with:

  • The library model: in this case, the root folder of src/ holds individual .R scripts that when executed will carry out an analysis. There may be one or more such scripts, e.g. for different analyses or different depths of insight. Subfolders of src/ are named after the kind of scripts they contain, e.g. ETL, transformation, plotting. The risk with this structure is that sometimes it’s tricky to remember what’s where, so descriptive file names are particularly important.
  • The pipeline model: in this case, there is a main runner script or potentially a small number. These go through scripts in a sequence. It is a sensible idea in such a case to establish sequential subfolders or sequentially numbered scripts that are executed in sequence. Typically, this model performs better if there are at most a handful distinct pipelines.

Whichever approach you adopt, a crucial point is to keep function definition and application separate. This means that only the pipeline or the runner scripts are allowed to execute (apply) functions, and other files are merely supposed to define them. Typically, folder level segregation works best for this:

  • keep all function definitions in subfolders of src/, e.g. src/data_engineering, and have the directly-executable scripts directly under src/ (this works better for larger projects), or
  • keep function definitions in src/, and keep the directly executable scripts in the root folder (this is more convenient for smaller projects, where perhaps the entire data engineering part is not much more than a single script).

output and other output folders

Output may mean a range of things, depending on the nature of your project. It can be anything from a whole D.Phil thesis written in a LaTeX-compliant form to a brief report to a client. There are a couple of conventions with regard to output folders that are useful to keep in mind.

Separating plot output

My personal preference is that plot output folders should be subfolders of output/, rather than top-tier folders, unless the plots themselves are the objective.
It is common to have a separate folder for plots (usually called figs/ or plots/), usually so that they could be used for various purposes. My personal preference is that plot output folders should be subfolders of output folders, rather than top-tier folders, unless they are the very output of the project. That is the case, for instance, where the project is intended to create a particular plot on a regular basis. This was the case, for instance, with the CBRD project whose purpose was to regularly generate daily epicurves for the DRC Zaire ebolavirus outbreak.

With regard to maps, in general, the principle that has worked best for teams I ran was to treat static maps as plots. However, dynamic maps (e.g. LeafletJS apps), tilesets, layers or generated files (e.g. GeoJSON files) tend to deserve their own folder.

Reports and reporting

For business users, automatically getting a beautiful PDF report can be priceless.
Not every project needs a reporting folder, but for business users, having a nice, pre-written reporting script that can be run automatically and produces a beautiful PDF report every day can be priceless. A large organisation I worked for in the past used this very well to monitor their Amazon AWS expenditure.5 A team of over fifty data scientists worked on a range of EC2 instances, and runaway spending from provisioning instances that were too big, leaving instances on and data transfer charges resulting from misconfigured instances6 was rampant. So the client wanted daily, weekly, monthly and 10-day rolling usage nicely plotted in a report, by user, highlighting people who would go on the naughty list. This was very well accomplished by an RMarkdown template that was ‘knit‘ every day at 0600 and uploaded as an HTML file onto an internal server, so that every user could see who’s been naughty and who’s been nice. EC2 usage costs have gone down by almost 30% in a few weeks, and that was without having to dismember anyone!7

Probably the only structural rule to keep in mind is to keep reports and reporting code separate. Reports are client products, reporting code is a work product and therefore should reside in src/.

Requirements and general settings

I am, in general, not a huge fan of outright loading whole packages to begin with. Too many users of R don’t realise that

  • you do not need to attach (library(package)) a package in order to use a function from it – as long as the package is available to R, you can simply call the function as package::function(arg1, arg2, ...), and
  • importing a package using library(package) puts every single function from that package into the namespace, overwriting by default all previous entries. This means that in order to deterministically know what any given symbol means, you would have to know, at all times, the order of package imports. Needless to say, there is enough stuff to keep in one’s mind when coding in R to worry about this stuff.

However, some packages might be useful to import, and sometimes it’s useful to have an initialisation script. This may be the case in three particular scenarios:

  • You need a particular locale setting, or a particularly crucial environment setting.
  • It’s your own package and you know you’re not going to shadow already existing symbols.
  • You are not using packrat or some other package management solution, and definitely need to ensure some packages are installed, but prefer not to put the clunky install-if-not-present code in every single thing.

In these cases, it’s sensible to have a file you would source before every top-level script – in an act of shameless thievery from Python, I tend to call this requirements.R, and it includes some fundamental settings I like to rely on, such as setting the locale appropriately. It also includes a CRAN install check script, although I would very much advise the use of Packrat over it, since it’s not version-sensitive.

Themes, house style and other settings

It is common, in addition to all this, to keep some general settings. If your institution has a ‘house style’ for ggplot2 (as, for instance, a ggthemr file), for instance, this could be part of your project’s config. But where does this best go?

I’m a big fan of keeping house styles in separate repos, as this ensures consistency across the board.
It would normally be perfectly fine to keep your settings in a config.R file at root level, but a config/ folder is much preferred as it prevents clutter if you derive any of your configurations from a git submodule. I’m a big fan of keeping house styles and other things intended to give a shared appearance to code and outputs (e.g. linting rules, text editor settings, map themes) in separate – and very, very well managed! – repos, as this ensures consistency across the board over time. As a result, most of my projects do have a config folder instead of a single configuration file.

It is paramount to separate project configuration and runtime configuration:

  • Project configuration pertains to the project itself, its outputs, schemes, the whole nine yards. For instance, the paper size to use for generated LaTeX documents would normally be a project configuration item. Your project configuration belongs in your config/ folder.
  • Runtime configuration pertains to parameters that relate to individual runs. In general, you should aspire to have as few of these, if any, as possible – and if you do, you should keep them as environment variables. But if you do decide to keep them as a file, it’s generally a good idea to keep them at the top level, and store them not as R files but as e.g. JSON files. There are a range of tools that can programmatically edit and change these file formats, while changing R files programmatically is fraught with difficulties.

Keeping runtime configuration editable

A few years ago, I worked on a viral forecasting tool where a range of model parameters to build the forecast from were hardcoded as R variables in a runtime configuration file. It was eventually decided to create a Python-based web interface on top of it, which would allow users to see the results as a dashboard (reading from a database where forecast results would be written) and make adjustments to some of the model parameters. The problem was, that’s really not easy to do with variables in an R file.

On the other hand, Python can easily read a JSON file into memory, change values as requested and export them onto the file system. So instead of that, the web interface would store the parameters in a JSON file, from which R would then read them and execute accordingly. Worked like a charm. Bottom line – configurations are data, and using code to store data is bad form.

Dirty little secrets

Everybody has secrets. In all likelihood, your project is no different: passwords, API keys, database credentials, the works. The first rule of this, of course, is never hardcode credentials in code. But you will need to work out how to make your project work, including via version control, while also not divulging credentials to the world at large. My preferred solutions, in order of preference, are:

  1. the keyring package, which interacts with OS X’s keychain, Windows’s Credential Store and the Secret Service API on Linux (where supported),
  2. using environment variables,
  3. using a secrets file that is .gitignored,
  4. using a config file that’s .gitignored,
  5. prompting the user.

Let’s take these – except the last one, which you should consider only as a measure of desperation, as it relies on RStudio and your code should aspire to run without it – in turn.

Using keyring

keyring is an R package that interfaces with the operating system’s keychain management solution, and works without any additional software on OS X and Windows.8 Using keyring is delightfully simple: it conceives of an individual key as belonging to a keyring and identified by a service. By reference to the service, it can then be retrieved easily once the user has authenticated to access the keychain. It has two drawbacks to be aware of:

  • It’s an interactive solution (it has to get access permission for the keychain), so if what you’re after is R code that runs quietly without any intervention, this is not your best bet.
  • A key can only contain a username and a password, so it cannot store more complex credentials, such as 4-ple secrets (e.g. in OAuth, where you may have a consumer and a publisher key and secret each). In that case, you could split them into separate keyring keys.

However, for most interactive purposes, keyring works fine. This includes single-item secrets, e.g. API keys, where you can use some junk as your username and hold only on to the password.

For most interactive purposes, keyring works fine. This includes single-item secrets, e.g. API keys.
By default, the operating system’s ‘main’ keyring is used, but you’re welcome to create a new one for your project. Note that users may be prompted for a keychain password at call time, and it’s helpful if they know what’s going on, so be sure you document your keyring calls well.

To set a key, simply call keyring::key_set(service = "my_awesome_service", username = "my_awesome_user). This will launch a dialogue using the host OS’s keychain handler to request authentication to access the relevant keychain (in this case, the system keychain, as no keychain is specified), and you can then retrieve

  • the username: using keyring::key_list("my_awesome_service")[1,2], and
  • the password: using keyring::key_get("my_awesome_service").

Using environment variables

The thing to remember about environment variables is that they’re ‘relatively private’: everyone in the user session will be able to read them.
Using environment variables to hold certain secrets has become extremely popular especially for Dockerised implementations of R code, as envvars can be very easily set using Docker. The thing to remember about environment variables is that they’re ‘relatively private’: they’re not part of the codebase, so they will definitely not accidentally get committed to the VCS, but everyone who has access to the particular user session  will be able to read them. This may be an issue when e.g. multiple people are sharing the ec2-user account on an EC2 instance. The other drawback of envvars is that if there’s a large number of them, setting them can be a pain. R has a little workaround for that: if you create an envfile called .Renviron in the working directory, it will store values in the environment. So for instance the following .Renviron file will bind an API key and a username:

api_username = "my_awesome_user"
api_key = "e19bb9e938e85e49037518a102860147"

So when you then call Sys.getenv("api_username"), you get the correct result. It’s worth keeping in mind that the .Renviron file is sourced once, and once only: at the start of the R session. Thus, obviously, changes made after that will not propagate into the session until it ends and a new session is started. It’s also rather clumsy to edit, although most APIs used to ini files will, with the occasional grumble, digest .Renvirons.

Needless to say, committing the .Renviron file to the VCS is what is sometimes referred to as making a chocolate fireman in the business, and is generally a bad idea.

Using a .gitignored config or secrets file

config is a package that allows you to keep a range of configuration settings outside your code, in a YAML file, then retrieve them. For instance, you can create a default configuration for an API:

default:
    my_awesome_api:
        url: 'https://awesome_api.internal'
        username: 'my_test_user'
        api_key: 'e19bb9e938e85e49037518a102860147'

From R, you could then access this using the config::get() function:

my_awesome_api_configuration <- config::get("my_awesome_api")

This would then allow you to e.g. refer to the URL as my_awesome_api_configuration$url, and the API key as my_awesome_api_configuration$api_key. As long as the configuration YAML file is kept out of the VCS, all is well. The problem is that not everything in such a configuration file is supposed to be secret. For instance, it makes sense for a database access credentials to have the other credentials DBI::dbConnect() needs for a connection available to other users, but keep the password private. So .gitignoreing a config file is not a good idea.

A dedicated secrets file is a better place for credentials than a config file, as this file can then be wholesale .gitignored.
A somewhat better idea is a secrets file. This file can be safely .gitignored, because it definitely only contains secrets. As previously noted, definitely create it using a format that can be widely written (JSON, YAML).9 For reasons noted in the next subsection, the thing you should definitely not do is creating a secrets file that consists of R variable assignments, however convenient an idea that may appear at first. Because…

Whatever you do…

One of the best ways to mess up is creating a fabulous way of keeping your secret credentials truly secret… then loading them into the global scope. Never, ever assign credentials. Ever.

You might have seen code like this:

dbuser <- Sys.getenv("dbuser")
dbpass <- Sys.getenv("dbpass")

conn <- DBI::dbConnect(odbc::odbc(), UID = dbuser, PWD = dbpass)
Never, ever put credentials into any environment if possible – especially not into the global scope.
This will work perfectly, except once its done, it will leave the password and the user name, in unencrypted plaintext (!), in the global scope, accessible to any code. That’s not just extremely embarrassing if, say, your wife of ten years discovers that your database password is your World of Warcraft character’s first name, but also a potential security risk. Never put credentials into any environment if possible, and if it has to happen, at least make it happen within a function so that they don’t end up in the global scope. The correct way to do the above would be more akin to this:

create_db_connection <- function() {
    DBI::dbConnect(odbc::odbc(), UID = Sys.getenv("dbuser"), PWD = Sys.getenv("dbpass")) %>% return()
}

Concluding remarks

Structuring R projects is an art, not just a science. Many best practices are highly domain-specific, and learning these generally happens by trial and pratfall error. In many ways, it’s the bellwether of an R developer’s skill trajectory, because it shows whether they possess the tenacity and endurance it takes to do meticulous, fine and often rather boring work in pursuance of future success – or at the very least, an easier time debugging things in the future. Studies show that one of the greatest predictors of success in life is being able to tolerate deferred gratification, and structuring R projects is a pure exercise in that discipline.

Structuring R projects is an art, not just a science. Many best practices are highly domain-specific, and learning these generally happens by trial and error.
At the same time, a well-executed structure can save valuable developer time, prevent errors and allow data scientists to focus on the data rather than debugging and trying to find where that damn snippet of code is or scratching their head trying to figure out what a particularly obscurely named function does. What might feel like an utter waste of time has enormous potential to create value, both for the individual, the team and the organisation.

As long as you keep in mind why structure matters and what its ultimate aims are, you will arrive at a form of order out of chaos that will be productive, collaborative and useful.
I’m sure there are many aspects of structuring R projects that I have omitted or ignored – in many ways, it is my own experiences that inform and motivate these commentaries on R. Some of these observations are echoed by many authors, others diverge greatly from what’s commonly held wisdom. As with all concepts in development, I encourage you to read widely, get to know as many different ideas about structuring R projects as possible, and synthesise your own style. As long as you keep in mind why structure matters and what its ultimate aims are, you will arrive at a form of order out of chaos that will be productive, collaborative and mutually useful not just for your own development but others’ work as well.

My last commentary on defensive programming in R has spawned a vivid and exciting debate on Reddit, and many have made extremely insightful comments there. I’m deeply grateful for all who have contributed there. I hope you will also consider posting your observations in the comment section below. That way, comments will remain together with the original content.

References   [ + ]

1.As in, Adam Smith.
2.It took me years to figure out why. It turns out that I have ZF alpha-1 antitrypsin deficiency. As a consequence, even minimal exposure to small particulates and dust can set off violent coughing attacks and impair breathing for days. Symptoms tend to be worse in hot weather due to impaired connective tissue something-or-other.
3.That’s a joke. I don’t gut interns – they’re valuable resources, HR shuns dismembering your coworkers, it creates paperwork and I liked every intern I’ve ever worked with – but most importantly, once gutted like a fish, they are not going to learn anything new. I prefer gentle, structured discussions on the benefits of good package structure. Please respect your interns – they are the next generation, and you are probably one of their first example of what software development/data science leadership looks like. The waves you set into motion will ripple through generations, well after you’re gone. You better set a good example.
4.Such a folder is often referred to as a ‘dropbox’, and the typical corresponding octal setting, 0422, guarantees that the R user will not accidentally overwrite data.
5.The organisation consented to me telling this story but requested anonymity, a request I honour whenever legally possible.
6.In case you’re unfamiliar with AWS: it’s a cloud service where elastic computing instances (EC2 instances) reside in ‘regions’, e.g. us-west-1a. There are (small but nonzero) charges for data transfer between regions. If you’re in one region but you configure the yum repo server of another region as your default, there will be costs, and, eventually, tears – provision ten instances with a few GBs worth of downloads, and there’ll be yelling. This is now more or less impossible to do except on purpose, but one must never underestimate what users are capable of from time to time!
7.Or so I’m told.
8.Linux users will need libsecret 0.16 or above, and sodium.
9.XML is acceptable if you’re threatened with waterboarding.

The Ten Rules of Defensive Programming in R

Where R code is integrated into a pipeline, runs autonomously or is embedded into a larger analytical solution, writing code that fails well is going to be crucial.
The topic of defensive programming in R is, admittedly, a little unusual. R, while fun and powerful, is not going to run defibrillators, nuclear power plants or spacecraft. In fact, much – if not most! – R code is actually executed interactively, where small glitches don’t really matter. But where R code is integrated into a pipeline, runs autonomously or is embedded into a larger analytical solution, writing code that fails well is going to be crucial. So below, I have collected my top ten principles of defensive programming in R. I have done so with an eye to users who do not come from the life critical systems community and might not have encountered defensive programming before, so some of these rules apply to all languages.

What is defensive programming?

The idea of defensive programming is not to write code that never fails. That’s an impossible aspiration. Rather, the fundamental idea is to write code that fails well. To me, ‘failing well’ means five things:

  1. Fail fast: your code should ensure all criteria are met before they embark upon operations, especially if those are computationally expensive or might irreversibly affect data.
  2. Fail safe: where there is a failure, your code should ensure that it relinquishes all locks and does not acquire any new ones, not write files, and so on.
  3. Fail conspicuously: when something is broken, it should return a very clear error message, and give as much information as possible to help unbreak it.
  4. Fail appropriately: failure should have appropriate effects. For every developer, it’s a judgment call to ensure whether a particular issue would be a a debug/info item, a warning or an error (which by definition means halting execution). Failures should be handled appropriately.
  5. Fail creatively: not everything needs to be a failure. It is perfectly legitimate to handle problems. One example is repeating a HTTP request that has timed out: there’s no need to immediately error out, because quite frankly, that sort of stuff happens. Equally, it’s legitimate to look for a parameter, then check for a configuration file if none was provided, and finally try checking the arguments with which the code was invoked before raising an error.1

And so, without further ado, here are the ten ways I implement these in my day-to-day R coding practice – and I encourage you to do so yourself. You will thank yourself for it.

The Ten Commandments of Defensive Programming in R

  1. Document your code.
  2. In God we trust, everyone else we verify.
  3. Keep functions short and sweet.
  4. Refer to external functions explicitly.
  5. Don’t use require() to import packages into the namespace.
  6. Aggressively manage package and version dependencies.
  7. Use a consistent style and automated code quality tools.
  8. Everything is a package.
  9. Power in names.
  10. Know the rules and their rationale, so that you know when to break them.

ONE: Document your code.

It’s a little surprising to even see this – I mean, shouldn’t you do this stuff anyway? Yes, you should, except some people think that because so much of R is done in the interpreter anyway, rules do not apply to them. Wrong! They very much do.

A few months ago, I saw some code written by a mentee of mine. It was infinitely long – over 250 standard lines! –, had half a dozen required arguments and did everything under the sun. This is, of course, as we’ll discuss later, a bad idea, but let’s put that aside. The problem is, I had no idea what the function was doing! After about half an hour of diligent row-by-row analysis, I figured it out, but that could have been half an hour spent doing something more enjoyable, such as a root canal without anaesthetic while listening to Nickelback. My friend could have saved me quite some hair-tearing by quite simply documenting his code. In R, the standard for documenting the code is called roxygen2, it’s got a great parser that outputs the beautiful LaTeX docs you probably (hopefully!) have encountered when looking up a package’s documentation, and it’s described in quite a bit of detail by Hadley Wickham. What more could you wish for?

Oh. An example. Yeah, that’d be useful. We’ll be documenting a fairly simple function, which calculates the Hamming distance between two strings of equal length, and throws something unpleasant in our face if they are not. Quick recap: the Hamming distance is the number of characters that do not match among two strings. Or mathematically put,

H(s, t) = \sum_{k=1}^{\mathcal{l}(s)} I(s_k, t_k) \mid \mathcal{l}(s) = \mathcal{l}(t)

where \mathcal{l}() is the length function and D(p, q) is the dissimilarity function, which returns 1 if two letters are not identical and 0 otherwise.

So, our function would look like this:

hamming <- function(s1, s2) {
  s1 <- strsplit(s1, "")[[1]]
  s2 <- strsplit(s2, "")[[1]]
  
  return(sum(s1 != s2))
}

Not bad, and pretty evident to a seasoned R user, but it would still be a good idea to point out a thing or two. One of these would be that the result of this code will be inaccurate (technically) if the two strings are of different lengths (we could, and will, test for that, but that’s for a later date). The Hamming distance is defined only for equal-length strings, and so it would be good if the user knew what they have to do – and what they’re going to get. Upon pressing Ctrl/Cmd+Shift+Alt+R, RStudio helpfully whips us up a nice little roxygen2 skeleton:

#' Title
#'
#' @param s1 
#' @param s2 
#'
#' @return
#' @export
#'
#' @examples
hamming <- function(s1, s2) {
  s1 <- strsplit(s1, "")[[1]]
  s2 <- strsplit(s2, "")[[1]]
  
  return(sum(s1 != s2))
}

So, let’s populate it! Most of the fields are fairly self-explanatory. roxygen2, unlike JavaDoc or RST-based Python documentation, does not require formal specification of types – it’s all free text. Also, since it will be parsed into LaTeX someday, you can go wild. A few things deserve mention.

  • You can document multiple parameters. Since s1 and s2 are both going to be strings, you can simply write @param s1,s2 The strings to be compared.
  • Use \code{...} to typeset something as fixed-width.
  • To create links in the documentation, you can use \url{https://chrisvoncsefalvay.com} to link to a URL, \code{\link{someotherfunction}} to refer to the function someotherfunction in the same package, and \code{\link[adifferentpackage]{someotherfunction}} to refer to the function someotherfunction in the adifferentpackage package. Where your function has necessary dependencies outside the current script or the base packages, it is prudent to note them here.
  • You can use \seealso{} to refer to other links or other functions, in this package or another, worth checking out.
  • Anything you put under the examples will be executed as part of testing and building the documentation. If your intention is to give an idea of what the code looks like in practice, and you don’t want the result or even the side effects, you can surround your example with a \dontrun{...} environment.
  • You can draw examples from a file. In this case, you use @example instead of @examples, and specify the path, relative to the file in which the documentation is, to the script: @example docs/examples/hamming.R would be such a directive.
  • What’s that @export thing at the end? Quite simply, it tells roxygen2 to export the function to the NAMESPACE file, making it accessible for reference by other documentation files (that’s how when you use \code{\link[somepackage]{thingamabob}}, roxygen2 knows which package to link to.

With that in mind, here’s what a decent documentation to our Hamming distance function would look like that would pass muster from a defensive programming perspective:

#' Hamming distance
#'
#' Calculates the Hamming distance between two strings of equal length.
#'
#' @param s1 
#' @param s2 
#'
#' @return The Hamming distance between the two strings \code{s1} and \code{s2}, provided as an integer.
#'
#' @section Warning:
#' 
#' For a Hamming distance calculation, the input strings must be of equal length. This code does NOT reject input strings of different lengths. 
#'
#' @examples
#' hamming("AAGAGTGTCGGCATACGTGTA", "AAGAGCGTCGGCATACGTGTA")  
#'  
#' @export
hamming <- function(s1, s2) {
  s1 <- strsplit(s1, "")[[1]]
  s2 <- strsplit(s2, "")[[1]]
  
  return(sum(s1 != s2))
}

The .Rd file generated from the hamming() function’s roxygen2 docstring: an intermediary format, resembling LaTeX, from which R can build a multitude of documentation outputsThis little example shows all that a good documentation does: it provides what to supply the function with and in what type, it provides what it will spit out and in what format, and adequately warns of what is not being checked. It’s always better to check input types, but warnings go a long way.2 From this file, R generates an .Rd file, which is basically a LaTeX file that it can parse into various forms of documentation (see left.) In the end, it yields the documentation below, with the adequate warning – a win for defensive programming!

The finished documentation, rendered as it would be in an online documentation pane in RStudio

TWO: In God we trust, everyone else we verify.

In the above example, we have taken the user at face value: we assumed that his inputs will be of equal length, and we assumed they will be strings. But because this is a post on defensive programming, we are going to be suspicious, and not trust our user. So let’s make sure we fail early and check what the user supplies us with. In many programming languages, you would be using various assertions (e.g. the assert keyword in Python), but all R has, as far as built-ins are concerned, is stopifnot(). stopifnot() does as the name suggests: if the condition is not met, execution stops with an error message. However, on the whole, it’s fairly clunky, and it most definitely should not be used to check for user inputs. For that, there are three tactics worth considering.

  1. assertthat is a package by Hadley Wickham (who else!), which implements a range of assert clauses. Most importantly, unlike stopifnot(), assertthat::assert_that() does a decent job at trying to interpret the error message. Consider our previous Hamming distance example: instead of gracelessly falling on its face, a test using
assert_that(length(s1) == length(s2))
  1. would politely inform us that s1 not equal to s2. That’s worth it for the borderline Canadian politeness alone.
  2. Consider the severity of the user input failure. Can it be worked around? For instance, a function requiring an integer may, if it is given a float, try to coerce it to a float. If you opt for this solution, make sure that you 1) always issue a warning, and 2) always allow the user to specify to run the function in ‘strict’ mode (typically by setting the strict parameter to TRUE), which will raise a fatal error rather than try to logic its way out of this pickle.
  3. Finally, make sure that it’s your code that fails, not the system code. Users should have relatively little insight into the internals of the system. And so, if at some point there’ll be a division by an argument foo, you should test whether foo == 0 at the outset and inform the user that foo cannot be zero. By the time the division operation is performed, the variable might have been renamed baz, and the user will not get much actionable intelligence out of the fact that ‘division by zero’ has occurred at some point, and baz was the culprit. Just test early for known incompatibilities, and stop further execution. The same goes, of course, for potentially malicious code.

In general, your code should be strict as to what it accepts, and you should not be afraid to reject anything that doesn’t look like what you’re looking for. Consider for this not only types but also values, e.g. if the value provided for a timeout in minutes is somewhere north of the lifetime of the universe, you should politely reject such an argument – with a good explanation, of course.

Update: After posting this on Reddit, u/BillWeld pointed out a great idiom for checking user inputs that’s most definitely worth reposting here:

f <- function(a, b, c)
{
    if (getOption("warn") > 0) {
        stopifnot(
            is.numeric(a),
            is.vector(a),
            length(a) == 1,
            is.finite(a),
            a > 0,
            is.character(b),
            # Other requirements go here
            )
    }

    # The main body of the function goes here
}

I find this a great and elegant idiom, although it is your call, as the programmer, to decide which deviations and what degree of incompatibility should cause the function to fail as opposed to merely emit a warning.

THREE: Keep functions short and sweet.

Rule #4 of the Power of Ten states

No function should be longer than what can be printed on a single sheet of paper in a standard reference format with one line per statement and one line per declaration. Typically, this means no more than about 60 lines of code per function.

Rationale: Each function should be a logical unit in the code that is understandable and verifiable as a unit. It is much harder to understand a logical unit that spans multiple screens on a computer display or multiple pages when printed. Excessively long functions are often a sign of poorly structured code.

In practice, with larger screen sizes and higher resolutions, much more than a measly hundred lines fit on a single screen. However, since many users view R code in a quarter-screen window in RStudio, an appropriate figure would be about 60-80 lines. Note that this does not include comments and whitespaces, nor does it penalise indentation styles (functions, conditionals, etc.).

Functions should represent a logical unity. Therefore, if a function needs to be split for compliance with this rule, you should do so in a manner that creates logical units. Typically, one good way is to split functions by the object they act on.

FOUR: Refer to external functions explicitly.

In R, there are two ways to invoke a function, yet most people don’t tend to be aware of this. Even in many textbooks, the library(package) function is treated as quintessentially analogous to, say, import in Python. This is a fundamental misunderstanding.

In R, packages do not need to be imported in order to be able to invoke their functions, and that’s not what the library() function does anyway. library() attaches a package to the current namespace.

What does this mean? Consider the following example. The foreign package is one of my favourite packages. In my day-to-day practice, I get data from all sorts of environments, and foreign helps me import them. One of its functions, read.epiinfo(), is particularly useful as it imports data from CDC’s free EpiInfo toolkit. Assuming that foreign is in a library accessible to my instance of R,4, I can invoke the read.epiinfo() function in two ways:

  • I can directly invoke the function using its canonical name, of the form package::function() – in this case, foreign::read.epiinfo().
  • Alternatively, I can attach the entire foreign package to the namespace of the current session using library(foreign). This has three effects, of which the first tends to be well-known, the second less so and the third altogether ignored.
    1. Functions in foreign will be directly available. Regardless of the fact that it came from a different package, you will be able to invoke it the same way you invoke, say, a function defined in the same script, by simply calling read.epiinfo().
    2. If there was a package of identical name to any function in foreign, that function will be ‘shadowed’, i.e. removed from the namespace. The namespace will always refer to the most recent function, and the older function will only be available by explicit invocation.
    3. When you invoke a function from the namespace, it will not be perfectly clear from a mere reading of the code what the function actually is, or where it comes from. Rather, the user or maintainer will have to guess what a given name will represent in the namespace at the time the code is running the particular line.

Controversially, my suggestion is

  • to eschew the use of library() altogether, and
  • write always explicitly invoke functions outside those functions that are in the namespace at startup.

This is not common advice, and many will disagree. That’s fine. Not all code needs to be safety-critical, and importing ggplot2 with library() for a simple plotting script is fine. But where you want code that’s easy to analyse, easy to read and can be reliably analysed as well, you want explicit invocations. Explicit invocations give you three main benefits:

  1. You will always know what code will be executed. filter may mean dplyr::filter, stats::filter, and so on, whereas specifically invoking dplyr::filter is unambiguous. You know what the code will be (simply invoking dplyr::filter without braces or arguments returns the source), and you know what that code is going to do.
  2. Your code will be more predictable. When someone – or something – analyses your code, they will not have to spend so much time trying to guess what at the time a particular identifier refers to within the namespace.
  3. There is no risk that as a ‘side effect’ various other functions you seek to rely on will be removed from the namespace. In interactive coding, R usually warns you and lists all shadowed functions upon importing functions with the same name into the namespace using library(), but for code intended to be bulk executed, this issue has caused a lot of headache.

Obviously, all of this applies to require() as well, although on the whole the latter should not be applied in general.

FIVE: Don’t use require() to import a package into the namespace.

Even seasoned R users sometimes forget the difference between library() and require(). The difference is quite simple: while both functions attempt to attach the package argument to the namespace,

  • require() returns FALSE if the import failed, while
  • library() simply loads the package and raises an error if the import failed.

Just about the only legitimate use for require() is writing an attach-or-install function. In any other case, as Yihui Xie points out, require() is almost definitely the wrong function to use.

SIX: Aggressively manage package and version dependencies.

Packrat is one of those packages that have changed what R is like – for the better. Packrat gives every project a package library, akin to a private /lib/ folder. This is not the place to document the sheer awesomeness of Packrat – you can do so yourself by doing the walkthrough of Packrat. But seriously, use it. Your coworkers will love you for it.

Equally important is to make sure that you specify the version of R that your code is written against. This is best accomplished on a higher level of configuration, however.

SEVEN: Use a consistent style and automated code quality tools.

This should be obvious – we’re programmers, which means we’re constitutionally lazy. If it can be solved by code faster than manually, then code it is! Two tools help you in this are lintr and styler.

  • lintr is an amazingly widely supported (from RStudio through vim to Sublime Text 3, I hear a version for microwave ovens is in the works!) linter for R code. Linters improve code quality primarily by enforcing good coding practices rather than good style. One big perk of lintr is that it can be injected rather easily into the Travis CI workflow, which is a big deal for those who maintain multi-contributor projects and use Travis to keep the cats appropriately herded.
  • styler was initially designed to help code adhere to the Tidyverse Style Guide, which in my humble opinion is one of the best style guides that have ever existed for R. It can now take any custom style files and reformat your code, either as a function or straight from an RStudio add-in.

So use them.

EIGHT: Everything is a package.

Whether you’re writing R code for fun, profit, research or the creation of shareholder value, your coworkers and your clients – rightly! – expect a coherent piece of work product that has everything in one neat package, preferably version controlled. Sending around single R scripts might have been appropriate at some point in the mid-1990s, but it isn’t anymore. And so, your work product should always be structured like a package. As a minimum, this should include:

  1. A DESCRIPTION and NAMESPACE file.
  2. The source code, including comments.
  3. Where appropriate, data mappings and other ancillary data to implement the code. These go normally into the data/ folder. Where these are large, such as massive shape files, you might consider using Git LFS.
  4. Dependencies, preferably in a packrat repo.
  5. The documentation, helping users to understand the code and in particular, if the code is to be part of a pipeline, explaining how to interact with the API it exposes.
  6. Where the work product is an analysis rather than a bit of code intended to carry out a task, the analysis as vignettes.

To understand the notion of analyses as packages, two outstanding posts by Robert M. Flight are worth reading: part 1 explains the ‘why’ and part 2 explains the ‘how’. Robert’s work is getting a little long in the tooth, and packages like knitr have taken the place of vignettes as analytical outputs, but the principles remain the same. Inasmuch as it is possible, an analysis in R should be a self-contained package, with all the dependencies and data either linked or included. From the perspective of the user, all that should be left for them to do is to execute the analysis.

NINE: Power in names.

There are only two hard things in Computer Science: cache invalidation and naming things.

Phil Karlton

In general, R has a fair few idiosyncrasies in naming things. For starters, dots/periods . are perfectly permitted in variable names (and thus function names), when in most languages, the dot operator is a binary operator retrieving the first operand object’s method called the second operand:

a.b(args) = dot(a, b, args) = a_{Method: b}(args)

For instance, in Python, wallet.pay(arg1, arg2) means ‘invoke the method pay of the object wallet with the arguments arg1 and arg2‘. In R, on the other hand, it’s a character like any other, and therefore there is no special meaning attached to it – you can even have a variable contaning multiple dots, or in fact a variable whose name consists entirely of dots5 – in R, .......... is a perfectly valid variable name. It is also a typcal example of the fact that justibecause you can do something doesn’t mean you also should do so.

A few straightforward rules for variable names in R are worth abiding by:

  1. Above all, be consistent. That’s more important than whatever you choose.
  2. Some style guides, including Google’s R style guide, treat variables, functions and constants as different entities in respect of naming. This is, in my not-so-humble opinion, a blatant misunderstanding of the fact that functions are variables of the type function, and not some distinct breed of animal. Therefore, I recommend using a unitary schema for all variables, callable or not.
  3. In order of my preference, the following are legitimate options for naming:
    • Underscore separated: average_speed
    • Dot separated: average.speed
    • JavaScript style lower-case CamelCase: averageSpeed
  4. Things that don’t belong into identifiers: hyphens, non-alphanumeric characters, emojis (🤦🏼‍♂️) and other horrors.
  5. Where identifiers are hierarchical, it is better to start representing them as hierarchical objects rather than assigning them to different variables. For example, instead of monthly_forecast_january, monthly_forecast_february and so on, it is better to have a list associative array called forecasts in which the forecasts are keyed by month name, and can then be retrieved using the $key or the [key] accessors. If your naming has half a dozen components, maybe it’s time to think about structuring your data better.

Finally, in some cases, the same data may be represented by multiple formats – for instance, data about productivity is first imported as a text file, and then converted into a data frame. In such cases, Hungarian notation may be legitimate, e.g. txt_productivity or productivity.txt vs df_productivity or productivity.df. This is more or less the only case in which Hungarian notation is appropriate.6

And while we’re at variables: never, ever, ever use = to assign. Every time you do it, a kitten dies of sadness.

For file naming, some sensible rules have served me well, and will be hopefully equally useful for you:

  1. File names should be descriptive, but no longer than 63 characters.
  2. File names should be all lower case, separated by underscores, and end in .R. That’s a capital R. Not a lower-case R. EveR.
  3. Where there is a ‘head’ script that sequentially invokes (sources) a number of subsidiary scripts, it is common for the head script to be called 00-.R, and the rest given a sequential corresponding to the order in which they are sourced and a descriptive name, e.g. 01-load_data_from_db.R, 02-transform_data_and_anonymise_records.R and so on.
  4. Where there is a core script, but it does not invoke other files sequentially, it is common for the core script to be called 00-base.R or main.R. As long as it’s somewhere made clear to the user which file to execute, all is fair.
  5. The injunction against emojis and other nonsense holds for file names, too.

TEN: Know the rules and their rationale, so that you know when to break them.

It’s important to understand why style rules and defensive programming principles exist. How else would we know which rules we can break, and when?

The reality is that there are no rules, in any field, that do not ever permit of exceptions. And defensive programming rules are no exception. Rules are tools that help us get our work done better and more reliably, not some abstract holy scripture. With that in mind, when can you ignore these rules?

  • You can, of course, always ignore these rules if you’re working on your own, and most of your work is interactive. You’re going to screw yourself over, but that’s your right and privilege.
  • Adhering to common standards is more important than doing what some dude on the internet (i.e. my good self) thinks is good R coding practice. Coherence and consistency are crucial, and you’ll have to stick to your team’s style over your own ideas. You can propose to change those rules, you can suggest that they be altogether redrafted, and link them this page. But don’t go out and start following a style of your own just because you think it’s better (even if you’re right).
  • It’s always a good idea to appoint a suitable individual – with lots of experience and little ego, ideally! – as code quality standards coordinator (CQSC). They will then centrally coordinate adherence to standards, train on defensive coding practices, review operational adherence, manage tools and configurations and onboard new people.
  • Equally, having an ‘editor settings repo’ is pretty useful. This should support, at the very least, RStudio lintr and styler.
  • Some prefer to have the style guide as a GitHub or Confluence wiki – I generally advise against that, as that cannot be tracked and versioned as well as, say, a bunch of RST files that are collated together using Sphinx, or some RMarkdown files that are rendered automatically upon update using a GitHub webhook.

Conclusion

Always code as if your life, or that of your loved ones, depended on the code you write – because it very well may someday.
In the end, defensive programming may not be critical for you at all. You may never need to use it, and even if you do, the chances that as an R programmer your code will have to live up to defensive programming rules and requirements is much lower than, say, for an embedded programmer. But algorithms, including those written in R, create an increasing amount of data that is used to support major decisions. What your code spits out may decide someone’s career, whether someone can get a loan, whether they can be insured or whether their health insurance will be dropped. It may even decide a whole company’s fate. This is an inescapable consequence of the algorithmic world we now live in.

A few years ago, at a seminar on coding to FDA CDRH standards, the instructor finished by giving us his overriding rule: in the end, always code as if your life, or that of your loved ones, depended on the code you write (because it very well may someday!). This may sound dramatic in the context of R, which is primarily a statistical programming language, but the point stands: code has real-life consequences, and we owe it to those whose lives, careers or livelihoods depend on our code to give them the kind of code that we wish to rely on: well-tested, reliable and stable.

References   [ + ]

1.One crucial element here: keep in mind the order of parameters – because the explicit should override the implied, your code should look at the arguments first, then the environment variables, then the configuration file!
2.Eagle-eyed readers might spot the mutation in the example. This is indeed the T310C mutation on the TNFRSF13B gene (17p11.2), which causes autosomal dominant type 2 Common Variable Immune Deficiency, a condition I have, but manage through regular IVIG, prophylactic antibiotics and aggressive monitoring.
3.Holzmann, G.J. (2006). The Power of 10: Rules for developing safety-critical code. IEEE Computer 39(6):95-99.
4.R stores packages in libraries, which are basically folders where the packages reside. There are global packages as well as user-specific packages. For a useful overview of packages and package installation strategies, read this summary by Nathan Stephens.
5.The exception is ..., which is not a valid variable name as it collides with the splat operator.
6.In fact, the necessity of even this case is quite arguable. A much better practice would be simply importing the text as productivity, then transform it into a data frame, if there’s no need for both versions to continue coexisting.

Bayesian reasoning in clinical diagnostics: a primer.

We know, from the source of eternal wisdom that is Saturday Morning Breakfast Cereal, that insufficient math education is the basis of the entire Western economy.1 This makes Bayesian logic and reasoning about probabilities almost like a dark art, a well-kept secret that only a few seem to know (and it shouldn’t be… but that’s a different story). This weird-wonderful argument, reflecting a much-reiterated meme about vaccines and vaccine efficacy, is a good example:

The argument, here, in case you are not familiar with the latest in anti-vaccination fallacies, is that vaccines don’t work, and they have not reduced the incidence of vaccine-preventable diseases. Rather, if a person is vaccinated for, say, measles, then despite displaying clinical signs of measles, he will be considered to have a different disease, and therefore all disease statistics proving the efficacy of vaccines are wrong. Now, that’s clearly nonsense, but it highlights one interesting point, one that has a massive bearing on computational systems drawing conclusions from evidence: namely, the internal Bayesian logic of the diagnostic process.

Which, incidentally, is the most important thing that they didn’t teach you in school. Bayesian logic, that is. Shockingly, they don’t even teach much of it in medical school unless you do research, and even there it’s seen as a predictive method, not a tool to make sense of analytical process. Which is a pity. The reason why idiotic arguments like the above by @Cattlechildren proliferate is that physicians have been taught how to diagnose well, but never how to explain and reason about the diagnostic process. This was true for the generations before me, and is more or less true for those still in med school today. What is often covered up with nebulous concepts like ‘clinical experience’ is in fact solid Bayesian reasoning. Knowing the mathematical fundamentals of the thought process you are using day to day, and which help you make the right decisions every day in the clinic, helps you reason about it, find weak points, answer challenges and respond to them. For this reason, my highest hope is that as many MDs, epidemiologists, med students, RNs, NPs and other clinical decision-makers will engage with this topic, even if it’s a little long. I promise, it’s worth it.

Some basic ideas about probability

In probability, an event, usually denoted with a capital and customarily starting at A (I have no idea why, as it makes things only more confusing!), is any outcome or incidence that we’re interested in – as long as they’re binary, that is, they either happen or don’t happen, and discrete, that is, there’s a clear definition for it, so that we can decide if it’s happened or not – no half-way events for now.2 In other words, an event can’t happen and not happen at the same time. Or, to get used to the notation of conditionality, p(A \mid \neg A) = 0.3 A thing cannot be both true and false.

Now, we may be interested in how likely it is for an event to happen if another event happens: how likely is A if B holds true? This is denoted as p(A|B), and for now, the most important thing to keep in mind about it is that it is not necessarily the same as p(B|A)!4

Bayesian logic deals with the notion of conditional probabilities – in other words, the probability of one event, given another.5 It is one of the most widely misunderstood part of probability, yet it is crucial to understand to our own idea of the way we reason about things.

Just to understand how important this is, let us consider a classic example.


Case study 1: speed cameras

Your local authority is broke. And so, it does what local authorities do when they’re broke: play poker with the borough credit card set up a bunch of speed cameras and fine drivers. Over this particular stretch of road, the speed limit is 60mph.

According to the manufacturer, the speed cameras are very sensitive, but not very specific. In other words, they never falsely indicate that a driver was below the speed limit, but they may falsely indicate that the driver was above it, in about 3% of the cases (the false positive rate).

One morning, you’re greeted by a message in your postbox, notifying you that you’ve driven too fast and fining you a rather respectable amount of cash. What is the probability that you have indeed driven too fast?

You may feel inclined to blurt out 97%. That, in fact, is wrong.


Explanation

It’s rather counter-intuitive at first to understand why, until we consider the problem in formal terms. We know the probability p(A|\not B), that is, the probability of being snapped (A) even though you were not speeding (\not B). But what the question asks is what the likelihood that you were, in fact, speeding (B) given the fact that you were snapped (A). And as we have learned, the conditional probability operator is not commutative, that is, p(A|B) is not necessarily the same as p(B|A).

Why is that the case? Because base rates matter. In other words, the probabilities of A and B, in and of themselves, are material. Consider, for a moment, the unlikely scenario of living in that mythical wonderland of law-abiding citizens where nobody speeds. Then, it does not matter how many drivers are snapped – all of them are false positives, and thus p(B|A), the probability of speeding (B) given that one got snapped by a speed camera (A), is actually zero.

In other words, if we want to reverse the conditional operator, we need to make allowances for the ‘base frequency’, the ordinary frequency with which each event occurs on its own. To overcome base frequency neglect,6 we have a mathematical tool, courtesy of the good Revd. Thomas Bayes, who sayeth that, verily,

$latex p(B \mid A) = \frac{p(A \mid B) p(B)}{p(A)}

Or, in words: if you want to reverse the probabilities, you will have to take the base rates of each event into account. If what we know is the likelihood that you were not speeding if you were snapped and what we’re interested in is the likelihood that someone getting snapped is indeed speeding, we’ll need to know a few more things.


Case study 1: Speed cameras – continued

  • We know that the speed cameras have a Type II (false negative) error rate of zero – in other words, if you are speeding (B), you are guaranteed to get snapped (A) – thus, $p(A \mid B)$ is 1.
  • We also know from the Highway Authority, who were using a different and more accurate measurement system, that approximately one in 1,000 drivers is speeding (p(B) = 0.001).
  • Finally, we know that of 1,000 drivers, 31 will be snapped – the one speeder and 3% accounting for the false positive rate –, yielding p(A) = 0.031.

Putting that into our equation,

p(B|A) = \frac{p(A \mid B) p(B)}{p(A)} = \frac{1 \cdot 0.001}{0.031} = 0.032

In other words, the likelihood that we indeed did exceed the speed limit is just barely north of 3%. That’s a far cry from the ‘intuitive’ answer of 97% (quite accidentally, it’s almost the inverse).


Diagnostics, probabilities and Bayesian logic

The procedure of medical diagnostics is ultimately a relatively simple algorithm:

  1. create a list of possibilities, however remote (the process of differential diagnostics),
  2. order them in order of likelihood,
  3. update priors as you run tests.7

From a statistical perspective, this is implemented as follows.

  1. We begin by running a number of tests, specifically m of them. It is assumed that the tests are independent from each other, i.e. the value of one does not affect the value of another. Let R_j denote the results of test $j \leq m$.
    1. For each test, we need to iterate over all our differentials D_{i \ldots n}, and determine the probability of each in light of the new evidence, i.e. $latex p(D_i \mid R_j).
    2. So, let’s take the results of test j that yielded the results R_j, and the putative diagnosis D_i. What we’re interested in is p(D_i \mid R_j), that is, the probability of the putative diagnosis given the new evidence. Or, to use Bayesian lingo, we are updating our prior: we had a previous probability assigned to D_i, which may have been a uniform probability or some other probability, and we are now updating it – seeing how likely it is given the new evidence, getting what is referred to as a posterior.8
    3. To calculate the posterior P(D_i | R_j), we need to know three things – the sensitivity and specificity of the test j (I’ll call these S^+_j and S^-_j, respectively), the overall incidence of D_i,9 and the overall incidence of the particular result R_j.
    4. Plugging these variables into our beloved Bayesian formula, we get p(D_i \mid R_j) = \frac{p(R_j \mid D_i) p(D_i)}{p(R_j)}.
    5. We know that p(R_j \mid D_i), that is, the probability that someone will test a particular way if they do have the condition D_i, is connected to sensitivity and specificity: if R_j is supposed to be positive if the patient has D_i, then p(R_j \mid D_i) = S^-_j (sensitivity), whereas if the test is supposed to be negative if the patient has D_i, then p(R_j \mid D_i) = S^+_j (specificity).
    6. We also know, or are supposed to know, the overall incidence of D_i and the probability of a particular outcome, R_j. With that, we can update our prior for D_i \mid R_j.
  2. We iterate over each of the tests, updating the priors every time new evidence comes in.

This may sound daunting and highly mathematical, but in fact most physicians have this down to an innate skill, so much so that when I explained this to a group of FY2 doctors, they couldn’t believe it – until they thought about how they thought. And that’s a key issue here: thinking about the way we arrive at results is important, because they are the bedrock of what we need to make those results intelligible to others.


Case study 2: ATA testing for coeliac disease

For a worked example of this in the diagnosis of coeliac disease, check Notebook 1: ATA case study. It puts things in the context of sensitivity and specificity in medical testing, and is in many ways quite similar to the above example, except here, we’re working with a real-world test with real-world uncertainties.

There are several ways of testing for coeliac disease, a metabolic disorder in which the body responds to gluten proteins (gliadins and glutenins) in wheats, wheat hybrids, barley, oats and rye. One diagnostic approach looks at genetic markers in the HLA-DQ (Human Leukocyte Antigen type DQ), part of the MHC (Major Histocompatibility Complex) Class II receptor system. Genetic testing for a particular haplotype of the HLA-DQ2 gene, called DQ2.5, can lead to a diagnosis in most patients. Unfortunately, it’s slow and expensive. Another test, a colonoscopic biopsy of the intestines, looks at the intestinal villi, short protrusions (about 1mm long) into the intestine, for tell-tale damage – but this test is unpleasant, possibly painful and costly.

So, a more frequent way is by looking for evidence of an autoantibody called anti-tissue transglutaminase antibody (ATA) – unrelated to this gene, sadly. ATA testing is cheap and cheerful, and relatively good, with a sensitivity (S^+_{ATA}) of 85% and specificity (S^+_{ATA}) of 97%.10 We also know the rough probability of a sample being from someone who actually has coeliac disease – for a referral lab, it’s about 1%.

Let’s consider the following case study. A patient gets tested for coeliac disease using the ATA test described above. Depending on whether the test is positive or negative, what are the chances she has coeliac disease?

Sensitivity and specificity trade-off for an ATA test given various values of true coeliac disease prevalence in the population.

If you’ve read the notebook, you know by now that the probability of having coeliac disease if testing positive is around 22%, or a little better than one-fifth. And from the visualisation to the left, you could see that small incremental improvements in specificity would yield a lot more increase in accuracy (marginal accuracy gain) than increases in sensitivity.

While quite simple, this is a good case study because it emphasises a few essential things about Bayesian reasoning:

  • Always know your baselines. In this case, we took a baseline of 1%, even though the average incidence of coeliac disease in the population is closer to about 0.25% of that. Why? Because we don’t spot-test people for coeliac disease. People who do get tested get tested because they exhibit symptoms that may or may not be coeliac disease, and by definition they have a higher prevalence11 of coeliac disease. The factor is, of course, entirely imaginary – you would, normally, need to know or have a way to figure out the true baseline values.
  • Use independent baselines. It is absolutely crucial to make sure that you do not get the baselines from your own measurement process. In this case, for instance, the incidence of coeliac disease should not be calculated by reference to your own lab’s number of positive tests divided by total tests. This merely allows for further proliferation of false positives and negatives, however minuscule their effect. A good way is to do follow-up studies, checking how many of the patients tested positive or negative for ATA were further tested using other methodologies, many of which may be more reliable, and calculate the proportion of actual cases coming through your door by reference to that.

Case study 3: Vaccines in differential diagnosis

This case is slightly different, as we are going to compare two different scenarios. Both concern D_{VPD}, a somewhat contrived vaccine-preventable illness. D_{VPD} produces a very particular symptom or symptom set, S, and produces this symptom or symptom set in every case, without fail.12 The question is – how does the vaccination status affect the differential diagnosis of two identical patients,13 presenting with the same symptoms S, one of whom is unvaccinated?

No. That’s not how this works. That’s not how ANY of this works. Nnnnope.

It has been a regrettably enduring trope of the anti-vaccination movement that because doctors believe vaccines work, they will not diagnose a patient with a vaccine-preventable disease (VPD), simply striking it off the differential diagnosis or substitute a different diagnosis for it.14 The reality is explored in this notebook, which compares two scenarios, of the same condition, with two persons with the sole difference of vaccination status. That difference makes a massive – about 7,800x – difference between the likelihood of the vaccinated and the unvaccinated person having the disease. The result is that a 7,800 times less likely outcome slides down the differential. As NZ paediatrician Dr Greenhouse (@greenhousemd) noted in the tweet, “it’s good medical care”. In the words of British economist John Maynard Keynes,15 “when the facts change, I change my mind”. And so do diagnosticians.

Quite absolutely simply put: it’s not an exclusion or fudging data or in any sensible way proof that “no vaccine in history has ever worked”. It’s quite simply a reflection of the reality that if in a population a condition is almost 8,000 times less likely, then, yes, other more frequent conditions push ahead.


Lessons learned

Bayesian analysis of the diagnostic procedure allows not only increased clarity about what one is doing as a clinician. Rather, it allows the full panoply of tools available to mathematical and logical reasoning to investigate claims, objections and contentions – and like in the case of the alleged non-diagnosis of vaccines, discard them.

The most powerful tool anyone who utilises any process of structured clinical reasoning – be it clinical reasoning in diagnostics, algorithmic analysis, detective work or intelligence analysis – is to be able to formally reason about one’s own toolkit of structured processes. It is my hope that if you’ve never thought about your clinical diagnostic process in these terms, you will now be able to see a new facet of it.

References   [ + ]

1.The basis of non-Western economies tends to be worse. That’s about as much as Western economies have going for them. See: Venezuela and the DPRK.
2.There’s a whole branch of probability that deals with continuous probabilities, but discrete probabilities are crazy enough for the time being.
3.Read: The probability of A given not-A is zero. A being any arbitrary event: the stock market crashing, the temperature tomorrow exceeding 30ºC, &.
4.In other words, it may be the same, but that’s pure accident. Mathematically, they’re almost always different.
5.It’s tempting to assume that this implies causation, or that the second event must temporally succeed the first, but none of those are implied, and in fact only serve to confuse things more.
6.You will also hear this referred to as ‘base rate neglect’ or ‘base rate fallacy’. As an epidemiologist, ‘rate’ has a specific meaning for us – it generally means events over a span of time. It’s not a rate unless it’s necessarily over time. I know, we’re pedantic like that.
7.This presupposes that these tests are independent of each other, like observations of a random variable. They generally aren’t – for instance, we run the acute phase protein CRP, W/ESR (another acute phase marker) and a WBC count, but these are typically not independent from each other. In such cases, it’s legitimate to use B = B_1 \cap B_2 \cap \ \ldots \cap B_n or, as my preferred notation goes, B = \bigcap^n_{k=1} B_k. I know ‘updating’ is the core mantra of Bayesianism, but knowing what to update and knowing where to simply calculate the conjoint probability is what experts in Bayesian reasoning rake in the big bucks for.
8.Note that a posterior from this step can, upon more new evidence, become the prior in the next round – the prior for j may be the inferred probability p(D_i), but the prior for j + 1 is p(D_i \mid R_j), and so on. More about multiple observations later.
9.It’s important to note that this is not necessarily the population incidence. For instance, the overall incidence and thus the relevant D for EBOV (D_{EBOV}) is going to be different for a haemorrhagic fever referral lab in Kinshasa and a county hospital microbiology lab in Michigan.
10.Lock, R.J. et al. (1999). IgA anti-tissue transglutaminase as a diagnostic marker of gluten sensitive enteropathy. J Clin Pathol 52(4):274-7.
11.More epidemiopedantry: ‘incidence’ refers to new cases over time, ‘prevalence’ refers to cases at a moment in time.
12.This is, of course, unrealistic. I will do a walkthrough of an example of multiple symptoms that each have an association with the illness in a later post.
13.It’s assumed gender is irrelevant to this disease.
14.Presumably hoping that refusing to diagnose a patient with diphtheria and instead diagnosing them with a throat staph infection will somehow get the patient okay enough that nobody will notice the insanely prominent pseudomembrane…
15.Or not…

Automagic epi curve plotting: part I

As of 24 May 2018, the underlying data schema of the Github repo from which the epi curve plotter draws its data has changed. Therefore, a lot of the code had to be adjusted. The current code can be found here on Github. This also plots a classical epi curve.

One of the best resources during the 2013-16 West African Ebola outbreak was Caitlin RiversGithub repo, which was probably one of the best ways to stay up to date on the numbers. For the current outbreak, she has also set up a Github repo, with really frequent updates straight from the WHO’s DON data and the information from DRC Ministry of Public Health (MdlS) mailing list.1 Using R, I have set up a simple script that I only have to run every time I want a pre-defined visualisation of the current situation. I am usually doing this on a remote RStudio server, which makes matters quite easy for me to quickly generate data on the fly from RStudio.

Obtaining the most recent data

Using the following little script, I grab the latest from the ebola-drc Github repo:

# Fetch most recent DRC data.
library(magrittr)
library(curl)
library(readr)
library(dplyr)

current_drc_data <- Sys.time() %>%
  format("%d%H%M%S%b%Y") %>%
  paste("raw_data/drc/", "drc-", ., ".csv", sep = "") %T>%
  curl_fetch_disk("https://raw.githubusercontent.com/cmrivers/ebola_drc/master/drc/data.csv", .) %>%
  read_csv()

This uses curl (the R implementation) to fetch the most recent data and save it as a timestamped2 file in the data folder I set up just for that purpose.3 Simply sourcing this script (source("fetch_drc_data.R")) should then load the current DRC dataset into the environment.4

Data munging

We need to do a little data munging. First, we melt down the data frame using reshape2‘s melt function. Melting takes ‘wide’ data and converumnts it into ‘long’ data – for example, in our case, the original data had a row for each daily report for each health zone, and a column for the various combinations of confirmed/probable/suspected over cases/deaths. Melting the data frame down creates a variable type column (say, confirmed_deaths and a value column (giving the value, e.g. 3). Using lubridate,5 the dates are parsed, and the values are stored in a numeric format.

library(magrittr)
library(reshape2)
library(lubridate)

current_drc_data %<>% melt(value_name = "value", measure.vars = c("confirmed_cases", "confirmed_deaths", "probable_cases", "probable_deaths", "suspect_cases", "suspect_deaths", "ruled_out"))
current_drc_data$event_date <- lubridate::ymd(current_drc_data$event_date)
current_drc_data$report_date <- lubridate::ymd(current_drc_data$report_date)
current_drc_data$value <- as.numeric(current_drc_data$value)

Next, we drop ruled_out cases, as they play no significant role for the current visualisation.

current_drc_data <- current_drc_data[current_drc_data$variable != "ruled_out",]

We also need to split the type labels into two different columns, so as to allow plotting them as a matrix. Currently, data type labels (the variable column) has both the certainty status (confirmed, suspected or probable) and the type of indicator (cases vs deaths) in a single variable, separated by an underscore. We’ll use stringr‘s str_split_fixed to split the variable names by underscore, and join them into two separate columns, suspicion and mm, the latter denoting mortality/morbidity status.

current_drc_data %<>% cbind(., str_split_fixed(use_series(., variable), "_", 2)) %>% 
                 subset(select = -c(variable)) %>% 
                 set_colnames(c("event_date", "report_date", "health_zone", "value", "suspicion", "mm"))

Let’s filter out the health zones that are being observed but have no relevant data for us yet:

relevant_health_zones <- current_drc_data %>% 
                         subset(select = c("health_zone", "value")) %>% 
                         group_by(health_zone) %>% 
                         summarise(totals = sum(value, na.rm=TRUE)) %>% 
                         dplyr::filter(totals > 0) %>% 
                         use_series(health_zone)

This gives us a vector of all health zones that are currently reporting cases. We can filter our DRC data for that:

current_drc_data %<>% dplyr::filter(health_zone %in% relevant_health_zones)

This whittles down our table by a few rows. Finally, we might want to create a fake health zone that summarises all other health zones’ respective data:

totals <- current_drc_data %>% group_by(event_date, report_date, suspicion, mm) 
                           %>% summarise(value = sum(value), health_zone=as.factor("DRC total"))

# Reorder totals to match the core dataset
totals <- totals[,c(1,2,6,5,3,4)]

Finally, we bind these together to a single data frame:

current_drc_data %<>% rbind.data.frame(totals)

Visualising it!

Of course, all this was in pursuance of cranking out a nice visualisation. For this, we need to do a couple of things, including first ensuring that “DRC total” is treated separately and comes last:

regions <- current_drc_data %>% use_series(health_zone) %>% unique()
regions[!regions == "DRC total"]
regions %<>% c("DRC total")

current_drc_data$health_zone_f <- factor(current_drc_data$health_zone, levels = regions)

I normally start out by declaring the colour scheme I will be using. In general, I tend to use the same few colour schemes, which I keep in a few gists. For simple plots, I prefer to use no more than five colours:

colour_scheme <- c(white = rgb(238, 238, 238, maxColorValue = 255),
                   light_primary = rgb(236, 231, 216, maxColorValue = 255),
                   dark_primary = rgb(127, 112, 114, maxColorValue = 255),
                   accent_red = rgb(240, 97, 114, maxColorValue = 255),
                   accent_blue = rgb(69, 82, 98, maxColorValue = 255))

With that sorted, I can invoke the ggplot method, storing the plot in an object, p. This is so as to facilitate later retrieval by the ggsave method.

p <- ggplot(current_drc_data, aes(x=event_date, y=value)) +

  # Title and subtitle
  ggtitle(paste("Daily EBOV status", "DRC", Sys.Date(), sep=" - ")) +
  labs(subtitle = "(c) Chris von Csefalvay/CBRD (cbrd.co) - @chrisvcsefalvay") +
  
  # This facets the plot based on the factor vector we created ear 
  facet_grid(health_zone_f ~ suspicion) +
  geom_path(aes(group = mm, colour = mm, alpha = mm), na.rm = TRUE) +
  geom_point(aes(colour = mm, alpha = mm)) +

  # Axis labels
  ylab("Cases") +
  xlab("Date") +

  # The x-axis is between the first notified case and the last
  xlim(c("2018-05-08", Sys.Date())) +
  scale_x_date(date_breaks = "7 days", date_labels = "%m/%d") +

  # Because often there's an overlap and cases that die on the day of registration
  # tend to count here as well, some opacity is useful.
  scale_alpha_manual(values = c("cases" = 0.5, "deaths" = 0.8)) +
  scale_colour_manual(values = c("cases" = colour_scheme[["accent_blue"]], "deaths" = colour_scheme[["accent_red"]])) +

  # Ordinarily, I have these derive from a theme package, but they're very good
  # defaults and starting poinnnnnntsssssts
  theme(panel.spacing.y = unit(0.6, "lines"), 
        panel.spacing.x = unit(1, "lines"),
        plot.title = element_text(colour = colour_scheme[["accent_blue"]]),
        plot.subtitle = element_text(colour = colour_scheme[["accent_blue"]]),
        axis.line = element_line(colour = colour_scheme[["dark_primary"]]),
        panel.background = element_rect(fill = colour_scheme[["white"]]),
        panel.grid.major = element_line(colour = colour_scheme[["light_primary"]]),
        panel.grid.minor = element_line(colour = colour_scheme[["light_primary"]]),
        strip.background = element_rect(fill = colour_scheme[["accent_blue"]]),
        strip.text = element_text(colour = colour_scheme[["light_primary"]])
  )
DRC EBOV outbreak, 22 May 2018. The data has some significant gaps, owing to monitoring and recording issues, but some clear trends already emerge from this simple illustration.

The end result is a fairly appealing plot, although if the epidemic goes on, one might want to consider getting rid of the point markers. All that remains is to insert an automatic call to the ggsave function to save the image:

Sys.time() %>%
  format("%d%H%M%S%b%Y") %>%
  paste("DRC-EBOV-", ., ".png", sep="") %>%
  ggsave(plot = p, device="png", path="visualisations/drc/", width = 8, height = 6)

Automation

The cronR package has a built-in cron scheduler add-in for RStudio, allowing you to manage all your code scheduling needs.

Of course, being a lazy epidemiologist, this is the kind of stuff that just has to be automated! Since I run my entire RStudio instance on a remote machine, it would make perfect sense to regularly run this script. cronR package comes with a nice widget, which will allow you to simply schedule any task. Old-school command line users can, of course, always resort to ye olde command line based scheduling and execution. One important caveat: the context of cron execution will not necessarily be the same as of your R project or indeed of the R user. Therefore, when you source a file or refer to paths, you may want to refer to fully qualified paths, i.e. /home/my_user/my_R_stuff/script.R rather than merely script.R. cronR is very good at logging when things go awry, so if the plots do not start to magically accumulate at the requisite rate, do give the log a check.

The next step is, of course, to automate uploads to Twitter. But that’s for another day.

References   [ + ]

1.Disclaimer: Dr Rivers is a friend, former collaborator and someone I hold in very high esteem. I’m also from time to time trying to contribute to these repos.
2.My convention for timestamps is the military DTG convention of DDHHMMSSMONYEAR, so e.g. 7.15AM on 21 May 2018 would be 21071500MAY2018.
3.It is, technically, bad form to put the path in the file name for the curl::curl_fetch_disk() function, given that curl::curl_fetch_disk() offers the path parameter just for that. However, due to the intricacies of piping data forward, this is arguably the best way to do it so as to allow the filename to be reused for the read_csv() function, too.
4.If for whatever reason you would prefer to keep only one copy of the data around and overwrite it every time, quite simply provide a fixed filename into curl_fetch_disk().
5.As an informal convention of mine, when using the simple conversion functions of lubridate such as ymd, I tend to note the source package, hence lubridate::ymd over simply ymd.

Get your Jupyterhub box set up on Linode with a single go!

At CBRD, a lot of the research work we do is done on remote machines. For various reasons, we like being able to spin up and wind down these boxes at will, and auto-configure them at short notice according to a few standard variables. Depending on the installation, then, we would have a perfectly set up box with all the features we want, focused around R, Python and as frontends, RStudio and Jupyterhub.

Where StackScripts fit in

There are a wide range of ways to configure boxes – Puppet, Chef cookbooks, Terraform, Dockerfiles, and all that –, but for ease of use, we rely on simple shell files that can be run as Linode StackScripts.

StackScripts are an extremely convenient way to configure single Linode with the software you need. Unlike more complex systems like Terraform,

Basic usage

The Ares research node generator StackFile does a handful of things:

  • Update system and add the CRAN repo as a source
  • Install R and the RStudio version of choice
  • Install Python and JupyterHub version of choice
  • Install an opinionated set of system level packages (i.e. available to all users)
  • Configures ports and some other configuration items for the instance
  • Creates the root user
  • Daemonises the RStudio server and JupyterHub to automatically start at failure and automatically start at reboot

When deployed using Linode from its StackFile, it allows for a wide range of configuration options, including ports for both Jupyter and RStudio, and a completely configured first user set up both on JupyterHub and RStudio. In addition, you can configure some install settings. A ‘barebones’ install option exists that allows for a minimum set of packages to be installed – this is useful for testing or if the desired configuration diverges from the ordinary structure. In addition, OpenCV, deep learning tools and cartography tools can be selectively disabled or enabled, as these are not always required.

User administration for Jupyterhub and RStudio

In general, user administration is by preset attached to PAM, i.e. the built-in Linux administration structure. JupyterHub has its own administration features, described here. RStudio, on the other hand, authenticates by user group membership. The two share the same usergroup, specified in the configuration (by default and convention, this is jupyter, but you can change it), and because users created by JupyterHub fall into that user group, creating users in JupyterHub automatically grants them access to RStudio. This is overall acceptable as we tend to use both, but there might be a safety concern there. If so, you can change the auth-required-user-group=$USERGROUPNAME setting to a defined usergroup in the /etc/rstudio/rstudio.conf.

Issues

There are some glitches that we’re trying to iron out:

  • Cartography and GIS tools glitch a little due to issues with PROJ.4.
  • GPU/CUDA support is not implemented as this is not customarily used or provided on Linodes
  • certbot and Let's Encrypt is not really supported yet, as our boxes are never directly public-facing, but you should most definitely find a way to put your server behind some form of SSL/TLS.
  • Currently, only Ubuntu 16.04 LTS is supported, as it’s the most widely used version. CRAN does not yet support more recent versions yet, but these will be added once CRAN support is added.
New to Linodes? Linode is a great service offering a simple, uncomplicated way to set up a Linux server, just how you want it! 👍🏻 In fact, I’ve been using Linode servers for over four years, and have only good experiences with them. I keep recommending them to friends, and they’re all happy! So if this post was helpful, and you’re thinking of signing up for a Linode, would you consider doing so using my affiliate link? I run this blog as a free service and pay for hosting out of pocket, and I don’t do advertisements (and never will!) or paid reviews of things (like what, paid Tyvek reviews? “Best Ebola PPE”? Nope.) – so really, it would make me so happy if you could help out this way.

As always, do let me know how things work out for you! Feel free to leave comments and ideas below. If you’re getting value out of using this script, just let me know.

MedDRA + VAERS: A marriage made in hell?

This post is a Golden DDoS Award winner

So far, this blog was DDoS’d only three times within 24 hours of its publication. That deserves a prize.

Quick: what do a broken femur, Henoch-Schönlein purpura, fainting, an expired vaccine and a healthy childbirth have in common? If your answer was “they’re all valid MedDRA codes”, you’re doing pretty well. If you, from that, deduced that they all can be logged on VAERS as adverse effects of vaccination, you’re exceeding expectations. And if you also realise that the idea that Jane got an expired HPV vaccine, and as a consequence broke her femur, developed Henoch-Schönlein purpura, and suddenly gave birth to a healthy baby boy is completely idiotic and yet can be logged on VAERS, you’re getting where I’m going.

MedDRA is a medical nomenclature specifically developed for the purposes of pharmacovigilance. The idea is, actually, not dreadful – there are some things in a usual medical nomenclature like ICD-10 that are not appropriate for a nomenclature used for pharmacovigilance reporting (V97.33: sucked into jet engine comes to my mind), and then there are things that are specific to pharmacovigilance, such as “oh shoot, that was not supposed to go up his bum!” (MedDRA 10013659: vaccine administered at inappropriate site), “we overdosed little Johnny on the flu vaccine!” (MedDRA 10000381: drug overdose, accidental) and other joys that generally do only happen in the context of pharmacovigilance. So far, so good.

At the same time, MedDRA is non-hierarchical, at least on the coding level. Thus, while the ICD code V97.33 tells you that you’re dealing with an external cause of mortality and morbidity (V and Y codes), specifically air and space transport (V95-97), more specifically ‘other’ specific air transport accidents, specifically getting sucked into a jet engine (V97.33), there’s no way to extract from MedDRA 10000381 what the hell we’re dealing with. Not only do we not know if it’s a test result, a procedure, a test or a disease, we are hopelessly lost as to figuring out what larger categories it belongs to. To make matters worse, MedDRA is proprietary – which in and of itself is offensive to the extreme to the idea of open research on VAERS and other public databases: a public database should not rely on proprietary encoding! -, and it lacks the inherent logic of ICD-10. Consider the encoding of the clinical diagnosis of unilateral headache in both:

SOC: System Organ Class, HLGT: High Level Group Term, HLT: High Level Term, PT: Preferred Term, LLT: Lower Level Term

Attempting to encode the same concept in MedDRA and ICD-10, note that the final code in MedDRA (10067040) does not allow for the predecessors to be reverse engineered, unless a look-up table is used – which is proprietary. ICD-10, at the same time, uses a structure that encodes the entire ‘ancestry’ of an entity in the final code. Not only does this allow for intermediate codes, e.g. G44 for an ‘other’ headache syndrome until it is closer defined, it also allows for structured analysis of the final code and even without a look-up table (which is public, as is the whole ICD-10), the level of kinship between two IDC-10 codes can be ascertained with ease.

We know that an ICD code beginning with F will be something psychiatric and G will be neurological, and from that alone we can get some easy analytical approaches (a popular one is looking at billed codes and drilling down by hierarchical level of ICD-10 codes, something in which the ICD-10 is vastly superior to its predecessor). MedDRA, alas, does not help us such.

Garbage in, garbage out

OK, so we’ve got a nomenclature where the codes for needlestick injury, death, pneumonia, congenital myopathy and a CBC look all the same. That’s already bad enough. It gets worse when you can enter any and all of these into the one single field. Meet VAERS.

The idea of VAERS is to allow physicians, non-physicians and ‘members of the public’ to report incidents. These are then coded by the CDC and depending on seriousness, they may or may not be investigated (all reports that are regarded as ‘serious’ are investigated, according to the CDC). The problem is that this approach is susceptible to three particular vulnerabilities:

  • The single field problem: VAERS has a single field for ‘symptoms’. Everything’s a symptom. This includes pre-existing conditions, new onset conditions, vaccination errors, lab tests (not merely results, just the tests themselves!), interventions (without specifying if they’re before or after the vaccine), and so on. There is also no way to filter out factors that definitely have nothing to do with the vaccine, such as a pre-existing birth defect. The History/Allergies field is not coded.
  • The coding problem: what gets coded and what does not is sometimes imperfect. This being a human process, it’s impossible to expect perfection, but the ramifications to this to certain methods of analysis are immense. For instance. if there are 100 cases of uncontrollable vomiting, that may be a signal. But if half of those are coded as ‘gastrointestinal disorder’ (also an existing code), you have two values of 50, neither of which may end up being a signal.
  • The issue of multiple coding: because MedDRA is non-hierarchical, it is not possible to normalise at a higher level (say, with ICD-10 codes, at chapter or block level), and it is not clear if two codes are hierarchically related. In ICD-10, if a record contains I07 (rheumatic tricuspid valve disease) and I07.2 (tricuspid stenosis with tricuspid insufficiency), one can decide to retain the more specific or the less specific entry, depending on intended purpose of the analysis.

In the following, I will demonstrate each of these based on randomly selected reports from VAERS.

The Single Field Problem (SFP)

The core of the SFP is that there is only one codeable field, ‘symptoms’.

VAERS ID 375693-1 involves a report, in which the patient claims she developed, between the first and second round of Gardasil,

severe stomach pain, cramping, and burning that lasted weeks. Muscle aches and overall feeling of not being well. In August 2009 patient had flu like symptoms, anxiety, depression, fatigue, ulcers, acne, overall feeling of illness or impending death.

Below is the patient’s symptom transposition into MedDRA entities (under Symptoms):

Event details for VAERS report 375693-1

The above example shows the mixture of symptoms, diagnostic procedures and diagnostic entities that are coded in the ‘Symptoms’ field. The principal problem with this is that when considering mass correlations (all drugs vs all symptoms, for instance), this system would treat a blood test just as much as a contributor to a safety signal as anxiety or myalgia, which might be true issues, or depression, which is a true diagnosis. Unfiltered, this makes VAERS effectively useless for market basket analysis based (cooccurrence frequency) analyses.

Consider for instance, that PRR is calculated as

PRR_{V,R} = \frac{\Sigma (R \mid V) \/ \Sigma (V)}{\Sigma (R \mid \neg V) \/ \Sigma (\neg V)} = \frac{\Sigma (R \mid V)}{\Sigma (V)} \cdot \frac{\Sigma (\neg V)}{\Sigma (R \mid \neg V)}

where V denotes the vaccine of interest, R denotes the reaction of interest, and the \Sigma operator denotes the sum of rows or columns that fulfill the requisite criteria (a more detailed, matrix-based version of this equation is presented here). But if \{R\}, the set of all R, contains not merely diagnoses but also various ‘non-diagnoses’, the PRR calculation will be distorted. For constant V and an unduly large R, the values computationally obtained from the VAERS data that ought to be \Sigma(R \mid V) and \Sigma(R \mid \neg V) will both be inaccurately inflated. This will yield inaccurate final results.

Just how bad IS this problem? About 30% bad, if not more. A manual tagging of the top 1,000 symptoms (by N, i.e. by the number of occurrences) was used as an estimate for how many of the diagnostic entities do not disclose an actual problem with the vaccine.

In this exercise, each of the top 1,000 diagnostic codes (by N, i.e. by occurrence) were categorised into a number of categories, which in turn were divided into includable (yellow) and non-includable (blue) categories. An includable category reveals a relevant (=adverse) test result, a symptom, a diagnosis or a particular issue, while non-includables pertain to procedures, diagnostics without results, negative test results and administration, storage & handling defects.

According to the survey of the top 1,000 codes, only a little more than 70% of the codes themselves disclose a relevant issue with the vaccine. In other words, almost a third of disclosed symptoms must be pruned, and these cannot be categorically pruned because unlike ICD-10, MedDRA does not disclose hierarchies based on which such pruning would be possible. As far as the use of MedDRA goes, this alone should be a complete disaster.

Again, for effect: a third of the codes do not disclose an actual side effect of the medication. These are not separate or identifiable in any way other than manually classifying them and seeing whether they disclose an actual side effect or just an ancillary issue. Pharmacovigilance relies on accurate source data, and VAERS is not set up, with its current use of MedDRA, to deliver that.

The coding problem

Once a VAERS report is received, it is MedDRA coded at the CDC. Now, no manual coding is perfect, but that’s not the point here. The problem is that a MedDRA code does not, in and of itself,  indicate the level of detail it holds. For instance, 10025169 and 10021881 look all alike, where in fact the first is a lowest-level entity (an LLT – Lower-Level Term – in MedDRA lingo) representing Lyme disease, while the former is the top-level class (SOC – System Organ Class) corresponding to infectious diseases. What this means is that once we see a MedDRA coded entity as its code, we don’t know what level of specificity we are dealing with.

The problem gets worse with named entities. You see, MedDRA has a ‘leaf’ structure: every branch must terminate in one or more (usually one) LLT. Often enough, LLTs have the same name as their parent PT, so you get PT Lyme disease and LLT Lyme disease. Not that it terrifically matters for most applications, but when you see only the verbose output, as is the case in VAERS, you don’t know if this is a PT, an LLT, or, God forbid, a higher level concept with a similar name.

Finally, to put the cherry on top of the cake, where a PT is also the LLT, they have the same code. So for Lyme disease, the PT and LLT both have the code 10025169. I’m sure this seemed like a good idea at the time.

The issue of multiple coding

As this has been touched upon previously, because MedDRA lacks an inherent hierarchy, a code cannot be converted into its next upper level without using a lookup table, whereas with, say, ICD-10, one can simply normalise to the chapter and block (the ‘part left of the dot’). More problematically, however, the same code may be a PT or an LLT, as is the case for Lyme disease (10025169).

Let’s look at this formally. Let the operator \in^* denote membership under the transitive closure of the set membership relation, so that

  1. if x \in A, then x \in^* A,
  2. if x \in A and A \subseteq B, then x \in^* B.

and so on, recursively, ad infinitum. Let furthermore \in^*_{m} denote the depth of recursion, so that

  1. for x \in A:  x \in^*_{0} A,
  2. for x \in A \mid A \subseteq B:  x \in^*_{1} B,

and, once again, so on, recursively, ad infinitum.

Then let a coding scheme \{S_{1...n}\} exhibit the Definite Degree of Transitiveness (DDoT) property iff (if and only if) for any S_m \mid m \leq n, there exists exactly one p for which it is true that S_m \in^*_{p} S.

Or, in other words, two codes S_q, S_r \mid q, r \leq n, may not be representable identically if p_q \neq p_r. Less formally: two codes on different levels may not be identical. This is clearly violated in MedDRA, as the example below shows.

Violating the Definite Degree of Transitiveness (DDoT) property: PT Botulism and LLT Botulism have different p numbers, i.e. different levels, but have nonetheless the same code. This makes the entity 10006041 indeterminate – is it the PT or the LLT for botulism? This is a result of the ‘leaf node constraint’ of MedDRA’s design, but a bug by design is still a bug, not a feature.

Bonus: the ethical problem

To me as a public health researcher, there is a huge ethical problem with the use of MedDRA in VAERS. I believe very strongly in open data and in the openness of biomedical information. I’m not alone: for better or worse, the wealth – terabytes upon terabytes – of biomedical data, genetics, X-ray crystallography, models, sequences  prove that if I’m a dreamer, I’m not the only one.

Which is why it’s little short of an insult to the public that a pharmacovigilance system is using a proprietary encoding model.

Downloads from VAERS, of course, provide the verbose names of the conditions or symptoms, but not what hierarchical level they are, nor what structure they are on. For that, unless you are a regulatory authority or a ‘non-profit’ or ‘non-commercial’ (which would already exclude a blogger who unlike me has ads on their blog to pay for hosting, or indeed most individual researchers, who by their nature could not provide the documentation to prove they aren’t making any money), you have to shell out some serious money.

MedDRA is one expensive toy.

Worse, the ‘non-profit’ definition does not include a non-profit research institution or an individual non-profit researcher, or any of the research bodies that are not medical libraries or affiliated with educational institutions but are funded by third party non-profit funding:

This just keeps getting worse. Where would a non-profit, non-patient care provider, non-educational, grant-funded research institution go?

There is something rotten with the use of MedDRA, and it’s not just how unsuitable it is for the purpose, it is also the sheer obscenity of a public database of grave public interest being tied to a (vastly unsuitable and flawed, as I hope it has been demonstrated above) nomenclature.

Is VAERS lost?

Resolving the MedDRA issue

Unlike quite a few people in the field, I don’t think VAERS is hopelessly lost. There’s, in fact, great potential in it. But the way it integrates with MedDRA has to be changed. This is both a moral point – a point of commitment to opening up government information – and one of facilitating research.

There are two alternatives at this point for the CDC.

  1. MedDRA has to open up at least the 17% of codes, complete with hierarchy, that are used within VAERS. These should be accessible, complete with the hierarchy, within VAERS, including the CDC WONDER interface.
  2. The CDC has to switch to a more suitable system. ICD-10 alone is not necessarily the best solution, and there are few alternatives, which puts MedDRA into a monopoly position that it seems to mercilessly exploit at the time. This can – and should – change.

Moving past the Single Field Problem

MedDRA apart, it is crucial for VAERS to resolve the Single Field Problem. It is clear that from the issues presented in the first paragraph – a broken femur, Henoch-Schönlein purpura, fainting, an expired vaccine and a healthy childbirth – that there is a range of issues that need to be logged. A good structure would be

  1. pre-existing conditions and risk factors,
  2. symptoms that arose within 6 hours of administration,
  3. symptoms that arose within 48 hours of administration,
  4. symptoms that arose later than 48 hours of administration,
  5. non-symptoms,
  6. clinical tests without results,
  7. clinical tests segmented by positive and negative results, and
  8. ancillary circumstances, esp. circumstances pertaining to vaccination errors such as wrong vaccine administered, expired vaccine, etc.

The use of this segmentation would be able to differentiate not only time of occurrence, but also allow for adequate filtering to identify the correct denominators for the PRR.

A future with (for?) MedDRA

As said, I am not necessarily hostile to MedDRA, even if the closet libertarian in me bristles at the fact that MedDRA is mercilessly exploiting what is an effective monopoly position. But MedDRA can be better, and needs to be better – if not for its own economic interests, then for the interests of those it serves. There are three particular suggestions MedDRA needs to seriously consider.

  1. MedDRA’s entity structure is valuable – arguably, it’s the value in the entire project. If coding can be structured to reflect its internal hierarchy, MedDRA becomes parseable without a LUT,1 and kinship structures become parseable without the extra step of a LUT.
  2. MedDRA needs to open up, especially to researchers not falling within its narrowly defined confines of access. Especially given the inherent public nature of its use – PhV and regulation are quintessentially public functions, and this needs an open system.
  3. MedDRA’s entity structure’s biggest strength is that it comprises a range of different things, from administrative errors through physical injuries to test results and the simple fact of tests.

Conclusion

VAERS is a valuable system with a range of flaws. All of them are avoidable and correctable – but would require the requisite level of will and commitment – both on CDC’s side and that of MedDRA. For any progress in this field, it is imperative that the CDC understand that a public resource maintained in the public interest cannot be driven by a proprietary nomenclature, least of all one that is priced out of the range of the average interested individual: and if they cannot be served, does the entire system even fulfill its governmental function of being of the people and for the people? It is ultimately CDC’s asset, and it has a unique chance to leverage its position to ensure that at least as far as the 17% of MedDRA codes go that are used in VAERS, these are released openly.

In the end, however sophisticated our dissimilarity metrics, when 30% of all entities are non-symptoms and we need to manually prune the key terms to avoid denominator bloat due to non-symptom entities, such as diagnostic tests without results or clearly unconnected causes of morbidity and mortality like motor vehicle accidents, dissimilarity based approaches will suffer from serious flaws. In the absence of detailed administration and symptom tracking at an individual or institutional level, dissimilarity metrics are the cheapest and most feasible ways of creating value out of post marketing passive reports. If VAERS is to be a useful research tool, as I firmly believe it was intended to be, it must evolve to that capability for all.

References   [ + ]

1.Look-up table

Ebola! Graph databases! Contact tracing! Bad puns!

Thanks to the awesome folks at Neo4j Budapest and GraphAware, I will be talking tonight about Ebola, contact tracing, how graph databases help us understand epidemics and maybe prevent them someday. Now, if flying to Budapest on short notice might not work for you, you can listen to a livestream of the whole event here! It starts today, 13 February, at 1830 CET, 1730 GMT or 1230 Eastern Time, and I sincerely hope you will listen to it, live or later from the recording, also accessible here.

On the challenge of building HTTP REST APIs that don’t suck

Here’s a harsh truth: most RESTful HTTP APIs (in the following, APIs) suck, to some degree or another. Including the ones I’ve written.

Now, to an extent, this is not my fault, your fault or indeed anyone’s fault. APIs occupy a strange no man’s land between stuff designed for machines and stuff designed for humans. On one hand, APIs are intended to allow applications and services to communicate with each other. If humans want to interact with some service, they will do so via some wrapper around an API, be it an iOS app, a web application or a desktop client. Indeed, the tools you need to interact with APIs – a HTTP client – are orders of magnitude less well known and less ubiquitous than web browsers. Everybody has a web browser and knows how to use one. Few people have a dedicated desktop HTTP browser like Paw or know how to use something like curl. Quick, how do you do a token auth header in curl?

At the same time, even if the end user of the API is the under-the-hood part of a client rather than a human end user, humans have to deal with the API at some point, when they’re building whatever connects to the API. Tools like Swagger/OpenAPI were intended to somewhat simplify this process, and the idea was good – let’s have APIs generate a schema that they also serve up from which a generic client can then build a specific client. Except that’s not how it ended up working in practice, and in the overwhelming majority of cases, the way an API handler is written involves Dexedrine, coffee and long hours spent poring over the API documentation.

Which is why your API can’t suck completely. There’s no reason why your API can’t be a jumbled mess of methods from the perspective of your end user, who will interact without your API without needing to know what an API even is. That’s the beauty of it all. But if you want people to use your service – which you should very much want! -, you’ll have to have an API that people can get some use out of.

Now, the web is awash with stuff about best practices in developing REST APIs. And, quite frankly, most of these are chock-full of good advice. Yes, use plural nouns. Use HATEOAS. Use the right methods. Don’t create GET methods that can change state. And so on.

But the most important thing to know about rules, as a former CO of mine used to say, is to know when to break them, and what the consequences will be when you do. There’s a philosophy of RESTful API design called pragmatic REST that acknowledges this to an extent, and uses the ideas underlying REST as a guideline, rather than strict, immutable rules. So, the first step of building APIs that don’t suck is knowing the consequences of everything you do. The problem with adhering to doctrine or rules or best practices is that none of that tells you what the consequences of your actions are, whether you follow them or not. That’s especially important when considering the consequences of not following the rules: not pluralizing your nouns and using GET to alter state have vastly different consequences. The former will piss off your colleagues (rightly so), the latter will possibly endanger the safety of your API and lead to what is sometimes referred to in the industry as Some Time Spent Updating Your LinkedIn & Resume.

Secondly, and you can take this from me – no rules are self-explanatory. Even with all the guidance in the world in your hand, there’s a decent chance I’ll have no idea why most of your code is the way it is. The bottom line being: document everything. I’d much rather have an API breaking fifteen rules and giving doctrinaire rule-followers an apoplectic fit but which is well-documented over a super-tidy bit of best practices incarnate (wouldn’t that be incodeate, given that code is not strictly made of meat?) that’s missing any useful documentation, any day of the week. There are several ways to document APIs, and no strictly right one – in fact, I would use several different methods within the same project for different endpoints. So for instance a totally run-of-the-mill DELETE endpoint that takes an object UUID as an argument requires much less documentation than a complex filtering interface that takes fifty different arguments, some of which may be mandatory. A few general principles have served me well in the past when it comes to documenting APIs:

  • Keep as much of the documentation as you can out of the code and in the parts that make it into the documentation. For instance, in Python, this is the docstring.
  • If your documentation allows, put examples into the docstring. An example can easily be drawn from the tests, by the way, which makes it a twofer.
  • Don’t document for documentation’s sake. Document to help people understand. Avoid tedious, wordy explanation for a method that’s blindingly obvious to everyone.
  • Eschew the concept of ‘required’ fields, values, query parameters, and so on. Nothing is ‘required’ – the world will not end if a query parameter is not provided, and you will be able to make the request at the very least. Rather, make it clear what the consequences of not providing the information will be. What happens if you do not enter a ‘required’ parameter? Merely calling it ‘required’ does not really tell me if it will crash, yield a cryptic error message or simply fail silent (which is something you also should try to avoid).
  • Where something must have a particular type of value (e.g. an integer), where a value will have to be provided in a particular way (e.g. a Boolean encoded as o/1 or True/False) or has a limited set of possible values (e.g. in an application tracking high school students, the year query parameter may only take the values ['freshman', 'sophomore', 'junior', 'senior']), make sure this is clearly documented, and make it clear whether the values are case sensitive or not.
  • Only put things into inline comments that would only be required for someone who is reading your code. Anything a user of your methods/endpoints ought to know about them should be in the docstring or otherwise end up in your documentation – your docstring, of course, has the added benefit that it will be visible not only for people reading the documentation but also for whoever is reading your code.
  • If you envisage even the most remote possibility that your API will have to handle Unicode, emojis or other fancy things (basically, anything beyond ASCII), make sure you explain how your API handles such values.

Finally, eat your own dog food. Writing a wrapper for your API is not only a commercially sound idea (it is much more fun for other developers to just grab an API wrapper for their language of choice than having to homebrew one), it’s also a great way to gauge how painful it is to work with your API. Unless it’s anything above a 6 on the 1-10 Visual Equivalent Scale of Painful and Grumpy Faces, you’ll be fine. And if you need to make changes, make any breaking change part of a new version. An API version string doesn’t necessarily mean the API cannot change at all, but it does mean you may not make breaking changes – this means any method, any endpoint and any argument that worked on day 0 of releasing v1 will have to work on v1, forever.

Following these rules won’t ensure your API won’t suck. But they’ll make sucking much more difficult, which is half the victory. A great marksmanship instructor I used to know said that the essence of ‘technique’, be it in handling a weapon or writing an API, is to reduce the opportunity of making avoidable mistakes. Correct running technique will force you to run in a way that doesn’t even let you injure your ankle unless you deviate from the form. Correct shooting technique eliminates the risk of elevation divergences due to discrepancies in how much air remains in the lungs by simply making you squeeze the trigger at the very end of your expiration. Good API development technique keeps you from creating APIs that suck by restricting you to practices that won’t allow you to commit some of the more egregious sins of writing APIs. And the more you can see beyond the rules and synthesise them into a body of technique that keeps you from making mistakes, the better your code will be, without cramping your creativity.

Using screen to babysit long-running processes

In machine learning, especially in deep learning, long-running processes are quite common. Just yesterday, I finished running an optimisation process that ran for the best part of four days –  and that’s on a 4-core machine with an Nvidia GRID K2, letting me crunch my data on 3,072 GPU cores!  Of course, I did not want to babysit the whole process. Least of all did I want to have to do so from my laptop. There’s a reason we have tools like Sentry, which can be easily adapted from webapp monitoring to letting you know how your model is doing.

One solution is to spin up another virtual machine, ssh into that machine, then from that
ssh into the machine running the code, so that if you drop the connection to the first machine, it will not drop the connection to the second. There is also nohup, which makes sure that the process is not killed when you ‘hang up’ the ssh connection. You will, however, not be able to get back into the process again. There are also reparenting tools like reptyr, but the need they meet is somewhat different. Enter terminal multiplexers.

Terminal multiplexers are old. They date from the era of things like time-sharing systems and other antiquities whose purpose was to allow a large number of users to get their time on a mainframe designed to serve hundreds, even thousands of users. With the advent of personal computers that had decent computational power on their own, terminal multiplexers remained the preserve of universities and other weirdos still using mainframe architectures. Fortunately for us, two great terminal multiplexers, screen (aka GNU Screen ) and tmux , are still being actively developed, and are almost definitely available for your *nix of choice. This gives us a convenient tool to sneak a peek at what’s going on with our long-suffering process. Here’s how.

Step 1
ssh into your remote machine, and launch ssh. You may need to do this as sudo if you encounter the error where screen, instead of starting up a new shell, returns [screen is terminating] and quits. If screen is started up correctly, you should be seeing a slightly different shell prompt (and if you started it as sudo, you will now be logged in as root).
ssh into your machine, and launch screen (screen).
In some scenarios, you may want to ‘name’ your screen session. Typically, this is the case when you want to share your screen with another user, e.g. for pair programming. To create a named screen, invoke screen using the session name parameter -S, as in e.g. screen -S my_shared_screen.
Step 2
In this step, we will be launching the actual script to run. If your script is Python based and you are using virtualenv (as you ought to!), activate the environment now using source /<virtualenv folder>/bin/activate, replacing  virtualenv folderby the name of the folder where your virtualenvs live (for me, that’s the environments folder, often enough it’s something like ~/.virtualenvs) and by the name of your virtualenv (in my case, research). You have to activate your virtualenv even if you have done so outside of screen already (remember, screen means you’re in an entirely new shell, with all environment configurations, settings, aliases &c. gone)!

With your virtualenv activated, launch it as normal — no need to launch it in the background. Indeed, one of the big advantages is the ability to see verbose mode progress indicators. If your script does not have a progress logger to stdout but logs to a logfile, you can start it using nohup, then put it into the background (Ctrl--Z, then bg) and track progress using tail -f logfile.log (where logfile.log is, of course, to be substituted by the filename of the logfile.
Step 3
Press Ctrl--A followed by Ctrl--D to detach from the current screen. This will take you back to your original shell after noting the address of the screen you’re detaching from. These always follow the format <identifier>.<session id>.<hostname>, where hostname is, of course, the hostname of the computer from which the screen session was started, stands for the name you gave your screen if any, and is an autogenerated 4-6 digit socket identifier. In general, as long as you are on the same machine, the screen identifier or the session name will be sufficient – the full canonical name is only necessary when trying to access a screen on another host.

To see a list of all screens running under your current username, enter screen -list. Refer to that listing or the address echoed when you detached from the screen to reattach to the process using screen -r <socket identifier>[.<session identifier>.<hostname>]. This will return you to the script, which keeps executing in the background.
Result
Reattaching to the process running in the background, you can now follow the progress of the script. Use the key combination in Step 3 to step out of the process anytime and the rest of the step to return to it.

Bugs
There is a known issue, caused by strace, that leads to screen immediately closing, with the message [screen is terminating] upon invoking screen as a non-privileged user.

There are generally two ways to resolve this issue.

The overall effect of both solutions is the same. Notably, both may be undesirable from a security perspective. As always, weigh risks against utility.

Do you prefer screen to staying logged in? Do you have any other cool hacks to make monitoring a machine learning process that takes considerable time to run? Let me know in the comments!

Image credits: Zenith Z-19 by ajmexico on Flickr

A deep learning

There are posts that are harder to write than others. This one perhaps has been one of the hardest. It took me the best part of four months and dozens of rewrites.

Because it’s about something I love. And about someone I love. And about something else I love. And how these three came to come into a conflict. And, perhaps, what we all can learn from that.


As many of you might know, deep learning is my jam. Not in a faddish, ‘it’s what cool kids do these days’ sense. Nor, for that matter, in the sense so awfully prevalent in Silicon Valley, whereby the utility of something is measured in how many jobs it will get rid of, presumably freeing off humans to engage in more cerebral pursuits, or how it may someday cure intrinsically human problems if only those pesky humans were to listen to their technocratic betters for once. Rather, I’m a deep learning and AI researcher who believes in what he’s doing. I believe with all I am and all I’ve got that deep learning is right now our best chance to find better ways of curing cancer, producing more with less emissions, building structures that can withstand floods on a dime, identifying terrorists and, heck, create entertaining stuff. I firmly believe that it’s one of the few intellectual pursuits I am somewhat suited for that is also worth my time, not the least because I firmly believe that it will make me have more of it – and if not me, maybe someone equally worthy.

Which is why it was so hard for me to watch this video, of my lifelong idol Hayao Miyazaki ripping a deep learning researcher to shreds.

Now, quite frankly, I have little time for the researcher and his proposition. It’s badly made, dumb and pointless. Why one would inundate Miyazaki-san with it is beyond me. His putdown is completely on point, and not an ounce too harsh. All of his words are well deserved. As someone with a neurological chronic pain disorder that makes me sometimes feel like that creature writhing on the floor, I don’t have a shred of sympathy for this chap.1

Rather, it’s the last few words of Miyazaki-san that have punched a hole in my heart and have held my thoughts captive for months now, coming back into the forefront of my thoughts like a recurring nightmare.

“I feel like we are nearing the end of times,” he says, the camera gracefully hovering over his shoulder as he sketches through his tears. “We humans are losing faith in ourselves.”


Deep learning is something formidable, something incredible, something so futuristic yet so simple. Deep down (no pun intended), deep learning is really not much more than a combination of a few relatively simple tricks, some the best part of a century old, that together create something fantastic. Let me try to put it into layman’s terms (if you’re one of my fellow ML /AI nerds, you can just jump over this part).

Consider you are facing the arduous and yet tremendously important task of, say, identifying whether an image depicts a cat or a dog. In ML lingo, this is what we call a ‘classification’ task. One traditional approach used to be to define what cats are versus what dogs are, and provide rules. If it’s got whiskers, it’s a cat. If it’s got big puppy eyes, it’s, well, a puppy. If it’s got forward pointing eyes and a roughly circular face, it’s almost definitely a kitty. If it’s on a leash, it’s probably a dog. And so on, ad infinitum, your model of a cat-versus-dog becoming more and more accurate with each rule you add.

This is a fairly feasible approach, and is still used. In fact, there’s a whole school of machine learning called decision trees that relies on this kind of definition of your subjects. But there are three problems with it.

  1. You need to know quite a bit about cats and dogs to be able to do this. At the very least, you need to be able to, and take the time and effort to, describe cats and dogs. It’s not enough to merely feed images of each to the computer.2
  2. You are limited in time and ability to put down distinguishing features – your program cannot be infinitely large, nor do you have infinite time to write it. You must prioritise by identifying the factors with the greatest differentiating potential first. In other words, you need to know, in advance, what the most salient characteristics of cats versus dogs are – that is, what characteristics are almost omnipresent among cats but hardly ever occur among dogs (and vice versa)? All dogs have a snout and no cat has a snout, whereas some cats do have floppy ears and some dogs do have almost catlike triangular ears.
  3. You are limited to what you know. Silly as that may sound, there might be some differentia between cats and dogs that are so arcane, so mathematical that no human would think of it – but which might come trivially evident to a computer.

Deep learning, like friendship, is magic. Unlike most other techniques of machine learning, you don’t need to have the slightest idea of what differentiates cats from dogs. What you need is a few hundred images of each, preferably with a label (although that is not strictly necessary – classifiers can get by just fine without needing to be told what the names of the things they are classifying are: as long as they’re told how many different classes they are to split the images into, they will find differentiating features on their own and split the images into ‘images with thing 1’ versus  ‘images with thing 2’. – magic, right?). Using modern deep learning libraries like TensorFlow and their high level abstractions (e.g. keras, tflearn) you can literally write a classifier that identifies cats versus dogs with a very high accuracy in less than 50 lines of Python that will be able to classify thousands of cat and dog pics in a fraction of a minute, most of which will be taken up by loading the images rather than the actual classification.

Told you it’s magic.


What makes deep learning ‘deep’, though? The origins of deep learning are older than modern computers. In 1943, McCullough and Pitts published a paper3 that posited a model of neural activity based on propositional logic. Spurred by the mid-20th century advances in understanding how the nervous system works, in particular how nerve cells are interconnected, McCulloch and Pitts simply drew the obvious conclusion: there is a way you can represent neural connections using propositional logic (and, actually, vice versa). But it wasn’t until 1958 that this idea was followed up in earnest. Rosenblatt’s ground-breaking paper4 introduced this thing called the perceptron, something that sounds like the ideal robotic boyfriend/therapist but in fact was intended as a mathematical model for how the brain stores and processes information. A perceptron is a network of artificial neurons. Consider the cat/dog example. A simple single-layer perceptron has a list of input neurons   x_1 ,   x_2   and so on. Each of these describe a particular property. Does the animal have a snout? Does it go woof? Depending on how characteristic they are, they’re multiplied by a weight  w_n . For instance, all dogs and no cats have snouts, so   w_1   will be relatively high, while there are cats that don’t have long curly tails and dogs that do, so  w_n   will be relatively low.

At the end, the output neuron (denoted by the big  \Sigma  ) sums up these results, and gives an estimate as to whether it’s a cat or a dog.

What was initially designed to model the way the brain works has soon shown remarkable utility in applied computation, to the point that the US Navy was roped into building an actual, physical perceptron machine – the first application of computer vision. However, it was a complete bust. It turned out that a single layer perceptron couldn’t really recognise a lot of patterns. What it lacked was depth.

What do we mean by depth? Consider the human brain. The brain actually doesn’t have a single part devoted to vision. Rather, it has six separate areas5 – the striate cortex (V1) and the extrastriate areas (V2-V6). These form a feedforward pathway of sorts, where V1 feeds into V2, which feeds into V3 and so on. To massively oversimplify: V1 detects optical features like edges, which it feeds on to V2, which breaks these down into more complex features: shapes, orientation, colour &c. As you proceed towards the back of the head, the visual centres detect increasingly complex abstractions from the simple visual information. What was found is that by putting layers and layers of neurons after one another, even very complex patterns can be identified accurately. There is a hierarchy of features, as the facial recognition example below shows.

The first hidden layer recognises simple geometries and blobs at different parts of the zone. The second hidden layer fires if it detects particular manifestations of parts of the face – noses, eyes, mouths. Finally, the third layer fires if it ‘sees’ a particular combination of these. Much like an identikit image, a face is recognised because it contains parts of a face, which in turn are recognised because they contain a characteristic spatial alignment of simple geometries.

There’s much more to deep learning than what I have tried to convey in a few paragraphs. The applications are endless. With the cost of computing decreasing rapidly, deep learning applications have now become feasible in just about all spheres where they can be applied. And they excel everywhere, outpacing not only other machine learning approaches (which makes me absolutely stoked about the future!) but, at times, also humans.


Which leads me back to Miyazaki. You see, deep learning can’t just classify things or predict stock prices. It can also create stuff. To put an old misunderstanding to rest quite early: generative neural networks are genuinely creating new things. Rather than merely combining pre-programmed elements, they come as close as anything non-human can come to creativity.

The pinnacle of it all, generating enjoyable music, is still some ways off, and we have yet to enjoy a novel written by a deep learning engine. But to anyone who has been watching the rapid development of deep learning and especially generative algorithms based on deep learning, these are literally just questions of time.

Or perhaps, as Miyazaki said, questions of the ‘end of times’.

What sets a computer-generated piece apart from a human’s composition? Someday, they will be, as far as quality is concerned, indistinguishable. Yet something that will always set them apart is the absence of a creator.

In what is probably one of the worst written essays in  20th century literary criticism, a field already overflowing with bad prose for bad prose’s sake, Roland Barthes’s 1967 essay La mort de l’auteur posited a sort of separation between the author and the text, countering centuries of literary criticism that sought to explain the meaning of the latter by reference to the former.  According to Barthes, texts (and so, compositions, paintings &.) have a life and existence of their own. To liberate works of art of an  ‘interpretive  tyranny’ that is almost self-explanatorily imposed on it, they must be read, interpreted and understood by reference to its audience and not its author. Indeed, Barthes eschews the term in favour of the term ‘scriptor‘, the latter hearkening back to the Medieval monks who copied manuscripts: like them, the scriptor is not in control of the narrative or work of art that he or she composes. Devoid of the author’s authority, the work of art is now free to exist in a liberated state that allows you – the recipient – to establish its essential meaning.

Oddly, that’s not entirely what post-modernism seems to have created. If anything, there is now an increased focus on the author, at the very least in one particular sense. Consider the curious case of Wagner’s works in Israel. Because of his anti-Semitic views, arguably as well as due to the favour his music found during the tragic years of the Third Reich, Wagner’s works – even those that do not even remotely express a political position – are rarely played in Israel. Even in recent years, other than Holocaust survivor Mendi Roman’s performance of Siegfried in 2000, there have been very few instances of Wagner played in Israel – despite the curious fact that Theodor Herzl, founder of Zionism, admired Wagner’s music (if not his vile racial politics). Rather than the death of the author, we more often witness the death of the work. The taint of the author’s life comes to haunt the chords of his composition and the stanzas of his poetry, every brush-stroke of theirs forever imbued with the often very human sins and mistakes of their lives.

Less dramatic, perhaps, than Wagner’s case are the increasingly frequent boycotts, outbursts and protests against works of art solely based on the character of the author or composer. One must only look at the recent past to see protests, for instance, against the works of HP Lovecraft, themselves having to do more with eldritch horrors than racist horridness, due to the author’s admittedly reprehensible views on matters of race. Outrages about one author or another, one artist or the next, are commonplace, acted out on a daily basis on the Twitter gibbets and the Facebook  pillory. Rather than the death of the author, we experience the death of art, amidst an increasingly intolerant culture towards  the works of flawed or sinful creators.

This is, of course, not to excuse any of those sins or flaws. They should not, and cannot, be excused. Rather, perhaps, it is to suggest that part of a better understanding of humanity is that artists are a cross-section of us as a species, equally prone to be misled and deluded into adopting positions that, as the famous German anti-Fascist and children’s book author Erich Kästner said, ‘feed the animal within man’. Nor is this to condone or justify art that actively expresses those reprehensible views – an entirely different issue. Rather, I seek merely to draw attention to the increased tendency to condemn works of art for the artist’s political sins. In many cases, these sins are far from being as straightforward as Lovecraft’s bigotry and Wagner’s anti-Semitism. In many cases, these sins can be as subtle as going against the drift of public opinion, the Orwellian sin of ‘wrongthink’. With the internet having become a haven of mob mentality (something I personally was subjected to a few years ago), the threshold of what sins  of the creator shall be visited upon their creations has significantly decreased. It’s not the end of days, but you can see it from here.

In which case perhaps Miyazaki is right.

Perhaps what we need is art produced by computers.


As Miyazaki-san said, we are losing faith in ourselves. Not in our ability to create wonderful works of art, but in our ability to measure up to some flawless ethos, to some expectation of the artist as the flawless being. We are losing faith in our artists. We are losing faith in our creators, our poets and painters and sculptors and playwrights and composers, because we fear that with the inevitable revelation of greater – or perhaps lesser – misdeeds or wrongful opinions from their past shall not merely taint them: they shall no less taint us, the fans and aficionados and cognoscenti. Put not your faith in earthly artists, for they are fickle, and prone to having opinions that might be unacceptable, or be seen as such someday. Is it not a straightforward response then to  declare one’s love for the intolerable synthetic Baroque of Stanford machine learning genius Cary Kaiming Huang’s research? In a society where the artist’s sins taint the work of art and through that, all those who confessed to enjoy his works, there’s no other safe bet. Only the AI can cast the first stone.

And if the cost of that is truly the chirps of Cary’s synthetic Baroque generator, Miyazaki is right on the other point, too. It truly is the end of days.

References   [ + ]

1.Least of all because I know how rudimentary and lame his work is. I’ve built evolutionary models of locomotion where the first stages look like this. There’s no cutting edge science here.
2.There’s a whole aspect of the story called feature extraction, which I will ignore for the sake of simplicity, and assume that it just happens. It doesn’t, of course, and it plays a huge role in identifying things, but this story is complex enough already as it is.
3.McCulloch, W and Pitts, W (1943). A Logical Calculus of Ideas Immanent in Nervous Activity. Bulletin of Mathematical Biophysics 5 (4): 115–133. doi:10.1007/BF02478259.
4.Rosenblatt, F. (1958). The Perceptron: A Probabilistic Model For Information Storage And Organization In The Brain. Psych Rev 65 (6): 386–408. doi:10.1037/h0042519
5.Or five, depending on whether you consider the dorsomedial area a separate area of the extrastriate cortex