Structuring R projects

There are some things that I call Smith goods:1 things I want, nay, require, but hate doing. A clean room is one of these – I have a visceral need to have some semblance of tidiness around me, I just absolutely hate tidying, especially in the summer.2 Starting and structuring packages and projects is another of these things, which is why I’m so happy things like cookiecutter exist that do it for you in Python. [su_pullquote align=”right”]While I don’t like structuring R projects, I keep doing it, because I know it matters. That’s a pearl of wisdom that came occasionally at a great price.[/su_pullquote]I am famously laid back about structuring R projects – my chill attitude is only occasionally compared to the Holy Inquisition, the other Holy Inquisition and Gunny R. Lee Ermey’s portrayal of Drill Sgt. Hartman, and it’s been months since I last gutted an intern for messing up namespaces.3 So while I don’t like structuring R projects, I keep doing it, because I know it matters. That’s a pearl of wisdom that came occasionally at a great price, some of which I am hoping to save you by this post.

Five principles of structuring R projects

Every R project is different. Therefore, when structuring R projects, there has to be a lot more adaptability than there is normally When structuring R projects, I try to follow five overarching principles.

  1. The project determines the structure. In a small exploratory data analysis (EDA) project, you might have some leeway as to structural features that you might not have when writing safety-critical or autonomously running code. This variability in R – reflective of the diversity of its use – means that it’s hard to devise a boilerplate that’s universally applicable to all kinds of projects.
  2. Structure is a means to an end, not an end in itself. The reason why gutting interns, scalping them or yelling at them Gunny style are inadvisable is not just the additional paperwork it creates for HR. Rather, the point of the whole exercise is to create people who understand why the rules exists and organically adopt them, understanding how they help.
  3. Rules are good, tools are better. When tools are provided that take the burden of adherence – linters, structure generators like cookiecutter, IDE plugins, &c. – off the developer, adherence is both more likely and simpler.
  4. Structures should be interpretable to a wide range of collaborators. Even if you have no collaborators, thinking from the perspective of an analyst, a data scientist, a modeller, a data engineer and, most importantly, the client who will at the very end receive the overall product.
  5. Structures should be capable of evolution. Your project may change objectives, it may evolve, it may change. What was a pet project might become a client product. What was designed to be a massive, error-resilient superstructure might have to scale down. And most importantly, your single-player adventure may end up turning into an MMORPG. Your structure has to be able to roll with the punches.

A good starting structure

Pretty much every R project can be imagined as a sort of process: data gets ingested, magic happens, then the results – analyses, processed data, and so on – get spit out. The absolute minimum structure reflects this:

.
└── my_awesome_project
    ├── src
    ├── output
    ├── data
    │   ├── raw
    │   └── processed
    ├── README.md
    ├── run_analyses.R 
    └── .gitignore

In this structure, we see this reflected by having a data/ folder (a source), a folder for the code that performs the operations (src/) and a place to put the results (output/). The root analysis file (the sole R file on the top level) is responsible for launching and orchestrating the functions defined in the src/ folder’s contents.

The data folder

The data folder is, unsurprisingly, where your data goes. In many cases, you may not have any file-formatted raw data (e.g. where the raw data is accessed via a *DBC connection to a database), and you might even keep all intermediate files there, although that’s pretty uncommon on the whole, and might not make you the local DBA’s favourite (not to mention data protection issues). So while the raw/ subfolder might be dispensed with, you’ll most definitely need a data/ folder.

When it comes to data, it is crucial to make a distinction between source data and generated data. Rich Fitzjohn puts it best when he says to treat

  • source data as read-only, and
  • generated data as disposable.

The preferred implementation I have adopted is to have

  • a data/raw/ folder, which is usually is symlinked to a folder that is write-only to clients but read-only to the R user,4,
  • a data/temp/ folder, which contains temp data, and
  • a data/output/ folder, if warranted.

The src folder

Some call this folder R– I find this a misleading practice, as you might have C++, bash and other non-R code in it, but is unfortunately enforced by R if you want to structure your project as a valid R package, which I advocate in some cases. I am a fan of structuring the src/ folder, usually by their logical function. There are two systems of nomenclature that have worked really well for me and people I work with:

  • The library model: in this case, the root folder of src/ holds individual .R scripts that when executed will carry out an analysis. There may be one or more such scripts, e.g. for different analyses or different depths of insight. Subfolders of src/ are named after the kind of scripts they contain, e.g. ETL, transformation, plotting. The risk with this structure is that sometimes it’s tricky to remember what’s where, so descriptive file names are particularly important.
  • The pipeline model: in this case, there is a main runner script or potentially a small number. These go through scripts in a sequence. It is a sensible idea in such a case to establish sequential subfolders or sequentially numbered scripts that are executed in sequence. Typically, this model performs better if there are at most a handful distinct pipelines.

Whichever approach you adopt, a crucial point is to keep function definition and application separate. This means that only the pipeline or the runner scripts are allowed to execute (apply) functions, and other files are merely supposed to define them. Typically, folder level segregation works best for this:

  • keep all function definitions in subfolders of src/, e.g. src/data_engineering, and have the directly-executable scripts directly under src/ (this works better for larger projects), or
  • keep function definitions in src/, and keep the directly executable scripts in the root folder (this is more convenient for smaller projects, where perhaps the entire data engineering part is not much more than a single script).

output and other output folders

Output may mean a range of things, depending on the nature of your project. It can be anything from a whole D.Phil thesis written in a LaTeX-compliant form to a brief report to a client. There are a couple of conventions with regard to output folders that are useful to keep in mind.

Separating plot output

[su_pullquote]My personal preference is that plot output folders should be subfolders of output/, rather than top-tier folders, unless the plots themselves are the objective.[/su_pullquote]It is common to have a separate folder for plots (usually called figs/ or plots/), usually so that they could be used for various purposes. My personal preference is that plot output folders should be subfolders of output folders, rather than top-tier folders, unless they are the very output of the project. That is the case, for instance, where the project is intended to create a particular plot on a regular basis. This was the case, for instance, with the CBRD project whose purpose was to regularly generate daily epicurves for the DRC Zaire ebolavirus outbreak.

With regard to maps, in general, the principle that has worked best for teams I ran was to treat static maps as plots. However, dynamic maps (e.g. LeafletJS apps), tilesets, layers or generated files (e.g. GeoJSON files) tend to deserve their own folder.

Reports and reporting

[su_pullquote]For business users, automatically getting a beautiful PDF report can be priceless.[/su_pullquote]Not every project needs a reporting folder, but for business users, having a nice, pre-written reporting script that can be run automatically and produces a beautiful PDF report every day can be priceless. A large organisation I worked for in the past used this very well to monitor their Amazon AWS expenditure.5 A team of over fifty data scientists worked on a range of EC2 instances, and runaway spending from provisioning instances that were too big, leaving instances on and data transfer charges resulting from misconfigured instances6 was rampant. So the client wanted daily, weekly, monthly and 10-day rolling usage nicely plotted in a report, by user, highlighting people who would go on the naughty list. This was very well accomplished by an RMarkdown template that was ‘knit‘ every day at 0600 and uploaded as an HTML file onto an internal server, so that every user could see who’s been naughty and who’s been nice. EC2 usage costs have gone down by almost 30% in a few weeks, and that was without having to dismember anyone!7

Probably the only structural rule to keep in mind is to keep reports and reporting code separate. Reports are client products, reporting code is a work product and therefore should reside in src/.

Requirements and general settings

I am, in general, not a huge fan of outright loading whole packages to begin with. Too many users of R don’t realise that

  • you do not need to attach (library(package)) a package in order to use a function from it – as long as the package is available to R, you can simply call the function as package::function(arg1, arg2, ...), and
  • importing a package using library(package) puts every single function from that package into the namespace, overwriting by default all previous entries. This means that in order to deterministically know what any given symbol means, you would have to know, at all times, the order of package imports. Needless to say, there is enough stuff to keep in one’s mind when coding in R to worry about this stuff.

However, some packages might be useful to import, and sometimes it’s useful to have an initialisation script. This may be the case in three particular scenarios:

  • You need a particular locale setting, or a particularly crucial environment setting.
  • It’s your own package and you know you’re not going to shadow already existing symbols.
  • You are not using packrat or some other package management solution, and definitely need to ensure some packages are installed, but prefer not to put the clunky install-if-not-present code in every single thing.

In these cases, it’s sensible to have a file you would source before every top-level script – in an act of shameless thievery from Python, I tend to call this requirements.R, and it includes some fundamental settings I like to rely on, such as setting the locale appropriately. It also includes a CRAN install check script, although I would very much advise the use of Packrat over it, since it’s not version-sensitive.

Themes, house style and other settings

It is common, in addition to all this, to keep some general settings. If your institution has a ‘house style’ for ggplot2 (as, for instance, a ggthemr file), for instance, this could be part of your project’s config. But where does this best go?

[su_pullquote align=”right”]I’m a big fan of keeping house styles in separate repos, as this ensures consistency across the board.[/su_pullquote]It would normally be perfectly fine to keep your settings in a config.R file at root level, but a config/ folder is much preferred as it prevents clutter if you derive any of your configurations from a git submodule. I’m a big fan of keeping house styles and other things intended to give a shared appearance to code and outputs (e.g. linting rules, text editor settings, map themes) in separate – and very, very well managed! – repos, as this ensures consistency across the board over time. As a result, most of my projects do have a config folder instead of a single configuration file.

It is paramount to separate project configuration and runtime configuration:

  • Project configuration pertains to the project itself, its outputs, schemes, the whole nine yards. For instance, the paper size to use for generated LaTeX documents would normally be a project configuration item. Your project configuration belongs in your config/ folder.
  • Runtime configuration pertains to parameters that relate to individual runs. In general, you should aspire to have as few of these, if any, as possible – and if you do, you should keep them as environment variables. But if you do decide to keep them as a file, it’s generally a good idea to keep them at the top level, and store them not as R files but as e.g. JSON files. There are a range of tools that can programmatically edit and change these file formats, while changing R files programmatically is fraught with difficulties.

Keeping runtime configuration editable

A few years ago, I worked on a viral forecasting tool where a range of model parameters to build the forecast from were hardcoded as R variables in a runtime configuration file. It was eventually decided to create a Python-based web interface on top of it, which would allow users to see the results as a dashboard (reading from a database where forecast results would be written) and make adjustments to some of the model parameters. The problem was, that’s really not easy to do with variables in an R file.

On the other hand, Python can easily read a JSON file into memory, change values as requested and export them onto the file system. So instead of that, the web interface would store the parameters in a JSON file, from which R would then read them and execute accordingly. Worked like a charm. Bottom line – configurations are data, and using code to store data is bad form.

Dirty little secrets

Everybody has secrets. In all likelihood, your project is no different: passwords, API keys, database credentials, the works. The first rule of this, of course, is never hardcode credentials in code. But you will need to work out how to make your project work, including via version control, while also not divulging credentials to the world at large. My preferred solutions, in order of preference, are:

  1. the keyring package, which interacts with OS X’s keychain, Windows’s Credential Store and the Secret Service API on Linux (where supported),
  2. using environment variables,
  3. using a secrets file that is .gitignored,
  4. using a config file that’s .gitignored,
  5. prompting the user.

Let’s take these – except the last one, which you should consider only as a measure of desperation, as it relies on RStudio and your code should aspire to run without it – in turn.

Using keyring

keyring is an R package that interfaces with the operating system’s keychain management solution, and works without any additional software on OS X and Windows.8 Using keyring is delightfully simple: it conceives of an individual key as belonging to a keyring and identified by a service. By reference to the service, it can then be retrieved easily once the user has authenticated to access the keychain. It has two drawbacks to be aware of:

  • It’s an interactive solution (it has to get access permission for the keychain), so if what you’re after is R code that runs quietly without any intervention, this is not your best bet.
  • A key can only contain a username and a password, so it cannot store more complex credentials, such as 4-ple secrets (e.g. in OAuth, where you may have a consumer and a publisher key and secret each). In that case, you could split them into separate keyring keys.

However, for most interactive purposes, keyring works fine. This includes single-item secrets, e.g. API keys, where you can use some junk as your username and hold only on to the password. [su_pullquote align=”right”]For most interactive purposes, keyring works fine. This includes single-item secrets, e.g. API keys.[/su_pullquote] By default, the operating system’s ‘main’ keyring is used, but you’re welcome to create a new one for your project. Note that users may be prompted for a keychain password at call time, and it’s helpful if they know what’s going on, so be sure you document your keyring calls well.

To set a key, simply call keyring::key_set(service = "my_awesome_service", username = "my_awesome_user). This will launch a dialogue using the host OS’s keychain handler to request authentication to access the relevant keychain (in this case, the system keychain, as no keychain is specified), and you can then retrieve

  • the username: using keyring::key_list("my_awesome_service")[1,2], and
  • the password: using keyring::key_get("my_awesome_service").

Using environment variables

[su_pullquote]The thing to remember about environment variables is that they’re ‘relatively private’: everyone in the user session will be able to read them.[/su_pullquote]Using environment variables to hold certain secrets has become extremely popular especially for Dockerised implementations of R code, as envvars can be very easily set using Docker. The thing to remember about environment variables is that they’re ‘relatively private’: they’re not part of the codebase, so they will definitely not accidentally get committed to the VCS, but everyone who has access to the particular user session  will be able to read them. This may be an issue when e.g. multiple people are sharing the ec2-user account on an EC2 instance. The other drawback of envvars is that if there’s a large number of them, setting them can be a pain. R has a little workaround for that: if you create an envfile called .Renviron in the working directory, it will store values in the environment. So for instance the following .Renviron file will bind an API key and a username:

api_username = "my_awesome_user"
api_key = "e19bb9e938e85e49037518a102860147"

So when you then call Sys.getenv("api_username"), you get the correct result. It’s worth keeping in mind that the .Renviron file is sourced once, and once only: at the start of the R session. Thus, obviously, changes made after that will not propagate into the session until it ends and a new session is started. It’s also rather clumsy to edit, although most APIs used to ini files will, with the occasional grumble, digest .Renvirons.

Needless to say, committing the .Renviron file to the VCS is what is sometimes referred to as making a chocolate fireman in the business, and is generally a bad idea.

Using a .gitignored config or secrets file

config is a package that allows you to keep a range of configuration settings outside your code, in a YAML file, then retrieve them. For instance, you can create a default configuration for an API:

default:
    my_awesome_api:
        url: 'https://awesome_api.internal'
        username: 'my_test_user'
        api_key: 'e19bb9e938e85e49037518a102860147'

From R, you could then access this using the config::get() function:

my_awesome_api_configuration <- config::get("my_awesome_api")

This would then allow you to e.g. refer to the URL as my_awesome_api_configuration$url, and the API key as my_awesome_api_configuration$api_key. As long as the configuration YAML file is kept out of the VCS, all is well. The problem is that not everything in such a configuration file is supposed to be secret. For instance, it makes sense for a database access credentials to have the other credentials DBI::dbConnect() needs for a connection available to other users, but keep the password private. So .gitignoreing a config file is not a good idea.

[su_pullquote align=”right”]A dedicated secrets file is a better place for credentials than a config file, as this file can then be wholesale .gitignored.[/su_pullquote]A somewhat better idea is a secrets file. This file can be safely .gitignored, because it definitely only contains secrets. As previously noted, definitely create it using a format that can be widely written (JSON, YAML).9 For reasons noted in the next subsection, the thing you should definitely not do is creating a secrets file that consists of R variable assignments, however convenient an idea that may appear at first. Because…

Whatever you do…

One of the best ways to mess up is creating a fabulous way of keeping your secret credentials truly secret… then loading them into the global scope. Never, ever assign credentials. Ever.

You might have seen code like this:

dbuser <- Sys.getenv("dbuser")
dbpass <- Sys.getenv("dbpass")

conn <- DBI::dbConnect(odbc::odbc(), UID = dbuser, PWD = dbpass)
[su_pullquote]Never, ever put credentials into any environment if possible – especially not into the global scope.[/su_pullquote]This will work perfectly, except once its done, it will leave the password and the user name, in unencrypted plaintext (!), in the global scope, accessible to any code. That’s not just extremely embarrassing if, say, your wife of ten years discovers that your database password is your World of Warcraft character’s first name, but also a potential security risk. Never put credentials into any environment if possible, and if it has to happen, at least make it happen within a function so that they don’t end up in the global scope. The correct way to do the above would be more akin to this:

create_db_connection <- function() {
    DBI::dbConnect(odbc::odbc(), UID = Sys.getenv("dbuser"), PWD = Sys.getenv("dbpass")) %>% return()
}

Concluding remarks

Structuring R projects is an art, not just a science. Many best practices are highly domain-specific, and learning these generally happens by trial and pratfall error. In many ways, it’s the bellwether of an R developer’s skill trajectory, because it shows whether they possess the tenacity and endurance it takes to do meticulous, fine and often rather boring work in pursuance of future success – or at the very least, an easier time debugging things in the future. Studies show that one of the greatest predictors of success in life is being able to tolerate deferred gratification, and structuring R projects is a pure exercise in that discipline.

[su_pullquote align=”right”]Structuring R projects is an art, not just a science. Many best practices are highly domain-specific, and learning these generally happens by trial and error.[/su_pullquote]At the same time, a well-executed structure can save valuable developer time, prevent errors and allow data scientists to focus on the data rather than debugging and trying to find where that damn snippet of code is or scratching their head trying to figure out what a particularly obscurely named function does. What might feel like an utter waste of time has enormous potential to create value, both for the individual, the team and the organisation.

[su_pullquote]As long as you keep in mind why structure matters and what its ultimate aims are, you will arrive at a form of order out of chaos that will be productive, collaborative and useful.[/su_pullquote]I’m sure there are many aspects of structuring R projects that I have omitted or ignored – in many ways, it is my own experiences that inform and motivate these commentaries on R. Some of these observations are echoed by many authors, others diverge greatly from what’s commonly held wisdom. As with all concepts in development, I encourage you to read widely, get to know as many different ideas about structuring R projects as possible, and synthesise your own style. As long as you keep in mind why structure matters and what its ultimate aims are, you will arrive at a form of order out of chaos that will be productive, collaborative and mutually useful not just for your own development but others’ work as well.

My last commentary on defensive programming in R has spawned a vivid and exciting debate on Reddit, and many have made extremely insightful comments there. I’m deeply grateful for all who have contributed there. I hope you will also consider posting your observations in the comment section below. That way, comments will remain together with the original content.

References   [ + ]

1.As in, Adam Smith.
2.It took me years to figure out why. It turns out that I have ZF alpha-1 antitrypsin deficiency. As a consequence, even minimal exposure to small particulates and dust can set off violent coughing attacks and impair breathing for days. Symptoms tend to be worse in hot weather due to impaired connective tissue something-or-other.
3.That’s a joke. I don’t gut interns – they’re valuable resources, HR shuns dismembering your coworkers, it creates paperwork and I liked every intern I’ve ever worked with – but most importantly, once gutted like a fish, they are not going to learn anything new. I prefer gentle, structured discussions on the benefits of good package structure. Please respect your interns – they are the next generation, and you are probably one of their first example of what software development/data science leadership looks like. The waves you set into motion will ripple through generations, well after you’re gone. You better set a good example.
4.Such a folder is often referred to as a ‘dropbox’, and the typical corresponding octal setting, 0422, guarantees that the R user will not accidentally overwrite data.
5.The organisation consented to me telling this story but requested anonymity, a request I honour whenever legally possible.
6.In case you’re unfamiliar with AWS: it’s a cloud service where elastic computing instances (EC2 instances) reside in ‘regions’, e.g. us-west-1a. There are (small but nonzero) charges for data transfer between regions. If you’re in one region but you configure the yum repo server of another region as your default, there will be costs, and, eventually, tears – provision ten instances with a few GBs worth of downloads, and there’ll be yelling. This is now more or less impossible to do except on purpose, but one must never underestimate what users are capable of from time to time!
7.Or so I’m told.
8.Linux users will need libsecret 0.16 or above, and sodium.
9.XML is acceptable if you’re threatened with waterboarding.

Assignment in R: slings and arrows

Having recently shared my post about defensive programming in R on the r/rstats subreddit, I was blown away by the sheer number of comments as much as I was blown away by the insight many displayed. One particular comment by u/guepier struck my attention. In my previous post, I came out quite vehemently against using the = operator to effect assignment in R. u/guepier‘s made a great point, however:

But point 9 is where you’re quite simply wrong, sorry:

never, ever, ever use = to assign. Every time you do it, a kitten dies of sadness.

This is FUD, please don’t spread it. There’s nothing wrong with =. It’s purely a question of personal preference. In fact, if anything <- is more error-prone (granted, this is a very slight chance but it’s still higher than the chance of making an error when using =).

Now, assignment is no doubt a hot topic – a related issue, assignment expressions, has recently led to Python’s BDFL to be forced to resign –, so I’ll have to tread carefully. A surprising number of people have surprisingly strong feelings about assignment and assignment expressions. In R, this is complicated by its unusual assignment structure, involving two assignment operators that are just different enough to be trouble.

A brief history of <-

IBM Model M SSK keyboard with APL keys
This is the IBM Model M SSK keyboard. The APL symbols are printed on it in somewhat faint yellow.

There are many ways in which <- in R is anomalous. For starters, it is rare to find a binary operator that consists of two characters – which is an interesting window on the R <- operator’s history.

The <- operator, apparently, stems from a day long gone by, when keyboards existed for the programming language eldritch horror that is APL. When R’s precursor, S, was conceived, APL keyboards and printing heads existed, and these could print a single ← character. It was only after most standard keyboard assignments ended up eschewing this all-important symbol that R and S accepted the digraphic <- as a substitute.

OK, but what does it do?

In the Brown Book, the underscore was actually an alias for the arrow assignment operator.
In the Brown Book (Richard A. Becker and John M. Chambers (1984). S: An Interactive Environment for Data Analysis and Graphics), the underscore was actually an alias for the arrow assignment operator! Thankfully, this did not make it into R.
<- is one of the first operators anyone encounters when familiarising themselves with the R language. The general idea is quite simple: it is a directionally unambiguous assignment, i.e. it indicates quite clearly that the right-hand side value (rhs, in the following) will replace the left-hand side variable (lhs), or be assigned to the newly created lhs if it has not yet been initialised. Or that, at the very least, is the short story.

Because quite peculiarly, there is another way to accomplish a simple assignment in R: the equality sign (=). And because on the top level, a <- b and a = b are equivalent, people have sometimes treated the two as being quintessentially identical. Which is not the case. Or maybe it is. It’s all very confusing. Let’s see if we can unconfuse it.

The Holy Writ

The Holy Writ, known to uninitiated as the R Manual, has this to say about assignment operators and their differences:

The operators <- and = assign into the environment in which they are evaluated. The operator <- can be used anywhere, whereas the operator = is only allowed at the top level (e.g., in the complete expression typed at the command prompt) or as one of the subexpressions in a braced list of expressions.

If this sounds like absolute gibberish, or you cannot think of what would qualify as not being on the top level or a subexpression in a braced list of expressions, welcome to the squad – I’ve had R experts scratch their head about this for an embarrassingly long time until they realised what the R documentation, in its neutron starlike denseness, actually meant.

If it’s in (parentheses) rather than {braces}, = and <- are going to behave weird

To translate the scriptural words above quoted to human speak, this means = cannot be used in the conditional part (the part enclosed by (parentheses) as opposed to {curly braces}) of control structures, among others. This is less an issue between <- and =, and rather an issue between = and ==. Consider the following example:

x = 3

if(x = 3) 1 else 0
# Error: unexpected '=' in "if(x ="

So far so good: you should not use a single equality sign as an equality test operator. The right way to do it is:

> if(x == 3) 1 else 0
[1] 1

But what about arrow assignment?

if(x <- 3) 1 else 0
# [1] 1

Oh, look, it works! Or does it?

if(x <- 4) 1 else 0
# [1] 1

The problem is that an assignment will always yield true if successful. So instead of comparing x to 4, it assigned 4 to x, then happily informed us that it is indeed true.

The bottom line is not to use = as comparison operator, and <- as anything at all in a control flow expression’s conditional part. Or as John Chambers notes,

Disallowing the new assignment form in control expressions avoids programming errors (such as the example above) that are more likely with the equal operator than with other S assignments.

Chain assignments

One example of where  <- and = behave differently (or rather, one behaves and the other throws an error) is a chain assignment. In a chain assignment, we exploit the fact that R assigns from right to left. The sole criterion is that all except the rightmost members of the chain must be capable of being assigned to.

# Chain assignment using <-
a <- b <- c <- 3

# Chain assignment using =
a = b = c = 3

# Chain assignment that will, unsurprisingly, fail
a = b = 3 = 4
# Error in 3 = 4 : invalid (do_set) left-hand side to assignment

So we’ve seen that as long as the chain assignment is logically valid, it’ll work fine, whether it’s using <- or =. But what if we mix them up?

a = b = c <- 1
# Works fine...

a = b <- c <- 1
# We're still great...

a <- b = c = 1
# Error in a <- b = c = 1 : could not find function "<-<-"
# Oh.

The bottom line from the example above is that where <- and = are mixed, the leftmost assignment has to be carried out using =, and cannot be by <-. In that one particular context, = and <- are not interchangeable.

A small note on chain assignments: many people dislike chain assignments because they’re ‘invisible’ – they literally return nothing at all. If that is an issue, you can surround your chain assignment with parentheses – regardless of whether it uses <-, = or a (valid) mixture thereof:

a = b = c <- 3
# ...
# ... still nothing...
# ... ... more silence...

(a = b = c <- 3)
# [1] 3

Assignment and initialisation in functions

This is the big whammy – one of the most important differences between <- and =, and a great way to break your code. If you have paid attention until now, you’ll be rewarded by, hopefully, some interesting knowledge.

= is a pure assignment operator. It does not necessary initialise a variable in the global namespace. <-, on the other hand, always creates a variable, with the lhs value as its name, in the global namespace. This becomes quite prominent when using it in functions.

Traditionally, when invoking a function, we are supposed to bind its arguments in the format parameter = argument.1 And as we know from what we know about functions, the keyword’s scope is restricted to the function block. To demonstrate this:

add_up_numbers <- function(a, b) {
    return(a + b)
}

add_up_numbers(a = 3, b = 5)
# [1] 8

a + b
# Error: object 'a' not found

This is expected: a (as well as b, but that didn’t even make it far enough to get checked!) doesn’t exist in the global scope, it exists only in the local scope of the function add_up_numbers. But what happens if we use <- assignment?

add_up_numbers(a <- 3, b <- 5)
# [1] 8

a + b
# [1] 8

Now, a and b still only exist in the local scope of the function add_up_numbers. However, using the assignment operator, we have also created new variables called a and b in the global scope. It’s important not to confuse it with accessing the local scope, as the following example demonstrates:

add_up_numbers(c <- 5, d <- 6)
# [1] 11

a + b
# [1] 8

c + d
# [1] 11

In other words, a + b gave us the sum of the values a and b had in the global scope. When we invoked add_up_numbers(c <- 5, d <- 6), the following happened, in order:

  1. A variable called c was initialised in the global scope. The value 5 was assigned to it.
  2. A variable called d was initialised in the global scope. The value 6 was assigned to it.
  3. The function add_up_numbers() was called on positional arguments c and d.
  4. c was assigned to the variable a in the function’s local scope.
  5. d was assigned to the variable b in the function’s local scope.
  6. The function returned the sum of the variables a and b in the local scope.

It may sound more than a little tedious to think about this function in this way, but it highlights three important things about <- assignment:

  1. In a function call, <- assignment to a keyword name is not the same as using =, which simply binds a value to the keyword.
  2. <- assignment in a function call affects the global scope, using = to provide an argument does not.
  3. Outside this context, <- and = have the same effect, i.e. they assign, or initialise and assign, in the current scope.

Phew. If that sounds like absolute confusing gibberish, give it another read and try playing around with it a little. I promise, it makes some sense. Eventually.

So… should you or shouldn’t you?

Which raises the question that launched this whole post: should you use = for assignment at all? Quite a few style guides, such as Google’s R style guide, have outright banned the use of = as assignment operator, while others have encouraged the use of ->. Personally, I’m inclined to agree with them, for three reasons.

  1. Because of the existence of ->, assignment by definition is best when it’s structured in a way that shows what is assigned to which side. a -> b and b <- a have a formal clarity that a = b does not have.
  2. Good code is unambiguous even if the language isn’t. This way, -> and <- always mean assignment, = always means argument binding and == always means comparison.
  3. Many argue that <- is ambiguous, as x<-3 may be mistyped as x<3 or x-3, or alternatively may be (visually) parsed as x < -3, i.e. compare x to -3. In reality, this is a non-issue. RStudio has a built-in shortcut (Alt/⎇ + ) for <-, and automatically inserts a space before and after it. And if one adheres to sound coding principles and surrounds operators with white spaces, this is not an issue that arises.

Like with all coding standards, consistency is key. Consistently used suboptimal solutions are superior, from a coding perspective, to an inconsistent mixture of right and wrong solutions.

References   [ + ]

1.A parameter is an abstract ‘slot’ where you can put in values that configure a function’s execution. Arguments are the actual values you put in. So add_up_numbers(a,b) has the parameters a and b, and add_up_numbers(a = 3, b = 5) has the arguments 3 and 5.

Herd immunity: how it really works

There are few concepts as trivial yet as widely misunderstood as herd immunity. In a sense, it’s not all that surprising, because frankly, there’s something almost magical about it – herd immunity means that in a population, some people who are not or cannot be immunized continue to reap the benefit of immunization. On its own, this may even be counter-intuitive. And so, unsurprisingly, like many evidently true concepts, herd immunity has its malcontents – going so far as to condemn the very idea as a ‘CDC lie’ – never mind that the concept was first used in 1923, well before the CDC was established.1

Now, let’s ignore for a moment what Dr Humphries, a nephrologist-turned-homeopath with a penchant for being economical with the truth when not outright lying, has to say – not because she’s a quack but because she has not the most basic idea of epidemiology. Instead, let’s look at this alleged ‘myth’ to begin with.

Herd immunity: the cold, hard maths

Our current understanding of herd immunity is actually a result of increasing understanding of population dynamics in epidemiology, towards the second half of the 20th century. There are, on the whole, two ways to explain it. Both are actually the same thing, and one can be derived from the other.

The simple explanation: effective R_0 depletion

The simple explanation rests on a simplification that makes it possible to describe herd immunity in terms that are intelligible at the level of high school maths. In epidemiology, R_0 (pron. ‘arr-nought‘, like a pirate), describes the basic reproduction rate of an infectious disease.2 To put it at its most simplistic: R_0 is the number of cases produced by each case. The illustration on the side shows the index case (IDX) and the first two generations of an infection with R_0 = 3.

Now, R_0 is a theoretical variable. It is usually observationally estimated, and don’t take measures intended to reduce it into account. And that’s where it gets interesting.

Consider the following scenario, where a third of the population is vaccinated, denoted by dark black circles around the nodes representing them. One would expect that of the 13 persons, a third, i.e. about. 4 , would remain disease-free. But in fact, over half of the people will remain disease-free, including three who are not vaccinated. This is because the person in the previous generation did not pass on the pathogen to them. In other words, preventing spread, e.g. by vaccination or by quarantine, can affect and reduce R_0. Thus in this case, the effective R_0 was closer to 1.66 than 3 – almost halving the R_0 by vaccinating only a third of the population.

We also know that for infections where the patient either dies or recovers, the infection has a simple ecology: every case must be ‘replaced’. In other words, if the effective R_0 falls below 1, the infection will eventually peter out. This happens quite often when everyone in a population is dead or immune after an infection has burned through it (more about that later).

Thus, the infection will be sustainable if and only if

R_{0} \geq 1

Under the assumption of a 100% efficient vaccine, the threshold value \bar{p_V} after which the infection will no longer be able to sustain itself is calculated as

\bar{p_V} = 1 - \frac{1}{R_0}

Adjusting for vaccine efficacy, E_V, which is usually less than 100%, we get

\bar{p_V} = \frac{1-\frac{1}{R_0}}{E_V} = \frac{R_0 - 1}{R_0 E_V}

For a worked example, let’s consider measles. Measles has an R_0 around 15 (although a much higher value has been observed in the past, up to 30, in some communities), and the measles vaccine is about 96% effective. What percentage of the population needs to be vaccinated? Let’s consider \bar{p_V}, the minimum or threshold value above which herd immunity is effective:

\bar{p_V} = \frac{R_0 - 1}{R_0 E_V} = \frac{15-1}{15 \cdot 0.96} = \frac{14}{14.4} \approx 97.22\%

The more complex explanation: \mathcal{SIR} models

Note: this is somewhat complex maths and is generally not recommended unless you’re a masochist and/or comfortable with calculus and differential equations. It does give you a more nuanced picture of matters, but is not necessary to understand the whole of the argumentation. So feel free to skip it.

The slightly more complex explanation relies on a three-compartment model, in which the population is allotted into one of three compartments: \mathcal{S}usceptible, \mathcal{I}nfectious and \mathcal{R}ecovered. This model makes certain assumptions, such as that persons are infectious from the moment they’re exposed and that once they recover, they’re immune. There are various twists on the idea of a multicompartment model that takes into account the fact that this is not true for every disease, but the overall idea is the same.3 In general, multicompartment models begin with everybody susceptible, and a seed population of infectious subjects. Vaccination in such models is usually accounted for by treating them as ‘recovered’, and thus immune, from t = 0 onwards.

Given an invariant population (i.e. it is assumed that no births, deaths or migration occurs), the population can be described as consisting of the sum of the mutually exclusive compartments: P = \mathcal{S}(t) + \mathcal{I}(t) + \mathcal{R}(t). For the same reason, the total change is invariant over time, i.e.

\frac{d \mathcal{S}}{d t} + \frac{d \mathcal{I}}{d t} + \frac{d \mathcal{R}}{d t} = 0

Under this assumption of a closed system, we can relate the volumes of each of the compartment to the transition probabilities \beta (from \mathcal{S} to \mathcal{I}) and \gamma (from \mathcal{I} to \mathcal{R}), so that:

\frac{d \mathcal{S}}{d t} = - \frac{\beta \mathcal{I} \mathcal{S}}{P}

\frac{d \mathcal{I}}{d t} = \frac{\beta \mathcal{I} \mathcal{S}}{P} - \gamma \mathcal{I}

\frac{d \mathcal{R}}{d t} = \gamma \mathcal{I}

Incidentally, in case you were wondering how this connects to the previous explanation: R_0 = \frac{\beta}{\gamma}.

Now, let us consider the end of the infection. If \mathcal{S} is reduced sufficiently, the disease will cease to be viable. This does not need every individual to be recovered or immune, however, as is evident from dividing the first by the third differential equation and integrating and substituting R_0, which yields

\displaystyle \mathcal{S}(t) = \mathcal{S}(0) e^{\frac{-R_0 (\mathcal{R}(t)-\mathcal{R}(0))}{P}}

Substituting this in, the limit of \mathcal{R}, as t approaches infinity, is

\displaystyle \lim_{t\to\infty}\mathcal{R}(t) = P - \lim_{t\to\infty}\mathcal{S}(t) = P - \mathcal{S}(0) e^{\frac{-R_0 (\mathcal{R}(t)-\mathcal{R}(0))}{P}}

From the latter, it is evident that

\displaystyle \lim_{t\to\infty}\mathcal{S}(t) \neq 0 \mid \mathcal{S}(0) \neq 0

In other words, once the infection has burned out, there will still be some individuals who are not immune, not immunised and not vaccinated. These are the individuals protected by herd immunity. This is a pretty elegant explanation for why herd immunity happens and how it works. There are three points to take away from this.

First, herd immunity is not unique to vaccination. The above finding in relation to the nonzero limit of \lim_{t\to\infty}\mathcal{S}(t) holds as long as \mathcal{S}(0) \neq 0, but regardless of what \mathcal{R}(0) is. In other words, herd immunity is not something artificial.

Two, for any i \in \mathcal{S} (that is, any susceptible person) at time t, the probability of which compartment he will be in at t+1 depends on whom he encounters. That, statistically, depends on the relative sizes of the compartments. In this model, the assumption is that the sample i will encounter will reflect the relative proportions of the individual compartments’ sizes. Thus if i meets n people at time t, each compartment will be proportionally represented, i.e. for any compartment \mathcal{C}, the proportion will be \frac{\mathcal{C}(t)}{P-1} for all \mathcal{C} \neq \mathcal{S}, for which the proportion will be \frac{\mathcal{S}(t) - 1}{P - 1}, since one cannot meet oneself. Given that the transition probability \beta_{i}(t) is assumed to equal the probability of meeting at least one element of \mathcal{I}, the following can be said. i‘s risk of infection depends on the relationship of n and \mathcal{I}(t), so that i is likely to get infected if

\displaystyle n \frac{\mathcal{I}(t)}{P-1} \geq 1

This elucidates two risk factors clearly, and the way to reduce them: reduce interactions (quarantine/self-quarantine), thereby reducing n, and reduce the proportion of infectious cases (\frac{\mathcal{I}(t)}{P-1}). The latter is where herd immunity from immunisation comes in. Recall that for a constant n, i‘s risk of infection at t rises as \mathcal{I}(t) rises.4 Recall also that while susceptible cases can turn into infectious cases, recovered (or vaccinated) cases cannot. And so, as \mathcal{R}(0) converges to P-1,5 i‘s risk of infection at any time t, denoted by \beta_{i}(t), falls. In other words,

\displaystyle \lim_{\mathcal{R}(0) \to P-1} \beta_{i}(t) = 0

Or to put it simply: the more are vaccinated at the start, the lower the probability, all things being equal, to meet someone who can pass on the infection.6

A final point to note is that this is primarily a model of statistical dynamics, and deals with average probabilities. It does not – it cannot – take account of facts like that some some susceptible people are just darn unlucky, and bump into a flock of unvaccinated, shiny-eyed snowflakes. Equally, in some places, susceptible people and infected people congregate, creating a viral breeding ground, also known as a Waldorf school. There are agent based models, which are basically attempts at brute force hacking reality, that can take account of such disparities. The takeaway is that herd immunity does not mean no susceptible individual will get infected. What it does mean is that their probability of getting infected is going to be significantly lower, for two reasons. First given a constant number of encounters (n), the likelihood of one of them being with an infectious individual is going to be much lower. More importantly, however, because of herd immunity, the disease is going to be able to persist in the population for a far shorter time – eventually it will burn through the small number of ‘accessible’ susceptible persons. Since the cumulative risk \beta_{i}^T for i \in \mathcal{S} for an infection that dies out after time T is defined as

\beta_i^T = \displaystyle \int\limits_0^T \beta_{i}(t) \, \mathrm{d}t

– the sooner the infection dies out, the smaller the likelihood that i will be infected. With that mathematical basis, let’s tackle a few of the myths about herd immunity.

Myth #1: herd immunity only works with naturally acquired immunity

This argument goes roughly along the following lines: herd immunity does exist, but it only exists if and where the immunity is acquired the ‘natural’ way, i.e. by surviving the disease. Case in point:

The $64,000 question, of course, is what the difference is between the residual immunity from a vaccine and the residual immunity from having survived the illness. A vaccine effectively ‘simulates’ the illness without actually causing the pathological symptoms. From the perspective of the immune system, it is largely irrelevant whether it has been exposed to an actual virus that can damage the body, or merely a capsid protein that is entirely harmless but will nonetheless elicit the same immune reaction. That should suffice to bust this myth, but it’s worth considering immunity quantitatively for a moment. As we have seen above, the source of immunity doesn’t matter. In fact, it doesn’t even have to be immunity: culling every animal except one in a herd is an entirely good way to reduce disease transmission. So is sealing oneself away from the rest of society and spending the evenings telling sexually explicit stories, as the heroes and heroines of Boccaccio’s Decameron have done, since we know that

\displaystyle n \frac{\mathcal{I}(t)}{P-1} \geq 1

Boccaccio’s crowd of assorted perverts knew nothing of all this, of course, but they did know that if they reduced n, the number of contacts with possibly infected persons, their chances of surviving the plague would increase. As it indeed did. Score one for medieval perverts. The bottom line is that it is entirely immaterial how immunity was obtained.

Myth #2: Herd immunity is a concept deriving from animals. It doesn’t work on humans.

This is one of the more outlandish claims, but shockingly, it actually has a tiny kernel of truth.

Now, much of the above is a veritable storehouse of insanity, but the point it makes in the beginning has some truth to it. In human populations, herd immunity sometimes behaves anomalously, because humans are not homogenously distributed. This is true a fortiori for humans who decide not to vaccinate, who – for better or worse – tend to flock in small groups. The term of venery for a bunch of anti-vaxxers is, in case you were wondering, a ‘plague’.7

Herd immunity was, in fact, observed in a range of species. Humans are different as we can knowingly and consciously decide to create herd immunity in our population and protect our fellow men, women and children, the last of whom are particularly susceptible to infectious diseases, from some of the worst killers.

Myth #3: If herd immunity can be obtained through natural immunity, surely we don’t need vaccines.

This argument has recently been peddled by the illustrious Kelly Brogan MD, who bills herself as a ‘holistic psychiatrist’ who threw away her script pad, which means she tends exclusively to the worried well and those with mild mental health issues where medication does not play as decisive a role as it does in, say, schizophrenia, severe PTSD, crippling anxiety disorders or complex neuropsychiatric post-insult phenomena.8 Here’s her foray into epidemiology, something she vaguely remembers studying in her first year of med school.

In this, Dr Brogan has successfully found almost a century old evidence for what everybody knew, namely that herd immunity can be naturally obtained. To anyone who has read the maths part above, this should evoke a sensation of ‘DUH!’. The problem is twofold. One, the ‘actual virus’ has an unsavoury fatality rate of 0.1%, not including the horribly tragic, heartbreaking late consequence of measles known as SSPE.9 Two, and perhaps more important: you don’t get lifelong, natural immunity if you die. This may have somehow escaped Dr Brogan’s attention, but part of the point of herd immunity is to protect those who would not survive, or would suffer serious sequelae, if they contracted the infection. What we don’t know, of course, how many of that 68% suffered permanent injuries, and how many are not included because they died. What we do know is that all 68% probably had a miserable time. Anyone who thinks measles is so fantastic should start by contracting it themselves.

Myth #4: Herd immunity means 95% need to be vaccinated to prevent a disease.

This one comes courtesy of Sarah aka the Healthy Home Economist,10, who, to what I presume must be the chagrin of her alma mater, states she has a Master’s from UPenn. Suspiciously enough, she does not state what in. I am somehow pretty sure it’s not public health.

The tedious conspiracy theory aside, it is quite evident just how little she understands of herd immunity. No – herd immunity is not based upon11 the idea that 95% must be vaccinated, and it is most definitely not based on the idea that 100% must be vaccinated. Indeed, the whole bloody point of herd immunity is that you do not need to vaccinate 100% to protect 100%. In fact, given the R_0 and vaccine efficacy E_V, we can predict the threshold vaccination rate for herd immunity quite simply, as demonstrated earlier: the threshold value, \bar{p_V}, can be calculated as

\bar{p_V} = \frac{R_0 - 1}{R_0 E_V}

As an illustration, the herd immunity threshold \bar{p_V} for mumps, with an efficacy of 88%12 and an R_0 of around 5.5, is \approx 92.98\%, while for Ebola, which has a very low R_0 around 2.0, herd immunity sets in once about 50% are immune.13

And those ‘conventional health authorities’? That’s what we call health authorities whose ideas work.

Myth #5: If vaccines work, why do we need herd immunity?

This argument is delightfully idiotic, because it, too, ignores the fundamental underlying facts of herd immunity. Quite apart from the fact that some people cannot receive some or all vaccines and other people can receive vaccines but may not generate sufficient antibody titres to have effective immunity, sometimes vaccines simply fail. Few things are 100%, and while vaccines are designed to be resilient, they can degrade due to inappropriate storage or fail to elicit a sufficient response for some other reason. Unlike wearing deodorant (or ‘deoderant’, as spelling-challenged anti-vaxxers would say), infections can sometimes be imagined as a chain of transmission. This is a frequently used model to explain the consequences of not vaccinating on others.

In this illustration, an index patient (IDX) is infected and comes in contact with G1, who in turn comes into contact with G2 who in turn comes into contact with G3. In the first case, G1, G2 and G3 are all vaccinated. The vaccine may have a small failure rate – 2% in this case – but by the time we get to G3, his chances of contracting the infection are 1:125,000 or 0.0008%. In the second case, G2 is unvaccinated – if G1’s vaccine fails, G2 is almost guaranteed to also fall ill. By not vaccinating, his own risk has increased 50-fold, from 0.04% to 2%. But that’s not all – due to G2’s failure to vaccinate, G3 will also be affected – instead of the lottery odds of 1:125,000, his risk has also risen 50-fold, to 1:2,500. And this 50-fold increase of risk will carry down the chain of potential transmission due to G2’s failure to vaccinate. No matter how well vaccines work, there’s always a small residual risk of failure, just as there is a residual risk of failure with everything. But it takes not vaccinating to make that risk hike up 50-fold. Makes that deodorant (‘deoderant’?) analogy sound rather silly, right?

Conclusion

Admittedly, the mathematical basis of herd immunity is complex. And the idea itself is somewhat counterintuitive. None of these are fit excuses for spreading lies and misinformation about herd immunity.

I have not engaged with the blatantly insane arguments (NWO, Zionists, Masonic conspiracies, Georgia Guidestones), nor with the blatantly untrue ones (doctors and public health officers are evil and guided just by money as they cash in on the suffering of innocent children). I was too busy booking my next flight paid for by Big Pharma.14 Envy is a powerful force, and it’s a good way to motivate people to suspect and hate people who sacrificed their 20s and 30s to work healing others and are eventually finally getting paid in their 40s. But it’s the myths that sway the well-meaning and uncommitted, and I firmly believe it’s part of our job as public health experts to counter them with truth.15

In every social structure, part of co-existence is assuming responsibility not just for oneself but for those who are affected by our decisions. Herd immunity is one of those instances where it’s no longer about just ourselves. Many have taken the language of herd immunity to suggest that it is some sort of favour or sacrifice done for the communal good, when it is in in fact the very opposite – it is preventing (inadvertent but often unavoidable) harm to others from ourselves.

And when the stakes are this high, when it’s about life and death of millions who for whatever reason cannot be vaccinated or cannot form an immune response, getting the facts right is paramount. I hope this has helped you, and if you know someone who would benefit from it, please do pass it on to them.

References   [ + ]

1.Topley, W. W. C. and Wilson, G. S. (1923). The spread of bacterial infection; the problem of herd immunity. J Hyg 21:243-249. The CDC was founded 23 years later, in 1946.
2.Why R_0? Because it is unrelated to \mathcal{R}, the quantity denoting recovered cases in \mathcal{S(E)IR} models – which is entirely unrelated. To emphasize the distinction, I will use mathcal fonts for the compartments in compartment models.
3.I hope to write about SIS, SEIR and vital dynamic models in the near future, but for this argument, it really doesn’t matter.
4.Technically, as \frac{\mathcal{I}(t)}{P - 1} rises, but since the model presupposes that P is constant, it doesn’t matter.
5.Since otherwise \mathcal{R} = P and \mathcal{S} = 0, and the whole model is moot, as noted above.
6.Note that this does not, unlike the R_0 explanation, presuppose any degree of vaccine efficacy. An inefficiently vaccinated person is simply in \mathcal{S} rather than \mathcal{R}.
7.Initially, ‘a culture’ was proposed, but looking at the average opponent of vaccination, it was clear this could not possibly work.
8.In other words, if you have actual mental health issues, try an actual psychiatrist who follows evidence-based protocols.
9.Subacute sclerosing panencephalitis is a long-term delayed consequence of measles infection, manifesting as a disseminated encephalitis that is invariably fatal. There are no adjectives that do the horror that is SSPE justice, so here’s a good summary paper on it.
10.As a rule, I don’t link to blogs and websites disseminating harmful information that endangers public health.
11.Correct term: ‘on’
12.As per the CDC.
13.Efficacy E_V is presumed to be 100% where immunity is not acquired via vaccination but by survival.
14.Anyone in public health is happy to tell you those things don’t merely no longer exist, they never even existed in our field.
15.And, if need be, maths.

Epidemiologists, here’s your chance to make your parents proud.

I found this t-shirt as a stocking stuffer for my dad, who still doesn’t really know what I do for a living, other than that it has ‘to do with viruses’ (yeah… almost, Dad.). Epidemiology geeks, there’s hope: here’s your chance to make your parents proud this Christmas!

Taken on Dec 06, 2017 @ 23:26 near Budapest II. kerület, this photo was originally posted on my Instagram. You can see the original on Instagram by clicking here..

The one study you shouldn’t write

I might have my own set of ideological prejudices,1 while at the same time I am more sure than I am about any of these I am certain about this: show me proof that contradicts my most cherished beliefs, and I will read it, evaluate it critically and if correct, learn from it. This, incidentally, is how I ended up believing in God and casting away the atheism of my early teens, but that’s a lateral point.

As such, I’m in support of every kind of inquiry that does not, in its process, harm humans (I am, you may be shocked to learn, far more supportive of torturing raw data than people). There’s one exception. There is that one study for every sociologist, every data scientist, every statistician, every psychologist, everyone – that one study that you should never write: the study that proves how your ideological opponents are morons, psychotics and/or terminally flawed human beings.2

Virginia Commonwealth University scholar Brad Verhulst, Pete Hatemi (now at Penn State, my sources tell me) and poor old Lindon Eaves, who of all of the aforementioned should really know better than to darken his reputation with this sort of nonsense, have just learned this lesson at what I believe will be a minuscule cost to their careers compared to the consequence this error ought to cost any researcher in any field.

In 2012, the trio published an article in the American Journal of Political Science, titled Correlation not causation: the relationship between personality traits and political ideologies. Its conclusion was, erm, ground-breaking for anyone who knows conservatives from more than the caricatures they have been reduced to in the media:

First, in line with our expectations, higher P scores correlate with more conservative military attitudes and more socially conservative beliefs for both females and males. For males, the relationship between P and military attitudes (r = 0.388) is larger than the relationship between P and social attitudes (r = 0.292). Alternatively, for females, social attitudes correlate more highly with P (r = 0.383) than military attitudes (r = 0.302).

Further, we find a negative relationship between Neuroticism and economic conservatism (r_{females} = −0.242, $$r_{males}$$ = −0.239). People higher in Neuroticism tend to be more economically liberal.

(P, in the above, being the score in Eysenck’s psychoticism inventory.)

The most damning words in the above were among the very first. I am not sure what’s worst here: that actual educated people believe psychoticism correlates to military attitudes (because the military is known for courting psychotics, am I right? No? NO?!), or that they think it helps any case to disclose what is a blatant bias quite openly. In my lawyering years, if the prosecution expert had stated that the fingerprints on the murder weapon “matched those of that dirty crook over there, as I expected”, I’d have torn him to shreds, and so would any good lawyer. And that’s not because we’re born and raised bloodhounds but because we prefer people not to have biases in what they are supposed to opine on in a dispassionate, clear, clinical manner.

And this story confirms why that matters.

Four years after the paper came into print (why so late?), an erratum had to be  published (that, by the way, is still not replicated on a lot of sites that republished the piece). It so turns out that the gentlemen writing the study have ‘misread’ their numbers. Like, real bad.

The authors regret that there is an error in the published version of “Correlation not Causation: The Relationship between Personality Traits and Political Ideologies” American Journal of Political Science 56 (1), 34–51. The interpretation of the coding of the political attitude items in the descriptive and preliminary analyses portion of the manuscript was exactly reversed. Thus, where we indicated that higher scores in Table 1 (page 40) reflect a more conservative response, they actually reflect a more liberal response. Specifically, in the original manuscript, the descriptive analyses report that those higher in Eysenck’s psychoticism are more conservative, but they are actually more liberal; and where the original manuscript reports those higher in neuroticism and social desirability are more liberal, they are, in fact, more conservative. We highlight the specific errors and corrections by page number below:

It also magically turns out that the military is not full of psychotics.3 Whodda thunk.

…Ρ is substantially correlated with liberal military and social attitudes, while Social Desirability is related to conservative social attitudes, and Neuroticism is related to conservative economic attitudes.

“No shit, Sherlock,” as they say.

The authors’ explanation is that the dog ate their homework. Ok, only a little bit better: the responses were “miscoded”, i.e. it’s all the poor grad student sods’ fault. Their academic highnesses remain faultless:

The potential for an error in our article initially was pointed out by Steven G. Ludeke and Stig H. R. Rasmussen in their manuscript, “(Mis)understanding the relationship between personality and sociopolitical attitudes.” We found the source of the error only after an investigation going back to the original copies of the data. The data for the current paper and an earlier paper (Verhulst, Hatemi and Martin (2010) “The nature of the relationship between personality traits and political attitudes.” Personality and Individual Differences 49:306–316) were collected through two independent studies by Lindon Eaves in the U.S. and Nichols Martin in Australia. Data collection began in the 1980’s and finished in the 1990’s. The questionnaires were designed in collaboration with one of the goals being to be compare and combine the data for specific analyses. The data were combined into a single data set in the 2000’s to achieve this goal. Data are extracted on a project-by-project basis, and we found that during the extraction for the personality and attitudes project, the specific codebook used for the project was developed in error.

As a working data scientist and statistician, I’m not buying this. This study has, for all its faults, intricate statistical methods. It’s well done from a technical standpoint. It uses Cholesky decomposition and displays a relatively sophisticated statistical approach, even if it’s at times bordering on the bizarre. The causal analysis is an absolute mess, and I have no idea where the authors have gotten the idea that a correlation over 0.2 is “large enough for further consideration”. That’s not a scientifically accepted idea. A correlation is significant or not significant. There is no weird middle way of “give us more money, let’s look into it more”. The point remains, however, that the authors, while practising a good deal of cargo cult science, have managed to oversee an epic blunder like this. How could that have happened?

Well, really, how could it have happened? I trust this should be explained by the words I’ve pointed out before. The authors had what is called “cognitive contamination” in the field of criminal forensic science. The authors had an idea about conservatives and liberals and what they are like. These ideas were caricaturesque to the extreme. They were blind as a bat, blinded by their own ideological biases.

And there goes my point. There are, sometimes, articles that you shouldn’t write.

Let me give you an analogy. My religion has some pretty clear rules about what married people are, and aren’t, allowed to do. Now, what my religion also happens to say is that it’s easier not to mess up these things if you do not engage in temptation. If you are a drug addict, you should not hang out with coke heads. If you are a recovering alcoholic, you would not exactly benefit from hanging out with your friends on a drunken revelry. If you’ve got political convictions, you are more prone to say stupid things when you find a result that confirms your ideas. The term for this is ‘confirmation bias’, the reality is that it’s the simple human proneness to see what we want to see.

Do you remember how as a child, you used to play the game of seeing shapes in clouds? Puppies, cows, elephants and horses? The human brain works on the basis of a Gestalt principle of reification, allowing us to reconstruct known things from its parts. It’s essential to the way our brain works. But it’s also making us see the things we want to see, not what we’re actually seeing.

And that’s why you should never write that one article. The one where you explain why the other side is dumb, evil or has psychotic and/or neurotic traits.

References   [ + ]

1.Largely, they presume outlandish stuff like ‘human life is exceptional and always worth defending’ or ‘death does not cure illnesses’, you get my drift.
2.For starters, I maintain we all are at the very least the latter, quite probably the middle one at least a portion of the time and, frankly, the first one more often than we would believe ourselves.
3.Yes, I know a high Eysenck P score does not mean a person is ‘psychotic’ and Eysenck’s test is a personality trait test, not a test to diagnose a psychotic disorder.