Structuring R projects

There are some things that I call Smith goods:1 things I want, nay, require, but hate doing. A clean room is one of these – I have a visceral need to have some semblance of tidiness around me, I just absolutely hate tidying, especially in the summer.2 Starting and structuring packages and projects is another of these things, which is why I’m so happy things like cookiecutter exist that do it for you in Python.

While I don’t like structuring R projects, I keep doing it, because I know it matters. That’s a pearl of wisdom that came occasionally at a great price.
I am famously laid back about structuring R projects – my chill attitude is only occasionally compared to the Holy Inquisition, the other Holy Inquisition and Gunny R. Lee Ermey’s portrayal of Drill Sgt. Hartman, and it’s been months since I last gutted an intern for messing up namespaces.3 So while I don’t like structuring R projects, I keep doing it, because I know it matters. That’s a pearl of wisdom that came occasionally at a great price, some of which I am hoping to save you by this post.

Five principles of structuring R projects

Every R project is different. Therefore, when structuring R projects, there has to be a lot more adaptability than there is normally When structuring R projects, I try to follow five overarching principles.

  1. The project determines the structure. In a small exploratory data analysis (EDA) project, you might have some leeway as to structural features that you might not have when writing safety-critical or autonomously running code. This variability in R – reflective of the diversity of its use – means that it’s hard to devise a boilerplate that’s universally applicable to all kinds of projects.
  2. Structure is a means to an end, not an end in itself. The reason why gutting interns, scalping them or yelling at them Gunny style are inadvisable is not just the additional paperwork it creates for HR. Rather, the point of the whole exercise is to create people who understand why the rules exists and organically adopt them, understanding how they help.
  3. Rules are good, tools are better. When tools are provided that take the burden of adherence – linters, structure generators like cookiecutter, IDE plugins, &c. – off the developer, adherence is both more likely and simpler.
  4. Structures should be interpretable to a wide range of collaborators. Even if you have no collaborators, thinking from the perspective of an analyst, a data scientist, a modeller, a data engineer and, most importantly, the client who will at the very end receive the overall product.
  5. Structures should be capable of evolution. Your project may change objectives, it may evolve, it may change. What was a pet project might become a client product. What was designed to be a massive, error-resilient superstructure might have to scale down. And most importantly, your single-player adventure may end up turning into an MMORPG. Your structure has to be able to roll with the punches.

A good starting structure

Pretty much every R project can be imagined as a sort of process: data gets ingested, magic happens, then the results – analyses, processed data, and so on – get spit out. The absolute minimum structure reflects this:

.
└── my_awesome_project
    ├── src
    ├── output
    ├── data
    │   ├── raw
    │   └── processed
    ├── README.md
    ├── run_analyses.R 
    └── .gitignore

In this structure, we see this reflected by having a data/ folder (a source), a folder for the code that performs the operations (src/) and a place to put the results (output/). The root analysis file (the sole R file on the top level) is responsible for launching and orchestrating the functions defined in the src/ folder’s contents.

The data folder

The data folder is, unsurprisingly, where your data goes. In many cases, you may not have any file-formatted raw data (e.g. where the raw data is accessed via a *DBC connection to a database), and you might even keep all intermediate files there, although that’s pretty uncommon on the whole, and might not make you the local DBA’s favourite (not to mention data protection issues). So while the raw/ subfolder might be dispensed with, you’ll most definitely need a data/ folder.

When it comes to data, it is crucial to make a distinction between source data and generated data. Rich Fitzjohn puts it best when he says to treat

  • source data as read-only, and
  • generated data as disposable.

The preferred implementation I have adopted is to have

  • a data/raw/ folder, which is usually is symlinked to a folder that is write-only to clients but read-only to the R user,4,
  • a data/temp/ folder, which contains temp data, and
  • a data/output/ folder, if warranted.

The src folder

Some call this folder R– I find this a misleading practice, as you might have C++, bash and other non-R code in it, but is unfortunately enforced by R if you want to structure your project as a valid R package, which I advocate in some cases. I am a fan of structuring the src/ folder, usually by their logical function. There are two systems of nomenclature that have worked really well for me and people I work with:

  • The library model: in this case, the root folder of src/ holds individual .R scripts that when executed will carry out an analysis. There may be one or more such scripts, e.g. for different analyses or different depths of insight. Subfolders of src/ are named after the kind of scripts they contain, e.g. ETL, transformation, plotting. The risk with this structure is that sometimes it’s tricky to remember what’s where, so descriptive file names are particularly important.
  • The pipeline model: in this case, there is a main runner script or potentially a small number. These go through scripts in a sequence. It is a sensible idea in such a case to establish sequential subfolders or sequentially numbered scripts that are executed in sequence. Typically, this model performs better if there are at most a handful distinct pipelines.

Whichever approach you adopt, a crucial point is to keep function definition and application separate. This means that only the pipeline or the runner scripts are allowed to execute (apply) functions, and other files are merely supposed to define them. Typically, folder level segregation works best for this:

  • keep all function definitions in subfolders of src/, e.g. src/data_engineering, and have the directly-executable scripts directly under src/ (this works better for larger projects), or
  • keep function definitions in src/, and keep the directly executable scripts in the root folder (this is more convenient for smaller projects, where perhaps the entire data engineering part is not much more than a single script).

output and other output folders

Output may mean a range of things, depending on the nature of your project. It can be anything from a whole D.Phil thesis written in a LaTeX-compliant form to a brief report to a client. There are a couple of conventions with regard to output folders that are useful to keep in mind.

Separating plot output

My personal preference is that plot output folders should be subfolders of output/, rather than top-tier folders, unless the plots themselves are the objective.
It is common to have a separate folder for plots (usually called figs/ or plots/), usually so that they could be used for various purposes. My personal preference is that plot output folders should be subfolders of output folders, rather than top-tier folders, unless they are the very output of the project. That is the case, for instance, where the project is intended to create a particular plot on a regular basis. This was the case, for instance, with the CBRD project whose purpose was to regularly generate daily epicurves for the DRC Zaire ebolavirus outbreak.

With regard to maps, in general, the principle that has worked best for teams I ran was to treat static maps as plots. However, dynamic maps (e.g. LeafletJS apps), tilesets, layers or generated files (e.g. GeoJSON files) tend to deserve their own folder.

Reports and reporting

For business users, automatically getting a beautiful PDF report can be priceless.
Not every project needs a reporting folder, but for business users, having a nice, pre-written reporting script that can be run automatically and produces a beautiful PDF report every day can be priceless. A large organisation I worked for in the past used this very well to monitor their Amazon AWS expenditure.5 A team of over fifty data scientists worked on a range of EC2 instances, and runaway spending from provisioning instances that were too big, leaving instances on and data transfer charges resulting from misconfigured instances6 was rampant. So the client wanted daily, weekly, monthly and 10-day rolling usage nicely plotted in a report, by user, highlighting people who would go on the naughty list. This was very well accomplished by an RMarkdown template that was ‘knit‘ every day at 0600 and uploaded as an HTML file onto an internal server, so that every user could see who’s been naughty and who’s been nice. EC2 usage costs have gone down by almost 30% in a few weeks, and that was without having to dismember anyone!7

Probably the only structural rule to keep in mind is to keep reports and reporting code separate. Reports are client products, reporting code is a work product and therefore should reside in src/.

Requirements and general settings

I am, in general, not a huge fan of outright loading whole packages to begin with. Too many users of R don’t realise that

  • you do not need to attach (library(package)) a package in order to use a function from it – as long as the package is available to R, you can simply call the function as package::function(arg1, arg2, ...), and
  • importing a package using library(package) puts every single function from that package into the namespace, overwriting by default all previous entries. This means that in order to deterministically know what any given symbol means, you would have to know, at all times, the order of package imports. Needless to say, there is enough stuff to keep in one’s mind when coding in R to worry about this stuff.

However, some packages might be useful to import, and sometimes it’s useful to have an initialisation script. This may be the case in three particular scenarios:

  • You need a particular locale setting, or a particularly crucial environment setting.
  • It’s your own package and you know you’re not going to shadow already existing symbols.
  • You are not using packrat or some other package management solution, and definitely need to ensure some packages are installed, but prefer not to put the clunky install-if-not-present code in every single thing.

In these cases, it’s sensible to have a file you would source before every top-level script – in an act of shameless thievery from Python, I tend to call this requirements.R, and it includes some fundamental settings I like to rely on, such as setting the locale appropriately. It also includes a CRAN install check script, although I would very much advise the use of Packrat over it, since it’s not version-sensitive.

Themes, house style and other settings

It is common, in addition to all this, to keep some general settings. If your institution has a ‘house style’ for ggplot2 (as, for instance, a ggthemr file), for instance, this could be part of your project’s config. But where does this best go?

I’m a big fan of keeping house styles in separate repos, as this ensures consistency across the board.
It would normally be perfectly fine to keep your settings in a config.R file at root level, but a config/ folder is much preferred as it prevents clutter if you derive any of your configurations from a git submodule. I’m a big fan of keeping house styles and other things intended to give a shared appearance to code and outputs (e.g. linting rules, text editor settings, map themes) in separate – and very, very well managed! – repos, as this ensures consistency across the board over time. As a result, most of my projects do have a config folder instead of a single configuration file.

It is paramount to separate project configuration and runtime configuration:

  • Project configuration pertains to the project itself, its outputs, schemes, the whole nine yards. For instance, the paper size to use for generated LaTeX documents would normally be a project configuration item. Your project configuration belongs in your config/ folder.
  • Runtime configuration pertains to parameters that relate to individual runs. In general, you should aspire to have as few of these, if any, as possible – and if you do, you should keep them as environment variables. But if you do decide to keep them as a file, it’s generally a good idea to keep them at the top level, and store them not as R files but as e.g. JSON files. There are a range of tools that can programmatically edit and change these file formats, while changing R files programmatically is fraught with difficulties.

Keeping runtime configuration editable

A few years ago, I worked on a viral forecasting tool where a range of model parameters to build the forecast from were hardcoded as R variables in a runtime configuration file. It was eventually decided to create a Python-based web interface on top of it, which would allow users to see the results as a dashboard (reading from a database where forecast results would be written) and make adjustments to some of the model parameters. The problem was, that’s really not easy to do with variables in an R file.

On the other hand, Python can easily read a JSON file into memory, change values as requested and export them onto the file system. So instead of that, the web interface would store the parameters in a JSON file, from which R would then read them and execute accordingly. Worked like a charm. Bottom line – configurations are data, and using code to store data is bad form.

Dirty little secrets

Everybody has secrets. In all likelihood, your project is no different: passwords, API keys, database credentials, the works. The first rule of this, of course, is never hardcode credentials in code. But you will need to work out how to make your project work, including via version control, while also not divulging credentials to the world at large. My preferred solutions, in order of preference, are:

  1. the keyring package, which interacts with OS X’s keychain, Windows’s Credential Store and the Secret Service API on Linux (where supported),
  2. using environment variables,
  3. using a secrets file that is .gitignored,
  4. using a config file that’s .gitignored,
  5. prompting the user.

Let’s take these – except the last one, which you should consider only as a measure of desperation, as it relies on RStudio and your code should aspire to run without it – in turn.

Using keyring

keyring is an R package that interfaces with the operating system’s keychain management solution, and works without any additional software on OS X and Windows.8 Using keyring is delightfully simple: it conceives of an individual key as belonging to a keyring and identified by a service. By reference to the service, it can then be retrieved easily once the user has authenticated to access the keychain. It has two drawbacks to be aware of:

  • It’s an interactive solution (it has to get access permission for the keychain), so if what you’re after is R code that runs quietly without any intervention, this is not your best bet.
  • A key can only contain a username and a password, so it cannot store more complex credentials, such as 4-ple secrets (e.g. in OAuth, where you may have a consumer and a publisher key and secret each). In that case, you could split them into separate keyring keys.

However, for most interactive purposes, keyring works fine. This includes single-item secrets, e.g. API keys, where you can use some junk as your username and hold only on to the password.

For most interactive purposes, keyring works fine. This includes single-item secrets, e.g. API keys.
By default, the operating system’s ‘main’ keyring is used, but you’re welcome to create a new one for your project. Note that users may be prompted for a keychain password at call time, and it’s helpful if they know what’s going on, so be sure you document your keyring calls well.

To set a key, simply call keyring::key_set(service = "my_awesome_service", username = "my_awesome_user). This will launch a dialogue using the host OS’s keychain handler to request authentication to access the relevant keychain (in this case, the system keychain, as no keychain is specified), and you can then retrieve

  • the username: using keyring::key_list("my_awesome_service")[1,2], and
  • the password: using keyring::key_get("my_awesome_service").

Using environment variables

The thing to remember about environment variables is that they’re ‘relatively private’: everyone in the user session will be able to read them.
Using environment variables to hold certain secrets has become extremely popular especially for Dockerised implementations of R code, as envvars can be very easily set using Docker. The thing to remember about environment variables is that they’re ‘relatively private’: they’re not part of the codebase, so they will definitely not accidentally get committed to the VCS, but everyone who has access to the particular user session  will be able to read them. This may be an issue when e.g. multiple people are sharing the ec2-user account on an EC2 instance. The other drawback of envvars is that if there’s a large number of them, setting them can be a pain. R has a little workaround for that: if you create an envfile called .Renviron in the working directory, it will store values in the environment. So for instance the following .Renviron file will bind an API key and a username:

api_username = "my_awesome_user"
api_key = "e19bb9e938e85e49037518a102860147"

So when you then call Sys.getenv("api_username"), you get the correct result. It’s worth keeping in mind that the .Renviron file is sourced once, and once only: at the start of the R session. Thus, obviously, changes made after that will not propagate into the session until it ends and a new session is started. It’s also rather clumsy to edit, although most APIs used to ini files will, with the occasional grumble, digest .Renvirons.

Needless to say, committing the .Renviron file to the VCS is what is sometimes referred to as making a chocolate fireman in the business, and is generally a bad idea.

Using a .gitignored config or secrets file

config is a package that allows you to keep a range of configuration settings outside your code, in a YAML file, then retrieve them. For instance, you can create a default configuration for an API:

default:
    my_awesome_api:
        url: 'https://awesome_api.internal'
        username: 'my_test_user'
        api_key: 'e19bb9e938e85e49037518a102860147'

From R, you could then access this using the config::get() function:

my_awesome_api_configuration <- config::get("my_awesome_api")

This would then allow you to e.g. refer to the URL as my_awesome_api_configuration$url, and the API key as my_awesome_api_configuration$api_key. As long as the configuration YAML file is kept out of the VCS, all is well. The problem is that not everything in such a configuration file is supposed to be secret. For instance, it makes sense for a database access credentials to have the other credentials DBI::dbConnect() needs for a connection available to other users, but keep the password private. So .gitignoreing a config file is not a good idea.

A dedicated secrets file is a better place for credentials than a config file, as this file can then be wholesale .gitignored.
A somewhat better idea is a secrets file. This file can be safely .gitignored, because it definitely only contains secrets. As previously noted, definitely create it using a format that can be widely written (JSON, YAML).9 For reasons noted in the next subsection, the thing you should definitely not do is creating a secrets file that consists of R variable assignments, however convenient an idea that may appear at first. Because…

Whatever you do…

One of the best ways to mess up is creating a fabulous way of keeping your secret credentials truly secret… then loading them into the global scope. Never, ever assign credentials. Ever.

You might have seen code like this:

dbuser <- Sys.getenv("dbuser")
dbpass <- Sys.getenv("dbpass")

conn <- DBI::dbConnect(odbc::odbc(), UID = dbuser, PWD = dbpass)
Never, ever put credentials into any environment if possible – especially not into the global scope.
This will work perfectly, except once its done, it will leave the password and the user name, in unencrypted plaintext (!), in the global scope, accessible to any code. That’s not just extremely embarrassing if, say, your wife of ten years discovers that your database password is your World of Warcraft character’s first name, but also a potential security risk. Never put credentials into any environment if possible, and if it has to happen, at least make it happen within a function so that they don’t end up in the global scope. The correct way to do the above would be more akin to this:

create_db_connection <- function() {
    DBI::dbConnect(odbc::odbc(), UID = Sys.getenv("dbuser"), PWD = Sys.getenv("dbpass")) %>% return()
}

Concluding remarks

Structuring R projects is an art, not just a science. Many best practices are highly domain-specific, and learning these generally happens by trial and pratfall error. In many ways, it’s the bellwether of an R developer’s skill trajectory, because it shows whether they possess the tenacity and endurance it takes to do meticulous, fine and often rather boring work in pursuance of future success – or at the very least, an easier time debugging things in the future. Studies show that one of the greatest predictors of success in life is being able to tolerate deferred gratification, and structuring R projects is a pure exercise in that discipline.

Structuring R projects is an art, not just a science. Many best practices are highly domain-specific, and learning these generally happens by trial and error.
At the same time, a well-executed structure can save valuable developer time, prevent errors and allow data scientists to focus on the data rather than debugging and trying to find where that damn snippet of code is or scratching their head trying to figure out what a particularly obscurely named function does. What might feel like an utter waste of time has enormous potential to create value, both for the individual, the team and the organisation.

As long as you keep in mind why structure matters and what its ultimate aims are, you will arrive at a form of order out of chaos that will be productive, collaborative and useful.
I’m sure there are many aspects of structuring R projects that I have omitted or ignored – in many ways, it is my own experiences that inform and motivate these commentaries on R. Some of these observations are echoed by many authors, others diverge greatly from what’s commonly held wisdom. As with all concepts in development, I encourage you to read widely, get to know as many different ideas about structuring R projects as possible, and synthesise your own style. As long as you keep in mind why structure matters and what its ultimate aims are, you will arrive at a form of order out of chaos that will be productive, collaborative and mutually useful not just for your own development but others’ work as well.

My last commentary on defensive programming in R has spawned a vivid and exciting debate on Reddit, and many have made extremely insightful comments there. I’m deeply grateful for all who have contributed there. I hope you will also consider posting your observations in the comment section below. That way, comments will remain together with the original content.

References   [ + ]

1.As in, Adam Smith.
2.It took me years to figure out why. It turns out that I have ZF alpha-1 antitrypsin deficiency. As a consequence, even minimal exposure to small particulates and dust can set off violent coughing attacks and impair breathing for days. Symptoms tend to be worse in hot weather due to impaired connective tissue something-or-other.
3.That’s a joke. I don’t gut interns – they’re valuable resources, HR shuns dismembering your coworkers, it creates paperwork and I liked every intern I’ve ever worked with – but most importantly, once gutted like a fish, they are not going to learn anything new. I prefer gentle, structured discussions on the benefits of good package structure. Please respect your interns – they are the next generation, and you are probably one of their first example of what software development/data science leadership looks like. The waves you set into motion will ripple through generations, well after you’re gone. You better set a good example.
4.Such a folder is often referred to as a ‘dropbox’, and the typical corresponding octal setting, 0422, guarantees that the R user will not accidentally overwrite data.
5.The organisation consented to me telling this story but requested anonymity, a request I honour whenever legally possible.
6.In case you’re unfamiliar with AWS: it’s a cloud service where elastic computing instances (EC2 instances) reside in ‘regions’, e.g. us-west-1a. There are (small but nonzero) charges for data transfer between regions. If you’re in one region but you configure the yum repo server of another region as your default, there will be costs, and, eventually, tears – provision ten instances with a few GBs worth of downloads, and there’ll be yelling. This is now more or less impossible to do except on purpose, but one must never underestimate what users are capable of from time to time!
7.Or so I’m told.
8.Linux users will need libsecret 0.16 or above, and sodium.
9.XML is acceptable if you’re threatened with waterboarding.

Assignment in R: slings and arrows

Having recently shared my post about defensive programming in R on the r/rstats subreddit, I was blown away by the sheer number of comments as much as I was blown away by the insight many displayed. One particular comment by u/guepier struck my attention. In my previous post, I came out quite vehemently against using the = operator to effect assignment in R. u/guepier‘s made a great point, however:

But point 9 is where you’re quite simply wrong, sorry:

never, ever, ever use = to assign. Every time you do it, a kitten dies of sadness.

This is FUD, please don’t spread it. There’s nothing wrong with =. It’s purely a question of personal preference. In fact, if anything <- is more error-prone (granted, this is a very slight chance but it’s still higher than the chance of making an error when using =).

Now, assignment is no doubt a hot topic – a related issue, assignment expressions, has recently led to Python’s BDFL to be forced to resign –, so I’ll have to tread carefully. A surprising number of people have surprisingly strong feelings about assignment and assignment expressions. In R, this is complicated by its unusual assignment structure, involving two assignment operators that are just different enough to be trouble.

A brief history of <-

IBM Model M SSK keyboard with APL keys
This is the IBM Model M SSK keyboard. The APL symbols are printed on it in somewhat faint yellow.

There are many ways in which <- in R is anomalous. For starters, it is rare to find a binary operator that consists of two characters – which is an interesting window on the R <- operator’s history.

The <- operator, apparently, stems from a day long gone by, when keyboards existed for the programming language eldritch horror that is APL. When R’s precursor, S, was conceived, APL keyboards and printing heads existed, and these could print a single ← character. It was only after most standard keyboard assignments ended up eschewing this all-important symbol that R and S accepted the digraphic <- as a substitute.

OK, but what does it do?

In the Brown Book, the underscore was actually an alias for the arrow assignment operator.
In the Brown Book (Richard A. Becker and John M. Chambers (1984). S: An Interactive Environment for Data Analysis and Graphics), the underscore was actually an alias for the arrow assignment operator! Thankfully, this did not make it into R.
<- is one of the first operators anyone encounters when familiarising themselves with the R language. The general idea is quite simple: it is a directionally unambiguous assignment, i.e. it indicates quite clearly that the right-hand side value (rhs, in the following) will replace the left-hand side variable (lhs), or be assigned to the newly created lhs if it has not yet been initialised. Or that, at the very least, is the short story.

Because quite peculiarly, there is another way to accomplish a simple assignment in R: the equality sign (=). And because on the top level, a <- b and a = b are equivalent, people have sometimes treated the two as being quintessentially identical. Which is not the case. Or maybe it is. It’s all very confusing. Let’s see if we can unconfuse it.

The Holy Writ

The Holy Writ, known to uninitiated as the R Manual, has this to say about assignment operators and their differences:

The operators <- and = assign into the environment in which they are evaluated. The operator <- can be used anywhere, whereas the operator = is only allowed at the top level (e.g., in the complete expression typed at the command prompt) or as one of the subexpressions in a braced list of expressions.

If this sounds like absolute gibberish, or you cannot think of what would qualify as not being on the top level or a subexpression in a braced list of expressions, welcome to the squad – I’ve had R experts scratch their head about this for an embarrassingly long time until they realised what the R documentation, in its neutron starlike denseness, actually meant.

If it’s in (parentheses) rather than {braces}, = and <- are going to behave weird

To translate the scriptural words above quoted to human speak, this means = cannot be used in the conditional part (the part enclosed by (parentheses) as opposed to {curly braces}) of control structures, among others. This is less an issue between <- and =, and rather an issue between = and ==. Consider the following example:

x = 3

if(x = 3) 1 else 0
# Error: unexpected '=' in "if(x ="

So far so good: you should not use a single equality sign as an equality test operator. The right way to do it is:

> if(x == 3) 1 else 0
[1] 1

But what about arrow assignment?

if(x <- 3) 1 else 0
# [1] 1

Oh, look, it works! Or does it?

if(x <- 4) 1 else 0
# [1] 1

The problem is that an assignment will always yield true if successful. So instead of comparing x to 4, it assigned 4 to x, then happily informed us that it is indeed true.

The bottom line is not to use = as comparison operator, and <- as anything at all in a control flow expression’s conditional part. Or as John Chambers notes,

Disallowing the new assignment form in control expressions avoids programming errors (such as the example above) that are more likely with the equal operator than with other S assignments.

Chain assignments

One example of where  <- and = behave differently (or rather, one behaves and the other throws an error) is a chain assignment. In a chain assignment, we exploit the fact that R assigns from right to left. The sole criterion is that all except the rightmost members of the chain must be capable of being assigned to.

# Chain assignment using <-
a <- b <- c <- 3

# Chain assignment using =
a = b = c = 3

# Chain assignment that will, unsurprisingly, fail
a = b = 3 = 4
# Error in 3 = 4 : invalid (do_set) left-hand side to assignment

So we’ve seen that as long as the chain assignment is logically valid, it’ll work fine, whether it’s using <- or =. But what if we mix them up?

a = b = c <- 1
# Works fine...

a = b <- c <- 1
# We're still great...

a <- b = c = 1
# Error in a <- b = c = 1 : could not find function "<-<-"
# Oh.

The bottom line from the example above is that where <- and = are mixed, the leftmost assignment has to be carried out using =, and cannot be by <-. In that one particular context, = and <- are not interchangeable.

A small note on chain assignments: many people dislike chain assignments because they’re ‘invisible’ – they literally return nothing at all. If that is an issue, you can surround your chain assignment with parentheses – regardless of whether it uses <-, = or a (valid) mixture thereof:

a = b = c <- 3
# ...
# ... still nothing...
# ... ... more silence...

(a = b = c <- 3)
# [1] 3

Assignment and initialisation in functions

This is the big whammy – one of the most important differences between <- and =, and a great way to break your code. If you have paid attention until now, you’ll be rewarded by, hopefully, some interesting knowledge.

= is a pure assignment operator. It does not necessary initialise a variable in the global namespace. <-, on the other hand, always creates a variable, with the lhs value as its name, in the global namespace. This becomes quite prominent when using it in functions.

Traditionally, when invoking a function, we are supposed to bind its arguments in the format parameter = argument.1 And as we know from what we know about functions, the keyword’s scope is restricted to the function block. To demonstrate this:

add_up_numbers <- function(a, b) {
    return(a + b)
}

add_up_numbers(a = 3, b = 5)
# [1] 8

a + b
# Error: object 'a' not found

This is expected: a (as well as b, but that didn’t even make it far enough to get checked!) doesn’t exist in the global scope, it exists only in the local scope of the function add_up_numbers. But what happens if we use <- assignment?

add_up_numbers(a <- 3, b <- 5)
# [1] 8

a + b
# [1] 8

Now, a and b still only exist in the local scope of the function add_up_numbers. However, using the assignment operator, we have also created new variables called a and b in the global scope. It’s important not to confuse it with accessing the local scope, as the following example demonstrates:

add_up_numbers(c <- 5, d <- 6)
# [1] 11

a + b
# [1] 8

c + d
# [1] 11

In other words, a + b gave us the sum of the values a and b had in the global scope. When we invoked add_up_numbers(c <- 5, d <- 6), the following happened, in order:

  1. A variable called c was initialised in the global scope. The value 5 was assigned to it.
  2. A variable called d was initialised in the global scope. The value 6 was assigned to it.
  3. The function add_up_numbers() was called on positional arguments c and d.
  4. c was assigned to the variable a in the function’s local scope.
  5. d was assigned to the variable b in the function’s local scope.
  6. The function returned the sum of the variables a and b in the local scope.

It may sound more than a little tedious to think about this function in this way, but it highlights three important things about <- assignment:

  1. In a function call, <- assignment to a keyword name is not the same as using =, which simply binds a value to the keyword.
  2. <- assignment in a function call affects the global scope, using = to provide an argument does not.
  3. Outside this context, <- and = have the same effect, i.e. they assign, or initialise and assign, in the current scope.

Phew. If that sounds like absolute confusing gibberish, give it another read and try playing around with it a little. I promise, it makes some sense. Eventually.

So… should you or shouldn’t you?

Which raises the question that launched this whole post: should you use = for assignment at all? Quite a few style guides, such as Google’s R style guide, have outright banned the use of = as assignment operator, while others have encouraged the use of ->. Personally, I’m inclined to agree with them, for three reasons.

  1. Because of the existence of ->, assignment by definition is best when it’s structured in a way that shows what is assigned to which side. a -> b and b <- a have a formal clarity that a = b does not have.
  2. Good code is unambiguous even if the language isn’t. This way, -> and <- always mean assignment, = always means argument binding and == always means comparison.
  3. Many argue that <- is ambiguous, as x<-3 may be mistyped as x<3 or x-3, or alternatively may be (visually) parsed as x < -3, i.e. compare x to -3. In reality, this is a non-issue. RStudio has a built-in shortcut (Alt/⎇ + ) for <-, and automatically inserts a space before and after it. And if one adheres to sound coding principles and surrounds operators with white spaces, this is not an issue that arises.

Like with all coding standards, consistency is key. Consistently used suboptimal solutions are superior, from a coding perspective, to an inconsistent mixture of right and wrong solutions.

References   [ + ]

1.A parameter is an abstract ‘slot’ where you can put in values that configure a function’s execution. Arguments are the actual values you put in. So add_up_numbers(a,b) has the parameters a and b, and add_up_numbers(a = 3, b = 5) has the arguments 3 and 5.

Is Senomyx putting baby parts into your fruit juice? The answer may (not) surprise you.

I know a great number of people who oppose abortion, and who are therefore opposed to the biomedical use of tissue or cells that have been derived from foetuses aborted for that very purpose, although most of them do not oppose the use of foetal tissue from foetuses aborted for some other reason.1 Over the last week or so, many have sent me a story involving a company named Senomyx, wondering if it’s true. In particular, the story claimed that Senomyx uses foetal cells, or substances derived from foetal cell lines, to create artificial flavourings. One article I have been referred to is straight from the immediately credible-sounding Natural News:2 3

Every time you purchase mass-produced processed “food” from the likes of Kraft, PepsiCo, or Nestle, you’re choosing, whether you realize it or not, to feed your family not only genetically engineered poisons and chemical additives, but also various flavoring agents manufactured using the tissue of aborted human babies.

It’s true: A company based out of California, known as Senomyx, is in the business of using aborted embryonic cells to test fake flavoring chemicals, both savory and sweet, which are then added to things like soft drinks, candy and cookies. And Senomyx has admittedly partnered with a number of major food manufacturers to lace its cannibalistic additives into all sorts of factory foods scarfed down by millions of American consumers every single day.

Needless to say, this is total bullcrap, and in the following, we’re going on a ride in the Magic Schoolbus through this fetid swamp of lies, half-truths and quarter-wits.

In the beginning was the cell culture…

About ten years before I was born, a healthy foetus was (legally) aborted in Leiden, the Netherlands, and samples were taken from the foetus’s kidney. They were much like the cells in your kidney. Like the cells in your kidney (hopefully!), they had a beginning and an end. That end is known as the Hayflick limit or ‘replicative senescence limit’. Cells contain small ‘caps’ at the end of their chromosomes, known as telomeres.4 At every mitosis, these shorten a little, like a countdown clock, showing how many divisions the cell has left. And once the telomeres are all gone, the cell enters a stage called cellular senescence, and is permanently stuck in the G1 phase of the cell cycle, unable to move on to phase S. This is a self-preservative mechanism: with every mitosis, the cell line ages, and becomes more likely to start suffering from errors it passes on to its descendants. Experimentally, we know that the Hayflick limit is around 40-60 divisions.

Kadir Nelson, The Mother of Modern Medicine. Oil on linen, 59 1/2 x 49 1/2.
Kadir Nelson, The Mother of Modern Medicine. Oil on linen, 59 1/2 x 49 1/2. Image courtesy of the National Museum of African American History & Culture, Washington, DC.

But every rule has an exception. Most obviously, there are the cells that just won’t die – this is the case in cancers. And so, some cancer cells have this senescence mechanism disabled, and can divide an unlimited number of times. They have become practically immortal. This is, of course, a nuisance if they are inside a human, because proliferation and division of those cells causes unpleasant things like tumours. For researchers, however, they offer something precious: a cell line that will, as long as it’s fed and watered, live and divide indefinitely. This is called an immortal cell line. The most famous of them, HeLa, began life as cervical epithelial cells of a woman named Henrietta Lacks. Then something dreadful happened, and the cells’ cell cycle regulation was disrupted – Ms Lacks developed aggressive cervical cancer, and died on October 4, 1951. Before that, without her permission or consent, samples were taken from her cervix, and passed onto George Otto Gey, a cell biologist who realised what would be Ms Lacks’s most lasting heritage: her cells would divide and divide, well beyond the Hayflick limit, with no end in sight. Finally, cell biologists had the Holy Grail: a single cell line that produced near-exact copies of itself, all descendants of a single cell from the cervix of a destitute African-American woman, who died without a penny to her name, but whose cells would for decades continue to save lives – among others, Salk’s polio vaccine was cultured in HeLa cells.5

HeLa cells were immortal ‘by nature’ – they underwent changes that have rendered them cancerous/immortal, depending on perspective. But it’s also possible to take regular mortal cells and interfere with their cell cycle to render them immortal. And this brings us back to the kidney cells of the foetus aborted in the 1970s, a life that was never born yet will never die, but live on as HEK 293. A cell biologist, Frank Graham, working in Alex van Eb’s lab at the University of Leiden,6 used the E1 gene from adenovirus 5 to ‘immortalise’ the cell line, effectively rewriting the cell’s internal regulation mechanism to allow it to continue to divide indefinitely rather than enter cellular senescence.7 This became the cell culture HEK 293,8 one of my favourite cell cultures altogether, with an almost transcendental beauty in symmetry.

HEK293 cells
HEK 293 cells, immunofluorescence photomicrograph.

HEK 293 is today a stock cell line, used for a range of experimental and production processes. You can buy it from Sigma-Aldrich for little north of £350 a vial. The cells you’re getting will be exact genetic copies of the initial cell sample, and its direct genetic descendants. That’s the point of an immortalised cell line: you can alter a single cell to effectively divide indefinitely as long as the requisite nutrients, space and temperature are present. They are immensely useful for two reasons:

  • You don’t need to take new cell samples every time you need a few million cells to tinker with.
  • All cells in a cell line are perfectly identical.9 It goes without saying just how important for reproducible science it is to have widely available, standardised, reference cell lines.

Admittedly, this was a whistle-stop tour through cell cultures and immortal(ised) cell lines, but the basics should be clear. Right? I mean… right?

Unless you’re Sayer Ji.

Until about 1900 today, I had absolutely no idea of who Sayer Ji is, and I would lie if I asserted my life was drastically impoverished by that particular ignorance. Sayer Ji runs a website (which I am not going to link to, but I am going to link to RationalWiki’s entry on it, as well as Orac’s collected works on the man, the myth and the bullshit), and he describes himself as a ‘thought leader’. That alone sets off big alarms. Mr Ji, to the best of my knowledge, has no medical credentials whatsoever, nor any accredited credentials that would allow him to make the grandiose statements that he seems to indulge in with the moderation of Tony Montana when it comes to cocaine:

There is not a single disease ever identified cause by a lack of a drug, but there are diseases caused by a lack of vitamins, minerals and nutrients. Why, then, do we consider the former – chemical medicine – the standard of care and food-as-medicine as quackery? Money and power is the obvious answer.

Mr Ji’s ‘article’ rests on the fundamental assertion that ‘abortion tainted vaccines’ constitute cannibalism. He is, of course, entirely misguided, and does not understand how cell line derived vaccines work.
You may question why I even engage with someone whose cognition is operating on this level, and you might be justified in doing so. Please ascribe it to my masochistic tendencies, or consider it a sacrifice for the common good. Either way, I pushed through a particular article of his which is cited quite extensively in the context of Senomyx, titled Biotech’s Dark Promise: Involuntary Cannibalism for All.10 In short, Mr Ji’s ‘article’ rests on the fundamental assertion that ‘abortion tainted vaccines’, among which he ranks anything derived from HEK 293 (which he just refers to as 293),11 constitute cannibalism. He is, of course, entirely misguided, and does not understand how cell line derived vaccines work:

Whereas cannibalism is considered by most modern societies to be the ultimate expression of uncivilized or barbaric behavior, it is intrinsic to many of the Western world’s most prized biotechnological and medical innovations. Probably the most ‘taken for granted’ example of this is the use of live, aborted fetus cell lines from induced abortions to produce vaccines. Known as diploid cell vaccines (diploid cells have two (di-) sets of chromosomes inherited from human mother and father), they are non-continuous (unlike cancer cells), and therefore must be continually replaced, i.e. new aborted, live fetal tissue must be harvested periodically.

For the time being, the VRBPAC has mooted but not approved tumour cell line derived vaccines, and is unlikely to do so anytime soon. However, the idea that diploid cell vaccines need a constant influx of cells is completely idiotic, and reveals Mr Ji’s profound ignorance.12 WI-38, for instance, is a diploid human cell line, and perfectly ‘continuous’ (by which he means immortal). There is no new “aborted, live fetal tissue” that “must be harvested periodically”.

Vaccines do not typically contain cells from the culture, but only the proteins, VLPs or virions expressed by the cells. There’s no cannibalism if no cells are consumed.
Equally, Mr Ji is unaware of the idea that vaccines do not typically contain cells from the culture, but only the proteins, VLPs or virions expressed by the cells. There’s no cannibalism if no cells are consumed, and the denaturation process (the attenuation part of attenuated vaccines) is already breaking whatever cellular parts there are to hell and back. On the infinitesimal chance that a whole cell has made it through, it will be blasted into a million little pieces by the body’s immune system – being a foreign cell around a bunch of adjuvant is like breaking into a bank vault right across the local police station while expressly alerting the police you’re about to crack the safe and, just to be sure, providing them with a handy link for a live stream.13

From cell lines to Diet Coke: Senomyx and high throughput receptor ligand screening

If you have soldiered on until now, good job. This is where the fun part comes – debunking the fear propaganda against Senomyx by a collection of staggering ignorami.

Senomyx is a company with a pretty clever business model. Instead of using human probands to develop taste enhancers (aka ‘flavourings’) and scents, Senomyx uses a foetal cell line, specifically HEK 293, to express certain receptors that mimic the taste receptors in the human body.

To reiterate the obvious: no foetal cells ever get anywhere near your food.
Then, it tests a vast number of candidate compounds on them, and sees which elicit a particular reaction. That product (which typically is a complex organic chemical but has nothing to do with the foetal cell!) is then patented and goes into your food. To reiterate the obvious: no foetal cells ever get anywhere near your food.

The kerfuffle around this is remarkably stupid because this is basically the same as High Throughput Screening (HTS), a core component of drug development today.14 Let’s go through how a drug is developed, with a fairly simple and entirely unrealistic example.15 We know that low postsynaptic levels of monoamines, especially of serotonin (but also dopamine and norepinephrine) correlate with low mood. One way to try to increase postsynaptic levels is by inhibiting the breakdown of monoamines, which happens by an enzyme called monoamine oxydase (MAO). But how do we find out which of the several thousand promising MAO inhibitors that our computer model spit out will actually work best? We can’t run a clinical trial for each. Not even an animal test. But we can run a high throughput screen. Here’s a much simplified example of how that could be done (it’s not how it’s actually done anymore, but it gets the idea across).

  1. We take a microtitre plate (a flat plate with up to hundreds of little holes called ‘wells’ that each take about half a millilitre of fluid), and fill it with our favourite monoamine neurotransmitters. Mmm, yummy serotonin! But because we’re tricky, we label them with a fluorescent tag or fluorophore, a substance that gives off light if excited by light of a particular wavelength, but only as long as they’re not oxidated by MAO.
  2. We add a tiny amounts of each of our putative drugs to a different well each of the microtitre plate.
  3. Then, we add some monoamine oxidase to each well.
  4. When illuminated by the particular wavelength of light that excites the fluorophores, some wells will light up pretty well, others will be fairly dark. The bright wells indicate that the candidate drug in that well has largely inhibited monoamine oxidase, and thus the monoamine neurotransmitter remained intact. A dark well indicates that most or all of the monoamines were oxidised and as such no longer give off light. This helps us whittle down thousands of candidate molecules to hundreds or even dozens.

What Senomyx does is largely similar. While their process is proprietary, it is evident it’s largely analogous to high throughput screening. The human cell lines are modified to express receptors analogous to those in taste buds. These are exposed to potential flavourings, and the degree to which they stimulate the receptor can be quantified and even compared, so that e.g you can gauge what quantity of this new flavouring would offset 1 grams of sugar. The candidate substances are then tested on real people, too. Some of the substances are not sweeteners itself but taste intensifiers, which interact to intensify the sweet taste sensation of sugar. It’s a fascinating technology, and a great idea – and has a huge potential to reduce the amount of sugars, salts and other harmful dietary flavourings in many meals.

S2222
S2227, a menthol-like flavouring manufactured by Senomyx. Can you spot the aborted cells? Nope, neither can I.

The end product is a flavouring – a molecule that has nothing to do with aborted cells (an example, the potent cooling sensate S2227, is depicted to the left – I dare anyone to show me where the aborted foetal cells are!). Soylent green, it turns out, is not people babies after all.

Conclusion

Now, two matters are outstanding. One is the safety of these substances. That, however, is irrelevant to how they were isolated. The product of a high-throughput screen is no more or less likely to be toxic than something derived from nature. The second point is a little more subtle.

A number of critics of Senomyx point out that this is, in a sense, deceiving the customer, and with pearl-clutching that would have won them awards in the 1940s, point out that companies want one thing, and it’s absolutely filthy:16

Companies like PepsiCo and Nestle S.A. seek to gain over competitors. To do so, they boast products like “reduced sugar”, “reduced sodium”, “no msg”, etc. Sales profits and stocks increase when consumers believe they eat or drink a healthier product.

The Weston A Price Foundation carries skepticism. They believe that the bottom line is what’s important. Companies only want to decrease the cost of goods for increased profits. Shareholders only care about stock prices and investment potential.

Err… and you expected what? There is absolutely no doubt that something that gives your body the taste of salt without the adverse effects of a high-sodium diet is A Good Thing – so consumers do not merely think they’re getting a healthier product, they’re getting a product with the same taste they prefer, but without the adverse dietary consequences (in other words: a healthier product).

To people who have to adhere to a strict diet, flavour enhancers can give back a craved-for flavour experience, improve quality of life and increase diet adherence.
To people who have to adhere to a strict low-sodium diet due to kidney disease, heart disease/hypertension or other health issues, this could well give back a craved-for flavour and improve quality of life. To people struggling with obesity, losing weight without having to say no to their favourite drink can result in better health outcomes. People with peanut allergies can enjoy a Reese’s Cup with a synthetic and chemically different protein that creates the same taste sensation, without risking anaphylaxis. In the end, these are things that matter, and should matter more than the fact that – shock horror! – someone is making money out of this.

All Senomyx does is what drug companies have been doing for decades, and an increasing number of companies will. But yet again, the cynics who see lizardoid conspiracies and corporate deceit behind every wall know the price of everything, but haven’t thought about the value of it for a second, have seized upon another talking point. They did so exploiting a genuine pro-life sentiment so many hold dear, intentionally misrepresenting or recklessly misunderstanding the fact that no aborted tissue will ever make its way into your Coke, nor will there be a need for a stream of abortions to feed a burgeoning industry for artificial taste bud cell lines. If anybody here is exploitative, it’s not Senomyx – it’s those who seize upon the universal human injunction against cannibalism and infanticide to push a scientifically incorrect, debunked agenda against something they themselves barely understand.

References   [ + ]

1.This is not a post about the politics of abortion, Roe v Wade, pro-life vs pro-choice, religion vs science or any of the rest. It is about laying a pernicious lie to rest. My position regarding abortion is quite irrelevant to this, as is theirs, but it deserves mention that all of the people who got in touch hold very genuine and consistent views about the sanctity of life. Please let’s not make it about something it isn’t about.
2.Needless to say, this is not my friends’ usual fare, they too have seen it on social media and were quite dubious.
3.In line with our linking policy, we do not link to pages that endorse violence, hate or discrimination.
4.From Greek τέλος, ‘end’.
5.Finally, Rebecca Skloot’s amazing book, The Immortal Life of Henrietta Lacks, paid a long overdue tribute to the mother of modern medicine in 2010. Her book is a must-read to anyone who wants to understand the ethical complexities of immortal cell lines, as well as the touching story of a woman whom we for so long have known by initials only, yet owe such a debt to.
6.Disclosure: the University of Leiden is my alma mater, I have spent a wonderful year there, and received great treatment at LUMC, the university hospital. I am not, and have never been, in receipt of funds from the university.
7.As such, these cells represent an immortalised cell line as opposed to an immortal cell line like HeLa, where the change to the cell cycle regulation has already occurred
8.Human, Embryonic, Kidney, from experiment #293.
9.Sort of. Like all human processes, cell culturing is not perfect. One risk is a so-called ‘contaminated cell line’, and the classical case study for that is the Chang Liver cell line, which turned out to be all HeLa, all the way. This is not only a significant problem, it is also responsible for huge monetary loss and wasted research money. You can read more about this, and what scientists are doing to combat the problem, here.
10.For ethical reasons, this blog refuses to link to, and thus create revenue for, quacks, extremists and pseudoscientists. However, where the source material is indispensable, an archival service is used to obtain a snapshot of the website, so that you, too, can safely peruse Mr Ji’s nonsense without making him any money. GreenMedInfo has a whole tedious page on how to cite their nonsense, which I am going to roundly ignore because a) it looks and reads like it was written by someone who flunked out of pre-law in his sophomore year, b) 17 U.S.C. §107 explicitly guarantees fair use rights for scholarship, research and criticism.
11.Curiously, he does not mention WI-38 and MRC-5, both cell lines derived from the lung epithelial cells of aborted foetuses, which are widely used in vaccine production…
12.The word ‘profound’ is especially meet in this context.
13.Which sounds like something someone MUST have done already. Come on. It’s 2018.
14.This article is a fantastic illustration of just how powerful this technique is!
15.Unrealistic, because we know all there is to know about monoamine oxidase inhibitors, and there’s no point in researching them much more – but it illustrates the point well!
16.Yes, I brought THAT meme in here!

SafeGram: visualising drug safety

Update: an RMarkdown notebook explaining the whole process is available here.

Visualising vaccine safety is hard. Doing so from passive (or, as we say it in Britain, ‘spontaneous’!) pharmacovigilance (PhV) sources is even harder. Unlike in active or trial pharmacovigilance, where you are essentially dividing the number of incidents by the person-time or the number of patients in the cohort overall, in passive PhV, only incidents are reported. This makes it quite difficult to figure out their prevalence overall, but fortunately, we have some metrics we can use to better understand the issues with a particular medication or vaccine. The proportional reporting ratio (PRR) is a metric that can operate entirely on spontaneous reporting, and reflect how frequent a particular symptom is for a particular treatment versus all other treatments.

Defining PRR

For convenience’s sake, I will use the subscript * operator to mean a row or column sum of a matrix, so that

N_{i,*} = \displaystyle \sum_{j=1}^{n} N_{i,j}

and

N_{*,j} = \displaystyle \sum_{i=1}^{m} N_{i,j}

and furthermore, I will use the exclusion operator * \neg to mean all entities except the right hand value. So e.g.

N_{i, * \neg k} = \displaystyle \sum_{j=1, j \neq k}^m N_{i,j}

Conventionally, the PRR is often defined to with reference to a 2×2 contingency table that cross-tabulates treatments (m axis) with adverse effects (n axis):

Adverse effect of interest
(i)
All other adverse effects
(\neg i)
TOTAL
Treatment of interest
(j)
a = D_{i,j}b = D_{i, * \neg j}a + b = D_{i, *} = \displaystyle \sum_{j = 1}^{n} D_{i, j}
All other treatments
(\neg j)
c = D_{* \neg i, j}d = D_{* \neg i, * \neg j}c + d = D_{* \neg i, *} = \displaystyle \sum_{k=1, k \neq i}^{m} \sum_{l = 1}^{n} D_{k, l}

 

With reference to the contingency table, the PRR is usually defined as

\frac{a / (a+b)}{c / (c+d)} = \frac{a}{a + b} \cdot \frac{c + d}{c}

However, let’s formally define it over any matrix D.

Definition 1. PRR. Let D be an m \times n matrix that represents the frequency with which each of the m adverse effects occur for each of the n drugs, so that D_{i,j} (i \in m, j \in n) represents the number of times the adverse effect j has occurred with the treatment i.

For convenience’s sake, let D_{*,j} denote \sum_{i=1}^{m} D_{i,j}, let D_{i,*} denote \sum_{j=1}^{n} D_{i,j}, and let D_{*,*} denote \sum_{i=1}^{m} \sum_{j=1}^{n} D_{i,j}. Furthermore, let D_{* \neg i, j} denote \sum_{k \neq i}^{m} D_{k,j} and D_{i, * \neg j} denote \sum_{k \neq j}^{n} D_{i, k}.

Then, PRR can be calculated for each combination D_{i,j} by the following formula:

PRR_{i,j} = \frac{D_{i,j} / D_{i,*}}{D_{* \neg i, j} / D_{* \neg i, *}} = \frac{D_{i,j}}{D_{i,*}} \cdot \frac{D_{*\neg i, *}}{D_{*\neg i, j}}

Expanding this, we get

PRR_{i,j} = \frac{D_{i,j}}{\displaystyle\sum_{q=1}^n D_{i,q}} \cdot \frac{\displaystyle\sum_{r=1, r\ne i}^{m} \displaystyle\sum_{s=1}^{n} D_{r,s}}{\displaystyle\sum_{t=1, t\ne i}^{m} D_{t,j}}

Which looks and sounds awfully convoluted until we start to think of it as a relatively simple query operation: calculate the sum of each row, then calculate the quotient of the ADR of interest associated with the treatment of interest divided by all uses of the treatment of interest on one hand and the ADR of interest associated with all other drugs (j \mid \neg i or c) divided by all ADRs associated with all treatments other than the treatment of interest. Easy peasy!

Beyond PRR

However, the PRR only tells part of the story. It does show whether a particular symptom is disproportionately often reported – but does it show whether that particular symptom is frequent at all? Evans (1998) suggested using a combination of an N-minimum, a PRR value and a chi-square value to identify a signal.1 In order to represent the overall safety profile of a drug, it’s important to show not only the PRR but also the overall incidence of each risk. The design of the SafeGram is to show exactly that, for every known occurred side effect. To show a better estimate, instead of plotting indiviual points (there are several hundreds, or even thousands, of different side effects), the kernel density is plotted.

This SafeGram shows four vaccines – meningococcal, oral and injectable polio and smallpox -, and their safety record based on VAERS data between 2006 and 2016.

The reason why SafeGrams are so intuitive is because they convey two important facts at once. First, the PRR cut-off (set to 3.00 in this case) conclusively excludes statistically insignificant increases of risk.2 Of course, anything above that is not necessarily dangerous or proof of a safety signal. Rather, it allows the clinician to reason about the side effect profile of the particular medication.

  • The meningococcal vaccine (left upper corner) had several side effects that occurred frequently (hence the tall, ‘flame-like’ appearance). However, these were largely side effects that were shared among other vaccines (hence the low PRR). This is the epitome of a safe vaccine, with few surprises likely.
  • The injectable polio vaccine (IPV) has a similar profile, although the wide disseminated ‘margin’ (blue) indicates that ht has a wider range of side effects compared to the meningococcal vaccine, even though virtually all of these were side effects shared among other vaccines to the same extent.
  • The oral polio vaccine (OPV, left bottom corner) shows a flattened pattern typical for vaccines that have a number of ‘peculiar’ side effects. While the disproportionately frequently reported instances are relatively infrequent, the ‘tail-like’ appearance of the OPV SafeGram is a cause for concern. The difference between meningococcal and IPV on one hand and OPV on the other is explained largely by the fact that OPV was a ‘live’ vaccine, and in small susceptible groups (hence the low numbers), they could provoke adverse effects.
  • The smallpox vaccine, another live vaccine, was known to have a range of adverse effects, with a significant part of the population (about 20%) having at least one contraindication. The large area covered indicates that there is a rather astonishing diversity of side effects, and many of these – about half of the orange kernel – lies above the significance boundary of 3.00. The large area covered by the kernel density estimate and the reach into the right upper corner indicates a very probable safety signal worth examining.

Interpretation

A SafeGram for each vaccine shows the two-dimensional density distribution of two things – the frequency and the proportional reporting rate of each vaccine (or drug or device or whatever it is applied to). When considering the safety of a particular product, the most important question is whether a particular adverse effect is serious – a product with a low chance of an irreversible severe side effect is riskier than one with a high probability of a relatively harmless side effect, such as localized soreness after injection. But the relative severity of a side effect is hard to quantify, and a better proxy for that is to assume that in general, most severe side effects will be unique to a particular vaccine. So for instance while injection site reactions and mild pyrexia following inoculation are common to all vaccines and hence the relative reporting rates are relatively low, reflecting roughly the number of inoculations administered, serious adverse effects tend to be more particular to the vaccine (e.g. the association of influenza vaccines with Guillain-Barré syndrome in certain years means that GBS has an elevated PRR, despite the low number of occurrences, for the flu vaccines). Discarding vaccines with a very low number of administered cases, the SafeGram remains robust to differences between the number of vaccines administered. Fig. 1 above shows a number of typical patterns. In general, anything to the left of the vertical significance line can be safely ignored, as they are generally effects shared between most other vaccines in general and exhibit no specific risk signal for the particular vaccine. On the other hand, occurrences to the right of the vertical significance line may – but don’t necessarily do – indicate a safety signal. Of particular concern are right upper quadrant signals – these are frequent and at the same time peculiar to a particular vaccine, suggesting that it is not part of the typical post-inoculation syndrome (fever, fatigue, malaise) arising from immune activation but rather a specific issue created by the antigen or the adjuvant. In rare cases, there is a lower right corner ‘stripe’, such as for the OPV, where a wide range of unique but relatively infrequent effects are produced. These, too, might indicate the need for closer scrutiny. It is crucial to note that merely having a density of signals in the statistically significant range does not automatically mean that there is a PhV concern, but rather that such a concern cannot be excluded. Setting the PRR significance limit is somewhat arbitrary, but Evans et al. (2001) have found a PRR of 2, more than 3 cases over a two year period and a chi-square statistic of 4 or above to be suggestive of a safety signal. Following this lead, the original SafeGram code looks at a PRR of 3.0 and above, and disregards cases with an overall frequency of 3Y, where Y denotes the number of years considered.

Limitations

The SafeGram inherently tries to make the best out of imperfect data. Acknowledging that passive reporting data is subject to imperfections, some caveats need to be kept in mind.

  • The algorithm assigns equal weight to every ‘symptom’ reported. VAERS uses an unfiltered version of MedDRA, a coding system for regulatory activities, and this includes a shocking array of codes that do not suggest any pathology. For instance, the VAERS implementation of MedDRA contains 530 codes for normal non-pathological states (e.g. “abdomen scan normal”), and almost 18,000 (!) events involve at least one of these ‘everything is fine!’ markers. This may be clinically useful because they may assist in differential diagnosis and excluding other causes of symptoms, but since they’re not treated separately from actually pathological symptoms, they corrupt the data to a minor but not insignificant extent. The only solution is manual filtering, and with tens of thousands of MedDRA codes, one would not necessarily be inclined to do so. The consequence is that some symptoms aren’t symptoms at all – they’re the exact opposite. This is not a problem for the PRR because it compares a symptom among those taking a particular medication against the same symptom among those who are not.
  • A lot of VAERS reports are, of course, low quality reports, and there is no way for the SafeGram to differentiate. This is a persistent problem with all passive reporting systems.
  • The SafeGram gives an overall picture of a particular drug’s or vaccine’s safety. It does not differentiate between the relative severity of a particular symptom.
  • As usual, correlation does not equal causation. As such, none of this proves the actual risk or danger of a vaccine, but rather the correlation or, in other words, potential safety signals that are worth examining.
Grouped by pathogen, the safety of a range of vaccines was examined by estimating the density of adverse event occurrence versus adverse event PRR. Note that adverse events reported in VAERS do not show or prove causation, only correlation. This shows that for the overwhelming majority of vaccines, most AEFI reports are below the PRR required to be considered a true safety signal.

SafeGrams are a great way to show the safety of vaccines, and to identify which vaccines have frequently occurring and significantly distinct (high-PRR) AEFIs that may be potential signals. It is important to note that for most common vaccines, including controversial ones like HPV, the centre of the density kernel estimate are below the margin of the PRR signal limit. The SafeGram is a useful and visually appealing proof of the safety of vaccines that can get actionable intelligence out of VAERS passive reporting evidence that is often disregarded as useless.

References   [ + ]

1.Evans, S. J. W. et al. (1998). Proportional reporting ratios: the uses of epidemiological methods for signal generation. Pharmacoepidemiol Drug Saf, 7(Suppl 2), 102.
2.According to Evans et al., the correct figure for PRR exclusion is 2.00, but they also use N-restriction and a minimum chi-square of 4.0.

Ebola! Graph databases! Contact tracing! Bad puns!

Thanks to the awesome folks at Neo4j Budapest and GraphAware, I will be talking tonight about Ebola, contact tracing, how graph databases help us understand epidemics and maybe prevent them someday. Now, if flying to Budapest on short notice might not work for you, you can listen to a livestream of the whole event here! It starts today, 13 February, at 1830 CET, 1730 GMT or 1230 Eastern Time, and I sincerely hope you will listen to it, live or later from the recording, also accessible here.

Are you looking for a data science sensei?

Maybe you’re a junior data scientist, maybe you’re a software developer who wants to go into data science, or perhaps you’ve dabbled in data for years in Excel but are ready to take the next step.

If so, this post is all about you, and an opportunity I offer every year.

You see, life has been very good to me in terms of training as a data scientist. I have been spoiled, really – I had the chance to learn from some of the best data scientists, work with some exceptional epidemiologists, experience some unusual challenges and face many of the day-to-day hurdles of working in data analytics. I’ve had the fortune to see this profession in all its contexts, from small enterprises to multi-million dollar FTSE100 companies, from well-run agile start-ups to large and sometimes pretty slow dinosaurs, from government through the private sector to NGOs: I’ve seen it all. I’ve done some great things. And I’ve made some superbly dumb mistakes.

And so, at the start of every year, I have opened applications for young, start-of-career data scientists looking for their Mr. Miyagi. Don’t worry: no car waxing involved. I will be choosing a single promising young data scientist and pass on as much as I can of my so-called wisdom. At the end, your skills will shine like Mr. Miyagi’s 1947 Ford Deluxe Convertible. There’s no catch, no hidden trap, no fees or charges involved (except the one mentioned below).

Eligibility criteria

To be eligible, you must be:

  • 18 or above if you are taking a gap year or not attending a university/college.
  • You do not have to have a formal degree in data science or a relevant subject, but you must have completed it if you do. In other words: if you’re in your 3rd year of an English Lit degree, you’re welcome to apply, but if you’re in the middle of your CS degree, you have to wait until you’re finished – sorry. The same goes if you intend to go straight on to a data science-related postgrad within the year.
  • Have a solid basis in mathematics: decent statistics, combinatorics, linear algebra and some high school calculus are the very minimum.
  • You must be familiar with Python (3.5 and above), and either familiar with the scientific Python stack (SciPy, NumPy, Pandas, matplotlib) or willing to pick up a lot on the go.
  • Be willing to put in the work: we’ll be convening about once every week to ten days by Skype for an hour, and you’ll probably be doing 6-10 hours’ worth of reading and work for the rest of the week. Please be realistic if you can sustain this.
  • If, as recommended, you are working on an AWS EC2 instance, be aware this might cost money and make sure you can cover the costs. In practice, these are negligible.
  • You must understand that this is a physically and intellectually strenuous endeavor, and it is your responsibility to know whether you’re physically and mentally up for the job. However, no physical or mental disabilities are regarded as automatically excluding you of consideration.
  • You must not live in, reside in or be a citizen of any of the countries listed in CFR Title 22 Part 126, §126.1(d)(1) and (2).
  • You must not have been convicted of a felony anywhere. This includes ‘spent’ UK criminal convictions.

Sounds good? Apply here.

Preferred applicants

When assessing applications, the following groups are given preference:

  • Persons with mental or physical disabilities whose disability precludes them from finding conventional employment – please outline this situation on the application form.
  • Honourably discharged (or equivalent) veterans of NATO forces and the IDF – please include member 4 copy of DD-214, Wehrdienstzeitbescheinigung or equivalent document that lists type of discharge.

What we’ll be up to

Don’t worry. None of this car waxing crap.

Over the 42 weeks to follow, you will be undergoing a rigorous and structured semi-self-directed training process. This will take your background, interests and future ambitions into account, but at the core, you will:

  • master Python’s data processing stack,
  • learn how to visualize data in Python,
  • work with networks and graph databases, including Neo4j,
  • acquire the correct way of presenting results in data science to stakeholders,
  • delve into cutting-edge methods of machine learning, such as deep learning using keras,
  • work on problems in computer vision and get familiar with the Python bindings of OpenCV,
  • scrape data from social networks, and
  • learn convenient ways of representing, summarizing and distributing our results.

The programme is divided into three ‘terms’ of 14 weeks each, which each consist of 9 weeks of directed study, 4 weeks of self-directed project work and one week of R&R.

What you’ll be getting out of this

Since the introduction of Docker, tolerance for wanton destruction as part of coursework has increased, but still won’t earn you a passing grade by itself.

In the past years, mentees have noted the unusual breadth of knowledge they have acquired about data science, as well as the diversity of practical topics and the realistic question settings, with an emphasis on practical applications of data science such as presenting data products. I hope that this year, too, I’ll be able to convey the same important topics. Every year is a little different as I try to adjust the course to meet the individual participant’s needs.

The programme is not, of course, accredited by any accreditation body, but a certificate of completion will be issued to any participant who wishes so.

Application process

Simply fill in the form below and send it off by 14 January 2018. The top contenders will be contacted by e-mail or telephone for a brief conversation thereafter. Finally, a lucky winner will be picked by the 21st January 2018. Easy peasy!

 

FAQ

Q: What does ‘semi-self-directed’ mean? Is there a fixed curriculum?

A: No. There are some basic topics (see list above) that I think are quite likely to come up, but ultimately, this is about making you the data scientist you want to be. For this reason, we’ll begin by planning out where you want to improve – kinda like a PT gives you a training plan before you start out at their gym. We will then adjust as needed. This is not an exam prep, it’s a learning experience, and for that reason, we can focus on delving deeper and getting the fundaments right over other cramming in a particular curriculum.

Q: Can I bring your own data?

A: Sure. In general, we’ll be using standard data sets, because they’re well-known and high-quality data. But if you have a dataset you collected or are otherwise entitled to use that would do equally well, there’s no reason why we couldn’t use it! Note that you must have the right to use and share the data set, meaning it’s unlikely you’re able to use data sets from your day job.

Q: Will this give me an employment advantage?

A: I don’t quite know – it’s impossible to predict. The field of data science degrees is something of a Wild West still, and while some reputable degrees have emerged, others are dubious. Employers still don’t know what to go by. However, you will most definitely be better prepared for an employment interview in data science!

Q: Why are you so keen on presenting data the right way?

A: Because as data scientists, we’re expected to not merely understand the data and draw the right conclusions, but also to convey them to stakeholders at various levels, from plant management to C-suite, in a way that gets the right message across at the first go.

Q: You’re a computational epidemiologist. Can I apply even if my work doesn’t really involve healthcare?

A: Sure. The principles are the same, and we’re largely focusing on generic topics. You might be exposed to bits and pieces of epidemiology, but I can guarantee it won’t hurt.

Q: Why do you only take on one mentee?

A: To begin with, my life is pretty busy – I have a demanding job, a family and – shock horror! – I even need to sleep every once in a while. More importantly, I want to devote my undivided attention to a worthy candidate.

Q: How come I’ve never heard of this before?

A: Until now, I’ve largely gotten mentees by word of mouth. I am concerned that this is keeping some talented people out and limiting the pool of people we should have in. That’s why this year, I have tried to make this process much more transparent.

Q: You’re rather fond of General ‘Mad Dog’ Mattis. Will there be yelling?

No.

Q: There seems to be no upper age limit. Is that a mistake?

No.

Q: I have more questions.

A: You can ask them here.

How I predicted Trump’s victory

Introit

“Can you, just once, explain it in intelligible words?”, my wife asked.

We’ve been talking for about an hour about American politics, and I made a valiant effort at trying to explain to her how my predictive model for the election worked, what it took into account and what it did… but twenty minutes in, I was torn between either using terms like stochastic gradient descent and confusing her, or having to start to build everything up from high school times tables onwards.

Now, my wife is no dunce. She is one of the most intelligent people I’ve ever had the honour to encounter, and I’ve spent years moving around academia and industry and science. She’s not only a wonderful artist and a passionate supporter of the arts, she’s also endowed with that clear, incisive intelligence that can whittle down the smooth, impure rock of a nascent theory into the Koh-I-Noor clarity of her theoretical work.

Yet, the fact is, we’ve become a very specialised industry. We, who are in the business of predicting the future, now do so with models that are barely intelligible to outsiders, and some even barely intelligible to those who do not share a subfield with you (I’m looking at you, my fellow topological analytics theorists!). Quite frankly, then: the world is run by algorithms that at best a fraction of us understand.

So when asked to write an account of how I predicted Trump’s victory, I’ve tried to write an account for a ‘popular audience’. 1 That means there’s more I want to get across than the way I built some model that for once turned out to be right. I also want to give you an insight into a world that’s generally pretty well hidden behind a wall made of obscure theory, social anxiety and plenty of confusing language. The latter, in and of itself, takes some time and patience to whittle down. People have asked me repeatedly what this support vector machine I was talking about all the time looked like, and were disappointed to hear it was not an actual machine with cranks and levers, just an algorithm. And the joke is not really on them, it’s largely on us. And so is the duty to make ourselves intelligible.

Prelude

I don’t think there’s been a Presidential election as controversial as Trump’s in recent history. Certainly I cannot remember any recent President having aroused the same sort of fervent reactions from supporters and opponents alike. As a quintessentially apolitical person, that struck me as the kind of odd that attracts data scientists like flies. And so, about a year ago, amidst moving stacks of boxes into my new office, I thought about modelling the outcome of the US elections.

It was a big gamble, and it was a game for a David with his sling. Here I was, with a limited (at best) understanding of the American political system, not much access to private polls the way major media and their court political scientists have, and generally having to rely on my own means to do it. I had no illusions about the chances.

After the first debate, I tweeted this:

Also, as so many asked: post debate indicators included, only 1 of over 200 ensemble models predict a HRC win. Most are strongly Trump win.

– Chris (@DoodlingData), September 28, 2016

To recall, this was a month and a half ago, and chances for Trump looked dim. He was assailed from a dozen sides. He was embroiled in what looked at the time as the largest mass accusation of sexual misconduct ever levelled against a candidate. He had, as many right and left were keen on pointing out, “no ground game”, polling unanimously went against him and I was fairly sure dinner on 10 November at our home will include crow.

But then, I had precious little to lose. I was never part of the political pundits’ cocoon, nor did I ever have a wish to be so. There’s only so much you can offer a man in consideration of a complete commonsensectomy. I do, however, enjoy playing with numbers – even if it’s a Hail Mary pass of predicting a turbulent, crazy election.

I’m not alone with that – these days, the average voter is assailed by a plethora of opinions, quantifications, pontifications and other -fications about the vote. It’s difficult to make sense of most of it. Some speak of their models downright with the same reverence one might once have invoked the name of the Pythiae of the Delphic Oracle. Others brashly assert that ‘math says’ one or other party has ‘already won’ the elections, a month ahead. And I would entirely forgive anyone who were to think that we are, all in all, a bunch of charlatans with slightly more high-tech dowsing rods and flashier crystal balls.

Like every data scientist, I’ve been asked a few times what I ‘really’ do. Do I wear a lab coat? I work in a ‘lab’, after all, so many deduced I would be some sort of experimental scientist. Or am I the Moneyball dude? Or Nate Silver?

Thankfully, neither of those is true. I hate working in the traditional experimental science lab setting (it’s too crowded and loud for my tastes), I don’t wear a lab coat (except as a joke at the expense of one of my long-term harassers), I don’t know anything about baseball statistics and, thanks be to God, I am not Nate Silver.

I am, however, in the business of predicting the future. Which sounds very much like theorising about spaceships and hoverboards, but is in fact quite a bit narrower. You see, I’m a data scientist specialising in several fields of working with data, one of which is ‘predictive analytics’ (PA). PA emerged from combinatorics (glorified dice throwing), statistics (lies, damned lies and ~) and some other parts of math (linear algebra, topology, etc.) and altogether aims to look at the past and find features that might help predicting the future. Over the last few years, this field has experienced an absolute explosion, thanks to a concept called machine learning (ML).

ML is another of those notions that evokes more passionate fear than understanding. In fact, when I explained to a kindly old lady with an abundance of curiosity that I worked in machine learning, she asked me what kind of machines I was teaching, and what I was teaching them – and whether I had taught children before. The reality is, we don’t sit around and read Moby Dick to our computers. Nor is ML some magic step towards artificial intelligence, like Cortana ingesting the entire Forerunner archives in Halo. No, machine learning is actually quite simple: it’s the art and science of creating applications that, at least when they work well, perform better each time than the time before.

It is high art and hard science. Most of modern ML is unintelligible without very solid mathematical foundations, and yet knowledge has never really been able to substitute for experience and a flair for constructing, applying and chaining mathematical methods to the point of accomplishing the best, most accurate result.

Wait, I haven’t talked about results yet! In machine learning, we have two kinds of ‘result’. We have processes we call ‘supervised learning’, where we give the computer a pattern and expect it to keep applying it. For instance, we give it a set (known in this context as the training set) of heart rhythm (ECG) tracings, and tell it which ones are fine and which ones are pathological. We then expect the computer to accurately label any heart rhythm we give to it.

There is also another realm of machine learning, called ‘unsupervised learning’. In unsupervised learning, we let the computer find the similarities and connections it wants to. One example would be giving the computer the same set of heart traces. It would then return what we call a ‘clustering’ – a group of heartbeats on one hand that are fine, and the pathological heartbeats on the other. We are somewhat less concerned with this type of machine learning. Electoral prediction is pretty much a straightforward supervised learning task, although there are interesting addenda that one can indeed do by leveraging certain unsupervised techniques. For instance, groups of people designated by certain characteristics might vote together, and a supervised model might be ‘told’ that a given number of people have to vote ‘as a block’.

These results are what we call ‘models’.

On models

Ever since Nate Silver allegedly predicted the Obama win, there has been a bit of a mystery-and-no-science-theatre around models, and how they work. Quite simply, a model is a function, like any other. You feed it source variables, it spits out a target variable. Like your washing machine:

f(C_d, W, E_{el}, P_w) = (C_c)

That is, put in dirty clothes (C_d ), water (W ), electricity (E_{el} ) and washing powder (P_w ), get clean clothes (C_c ) as a result. Simple, no?

The only reason why a model is a little different is that it is, or is supposed to be, based on the relationship between some real entities on each side of the equality, so that if we know what’s on the left side (generally easy-to-measure things), we can get what’s on the right side. And normally, models were developed in some way by reference to data where we do have both sides of the equation. An example for this is the tool known as Henssge’s nomogram, which is a tool called a nomogram, a visual representation of certain physical relationships. That particular model was developed from hundreds, if not thousands, of measurements of (get your retching bag ready), butthole temperature measurements of dead bodies where the time of death actually was known. As I’m certain you know, when you die, you slowly assume room temperature. There are a million factors that influence this, and to calculate the time since death could certainly break a supercomputer. And it would be accurate, but not much more accurate than Henssge’s method. Turns out, a gentleman called Claus Henssge discovered, that three and a half factors are pretty much enough to estimate the time since death with reasonable accuracy: the ambient temperature, the aforementioned butthole temperature, the decedent’s body weight, and a corrective factor to take account for the decedent’s state of nakedness. Those factors altogether give you 95% or so accuracy – which is pretty good.

The Henssge nomogram illustrates two features of every model:

  1. They’re all based on past or known data.
  2. They’re all, to an extent, simplifications.

Now, traditionally, a model used to be built by people who reasoned deductively, then did some inductive stuff such as testing to assuage the more scientifically obsessed. And so it was with the Henssge nomogram, where data was collected, but everyone had a pretty decent hunch that time of death will correlate best with body weight and the difference between ambient and core (= rectal) temperature. That’s because heat transfer from a body to its environment generally depends on the temperature differential and the area of the surface of exchange:

Q = hA(T_a - T_b)

where Q is heat transferred per unit time, h is the heat transfer coefficient, A is the area of the object and T_a - T_b is the temperature difference. So from that, it then follows that T_a and T_b can be measured, h is relatively constant for humans (most humans are composed of the same substance) and A can be relatively well extrapolated from body weight.2

The entire story of modelling can be understood to focus on one thing, and do it really well: based on a data set (the training set), it creates a model that seeks to describe the essence of the relationship between the variables involved in the training set. The simplest suich relationships are linear: for instance, if the training set consists of {number of hamburgers ordered; amount paid}, the model will be a straight line – for every increase on the hamburger axis, there will be the same increase on the amount paid axis. Some models are more complex – when they can no longer be described as a combination of straight lines, they’re called ‘nonlinear’. And eventually, they get way too complex to be adequately plotted. That is often the consequence of the training dataset consisting not merely of two fields (number of hamburgers and the target variable, i.e. price), but a whole list of other fields. These fields are called elements of the feature vector, and when there’s a lot of them, we speak of a high-dimensional dataset. The idea of a ‘higher dimension’ might sound mysterious, but true to fashion, mathematicians can make it sound boring. In data science, we regularly throw around data sets of several hundred or thousand dimensions or even more – so many, in fact, that there are whole techniques intended to reduce this number to something more manageable.

But just how do we get our models?

Building our model

In principle, you can sit down, think about a process and create a model based on some abstract simplifications and some other relationships you are aware of. That’s how the Henssge model was born – you need no experimental data to figure out that heat loss will depend on the radiating area, the temperature difference to ‘radiate away’ and the time the body has been left to assume room temperature: these things more or less follow from an understanding of how physics happens to work. You can then use data to verify or disprove your model, and if all goes well, you will get a result in the end.

There is another way of building models, however. You can feed a computer a lot of data, and have it come up with whatever representation gives the best result. This is known as machine learning, and is generally a bigger field than I could even cursorily survey here. It comes in two flavours – unsupervised ML, in which we let the computer loose on some data and hope it turns out ok, and supervised ML, in which we give the computer a very clear indication of what approrpiate outputs are for given input values. We’re going to be concerned with the latter. The general idea of supervised ML is as follows.

  1. Give the algorithm a lot of value pairs from both sides of the function – that is, show the algorithm what comes out given a particular input. The inputs, and sometimes even the outputs, may be high-dimensional – in fact, in the field I deal with normally, known as time series analytics, thousands of dimensions of data are pretty frequently encountered. This data set is known as the training set.
  2. Look at what the algorithm came up with. Start feeding it some more data to which you know the ‘correct’ output, so to speak, data which you haven’t used as part of the training set. Examine how well your model is doing predicting the test set.
  3. Tweak model parameters until you get closer to higher accuracy. Often, an algorithm called gradient descent is used, which is basically a fancy way of saying ‘look at whether changing a model parameter in a particular direction by \mu makes the model perform better, and if so, keep doing it until it doesn’t’. \mu is known as the ‘learning rate’, and determines on one hand how fast the model will get to a best possible approximation of the result (how fast the modell will converge), and on the other, how close it will be to the true best settings. Finding a good learning rate is more a dark art than science, but something people eventually get better at with practice.

In this case, I was using a modelling approach called a backpropagation neural network. An artificial neural network (ANN) is basically a bunch of nodes, known as neurons, connected to each other. Each node runs a function on the input value and spits it out to its output. An ANN has these neurons arranged in layers, and generally nodes feed in one direction (‘forward’), i.e. from one layer to the next, and never among nodes in the same layer.

Neurons are connected by ‘synapses’ that are basically weighted connections (weighting simply means multiplying each input to a neuron by a value that emphasises its significance, so that these values all add up to 1). The weights are the ‘secret sauce’ to this entire algorithm. For instance, you may have an ANN set to recognise handwritten digits. The layers would get increasingly complex. So one node may respond to whether the digit has a straight vertical line. The output node for the digit 1 would weight the output from this node quite strongly, while the output node for 8 would weight it very weakly. Now, it’s possible to pick the functions and determine the weights manually, but there’s something better – an algorithm called backpropagation that, basically, keeps adjusting weights using gradient descent (as described above) to reach an optimal weighting, i.e. one that’s most likely to return accurate values.

My main premise for creating the models was threefold.

  1. No polling. None at all. The explanation for that is twofold. First, I am not a political scientist. I don’t understand polls as well as I ought to, and I don’t trust things I don’t understand completely (and neither should you!). Most of all, though, I worry that polls are easy to influence. I witnessed the 1994 Hungarian elections, where the incumbent right-wing party won all polls and exit-poll surveys by a mile… right up until eventually the post-communists won the actual elections. How far that was a stolen election is a different question: what matters is that ever since, I have no faith at all in polling, and that hasn’t gotten better lately. Especially in the current elections, a stigma has developed around voting Trump – people have been beaten up, verbally assaulted and professionally ostracised for it. Clearly asking them politely will not give you the truth.
  2. No prejudice for or against particular indicators. The models were generated from a vast pool of indicators, and, to put it quite simply, a machine was created that looked for correlations between electoral results and various input indicators. I’m pretty sure many, even perhaps most, of those correlations were spurious. At the same time, spurious correlations don’t hurt a predictive model if you’re not intending to use the model for anything other than prediction.
  3. Assumed ergodicity. Ergodicity, quite simply, means that the average of an indicator over time is the same as the average of an indicator over space. To give you an example:3 assume you’re interested in the ‘average price’ of shoes. You may either spend a day visiting every shoe store and calculate the average of their prices (average over space), or you may swing past the window of the shoe store on your way to work and look at the prices every day for a year or so. If the price of shoes is ergodic, then the two averages will be the same. I thus made a pretty big and almost certainly false assumption, namely that the effect of certain indicators on individual Senate and House races is the same as on the Presidency. As said, while this is almost certainly false, it did make the model a little more accurate and it was the best model I could use for things for which I do not have a long history of measurements, such as Twitter prevalence.

One added twist was the use of cohort models. I did not want to pick one model to stake all on – I wanted to generate groups (cohorts) of 200 models each, wherein each would be somewhat differently tuned. Importantly, I did not want to create a ‘superteam’ of the best 200 models generated in different runs. Rather, I wanted to select the group of 200 models that is most likely to give a correct overall prediction, i.e. in which the actual outcome would most likely be the outcome predicted by the majority of the models. This allows for picking models where we know they will, ultimately, act together as an effective ensemble, and models will ‘balance out’ each other.

A supercohort of 1,000 cohorts of 200 models each was trained on electoral data since 1900. Because of the ergodicity assumption (as detailed above), the models included non-Presidential elections, but anything ‘learned’ from such elections was penalised. This is a decent compromise if we consider the need for ergodicity. For example, I have looked at the (normalised fraction4 of the) two candidates’ media appearances and their volume of bought advertising, but mass media hasn’t always been around for the last 116 years in its current form. So I looked at the effect that this had on smaller elections. All variables weighted to ‘decay’ depending on their age.

Tuning of model hyperparameters and deep architecture was attempted in two ways. I initially began with a classical genetic algorithm for tuning hyperparameters and architecture, aware that this was less efficient than gradient descent based algorithms but more likely to give you a diversity of hyperparameters and far more suited to multi-objective systems. Compared with gradient descent algorithms, genetic algorithms took longer but performed better. This was an acceptable tradeoff to me, so I eventually adapted a multi-objective genetic algorithm implementation, drawing on the Python DEAP package and some (ok, a LOT of) custom code. Curiously (or maybe not – I recently learned this was a ‘well known’ finding –  apparently not as well known after all!), the best models came out of ‘split training’: genetically optimised convolutional layers, genetically optimised structure but non-convolutional layers are trained using backpropagation.

Another twist was the use of ‘time contingent parameters’. That’s a fancy word of saying data that’s not available ab initio. An example for that would be post-debate changes of web search volumes for certain keywords associated with each candidate. Trivially, that information is not in existence until a week or so post-debate. These models were trained to ‘variants’. So if a particular model had information missing, it defaulted to an equally weighted model without the nodes that would have required that information. Much as this was a hacky solution, it was acceptable to me as I knew that by late October, every model would have complete information.

I wrote a custom mdoel runner in Python with an easy-as-heck output interface – I was not concerned with creating pretty, I was concerned with creating good. The runner first pulled all data it required once again, diffed it against the previous version, reran feature extractors where there was a change, then ran the models over the feature vectors. Outputs went into CSV files and simple outputs that looked like this (welcome to 1983):

CVoncsefalvay @ orinoco ~/Developer/mfarm/election2016 $ mrun –all

< lots of miscellaneous debug outputs go here >

[13:01:06.465 02 Nov 2016 +0000] OK DONE.
[13:01:06.590 02 Nov 2016 +0000] R 167; D 32; DNC 1
[13:01:06.630 02 Nov 2016 +0000] Output written to outputs/021301NOV2016.mconfdef.csv

That’s basically saying that (after spending the best part of a day scoring through all the models) 167 models were predicting a Republican victory, 32 a Democratic victory and one model crashed, did not converge somewhere or otherwise broke. The CSV output file would then give further data about each submodel, such as predicted turnout, predictions of the electoral college and popular vote, etc. The model was run with a tolerance of 1%, i.e. up to two models can break and the model would still be acceptable. Any more than that, and a rerun would be initiated automatically. One cool thing: this was my first application using the Twilio API to send me messages keeping me up to date on the model. Yes, I know, the 1990s called, they want SMS messaging back.

By the end of the week, the first models have phoned back. I was surprised: was Trump really that far ahead? The polls have slammed him, he seemed hopeless, he’s not exactly anyone’s idea of the next George Washington and he ran against more money, more media and more political capital. I had to spend the best part of a weekend confirming the models, going over them line by line, doing tests and cross-validation, until I was willing to trust my models somewhat.

But part of our story in science is to believe evidence with the same fervour we disbelieve assertions without it. And so, after being unable to find the much expected error in my code and the models, I concluded they must be right.

Living with the models

The unique exhilaration, but also the most unnerving feature, of creating these models was how different they are from my day-to-day fare. When I write predictive models, the approach is, and remains, quintessentially iterative. We build models, we try them, and iteratively improve on them. It is dangerous to fall in love with one’s own models – today’s hero is in all likelihood destined for tomorrow’s dungheap, with another, better model taking its place – until that model, too, is discarded for a better approach, and so on. We do this because of the understanding that reality is a harsh taskmaster, and it always has some surprises in store for us. This is not to say that data scientists build and sell half-assed, flawed products – quite the opposite: we give you the best possible insight we can with the information we’ve got. But how reality pans out will give us more new information, and we can work with that to move another step closer to the elusive truth of predicting the future. And one day, maybe, we’ll get there. But every day, if we play the game well, we get closer.

Predicting a one-time event is different. You don’t get pointers as to whether you are on the right track or not. There are no subtle indications of whether the model is going to work or not. I have rarely had a problem sticking by a model I built that I knew was correct, because I knew every day that new information would either confirm or improve my model – and after all, turning out the best possible model is the important part, not getting it in one shot, right? It was unnerving to have a model built on fairly experimental techniques, with the world predicting a Clinton win with a shocking unanimity. There were extremely few who predicted a Trump win, and we all were at risk of being labelled either partisans for Trump (a rather hilarious accusation when levelled at me!) or just plain crackpots. So I pledged not to discuss the technical details of my models unless and until the elections confirmed they were right.

So it came to pass that it was me, the almost apolitical one, rather than my extremely clever and politically very passionate wife, who stayed up until the early hours of the morning, watching the results pour in. With CNN, Fox and Twitter over three screens, refreshing all the time, I watched as Trump surged ahead early and maintained a steady win.

My model was right.

Coda

It’s the 16th of November today. It’s been almost a week since the elections, and America is slowly coming to terms with the unexpected. It is a long process, it is a traumatic process, and the polling and ‘quantitative social science’ professions are, to an extent, responsible for this. There was all kinds of sloppiness, multiplication of received wisdom, ‘models’ that in fact were thin confirmations of the author’s prejudices in mathematical terms, and a great deal of stupidity. That does sound harsh, but there’s no better way really to describe articles that, weeks before the election, state without a shade of doubt that we needed to ‘move on’, for Clinton had already won. I wonder if Mr Frischling had a good family recipe for crow? And on the note of election night menu, he may exchange tips with Dr Sam Wang, whom Wired declared 2016’s election data hero in an incredibly complimentary puff piece, apparently quite more on the basis that the author, Jeff Nesbit, hoped Wang was right rather than any indications for analytical superiority.

The fact is, the polling profession failed America and has no real reason to continue to exist. The only thing it has done is make campaigns more expensive and add to the pay-to-play of American politics. I don’t really see myself crying salt tears at the polling profession’s funeral.

The jury is still out on the ‘quantitative social sciences’, but it’s not looking good. The ideological homogeneity in social science faculties worldwide, but especially in America, has contributed to the kind of disaster that happens when people live in a bubble. As scientists, we should never forget to sanity check our conclusions against our experiences, and intentionally cultivate the most diverse circle of friends we can to get as many little slivers of the human experience as we can. When one’s entire milieu consists of pro-Clinton academics, it’s hard to even entertain doubt about who is going to win – the availability heuristic is a strong and formidable adversary, and the only way to beat it is by recruiting a wide array of familiar people, faces, notions, ideas and experiences to rely on.

As I write this, I have an inch-thick pile of papers next to me: calculations, printouts, images, drafts of a longer academic paper that explains the technical side of all this in detail. Over the last few days, I’ve fielded my share of calls from the media – which was somewhat flattering, but this is not my field. I’m just an amateur who might have gotten very lucky – or maybe not.

Time will tell.

In a few months, I will once again be sharing a conference room with my academic brethren. We will discuss, theorize, ideate and exchange views; a long, vivid conversation written for a 500-voice chorus, with all the beauty and passion and dizzying heights and tumbling downs of Tallis’s Spem in Alium. The election has featured prominently in those conversations last time, and no doubt that will be the case again. Many are, at least from an academic perspective, energised by what happened. Science is the only game where you actually want to lose from time to time. You want to be proven wrong, you want to see you don’t know anything, you want to be miles off, because that means there is still something else to discover, still some secrets this Creation conceals from our sight with sleights of hand and blurry mirrors. And so, perhaps the real winners are not those few, those merry few, who got it right this time. The real winners are those who, led by their curiosity about their failure to predict this election, find new solutions, new answers and, often enough, new puzzles.

That’s not a consolation prize. That’s how science works.

And while it’s cool to have predicted the election results more or less correctly, the real adventure is not the destination. The real adventure is the journey, and I hope that I have been able to grant you a little insight into this adventure some of us are on every hour of every day.

References   [ + ]

1.There is an academic paper with a lot more details forthcoming on the matter – incidentally, because republication is generally not permitted, it will contain many visualisations I was not able or allowed to put into this blog post. So just for that, it may be worth reading once it’s out. I will post a link to it here.
2.The reasoning here is roughly as follows. Assume the body is a sphere. All bodies are assumed of being made of the same material, which is also assumed to be homogenous. The volume of a sphere V = \frac{4}{3} \pi r^3 , and its weight is that multiplied by its density \rho . Thus the radius of a sphere of a matter of known density \rho can be calculated as r = \sqrt[3]{\frac{3}{4} \frac{M}{\pi \rho}} . From this, the surface area can be calculated (A = 4 \pi r^2 ). Thus, body weight is a decent stand-in for surface area.
3.I am indebted to Nassim Nicholas Taleb for this example.
4.Divide the smaller by the larger value, normalise to 1.

The Fear Factor

Update: Adam has a great response, much from his own (rather different yet nonetheless fascinating and infinitely important) perspective, that should be a valuable read to anyone who found this post even mildly interesting.

As I was perusing Twitter, I bumped into this sponsored tweet by Shell, promoting Sensabot, a ruggedised remote ops robot designed for operations in dangerous environments by Carnegie Mellon’s NREC and now apparently adapted/adopted by Shell:

That sounds great… unfortunately, it strikes me, it misunderstands a couple of things – namely,

  1. fear, and the function it has in the human mind (not just the psyche – fear is a primarily neural response, secondarily perhaps cognitive and far behind it is any psychological aspect thereof), and
  2. what a robot ought to do/know.

Now, this might just be a marketing puff, though it probably is in its own way true – I have yet to see robots with a specific system catering for fear. It is also hardly just Shell and NREC I’m singling out here – the points are much more generic and have more to do with that robots do and what they therefore ought to understand about being human.

The gift of fear

When I was a child, I once managed to piss off my grandfather enough to have him lock me in the shed. The shed smelled of chicken feed, something that turns my guts upside down to this day. Now, the shed was pretty ok, all things considered, albeit dark and damp. However, I shared it with what at my youthful estimation must have been several hundred of small insects, spiders, centipedes and other creepy-crawlies.

I love nature. I just hate the things it sometimes produces. Safe to say if it does not have a brain and does creepy-crawly stuff, I will in all likelihood be creeped out by it.

So there I was, locked in with what I now know were more along the lines of a few thousands of these critters. I screamed like a banshee, until my grandmother took pity on me and released me.

Years and years later, memories of this crawled their way back into my mind at the most inopportune times. It was an uncharacteristically cool midsummer day, in the ageless and timeless beauty that you only get at NSC Bisley, the capital of UK target rifle shooting. Looking through the high-resolution scope of my rifle at Stickledown, the long distance (1,000+ yd) match rifle range, I was immersed in doing calculations of wind and gust patterns and adjustments and projectile drop in my head, trying to look out for any sign of crosswind at what is rightly known as one of the most treacherous ranges in the world of long distance target rifle shooting. That was when I suddenly felt that familiar crawl of a spider up my right leg. Had it not been for the fact that freaking out like a lunatic at a firing range while holding a stupidly overpowered sniper rifle in the middle of a few dozen other shooters on edge at what is to many of them a career match is generally shunned upon, I would probably have screamed. Instead, I unloaded, kicked the spider off me and tried to settle down, but by that time, it was all lost. My heart, pickled in adrenalin, pounded and pounded and I got some stupidly bad shots away before I conceded. And that’s how great my first Imperial Meeting went, back in 2006.

I was so bothered by this episode, I applied my usual method to it – reading every single book ever written on the subject of fear and anxiety responses (years later, when I would once again have to face scary memories from my past, much of that reading would prove helpful in retaining a modicum of sanity). I read tomes of evolutionary biology, a field so unfamiliar to me I had to ask a fellow student to give me the Cliff’s Notes of it. I read a fantastic book, The Gift of Fear by Gavin de Becker, in which he explains the importance of fear signals in avoiding violence. I read copiously on the physiology of fear responses, of the need for the inotropy and chronotropy1 of adrenaline, and the reason why battle cries and battle shouts are a universal feature of human civilisations.

And eventually, I came to realise that while fear probably lost me any stab I had at the Halford, the much coveted 1,100yd/1,200yd trophy that year (if we’re being honest here, any such chances were… fairly slim at best), it was a crucial part of getting my species, and quite probably the individual self, to that point. Of course, because this is not a Hollywood blockbuster about facing one’s fears and winning, I never went back to Bisley after this event and I would in fact never again shoot in a public competition. I would, however, spend plenty of time on the more fortunate end of a rifle, and never have another problem with it – even in inhospitable climates with various nasty creepy-crawlies.

Now, there’s a reason I’m giving a little personal vignette here – and that is to understand that we’re more familiar with the adverse effects of fear than we are with its ‘gift’, to use de Becker’s term. We, as a society, are in the mindset I was as I walked down the firing point and tried to figure out how the hell I am going to explain all this to my coach without becoming the club joke. Fear is bad. Fear is so bad, if you get too much of it, many opt for taking medication or seeking professional help (and that’s perfectly right so!). “Freedom from fear”, one of FDR’s often-quoted four freedoms, has put fear on the level of starvation, religious persecution and oppression of free expression. Fear is a big deal.

And justly so. Fear is, well, not nice. It puts people ‘on edge’, which is just fine, but is a prioritisation mechanism – it puts efficiency and survival ahead of communication and courtesy, and leads you to be perceived as unpleasant. Fear, especially long-term levels of heightened stress response (known sometimes the ‘biological embedding of anxiety’) can have utterly deleterious effects on long-term health2 and there seems now ample evidence that the damage is on a genetic level3, i.e. capable of being passed on. The harm of fear thus becomes intergenerational.

At the same time, fear is necessary for humans, and it does not take much to think of a scenario when your entire ancestral line could have been wiped out if it had not been for your slightly anxious great-great-great^n-ancestor so pathologically afraid of floods that he insisted on dwelling on a hill or so fearful of sabre-tooth tigers that he always carried a sharp object that might just be credited for his survival. In fact, Marks and Nesse (1994) argue that eliminating fear would be by no means exclusively positive – and perhaps even the existence of a pathological state of low fear they term ‘hypophobic disorder’.4 Certainly the lack of fear response, such as that induced by ablation of metabotropic glutamate receptor subtype 75 or interneuronal ablation by inhibition of the Dlx1 gene6 in the experimental setting or witnessed in the context of pervasive neurodevelopmental disorders both in models7 and in vivo,8 has significant evolutionary drawbacks – it is not hard to see how behaviour without fear in a world of danger can quickly lead to reduced life expectancies.

And so we get to the Sensabot, our ‘fearless’ robot. What benefits does his fearlessness yield us? None, I submit, for the reasons below.

Fear is a diverse phenomenon. Cutting it out is neurosurgery with a hatchet.

Fear is a single word (albeit one replete with synonyms) for a number of states of mind that connect somewhere in the human reactions they evoke via the amygdala and thence the limbic system, quite prominently the hypothalamic fight/flight response.9 It bundles together your fear of spiders (a genuine phobia), your fear of nuclear war (a longer-term anxiety), your fear of wasting your life (an even less acute, more existential perception) and so on. Cutting it all out is doing neurosurgery with a hatchet. A not very sharp one, either.

To put it in the context of the robot: of course, a robot who is not worried about spiders, doesn’t hyperventilate and sweat when handling dangerous substances and doesn’t freeze at the sight of a few zombies emerging from the neighbouring compartment (a risk I doubt Shell needs to envisage at this point, of course!) lacks maladaptive manifestations of fear. These intersect somewhere in the shared fact that they’re either disproportionate to the risk (spiders) or unproductive in the situation (handling dangerous substances and freezing at the sight of a zombie).

But what about other aspects of fear? Fear is a natural way to warn us of existential danger. The Sensabot relies on the human operator to have at least some degree of that, but more autonomous bots will not be able to. Nor do operators act the same when they’re in the cockpit than when they’re piloting an unmanned vehicle.10 Of course, Sensabots can be replaced far easier than humans, but that’s irrelevant here, both axiomatically and practically (the collateral damage of e.g. an explosion caused by an incorrect action might well be an actual human in the area). Nor does the bot have the neurophysiological advantages that fear – specifically, the cardiac effects of faster movement, better cognitive capabilities and so on. Fear is a reserve, and machines don’t have that reserve.

Fear prioritises.

Fear is best represented as a vector, having both magnitude and direction (example: “I’m very afraid (magnitude) of spiders (direction)”). Different magnitudes help prioritising for immediacy and apprehended risk (likelihood times expected loss). Of course, it is not possible to simply bestow this upon a computer, and there are other methods of prioritising risk, but the great benefit of fear is that it distills signals down into a simple and fast calculation that is remarkably rarely wrong. It does so by considering a current signal in the context of all signals, the signal space of all possible signals, as well as learned patterns and the wider context in which the entire process is taking place. The decision whether to be afraid of something is, actually, quite complex.

This prioritisation is often lampooned when it appears to go wrong, typically when people are afraid of certain rare risks than more frequent ones. Typically, such caricatures of the way human fears work get several things wrong – they reduce the situation to mere probability when in reality, the likelihood of loss, the manner of loss, its impact on others, its impact on your wider community at large and so on are taken into account. That’s why more people fear terrorist attacks than car accidents, even though the latter are much more frequent. Context is everything, and if we learn only one thing from fear, let it be that the evaluation of risks takes place in a very wide context, and with its holistic nature – involving the limbic system, the median prefrontal cortex (mPFC) for memory, somewhat affected by the person’s state of arousal and HPA axis function, mediated by sensory perceptions mixed with our interpretations thereof –, fear is a highly multifactorial response that can be promoted, mediated or inhibited by a number of factors on the way. There is now a degree of awareness in literature that learned and innate fears are differentiated in their propagation pathways (specifically, the involvement of the prelimbic mPFC),11 indicating again that fear is a single response to different processes reacting to different stimuli. This underlines how fear, rather than simply getting one’s brain pickled in adrenaline, is a complex phenomenon. From an evolutionary perspective, this complexity calls for an explanation – the holistic nature of fear comes, of course, at the expense of the time it takes to trigger release of adrenaline and fight/flight responses. That explanation is, of course, that fear has to accomplish more objectives than merely recoiling, reliably and every time, from a trigger: it has to weigh whether a response is going to put us in more danger, it has to weigh our resources against plans of escape, it has to consider the entire context of a situation, including the past (via the memory activity of the prelimbic mPFC).

Humans are afraid. Their robot co-workers need to understand this.

My learned friend Adam Elkus has made a great point:

He’s entirely right – if robots and humans work together, robots need to have what is sometimes referred to as the ‘theory of mind’ – an internal concept of an external mind – in order to anticipate and understand their workmates. And humans, well, humans experience fear. There’s no way to getting around that. An interesting consideration here would be a fear related adaptation of Baron-Cohen’s Sally-Anne test,13 which I will call the Sally-Zombie test. Anne is, in this case, replaced by a rotting carcass reanimated by dark forces. A robot ought to be able to reason about Sally’s reaction to this. Merely predicting a response is unlikely – even in the absence of in-depth knowledge of Sally’s decisions in the past, it will be hard to predict whether she will freeze, run or reach for the nearest object to decapitate her once-friend-turned-zombie with. The robot, thus, ought to understand multiple things here:

  • The human emotional context: humans and their fear of their body being taken over, a fear rooted in the human appreciation for autonomy and capacity.
  • The human cultural context: zombie movies are a staple of Western culture and at least to Western observers, zombies are unequivocally scary (more advanced models would need to consider cultural differences, e.g. the response Sally would have, had she grown up with a culture, such as Haitian Voodoo or Palo Malombe, where zombies occupy a more complex albeit still fear-inducing position).
  • The human personal context: what are Sally’s experiences with zombies? With similar stressors? With the concept of losing a friend to an abomination?
  • The human physiological context: given Sally’s state, what is she most likely to experience when her body goes through the motions of fear/panic and the corresponding neural level reactions?

The fact is that the qualia of fear is a uniquely human perception.14 No machine, however intricate, will ever have the qualia of fear. Whether it needs the qualia, however, in order to do its job is debatable. At the same time, its sheer richness and multifactorial nature, as well as extent of its manifestations, make fear unsuitable to a mere scripted response level of understanding. While some basic features can be easily responded to without having an understanding of fear (“If Joe is around tons of highly explosive material, Joe will have a heart rate exceeding his basal heart rate by approximately 10-25% and experience palmar hydrosis”), most cannot. And that means that robots, to be effective, have to develop something in between the unattainable qualia and the insufficient scripted understanding.

It’s important to understand that this is not merely for robots to understand how their human companions will think, but also to allow the latter to understand how their robot coworker would perceive a situation. A coworker who lacks, say, an ordinary human level of fear might not only endanger his companions, his actions are also likely to be unintelligible to them: why is Steve running towards the madly out-of-control spinning saw blades without any protective equipment?! Fear is so fundamental to being human that it is part of the unwritten set of shared presumptions that help us understand and anticipate each other. This is a system into which robots will, someday soon, integrate themselves.

Conclusion

Do we want robots to be afraid? Some applications, such as the recent proliferation of humanoid, emotionally expressive robots for therapeutic, educational,15 or general usage purposes16 certainly rely on at least an understanding of where the display of the physiological-communicative responses to stimuli is appropriate. But that’s not where the story ought to end. In fact, robots that need to have more than trivial ability to reason on their own, especially if they need to do so in a human context. There is much that robots can do better than humans – they don’t feel pain, remorse, regret, doubt, boredom or fatigue. To inject into what one might perceive as almost perfect creatures the maladaptive aspects, too, of human responses to perceived dangers sounds counterintuitive. At the same time, on the large (evolutionary) scale as well as the individual long-term scale, fear is a gift. It is a gift we as humans must consider to pass on to our creations.

References   [ + ]

1.That’s ‘making your heart beat stronger’ and ‘making your heart beat faster’ in human language, respectively.
2.Miller, Gregory E., Edith Chen, and Karen J. Parker. “Psychological stress in childhood and susceptibility to the chronic diseases of aging: moving toward a model of behavioral and biological mechanisms.” Psychological Bulletin 137.6 (2011): 959.
3.Sasaki, Aya, Wilfred C. de Vega, and Patrick O. McGowan. “Biological embedding in mental health: An epigenomic perspective 1.” Biochemistry and Cell Biology 91.1 (2013): 14-21.
4.Nesse, Randolph M. “Fear and fitness: An evolutionary analysis of anxiety disorders.” Ethology and sociobiology 15.5 (1994): 247-261.
5.Masugi, Miwako, et al. “Metabotropic glutamate receptor subtype 7 ablation causes deficit in fear response and conditioned taste aversion.” The Journal of Neuroscience 19.3 (1999): 955-963.
6.Mao, Rong, et al. “Reduced conditioned fear response in mice that lack Dlx1 and show subtype-specific loss of interneurons.” Journal of Neurodevelopmental Disorders 1.3 (2009): 224.
7.Markram, Kamila, et al. “Abnormal fear conditioning and amygdala processing in an animal model of autism.” Neuropsychopharmacology 33.4 (2008): 901-912.
8.Consider DSM-IV 299.0, at Associated Features and Disorders, para.1
9.Said at risk of massive oversimplification. It is WAY more complex than that, of course, and the mPFC as well as other parts of the brain play a significant role. There is an increasing understanding that some of our most fundamental emotions like fear are as close as one gets to global in the brain!
10.As this song attests.
11.Corcoran, Kevin A., and Gregory J. Quirk. “Activity in prelimbic cortex is necessary for the expression of learned, but not innate, fears.” The Journal of neuroscience 27.4 (2007): 840-844.
12.Adam (Elkus
13.Baron-Cohen, Simon, Alan M. Leslie, and Uta Frith. “Does the autistic child have a “theory of mind”?.” Cognition 21.1 (1985): 37-46.
14.Buck, Ross. “What is this thing called subjective experience? Reflections on the neuropsychology of qualia.” Neuropsychology 7.4 (1993): 490.
15.Movellan, Javier, et al. “Sociable robot improves toddler vocabulary skills.” Proceedings of the 4th ACM/IEEE international conference on Human robot interaction. ACM, 2009.
16.Consider in this field Hashimoto, Takuya, et al. “Development of the face robot SAYA for rich facial expressions.” 2006 SICE-ICASE International Joint Conference. IEEE, 2006. and Itoh, Kazuko, et al. “Various emotional expressions with emotion expression humanoid robot WE-4RII.” Robotics and Automation, 2004. TExCRA’04. First IEEE Technical Exhibition Based Conference on. IEEE, 2004.

The one study you shouldn’t write

I might have my own set of ideological prejudices,1 while at the same time I am more sure than I am about any of these I am certain about this: show me proof that contradicts my most cherished beliefs, and I will read it, evaluate it critically and if correct, learn from it. This, incidentally, is how I ended up believing in God and casting away the atheism of my early teens, but that’s a lateral point.

As such, I’m in support of every kind of inquiry that does not, in its process, harm humans (I am, you may be shocked to learn, far more supportive of torturing raw data than people). There’s one exception. There is that one study for every sociologist, every data scientist, every statistician, every psychologist, everyone – that one study that you should never write: the study that proves how your ideological opponents are morons, psychotics and/or terminally flawed human beings.2

Virginia Commonwealth University scholar Brad Verhulst, Pete Hatemi (now at Penn State, my sources tell me) and poor old Lindon Eaves, who of all of the aforementioned should really know better than to darken his reputation with this sort of nonsense, have just learned this lesson at what I believe will be a minuscule cost to their careers compared to the consequence this error ought to cost any researcher in any field.

In 2012, the trio published an article in the American Journal of Political Science, titled Correlation not causation: the relationship between personality traits and political ideologies. Its conclusion was, erm, ground-breaking for anyone who knows conservatives from more than the caricatures they have been reduced to in the media:

First, in line with our expectations, higher P scores correlate with more conservative military attitudes and more socially conservative beliefs for both females and males. For males, the relationship between P and military attitudes (r = 0.388) is larger than the relationship between P and social attitudes (r = 0.292). Alternatively, for females, social attitudes correlate more highly with P (r = 0.383) than military attitudes (r = 0.302).

Further, we find a negative relationship between Neuroticism and economic conservatism (r_{females} = −0.242, $$r_{males}$$ = −0.239). People higher in Neuroticism tend to be more economically liberal.

(P, in the above, being the score in Eysenck’s psychoticism inventory.)

The most damning words in the above were among the very first. I am not sure what’s worst here: that actual educated people believe psychoticism correlates to military attitudes (because the military is known for courting psychotics, am I right? No? NO?!), or that they think it helps any case to disclose what is a blatant bias quite openly. In my lawyering years, if the prosecution expert had stated that the fingerprints on the murder weapon “matched those of that dirty crook over there, as I expected”, I’d have torn him to shreds, and so would any good lawyer. And that’s not because we’re born and raised bloodhounds but because we prefer people not to have biases in what they are supposed to opine on in a dispassionate, clear, clinical manner.

And this story confirms why that matters.

Four years after the paper came into print (why so late?), an erratum had to be  published (that, by the way, is still not replicated on a lot of sites that republished the piece). It so turns out that the gentlemen writing the study have ‘misread’ their numbers. Like, real bad.

The authors regret that there is an error in the published version of “Correlation not Causation: The Relationship between Personality Traits and Political Ideologies” American Journal of Political Science 56 (1), 34–51. The interpretation of the coding of the political attitude items in the descriptive and preliminary analyses portion of the manuscript was exactly reversed. Thus, where we indicated that higher scores in Table 1 (page 40) reflect a more conservative response, they actually reflect a more liberal response. Specifically, in the original manuscript, the descriptive analyses report that those higher in Eysenck’s psychoticism are more conservative, but they are actually more liberal; and where the original manuscript reports those higher in neuroticism and social desirability are more liberal, they are, in fact, more conservative. We highlight the specific errors and corrections by page number below:

It also magically turns out that the military is not full of psychotics.3 Whodda thunk.

…Ρ is substantially correlated with liberal military and social attitudes, while Social Desirability is related to conservative social attitudes, and Neuroticism is related to conservative economic attitudes.

“No shit, Sherlock,” as they say.

The authors’ explanation is that the dog ate their homework. Ok, only a little bit better: the responses were “miscoded”, i.e. it’s all the poor grad student sods’ fault. Their academic highnesses remain faultless:

The potential for an error in our article initially was pointed out by Steven G. Ludeke and Stig H. R. Rasmussen in their manuscript, “(Mis)understanding the relationship between personality and sociopolitical attitudes.” We found the source of the error only after an investigation going back to the original copies of the data. The data for the current paper and an earlier paper (Verhulst, Hatemi and Martin (2010) “The nature of the relationship between personality traits and political attitudes.” Personality and Individual Differences 49:306–316) were collected through two independent studies by Lindon Eaves in the U.S. and Nichols Martin in Australia. Data collection began in the 1980’s and finished in the 1990’s. The questionnaires were designed in collaboration with one of the goals being to be compare and combine the data for specific analyses. The data were combined into a single data set in the 2000’s to achieve this goal. Data are extracted on a project-by-project basis, and we found that during the extraction for the personality and attitudes project, the specific codebook used for the project was developed in error.

As a working data scientist and statistician, I’m not buying this. This study has, for all its faults, intricate statistical methods. It’s well done from a technical standpoint. It uses Cholesky decomposition and displays a relatively sophisticated statistical approach, even if it’s at times bordering on the bizarre. The causal analysis is an absolute mess, and I have no idea where the authors have gotten the idea that a correlation over 0.2 is “large enough for further consideration”. That’s not a scientifically accepted idea. A correlation is significant or not significant. There is no weird middle way of “give us more money, let’s look into it more”. The point remains, however, that the authors, while practising a good deal of cargo cult science, have managed to oversee an epic blunder like this. How could that have happened?

Well, really, how could it have happened? I trust this should be explained by the words I’ve pointed out before. The authors had what is called “cognitive contamination” in the field of criminal forensic science. The authors had an idea about conservatives and liberals and what they are like. These ideas were caricaturesque to the extreme. They were blind as a bat, blinded by their own ideological biases.

And there goes my point. There are, sometimes, articles that you shouldn’t write.

Let me give you an analogy. My religion has some pretty clear rules about what married people are, and aren’t, allowed to do. Now, what my religion also happens to say is that it’s easier not to mess up these things if you do not engage in temptation. If you are a drug addict, you should not hang out with coke heads. If you are a recovering alcoholic, you would not exactly benefit from hanging out with your friends on a drunken revelry. If you’ve got political convictions, you are more prone to say stupid things when you find a result that confirms your ideas. The term for this is ‘confirmation bias’, the reality is that it’s the simple human proneness to see what we want to see.

Do you remember how as a child, you used to play the game of seeing shapes in clouds? Puppies, cows, elephants and horses? The human brain works on the basis of a Gestalt principle of reification, allowing us to reconstruct known things from its parts. It’s essential to the way our brain works. But it’s also making us see the things we want to see, not what we’re actually seeing.

And that’s why you should never write that one article. The one where you explain why the other side is dumb, evil or has psychotic and/or neurotic traits.

References   [ + ]

1.Largely, they presume outlandish stuff like ‘human life is exceptional and always worth defending’ or ‘death does not cure illnesses’, you get my drift.
2.For starters, I maintain we all are at the very least the latter, quite probably the middle one at least a portion of the time and, frankly, the first one more often than we would believe ourselves.
3.Yes, I know a high Eysenck P score does not mean a person is ‘psychotic’ and Eysenck’s test is a personality trait test, not a test to diagnose a psychotic disorder.