# Structuring R projects

There are some things that I call Smith goods:1 things I want, nay, require, but hate doing. A clean room is one of these – I have a visceral need to have some semblance of tidiness around me, I just absolutely hate tidying, especially in the summer.2 Starting and structuring packages and projects is another of these things, which is why I’m so happy things like cookiecutter exist that do it for you in Python. [su_pullquote align=”right”]While I don’t like structuring R projects, I keep doing it, because I know it matters. That’s a pearl of wisdom that came occasionally at a great price.[/su_pullquote]I am famously laid back about structuring R projects – my chill attitude is only occasionally compared to the Holy Inquisition, the other Holy Inquisition and Gunny R. Lee Ermey’s portrayal of Drill Sgt. Hartman, and it’s been months since I last gutted an intern for messing up namespaces.3 So while I don’t like structuring R projects, I keep doing it, because I know it matters. That’s a pearl of wisdom that came occasionally at a great price, some of which I am hoping to save you by this post.

## Five principles of structuring R projects

Every R project is different. Therefore, when structuring R projects, there has to be a lot more adaptability than there is normally When structuring R projects, I try to follow five overarching principles.

1. The project determines the structure. In a small exploratory data analysis (EDA) project, you might have some leeway as to structural features that you might not have when writing safety-critical or autonomously running code. This variability in R – reflective of the diversity of its use – means that it’s hard to devise a boilerplate that’s universally applicable to all kinds of projects.
2. Structure is a means to an end, not an end in itself. The reason why gutting interns, scalping them or yelling at them Gunny style are inadvisable is not just the additional paperwork it creates for HR. Rather, the point of the whole exercise is to create people who understand why the rules exists and organically adopt them, understanding how they help.
3. Rules are good, tools are better. When tools are provided that take the burden of adherence – linters, structure generators like cookiecutter, IDE plugins, &c. – off the developer, adherence is both more likely and simpler.
4. Structures should be interpretable to a wide range of collaborators. Even if you have no collaborators, thinking from the perspective of an analyst, a data scientist, a modeller, a data engineer and, most importantly, the client who will at the very end receive the overall product.
5. Structures should be capable of evolution. Your project may change objectives, it may evolve, it may change. What was a pet project might become a client product. What was designed to be a massive, error-resilient superstructure might have to scale down. And most importantly, your single-player adventure may end up turning into an MMORPG. Your structure has to be able to roll with the punches.

## A good starting structure

Pretty much every R project can be imagined as a sort of process: data gets ingested, magic happens, then the results – analyses, processed data, and so on – get spit out. The absolute minimum structure reflects this:

.
└── my_awesome_project
├── src
├── output
├── data
│   ├── raw
│   └── processed
├── run_analyses.R
└── .gitignore

In this structure, we see this reflected by having a data/ folder (a source), a folder for the code that performs the operations (src/) and a place to put the results (output/). The root analysis file (the sole R file on the top level) is responsible for launching and orchestrating the functions defined in the src/ folder’s contents.

## The data folder

The data folder is, unsurprisingly, where your data goes. In many cases, you may not have any file-formatted raw data (e.g. where the raw data is accessed via a *DBC connection to a database), and you might even keep all intermediate files there, although that’s pretty uncommon on the whole, and might not make you the local DBA’s favourite (not to mention data protection issues). So while the raw/ subfolder might be dispensed with, you’ll most definitely need a data/ folder.

When it comes to data, it is crucial to make a distinction between source data and generated data. Rich Fitzjohn puts it best when he says to treat

• source data as read-only, and
• generated data as disposable.

The preferred implementation I have adopted is to have

• a data/raw/ folder, which is usually is symlinked to a folder that is write-only to clients but read-only to the R user,4,
• a data/temp/ folder, which contains temp data, and
• a data/output/ folder, if warranted.

### The src folder

Some call this folder R– I find this a misleading practice, as you might have C++, bash and other non-R code in it, but is unfortunately enforced by R if you want to structure your project as a valid R package, which I advocate in some cases. I am a fan of structuring the src/ folder, usually by their logical function. There are two systems of nomenclature that have worked really well for me and people I work with:

• The library model: in this case, the root folder of src/ holds individual .R scripts that when executed will carry out an analysis. There may be one or more such scripts, e.g. for different analyses or different depths of insight. Subfolders of src/ are named after the kind of scripts they contain, e.g. ETL, transformation, plotting. The risk with this structure is that sometimes it’s tricky to remember what’s where, so descriptive file names are particularly important.
• The pipeline model: in this case, there is a main runner script or potentially a small number. These go through scripts in a sequence. It is a sensible idea in such a case to establish sequential subfolders or sequentially numbered scripts that are executed in sequence. Typically, this model performs better if there are at most a handful distinct pipelines.

Whichever approach you adopt, a crucial point is to keep function definition and application separate. This means that only the pipeline or the runner scripts are allowed to execute (apply) functions, and other files are merely supposed to define them. Typically, folder level segregation works best for this:

• keep all function definitions in subfolders of src/, e.g. src/data_engineering, and have the directly-executable scripts directly under src/ (this works better for larger projects), or
• keep function definitions in src/, and keep the directly executable scripts in the root folder (this is more convenient for smaller projects, where perhaps the entire data engineering part is not much more than a single script).

## output and other output folders

Output may mean a range of things, depending on the nature of your project. It can be anything from a whole D.Phil thesis written in a LaTeX-compliant form to a brief report to a client. There are a couple of conventions with regard to output folders that are useful to keep in mind.

### Separating plot output

[su_pullquote]My personal preference is that plot output folders should be subfolders of output/, rather than top-tier folders, unless the plots themselves are the objective.[/su_pullquote]It is common to have a separate folder for plots (usually called figs/ or plots/), usually so that they could be used for various purposes. My personal preference is that plot output folders should be subfolders of output folders, rather than top-tier folders, unless they are the very output of the project. That is the case, for instance, where the project is intended to create a particular plot on a regular basis. This was the case, for instance, with the CBRD project whose purpose was to regularly generate daily epicurves for the DRC Zaire ebolavirus outbreak.

With regard to maps, in general, the principle that has worked best for teams I ran was to treat static maps as plots. However, dynamic maps (e.g. LeafletJS apps), tilesets, layers or generated files (e.g. GeoJSON files) tend to deserve their own folder.

### Reports and reporting

[su_pullquote]For business users, automatically getting a beautiful PDF report can be priceless.[/su_pullquote]Not every project needs a reporting folder, but for business users, having a nice, pre-written reporting script that can be run automatically and produces a beautiful PDF report every day can be priceless. A large organisation I worked for in the past used this very well to monitor their Amazon AWS expenditure.5 A team of over fifty data scientists worked on a range of EC2 instances, and runaway spending from provisioning instances that were too big, leaving instances on and data transfer charges resulting from misconfigured instances6 was rampant. So the client wanted daily, weekly, monthly and 10-day rolling usage nicely plotted in a report, by user, highlighting people who would go on the naughty list. This was very well accomplished by an RMarkdown template that was ‘knit‘ every day at 0600 and uploaded as an HTML file onto an internal server, so that every user could see who’s been naughty and who’s been nice. EC2 usage costs have gone down by almost 30% in a few weeks, and that was without having to dismember anyone!7

Probably the only structural rule to keep in mind is to keep reports and reporting code separate. Reports are client products, reporting code is a work product and therefore should reside in src/.

## Requirements and general settings

I am, in general, not a huge fan of outright loading whole packages to begin with. Too many users of R don’t realise that

• you do not need to attach (library(package)) a package in order to use a function from it – as long as the package is available to R, you can simply call the function as package::function(arg1, arg2, ...), and
• importing a package using library(package) puts every single function from that package into the namespace, overwriting by default all previous entries. This means that in order to deterministically know what any given symbol means, you would have to know, at all times, the order of package imports. Needless to say, there is enough stuff to keep in one’s mind when coding in R to worry about this stuff.

However, some packages might be useful to import, and sometimes it’s useful to have an initialisation script. This may be the case in three particular scenarios:

• You need a particular locale setting, or a particularly crucial environment setting.
• You are not using packrat or some other package management solution, and definitely need to ensure some packages are installed, but prefer not to put the clunky install-if-not-present code in every single thing.

In these cases, it’s sensible to have a file you would source before every top-level script – in an act of shameless thievery from Python, I tend to call this requirements.R, and it includes some fundamental settings I like to rely on, such as setting the locale appropriately. It also includes a CRAN install check script, although I would very much advise the use of Packrat over it, since it’s not version-sensitive.

### Themes, house style and other settings

It is common, in addition to all this, to keep some general settings. If your institution has a ‘house style’ for ggplot2 (as, for instance, a ggthemr file), for instance, this could be part of your project’s config. But where does this best go?

[su_pullquote align=”right”]I’m a big fan of keeping house styles in separate repos, as this ensures consistency across the board.[/su_pullquote]It would normally be perfectly fine to keep your settings in a config.R file at root level, but a config/ folder is much preferred as it prevents clutter if you derive any of your configurations from a git submodule. I’m a big fan of keeping house styles and other things intended to give a shared appearance to code and outputs (e.g. linting rules, text editor settings, map themes) in separate – and very, very well managed! – repos, as this ensures consistency across the board over time. As a result, most of my projects do have a config folder instead of a single configuration file.

It is paramount to separate project configuration and runtime configuration:

• Project configuration pertains to the project itself, its outputs, schemes, the whole nine yards. For instance, the paper size to use for generated LaTeX documents would normally be a project configuration item. Your project configuration belongs in your config/ folder.
• Runtime configuration pertains to parameters that relate to individual runs. In general, you should aspire to have as few of these, if any, as possible – and if you do, you should keep them as environment variables. But if you do decide to keep them as a file, it’s generally a good idea to keep them at the top level, and store them not as R files but as e.g. JSON files. There are a range of tools that can programmatically edit and change these file formats, while changing R files programmatically is fraught with difficulties.

### Keeping runtime configuration editable

A few years ago, I worked on a viral forecasting tool where a range of model parameters to build the forecast from were hardcoded as R variables in a runtime configuration file. It was eventually decided to create a Python-based web interface on top of it, which would allow users to see the results as a dashboard (reading from a database where forecast results would be written) and make adjustments to some of the model parameters. The problem was, that’s really not easy to do with variables in an R file.

On the other hand, Python can easily read a JSON file into memory, change values as requested and export them onto the file system. So instead of that, the web interface would store the parameters in a JSON file, from which R would then read them and execute accordingly. Worked like a charm. Bottom line – configurations are data, and using code to store data is bad form.

## Dirty little secrets

Everybody has secrets. In all likelihood, your project is no different: passwords, API keys, database credentials, the works. The first rule of this, of course, is never hardcode credentials in code. But you will need to work out how to make your project work, including via version control, while also not divulging credentials to the world at large. My preferred solutions, in order of preference, are:

1. the keyring package, which interacts with OS X’s keychain, Windows’s Credential Store and the Secret Service API on Linux (where supported),
2. using environment variables,
3. using a secrets file that is .gitignored,
4. using a config file that’s .gitignored,
5. prompting the user.

Let’s take these – except the last one, which you should consider only as a measure of desperation, as it relies on RStudio and your code should aspire to run without it – in turn.

### Using keyring

keyring is an R package that interfaces with the operating system’s keychain management solution, and works without any additional software on OS X and Windows.8 Using keyring is delightfully simple: it conceives of an individual key as belonging to a keyring and identified by a service. By reference to the service, it can then be retrieved easily once the user has authenticated to access the keychain. It has two drawbacks to be aware of:

• It’s an interactive solution (it has to get access permission for the keychain), so if what you’re after is R code that runs quietly without any intervention, this is not your best bet.
• A key can only contain a username and a password, so it cannot store more complex credentials, such as 4-ple secrets (e.g. in OAuth, where you may have a consumer and a publisher key and secret each). In that case, you could split them into separate keyring keys.

However, for most interactive purposes, keyring works fine. This includes single-item secrets, e.g. API keys, where you can use some junk as your username and hold only on to the password. [su_pullquote align=”right”]For most interactive purposes, keyring works fine. This includes single-item secrets, e.g. API keys.[/su_pullquote] By default, the operating system’s ‘main’ keyring is used, but you’re welcome to create a new one for your project. Note that users may be prompted for a keychain password at call time, and it’s helpful if they know what’s going on, so be sure you document your keyring calls well.

To set a key, simply call keyring::key_set(service = "my_awesome_service", username = "my_awesome_user). This will launch a dialogue using the host OS’s keychain handler to request authentication to access the relevant keychain (in this case, the system keychain, as no keychain is specified), and you can then retrieve

• the username: using keyring::key_list("my_awesome_service")[1,2], and
• the password: using keyring::key_get("my_awesome_service").

### Using environment variables

[su_pullquote]The thing to remember about environment variables is that they’re ‘relatively private’: everyone in the user session will be able to read them.[/su_pullquote]Using environment variables to hold certain secrets has become extremely popular especially for Dockerised implementations of R code, as envvars can be very easily set using Docker. The thing to remember about environment variables is that they’re ‘relatively private’: they’re not part of the codebase, so they will definitely not accidentally get committed to the VCS, but everyone who has access to the particular user session  will be able to read them. This may be an issue when e.g. multiple people are sharing the ec2-user account on an EC2 instance. The other drawback of envvars is that if there’s a large number of them, setting them can be a pain. R has a little workaround for that: if you create an envfile called .Renviron in the working directory, it will store values in the environment. So for instance the following .Renviron file will bind an API key and a username:

api_username = "my_awesome_user"
api_key = "e19bb9e938e85e49037518a102860147"

So when you then call Sys.getenv("api_username"), you get the correct result. It’s worth keeping in mind that the .Renviron file is sourced once, and once only: at the start of the R session. Thus, obviously, changes made after that will not propagate into the session until it ends and a new session is started. It’s also rather clumsy to edit, although most APIs used to ini files will, with the occasional grumble, digest .Renvirons.

Needless to say, committing the .Renviron file to the VCS is what is sometimes referred to as making a chocolate fireman in the business, and is generally a bad idea.

### Using a .gitignored config or secrets file

config is a package that allows you to keep a range of configuration settings outside your code, in a YAML file, then retrieve them. For instance, you can create a default configuration for an API:

default:
my_awesome_api:
url: 'https://awesome_api.internal'
api_key: 'e19bb9e938e85e49037518a102860147'

From R, you could then access this using the config::get() function:

my_awesome_api_configuration <- config::get("my_awesome_api")

This would then allow you to e.g. refer to the URL as my_awesome_api_configuration$url, and the API key as my_awesome_api_configuration$api_key. As long as the configuration YAML file is kept out of the VCS, all is well. The problem is that not everything in such a configuration file is supposed to be secret. For instance, it makes sense for a database access credentials to have the other credentials DBI::dbConnect() needs for a connection available to other users, but keep the password private. So .gitignoreing a config file is not a good idea.

[su_pullquote align=”right”]A dedicated secrets file is a better place for credentials than a config file, as this file can then be wholesale .gitignored.[/su_pullquote]A somewhat better idea is a secrets file. This file can be safely .gitignored, because it definitely only contains secrets. As previously noted, definitely create it using a format that can be widely written (JSON, YAML).9 For reasons noted in the next subsection, the thing you should definitely not do is creating a secrets file that consists of R variable assignments, however convenient an idea that may appear at first. Because…

### Whatever you do…

One of the best ways to mess up is creating a fabulous way of keeping your secret credentials truly secret… then loading them into the global scope. Never, ever assign credentials. Ever.

You might have seen code like this:

dbuser <- Sys.getenv("dbuser")
dbpass <- Sys.getenv("dbpass")

conn <- DBI::dbConnect(odbc::odbc(), UID = dbuser, PWD = dbpass)
[su_pullquote]Never, ever put credentials into any environment if possible – especially not into the global scope.[/su_pullquote]This will work perfectly, except once its done, it will leave the password and the user name, in unencrypted plaintext (!), in the global scope, accessible to any code. That’s not just extremely embarrassing if, say, your wife of ten years discovers that your database password is your World of Warcraft character’s first name, but also a potential security risk. Never put credentials into any environment if possible, and if it has to happen, at least make it happen within a function so that they don’t end up in the global scope. The correct way to do the above would be more akin to this:

create_db_connection <- function() {
DBI::dbConnect(odbc::odbc(), UID = Sys.getenv("dbuser"), PWD = Sys.getenv("dbpass")) %>% return()
}

## Concluding remarks

Structuring R projects is an art, not just a science. Many best practices are highly domain-specific, and learning these generally happens by trial and pratfall error. In many ways, it’s the bellwether of an R developer’s skill trajectory, because it shows whether they possess the tenacity and endurance it takes to do meticulous, fine and often rather boring work in pursuance of future success – or at the very least, an easier time debugging things in the future. Studies show that one of the greatest predictors of success in life is being able to tolerate deferred gratification, and structuring R projects is a pure exercise in that discipline.

[su_pullquote align=”right”]Structuring R projects is an art, not just a science. Many best practices are highly domain-specific, and learning these generally happens by trial and error.[/su_pullquote]At the same time, a well-executed structure can save valuable developer time, prevent errors and allow data scientists to focus on the data rather than debugging and trying to find where that damn snippet of code is or scratching their head trying to figure out what a particularly obscurely named function does. What might feel like an utter waste of time has enormous potential to create value, both for the individual, the team and the organisation.

[su_pullquote]As long as you keep in mind why structure matters and what its ultimate aims are, you will arrive at a form of order out of chaos that will be productive, collaborative and useful.[/su_pullquote]I’m sure there are many aspects of structuring R projects that I have omitted or ignored – in many ways, it is my own experiences that inform and motivate these commentaries on R. Some of these observations are echoed by many authors, others diverge greatly from what’s commonly held wisdom. As with all concepts in development, I encourage you to read widely, get to know as many different ideas about structuring R projects as possible, and synthesise your own style. As long as you keep in mind why structure matters and what its ultimate aims are, you will arrive at a form of order out of chaos that will be productive, collaborative and mutually useful not just for your own development but others’ work as well.

My last commentary on defensive programming in R has spawned a vivid and exciting debate on Reddit, and many have made extremely insightful comments there. I’m deeply grateful for all who have contributed there. I hope you will also consider posting your observations in the comment section below. That way, comments will remain together with the original content.

References   [ + ]

 1 ↑ As in, Adam Smith. 2 ↑ It took me years to figure out why. It turns out that I have ZF alpha-1 antitrypsin deficiency. As a consequence, even minimal exposure to small particulates and dust can set off violent coughing attacks and impair breathing for days. Symptoms tend to be worse in hot weather due to impaired connective tissue something-or-other. 3 ↑ That’s a joke. I don’t gut interns – they’re valuable resources, HR shuns dismembering your coworkers, it creates paperwork and I liked every intern I’ve ever worked with – but most importantly, once gutted like a fish, they are not going to learn anything new. I prefer gentle, structured discussions on the benefits of good package structure. Please respect your interns – they are the next generation, and you are probably one of their first example of what software development/data science leadership looks like. The waves you set into motion will ripple through generations, well after you’re gone. You better set a good example. 4 ↑ Such a folder is often referred to as a ‘dropbox’, and the typical corresponding octal setting, 0422, guarantees that the R user will not accidentally overwrite data. 5 ↑ The organisation consented to me telling this story but requested anonymity, a request I honour whenever legally possible. 6 ↑ In case you’re unfamiliar with AWS: it’s a cloud service where elastic computing instances (EC2 instances) reside in ‘regions’, e.g. us-west-1a. There are (small but nonzero) charges for data transfer between regions. If you’re in one region but you configure the yum repo server of another region as your default, there will be costs, and, eventually, tears – provision ten instances with a few GBs worth of downloads, and there’ll be yelling. This is now more or less impossible to do except on purpose, but one must never underestimate what users are capable of from time to time! 7 ↑ Or so I’m told. 8 ↑ Linux users will need libsecret 0.16 or above, and sodium. 9 ↑ XML is acceptable if you’re threatened with waterboarding.

# Automagic epi curve plotting: part I

As of 24 May 2018, the underlying data schema of the Github repo from which the epi curve plotter draws its data has changed. Therefore, a lot of the code had to be adjusted. The current code can be found here on Github. This also plots a classical epi curve.

One of the best resources during the 2013-16 West African Ebola outbreak was Caitlin RiversGithub repo, which was probably one of the best ways to stay up to date on the numbers. For the current outbreak, she has also set up a Github repo, with really frequent updates straight from the WHO’s DON data and the information from DRC Ministry of Public Health (MdlS) mailing list.1 Using R, I have set up a simple script that I only have to run every time I want a pre-defined visualisation of the current situation. I am usually doing this on a remote RStudio server, which makes matters quite easy for me to quickly generate data on the fly from RStudio.

### Obtaining the most recent data

Using the following little script, I grab the latest from the ebola-drc Github repo:

# Fetch most recent DRC data.
library(magrittr)
library(curl)
library(dplyr)

current_drc_data <- Sys.time() %>%
format("%d%H%M%S%b%Y") %>%
paste("raw_data/drc/", "drc-", ., ".csv", sep = "") %T>%
curl_fetch_disk("https://raw.githubusercontent.com/cmrivers/ebola_drc/master/drc/data.csv", .) %>%
read_csv()

This uses curl (the R implementation) to fetch the most recent data and save it as a timestamped2 file in the data folder I set up just for that purpose.3 Simply sourcing this script (source("fetch_drc_data.R")) should then load the current DRC dataset into the environment.4

### Data munging

We need to do a little data munging. First, we melt down the data frame using reshape2‘s melt function. Melting takes ‘wide’ data and converumnts it into ‘long’ data – for example, in our case, the original data had a row for each daily report for each health zone, and a column for the various combinations of confirmed/probable/suspected over cases/deaths. Melting the data frame down creates a variable type column (say, confirmed_deaths and a value column (giving the value, e.g. 3). Using lubridate,5 the dates are parsed, and the values are stored in a numeric format.

library(magrittr)
library(reshape2)
library(lubridate)

current_drc_data %<>% melt(value_name = "value", measure.vars = c("confirmed_cases", "confirmed_deaths", "probable_cases", "probable_deaths", "suspect_cases", "suspect_deaths", "ruled_out"))
current_drc_data$event_date <- lubridate::ymd(current_drc_data$event_date)
current_drc_data$report_date <- lubridate::ymd(current_drc_data$report_date)
current_drc_data$value <- as.numeric(current_drc_data$value)

Next, we drop ruled_out cases, as they play no significant role for the current visualisation.

## Issues

There are some glitches that we’re trying to iron out:

• Cartography and GIS tools glitch a little due to issues with PROJ.4.
• GPU/CUDA support is not implemented as this is not customarily used or provided on Linodes
• certbot and Let's Encrypt is not really supported yet, as our boxes are never directly public-facing, but you should most definitely find a way to put your server behind some form of SSL/TLS.
• Currently, only Ubuntu 16.04 LTS is supported, as it’s the most widely used version. CRAN does not yet support more recent versions yet, but these will be added once CRAN support is added.
[su_note]New to Linodes? Linode is a great service offering a simple, uncomplicated way to set up a Linux server, just how you want it! 👍🏻 In fact, I’ve been using Linode servers for over four years, and have only good experiences with them. I keep recommending them to friends, and they’re all happy! So if this post was helpful, and you’re thinking of signing up for a Linode, would you consider doing so using my affiliate link? I run this blog as a free service and pay for hosting out of pocket, and I don’t do advertisements (and never will!) or paid reviews of things (like what, paid Tyvek reviews? “Best Ebola PPE”? Nope.) – so really, it would make me so happy if you could help out this way. [/su_note]

As always, do let me know how things work out for you! Feel free to leave comments and ideas below. If you’re getting value out of using this script, just let me know.

- Tagged

# Installing RStudio Server on Debian 9

Oh, wouldn’t it be just wonderful if you could have your own RStudio installation on a server that you could then access from whatever device you currently have, including an iPad? It totally would. Except it’s some times far from straightforward. Here’s how to do it relatively painlessly.

## Step 1: Get a server

I use Linode,1 and in general, their 4096 server is pretty good. Linodes can be very easily resized, so this should not be a worry.

## Step 2: Set up the server

Once your Linode is up, it will turn up on your dashboard with a random name (linode1234567, typically). If you click on it, you will see your Linode is ‘Brand New’, which means you need to configure it. I usually keep them in groups depending on purpose: blog servers, various processing servers, hosts, research servers. Each of them then gets a name. Choose whatever nomenclature fits your needs best.

Next, click on the Rebuild tab, and configure the root password, the operating system (we’ll be using Debian 9), the swap disk size (for R, it’s a good idea to set this as large as you can) and finally, set the deployment disk size. I usually set that for 66%.

## Step 3: Install R

SSH into your Linode and log in as root. You will find your Linode’s IP address and other interesting factoids about it under the Remote access tab. I have obscured some information as I don’t want you scallywags messing around with my server, but the IP address is displayed both in the SSH link (the one that goes ssh [email protected] or something along these lines) and below under public IPs.

Using vim, open /etc/apt/sources.list and add the following source:

# CRAN server for Debian stretch (R and related stuff)
deb http://cran.rstudio.com/bin/linux/debian stretch-cran34/

Save the file, and next install dirmngr (sudo apt-get install dirmngr). Then, add the requisite GPG key for CRAN, update the repository and install r-base:

sudo apt-key adv --keyserver keys.gnupg.net --recv-key 'E19F5F87128899B192B1A2C2AD5F960A256A04AF'

sudo apt update
sudo apt install r-base

At this point, you can enter R to test if your R installation works, and try to install a package, like ggplot2. It should work, but may ask you to select an installation server. If all is well, this is what you should be getting:

[email protected]:~# R

R version 3.4.4 (2018-03-15) -- "Someone to Lean On"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

Natural language support but running in an English locale

R is a collaborative project with many contributors.
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

In the prompt, enter install.packages("ggplot2"). Select a server if requested. Otherwise, watch ggplot2 (and a bazillion other packages) install. Then, quit R by calling q().

## Step 4: Downgrade libssl

For some inscrutable reason, RStudio is currently set up to work with libssl1.0.0, whereas Debian 9 comes with libssl1.1.0 out of the box. Clearly that’s not going to work, so we’ll have to roll back our libssl. We’ll do so by creating a file called /etc/apt/sources.list.d/jessie.list using vim. And we’ll fill it with the following:

deb http://httpredir.debian.org/debian jessie main contrib non-free
deb-src http://httpredir.debian.org/debian jessie main contrib non-free

deb http://security.debian.org/ jessie/updates main contrib non-free
deb-src http://security.debian.org/ jessie/updates main contrib non-free

In case you’re curious: this creates a sources list called jessie, and allows you to draw from Debian Jessie. So let’s have apt get with the program (sudo apt update), and install libssl1.0.0 (sudo apt install libssl1.0.0). Ift may be prudent to also install the openssl tool corresponding to the libssl version (sudo apt install openssl/jessie).

## Step 5: Time to install RStudio Server!

First, install gDebi, a package installer, by typing sudo apt-get install gdebi. Next, we’ll be grabbing the latest RStudio version using wget, and installing it using gDebi. Make sure you either do this in your home directory or in /tmp/, ideally. Note that the versions of RStudio tend to change – 1.1.442 was the current version, released 12 March 2018, at the time of writing this post, but may by now have changed. You can check the current version number here.

wget https://download2.rstudio.org/rstudio-server-1.1.442-amd64.deb
sudo gdebi rstudio-server-1.1.442-amd64.deb

If all is well and you said yes to the dress question about whether you actually want to install RStudio Server (no, you’ve been going through this whole pain in the rear for excrement and jocularity, duh) , you should see something like this:

(Reading database ... 55651 files and directories currently installed.)
Preparing to unpack rstudio-server-1.1.442-amd64.deb ...
Unpacking rstudio-server (1.1.442) ...
Setting up rstudio-server (1.1.442) ...
rsession: no process found
● rstudio-server.service - RStudio Server
Active: active (running) since Sat 2018-03-31 21:59:25 UTC; 1s ago
Process: 16478 ExecStart=/usr/lib/rstudio-server/bin/rserver (code=exited, status=0/SUCCESS)
Main PID: 16479 (rserver)
CGroup: /system.slice/rstudio-server.service
└─16479 /usr/lib/rstudio-server/bin/rserver

Mar 31 21:59:25 localhost systemd[1]: Starting RStudio Server...
Mar 31 21:59:25 localhost systemd[1]: Started RStudio Server.

Now, if all goes well, you can navigate to your server’s IP at port 8787, and you should behold something akin to this:

There are a few more things you wish to install at this point – these are libraries that will help with SSL and other functionality from within R. Head back to the terminal and install the following:

sudo apt-get install libssl-dev libcurl4-openssl-dev libssh2-1-dev


## Step 6: Some finishing touches

The login screen will draw on PAM, Unix’s own authentication module, in lieu of a user manager. As such, to create access, you will have to create a new Unix user with adduser, and assign a password to it. This will grant it a directory of its own, and you the ability to fine-tune what they should, and what they shouldn’t, have access to. Win-win! This will allow you to share a single installation among a range of people.

## Step 7: To reverse proxy, or to not reverse proxy?

There are diverging opinions as to whether a reverse proxy such as NGINX carries any benefit. In my understanding, they do not, and there’s a not entirely unpleasant degree of security by obscurity in having a hard to guess port (which, by the way, you can change). It also makes uploads, on which you will probably rely quite a bit, more difficult. On the whole, there are more arguments against than in favour of a reverse proxy, but I may add specific guidance on reverse proxying here if there’s interest.

References   [ + ]

 1 ↑ Using this link gives me a referral bonus of $20 as long as you remain a customer for 90 days. If you do not wish to do so, please use this link. It costs the same either way, and I would be using Linode anyway as their service is superbly reliable. # MedDRA + VAERS: A marriage made in hell? This post is a Golden DDoS Award winner So far, this blog was DDoS’d only three times within 24 hours of its publication. That deserves a prize. Quick: what do a broken femur, Henoch-Schönlein purpura, fainting, an expired vaccine and a healthy childbirth have in common? If your answer was “they’re all valid MedDRA codes”, you’re doing pretty well. If you, from that, deduced that they all can be logged on VAERS as adverse effects of vaccination, you’re exceeding expectations. And if you also realise that the idea that Jane got an expired HPV vaccine, and as a consequence broke her femur, developed Henoch-Schönlein purpura, and suddenly gave birth to a healthy baby boy is completely idiotic and yet can be logged on VAERS, you’re getting where I’m going. MedDRA is a medical nomenclature specifically developed for the purposes of pharmacovigilance. The idea is, actually, not dreadful – there are some things in a usual medical nomenclature like ICD-10 that are not appropriate for a nomenclature used for pharmacovigilance reporting (V97.33: sucked into jet engine comes to my mind), and then there are things that are specific to pharmacovigilance, such as “oh shoot, that was not supposed to go up his bum!” (MedDRA 10013659: vaccine administered at inappropriate site), “we overdosed little Johnny on the flu vaccine!” (MedDRA 10000381: drug overdose, accidental) and other joys that generally do only happen in the context of pharmacovigilance. So far, so good. At the same time, MedDRA is non-hierarchical, at least on the coding level. Thus, while the ICD code V97.33 tells you that you’re dealing with an external cause of mortality and morbidity (V and Y codes), specifically air and space transport (V95-97), more specifically ‘other’ specific air transport accidents, specifically getting sucked into a jet engine (V97.33), there’s no way to extract from MedDRA 10000381 what the hell we’re dealing with. Not only do we not know if it’s a test result, a procedure, a test or a disease, we are hopelessly lost as to figuring out what larger categories it belongs to. To make matters worse, MedDRA is proprietary – which in and of itself is offensive to the extreme to the idea of open research on VAERS and other public databases: a public database should not rely on proprietary encoding! -, and it lacks the inherent logic of ICD-10. Consider the encoding of the clinical diagnosis of unilateral headache in both: We know that an ICD code beginning with F will be something psychiatric and G will be neurological, and from that alone we can get some easy analytical approaches (a popular one is looking at billed codes and drilling down by hierarchical level of ICD-10 codes, something in which the ICD-10 is vastly superior to its predecessor). MedDRA, alas, does not help us such. ## Garbage in, garbage out OK, so we’ve got a nomenclature where the codes for needlestick injury, death, pneumonia, congenital myopathy and a CBC look all the same. That’s already bad enough. It gets worse when you can enter any and all of these into the one single field. Meet VAERS. The idea of VAERS is to allow physicians, non-physicians and ‘members of the public’ to report incidents. These are then coded by the CDC and depending on seriousness, they may or may not be investigated (all reports that are regarded as ‘serious’ are investigated, according to the CDC). The problem is that this approach is susceptible to three particular vulnerabilities: • The single field problem: VAERS has a single field for ‘symptoms’. Everything’s a symptom. This includes pre-existing conditions, new onset conditions, vaccination errors, lab tests (not merely results, just the tests themselves!), interventions (without specifying if they’re before or after the vaccine), and so on. There is also no way to filter out factors that definitely have nothing to do with the vaccine, such as a pre-existing birth defect. The History/Allergies field is not coded. • The coding problem: what gets coded and what does not is sometimes imperfect. This being a human process, it’s impossible to expect perfection, but the ramifications to this to certain methods of analysis are immense. For instance. if there are 100 cases of uncontrollable vomiting, that may be a signal. But if half of those are coded as ‘gastrointestinal disorder’ (also an existing code), you have two values of 50, neither of which may end up being a signal. • The issue of multiple coding: because MedDRA is non-hierarchical, it is not possible to normalise at a higher level (say, with ICD-10 codes, at chapter or block level), and it is not clear if two codes are hierarchically related. In ICD-10, if a record contains I07 (rheumatic tricuspid valve disease) and I07.2 (tricuspid stenosis with tricuspid insufficiency), one can decide to retain the more specific or the less specific entry, depending on intended purpose of the analysis. In the following, I will demonstrate each of these based on randomly selected reports from VAERS. ### The Single Field Problem (SFP) The core of the SFP is that there is only one codeable field, ‘symptoms’. VAERS ID 375693-1 involves a report, in which the patient claims she developed, between the first and second round of Gardasil, severe stomach pain, cramping, and burning that lasted weeks. Muscle aches and overall feeling of not being well. In August 2009 patient had flu like symptoms, anxiety, depression, fatigue, ulcers, acne, overall feeling of illness or impending death. Below is the patient’s symptom transposition into MedDRA entities (under Symptoms): The above example shows the mixture of symptoms, diagnostic procedures and diagnostic entities that are coded in the ‘Symptoms’ field. The principal problem with this is that when considering mass correlations (all drugs vs all symptoms, for instance), this system would treat a blood test just as much as a contributor to a safety signal as anxiety or myalgia, which might be true issues, or depression, which is a true diagnosis. Unfiltered, this makes VAERS effectively useless for market basket analysis based (cooccurrence frequency) analyses. Consider for instance, that $PRR$ is calculated as $PRR_{V,R} = \frac{\Sigma (R \mid V) \/ \Sigma (V)}{\Sigma (R \mid \neg V) \/ \Sigma (\neg V)} = \frac{\Sigma (R \mid V)}{\Sigma (V)} \cdot \frac{\Sigma (\neg V)}{\Sigma (R \mid \neg V)}$ where $V$ denotes the vaccine of interest, $R$ denotes the reaction of interest, and the $\Sigma$ operator denotes the sum of rows or columns that fulfill the requisite criteria (a more detailed, matrix-based version of this equation is presented here). But if $\{R\}$, the set of all $R$, contains not merely diagnoses but also various ‘non-diagnoses’, the PRR calculation will be distorted. For constant $V$ and an unduly large $R$, the values computationally obtained from the VAERS data that ought to be $\Sigma(R \mid V)$ and $\Sigma(R \mid \neg V)$ will both be inaccurately inflated. This will yield inaccurate final results. Just how bad IS this problem? About 30% bad, if not more. A manual tagging of the top 1,000 symptoms (by $N$, i.e. by the number of occurrences) was used as an estimate for how many of the diagnostic entities do not disclose an actual problem with the vaccine. According to the survey of the top 1,000 codes, only a little more than 70% of the codes themselves disclose a relevant issue with the vaccine. In other words, almost a third of disclosed symptoms must be pruned, and these cannot be categorically pruned because unlike ICD-10, MedDRA does not disclose hierarchies based on which such pruning would be possible. As far as the use of MedDRA goes, this alone should be a complete disaster. Again, for effect: a third of the codes do not disclose an actual side effect of the medication. These are not separate or identifiable in any way other than manually classifying them and seeing whether they disclose an actual side effect or just an ancillary issue. Pharmacovigilance relies on accurate source data, and VAERS is not set up, with its current use of MedDRA, to deliver that. ### The coding problem Once a VAERS report is received, it is MedDRA coded at the CDC. Now, no manual coding is perfect, but that’s not the point here. The problem is that a MedDRA code does not, in and of itself, indicate the level of detail it holds. For instance, 10025169 and 10021881 look all alike, where in fact the first is a lowest-level entity (an LLT – Lower-Level Term – in MedDRA lingo) representing Lyme disease, while the former is the top-level class (SOC – System Organ Class) corresponding to infectious diseases. What this means is that once we see a MedDRA coded entity as its code, we don’t know what level of specificity we are dealing with. The problem gets worse with named entities. You see, MedDRA has a ‘leaf’ structure: every branch must terminate in one or more (usually one) LLT. Often enough, LLTs have the same name as their parent PT, so you get PT Lyme disease and LLT Lyme disease. Not that it terrifically matters for most applications, but when you see only the verbose output, as is the case in VAERS, you don’t know if this is a PT, an LLT, or, God forbid, a higher level concept with a similar name. Finally, to put the cherry on top of the cake, where a PT is also the LLT, they have the same code. So for Lyme disease, the PT and LLT both have the code 10025169. I’m sure this seemed like a good idea at the time. ### The issue of multiple coding As this has been touched upon previously, because MedDRA lacks an inherent hierarchy, a code cannot be converted into its next upper level without using a lookup table, whereas with, say, ICD-10, one can simply normalise to the chapter and block (the ‘part left of the dot’). More problematically, however, the same code may be a PT or an LLT, as is the case for Lyme disease (10025169). Let’s look at this formally. Let the operator $\in^*$ denote membership under the transitive closure of the set membership relation, so that 1. if $x \in A$, then $x \in^* A$, 2. if $x \in A$ and $A \subseteq B$, then $x \in^* B$. and so on, recursively, ad infinitum. Let furthermore $\in^*_{m}$ denote the depth of recursion, so that 1. for $x \in A$: $x \in^*_{0} A$, 2. for $x \in A \mid A \subseteq B$: $x \in^*_{1} B$, and, once again, so on, recursively, ad infinitum. Then let a coding scheme $\{S_{1...n}\}$ exhibit the Definite Degree of Transitiveness (DDoT) property iff (if and only if) for any $S_m \mid m \leq n$, there exists exactly one $p$ for which it is true that $S_m \in^*_{p} S$. Or, in other words, two codes $S_q, S_r \mid q, r \leq n$, may not be representable identically if $p_q \neq p_r$. Less formally: two codes on different levels may not be identical. This is clearly violated in MedDRA, as the example below shows. ### Bonus: the ethical problem To me as a public health researcher, there is a huge ethical problem with the use of MedDRA in VAERS. I believe very strongly in open data and in the openness of biomedical information. I’m not alone: for better or worse, the wealth – terabytes upon terabytes – of biomedical data, genetics, X-ray crystallography, models, sequences prove that if I’m a dreamer, I’m not the only one. Which is why it’s little short of an insult to the public that a pharmacovigilance system is using a proprietary encoding model. Downloads from VAERS, of course, provide the verbose names of the conditions or symptoms, but not what hierarchical level they are, nor what structure they are on. For that, unless you are a regulatory authority or a ‘non-profit’ or ‘non-commercial’ (which would already exclude a blogger who unlike me has ads on their blog to pay for hosting, or indeed most individual researchers, who by their nature could not provide the documentation to prove they aren’t making any money), you have to shell out some serious money. Worse, the ‘non-profit’ definition does not include a non-profit research institution or an individual non-profit researcher, or any of the research bodies that are not medical libraries or affiliated with educational institutions but are funded by third party non-profit funding: There is something rotten with the use of MedDRA, and it’s not just how unsuitable it is for the purpose, it is also the sheer obscenity of a public database of grave public interest being tied to a (vastly unsuitable and flawed, as I hope it has been demonstrated above) nomenclature. ## Is VAERS lost? ### Resolving the MedDRA issue Unlike quite a few people in the field, I don’t think VAERS is hopelessly lost. There’s, in fact, great potential in it. But the way it integrates with MedDRA has to be changed. This is both a moral point – a point of commitment to opening up government information – and one of facilitating research. There are two alternatives at this point for the CDC. 1. MedDRA has to open up at least the 17% of codes, complete with hierarchy, that are used within VAERS. These should be accessible, complete with the hierarchy, within VAERS, including the CDC WONDER interface. 2. The CDC has to switch to a more suitable system. ICD-10 alone is not necessarily the best solution, and there are few alternatives, which puts MedDRA into a monopoly position that it seems to mercilessly exploit at the time. This can – and should – change. ### Moving past the Single Field Problem MedDRA apart, it is crucial for VAERS to resolve the Single Field Problem. It is clear that from the issues presented in the first paragraph – a broken femur, Henoch-Schönlein purpura, fainting, an expired vaccine and a healthy childbirth – that there is a range of issues that need to be logged. A good structure would be 1. pre-existing conditions and risk factors, 2. symptoms that arose within 6 hours of administration, 3. symptoms that arose within 48 hours of administration, 4. symptoms that arose later than 48 hours of administration, 5. non-symptoms, 6. clinical tests without results, 7. clinical tests segmented by positive and negative results, and 8. ancillary circumstances, esp. circumstances pertaining to vaccination errors such as wrong vaccine administered, expired vaccine, etc. The use of this segmentation would be able to differentiate not only time of occurrence, but also allow for adequate filtering to identify the correct denominators for the $PRR$. ### A future with (for?) MedDRA As said, I am not necessarily hostile to MedDRA, even if the closet libertarian in me bristles at the fact that MedDRA is mercilessly exploiting what is an effective monopoly position. But MedDRA can be better, and needs to be better – if not for its own economic interests, then for the interests of those it serves. There are three particular suggestions MedDRA needs to seriously consider. 1. MedDRA’s entity structure is valuable – arguably, it’s the value in the entire project. If coding can be structured to reflect its internal hierarchy, MedDRA becomes parseable without a LUT,1 and kinship structures become parseable without the extra step of a LUT. 2. MedDRA needs to open up, especially to researchers not falling within its narrowly defined confines of access. Especially given the inherent public nature of its use – PhV and regulation are quintessentially public functions, and this needs an open system. 3. MedDRA’s entity structure’s biggest strength is that it comprises a range of different things, from administrative errors through physical injuries to test results and the simple fact of tests. ## Conclusion VAERS is a valuable system with a range of flaws. All of them are avoidable and correctable – but would require the requisite level of will and commitment – both on CDC’s side and that of MedDRA. For any progress in this field, it is imperative that the CDC understand that a public resource maintained in the public interest cannot be driven by a proprietary nomenclature, least of all one that is priced out of the range of the average interested individual: and if they cannot be served, does the entire system even fulfill its governmental function of being of the people and for the people? It is ultimately CDC’s asset, and it has a unique chance to leverage its position to ensure that at least as far as the 17% of MedDRA codes go that are used in VAERS, these are released openly. In the end, however sophisticated our dissimilarity metrics, when 30% of all entities are non-symptoms and we need to manually prune the key terms to avoid denominator bloat due to non-symptom entities, such as diagnostic tests without results or clearly unconnected causes of morbidity and mortality like motor vehicle accidents, dissimilarity based approaches will suffer from serious flaws. In the absence of detailed administration and symptom tracking at an individual or institutional level, dissimilarity metrics are the cheapest and most feasible ways of creating value out of post marketing passive reports. If VAERS is to be a useful research tool, as I firmly believe it was intended to be, it must evolve to that capability for all. References [ + ]  1 ↑ Look-up table # SafeGram: visualising drug safety Update: an RMarkdown notebook explaining the whole process is available here. Visualising vaccine safety is hard. Doing so from passive (or, as we say it in Britain, ‘spontaneous’!) pharmacovigilance (PhV) sources is even harder. Unlike in active or trial pharmacovigilance, where you are essentially dividing the number of incidents by the person-time or the number of patients in the cohort overall, in passive PhV, only incidents are reported. This makes it quite difficult to figure out their prevalence overall, but fortunately, we have some metrics we can use to better understand the issues with a particular medication or vaccine. The proportional reporting ratio ($PRR$) is a metric that can operate entirely on spontaneous reporting, and reflect how frequent a particular symptom is for a particular treatment versus all other treatments. #### Defining $PRR$$PRR$ For convenience’s sake, I will use the subscript $*$ operator to mean a row or column sum of a matrix, so that $N_{i,*} = \displaystyle \sum_{j=1}^{n} N_{i,j}$ and $N_{*,j} = \displaystyle \sum_{i=1}^{m} N_{i,j}$ and furthermore, I will use the exclusion operator $* \neg$ to mean all entities except the right hand value. So e.g. $N_{i, * \neg k} = \displaystyle \sum_{j=1, j \neq k}^m N_{i,j}$ Conventionally, the PRR is often defined to with reference to a 2×2 contingency table that cross-tabulates treatments ($m$ axis) with adverse effects ($n$ axis): Adverse effect of interest ($i$) All other adverse effects ($\neg i$) TOTAL Treatment of interest ($j$) $a = D_{i,j}$$b = D_{i, * \neg j}$$a + b = D_{i, *} = \displaystyle \sum_{j = 1}^{n} D_{i, j}$ All other treatments ($\neg j$) $c = D_{* \neg i, j}$$d = D_{* \neg i, * \neg j}$$c + d = D_{* \neg i, *} = \displaystyle \sum_{k=1, k \neq i}^{m} \sum_{l = 1}^{n} D_{k, l}$ With reference to the contingency table, the $PRR$ is usually defined as $\frac{a / (a+b)}{c / (c+d)} = \frac{a}{a + b} \cdot \frac{c + d}{c}$ However, let’s formally define it over any matrix $D$. Definition 1. $PRR$. Let $D$ be an $m \times n$ matrix that represents the frequency with which each of the $m$ adverse effects occur for each of the $n$ drugs, so that $D_{i,j}$ ($i \in m$, $j \in n$) represents the number of times the adverse effect $j$ has occurred with the treatment $i$. For convenience’s sake, let $D_{*,j}$ denote $\sum_{i=1}^{m} D_{i,j}$, let $D_{i,*}$ denote $\sum_{j=1}^{n} D_{i,j}$, and let $D_{*,*}$ denote $\sum_{i=1}^{m} \sum_{j=1}^{n} D_{i,j}$. Furthermore, let $D_{* \neg i, j}$ denote $\sum_{k \neq i}^{m} D_{k,j}$ and $D_{i, * \neg j}$ denote $\sum_{k \neq j}^{n} D_{i, k}$. Then, $PRR$ can be calculated for each combination $D_{i,j}$ by the following formula: $PRR_{i,j} = \frac{D_{i,j} / D_{i,*}}{D_{* \neg i, j} / D_{* \neg i, *}} = \frac{D_{i,j}}{D_{i,*}} \cdot \frac{D_{*\neg i, *}}{D_{*\neg i, j}}$ Expanding this, we get $PRR_{i,j} = \frac{D_{i,j}}{\displaystyle\sum_{q=1}^n D_{i,q}} \cdot \frac{\displaystyle\sum_{r=1, r\ne i}^{m} \displaystyle\sum_{s=1}^{n} D_{r,s}}{\displaystyle\sum_{t=1, t\ne i}^{m} D_{t,j}}$ Which looks and sounds awfully convoluted until we start to think of it as a relatively simple query operation: calculate the sum of each row, then calculate the quotient of the ADR of interest associated with the treatment of interest divided by all uses of the treatment of interest on one hand and the ADR of interest associated with all other drugs ($j \mid \neg i$ or $c$) divided by all ADRs associated with all treatments other than the treatment of interest. Easy peasy! ### Beyond $PRR$$PRR$ However, the PRR only tells part of the story. It does show whether a particular symptom is disproportionately often reported – but does it show whether that particular symptom is frequent at all? Evans (1998) suggested using a combination of an $N$-minimum, a $PRR$ value and a chi-square value to identify a signal.1 In order to represent the overall safety profile of a drug, it’s important to show not only the $PRR$ but also the overall incidence of each risk. The design of the SafeGram is to show exactly that, for every known occurred side effect. To show a better estimate, instead of plotting indiviual points (there are several hundreds, or even thousands, of different side effects), the kernel density is plotted. The reason why SafeGrams are so intuitive is because they convey two important facts at once. First, the PRR cut-off (set to 3.00 in this case) conclusively excludes statistically insignificant increases of risk.2 Of course, anything above that is not necessarily dangerous or proof of a safety signal. Rather, it allows the clinician to reason about the side effect profile of the particular medication. • The meningococcal vaccine (left upper corner) had several side effects that occurred frequently (hence the tall, ‘flame-like’ appearance). However, these were largely side effects that were shared among other vaccines (hence the low PRR). This is the epitome of a safe vaccine, with few surprises likely. • The injectable polio vaccine (IPV) has a similar profile, although the wide disseminated ‘margin’ (blue) indicates that ht has a wider range of side effects compared to the meningococcal vaccine, even though virtually all of these were side effects shared among other vaccines to the same extent. • The oral polio vaccine (OPV, left bottom corner) shows a flattened pattern typical for vaccines that have a number of ‘peculiar’ side effects. While the disproportionately frequently reported instances are relatively infrequent, the ‘tail-like’ appearance of the OPV SafeGram is a cause for concern. The difference between meningococcal and IPV on one hand and OPV on the other is explained largely by the fact that OPV was a ‘live’ vaccine, and in small susceptible groups (hence the low numbers), they could provoke adverse effects. • The smallpox vaccine, another live vaccine, was known to have a range of adverse effects, with a significant part of the population (about 20%) having at least one contraindication. The large area covered indicates that there is a rather astonishing diversity of side effects, and many of these – about half of the orange kernel – lies above the significance boundary of 3.00. The large area covered by the kernel density estimate and the reach into the right upper corner indicates a very probable safety signal worth examining. ### Interpretation A SafeGram for each vaccine shows the two-dimensional density distribution of two things – the frequency and the proportional reporting rate of each vaccine (or drug or device or whatever it is applied to). When considering the safety of a particular product, the most important question is whether a particular adverse effect is serious – a product with a low chance of an irreversible severe side effect is riskier than one with a high probability of a relatively harmless side effect, such as localized soreness after injection. But the relative severity of a side effect is hard to quantify, and a better proxy for that is to assume that in general, most severe side effects will be unique to a particular vaccine. So for instance while injection site reactions and mild pyrexia following inoculation are common to all vaccines and hence the relative reporting rates are relatively low, reflecting roughly the number of inoculations administered, serious adverse effects tend to be more particular to the vaccine (e.g. the association of influenza vaccines with Guillain-Barré syndrome in certain years means that GBS has an elevated PRR, despite the low number of occurrences, for the flu vaccines). Discarding vaccines with a very low number of administered cases, the SafeGram remains robust to differences between the number of vaccines administered. Fig. 1 above shows a number of typical patterns. In general, anything to the left of the vertical significance line can be safely ignored, as they are generally effects shared between most other vaccines in general and exhibit no specific risk signal for the particular vaccine. On the other hand, occurrences to the right of the vertical significance line may – but don’t necessarily do – indicate a safety signal. Of particular concern are right upper quadrant signals – these are frequent and at the same time peculiar to a particular vaccine, suggesting that it is not part of the typical post-inoculation syndrome (fever, fatigue, malaise) arising from immune activation but rather a specific issue created by the antigen or the adjuvant. In rare cases, there is a lower right corner ‘stripe’, such as for the OPV, where a wide range of unique but relatively infrequent effects are produced. These, too, might indicate the need for closer scrutiny. It is crucial to note that merely having a density of signals in the statistically significant range does not automatically mean that there is a PhV concern, but rather that such a concern cannot be excluded. Setting the PRR significance limit is somewhat arbitrary, but Evans et al. (2001) have found a PRR of 2, more than 3 cases over a two year period and a chi-square statistic of 4 or above to be suggestive of a safety signal. Following this lead, the original SafeGram code looks at a PRR of 3.0 and above, and disregards cases with an overall frequency of $3Y$, where $Y$ denotes the number of years considered. ### Limitations The SafeGram inherently tries to make the best out of imperfect data. Acknowledging that passive reporting data is subject to imperfections, some caveats need to be kept in mind. • The algorithm assigns equal weight to every ‘symptom’ reported. VAERS uses an unfiltered version of MedDRA, a coding system for regulatory activities, and this includes a shocking array of codes that do not suggest any pathology. For instance, the VAERS implementation of MedDRA contains 530 codes for normal non-pathological states (e.g. “abdomen scan normal”), and almost 18,000 (!) events involve at least one of these ‘everything is fine!’ markers. This may be clinically useful because they may assist in differential diagnosis and excluding other causes of symptoms, but since they’re not treated separately from actually pathological symptoms, they corrupt the data to a minor but not insignificant extent. The only solution is manual filtering, and with tens of thousands of MedDRA codes, one would not necessarily be inclined to do so. The consequence is that some symptoms aren’t symptoms at all – they’re the exact opposite. This is not a problem for the $PRR$ because it compares a symptom among those taking a particular medication against the same symptom among those who are not. • A lot of VAERS reports are, of course, low quality reports, and there is no way for the SafeGram to differentiate. This is a persistent problem with all passive reporting systems. • The SafeGram gives an overall picture of a particular drug’s or vaccine’s safety. It does not differentiate between the relative severity of a particular symptom. • As usual, correlation does not equal causation. As such, none of this proves the actual risk or danger of a vaccine, but rather the correlation or, in other words, potential safety signals that are worth examining. SafeGrams are a great way to show the safety of vaccines, and to identify which vaccines have frequently occurring and significantly distinct (high-$PRR$) AEFIs that may be potential signals. It is important to note that for most common vaccines, including controversial ones like HPV, the centre of the density kernel estimate are below the margin of the $PRR$ signal limit. The SafeGram is a useful and visually appealing proof of the safety of vaccines that can get actionable intelligence out of VAERS passive reporting evidence that is often disregarded as useless. References [ + ]  1 ↑ Evans, S. J. W. et al. (1998). Proportional reporting ratios: the uses of epidemiological methods for signal generation. Pharmacoepidemiol Drug Saf, 7(Suppl 2), 102. 2 ↑ According to Evans et al., the correct figure for PRR exclusion is 2.00, but they also use N-restriction and a minimum chi-square of 4.0. # Using screen to babysit long-running processes In machine learning, especially in deep learning, long-running processes are quite common. Just yesterday, I finished running an optimisation process that ran for the best part of four days – and that’s on a 4-core machine with an Nvidia GRID K2, letting me crunch my data on 3,072 GPU cores! Of course, I did not want to babysit the whole process. Least of all did I want to have to do so from my laptop. There’s a reason we have tools like Sentry, which can be easily adapted from webapp monitoring to letting you know how your model is doing. One solution is to spin up another virtual machine, ssh into that machine, then from that ssh into the machine running the code, so that if you drop the connection to the first machine, it will not drop the connection to the second. There is also nohup, which makes sure that the process is not killed when you ‘hang up’ the ssh connection. You will, however, not be able to get back into the process again. There are also reparenting tools like reptyr, but the need they meet is somewhat different. Enter terminal multiplexers. Terminal multiplexers are old. They date from the era of things like time-sharing systems and other antiquities whose purpose was to allow a large number of users to get their time on a mainframe designed to serve hundreds, even thousands of users. With the advent of personal computers that had decent computational power on their own, terminal multiplexers remained the preserve of universities and other weirdos still using mainframe architectures. Fortunately for us, two great terminal multiplexers, screen (aka GNU Screen ) and tmux , are still being actively developed, and are almost definitely available for your *nix of choice. This gives us a convenient tool to sneak a peek at what’s going on with our long-suffering process. Here’s how. Step 1 ssh into your remote machine, and launch ssh. You may need to do this as sudo if you encounter the error where screen, instead of starting up a new shell, returns [screen is terminating] and quits. If screen is started up correctly, you should be seeing a slightly different shell prompt (and if you started it as sudo, you will now be logged in as root). In some scenarios, you may want to ‘name’ your screen session. Typically, this is the case when you want to share your screen with another user, e.g. for pair programming. To create a named screen, invoke screen using the session name parameter -S, as in e.g. screen -S my_shared_screen. Step 2 In this step, we will be launching the actual script to run. If your script is Python based and you are using virtualenv (as you ought to!), activate the environment now using source /<virtualenv folder>/bin/activate, replacing virtualenv folderby the name of the folder where your virtualenvs live (for me, that’s the environments folder, often enough it’s something like ~/.virtualenvs) and by the name of your virtualenv (in my case, research). You have to activate your virtualenv even if you have done so outside of screen already (remember, screen means you’re in an entirely new shell, with all environment configurations, settings, aliases &c. gone)! With your virtualenv activated, launch it as normal — no need to launch it in the background. Indeed, one of the big advantages is the ability to see verbose mode progress indicators. If your script does not have a progress logger to stdout but logs to a logfile, you can start it using nohup, then put it into the background (Ctrl--Z, then bg) and track progress using tail -f logfile.log (where logfile.log is, of course, to be substituted by the filename of the logfile. Step 3 Press Ctrl--A followed by Ctrl--D to detach from the current screen. This will take you back to your original shell after noting the address of the screen you’re detaching from. These always follow the format <identifier>.<session id>.<hostname>, where hostname is, of course, the hostname of the computer from which the screen session was started, stands for the name you gave your screen if any, and is an autogenerated 4-6 digit socket identifier. In general, as long as you are on the same machine, the screen identifier or the session name will be sufficient – the full canonical name is only necessary when trying to access a screen on another host. To see a list of all screens running under your current username, enter screen -list. Refer to that listing or the address echoed when you detached from the screen to reattach to the process using screen -r <socket identifier>[.<session identifier>.<hostname>]. This will return you to the script, which keeps executing in the background. Result Reattaching to the process running in the background, you can now follow the progress of the script. Use the key combination in Step 3 to step out of the process anytime and the rest of the step to return to it. Bugs There is a known issue, caused by strace, that leads to screen immediately closing, with the message [screen is terminating] upon invoking screen as a non-privileged user. There are generally two ways to resolve this issue. • Use a privileged user account and always invoke screen as sudo. • As a privileged user, change the permissions of screen to 2775 by entering sudo chmod 2775$(which screen). The first digit is responsible for a privilege elevation upon execution to sudo, which means that repeated sudoing will not be necessary.

The overall effect of both solutions is the same. Notably, both may be undesirable from a security perspective. As always, weigh risks against utility.

Do you prefer screen to staying logged in? Do you have any other cool hacks to make monitoring a machine learning process that takes considerable time to run? Let me know in the comments!

Image credits: Zenith Z-19 by ajmexico on Flickr

# Fixing the mysterious Jupyter Tensorflow import bug

There’s a weird bug afoot that you might encounter when setting up a ‘lily white’ (brand new) development environment to play around with Tensorflow.  As it seems to have vexed quite a few people, I thought I’ll put my solution here to help future  tensorflowers find their way.  The problem presents after you have set up your new  virtualenv . You install Jupyter and Tensorflow, and  when importing, you get this:

In [1]:   import tensorflow as tf

---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
in ()
----> 1 import tensorflow as tf

ModuleNotFoundError: No module named 'tensorflow'

Oh.

Say you are a dogged pursuer of bugs, and wish to check if you might have installed Tensorflow and Jupyter into different virtualenvs. One way to do that is to simply activate your virtualenv (using activate or source activate, depending on whether you use virtualenvwrapper), and starting a Python shell. Perplexingly, importing Tensorflow here will work just fine.

### The solution

Caution
At this time, this works only for CPython aka ‘regular Python’ (if you don’t know what kind of Python you are running, it is in all likelihood CPython).
Note

In general, it is advisable to start fixing these issues by destroying your virtualenv and starting anew, although that’s not strictly necessary. Create a virtualenv, and note the base Python executable’s version (it has to be a version for which there is a Tensorflow wheel for your platform, i.e. 2.7 or 3.3-3.6).

Step 1

Go to the PyPI website to find the Tensorflow installation appropriate to your system and your Python version (e.g. cp36 for Python 3.6). Copy the path of the correct version, then open up a terminal window and declare it as the environment variable TF_BINARY_URL. Use pip to install from the URL you set as the environment variable, then install Jupyter.

[email protected] ~ $export TF_BINARY_URL=https://pypi.python.org/packages/b1/74/873a5fc04f1aa8d275ef1349b25c75dd87cbd7fb84fe41fc8c0a1d9afbe9/tensorflow-1.1.0rc2-cp36-cp36m-macosx_10_11_x86_64.whl#md5=c9b6f7741d955d1d3b4991a7942f48b9 [email protected] ~$ pip install --upgrade \$TF_BINARY_URL jupyter
Collecting tensorflow==1.1.0rc2 from https://pypi.python.org/packages/b1/74/873a5fc04f1aa8d275ef1349b25c75dd87cbd7fb84fe41fc8c0a1d9afbe9/tensorflow-1.1.0rc2-cp36-cp36m-macosx_10_11_x86_64.whl#md5=c9b6f7741d955d1d3b4991a7942f48b9
Using cached tensorflow-1.1.0rc2-cp36-cp36m-macosx_10_11_x86_64.whl
Collecting jupyter
Using cached jupyter-1.0.0-py2.py3-none-any.whl

(... lots more installation steps to follow ...)

Successfully installed ipykernel-4.6.1 ipython-6.0.0 jedi-0.10.2 jinja2-2.9.6 jupyter-1.0.0 jupyter-client-5.0.1 jupyter-console-5.1.0 notebook-5.0.0 prompt-toolkit-1.0.14 protobuf-3.2.0 qtconsole-4.3.0 setuptools-35.0.1 tensorflow-1.1.0rc2 tornado-4.5.1 webencodings-0.5.1 werkzeug-0.12.1
Step 2
Now for some magic. If you launch Jupyter now, there’s a good chance it won’t find Tensorflow. Why? Because you just installed Jupyter, your shell might not have updated the jupyter alias to point to that in the virtualenv, rather than your system Python installation.

Enter which jupyter to find out where the Jupyter link is pointing. If it is pointing to a path within your virtualenvs folder, you’re good to go. Otherwise, open a new terminal window and activate your virtualenv. Check where the jupyter command is pointing now – it should point to the virtualenv.

Step 3
Fire up Jupyter, and import tensorflow. Voila – you have a fully working Tensorflow environment!

As always, let me know if it works for you in the comments, or if you’ve found some alternative ways to fix this issue. Hopefully, this helps you on your way to delve into Tensorflow and explore this fantastic deep learning framework!

Header image: courtesy of Jeff Dean, Large Scale Deep Learning for Intelligent Computer Systems, adapted from Untangling invariant object recognition by DiCarlo and Cox (2007).

If you develop for Amazon’s Alexa-powered devices, you must at some point have come across Flask-Ask, a project by John Wheeler that lets you quickly and easily build Python-based Skills for Alexa. It’s so easy, in fact, that John’s quickstart video, showing the creation of a Flask-Ask based Skill from zero to hero, takes less than five minutes! How awesome is that? Very awesome.

Bootstrapping a Flask-Ask project is not difficult – in fact, it’s pretty easy, but also pretty repetitive. And so, being the ingenious lazy developer I am, I’ve come up with a (somewhat opinionated) cookiecutter template for Flask-Ask.

## Usage

Using the Flask-Ask cookiecutter should be trivial.  Make sure you have cookiecutter installed, either in a virtualenv that you have activated or your system installation of Python. Then, simply use  cookiecutter gh:chrisvoncsefalvay/cookiecutter-flask-ask to get started. Answer the friendly assistant’s questions, and voila! You have the basics of a Flask-Ask project all scaffolded.

Once you have scaffolded your project, you will have to create a virtualenv for your project and install dependencies by invoking pip install -r requirements.txt. You will also need ngrok to test your skill from your local device.

## What’s in the box?

The cookiecutter has been configured with my Flask-Ask development preferences in mind, which in turn borrow heavily from John Wheeler‘s. The cookiecutter provides a scaffold of a Flask application, including not only session start handlers and an example intention but also a number of handlers for built-in Alexa intents, such as Yes, No and Help.

There is also a folder structure you might find useful, including an intent schema for some basic Amazon intents and a corresponding empty sample_utterances.txt file, as well as a gitkeep’d folder for custom slot types. Because I’m a huge fan of Sphinx documentation and strongly believe that voice apps need to be assiduously documented to live up to their potential, there is also a docs/ folder with a Makefile and an opinionated conf.py configuration file.

## Is that all?!

Blissfully, yes, it is. Thanks to John’s extremely efficient and easy-to-use Flask-Ask project, you can discourse with your very own skill less than twenty minutes after starting the scaffolding!

You can find the cookiecutter-flask-ask project here. Issues, bugs and other woes are welcome, as are contributions (simply raise a pull request). For help and advice, you can find me on the Flask-Ask Gitter a lot during daytime CET.

# Diffie-Hellman in under 25 lines

How can you and I agree on a secret without anyone eavesdropping being able to intercept our communications? At first, the idea sounds absurd – for the longest time, without a pre-shared secret, encryption was seen as impossible. In World War II, the Enigma machines relied on a fairly complex pre-shared secret – the Enigma configurations (consisting of the rotor drum wirings and number of rotors specific to the model, the Ringstellung of the day, and Steckbrett configurations) were effectively the pre-shared key. During the Cold War, field operatives were provided with one-time pads (OTPs), randomly (if they were lucky) or pseudorandomly (if they weren’t, which was most of the time) generated1 one time pads (OTPs) with which to encrypt their messages. Cold War era Soviet OTPs were, of course, vulnerable because like most Soviet things, they were manufactured sloppily.2 But OTPs are vulnerable to a big problem: if the key is known, the entire scheme of encryption is defeated. And somehow, you need to get that key to your field operative.

Enter the triad of Merkle, Diffie and Hellman, who in 1976 found a way to exploit the fact that multiplying primes is simple but decomposing a large number into the product of two primes is difficult. From this, they derived the algorithm that came to be known as the Diffie-Hellman algorithm.3

# How to cook up a key exchange algorithm

The idea of a key exchange algorithm is to end up with a shared secret without having to exchange anything that would require transmission of the secret. In other words, the assumption is that the communication channel is unsafe. The algorithm must withstand an eavesdropper knowing every single exchange.

Alice and Bob must first agree to use a modulus $p$ and a base$g$, so that the base is a primitive root modulo the modulus.

Alice and Bob each choose a secret key $a$ and $b$ respectively – ideally, randomly generated. The parties then exchange $A = g^a \mod(p)$ (for Alice) and $B = g^b \mod(p)$ (for Bob).

Alice now has received $B$. She goes on to compute the shared secret $s$ by calculating $B^a \mod(p)$ and Bob computes it by calculating $A^b \mod(p)$.

The whole story is premised on the equality of

$A^b \mod(p) = B^a \mod(p)$

That this holds nearly trivially true should be evident from substituting $g^b$ for $B$ and $g^a$ for $A$. Then,

$g^{ab} \mod(p) = g^{ba} \mod(p)$

Thus, both parties get the same shared secret. An eavesdropper would be able to get $A$ and $B$. Given a sufficiently large prime for $p$, in the range of 6-700 digits, the discrete logarithm problem of retrieving $a$ from $B^a \mod(p)$ in the knowledge of $B$ and $p$ is not efficiently solvable, not even given fairly extensive computing resources. Read more

References   [ + ]

 1 ↑ As a child, I once built a pseudorandom number generator from a sound card, a piece of wire and some stray radio electronics, which basically rested on a sampling of atmospheric noise. I was surprised to learn much later that this was the method the KGB used as well. 2 ↑ Under pressure from the advancing German Wehrmacht in 1941, they had duplicated over 30,000 pages worth of OTP code. This broke the golden rule of OTPs of never, ever reusing code, and ended up with a backdoor that two of the most eminent female cryptanalysts of the 20th, Genevieve Grotjan Feinstein and Meredith Gardner, on whose shoulders the success of the Venona project rested, could exploit. 3 ↑ It deserves noting that the D-H key exchange algorithm was another of those inventions that were invented twice but published once. In 1975, the GCHQ team around Clifford Cocks invented the same algorithm, but was barred from publishing it. Their achievements weren’t recognised until 1997.