Structuring R projects

There are some things that I call Smith goods:1 things I want, nay, require, but hate doing. A clean room is one of these – I have a visceral need to have some semblance of tidiness around me, I just absolutely hate tidying, especially in the summer.2 Starting and structuring packages and projects is another of these things, which is why I’m so happy things like cookiecutter exist that do it for you in Python. [su_pullquote align=”right”]While I don’t like structuring R projects, I keep doing it, because I know it matters. That’s a pearl of wisdom that came occasionally at a great price.[/su_pullquote]I am famously laid back about structuring R projects – my chill attitude is only occasionally compared to the Holy Inquisition, the other Holy Inquisition and Gunny R. Lee Ermey’s portrayal of Drill Sgt. Hartman, and it’s been months since I last gutted an intern for messing up namespaces.3 So while I don’t like structuring R projects, I keep doing it, because I know it matters. That’s a pearl of wisdom that came occasionally at a great price, some of which I am hoping to save you by this post.

Five principles of structuring R projects

Every R project is different. Therefore, when structuring R projects, there has to be a lot more adaptability than there is normally When structuring R projects, I try to follow five overarching principles.

  1. The project determines the structure. In a small exploratory data analysis (EDA) project, you might have some leeway as to structural features that you might not have when writing safety-critical or autonomously running code. This variability in R – reflective of the diversity of its use – means that it’s hard to devise a boilerplate that’s universally applicable to all kinds of projects.
  2. Structure is a means to an end, not an end in itself. The reason why gutting interns, scalping them or yelling at them Gunny style are inadvisable is not just the additional paperwork it creates for HR. Rather, the point of the whole exercise is to create people who understand why the rules exists and organically adopt them, understanding how they help.
  3. Rules are good, tools are better. When tools are provided that take the burden of adherence – linters, structure generators like cookiecutter, IDE plugins, &c. – off the developer, adherence is both more likely and simpler.
  4. Structures should be interpretable to a wide range of collaborators. Even if you have no collaborators, thinking from the perspective of an analyst, a data scientist, a modeller, a data engineer and, most importantly, the client who will at the very end receive the overall product.
  5. Structures should be capable of evolution. Your project may change objectives, it may evolve, it may change. What was a pet project might become a client product. What was designed to be a massive, error-resilient superstructure might have to scale down. And most importantly, your single-player adventure may end up turning into an MMORPG. Your structure has to be able to roll with the punches.

A good starting structure

Pretty much every R project can be imagined as a sort of process: data gets ingested, magic happens, then the results – analyses, processed data, and so on – get spit out. The absolute minimum structure reflects this:

.
└── my_awesome_project
    ├── src
    ├── output
    ├── data
    │   ├── raw
    │   └── processed
    ├── README.md
    ├── run_analyses.R 
    └── .gitignore

In this structure, we see this reflected by having a data/ folder (a source), a folder for the code that performs the operations (src/) and a place to put the results (output/). The root analysis file (the sole R file on the top level) is responsible for launching and orchestrating the functions defined in the src/ folder’s contents.

The data folder

The data folder is, unsurprisingly, where your data goes. In many cases, you may not have any file-formatted raw data (e.g. where the raw data is accessed via a *DBC connection to a database), and you might even keep all intermediate files there, although that’s pretty uncommon on the whole, and might not make you the local DBA’s favourite (not to mention data protection issues). So while the raw/ subfolder might be dispensed with, you’ll most definitely need a data/ folder.

When it comes to data, it is crucial to make a distinction between source data and generated data. Rich Fitzjohn puts it best when he says to treat

  • source data as read-only, and
  • generated data as disposable.

The preferred implementation I have adopted is to have

  • a data/raw/ folder, which is usually is symlinked to a folder that is write-only to clients but read-only to the R user,4,
  • a data/temp/ folder, which contains temp data, and
  • a data/output/ folder, if warranted.

The src folder

Some call this folder R– I find this a misleading practice, as you might have C++, bash and other non-R code in it, but is unfortunately enforced by R if you want to structure your project as a valid R package, which I advocate in some cases. I am a fan of structuring the src/ folder, usually by their logical function. There are two systems of nomenclature that have worked really well for me and people I work with:

  • The library model: in this case, the root folder of src/ holds individual .R scripts that when executed will carry out an analysis. There may be one or more such scripts, e.g. for different analyses or different depths of insight. Subfolders of src/ are named after the kind of scripts they contain, e.g. ETL, transformation, plotting. The risk with this structure is that sometimes it’s tricky to remember what’s where, so descriptive file names are particularly important.
  • The pipeline model: in this case, there is a main runner script or potentially a small number. These go through scripts in a sequence. It is a sensible idea in such a case to establish sequential subfolders or sequentially numbered scripts that are executed in sequence. Typically, this model performs better if there are at most a handful distinct pipelines.

Whichever approach you adopt, a crucial point is to keep function definition and application separate. This means that only the pipeline or the runner scripts are allowed to execute (apply) functions, and other files are merely supposed to define them. Typically, folder level segregation works best for this:

  • keep all function definitions in subfolders of src/, e.g. src/data_engineering, and have the directly-executable scripts directly under src/ (this works better for larger projects), or
  • keep function definitions in src/, and keep the directly executable scripts in the root folder (this is more convenient for smaller projects, where perhaps the entire data engineering part is not much more than a single script).

output and other output folders

Output may mean a range of things, depending on the nature of your project. It can be anything from a whole D.Phil thesis written in a LaTeX-compliant form to a brief report to a client. There are a couple of conventions with regard to output folders that are useful to keep in mind.

Separating plot output

[su_pullquote]My personal preference is that plot output folders should be subfolders of output/, rather than top-tier folders, unless the plots themselves are the objective.[/su_pullquote]It is common to have a separate folder for plots (usually called figs/ or plots/), usually so that they could be used for various purposes. My personal preference is that plot output folders should be subfolders of output folders, rather than top-tier folders, unless they are the very output of the project. That is the case, for instance, where the project is intended to create a particular plot on a regular basis. This was the case, for instance, with the CBRD project whose purpose was to regularly generate daily epicurves for the DRC Zaire ebolavirus outbreak.

With regard to maps, in general, the principle that has worked best for teams I ran was to treat static maps as plots. However, dynamic maps (e.g. LeafletJS apps), tilesets, layers or generated files (e.g. GeoJSON files) tend to deserve their own folder.

Reports and reporting

[su_pullquote]For business users, automatically getting a beautiful PDF report can be priceless.[/su_pullquote]Not every project needs a reporting folder, but for business users, having a nice, pre-written reporting script that can be run automatically and produces a beautiful PDF report every day can be priceless. A large organisation I worked for in the past used this very well to monitor their Amazon AWS expenditure.5 A team of over fifty data scientists worked on a range of EC2 instances, and runaway spending from provisioning instances that were too big, leaving instances on and data transfer charges resulting from misconfigured instances6 was rampant. So the client wanted daily, weekly, monthly and 10-day rolling usage nicely plotted in a report, by user, highlighting people who would go on the naughty list. This was very well accomplished by an RMarkdown template that was ‘knit‘ every day at 0600 and uploaded as an HTML file onto an internal server, so that every user could see who’s been naughty and who’s been nice. EC2 usage costs have gone down by almost 30% in a few weeks, and that was without having to dismember anyone!7

Probably the only structural rule to keep in mind is to keep reports and reporting code separate. Reports are client products, reporting code is a work product and therefore should reside in src/.

Requirements and general settings

I am, in general, not a huge fan of outright loading whole packages to begin with. Too many users of R don’t realise that

  • you do not need to attach (library(package)) a package in order to use a function from it – as long as the package is available to R, you can simply call the function as package::function(arg1, arg2, ...), and
  • importing a package using library(package) puts every single function from that package into the namespace, overwriting by default all previous entries. This means that in order to deterministically know what any given symbol means, you would have to know, at all times, the order of package imports. Needless to say, there is enough stuff to keep in one’s mind when coding in R to worry about this stuff.

However, some packages might be useful to import, and sometimes it’s useful to have an initialisation script. This may be the case in three particular scenarios:

  • You need a particular locale setting, or a particularly crucial environment setting.
  • It’s your own package and you know you’re not going to shadow already existing symbols.
  • You are not using packrat or some other package management solution, and definitely need to ensure some packages are installed, but prefer not to put the clunky install-if-not-present code in every single thing.

In these cases, it’s sensible to have a file you would source before every top-level script – in an act of shameless thievery from Python, I tend to call this requirements.R, and it includes some fundamental settings I like to rely on, such as setting the locale appropriately. It also includes a CRAN install check script, although I would very much advise the use of Packrat over it, since it’s not version-sensitive.

Themes, house style and other settings

It is common, in addition to all this, to keep some general settings. If your institution has a ‘house style’ for ggplot2 (as, for instance, a ggthemr file), for instance, this could be part of your project’s config. But where does this best go?

[su_pullquote align=”right”]I’m a big fan of keeping house styles in separate repos, as this ensures consistency across the board.[/su_pullquote]It would normally be perfectly fine to keep your settings in a config.R file at root level, but a config/ folder is much preferred as it prevents clutter if you derive any of your configurations from a git submodule. I’m a big fan of keeping house styles and other things intended to give a shared appearance to code and outputs (e.g. linting rules, text editor settings, map themes) in separate – and very, very well managed! – repos, as this ensures consistency across the board over time. As a result, most of my projects do have a config folder instead of a single configuration file.

It is paramount to separate project configuration and runtime configuration:

  • Project configuration pertains to the project itself, its outputs, schemes, the whole nine yards. For instance, the paper size to use for generated LaTeX documents would normally be a project configuration item. Your project configuration belongs in your config/ folder.
  • Runtime configuration pertains to parameters that relate to individual runs. In general, you should aspire to have as few of these, if any, as possible – and if you do, you should keep them as environment variables. But if you do decide to keep them as a file, it’s generally a good idea to keep them at the top level, and store them not as R files but as e.g. JSON files. There are a range of tools that can programmatically edit and change these file formats, while changing R files programmatically is fraught with difficulties.

Keeping runtime configuration editable

A few years ago, I worked on a viral forecasting tool where a range of model parameters to build the forecast from were hardcoded as R variables in a runtime configuration file. It was eventually decided to create a Python-based web interface on top of it, which would allow users to see the results as a dashboard (reading from a database where forecast results would be written) and make adjustments to some of the model parameters. The problem was, that’s really not easy to do with variables in an R file.

On the other hand, Python can easily read a JSON file into memory, change values as requested and export them onto the file system. So instead of that, the web interface would store the parameters in a JSON file, from which R would then read them and execute accordingly. Worked like a charm. Bottom line – configurations are data, and using code to store data is bad form.

Dirty little secrets

Everybody has secrets. In all likelihood, your project is no different: passwords, API keys, database credentials, the works. The first rule of this, of course, is never hardcode credentials in code. But you will need to work out how to make your project work, including via version control, while also not divulging credentials to the world at large. My preferred solutions, in order of preference, are:

  1. the keyring package, which interacts with OS X’s keychain, Windows’s Credential Store and the Secret Service API on Linux (where supported),
  2. using environment variables,
  3. using a secrets file that is .gitignored,
  4. using a config file that’s .gitignored,
  5. prompting the user.

Let’s take these – except the last one, which you should consider only as a measure of desperation, as it relies on RStudio and your code should aspire to run without it – in turn.

Using keyring

keyring is an R package that interfaces with the operating system’s keychain management solution, and works without any additional software on OS X and Windows.8 Using keyring is delightfully simple: it conceives of an individual key as belonging to a keyring and identified by a service. By reference to the service, it can then be retrieved easily once the user has authenticated to access the keychain. It has two drawbacks to be aware of:

  • It’s an interactive solution (it has to get access permission for the keychain), so if what you’re after is R code that runs quietly without any intervention, this is not your best bet.
  • A key can only contain a username and a password, so it cannot store more complex credentials, such as 4-ple secrets (e.g. in OAuth, where you may have a consumer and a publisher key and secret each). In that case, you could split them into separate keyring keys.

However, for most interactive purposes, keyring works fine. This includes single-item secrets, e.g. API keys, where you can use some junk as your username and hold only on to the password. [su_pullquote align=”right”]For most interactive purposes, keyring works fine. This includes single-item secrets, e.g. API keys.[/su_pullquote] By default, the operating system’s ‘main’ keyring is used, but you’re welcome to create a new one for your project. Note that users may be prompted for a keychain password at call time, and it’s helpful if they know what’s going on, so be sure you document your keyring calls well.

To set a key, simply call keyring::key_set(service = "my_awesome_service", username = "my_awesome_user). This will launch a dialogue using the host OS’s keychain handler to request authentication to access the relevant keychain (in this case, the system keychain, as no keychain is specified), and you can then retrieve

  • the username: using keyring::key_list("my_awesome_service")[1,2], and
  • the password: using keyring::key_get("my_awesome_service").

Using environment variables

[su_pullquote]The thing to remember about environment variables is that they’re ‘relatively private’: everyone in the user session will be able to read them.[/su_pullquote]Using environment variables to hold certain secrets has become extremely popular especially for Dockerised implementations of R code, as envvars can be very easily set using Docker. The thing to remember about environment variables is that they’re ‘relatively private’: they’re not part of the codebase, so they will definitely not accidentally get committed to the VCS, but everyone who has access to the particular user session  will be able to read them. This may be an issue when e.g. multiple people are sharing the ec2-user account on an EC2 instance. The other drawback of envvars is that if there’s a large number of them, setting them can be a pain. R has a little workaround for that: if you create an envfile called .Renviron in the working directory, it will store values in the environment. So for instance the following .Renviron file will bind an API key and a username:

api_username = "my_awesome_user"
api_key = "e19bb9e938e85e49037518a102860147"

So when you then call Sys.getenv("api_username"), you get the correct result. It’s worth keeping in mind that the .Renviron file is sourced once, and once only: at the start of the R session. Thus, obviously, changes made after that will not propagate into the session until it ends and a new session is started. It’s also rather clumsy to edit, although most APIs used to ini files will, with the occasional grumble, digest .Renvirons.

Needless to say, committing the .Renviron file to the VCS is what is sometimes referred to as making a chocolate fireman in the business, and is generally a bad idea.

Using a .gitignored config or secrets file

config is a package that allows you to keep a range of configuration settings outside your code, in a YAML file, then retrieve them. For instance, you can create a default configuration for an API:

default:
    my_awesome_api:
        url: 'https://awesome_api.internal'
        username: 'my_test_user'
        api_key: 'e19bb9e938e85e49037518a102860147'

From R, you could then access this using the config::get() function:

my_awesome_api_configuration <- config::get("my_awesome_api")

This would then allow you to e.g. refer to the URL as my_awesome_api_configuration$url, and the API key as my_awesome_api_configuration$api_key. As long as the configuration YAML file is kept out of the VCS, all is well. The problem is that not everything in such a configuration file is supposed to be secret. For instance, it makes sense for a database access credentials to have the other credentials DBI::dbConnect() needs for a connection available to other users, but keep the password private. So .gitignoreing a config file is not a good idea.

[su_pullquote align=”right”]A dedicated secrets file is a better place for credentials than a config file, as this file can then be wholesale .gitignored.[/su_pullquote]A somewhat better idea is a secrets file. This file can be safely .gitignored, because it definitely only contains secrets. As previously noted, definitely create it using a format that can be widely written (JSON, YAML).9 For reasons noted in the next subsection, the thing you should definitely not do is creating a secrets file that consists of R variable assignments, however convenient an idea that may appear at first. Because…

Whatever you do…

One of the best ways to mess up is creating a fabulous way of keeping your secret credentials truly secret… then loading them into the global scope. Never, ever assign credentials. Ever.

You might have seen code like this:

dbuser <- Sys.getenv("dbuser")
dbpass <- Sys.getenv("dbpass")

conn <- DBI::dbConnect(odbc::odbc(), UID = dbuser, PWD = dbpass)
[su_pullquote]Never, ever put credentials into any environment if possible – especially not into the global scope.[/su_pullquote]This will work perfectly, except once its done, it will leave the password and the user name, in unencrypted plaintext (!), in the global scope, accessible to any code. That’s not just extremely embarrassing if, say, your wife of ten years discovers that your database password is your World of Warcraft character’s first name, but also a potential security risk. Never put credentials into any environment if possible, and if it has to happen, at least make it happen within a function so that they don’t end up in the global scope. The correct way to do the above would be more akin to this:

create_db_connection <- function() {
    DBI::dbConnect(odbc::odbc(), UID = Sys.getenv("dbuser"), PWD = Sys.getenv("dbpass")) %>% return()
}

Concluding remarks

Structuring R projects is an art, not just a science. Many best practices are highly domain-specific, and learning these generally happens by trial and pratfall error. In many ways, it’s the bellwether of an R developer’s skill trajectory, because it shows whether they possess the tenacity and endurance it takes to do meticulous, fine and often rather boring work in pursuance of future success – or at the very least, an easier time debugging things in the future. Studies show that one of the greatest predictors of success in life is being able to tolerate deferred gratification, and structuring R projects is a pure exercise in that discipline.

[su_pullquote align=”right”]Structuring R projects is an art, not just a science. Many best practices are highly domain-specific, and learning these generally happens by trial and error.[/su_pullquote]At the same time, a well-executed structure can save valuable developer time, prevent errors and allow data scientists to focus on the data rather than debugging and trying to find where that damn snippet of code is or scratching their head trying to figure out what a particularly obscurely named function does. What might feel like an utter waste of time has enormous potential to create value, both for the individual, the team and the organisation.

[su_pullquote]As long as you keep in mind why structure matters and what its ultimate aims are, you will arrive at a form of order out of chaos that will be productive, collaborative and useful.[/su_pullquote]I’m sure there are many aspects of structuring R projects that I have omitted or ignored – in many ways, it is my own experiences that inform and motivate these commentaries on R. Some of these observations are echoed by many authors, others diverge greatly from what’s commonly held wisdom. As with all concepts in development, I encourage you to read widely, get to know as many different ideas about structuring R projects as possible, and synthesise your own style. As long as you keep in mind why structure matters and what its ultimate aims are, you will arrive at a form of order out of chaos that will be productive, collaborative and mutually useful not just for your own development but others’ work as well.

My last commentary on defensive programming in R has spawned a vivid and exciting debate on Reddit, and many have made extremely insightful comments there. I’m deeply grateful for all who have contributed there. I hope you will also consider posting your observations in the comment section below. That way, comments will remain together with the original content.

References   [ + ]

1.As in, Adam Smith.
2.It took me years to figure out why. It turns out that I have ZF alpha-1 antitrypsin deficiency. As a consequence, even minimal exposure to small particulates and dust can set off violent coughing attacks and impair breathing for days. Symptoms tend to be worse in hot weather due to impaired connective tissue something-or-other.
3.That’s a joke. I don’t gut interns – they’re valuable resources, HR shuns dismembering your coworkers, it creates paperwork and I liked every intern I’ve ever worked with – but most importantly, once gutted like a fish, they are not going to learn anything new. I prefer gentle, structured discussions on the benefits of good package structure. Please respect your interns – they are the next generation, and you are probably one of their first example of what software development/data science leadership looks like. The waves you set into motion will ripple through generations, well after you’re gone. You better set a good example.
4.Such a folder is often referred to as a ‘dropbox’, and the typical corresponding octal setting, 0422, guarantees that the R user will not accidentally overwrite data.
5.The organisation consented to me telling this story but requested anonymity, a request I honour whenever legally possible.
6.In case you’re unfamiliar with AWS: it’s a cloud service where elastic computing instances (EC2 instances) reside in ‘regions’, e.g. us-west-1a. There are (small but nonzero) charges for data transfer between regions. If you’re in one region but you configure the yum repo server of another region as your default, there will be costs, and, eventually, tears – provision ten instances with a few GBs worth of downloads, and there’ll be yelling. This is now more or less impossible to do except on purpose, but one must never underestimate what users are capable of from time to time!
7.Or so I’m told.
8.Linux users will need libsecret 0.16 or above, and sodium.
9.XML is acceptable if you’re threatened with waterboarding.

The Ten Rules of Defensive Programming in R

[su_pullquote align=”right”]Where R code is integrated into a pipeline, runs autonomously or is embedded into a larger analytical solution, writing code that fails well is going to be crucial.[/su_pullquote]The topic of defensive programming in R is, admittedly, a little unusual. R, while fun and powerful, is not going to run defibrillators, nuclear power plants or spacecraft. In fact, much – if not most! – R code is actually executed interactively, where small glitches don’t really matter. But where R code is integrated into a pipeline, runs autonomously or is embedded into a larger analytical solution, writing code that fails well is going to be crucial. So below, I have collected my top ten principles of defensive programming in R. I have done so with an eye to users who do not come from the life critical systems community and might not have encountered defensive programming before, so some of these rules apply to all languages.

What is defensive programming?

The idea of defensive programming is not to write code that never fails. That’s an impossible aspiration. Rather, the fundamental idea is to write code that fails well. To me, ‘failing well’ means five things:

  1. Fail fast: your code should ensure all criteria are met before they embark upon operations, especially if those are computationally expensive or might irreversibly affect data.
  2. Fail safe: where there is a failure, your code should ensure that it relinquishes all locks and does not acquire any new ones, not write files, and so on.
  3. Fail conspicuously: when something is broken, it should return a very clear error message, and give as much information as possible to help unbreak it.
  4. Fail appropriately: failure should have appropriate effects. For every developer, it’s a judgment call to ensure whether a particular issue would be a a debug/info item, a warning or an error (which by definition means halting execution). Failures should be handled appropriately.
  5. Fail creatively: not everything needs to be a failure. It is perfectly legitimate to handle problems. One example is repeating a HTTP request that has timed out: there’s no need to immediately error out, because quite frankly, that sort of stuff happens. Equally, it’s legitimate to look for a parameter, then check for a configuration file if none was provided, and finally try checking the arguments with which the code was invoked before raising an error.1

And so, without further ado, here are the ten ways I implement these in my day-to-day R coding practice – and I encourage you to do so yourself. You will thank yourself for it.

The Ten Commandments of Defensive Programming in R

  1. Document your code.
  2. In God we trust, everyone else we verify.
  3. Keep functions short and sweet.
  4. Refer to external functions explicitly.
  5. Don’t use require() to import packages into the namespace.
  6. Aggressively manage package and version dependencies.
  7. Use a consistent style and automated code quality tools.
  8. Everything is a package.
  9. Power in names.
  10. Know the rules and their rationale, so that you know when to break them.

ONE: Document your code.

It’s a little surprising to even see this – I mean, shouldn’t you do this stuff anyway? Yes, you should, except some people think that because so much of R is done in the interpreter anyway, rules do not apply to them. Wrong! They very much do.

A few months ago, I saw some code written by a mentee of mine. It was infinitely long – over 250 standard lines! –, had half a dozen required arguments and did everything under the sun. This is, of course, as we’ll discuss later, a bad idea, but let’s put that aside. The problem is, I had no idea what the function was doing! After about half an hour of diligent row-by-row analysis, I figured it out, but that could have been half an hour spent doing something more enjoyable, such as a root canal without anaesthetic while listening to Nickelback. My friend could have saved me quite some hair-tearing by quite simply documenting his code. In R, the standard for documenting the code is called roxygen2, it’s got a great parser that outputs the beautiful LaTeX docs you probably (hopefully!) have encountered when looking up a package’s documentation, and it’s described in quite a bit of detail by Hadley Wickham. What more could you wish for?

Oh. An example. Yeah, that’d be useful. We’ll be documenting a fairly simple function, which calculates the Hamming distance between two strings of equal length, and throws something unpleasant in our face if they are not. Quick recap: the Hamming distance is the number of characters that do not match among two strings. Or mathematically put,

H(s, t) = \sum_{k=1}^{\mathcal{l}(s)} I(s_k, t_k) \mid \mathcal{l}(s) = \mathcal{l}(t)

where \mathcal{l}() is the length function and D(p, q) is the dissimilarity function, which returns 1 if two letters are not identical and 0 otherwise.

So, our function would look like this:

hamming <- function(s1, s2) {
  s1 <- strsplit(s1, "")[[1]]
  s2 <- strsplit(s2, "")[[1]]
  
  return(sum(s1 != s2))
}

Not bad, and pretty evident to a seasoned R user, but it would still be a good idea to point out a thing or two. One of these would be that the result of this code will be inaccurate (technically) if the two strings are of different lengths (we could, and will, test for that, but that’s for a later date). The Hamming distance is defined only for equal-length strings, and so it would be good if the user knew what they have to do – and what they’re going to get. Upon pressing Ctrl/Cmd+Shift+Alt+R, RStudio helpfully whips us up a nice little roxygen2 skeleton:

#' Title
#'
#' @param s1 
#' @param s2 
#'
#' @return
#' @export
#'
#' @examples
hamming <- function(s1, s2) {
  s1 <- strsplit(s1, "")[[1]]
  s2 <- strsplit(s2, "")[[1]]
  
  return(sum(s1 != s2))
}

So, let’s populate it! Most of the fields are fairly self-explanatory. roxygen2, unlike JavaDoc or RST-based Python documentation, does not require formal specification of types – it’s all free text. Also, since it will be parsed into LaTeX someday, you can go wild. A few things deserve mention.

  • You can document multiple parameters. Since s1 and s2 are both going to be strings, you can simply write @param s1,s2 The strings to be compared.
  • Use \code{...} to typeset something as fixed-width.
  • To create links in the documentation, you can use \url{https://chrisvoncsefalvay.com} to link to a URL, \code{\link{someotherfunction}} to refer to the function someotherfunction in the same package, and \code{\link[adifferentpackage]{someotherfunction}} to refer to the function someotherfunction in the adifferentpackage package. Where your function has necessary dependencies outside the current script or the base packages, it is prudent to note them here.
  • You can use \seealso{} to refer to other links or other functions, in this package or another, worth checking out.
  • Anything you put under the examples will be executed as part of testing and building the documentation. If your intention is to give an idea of what the code looks like in practice, and you don’t want the result or even the side effects, you can surround your example with a \dontrun{...} environment.
  • You can draw examples from a file. In this case, you use @example instead of @examples, and specify the path, relative to the file in which the documentation is, to the script: @example docs/examples/hamming.R would be such a directive.
  • What’s that @export thing at the end? Quite simply, it tells roxygen2 to export the function to the NAMESPACE file, making it accessible for reference by other documentation files (that’s how when you use \code{\link[somepackage]{thingamabob}}, roxygen2 knows which package to link to.

With that in mind, here’s what a decent documentation to our Hamming distance function would look like that would pass muster from a defensive programming perspective:

#' Hamming distance
#'
#' Calculates the Hamming distance between two strings of equal length.
#'
#' @param s1 
#' @param s2 
#'
#' @return The Hamming distance between the two strings \code{s1} and \code{s2}, provided as an integer.
#'
#' @section Warning:
#' 
#' For a Hamming distance calculation, the input strings must be of equal length. This code does NOT reject input strings of different lengths. 
#'
#' @examples
#' hamming("AAGAGTGTCGGCATACGTGTA", "AAGAGCGTCGGCATACGTGTA")  
#'  
#' @export
hamming <- function(s1, s2) {
  s1 <- strsplit(s1, "")[[1]]
  s2 <- strsplit(s2, "")[[1]]
  
  return(sum(s1 != s2))
}

The .Rd file generated from the hamming() function’s roxygen2 docstring: an intermediary format, resembling LaTeX, from which R can build a multitude of documentation outputsThis little example shows all that a good documentation does: it provides what to supply the function with and in what type, it provides what it will spit out and in what format, and adequately warns of what is not being checked. It’s always better to check input types, but warnings go a long way.2 From this file, R generates an .Rd file, which is basically a LaTeX file that it can parse into various forms of documentation (see left.) In the end, it yields the documentation below, with the adequate warning – a win for defensive programming!

The finished documentation, rendered as it would be in an online documentation pane in RStudio

TWO: In God we trust, everyone else we verify.

In the above example, we have taken the user at face value: we assumed that his inputs will be of equal length, and we assumed they will be strings. But because this is a post on defensive programming, we are going to be suspicious, and not trust our user. So let’s make sure we fail early and check what the user supplies us with. In many programming languages, you would be using various assertions (e.g. the assert keyword in Python), but all R has, as far as built-ins are concerned, is stopifnot(). stopifnot() does as the name suggests: if the condition is not met, execution stops with an error message. However, on the whole, it’s fairly clunky, and it most definitely should not be used to check for user inputs. For that, there are three tactics worth considering.

  1. assertthat is a package by Hadley Wickham (who else!), which implements a range of assert clauses. Most importantly, unlike stopifnot(), assertthat::assert_that() does a decent job at trying to interpret the error message. Consider our previous Hamming distance example: instead of gracelessly falling on its face, a test using
assert_that(length(s1) == length(s2))
  1. would politely inform us that s1 not equal to s2. That’s worth it for the borderline Canadian politeness alone.
  2. Consider the severity of the user input failure. Can it be worked around? For instance, a function requiring an integer may, if it is given a float, try to coerce it to a float. If you opt for this solution, make sure that you 1) always issue a warning, and 2) always allow the user to specify to run the function in ‘strict’ mode (typically by setting the strict parameter to TRUE), which will raise a fatal error rather than try to logic its way out of this pickle.
  3. Finally, make sure that it’s your code that fails, not the system code. Users should have relatively little insight into the internals of the system. And so, if at some point there’ll be a division by an argument foo, you should test whether foo == 0 at the outset and inform the user that foo cannot be zero. By the time the division operation is performed, the variable might have been renamed baz, and the user will not get much actionable intelligence out of the fact that ‘division by zero’ has occurred at some point, and baz was the culprit. Just test early for known incompatibilities, and stop further execution. The same goes, of course, for potentially malicious code.

In general, your code should be strict as to what it accepts, and you should not be afraid to reject anything that doesn’t look like what you’re looking for. Consider for this not only types but also values, e.g. if the value provided for a timeout in minutes is somewhere north of the lifetime of the universe, you should politely reject such an argument – with a good explanation, of course.

Update: After posting this on Reddit, u/BillWeld pointed out a great idiom for checking user inputs that’s most definitely worth reposting here:

f <- function(a, b, c)
{
    if (getOption("warn") > 0) {
        stopifnot(
            is.numeric(a),
            is.vector(a),
            length(a) == 1,
            is.finite(a),
            a > 0,
            is.character(b),
            # Other requirements go here
            )
    }

    # The main body of the function goes here
}

I find this a great and elegant idiom, although it is your call, as the programmer, to decide which deviations and what degree of incompatibility should cause the function to fail as opposed to merely emit a warning.

THREE: Keep functions short and sweet.

Rule #4 of the Power of Ten states

No function should be longer than what can be printed on a single sheet of paper in a standard reference format with one line per statement and one line per declaration. Typically, this means no more than about 60 lines of code per function.

Rationale: Each function should be a logical unit in the code that is understandable and verifiable as a unit. It is much harder to understand a logical unit that spans multiple screens on a computer display or multiple pages when printed. Excessively long functions are often a sign of poorly structured code.

In practice, with larger screen sizes and higher resolutions, much more than a measly hundred lines fit on a single screen. However, since many users view R code in a quarter-screen window in RStudio, an appropriate figure would be about 60-80 lines. Note that this does not include comments and whitespaces, nor does it penalise indentation styles (functions, conditionals, etc.).

Functions should represent a logical unity. Therefore, if a function needs to be split for compliance with this rule, you should do so in a manner that creates logical units. Typically, one good way is to split functions by the object they act on.

FOUR: Refer to external functions explicitly.

In R, there are two ways to invoke a function, yet most people don’t tend to be aware of this. Even in many textbooks, the library(package) function is treated as quintessentially analogous to, say, import in Python. This is a fundamental misunderstanding.

In R, packages do not need to be imported in order to be able to invoke their functions, and that’s not what the library() function does anyway. library() attaches a package to the current namespace.

What does this mean? Consider the following example. The foreign package is one of my favourite packages. In my day-to-day practice, I get data from all sorts of environments, and foreign helps me import them. One of its functions, read.epiinfo(), is particularly useful as it imports data from CDC’s free EpiInfo toolkit. Assuming that foreign is in a library accessible to my instance of R,4, I can invoke the read.epiinfo() function in two ways:

  • I can directly invoke the function using its canonical name, of the form package::function() – in this case, foreign::read.epiinfo().
  • Alternatively, I can attach the entire foreign package to the namespace of the current session using library(foreign). This has three effects, of which the first tends to be well-known, the second less so and the third altogether ignored.
    1. Functions in foreign will be directly available. Regardless of the fact that it came from a different package, you will be able to invoke it the same way you invoke, say, a function defined in the same script, by simply calling read.epiinfo().
    2. If there was a package of identical name to any function in foreign, that function will be ‘shadowed’, i.e. removed from the namespace. The namespace will always refer to the most recent function, and the older function will only be available by explicit invocation.
    3. When you invoke a function from the namespace, it will not be perfectly clear from a mere reading of the code what the function actually is, or where it comes from. Rather, the user or maintainer will have to guess what a given name will represent in the namespace at the time the code is running the particular line.

Controversially, my suggestion is

  • to eschew the use of library() altogether, and
  • write always explicitly invoke functions outside those functions that are in the namespace at startup.

This is not common advice, and many will disagree. That’s fine. Not all code needs to be safety-critical, and importing ggplot2 with library() for a simple plotting script is fine. But where you want code that’s easy to analyse, easy to read and can be reliably analysed as well, you want explicit invocations. Explicit invocations give you three main benefits:

  1. You will always know what code will be executed. filter may mean dplyr::filter, stats::filter, and so on, whereas specifically invoking dplyr::filter is unambiguous. You know what the code will be (simply invoking dplyr::filter without braces or arguments returns the source), and you know what that code is going to do.
  2. Your code will be more predictable. When someone – or something – analyses your code, they will not have to spend so much time trying to guess what at the time a particular identifier refers to within the namespace.
  3. There is no risk that as a ‘side effect’ various other functions you seek to rely on will be removed from the namespace. In interactive coding, R usually warns you and lists all shadowed functions upon importing functions with the same name into the namespace using library(), but for code intended to be bulk executed, this issue has caused a lot of headache.

Obviously, all of this applies to require() as well, although on the whole the latter should not be applied in general.

FIVE: Don’t use require() to import a package into the namespace.

Even seasoned R users sometimes forget the difference between library() and require(). The difference is quite simple: while both functions attempt to attach the package argument to the namespace,

  • require() returns FALSE if the import failed, while
  • library() simply loads the package and raises an error if the import failed.

Just about the only legitimate use for require() is writing an attach-or-install function. In any other case, as Yihui Xie points out, require() is almost definitely the wrong function to use.

SIX: Aggressively manage package and version dependencies.

Packrat is one of those packages that have changed what R is like – for the better. Packrat gives every project a package library, akin to a private /lib/ folder. This is not the place to document the sheer awesomeness of Packrat – you can do so yourself by doing the walkthrough of Packrat. But seriously, use it. Your coworkers will love you for it.

Equally important is to make sure that you specify the version of R that your code is written against. This is best accomplished on a higher level of configuration, however.

SEVEN: Use a consistent style and automated code quality tools.

This should be obvious – we’re programmers, which means we’re constitutionally lazy. If it can be solved by code faster than manually, then code it is! Two tools help you in this are lintr and styler.

  • lintr is an amazingly widely supported (from RStudio through vim to Sublime Text 3, I hear a version for microwave ovens is in the works!) linter for R code. Linters improve code quality primarily by enforcing good coding practices rather than good style. One big perk of lintr is that it can be injected rather easily into the Travis CI workflow, which is a big deal for those who maintain multi-contributor projects and use Travis to keep the cats appropriately herded.
  • styler was initially designed to help code adhere to the Tidyverse Style Guide, which in my humble opinion is one of the best style guides that have ever existed for R. It can now take any custom style files and reformat your code, either as a function or straight from an RStudio add-in.

So use them.

EIGHT: Everything is a package.

Whether you’re writing R code for fun, profit, research or the creation of shareholder value, your coworkers and your clients – rightly! – expect a coherent piece of work product that has everything in one neat package, preferably version controlled. Sending around single R scripts might have been appropriate at some point in the mid-1990s, but it isn’t anymore. And so, your work product should always be structured like a package. As a minimum, this should include:

  1. A DESCRIPTION and NAMESPACE file.
  2. The source code, including comments.
  3. Where appropriate, data mappings and other ancillary data to implement the code. These go normally into the data/ folder. Where these are large, such as massive shape files, you might consider using Git LFS.
  4. Dependencies, preferably in a packrat repo.
  5. The documentation, helping users to understand the code and in particular, if the code is to be part of a pipeline, explaining how to interact with the API it exposes.
  6. Where the work product is an analysis rather than a bit of code intended to carry out a task, the analysis as vignettes.

To understand the notion of analyses as packages, two outstanding posts by Robert M. Flight are worth reading: part 1 explains the ‘why’ and part 2 explains the ‘how’. Robert’s work is getting a little long in the tooth, and packages like knitr have taken the place of vignettes as analytical outputs, but the principles remain the same. Inasmuch as it is possible, an analysis in R should be a self-contained package, with all the dependencies and data either linked or included. From the perspective of the user, all that should be left for them to do is to execute the analysis.

NINE: Power in names.

There are only two hard things in Computer Science: cache invalidation and naming things.

Phil Karlton

In general, R has a fair few idiosyncrasies in naming things. For starters, dots/periods . are perfectly permitted in variable names (and thus function names), when in most languages, the dot operator is a binary operator retrieving the first operand object’s method called the second operand:

a.b(args) = dot(a, b, args) = a_{Method: b}(args)

For instance, in Python, wallet.pay(arg1, arg2) means ‘invoke the method pay of the object wallet with the arguments arg1 and arg2‘. In R, on the other hand, it’s a character like any other, and therefore there is no special meaning attached to it – you can even have a variable contaning multiple dots, or in fact a variable whose name consists entirely of dots5 – in R, .......... is a perfectly valid variable name. It is also a typcal example of the fact that justibecause you can do something doesn’t mean you also should do so.

A few straightforward rules for variable names in R are worth abiding by:

  1. Above all, be consistent. That’s more important than whatever you choose.
  2. Some style guides, including Google’s R style guide, treat variables, functions and constants as different entities in respect of naming. This is, in my not-so-humble opinion, a blatant misunderstanding of the fact that functions are variables of the type function, and not some distinct breed of animal. Therefore, I recommend using a unitary schema for all variables, callable or not.
  3. In order of my preference, the following are legitimate options for naming:
    • Underscore separated: average_speed
    • Dot separated: average.speed
    • JavaScript style lower-case CamelCase: averageSpeed
  4. Things that don’t belong into identifiers: hyphens, non-alphanumeric characters, emojis (🤦🏼‍♂️) and other horrors.
  5. Where identifiers are hierarchical, it is better to start representing them as hierarchical objects rather than assigning them to different variables. For example, instead of monthly_forecast_january, monthly_forecast_february and so on, it is better to have a list associative array called forecasts in which the forecasts are keyed by month name, and can then be retrieved using the $key or the [key] accessors. If your naming has half a dozen components, maybe it’s time to think about structuring your data better.

Finally, in some cases, the same data may be represented by multiple formats – for instance, data about productivity is first imported as a text file, and then converted into a data frame. In such cases, Hungarian notation may be legitimate, e.g. txt_productivity or productivity.txt vs df_productivity or productivity.df. This is more or less the only case in which Hungarian notation is appropriate.6

And while we’re at variables: never, ever, ever use = to assign. Every time you do it, a kitten dies of sadness.

For file naming, some sensible rules have served me well, and will be hopefully equally useful for you:

  1. File names should be descriptive, but no longer than 63 characters.
  2. File names should be all lower case, separated by underscores, and end in .R. That’s a capital R. Not a lower-case R. EveR.
  3. Where there is a ‘head’ script that sequentially invokes (sources) a number of subsidiary scripts, it is common for the head script to be called 00-.R, and the rest given a sequential corresponding to the order in which they are sourced and a descriptive name, e.g. 01-load_data_from_db.R, 02-transform_data_and_anonymise_records.R and so on.
  4. Where there is a core script, but it does not invoke other files sequentially, it is common for the core script to be called 00-base.R or main.R. As long as it’s somewhere made clear to the user which file to execute, all is fair.
  5. The injunction against emojis and other nonsense holds for file names, too.

TEN: Know the rules and their rationale, so that you know when to break them.

It’s important to understand why style rules and defensive programming principles exist. How else would we know which rules we can break, and when?

The reality is that there are no rules, in any field, that do not ever permit of exceptions. And defensive programming rules are no exception. Rules are tools that help us get our work done better and more reliably, not some abstract holy scripture. With that in mind, when can you ignore these rules?

  • You can, of course, always ignore these rules if you’re working on your own, and most of your work is interactive. You’re going to screw yourself over, but that’s your right and privilege.
  • Adhering to common standards is more important than doing what some dude on the internet (i.e. my good self) thinks is good R coding practice. Coherence and consistency are crucial, and you’ll have to stick to your team’s style over your own ideas. You can propose to change those rules, you can suggest that they be altogether redrafted, and link them this page. But don’t go out and start following a style of your own just because you think it’s better (even if you’re right).
  • It’s always a good idea to appoint a suitable individual – with lots of experience and little ego, ideally! – as code quality standards coordinator (CQSC). They will then centrally coordinate adherence to standards, train on defensive coding practices, review operational adherence, manage tools and configurations and onboard new people.
  • Equally, having an ‘editor settings repo’ is pretty useful. This should support, at the very least, RStudio lintr and styler.
  • Some prefer to have the style guide as a GitHub or Confluence wiki – I generally advise against that, as that cannot be tracked and versioned as well as, say, a bunch of RST files that are collated together using Sphinx, or some RMarkdown files that are rendered automatically upon update using a GitHub webhook.

Conclusion

[su_pullquote align=”right”]Always code as if your life, or that of your loved ones, depended on the code you write – because it very well may someday.[/su_pullquote]In the end, defensive programming may not be critical for you at all. You may never need to use it, and even if you do, the chances that as an R programmer your code will have to live up to defensive programming rules and requirements is much lower than, say, for an embedded programmer. But algorithms, including those written in R, create an increasing amount of data that is used to support major decisions. What your code spits out may decide someone’s career, whether someone can get a loan, whether they can be insured or whether their health insurance will be dropped. It may even decide a whole company’s fate. This is an inescapable consequence of the algorithmic world we now live in.

A few years ago, at a seminar on coding to FDA CDRH standards, the instructor finished by giving us his overriding rule: in the end, always code as if your life, or that of your loved ones, depended on the code you write (because it very well may someday!). This may sound dramatic in the context of R, which is primarily a statistical programming language, but the point stands: code has real-life consequences, and we owe it to those whose lives, careers or livelihoods depend on our code to give them the kind of code that we wish to rely on: well-tested, reliable and stable.

References   [ + ]

1.One crucial element here: keep in mind the order of parameters – because the explicit should override the implied, your code should look at the arguments first, then the environment variables, then the configuration file!
2.Eagle-eyed readers might spot the mutation in the example. This is indeed the T310C mutation on the TNFRSF13B gene (17p11.2), which causes autosomal dominant type 2 Common Variable Immune Deficiency, a condition I have, but manage through regular IVIG, prophylactic antibiotics and aggressive monitoring.
3.Holzmann, G.J. (2006). The Power of 10: Rules for developing safety-critical code. IEEE Computer 39(6):95-99.
4.R stores packages in libraries, which are basically folders where the packages reside. There are global packages as well as user-specific packages. For a useful overview of packages and package installation strategies, read this summary by Nathan Stephens.
5.The exception is ..., which is not a valid variable name as it collides with the splat operator.
6.In fact, the necessity of even this case is quite arguable. A much better practice would be simply importing the text as productivity, then transform it into a data frame, if there’s no need for both versions to continue coexisting.

Installing RStudio Server on Debian 9

Oh, wouldn’t it be just wonderful if you could have your own RStudio installation on a server that you could then access from whatever device you currently have, including an iPad? It totally would. Except it’s some times far from straightforward. Here’s how to do it relatively painlessly.

Step 1: Get a server

Choose a suitable (and affordable) server, pick a location near you in the drop-down menu on the bottom, and press Add this Linode! to set up your first Linode.

I use Linode,1 and in general, their 4096 server is pretty good. Linodes can be very easily resized, so this should not be a worry.

Step 2: Set up the server

Once your Linode is up, it will turn up on your dashboard with a random name (linode1234567, typically). If you click on it, you will see your Linode is ‘Brand New’, which means you need to configure it. I usually keep them in groups depending on purpose: blog servers, various processing servers, hosts, research servers. Each of them then gets a name. Choose whatever nomenclature fits your needs best.

Give your Linode a name you can recognise it by, and assign it to a category.

Next, click on the Rebuild tab, and configure the root password, the operating system (we’ll be using Debian 9), the swap disk size (for R, it’s a good idea to set this as large as you can) and finally, set the deployment disk size. I usually set that for 66%.

Step 3: Install R

The Remote Access field shows the SSH access command, complete with the root user, as well as public IPs, default gateways and other networking stuff.

SSH into your Linode and log in as root. You will find your Linode’s IP address and other interesting factoids about it under the Remote access tab. I have obscured some information as I don’t want you scallywags messing around with my server, but the IP address is displayed both in the SSH link (the one that goes ssh [email protected] or something along these lines) and below under public IPs.

Using vim, open /etc/apt/sources.list and add the following source:

# CRAN server for Debian stretch (R and related stuff)
deb http://cran.rstudio.com/bin/linux/debian stretch-cran34/

Save the file, and next install dirmngr (sudo apt-get install dirmngr). Then, add the requisite GPG key for CRAN, update the repository and install r-base:

sudo apt-key adv --keyserver keys.gnupg.net --recv-key 'E19F5F87128899B192B1A2C2AD5F960A256A04AF'

sudo apt update
sudo apt install r-base

At this point, you can enter R to test if your R installation works, and try to install a package, like ggplot2. It should work, but may ask you to select an installation server. If all is well, this is what you should be getting:

[email protected]:~# R

R version 3.4.4 (2018-03-15) -- "Someone to Lean On"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

In the prompt, enter install.packages("ggplot2"). Select a server if requested. Otherwise, watch ggplot2 (and a bazillion other packages) install. Then, quit R by calling q().

Step 4: Downgrade libssl

For some inscrutable reason, RStudio is currently set up to work with libssl1.0.0, whereas Debian 9 comes with libssl1.1.0 out of the box. Clearly that’s not going to work, so we’ll have to roll back our libssl. We’ll do so by creating a file called /etc/apt/sources.list.d/jessie.list using vim. And we’ll fill it with the following:

deb http://httpredir.debian.org/debian jessie main contrib non-free
deb-src http://httpredir.debian.org/debian jessie main contrib non-free

deb http://security.debian.org/ jessie/updates main contrib non-free
deb-src http://security.debian.org/ jessie/updates main contrib non-free

In case you’re curious: this creates a sources list called jessie, and allows you to draw from Debian Jessie. So let’s have apt get with the program (sudo apt update), and install libssl1.0.0 (sudo apt install libssl1.0.0). Ift may be prudent to also install the openssl tool corresponding to the libssl version (sudo apt install openssl/jessie).

Step 5: Time to install RStudio Server!

First, install gDebi, a package installer, by typing sudo apt-get install gdebi. Next, we’ll be grabbing the latest RStudio version using wget, and installing it using gDebi. Make sure you either do this in your home directory or in /tmp/, ideally. Note that the versions of RStudio tend to change – 1.1.442 was the current version, released 12 March 2018, at the time of writing this post, but may by now have changed. You can check the current version number here.

wget https://download2.rstudio.org/rstudio-server-1.1.442-amd64.deb
sudo gdebi rstudio-server-1.1.442-amd64.deb

If all is well and you said yes to the dress question about whether you actually want to install RStudio Server (no, you’ve been going through this whole pain in the rear for excrement and jocularity, duh) , you should see something like this:

(Reading database ... 55651 files and directories currently installed.)
Preparing to unpack rstudio-server-1.1.442-amd64.deb ...
Unpacking rstudio-server (1.1.442) ...
Setting up rstudio-server (1.1.442) ...
groupadd: group 'rstudio-server' already exists
rsession: no process found
Created symlink /etc/systemd/system/multi-user.target.wants/rstudio-server.service → /etc/systemd/system/rstudio-server.service.
● rstudio-server.service - RStudio Server
   Loaded: loaded (/etc/systemd/system/rstudio-server.service; enabled; vendor preset: enabled)
   Active: active (running) since Sat 2018-03-31 21:59:25 UTC; 1s ago
  Process: 16478 ExecStart=/usr/lib/rstudio-server/bin/rserver (code=exited, status=0/SUCCESS)
 Main PID: 16479 (rserver)
    Tasks: 3 (limit: 4915)
   CGroup: /system.slice/rstudio-server.service
           └─16479 /usr/lib/rstudio-server/bin/rserver

Mar 31 21:59:25 localhost systemd[1]: Starting RStudio Server...
Mar 31 21:59:25 localhost systemd[1]: Started RStudio Server.

Now, if all goes well, you can navigate to your server’s IP at port 8787, and you should behold something akin to this:

The login interface. Technically, you could simply log in with your root password and root as the username. But just don’t yet.

There are a few more things you wish to install at this point – these are libraries that will help with SSL and other functionality from within R. Head back to the terminal and install the following:

sudo apt-get install libssl-dev libcurl4-openssl-dev libssh2-1-dev

Step 6: Some finishing touches

The login screen will draw on PAM, Unix’s own authentication module, in lieu of a user manager. As such, to create access, you will have to create a new Unix user with adduser, and assign a password to it. This will grant it a directory of its own, and you the ability to fine-tune what they should, and what they shouldn’t, have access to. Win-win! This will allow you to share a single installation among a range of people.

Step 7: To reverse proxy, or to not reverse proxy?

There are diverging opinions as to whether a reverse proxy such as NGINX carries any benefit. In my understanding, they do not, and there’s a not entirely unpleasant degree of security by obscurity in having a hard to guess port (which, by the way, you can change). It also makes uploads, on which you will probably rely quite a bit, more difficult. On the whole, there are more arguments against than in favour of a reverse proxy, but I may add specific guidance on reverse proxying here if there’s interest.

References   [ + ]

1.Using this link gives me a referral bonus of $20 as long as you remain a customer for 90 days. If you do not wish to do so, please use this link. It costs the same either way, and I would be using Linode anyway as their service is superbly reliable.

cookiecutter-flask-ask: a quick(er) start to Alexa skills!

If you develop for Amazon’s Alexa-powered devices, you must at some point have come across Flask-Ask, a project by John Wheeler that lets you quickly and easily build Python-based Skills for Alexa. It’s so easy, in fact, that John’s quickstart video, showing the creation of a Flask-Ask based Skill from zero to hero, takes less than five minutes! How awesome is that? Very awesome.

Bootstrapping a Flask-Ask project is not difficult – in fact, it’s pretty easy, but also pretty repetitive. And so, being the ingenious lazy developer I am, I’ve come up with a (somewhat opinionated) cookiecutter template for Flask-Ask.

Usage

Using the Flask-Ask cookiecutter should be trivial.  Make sure you have cookiecutter installed, either in a virtualenv that you have activated or your system installation of Python. Then, simply use  cookiecutter gh:chrisvoncsefalvay/cookiecutter-flask-ask to get started. Answer the friendly assistant’s questions, and voila! You have the basics of a Flask-Ask project all scaffolded.

Once you have scaffolded your project, you will have to create a virtualenv for your project and install dependencies by invoking pip install -r requirements.txt. You will also need ngrok to test your skill from your local device.

What’s in the box?

The cookiecutter has been configured with my Flask-Ask development preferences in mind, which in turn borrow heavily from John Wheeler‘s. The cookiecutter provides a scaffold of a Flask application, including not only session start handlers and an example intention but also a number of handlers for built-in Alexa intents, such as Yes, No and Help.

There is also a folder structure you might find useful, including an intent schema for some basic Amazon intents and a corresponding empty sample_utterances.txt file, as well as a gitkeep’d folder for custom slot types. Because I’m a huge fan of Sphinx documentation and strongly believe that voice apps need to be assiduously documented to live up to their potential, there is also a docs/ folder with a Makefile and an opinionated conf.py configuration file.

Is that all?!

Blissfully, yes, it is. Thanks to John’s extremely efficient and easy-to-use Flask-Ask project, you can discourse with your very own skill less than twenty minutes after starting the scaffolding!

You can find the cookiecutter-flask-ask project here. Issues, bugs and other woes are welcome, as are contributions (simply raise a pull request). For help and advice, you can find me on the Flask-Ask Gitter a lot during daytime CET.

cookiecutter-flask-ask is, of course, Swabware.