Are you looking for a data science sensei?

Maybe you’re a junior data scientist, maybe you’re a software developer who wants to go into data science, or perhaps you’ve dabbled in data for years in Excel but are ready to take the next step.

If so, this post is all about you, and an opportunity I offer every year.

You see, life has been very good to me in terms of training as a data scientist. I have been spoiled, really – I had the chance to learn from some of the best data scientists, work with some exceptional epidemiologists, experience some unusual challenges and face many of the day-to-day hurdles of working in data analytics. I’ve had the fortune to see this profession in all its contexts, from small enterprises to multi-million dollar FTSE100 companies, from well-run agile start-ups to large and sometimes pretty slow dinosaurs, from government through the private sector to NGOs: I’ve seen it all. I’ve done some great things. And I’ve made some superbly dumb mistakes.

And so, at the start of every year, I have opened applications for young, start-of-career data scientists looking for their Mr. Miyagi. Don’t worry: no car waxing involved. I will be choosing a single promising young data scientist and pass on as much as I can of my so-called wisdom. At the end, your skills will shine like Mr. Miyagi’s 1947 Ford Deluxe Convertible. There’s no catch, no hidden trap, no fees or charges involved (except the one mentioned below).

Eligibility criteria

To be eligible, you must be:

  • 18 or above if you are taking a gap year or not attending a university/college.
  • You do not have to have a formal degree in data science or a relevant subject, but you must have completed it if you do. In other words: if you’re in your 3rd year of an English Lit degree, you’re welcome to apply, but if you’re in the middle of your CS degree, you have to wait until you’re finished – sorry. The same goes if you intend to go straight on to a data science-related postgrad within the year.
  • Have a solid basis in mathematics: decent statistics, combinatorics, linear algebra and some high school calculus are the very minimum.
  • You must be familiar with Python (3.5 and above), and either familiar with the scientific Python stack (SciPy, NumPy, Pandas, matplotlib) or willing to pick up a lot on the go.
  • Be willing to put in the work: we’ll be convening about once every week to ten days by Skype for an hour, and you’ll probably be doing 6-10 hours’ worth of reading and work for the rest of the week. Please be realistic if you can sustain this.
  • If, as recommended, you are working on an AWS EC2 instance, be aware this might cost money and make sure you can cover the costs. In practice, these are negligible.
  • You must understand that this is a physically and intellectually strenuous endeavor, and it is your responsibility to know whether you’re physically and mentally up for the job. However, no physical or mental disabilities are regarded as automatically excluding you of consideration.
  • You must not live in, reside in or be a citizen of any of the countries listed in CFR Title 22 Part 126, §126.1(d)(1) and (2).
  • You must not have been convicted of a felony anywhere. This includes ‘spent’ UK criminal convictions.

Sounds good? Apply here.

Preferred applicants

When assessing applications, the following groups are given preference:

  • Persons with mental or physical disabilities whose disability precludes them from finding conventional employment – please outline this situation on the application form.
  • Honourably discharged (or equivalent) veterans of NATO forces and the IDF – please include member 4 copy of DD-214, Wehrdienstzeitbescheinigung or equivalent document that lists type of discharge.

What we’ll be up to

Don’t worry. None of this car waxing crap.

Over the 42 weeks to follow, you will be undergoing a rigorous and structured semi-self-directed training process. This will take your background, interests and future ambitions into account, but at the core, you will:

  • master Python’s data processing stack,
  • learn how to visualize data in Python,
  • work with networks and graph databases, including Neo4j,
  • acquire the correct way of presenting results in data science to stakeholders,
  • delve into cutting-edge methods of machine learning, such as deep learning using keras,
  • work on problems in computer vision and get familiar with the Python bindings of OpenCV,
  • scrape data from social networks, and
  • learn convenient ways of representing, summarizing and distributing our results.

The programme is divided into three ‘terms’ of 14 weeks each, which each consist of 9 weeks of directed study, 4 weeks of self-directed project work and one week of R&R.

What you’ll be getting out of this

Since the introduction of Docker, tolerance for wanton destruction as part of coursework has increased, but still won’t earn you a passing grade by itself.

In the past years, mentees have noted the unusual breadth of knowledge they have acquired about data science, as well as the diversity of practical topics and the realistic question settings, with an emphasis on practical applications of data science such as presenting data products. I hope that this year, too, I’ll be able to convey the same important topics. Every year is a little different as I try to adjust the course to meet the individual participant’s needs.

The programme is not, of course, accredited by any accreditation body, but a certificate of completion will be issued to any participant who wishes so.

Application process

Simply fill in the form below and send it off by 14 January 2018. The top contenders will be contacted by e-mail or telephone for a brief conversation thereafter. Finally, a lucky winner will be picked by the 21st January 2018. Easy peasy!



Q: What does ‘semi-self-directed’ mean? Is there a fixed curriculum?

A: No. There are some basic topics (see list above) that I think are quite likely to come up, but ultimately, this is about making you the data scientist you want to be. For this reason, we’ll begin by planning out where you want to improve – kinda like a PT gives you a training plan before you start out at their gym. We will then adjust as needed. This is not an exam prep, it’s a learning experience, and for that reason, we can focus on delving deeper and getting the fundaments right over other cramming in a particular curriculum.

Q: Can I bring your own data?

A: Sure. In general, we’ll be using standard data sets, because they’re well-known and high-quality data. But if you have a dataset you collected or are otherwise entitled to use that would do equally well, there’s no reason why we couldn’t use it! Note that you must have the right to use and share the data set, meaning it’s unlikely you’re able to use data sets from your day job.

Q: Will this give me an employment advantage?

A: I don’t quite know – it’s impossible to predict. The field of data science degrees is something of a Wild West still, and while some reputable degrees have emerged, others are dubious. Employers still don’t know what to go by. However, you will most definitely be better prepared for an employment interview in data science!

Q: Why are you so keen on presenting data the right way?

A: Because as data scientists, we’re expected to not merely understand the data and draw the right conclusions, but also to convey them to stakeholders at various levels, from plant management to C-suite, in a way that gets the right message across at the first go.

Q: You’re a computational epidemiologist. Can I apply even if my work doesn’t really involve healthcare?

A: Sure. The principles are the same, and we’re largely focusing on generic topics. You might be exposed to bits and pieces of epidemiology, but I can guarantee it won’t hurt.

Q: Why do you only take on one mentee?

A: To begin with, my life is pretty busy – I have a demanding job, a family and – shock horror! – I even need to sleep every once in a while. More importantly, I want to devote my undivided attention to a worthy candidate.

Q: How come I’ve never heard of this before?

A: Until now, I’ve largely gotten mentees by word of mouth. I am concerned that this is keeping some talented people out and limiting the pool of people we should have in. That’s why this year, I have tried to make this process much more transparent.

Q: You’re rather fond of General ‘Mad Dog’ Mattis. Will there be yelling?


Q: There seems to be no upper age limit. Is that a mistake?


Q: I have more questions.

A: You can ask them here.

Fixing the mysterious Jupyter Tensorflow import bug

There’s a weird bug afoot that you might encounter when setting up a ‘lily white’ (brand new) development environment to play around with Tensorflow.  As it seems to have vexed quite a few people, I thought I’ll put my solution here to help future  tensorflowers find their way.  The problem presents after you have set up your new  virtualenv . You install Jupyter and Tensorflow, and  when importing, you get this:

In [1]:   import tensorflow as tf

ModuleNotFoundError Traceback (most recent call last)
in ()
----> 1 import tensorflow as tf

ModuleNotFoundError: No module named 'tensorflow'


Added perplexion

Say you are a dogged pursuer of bugs, and wish to check if you might have installed Tensorflow and Jupyter into different virtualenvs. One way to do that is to simply activate your virtualenv (using activate or source activate, depending on whether you use virtualenvwrapper), and starting a Python shell. Perplexingly, importing Tensorflow here will work just fine.

The solution

At this time, this works only for CPython aka ‘regular Python’ (if you don’t know what kind of Python you are running, it is in all likelihood CPython).

In general, it is advisable to start fixing these issues by destroying your virtualenv and starting anew, although that’s not strictly necessary. Create a virtualenv, and note the base Python executable’s version (it has to be a version for which there is a Tensorflow wheel for your platform, i.e. 2.7 or 3.3-3.6).

Step 1

Go to the PyPI website to find the Tensorflow installation appropriate to your system and your Python version (e.g. cp36 for Python 3.6). Copy the path of the correct version, then open up a terminal window and declare it as the environment variable TF_BINARY_URL. Use pip to install from the URL you set as the environment variable, then install Jupyter.

[email protected] ~ $ export TF_BINARY_URL=
[email protected] ~ $ pip install --upgrade $TF_BINARY_URL jupyter                 
Collecting tensorflow==1.1.0rc2 from
  Using cached tensorflow-1.1.0rc2-cp36-cp36m-macosx_10_11_x86_64.whl
Collecting jupyter
  Using cached jupyter-1.0.0-py2.py3-none-any.whl

(... lots more installation steps to follow ...)

Successfully installed ipykernel-4.6.1 ipython-6.0.0 jedi-0.10.2 jinja2-2.9.6 jupyter-1.0.0 jupyter-client-5.0.1 jupyter-console-5.1.0 notebook-5.0.0 prompt-toolkit-1.0.14 protobuf-3.2.0 qtconsole-4.3.0 setuptools-35.0.1 tensorflow-1.1.0rc2 tornado-4.5.1 webencodings-0.5.1 werkzeug-0.12.1
Step 2
Now for some magic. If you launch Jupyter now, there’s a good chance it won’t find Tensorflow. Why? Because you just installed Jupyter, your shell might not have updated the jupyter alias to point to that in the virtualenv, rather than your system Python installation.

Enter which jupyter to find out where the Jupyter link is pointing. If it is pointing to a path within your virtualenvs folder, you’re good to go. Otherwise, open a new terminal window and activate your virtualenv. Check where the jupyter command is pointing now – it should point to the virtualenv.

Step 3
Fire up Jupyter, and import tensorflow. Voila – you have a fully working Tensorflow environment!

As always, let me know if it works for you in the comments, or if you’ve found some alternative ways to fix this issue. Hopefully, this helps you on your way to delve into Tensorflow and explore this fantastic deep learning framework!

Header image: courtesy of Jeff Dean, Large Scale Deep Learning for Intelligent Computer Systems, adapted from Untangling invariant object recognition by DiCarlo and Cox (2007).

A deep learning

There are posts that are harder to write than others. This one perhaps has been one of the hardest. It took me the best part of four months and dozens of rewrites.

Because it’s about something I love. And about someone I love. And about something else I love. And how these three came to come into a conflict. And, perhaps, what we all can learn from that.

As many of you might know, deep learning is my jam. Not in a faddish, ‘it’s what cool kids do these days’ sense. Nor, for that matter, in the sense so awfully prevalent in Silicon Valley, whereby the utility of something is measured in how many jobs it will get rid of, presumably freeing off humans to engage in more cerebral pursuits, or how it may someday cure intrinsically human problems if only those pesky humans were to listen to their technocratic betters for once. Rather, I’m a deep learning and AI researcher who believes in what he’s doing. I believe with all I am and all I’ve got that deep learning is right now our best chance to find better ways of curing cancer, producing more with less emissions, building structures that can withstand floods on a dime, identifying terrorists and, heck, create entertaining stuff. I firmly believe that it’s one of the few intellectual pursuits I am somewhat suited for that is also worth my time, not the least because I firmly believe that it will make me have more of it – and if not me, maybe someone equally worthy.

Which is why it was so hard for me to watch this video, of my lifelong idol Hayao Miyazaki ripping a deep learning researcher to shreds.

Now, quite frankly, I have little time for the researcher and his proposition. It’s badly made, dumb and pointless. Why one would inundate Miyazaki-san with it is beyond me. His putdown is completely on point, and not an ounce too harsh. All of his words are well deserved. As someone with a neurological chronic pain disorder that makes me sometimes feel like that creature writhing on the floor, I don’t have a shred of sympathy for this chap.[1]

Rather, it’s the last few words of Miyazaki-san that have punched a hole in my heart and have held my thoughts captive for months now, coming back into the forefront of my thoughts like a recurring nightmare.

“I feel like we are nearing the end of times,” he says, the camera gracefully hovering over his shoulder as he sketches through his tears. “We humans are losing faith in ourselves.”

Deep learning is something formidable, something incredible, something so futuristic yet so simple. Deep down (no pun intended), deep learning is really not much more than a combination of a few relatively simple tricks, some the best part of a century old, that together create something fantastic. Let me try to put it into layman’s terms (if you’re one of my fellow ML /AI nerds, you can just jump over this part).

Consider you are facing the arduous and yet tremendously important task of, say, identifying whether an image depicts a cat or a dog. In ML lingo, this is what we call a ‘classification’ task. One traditional approach used to be to define what cats are versus what dogs are, and provide rules. If it’s got whiskers, it’s a cat. If it’s got big puppy eyes, it’s, well, a puppy. If it’s got forward pointing eyes and a roughly circular face, it’s almost definitely a kitty. If it’s on a leash, it’s probably a dog. And so on, ad infinitum, your model of a cat-versus-dog becoming more and more accurate with each rule you add.

This is a fairly feasible approach, and is still used. In fact, there’s a whole school of machine learning called decision trees that relies on this kind of definition of your subjects. But there are three problems with it.

  1. You need to know quite a bit about cats and dogs to be able to do this. At the very least, you need to be able to, and take the time and effort to, describe cats and dogs. It’s not enough to merely feed images of each to the computer.[2]
  2. You are limited in time and ability to put down distinguishing features – your program cannot be infinitely large, nor do you have infinite time to write it. You must prioritise by identifying the factors with the greatest differentiating potential first. In other words, you need to know, in advance, what the most salient characteristics of cats versus dogs are – that is, what characteristics are almost omnipresent among cats but hardly ever occur among dogs (and vice versa)? All dogs have a snout and no cat has a snout, whereas some cats do have floppy ears and some dogs do have almost catlike triangular ears.
  3. You are limited to what you know. Silly as that may sound, there might be some differentia between cats and dogs that are so arcane, so mathematical that no human would think of it – but which might come trivially evident to a computer.

Deep learning, like friendship, is magic. Unlike most other techniques of machine learning, you don’t need to have the slightest idea of what differentiates cats from dogs. What you need is a few hundred images of each, preferably with a label (although that is not strictly necessary – classifiers can get by just fine without needing to be told what the names of the things they are classifying are: as long as they’re told how many different classes they are to split the images into, they will find differentiating features on their own and split the images into ‘images with thing 1’ versus  ‘images with thing 2’. – magic, right?). Using modern deep learning libraries like TensorFlow and their high level abstractions (e.g. keras, tflearn) you can literally write a classifier that identifies cats versus dogs with a very high accuracy in less than 50 lines of Python that will be able to classify thousands of cat and dog pics in a fraction of a minute, most of which will be taken up by loading the images rather than the actual classification.

Told you it’s magic.

What makes deep learning ‘deep’, though? The origins of deep learning are older than modern computers. In 1943, McCullough and Pitts published a paper[3] that posited a model of neural activity based on propositional logic. Spurred by the mid-20th century advances in understanding how the nervous system works, in particular how nerve cells are interconnected, McCulloch and Pitts simply drew the obvious conclusion: there is a way you can represent neural connections using propositional logic (and, actually, vice versa). But it wasn’t until 1958 that this idea was followed up in earnest. Rosenblatt’s ground-breaking paper[4] introduced this thing called the perceptron, something that sounds like the ideal robotic boyfriend/therapist but in fact was intended as a mathematical model for how the brain stores and processes information. A perceptron is a network of artificial neurons. Consider the cat/dog example. A simple single-layer perceptron has a list of input neurons   x_1 ,   x_2   and so on. Each of these describe a particular property. Does the animal have a snout? Does it go woof? Depending on how characteristic they are, they’re multiplied by a weight  w_n . For instance, all dogs and no cats have snouts, so   w_1   will be relatively high, while there are cats that don’t have long curly tails and dogs that do, so  w_n   will be relatively low.

At the end, the output neuron (denoted by the big  \Sigma  ) sums up these results, and gives an estimate as to whether it’s a cat or a dog.

What was initially designed to model the way the brain works has soon shown remarkable utility in applied computation, to the point that the US Navy was roped into building an actual, physical perceptron machine – the first application of computer vision. However, it was a complete bust. It turned out that a single layer perceptron couldn’t really recognise a lot of patterns. What it lacked was depth.

What do we mean by depth? Consider the human brain. The brain actually doesn’t have a single part devoted to vision. Rather, it has six separate areas[5] – the striate cortex (V1) and the extrastriate areas (V2-V6). These form a feedforward pathway of sorts, where V1 feeds into V2, which feeds into V3 and so on. To massively oversimplify: V1 detects optical features like edges, which it feeds on to V2, which breaks these down into more complex features: shapes, orientation, colour &c. As you proceed towards the back of the head, the visual centres detect increasingly complex abstractions from the simple visual information. What was found is that by putting layers and layers of neurons after one another, even very complex patterns can be identified accurately. There is a hierarchy of features, as the facial recognition example below shows.

The first hidden layer recognises simple geometries and blobs at different parts of the zone. The second hidden layer fires if it detects particular manifestations of parts of the face – noses, eyes, mouths. Finally, the third layer fires if it ‘sees’ a particular combination of these. Much like an identikit image, a face is recognised because it contains parts of a face, which in turn are recognised because they contain a characteristic spatial alignment of simple geometries.

There’s much more to deep learning than what I have tried to convey in a few paragraphs. The applications are endless. With the cost of computing decreasing rapidly, deep learning applications have now become feasible in just about all spheres where they can be applied. And they excel everywhere, outpacing not only other machine learning approaches (which makes me absolutely stoked about the future!) but, at times, also humans.

Which leads me back to Miyazaki. You see, deep learning can’t just classify things or predict stock prices. It can also create stuff. To put an old misunderstanding to rest quite early: generative neural networks are genuinely creating new things. Rather than merely combining pre-programmed elements, they come as close as anything non-human can come to creativity.

The pinnacle of it all, generating enjoyable music, is still some ways off, and we have yet to enjoy a novel written by a deep learning engine. But to anyone who has been watching the rapid development of deep learning and especially generative algorithms based on deep learning, these are literally just questions of time.

Or perhaps, as Miyazaki said, questions of the ‘end of times’.

What sets a computer-generated piece apart from a human’s composition? Someday, they will be, as far as quality is concerned, indistinguishable. Yet something that will always set them apart is the absence of a creator.

In what is probably one of the worst written essays in  20th century literary criticism, a field already overflowing with bad prose for bad prose’s sake, Roland Barthes’s 1967 essay La mort de l’auteur posited a sort of separation between the author and the text, countering centuries of literary criticism that sought to explain the meaning of the latter by reference to the former.  According to Barthes, texts (and so, compositions, paintings &.) have a life and existence of their own. To liberate works of art of an  ‘interpretive  tyranny’ that is almost self-explanatorily imposed on it, they must be read, interpreted and understood by reference to its audience and not its author. Indeed, Barthes eschews the term in favour of the term ‘scriptor‘, the latter hearkening back to the Medieval monks who copied manuscripts: like them, the scriptor is not in control of the narrative or work of art that he or she composes. Devoid of the author’s authority, the work of art is now free to exist in a liberated state that allows you – the recipient – to establish its essential meaning.

Oddly, that’s not entirely what post-modernism seems to have created. If anything, there is now an increased focus on the author, at the very least in one particular sense. Consider the curious case of Wagner’s works in Israel. Because of his anti-Semitic views, arguably as well as due to the favour his music found during the tragic years of the Third Reich, Wagner’s works – even those that do not even remotely express a political position – are rarely played in Israel. Even in recent years, other than Holocaust survivor Mendi Roman’s performance of Siegfried in 2000, there have been very few instances of Wagner played in Israel – despite the curious fact that Theodor Herzl, founder of Zionism, admired Wagner’s music (if not his vile racial politics). Rather than the death of the author, we more often witness the death of the work. The taint of the author’s life comes to haunt the chords of his composition and the stanzas of his poetry, every brush-stroke of theirs forever imbued with the often very human sins and mistakes of their lives.

Less dramatic, perhaps, than Wagner’s case are the increasingly frequent boycotts, outbursts and protests against works of art solely based on the character of the author or composer. One must only look at the recent past to see protests, for instance, against the works of HP Lovecraft, themselves having to do more with eldritch horrors than racist horridness, due to the author’s admittedly reprehensible views on matters of race. Outrages about one author or another, one artist or the next, are commonplace, acted out on a daily basis on the Twitter gibbets and the Facebook  pillory. Rather than the death of the author, we experience the death of art, amidst an increasingly intolerant culture towards  the works of flawed or sinful creators.

This is, of course, not to excuse any of those sins or flaws. They should not, and cannot, be excused. Rather, perhaps, it is to suggest that part of a better understanding of humanity is that artists are a cross-section of us as a species, equally prone to be misled and deluded into adopting positions that, as the famous German anti-Fascist and children’s book author Erich Kästner said, ‘feed the animal within man’. Nor is this to condone or justify art that actively expresses those reprehensible views – an entirely different issue. Rather, I seek merely to draw attention to the increased tendency to condemn works of art for the artist’s political sins. In many cases, these sins are far from being as straightforward as Lovecraft’s bigotry and Wagner’s anti-Semitism. In many cases, these sins can be as subtle as going against the drift of public opinion, the Orwellian sin of ‘wrongthink’. With the internet having become a haven of mob mentality (something I personally was subjected to a few years ago), the threshold of what sins  of the creator shall be visited upon their creations has significantly decreased. It’s not the end of days, but you can see it from here.

In which case perhaps Miyazaki is right.

Perhaps what we need is art produced by computers.

As Miyazaki-san said, we are losing faith in ourselves. Not in our ability to create wonderful works of art, but in our ability to measure up to some flawless ethos, to some expectation of the artist as the flawless being. We are losing faith in our artists. We are losing faith in our creators, our poets and painters and sculptors and playwrights and composers, because we fear that with the inevitable revelation of greater – or perhaps lesser – misdeeds or wrongful opinions from their past shall not merely taint them: they shall no less taint us, the fans and aficionados and cognoscenti. Put not your faith in earthly artists, for they are fickle, and prone to having opinions that might be unacceptable, or be seen as such someday. Is it not a straightforward response then to  declare one’s love for the intolerable synthetic Baroque of Stanford machine learning genius Cary Kaiming Huang’s research? In a society where the artist’s sins taint the work of art and through that, all those who confessed to enjoy his works, there’s no other safe bet. Only the AI can cast the first stone.

And if the cost of that is truly the chirps of Cary’s synthetic Baroque generator, Miyazaki is right on the other point, too. It truly is the end of days.

References   [ + ]

1. Least of all because I know how rudimentary and lame his work is. I’ve built evolutionary models of locomotion where the first stages look like this. There’s no cutting edge science here.
2. There’s a whole aspect of the story called feature extraction, which I will ignore for the sake of simplicity, and assume that it just happens. It doesn’t, of course, and it plays a huge role in identifying things, but this story is complex enough already as it is.
3. McCulloch, W and Pitts, W (1943). A Logical Calculus of Ideas Immanent in Nervous Activity. Bulletin of Mathematical Biophysics 5 (4): 115–133. doi:10.1007/BF02478259.
4. Rosenblatt, F. (1958). The Perceptron: A Probabilistic Model For Information Storage And Organization In The Brain. Psych Rev 65 (6): 386–408. doi:10.1037/h0042519
5. Or five, depending on whether you consider the dorsomedial area a separate area of the extrastriate cortex