Are you looking for a data science sensei?

Maybe you’re a junior data scientist, maybe you’re a software developer who wants to go into data science, or perhaps you’ve dabbled in data for years in Excel but are ready to take the next step.

If so, this post is all about you, and an opportunity I offer every year.

You see, life has been very good to me in terms of training as a data scientist. I have been spoiled, really – I had the chance to learn from some of the best data scientists, work with some exceptional epidemiologists, experience some unusual challenges and face many of the day-to-day hurdles of working in data analytics. I’ve had the fortune to see this profession in all its contexts, from small enterprises to multi-million dollar FTSE100 companies, from well-run agile start-ups to large and sometimes pretty slow dinosaurs, from government through the private sector to NGOs: I’ve seen it all. I’ve done some great things. And I’ve made some superbly dumb mistakes.

And so, at the start of every year, I have opened applications for young, start-of-career data scientists looking for their Mr. Miyagi. Don’t worry: no car waxing involved. I will be choosing a single promising young data scientist and pass on as much as I can of my so-called wisdom. At the end, your skills will shine like Mr. Miyagi’s 1947 Ford Deluxe Convertible. There’s no catch, no hidden trap, no fees or charges involved (except the one mentioned below).

Eligibility criteria

To be eligible, you must be:

  • 18 or above if you are taking a gap year or not attending a university/college.
  • You do not have to have a formal degree in data science or a relevant subject, but you must have completed it if you do. In other words: if you’re in your 3rd year of an English Lit degree, you’re welcome to apply, but if you’re in the middle of your CS degree, you have to wait until you’re finished – sorry. The same goes if you intend to go straight on to a data science-related postgrad within the year.
  • Have a solid basis in mathematics: decent statistics, combinatorics, linear algebra and some high school calculus are the very minimum.
  • You must be familiar with Python (3.5 and above), and either familiar with the scientific Python stack (SciPy, NumPy, Pandas, matplotlib) or willing to pick up a lot on the go.
  • Be willing to put in the work: we’ll be convening about once every week to ten days by Skype for an hour, and you’ll probably be doing 6-10 hours’ worth of reading and work for the rest of the week. Please be realistic if you can sustain this.
  • If, as recommended, you are working on an AWS EC2 instance, be aware this might cost money and make sure you can cover the costs. In practice, these are negligible.
  • You must understand that this is a physically and intellectually strenuous endeavor, and it is your responsibility to know whether you’re physically and mentally up for the job. However, no physical or mental disabilities are regarded as automatically excluding you of consideration.
  • You must not live in, reside in or be a citizen of any of the countries listed in CFR Title 22 Part 126, §126.1(d)(1) and (2).
  • You must not have been convicted of a felony anywhere. This includes ‘spent’ UK criminal convictions.

Sounds good? Apply here.

Preferred applicants

When assessing applications, the following groups are given preference:

  • Persons with mental or physical disabilities whose disability precludes them from finding conventional employment – please outline this situation on the application form.
  • Honourably discharged (or equivalent) veterans of NATO forces and the IDF – please include member 4 copy of DD-214, Wehrdienstzeitbescheinigung or equivalent document that lists type of discharge.

What we’ll be up to

Don’t worry. None of this car waxing crap.

Over the 42 weeks to follow, you will be undergoing a rigorous and structured semi-self-directed training process. This will take your background, interests and future ambitions into account, but at the core, you will:

  • master Python’s data processing stack,
  • learn how to visualize data in Python,
  • work with networks and graph databases, including Neo4j,
  • acquire the correct way of presenting results in data science to stakeholders,
  • delve into cutting-edge methods of machine learning, such as deep learning using keras,
  • work on problems in computer vision and get familiar with the Python bindings of OpenCV,
  • scrape data from social networks, and
  • learn convenient ways of representing, summarizing and distributing our results.

The programme is divided into three ‘terms’ of 14 weeks each, which each consist of 9 weeks of directed study, 4 weeks of self-directed project work and one week of R&R.

What you’ll be getting out of this

Since the introduction of Docker, tolerance for wanton destruction as part of coursework has increased, but still won’t earn you a passing grade by itself.

In the past years, mentees have noted the unusual breadth of knowledge they have acquired about data science, as well as the diversity of practical topics and the realistic question settings, with an emphasis on practical applications of data science such as presenting data products. I hope that this year, too, I’ll be able to convey the same important topics. Every year is a little different as I try to adjust the course to meet the individual participant’s needs.

The programme is not, of course, accredited by any accreditation body, but a certificate of completion will be issued to any participant who wishes so.

Application process

Simply fill in the form below and send it off by 14 January 2018. The top contenders will be contacted by e-mail or telephone for a brief conversation thereafter. Finally, a lucky winner will be picked by the 21st January 2018. Easy peasy!

 

FAQ

Q: What does ‘semi-self-directed’ mean? Is there a fixed curriculum?

A: No. There are some basic topics (see list above) that I think are quite likely to come up, but ultimately, this is about making you the data scientist you want to be. For this reason, we’ll begin by planning out where you want to improve – kinda like a PT gives you a training plan before you start out at their gym. We will then adjust as needed. This is not an exam prep, it’s a learning experience, and for that reason, we can focus on delving deeper and getting the fundaments right over other cramming in a particular curriculum.

Q: Can I bring your own data?

A: Sure. In general, we’ll be using standard data sets, because they’re well-known and high-quality data. But if you have a dataset you collected or are otherwise entitled to use that would do equally well, there’s no reason why we couldn’t use it! Note that you must have the right to use and share the data set, meaning it’s unlikely you’re able to use data sets from your day job.

Q: Will this give me an employment advantage?

A: I don’t quite know – it’s impossible to predict. The field of data science degrees is something of a Wild West still, and while some reputable degrees have emerged, others are dubious. Employers still don’t know what to go by. However, you will most definitely be better prepared for an employment interview in data science!

Q: Why are you so keen on presenting data the right way?

A: Because as data scientists, we’re expected to not merely understand the data and draw the right conclusions, but also to convey them to stakeholders at various levels, from plant management to C-suite, in a way that gets the right message across at the first go.

Q: You’re a computational epidemiologist. Can I apply even if my work doesn’t really involve healthcare?

A: Sure. The principles are the same, and we’re largely focusing on generic topics. You might be exposed to bits and pieces of epidemiology, but I can guarantee it won’t hurt.

Q: Why do you only take on one mentee?

A: To begin with, my life is pretty busy – I have a demanding job, a family and – shock horror! – I even need to sleep every once in a while. More importantly, I want to devote my undivided attention to a worthy candidate.

Q: How come I’ve never heard of this before?

A: Until now, I’ve largely gotten mentees by word of mouth. I am concerned that this is keeping some talented people out and limiting the pool of people we should have in. That’s why this year, I have tried to make this process much more transparent.

Q: You’re rather fond of General ‘Mad Dog’ Mattis. Will there be yelling?

No.

Q: There seems to be no upper age limit. Is that a mistake?

No.

Q: I have more questions.

A: You can ask them here.

cookiecutter-flask-ask: a quick(er) start to Alexa skills!

If you develop for Amazon’s Alexa-powered devices, you must at some point have come across Flask-Ask, a project by John Wheeler that lets you quickly and easily build Python-based Skills for Alexa. It’s so easy, in fact, that John’s quickstart video, showing the creation of a Flask-Ask based Skill from zero to hero, takes less than five minutes! How awesome is that? Very awesome.

Bootstrapping a Flask-Ask project is not difficult – in fact, it’s pretty easy, but also pretty repetitive. And so, being the ingenious lazy developer I am, I’ve come up with a (somewhat opinionated) cookiecutter template for Flask-Ask.

Usage

Using the Flask-Ask cookiecutter should be trivial.  Make sure you have cookiecutter installed, either in a virtualenv that you have activated or your system installation of Python. Then, simply use  cookiecutter gh:chrisvoncsefalvay/cookiecutter-flask-ask to get started. Answer the friendly assistant’s questions, and voila! You have the basics of a Flask-Ask project all scaffolded.

Once you have scaffolded your project, you will have to create a virtualenv for your project and install dependencies by invoking pip install -r requirements.txt. You will also need ngrok to test your skill from your local device.

What’s in the box?

The cookiecutter has been configured with my Flask-Ask development preferences in mind, which in turn borrow heavily from John Wheeler‘s. The cookiecutter provides a scaffold of a Flask application, including not only session start handlers and an example intention but also a number of handlers for built-in Alexa intents, such as Yes, No and Help.

There is also a folder structure you might find useful, including an intent schema for some basic Amazon intents and a corresponding empty sample_utterances.txt file, as well as a gitkeep’d folder for custom slot types. Because I’m a huge fan of Sphinx documentation and strongly believe that voice apps need to be assiduously documented to live up to their potential, there is also a docs/ folder with a Makefile and an opinionated conf.py configuration file.

Is that all?!

Blissfully, yes, it is. Thanks to John’s extremely efficient and easy-to-use Flask-Ask project, you can discourse with your very own skill less than twenty minutes after starting the scaffolding!

You can find the cookiecutter-flask-ask project here. Issues, bugs and other woes are welcome, as are contributions (simply raise a pull request). For help and advice, you can find me on the Flask-Ask Gitter a lot during daytime CET.

cookiecutter-flask-ask is, of course, Swabware.

Give your Twitter account a memory wipe… for free.

The other day, my wife has decided to get rid of all the tweets on one of her twitter accounts, while of course retaining all the followers. But bulk deleting tweets is far from easy. There are, fortunately, plenty of tools that offer you the service of bulk deleting your tweets… for a cost, of course. One had a freemium model that allowed three free deletes per day. I quickly calculated that it would have taken my wife something on the order of twelve years to get rid of all her tweets. No, seriously. That’s silly. I can write some Python code to do that faster, can’t I?

Turns out you can. First, of course, you’ll need to create a Twitter app from the account you wish to wipe and generate an access token, since we’ll also be performing actions on behalf of the account.

import tweepy
import time

CONSUMER_KEY=<your consumer key>
CONSUMER_SECRET=<your consumer secret>
ACCESS_TOKEN=<your access token>
ACCESS_TOKEN_SECRET=<your access token secret>
SCREEN_NAME=<your screen name, without the @>

Time to use tweepy’s OAuth handler to connect to the Twitter API:

auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)

api = tweepy.API(auth)

Now, we could technically write an extremely sophisticated script, which looks at the returned headers to determine when we will be cut off by the API throttle… but we’ll use the easy and brutish route of holding off for a whole hour if we get cut off. At 350 requests per hour, each capable of deleting 100 tweets, we can get rid of a 35,000 tweet account in a single hour with no waiting time, which is fairly decent.

The approach will be simple: we ask for batches of 100 tweets, then call the .destroy() method on each of them, which thanks to tweepy is now bound into the object representing every tweet we receive. If we encounter errors, we respond accordingly: if it’s a RateLimitError, an error object from tweepy that – as its name suggests – shows that the rate limit has been exceeded, we’ll hold off for an hour (we could elicit the reset time from headers, but this is much simpler… and we’ve got time!), if it can’t find the status we simply leap over it (sometimes that happens, especially when someone is doing some manual deleting at the same time) and otherwise, we break the loops.

def destroy():
    while True:
        q = api.user_timeline(screen_name=SCREEN_NAME,
                              count=100)
        for each in q:
            try:
                each.destroy()
            except tweepy.RateLimitError as e:
                print (u"Rate limit exceeded: {0:s}".format(e.message))
                time.sleep(3600)
            except tweepy.TweepError as e:
                if e.message == "No status found with that ID.":
                    continue
            except Exception as e:
                print (u"Encountered undefined error: {0:s}".format(e.message))
                break
        break

Finally, we’ll make sure this is called as the module default:

if __name__ == '__main__':
    destroy()

Happy destruction!

Immortal questions

When asked for a title for his 1979 collection of philosophical papers, my all-time favourite philosopher1 Thomas Nagel chose the title Mortal Questions, an apt title, for most of our philosophical preoccupations (and especially those pertaining to the broad realm of moral philosophy) stem from the simple fact that we’re all mortal, and human life is as such an irreplaceable good. By extension, most things that can be created by humans are capable of being destroyed by humans.

That time is ending, and we need a new ethics for that.

Consider the internet. We all know it’s vulnerable, but is it existentially vulnerable?2 The answer is probably no. Neither would any significantly distributed self-provisioning pseudo-AI be. And by pseudo-AI, I don’t even mean a particularly clever or futuristic or independently reasoning system, but rather a system that can provision resources for itself in response to threat factors just as certain balancers and computational systems we write and use on a day to day basis can commission themselves new cloud resources to carry out their mandate. Based on their mandate, such systems are potentially existentially immortal/existentially indestructible.3

The human factor in this is that such a system will be constrained by mandates we give them. Ergo,4 those mandates are as fit a subject for human moral reasoning as any other human action.

Which means we’re going to need that new ethics pretty darn’ fast, for there isn’t a lot of time left. Distributed systems, smart contracts, trustless M2M protocols, the plethora of algorithms that have arisen that each bring us a bit closer to a machine capable of drawing subtle conclusions from source data (hidden Markov models, 21st century incarnations of fuzzy logic, certain sorts of programmatic higher order logic and a few other factors are all moving towards an expansion of what we as humans can create and the freedom we can give our applications. Who, even ten years ago, would have thought that one day I will be able to give a computing cluster my credit card and if it ran out of juice, it could commission additional resources until it bled me dry and I had to field angry questions from my wife? And that was a simple dumb computing cluster. Can you teach a computing cluster to defend itself? Why the heck not, right?

Geeks who grew up on Asimov’s laws of robotics, myself included, think of this sort of problem as largely being one of giving the ‘right’ mandates to the system, overriding mandates to keep itself safe, not to harm humans,5 or the like. But any sufficiently well-written system will eventually grow to the level of the annoying six-year-old, who lives for the sole purpose of trying to twist and redefine his parents’ words to mean the opposite of what they intended.6 In the human world, a mandate takes place in a context. A writ is executed within a legal system. An order by a superior officer is executed according to the applicable rules of military justice, including circumstances when the order ought not be carried out. Passing these complex human contexts, which most of us ignore as we do all the things we grew up with and take for granted, into a more complicated model may not be feasible. Rules cannot be formulated exhaustively,7 as such a formulation by definition would have to encompass all past, present and future – all that potentially can happen. Thus, the issue moves on soon from merely providing mandates to what in the human world is known as ‘statutory construction’ or interpretation of legislative works. How are computers equipped to reason about symbolic propositions according to rules that we humans can predict? In other words, how can we teach rules of reasoning about rules in a way that is not inherently recursing this question (i.e. is not based on a simple conditional rule based framework).

Which means that the best that can be provided in such a situation is a framework based on values, and target optimisation algorithms (i.e. what’s the best way to reach the overriding objective with least damage to other objectives and so on). Which in turn will need a good bit of rethinking ethical norms.

But the bottom line is quite simple: we’re about to start creating immortals. Right now, you can put data on distributed file infrastructures like IPFS that’s effectively impossible to destroy using a reasonable amount of resources. Equally, distributed applications via survivable infrastructures such as the blockchain, as well as smart contract platforms, are relatively immortal. The creation of these is within the power of just about everyone with a modicum of computing skills. The rise of powerful distributed execution engines for smart contracts, like Maverick Labs’ Aletheia Platform,8 will give a burst of impetus to systems’ ability to self-provision, enter into contracts, procure services and thus even effect their own protection (or destruction). They are incarnate, and they are immortal. For what it’s worth, man is steps away from creating its own brand of deities.9

What are the ethics of creating a god? What is right and wrong in this odd, novel context? What is good and evil to a device?

The time to figure out these questions is running out with merciless rapidity.

Title image: God the Architect of the Universe, Codex Vindobonensis 2554, f1.v

References   [ + ]

1. That does not mean I agree with even half of what he’s saying. But I do undoubtedly acknowledge his talent, agility of mind, style of writing, his knowledge and his ability to write good and engaging papers that have not yet fallen victim to the neo-sophistry dominating universities.
2. I define existential vulnerability as being capable of being destroyed by an adversary that does not require the adversary to accept an immense loss or undertake a nonsensically arduous task. For example, it is possible to kill the internet by nuking the whole planet, but that would be rather disproportionate. Equally, destruction of major lines of transmission may at best isolate bits of the internet (think of it in graph theory terms as turning the internet from a connected graph into a spanning acyclic tree), but it takes rather more to kill off everything. On the other hand, your home network is existentially vulnerable. I kill router, game over, good night and good luck.
3. As in, lack existential vulnerability.
4. According to my professors at Oxford, my impatience towards others who don’t see the connections I do has led me to try to make up for it by the rather annoying verbal tic of overusing ‘thus’ at the start of every other sentence. I wrote a TeX macro that automatically replaced it with neatly italicised ‘Ergo‘. Sometimes, I wonder why they never decided to drown me in the Cherwell.
5. …or at least not to harm a given list of humans or a given type of humans.
6. Many of these, myself included, are at risk of becoming lawyers. Parents, talk to your kids. If you don’t talk to them about the evils of law school, who will?
7. H.L.A. Hart makes some good points regarding this
8. Mandatory disclosure: I’m one of the creators of Aletheia, and a shareholder and CTO of its parent corporation.
9. For the avoidance of doubt: as a Christian, a scientist and a developer of some pretty darn complex things, I do not believe that these constructs, even if omnipotent, omniscient and omnipresent as they someday will be by leveraging IoT and surveillance networks, are anything like my capital-G God. For lack of space, there’s no way to go into an exhaustive level of detail here, but my God is not defined by its omniscience and omnipotence, it’s defined by his grace, mercy and love for us. I’d like to see an AI become incarnate and then suffer and die for the salvation of all of humanity and the forgiveness of sins. The true power of God, which no machine will ever come close to, was never as strongly demonstrated as when the child Jesus lay in the manger, among animals, ready to give Himself up to save a fallen, broken humanity. And I don’t see any machine ever coming close to that.

If you love something, let it break your heart.

A few days ago, I finished re-reading Michele Cushatt‘s book Undone, an incredible journey through what she describes as an ‘unexpected life’. I cannot possibly do the book justice, save by telling you to go out and buy it, right now. I was reminded of her book today when thinking about coding and software development (which I do way too much for someone who is principally supposed to be a tool user, not a toolmaker).

Broken

There’s a style of Japanese pottery known as 金繕い (kintsukuroi), in which broken pottery is repaired with a mixture of lacquer and fine-grained gold or silver dust. The first time I saw a kintsukuroi bowl, I must have been about 12. It was unceremoniously stacked among a lot of other stuff in a side room of the Kunsthistorisches Museum in Vienna, which hosted an exhibition on Asian pottery. It was as close to a transcendental experience as I have ever had.

Kintsukuroi accepts brokenness as part of life. It accepts the fact that Things Break, and sometimes not all the king’s horses and not all the king’s men can put what was broken back together again. It is a brutally honest acknowledgement of what it means to live in an imperfect, fallen world. But at the same time, it is an acknowledgement of the fact that yes, sometimes, you cannot put something broken back together again with seams fitting in lattice-to-lattice perfection. But what fills up the empty spaces, what fills up the gaps of what used to be, can be valuable. Just as the few drops of gold lacquer in a kintsukuroi bowl are worth more than the bowl itself ever was, brokenness and healing can make us something more valuable.

Coding is hard. It’s heartbreaking. It’s sometimes frustrating. A few weeks ago, I spent hours trying to figure out what to make of a fifteen-line, completely obscure error message in R citing various Java code citing C++ code… I was near punching the screen.

It turns out it was the system’s way of telling me the server connection timed out. Did it say so? Heck no. It took hours of digging and debugging around to find out how a Java exception in the middle of R code even made sense. Those are the moments you just want to smash some of the objects in arm’s reach.

The only good thing about these moments is that there’s a faint chance you’ll remember them. That, by the way, is called growth.

Mended

Which leads me to the point of developing immunity to heartbreak. I wrote my first piece of code that I actually dared to share with the wider public in 2000. Github wasn’t around yet, so people shared code on mailing lists and messageboards. I posted this snippet of code I wrote to calculate the haversine distance, written in Python, that for some reason didn’t run.

Within an hour, I had 25 comments. They ran the gamut of “just give up already” to “this is some of the worst code I’ve ever seen”. In other words, the kind of feedback you aren’t allowed to give these days. Now, I’m not a particularly thick-skinned person, and it duly broke my heart.

I got upset… until I felt something stir in me, a resolve and an unwillingness to leave it at this. I decided I’m going to work on my code until I could turn out the best haversine implementation that I can. It took me the best part of a month. I posted it to the same messageboard, in the same thread.

The same people I thought were absolute jerks now sent appreciative comments. One even noted he learned a new trick from my code.

That’s when I understood the people who broke my heart only weeks ago didn’t do so out of ill-will or meanness. They did so because they, too, had their hearts broken day by day by the interpreter’s harsh output. And they knew that getting your heart broken by people is a lot easier than getting your heart broken by a machine.

Heartbreak is important. Heartbreak is what happens when you try to scale a mountain too high. Heartbreak is the sign that you’ve gone beyond your comfort zone.

And growth happens in that space. Growth cannot happen in a space of continual heartbreak – that’s just pure frustration, and frustration causes resignation, not growth. But a space without heartbreak has no challenges, no risk, no disappointment, no growth.

If you love something, go out and make it break your heart.

Then stand up, and try again.

And again.