Are you looking for a data science sensei?

Maybe you’re a junior data scientist, maybe you’re a software developer who wants to go into data science, or perhaps you’ve dabbled in data for years in Excel but are ready to take the next step.

If so, this post is all about you, and an opportunity I offer every year.

You see, life has been very good to me in terms of training as a data scientist. I have been spoiled, really – I had the chance to learn from some of the best data scientists, work with some exceptional epidemiologists, experience some unusual challenges and face many of the day-to-day hurdles of working in data analytics. I’ve had the fortune to see this profession in all its contexts, from small enterprises to multi-million dollar FTSE100 companies, from well-run agile start-ups to large and sometimes pretty slow dinosaurs, from government through the private sector to NGOs: I’ve seen it all. I’ve done some great things. And I’ve made some superbly dumb mistakes.

And so, at the start of every year, I have opened applications for young, start-of-career data scientists looking for their Mr. Miyagi. Don’t worry: no car waxing involved. I will be choosing a single promising young data scientist and pass on as much as I can of my so-called wisdom. At the end, your skills will shine like Mr. Miyagi’s 1947 Ford Deluxe Convertible. There’s no catch, no hidden trap, no fees or charges involved (except the one mentioned below).

Eligibility criteria

To be eligible, you must be:

  • 18 or above if you are taking a gap year or not attending a university/college.
  • You do not have to have a formal degree in data science or a relevant subject, but you must have completed it if you do. In other words: if you’re in your 3rd year of an English Lit degree, you’re welcome to apply, but if you’re in the middle of your CS degree, you have to wait until you’re finished – sorry. The same goes if you intend to go straight on to a data science-related postgrad within the year.
  • Have a solid basis in mathematics: decent statistics, combinatorics, linear algebra and some high school calculus are the very minimum.
  • You must be familiar with Python (3.5 and above), and either familiar with the scientific Python stack (SciPy, NumPy, Pandas, matplotlib) or willing to pick up a lot on the go.
  • Be willing to put in the work: we’ll be convening about once every week to ten days by Skype for an hour, and you’ll probably be doing 6-10 hours’ worth of reading and work for the rest of the week. Please be realistic if you can sustain this.
  • If, as recommended, you are working on an AWS EC2 instance, be aware this might cost money and make sure you can cover the costs. In practice, these are negligible.
  • You must understand that this is a physically and intellectually strenuous endeavor, and it is your responsibility to know whether you’re physically and mentally up for the job. However, no physical or mental disabilities are regarded as automatically excluding you of consideration.
  • You must not live in, reside in or be a citizen of any of the countries listed in CFR Title 22 Part 126, §126.1(d)(1) and (2).
  • You must not have been convicted of a felony anywhere. This includes ‘spent’ UK criminal convictions.

Sounds good? Apply here.

Preferred applicants

When assessing applications, the following groups are given preference:

  • Persons with mental or physical disabilities whose disability precludes them from finding conventional employment – please outline this situation on the application form.
  • Honourably discharged (or equivalent) veterans of NATO forces and the IDF – please include member 4 copy of DD-214, Wehrdienstzeitbescheinigung or equivalent document that lists type of discharge.

What we’ll be up to

Don’t worry. None of this car waxing crap.

Over the 42 weeks to follow, you will be undergoing a rigorous and structured semi-self-directed training process. This will take your background, interests and future ambitions into account, but at the core, you will:

  • master Python’s data processing stack,
  • learn how to visualize data in Python,
  • work with networks and graph databases, including Neo4j,
  • acquire the correct way of presenting results in data science to stakeholders,
  • delve into cutting-edge methods of machine learning, such as deep learning using keras,
  • work on problems in computer vision and get familiar with the Python bindings of OpenCV,
  • scrape data from social networks, and
  • learn convenient ways of representing, summarizing and distributing our results.

The programme is divided into three ‘terms’ of 14 weeks each, which each consist of 9 weeks of directed study, 4 weeks of self-directed project work and one week of R&R.

What you’ll be getting out of this

Since the introduction of Docker, tolerance for wanton destruction as part of coursework has increased, but still won’t earn you a passing grade by itself.

In the past years, mentees have noted the unusual breadth of knowledge they have acquired about data science, as well as the diversity of practical topics and the realistic question settings, with an emphasis on practical applications of data science such as presenting data products. I hope that this year, too, I’ll be able to convey the same important topics. Every year is a little different as I try to adjust the course to meet the individual participant’s needs.

The programme is not, of course, accredited by any accreditation body, but a certificate of completion will be issued to any participant who wishes so.

Application process

Simply fill in the form below and send it off by 14 January 2018. The top contenders will be contacted by e-mail or telephone for a brief conversation thereafter. Finally, a lucky winner will be picked by the 21st January 2018. Easy peasy!

 

FAQ

Q: What does ‘semi-self-directed’ mean? Is there a fixed curriculum?

A: No. There are some basic topics (see list above) that I think are quite likely to come up, but ultimately, this is about making you the data scientist you want to be. For this reason, we’ll begin by planning out where you want to improve – kinda like a PT gives you a training plan before you start out at their gym. We will then adjust as needed. This is not an exam prep, it’s a learning experience, and for that reason, we can focus on delving deeper and getting the fundaments right over other cramming in a particular curriculum.

Q: Can I bring your own data?

A: Sure. In general, we’ll be using standard data sets, because they’re well-known and high-quality data. But if you have a dataset you collected or are otherwise entitled to use that would do equally well, there’s no reason why we couldn’t use it! Note that you must have the right to use and share the data set, meaning it’s unlikely you’re able to use data sets from your day job.

Q: Will this give me an employment advantage?

A: I don’t quite know – it’s impossible to predict. The field of data science degrees is something of a Wild West still, and while some reputable degrees have emerged, others are dubious. Employers still don’t know what to go by. However, you will most definitely be better prepared for an employment interview in data science!

Q: Why are you so keen on presenting data the right way?

A: Because as data scientists, we’re expected to not merely understand the data and draw the right conclusions, but also to convey them to stakeholders at various levels, from plant management to C-suite, in a way that gets the right message across at the first go.

Q: You’re a computational epidemiologist. Can I apply even if my work doesn’t really involve healthcare?

A: Sure. The principles are the same, and we’re largely focusing on generic topics. You might be exposed to bits and pieces of epidemiology, but I can guarantee it won’t hurt.

Q: Why do you only take on one mentee?

A: To begin with, my life is pretty busy – I have a demanding job, a family and – shock horror! – I even need to sleep every once in a while. More importantly, I want to devote my undivided attention to a worthy candidate.

Q: How come I’ve never heard of this before?

A: Until now, I’ve largely gotten mentees by word of mouth. I am concerned that this is keeping some talented people out and limiting the pool of people we should have in. That’s why this year, I have tried to make this process much more transparent.

Q: You’re rather fond of General ‘Mad Dog’ Mattis. Will there be yelling?

No.

Q: There seems to be no upper age limit. Is that a mistake?

No.

Q: I have more questions.

A: You can ask them here.

10 tips for passing the Neo4j Certified Professional examination

Everybody loves a good certification. Twice so when it’s for free and quadruply so if it’s in a cool new technology like Neo4j. In case you’re unfamiliar with Neo4j, it’s a graph database – a novel database concept that belongs to the NoSQL class of databases, i.e. it does not follow a relational model. Rather, it allows for the storage of, and computation on, graphs.

From a purely mathematical perspective, a graph G(V,E) is formally defined as an ordered pair of vertices V (called nodes in Neo4j) and edges E (known as relationships in Neo4j). In other words, the first class citizens of a graph are ‘things’ and ‘connections between things’. No doubt you can already think of a lot of problems that can be conceptualised as graph problems. Indeed, for a surprising number of things that don’t sound very graph-y at all, it is possible to make use of graph databases. Not that you should always do so (no single technology is a panacea to every problem and I would look very suspiciously at someone who would implement time series as a graph database), but that does not mean it’s not possible in most cases.

Which leads me to the appeal of Neo4j. In general, you had two approaches to graph operations until graph databases entered the scene. One was to write your own graph object model and have it persist in memory. That’s not bad, but a database it sure ain’t. Meanwhile, an alternative is to decompose the graph into a table of vertices and its properties and another table of connections between vertices (an adjacency matrix) and then store it in a regular RDBMS or, somewhat more efficiently, in a NoSQL key-value store. That’s a little better, but it still requires considerable reinvention of the wheel.

The strength of graph databases is that they facilitate more complex operations, way beyond storage and retrieval of graphs, such as searching for patterns, properties and paths. One done-to-death example would be the famous problem known as Six Degrees of Kevin Bacon, a pop culture version of Erdös numbers: for an actor A and a Kevin Bacon K within a graph G_{Actors} with A, K \in G_{Actors}, what is the shortest path (and is it below six jumps?) to get from A to K? Graph databases turn this into a simple query. Neo4j is one of the first industrial grade graph DBs, with an enterprise grade product that you can safely deploy in a production system without worrying too much about it. Written in Java, it’s stable, fast and has enough API wrappers to have some left over for the presents next Christmas. Alongside the more traditional APIs, it’s got a very friendly and very visual web-based interface that immediately plots your query results and a somewhat weird but ultimately not very counter-intuitive query language known as Cypher. As such, if graph problems are the kind of problem you deal with on a regular basis, taking Neo4j for a spin might be a very good idea.

Which in turn leads me to the Neo4j certification. For the unbeatable price of $0.00, you can now sit for the esteemed title of Neo4j Certified Professional – that is, if you pass the 80-question, 60-minute time-capped test with a score of 80% or above. Now, let not the fact that it’s offered for free deter you – the test is pretty ferocious. It takes a fairly in-depth knowledge of Neo4j to pass (I’ve been around Neo4j ever since it has been around, and while I’ve never tried it and passed at first try recently, it has been surprisingly hard even for me!), the time cap means that even if you do decide to refer to your notes (I am not sure if that’s not cheating – I personally did not, as it was just so time-intensive), you won’t be able to pass merely from notes. Worse, there are no test exams and preparation material is scarce outside (rather pricey!) trainings. As such, I’ve written up the ten things I wish I had known before embarking upon the exam. While I did pass at the first try, it was a lot harder than I expected and I would definitely have prepared for it differently, had I known what it would be like! Fortunately, you can attempt it as often as you would like for no cost, and as such it’s by no means an impossible task,[1] but you’re in for a ride if you wish to pass with a good score. Fasten your seat belt, flip up the tray table and put your seat in a fully upright position – it’s time to get Neo4j’d!

1. This is not a user test… it’s a user and DBA test.

I haven’t heard of a single Neo4j shop that had a dedicated Neo4j DBA to support graph operations. Which is ok – compared to the relatively arcane art of (enterprise) RDBMS DBAs, Neo4j is a breeze to configure. At the same time, the model seems to expect users to know what they’re doing themselves and be confident with some close-to-the-metal database tweaking. Good.

The downside is that about a quarter or so of the questions have to do with the configuration of Neo4j, and they do get into the nitty-gritty. You’re expected, for instance, to know fairly detailed minutiae of Enterprise edition High Availability server settings.

2. Pay attention to Cypher queries. The devil’s in the details.

If you’ve done as many multiple choice tests as I have, you know you’ve learned one thing for sure: all of them follow the same pattern. Two answers are complete bunk and anyone who’s done their reading can spot that. The remaining two are deceptively similar, however, and both sound ‘correct enough’. In the Neo4j test, this is mainly in the realm of the Cypher queries. A number of questions involve a ‘problem’ being described and four possible Cypher queries. The candidate must then spot which of these, or which several of these, answer the problem description. Often the correct answer may be distinguished from the incorrect one by as little as a correctly placed colon or a bracket closed in the right order. When in doubt, have a very sharp look at the Cypher syntax.

Oh, incidentally? The test makes relatively liberal use of the ‘both directions match’ (a)-[:RELATION]-(b) query pattern. This catches (a)-[:RELATION]->(b) as well as (b)-[:RELATION]->(a). The lack of the little arrow is easy to overlook and can lead you down the wrong path…

3. Develop query equivalence to second nature.

Python was built so that there would be one, and exactly one, right way to do everything. Sort of. Cypher is the opposite – there are dozens of ways to express certain relations, largely owing to the equivalence of relationships. As such, be aware of two equivalences. One is the equivalence of inline parameters and WHERE parameters:

MATCH (a:Person {name: "John Smith"})-[:REL]->(b)
RETURN a;
MATCH (a:Person)-[:REL]->(b)
WHERE a.name = "John Smith"
RETURN a;

Also, the following partials are equivalent, but not always:

(a)-[:FIRST_REL]->(b)<-[:SECOND_REL]-(c)
(a)-[:FIRST_REL]->(b)
(c)-[:SECOND_REL]->(b)

When you see a Cypher statement, you should be able to see all of its forms. Recap question: when are the statements in the second pair NOT equivalent?

4. The test is designed on the basis of the Enterprise edition.

Neo4j comes in two ‘flavours’ – Community and Enterprise. The latter has a lot of cool features, such as an error-resilient, distributed ‘High Availability’ mode. The certification’s premise is that you are familiar – and familiar to a fairly high degree, actually! – with many of the Enterprise-only features of Neo4j. As such, unless you’re fortunate enough to be an enterprise user, it might repay itself to download the 30-day evaluation version of Neo4j Enterprise.

5. The test is generally well-written.

In other words, most things are fairly clear. By fairly clear, I mean that there is little ambiguity and it uses the same language as the reference (although comparing test questions to phrases that stuck in my head which I ended up checking after the test, just enough words are changed to deter would-be cheaters from Ctrl+F-ing through the manual! There are no trick questions – so try to understand the questions in their most ‘mundane’, ‘trivial’ way. Yes, sometimes it is that simple!

6. TRUNCATE BRAINSPACE sql_clauses;

A lot of traditional SQL clauses (yes, TRUNCATE is one example – so is JOIN and its multifarious siblings, which describe a concept that simply does not exist in Neo4j) come up as red herrings in Cypher application questions. Try to force your brain to make a switch from SQL to Cypher – and don’t fall for the trap of instinctively thinking of the clauses in the SQL solution! Forget SQL. And most of all, forget its logic of selection – MATCHing is something rather different than SELECTing in SQL.

7. Have a 30,000ft overview of the subject

In particular, have an overview of what your options are to get particular things done. How can you access Neo4j? You might have spent 99% of your time on the web interface and/or interacting using the SDK, but there is actually a shell. How can you backup from Neo4j, and what does backup do? What are your options to monitor Neo4j? Once again, most users are more likely to think of one solution, perhaps two, when there are several more. The difficult thing about this test is that it requires you to be exhaustive – both in breadth and in depth.

8. Algorithms, statistics and aggregation

As far as I’m aware, everyone gets slightly different questions, but my test did not include anything about the graph algorithms inherent in Neo4j (good news for philistines people who want to get stuff done). It did, however, include quite a bit of detail about aggregation functions. You make of that what you will.

9. Practice on Northwind but know the Movie DB like the back of your hand.

Out of the box, if you install Neo4j Community on your computer, you have two sample databases that the Browser offers to load into your instance – Movie and Northwind. The latter should be highly familiar to you if you have a past in relational databases. Meanwhile, the former is a Neo4j favourite, not the least for the Kevin Bacon angle. If you did the self-paced Getting Started training (as you should have!), you’ll have used the Movie DB enough to get a good grip of it. Most of the questions on the text pertain or relate in some way to that graph, so a degree of familiarity can help you spot errors faster. At the same time, Northwind is both a better and bigger database, more fun to use and allows for more complex queries. Northwind should therefore be your educational tool, but you should know Movie rather well for that little plus of familiar feeling that can make the difference between passing and failing. Oh, by the way – while Getting Started is a great course, you will not stand a snowball’s chance in hell without the Production course. This is so even if you’ve done your fill of deployments and integrations – quite simply put, the breadth of the test is statistically very likely to be beyond your own experiences, even if you’ve done e.g. High Availability deployments yourself. In the real world, we specialise – for the test, however, you must be a generalist.

10. Refcards are your friends.

Start with the one for Cypher. Then build your own for High Availability. Laminate them and carry them around, if need be – or take the few functions or clauses that are your weak spots, put them on post-its and plaster them on your wall. Whatever helps – unless you’re writing Cypher code 24/7 (in which case, what are you doing here?), which I doubt happens a lot, there’s quite simply no substitute for seeing correct code and being able to get a feeling for good versus bad code. The test is incredibly fast paced – 80 questions over 60 minutes gives you 45 seconds for a turnkey execution. At least 15-20 of that is reading the question, if not more (it definitely was more for me – as noted, most questions repay a thorough reading!). Realistically, if you want to make that and have time to think about the more complex questions, you’ve got to be able to bang out simple Cypher questions (I’d say there were about 8-10 of them altogether, worth an average number of points, though I (and I do regret this now) didn’t count them.

 

While the Neo4j certification exam is far from easy, it is doable (hey, if I can do it, so can you!). As graph databases are becoming increasingly important due to the recognition that they have the potential to accelerate certain calculations on graph data, coupled with the understanding that a lot of natural processes are in reality closer to relationship-driven interactions than the static picture that traditional RDBMS logic seeks to convey, knowing Neo4j is a definite asset for you and your team. Regardless of your intent to get certified and/or view on certifications in general (mine, too, is in general more on the less complimentary side), what you learn can be an indispensable asset in research and operations as well. Of course, I’m happy to answer any questions about Neo4j and the certification exam, insofar as my subjective views can make a valid contribution to the matter.

Update 15.02.2016: Neo4j community caretaker Michael Hunger has been so kind as to leave a comment on this article, pointing out that the scant feedback is intentional – it prevents re-takers from simply banging in the correct answers from the feedback e-mail. That makes perfect sense – and is not something I thought of. Thanks, Michael. He is also encouraging recent test takers to propose questions for the test – to me, it’s an unprecedented amazingness for a certificate provider to actually ask the community what they believe to be to be the cornerstone and benchmarks of knowledge in a particular field. So do take him up on that offer – his e-mail is in his comment below.

 

Title image credits: Dr Tamás Nepusz, Reconstructing the structure of the world-wide music scene with Last.fm.

References   [ + ]

1. I’ve been told that feedback on failed tests is fairly terrible – there is no feedback to most questions, and you’re not given the correct answers.