Merton_College_library_hall

The one study you shouldn’t write

I might have my own set of ideological prejudices,1Largely, they presume outlandish stuff like ‘human life is exceptional and always worth defending’ or ‘death does not cure illnesses’, you get my drift. while at the same time I am more sure than I am about any of these I am certain about this: show me proof that contradicts my most cherished beliefs, and I will read it, evaluate it critically and if correct, learn from it. This, incidentally, is how I ended up believing in God and casting away the atheism of my early teens, but that’s a lateral point.

As such, I’m in support of every kind of inquiry that does not, in its process, harm humans (I am, you may be shocked to learn, far more supportive of torturing raw data than people). There’s one exception. There is that one study for every sociologist, every data scientist, every statistician, every psychologist, everyone – that one study that you should never write: the study that proves how your ideological opponents are morons, psychotics and/or terminally flawed human beings.2For starters, I maintain we all are at the very least the latter, quite probably the middle one at least a portion of the time and, frankly, the first one more often than we would believe ourselves.

Virginia Commonwealth University scholar Brad Verhulst, Pete Hatemi (now at Penn State, my sources tell me) and poor old Lindon Eaves, who of all of the aforementioned should really know better than to darken his reputation with this sort of nonsense, have just learned this lesson at what I believe will be a minuscule cost to their careers compared to the consequence this error ought to cost any researcher in any field.

In 2012, the trio published an article in the American Journal of Political Science, titled Correlation not causation: the relationship between personality traits and political ideologies. Its conclusion was, erm, ground-breaking for anyone who knows conservatives from more than the caricatures they have been reduced to in the media:

First, in line with our expectations, higher P scores correlate with more conservative military attitudes and more socially conservative beliefs for both females and males. For males, the relationship between P and military attitudes (r = 0.388) is larger than the relationship between P and social attitudes (r = 0.292). Alternatively, for females, social attitudes correlate more highly with P (r = 0.383) than military attitudes (r = 0.302).

Further, we find a negative relationship between Neuroticism and economic conservatism (r_{females} = −0.242, $$r_{males}$$ = −0.239). People higher in Neuroticism tend to be more economically liberal.

(P, in the above, being the score in Eysenck’s psychoticism inventory.)

The most damning words in the above were among the very first. I am not sure what’s worst here: that actual educated people believe psychoticism correlates to military attitudes (because the military is known for courting psychotics, am I right? No? NO?!), or that they think it helps any case to disclose what is a blatant bias quite openly. In my lawyering years, if the prosecution expert had stated that the fingerprints on the murder weapon “matched those of that dirty crook over there, as I expected”, I’d have torn him to shreds, and so would any good lawyer. And that’s not because we’re born and raised bloodhounds but because we prefer people not to have biases in what they are supposed to opine on in a dispassionate, clear, clinical manner.

And this story confirms why that matters.

Four years after the paper came into print (why so late?), an erratum had to be  published (that, by the way, is still not replicated on a lot of sites that republished the piece). It so turns out that the gentlemen writing the study have ‘misread’ their numbers. Like, real bad.

The authors regret that there is an error in the published version of “Correlation not Causation: The Relationship between Personality Traits and Political Ideologies” American Journal of Political Science 56 (1), 34–51. The interpretation of the coding of the political attitude items in the descriptive and preliminary analyses portion of the manuscript was exactly reversed. Thus, where we indicated that higher scores in Table 1 (page 40) reflect a more conservative response, they actually reflect a more liberal response. Specifically, in the original manuscript, the descriptive analyses report that those higher in Eysenck’s psychoticism are more conservative, but they are actually more liberal; and where the original manuscript reports those higher in neuroticism and social desirability are more liberal, they are, in fact, more conservative. We highlight the specific errors and corrections by page number below:

It also magically turns out that the military is not full of psychotics.3Yes, I know a high Eysenck P score does not mean a person is ‘psychotic’ and Eysenck’s test is a personality trait test, not a test to diagnose a psychotic disorder. Whodda thunk.

…Ρ is substantially correlated with liberal military and social attitudes, while Social Desirability is related to conservative social attitudes, and Neuroticism is related to conservative economic attitudes.

“No shit, Sherlock,” as they say.

The authors’ explanation is that the dog ate their homework. Ok, only a little bit better: the responses were “miscoded”, i.e. it’s all the poor grad student sods’ fault. Their academic highnesses remain faultless:

The potential for an error in our article initially was pointed out by Steven G. Ludeke and Stig H. R. Rasmussen in their manuscript, “(Mis)understanding the relationship between personality and sociopolitical attitudes.” We found the source of the error only after an investigation going back to the original copies of the data. The data for the current paper and an earlier paper (Verhulst, Hatemi and Martin (2010) “The nature of the relationship between personality traits and political attitudes.” Personality and Individual Differences 49:306–316) were collected through two independent studies by Lindon Eaves in the U.S. and Nichols Martin in Australia. Data collection began in the 1980’s and finished in the 1990’s. The questionnaires were designed in collaboration with one of the goals being to be compare and combine the data for specific analyses. The data were combined into a single data set in the 2000’s to achieve this goal. Data are extracted on a project-by-project basis, and we found that during the extraction for the personality and attitudes project, the specific codebook used for the project was developed in error.

As a working data scientist and statistician, I’m not buying this. This study has, for all its faults, intricate statistical methods. It’s well done from a technical standpoint. It uses Cholesky decomposition and displays a relatively sophisticated statistical approach, even if it’s at times bordering on the bizarre. The causal analysis is an absolute mess, and I have no idea where the authors have gotten the idea that a correlation over 0.2 is “large enough for further consideration”. That’s not a scientifically accepted idea. A correlation is significant or not significant. There is no weird middle way of “give us more money, let’s look into it more”. The point remains, however, that the authors, while practising a good deal of cargo cult science, have managed to oversee an epic blunder like this. How could that have happened?

Well, really, how could it have happened? I trust this should be explained by the words I’ve pointed out before. The authors had what is called “cognitive contamination” in the field of criminal forensic science. The authors had an idea about conservatives and liberals and what they are like. These ideas were caricaturesque to the extreme. They were blind as a bat, blinded by their own ideological biases.

And there goes my point. There are, sometimes, articles that you shouldn’t write.

Let me give you an analogy. My religion has some pretty clear rules about what married people are, and aren’t, allowed to do. Now, what my religion also happens to say is that it’s easier not to mess up these things if you do not engage in temptation. If you are a drug addict, you should not hang out with coke heads. If you are a recovering alcoholic, you would not exactly benefit from hanging out with your friends on a drunken revelry. If you’ve got political convictions, you are more prone to say stupid things when you find a result that confirms your ideas. The term for this is ‘confirmation bias’, the reality is that it’s the simple human proneness to see what we want to see.

Do you remember how as a child, you used to play the game of seeing shapes in clouds? Puppies, cows, elephants and horses? The human brain works on the basis of a Gestalt principle of reification, allowing us to reconstruct known things from its parts. It’s essential to the way our brain works. But it’s also making us see the things we want to see, not what we’re actually seeing.

And that’s why you should never write that one article. The one where you explain why the other side is dumb, evil or has psychotic and/or neurotic traits.

References   [ + ]

1. Largely, they presume outlandish stuff like ‘human life is exceptional and always worth defending’ or ‘death does not cure illnesses’, you get my drift.
2. For starters, I maintain we all are at the very least the latter, quite probably the middle one at least a portion of the time and, frankly, the first one more often than we would believe ourselves.
3. Yes, I know a high Eysenck P score does not mean a person is ‘psychotic’ and Eysenck’s test is a personality trait test, not a test to diagnose a psychotic disorder.
men-in-black

Give your Twitter account a memory wipe… for free.

The other day, my wife has decided to get rid of all the tweets on one of her twitter accounts, while of course retaining all the followers. But bulk deleting tweets is far from easy. There are, fortunately, plenty of tools that offer you the service of bulk deleting your tweets… for a cost, of course. One had a freemium model that allowed three free deletes per day. I quickly calculated that it would have taken my wife something on the order of twelve years to get rid of all her tweets. No, seriously. That’s silly. I can write some Python code to do that faster, can’t I?

Turns out you can. First, of course, you’ll need to create a Twitter app from the account you wish to wipe and generate an access token, since we’ll also be performing actions on behalf of the account.

import tweepy
import time

CONSUMER_KEY=<your consumer key>
CONSUMER_SECRET=<your consumer secret>
ACCESS_TOKEN=<your access token>
ACCESS_TOKEN_SECRET=<your access token secret>
SCREEN_NAME=<your screen name, without the @>

Time to use tweepy’s OAuth handler to connect to the Twitter API:

auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)

api = tweepy.API(auth)

Now, we could technically write an extremely sophisticated script, which looks at the returned headers to determine when we will be cut off by the API throttle… but we’ll use the easy and brutish route of holding off for a whole hour if we get cut off. At 350 requests per hour, each capable of deleting 100 tweets, we can get rid of a 35,000 tweet account in a single hour with no waiting time, which is fairly decent.

The approach will be simple: we ask for batches of 100 tweets, then call the .destroy() method on each of them, which thanks to tweepy is now bound into the object representing every tweet we receive. If we encounter errors, we respond accordingly: if it’s a RateLimitError, an error object from tweepy that – as its name suggests – shows that the rate limit has been exceeded, we’ll hold off for an hour (we could elicit the reset time from headers, but this is much simpler… and we’ve got time!), if it can’t find the status we simply leap over it (sometimes that happens, especially when someone is doing some manual deleting at the same time) and otherwise, we break the loops.

def destroy():
    while True:
        q = api.user_timeline(screen_name=SCREEN_NAME,
                              count=100)
        for each in q:
            try:
                each.destroy()
            except tweepy.RateLimitError as e:
                print (u"Rate limit exceeded: {0:s}".format(e.message))
                time.sleep(3600)
            except tweepy.TweepError as e:
                if e.message == "No status found with that ID.":
                    continue
            except Exception as e:
                print (u"Encountered undefined error: {0:s}".format(e.message))
                break
        break

Finally, we’ll make sure this is called as the module default:

if __name__ == '__main__':
    destroy()

Happy destruction!

IMG911-B-c

Immortal questions

When asked for a title for his 1979 collection of philosophical papers, my all-time favourite philosopher1That does not mean I agree with even half of what he’s saying. But I do undoubtedly acknowledge his talent, agility of mind, style of writing, his knowledge and his ability to write good and engaging papers that have not yet fallen victim to the neo-sophistry dominating universities. Thomas Nagel chose the title Mortal Questions, an apt title, for most of our philosophical preoccupations (and especially those pertaining to the broad realm of moral philosophy) stem from the simple fact that we’re all mortal, and human life is as such an irreplaceable good. By extension, most things that can be created by humans are capable of being destroyed by humans.

That time is ending, and we need a new ethics for that.

Consider the internet. We all know it’s vulnerable, but is it existentially vulnerable?2I define existential vulnerability as being capable of being destroyed by an adversary that does not require the adversary to accept an immense loss or undertake a nonsensically arduous task. For example, it is possible to kill the internet by nuking the whole planet, but that would be rather disproportionate. Equally, destruction of major lines of transmission may at best isolate bits of the internet (think of it in graph theory terms as turning the internet from a connected graph into a spanning acyclic tree), but it takes rather more to kill off everything. On the other hand, your home network is existentially vulnerable. I kill router, game over, good night and good luck. The answer is probably no. Neither would any significantly distributed self-provisioning pseudo-AI be. And by pseudo-AI, I don’t even mean a particularly clever or futuristic or independently reasoning system, but rather a system that can provision resources for itself in response to threat factors just as certain balancers and computational systems we write and use on a day to day basis can commission themselves new cloud resources to carry out their mandate. Based on their mandate, such systems are potentially existentially immortal/existentially indestructible.3As in, lack existential vulnerability.

The human factor in this is that such a system will be constrained by mandates we give them. Ergo,4According to my professors at Oxford, my impatience towards others who don’t see the connections I do has led me to try to make up for it by the rather annoying verbal tic of overusing ‘thus’ at the start of every other sentence. I wrote a TeX macro that automatically replaced it with neatly italicised ‘Ergo‘. Sometimes, I wonder why they never decided to drown me in the Cherwell. those mandates are as fit a subject for human moral reasoning as any other human action.

Which means we’re going to need that new ethics pretty darn’ fast, for there isn’t a lot of time left. Distributed systems, smart contracts, trustless M2M protocols, the plethora of algorithms that have arisen that each bring us a bit closer to a machine capable of drawing subtle conclusions from source data (hidden Markov models, 21st century incarnations of fuzzy logic, certain sorts of programmatic higher order logic and a few other factors are all moving towards an expansion of what we as humans can create and the freedom we can give our applications. Who, even ten years ago, would have thought that one day I will be able to give a computing cluster my credit card and if it ran out of juice, it could commission additional resources until it bled me dry and I had to field angry questions from my wife? And that was a simple dumb computing cluster. Can you teach a computing cluster to defend itself? Why the heck not, right?

Geeks who grew up on Asimov’s laws of robotics, myself included, think of this sort of problem as largely being one of giving the ‘right’ mandates to the system, overriding mandates to keep itself safe, not to harm humans,5…or at least not to harm a given list of humans or a given type of humans. or the like. But any sufficiently well-written system will eventually grow to the level of the annoying six-year-old, who lives for the sole purpose of trying to twist and redefine his parents’ words to mean the opposite of what they intended.6Many of these, myself included, are at risk of becoming lawyers. Parents, talk to your kids. If you don’t talk to them about the evils of law school, who will? In the human world, a mandate takes place in a context. A writ is executed within a legal system. An order by a superior officer is executed according to the applicable rules of military justice, including circumstances when the order ought not be carried out. Passing these complex human contexts, which most of us ignore as we do all the things we grew up with and take for granted, into a more complicated model may not be feasible. Rules cannot be formulated exhaustively,7H.L.A. Hart makes some good points regarding this as such a formulation by definition would have to encompass all past, present and future – all that potentially can happen. Thus, the issue moves on soon from merely providing mandates to what in the human world is known as ‘statutory construction’ or interpretation of legislative works. How are computers equipped to reason about symbolic propositions according to rules that we humans can predict? In other words, how can we teach rules of reasoning about rules in a way that is not inherently recursing this question (i.e. is not based on a simple conditional rule based framework).

Which means that the best that can be provided in such a situation is a framework based on values, and target optimisation algorithms (i.e. what’s the best way to reach the overriding objective with least damage to other objectives and so on). Which in turn will need a good bit of rethinking ethical norms.

But the bottom line is quite simple: we’re about to start creating immortals. Right now, you can put data on distributed file infrastructures like IPFS that’s effectively impossible to destroy using a reasonable amount of resources. Equally, distributed applications via survivable infrastructures such as the blockchain, as well as smart contract platforms, are relatively immortal. The creation of these is within the power of just about everyone with a modicum of computing skills. The rise of powerful distributed execution engines for smart contracts, like Maverick Labs’ Aletheia Platform,8Mandatory disclosure: I’m one of the creators of Aletheia, and a shareholder and CTO of its parent corporation. will give a burst of impetus to systems’ ability to self-provision, enter into contracts, procure services and thus even effect their own protection (or destruction). They are incarnate, and they are immortal. For what it’s worth, man is steps away from creating its own brand of deities.9For the avoidance of doubt: as a Christian, a scientist and a developer of some pretty darn complex things, I do not believe that these constructs, even if omnipotent, omniscient and omnipresent as they someday will be by leveraging IoT and surveillance networks, are anything like my capital-G God. For lack of space, there’s no way to go into an exhaustive level of detail here, but my God is not defined by its omniscience and omnipotence, it’s defined by his grace, mercy and love for us. I’d like to see an AI become incarnate and then suffer and die for the salvation of all of humanity and the forgiveness of sins. The true power of God, which no machine will ever come close to, was never as strongly demonstrated as when the child Jesus lay in the manger, among animals, ready to give Himself up to save a fallen, broken humanity. And I don’t see any machine ever coming close to that.

What are the ethics of creating a god? What is right and wrong in this odd, novel context? What is good and evil to a device?

The time to figure out these questions is running out with merciless rapidity.

Title image: God the Architect of the Universe, Codex Vindobonensis 2554, f1.v

References   [ + ]

1. That does not mean I agree with even half of what he’s saying. But I do undoubtedly acknowledge his talent, agility of mind, style of writing, his knowledge and his ability to write good and engaging papers that have not yet fallen victim to the neo-sophistry dominating universities.
2. I define existential vulnerability as being capable of being destroyed by an adversary that does not require the adversary to accept an immense loss or undertake a nonsensically arduous task. For example, it is possible to kill the internet by nuking the whole planet, but that would be rather disproportionate. Equally, destruction of major lines of transmission may at best isolate bits of the internet (think of it in graph theory terms as turning the internet from a connected graph into a spanning acyclic tree), but it takes rather more to kill off everything. On the other hand, your home network is existentially vulnerable. I kill router, game over, good night and good luck.
3. As in, lack existential vulnerability.
4. According to my professors at Oxford, my impatience towards others who don’t see the connections I do has led me to try to make up for it by the rather annoying verbal tic of overusing ‘thus’ at the start of every other sentence. I wrote a TeX macro that automatically replaced it with neatly italicised ‘Ergo‘. Sometimes, I wonder why they never decided to drown me in the Cherwell.
5. …or at least not to harm a given list of humans or a given type of humans.
6. Many of these, myself included, are at risk of becoming lawyers. Parents, talk to your kids. If you don’t talk to them about the evils of law school, who will?
7. H.L.A. Hart makes some good points regarding this
8. Mandatory disclosure: I’m one of the creators of Aletheia, and a shareholder and CTO of its parent corporation.
9. For the avoidance of doubt: as a Christian, a scientist and a developer of some pretty darn complex things, I do not believe that these constructs, even if omnipotent, omniscient and omnipresent as they someday will be by leveraging IoT and surveillance networks, are anything like my capital-G God. For lack of space, there’s no way to go into an exhaustive level of detail here, but my God is not defined by its omniscience and omnipotence, it’s defined by his grace, mercy and love for us. I’d like to see an AI become incarnate and then suffer and die for the salvation of all of humanity and the forgiveness of sins. The true power of God, which no machine will ever come close to, was never as strongly demonstrated as when the child Jesus lay in the manger, among animals, ready to give Himself up to save a fallen, broken humanity. And I don’t see any machine ever coming close to that.
Screenshot 2016-05-13 11.27.11

Actually, yes, you should sometimes share your talent for free.

Adam Hess is a ‘comedian’. I don’t know what that means these days, so I’ll give him the benefit of doubt here and assume that he’s someone paid to be funny rather than someone living with their parents and occasionally embarrassing themselves at Saturday Night Open Mic. I came across his tweet from yesterday, in which he attempted some sarcasm aimed at an advertisement in which Sainsbury’s was looking for an artist who would, free of charge, refurbish their canteen in Camden.

Now, I’m married to an artist. I have dabbled in art myself, though with the acute awareness that I’ll never make a darn penny anytime soon given my utter lack of a) skills, b) talent. As such, I have a good deal of compassion for artists who are upset when clients, especially fairly wealthy ones, ask young artists and designers at the beginning of their career to create something for free. You wouldn’t tell a junior solicitor or a freshly qualified accountant to do your legal matters or your accounts for free to ‘gain experience’, ‘get some exposure’ and ‘perhaps get some future business’. It invalidates the fact that artists are, like any other profession, working for a living and have got bills to pay.

Then there’s the reverse of the medal. I spend my life in a profession that has a whole culture of giving our knowledge, skills and time away for free. The result is an immense body of code and knowledge that is, I repeat, publicly available for free. Perhaps, if you’re not in the tech industry, you might want to stop and think about this for five minutes. The multi-trillion industry that is the internet and its associated revenue streams, from e-commerce through Netflix to, uh, porn (regrettably, a major source of internet-based revenue), rely for its very operation on software that people have built for no recompense at all, and/or which was open-sourced by large companies. Over half of all web servers globally run Apache or nginx, both having open-source licences.1Apache and a BSD variant licence, respectively. To put it in other words – over half the servers on the internet use software for which the creators are not paid a single penny.

The most widespread blog engine, WordPress, is open source. Most servers running SaaS products use an open-source OS, usually something *nix based. Virtually all programming languages are open-source – freely available and provided for no recompense. Closer to the base layer of the internet, the entire TCP/IP stack is open, as is BIND, the de facto gold standard for DNS servers.2DNS servers translate verbose and easy-to-remember domain names to IP addresses, which are not that easy to remember. And whatever your field, chances are, there is a significant open source community in it.

Over the last decade and a bit, I have open-sourced quite a bit of code myself. That’s, to use Mr Hess’s snark, free stuff I produced to, among others, ‘impress’ employers. A few years ago, I attended an interview for the data department of a food retailer. As a ‘show and tell’ piece, I brought them a client for their API that I built and open-sourced over the days preceding the interview.3An API is the way third-party software can communicate with a service. API wrappers or API clients are applications written for a particular language that translate the API to objects native to that language. They were ready to offer me the job right there and then. But it takes patience and faith – patience to understand that rewards for this sort of work are not immediate and faith in one’s own skills to know that they will someday be recognised. That is, of course, not the sole reason – or even the main reason – why I open-source software, but I would lie if I pretended it was not sometimes at the back of my head.

At which point it’s somewhat ironic to see Mr Hess complain about an artist being asked to do something for free (and he wasn’t even approached – this is a public advertisement in a local fishwrap!) while using a software pipeline worth millions that people have built, and simply given away, for free, for the betterment of our species and our shared humanity.

Worse, it’s quite clear that this seems to be an initiative not by Sainsbury’s but rather by a few workers who want slightly nicer surroundings but cannot afford to pay for it. Note that it’s the staff canteen, rather than customer areas, that are to be decorated. At this point, Mr Hess sounds greedier than Sainsbury’s. Who, really, is ‘exploiting’ whom here?

In my business life, I would estimate the return I get from work done free of charge at 2-300% long term. That includes, for the avoidance of doubt, people for whom I’ve done work who ended up not paying me anything at all ever. I’m not sure how it works in comedy, but in the real world, occasionally doing something for someone else without demanding recompense is not only lucrative, it’s also beneficial in other ways:

  • It builds connections because it personalises a business relationship.
  • It builds character because it teaches the value of selflessness.
  • And it’s fun. Frankly, the best times I’ve had during my working career usually involved unpaid engagements, free-of-charge investments of time, open-source contributions or volunteer work.

The sad fact is that many, like Mr Hess, confuse righteous indignation about those who seek to profit off ‘young artists’ by exploiting them with the terrific, horrific, scary prospect of doing something for free just once in a blue moon.

Fortunately, there are plenty of young artists eager to show their skills who either have more business acumen than Mr Hess or more common sense than to publicly snub their noses at the fearsome prospect of actually doing something they are [supposed to be] enjoying for free. As such, I doubt that the Camden Sainsbury’s canteen will go undecorated.

Of the 800 or so retweets, I see few who would heed a word of wisdom, as I see the retweets are awash with remarks that are various degrees of confused, irate or just full of creative smuggity smugness), but for the rest, I’d venture the following word of wisdom:4Credited to Dale Carnegie, but reportedly in use much earlier.

If you want to make a million dollars, you’ve got to first make a million people happy.

The much-envied wealth of Silicon Valley did not happen because they greedily demanded an hourly rate for every line of code they ever produces. It happened because of the realisation that we all are but dwarfs on the shoulders of giants, and ultimately our lives are going to be made not by what we secret away but by what others share to lift us up, and what we share to lift up others with.

You are the light of the world. A city seated on a mountain cannot be hid. Neither do men light a candle and put it under a bushel, but upon a candlestick, that it may shine to all that are in the house. So let your light shine before men, that they may see your good works, and glorify your Father who is in heaven.

Matthew 5:14-16

Title image: The blind Orion carries Cedalion on his shoulders, from Nicolas Poussin’s The Blind Orion Searching for the Rising Sun, 1658. Oil on canvas; 46 7/8 x 72 in. (119.1 x 182.9 cm), Metropolitan Museum of Art.

References   [ + ]

1. Apache and a BSD variant licence, respectively.
2. DNS servers translate verbose and easy-to-remember domain names to IP addresses, which are not that easy to remember.
3. An API is the way third-party software can communicate with a service. API wrappers or API clients are applications written for a particular language that translate the API to objects native to that language.
4. Credited to Dale Carnegie, but reportedly in use much earlier.
rotor-cipher-machine-1147801_1280

Diffie-Hellman in under 25 lines

How can you and I agree on a secret without anyone eavesdropping being able to intercept our communications? At first, the idea sounds absurd – for the longest time, without a pre-shared secret, encryption was seen as impossible. In World War II, the Enigma machines relied on a fairly complex pre-shared secret – the Enigma configurations (consisting of the rotor drum wirings and number of rotors specific to the model, the Ringstellung of the day, and Steckbrett configurations) were effectively the pre-shared key. During the Cold War, field operatives were provided with one-time pads (OTPs), randomly (if they were lucky) or pseudorandomly (if they weren’t, which was most of the time) generated1As a child, I once built a pseudorandom number generator from a sound card, a piece of wire and some stray radio electronics, which basically rested on a sampling of atmospheric noise. I was surprised to learn much later that this was the method the KGB used as well. one time pads (OTPs) with which to encrypt their messages. Cold War era Soviet OTPs were, of course, vulnerable because like most Soviet things, they were manufactured sloppily.2Under pressure from the advancing German Wehrmacht in 1941, they had duplicated over 30,000 pages worth of OTP code. This broke the golden rule of OTPs of never, ever reusing code, and ended up with a backdoor that two of the most eminent female cryptanalysts of the 20th, Genevieve Grotjan Feinstein and Meredith Gardner, on whose shoulders the success of the Venona project rested, could exploit. But OTPs are vulnerable to a big problem: if the key is known, the entire scheme of encryption is defeated. And somehow, you need to get that key to your field operative.

Enter the triad of Merkle, Diffie and Hellman, who in 1976 found a way to exploit the fact that multiplying primes is simple but decomposing a large number into the product of two primes is difficult. From this, they derived the algorithm that came to be known as the Diffie-Hellman algorithm.3It deserves noting that the D-H key exchange algorithm was another of those inventions that were invented twice but published once. In 1975, the GCHQ team around Clifford Cocks invented the same algorithm, but was barred from publishing it. Their achievements weren’t recognised until 1997.

5535098

How to cook up a key exchange algorithm

The idea of a key exchange algorithm is to end up with a shared secret without having to exchange anything that would require transmission of the secret. In other words, the assumption is that the communication channel is unsafe. The algorithm must withstand an eavesdropper knowing every single exchange.

Alice and Bob must first agree to use a modulus p and a baseg, so that the base is a primitive root modulo the modulus.

Alice and Bob each choose a secret key a and b respectively – ideally, randomly generated. The parties then exchange A = g^a \mod(p) (for Alice) and B = g^b \mod(p) (for Bob).

Alice now has received B. She goes on to compute the shared secret s by calculating B^a \mod(p) and Bob computes it by calculating A^b \mod(p).

The whole story is premised on the equality of

A^b \mod(p) = B^a \mod(p)

That this holds nearly trivially true should be evident from substituting g^b for B and g^a for A. Then,

g^{ab} \mod(p) = g^{ba} \mod(p)

Thus, both parties get the same shared secret. An eavesdropper would be able to get A and B. Given a sufficiently large prime for p, in the range of 6-700 digits, the discrete logarithm problem of retrieving a from B^a \mod(p) in the knowledge of B and p is not efficiently solvable, not even given fairly extensive computing resources. Read more

References   [ + ]

1. As a child, I once built a pseudorandom number generator from a sound card, a piece of wire and some stray radio electronics, which basically rested on a sampling of atmospheric noise. I was surprised to learn much later that this was the method the KGB used as well.
2. Under pressure from the advancing German Wehrmacht in 1941, they had duplicated over 30,000 pages worth of OTP code. This broke the golden rule of OTPs of never, ever reusing code, and ended up with a backdoor that two of the most eminent female cryptanalysts of the 20th, Genevieve Grotjan Feinstein and Meredith Gardner, on whose shoulders the success of the Venona project rested, could exploit.
3. It deserves noting that the D-H key exchange algorithm was another of those inventions that were invented twice but published once. In 1975, the GCHQ team around Clifford Cocks invented the same algorithm, but was barred from publishing it. Their achievements weren’t recognised until 1997.
Panna cotta time!

Panna cotta time!

Panna cotta time!

Summertime is panna cotta time! A panna cotta (Italian for ‘cooked cream’) is a great dessert for hot days, as it’s light, does not melt (like chocolate does), and feels cool without weighing your tummy down. It can even substitute for a full meal as it’s a fairly strong dish.


15′ + 3-5h in fridge
Easy peasy

Ingredients

  • 3 cups of heavy cream (‘double cream’ for Limeys) or mascarpone
  • 1/3 cup fine sugar
  • 35ml milk
  • 2 teaspoonfuls of vanilla extract, ideally alcoholic
  • 1 tablespoon or 2 normal sheets of gelatin (be sure to get one you trust, bad gelatin is worse than no gelatin!)
  • Frozen fruit (raspberries, blueberries and forest fruits are generally the best) – alternatively, simply keep the fruit in the fridge for 3-4 hours
  • Finely grated lemon peel (the real thing, not freeze-dried crap)

  1. Add the milk to the saucepan and gently warm. Dissolve the mascarpone or cream in the saucepan, using a whisk if needed.
  2. Add the vanilla extract.
  3. In a separate saucepan, warm up 25-30ml water and dissolve the gelatin.
  4. Pour gelatin into the milk/cream mixture and gently dissolve.
  5. Divide among 6-8 ramekin dishes or small Kilner jars.
  6. Drop in the cold fruits.
  7. Sprinkle lemon peel over the mixture.
  8. Put into fridge, covering it either only very gently with a paper towel or not at all.
  9. Leave to cool for 3-4 hours. Enjoy cold, with a root beer or as a treat on a hot summer day.
watman

What’s the value of a conditional clause?

No, seriously, bear with me. I haven’t lost my mind. Consider the following.

Joe, a citizen of Utopia, makes Ut$142,000 a year. In Utopia, you pay 25% on your first Ut$120,000 and from there on 35% on all earnings above. Let’s calculate Joe’s tax rate.

Trivial, no? JavaScript:

var income = 142000;
var tax_to_pay = 0;

if(income <= 120000){
 tax_to_pay = income * 0.25;
} else {
 tax_to_pay = 30000 + (income - 120000) * 0.35;
}

console.log(tax_to_pay);

And Python:

income = 142000

if income <= 120000
    tax_to_pay = income * 0.25
else
    tax_to_pay = 30000 + (income - 120000) * 0.35

print(tax_to_pay)

And so on. Now let’s consider the weirdo in the ranks, Julia:

income = 142000

if income <= 120000:
    tax_to_pay = income * 0.25
else
    tax_to_pay = 30000 + (income - 120000) * 0.35
end

Returns 37700.0 all right. But now watch what Julia can do and the other languages (mostly) can’t! The following is perfectly correct Julia code.

income = 142000

tax_to_pay = (if income <= 120000
                  income * 0.25
              else
                  30000 + (income - 120000) * 0.35
              end)

print(tax_to_pay)

This, too, will return 37700.0. Now, you might say, that’s basically no different from a ternary operator. Except unlike with ternary ops in most languages, you can actually put as much code there as you want and have as many side effects as your heart desires, while still being able to assign the result of the last calculation within the block that ends up getting executed to the variable at the head.

Now, that raises the question of what the value of a while expression is. Any guesses?

i = 0

digits = (while i < 10
                i = i + 1
          end)

Well, Julia says 0123456789 when executing it, so digits surely must be…

julia> digits


Wait, wat?! That must be wrong. Let’s type check it.

julia> typeof(digits)
Void

So there you have it. A conditional has a value, a while loop doesn’t… even if both return a value. Sometimes, you’ve gotta love Julia, kick back with a stiff gin and listen to Gary Bernhardt.

Screenshot 2016-02-14 13.13.28

10 tips for passing the Neo4j Certified Professional examination

Everybody loves a good certification. Twice so when it’s for free and quadruply so if it’s in a cool new technology like Neo4j. In case you’re unfamiliar with Neo4j, it’s a graph database – a novel database concept that belongs to the NoSQL class of databases, i.e. it does not follow a relational model. Rather, it allows for the storage of, and computation on, graphs.

From a purely mathematical perspective, a graph G(V,E) is formally defined as an ordered pair of vertices V (called nodes in Neo4j) and edges E (known as relationships in Neo4j). In other words, the first class citizens of a graph are ‘things’ and ‘connections between things’. No doubt you can already think of a lot of problems that can be conceptualised as graph problems. Indeed, for a surprising number of things that don’t sound very graph-y at all, it is possible to make use of graph databases. Not that you should always do so (no single technology is a panacea to every problem and I would look very suspiciously at someone who would implement time series as a graph database), but that does not mean it’s not possible in most cases.

Which leads me to the appeal of Neo4j. In general, you had two approaches to graph operations until graph databases entered the scene. One was to write your own graph object model and have it persist in memory. That’s not bad, but a database it sure ain’t. Meanwhile, an alternative is to decompose the graph into a table of vertices and its properties and another table of connections between vertices (an adjacency matrix) and then store it in a regular RDBMS or, somewhat more efficiently, in a NoSQL key-value store. That’s a little better, but it still requires considerable reinvention of the wheel.

The strength of graph databases is that they facilitate more complex operations, way beyond storage and retrieval of graphs, such as searching for patterns, properties and paths. One done-to-death example would be the famous problem known as Six Degrees of Kevin Bacon, a pop culture version of Erdös numbers: for an actor A and a Kevin Bacon K within a graph G_{Actors} with A, K \in G_{Actors}, what is the shortest path (and is it below six jumps?) to get from A to K? Graph databases turn this into a simple query. Neo4j is one of the first industrial grade graph DBs, with an enterprise grade product that you can safely deploy in a production system without worrying too much about it. Written in Java, it’s stable, fast and has enough API wrappers to have some left over for the presents next Christmas. Alongside the more traditional APIs, it’s got a very friendly and very visual web-based interface that immediately plots your query results and a somewhat weird but ultimately not very counter-intuitive query language known as Cypher. As such, if graph problems are the kind of problem you deal with on a regular basis, taking Neo4j for a spin might be a very good idea.

Which in turn leads me to the Neo4j certification. For the unbeatable price of $0.00, you can now sit for the esteemed title of Neo4j Certified Professional – that is, if you pass the 80-question, 60-minute time-capped test with a score of 80% or above. Now, let not the fact that it’s offered for free deter you – the test is pretty ferocious. It takes a fairly in-depth knowledge of Neo4j to pass (I’ve been around Neo4j ever since it has been around, and while I’ve never tried it and passed at first try recently, it has been surprisingly hard even for me!), the time cap means that even if you do decide to refer to your notes (I am not sure if that’s not cheating – I personally did not, as it was just so time-intensive), you won’t be able to pass merely from notes. Worse, there are no test exams and preparation material is scarce outside (rather pricey!) trainings. As such, I’ve written up the ten things I wish I had known before embarking upon the exam. While I did pass at the first try, it was a lot harder than I expected and I would definitely have prepared for it differently, had I known what it would be like! Fortunately, you can attempt it as often as you would like for no cost, and as such it’s by no means an impossible task,1I’ve been told that feedback on failed tests is fairly terrible – there is no feedback to most questions, and you’re not given the correct answers. but you’re in for a ride if you wish to pass with a good score. Fasten your seat belt, flip up the tray table and put your seat in a fully upright position – it’s time to get Neo4j’d!

1. This is not a user test… it’s a user and DBA test.

I haven’t heard of a single Neo4j shop that had a dedicated Neo4j DBA to support graph operations. Which is ok – compared to the relatively arcane art of (enterprise) RDBMS DBAs, Neo4j is a breeze to configure. At the same time, the model seems to expect users to know what they’re doing themselves and be confident with some close-to-the-metal database tweaking. Good.

The downside is that about a quarter or so of the questions have to do with the configuration of Neo4j, and they do get into the nitty-gritty. You’re expected, for instance, to know fairly detailed minutiae of Enterprise edition High Availability server settings.

2. Pay attention to Cypher queries. The devil’s in the details.

If you’ve done as many multiple choice tests as I have, you know you’ve learned one thing for sure: all of them follow the same pattern. Two answers are complete bunk and anyone who’s done their reading can spot that. The remaining two are deceptively similar, however, and both sound ‘correct enough’. In the Neo4j test, this is mainly in the realm of the Cypher queries. A number of questions involve a ‘problem’ being described and four possible Cypher queries. The candidate must then spot which of these, or which several of these, answer the problem description. Often the correct answer may be distinguished from the incorrect one by as little as a correctly placed colon or a bracket closed in the right order. When in doubt, have a very sharp look at the Cypher syntax.

Oh, incidentally? The test makes relatively liberal use of the ‘both directions match’ (a)-[:RELATION]-(b) query pattern. This catches (a)-[:RELATION]->(b) as well as (b)-[:RELATION]->(a). The lack of the little arrow is easy to overlook and can lead you down the wrong path…

3. Develop query equivalence to second nature.

Python was built so that there would be one, and exactly one, right way to do everything. Sort of. Cypher is the opposite – there are dozens of ways to express certain relations, largely owing to the equivalence of relationships. As such, be aware of two equivalences. One is the equivalence of inline parameters and WHERE parameters:

MATCH (a:Person {name: "John Smith"})-[:REL]->(b)
RETURN a;
MATCH (a:Person)-[:REL]->(b)
WHERE a.name = "John Smith"
RETURN a;

Also, the following partials are equivalent, but not always:

(a)-[:FIRST_REL]->(b)<-[:SECOND_REL]-(c)
(a)-[:FIRST_REL]->(b)
(c)-[:SECOND_REL]->(b)

When you see a Cypher statement, you should be able to see all of its forms. Recap question: when are the statements in the second pair NOT equivalent?

4. The test is designed on the basis of the Enterprise edition.

Neo4j comes in two ‘flavours’ – Community and Enterprise. The latter has a lot of cool features, such as an error-resilient, distributed ‘High Availability’ mode. The certification’s premise is that you are familiar – and familiar to a fairly high degree, actually! – with many of the Enterprise-only features of Neo4j. As such, unless you’re fortunate enough to be an enterprise user, it might repay itself to download the 30-day evaluation version of Neo4j Enterprise.

5. The test is generally well-written.

In other words, most things are fairly clear. By fairly clear, I mean that there is little ambiguity and it uses the same language as the reference (although comparing test questions to phrases that stuck in my head which I ended up checking after the test, just enough words are changed to deter would-be cheaters from Ctrl+F-ing through the manual! There are no trick questions – so try to understand the questions in their most ‘mundane’, ‘trivial’ way. Yes, sometimes it is that simple!

6. TRUNCATE BRAINSPACE sql_clauses;

A lot of traditional SQL clauses (yes, TRUNCATE is one example – so is JOIN and its multifarious siblings, which describe a concept that simply does not exist in Neo4j) come up as red herrings in Cypher application questions. Try to force your brain to make a switch from SQL to Cypher – and don’t fall for the trap of instinctively thinking of the clauses in the SQL solution! Forget SQL. And most of all, forget its logic of selection – MATCHing is something rather different than SELECTing in SQL.

7. Have a 30,000ft overview of the subject

In particular, have an overview of what your options are to get particular things done. How can you access Neo4j? You might have spent 99% of your time on the web interface and/or interacting using the SDK, but there is actually a shell. How can you backup from Neo4j, and what does backup do? What are your options to monitor Neo4j? Once again, most users are more likely to think of one solution, perhaps two, when there are several more. The difficult thing about this test is that it requires you to be exhaustive – both in breadth and in depth.

8. Algorithms, statistics and aggregation

As far as I’m aware, everyone gets slightly different questions, but my test did not include anything about the graph algorithms inherent in Neo4j (good news for philistines people who want to get stuff done). It did, however, include quite a bit of detail about aggregation functions. You make of that what you will.

9. Practice on Northwind but know the Movie DB like the back of your hand.

Out of the box, if you install Neo4j Community on your computer, you have two sample databases that the Browser offers to load into your instance – Movie and Northwind. The latter should be highly familiar to you if you have a past in relational databases. Meanwhile, the former is a Neo4j favourite, not the least for the Kevin Bacon angle. If you did the self-paced Getting Started training (as you should have!), you’ll have used the Movie DB enough to get a good grip of it. Most of the questions on the text pertain or relate in some way to that graph, so a degree of familiarity can help you spot errors faster. At the same time, Northwind is both a better and bigger database, more fun to use and allows for more complex queries. Northwind should therefore be your educational tool, but you should know Movie rather well for that little plus of familiar feeling that can make the difference between passing and failing. Oh, by the way – while Getting Started is a great course, you will not stand a snowball’s chance in hell without the Production course. This is so even if you’ve done your fill of deployments and integrations – quite simply put, the breadth of the test is statistically very likely to be beyond your own experiences, even if you’ve done e.g. High Availability deployments yourself. In the real world, we specialise – for the test, however, you must be a generalist.

10. Refcards are your friends.

Start with the one for Cypher. Then build your own for High Availability. Laminate them and carry them around, if need be – or take the few functions or clauses that are your weak spots, put them on post-its and plaster them on your wall. Whatever helps – unless you’re writing Cypher code 24/7 (in which case, what are you doing here?), which I doubt happens a lot, there’s quite simply no substitute for seeing correct code and being able to get a feeling for good versus bad code. The test is incredibly fast paced – 80 questions over 60 minutes gives you 45 seconds for a turnkey execution. At least 15-20 of that is reading the question, if not more (it definitely was more for me – as noted, most questions repay a thorough reading!). Realistically, if you want to make that and have time to think about the more complex questions, you’ve got to be able to bang out simple Cypher questions (I’d say there were about 8-10 of them altogether, worth an average number of points, though I (and I do regret this now) didn’t count them.

 

While the Neo4j certification exam is far from easy, it is doable (hey, if I can do it, so can you!). As graph databases are becoming increasingly important due to the recognition that they have the potential to accelerate certain calculations on graph data, coupled with the understanding that a lot of natural processes are in reality closer to relationship-driven interactions than the static picture that traditional RDBMS logic seeks to convey, knowing Neo4j is a definite asset for you and your team. Regardless of your intent to get certified and/or view on certifications in general (mine, too, is in general more on the less complimentary side), what you learn can be an indispensable asset in research and operations as well. Of course, I’m happy to answer any questions about Neo4j and the certification exam, insofar as my subjective views can make a valid contribution to the matter.

Update 15.02.2016: Neo4j community caretaker Michael Hunger has been so kind as to leave a comment on this article, pointing out that the scant feedback is intentional – it prevents re-takers from simply banging in the correct answers from the feedback e-mail. That makes perfect sense – and is not something I thought of. Thanks, Michael. He is also encouraging recent test takers to propose questions for the test – to me, it’s an unprecedented amazingness for a certificate provider to actually ask the community what they believe to be to be the cornerstone and benchmarks of knowledge in a particular field. So do take him up on that offer – his e-mail is in his comment below.

 

Title image credits: Dr Tamás Nepusz, Reconstructing the structure of the world-wide music scene with Last.fm.

References   [ + ]

1. I’ve been told that feedback on failed tests is fairly terrible – there is no feedback to most questions, and you’re not given the correct answers.
scada

(I)IoT security is not SCADA security

The other day, at the annual Worldwide Threats hearing of the Senate Armed Services Committee (SASC) – the literal sum of all fears of the intelligence community and the US military -, the testimony of DNI James Clapper made notice of the emerging threats of hacking the Internet of Things:

“Smart” devices incorporated into the electric grid, vehicles—including autonomous vehicles—and household appliances are improving efficiency, energy conservation, and convenience. However, security industry analysts have demonstrated that many of these new systems can threaten data privacy, data integrity, or continuity of services. In the future, intelligence services might use the IoT for identification, surveillance, monitoring, location tracking, and targeting for recruitment, or to gain access to networks or user credentials.1Testimony of DNI James R. Clapper to the Senate Armed Services Committee, 9 February 2016. Available here.

It’s good to hear a degree of concern for safety where IoT applications are concerned. The problem is, they come in two flavours, neither of which is helpful.

One is the “I can’t see your fridge killing you” approach. Beloved of Silicon Valley, it is generally a weak security approach based on the fact that for what it’s worth, most of IoT is a toy you can live without. So what if your internet-connected egg monitor stops sending packages to your MQTT server about your current egg stock? What if your fridge misregulates because some teenager is playing a prank on you?

The problem with this approach is that it entirely ignores the fact that Industrial IoT exists. Beyond that, you find a multiverse of incredibly safety-critical devices with IoT connectivity. This includes pain pumps, anaesthesia and vitals monitors, navigation systems, road controls and so on. A few of these are air-gapped, but if Stuxnet is anything to go by, that’ll be of little avail altogether. In other words, just because some IoT devices are toys doesn’t mean all of them are.

The other does take security seriously… but it smacks of SCADA talk. Yes, the dangers that affect interfacing a computer with the controls of any real world object, up to and including fridges, cat feeders and nuclear power plants, mean that there are particular dangers, and whether that interface is a 1980s protocol or the Internet of Things makes no great diference. But the threat of this perspective is that it ignores the IoT specific part of the risk profile. This includes, for instance, the vulnerabilities inherent in the #1 carrier medium of non-industrial IoT traffic – 802.11b/g. Any vulnerability in the main wireless protocols is in turn a vulnerability of IoT connected devices. I’d wager that wasn’t much of an issue back when Reagan was President.

Moreover, a scary part of IoT is that a lot of the devices interface directly with customers, rather than professionals. You cannot rely on anything that’s not in the box. The wireless network will be badly configured, the passwords will be the user’s birthday and generally, everything will be as unsafe as it gets. Of course, this isn’t even true for a minority of users, but it is the worst case scenario, and that’s what determines the risk profile of a component.

The bottom line is that if you ignore the incredible diversity of devices caught by the new favourite buzzword IoT has become, you get a slanted risk profile, towards toys (IoT connected floating thingies in your pool that measure your cannonballs? Are you serious?) or towards an aging system of industrial control protocols that some hope will be revived by, rather than supplanted by, IoT.

That’s a problem if you scratch anything but the top layer. Below the very specific layers, a number of fairly generic interconnection layers lie. And the consequence of that is that is that every layer, protocol or method of communication that can conceivably be used for IoT data transmission eventually will be used for that, and as such must have a defensible risk profile.

And in turn, every (I)IoT application must consider, in its risk profile, what the risk profiles of the stack it is based on is. Every element of the stack contributes to the risk, and quite often these risks are encapsulated in vendor risk estimates of higher-up layers sold as composite. Thus for instance an entire IoT appliance might be sold with a particular risk profile, but in reality, its risk profile is the sum of the risk profiles of all of its layers: its sensors, its hardware (especially where sex via hex is concerned – something that might arouse older engineers entirely the wrong way), the radio frequencies, the susceptibility of the radio hardware to certain unpleasant hacks, heck, even the possibility of van Eck phreaking on screens.2A few years ago, a large and undisclosed metallurgical company was concerned about information leaking onto the open market that indicated possible hacking of their output values at one of their largest plants, values that were, due to some oddities in the markets, quite significant in global commodity prices. After an enormously long hunt and scouring through the entire corporate and production network, they were about to give up when they found a van Eck loop antenna curled to the backside of a subsidiary monitor that an engineer has ‘hacked’ never to go into power save. As a recovering lawyer, I entirely foresee that the expected standard from non-private users of IoT devices, as well as those who build IoT devices, will be examining the whole stack rather than relying in previous promises. It may be turtles all the way down, but you’re expected to know all about those turtles now. Merely subscribing to one of the two prevalent risk models will not suffice.

References   [ + ]

1. Testimony of DNI James R. Clapper to the Senate Armed Services Committee, 9 February 2016. Available here.
2. A few years ago, a large and undisclosed metallurgical company was concerned about information leaking onto the open market that indicated possible hacking of their output values at one of their largest plants, values that were, due to some oddities in the markets, quite significant in global commodity prices. After an enormously long hunt and scouring through the entire corporate and production network, they were about to give up when they found a van Eck loop antenna curled to the backside of a subsidiary monitor that an engineer has ‘hacked’ never to go into power save.
Screenshot 2016-02-07 16.12.59

The sinful algorithm

In 1318, amidst a public sentiment that was less than enthusiastic about King Edward II, a young clerk from Oxford, John Deydras of Powderham, claimed to be the rightful ruler of England. He spun a long and rather fantastic tale that involved sows biting off the ears of children and other assorted lunacy.1It is now more or less consensus that Deydras was mentally ill and made the whole story up. Whether he himself believed it or not is another question. Edward II took much better to the pretender than his wife, the all-around badass Isabella of France, who was embarrassed by the whole affair, and Edward’s barons, who feared more sedition if they let this one slide. As such, eventually, Deydras was tried for sedition.

Deydras’s defence was that he has been convinced to engage in this charade by his cat, through whom the devil appeared to him.2As an obedient servant to a kitten, I have trouble believing this! That did not meet with much leniency, it did however result in one of the facts that exemplified the degree to which medieval criminal jurisprudence was divorced from reason and reality: besides Deydras, his cat, too, was tried, convicted, sentenced to death and hung, alongside his owner.

Before the fashionable charge of unreasonableness is brought against the Edwardian courts, let it be noted that other times and cultures have fared no better. In the later middle ages, it was fairly customary for urban jurisdictions to remove objects that have been involved in a crime beyond the city limits, giving rise to the term extermination (ex terminare, i.e., [being put] beyond the ends).3Falcón y Tella, Maria J. (2014). Justice and law, 60. Brill Nijhoff, Leiden The Privileges of Ratisbon (1207) allowed the house in which a crime took place or which harboured an outlaw to be razed to the ground – the house itself was as guilty as its owner.4Falcón y Tella, Maria J. and Falcón y Tella, Fernando (2006). Punishment and Culture: a right to punish? Nijhoff, Leiden. And even a culture as civilised and rationalistic as the Greeks fared no better, falling victim to the same surge of unreason. Hyde describes

The Prytaneum was the Hôtel de Ville of Athens as of every Greek town. In it was the common hearth of the city, which represented the unity and vitality of the community. From its perpetual fire, colonists, like the American Indians, would carry sparks to their new homes, as a symbol of fealty to the mother city, and here in very early times the prytanis or chieftain probably dwelt. In the Prytaneum at Athens the statues of Eirene (Peace) and Hestia (Hearth) stood; foreign ambassadors, famous citizens, athletes, and strangers were entertained there at the public expense; the laws of the great law-giver Solon were displayed within it and before his day the chief archon made it his home.
One of the important features of the Prytaneum at Athens were the curious murder trials held in its immediate vicinity. Many Greek writers mention these trials, which appear to have comprehended three kinds of cases. In the first place, if a murderer was unknown or could not be found, he was nevertheless tried at this court. Then inanimate things – such as stones, beams, pliece of iron, etc., – which had caused the death of a man by falling upon him-were put on trial at the Prytaneum, and lastly animals, which had similarly been the cause of death.
Though all these trials were of a ceremonial character, they were carried on with due process of law. Thus, as in all murder trials at Athens, because of the religious feeling back of them that such crimes were against the gods as much as against men, they took place in the open air, that the judges might not be contaminated by the pollution supposed to exhale from the prisoner by sitting under the same roof with him.
(…)
[T]he trial of things, was thus stated by Plato:
“And if any lifeless thing deprive a man of life, except in the case of a thunderbolt or other fatal dart sent from the gods – whether a man is killed by lifeless objects falling upon him, or his falling upon them, the nearest of kin shall appoint the nearest neighbour to be a judge and thereby acquit himself and the whole family of guilt. And he shall cast forth the guilty thing beyond the border.”
Thus we see that this case was an outgrowth from, or amplification of the [courts’ jurisdiction trying and punishing criminals in absentia]; for if the murderer could not be found, the thing that was used in the slaying, if it was known, was punished.5Hyde, Walter W. (1916). The Prosecution and Punishment of Animals and Lifeless Things in the Middle Ages and Modern Times. 64 U.Pa.LRev. 696.


Looking at the current wave of fashionable statements about the evils of algorithms have reminded me eerily of the superstitious pre-Renaissance courts, convening in damp chambers to mete out punishments not only on people but also on impersonal objects. The same detachment from reality, from the Prytaneum through Xerxes’s flogging of the Hellespont through hanging cats for being Satan’s conduits, is emerging once again, in the sophisticated terminology of ‘systematized biases’:

Clad in the pseudo-sophistication of a man who bills himself as ‘one of the world’s leading thinkers‘, a wannabe social theorist with an MBA from McGill and a career full of buzzwords (everything is ‘foremost’, ‘agenda-setting’ or otherwise ‘ultimative’!) that now apparently qualifies him to discuss algorithms, Mr Haque makes three statements that have now become commonly accepted dogma among certain circles when discussing algorithms.

  1. Algorithms are means to social control, or at the very least, social influence.
  2. Algorithms are made by a crowd of ‘geeks’, a largely homogenous, socially self-selected group that’s mostly white, male, middle to upper middle class and educated to a Masters level.
  3. ‘Systematic biases’, by which I presume he seeks to allude to the concept of institutional -isms in the absence of an actual propagating institution, mean that these algorithms are reflective of various biases, effectively resulting in (at best) disadvantage and (at worst) actual prejudice and discrimination against groups that do not fit the majority demographic of those who develop code.

Needless to say, leading thinkers and all that, this is absolute, total and complete nonsense. Here’s why.


A geek’s-eye view of algorithms

We live in a world governed by algorithms – and we have ever since men have mastered basic mathematics. The Polynesian sailors navigating based on stars and the architects of Solomon’s Temple were no less using algorithms than modern machine learning techniques or data mining outfits are. Indeed, the very word itself is a transliteration of the name of the 8th century Persian mathematician Al-Khwarazmi.6Albeit what we currently regard as the formal definition of an algorithm is largely informed by the work of Hilbert in the 1920s, Church’s lambda calculus and, eventually, the emergence of Turing machines. And for most of those millennia of unwitting and untroubled use of algorithms, there were few objections.

The problem is that algorithms now play a social role. What you read is determined by algorithms. The ads on a website? Algorithms. Your salary? Ditto. A million other things are algorithmically calculated. This has endowed the concept of algorithms with an air of near-conspiratorial mystery. You totally expect David Icke to jump out of your quicksort code one day.

Whereas, in reality, algorithms are nothing special to ‘us geeks’. They’re ways to do three things:

  1. Execute things in a particular order, sometimes taking the results of previous steps as starting points. This is called sequencing.
  2. Executing things a particular number of times. This is called iteration.
  3. Executing things based on a predicate being true or false. This is conditionality.

From these three building blocks, you can literally reconstruct every single algorithm that has ever been used. There. That’s all the mystery.

So quite probably, what people mean when they rant about ‘algorithms’ is not the concept of algorithms but particular types of algorithm. In particular, social algorithms, content filtering, optimisation and routing algorithms are involved there.

Now, what you need to understand is that geeks care relatively little about the real world ‘edges’ of problems. They’re not doing this out of contempt or not caring, but rather to compartmentalise problems to manageable little bits. It’s easier to solve tiny problems and make sure the solutions can interoperate than creating a single, big solution that eventually never happens.

To put it this way: to us, most things, if not everything, is an interface. And this largely determines what it means when we talk about the performance of an algorithm.

Consider your washing machine: it can be accurately modelled in the following way.

The above models a washing machine. Given supply water and power, if you put in dirty clothes and detergent, you will eventually get clean clothes and grey water. And that is all.
The above models a washing machine. Given supply water and power, if you put in dirty clothes and detergent, you will eventually get clean clothes and grey water. And that is all.

Your washing machine is an algorithm of sorts. It’s got parameters (water, power, dirty clothes) and return values (grey water, clean clothes). Now, as long as your washing machine fulfils a certain specification (sometimes called a promise7I discourage the promise terminology here as I’ve seen it confuzzled with the asynchronous meaning of the word way too often or a contract), according to which it will deliver a given set of predictable outputs to a given set of inputs, all will be well. Sort of.

“Sort of”, because washing machines can break. A defect in an algorithm is defined as ‘betraying the contract’, in other words, the algorithm has gone wrong if it has been given the right supply and yields the wrong result. Your washing machine might, however, fail internally. The motor might die. A sock might get stuck in it. The main control unit might short out.


Now consider the following (extreme simplification of an) algorithm. MD5 is what we call a cryptographic hash function. It takes something – really, anything that can be expressed in binary – and gives a 128-bit hash value. On one hand, it is generally impossible to invert the process (i.e. it is not possible to conclusively deduce what the original message was), while at the same time the same message will always yield the same hash value.

Without really having an understanding of what goes on behind the scenes,8In case you’re interested, RFC1321 explains MD5’s internals in a lot of detail. you can rely on the promise given by MD5. This is so in every corner of the universe. The value of MD5("Hello World!") is 0xed076287532e86365e841e92bfc50d8c in every corner of the universe. It was that value yesterday. It will be that value tomorrow. It will be that value at the heat death of the universe. What we mean when we say that an algorithm is perfect is that it upholds, and will uphold, its promise. Always.

At the same time, there are aspects of MD5 that are not perfect. You see, perfection of an algorithm is quite context-dependent, much as the world’s best, most ‘perfect’ hammer is utterly useless when what you need is a screwdriver. As such, for instance, we know that MD5 has to map every possible bit value of every possible length to a limited number of possible hash values (128 bit worth of values, to be accurate, which equates to 2^128 or approximately 3.4×10^38 distinct values). These seem a lot, but are actually teensy when you consider that they are used to map every possible amount of binary data, of every possible length. As such, it is known that sometimes different things can have the same hash value. This is called a ‘collision’, and it is a necessary feature of all hash algorithms. It is not a ‘fault’ or a ‘shortcoming’ of the algorithm, no more than we regard the non-commutativity of division a ‘shortcoming’.

Which is why it’s up to you, when you’re using an algorithm, to know what it can and cannot do. Algorithms are tools. Unlike the weird perception in Mr Haque’s swirl of incoherence, we do not worship algorithms. We don’t tend to sacrifice small animals to quicksort and you can rest assured we don’t routinely bow to a depiction of binary search trees. No more do we believe in the ‘perfection’ of algorithms than a surgeon believes in the ‘perfection’ of his scalpel or a pilot believes in the ‘perfection’ of their aircraft. Both know their tools have imperfections. They merely rely on the promise that if used with an understanding of its limitations, you can stake your, and others’, lives on it. That’s not tool-worship, that’s what it means to be a tool-using human being.


The Technocratic Spectre

We don’t know the name of the first human who banged two stones together to make fire, and became the archetype for Prometheus, but I’m rather sure he was rewarded by his fellow humans rewarded with very literally tearing out his very literal and very non-regrowing liver. Every progress in the history of humanity had those who not merely feared progress and the new, but immediately saw seven kinds of nefarious scheming behind it. Beyond (often justified!) skepticism and a critical stance towards new inventions and a reserved approach towards progress (all valid positions!), there is always a caste of professional fear-mongerers, who, after painting a spectre of disaster, immediately proffer the solution: which, of course, is giving them control over all things new, for they are endowed with the mythical talents that one requires to be so presumptuous as to claim to be able to decide for others without even hearing their views.

The difference is that most people have become incredibly lazy. The result is that there is now a preference for fear over informed understanding that comes at the price of investing some time in reading up on the technologies that now are playing such a transformative role. How many Facebook users do you think have re-posted the “UCC 1-308 and Rome Statute” nonsense? And how many of them, you reckon, actually know how Facebook uses their data? While much of what they do is proprietary, the Facebook graph algorithms are partly primitives9Building blocks commonly used that are well-known and well-documented and partly open. If you wanted, you could, with a modicum of mathematical and computing knowledge, have a good stab at understanding what is going on. On the other hand, posting bad legalese is easier. Much easier.

And thus, as a result, we have a degree of skepticism towards ‘algorithms’, mostly by people like Mr Haque who do not quite understand what they are talking about and are not actually referring to algorithms but their social use.

And there lieth the Technocratic Spectre. It has always been a fashionable argument against progress, good or ill, that it is some mysterious machination by a scientific-technical elite aimed at the common man’s detriment. There is now a new iteration of this philosophy, and it is quite surprising how the backwards, low-information edges of the far right reach hands to the far left’s paranoid and misinformed segment. At least the Know-Nothings of the right live in an honest admission of ignorance, eschewing the over-blown credentials and inflated egos of their left-wing brethren like Mr Haque. But in ignorance, they both are one another’s match.

The left-wing argument against technological progress is an odd one, for the IT business, especially the part heavy on research and innovation that comes up with algorithms and their applications, is a very diverse and rather liberal sphere. Nor does this argument square too well with the traditional liberal values of upholding civil liberties, first and foremost that of freedom of expression and conscience. Instead, the objective seems to be an ever more expansive campaign, conducted entirely outside parliamentary procedure (basing itself on regulating private services from the inside and a goodly amount of shaming people into doing their will through the kind of agitated campaigning that I have never had the displeasure to see in a democracy), of limiting the expression of ideas to a rather narrowly circumscribed set, with the pretense that some minority groups are marginalised and even endangered by wrongthink.10Needless to say, a multiple-times-over minority in IT, the only people who have marginalised and endangered me were these stalwart defenders of the right never to have to face a controversial opinion.


Their own foray at algorithms has not fared well. One need only look at the misguided efforts of a certain Bay Area developer notorious for telling people to set themselves on fire. Her software, intended to block wrongthink on the weirder-than-weird cultural phenomenon of Gamergate by blocking Twitter users who have followed a small number of acknowledged wrongthinkers, expresses the flaws of this ideology beautifully. Not only is subtleness and a good technical understanding lacking. There is also a distinct shortage of good common sense and, most of all, an understanding of how to use algorithms. While terribly inefficient and horrendously badly written 11Especially for someone who declaims, with pride, her 15-year IT business experience…, the algorithm behind the GGAutoblocker is sound. It does what its creator intended it to do on a certain level: allow you to block everyone who is following controversial personalities. That this was done without an understanding of the social context (e.g. that this is a great way to block the uncommitted and those who wish to be as widely informed as possible, is of course the very point.

The problem is not with “geeks”.

The problem is when “geeks” decide to play social engineering. Whey they suddenly throw down their coding gear and decide they’re going to transform who talks with whom and how information is exchanged. The problem is exactly the opposite: it happens when geeks cease to be geeks.

It happens when Facebook experiments with users’ timelines without their consent. It happens when companies implement policies aimed at a really laudable goal (diversity and inclusion) that leads to statements by employees that should make any sane person shudder (You know who you are, Bay Area). It happens when Twitter decides they are going to experiment with their only asset. This is how it is rewarded.

$TWTR crash
Top tip: when you’ve got effectively no assets to speak of, screwing with your user base can be very fatal very fast.

The problem is not geeks seeing a technical solution to every socio-political issue.

The problem is a certain class of ‘geeks’ seeing a socio-political use to every tool.


Sins of the algorithm

Why algorithms? Because algorithms are infinitely dangerous: because they are, as I noted above, within their area of applicability universally true and correct.

But they’re also resilient. An algorithm feels no shame. An algorithm feels no guilt. You can’t fire it. You can’t tell them to set themselves on fire or, as certain elements have done to me for a single statistical analysis, threaten to rape my wife and/or kill me. An algorithm cannot be guilted into ‘right-think’. And worst of all, algorithms cannot be convincingly presented as having an internal political bias. Quicksort is not Republican. R/B trees are not Democrats. Neural nets can’t decide to be homophobic.

And for people whose sole argumentation lies on the plane of politics, in particular grievance and identity politics, this is a devastating strike. Algorithms are the greased eels unable to be framed for the ideological sins that are used to attack and remove undesirables from political and social discourse. And to those who wish to govern this discourse by fear and intimidation, a bunch of code that steadfastly spits out results and to hell with threats is a scary prospect.

And so, if you cannot invalidate the code, you have to invalidate the maker. Algorithms perpetuate real equality by being by definition unable to exercise the same kind of bias humans do (not that they don’t have their own kind of bias, but the similarity ends with the word – if your algorithm has a racial or ethnic or gender bias, you’re using it wrong). Algorithms are meritocratic, being immune to nepotism and petty politicking. A credit scorer does not care about your social status the way Mr Jones at the bank might privilege the child of his golf partners over a young unmarried ethnic couple. Trading algorithms don’t care whether you’re a severely ill young man playing the markets from hospital.12It was a great distraction. Without human intervention, algorithms have a purity and lack of bias that cannot easily be replicated once humans have touched the darn things.

And so, those whose stock in life is a thorough education in harnessing grievances for their own gain are going after “the geeks”.


Perhaps the most disgusting thing about Mr Haque’s tweet is the contraposition between “geeks” and “regular humans”, with the assumption that “regular humans” know all about algorithms and unlike the blindly algorithm-worshipping geeks, understand how ‘life is more complicated’ and algorithms are full of geeky biases.

For starters, this is hard to take seriously when in the same few tweets, Mr Haque displays a lack of understanding of algorithms that doesn’t befit an Oregon militia hick, never mind somebody who claims spurious credentials as a foremost thinker.

“Regular humans”, whatever they are that geeks aren’t (and really, I’m not one for geek supremacy, but if Mr Haque had spent five minutes among geeks, he’d know the difference is not what, and where, he thinks it is), don’t have some magical understanding of the shortcomings of algorithms. Heck, usually, they don’t have a regular understanding of algorithms, never mind magical. But it sure sounds good when you’re in the game of shaming some of the most productive members of society unless they contribute to the very problem you’re complaining about. For of course ‘geeks’ can atone for their ‘geekdom’ by becoming more of a ‘regular human’, by starting to engage in various ill-fated political forays that end with the problems that sent the blue bird into a dive on Friday.


Little of this is surprising, though. Anyone who has been paying attention could see the warning signs of a forced politicisation of technology, under the guise of making it more equal and diverse. In my experience, diverse teams perform better, yield better results, work a little faster, communicate better and make fewer big mistakes (albeit a little more small ones). In particular, gender-diverse and ethnically diverse teams are much more than the sum of their parts. This is almost universally recognised, and few businesses that have intentionally resisted creating diverse, agile teams have fared well in the long run.13Not that any statement about this matter is not shut down by reference to ludicrous made-up words like ‘mansplaining’. I’m a huge fan of diversity – because it lives up to a meritocratic ideal, one to which I am rather committed after I’ve had to work my way into tech through a pretty arduous journey.

Politicising a workplace, on the other hand, I am less fond of. Quite simply, it’s not our job. It’s not our job, because for what it’s worth, we’re just a bunch of geeks. There are things we’re better at. Building algorithms is one.

But they are now the enemy. And because they cannot be directly attacked, we’ll become targets. With the passion of a zealot, it will be taught that algorithms are not clever mathematical shortcuts but merely geeks’ prejudices expressed in maths.

And that’s a problem. If you look into the history of mathematics, most of it is peppered by people who held one kind of unsavoury view or another. Moore was a virulent racist. Pauli loved loose women. Half the 20th century mathematicians were communists at some point of their career. Haldane thought Stalin was a great man. And I could go on. But I don’t, because it does not matter. Because they took part in the only truly universal human experience: discovery.


But discovery has its enemies and malcontents. The attitude they display, evidenced by Haque’s tweet too, is ultimately eerily reminiscent of the letter that sounded the death knell on the venerable pre-WW II German mathematical tradition. Titled Kunst des Zitierens (The Art of Citing), it was written in 1934 by Ludwig Bieberbach, a vicious anti-Semite and generally unpleasant character, who was obsessed with the idea of a ‘German mathematics’, free of the Hilbertian internationalism, of what he saw as Jewish influence, of the liberalism of the German mathematical community in the inter-war years. He writes:

“Ein Volk, das eingesehen hat, wie fremde Herrschaftsgelüste an seinem Marke nagen, wie Volksfremde daran arbeiten, ihm fremde Art aufzuzwingen, muss Lehrer von einem ihm fremden Typus ablehnen.”

 

“A people that has recognised how foreign ambitions of power attack its brand, how aliens work on imposing foreign ways on it, has to reject teachers from a type alien to it.”

Algorithms, and the understanding of what they do, protect us from lunatics like Bieberbach. His ‘German mathematics’, suffused with racism and Aryan mysticism, was no less delusional than the idea that a cabal of geeks is imposing a ‘foreign way’ of algorithmically implementing their prejudices, as if geeks actually cared about that stuff.

Every age will produce its Lysenko and its Bieberbach, and every generation has its share of zealots that demand ideological adherence and measure the merit of code and mathematics based on the author’s politics.

Like on Lysenko and Bieberbach, history will have its judgment on them, too.

Head image credits: Max Slevogt, Xerxes at the Hellespont (Allegory on Sea Power). Bildermann 13, Oct. 5, 1916. With thanks to the President and Fellows of Harvard College.

References   [ + ]

1. It is now more or less consensus that Deydras was mentally ill and made the whole story up. Whether he himself believed it or not is another question.
2. As an obedient servant to a kitten, I have trouble believing this!
3. Falcón y Tella, Maria J. (2014). Justice and law, 60. Brill Nijhoff, Leiden
4. Falcón y Tella, Maria J. and Falcón y Tella, Fernando (2006). Punishment and Culture: a right to punish? Nijhoff, Leiden.
5. Hyde, Walter W. (1916). The Prosecution and Punishment of Animals and Lifeless Things in the Middle Ages and Modern Times. 64 U.Pa.LRev. 696.
6. Albeit what we currently regard as the formal definition of an algorithm is largely informed by the work of Hilbert in the 1920s, Church’s lambda calculus and, eventually, the emergence of Turing machines.
7. I discourage the promise terminology here as I’ve seen it confuzzled with the asynchronous meaning of the word way too often
8. In case you’re interested, RFC1321 explains MD5’s internals in a lot of detail.
9. Building blocks commonly used that are well-known and well-documented
10. Needless to say, a multiple-times-over minority in IT, the only people who have marginalised and endangered me were these stalwart defenders of the right never to have to face a controversial opinion.
11. Especially for someone who declaims, with pride, her 15-year IT business experience…
12. It was a great distraction.
13. Not that any statement about this matter is not shut down by reference to ludicrous made-up words like ‘mansplaining’.