The rest of the day was mainly reserved for more technical talks: Tom Wight introducing the merits of MR2, also known as YARN. Steve Loughran gave a very insightful talk on the various failure modes of Hadoop – though the Namenode is like the most obvious single point of failure there are a few more traps waiting for those depending on their Hadoop clusters: Hadoop does just find with single harddisks failing. Failing single machines usually also does not create a huge issue. However what if the switch one of your racks is connected with fails? Suddenly not just one machine has to be re-replicated but a whole rack of machines. Even if you have enough space in your cluster left, can your network deal with the replication traffic? What if your cluster is split in half as a result? Steve gave an introduction to the various HA configurations available for Hadoop. There's one insight I really liked though: If you are looking for SPOFs in your system – just carry a pager … and wait.
In the afternoon I joined Ted Dunning's talk on fast kNN soon to be available in Mahout – the speedups gained really do look impressive – just like the fact that the algorithm is all online and single pass.
It was good to meet with so many big data people in two days – including Sean Owen who joined the Data Science Meetup in the evening.
Thanks to the O'Reilly Strata team – you really did an awesome job making Strata EU an interesting and very well organised event. If you yourself are still wondering what this big data thing is and in what respect it might be relevant to your company Strata is the place to be to find out: Though being a tad to high-level for people with a technical interest the selection of talks is really great when it comes to showing the wide impact of big data applications from IT, the medical sector right up to data journalism.
If you are interested in anything big data, in particular who to turn the technology into value make sure you check out the conferences in New York and Santa Clara. Also all keynotes of London were video taped and are available on YouTube by now.
The first Tuesday morning keynote put the hype around big data into historical context: According to wikipedia big data apps are defined by their capability of coping with data set sizes that are larger than can be handled with commonly available machines and algorithms. Going from that definition we can look back to history and will realize that the issue of big data actually isn't that new: Even back in the 1950s people had to deal with big data problems. One example the speaker went through was a trading company that back in the old days had a very capable computer at their disposal. To ensure optimal utilisation they would rent out computing power whenever they did not need it for their own computations. One of the tasks they had to accomplish was a government contract: Freight charges on rails had been changed to be distance based. As a result the British government needed information on the pairwise distances between all trainstations in GB. The developers had to deal with the fact that they did not have enough memory to fit all computation into it – as a result they had to partition the task. Also Dijkstra's algorithm for finding shortest paths in graphs wasn't invented until 4 years later – so they had to figure something out themselves to get the job done (note: Compared to what Dijkstra published later it actually was very similar – only that they never published it). The conclusion is quite obvious: The problems we face today with Petabytes of data aren't particularly new – we are again pushing frontiers, inventing new algorithms as we go, partition our data to suit the compute power that we have.
With everyday examples and a bit of hackery the second keynote went into detail on what it means to live in a world that increasingly depends on sensors around us. The first example the speaker gave was on a hotel that featured RFID cards for room access. On the card it was noted that every entry and exit to the room is being tracked – how scary is that? In particular when taking into account how simple it is to trick the system behind into revealing some of the gathered information as shown a few slides later by the speaker. A second example he have was a leaked dataset of mobile device types, names and usernames. By looking at the statistics of that dataset (What is the distribution of device types – it was mainly iPads as opposed to iPhones or Android phones. What is the distribution of device names? - Right after manufacturer names those contained mainly male names. When correlating these with a statistic on most common baby name per year they managed to find that those were mainly in their mid thirties.) The group of people whose data had leaked used the app mainly on an iPad, was mainly male and in their thirties. With a bit more digging it was possible to deduce who exactly had leaked the data – and do that well enough for the responsible person (an American publisher) to not be able to deny that. The last example showed how to use geographical self tracking correlated with credit card transactions to identify fraudulent transactions – in some cases faster than the bank would discover them.
The last keynote provided some insight into the presentation bias prevalent in academic publishing – but in particular in medical publications: There the preference to publish positive results is particularly detrimental as it has a direct effect on patient treatment.
The second keynote touched upon the topic of data literacy: In an age in which growing amounts of data are being generated being able to make sense of these becomes a crucial skill for citizens just like reading, writing and computing. The speaker's message was two-fold: a) People currently are not being taught how to deal with that data but are being taught that all that growing data is evil. Like an enemy hiding under their bed just waiting to jump at them. b) When it comes to getting the people around you literate the common wisdom is to simplify, simplify, simplify. However her approach is a little different: Don't simplify. Instead give people the option to learn and improve. As a trivial comparison: Just because her own little baby does not yet talk doesn't mean she shouldn't talk to it. Over time the little human will learn and adapt and have great fun communicating with others. Similarly we shouldn't over-simplify but give others a chance to learn.
The last keynote dealt gave a really nice perspective on information overload and the history of information creation. Starting back in the age of clay tablets where writing was to 90% used for accounting only – tablets being tagged for easier findability. Continuing with the invention of paper – back then still as roles as opposed to books that facilitated easy sequential reading but made random access hard. The obvious next step being books that allow for random access read. Going on to initial printing efforts in an age where books were still a scarce resource. Continuing to the age of the printing press with movable types when books became ubiquitous – introducing the need for more metadata attached to books like title pages, TOCs and indexes for better findability. As book production became simpler and cheaper people soon had to think of new ways to cope with the ever growing amount of information available to them. Compared to that the current big data revolution does not look to familiar anymore: Much like the printing press allowed for more and more books to become available , Hadoop allows for more and more data to be stored in clusters. As a result we will have to think about new ways to cope with the increasing amount of data at our disposal, time to start going beyond the mere production processes and deal with the implications for society. Each past data revolution left both – winners and loosers – mainly unintentioned by those who invented the production processes. Same will happen with today's data revolution.
After the keynotes I joined some of the nerdcore track talks on Clojure for data science and Cascalog for distributed data analysis, briefly joined the talk on data literacy for those playing with self tracking tools to finally join some friends heading out for an Apache Dinner. Always great to meet with people you know in cities abroad. Thanks to the cloud of people who facilitated the event!
A few weeks ago I attended O'Reilly Strata EU. As I had the honour of being on the program committee I remember how hard it was to decide on which talks to accept and which ones to decline. It's great to see that potential turned into an awesome conference on all things Big Data.
I arrived a bit late as I flew in only Monday morning. So I didn't get to see all of the keynotes and plunged right into Dyson's talk on the history of computing from Alan Turing to now including the everlasting goal of making computers more like humans, making them what is generally called intelligent.
The next keynote was co-presented by the Guardian and Google on the Guardian big data blog. Guardian is very well known for their innovative approach to journalism that more and more relies on being able to make sense of ever growing datasets – both public and not-yet-published. It was quite interesting to see them use technologies like Google Refine for cleaning up data, see them mention common tools like Google spreadsheets or Tableau for data presentation and learn more on how they enrich data by joining it with publicly available datasets.