Strata EU - part 3

Strata EU - part 3 #

The first Tuesday morning keynote put the hype around big data into historical context: According to wikipedia big data apps are defined by their capability of coping with data set sizes that are larger than can be handled with commonly available machines and algorithms. Going from that definition we can look back to history and will realize that the issue of big data actually isn’t that new: Even back in the 1950s people had to deal with big data problems. One example the speaker went through was a trading company that back in the old days had a very capable computer at their disposal. To ensure optimal utilisation they would rent out computing power whenever they did not need it for their own computations. One of the tasks they had to accomplish was a government contract: Freight charges on rails had been changed to be distance based. As a result the British government needed information on the pairwise distances between all trainstations in GB. The developers had to deal with the fact that they did not have enough memory to fit all computation into it – as a result they had to partition the task. Also Dijkstra’s algorithm for finding shortest paths in graphs wasn’t invented until 4 years later – so they had to figure something out themselves to get the job done (note: Compared to what Dijkstra published later it actually was very similar – only that they never published it). The conclusion is quite obvious: The problems we face today with Petabytes of data aren’t particularly new – we are again pushing frontiers, inventing new algorithms as we go, partition our data to suit the compute power that we have.

With everyday examples and a bit of hackery the second keynote went into detail on what it means to live in a world that increasingly depends on sensors around us. The first example the speaker gave was on a hotel that featured RFID cards for room access. On the card it was noted that every entry and exit to the room is being tracked – how scary is that? In particular when taking into account how simple it is to trick the system behind into revealing some of the gathered information as shown a few slides later by the speaker. A second example he have was a leaked dataset of mobile device types, names and usernames. By looking at the statistics of that dataset (What is the distribution of device types – it was mainly iPads as opposed to iPhones or Android phones. What is the distribution of device names? - Right after manufacturer names those contained mainly male names. When correlating these with a statistic on most common baby name per year they managed to find that those were mainly in their mid thirties.) The group of people whose data had leaked used the app mainly on an iPad, was mainly male and in their thirties. With a bit more digging it was possible to deduce who exactly had leaked the data – and do that well enough for the responsible person (an American publisher) to not be able to deny that. The last example showed how to use geographical self tracking correlated with credit card transactions to identify fraudulent transactions – in some cases faster than the bank would discover them.

The last keynote provided some insight into the presentation bias prevalent in academic publishing – but in particular in medical publications: There the preference to publish positive results is particularly detrimental as it has a direct effect on patient treatment.