Strata EU - part 3 #
The first Tuesday morning keynote put the hype around big data into historical context: According to wikipedia big data
apps are defined by their capability of coping with data set sizes that are larger than can be handled with commonly
available machines and algorithms. Going from that definition we can look back to history and will realize that the
issue of big data actually isn’t that new: Even back in the 1950s people had to deal with big data problems. One
example the speaker went through was a trading company that back in the old days had a very capable computer at their
disposal. To ensure optimal utilisation they would rent out computing power whenever they did not need it for their own
computations. One of the tasks they had to accomplish was a government contract: Freight charges on rails had been
changed to be distance based. As a result the British government needed information on the pairwise distances between
all trainstations in GB. The developers had to deal with the fact that they did not have enough memory to fit all
computation into it – as a result they had to partition the task. Also Dijkstra’s algorithm for finding shortest paths
in graphs wasn’t invented until 4 years later – so they had to figure something out themselves to get the job done
(note: Compared to what Dijkstra published later it actually was very similar – only that they never published it). The
conclusion is quite obvious: The problems we face today with Petabytes of data aren’t particularly new – we are again
pushing frontiers, inventing new algorithms as we go, partition our data to suit the compute power that we
have.
With everyday examples and a bit of hackery the second keynote went into detail on what it means to live
in a world that increasingly depends on sensors around us. The first example the speaker gave was on a hotel that
featured RFID cards for room access. On the card it was noted that every entry and exit to the room is being tracked –
how scary is that? In particular when taking into account how simple it is to trick the system behind into revealing
some of the gathered information as shown a few slides later by the speaker. A second example he have was a leaked
dataset of mobile device types, names and usernames. By looking at the statistics of that dataset (What is the
distribution of device types – it was mainly iPads as opposed to iPhones or Android phones. What is the distribution of
device names? - Right after manufacturer names those contained mainly male names. When correlating these with a
statistic on most common baby name per year they managed to find that those were mainly in their mid thirties.) The
group of people whose data had leaked used the app mainly on an iPad, was mainly male and in their thirties. With a bit
more digging it was possible to deduce who exactly had leaked the data – and do that well enough for the responsible
person (an American publisher) to not be able to deny that. The last example showed how to use geographical self
tracking correlated with credit card transactions to identify fraudulent transactions – in some cases faster than the
bank would discover them.
The last keynote provided some insight into the presentation bias prevalent in
academic publishing – but in particular in medical publications: There the preference to publish positive results is
particularly detrimental as it has a direct effect on patient treatment.