Day two of Strata started with a very inspiring insight from the host itself that extended the vision discussed earlier in the tutorials: It's not at all about the tools, the current data analytics value lies in the data itself and in the conclusions and actions drawn from analysing it.
The first key note was presented by bit.ly - for them there are four dimensions to data analytics:
- Timeliness: There must be realtime access, or at least streaming access to incoming data.
- Storage must provide the means to efficiently store, access, query and operate on data.
- Education as there is no clear path to becoming a data scientist today.
- Imagination to come up with new interesting ways to look at existing data.
Storing shortened urls for bit.ly there really are three views on their data: The very personal intrinsic preferences expressed in your participation in the network. The neighborhood view taking into account your friends and accquaintances. Finally there is the global view that allows for drawing conclusion on a very large global scale - a way to find out what's happening world wide just by looking at log data.
In contrast to all digital bit.ly Thomson Reuters comes with a very different background - though acting on a global scale distributing news world wide there lots of manual intervention is still asked for to come up with high quality, clean, curated data. In addition their clients focus on very low latency to be able to act on new incoming news at the stock market.
For traditional media providers it is very important to bring news together with context and users: Knowing who users are and where they live may result in delivering better service with more focussed information. However he sees a huge gap between what is possible with today's web2.0 applications and what is still in common practice in large corporate environments: Social networking sites tend to gather data implicitly without clearly telling users what is collected and for which purpose. In corporate environments though it was (and still is) common practice to come up with general compliance rules that target protecting data privacy and insulating corporate networks from public ones.
Focussing on cautious and explicit data mining might help these environments to benefit from cost savings and targeted information publishing to the corporate environment as well.
Mythology of big data
Each technology caries in itself the seeds for self destruction - same is true for Hadoop and friends: The code is about to start turning into commodity itself. As a result the real value lies in the data it processes and the knowledge about how to combine existing tools to solve 80% of your data analytics problems.
The myth really lies in the lonely hacker sitting in front of his laptop solving the world's data analysis problems. Instead analytics is all about communication and learning from those who stored and generated the data. Only they are able to tell more on business cases as well as the context of the data. Only domain knowledge can help solve real problems.
In the past data emerged from being the product, into being a by-product, to being an asset in the past decade. Nowadays it is turning into a substrate for developing better applications. There is no need for huge data sets for turning data into a basis for better applications. In the end it boils down to using data to re-vamp your organisation's decisions from being horse trading, gut-check based decisions to scientific, data backed informed decisions.
Amazon - Werner Vogels
For amazon, big data means that storing, collecting, analyzing and processing the data are hard to do. Being able to do so currently is a competitive advantage. In contrast to BI where questions drove the way data was stored and collected today infrastructure is cheap enough to creatively come up with new analytics questions based on available data.
- Collecting data goes from a streaming model to daily imports even to batch imports - never under estimate the bandwidth of FedEx. There even is a FedEx import at Amazon.
- Never under estimate the need for increased storage capacity. Storage on AWS can be increased dynamically.
- When organizing data keep data quality and manual cleansing in mind - there is a mechanical turk offering for that at AWS.
- For Analysis Map Reduce currently is the obvious choice - AWS offeres elastic map reduce for that.
- The trend goes more and more to sharing analysis results via public APIs to enable customers down stream to reuse data and provide added value on top of it.
Microsoft Azure data market place
Microsoft used their keynote to announce the Azure Data Marketplace - a place to make data available for easy use and trading. To deal with data today you have to find it, license it from its original owner - which incurs overhead negotiating licensing terms.
The goal of Microsoft is to provide a one click stop shop for data that provides a unified and discoverable interface to data. They work with providers to ensure cleanup and curation. In turn providers get a marketplace for trading data. It will be possible to visualize data before purchase to avoid buying what you do not know. There is a subscription model that allows for constant updates, has cleared licensing issues. There are consistant APIs to data that can be incorporated by solution partners to provide better integration and analysis support.
At the very end the Heritage health prize was announced - a 3 million data mining competition open for participation starting next April.