O'Reilly Strata - day one afternoon lectures

2011-02-13 22:18

Big data at startups - Info Chimps

As a startup to get good people there is no other option then to grow your own: Offer the option to gain a lot of experience in return for a not so great wage. Start out with really great hires:

  • People who have the "get shit done gene": They discover new projects, are proud to contribute to team efforts, are confident in making changes to a code base probably not known before hand. To find these you should ask open ended questions in interviews.
  • People who are passionate learners, that use the tools out there, use open code and are willing to be proven wrong.
  • People who are generally fun to work with.

Put these people on small, non-mission-critical initial projects - make them fail on parallel tasks (and tell them they will fail) to teach them to ask for help. What is really hard for new hires is learning to deal with git, ssh keys, command line stuff, what to do and when to ask for help, knowing what to do when something breaks.

Infochimps uses Kanban for organisation: Each developer has a task he has chosen at any given point in time. He is responsible for getting that task done - which may well involved getting help from others. Being responsible for a complete features is one big performance boost once the feature truely goes online. Code review is being used for teachable moments - and in cases where something really goes wrong.

Development itself is organised to optimise for developers' joy - which usually means to take Java out of the loop.

Machine learning at Orbitz

They use Hadoop mostly for log analysis. Also here the problem of fields or whole entries missing in the original log format was encountered. To be able to dynamically add new attributes and deal with growing data volumns they went from a data warehouse solution to Apache Hadoop. Hadoop is used for data preparation before training, for training recommender models, for cross validation setups. Hive has been added for ad-hoc queries usually issued by business users.

Data scaling patterns at LinkedIn

When scaling to growing data LinkedIn developers started gathering a few patterns that helped make dealing with data easier:

  • When building applications constantly monitor your invariants: It can be so frustrating to run an hour long job just to find out at the very end that you made a mistake during data import.
  • Have a QA cluster, have versioning on your releases to allow for easy rollback should anything go bad. Unit tests go without saying.
  • Profile your jobs to avoid bottlenecks: Do not read from the distributed cache in a combiner - do not reuse code that was intended for a different component without thorough review.
  • Dealing with real world data means dealing with irregular, dirty data: When generating pairs of users for connect recommendation, Obama caused problems as he is friends with seemingly every american.

However the biggest bottleneck: IO during shuffling as every map talks to every reducer. As a rule of thumb, do most work on the map side and minimise data sent to reducers. This also applies to many of the machine learning M/R formulations. One idea for reducing shuffling load is to pre-filter on the map side with bloom filters.

To serve at scale:

  • Run stuff multiple times.
  • Iterate quickly to get fast feedback.
  • Do AB testing to measure performance.
  • Push out quickly for feedback.
  • Try out what you would like to see.

See also sna-projects.com/blog for more information.

O'Reilly Strata - Day two - keynotes

2011-02-12 20:17

Day two of Strata started with a very inspiring insight from the host itself that extended the vision discussed earlier in the tutorials: It's not at all about the tools, the current data analytics value lies in the data itself and in the conclusions and actions drawn from analysing it.

Bit.ly keynote

The first key note was presented by bit.ly - for them there are four dimensions to data analytics:

  • Timeliness: There must be realtime access, or at least streaming access to incoming data.
  • Storage must provide the means to efficiently store, access, query and operate on data.
  • Education as there is no clear path to becoming a data scientist today.
  • Imagination to come up with new interesting ways to look at existing data.

Storing shortened urls for bit.ly there really are three views on their data: The very personal intrinsic preferences expressed in your participation in the network. The neighborhood view taking into account your friends and accquaintances. Finally there is the global view that allows for drawing conclusion on a very large global scale - a way to find out what's happening world wide just by looking at log data.

Thomson Reuters

In contrast to all digital bit.ly Thomson Reuters comes with a very different background - though acting on a global scale distributing news world wide there lots of manual intervention is still asked for to come up with high quality, clean, curated data. In addition their clients focus on very low latency to be able to act on new incoming news at the stock market.

For traditional media providers it is very important to bring news together with context and users: Knowing who users are and where they live may result in delivering better service with more focussed information. However he sees a huge gap between what is possible with today's web2.0 applications and what is still in common practice in large corporate environments: Social networking sites tend to gather data implicitly without clearly telling users what is collected and for which purpose. In corporate environments though it was (and still is) common practice to come up with general compliance rules that target protecting data privacy and insulating corporate networks from public ones.

Focussing on cautious and explicit data mining might help these environments to benefit from cost savings and targeted information publishing to the corporate environment as well.

Mythology of big data

Each technology caries in itself the seeds for self destruction - same is true for Hadoop and friends: The code is about to start turning into commodity itself. As a result the real value lies in the data it processes and the knowledge about how to combine existing tools to solve 80% of your data analytics problems.

The myth really lies in the lonely hacker sitting in front of his laptop solving the world's data analysis problems. Instead analytics is all about communication and learning from those who stored and generated the data. Only they are able to tell more on business cases as well as the context of the data. Only domain knowledge can help solve real problems.

In the past data emerged from being the product, into being a by-product, to being an asset in the past decade. Nowadays it is turning into a substrate for developing better applications. There is no need for huge data sets for turning data into a basis for better applications. In the end it boils down to using data to re-vamp your organisation's decisions from being horse trading, gut-check based decisions to scientific, data backed informed decisions.

Amazon - Werner Vogels

For amazon, big data means that storing, collecting, analyzing and processing the data are hard to do. Being able to do so currently is a competitive advantage. In contrast to BI where questions drove the way data was stored and collected today infrastructure is cheap enough to creatively come up with new analytics questions based on available data.

  • Collecting data goes from a streaming model to daily imports even to batch imports - never under estimate the bandwidth of FedEx. There even is a FedEx import at Amazon.
  • Never under estimate the need for increased storage capacity. Storage on AWS can be increased dynamically.
  • When organizing data keep data quality and manual cleansing in mind - there is a mechanical turk offering for that at AWS.
  • For Analysis Map Reduce currently is the obvious choice - AWS offeres elastic map reduce for that.
  • The trend goes more and more to sharing analysis results via public APIs to enable customers down stream to reuse data and provide added value on top of it.

Microsoft Azure data market place

Microsoft used their keynote to announce the Azure Data Marketplace - a place to make data available for easy use and trading. To deal with data today you have to find it, license it from its original owner - which incurs overhead negotiating licensing terms.

The goal of Microsoft is to provide a one click stop shop for data that provides a unified and discoverable interface to data. They work with providers to ensure cleanup and curation. In turn providers get a marketplace for trading data. It will be possible to visualize data before purchase to avoid buying what you do not know. There is a subscription model that allows for constant updates, has cleared licensing issues. There are consistant APIs to data that can be incorporated by solution partners to provide better integration and analysis support.

At the very end the Heritage health prize was announced - a 3 million data mining competition open for participation starting next April.

O'Reilly Strata - Tutorial data analytics

2011-02-11 20:17

Acting based on data

It comes as no surprise to hear that also in the data analytics world engineers are unwilling to share details of how their analysis works with higher management - with on the other side not much interest on learning how analytics really works. This culture leads to a sort of black art, witch craft attitude to data analytics that hinders most innovation.

When starting to establish data analytics in your business there are a few steps to consider: Frist of all no matter how beautiful visualizations may look on the tool you just chose to work with and are considering to buy - keep in mind that shiny pebbles won't solve your problems. Instead focus on what kind of information you really want to extract and chose the tool that does that job best. Keep in mind that data never comes as clean as analysts would love it to be.

  • Ask yourself how complete your data really is (Are all fields you are looking at filled for all relevant records?).
  • Are those fields filled with accurate information (Ever asked yourself why everyone using your registration form seems to be working for a 1-100 engineers startup instead of one of the many other options down the list?)
  • For how long will that data remain accurate?
  • For how long will it be relevant for your business case?

Even the cleanest data set can get you only so far: You need to be able to link your data back to actual transactions to be able to segment your customers and add value from data analytics.

When introducing data analytics check whether people are actually willing to share their data. Check whether management is willing to act on potential results - that may be as easy as spending lots of money on data cleansing, or it may involve changing workflows to be able to provide better source data. As a result of data analytics there may be even more severe changes ahead of you: Are people willing to change the product based on pure data? Are they willing to adjust the marketing budget? ... job descriptions? ... development budget? How fast is the turnaround for these changes? When making changes yearly there is no value in having realtime analytics.

In the end it boils down to applying the OODA cycle: If you can be faster observing, orienting, deciding and acting than your competitor only then do you have a real business advantage.

Data analytics ethics

Today Apache Hadoop provides the means to give data analytics super powers to everyone: It brings together the use of commodity hardware with scaling to high data volumns. With great power there must come great great responsibility according to Stan Lee. In the realm of data science that involves solving problems that might be ethically at least questionable though technologically trivial:

  • Helping others adjust their weapons to increase death rates.
  • Making others turn into a monopoly.
  • Predict the likelihood of cheap food making you so sick that you are able and willing to go to court against the provider as a result.

On the other hand it can solve cases that are mutually sensible both for the provider and the customer: Predicting when visitors to a casino are about to become unhappy and willing to leave before the even know they are may give the casino employees a brief time window for counter actions (e.g. offering you a free meal).

In the end it boils down to avoiding to screw up other people's lifes. Deciding which action does least harm while achieving most benefit. Which treats people at least proportional if not equal, what serves the community as a whole - or more simply: What leads me to being the person I always wanted to be.