O’Reilly Strata - day one afternoon lectures #
Big data at startups - Info Chimps
As a startup to get good people there is no other option then to grow your own: Offer the option to gain a lot of experience in return for a not so great wage. Start out with really great hires:
- People who have the "get shit done gene": They discover new projects, are proud to contribute to team efforts, are confident in making changes to a code base probably not known before hand. To find these you should ask open ended questions in interviews.
- People who are passionate learners, that use the tools out there, use open code and are willing to be proven wrong.
- People who are generally fun to work with.
Put these people on small, non-mission-critical initial projects - make them fail on parallel tasks (and tell them they will fail) to teach them to ask for help. What is really hard for new hires is learning to deal with git, ssh keys, command line stuff, what to do and when to ask for help, knowing what to do when something breaks.
Infochimps uses Kanban for organisation: Each developer has a task he has chosen at any given point in time. He is responsible for getting that task done - which may well involved getting help from others. Being responsible for a complete features is one big performance boost once the feature truely goes online. Code review is being used for teachable moments - and in cases where something really goes wrong.
Development itself is organised to optimise for developers' joy - which usually means to take Java out of the loop.
Machine learning at Orbitz
They use Hadoop mostly for log analysis. Also here the problem of fields or whole entries missing in the original log format was encountered. To be able to dynamically add new attributes and deal with growing data volumns they went from a data warehouse solution to Apache Hadoop. Hadoop is used for data preparation before training, for training recommender models, for cross validation setups. Hive has been added for ad-hoc queries usually issued by business users.
Data scaling patterns at LinkedIn
When scaling to growing data LinkedIn developers started gathering a few patterns that helped make dealing with data easier:
- When building applications constantly monitor your invariants: It can be so frustrating to run an hour long job just to find out at the very end that you made a mistake during data import.
- Have a QA cluster, have versioning on your releases to allow for easy rollback should anything go bad. Unit tests go without saying.
- Profile your jobs to avoid bottlenecks: Do not read from the distributed cache in a combiner - do not reuse code that was intended for a different component without thorough review.
- Dealing with real world data means dealing with irregular, dirty data: When generating pairs of users for connect recommendation, Obama caused problems as he is friends with seemingly every american.
However the biggest bottleneck: IO during shuffling as every map talks to every reducer. As a rule of thumb, do most work on the map side and minimise data sent to reducers. This also applies to many of the machine learning M/R formulations. One idea for reducing shuffling load is to pre-filter on the map side with bloom filters.
To serve at scale:
- Run stuff multiple times.
- Iterate quickly to get fast feedback.
- Do AB testing to measure performance.
- Push out quickly for feedback.
- Try out what you would like to see.
See also sna-projects.com/blog for more information.