Wonder if you should switch from your RDBMS to Apache Hadoop: Don't!

2013-08-26 17:10
Last weekend I spend a lot of fun time at FrOSCon* in Sankt Augustin - always great to catch up with friends in the open source space. As always there were quite a few talks on NoSQL, Hadoop, but also really solid advise on tuning your system for stuff like MySQL (including a side note on PostgreSQL and Oracle) from Kristian Köhntopp. When following some of the discussions in the audience before and after the talk I could not help but shake my head on some of the advise given about HDFS and friends.

This is to give a really short rule of thumb on what project to use for which occasion. Maybe it helps clear some false assumptions. Note: All of the below are most likely gross oversimplifications. Don't use it as hard and fast advise but as a first step towards finding more information with your preferred search engine.

Use Case 1 - relational data Technology
I have usual relational data Use a relational database - think MySQL and friends
I have relational data but my database doesn't perform. Tune your system, go to step 1.
I have relational data but way more reads than one machine can accomodate. Have master-slave replication turned on, configure enough slaves to accommodate your traffic.
I have relational data, but way too much data for a single machine. Start sharding your database.
I have a lot of reads (or writes) and too much data for a single machine. If the sharding+replication pain gets unbearable but you still need strong consistency guarantees start playing with HBase. You might loose the option of SQL but win being able to scale beyond traditional solutions. Hint: Unless your online product is hugely successful switching to HBase usually means you've missed some tuning option.
Use Case 2 - Crawling Technology
I want to store and process a crawl of the internet. Store it as flat files, if you like encode metadata together with the data in protocol buffers, thrift or Avro.
My internet crawl no longer fits on a single disk. Put multiple disks in your machine, RAID them if you like.
Processing my crawl takes too long. Optimise your pipeline. Make sure you utilise all processors in your machine.
Processing the crawl still takes too long. If your data doesn't fit on a single machine, takes way too long to process but there is no bigger machine that you can reasonably pay for you are probably willing to take some pain. Get yourself more than one machine, hook them together, install Hadoop and use either plain map reduce , Pig, Hive or Cascading to process the data. Distribution-wise Apache, Cloudera, MapR, Hortonworks are all good choices.
Use Case 3 - BI Technology
I have structured data and want my business analysts to find great new insights. Use a data warehouse your analysts are most familiar with.
I want to draw conclusions from one year worth of traffic on a busy web site (hint: resulting log files no longer fit on the hard disk of my biggest machine). Stream your logs into HDFS. From here it depends: If it's your developers that want to get their hands dirty, Cascading and depending packages might be a decent idea. There's plenty of UDFs in Pig that will help you as well. If the work is to be done by data analysts that only speak SQL use Hive.
I want to correlate user transactions with social media activity around my brand. See above.

A really short three bullet point summary

  • Use HBase to scale the backend your end users interact with. If you want to trade strong consistency for being able to span multiple datacenters on multiple continents take a look at Cassandra.
  • Use plain HDFS with Hive/Pig/Cascading for batch analysis. This could be business intelligence queries against user transactions, log file analysis for statistics, data extraction steps for internet crawls, social media data or other sensor data.
  • Use Drill or Impala for low latency business intelligence queries.

Good advise from ApacheConEU 2008/9

Back at one of the first ApacheCons I ever attended there was an Apache Hadoop BoF. One of the attendees asked for good reasons to switch from his current working infrastructure to Hadoop. In my opinion the advise he got from Christophe Bisciglia is still valid today. Paraphrased version:
For as long as you wonder why you should be switching to Hadoop, don't.

A parting note: I've left CouchDB, MongoDB, JackRabbit and friends out of the equation. The reason for this is my own lack of first-hand experience with those projects. Maybe someone else can add to the list here.

* A note to the organisers: Thilo and myself married last year in September. So when seeing the term "Fromm" in a speaker's surname doesn't automatically mean that the speaker hotel room should be booked on the name "Thilo Fromm" - the speaker on your list could as well be called "Isabel Drost-Fromm". It was hilarious to have the speaker reimbursement package signed by my husband though this year around I was the one giving a talk at your conference ;)

Apache Sling and Jackrabbit event coming to Berlin

2012-07-12 20:59
Interested in Apache Sling and/or Apache Jackrabbit? Then you might be interested in hearing that on September 26th to 28th there will be an event in town on these two topics - mainly organised by Adobe, but labeled as community event, meaning that there will be a number of active community members attending the conference: adaptTo().

From their website:

In late September 2012 Berlin will become the global heart beat for developers working on the Adobe CQ technical stack. pro!vision and Adobe are working jointly to set up a pure technical event for developers that will be focused on Apache Sling, Apache Jackrabbit, Apache Felix and more specifically on Adobe CQ: adaptTo(), Berlin. September 26-28 2012.

Apache Hadoop Get Together - Hand over

2011-11-02 16:20
Apache Hadoop receives lots of attention from large US corporations who are using the project to scale their data processing pipelines:

“Facebook uses Hadoop and Hive extensively to process large data sets. [...]” (Ashish Thusoo, Engineering Manager at Facebook), "Hadoop is a key ingredient in allowing LinkedIn to build many of our most computationally difficult features [...]" (Jay Kreps, Principal Engineer, LinkedIn), "Hadoop enables [Twitter] to store, process, and derive insights from our data in ways that wouldn't otherwise be possible. [...]" (Kevin Weil, Analytics Lead, Twitter). Found on Yahoo developer blog.

However the system's use is not limited to large corporations only: With 101tec, Zanox, nugg.ad, nurago also local German players are using the project to enable new applications. Add components like Lucene, Redis, CouchDB, HBase and UIMA to the mix and you end up with a set of majour open source components that allow developers to rapidly develop systems that until a few years ago were possible only either in Google-like companies or in research.

The Berlin Apache Hadoop Get Together started in 2008 allowed to learn more on how the average local company leveraged this software. It is a platform to get in touch informally, exchange knowledge and best practices across corporate boundaries.

After three years of organising that event it is time to hand it over to new caring hands: David Obermann from Idealo kindly volunteered to take over organisation. He is a long-term attendee of the event and will continue it in the roughly the same spirit as before: Technical talks on success stories by users, new features by developers - not solely restricted to Hadoop only but also taking into account related projects.

A huge Thank You for taking up the work of co-ordinating, finding a venue and a sponsor for the videos goes to David! If any of you attending the event think that you have an interesting story to share, would like to support the event financially or just help out please get in touch with David.

Looking forward to the next Apache Hadoop Get Together Berlin. Watch this space for updates on when and where it will take place.

Devoxx – Day 2 HBase

2010-12-09 21:25
Devoxx featured several interesting case studies of how HBase and Hadoop can be used to scale data analysis back ends as well as data serving front ends.


Dmitry Ryaboy from Twitter explained how to scale high load and large data systems using Cassandra. Looking at the sheer amount of tweets generated each day it becomes obvious that with a system like MySQL alone this site cannot be run.

Twitter has released several of their internal tools under a free software license for others to re-use – some of them being rather straight forward, others more involved. At Twitter each Tweet is annotated by a user_id, a time stamp (ok if skewed by a few minutes) as well as a unique tweet_id. In order to come up with a solution for generating the latter one they built a library called snowflake. Though rather simple algorithm even works in a cross data-centre set-up: The first bits are composed of the current time stamp, the following bits encode the data-centre, after that there is room for a counter. The tweet_ids are globally ordered by time and distinct across data-centres without the need for global synchronisation.

With gizzard Twitter released a rather general sharding implementation that is used internally to run distributed versions of Lucene, MySQL as well as Redis (to be introduced for caching tweet timelines due to its explicit support for lists as data structures for values that are not available in memcached).

FlockDB for large scale social graph storage and analysis. Rainbird for time series analysis, though with OpenTSDB there is something comparable available for HBase. Haplocheirus for message vector caching (currently based on memcached, soon to be migrated to Redis for its richer data structures). The queries available through the front-end are rather limited thus making it easy to provide pre-computed, optimised version in the back-end. As with the caching problem a tradeoff between hit rate on the pool of pre-computed items vs. storage cost can be made based on the observed query distribution.

In the back-end of Twitter various statistical and data mining analysis are run on top of Hadoop HBase To compute potentially interesting followers for users, to extract potentially interesting products etc.
The final take-home message here: Go from requirements to final solution. In the space of storage systems there is not such thing as a silver bullet. Instead you have to carefully evaluate features and properties of each solutions as your data and load increase.


When implementing Facebook Messaging (a new feature that was announced this week) Facebook decided to go for HBase instead of Cassandra. The requirements of the feature included massive scale, long-tail write access to the database (which more or less ruled out MySQL and comparable solutions) and a need for strict ordering of messages (which ruled out any eventually consistent system. The decision was made to use HBase.

A team of 15 developers (including operations and frontend) was working on the system for one year before it was finally released. The feature supports for integration of facebook messaging, IM, SMS and mail into one single system making it possible to group all messages by conversation no matter which device was used to send the message originally. That way each user's inbox turns into a social inbox.


Cosmin Lehene presented four use cases of Hadoop at Adobe. The first one dealt with creating and evaluating profiles of the Adobe Media Player. Users would be associated with a vector giving more information on what types of genre the meda they consumed belonged to. These vectors would then be used to generate recommendations for additional content to view in order to increase consumption rate. Adobe built a clustering system that would interface Mahout's canopy- and k-means implementations with their HBase backend for user grouping. Thanks Cosmin for including that information in your presentation!

A second use case focussed on finding out more on the usage of flash on the internet. Using Google to search for flash content was no good as only the first 2000 results could be viewed thus resulting in a highly skewed sample. Instead they used a mixture of nutch and HBase for storage to retrieve the content. Analysis was done with respect to various features of flash movies, such as frame rates. The analysis revealed a large gap between the perceived typical usage and the actual usage of flash on the internet.

The third use case involves analysis of images and usage patterns on the Photoshop-in-a-browser edition of Photoshop.com. The forth use case dealt with scaling the infrastructure that powers businesscatalyst – a turn-key online business platform solution including analysis, campaigning and more. When purchased by Adobe the system was very successful business-wise. However the infrastructure was by no means able to put up with the load it had to accommodate. Changing to a back-end based on HBase led to better performance, faster report generation.

Devoxx – University – Cassandra, HBase

2010-12-06 21:20
During the morning session FIXME Ellison gave an introduction to the distributed NoSQL database Cassandra. Being generally based on the Dynamo paper from Amazon the key-value store distributes key/value pairs according to a consistent hashing schema. Nodes can be added dynamically making the system well suited for elastic scaling. In contrast to Dynamo, Cassandra can be tuned for the required consistency level. The system is tuned for storing moderately sized key/value pairs. Large blobs of data should not be stored into it. A recent addition to Cassandra has been the integration with Hadoop's Map/Reduce implementation for data processing. In addition Cassandra comes with a Thrift interface as well as higher level interfaces such as Hector.

In the afternoon Michael Stack and Jonathan Grey gave an overview of HBase covering basic installation, going into more detail concerning the APIs. Most interesting to me was the HBase architecture and optimisation details. The systems is inspired by Google's BigTable paper. It uses Apache HDFS as storage back-end inheriting the failure resilience of HDFS. The system uses Zookeeper for co-ordination and meta-data storage. However fear not, Zookeeper comes packaged with HBase, so in case you have not yet setup your own Zookeeper installation, HBase will do that for you on installation.

HBase is split into a master server for holding meta-data and region servers for storing the actual data. When storing data HBase optimises storage for data locality. This is done in two ways: On write the first copy usually goes to the local disk of the client, so even when storage exceeds the size of one block and remote copies get replicated to differing nodes, at least the first copy gets stored on one machine only. During a compaction phase that is scheduled regularly (usually every 24h, jitter time can be added to lower the load on the cluster) data is re-shuffled for optimal layout. When running Map/Reduce jobs against HBase this data locality can be easily exploited. HBase comes with its own input and output formats for Hadoop jobs.

HBase comes not only with Map/Reduce integration, it also publishes a Thrift interface, a REST interface and can be queried from an HBase shell.

Devoxx University – MongoDB, Mahout

2010-12-05 21:19
The second tutorial was given by Roger Bodamer on MongoDB. It concentrates on being horizontally scalable by avoiding joins and complex, multi document transactions. It supports a new data model that allows for flexible, changeable "schemas".

The exact data layout is determined by the types of operations you expect for your application, by the access patterns (reading vs. writing data; types of updates and types of queries). Also don't forget about indexing tables by columns to speed up frequently run queries.

Scaling MongoDB is supported by replication in a master/slave setup quite as any traditional system. In a replica set of n nodes, any of these can be elected as the primary (taking writes). If that one goes down, new master election happens. For durability all writes are required to go to at least a majority of all nodes, if that does not happen, now guarantee is given as to the availability of the update in case of primary failure. Write sharding comes with MongoDB as well.
Java support for Mongo is pretty standard - Raw Mongo driver comes in a Map<... ...> flavour. Morphia supports Pojo mapping, annotations etc. for MongoDB Java integration, Code generators for various other JVM languages are available as well.

See also: http://blog.wordnik.com/12-months-with-mongodb

My talk was scheduled for 30min in the afternoon. I went into some detail on what is necessary to build a news clustering system with Mahout and finished the presentation by a short overview of the other use cases that could be solved with the various algorithms. In the audience, nearly all had heard about Hadoop before – most likely in the introductory session that same morning. Same for Lucene. Solr was known to about half of the attendees. Mahout to just a few. Knowing that only very few attendees had any Machine Learning background I tried to provide a very high level overview of what can be done with the library, not going into too much mathematical details. There were quite a few interested questions after the presentation – both online and offline, including requests for examples on how to integrate the software with Solr. In addition connectors for instance to HBase as a data-source were interesting to people. Show-casing integration of Mahout, possibly even providing not only Java- but also REST interfaces might be one route to easier integration and faster adoption of Mahout.

Devoxx Antwerp

2010-12-03 21:16
With 3000 attendees Devoxx is the largest Java Community conference world-wide. Each year in autumn it takes place in Antwerp/ Belgium, in recent years in the Metropolis cinema. The conference tickets were sold out long before doors were opened this year.
The focus of the presentations are mainly on enterprise Java featuring talks by famous Joshua Bloch, Mark Reihnhold and others on new features of the upcoming JDK release as well as intricacies of the Java programming language itself.
This year for the first time the scope was extended to include one whole track on NoSQL databases. The track was organised by Steven Noels. It featured fantastic presentations on HBase use cases, easily accessible introductions to the concepts and usage of Hadoop.
To me it was interesting to observe which talks people would go to. In contrast to many other conferences here the NoSQL/ cloud-computing presentations were less visited than I'd have expected. One reason might be the fact that especially on conference day two they had to compete with popular topics such as the Java puzzlers, Live Java posse and others. However when talking to other attendees their seemed to be a clear gap between the two communities caused probably by a mixture of

  • there being very different problems to be solved in the enterprise world vs. the free software, requirements and scalability driven NoSQL community. Although even comparably small companies (compared to the Googles and Yahoo!s of this world) in Germany are already facing scaling issues, these problems are not yet that pervasive in the Java community as a whole. To me this was rather important to learn, as coming from a Machine learning background, now working for a search provider and being involved with Mahout, Lucene and Hadoop scalability and a growth in data has always been one of the major drivers for any projects I have been working on so far.
  • Even when faced with growing amounts of data in the regular enterprise world developers seem to be faced with the problem of not being able to freely select the technologies to be used for implementing a project. In contrast to startups and lean software teams there still seem to be quite a few teams that are not only given what to implement but also how to implement the software unnecessarily restricting the tools to use to solve a given problem.

One final factor that drives developers adopting NoSQL and cloud computing technologies is the observation for the need to optimise the system as a whole – to think outside the box of fixed APIs and module development units. To that end the DevOps movement was especially interesting to me as only by getting the knowledge largely hidden in operations teams into development and mixing that with the skill of software developers can lead to truly elastic and adaptable systems.

Apache Mahout at Apache Con NA

2010-10-15 20:39
The upcoming Apache Con NA to take place in Atlanta will feature several tracks relevant to users of Apache Mahout, Lucene and Hadoop: There will be a full track on Hadoop as well as one on NoSQL on Wednesday featuring talks on the framework itself, Pig and Hive as well as presentations from users on special use cases and on their way of getting the system to production.

The track on Mahout, Lucene and friends starts on Thursday afternoon, followed by another series of Lucene presentations on Friday.

Also don't miss the track on the community and business tracks for a glimpse behind the scenes. Of course there will also be tracks on well-known Apache Tomcat, httpd, OSGi and many, many more.

Looking forward to meeting you in Atlanta!

NoSQL summer Berlin - this evening

2010-08-11 06:38
This evening at Volkspark Friedrichshain, Café Schoenbrunn the next NoSQL summer Berlin (organised by Tim Lossen) is meeting to discuss the paper on Amazon's Dynamo "Dynamo: Amazon's Highly Available Key-value Store". The group is planning to meet at 19:30 for some beer and discussions on the publication.

My highly subjective Berlin Buzzwords recap

2010-06-13 18:32
Last November I innocently asked Grant what it would take to make him to give a talk in Berlin. The only requirement he told me was that I'd have to pay for his flight. About eight months later we had Berlin Buzzwords - a conference all around the topics scalability, data storage and search. With Simon Willnauer, Uwe Schindler, Michael Busch, Robert Muir, Grant Ingersoll, Andrzej Bialecki and many others we had quite a few Lucene people in town.

From the NoSQL community, Peter Neubauer, Rusty Klophaus, Jan Lehnardt, Mathias Meyer, Eric Evans and many others made sure people got their fair share of NoSQL knowledge. With Aaron Kimball, Jay Booth, Doug Judd and Steve Loughran we had several Hadoop and related people at the conference.

The conference also featured two talks on Apache Mahout: An overview from Frank Scholten as well as a more in-depth talk by Sean Owen. It's great to see the project grow - not only in terms of development community but also in terms of requests from professional Mahout users.

In addition we had a keynote by Pieter Hintjens that concentrated on messaging in general and 0MQ in particular - a scalability topic otherwise highly underrepresented at Berlin Buzzwords.

We got well over 300 attendees that filled Berlin Kosmos - a former cinema. Attendees were a good mixture of Apache and non-Apache people, developers and users. People used the breaks and bar tours after the event to get in touch, exchange ideas. It's always good to see developers discuss design issues and architectural challenges.

Monday evening was reserved for local people taking out the speakers and interested attendees for Bar Tours to Friedrichshain. Those from Berlin took Berlin Buzzwords people to their favourite restaurants and bars - or to what they considered to be "typical Berlin". Some spent evenings later that week drinking beer or Berliner Weisse.

The tour for keynote speakers Grant Ingersoll, Pieter Hintjens and friends was organised by Julia and myself. We went over to Kreuzberg - some went to famous Burgermeister for Burgers, the other half went to a nearby Indian restaurant. After that we spent the evening in Club der Visionäre - a club next to the water. Me personally I left at about midnight - several people of the Lucene community moved to the well known Fette Ecke later on.

When asking the audience about repeating the conference next year, all hands went up immediately. Beside lots of praise for the organisation, from the feedback form we put up we got some good ideas on how to improve the conference next year. I'd love to have you guys back here in 2011 - and I'd love to get even more attendees in. Was great fun having you here. Thanks for 5 great days:

Five instead of two days, because:

  • Keynote speakers got a special treatment - that is a personal city guide for the weekend before Buzzwords.
  • We had the official conference start on Sunday with a Barcamp.
  • We had another Apache dinner on Wednesday with those Apache people that live in Berlin. In addition the Aaron and Sarah joined us as they were still in town for the Apache Hadoop trainings. Also Greg Stein had pizza and beer with us - he was in town for the svn conference at the end of the week.

Thanks to all who helped turn this conference into a success: Julia Gemählich for conference management, Ulf and Wetter for WiFi setup, Nils for travel management, Simon and Jan for support ranking talks and reaching out to your communities, all speakers for fantastic talks, those taking pictures of the conference and sharing them on Flickr for showing those who stayed at home how great the conference was, peoplezapping for the videos that will soon be available online, all sponsors for supporting the conference, all attendees for their participation. I'd love to have all of you (and many more) back in Berlin next year. An informal call for presentations has been set up already - submit now and be the one to set the trend instead of just following the Buzzwords!

For those who do not want to wait for another year: We will have another Apache Hadoop Get Together in September 2010 - watch this space for more information. If you'd like to give a talk their and present your Hadoop/ Solr/ Lucene etc. system - please get in touch with me.