Large Scalability - Papers and implementations

2009-06-23 12:08
In recent years the Googles and Amazons on this world have released papers on how to scale computing and processing to terrabytes of data. These publications have led to the implementation of various open source projects that benefit from that knowledge. However mapping the various open source projects to the original papers and assigning tasks that these projects solve is not always easy.

With no guarantee of completeness this lists provides a short mapping from open source project to publication.

There are further overviews available online as well as a set of slides from the NOSQL debrief.















Map Reduce Hadoop Core Map Reduce Distributed programming on rails, 5 Hadoop questions, 10 Map Reduce Tips
GFS HDFS (Hadoop File System) Distributed file system for unstructured data
Bigtable HBase, Hypertable Distributed storage for structured data, When to use HBase.
Chubby Zookeeper Distributed lock- and naming service
Sawzall PIG, Cascading, JAQL, Hive Higher level langage for writing map reduce jobs
Protocol Buffers Protocol Buffers, Thrift, Avro, more traditional: Hessian, Java serialization Data serialization, early benchmarks
Some NoSQL storage solutions CouchDB, MongoDB CouchDB: document database
Dynamo Dynomite, Voldemort, Cassandra Distributed key-value stores
Index Lucene Search index
Index distribution katta, Solr, nutch Distributed Lucene indexes
Crawling nutch, Heritrix, droids, Grub, Aperture Crawling linked pages



June 2009 Apache Hadoop Get Together @ Berlin

2009-06-21 21:33
Just a brief reminder: Next week on Thursday the next Apache Hadoop Get Together is scheduled to take place in Berlin. There are quite a few interesting talks scheduled:

  • Torsten Curdt: Data Legacy - the challenges of an evolving data warehouse
  • Christoph M. Friedrich, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI): "SCAIView - Lucene for Life Science Knowledge Discovery".
  • Uri Boness from JTeam in Amsterdam: Solr - From Theory to Practice.


See http://upcoming.yahoo.com/event/2488959/ for more information.

For those interested in NOSQL Meetups, the discussion over at the NOSQL mailing list might be of interest to you: http://blog.oskarsson.nu/2009/06/nosql-debrief.html

Back from Zürich

2009-05-05 16:58
I spend the last five days in Zurich. I wanted to visit the city again - and still owed one of my friends there a visit. I am really happy the weather was quite nice over the weekend. That way I could spend quite some time in town (got another one of those puzzles) and go for a hike on the Ütli mountain: I took the steep way up that had quite a lot of stairs. Interestingly though, despite being quite tired when I finally arrived on top, my legs did not have sore muscles the next day. Seems going to work and back again by bike does indeed help a bit, even if we have no hills in Berlin.

Yesterday I was allowed to present the Apache project Mahout in a Google tech talk. Usually I am talking to people well familiar with the various Apache projects. Giving my talk I asked people who was familiar with Lucene, with Hadoop. To me it was pretty unusual that very few engineers were aware of these. It almost seemed like it is unusual to have a look at what is going outside the company? Or was it just the selection of people that were interested in my talk?

I tried to cover most of the basics, put Mahout into the context of the Lucene umbrella project. I tried to show some of the applications that can be built with Mahout and detailed some of the things that are on our agenda.

Some of the questions I received were on the scalability of Hadoop, on the general distribution of people being paid to work on Free Software projects vs. those working on them in their freetime. Another question was whether the project is targeted to text only applications (which of course it is not, as feature extraction so far has been left to the user). Last but not least the relation to UIMA was brought up by a former IBM-UIMA engineer.

To summarize: For me it was a pretty interesting experience to give this tech talk. I hope it did help me to do away with some of my "Apache bias". It is always valuable to look into what is going outside one's community.

DIMA @ TU Berlin

2009-05-03 07:26
On Friday, the 24th of April Prof. Volker Markl organised a Welcome Workshop at TU Berlin. The day started with an introduction by the Dekan of the faculty. First talk was given by Rudolf Bayer on the topic "From B-Trees to UB-Trees". Second presentation was by Guy Lohman on "LEO, DB2's Learning Optimizer".

After the coffee break, Volker Markl gave an introduction to his selected research field, outstanding tasks and the way he is going to accomplish his goals. Seems like scalability is playing a major role in his tasks. Interestingly Hadoop was chosen as an infrastructure basis.

In his talk Volker Markl announced the newly started BBI Colloquium. It is a regular meeting in Berlin dedicated to the scientific discurs on topics relevant to the participating researchers. Participating researchers are Prof. Oliver Günther, Prof. Johann-Christoph Freytag, Prof. Ulf Leser from HU Berlin, Prof. Dr. Volker Markl from TU Berlin, Prof. Dr. Heinz Schweppe from FU Berlin and Prof. Dr. Felix Naumann from HPI Potsdam.

Feedback from the Hadoop User Group UK

2009-04-29 08:54
A few weeks after the Hadoop User Group UK is over, there are quite a few postings on the event online. I will try to keep this page updated if there are any further reviews. The one I found so far:

http://huguk.org/2009/04/huguk-2-wrap-up.html - the wrap-up of the event itself.

http://blog.oskarsson.nu/2009_04_01_archive.html - a short summary by the organiser - Thanks again for a great event.

http://www.cloudera.com/blog/2009/04/21/hadoop-uk-user-group-meeting/ - a short summary on the Cloudera blog.

http://people.kmi.open.ac.uk/adam/?p=26 - a quick overview with a Mahout focus by Adam Rae.

June 2009 Apache Hadoop Get Together @ Berlin

2009-04-23 19:30
Title: Apache Hadoop Get Together @ Berlin
Location: newthinking store Berlin Mitte
Link out: Click here
Description: I just announced the fifth Apache Hadoop Get Together in Berlin at the newthinking store. Torsten Curdt offered to give a talk on data serialization with Thrift and Protocol Buffers.

If you have a topic you would like to talk about: Feel free to just bring your slides - there will be a beamer and lots of people interested in scalable information retrieval.
Start Time: 17:00
Date: 2009-06-25

Mahout on EC2

2009-04-21 21:00
Amazon released Elastic Map Reduce only a few weeks ago. EMR is based on a hosted Hadoop environment and offers machines to run map reduce jobs against data in S3 on demand.

Last week Stephen Green has spent quite some effort to get Mahout running on EMR. Thanks to his work Mahout is running on EMR since last Thursday night. Read the weblog of Tim Bass for further information.

Hadoop User Group UK

2009-04-21 20:34
On Tuesday the 14th the second Hadoop User Group UK took place in London. This time venue and pizza was sponsored by Sun. The room quickly filled http://www.thecepblog.com/2009/04/19/kmeans-clustering-now-running-on-elastic-mapreduce/with approximately 70 people.

Tom opened the session with a talk on 10 practical tips on how to get the most benefit from Apache Hadoop. The first question users should ask themselves is which type of programming language they want to use. There is a choice between structured data processing languages (PIG or Hive), dynamic languages (Streaming or Dumbo), or using Java which is closest to the system.

Tom's second hint dealt with the size of files to process with Hadoop: Both - too large unsplittable and too small ones are bad for performance. In the first case, the workload cannot be easily distributed across the nodes in the latter case each unit of work is to small to account for startup and coordination overhead. There are ways to remedy these problems with sequence files and map files though. Another performance optimization would be to chain individual jobs - PIG and Hive do a pretty decent job in automatically generating such jobs. ChainMapper and ChainReducer can help with creating chained jobs.

Another important task when implementing map reduce jobs is to tell Hadoop the progress of your job. For once, this is important for long running jobs in order for them to remain alive and not be killed by the framework due to timeouts. Second, it is convenient for the user as he can view the progress in the web UI of Hadoop.

Usual suspects for tuning a job: Number of mappers and reducers, usage of combiners, compression customised data serialisation, shuffling tweaks. Of course there is always the option to let someone else do the tuning: Cloudera does provide support as well as pre-built packages init scripts and the like ;)

In the second talk I did a brief Mahout intro. It was surprising to me that half of the attendees already employed machine learning algorithm implementations in their daily work. Judging from the discussion after the talk and from questions I received after it the interest in the project seems pretty high. The slide I liked the most: The announcement of our first 0.1 release. Thanks to all Mahout committers and contributors who made this possible.

After the coffee break Craig gave an introduction to Terrier an extensible information retrieval plattform developed at the university of Glasgow. He mentioned a few other open IR platforms namely Tuple Flow, Zettair, Lemur/Indri, Xapian, as well as of course nutch/Solr/Lucene.

What does Terrier have to do with the HugUK? Well index creation in Terrier is now based on an implementation that makes use of Hadoop for parallelization. Craig did some very interesting analysis on scalability of the solution: The team was able to achieve scaling near linear in the number of nodes added (at least as long as more than reducer is used ;) ).

After the pizza Paolo described his experiences implementing the vanilla pagerank computation with Hadoop. One of his test datasets was the Citeseer citation graph. Interestingly enough: Some of the nodes in this graph have self references (maybe due to extraction problems), duplicate citations, and the data comes in an invalid xml format.

The last talk was on HBase by Michael Stack. I am really happy I attended HugUK as I missed that talk in Amsterdam at the ApacheCon. First Michael gave an overview of which features of a typical RDBMS are not supported by HBase: Relations, joins, and of course JDBC being among the limitations. On the pro site HBase offers a multiple node solutions that has scale out and replication built in.

HBase can be used as source as well as as sink for map reduce jobs and thus integrates nicely with the Apache Hadoop stack. The framework provides a simple shell for administrative tasks (surgery on sick clusters forced flushes non sql get scan and put methods). In addition the master comes with a UI to monitor the cluster state.

Your typical DBA work though differs with HBase: Data locality and physical layout do matter and can be configured. Michaels recommendation was to start out testing with the XL instance on EC2 and decrease instances if you find out that it is too large.

The talk finished with an outlook of the features in the upcoming release the issues on the todo list and an overview of companies already using HBase.

After talks were finished quite a few attendees went over to a pub close by: Drinking beer, discussing new directions and sharing war stories.

I would to thank Johan Oskarsson for organising the event. And a special thanks to Tom for letting me use his Laptop for the Apache Mahout presentation: the hard disk of mine broke exactly one day before.

Last but not least thank you to Sylvio and Susi for letting me stay at their place - and thanks to Helene for crying only during daytime when I was out anyway ;)

Hope to see at least some of the attendees again at the next Hadoop Meetup in Berlin. Looking forward to the next Hadoop User Group UK.

Hadoop User Group UK

2009-03-07 23:10
Title: Hadoop User Group UK
Location: London/ Sun Office
Link out: Click here
Date: 2009-04-14


Johan Oskarsson is organising the second Hadoop User Group UK in London in April this year. The schedule is already up:


  • Tom White (Cloudera): Practical MapReduce
  • Michael Stack (Powerset): Apache HBase
  • Isabel Drost (ASF): Introducing Apache Mahout
  • Iadh Ounis and Craig Macdonald (University of Glasgow): Terrier
  • Paolo Castagna (HP): "Having Fun with PageRank and MapReduce"


Tickets are free but limited. So better register soon. Looking forward to seeing you there.

March 2009 Hadoop Get Together Berlin

2009-03-07 19:50
Since last summer, newthinking store Berlin is hosting a Hadoop Meetup every quarter of the year. The scope of these user group meetings is not only on Hadoop projects but deals with technologies necessary with storing, processing and searching large amounts of data.

The meeting last Thursday featured a talk by Lars George on his experiences using HBase in customer projects as early as in 2007. His talk discussed his requirements for a distributed database. He then explained the basics of HBase and described his experiences using the software for customer projects. Bottom line for me is that although in a very early stage the project does provide a lot of value: Instead of re-implementing your own solution it is possible to benefit from the efforts of others. One thing I consider especially remarkable is the effort of the HBase community helping users in case they run into problems.

The second talk was from Jan Lehnardt on CouchDB. Jan explained the main design goals of the system. He detailed the architecture of CouchDB. Then he explained how Erlang made it possible to reach the goals in comparably short time.

The slides of the talks are both available online:

Lars George: HBase

Jan Lehnardt: CouchDB

The talks were followed by several questions and interesting discussions (with some beer in the Keyser Soze close by).

The next Get Together will be held in June 2009. Looking forward to see you in Berlin by then.