Hadoop User Group UK

2009-04-21 20:34
On Tuesday the 14th the second Hadoop User Group UK took place in London. This time venue and pizza was sponsored by Sun. The room quickly filled http://www.thecepblog.com/2009/04/19/kmeans-clustering-now-running-on-elastic-mapreduce/with approximately 70 people.

Tom opened the session with a talk on 10 practical tips on how to get the most benefit from Apache Hadoop. The first question users should ask themselves is which type of programming language they want to use. There is a choice between structured data processing languages (PIG or Hive), dynamic languages (Streaming or Dumbo), or using Java which is closest to the system.

Tom's second hint dealt with the size of files to process with Hadoop: Both - too large unsplittable and too small ones are bad for performance. In the first case, the workload cannot be easily distributed across the nodes in the latter case each unit of work is to small to account for startup and coordination overhead. There are ways to remedy these problems with sequence files and map files though. Another performance optimization would be to chain individual jobs - PIG and Hive do a pretty decent job in automatically generating such jobs. ChainMapper and ChainReducer can help with creating chained jobs.

Another important task when implementing map reduce jobs is to tell Hadoop the progress of your job. For once, this is important for long running jobs in order for them to remain alive and not be killed by the framework due to timeouts. Second, it is convenient for the user as he can view the progress in the web UI of Hadoop.

Usual suspects for tuning a job: Number of mappers and reducers, usage of combiners, compression customised data serialisation, shuffling tweaks. Of course there is always the option to let someone else do the tuning: Cloudera does provide support as well as pre-built packages init scripts and the like ;)

In the second talk I did a brief Mahout intro. It was surprising to me that half of the attendees already employed machine learning algorithm implementations in their daily work. Judging from the discussion after the talk and from questions I received after it the interest in the project seems pretty high. The slide I liked the most: The announcement of our first 0.1 release. Thanks to all Mahout committers and contributors who made this possible.

After the coffee break Craig gave an introduction to Terrier an extensible information retrieval plattform developed at the university of Glasgow. He mentioned a few other open IR platforms namely Tuple Flow, Zettair, Lemur/Indri, Xapian, as well as of course nutch/Solr/Lucene.

What does Terrier have to do with the HugUK? Well index creation in Terrier is now based on an implementation that makes use of Hadoop for parallelization. Craig did some very interesting analysis on scalability of the solution: The team was able to achieve scaling near linear in the number of nodes added (at least as long as more than reducer is used ;) ).

After the pizza Paolo described his experiences implementing the vanilla pagerank computation with Hadoop. One of his test datasets was the Citeseer citation graph. Interestingly enough: Some of the nodes in this graph have self references (maybe due to extraction problems), duplicate citations, and the data comes in an invalid xml format.

The last talk was on HBase by Michael Stack. I am really happy I attended HugUK as I missed that talk in Amsterdam at the ApacheCon. First Michael gave an overview of which features of a typical RDBMS are not supported by HBase: Relations, joins, and of course JDBC being among the limitations. On the pro site HBase offers a multiple node solutions that has scale out and replication built in.

HBase can be used as source as well as as sink for map reduce jobs and thus integrates nicely with the Apache Hadoop stack. The framework provides a simple shell for administrative tasks (surgery on sick clusters forced flushes non sql get scan and put methods). In addition the master comes with a UI to monitor the cluster state.

Your typical DBA work though differs with HBase: Data locality and physical layout do matter and can be configured. Michaels recommendation was to start out testing with the XL instance on EC2 and decrease instances if you find out that it is too large.

The talk finished with an outlook of the features in the upcoming release the issues on the todo list and an overview of companies already using HBase.

After talks were finished quite a few attendees went over to a pub close by: Drinking beer, discussing new directions and sharing war stories.

I would to thank Johan Oskarsson for organising the event. And a special thanks to Tom for letting me use his Laptop for the Apache Mahout presentation: the hard disk of mine broke exactly one day before.

Last but not least thank you to Sylvio and Susi for letting me stay at their place - and thanks to Helene for crying only during daytime when I was out anyway ;)

Hope to see at least some of the attendees again at the next Hadoop Meetup in Berlin. Looking forward to the next Hadoop User Group UK.

Hadoop User Group UK

2009-04-21 20:34
On Tuesday the 14th the second Hadoop User Group UK took place in London. This time venue and pizza was sponsored by Sun. The room quickly filled http://www.thecepblog.com/2009/04/19/kmeans-clustering-now-running-on-elastic-mapreduce/with approximately 70 people.

Tom opened the session with a talk on 10 practical tips on how to get the most benefit from Apache Hadoop. The first question users should ask themselves is which type of programming language they want to use. There is a choice between structured data processing languages (PIG or Hive), dynamic languages (Streaming or Dumbo), or using Java which is closest to the system.

Tom's second hint dealt with the size of files to process with Hadoop: Both - too large unsplittable and too small ones are bad for performance. In the first case, the workload cannot be easily distributed across the nodes in the latter case each unit of work is to small to account for startup and coordination overhead. There are ways to remedy these problems with sequence files and map files though. Another performance optimization would be to chain individual jobs - PIG and Hive do a pretty decent job in automatically generating such jobs. ChainMapper and ChainReducer can help with creating chained jobs.

Another important task when implementing map reduce jobs is to tell Hadoop the progress of your job. For once, this is important for long running jobs in order for them to remain alive and not be killed by the framework due to timeouts. Second, it is convenient for the user as he can view the progress in the web UI of Hadoop.

Usual suspects for tuning a job: Number of mappers and reducers, usage of combiners, compression customised data serialisation, shuffling tweaks. Of course there is always the option to let someone else do the tuning: Cloudera does provide support as well as pre-built packages init scripts and the like ;)

In the second talk I did a brief Mahout intro. It was surprising to me that half of the attendees already employed machine learning algorithm implementations in their daily work. Judging from the discussion after the talk and from questions I received after it the interest in the project seems pretty high. The slide I liked the most: The announcement of our first 0.1 release. Thanks to all Mahout committers and contributors who made this possible.

After the coffee break Craig gave an introduction to Terrier an extensible information retrieval plattform developed at the university of Glasgow. He mentioned a few other open IR platforms namely Tuple Flow, Zettair, Lemur/Indri, Xapian, as well as of course nutch/Solr/Lucene.

What does Terrier have to do with the HugUK? Well index creation in Terrier is now based on an implementation that makes use of Hadoop for parallelization. Craig did some very interesting analysis on scalability of the solution: The team was able to achieve scaling near linear in the number of nodes added (at least as long as more than reducer is used ;) ).

After the pizza Paolo described his experiences implementing the vanilla pagerank computation with Hadoop. One of his test datasets was the Citeseer citation graph. Interestingly enough: Some of the nodes in this graph have self references (maybe due to extraction problems), duplicate citations, and the data comes in an invalid xml format.

The last talk was on HBase by Michael Stack. I am really happy I attended HugUK as I missed that talk in Amsterdam at the ApacheCon. First Michael gave an overview of which features of a typical RDBMS are not supported by HBase: Relations, joins, and of course JDBC being among the limitations. On the pro site HBase offers a multiple node solutions that has scale out and replication built in.

HBase can be used as source as well as as sink for map reduce jobs and thus integrates nicely with the Apache Hadoop stack. The framework provides a simple shell for administrative tasks (surgery on sick clusters forced flushes non sql get scan and put methods). In addition the master comes with a UI to monitor the cluster state.

Your typical DBA work though differs with HBase: Data locality and physical layout do matter and can be configured. Michaels recommendation was to start out testing with the XL instance on EC2 and decrease instances if you find out that it is too large.

The talk finished with an outlook of the features in the upcoming release the issues on the todo list and an overview of companies already using HBase.

After talks were finished quite a few attendees went over to a pub close by: Drinking beer, discussing new directions and sharing war stories.

I would to thank Johan Oskarsson for organising the event. And a special thanks to Tom for letting me use his Laptop for the Apache Mahout presentation: the hard disk of mine broke exactly one day before.

Last but not least thank you to Sylvio and Susi for letting me stay at their place - and thanks to Helene for crying only during daytime when I was out anyway ;)

Hope to see at least some of the attendees again at the next Hadoop Meetup in Berlin. Looking forward to the next Hadoop User Group UK.

Announcing Apache Mahout 0.1

2009-04-08 15:11
This morning I received Grant's release mail of Apache Mahout. I am really happy that after little more than one year we now have our first release out there to test and scrutinate by anyone interested in the project. Thanks to all the committers who have helped make this possible. A special thanks to Grant Ingersoll for putting so much time into getting many release issues out of the way as well as to those who reviewed the release candidates and all the major and minor problems.

For those who are not familiar with Mahout: The goal of the project is to build a suite of machine learning libraries under the Apache license. The main focus is on:

  • Building a viable community that develops new features, helps users with software problems and is interested in the data mining problems Mahout users.
  • Developing stable, well documented, scalable software that solves your problems.


The current release includes several algorithms for clustering (k-Means, Canopy, fuzzy k-Means, Dirichlet based), for classification (Naive Bayes and Complementary Naive Bayes). There is some integration with the Watchmaker evolutionary programming framework. The Taste Collaborative Filtering framework moved to Mahout as well. Taste has been around for a while and is much more mature than the rest of the code.

With this being a 0.1 release we are looking for early adopters that are willing to work with cutting edge software and gain benefits from working closely together with the community. We are seeking feedback on use cases as well as performance numbers. If you are using Mahout in your projects or plan to use it or even only evaluate it - we are happy about hearing back from you on our mailing lists. Tell us what you like, what works well, but do not forget to tell us what you would like to improve. Contributions and Patches as always are very welcome.

For more information see the project homepage, especially the wiki and the Lucene weblog by Grant Ingersoll.

GSoC: Student applications.

2009-04-05 09:57
Title: GSoC: Student applications closed
Link out: Click here
Description: After this date no more student applications are accepted. Internal ranking at Apache starts 7 days earlier. The ranking process finishes at 16th of April.
Date: 2009-04-03

Students begin coding for their GSoC projects

2009-04-05 09:54
Title: Students begin coding for their GSoC projects
Link out: Click here
Description: GSoC time.
Date: 2009-05-26

Apache Con Europe 2009 - part 3

2009-03-29 19:56
Friday was the last conference day. I enjoyed the Apache pioneers panel with a brief history of the Apache Software Foundation as well as lots of stories on how people first got in contact with the ASF.

After lunch I went to the testing and cloud session. I enjoyed the talk on continuum and its uses by Wendy Smoak. She gave a basic overview of why one would want a CI system and provided a brief introduction to continuum. After that Carlos Sanchez showed how to use the cloud to automate interface tests with Selenium: The basic idea is to automatically (initiated through maven) start up AMIs on EC2, each configured with another operating system and run Selenium tests against the application under development in these. Really nice system for running automated interface tests.

The final session for me was the talk by Chris Anderson and Jan Lehnardt on CouchDB deployments.

The day ended with the Closing Event and Raffle. Big Thank You to Ross Gardler for including the Berlin Apache Hadoop Get Together in newthinking store in the announcements! Will sent the CfP to concom really soon, as promised. Finally I won one package of caffeinated sweets at the Raffle - does that mean less sleep for me in the coming weeks?

Now I am finally back home and had some time to do a quick writeup. If you are interested in the complete notes, go to http://fotos.isabel-drost.de (default login is published in the login window). Looking forward to the Hadoop User Group UK on 14th of April. If you have not signed up yet - do so now: http://huguk.org

Apache Con Europe 2009 - part 2

2009-03-29 19:42
Thursday morning started with an interesting talk on open source collaboration tools and how they can help resolving some collaboration overhead on commercial software projects. Four goals can be reached with the help of the right tools: Sharing the project vision, tracking the current status of the project, finding places to help the project and documenting the project history as well as the reasons for decisions along the way. The exact tool used is irrelevant as long as it helps to solve the four tasks above.

The second talk was on cloud architectures by Steve Loughran. He explained what reasons there are to go into the cloud, what a typical cloud architecture looks like. Steve described Amazon's offer, mentioned other cloud service providers and highlighted some options for a private cloud. However his main interest is in building a standardised cloud stack. Currently choosing one of the cloud provides means vendor lock-in: Your application uses a special API, your data are stored on special servers. There are quite a few tools necessary for building a cloud stack available at Apache (Hadoop, HBase, CouchDB, Pig, Zookeeper...). The question that remains is how to integrate the various pieces and extend where necessary to arrive at a solution that can compete with AppEngine or Azure?

After lunch I went to the Solr case study by JTeam. Basically one great commercial for Solr. They even brought the happy customer to Apache Con to talk about the experience of Solr from his point of view. Great work, really!

The Lightning Talk session ended the day - with a "Happy birthday to you" from the community.

After having spent the last 4 days from 8a.m. to 12p.m. at Apache Con I really did need some rest on Thursday and went to bed pretty early: At 11p.m. ...

Apache Con Europe 2009 - part 1

2009-03-29 18:41
The past week members, committers and users of Apache software projects gathered in Amsterdam for another Apache Con EU - and to celebrate the 10th birthday of the ASF. One week dedicated to the development and use of Free Software and the Apache Way.

Monday was BarCamp day for me, the first BarCamp I ever attended. Unfortunately not all participants proposed talks. So some of the atmosphere of an unconference was missing. The first talk by Danese Cooper was on "HowTo: Amsterdam Coffee Shops". She explained the ins and outs of going to coffee shops in Amsterdam, gave both legal and practical advise. There was a presentation of the Open Street Map project, several Apache projects. One talk discussed transfering the ideas of Free Software to other parts of life. Ross Gardler started a discussion on how to advocate contributions to Free Software projects in science and education.

Tuesday for me meant having some time for Mahout during the Hackathon. Specifically I looked into enhancing matrices with meta information. In the evening there were quite a few interesting talks at the Lucene Meetup: Jukka gave an overview of Tika, Grant introduced Solr. After Grant's talk some of the participants shared numbers on their Solr installations (number of documents per index, query volumn, machine setup). To me it was extremely interesting to gain some insight into what people actually accomplish with Solr. The final talk was on Apache Droids, a still incubating crawling framework.

The Wednesday tracks were a little unfair: The Hadoop track (videos available online for a small fee) was right in parallel to the Lucene track. The day started with a very interesting keynote by Raghu from Yahoo! on their storage system PNUTS. He went into quite some technical detail. Obviously there is interest in publishing the underlying code under an open source license.

After the Mahout introduction by Grant Ingersoll I changed room to the Hadoop track. Arun Murthy shared his experience on tuning and debugging Hadoop applications. After lunch Olga Natkovich gave an introduction to Pig - a higher language on top of Hadoop that allows for specifications of filter operations, joins and basic control flow of map reduce jobs in just a few lines of Pig Latin code. Tom White gave an overview of what it means to run Hadoop on the EC2 cloud. He compared several options for storing the data to process. Today it is very likely that there will soon be quite a few more providers of cloud services in addition to Amazon.

Allen Wittenauer gave an overview of Hadoop from the operations point of view. Steve Lougran finally covered the topic of running Hadoop on dynamically allocated servers.

The day finished with a pretty interesting BOF on Hadoop. There still are people that do not clearly see the differences of Hadoop based systems to database backed applications. Best way to find out whether the model fits: Set up a trial cluster and do experiment yourself. Noone can tell which solution is best for you except for yourself (and maybe Cloudera setting up the cluster for you :) ).

After that the Mahout/UIMA BOF was scheduled - there were quite a few interesting discussions on what UIMA can be used for and how it integrates with Mahout. One major take home message: We need more examples integrating both. We developers do see the clear connections. But users often do not realize that many Apache projects should be used together to get the biggest value out.

Hadoop User Group UK

2009-03-07 23:10
Title: Hadoop User Group UK
Location: London/ Sun Office
Link out: Click here
Date: 2009-04-14


Johan Oskarsson is organising the second Hadoop User Group UK in London in April this year. The schedule is already up:


  • Tom White (Cloudera): Practical MapReduce
  • Michael Stack (Powerset): Apache HBase
  • Isabel Drost (ASF): Introducing Apache Mahout
  • Iadh Ounis and Craig Macdonald (University of Glasgow): Terrier
  • Paolo Castagna (HP): "Having Fun with PageRank and MapReduce"


Tickets are free but limited. So better register soon. Looking forward to seeing you there.

March 2009 Hadoop Get Together Berlin

2009-03-07 19:50
Since last summer, newthinking store Berlin is hosting a Hadoop Meetup every quarter of the year. The scope of these user group meetings is not only on Hadoop projects but deals with technologies necessary with storing, processing and searching large amounts of data.

The meeting last Thursday featured a talk by Lars George on his experiences using HBase in customer projects as early as in 2007. His talk discussed his requirements for a distributed database. He then explained the basics of HBase and described his experiences using the software for customer projects. Bottom line for me is that although in a very early stage the project does provide a lot of value: Instead of re-implementing your own solution it is possible to benefit from the efforts of others. One thing I consider especially remarkable is the effort of the HBase community helping users in case they run into problems.

The second talk was from Jan Lehnardt on CouchDB. Jan explained the main design goals of the system. He detailed the architecture of CouchDB. Then he explained how Erlang made it possible to reach the goals in comparably short time.

The slides of the talks are both available online:

Lars George: HBase

Jan Lehnardt: CouchDB

The talks were followed by several questions and interesting discussions (with some beer in the Keyser Soze close by).

The next Get Together will be held in June 2009. Looking forward to see you in Berlin by then.