Open Source Expo

2009-10-29 07:38
Title: Open Source Expo
Location: Karlsruhe
Link out: Click here
Description: There will be a booth at Open source expo introducing interested visitors to the Apache projects Lucene and Mahout. Of course we are also happy to answer any questions on the ASF in general.
Start Date: 2009-11-15
End Date: 2009-11-16

Mahout@TU WS 09/10

2009-09-09 23:08
Title: Mahout@TU WS 09/10


There is going to be a project/seminar course at TU Berlin on Apache Mahout. The goal is to introduce students to the work on a free software project, interact with the community and build production ready software.

Students will be given several potential tasks ranging from optimizing existing implementations, implementing new algorithms and (depending on their prior knowledge) improving, scaling and parallelizing existing algorithms.

Successful completion of the course depends on a number of factors: Interaction of the student with the community, ability to write tested (as in test-first-developed) code that performs well in a large scale environments, ability to show incremental development progress at each iteration, ability to review patches and improvements, usage of tools like SCM, Issue-tracker and mailinglists. Of course theoretical background - that is understanding existing publications as well extending their ideas is crucial as well.

If you are a student interessted in Mahout missing some course work, consider subscribing to the Mahout course at DIMA Berlin (linked below). Goal is that your work is to be integrated in one of the next releases, once the community is satisfied.

If you are a Mahout developer or user and have some issue that you consider suitable for a student to solve, please to provide your ideas.


Location: TU Berlin
Link out: Click here
Start Date: 2009-10-01
End Date: 2010-03-31

GSoC at Mahout

2009-09-09 22:22
GSoC 2009 is about to finish: Final evaluations are through, most of the code submitted by Mahout's students has been committed to svn, code samples are on their way to Google.

In Mahout, we had three students joining the project: Robin working on an HBase based Naive Bayes extension and on frequent itemset discovery. David contributing a distributed LDA implementation. Deneche was working on a Random Forest implementation. All three of them have done great work during this summer, contributing not only code but valuable input on the project's mailinglists as well. As a result, all three of them have been given committer status by the end of GSoC.

Apart from three new additions to the code base, summer also brought quite some traffic to the user list - not only in terms of subscriptions but also in terms of developers contributing to the discussions online. Currently, it looks like the project is really gaining momentum, as also noted in Grant Ingersoll's post.

Discussions on the dev list on the future road map of Mahout clearly showed that the developers share the vision of a scalable, potentially distributed, stable machine learning library. That the focus should be on production ready code under a commercially friendly license instead of bleeding edge research implementations. Last but no least the goal is to build a lively, diverse community around the project to guarantee further development and user support.

2009 brought quite a few talks both in Germany as well as the US on the topic of Mahout (besides all the events on Hadoop, scalable databases and cloud computing in general) with an Apache Con US talk introducing Mahout in Oakland still to come.

Yesterday, a great article indroducing Apache Mahout with hands-on examples was published on IBM Developerworks by Grant Ingersoll. Check it out, if you want to learn more on Mahout, and Machine Learning in general.

Inglourious Basterds

2009-08-24 22:48
This evening I went to the cinema Odeon in Berlin Schöneberg. It is a pretty traditional, old-fashioned and very lovely cinema that has specialised on showing non-dubbed, original versions of movies.

Showing the great movie Inglourious Basterds, the cinema was completely sold out today. Fortunately we were able to grab some of the last tickets.

Just in case the entrance seemed familiar to those who have attended a Mahout presentation in the recent past - a picture of the Odeon usually visualises one part of my motivation on the Mahout slides ;)

Flying back home from Cologne

2009-08-23 20:40
Last weekend FrOSCon took place in Sankt Augustin, near Cologne. FrOSCon is organized on a yearly basis at the university of applied sciences in Sankt Augustin. It is a volunteer driven event with the goal of bringing developers and users of free software projects together. This year, the conference featured 5 tracks, two examples being cloud computing and the Java track.

Unfortunately this year the conference started with a little surprise for me and my boyfriend: Being both speakers, we had booked a room in Hotel Regina via the conference committee. Yet on Friday evening we had to learn that the reservation never actually reached the hotel... So, after several minutes talking to the receptionist, calling the organizers we ended up in a room that was booked for Friday night by someone who was known to arrive no earlier than Saturday. Fortunately for us we have a few friends close by in Düsseldorf: Fnord was so very kind to let us have his guest couch for the following night.

Checkin time next morning: On the right hand side the regular registration booth. On the left hand side the entrance for VIPs only. The FSFE quickly realized it's opportunity: They soon started distributing flyers and stickers among the waiting exhibitors and speakers.






Set aside the organizational issues, most of the talks were very interesting and well presented. The Java track featured two talks by Apache Tomcat committer Peter Roßbach, the first one on the new Servlet 3.0 API, the second one on Tomcat 7. Too sad, my talk was in parallel to his Tomcat talk, so I couldn't attend that. I appreciate several of the ideas on cloud computing highlighted in the keynote: Cloud computing as such is not really new or innovative, it is several good ideas so far known for instance as utility computing that are now being improved and refined to make computation a commodity. At the very moment however cloud computing providers tend to sell their offers as new, innovative products. There is no standard API for cloud computing services. That makes switching from one provider to another extremely hard and leads to vendor-lockin for its users.

The afternoon was filled by my talk. This time I tried something, that so far I only have done in user groups of up to 20 people: I first gave a short introduction into who I am and than asked the audience to describe themselves in one sentence. There were about 50 people, after 10 minutes everyone had given is self-introduction. It was a nice way of getting detailed information of what knowledge to expect from people, and it was interesting to hear people from IBM and Microsoft being in the room.

After that I attended the RestMS talk by Thilo Fromm and Peter Hintjens. They showed a novel, community driven way to standards creation. RestMS is a messaging standard that is based on a restful way for communication. So far the standard itself is still in it's very early stages, still there are some very “alpha, alpha, alpha” implementations out there that can be used for playing around. According to Peter there are actually people who already use these implementations for production servers and send back bug reports.

Sunday started with an overview of the DaVinci VM by Dalibor Topic, the author of the OpenJDK article series in the German Java Magazin. Second talk of the day was an introduction to Scala. I already know a few details of the language, but the presentation made it easy to learn more: It was organised as an open question and answer session with live coding leading through the talk.

After lunch and some rest, the last two topics of interest were on details on the campaigns of FFII against software patents and an overview of the upcoming changes in gnome3.0.

This year's FrOSCon did have some organizational quirks but the quality of most of the talks was really good with at least one interesting topic in one of the sessions at nearly every time slot - though I must admit that that was easy in my case with Java and cloud computing being of interest to me.

Update: Videos are up online.

Back from Zürich

2009-05-05 16:58
I spend the last five days in Zurich. I wanted to visit the city again - and still owed one of my friends there a visit. I am really happy the weather was quite nice over the weekend. That way I could spend quite some time in town (got another one of those puzzles) and go for a hike on the Ütli mountain: I took the steep way up that had quite a lot of stairs. Interestingly though, despite being quite tired when I finally arrived on top, my legs did not have sore muscles the next day. Seems going to work and back again by bike does indeed help a bit, even if we have no hills in Berlin.

Yesterday I was allowed to present the Apache project Mahout in a Google tech talk. Usually I am talking to people well familiar with the various Apache projects. Giving my talk I asked people who was familiar with Lucene, with Hadoop. To me it was pretty unusual that very few engineers were aware of these. It almost seemed like it is unusual to have a look at what is going outside the company? Or was it just the selection of people that were interested in my talk?

I tried to cover most of the basics, put Mahout into the context of the Lucene umbrella project. I tried to show some of the applications that can be built with Mahout and detailed some of the things that are on our agenda.

Some of the questions I received were on the scalability of Hadoop, on the general distribution of people being paid to work on Free Software projects vs. those working on them in their freetime. Another question was whether the project is targeted to text only applications (which of course it is not, as feature extraction so far has been left to the user). Last but not least the relation to UIMA was brought up by a former IBM-UIMA engineer.

To summarize: For me it was a pretty interesting experience to give this tech talk. I hope it did help me to do away with some of my "Apache bias". It is always valuable to look into what is going outside one's community.

Feedback from the Hadoop User Group UK

2009-04-29 08:54
A few weeks after the Hadoop User Group UK is over, there are quite a few postings on the event online. I will try to keep this page updated if there are any further reviews. The one I found so far:

http://huguk.org/2009/04/huguk-2-wrap-up.html - the wrap-up of the event itself.

http://blog.oskarsson.nu/2009_04_01_archive.html - a short summary by the organiser - Thanks again for a great event.

http://www.cloudera.com/blog/2009/04/21/hadoop-uk-user-group-meeting/ - a short summary on the Cloudera blog.

http://people.kmi.open.ac.uk/adam/?p=26 - a quick overview with a Mahout focus by Adam Rae.

Mahout on EC2

2009-04-21 21:00
Amazon released Elastic Map Reduce only a few weeks ago. EMR is based on a hosted Hadoop environment and offers machines to run map reduce jobs against data in S3 on demand.

Last week Stephen Green has spent quite some effort to get Mahout running on EMR. Thanks to his work Mahout is running on EMR since last Thursday night. Read the weblog of Tim Bass for further information.

Hadoop User Group UK

2009-04-21 20:34
On Tuesday the 14th the second Hadoop User Group UK took place in London. This time venue and pizza was sponsored by Sun. The room quickly filled http://www.thecepblog.com/2009/04/19/kmeans-clustering-now-running-on-elastic-mapreduce/with approximately 70 people.

Tom opened the session with a talk on 10 practical tips on how to get the most benefit from Apache Hadoop. The first question users should ask themselves is which type of programming language they want to use. There is a choice between structured data processing languages (PIG or Hive), dynamic languages (Streaming or Dumbo), or using Java which is closest to the system.

Tom's second hint dealt with the size of files to process with Hadoop: Both - too large unsplittable and too small ones are bad for performance. In the first case, the workload cannot be easily distributed across the nodes in the latter case each unit of work is to small to account for startup and coordination overhead. There are ways to remedy these problems with sequence files and map files though. Another performance optimization would be to chain individual jobs - PIG and Hive do a pretty decent job in automatically generating such jobs. ChainMapper and ChainReducer can help with creating chained jobs.

Another important task when implementing map reduce jobs is to tell Hadoop the progress of your job. For once, this is important for long running jobs in order for them to remain alive and not be killed by the framework due to timeouts. Second, it is convenient for the user as he can view the progress in the web UI of Hadoop.

Usual suspects for tuning a job: Number of mappers and reducers, usage of combiners, compression customised data serialisation, shuffling tweaks. Of course there is always the option to let someone else do the tuning: Cloudera does provide support as well as pre-built packages init scripts and the like ;)

In the second talk I did a brief Mahout intro. It was surprising to me that half of the attendees already employed machine learning algorithm implementations in their daily work. Judging from the discussion after the talk and from questions I received after it the interest in the project seems pretty high. The slide I liked the most: The announcement of our first 0.1 release. Thanks to all Mahout committers and contributors who made this possible.

After the coffee break Craig gave an introduction to Terrier an extensible information retrieval plattform developed at the university of Glasgow. He mentioned a few other open IR platforms namely Tuple Flow, Zettair, Lemur/Indri, Xapian, as well as of course nutch/Solr/Lucene.

What does Terrier have to do with the HugUK? Well index creation in Terrier is now based on an implementation that makes use of Hadoop for parallelization. Craig did some very interesting analysis on scalability of the solution: The team was able to achieve scaling near linear in the number of nodes added (at least as long as more than reducer is used ;) ).

After the pizza Paolo described his experiences implementing the vanilla pagerank computation with Hadoop. One of his test datasets was the Citeseer citation graph. Interestingly enough: Some of the nodes in this graph have self references (maybe due to extraction problems), duplicate citations, and the data comes in an invalid xml format.

The last talk was on HBase by Michael Stack. I am really happy I attended HugUK as I missed that talk in Amsterdam at the ApacheCon. First Michael gave an overview of which features of a typical RDBMS are not supported by HBase: Relations, joins, and of course JDBC being among the limitations. On the pro site HBase offers a multiple node solutions that has scale out and replication built in.

HBase can be used as source as well as as sink for map reduce jobs and thus integrates nicely with the Apache Hadoop stack. The framework provides a simple shell for administrative tasks (surgery on sick clusters forced flushes non sql get scan and put methods). In addition the master comes with a UI to monitor the cluster state.

Your typical DBA work though differs with HBase: Data locality and physical layout do matter and can be configured. Michaels recommendation was to start out testing with the XL instance on EC2 and decrease instances if you find out that it is too large.

The talk finished with an outlook of the features in the upcoming release the issues on the todo list and an overview of companies already using HBase.

After talks were finished quite a few attendees went over to a pub close by: Drinking beer, discussing new directions and sharing war stories.

I would to thank Johan Oskarsson for organising the event. And a special thanks to Tom for letting me use his Laptop for the Apache Mahout presentation: the hard disk of mine broke exactly one day before.

Last but not least thank you to Sylvio and Susi for letting me stay at their place - and thanks to Helene for crying only during daytime when I was out anyway ;)

Hope to see at least some of the attendees again at the next Hadoop Meetup in Berlin. Looking forward to the next Hadoop User Group UK.

Announcing Apache Mahout 0.1

2009-04-08 15:11
This morning I received Grant's release mail of Apache Mahout. I am really happy that after little more than one year we now have our first release out there to test and scrutinate by anyone interested in the project. Thanks to all the committers who have helped make this possible. A special thanks to Grant Ingersoll for putting so much time into getting many release issues out of the way as well as to those who reviewed the release candidates and all the major and minor problems.

For those who are not familiar with Mahout: The goal of the project is to build a suite of machine learning libraries under the Apache license. The main focus is on:

  • Building a viable community that develops new features, helps users with software problems and is interested in the data mining problems Mahout users.
  • Developing stable, well documented, scalable software that solves your problems.


The current release includes several algorithms for clustering (k-Means, Canopy, fuzzy k-Means, Dirichlet based), for classification (Naive Bayes and Complementary Naive Bayes). There is some integration with the Watchmaker evolutionary programming framework. The Taste Collaborative Filtering framework moved to Mahout as well. Taste has been around for a while and is much more mature than the rest of the code.

With this being a 0.1 release we are looking for early adopters that are willing to work with cutting edge software and gain benefits from working closely together with the community. We are seeking feedback on use cases as well as performance numbers. If you are using Mahout in your projects or plan to use it or even only evaluate it - we are happy about hearing back from you on our mailing lists. Tell us what you like, what works well, but do not forget to tell us what you would like to improve. Contributions and Patches as always are very welcome.

For more information see the project homepage, especially the wiki and the Lucene weblog by Grant Ingersoll.