GSoC at Mahout

GSoC at Mahout #

GSoC 2009 is about to finish: Final evaluations are through, most of the code submitted by Mahout’s students has been committed to svn, code samples are on their way to Google.

In Mahout, we had three students joining the project: Robin working on an HBase based Naive Bayes extension and on frequent itemset discovery. David contributing a distributed LDA implementation. Deneche was working on a Random Forest implementation. All three of them have done great work during this summer, contributing not only code but valuable input on the project’s mailinglists as well. As a result, all three of them have been given committer status by the end of GSoC.

Apart from three new additions to the code base, summer also brought quite some traffic to the user list - not only in terms of subscriptions but also in terms of developers contributing to the discussions online. Currently, it looks like the project is really gaining momentum, as also noted in Grant Ingersoll’s post.

Discussions on the dev list on the future road map of Mahout clearly showed that the developers share the vision of a scalable, potentially distributed, stable machine learning library. That the focus should be on production ready code under a commercially friendly license instead of bleeding edge research implementations. Last but no least the goal is to build a lively, diverse community around the project to guarantee further development and user support.

2009 brought quite a few talks both in Germany as well as the US on the topic of Mahout (besides all the events on Hadoop, scalable databases and cloud computing in general) with an Apache Con US talk introducing Mahout in Oakland still to come.

Yesterday, a great article indroducing Apache Mahout with hands-on examples was published on IBM Developerworks by Grant Ingersoll. Check it out, if you want to learn more on Mahout, and Machine Learning in general.