Archive

Archive for the ‘Mahout’ Category

Teaching Free Software Development

June 20th, 2010

In Summer last year I was invited to give a presentation on Apache Mahout at TU Berlin. After the talk was over some of the research group members asked me to design and give a course on scalable machine learning with open source software during the winter semester.

The project attracted four to five students - not very many - but then again it is a course people can take voluntarily. During the first semester participants were asked to integrate Mahout to build a system that crawls web pages, assigns them to clusters and makes the content searchable with Lucene. The intention was to get students to publish any patches they have to make at Mahout. In addition the code behind the system was supposed to be published after the project was over.

This setup turned out to be sub-optimal: The participants never grew confident enough to publish not only their ideas and design on the mailinglist but also send in the access data to the SCM system that hosted the project source code.

Some similar setup was run at HPI Potsdam by Christoph Böhm: He let students implement various information retrieval and machine learning algorithms on top of Apache Hadoop. After the course was over he tried to motivate students to publish their code at Apache Mahout. So far I have seen no submissions.

Being aware of these problems next time I setup the course for the summer semester at TU I chose a slightly different model: Having only four students who do not have enough free cycles to work on the project full time I set the goal to implement an HMM - including tests, example and documentation. Being roughly aligned with GSoC I asked students to publish their timeline in JIRA. As soon as coding started I urged them to publish even incremental progress and ask the community for feedback.

Now we do have an open JIRA issue with a patch attached to it. People also got some code review feedback already. Having Berlin Buzzwords in town while the course was still running I used my chance to get students in touch with other Mahout developers. Looks like at least one of them is planning to stay with the project for a little longer. For me it would be a great success if at least one student could be turned into a longer term contributor to the project.

So far it looks like applying the general principle of releasing code early and often helps people do integrate into some project. My own lesson learned from those experiences however is to urge students early on to get in touch and release their code: It was not particularly easy to get them to send e-mails to public mailing lists. However if they had done this just once, feedback usually was very positive - and surprised by how friendly and helpful in the free software community generally are.

Mahout , , , ,

Scaling user groups

May 26th, 2010

A few hours ago, Jan Lehnardt posted a link on How to organise a nerd conference - joking that this is how we planned Berlin Buzzwords. Well, it is not exactly that easy - however the comic actually is not so far from the truth either:

About two years ago, after having started Apache Mahout together with Grant Ingersoll, Karl Wettin and others, several Apache Hadoop user groups, meetups and get togethers started to pop up all around the world. The one closest to me was the Hadoop user group UK. Back in 2008 I was pretty envious to all these user groups - being so distributed, there was no way I could ever attend all of them, though talks were certainly interesting. So the naive thought of a back then naive free software developer was: Let’s have that in Berlin. To have initial talks I called Stefan Groschupf. His answer was very positive: Oh yeah, let’s do this. I am in Germany for another two weeks, so it should be at about that timeframe. We agreed that if no-one showed up, we could still have some pizza together and share insights from our projects.

For the venue I knew from regular meetups of the Free Software Foundation Europe - read FSF*E* - that newthinking store was available for free for meetups for devs of free software. On I went, calling Martin from the store, booked the room. After that some mails went to the usual suspects, mailing lists and such. At the first meetup two years ago, more than 15 attendees - with two more people who had prepared slides. Pizzas obviously had to wait a little.

If you are wondering what that looked like back then - Thanks to Martin for taking the image back then and putting it online.



We (as in all attendees) decided to repeat the exercise three months later*, talks for the next time were proposed during that first session. Noone objected to having it in Berlin again - everyone knew this was the only way to avoid having to do the organization next time.

The meetup grew steadily in size, talks started being proposed three to six months in advance. I ended up creating not only a mailing list for the meetup but also a blog so I could publish news on Jan’s CouchDB talk and Lars George’s HBase talk back then. We got video sponsoring from Cloudera (Thanks Christophe), StudiVZ (Thanks Nils), and Nokia (Thanks Matt). Late last year I did the first European NoSQL meetup together with Jan Lehnardt - 80 attendees, lots of potential for more, the newthinking store obviously a bit too small for that :)

If you are wondering what NoSQL and Hadoop meetups looked like last time:


During that meetup the idea was born for a larger NoSQL conference in Berlin in 2010. First ideas were tossed around together with Jan and Simon Willnauer during Apache Con US in Oakland. The topic Hadoop got added there. In January 2010 finally Lucene was added to the mix. We contacted newthinking for support - got a very warm welcome.

Now - two years after the first Apache Hadoop Get Together Berlin we are proud to host Berlin Buzzwords - focussed on NoSQL, Apache Hadoop and search as in Apache Lucene.The conference is co-organised by newthinking communications, Simon Willnauer, Jan Lehnardt and myself. A big thanks to neofonie for supporting me by making it possible that I could do most of the organisation during my regular working hours.

The speaker lineup looks fantastic. Registration is going very well - exceeding expectations (did I mention that registration is still open, group and student tickets still available?).

I am really looking forward to an amazing conference on 7th and 8th of June. We will have a NoSQL barcamp in newthinking store Sunday evening before the conference. Keynote speaker packages have been sent out and were well received. Hotel rooms for speakers are booked. We are about to pull together the last loose ends in the coming days. Happy to have so many guys (and a few girls) interested in scalability topics here in town at the beginning of June. Looking forward to seeing you in Berlin.

* The second meetup turned out to be the first and so far only one that took place w/o the organiser - I broke my leg on my way to newthinking by getting hit by a BMW X5… *sigh* Note for other meetup organizers: Always have a backup moderator - in may case that was my neofonie manager Holger Düwiger who happened to attend that meetup for the first time back then.

Apache Hadoop Get Together Berlin, Berlin Buzzwords, Hadoop, Lucene, Mahout , , , , ,

Working on Mahout as part of your studies at TU Berlin

April 9th, 2010

Did you ever wonder, who those weird people working on free software projects are? Did you ever ask yourself how these developers organise their work, how they collaborate, which values are important to them? Did you ever think about participating in a free software project yourself but never really had time to do so because your studies were just too time-consuming?

Well, if you are a student of one of the Berlin universities, there is a project at the research group DIMA at TU Berlin that might be of interest to you: With Hot Topics in Information Management the second edition of last year’s course focussed on building systems with Apache Mahout.

This term the course will concentrate on extending Mahout. During the first week, students are given a set of possible project ideas to choose from. Of course you are invited to add your own ideas as well. You will need to come up with a rough plan of material to read, modules to implement and a timeframe for each module.

You are asked to not only implement your choosen extension but to thouroughly (unit-/integration-) test it, to document it, to provide examples of its usage and finally to work together with the community on contributing your implementation back to the project.

During the course you are free to re-use resources built up for last year’s course - both hardware as well as installed software and available data.

The course starts next week on Tuesday - registration closes in a few days, so make sure you signed up if you are interested in working on Mahout during your regular project time and get credits for that.

Mahout , ,

GSoC - one day to go for your application

April 8th, 2010

If you are a student interested in participating in Google Summer of Code: Registration closes tomorrow (as in “April 9, 19:00 UTC”). You hopefully published and discussed your proposal at your favourite project already so you have a clear plan of where to go and which milestones to achieve in summer.

If you are interested in Apache Mahout: Yes, as last years, we are again looking for students willing to work on awesome student projects this summer. Several core Mahout developers have signed up as mentors for GSoC. With Robin one of our former GSoC students now has turned into a mentor: It’s always amazing to watch students stick with the project and continue contributing valuable input.

So in case you would love to learn more on machine learning, train your software development skills and work with great people on your favourite problem, do not forget to submit your project proposal until tomorrow.

Mahout , ,

Apache Mahout 0.3 released

March 18th, 2010

This week, Apache Mahout 0.3 was released. First of all thanks to all committers and contributors who made that possible: Thanks for all your hard work on making the code even faster and integrating even more algorithms.

To the highlights:

  • New: math and collections modules based on the high performance Colt library

  • Faster Frequent Pattern Growth(FPGrowth) using FP-bonsai pruning
  • Parallel Dirichlet process clustering (model-based clustering algorithm)
  • Parallel co-occurrence based recommender
  • Parallel text document to vector conversion using LLR based ngram generation
  • Parallel Lanczos SVD(Singular Value Decomposition) solver
  • Shell scripts for easier running of algorithms, utilities and examples

      … and much much more: code cleanup, many bug fixes and performance improvements. Check out the new release and watch for further news on Apache Mahout to come in the next days and weeks.

      Details on what’s included can be found in the release notes.

      Downloads are available from the Apache Mirrors

      Mahout ,

Google Summer of Code starting

March 10th, 2010

As published on the Google Open Source blog the application period for mentoring organizations for GSoC starts now. The ASF is already in the process of applying. If you are a student, looking for an interesting project to work on during the coming summer - you might consider participating in GSoC. It does give you are great opportunity to get in touch with successful free software projects, learn how to work in global teams, improve your communication skills and last but not least show and publish your fantastic coding skills.

If you want to learn more on Why you should contribute to open source, the article by Shalin Shekhar Mangar is a great summary of some of the reasons why people work on open source projects.

Apache, Hacking, Mahout

Learning to Rank Challenge

March 9th, 2010

In one of his recent blog posts, Jeff Dalton published an article on currently running machine learning challenges. Especially interesting for those working on search engines and interested in learning new rankings from data should be the Yahoo! Learning to Rank Challenge to be held in conjunction with this year’s ICML 2010 in Haifa, Israel. The goal is to show that your algorithm does not only scale on real-world data provided by Yahoo!. Tasks are split in two. The first one focusses on traditional learning to rank procedures, the second one on transfer learning. Tracks are open to participants from industry and research.

A second challenge was published by the machine learning theory blog. The challenge is hosted by Yahoo! as well and deals with Key scientific challenges in statistics and machine learning.

Both programs look pretty interesting - would be great to lots of people from the community participating and comparing their systems.

Mahout, Science , ,

Preliminary schedule online for ignite Berlin

February 23rd, 2010

Today first talks scheduled for ignite Berlin were published. If you yourself would like to give a talk: Submission seems to still be open.

Events, Freetime, Mahout , ,

FOSDEM 2010 - 10 years FOSDEM

February 3rd, 2010

I'm going to FOSDEM, the Free and Open Source Software Developers' European Meeting

The final schedule of FOSDEM 2010 is up: Looks like bad news - 306 interesting talks within just one weekend. Lots of interesting talks in the main track including Greg Kroah-Hartman on “Write and Submit your first Linux kernel Patch”, David Recordon from Facebook on “Scaling Facebook with OpenSource tools”, Bernard Li on “Ganglia: 10 years of monitoring clusters and grids”, Andrew Tanenbaum with his “MINIX 3: a Modular, Self-Healing POSIX-compatible Operating System” talk, Benoît Chesneau on “CouchDB! REST and Database!” and many, many more.

In addition there will be many interesting DevRooms, including one on NoSQL, one on Free Java, the Mono DevRoom featuring a talk by Miguel de Icaza…

Looks like a weekend packed with interesting talks and discussions. If you are going there and are interested in an ad-hoc Hadoop-Beer-drinking meetup, make sure to contact me before the event.

Events, Free Software, Mahout

Mahout in Action

January 11th, 2010

As noted earlier by Grant Ingersoll, the first chapters of Mahout in Action are already online at Manning:



Sean, Robin, keep up the great work! I would love to read more of the book in the near future.

Mahout ,