Archive

Posts Tagged ‘Mahout’

Teaching Free Software Development

June 20th, 2010

In Summer last year I was invited to give a presentation on Apache Mahout at TU Berlin. After the talk was over some of the research group members asked me to design and give a course on scalable machine learning with open source software during the winter semester.

The project attracted four to five students - not very many - but then again it is a course people can take voluntarily. During the first semester participants were asked to integrate Mahout to build a system that crawls web pages, assigns them to clusters and makes the content searchable with Lucene. The intention was to get students to publish any patches they have to make at Mahout. In addition the code behind the system was supposed to be published after the project was over.

This setup turned out to be sub-optimal: The participants never grew confident enough to publish not only their ideas and design on the mailinglist but also send in the access data to the SCM system that hosted the project source code.

Some similar setup was run at HPI Potsdam by Christoph Böhm: He let students implement various information retrieval and machine learning algorithms on top of Apache Hadoop. After the course was over he tried to motivate students to publish their code at Apache Mahout. So far I have seen no submissions.

Being aware of these problems next time I setup the course for the summer semester at TU I chose a slightly different model: Having only four students who do not have enough free cycles to work on the project full time I set the goal to implement an HMM - including tests, example and documentation. Being roughly aligned with GSoC I asked students to publish their timeline in JIRA. As soon as coding started I urged them to publish even incremental progress and ask the community for feedback.

Now we do have an open JIRA issue with a patch attached to it. People also got some code review feedback already. Having Berlin Buzzwords in town while the course was still running I used my chance to get students in touch with other Mahout developers. Looks like at least one of them is planning to stay with the project for a little longer. For me it would be a great success if at least one student could be turned into a longer term contributor to the project.

So far it looks like applying the general principle of releasing code early and often helps people do integrate into some project. My own lesson learned from those experiences however is to urge students early on to get in touch and release their code: It was not particularly easy to get them to send e-mails to public mailing lists. However if they had done this just once, feedback usually was very positive - and surprised by how friendly and helpful in the free software community generally are.

Mahout , , , ,

My highly subjective Berlin Buzzwords recap

June 13th, 2010

Last November I innocently asked Grant what it would take to make him to give a talk in Berlin. The only requirement he told me was that I’d have to pay for his flight. About eight months later we had Berlin Buzzwords - a conference all around the topics scalability, data storage and search. With Simon Willnauer, Uwe Schindler, Michael Busch, Robert Muir, Grant Ingersoll, Andrzej Bialecki and many others we had quite a few Lucene people in town.


From the NoSQL community, Peter Neubauer, Rusty Klophaus, Jan Lehnardt, Mathias Meyer, Eric Evans and many others made sure people got their fair share of NoSQL knowledge. With Aaron Kimball, Jay Booth, Doug Judd and Steve Loughran we had several Hadoop and related people at the conference.

The conference also featured two talks on Apache Mahout: An overview from Frank Scholten as well as a more in-depth talk by Sean Owen. It’s great to see the project grow - not only in terms of development community but also in terms of requests from professional Mahout users.

In addition we had a keynote by Pieter Hintjens that concentrated on messaging in general and 0MQ in particular - a scalability topic otherwise highly underrepresented at Berlin Buzzwords.


We got well over 300 attendees that filled Berlin Kosmos - a former cinema. Attendees were a good mixture of Apache and non-Apache people, developers and users. People used the breaks and bar tours after the event to get in touch, exchange ideas. It’s always good to see developers discuss design issues and architectural challenges.

Monday evening was reserved for local people taking out the speakers and interested attendees for Bar Tours to Friedrichshain. Those from Berlin took Berlin Buzzwords people to their favourite restaurants and bars - or to what they considered to be “typical Berlin”. Some spent evenings later that week drinking beer or Berliner Weisse.




The tour for keynote speakers Grant Ingersoll, Pieter Hintjens and friends was organised by Julia and myself. We went over to Kreuzberg - some went to famous Burgermeister for Burgers, the other half went to a nearby Indian restaurant. After that we spent the evening in Club der Visionäre - a club next to the water. Me personally I left at about midnight - several people of the Lucene community moved to the well known Fette Ecke later on.

When asking the audience about repeating the conference next year, all hands went up immediately. Beside lots of praise for the organisation, from the feedback form we put up we got some good ideas on how to improve the conference next year. I’d love to have you guys back here in 2011 - and I’d love to get even more attendees in. Was great fun having you here. Thanks for 5 great days:

Five instead of two days, because:

  • Keynote speakers got a special treatment - that is a personal city guide for the weekend before Buzzwords.
  • We had the official conference start on Sunday with a Barcamp.
  • We had another Apache dinner on Wednesday with those Apache people that live in Berlin. In addition the Aaron and Sarah joined us as they were still in town for the Apache Hadoop trainings. Also Greg Stein had pizza and beer with us - he was in town for the svn conference at the end of the week.


Thanks to all who helped turn this conference into a success: Julia Gemählich for conference management, Ulf and Wetter for WiFi setup, Nils for travel management, Simon and Jan for support ranking talks and reaching out to your communities, all speakers for fantastic talks, those taking pictures of the conference and sharing them on Flickr for showing those who stayed at home how great the conference was, peoplezapping for the videos that will soon be available online, all sponsors for supporting the conference, all attendees for their participation. I’d love to have all of you (and many more) back in Berlin next year. An informal call for presentations has been set up already - submit now and be the one to set the trend instead of just following the Buzzwords!

For those who do not want to wait for another year: We will have another Apache Hadoop Get Together in September 2010 - watch this space for more information. If you’d like to give a talk their and present your Hadoop/ Solr/ Lucene etc. system - please get in touch with me.

Apache Hadoop Get Together Berlin, Berlin Buzzwords, Events , , , , ,

Berlin Buzzwords - End of CfP drawing closer

April 11th, 2010

One week to go for submitting a talk on your favourite NoSQL topic, your favourite search application or your most interesting data analysis task: The call for presentations for Berlin Buzzwords ends on April 17th, that is Sunday next week.

Shortly after the last talk was submitted we will start announcing speakers - final list of speakers is to be expected by the start of May, final schedule will be published shortly after that.

Berlin Buzzwords , , , , ,

Berlin Buzzwords - Early bird registration

April 10th, 2010

I would like to invite everyone interested in data storage, analysis and search to join us for two days on June 7/8th in Berlin for Berlin Buzzwords - an in-depth, technical, developer-focused conference located in the heart of Europe. Presentations will range from beginner friendly introductions on the hot data analysis topics up to in-depth technical presentations of scalable architectures.

Our intention is to bring together users and developers of data storage, analysis and search projects. Meet members of the development team working on projects you use. Get in touch with other developers you may know only from mailing list discussions. Exchange ideas with those using your software and get their feedback while having a drink in one of Berlin’s many bars.

Early bird registration has been extended until April 17th - so don’t wait too long.

If you would like to submit a talk yourself: Conference submission is open for little more than one week. More details are available online in the call for presentations:

Looking forward to meeting you in the beautiful, vibrant city of Berlin this summer for a conference packed with high profile speakers, awesome talks and lots of interesting discussions.

Berlin Buzzwords , , , , , , ,

Working on Mahout as part of your studies at TU Berlin

April 9th, 2010

Did you ever wonder, who those weird people working on free software projects are? Did you ever ask yourself how these developers organise their work, how they collaborate, which values are important to them? Did you ever think about participating in a free software project yourself but never really had time to do so because your studies were just too time-consuming?

Well, if you are a student of one of the Berlin universities, there is a project at the research group DIMA at TU Berlin that might be of interest to you: With Hot Topics in Information Management the second edition of last year’s course focussed on building systems with Apache Mahout.

This term the course will concentrate on extending Mahout. During the first week, students are given a set of possible project ideas to choose from. Of course you are invited to add your own ideas as well. You will need to come up with a rough plan of material to read, modules to implement and a timeframe for each module.

You are asked to not only implement your choosen extension but to thouroughly (unit-/integration-) test it, to document it, to provide examples of its usage and finally to work together with the community on contributing your implementation back to the project.

During the course you are free to re-use resources built up for last year’s course - both hardware as well as installed software and available data.

The course starts next week on Tuesday - registration closes in a few days, so make sure you signed up if you are interested in working on Mahout during your regular project time and get credits for that.

Mahout , ,

GSoC - one day to go for your application

April 8th, 2010

If you are a student interested in participating in Google Summer of Code: Registration closes tomorrow (as in “April 9, 19:00 UTC”). You hopefully published and discussed your proposal at your favourite project already so you have a clear plan of where to go and which milestones to achieve in summer.

If you are interested in Apache Mahout: Yes, as last years, we are again looking for students willing to work on awesome student projects this summer. Several core Mahout developers have signed up as mentors for GSoC. With Robin one of our former GSoC students now has turned into a mentor: It’s always amazing to watch students stick with the project and continue contributing valuable input.

So in case you would love to learn more on machine learning, train your software development skills and work with great people on your favourite problem, do not forget to submit your project proposal until tomorrow.

Mahout , ,

Apache Mahout 0.3 released

March 18th, 2010

This week, Apache Mahout 0.3 was released. First of all thanks to all committers and contributors who made that possible: Thanks for all your hard work on making the code even faster and integrating even more algorithms.

To the highlights:

  • New: math and collections modules based on the high performance Colt library

  • Faster Frequent Pattern Growth(FPGrowth) using FP-bonsai pruning
  • Parallel Dirichlet process clustering (model-based clustering algorithm)
  • Parallel co-occurrence based recommender
  • Parallel text document to vector conversion using LLR based ngram generation
  • Parallel Lanczos SVD(Singular Value Decomposition) solver
  • Shell scripts for easier running of algorithms, utilities and examples

      … and much much more: code cleanup, many bug fixes and performance improvements. Check out the new release and watch for further news on Apache Mahout to come in the next days and weeks.

      Details on what’s included can be found in the release notes.

      Downloads are available from the Apache Mirrors

      Mahout ,

Seminar on scaling learning at DIMA TU Berlin

March 17th, 2010

Last Thursday the seminar on scaling learning problems took place at DIMA at TU Berlin. We had five students give talks.

The talks started with an introduction to map reduce. Oleg Mayevskiy first explained the basic concept, than gave an overview of the parallelization architecture and finally showed how jobs can be formulated as map reduce jobs.

His paper as well as his slides are available online.

Second was Daniel Georg - he was working on the rather broad topic of NoSQL databases. Being too fuzzy to be covered in one 20min talk, Daniel focussed on distributed solutions - namely Bigtable/HBase and Yahoo! PNUTS.

Daniel’s paper as well as the slides are available online as well.

Third was Dirk Dieter Flamming on duplicate detection. He concentrated on algorithms for near duplicate detection needed when building information retrieval systems that work with real world documents: The web is full of copies, mirrors, near duplicates and documents made of partial copies. The important task is to identify near duplicates to not only reduce the data store but to potentially be able to track original authorship over time.

Again, paper and slides are available online.

After a short break, Qiuyan Xu presented ways to learn ranking functions from explicit as well as implicit user feedback. Any interaction with search engines provides valuable feedback about the quality of the current ranking function. Watching users - and learning from their clicks - can help to improve future ranking functions.

A very detailedpaper as well as slides are available for download.

Last talk was be Robert Kubiak on topic detection and tracking. The talk presented methods for identifying and tracking upcoming topics e.g. in news streams or blog postings. Given the amount of new information published digitally each day, these systems can help following interesting news topics or by sending notifications on new, upcoming topics.

Paper and slides are available online.

If you are a student in Berlin interested in scalable machine learning: The next course IMPRO2 has been setup. As last year the goal is to not only improve your skills in writing code but also to interact with the community and if appropriate to contribute back the work created during the course.

Science , , , , , , , , , ,

Learning to Rank Challenge

March 9th, 2010

In one of his recent blog posts, Jeff Dalton published an article on currently running machine learning challenges. Especially interesting for those working on search engines and interested in learning new rankings from data should be the Yahoo! Learning to Rank Challenge to be held in conjunction with this year’s ICML 2010 in Haifa, Israel. The goal is to show that your algorithm does not only scale on real-world data provided by Yahoo!. Tasks are split in two. The first one focusses on traditional learning to rank procedures, the second one on transfer learning. Tracks are open to participants from industry and research.

A second challenge was published by the machine learning theory blog. The challenge is hosted by Yahoo! as well and deals with Key scientific challenges in statistics and machine learning.

Both programs look pretty interesting - would be great to lots of people from the community participating and comparing their systems.

Mahout, Science , ,

Mahout at Berlin ignite

March 1st, 2010

This evening the first Berlin ignite event took place in the “Festsaal” in Berlin X-Berg. Organiser of the event was Matt Biddulph from Nokia Gate 5. We had eleven fantastic talks (ok, to be more precise: At least ten fantastic ones, my own can only be judged by the audience ;) ).

Topics included things you can learn when starting to collect data, themes from (agile) project management, RepRap machines (see also the Rep Rap FOSDEM 2010 talk), bots and robots. The talks finished with a presentation of a Part time scientist’s vision of getting to the moon - an article on the project is available on heise newsticker.

The room was filled with more then 120 people resulting in a location packed with interested attendees. It was great seeing the talks on such diverse topics. Hope to have more events of this format here in Berlin. Thanks go to Matt, all speakers and everyone involved in generally making the event a big success.

For those who didn’t make it to the event, slides and audio should go online soon. At least the slides on Mahout are available online.

*Camp ,