Machine learning problem settings

2011-08-06 07:10
Together with Sebastian Schelter I held a Nokia sponsored (Thank you!) lecture on large scale data analysis and data mining during the past few months.

After supervising a few successful university projects based on Apache Mahout the goal of this lecture was to introduce students to some of the basic concepts and problems encountered today in a world where huge datasets are generally available and are easy to process with Apache Hadoop. As such the course is targeted at an entry level audience - thorough treatment of the mathematical background of latest machine learning technology is left to the machine learning research groups in Potsdam, at TU Berlin and the neural information processing group at TU.

Slides and exercises are available online via git. Please let me know if you want to re-use them in your lecture.

The very first problem that users of machine learning algorithms usually come across is mapping their application problem to one of the various machine learning problems. In 2010 Michael Brückner gave a lecture on Intelligent Data Analysis with Matlab (slides and videos in German) including a simple taxonomy of algorithms:

According to

  • the types of input data an algorithm can handle (either independent instances, also called examples, sequences or graphs of instances)
  • the type of training data available (e.g. instances with assigned nominal target attribute, no labels at all, a partial sorting of sets of instances)
  • and the learning goal
algorithms can be nicely partitioned by the learning problem that they solve. Based on that very first step of identifying exactly what the problem setting is, deciding which algorithm to use becomes much easier. Based on that taxonomy I came up with the above graphic giving a first overview of which tasks can be solved with machine learning:

Boxes in dark blue are what in general is called supervised learning, yellow unsupervised and light blue semi supervised - based on the amount of labeled training data available. Red boxes indicate settings with the goal of knowledge discovery. Green are any ranking problems.

Apache Mahout Hackathon Berlin

2011-03-21 21:39
Last year Sebastian Schelter from Berlin was added to the list of committers for Apache Mahout. With two committers in town the idea was born to meet some day, work on Mahout. So why not just announce that meeting publicly and invite others who might be interested in learning more about the framework? I got in touch with c-base - a hacker space in Berlin well suited to host a Hackathon - and quickly got their ok for the event.

As a result the first Apache Mahout Hackathon took place at c-base in Berlin last weekend. We had about eight attendees - arriving at varying times: I guess 11a.m. simply is way too early to get up for your average software developer on a Saturday. I got a few people surprised by the venue - especially those who were attending a Hackathon for the very first time and had expected c-base to be some IT company ;)

We started the day with a brief collection of ideas that everyone wanted to work on: Some needed help to use Mahout - topics included:

  • How to use Apache Mahout collaborative filtering with complex models.
  • How to use Apache Mahout via a web application?
  • How to use classification (mostly focussed on using Naive Bayes from within web applications).
  • Is HBase a solution for scalable graph mining algorithms?
  • Is there a frequent itemset algorithm that respects temporal changes in patterns?

Those more into Mahout development proposed a slightly different set of topics:

  • PLSI and Map/Reduce?
  • Build customisable sampling strategies for distributed recommendations.
  • Come up with a more Java API friendly configuration scheme for Mahout clusterings.
  • Complete the distributed SVD recommender.

Quickly teams of two to three (and more) people formed. First several user side questions could be addressed by mixing more experienced Mahout developers with newbie users. Apart from Mahout specifics also more basic questions of getting involved even by simply contributing to the online documentation, answering questions on the mailing lists or just providing structured access to existing material that users generally have trouble finding.

Another topic that is being overlooked all too when asking users to contribute to the project is the process of creating, submitting, applying and reviewing patches itself: Being deeply involved with free software projects dealing with patches, integration of issue tracker and svn with the project mailing lists all seems very obvious. However even this seemingly basic setup sometimes looks confusing and complex to regular users - that is very common but not limited to people who are just starting to work as software developers.

Thanks to Thilo Fromm for taking the group picture.

In the evening people finally started hacking more sophisticated tasks - working on the first project patches. On Sunday only the really hard core developers remained - leading to a rather focussed work on Mahout improvements which in the end led to first patches sent in from the Mahout Hackathon.

Apache Mahout Meetup Amsterdam

2011-02-19 20:18

Last week I was honoured to be invited as one of the two speakers on Apache Mahout at the Mahout meetup in Amsterdam at JTeams offices. After free beer, cola and pizza Frank Scholten gave an overview of Mahout's clustering capabilities. After a brief introduction to Mahout itself he went into a little more detail on how clustering works in general. After that with a selection of Seinfeld scripts he used a fun data set to guide the audience through the process of choosing the right data preparation steps, coming up with good training parameters and finally evaluating clustering quality.

After that I gave a brief introduction to classification with Mahout - going into a little more detail when it comes to data preparation and quality evaluation. The audience seemed most interested in learning more on how data preparation works - after all that step cannot really be covered by Mahout itself (though we do have some support) but instead needs a lot of domain knowledge from the user side.

Judging from the brief round of self introductions the meetup was well visited by an intesting mixture of people coming from JTeam, Hippo, the dutch police working on data analytics, developers working at RIPE and many more.

If you are interested in more data analysis, search and data storage - do not miss registration for Berlin Buzzwords on June 6/7th 2011.

Apache Mahout in Amsterdam

2011-01-25 20:00
On February 7th there will be an Apache Mahout meetup in Amsterdam kindly organised by JTeam. There will be two presentations - one by myself on classification with Apache Mahout as well as a second one by Frank Scholten on clustering with Apache Mahout.

  • Time: 18:00
  • Location: Frederiksplein 1, 1017XK Amsterdam, The Netherlands

Looking forward to a few days in Amsterdam.


2011-01-23 15:46
It's already sort of a nice little tradition for me to spend the first weekend in February in Brussels for FOSDEM. This year I am particulary happy that there will be a Data Analytics Dev Room at FOSDEM. A huge Thanks to @ogrisel and @nmaillot who have done most of the heavy lifting of getting the schedule in place.

I'm going to FOSDEM, the Free and Open Source Software Developers' European Meeting

Looking forward to an interesting Cloud Track, to meeting Peter Hintjens who is going to give a talk on 0MQ, the DevOps presentation and lots of very interesting DevRooms. Looks like again it's going to be tough to decide on which presentations to go to at any one time.

O'Reilly Strata Conference

2011-01-22 04:34
Title: O'Reilly Strata Conference
Location: Santa Clara
Link out: Click here
Description: Early next February O'Reilly is planning to put on a very interesting conference on the topic of data analysis and the business of generating value from raw digital data.

Strata 2011

I'm really glad to have received the acceptance notification for my presentation and travel sponsorship from the DICODE project. So see you in Santa Clara.
Start Date: 2011-02-01
End Date: 2011-02-03

If you are still unsure whether you should attend or not: Strata kindly handed out discount codes to speakers to share with their followers and readers. It saves you 25% of the registration cost - just use str11fsd during registration.

Apache Mahout Podcast

2010-12-13 21:21
During Apache Con ATL Michael Coté interviewed Grant Ingersoll on Apache Mahout. The interview is available online as podcast. The interview covers the goals and current use cases of the project, goes into some detail on the reasons for initially starting it. If you are wondering what Mahout is all about, what you can do with it and which direction development is heading, the interview is a great option to find out more.

Apache Mahout 0.4 release

2010-11-03 15:21
On last Sunday the Apache Mahout project published the 0.4 release. Nearly every piece of the code has been refactored and improved since the last 0.3 release. The release was timed to happen exactly before Apache Con NA in Atlanta. As such it was published on October 31st - the Halloween release, sort-of.

Especially mentionable are the following improvements:

  • Model refactoring and CLI changes to improve integration and consistency
  • Map/Reduce job to compute the pairwise similarities of the rows of a matrix using a customizable similarity measure
  • Map/Reduce job to compute the item-item-similarities for item-based collaborative filtering
  • More support for distributed operations on very large matrices
  • Easier access to Mahout operations via the command line
  • New vector encoding framework for high speed vectorization without a pre-built dictionary
  • Additional elements of supervised model evaluation framework
  • Promoted several pieces of old Colt framework to tested status (QR decomposition, in particular)
  • Can now save random forests and use it to classify new data

New features and algorithms include:

  • New ClusterEvaluator and CDbwClusterEvaluator offer new ways to evaluate clustering effectiveness
  • New Spectral Clustering and MinHash Clustering (still experimental)
  • New VectorModelClassifier allows any set of clusters to be used for classification
  • RecommenderJob has been evolved to a fully distributed item-based recommender
  • Distributed Lanczos SVD implementation
  • New HMM based sequence classification from GSoC (currently as sequential version only and still experimental)
  • Sequential logistic regression training framework
  • New SGD classifier
  • Experimental new type of NB classifier, and feature reduction options for existing one

There were many, many more small fixes, improvements, refactorings and cleanup. Go check out the new release, give the new features a try and report back to us on the user mailing list.

Apache Mahout @ Lisbon Codebits

2010-10-31 09:36
Second week of November I'll spend a few days in Lisbon - never would have thought that I'd return so quickly when I visited this beautiful city this summer during vacation. I'll be there for Codebits - thanks to Sapo for inviting me to be there.

Back in summer I learned only after I returned to Germany that there was someone form Portugal seeking to meet with other Apache people exactly when I was down there. I contacted the guy proposing to do an Apache Dinner to see how many other committers and friends could be reached. In addition Filipe asked me whether I could imagine flying down to Sapo to give a talk on Mahout as devs there would be interested in it. Well, I told him that if I got travel support, I'd be happy to be there. This 10min chat quickly turned into an invitation to a great conference in Lisbon. Looking forward to meet you there. (And looking forward to weather that compared to Germany is way warmer and more sunny right now. :) )