Archive

Archive for the ‘Mahout’ Category

RecSys Stammtisch Berlin - December 2012

December 30th, 2012 at 12:40pm

Earlier this month I attended the fourth Recommender Stammtisch in Berlin. The event was kindly hosted by Soundcloud - who on top of organising the speakers provided a really yummy buffet by Kochzeichen D.

With Paul Lamere the evening started with a very entertaining but also very packed talk on why music recommendation is special - or put more generally why all recommender systems are special:

  • Traditionally recommender systems found their way into the wild to drive sales. In music however the main goal is to help users discover new content.
  • Listeners are very different: Ranging from those indifferent to what is being played (imagine someone sitting in a coffee bar enjoying their espresso - it’s unlikely that those would want to influence the playlist of the shop’s entertainment system unless they are really annoyed with it’s content). There are casual listeners who from time to time skip a piece. There are more engaged people who train their own recommender through services like last.fm. Finally there are fanatics that are really into certain kinds of music. Building just one system to fit them all won’t do. Also relying on just one signal won’t do - instead you will have to deal with both, content signals like loudness plots as well as community signals.
  • Music applications tend to be highly interactive. So even if there is little to no reliable explicit feedback people tell you how much you like your music when skipping, turning pieces louder, interacting with the content behind the song being played.
  • In contrast to many other domains music deals with a vast item space and a huge long tail of songs that almost never get interacted with.
  • In contrast to shopping recommenders however in music making mistakes is comparably cheap: In most situations music isn’t purchased on a song by song basis but based on some subscription model. That way the actual cost of playing the wrong song is low. Also songs tend to be not much longer than 5min so also users are less annoyed when confronted with a slightly wrong piece of music.
  • When implementing recommenders for a shopping site it is to be avoided to re-recommend stuff a user has purchased already. This is not the case in music recommendation. Quite the contrary: Re-recommending known music is one indicator for playlists people will like.
  • When it comes to building playlists care must be taken to organise songs in a coherent way, mixing new and familiar songs in a pleasing order - essentially the goal should be to take the listener on a journey.
  • Fast updates are crucial: Music business itself is fast paced with new releases coming out regularly and being taken up very quickly by major broadcasting stations.
  • Music is highly contextual: It pays to know if the user is in the mood for calm or for faster music..
  • There are highly passionate users that are very easy to scare away - those tend to be the loudest ones influencing your user community most.
  • Though meta data is key in music as well, never expect it to be correct. There are all sorts of weird band and song names that you never thought would be possible - an observation that also Ticketmaster made when building their ticket search engine.
  • Music is highly social and irrational - so just knowing your users friends and their tastes won’t get you to being perfect.

Overall I guess the conclusion is that no matter which domain you deal with you will always need to know the exact properties of that domain to build a successful system.

In the second talk Brian McFee explained one way of modeling playlists with context. With that he concentrated on passive music discovery - that is based on one query return a list of music to listen to sequentially as opposed to active retrieval where users issue a query to search for a specific piece of music.

Historically it turned out to be difficult to come up with any playlist generator that is better than randomly selecting songs to play. His model is based on a random walk notian where the vertices are songs and edges represent learnt group similarities. Groups were represented by features like familiarity, social tags, specific audio features, metadata, release dates etc. Depending on the playlist category in most cases he was able to show that his model actually does perform better than random.

In the third talk Oscar Celma showed some techniques to also benefit from some of the more complicated signals for music recommendation. Essentially his take was that by relying on usage signals only you will be stuck with the head of the usage distribution only. What you want though is to be able to provide recommendations for the long tail as well.

Some signals he mentioned included content based features (rythm, BPM, timbre, harmony), usage signals, social signals (beware of people trying to game the system or make fun of it though) and a mix of all those. His recommendation was to put content signals at the end of the processing pipeline for re-ranking and refining playlists.

When providing recommendations it is essential to be able to answer why something was recommended. Even just in the space of novelty vs. relevancy to the user there are four possible strategies: a) recommend only old stuff that is marginally relevant to the specific user: This will end up pulling up mostly popular songs. b) recommend what is new but not relevant to the user: This will end up pulling out songs that turn your user away. c) recommend what is relevant to the user but old, this will mostly surface stuff the user knows already but is a safe bet to play. d) recommend what is both relevant and new to the user - here the real interesting work starts as this deals with recommending genuinely new songs to users.

To balance discovery with safe fallback go for skips, bans, likes and dislikes. Take into account the user context and attention.

The final point the speaker made was the need to take into account the whole picture: Your shiny new recommendation algorithm will just be a tiny piece in the puzzle. Much more work will need to go into data collection and ingestion, into API design.

The last talk finally went into some detail of the history of playlist creation - back from music creators’ choices, via radio station mixes, mix tapes and finally ending up at spotify and fully automatic playlist creation.

There is a vast body of knowledge on how to create successful playlists e.g. among DJs that speak about warm-up phases, chillout times, alternating types of music in order to take the audience on a journey. Even just shuffling music the user already knows can be very powerful given the pool of songs the shuffle is based on neither too large (containing too broad types of music) nor too small (leading to frequent repetitions). According to Ben Fields the science and art of playlist generation and in particular evaluation is still pretty much in it’s infancy with much to come.

General, Mahout, Science ,

Fourth #Recsys Stammtisch Berlin

October 23rd, 2012 at 10:38pm

This evening the 4th #recsys Stammtisch (German for “a meetup involving beer”) was kindly organised by Alan Said, Zeno Gantner and Till Plumbaum. The event was hosted by Aklamio with beers and drinks provided by Plista. They had three talks:

  • @AlanSaid gave an overview of the topics covered in this year’s RecSys conference in Dublin. Instead of going into too much technical detail the presentation gave a whirl-wind tour of the topics that are currently under discussion, the competitions to participate in and links to people relevant to the topic to follow up with. He put his slides online already.
  • As second speaker the meetup had @zenogantner give a tour to his MyMedialight recommender system library. Though written in c# there is no need for a deep c# knowledge to use the system - it comes with useful command line tools out of the box, supports all common algorithms and evaluation setups. One of the few talks where life demos actually worked.
  • The third talk - one of the rare “slide-free” presentations - covered Plista and it’s relation to recommender systems. After going into some more detail on where they came from (from a big over-arching solution down to the narrow, sharp focus of doing ad recommendations), where they want to go (back to an over-arching solution to be offered as a service with the goal of bringing interaction data of many services together in one hosted system). Most interesting news to me: They are working on an open source web-service layer for Apache Mahout that seems to be already in production. Definitely something to watch.

Overall a good crowd of over 20 people from various startups, universities and larger companies in Berlin joined the meetup. There were even some people travelling there from Magdeburg. Pretty good to know that there are so many people knowledgeable in the general area of recommender systems in and close to Berlin - and good to see some of those I knew already before the meetup again. Looking forward to the next event - any volunteers for organising one?

Mahout , ,

Apache Mahout 0.6 released

February 8th, 2012 at 9:33pm

As of Monday, February 6th a new Apache Mahout version was released. The new package features

Lots of performance improvments:

  • A new LDA implementation using Collapsed Variational Bayes 0th Derivative Approximation - try that out if you have been bothered by the way less than optimal performance of the old version.
  • Improved Decision Tree performance and added support for regression problems
  • Reduced runtime of dot product between vectors - many algorithms in Mahout rely on that, so these performance improvements will affect anyone using them.
  • Reduced runtime of LanczosSolver tests - make modifications to Mahout more easily and have faster development cycles by faster testing.
  • Increased efficiency of parallel ALS matrix factorization
  • Performance improvements in RowSimilarityJob, TransposeJob - helpful for anyone trying to find similar items or running the Hadoop based recommender

New features:

  • K-Trusses, Top-Down and Bottom-Up clustering, Random Walk with Restarts implementation
  • SSVD enhancements

Better integration:

  • Added MongoDB and Cassandra DataModel support
  • Added numerous clustering display examples

Many bug fixes, refactorings, and other small improvements. More information is available in the Release Notes.

Overall great improvements towards better performance, better stability and integration. However there are still quite some outstanding issues and issues in need for review. Come join the project, help us improve existing patches, improve performance and in particular integration and streamlining of how to use the different parts of the project.

Mahout , , ,

Learning Machine Learning with Apache Mahout

December 13th, 2011 at 10:20pm

Once in a while I get questions like Where to start learning more on machine learning. Other than the official sources I think there is quite good coverage also in the Mahout community: Since it was founded several presentations have been given that give an overview of Apache Mahout, introduce special features or even go into more details on particular implementations. Below is an attempt to create a collection of talks given so far without any claim to contain links to all videos or lectures. Feel free to add your favourite in the comments section. In addition I linked to some online courses with further material to get you started.

When looking for books of course check out Mahout in Action. Also Taming Text and the data mining book that comes with weka are good starting points for practitioners.

Introductory, overview videos

Technical details

Further course material

Mahout, Science , ,

See you in Vancouver at Apache Con NA 2011

October 24th, 2011 at 1:49pm

Mid November Apache hosts its famous yearly conference - this time in Vancouver/Canada. They kindly accepted my presentations on Apache Mahout for intelligent data analysis (mostly focused on introducing the project to new comers and showing what happened within the project in the past year - if you have any wish concerning topics you would like to see covered in particular, please let me know) as well as a more committer focused one on Talking people into creating patches (with the goal of highlighting some of the issues new-comers to free software projects that want to contribute run into and initiating a discussion on what helps to convince them to keep up the momentum and over come and obstacles).

Looking forward to seeing you in Vancouver for Apache Con NA.

Apache Con, Free Software, Mahout ,

GoTo Con

October 10th, 2011 at 8:49pm

Location: Amsterdam
Link out: Click here
Start Date: 2011-10-12
End Date: 2011-10-14

This week late Tuesday night I am going to leave for GoTo con in Amsterdam. Train tickets are already booked - this is going to be my first trip with City Night line, will see how great they are.

GoTo Amsterdam features a special Apache track as well as several talk on scaling up, searching, but also includes stuff in general architectural decisions. If you have not registered yet - use dros200 as promotion code to get a discount on the registration prize.

Looking forward to seeing you in Amsterdam later this week.

General, Mahout , ,

Apache Mahout Hackathon Berlin

March 21st, 2011 at 9:39pm

Last year Sebastian Schelter from Berlin was added to the list of committers for Apache Mahout. With two committers in town the idea was born to meet some day, work on Mahout. So why not just announce that meeting publicly and invite others who might be interested in learning more about the framework? I got in touch with c-base - a hacker space in Berlin well suited to host a Hackathon - and quickly got their ok for the event.

As a result the first Apache Mahout Hackathon took place at c-base in Berlin last weekend. We had about eight attendees - arriving at varying times: I guess 11a.m. simply is way too early to get up for your average software developer on a Saturday. I got a few people surprised by the venue - especially those who were attending a Hackathon for the very first time and had expected c-base to be some IT company ;)

We started the day with a brief collection of ideas that everyone wanted to work on: Some needed help to use Mahout - topics included:

  • How to use Apache Mahout collaborative filtering with complex models.
  • How to use Apache Mahout via a web application?
  • How to use classification (mostly focussed on using Naive Bayes from within web applications).
  • Is HBase a solution for scalable graph mining algorithms?
  • Is there a frequent itemset algorithm that respects temporal changes in patterns?

Those more into Mahout development proposed a slightly different set of topics:

  • PLSI and Map/Reduce?
  • Build customisable sampling strategies for distributed recommendations.
  • Come up with a more Java API friendly configuration scheme for Mahout clusterings.
  • Complete the distributed SVD recommender.

Quickly teams of two to three (and more) people formed. First several user side questions could be addressed by mixing more experienced Mahout developers with newbie users. Apart from Mahout specifics also more basic questions of getting involved even by simply contributing to the online documentation, answering questions on the mailing lists or just providing structured access to existing material that users generally have trouble finding.

Another topic that is being overlooked all too when asking users to contribute to the project is the process of creating, submitting, applying and reviewing patches itself: Being deeply involved with free software projects dealing with patches, integration of issue tracker and svn with the project mailing lists all seems very obvious. However even this seemingly basic setup sometimes looks confusing and complex to regular users - that is very common but not limited to people who are just starting to work as software developers.

Thanks to Thilo Fromm for taking the group picture.

In the evening people finally started hacking more sophisticated tasks - working on the first project patches. On Sunday only the really hard core developers remained - leading to a rather focussed work on Mahout improvements which in the end led to first patches sent in from the Mahout Hackathon.

Hacking, Mahout , , ,

Apache Mahout Meetup Amsterdam

February 19th, 2011 at 8:18pm

Last week I was honoured to be invited as one of the two speakers on Apache Mahout at the Mahout meetup in Amsterdam at JTeams offices. After free beer, cola and pizza Frank Scholten gave an overview of Mahout’s clustering capabilities. After a brief introduction to Mahout itself he went into a little more detail on how clustering works in general. After that with a selection of Seinfeld scripts he used a fun data set to guide the audience through the process of choosing the right data preparation steps, coming up with good training parameters and finally evaluating clustering quality.

After that I gave a brief introduction to classification with Mahout - going into a little more detail when it comes to data preparation and quality evaluation. The audience seemed most interested in learning more on how data preparation works - after all that step cannot really be covered by Mahout itself (though we do have some support) but instead needs a lot of domain knowledge from the user side.

Judging from the brief round of self introductions the meetup was well visited by an intesting mixture of people coming from JTeam, Hippo, the dutch police working on data analytics, developers working at RIPE and many more.

If you are interested in more data analysis, search and data storage - do not miss registration for Berlin Buzzwords on June 6/7th 2011.

General, Mahout , ,

O’Reilly Strata Conference

January 22nd, 2011 at 4:34am

Title: O’Reilly Strata Conference
Location: Santa Clara
Link out: Click here
Description: Early next February O’Reilly is planning to put on a very interesting conference on the topic of data analysis and the business of generating value from raw digital data.


Strata 2011

I’m really glad to have received the acceptance notification for my presentation and travel sponsorship from the DICODE project. So see you in Santa Clara.
Start Date: 2011-02-01
End Date: 2011-02-03

If you are still unsure whether you should attend or not: Strata kindly handed out discount codes to speakers to share with their followers and readers. It saves you 25% of the registration cost - just use str11fsd during registration.

General, Mahout , , ,

Apache Mahout Hackathon Berlin

December 14th, 2010 at 8:50pm

Early next year - on February 19th/20th to be more precise - the first Apache Mahout Hackathon is scheduled to take place at c-base. The Hackathon will take one weekend. There will be plenty of time to hack on your favourite Mahout issue, to get in touch with two of the Mahout committers and get your machine learning project off the ground.

Please contact isabel@apache.org if you are planning to attend this event or register with the xing event so we can plan for enough space for everyone. If you have not registered for the event there is now guarantee you will be admitted.

If you’d like to support the event: We are still looking for sponsors for drinks and pizza.

Mahout , , , ,