On Taming Text

2013-01-01 20:21
This time of the year I would usually post pictures of my bicycle standing in the snow somewhere in Tierpark. This year however I was tricked into using public transport instead: a) After my husband found a new job, we now share some of the route to work - and he isn't crazy going by bike when it's snowing. b) I got myself a Nexus7 earlier this month which obsoleted having to take paper books with me when using public transport. c) Early in December Grant Ingersoll asked me for feedback on the by now nearly finished "Taming Text (currently available as MEAP at Manning). So I even had a really interesting book to read on my way home.

Up to mid-December "Taming Text" was one of those books that always were very high on my to-read list: At least from the TOC it looked like the book to read if ever you wanted to write a search application. So I was really curious which topics it would cover and how deep explanations would go when I got the offer to read and review the book.

tl&dr



Short version: If you are building search applications - that is anything that makes a search box available on a web site, be it an online store or a new article archive - this is the book to read. It covers all the gory details of how to implement features we have come to take for granted when using search: Type ahead, spelling correction, facetting, automatic tagging and more. The book motivates what the value of these features is from the user side, explains how to implement these features with proven technologies like Apache Lucene, OpenNLP, and Mahout and how those projects work internally to provide you with the functionality you need.

Longer summary



Search can be as easy as providing one box in some corner on your web site that users can type into to find relevant pages. However when thinking about the topic just a little more some more handy features that users have come to expect come to mind:

  • Type ahead to avoid superfluous typing - it also comes in handy to avoid spelling errors and to know exactly which query actually will return a decent number of documents.
  • Spelling correction is pretty much standard - and avoids user frustration with hard to spell query terms.
  • Facetting is a great way to discover and explore more content in particular when there are a few structured attributes attached to your items (prices to books, colors to cars etc).
  • Named Entity Recognition is well known among publishers who use automatic tagging services to support their staff.


The authors of Taming Text decided to structure the book around the task of building an automatic Question Answering system. Throughout the book they present technologies that need to be orchestrated to build such an application but are each valuable in it's own right.

In contrast to Search Patterns (which is focused mainly on the product manager perspective and contains much less technical detail) Taming Text is the book to read for any engineer working on search applications. In contrast to books like Programming Collective Ingelligence Taming Text takes you one level further by not only showing the tools to use but also explaining their inner workings so that you can adapt them exactly to your use case. To me, Taming Text is the ideal complimentary book to Mahout in Action (for the machine learning part) and Lucene in Action for the search part.

Back in 1998 it was estimated that 80% of all information is unstructured data. In order to make sense of that wealth of data we need technologies that can deal with unstructured data. Search is one of the most basic but also most powerful ways to analyse texts. With a good mixture of theoretical background and hands-on-examples Taming Text guides you through the process of building a successful search application, no matter if you are dealing with a vast product database that you want to make more accessible to your users, with an ever growing news archive or with several blog posts and twitter messages that you want to extract data from.

RecSys Stammtisch Berlin - December 2012

2012-12-30 12:40
Earlier this month I attended the fourth Recommender Stammtisch in Berlin. The event was kindly hosted by Soundcloud - who on top of organising the speakers provided a really yummy buffet by Kochzeichen D.

With Paul Lamere the evening started with a very entertaining but also very packed talk on why music recommendation is special - or put more generally why all recommender systems are special:


  • Traditionally recommender systems found their way into the wild to drive sales. In music however the main goal is to help users discover new content.
  • Listeners are very different: Ranging from those indifferent to what is being played (imagine someone sitting in a coffee bar enjoying their espresso - it's unlikely that those would want to influence the playlist of the shop's entertainment system unless they are really annoyed with it's content). There are casual listeners who from time to time skip a piece. There are more engaged people who train their own recommender through services like last.fm. Finally there are fanatics that are really into certain kinds of music. Building just one system to fit them all won't do. Also relying on just one signal won't do - instead you will have to deal with both, content signals like loudness plots as well as community signals.
  • Music applications tend to be highly interactive. So even if there is little to no reliable explicit feedback people tell you how much you like your music when skipping, turning pieces louder, interacting with the content behind the song being played.
  • In contrast to many other domains music deals with a vast item space and a huge long tail of songs that almost never get interacted with.
  • In contrast to shopping recommenders however in music making mistakes is comparably cheap: In most situations music isn't purchased on a song by song basis but based on some subscription model. That way the actual cost of playing the wrong song is low. Also songs tend to be not much longer than 5min so also users are less annoyed when confronted with a slightly wrong piece of music.
  • When implementing recommenders for a shopping site it is to be avoided to re-recommend stuff a user has purchased already. This is not the case in music recommendation. Quite the contrary: Re-recommending known music is one indicator for playlists people will like.
  • When it comes to building playlists care must be taken to organise songs in a coherent way, mixing new and familiar songs in a pleasing order - essentially the goal should be to take the listener on a journey.
  • Fast updates are crucial: Music business itself is fast paced with new releases coming out regularly and being taken up very quickly by major broadcasting stations.
  • Music is highly contextual: It pays to know if the user is in the mood for calm or for faster music..
  • There are highly passionate users that are very easy to scare away - those tend to be the loudest ones influencing your user community most.
  • Though meta data is key in music as well, never expect it to be correct. There are all sorts of weird band and song names that you never thought would be possible - an observation that also Ticketmaster made when building their ticket search engine.
  • Music is highly social and irrational - so just knowing your users friends and their tastes won't get you to being perfect.


Overall I guess the conclusion is that no matter which domain you deal with you will always need to know the exact properties of that domain to build a successful system.

In the second talk Brian McFee explained one way of modeling playlists with context. With that he concentrated on passive music discovery - that is based on one query return a list of music to listen to sequentially as opposed to active retrieval where users issue a query to search for a specific piece of music.

Historically it turned out to be difficult to come up with any playlist generator that is better than randomly selecting songs to play. His model is based on a random walk notian where the vertices are songs and edges represent learnt group similarities. Groups were represented by features like familiarity, social tags, specific audio features, metadata, release dates etc. Depending on the playlist category in most cases he was able to show that his model actually does perform better than random.

In the third talk Oscar Celma showed some techniques to also benefit from some of the more complicated signals for music recommendation. Essentially his take was that by relying on usage signals only you will be stuck with the head of the usage distribution only. What you want though is to be able to provide recommendations for the long tail as well.

Some signals he mentioned included content based features (rythm, BPM, timbre, harmony), usage signals, social signals (beware of people trying to game the system or make fun of it though) and a mix of all those. His recommendation was to put content signals at the end of the processing pipeline for re-ranking and refining playlists.

When providing recommendations it is essential to be able to answer why something was recommended. Even just in the space of novelty vs. relevancy to the user there are four possible strategies: a) recommend only old stuff that is marginally relevant to the specific user: This will end up pulling up mostly popular songs. b) recommend what is new but not relevant to the user: This will end up pulling out songs that turn your user away. c) recommend what is relevant to the user but old, this will mostly surface stuff the user knows already but is a safe bet to play. d) recommend what is both relevant and new to the user - here the real interesting work starts as this deals with recommending genuinely new songs to users.

To balance discovery with safe fallback go for skips, bans, likes and dislikes. Take into account the user context and attention.

The final point the speaker made was the need to take into account the whole picture: Your shiny new recommendation algorithm will just be a tiny piece in the puzzle. Much more work will need to go into data collection and ingestion, into API design.

The last talk finally went into some detail of the history of playlist creation - back from music creators' choices, via radio station mixes, mix tapes and finally ending up at spotify and fully automatic playlist creation.

There is a vast body of knowledge on how to create successful playlists e.g. among DJs that speak about warm-up phases, chillout times, alternating types of music in order to take the audience on a journey. Even just shuffling music the user already knows can be very powerful given the pool of songs the shuffle is based on neither too large (containing too broad types of music) nor too small (leading to frequent repetitions). According to Ben Fields the science and art of playlist generation and in particular evaluation is still pretty much in it's infancy with much to come.

Fourth #Recsys Stammtisch Berlin

2012-10-23 22:38
This evening the 4th #recsys Stammtisch (German for "a meetup involving beer") was kindly organised by Alan Said, Zeno Gantner and Till Plumbaum. The event was hosted by Aklamio with beers and drinks provided by Plista. They had three talks:


  • @AlanSaid gave an overview of the topics covered in this year's RecSys conference in Dublin. Instead of going into too much technical detail the presentation gave a whirl-wind tour of the topics that are currently under discussion, the competitions to participate in and links to people relevant to the topic to follow up with. He put his slides online already.
  • As second speaker the meetup had @zenogantner give a tour to his MyMedialight recommender system library. Though written in c# there is no need for a deep c# knowledge to use the system - it comes with useful command line tools out of the box, supports all common algorithms and evaluation setups. One of the few talks where life demos actually worked.
  • The third talk - one of the rare "slide-free" presentations - covered Plista and it's relation to recommender systems. After going into some more detail on where they came from (from a big over-arching solution down to the narrow, sharp focus of doing ad recommendations), where they want to go (back to an over-arching solution to be offered as a service with the goal of bringing interaction data of many services together in one hosted system). Most interesting news to me: They are working on an open source web-service layer for Apache Mahout that seems to be already in production. Definitely something to watch.


Overall a good crowd of over 20 people from various startups, universities and larger companies in Berlin joined the meetup. There were even some people travelling there from Magdeburg. Pretty good to know that there are so many people knowledgeable in the general area of recommender systems in and close to Berlin - and good to see some of those I knew already before the meetup again. Looking forward to the next event - any volunteers for organising one?

Speaking at ApacheCon EU 2012

2012-09-15 12:47
I'll be at ApacheCon EU in November. Looking forward to an interesting conference on all things Apache that is finally returning back to Europe. Go there if you want to learn more on Tomcat, Hadoop, httpd, HBase, Camel, Open Office, Mahout, Lucene and more.

Now on to prepare the two talks I submitted:


  • "Choosing the right tool for your data analysis task - Apache Mahout in context"
  • "I was voted to be committer. Now what?"


Looking forward to see you there.

Recsys meetup Berlin

2012-07-25 01:31
Planning a meetup in Berlin: 8 people register, a table for 14 people is booked, 16+ people arrive - all of that even if no pre-defined topic or talk is announced. Seems like building recommender systems is a hot topic currently in Berlin.

Thanks to Zeno Gantner from MyMedialight for organising the event - looking forward to the next edition.

Apache Mahout 0.6 released

2012-02-08 21:33
As of Monday, February 6th a new Apache Mahout version was released. The new package features

Lots of performance improvments:


  • A new LDA implementation using Collapsed Variational Bayes 0th Derivative Approximation - try that out if you have been bothered by the way less than optimal performance of the old version.
  • Improved Decision Tree performance and added support for regression problems
  • Reduced runtime of dot product between vectors - many algorithms in Mahout rely on that, so these performance improvements will affect anyone using them.
  • Reduced runtime of LanczosSolver tests - make modifications to Mahout more easily and have faster development cycles by faster testing.
  • Increased efficiency of parallel ALS matrix factorization
  • Performance improvements in RowSimilarityJob, TransposeJob - helpful for anyone trying to find similar items or running the Hadoop based recommender


New features:

  • K-Trusses, Top-Down and Bottom-Up clustering, Random Walk with Restarts implementation
  • SSVD enhancements


Better integration:

  • Added MongoDB and Cassandra DataModel support
  • Added numerous clustering display examples


Many bug fixes, refactorings, and other small improvements. More information is available in the Release Notes.

Overall great improvements towards better performance, better stability and integration. However there are still quite some outstanding issues and issues in need for review. Come join the project, help us improve existing patches, improve performance and in particular integration and streamlining of how to use the different parts of the project.

Learning Machine Learning with Apache Mahout

2011-12-13 22:20
Once in a while I get questions like Where to start learning more on machine learning. Other than the official sources I think there is quite good coverage also in the Mahout community: Since it was founded several presentations have been given that give an overview of Apache Mahout, introduce special features or even go into more details on particular implementations. Below is an attempt to create a collection of talks given so far without any claim to contain links to all videos or lectures. Feel free to add your favourite in the comments section. In addition I linked to some online courses with further material to get you started.

When looking for books of course check out Mahout in Action. Also Taming Text and the data mining book that comes with weka are good starting points for practitioners.

Introductory, overview videos





Technical details




Further course material



See you in Vancouver at Apache Con NA 2011

2011-10-24 13:49
Mid November Apache hosts its famous yearly conference - this time in Vancouver/Canada. They kindly accepted my presentations on Apache Mahout for intelligent data analysis (mostly focused on introducing the project to new comers and showing what happened within the project in the past year - if you have any wish concerning topics you would like to see covered in particular, please let me know) as well as a more committer focused one on Talking people into creating patches (with the goal of highlighting some of the issues new-comers to free software projects that want to contribute run into and initiating a discussion on what helps to convince them to keep up the momentum and over come and obstacles).

Looking forward to seeing you in Vancouver for Apache Con NA.

GoTo Con

2011-10-10 20:49
Location: Amsterdam

Link out: Click here
Start Date: 2011-10-12

End Date: 2011-10-14


This week late Tuesday night I am going to leave for GoTo con in Amsterdam. Train tickets are already booked - this is going to be my first trip with City Night line, will see how great they are.

GoTo Amsterdam features a special Apache track as well as several talk on scaling up, searching, but also includes stuff in general architectural decisions. If you have not registered yet - use dros200 as promotion code to get a discount on the registration prize.

Looking forward to seeing you in Amsterdam later this week.

Apache Mahout Hackathon Berlin

2011-03-21 21:39
Last year Sebastian Schelter from Berlin was added to the list of committers for Apache Mahout. With two committers in town the idea was born to meet some day, work on Mahout. So why not just announce that meeting publicly and invite others who might be interested in learning more about the framework? I got in touch with c-base - a hacker space in Berlin well suited to host a Hackathon - and quickly got their ok for the event.

As a result the first Apache Mahout Hackathon took place at c-base in Berlin last weekend. We had about eight attendees - arriving at varying times: I guess 11a.m. simply is way too early to get up for your average software developer on a Saturday. I got a few people surprised by the venue - especially those who were attending a Hackathon for the very first time and had expected c-base to be some IT company ;)

We started the day with a brief collection of ideas that everyone wanted to work on: Some needed help to use Mahout - topics included:

  • How to use Apache Mahout collaborative filtering with complex models.
  • How to use Apache Mahout via a web application?
  • How to use classification (mostly focussed on using Naive Bayes from within web applications).
  • Is HBase a solution for scalable graph mining algorithms?
  • Is there a frequent itemset algorithm that respects temporal changes in patterns?


Those more into Mahout development proposed a slightly different set of topics:

  • PLSI and Map/Reduce?
  • Build customisable sampling strategies for distributed recommendations.
  • Come up with a more Java API friendly configuration scheme for Mahout clusterings.
  • Complete the distributed SVD recommender.


Quickly teams of two to three (and more) people formed. First several user side questions could be addressed by mixing more experienced Mahout developers with newbie users. Apart from Mahout specifics also more basic questions of getting involved even by simply contributing to the online documentation, answering questions on the mailing lists or just providing structured access to existing material that users generally have trouble finding.

Another topic that is being overlooked all too when asking users to contribute to the project is the process of creating, submitting, applying and reviewing patches itself: Being deeply involved with free software projects dealing with patches, integration of issue tracker and svn with the project mailing lists all seems very obvious. However even this seemingly basic setup sometimes looks confusing and complex to regular users - that is very common but not limited to people who are just starting to work as software developers.

Thanks to Thilo Fromm for taking the group picture.

In the evening people finally started hacking more sophisticated tasks - working on the first project patches. On Sunday only the really hard core developers remained - leading to a rather focussed work on Mahout improvements which in the end led to first patches sent in from the Mahout Hackathon.