On Taming Text

2013-01-01 20:21
This time of the year I would usually post pictures of my bicycle standing in the snow somewhere in Tierpark. This year however I was tricked into using public transport instead: a) After my husband found a new job, we now share some of the route to work - and he isn't crazy going by bike when it's snowing. b) I got myself a Nexus7 earlier this month which obsoleted having to take paper books with me when using public transport. c) Early in December Grant Ingersoll asked me for feedback on the by now nearly finished "Taming Text (currently available as MEAP at Manning). So I even had a really interesting book to read on my way home.

Up to mid-December "Taming Text" was one of those books that always were very high on my to-read list: At least from the TOC it looked like the book to read if ever you wanted to write a search application. So I was really curious which topics it would cover and how deep explanations would go when I got the offer to read and review the book.

tl&dr



Short version: If you are building search applications - that is anything that makes a search box available on a web site, be it an online store or a new article archive - this is the book to read. It covers all the gory details of how to implement features we have come to take for granted when using search: Type ahead, spelling correction, facetting, automatic tagging and more. The book motivates what the value of these features is from the user side, explains how to implement these features with proven technologies like Apache Lucene, OpenNLP, and Mahout and how those projects work internally to provide you with the functionality you need.

Longer summary



Search can be as easy as providing one box in some corner on your web site that users can type into to find relevant pages. However when thinking about the topic just a little more some more handy features that users have come to expect come to mind:

  • Type ahead to avoid superfluous typing - it also comes in handy to avoid spelling errors and to know exactly which query actually will return a decent number of documents.
  • Spelling correction is pretty much standard - and avoids user frustration with hard to spell query terms.
  • Facetting is a great way to discover and explore more content in particular when there are a few structured attributes attached to your items (prices to books, colors to cars etc).
  • Named Entity Recognition is well known among publishers who use automatic tagging services to support their staff.


The authors of Taming Text decided to structure the book around the task of building an automatic Question Answering system. Throughout the book they present technologies that need to be orchestrated to build such an application but are each valuable in it's own right.

In contrast to Search Patterns (which is focused mainly on the product manager perspective and contains much less technical detail) Taming Text is the book to read for any engineer working on search applications. In contrast to books like Programming Collective Ingelligence Taming Text takes you one level further by not only showing the tools to use but also explaining their inner workings so that you can adapt them exactly to your use case. To me, Taming Text is the ideal complimentary book to Mahout in Action (for the machine learning part) and Lucene in Action for the search part.

Back in 1998 it was estimated that 80% of all information is unstructured data. In order to make sense of that wealth of data we need technologies that can deal with unstructured data. Search is one of the most basic but also most powerful ways to analyse texts. With a good mixture of theoretical background and hands-on-examples Taming Text guides you through the process of building a successful search application, no matter if you are dealing with a vast product database that you want to make more accessible to your users, with an ever growing news archive or with several blog posts and twitter messages that you want to extract data from.

RecSys Stammtisch Berlin - December 2012

2012-12-30 12:40
Earlier this month I attended the fourth Recommender Stammtisch in Berlin. The event was kindly hosted by Soundcloud - who on top of organising the speakers provided a really yummy buffet by Kochzeichen D.

With Paul Lamere the evening started with a very entertaining but also very packed talk on why music recommendation is special - or put more generally why all recommender systems are special:


  • Traditionally recommender systems found their way into the wild to drive sales. In music however the main goal is to help users discover new content.
  • Listeners are very different: Ranging from those indifferent to what is being played (imagine someone sitting in a coffee bar enjoying their espresso - it's unlikely that those would want to influence the playlist of the shop's entertainment system unless they are really annoyed with it's content). There are casual listeners who from time to time skip a piece. There are more engaged people who train their own recommender through services like last.fm. Finally there are fanatics that are really into certain kinds of music. Building just one system to fit them all won't do. Also relying on just one signal won't do - instead you will have to deal with both, content signals like loudness plots as well as community signals.
  • Music applications tend to be highly interactive. So even if there is little to no reliable explicit feedback people tell you how much you like your music when skipping, turning pieces louder, interacting with the content behind the song being played.
  • In contrast to many other domains music deals with a vast item space and a huge long tail of songs that almost never get interacted with.
  • In contrast to shopping recommenders however in music making mistakes is comparably cheap: In most situations music isn't purchased on a song by song basis but based on some subscription model. That way the actual cost of playing the wrong song is low. Also songs tend to be not much longer than 5min so also users are less annoyed when confronted with a slightly wrong piece of music.
  • When implementing recommenders for a shopping site it is to be avoided to re-recommend stuff a user has purchased already. This is not the case in music recommendation. Quite the contrary: Re-recommending known music is one indicator for playlists people will like.
  • When it comes to building playlists care must be taken to organise songs in a coherent way, mixing new and familiar songs in a pleasing order - essentially the goal should be to take the listener on a journey.
  • Fast updates are crucial: Music business itself is fast paced with new releases coming out regularly and being taken up very quickly by major broadcasting stations.
  • Music is highly contextual: It pays to know if the user is in the mood for calm or for faster music..
  • There are highly passionate users that are very easy to scare away - those tend to be the loudest ones influencing your user community most.
  • Though meta data is key in music as well, never expect it to be correct. There are all sorts of weird band and song names that you never thought would be possible - an observation that also Ticketmaster made when building their ticket search engine.
  • Music is highly social and irrational - so just knowing your users friends and their tastes won't get you to being perfect.


Overall I guess the conclusion is that no matter which domain you deal with you will always need to know the exact properties of that domain to build a successful system.

In the second talk Brian McFee explained one way of modeling playlists with context. With that he concentrated on passive music discovery - that is based on one query return a list of music to listen to sequentially as opposed to active retrieval where users issue a query to search for a specific piece of music.

Historically it turned out to be difficult to come up with any playlist generator that is better than randomly selecting songs to play. His model is based on a random walk notian where the vertices are songs and edges represent learnt group similarities. Groups were represented by features like familiarity, social tags, specific audio features, metadata, release dates etc. Depending on the playlist category in most cases he was able to show that his model actually does perform better than random.

In the third talk Oscar Celma showed some techniques to also benefit from some of the more complicated signals for music recommendation. Essentially his take was that by relying on usage signals only you will be stuck with the head of the usage distribution only. What you want though is to be able to provide recommendations for the long tail as well.

Some signals he mentioned included content based features (rythm, BPM, timbre, harmony), usage signals, social signals (beware of people trying to game the system or make fun of it though) and a mix of all those. His recommendation was to put content signals at the end of the processing pipeline for re-ranking and refining playlists.

When providing recommendations it is essential to be able to answer why something was recommended. Even just in the space of novelty vs. relevancy to the user there are four possible strategies: a) recommend only old stuff that is marginally relevant to the specific user: This will end up pulling up mostly popular songs. b) recommend what is new but not relevant to the user: This will end up pulling out songs that turn your user away. c) recommend what is relevant to the user but old, this will mostly surface stuff the user knows already but is a safe bet to play. d) recommend what is both relevant and new to the user - here the real interesting work starts as this deals with recommending genuinely new songs to users.

To balance discovery with safe fallback go for skips, bans, likes and dislikes. Take into account the user context and attention.

The final point the speaker made was the need to take into account the whole picture: Your shiny new recommendation algorithm will just be a tiny piece in the puzzle. Much more work will need to go into data collection and ingestion, into API design.

The last talk finally went into some detail of the history of playlist creation - back from music creators' choices, via radio station mixes, mix tapes and finally ending up at spotify and fully automatic playlist creation.

There is a vast body of knowledge on how to create successful playlists e.g. among DJs that speak about warm-up phases, chillout times, alternating types of music in order to take the audience on a journey. Even just shuffling music the user already knows can be very powerful given the pool of songs the shuffle is based on neither too large (containing too broad types of music) nor too small (leading to frequent repetitions). According to Ben Fields the science and art of playlist generation and in particular evaluation is still pretty much in it's infancy with much to come.

Data Scientists - researchers' persectives

2012-08-03 20:35
"Data scientist" as a term has caught quite some attention as of late (together with all the big data, scalability and cloud hype). Instead of re-hashing arguments seen in other sources I thought it might make more sense to link to a few of the thought provoking posts I came across recently.



FrOSCon 2012

2012-07-31 20:25
On August 25th/26th the Free and Open Source Conference (FrOSCon) will again kick off in Sankt Augustin/ Germany.



The event is completely community organised, hosted by the FH Sankt Augustin. It covers a broad range of free software topics like Arduino microcontrollers, git goodies, politics, strace, open nebula, wireshark and others.

Three highlights that are on my schedule:



Looking forward to interesting talks and discussions at FrOSCon.

Book: Search Patterns

2012-07-28 20:41
I got the book months ago during FOSDEM - the O'Reilly book table always is a pretty dangerous place as a meeting point for me: Search Patterns - Design for Discovery is one of those small, deceivingly beautiful books that manages to explain effective search engine design by focusing on the end user needs but going into some detail concerning the basics of search engine backends as well.

We use them on a daily basis not only for finding content on the web but also for navigating shopping sites, discovering news content and even finding articles on blogs and open source project pages. Many discovery tasks can be easily expressed as a search problem and as a result tackled with by now standard off the shelve software like Apache Lucene - or event the commercial counterparts from the enterprise search market. Still oftentimes search is perceived as being made up of simple a small box that users type (typically one or two term) queries into and that as a result show a list of some ten links.

After setting the stage for search in the first chapter the book goes into some more detail in "The anatomy of search". In a very approachable way it explains all the components from user constraints, graphical interface, the basics of retrieval and evaluating search performance in terms of precision and recall. The third chapter shows some bahavioural patterns that make discovery easier for users - from incrementally constructing the answer, progessively disclosing more and more detail up to being predictable.

Finally the design patterns as identified by the authors are introduced. Pretty obvious to those working in the field but well explained to those not intimately familiar with the topic:


  • Though perceived as a mere convenience to type less by users, autocomplete can actually help guide the user's search in case of ambiguities and can help avoid imprecise results.
  • Expected as it might be by users, presenting the best result first actually goes a long way when building credibility for a search engine. Having more precise queries to guide e.g. as a result of autocomplete helps here. So does having strong ranking criteria to build up a compelling ranking function that is used by default (even though others might be offered as an alternative for users to explore more and different results).
  • Federated search has both - advantages (integrating otherwise isolated silos of knowledge) but also disadvantages (it's speed being dominated by the slowest connected search engine).
  • Facetted navigation is pretty much standard for any major search engine - giving the user the option to start with a broad query that returns an overwhelming amount of results but guiding the user when refining the query is one major way of driving searches.
  • Offering personalisation tends to be one beloved feature though it is particularly hard to implement and needs a good deal of user data to work well. Usually there are features that require less work to get done that are more promising to start with.
  • Pageination is as much standard to be expected by users - though its implementation can differ: Though we are used to clicking the next button, this actually may not make much sense and just lead to interrupting the user's flow. Much more appealing - but sometimes also confusing - can be interfaces that allow for simply extending the result page when scroling to it's end.
  • Structured results provide a way to give the user more than just an outlink - triggered by specific searches it may be possible to directly answer the user's question instead of linking to content that answers it.
  • Actionable results are a way for the user to get active - either by voting on results, bookmarking them or sharing them with others.
  • Unified discovery is about accepting that search always plays a role in a bigger context and has to play well with the discovery mode the user is in: When searching for "apple" while browsing the category "electronics" it's rather unlikely that I am looking for the fruit. Similarly search should take context into account and support me seamlessly when switching from discovery to directed search and back to discovery mode.


The book concludes by going into some detail on example search engines and presenting some features that are not yet commonplace but might change the world by employing search in new and creative ways.

Easy to read, well written, several nice examples to make the technical points simpler to understand. Definitely a good read for domain experts planning to build a search engine, designers trying to understand the basics of building effective search engines and engineers struggling for words to explain why a seemingly little box can cause a whole lot of pain when done wrong but a whole lot of joy when done right.

Need your input: Failing big data projects - experiences from the wild

2012-07-18 20:11
A few weeks ago my talk on "How to fail your big data project quick and rapidly" was accepted at O'Reily Strata conference in London. The basic intention of this talk is to share some anti-patterns, embarrassing failure modes and "please don't do this at home" kind of advice with those entering the buzzwordy space of big data.

Inspired by Thomas Sundberg's presentation on "failing software projects the talk will be split in five chapters and highlight the top two failure-factors for each.

I only have so much knowledge of what can go wrong when dealing with big data. In addition no one likes talking about what did not work in their environment. So I'd like to invite you to share your war stories in a public etherpad - either anonymously or including your name so I can give credit. Some ideas are already sketched up - feel free to extend, adjust, re-rank or change.

Looking forward to your stories.

GeeCon - TDD and it's influence on software design

2012-05-22 08:04
The second talk I went to on the first day was on the influence of TDD on software design. Keith Braithwaite did a really great job of first introducing the concept of cyclomatic complexity and than showing at the example of Hudson as well as many other open source Java projects that the average and mean cyclomatic complexity of all those projects actually is pretty close to one and when plotted for all methods pretty much follows a power law distribution. Comparing the properties of their specific distribution of cyclomatic complexities over projects he found out that the less steep the curve is, that is the more balance the distribution is, that is the less really complex pieces there are in the code the more likely are developers happy with the current state of the code. Not only that, also that distribution would be transformed into something more balanced after refactorings.

Now looking at a selection of open source projects he analyzed what the alpha of the distribution of cyclomatic complexity is for projects that have no tests at all, have tests and those that were developed according to TDD. Turns out that the latter ones were the ones with the most balanced alpha.

GeeCon - Randomized testing

2012-05-21 08:02
I arrived late during lunch time on Thursday for GeeCon – however just in time to listen to one of the most interesting talks when it comes to testing. Did you ever have the issue of writing code that runs well in your development environment but crashes as soon as it's rolled out at customers only to find out that their Locale setting was causing the issues? Ever had to deal with random test failure because against better advise your tests did depend on execution order that is almost guaranteed to be different on new JVM releases?

The Lucene community has encountered many similar issues. In effect they are faced with having to test a huge number of different configuration combinations in order to make sure that their software runs in all client setups. In recent months they developed an approach called randomised testing to tackle this problem: Essentially on each run “random tests” are run multiple times, each time with a slightly different configuration, input, in a different environment (e.g. Locale settings, time zones, JVMs, operating systems). Each of these configurations are pseudo random – however on test failure the framework will reveal the seed that was used to initialize that pseudo random number generator and thus allow you to reproduce the failure deterministically.

The idea itself is not new: published in a paper by Ntafos, used in fuzzers to identify security holes in applications this kind of technique is pretty well known. However applying it to write tests is a new idea used at Lucene.

The advantage is clear: With every new run of the test suite you gain confidence that your code is actually stable to any kind of user input. The downside of course is that you will discover all sorts of different issues and bugs not only in your code but also in the JVM itself. If your library is being used in all sorts of different setups fixing these issues upfront however is crucial to avoid users being surprised that it does not work well in their setup. Make sure to fix these failures quickly though – developers tend to ignore flickering tests over time. Adding randomness – and thereby essentially increasing the number of tests in your testsuite – will add the amount of effort to invest in fixing broken code.

Dawid Weiss gave a great overview of how random tests can be used to harden a code base. He introduced the testframework written at carrot search that isolated the random test features: It comes with a RandomizedRunner implementation that can be used to subsitute junit's own runner. It's capable of tracking test isolation by tracking spawned threads that might be leaking out of tests. In addition it provides utilities for instance for creating random strings, locals, numbers as well as annotations to denote how often a test should run and when it should run (always vs. nightly).

So when having tests with random input – how do you check for correctness? The most obvious thing to do is when being able to check the exact output. When testing a sorting method, not matter what the implementation and the input is – the output should always be sorted, which is easy enough to check. Also checking against simpler, but maybe in practice more expensive algorithms is an option.

A second approach is to do sanity checks: Math.abs() at least should always return positive integers. The third approach is to do no checking at all in some cases. Why would that help? You'd be surprised by how many failures and exceptions you get by actually using your API in unexpected ways or giving your program unexpected input. This kind of behaviour checking does not need any assertions.

Note: Really loved the APad/ iMiga that Dawid used to give his talk! Been such a long time since I last played with my own Amiga...

Learning Machine Learning with Apache Mahout

2011-12-13 22:20
Once in a while I get questions like Where to start learning more on machine learning. Other than the official sources I think there is quite good coverage also in the Mahout community: Since it was founded several presentations have been given that give an overview of Apache Mahout, introduce special features or even go into more details on particular implementations. Below is an attempt to create a collection of talks given so far without any claim to contain links to all videos or lectures. Feel free to add your favourite in the comments section. In addition I linked to some online courses with further material to get you started.

When looking for books of course check out Mahout in Action. Also Taming Text and the data mining book that comes with weka are good starting points for practitioners.

Introductory, overview videos





Technical details




Further course material



Machine learning problem settings

2011-08-06 07:10
Together with Sebastian Schelter I held a Nokia sponsored (Thank you!) lecture on large scale data analysis and data mining during the past few months.

After supervising a few successful university projects based on Apache Mahout the goal of this lecture was to introduce students to some of the basic concepts and problems encountered today in a world where huge datasets are generally available and are easy to process with Apache Hadoop. As such the course is targeted at an entry level audience - thorough treatment of the mathematical background of latest machine learning technology is left to the machine learning research groups in Potsdam, at TU Berlin and the neural information processing group at TU.

Slides and exercises are available online via git. Please let me know if you want to re-use them in your lecture.



The very first problem that users of machine learning algorithms usually come across is mapping their application problem to one of the various machine learning problems. In 2010 Michael Brückner gave a lecture on Intelligent Data Analysis with Matlab (slides and videos in German) including a simple taxonomy of algorithms:

According to


  • the types of input data an algorithm can handle (either independent instances, also called examples, sequences or graphs of instances)
  • the type of training data available (e.g. instances with assigned nominal target attribute, no labels at all, a partial sorting of sets of instances)
  • and the learning goal
algorithms can be nicely partitioned by the learning problem that they solve. Based on that very first step of identifying exactly what the problem setting is, deciding which algorithm to use becomes much easier. Based on that taxonomy I came up with the above graphic giving a first overview of which tasks can be solved with machine learning:

Boxes in dark blue are what in general is called supervised learning, yellow unsupervised and light blue semi supervised - based on the amount of labeled training data available. Red boxes indicate settings with the goal of knowledge discovery. Green are any ranking problems.