ApacheCon EU - part 09

2012-11-18 20:54 More posts about tfidf Apache Con ApacheCon Lucene ap apacheconeu Solr
In the Solr track Elastic Search and Solr Cloud went into competition. The comparison itself was slightly apples-and-oranges like as the speaker compared the current ES version based on Lucene 3.x and Solr Cloud based on Lucene 4.0. During the comparison it still turned out that both solutions are more or less comparable - so choice again depends on your application. However I did like the conclusion: The speaker did not pick a clear winner in terms of projects. However he did have another clear winner: The user community will benefit from there being two projects as this kind of felt competition did speed up development already considerably.

The day finished with hoss' Stump the Chump session: The audience was asked to submit questions before the session, the jury was than asked to pick the winning question that stumped Hoss the most.

Some interesting bits from that question: One guy had the problem of having to provide somewhat diverse results in terms e.g. manufacturers in his online shop. There are a few tricks to deal with this problem: a) clean your data - don't have items that use keyword spamming side by side with regular entries. Assuming this is done you could b) use grouping to collapse items from the same manufacturer and let the user drill deeper from there. Also using c) a secondary sorting value can help - one hint: Solr ships with a random value out of the box for such cases.

For me the last day started with hossman's session on boosting and scoring tricks with Solr - including a cute reference for explaining TF-IDF ranking to people (see also a message tweeted earlier for an explanation of what a picture taken during my wedding has to do with ranking documents):


Share photos on twitter with Twitpic


Though TF-IDF is standard IR scoring it probably is not enough for your application. There's a lot of domain knowledge that you can encode in your ranking:

  • novelty factors - number of ratings/ standard deviation of ratings - ranks controversial items on top that might be more interesting than just having the stuff that everyone loves
  • scarcity - people like buying what is nearly sold out
  • profit margin
  • create your score manually by an external factor - e.g. popularity by association like categories that are more popular than others or items that are more popular depending on the time of day or year


There are a few sledge hammers people usually think of that can turn against you really badly: Say you rank by novelty only - that is you sort by date. The counter example given was the case of the AOL-Time Warner merger - being a big story news papers would post essays on it, do evaluations etc. However also articles only remotely related to it would mention the case. So be the end of the week when searching for it you would find all those little only remotely relevant articles and have to search through all of them to find the really big and important essay.

There are cases where it seems like only recency is all that matters: Filter only for the most recent items and re-try only in case of no results. The counter example is the case where products are released just on a yearly basis but you set the filter to say a month. This way up until May 31 your users will run into the retry branch and get a whole lot of results. However when a new product comes out on June first from that day onward the old results won't be reachable anymore - leading to a very weird experience for those of your users who saw yesterday's results.

There were times when scoring was influenced by keyword stuffing to emulate higher scores - don't do that anymore, Solr does support sophisticated per field and document boosting that make such hacks superfluous.

Instead rather use edismax for weighting fields. Some hints on that one: configure omitNorms in order to avoid having keyword stuffing influence your ranking. Configure omitTermFrequencyAndPosition if the term frequency in any document does not really tell you much e.g. in case of small documents only.

With current versions of Solr you can use your custom scoring per field. In addition a few ones are shipped that come with options for tweaking - like for instance the sweetSpotSimilarity wher you can tell the scorer that up to a certain length no length penalisation should happen.

Create your own boost functions that in addition to TF-IDF rely on rating values, click rates, prices or category influences. There's even an external file field option to allow you to load your scoring value per document or category from an external file that can be updated on a much more frequent basis than you would otherwise want to re-index all documents in your solr. For those suffering from the "For business reasons this document must come first no matter what the algorithm says" syndrom - there's a query elevation component for very fine grained tuning of rankings per query. Keep in mind so that this can easily turn into a maintanance nightmare. However it can be handy when quickly fixing a highly valuable business based use case: With that component it is possible to explicitly exclude documents from matching and precisely setting where to rank individual documents.

When it comes to user analytics and personalisation many people think of highly sophisticated algorithms that need lots of data to be trained. Yes Mahout can help you with personalisation and recommendation - but there are a few low hanging fruits to grab before:

  • Use the history of registered users or those you can identify through cookies - track the keywords they are looking for, the sort and filter functions commonly used.
  • Bucket people by explicit or implicit demographics.
  • Even just grouping people by the os and browser they use can help to identify preferences.


All of this information is relatively cheap to get by and can be used in many creative ways:

  • Provide default sort and filter functions for returning users.
  • Filter on the current query but when scoring take the older query into account.
  • Based on category facet used before do boosting in the next search assuming that the two queries are related.


Essentially the goal is to identify three factors for your users: What is their preference, what is the differentiator and what is your confidence in your estimation.

Another option could be to use the SweetSpotPlateau: If someone clicked on a price range facet on the next related query do not hide other prices but boost those that are in the previous facet.

One side effect to keep in mind: Your cache hit rate will go down now you are tailoring your results to individual users or user groups.

Biggest news for Lucene and Solr was the release of Lucene 4 - find more details online in an article published recently.