ApacheCon EU - part 09

2012-11-18 20:54
In the Solr track Elastic Search and Solr Cloud went into competition. The comparison itself was slightly apples-and-oranges like as the speaker compared the current ES version based on Lucene 3.x and Solr Cloud based on Lucene 4.0. During the comparison it still turned out that both solutions are more or less comparable - so choice again depends on your application. However I did like the conclusion: The speaker did not pick a clear winner in terms of projects. However he did have another clear winner: The user community will benefit from there being two projects as this kind of felt competition did speed up development already considerably.

The day finished with hoss' Stump the Chump session: The audience was asked to submit questions before the session, the jury was than asked to pick the winning question that stumped Hoss the most.

Some interesting bits from that question: One guy had the problem of having to provide somewhat diverse results in terms e.g. manufacturers in his online shop. There are a few tricks to deal with this problem: a) clean your data - don't have items that use keyword spamming side by side with regular entries. Assuming this is done you could b) use grouping to collapse items from the same manufacturer and let the user drill deeper from there. Also using c) a secondary sorting value can help - one hint: Solr ships with a random value out of the box for such cases.

For me the last day started with hossman's session on boosting and scoring tricks with Solr - including a cute reference for explaining TF-IDF ranking to people (see also a message tweeted earlier for an explanation of what a picture taken during my wedding has to do with ranking documents):

Share photos on twitter with Twitpic

Though TF-IDF is standard IR scoring it probably is not enough for your application. There's a lot of domain knowledge that you can encode in your ranking:

  • novelty factors - number of ratings/ standard deviation of ratings - ranks controversial items on top that might be more interesting than just having the stuff that everyone loves
  • scarcity - people like buying what is nearly sold out
  • profit margin
  • create your score manually by an external factor - e.g. popularity by association like categories that are more popular than others or items that are more popular depending on the time of day or year

There are a few sledge hammers people usually think of that can turn against you really badly: Say you rank by novelty only - that is you sort by date. The counter example given was the case of the AOL-Time Warner merger - being a big story news papers would post essays on it, do evaluations etc. However also articles only remotely related to it would mention the case. So be the end of the week when searching for it you would find all those little only remotely relevant articles and have to search through all of them to find the really big and important essay.

There are cases where it seems like only recency is all that matters: Filter only for the most recent items and re-try only in case of no results. The counter example is the case where products are released just on a yearly basis but you set the filter to say a month. This way up until May 31 your users will run into the retry branch and get a whole lot of results. However when a new product comes out on June first from that day onward the old results won't be reachable anymore - leading to a very weird experience for those of your users who saw yesterday's results.

There were times when scoring was influenced by keyword stuffing to emulate higher scores - don't do that anymore, Solr does support sophisticated per field and document boosting that make such hacks superfluous.

Instead rather use edismax for weighting fields. Some hints on that one: configure omitNorms in order to avoid having keyword stuffing influence your ranking. Configure omitTermFrequencyAndPosition if the term frequency in any document does not really tell you much e.g. in case of small documents only.

With current versions of Solr you can use your custom scoring per field. In addition a few ones are shipped that come with options for tweaking - like for instance the sweetSpotSimilarity wher you can tell the scorer that up to a certain length no length penalisation should happen.

Create your own boost functions that in addition to TF-IDF rely on rating values, click rates, prices or category influences. There's even an external file field option to allow you to load your scoring value per document or category from an external file that can be updated on a much more frequent basis than you would otherwise want to re-index all documents in your solr. For those suffering from the "For business reasons this document must come first no matter what the algorithm says" syndrom - there's a query elevation component for very fine grained tuning of rankings per query. Keep in mind so that this can easily turn into a maintanance nightmare. However it can be handy when quickly fixing a highly valuable business based use case: With that component it is possible to explicitly exclude documents from matching and precisely setting where to rank individual documents.

When it comes to user analytics and personalisation many people think of highly sophisticated algorithms that need lots of data to be trained. Yes Mahout can help you with personalisation and recommendation - but there are a few low hanging fruits to grab before:

  • Use the history of registered users or those you can identify through cookies - track the keywords they are looking for, the sort and filter functions commonly used.
  • Bucket people by explicit or implicit demographics.
  • Even just grouping people by the os and browser they use can help to identify preferences.

All of this information is relatively cheap to get by and can be used in many creative ways:

  • Provide default sort and filter functions for returning users.
  • Filter on the current query but when scoring take the older query into account.
  • Based on category facet used before do boosting in the next search assuming that the two queries are related.

Essentially the goal is to identify three factors for your users: What is their preference, what is the differentiator and what is your confidence in your estimation.

Another option could be to use the SweetSpotPlateau: If someone clicked on a price range facet on the next related query do not hide other prices but boost those that are in the previous facet.

One side effect to keep in mind: Your cache hit rate will go down now you are tailoring your results to individual users or user groups.

Biggest news for Lucene and Solr was the release of Lucene 4 - find more details online in an article published recently.

ApacheConEU - part 07

2012-11-16 20:51
Julien Nioche shared some details on the nutch crawler. Being the mother of all Hadoop projects (as in Hadoop was born out of developments inside of nutch) the project has become rather quite with a steady stream of development in the recent past. Julien himself uses the nutch for gathering crawled data for several customer projects - feeding this data into an NLP pipeline based on Behemoth that glues Mahout, UIMA and Gate together.

The basic crawling steps including building the web graph, computing a link based ranking method and indexing are still the same since last I looked at the project - just that for indexing the project now uses solr instead of their own lucene based solution.

The main advantage of nutch is its pluggability: the protocol parser, html filter, url filter, url normaliser all can be exchanged against your own implementations.

In their 2.0 version they moved away from using their own plain hdfs storage to a table schema - mapped to the real database through Gora, an abstraction layer to connect to e.g. Cassandra or HBase. The schema itself is based on Avro but can be adopted to your needs. The advantages are obvious: Though still distributed this approach is much easier and simpler in terms of logic kept in nutch itself. Also it is easier to connect to the data for third parties - all you need is the schema as well as Gora. The current disadvantage lies in it's configuration overhead and instability compared to the old solution. Most likely at least the latter one will go away as version 2.0 stabelises.

In terms of future work the project focuses on stabilisation, synchronising features of version 1.x and 2.x (link ranking is only available in version 1.x while support for elastic search is only available in version 2.x). In terms of functionality the goal is to move to Solr Cloud, support sitemaps (as implemented by commons crawler), more (pluggable?) indexers.

The goal is to delegate implementations - it was already done for Tika and Solr. Most likely it will also happen for the fetcher, protocol handling, robots.txt handling, url normalisation and filtering, graph processing code and others.

The next talk in the Solr/Lucene talk dealt with scaling Solr to big data. The goal of the speaker was to index 100 million documents - the number of documents was expected to grow in the future. Being under heavy time pressure and having a bash wizard on the project they started building lots of their glue code and components in bash scripts: There were scripts for starting/stopping services, remote indexing, performance monitoring, content extraction, ingestion and deployment. Short term this was a very good idea - it allowed for fast iterations and learning. On the long run they slowly replaced their custom software with standard components (tika for content extraction, puppet for deployment etc.).

They quickly learnt to value property files in order to easily reconfigure the system even in production (relying heavily on bash xml was of course not an option). One problem this came in handy with was adjusting the sharding configuration - going from a simple random sharding to old vs new to monthly they could optimise the configuration to their load. What worked well for them was to separate JVM startup from Solr core startup - they would start with empty solrs and only point them to the data directories aafter verifying that the JVMs booted up correctly.

In terms of performance they learnt to go wide quickly - instead of spending hours on optimising their one huge box they ended up distributing their solrs to multiple separate machines.

In terms of ingestion pipelines: Theirs is currently based on an indexing directory convention, moving the indexing as soon as all data is ingested. The advantage here is the atomicity of mv that can be used - disadvantage is having to move around lots of data. Their goal is to go for hdfs for indexing soon and zookeeper for configuration handling.

In terms of testing: In contrast to having a dev-test-production environment their goal is to have an elastic compute cloud that can be adjusted accordingly. Though EC2 itstelf is very cost intensive, poses additional problems with firewalling and moving data their cloud computing could still be a solution - in particular given projects like cloud stack or open cloud. The goal would be to do cycle scavaging in their own datacenter, do heavy computations when there is not a lot of load on the system and turn those analysis of in case of incoming traffic.

When it comes to testing and monitoring changes they made good experiences with using JConsole (connecting it to several solrs at once through a simple ip discovery script) and solr meter as a performance debugging tool.

Some implementation details: They used Solr as some sort of NoSQL cache as well (for thousands of queries/s), push the schema definition to solr from the app, have common fields and the option for developers to add custom fields in the app. Their experience is to not do expensive stuff in solr but to move that outside - this applies in particular to content extraction. For storage they used an avro based binary file format (mainly in order to save storage space, have a versioned schema and fast compression and de-compression). They are starting to use tika as their pipeline and for auto content detection, scaling up with behemoth. They learnt to upgrade indexes without re-indexing using the lucene upgrade tooling. In addition they use Grimreaper to kill servers if anything goes wrong and restart it later.

GeeCon - Solr at Allegro

2012-05-25 08:07
One particularly interesting to me was on Allegro's (polish Ebay) Solr usage. In terms of numbers: They have 20Mio offers in Poland, another 10Mio active offers in partnering countries. In addition in their index there are 50Mio inactive offers in Poland and 40 Mio closed offers outside that country. They serve 8Mio updates a day, that is 100 updates a second. Those are related to start/end of bidding phase, buy now actions, cancelled bids, bids themselves.

Per day they have 105Mio requests per day, on peak time in the evening that is 3.5k requests per second. Of those 75% are answered in less than 5ms, 90% in less than 20ms.

To achieve that performance they are using Solr. Coming from a database based system, going via a proprietary search product they are now happy users of Solr with much better customer support both from the community as well as from contractors than with their previous paid for solution.

The speakers went into some detail on how they solved particular technical issues: They had to decide to go for an external data feeder to avoid putting the database itself under too much load even when just indexing the updates. On updates they need to deal with having to reconstruct the whole document as updates for Solr right now mean deleting the old document and indexing the new one. In addition commits are pretty expensive, so they ended up delaying commits for as long as the SLA would allow (one minute) and committing them as batch.

They tried to shard indexes by category facetted by – that did not work particularly wrong as with their user behaviour it resulted in too many cross-shard requests. Index size was an issue for them so they reduced the amount of data indexed and stored in Solr to the absolute minimum – all else was out-sourced to a key-value store (in their case MongoDB).

When it comes to caching that proved to be the component that needed most tweaks – they put a varnish in front (Solr speaks xml over http which is simple enough to find caches for) – in relation with the index delay they had in place they could tune eviction times. Result were cache hit rates of about 30 to 40 percent. When it comes to internal caches: High eviction and low hit rates are a problem. Watch the Solr Admin Console for more statistics. Are there too many unique objects in your index? Are caches too small? Are there too many unique queries? They ended up binding users to solr backends by having a routing be sticky with the user's cookie – as users tend to drill down on the same dataset over and over again in their case that raised hit rates substancially. When tuning filter queries: Have them as independent as possible – don't use many unique combinations of the same filtering over and over again. Instead filter individually to better use that cache.

For them Solr proved to be a stable, efficient, flexible, easy to monitor and maintain and change system that ran without failure for the first 8 months with the whole architecture being designed and prototyped (at near production quality) by one developer in six months.

Currently the system is running on 10 solr slaves (+ power backup) compared to 25 nodes before. A full index takes 4 hours, bottlenecked at the feeder, potentially that could be pushed down to one hour. Updates of course flow in continuously.

Apache Con – last day

2010-11-27 23:23
Day three of Apache Con started with interesting talks on Tomcat 7, including an introduction to the new features of that release. Those include better memory leak prevention and detection capabilities – the implementation of these capabilities have lead to the discovery of various leaks that appear under more or less weird circumstances in famous open source libraries and the JVM itself. But also better management and reporting facilities are part of the new release.

As I started the third day over at the Tomcat track, unfortunately I missed the Tika and Nutch presentations by Chris Mattman – so happy, that at least the slides were published online: $LINK. The development of nutch was especially interesting for me as that was the first Apache project I got involved with back in 2004. Nutch started out as a project with the goal of providing an open source alternative internet-scale search engine. Based on Lucene as a indexer kernel, it also providing crawling, content extraction and link analysis.

Focussed on building an internet scale search engine the need for a distributed processing environment quickly became apparent. Initial implementations of a nutch distributed file system and a map reduce engine lead to the creation of the Apache Hadoop project.

In recent years it was comparably quiet around nutch. Besides Hadoop also content extraction was factored out of the project into Apache Tika. At the moment development is getting more momentum again. Future developments are supposed to be focussed on building an efficient crawling engine. As storage backend the project wants to leverage Apache HBase, for content extraction Tika is to be used, as indexing backend Solr.

I loved the presentation by Geoffrey Young on how they used Solr to replace their old MySQL search based system for better performance and more features at Ticketmaster. Indexing documents representing music CDs presents some special challenges when it comes to domain modeling: There are bands with names like “!!!”. In addition users are very likely to misspell certain artists names. In contrast to large search providers like Google these businesses usually have neither human resources nor enough log data to provide comparable coverage e.g. when implementing spell-checking. A very promising and agile approach taking instead was to parse log files for most common failing queries and from that learn more about features needed by users: There were many queries including geo information coming from users looking for an event at one specific location. As a result geo information was added to the index leading to happier users.

Scaling user groups

2010-05-26 19:32
A few hours ago, Jan Lehnardt posted a link on How to organise a nerd conference - joking that this is how we planned Berlin Buzzwords. Well, it is not exactly that easy - however the comic actually is not so far from the truth either:

About two years ago, after having started Apache Mahout together with Grant Ingersoll, Karl Wettin and others, several Apache Hadoop user groups, meetups and get togethers started to pop up all around the world. The one closest to me was the Hadoop user group UK. Back in 2008 I was pretty envious to all these user groups - being so distributed, there was no way I could ever attend all of them, though talks were certainly interesting. So the naive thought of a back then naive free software developer was: Let's have that in Berlin. To have initial talks I called Stefan Groschupf. His answer was very positive: Oh yeah, let's do this. I am in Germany for another two weeks, so it should be at about that timeframe. We agreed that if no-one showed up, we could still have some pizza together and share insights from our projects.

For the venue I knew from regular meetups of the Free Software Foundation Europe - read FSF*E* - that newthinking store was available for free for meetups for devs of free software. On I went, calling Martin from the store, booked the room. After that some mails went to the usual suspects, mailing lists and such. At the first meetup two years ago, more than 15 attendees - with two more people who had prepared slides. Pizzas obviously had to wait a little.

If you are wondering what that looked like back then - Thanks to Martin for taking the image back then and putting it online.

We (as in all attendees) decided to repeat the exercise three months later*, talks for the next time were proposed during that first session. Noone objected to having it in Berlin again - everyone knew this was the only way to avoid having to do the organization next time.

The meetup grew steadily in size, talks started being proposed three to six months in advance. I ended up creating not only a mailing list for the meetup but also a blog so I could publish news on Jan's CouchDB talk and Lars George's HBase talk back then. We got video sponsoring from Cloudera (Thanks Christophe), StudiVZ (Thanks Nils), and Nokia (Thanks Matt). Late last year I did the first European NoSQL meetup together with Jan Lehnardt - 80 attendees, lots of potential for more, the newthinking store obviously a bit too small for that :)

If you are wondering what NoSQL and Hadoop meetups looked like last time:

During that meetup the idea was born for a larger NoSQL conference in Berlin in 2010. First ideas were tossed around together with Jan and Simon Willnauer during Apache Con US in Oakland. The topic Hadoop got added there. In January 2010 finally Lucene was added to the mix. We contacted newthinking for support - got a very warm welcome.

Now - two years after the first Apache Hadoop Get Together Berlin we are proud to host Berlin Buzzwords - focussed on NoSQL, Apache Hadoop and search as in Apache Lucene.The conference is co-organised by newthinking communications, Simon Willnauer, Jan Lehnardt and myself. A big thanks to neofonie for supporting me by making it possible that I could do most of the organisation during my regular working hours.

The speaker lineup looks fantastic. Registration is going very well - exceeding expectations (did I mention that registration is still open, group and student tickets still available?).

I am really looking forward to an amazing conference on 7th and 8th of June. We will have a NoSQL barcamp in newthinking store Sunday evening before the conference. Keynote speaker packages have been sent out and were well received. Hotel rooms for speakers are booked. We are about to pull together the last loose ends in the coming days. Happy to have so many guys (and a few girls) interested in scalability topics here in town at the beginning of June. Looking forward to seeing you in Berlin.

* The second meetup turned out to be the first and so far only one that took place w/o the organiser - I broke my leg on my way to newthinking by getting hit by a BMW X5... *sigh* Note for other meetup organizers: Always have a backup moderator - in may case that was my neofonie manager Holger Düwiger who happened to attend that meetup for the first time back then.

Berlin Buzzwords - Early bird registration

2010-04-10 15:02
I would like to invite everyone interested in data storage, analysis and search to join us for two days on June 7/8th in Berlin for Berlin Buzzwords - an in-depth, technical, developer-focused conference located in the heart of Europe. Presentations will range from beginner friendly introductions on the hot data analysis topics up to in-depth technical presentations of scalable architectures.

Our intention is to bring together users and developers of data storage, analysis and search projects. Meet members of the development team working on projects you use. Get in touch with other developers you may know only from mailing list discussions. Exchange ideas with those using your software and get their feedback while having a drink in one of Berlin's many bars.

Early bird registration has been extended until April 17th - so don't wait too long.

If you would like to submit a talk yourself: Conference submission is open for little more than one week. More details are available online in the call for presentations:

Looking forward to meeting you in the beautiful, vibrant city of Berlin this summer for a conference packed with high profile speakers, awesome talks and lots of interesting discussions.

Some pictures

2010-03-25 11:00
Uwe and Simon were so kind to take some pictures of the last Hadoop Get Together in Berlin:

Image Hadoop Get Together Berlin

Image Hadoop Get Together Berlin

Image Hadoop Get Together Berlin

Image Hadoop Get Together Berlin

Image Hadoop Get Together Berlin

Thanks for the pictures.

Chris Male on spatial search with Lucene

2010-03-16 20:42
Last week the March 2010 Hadoop Get Together took place in Berlin. Last speaker was Chris Male on spatial search with Lucene and Solr. The video is now available online:

Lucene Chris Male from Isabel Drost on Vimeo.

Feel free to share and distribute the video to anyone who might be interested. Thank you Chris, for traveling over from Amsterdam for an awesome talk on spatial search.

If you want to learn more on what people over at Lucene and Solr are currently working one, head over to Berlin Buzzwords - a conference on scalable search, storage and data analysis. If you yourself have interesting projects - feel free to submit a talk.

Thanks to Nokia for sponsoring the video taping - and again as always thanks to newthinking for providing the location for free.

Call for presentations - Berlin Buzzwords

2010-03-11 15:09

Call for Presentations Berlin Buzzwords
Berlin Buzzwords 2010 - Search, Store, Scale
7/8 June 2010

This is to announce the opening of the Berlin Buzzwords 2010 call for presentations. Berlin Buzzwords is the first conference on scalable and open search, data processing and data storage in Germany, taking place in Berlin.

The event will comprise presentations on scalable data processing. We invite you to submit talks on the topics:

  • Information retrieval, search - Lucene, Solr, katta or comparable solutions
  • NoSQL - like CouchDB, MongoDB, Jackrabbit, HBase and others
  • Hadoop - Hadoop itself, MapReduce, Cascading or Pig and relatives

Closely related topics not explicitly listed above are welcome. We are looking for presentations on the implementation of the systems themselves, real world applications and case studies.

Important Dates (all dates in GMT +2)

  • Submission deadline: April 17th 2010, 23:59
  • Notification of accepted speakers: May 1st, 2010.
  • Publication of final schedule: May 9th, 2010.
  • Conference: June 7/8. 2010.

High quality, technical submissions are called for, ranging from principles to practice. We are looking for real world use cases, background on the architecture of specific projects and a deep dive into architectures built on top of e.g. Hadoop clusters.

Proposals should be submitted at http://berlinbuzzwords.de/content/cfp no later than April 17th, 2010. Acceptance notifications will be sent out on May 1st. Please include your name, bio and email, the title of the talk, a brief abstract in English language. Please indicate whether you want to give a short (30min) or long (45min) presentation and indicate the level of experience with the topic your audience should have (e.g. whether your talk will be suitable for newbies or is targeted for experienced users.)

The presentation format is short: either 30 or 45 minutes including questions. We will be enforcing the schedule rigorously.

If you are interested in sponsoring the event (e.g. we would be happy to provide videos after the event, free drinks for attendees as well as an after-show party), please contact us.

Follow @hadoopberlin on Twitter for updates. News on the conference will be published on our website at http://berlinbuzzwords.de

Program Chairs: Isabel Drost, Jan Lehnardt, and Simon Willnauer.

Schedule and further updates on the event will be published on http://berlinbuzzwords.de

Slides are available

2010-03-11 00:49
Slides for the last Hadoop Get Together are available online:

Videos will follow as soon as the are ready. Watch this space for further updates.