Archive

Archive for the ‘Lucene’ Category

Elastic Search meetup Berlin – January 2013

February 1st, 2013 at 6:34pm

The first meetup this year I went to started with a large bag of good news for Elastic Search users. In the offices of Sys Eleven (thanks for hosting) the meetup started at 7p.m. last Tuesday. Simon Willnauer gave an overview of what to expect of the upcoming major release of Elastic Search:

For all 0.20.x version ES features a shard allocator version that is ignorant of which index shards belong to, machine properties, usage patterns. Especially ignoring index information can be detrimental and lead to having all shards of one index on one machine in the end leading to hot spots in your cluster. Today this is solved by lots of manual intervention or even using custom shard allocator implementations.

With the new release there will be an EvenShardCountAllocator that allows for balancing shards of indexes on machines – by default it will behave like the old allocator but can be configured to take weighted factors into account. The implementation will start with basic properties like “which index does this shard belong to” but the goal is to also make variables like remaining disk space available. To avoid constant re-allocation there is a threshold on the delta that has to be passed for re-allocation to kick in.

0.21 will be released when Lucene 4.1 is integrated. That will bring new codecs, concurrent flushing (to avoid the stop-the-world flush during indexing that is used in anything below Lucene 4 – hint: Give less memory to your JVM in order to cause more frequent flushes), there will be compressed sort fields, spellchecking and suggest built into the search request (though unigram only). There will be one similarity configurable per field – that means you can switch from TF-IDF to alternative built-in scoring models or even build your own.

Speaking of rolling your own: There is a new interface for FieldData (used for faceting, scoring and sorting) to allow for specialised data structures and implementations per field. Also the default implementation will be much more memory efficient for most scenarios be using UTF-8 instead of UTF-16 characters).

As for GeoSpatial: The code came to Lucene as a code dump that the contributor wasn’t willing to support or maintain. It was replaced by an implementation that wasn’t that much better. However the community is about to take up the mess and turn it into something better.

After the talk the session essentially changed to an “interactive mailing list” setup where people would ask questions live and get answers both from other users as well as the developers. Some example was the question for recommendability of pyes as a library. Most people had used it, many ran into issues when trying to run an upgrade with features being taken away or behaviour being changed without much notice. There are plans to release Perl, Ruby and Python clients. However also using JRuby, Groovy, Scala or Clojure to communicate with ES works well.

On the benefit of joining the cluster for requests: That safes one hop for routing, result merging, is an option to have a master w/o data and helps with indexing as the data doesn’t go through an additional node.

As for plugins the next thing needed is an upgrade and versioning schema. Concerning plugin reloading without restarting the cluster there was not much ambition to get that into the project from the ES side of things – there is just too much hazzle when it comes to loading and unloading classes with references still hanging around to make that worthwhile.

Speaking of clients: When writing your own don’t rely on the binary protocol. This is a private interface that can be subject to change at any time.

When dealing with AWS: The S3 gateway is not recommended to be used as it is way too slow (and as a result very expensive). Rather backup with replicas, keep the data around for backup or use rsync. When trying to backup across regions this is nothing that ES will help you with directly – rather send your data to both sites and index locally. One recommendation that came from the audience was to not try and use EBS as the IO optimised versions are just too expensive – it’s much more cost effective to rely on ephermeral storage. Another thing to checkout is the support for ES being zone aware to avoid having all shards in one availability zone. Also the node discovery timeout should be increased to at least one minute to work in AWS. When it comes to hosted solutions like heroko you usually are too limited in what you can do with these offers compared to the low maintenance overhead of running your own cluster. Oh, and don’t even think about index encryption if you want to have a fast index without spending hours and hours of development time on speeding your solution up with custom codecs and the like :)

Looking forward to the Elastic Search next meetup end of February – location still to be announced. It’s always interesting to see such meetup groups grow (this time from roughly 15 in November to over 30 in January).

PS: A final shout-out to Hossman - that psychological trick you played on my at your boosting and biasing talk at Apache Con EU is slightly annoying: Everytime someone mentions TF-IDF in a talk (and that isn’t too unlikely in any Lucene, Solr, Elastic Search talks) I panicingly double check whether there are funny pictures on the slide shown! ;)

Free Software, Lucene , ,

Elastic Search meetup Berlin

November 28th, 2012 at 12:39am

Today Retresco hosted the (to my knowledge fourth) Elastic Search User Group Berlin - a group dedicated to using Lucene as part of Elastic Search. With roughly fifteen attendees the meetup attracted a decent crowd - most interestingly many of the people there were already using the software either in production or for closed beta projects.

The fist talk given was by people from ferret-go - a company doing media monitoring for brands focused on the German market. They are pretty new to the search topic, on top they aren’t fluent Java developers but do most stuff in Python. Essentially their whole application is built on top of Elastic Search - most features are implemented as more or less complex search queries. In recent weeks they had to deal with typical problems related to growing data set sizes, nodes getting hot in particular when load balancing isn’t configured quite right and balancing shards on a per index level instead of doing it globally for all shards (particularly bad in their case as they added an index with smaller shards and one with significantly larger shards into ES).

The second talk gave a really nice overview on things to keep in mind before putting ES to production - as is usually the case, the default configuration makes it easy to get started but most likely is not what you want in your production environment.

Really nice, technically focused event. Thanks to Retresco for hosting the meetup including beer and Club Mate and to ElasticSearch.com for paying for the pizza. Check their meetup page for the next event - most likely to be scheduled in January.

Also if you’d like to learn more on Lucene 4 (talk by Simon Willnauer) - make sure to attend the Apache Hadoop Get Together December 2012 (if you need another ticket, it might make sense to politely ask the organiser, make sure you are registered, otherwise most likely you won’t get through security). Other two talks scheduled: Pere Urbon Bayes on NoSQL and Graph, a love story! as well as myself on How to Fail Your Big Data Project Quick and Rapidly.

Get Together, Lucene ,

ApacheConEU - part 10

November 19th, 2012 at 8:34pm

In the next session Jukka introduced Tika - a toolkit for parsing content from files including a heuristics based component for guessing the file type: Based on file extension, magic and certain patterns in the file the file type can be guessed rather reliably. Some anecdotes:

  • not all mime types are registered with IANA, there are of course conflicting file extensions,
  • Microsoft Word not only localises their interface but also the magic in the file,
  • html detection is particularly hard as there is quite some overlap with other file formats (e.g. there are such things as html mails…)
  • xhtml parsing is quite reliable by using an actual xml parser for the first few bytes to identify the namespace of the document
  • identifying odf documents is easy - though zipped the magic is preserved uncompressed at a pre-defined location in the file
  • for ooxml the file has to be unpacked to identify it
  • plain text content is hardest - there are a few heuristics based on UTF BOMs, ASCII stats, line ending stats, byte histograms, still it’s not fool proof.

In addition Tika can extract metadata: For text that can be as easy as encoding, length, content type and comments. For images that is extended by image size and potentially EXIF data. For pdf data it gets even more comprehensive including information on the file creator, save date and more (same applies for MS office documents). Most document metadata is based on the doublin core standard, for images there’s EXIF and IPCT - soon there’ll also be xmb related data that can be discovered.

Apache Con, Lucene , ,

ApacheCon EU - part 09

November 18th, 2012 at 8:54pm

In the Solr track Elastic Search and Solr Cloud went into competition. The comparison itself was slightly apples-and-oranges like as the speaker compared the current ES version based on Lucene 3.x and Solr Cloud based on Lucene 4.0. During the comparison it still turned out that both solutions are more or less comparable - so choice again depends on your application. However I did like the conclusion: The speaker did not pick a clear winner in terms of projects. However he did have another clear winner: The user community will benefit from there being two projects as this kind of felt competition did speed up development already considerably.

The day finished with hoss’ Stump the Chump session: The audience was asked to submit questions before the session, the jury was than asked to pick the winning question that stumped Hoss the most.

Some interesting bits from that question: One guy had the problem of having to provide somewhat diverse results in terms e.g. manufacturers in his online shop. There are a few tricks to deal with this problem: a) clean your data - don’t have items that use keyword spamming side by side with regular entries. Assuming this is done you could b) use grouping to collapse items from the same manufacturer and let the user drill deeper from there. Also using c) a secondary sorting value can help - one hint: Solr ships with a random value out of the box for such cases.

For me the last day started with hossman’s session on boosting and scoring tricks with Solr - including a cute reference for explaining TF-IDF ranking to people (see also a message tweeted earlier for an explanation of what a picture taken during my wedding has to do with ranking documents):


Share photos on twitter with Twitpic

Though TF-IDF is standard IR scoring it probably is not enough for your application. There’s a lot of domain knowledge that you can encode in your ranking:

  • novelty factors - number of ratings/ standard deviation of ratings - ranks controversial items on top that might be more interesting than just having the stuff that everyone loves
  • scarcity - people like buying what is nearly sold out
  • profit margin
  • create your score manually by an external factor - e.g. popularity by association like categories that are more popular than others or items that are more popular depending on the time of day or year

There are a few sledge hammers people usually think of that can turn against you really badly: Say you rank by novelty only - that is you sort by date. The counter example given was the case of the AOL-Time Warner merger - being a big story news papers would post essays on it, do evaluations etc. However also articles only remotely related to it would mention the case. So be the end of the week when searching for it you would find all those little only remotely relevant articles and have to search through all of them to find the really big and important essay.

There are cases where it seems like only recency is all that matters: Filter only for the most recent items and re-try only in case of no results. The counter example is the case where products are released just on a yearly basis but you set the filter to say a month. This way up until May 31 your users will run into the retry branch and get a whole lot of results. However when a new product comes out on June first from that day onward the old results won’t be reachable anymore - leading to a very weird experience for those of your users who saw yesterday’s results.

There were times when scoring was influenced by keyword stuffing to emulate higher scores - don’t do that anymore, Solr does support sophisticated per field and document boosting that make such hacks superfluous.

Instead rather use edismax for weighting fields. Some hints on that one: configure omitNorms in order to avoid having keyword stuffing influence your ranking. Configure omitTermFrequencyAndPosition if the term frequency in any document does not really tell you much e.g. in case of small documents only.

With current versions of Solr you can use your custom scoring per field. In addition a few ones are shipped that come with options for tweaking - like for instance the sweetSpotSimilarity wher you can tell the scorer that up to a certain length no length penalisation should happen.

Create your own boost functions that in addition to TF-IDF rely on rating values, click rates, prices or category influences. There’s even an external file field option to allow you to load your scoring value per document or category from an external file that can be updated on a much more frequent basis than you would otherwise want to re-index all documents in your solr. For those suffering from the “For business reasons this document must come first no matter what the algorithm says” syndrom - there’s a query elevation component for very fine grained tuning of rankings per query. Keep in mind so that this can easily turn into a maintanance nightmare. However it can be handy when quickly fixing a highly valuable business based use case: With that component it is possible to explicitly exclude documents from matching and precisely setting where to rank individual documents.

When it comes to user analytics and personalisation many people think of highly sophisticated algorithms that need lots of data to be trained. Yes Mahout can help you with personalisation and recommendation - but there are a few low hanging fruits to grab before:

  • Use the history of registered users or those you can identify through cookies - track the keywords they are looking for, the sort and filter functions commonly used.
  • Bucket people by explicit or implicit demographics.
  • Even just grouping people by the os and browser they use can help to identify preferences.

All of this information is relatively cheap to get by and can be used in many creative ways:

  • Provide default sort and filter functions for returning users.
  • Filter on the current query but when scoring take the older query into account.
  • Based on category facet used before do boosting in the next search assuming that the two queries are related.

Essentially the goal is to identify three factors for your users: What is their preference, what is the differentiator and what is your confidence in your estimation.

Another option could be to use the SweetSpotPlateau: If someone clicked on a price range facet on the next related query do not hide other prices but boost those that are in the previous facet.

One side effect to keep in mind: Your cache hit rate will go down now you are tailoring your results to individual users or user groups.

Biggest news for Lucene and Solr was the release of Lucene 4 - find more details online in an article published recently.

Apache Con, Lucene , , , , ,

ApacheConEU - part 07

November 16th, 2012 at 8:51pm

Julien Nioche shared some details on the nutch crawler. Being the mother of all Hadoop projects (as in Hadoop was born out of developments inside of nutch) the project has become rather quite with a steady stream of development in the recent past. Julien himself uses the nutch for gathering crawled data for several customer projects - feeding this data into an NLP pipeline based on Behemoth that glues Mahout, UIMA and Gate together.

The basic crawling steps including building the web graph, computing a link based ranking method and indexing are still the same since last I looked at the project - just that for indexing the project now uses solr instead of their own lucene based solution.

The main advantage of nutch is its pluggability: the protocol parser, html filter, url filter, url normaliser all can be exchanged against your own implementations.

In their 2.0 version they moved away from using their own plain hdfs storage to a table schema - mapped to the real database through Gora, an abstraction layer to connect to e.g. Cassandra or HBase. The schema itself is based on Avro but can be adopted to your needs. The advantages are obvious: Though still distributed this approach is much easier and simpler in terms of logic kept in nutch itself. Also it is easier to connect to the data for third parties - all you need is the schema as well as Gora. The current disadvantage lies in it’s configuration overhead and instability compared to the old solution. Most likely at least the latter one will go away as version 2.0 stabelises.

In terms of future work the project focuses on stabilisation, synchronising features of version 1.x and 2.x (link ranking is only available in version 1.x while support for elastic search is only available in version 2.x). In terms of functionality the goal is to move to Solr Cloud, support sitemaps (as implemented by commons crawler), more (pluggable?) indexers.

The goal is to delegate implementations - it was already done for Tika and Solr. Most likely it will also happen for the fetcher, protocol handling, robots.txt handling, url normalisation and filtering, graph processing code and others.

The next talk in the Solr/Lucene talk dealt with scaling Solr to big data. The goal of the speaker was to index 100 million documents - the number of documents was expected to grow in the future. Being under heavy time pressure and having a bash wizard on the project they started building lots of their glue code and components in bash scripts: There were scripts for starting/stopping services, remote indexing, performance monitoring, content extraction, ingestion and deployment. Short term this was a very good idea - it allowed for fast iterations and learning. On the long run they slowly replaced their custom software with standard components (tika for content extraction, puppet for deployment etc.).

They quickly learnt to value property files in order to easily reconfigure the system even in production (relying heavily on bash xml was of course not an option). One problem this came in handy with was adjusting the sharding configuration - going from a simple random sharding to old vs new to monthly they could optimise the configuration to their load. What worked well for them was to separate JVM startup from Solr core startup - they would start with empty solrs and only point them to the data directories aafter verifying that the JVMs booted up correctly.

In terms of performance they learnt to go wide quickly - instead of spending hours on optimising their one huge box they ended up distributing their solrs to multiple separate machines.

In terms of ingestion pipelines: Theirs is currently based on an indexing directory convention, moving the indexing as soon as all data is ingested. The advantage here is the atomicity of mv that can be used - disadvantage is having to move around lots of data. Their goal is to go for hdfs for indexing soon and zookeeper for configuration handling.

In terms of testing: In contrast to having a dev-test-production environment their goal is to have an elastic compute cloud that can be adjusted accordingly. Though EC2 itstelf is very cost intensive, poses additional problems with firewalling and moving data their cloud computing could still be a solution - in particular given projects like cloud stack or open cloud. The goal would be to do cycle scavaging in their own datacenter, do heavy computations when there is not a lot of load on the system and turn those analysis of in case of incoming traffic.

When it comes to testing and monitoring changes they made good experiences with using JConsole (connecting it to several solrs at once through a simple ip discovery script) and solr meter as a performance debugging tool.

Some implementation details: They used Solr as some sort of NoSQL cache as well (for thousands of queries/s), push the schema definition to solr from the app, have common fields and the option for developers to add custom fields in the app. Their experience is to not do expensive stuff in solr but to move that outside - this applies in particular to content extraction. For storage they used an avro based binary file format (mainly in order to save storage space, have a versioned schema and fast compression and de-compression). They are starting to use tika as their pipeline and for auto content detection, scaling up with behemoth. They learnt to upgrade indexes without re-indexing using the lucene upgrade tooling. In addition they use Grimreaper to kill servers if anything goes wrong and restart it later.

Apache Con, Lucene , , , ,

GeeCon - Solr at Allegro

May 25th, 2012 at 8:07am

One particularly interesting to me was on Allegro’s (polish Ebay) Solr usage. In terms of numbers: They have 20Mio offers in Poland, another 10Mio active offers in partnering countries. In addition in their index there are 50Mio inactive offers in Poland and 40 Mio closed offers outside that country. They serve 8Mio updates a day, that is 100 updates a second. Those are related to start/end of bidding phase, buy now actions, cancelled bids, bids themselves.

Per day they have 105Mio requests per day, on peak time in the evening that is 3.5k requests per second. Of those 75% are answered in less than 5ms, 90% in less than 20ms.

To achieve that performance they are using Solr. Coming from a database based system, going via a proprietary search product they are now happy users of Solr with much better customer support both from the community as well as from contractors than with their previous paid for solution.

The speakers went into some detail on how they solved particular technical issues: They had to decide to go for an external data feeder to avoid putting the database itself under too much load even when just indexing the updates. On updates they need to deal with having to reconstruct the whole document as updates for Solr right now mean deleting the old document and indexing the new one. In addition commits are pretty expensive, so they ended up delaying commits for as long as the SLA would allow (one minute) and committing them as batch.

They tried to shard indexes by category facetted by – that did not work particularly wrong as with their user behaviour it resulted in too many cross-shard requests. Index size was an issue for them so they reduced the amount of data indexed and stored in Solr to the absolute minimum – all else was out-sourced to a key-value store (in their case MongoDB).

When it comes to caching that proved to be the component that needed most tweaks – they put a varnish in front (Solr speaks xml over http which is simple enough to find caches for) – in relation with the index delay they had in place they could tune eviction times. Result were cache hit rates of about 30 to 40 percent. When it comes to internal caches: High eviction and low hit rates are a problem. Watch the Solr Admin Console for more statistics. Are there too many unique objects in your index? Are caches too small? Are there too many unique queries? They ended up binding users to solr backends by having a routing be sticky with the user’s cookie – as users tend to drill down on the same dataset over and over again in their case that raised hit rates substancially. When tuning filter queries: Have them as independent as possible – don’t use many unique combinations of the same filtering over and over again. Instead filter individually to better use that cache.

For them Solr proved to be a stable, efficient, flexible, easy to monitor and maintain and change system that ran without failure for the first 8 months with the whole architecture being designed and prototyped (at near production quality) by one developer in six months.

Currently the system is running on 10 solr slaves (+ power backup) compared to 25 nodes before. A full index takes 4 hours, bottlenecked at the feeder, potentially that could be pushed down to one hour. Updates of course flow in continuously.

Lucene , ,

Scaling user groups

May 26th, 2010 at 7:32pm

A few hours ago, Jan Lehnardt posted a link on How to organise a nerd conference - joking that this is how we planned Berlin Buzzwords. Well, it is not exactly that easy - however the comic actually is not so far from the truth either:

About two years ago, after having started Apache Mahout together with Grant Ingersoll, Karl Wettin and others, several Apache Hadoop user groups, meetups and get togethers started to pop up all around the world. The one closest to me was the Hadoop user group UK. Back in 2008 I was pretty envious to all these user groups - being so distributed, there was no way I could ever attend all of them, though talks were certainly interesting. So the naive thought of a back then naive free software developer was: Let’s have that in Berlin. To have initial talks I called Stefan Groschupf. His answer was very positive: Oh yeah, let’s do this. I am in Germany for another two weeks, so it should be at about that timeframe. We agreed that if no-one showed up, we could still have some pizza together and share insights from our projects.

For the venue I knew from regular meetups of the Free Software Foundation Europe - read FSF*E* - that newthinking store was available for free for meetups for devs of free software. On I went, calling Martin from the store, booked the room. After that some mails went to the usual suspects, mailing lists and such. At the first meetup two years ago, more than 15 attendees - with two more people who had prepared slides. Pizzas obviously had to wait a little.

If you are wondering what that looked like back then - Thanks to Martin for taking the image back then and putting it online.



We (as in all attendees) decided to repeat the exercise three months later*, talks for the next time were proposed during that first session. Noone objected to having it in Berlin again - everyone knew this was the only way to avoid having to do the organization next time.

The meetup grew steadily in size, talks started being proposed three to six months in advance. I ended up creating not only a mailing list for the meetup but also a blog so I could publish news on Jan’s CouchDB talk and Lars George’s HBase talk back then. We got video sponsoring from Cloudera (Thanks Christophe), StudiVZ (Thanks Nils), and Nokia (Thanks Matt). Late last year I did the first European NoSQL meetup together with Jan Lehnardt - 80 attendees, lots of potential for more, the newthinking store obviously a bit too small for that :)

If you are wondering what NoSQL and Hadoop meetups looked like last time:


During that meetup the idea was born for a larger NoSQL conference in Berlin in 2010. First ideas were tossed around together with Jan and Simon Willnauer during Apache Con US in Oakland. The topic Hadoop got added there. In January 2010 finally Lucene was added to the mix. We contacted newthinking for support - got a very warm welcome.

Now - two years after the first Apache Hadoop Get Together Berlin we are proud to host Berlin Buzzwords - focussed on NoSQL, Apache Hadoop and search as in Apache Lucene.The conference is co-organised by newthinking communications, Simon Willnauer, Jan Lehnardt and myself. A big thanks to neofonie for supporting me by making it possible that I could do most of the organisation during my regular working hours.

The speaker lineup looks fantastic. Registration is going very well - exceeding expectations (did I mention that registration is still open, group and student tickets still available?).

I am really looking forward to an amazing conference on 7th and 8th of June. We will have a NoSQL barcamp in newthinking store Sunday evening before the conference. Keynote speaker packages have been sent out and were well received. Hotel rooms for speakers are booked. We are about to pull together the last loose ends in the coming days. Happy to have so many guys (and a few girls) interested in scalability topics here in town at the beginning of June. Looking forward to seeing you in Berlin.

* The second meetup turned out to be the first and so far only one that took place w/o the organiser - I broke my leg on my way to newthinking by getting hit by a BMW X5… *sigh* Note for other meetup organizers: Always have a backup moderator - in may case that was my neofonie manager Holger Düwiger who happened to attend that meetup for the first time back then.

Berlin Buzzwords, Get Together, Hadoop, Lucene, Mahout , , , , ,

The 7 deadly sins of (Java) software developers

January 23rd, 2010 at 9:09am

On Lucid Imaginations Blog Jay Hill published a great article on The seven deadly sins of solr. Basically it is a collection of his experiences “analyzing and evaluating a great many instances of Solr implementations, running in some of the largest Fortune 500 companies”. It is a collection of common mistakes, mis-configurations and pitfalls in Solr installations in production use.

I loved the article very much. However, many of the symptoms that Jay described in his article do not apply to Solr installations only. In the following I will try to come up with a more general classification of errors that occur when your average Java developer starts using a sufficiently large framework that is supposed to make his work easier. Happy about any input on your favourite production issues.

Remark: What is printed in italic is quoted as is.

Sin number 1: Sloth - I’ll do it later

Let’s define sloth as laziness or indifference. This one bites most of us at some time or another. We just can’t resist the impulse to take a shortcut, or we simply refuse to acknowledge the amount of effort required to do a task properly. Ultimately we wind up paying the price, usually with interest.

There is even a name for it in Scrum: Technical debt. It may be ok to take a shortcut, given this is done based on an informed decision. As with regular debt, you may get a big advantage like launching way earlier than your competitor. However as with real debt, it does come at a prize.

Lack of commitment

Jay describes the problems that are especially frequent when switching search applications: Humans in general do not like giving up their habits. A nice example described in more detail in a recent Zeit article is what happens each year in December/ January when the first snow falls: It is by no means irregular or not to be expected that it starts snowing in December in Germany. However there will be lots of people who are not prepared for that. They refuse to put on winter tiers in late autumn. They use their car instead of public transport despite warnings in public press. The conclusion of the article was simple: People are simply not willing to change habits they got used to. It does take longer and is a bit less flexible to get to work by public transport instead of your own car. It does require adjusting your daily routine, optimising your processes.

Something similar happens to a developer that is “forced” to switch technology, be it the search server, the database, the build system or simply the version control system: The old ways of doing stuff simply may not work as expected. New tools might be called for. New technologies to learn. However in not so seldom cases developers just blame the new tools: “But with the old setup this would always work.”

Developing software - probably more than anything else - means constant development, constant change. Technologies shift as tasks shift, tools are improved as workflows change. Developing software means to constantly watch closely what you are doing, reflecting on what works and what doesn’t and changing things that don’t work. Accepting change, seeing it as a chance rather than an obstacle is critical.

If however change is imposed on developers though good arguments in favour of the old approach exist, it may be worth the effort to at least take the technical view into account to make an informed decision.

Not reviewing, editing, or changing the default configuration files.

I have extended this one a bit: Developers not changing default configuration files are not that uncommon. Be it the default database configuration, default logging configuration for your modules or default configuration of your servlet container. Even if you are using software pre-packed by your distribution, it is still worth the effort to review configuration files for your services and adjust them to your needs. Usually they are to be used as examples that still need tweaking and customization after roll-out.

JVM settings and GC

If you are running Java application there is no way around to adjust GC settings as well as general JVM settings to your particular use case. There are great tutorials at sun.com that explain both the settings themselves as well as several rules-of-thumb of where to start. Still nothing should stop you from measuring your particular application and its specific needs - both, before and after tuning. Along with that goes the obvious recommendation to simply “know-your-tools” - learning load testing tools shortly before launch time is certainly no good choice. Trying to find out more on Java memory analysis late in the development cycle just because you need to find that stupid memory leak like *now* is no good idea neither.

There are several nice talks as well as several tutorials available online on the topic of JVM tuning, debugging memory as well as threading issues, one of them being the talk by Rainer Jung at Frocson 2008.

Sin number 2: Greed

Running a service on insufficient hardware (be it main memory, harddisks, bandwidth, …) is not only an issue with Solr installations. There are many cases where just adding hardware may help in the short run, but is a dead-end in the long run:

  • Given a highly inefficient implementation, identifying bottlenecks, profiling, benchmarking and optimization go a long way.
  • Given an inappropriate architecture, redesign, reimplementation and maybe even switching base technologies does help.

However as Jay pointed out, running production servers with less power than your average desktop Mac has does not help neither.

Sin number 3: Pride

Engineers love to code. Sometimes to the point of wanting to create custom work that may have a solution in place already, just because: a) They believe they can do it better. b) They believe they can learn by going through the process. c) It “would be fun”. This is not meant to discourage new work to help out with an open-source project, to contribute bug fixes, or certainly to improve existing functionality. But be careful not to rush off and start coding before you know what options already exist. Measure twice, cut once.

Don’t re-invent the wheel.

As described in Jay’s post, there are developers who seem to be actively searching for reasons to re-invent the wheel. Sure, this is far easier with open source software than with commercial software. Access to code here makes the difference: Understanding, learning from, sharing and improving the software is key to free software.

However there are so many cases where improve does not mean re-implement but submitting patches, fixing bugs, adding new features to the orignal project or just refactoring the original code and ironing out some well known bumbs to make life easier for others.

Every now and then a new query abstraction language for map reduce pops up. Some of those really solve distinct problem settings that cannot (and should not) be solved within one language. Especially if a technology is young, this is pretty usual as people try out different approaches to see what works and what does not work out so well. Good and stable things come from that - in general the fittest approach survives. However, too often I have heard developers excusing their re-invention by “having had too few time to do a throughough evaluation of existing frameworks and libraries”. The irony here really is that usually, coding up your own solution does take time as well. In other cases the excuse was missing support for some of the features needed. How about adding those features, submitting them upstream and benefitting from what is already there and an active community supporting the project, testing it, applying fixes and adding further improvements?

Make use of the mailing lists and the list archives.

Communication is key to success in software development. According to Conway’s law “Organizations
which design systems are constrained to produce systems which are copies of the communication structures of these organizations.” I guess it is pretty obvious that developing software today generally means designing complex systems.

In Open source, mailing lists (and bug trackers, the code itself, release notes etc.) are all ways for communication. (See also Bertrand’s brilliant talk on open source collaboration tools for that). With in-house development there is even added benefit as face-to-face communication or at least teleconferencing is possible.

However software developers in general seem to be reluctant to ask questions, to discuss their design, their implementation and their needs for changes. It just seems simpler to work-around a situation that disturbs you instead of propagating the problem to its source - or just asking for the information you need. A nice article on a related topic was published recently it-republik.

However asking these questions, taking part in these discussions is what makes software better. It is what happens regularly within open source projects in terms of design discussions on mailing lists, discussions on focussed issues in the bug tracker as well as in terms of code review.

There are several best practices that come with Agile Development that help starting discussions on code. Pair programming is one of these. Code reviews are another example. Having more than two eye balls look at a problem usually makes the solution more robust, gives confidence in what was implemented and as a nice side effect spreads knowledge on the code avoiding a single point of failure with just one developer being familiar with a particular piece of code.

Sin number 4: Lust

Must have more!You’ll have to grant me artistic license on this one, or else we won’t be able to keep this blog G-rated. So let’s define lust as “an unnatural craving for something to the point of self-indulgence or lunacy”. OK.

Setting the JVM Heap size too high, not leaving enough RAM for the OS.

Jay describes how setting the JVM RAM allocation too high can lead to Java eating up all memory and leaving nothing for the OS. The observation does not apply to Solr deployments only. Tomcat is just yet another application where this applies as well. Especially with IO-bound applications giving too much memory to the JVM is grave as the OS does not longer have enough space for disk caches.

The general take-away probably should be to measure and tune according to the real observed behaviour of your application. A second take-home message would be to understand your system - not only the Java part of it, but the whole machine from Java, the OS down to the hardware - to tune it effectively. However that should be a well known fact anyway. For Java developers, it sometimes helps to simply talk to your operations guys to get the bigger picture.

Too much attention on the JVM and garbage collection.

There are actually two aspects here: For one, as described by Jay it should not be necessary to try every arcane JVM or GC setting unless you are a JVM expert. More precisely, simply trying various options w/o understanding, what they mean, what side-effects they have and in which situations they help obviously isn’t a very good idea.

The second aspect would be developers starting with JVM optimization only to learn later on that the real problem is within their own application. Tuning JVM parameters really should be one of the last steps in your optimization pipeline. First should be benchmarking and profiling your own code. At the same stage you should review configuration parameters of your application (size of thread pools, connection pools etc.) as well your libraries and frameworks (here come solr’s configuration files, Tomcat’s configuration, RDBMs configuration parameters, cache configurations…). Last but not least should be JVM tuning - starting with adjusting memory to a reasonable amount, setting the GC configuration that makes most sense to your application.

Sin number 5: Envy

Bah!

Wanting features that other sites have, that you really don’t need.

It should be good engineering practice to start with your business needs and distill user stories from that and identify the technology that solves your problem. Don’t go from problem to solution without first having understood your problem. Or even worse: Don’t go from solution (that is from a technology you would love to use) to searching for a problem that this solution might solve: “But there must be a RDBMS somewhere in our architecture, right?”

Wanting to have a bigger index than the other guy.

The antithesis of the “greed” issue of not allocating enough resources. “Shooting for the moon” and trying to allow for possible growth over the next 20 years. Another scenario would be to never fix your system but leave every piece open and configurable, in the end leading to a system that is harder to configure than sendmail is. Yet another scenario would be to plan for billions of users before even launching: That may make sense for a new Google gadget, however for the “new kid on the block”? Probably unlikely, unless you have really good marketing guys. Plan for what is reasonable in your project, observe real traffic and identify real bottlenecks once you see them. Usually estimations of what bottlenecks could be are just plain wrong unless you have lot’s of experience with the type of application you are building. As Jeff Dean pointed out in his WSDM 2009 keynote, the right design for X users may still be right with 10x the amount of users. But do plan a rewrite at about the time you start having 100x and more the amount of users.

Sin number 6: Gluttony

“Staying fit and trim” is usually good practice when designing and running Solr applications. A lot of these issues cross over into the “Sloth” category, and are generally cases where the extra effort to keep your configuration and data efficiently managed is not considered important.

Lack of attention to field configuration in the schema.

Storing fields that will never be retrieved. Indexing fields that will never be searched. Storing term vectors, positions and offsets when they will never be used. Unnecessary bloat. Understand your data and your users and design your schema and fields accordingly.

On a more general scale that might be wrapped into the general advise of keeping only data that is really needed: Rotate logs on a schedule fit to your business, operations needs and based on available machines. Rotate data written into your database backend: It may make sense to keep users that did not interact with your application for 10 years. If you have a large datacenter for storage that may make even more sense. However usually keeping inactive users in your records simply eats up space.

Unexamined queries that are redundant or inefficient.

Queries that catch too much information, are redundant or multiple queries that could be folded into one are not only a problem for Solr users. Anyone using data sources that are expensive to query probably knows how to optimize those queries for reduced cost.

Sin number 7: Wrath

Now! While wrath is usually considered to be synonymous with anger, let’s use an older definition here: “a vehement denial of the truth, both to others and in the form of self-denial, impatience.”

Assuming you will never need to re-index your data.

Hmm - don’t only backup. Include recovery in your plans! Admittedly with search applications, this includes keeping the original documents - it is not unusual to add more fields or to want to parse data differently from the first indexing run. Same applies if you are post-processing data that has been entered by users or spidered from the web for tasks like information extraction, classifier training etc.

Rushing to production.

Of course we all have deadlines, but you only get one chance to make a first impression. Years ago I was part of a project where we released our search application prematurely (ahead of schedule) because the business decided it was better to have something in place rather than not have a search option. We developers felt that, with another four weeks of work we could deliver a fully-ready system that would be an excellent search application. But we rushed to production with some major flaws. Customers of ours were furious when they searched for their products and couldn’t find them. We developed a bad reputation, angered some business partners, and lost money just because it was deemed necessary to have a search application up and running four weeks early.

Leaving that as is - just adding, this does not apply to search applications only ;)

So keep it simple and separate, stay smart, stay up to date, and keep your application on the straight-and-narrow (YAGNI ;) ). Seek (intelligently) and ye shall find.

Free Software, Hacking, Lucene, Scrum , , ,

With a little help from my friends

December 31st, 2009 at 11:55pm

The end of the year 2009 is quickly approaching. To me it feels a little like it ran away far too quickly. So instead of taking part in the annual review of past events, I would like to use it as an opportunity to say thank you: The past twelve months were a lot of fun with lots of interesting, nice people from all over the world. I got the chance to meet quite a bit of the Mahout community, I got lots and lots of new developers from all over Germany - or more precisely the EU - to attend the Apache Hadoop Get Together in Berlin. The interest in Mahout has grown tremendously over the past year.

All of this would not have been possible without the help of many people: First of all I’d like to thank Thilo Fromm - for making me happy whenever I was disappointed, for solacing me when I when I was sad, for patiently listening to me nervously whining before each and every talk, for kindly reviewing my slides and last but not least for helping me fix some of the problems that bugged me. Oh - and, thanks for helping me fix the issue in the zookeeper c-client within minutes that puzzled me for days.

Another big Thanks goes to family, first and foremost my mum, who kindly took care of organizing quite a bit of my paperwork and kept me on schedule with so many “unimportant” tasks like getting an appointment with some hospital to finally get the screws taken out of my knee ;)

A special thanks goes to the growing Mahout community as well as to the Lucene people - you know, who you are - keep up the great work: You rock!

Furthermore there are students at TU Berlin who have shown that with Mahout it is “dead-simple” to write an application that, given a stream of documents, groups them by topic and makes the result searchable in Solr. Thanks to you for solving the minor and major problems, for communicating with the community, for transparently communicating problems. Looking forward to continue working together with you next year.

Finally a big thank you to all of the speakers, sponsors and attendees of the Apache Hadopp Get Together, the NoSQL conference and the Apache Dinner Berlin - without you these events would never have been possible. Looking forward to seeing you again in January/ March 2010!

I hope I didn’t forget too many people - just in case: I am pretty grateful for all the input, help and feedback I got this year.

PS: Another thanks to the spaceboyz visiting Berlin for 26C3 for helping Thilo tidy up our apartment after Congress was over this year ;)

Hadoop, Lucene, Mahout , , , , ,

Moving from Fast to Solr

November 19th, 2009 at 8:34pm

Sesat has published a nice in-depth report on why to move from Fast to Solr. The article also includes a description of the steps taken to move over as well as several statistics:

http://sesat.no/moving-from-fast-to-solr-review.html

On a related topic, the following article details, where Apple is using Lucene/Solr to power it’s search. Spoiler: Look at Spotlight, their desktop search, as well as on the iTunes search with about 800 QPS.

Update: As the site above could not be reached for quite some time, you should either look into the Google Cache version.

Free Software, Lucene ,