Elastic Search meetup Berlin – January 2013

2013-02-01 18:34
The first meetup this year I went to started with a large bag of good news for Elastic Search users. In the offices of Sys Eleven (thanks for hosting) the meetup started at 7p.m. last Tuesday. Simon Willnauer gave an overview of what to expect of the upcoming major release of Elastic Search:

For all 0.20.x version ES features a shard allocator version that is ignorant of which index shards belong to, machine properties, usage patterns. Especially ignoring index information can be detrimental and lead to having all shards of one index on one machine in the end leading to hot spots in your cluster. Today this is solved by lots of manual intervention or even using custom shard allocator implementations.

With the new release there will be an EvenShardCountAllocator that allows for balancing shards of indexes on machines – by default it will behave like the old allocator but can be configured to take weighted factors into account. The implementation will start with basic properties like “which index does this shard belong to” but the goal is to also make variables like remaining disk space available. To avoid constant re-allocation there is a threshold on the delta that has to be passed for re-allocation to kick in.

0.21 will be released when Lucene 4.1 is integrated. That will bring new codecs, concurrent flushing (to avoid the stop-the-world flush during indexing that is used in anything below Lucene 4 – hint: Give less memory to your JVM in order to cause more frequent flushes), there will be compressed sort fields, spellchecking and suggest built into the search request (though unigram only). There will be one similarity configurable per field – that means you can switch from TF-IDF to alternative built-in scoring models or even build your own.

Speaking of rolling your own: There is a new interface for FieldData (used for faceting, scoring and sorting) to allow for specialised data structures and implementations per field. Also the default implementation will be much more memory efficient for most scenarios be using UTF-8 instead of UTF-16 characters).

As for GeoSpatial: The code came to Lucene as a code dump that the contributor wasn't willing to support or maintain. It was replaced by an implementation that wasn't that much better. However the community is about to take up the mess and turn it into something better.

After the talk the session essentially changed to an “interactive mailing list” setup where people would ask questions live and get answers both from other users as well as the developers. Some example was the question for recommendability of pyes as a library. Most people had used it, many ran into issues when trying to run an upgrade with features being taken away or behaviour being changed without much notice. There are plans to release Perl, Ruby and Python clients. However also using JRuby, Groovy, Scala or Clojure to communicate with ES works well.

On the benefit of joining the cluster for requests: That safes one hop for routing, result merging, is an option to have a master w/o data and helps with indexing as the data doesn't go through an additional node.

As for plugins the next thing needed is an upgrade and versioning schema. Concerning plugin reloading without restarting the cluster there was not much ambition to get that into the project from the ES side of things – there is just too much hazzle when it comes to loading and unloading classes with references still hanging around to make that worthwhile.

Speaking of clients: When writing your own don't rely on the binary protocol. This is a private interface that can be subject to change at any time.

When dealing with AWS: The S3 gateway is not recommended to be used as it is way too slow (and as a result very expensive). Rather backup with replicas, keep the data around for backup or use rsync. When trying to backup across regions this is nothing that ES will help you with directly – rather send your data to both sites and index locally. One recommendation that came from the audience was to not try and use EBS as the IO optimised versions are just too expensive – it's much more cost effective to rely on ephermeral storage. Another thing to checkout is the support for ES being zone aware to avoid having all shards in one availability zone. Also the node discovery timeout should be increased to at least one minute to work in AWS. When it comes to hosted solutions like heroko you usually are too limited in what you can do with these offers compared to the low maintenance overhead of running your own cluster. Oh, and don't even think about index encryption if you want to have a fast index without spending hours and hours of development time on speeding your solution up with custom codecs and the like :)

Looking forward to the Elastic Search next meetup end of February – location still to be announced. It's always interesting to see such meetup groups grow (this time from roughly 15 in November to over 30 in January).

PS: A final shout-out to Hossman - that psychological trick you played on my at your boosting and biasing talk at Apache Con EU is slightly annoying: Everytime someone mentions TF-IDF in a talk (and that isn't too unlikely in any Lucene, Solr, Elastic Search talks) I panicingly double check whether there are funny pictures on the slide shown! ;)