Archive

Archive for the ‘Apache’ Category

Notes on storage options - FOSDEM 05

February 17th, 2013 at 8:43pm

On MySQL

Second day at FOSDEM for me started with the MySQL dev room. One thing that made me smile was in the MySQL new features talk: The speaker announced support for “NoSQL interfaces” to MySQL. That is kind of fun in two dimensions: A) What he really means is support for the memcached interface. Given the vast number of different interfaces to databases today, announcing anything as “supports NoSQL interfaces” sounds kind of silly. B) Given the fact that many databases refrain from supporting SQL not because they think their interface is inferior to SQL but because they sacrifice SQL compliance for better performance, Hadoop integration, scaling properties or others this seems really kind of turning the world upside-down.

As for new features – the new MySQL release improved the query optimiser, subquery support. When it comes to replication there were improvements along the lines of performance (multi threaded slaves etc.), data integrity (replication check sums being computed, propagated and checked), agility (support for time delayed replication), failover and recovery.

There were improvements along the lines of performance schemata, security, workbench features. The goal is to be the go-to-database for small and growing businesses on the web.

After that I joined the systemd in Debian talk. Looking forward to systemd support in my next Debian version.

HBase optimisation notes

Lars George’s talk on HBase performance notes was pretty much packed – like any other of the NoSQL (and really also the community/marketing and legal dev room) talks.

Lars started by explaining that by default HBase is configured to reserve 40% of the JVM heap for in memory stores to speed up reading, 20% for the blockcache used for writing and leaves the rest as breath area.

On read HBase will first locate the correct region server and route the request accordingly – this information is cached on the client side for faster access. Prefetching on boot-up is possible to save a few milliseconds on first requests. In order to touch as little files as possible when fetching bloomfilters and time ranges are used. In addition the block cache is queried to avoid going to disk entirely. A hint: Leave as much space as possible for the OS file cache for faster access. When monitoring reads make sure to check the metrics exported by HBase e.g. by tracking them over time in Ganglia.

The cluster size will determine your write performance: HBase files are so-called log structured merge trees. Writes are first stored in memory and in the so-called Write-Ahead-Log (WAL, stored and as a result replicated on HDFS). This information is flushed to disk periodically either when there are too many log files around or the system gets under memory pressure. WAL without pending edits are being discarded.

HBase files are written in an append-only fashion. Regular compactions make sure that deleted records are being deleted.

In general the WAL file size is configured to be 64 to 128 MB. In addition only 32 log files are permitted before a flush is forced. This can be too small a file size or number of log files in periods of high write request numbers and is detrimental in particular as writes sync across all stores, so large cells in one family will cause a lot of writes.

Bypassing the WAL is possible though not recommended as it is the only source for durability there is. It may make sense on derived columns that can easily be re-created in a co-processor on crash.

Too small WAL sizes can lead to compaction storms happening on your cluster: Many small files than have to be merged sequentially into one large file. Keep in mind that flushes happen across column families even if just one family triggers.

Some handy numbers to have when computing write performance of your cluster and sizing HBase configuration for your use case: HDFS has an expected 35 to 50 MB/s throughput. Given different cell size this is how that number translates to HBase write performance:

Cell size OPS
0.5MB 70-100
100kB 250-500
10kB with 800 less than expected as this HBase is not optimised for these sizes
1kB 6000, see above

As a general rule of thumb: Have your memstore be driven by size number of regions and flush size. Have the number of allowed WAL logs before flush be driven by fill and flush rates.. The capacity of your cluster is driven by the JVM heap, region count and size, key distribution (check the talks on HBase schema design). There might be ways to get rid of the Java heap restriction through off-heap memory, however that is not yet implemented.

Keep enough and large enough WAL logs, do not oversubscribe the memstore space, keep the flush size in the right boundaries, check WAL usage on your cluster. Use Ganglia for cluster monitoring. Enable compression, tweak the compaction algorithm to peg background I/O, keep uneven families in separate tables, watch the metrics for blockcache and memstore.

Event, Free Software, Hadoop , ,

Elastic Search meetup Berlin – January 2013

February 1st, 2013 at 6:34pm

The first meetup this year I went to started with a large bag of good news for Elastic Search users. In the offices of Sys Eleven (thanks for hosting) the meetup started at 7p.m. last Tuesday. Simon Willnauer gave an overview of what to expect of the upcoming major release of Elastic Search:

For all 0.20.x version ES features a shard allocator version that is ignorant of which index shards belong to, machine properties, usage patterns. Especially ignoring index information can be detrimental and lead to having all shards of one index on one machine in the end leading to hot spots in your cluster. Today this is solved by lots of manual intervention or even using custom shard allocator implementations.

With the new release there will be an EvenShardCountAllocator that allows for balancing shards of indexes on machines – by default it will behave like the old allocator but can be configured to take weighted factors into account. The implementation will start with basic properties like “which index does this shard belong to” but the goal is to also make variables like remaining disk space available. To avoid constant re-allocation there is a threshold on the delta that has to be passed for re-allocation to kick in.

0.21 will be released when Lucene 4.1 is integrated. That will bring new codecs, concurrent flushing (to avoid the stop-the-world flush during indexing that is used in anything below Lucene 4 – hint: Give less memory to your JVM in order to cause more frequent flushes), there will be compressed sort fields, spellchecking and suggest built into the search request (though unigram only). There will be one similarity configurable per field – that means you can switch from TF-IDF to alternative built-in scoring models or even build your own.

Speaking of rolling your own: There is a new interface for FieldData (used for faceting, scoring and sorting) to allow for specialised data structures and implementations per field. Also the default implementation will be much more memory efficient for most scenarios be using UTF-8 instead of UTF-16 characters).

As for GeoSpatial: The code came to Lucene as a code dump that the contributor wasn’t willing to support or maintain. It was replaced by an implementation that wasn’t that much better. However the community is about to take up the mess and turn it into something better.

After the talk the session essentially changed to an “interactive mailing list” setup where people would ask questions live and get answers both from other users as well as the developers. Some example was the question for recommendability of pyes as a library. Most people had used it, many ran into issues when trying to run an upgrade with features being taken away or behaviour being changed without much notice. There are plans to release Perl, Ruby and Python clients. However also using JRuby, Groovy, Scala or Clojure to communicate with ES works well.

On the benefit of joining the cluster for requests: That safes one hop for routing, result merging, is an option to have a master w/o data and helps with indexing as the data doesn’t go through an additional node.

As for plugins the next thing needed is an upgrade and versioning schema. Concerning plugin reloading without restarting the cluster there was not much ambition to get that into the project from the ES side of things – there is just too much hazzle when it comes to loading and unloading classes with references still hanging around to make that worthwhile.

Speaking of clients: When writing your own don’t rely on the binary protocol. This is a private interface that can be subject to change at any time.

When dealing with AWS: The S3 gateway is not recommended to be used as it is way too slow (and as a result very expensive). Rather backup with replicas, keep the data around for backup or use rsync. When trying to backup across regions this is nothing that ES will help you with directly – rather send your data to both sites and index locally. One recommendation that came from the audience was to not try and use EBS as the IO optimised versions are just too expensive – it’s much more cost effective to rely on ephermeral storage. Another thing to checkout is the support for ES being zone aware to avoid having all shards in one availability zone. Also the node discovery timeout should be increased to at least one minute to work in AWS. When it comes to hosted solutions like heroko you usually are too limited in what you can do with these offers compared to the low maintenance overhead of running your own cluster. Oh, and don’t even think about index encryption if you want to have a fast index without spending hours and hours of development time on speeding your solution up with custom codecs and the like :)

Looking forward to the Elastic Search next meetup end of February – location still to be announced. It’s always interesting to see such meetup groups grow (this time from roughly 15 in November to over 30 in January).

PS: A final shout-out to Hossman - that psychological trick you played on my at your boosting and biasing talk at Apache Con EU is slightly annoying: Everytime someone mentions TF-IDF in a talk (and that isn’t too unlikely in any Lucene, Solr, Elastic Search talks) I panicingly double check whether there are funny pictures on the slide shown! ;)

Free Software, Lucene , ,

Linux vs. Hadoop - some inspiration?

January 16th, 2013 at 8:22pm

This (even for my blog’s standards) long-ish blog post was inspired by a talk given late last year at Apache Con EU as well as from discussions around what constitutes “Apache Hadoop compatibility” and how to make extending Hadoop easier. The post is based on conversations with at least one guy close to the Linux kernel community and another developer working on Hadoop. Both were extremly helpful in answering my questions and sanity checking the post below. After all I’m neither an expert on Linux kernel development and design, nor am I an expert on the detailed design and implementation of features coming up in the next few Hadoop releases. Thanks for your input.

Posting this here as I thought the result of my trials to understand the exact design commonalities and differences better might be interesting for others as well. Disclaimer: This is by no means an attempt to influence current development, it just summarizes some recent thoughts and analysis. As a result I’m happy about comments pointing out additions or corrections - preferably as trackback or maybe on Google Plus as I had to turn of comments on this very blog for spamming reasons.

In his slides on “Insides Hadoop dev” during Apache Con EU:

Steve Loughran included a comparison that popped up rather often already in recent past but still made me think:

“Apache Hadoop is an OS for the datacenter”

It does make a very good point, even though being slightly misleading in my opinion:

  • There are lots of applications that need to run in a datacenter that do not imply having to use Hadoop at all - think mobile application backends, content management systems of publishers, encyclopedia hosting. Growing you may still run into the need for central log processing, scheduling and storing data.
  • Even if your application benefits from a Hadoop cluster you will need a zoo of other projects not necessarily related to the project to successfully run your cluster - think configuration management, monitoring, alerting. Actually many of these topics are on the radar of Hadoop developers - with an intend to avoid the NIH principle and rather integrate better with existing proven standard tools.

However if you do want to do large scale data analysis on largely unstructured data today you will most likely end up using Apache Hadoop.

When talking about operating systems in the free software world inevitably the topic will drift towards the Linux kernel. Being one the most successful free software projects out there from time to time it’s interesting and valuable to look at its history and present in terms of development process, architecture, stakeholders in the development cycle and the way conflicting interests are being dealt with.

Although interesting in many dimensions this blog post focuses just on two related aspects:

  • How to balance innovation for stability in critical parts of the system.
  • How to deal with modularity and API stability from an architectural point of view taking project-external (read: non-mainline) module contributions into account.

The post is not going to deal with just “everything map/reduce” but focus solely on software written specifically to work with Apache Hadoop. In particular Map/Reduce layers plugged on top of existing distributed file systems that ignore data locality guarantees as well as layers on top of existing relational database management systems that ignore easy distribution and fail over are intentionally being ignored.

Balancing innovation with stability

One pain point mentioned during Steve’s talk was the perceived need for a very stable and reliable HDFS that prevents changes and improvements from making it into Hadoop. The rational is very simple: Many customers have entrusted lots (as in not easy to re-create in any reasonable time frame) of critical (as in the service offered degrades substantially when no longer based on that data) data to Hadoop. Even when in a backup Hadoop going down for a file system failure would still be catastrophic as it would take ages to get all systems back to a working state - time that means loosing lots of customer interaction with the service provided.

When glancing over to Linux-land (or Windows, or MacOS really) the situation isn’t much different: Though both backup and recovery are much cheaper there, having to restore a user’s hard-disk just due to some weird programming mistake still is not acceptable. Where does innovation happen there? Well, if you want durability and stability all you do is to use one of the well proven file system implementations - everyone knows names like ext2, xfs and friends. A simple “man mount” will reveal many more. If on the contrary you need some more cutting edge features or want to implement a whole new idea of how a file system should work, you are free to implement your own module or contribute to those marked as EXPERIMENTAL.

If Hadoop really is the OS of the datacenter than maybe it’s time to think about ways that enable users to swap in their prefered file system implementation, maybe it’s time for developers to focus implementation of new features that could break existing deployed systems to separate modules. Maybe it’s time to announce an end-of-support-date for older implementations (unless there are users that not only need support but are willing to put time and implementation effort into maintaining these old versions that is.)

Dealing with modularity and API stability

With the vision of being able to completely replace whole sub-systems comes the question of how to guarantee some sort of interoperability. The market for Hadoop and surrounding projects is already split, it’s hard to grasp for outsiders and newcomers which components work with wich version of Hadoop. Is there a better way to do things?

Looking at the Linux kernel I see some parallels here: There’s components built on top of kernel system calls (tools like ls, mkdir etc. all rely on a fixed set of system calls being available). On the other hand there’s a wide variety of vendors offering kernel drivers for their hardware. Those come in three versions:

  • Some are distributed as part of the mainline kernel (e.g. those for Intel graphics cards).
  • Some are distributed separately but including all source code (e.g. ….)
  • Some are distributed as binary blog with some generic GPLed glue logic (e.g. those provided by NVIDIA for their graphics cards).

Essentially there are two kinds of programming interfaces: ABIs (Application Binary Interfaces) that are being developed against from user space applications like “ls” and friends. APIs (Application Programming Interfaces) that are being developed against by kernel modules like the one by NVIDIA.

Coming back to Hadoop I see some parallelism here: There are ABIs that are being used by user space applications like “hadoop fs -ls” or your average map/reduce application. There are also some sort of APIs that strictly only allow for communication between HDFS, Map/Reduce and applications on top.

The Java ecosystem has a history of having APIs defined and standardised through the JCP and implemented by multiple vendors afterwards. With Apache projects people coming from a plain Java world often wonder why there is no standard that defines the APIs of valuable projects like Lucene or even Hadoop. Even log4j, commons logging and build tooling follow the “defacto standardisation” approach where development defines the API as opposed to a standardisation committee.

Going one step back the natrual question to ask is why there is demand for standardisation. What are the benefits of having APIs standardised? Going through a lengthy standardisation process obviously can’t be the benefit.

Advantages that come to my mind:

  • When having multiple vendors involved that do not want to or cannot communicate otherwise a standardisation committee can provide a neutral ground for communication in particular for the engineers involved.
  • For users there is some higher level document they can refer to in order to compare solutions and see how painful it might be to migrate.

Having been to a DIN/ISO SQL meetup lately there’s also a few pitfalls that I can think of:

  • You really have to make sure that your standard isn’t going to be polluted with things that never get implemented just because someone thought a particular feature could be interesting.
  • Standardisation usually takes a long time (read: mutliple years) until something valuable that than can be adopted and implemented in the industry is created.

More concerns include but are not limited to the problem of testing the standard - when putting the standard into main focus instead of the implementation there is a risk of including features in the standard that are hard or even impossible to implement. There is the risk of running into competing organisations gaming the system, making deals with each other - all leading to compromises that are everything but technologically sensible. There clearly is a barrier to entry when standardisation happens in a professional standards body. (On a related note: At least the German group working on the DIN/ISO standard defining the standard query language in particular in big data environments. Let me know if you would like to get involved.)

Concerning the first advantage (having some neutral ground for vendors to meet): Looking at your average standardisation effort those committees may be neutral ground. However communication isn’t necessarily available to the public for whatever reasons. Compared to the situation little over a decade ago there’s also one major shift in how development is done on successful projects: Software is no longer developed in-house only. Many successful components that enable productivity are developed in the open in a collaborative way that is open to any participant. Httpd, Linux, PHP, Lucene, Hadoop, Perl, Python, Django, Debian and others are all developed by teams spanning continents, cultures and most importantly corporations. Those projects provide a neutral ground for developers to meet and discuss their idea of what an implementation should look like.

Pondering a bit more on where successful projects I know of came from reveals something particularly interesting: ODF first was implemented as part of Open Office and then turned into a standardised format. XMPP was first implemented and than turned into an IETF standardised protocol. Lucene never went for any storage format or even search API standardisation but defined very rigid backwards compatibility guidelines that users learnt to trust. Linux itself never went for ABI standardisation - instead they opted for very strict ABI backwards compat guidelines that developers of user space tools could rely on.

Looking at the Linux kernel in particular the rule is that user facing ABIs are supposed to be backwards compatible: You will always be able to run yesterday’s ls against a newer kernel. One advantage for me as a user is that this way I can easily upgrade the kernel in my system without having to worry about any of the installed user space software.

The picture looks rather different with Linux’ APIs: Those are intentionally not considered holy and subject to change if need be. As a result vendors providing proprietary kernel driver like NVIDIA have the burden of providing updated versions in case they want to support more than one kernel version.

I could imaging a world similar to that for Hadoop: A world in which clients run older versions of Hadoop but are still able to talk to their upgraded clusters. A world in which older MapReduce programs still run when deployed on newer clusters. The only people who would need to worry about API upgrades would be those providing plugins to Hadoop itself or replace components of the system. According to Steve this is what YARN promises: Turn MR into user layer code, have the lower level resource manager for requesting machines near the data.

Hacking, Hadoop , , , ,

RecSys Stammtisch Berlin - December 2012

December 30th, 2012 at 12:40pm

Earlier this month I attended the fourth Recommender Stammtisch in Berlin. The event was kindly hosted by Soundcloud - who on top of organising the speakers provided a really yummy buffet by Kochzeichen D.

With Paul Lamere the evening started with a very entertaining but also very packed talk on why music recommendation is special - or put more generally why all recommender systems are special:

  • Traditionally recommender systems found their way into the wild to drive sales. In music however the main goal is to help users discover new content.
  • Listeners are very different: Ranging from those indifferent to what is being played (imagine someone sitting in a coffee bar enjoying their espresso - it’s unlikely that those would want to influence the playlist of the shop’s entertainment system unless they are really annoyed with it’s content). There are casual listeners who from time to time skip a piece. There are more engaged people who train their own recommender through services like last.fm. Finally there are fanatics that are really into certain kinds of music. Building just one system to fit them all won’t do. Also relying on just one signal won’t do - instead you will have to deal with both, content signals like loudness plots as well as community signals.
  • Music applications tend to be highly interactive. So even if there is little to no reliable explicit feedback people tell you how much you like your music when skipping, turning pieces louder, interacting with the content behind the song being played.
  • In contrast to many other domains music deals with a vast item space and a huge long tail of songs that almost never get interacted with.
  • In contrast to shopping recommenders however in music making mistakes is comparably cheap: In most situations music isn’t purchased on a song by song basis but based on some subscription model. That way the actual cost of playing the wrong song is low. Also songs tend to be not much longer than 5min so also users are less annoyed when confronted with a slightly wrong piece of music.
  • When implementing recommenders for a shopping site it is to be avoided to re-recommend stuff a user has purchased already. This is not the case in music recommendation. Quite the contrary: Re-recommending known music is one indicator for playlists people will like.
  • When it comes to building playlists care must be taken to organise songs in a coherent way, mixing new and familiar songs in a pleasing order - essentially the goal should be to take the listener on a journey.
  • Fast updates are crucial: Music business itself is fast paced with new releases coming out regularly and being taken up very quickly by major broadcasting stations.
  • Music is highly contextual: It pays to know if the user is in the mood for calm or for faster music..
  • There are highly passionate users that are very easy to scare away - those tend to be the loudest ones influencing your user community most.
  • Though meta data is key in music as well, never expect it to be correct. There are all sorts of weird band and song names that you never thought would be possible - an observation that also Ticketmaster made when building their ticket search engine.
  • Music is highly social and irrational - so just knowing your users friends and their tastes won’t get you to being perfect.

Overall I guess the conclusion is that no matter which domain you deal with you will always need to know the exact properties of that domain to build a successful system.

In the second talk Brian McFee explained one way of modeling playlists with context. With that he concentrated on passive music discovery - that is based on one query return a list of music to listen to sequentially as opposed to active retrieval where users issue a query to search for a specific piece of music.

Historically it turned out to be difficult to come up with any playlist generator that is better than randomly selecting songs to play. His model is based on a random walk notian where the vertices are songs and edges represent learnt group similarities. Groups were represented by features like familiarity, social tags, specific audio features, metadata, release dates etc. Depending on the playlist category in most cases he was able to show that his model actually does perform better than random.

In the third talk Oscar Celma showed some techniques to also benefit from some of the more complicated signals for music recommendation. Essentially his take was that by relying on usage signals only you will be stuck with the head of the usage distribution only. What you want though is to be able to provide recommendations for the long tail as well.

Some signals he mentioned included content based features (rythm, BPM, timbre, harmony), usage signals, social signals (beware of people trying to game the system or make fun of it though) and a mix of all those. His recommendation was to put content signals at the end of the processing pipeline for re-ranking and refining playlists.

When providing recommendations it is essential to be able to answer why something was recommended. Even just in the space of novelty vs. relevancy to the user there are four possible strategies: a) recommend only old stuff that is marginally relevant to the specific user: This will end up pulling up mostly popular songs. b) recommend what is new but not relevant to the user: This will end up pulling out songs that turn your user away. c) recommend what is relevant to the user but old, this will mostly surface stuff the user knows already but is a safe bet to play. d) recommend what is both relevant and new to the user - here the real interesting work starts as this deals with recommending genuinely new songs to users.

To balance discovery with safe fallback go for skips, bans, likes and dislikes. Take into account the user context and attention.

The final point the speaker made was the need to take into account the whole picture: Your shiny new recommendation algorithm will just be a tiny piece in the puzzle. Much more work will need to go into data collection and ingestion, into API design.

The last talk finally went into some detail of the history of playlist creation - back from music creators’ choices, via radio station mixes, mix tapes and finally ending up at spotify and fully automatic playlist creation.

There is a vast body of knowledge on how to create successful playlists e.g. among DJs that speak about warm-up phases, chillout times, alternating types of music in order to take the audience on a journey. Even just shuffling music the user already knows can be very powerful given the pool of songs the shuffle is based on neither too large (containing too broad types of music) nor too small (leading to frequent repetitions). According to Ben Fields the science and art of playlist generation and in particular evaluation is still pretty much in it’s infancy with much to come.

General, Mahout, Science ,

Elastic Search meetup Berlin

November 28th, 2012 at 12:39am

Today Retresco hosted the (to my knowledge fourth) Elastic Search User Group Berlin - a group dedicated to using Lucene as part of Elastic Search. With roughly fifteen attendees the meetup attracted a decent crowd - most interestingly many of the people there were already using the software either in production or for closed beta projects.

The fist talk given was by people from ferret-go - a company doing media monitoring for brands focused on the German market. They are pretty new to the search topic, on top they aren’t fluent Java developers but do most stuff in Python. Essentially their whole application is built on top of Elastic Search - most features are implemented as more or less complex search queries. In recent weeks they had to deal with typical problems related to growing data set sizes, nodes getting hot in particular when load balancing isn’t configured quite right and balancing shards on a per index level instead of doing it globally for all shards (particularly bad in their case as they added an index with smaller shards and one with significantly larger shards into ES).

The second talk gave a really nice overview on things to keep in mind before putting ES to production - as is usually the case, the default configuration makes it easy to get started but most likely is not what you want in your production environment.

Really nice, technically focused event. Thanks to Retresco for hosting the meetup including beer and Club Mate and to ElasticSearch.com for paying for the pizza. Check their meetup page for the next event - most likely to be scheduled in January.

Also if you’d like to learn more on Lucene 4 (talk by Simon Willnauer) - make sure to attend the Apache Hadoop Get Together December 2012 (if you need another ticket, it might make sense to politely ask the organiser, make sure you are registered, otherwise most likely you won’t get through security). Other two talks scheduled: Pere Urbon Bayes on NoSQL and Graph, a love story! as well as myself on How to Fail Your Big Data Project Quick and Rapidly.

Get Together, Lucene ,

ApacheConEU - part 10

November 19th, 2012 at 8:34pm

In the next session Jukka introduced Tika - a toolkit for parsing content from files including a heuristics based component for guessing the file type: Based on file extension, magic and certain patterns in the file the file type can be guessed rather reliably. Some anecdotes:

  • not all mime types are registered with IANA, there are of course conflicting file extensions,
  • Microsoft Word not only localises their interface but also the magic in the file,
  • html detection is particularly hard as there is quite some overlap with other file formats (e.g. there are such things as html mails…)
  • xhtml parsing is quite reliable by using an actual xml parser for the first few bytes to identify the namespace of the document
  • identifying odf documents is easy - though zipped the magic is preserved uncompressed at a pre-defined location in the file
  • for ooxml the file has to be unpacked to identify it
  • plain text content is hardest - there are a few heuristics based on UTF BOMs, ASCII stats, line ending stats, byte histograms, still it’s not fool proof.

In addition Tika can extract metadata: For text that can be as easy as encoding, length, content type and comments. For images that is extended by image size and potentially EXIF data. For pdf data it gets even more comprehensive including information on the file creator, save date and more (same applies for MS office documents). Most document metadata is based on the doublin core standard, for images there’s EXIF and IPCT - soon there’ll also be xmb related data that can be discovered.

Apache Con, Lucene , ,

ApacheCon EU - part 09

November 18th, 2012 at 8:54pm

In the Solr track Elastic Search and Solr Cloud went into competition. The comparison itself was slightly apples-and-oranges like as the speaker compared the current ES version based on Lucene 3.x and Solr Cloud based on Lucene 4.0. During the comparison it still turned out that both solutions are more or less comparable - so choice again depends on your application. However I did like the conclusion: The speaker did not pick a clear winner in terms of projects. However he did have another clear winner: The user community will benefit from there being two projects as this kind of felt competition did speed up development already considerably.

The day finished with hoss’ Stump the Chump session: The audience was asked to submit questions before the session, the jury was than asked to pick the winning question that stumped Hoss the most.

Some interesting bits from that question: One guy had the problem of having to provide somewhat diverse results in terms e.g. manufacturers in his online shop. There are a few tricks to deal with this problem: a) clean your data - don’t have items that use keyword spamming side by side with regular entries. Assuming this is done you could b) use grouping to collapse items from the same manufacturer and let the user drill deeper from there. Also using c) a secondary sorting value can help - one hint: Solr ships with a random value out of the box for such cases.

For me the last day started with hossman’s session on boosting and scoring tricks with Solr - including a cute reference for explaining TF-IDF ranking to people (see also a message tweeted earlier for an explanation of what a picture taken during my wedding has to do with ranking documents):


Share photos on twitter with Twitpic

Though TF-IDF is standard IR scoring it probably is not enough for your application. There’s a lot of domain knowledge that you can encode in your ranking:

  • novelty factors - number of ratings/ standard deviation of ratings - ranks controversial items on top that might be more interesting than just having the stuff that everyone loves
  • scarcity - people like buying what is nearly sold out
  • profit margin
  • create your score manually by an external factor - e.g. popularity by association like categories that are more popular than others or items that are more popular depending on the time of day or year

There are a few sledge hammers people usually think of that can turn against you really badly: Say you rank by novelty only - that is you sort by date. The counter example given was the case of the AOL-Time Warner merger - being a big story news papers would post essays on it, do evaluations etc. However also articles only remotely related to it would mention the case. So be the end of the week when searching for it you would find all those little only remotely relevant articles and have to search through all of them to find the really big and important essay.

There are cases where it seems like only recency is all that matters: Filter only for the most recent items and re-try only in case of no results. The counter example is the case where products are released just on a yearly basis but you set the filter to say a month. This way up until May 31 your users will run into the retry branch and get a whole lot of results. However when a new product comes out on June first from that day onward the old results won’t be reachable anymore - leading to a very weird experience for those of your users who saw yesterday’s results.

There were times when scoring was influenced by keyword stuffing to emulate higher scores - don’t do that anymore, Solr does support sophisticated per field and document boosting that make such hacks superfluous.

Instead rather use edismax for weighting fields. Some hints on that one: configure omitNorms in order to avoid having keyword stuffing influence your ranking. Configure omitTermFrequencyAndPosition if the term frequency in any document does not really tell you much e.g. in case of small documents only.

With current versions of Solr you can use your custom scoring per field. In addition a few ones are shipped that come with options for tweaking - like for instance the sweetSpotSimilarity wher you can tell the scorer that up to a certain length no length penalisation should happen.

Create your own boost functions that in addition to TF-IDF rely on rating values, click rates, prices or category influences. There’s even an external file field option to allow you to load your scoring value per document or category from an external file that can be updated on a much more frequent basis than you would otherwise want to re-index all documents in your solr. For those suffering from the “For business reasons this document must come first no matter what the algorithm says” syndrom - there’s a query elevation component for very fine grained tuning of rankings per query. Keep in mind so that this can easily turn into a maintanance nightmare. However it can be handy when quickly fixing a highly valuable business based use case: With that component it is possible to explicitly exclude documents from matching and precisely setting where to rank individual documents.

When it comes to user analytics and personalisation many people think of highly sophisticated algorithms that need lots of data to be trained. Yes Mahout can help you with personalisation and recommendation - but there are a few low hanging fruits to grab before:

  • Use the history of registered users or those you can identify through cookies - track the keywords they are looking for, the sort and filter functions commonly used.
  • Bucket people by explicit or implicit demographics.
  • Even just grouping people by the os and browser they use can help to identify preferences.

All of this information is relatively cheap to get by and can be used in many creative ways:

  • Provide default sort and filter functions for returning users.
  • Filter on the current query but when scoring take the older query into account.
  • Based on category facet used before do boosting in the next search assuming that the two queries are related.

Essentially the goal is to identify three factors for your users: What is their preference, what is the differentiator and what is your confidence in your estimation.

Another option could be to use the SweetSpotPlateau: If someone clicked on a price range facet on the next related query do not hide other prices but boost those that are in the previous facet.

One side effect to keep in mind: Your cache hit rate will go down now you are tailoring your results to individual users or user groups.

Biggest news for Lucene and Solr was the release of Lucene 4 - find more details online in an article published recently.

Apache Con, Lucene , , , , ,

ApacheConEU - part 07

November 16th, 2012 at 8:51pm

Julien Nioche shared some details on the nutch crawler. Being the mother of all Hadoop projects (as in Hadoop was born out of developments inside of nutch) the project has become rather quite with a steady stream of development in the recent past. Julien himself uses the nutch for gathering crawled data for several customer projects - feeding this data into an NLP pipeline based on Behemoth that glues Mahout, UIMA and Gate together.

The basic crawling steps including building the web graph, computing a link based ranking method and indexing are still the same since last I looked at the project - just that for indexing the project now uses solr instead of their own lucene based solution.

The main advantage of nutch is its pluggability: the protocol parser, html filter, url filter, url normaliser all can be exchanged against your own implementations.

In their 2.0 version they moved away from using their own plain hdfs storage to a table schema - mapped to the real database through Gora, an abstraction layer to connect to e.g. Cassandra or HBase. The schema itself is based on Avro but can be adopted to your needs. The advantages are obvious: Though still distributed this approach is much easier and simpler in terms of logic kept in nutch itself. Also it is easier to connect to the data for third parties - all you need is the schema as well as Gora. The current disadvantage lies in it’s configuration overhead and instability compared to the old solution. Most likely at least the latter one will go away as version 2.0 stabelises.

In terms of future work the project focuses on stabilisation, synchronising features of version 1.x and 2.x (link ranking is only available in version 1.x while support for elastic search is only available in version 2.x). In terms of functionality the goal is to move to Solr Cloud, support sitemaps (as implemented by commons crawler), more (pluggable?) indexers.

The goal is to delegate implementations - it was already done for Tika and Solr. Most likely it will also happen for the fetcher, protocol handling, robots.txt handling, url normalisation and filtering, graph processing code and others.

The next talk in the Solr/Lucene talk dealt with scaling Solr to big data. The goal of the speaker was to index 100 million documents - the number of documents was expected to grow in the future. Being under heavy time pressure and having a bash wizard on the project they started building lots of their glue code and components in bash scripts: There were scripts for starting/stopping services, remote indexing, performance monitoring, content extraction, ingestion and deployment. Short term this was a very good idea - it allowed for fast iterations and learning. On the long run they slowly replaced their custom software with standard components (tika for content extraction, puppet for deployment etc.).

They quickly learnt to value property files in order to easily reconfigure the system even in production (relying heavily on bash xml was of course not an option). One problem this came in handy with was adjusting the sharding configuration - going from a simple random sharding to old vs new to monthly they could optimise the configuration to their load. What worked well for them was to separate JVM startup from Solr core startup - they would start with empty solrs and only point them to the data directories aafter verifying that the JVMs booted up correctly.

In terms of performance they learnt to go wide quickly - instead of spending hours on optimising their one huge box they ended up distributing their solrs to multiple separate machines.

In terms of ingestion pipelines: Theirs is currently based on an indexing directory convention, moving the indexing as soon as all data is ingested. The advantage here is the atomicity of mv that can be used - disadvantage is having to move around lots of data. Their goal is to go for hdfs for indexing soon and zookeeper for configuration handling.

In terms of testing: In contrast to having a dev-test-production environment their goal is to have an elastic compute cloud that can be adjusted accordingly. Though EC2 itstelf is very cost intensive, poses additional problems with firewalling and moving data their cloud computing could still be a solution - in particular given projects like cloud stack or open cloud. The goal would be to do cycle scavaging in their own datacenter, do heavy computations when there is not a lot of load on the system and turn those analysis of in case of incoming traffic.

When it comes to testing and monitoring changes they made good experiences with using JConsole (connecting it to several solrs at once through a simple ip discovery script) and solr meter as a performance debugging tool.

Some implementation details: They used Solr as some sort of NoSQL cache as well (for thousands of queries/s), push the schema definition to solr from the app, have common fields and the option for developers to add custom fields in the app. Their experience is to not do expensive stuff in solr but to move that outside - this applies in particular to content extraction. For storage they used an avro based binary file format (mainly in order to save storage space, have a versioned schema and fast compression and de-compression). They are starting to use tika as their pipeline and for auto content detection, scaling up with behemoth. They learnt to upgrade indexes without re-indexing using the lucene upgrade tooling. In addition they use Grimreaper to kill servers if anything goes wrong and restart it later.

Apache Con, Lucene , , , ,

ApacheConEU - part 05

November 14th, 2012 at 8:47pm

The afternoon featured several talks on HBase - both it’s implementation as well as schema optimisation. One major issue in schema design in the choice of key. Simplest recommendation is to make sure that keys are designed such that on reading data load will be evenly distributed accross all nodes to prevent region-server hot-spotting. General advise here are hashing or reversing urls.

When it comes to running your own HBase cluster make sure you know what is going on in the cluster at any point in time:

  • Hbase comes with tools for checking and fixing tables,
  • tools for inspecting hfiles that HBase stores data in,
  • commands for inspecting the binary write ahead log,
  • web interfaces for master and region servers,
  • offline meta data repair tooling.

When it comes to system monitoring make sure to track cluster behaviour over time e.g. by using Ganglia or OpenTSDB and configure your alerts accordingly.

One tip for high traffic sites - it might make sense to disable automatic splitting to avoid splits during peaks and rather postpone them to low traffic times. One rather new project presented to monitor region sizes was Hannibal.

At the end of his talk the speaker went into some more detail on problems encountered when rolling out HBase and lessons learnt:

  • the topic itself was new so both engineering and ops were learning.
  • at scale nothing that was tested on small scale works as advertised.
  • hardware issues will occur, tuning the configuration to your workload is crucial.
  • they used early versions - inevitably leading to stability issues.
  • it’s crucial to know that something bad is building up before all systems start catching fire - monitoring and alerting the right thing is important. With Hadoop there are multiple levels to monitor: the hardware, os, jvm, Hadoop itself, HBase on top. It’s important to correlate the metrics.
  • key- and schema design are key.
  • monitoring and knowledgable operations are important.
  • there are no emergency actions - in case of an emergency it just will take time to recover: Even if there is a backup, even just transferring the data back can take hours and days.
  • HBase (and Hadoop) is DevOps technology.
  • there is a huge learning curve to get to a state that makes using these systems easy.

In his talk on HBase schema design Lars George started out with an introduction to the HBase architecture. On the logical level it’s best to think of HBase as a large distributed hash table - all data except for names are stored as byte arrays (with helpers to transform that data back into well known data types provided). The tables themselves are stored in a sparse format which means that null values essentially come for free.

On a more technical level the project uses zookeeper for master election, split tracking and state tracking. The architecture itself is based on log structured merge trees. All data initially ends up in the write ahead log - with data always being appended to the end this log can be written and ready very efficiently without any seek penalty. The data is inserted at the respected region server in memory (mem store, size of 64 MB typically) and synched to disk in regular intervals. In HBase all files are immutable - modifications are done only by writing new data and merging it in later. Deletes also happen by marking data as deleted instead of really deleting it. On a minor compaction the recently few files are being merged. On a major compaction all files are merged and deletes are being handled. Handling your own major compaction is possible as well.

In terms of performance lookup by key is the best you can do. If you do lookup by value this will result in a full-table scan. There is an option to give HBase a hint as to where to find the key when it is updated only infrequently - there is an option to provide a timestamp of roughly where to look for it. Also there are options to use Bloomfilters for better read performance. Another option is to move more data into the row key itself if that is the data you will be searching for very often. Make sure to de-normalize your data as HBase does not do any sophisticated joins, there will be duplication in your data as all should be pre-joined to get better performance. Have intelligent keys that make match your read/write patterns. Also make sure to keep your keys reasonably short as they are being repeated for each entry - so moving the whole data into the key isn’t going to get you anything.

Speaking of read write patterns - as a general rule of thumb: to optimise for better write performance tune the memstore size. For better read performance tune the block cache size. One final hint: Anything below 8 servers really is just a test setup as it doesn’t get you any real advantages.

Apache Con, Hadoop , ,

ApacheConEU - part 04

November 13th, 2012 at 8:46pm

The second talk I went to was the one on the dev@hadoop.a.o insights given by Steve Loughran. According to Steve Hadoop has turned into what he calls an operating system for the data center - similar to Linux in that it’s development is not driven by a vendor but by its users: Even though Hortenworks, Cloudera and MapR each have full time people working on Hadoop (and related projects), this work usually is driven by customer requirements which ultimately means that someone is running a Hadoop cluster that he has trouble with and wants to have fixed. In that sense the community at large has benefitted a lot from the financial crisis that Y! has slipped into: Most of the Hadoop knowledge that is now spread across companies like Linked.In, Facebook and others comes from engineers leaving Y! and joining one of those companies. With that also the development cycle of Hadoop has changed: While it was mostly driven by Y! schedule in the beginning - crunching out new releases nearly on a monthly basis with a dip around Christmas time it’s got more irregular later - to pick up a more regular schedule just recently.


Image taken a few talks earlier in the same session.

In terms of version numbers: 1.x is to be considered stable, production ready with fixes and low risk patches applied only. For 2.x the picture is a bit different - currently that is in alpha stage, features and fixes go there first, new code is developed there first. However Hadoop is not just http://hadoop.apache.org - the ecosystem is much larger including projects like Mahout, Hama, HBase and Accumulo built on top of Hadoop and Hive, Pig, Sqoop and Flume for integraion and analysis, as well as oozie for coordination an zookeeper for distributed configuration. There’s even more in incubation (or just recently graduated): Kafka for logging, whirr for cloud deployment, giraph for graph processing, ambari for cluster management, s4 for distributed stream processing, hcatalog and templeton for schema management, chuckwa for loggin purposes. All of the latter ones love helping hands. If you want to help out and want to “play Apache” those are the place to go to.

With such a large ecosystem one of the major pain points when rolling out your own Hadoop cluster is integrating all components you need: All projects release on separate release schedules, usually documenting against which version of Hadoop they were built. However finding a working combination is not always trivial. The first place to go to that springs to mind are the standard Linux distributions - however for their release and support cycles (e.g. Debian’s guarantees for stable releases) the pace inside of Hadoop and the speed with which old versions were declared “no longer supported” still is too fast. So what alternatives are there? You can go with Cloudera who ship Apache Hadoop extended with their patches and additional proprietary software. You can opt for Hortenworks that ships the ASF Hadoop only. Or you can opt for Apache itself and either tame the zoo yourself or rely on BigTop that aims to integrate the latest versions.

Even though there are people working fulltime on the project there’s still way more work to do than hands to help out. Some issues do fall through the cracks, in particular if you are the only person affected by the issue. Ultimately this may mean that if the bug only affects you you will have to be the one to fix the issue. Before going out and hacking the source itself it may make sense to go out and search for the stack trace you are looking at, the error message in front of you - look in JIRA and on the mailing list archives to see if there’s someone else who has solved the problem already - and in an ideal case even provided a patch that just didn’t make it into the distribution so far.

Also contributing to Hadoop is a great way to get to work on CS hard problems: There’s distributed computing involved (clearly), there’s consensus implementations like Paxos, there’s work to optimise Hadoop for specific CPU architectures, there’s scheduling and data placement problems to solve, machine learning and graph algorithm implementations and more. If you are into these topics - come and join, those are highly needed skills. People on the list tend to be friendly - they are “just” overloaded.

Of course there are barriers to entry - just like the Linux kernel Hadoop has become business critical for many of its users. Steve’s recommendation for dealing with this circumstance was to not compete with existing work that is being done - instead either concentrate on areas that aren’t covered yet or even better yet collaborate with others. A single lonely developer just cannot reasonably compete.

In terms of commit process Hadoop is running a review-than-commit protocol. Also it makes things seemingly more secure it also means that the development process is a lot slower, that things get neglected, that there is a lot of frustration also on the committers’ side when having to update a patch oftentimes. In addition tests running for a long time doesn’t make contributing substancially easier. With lots of valuable and critical data being stored in HDFS makeing radical changes there also isn’t something that happens easily. The best way to really get started is to get trusted and have a track record: The maintanance cost for abandoned patches that are large, hard to grasp and no longer supported by the original author is just to high.

Get a track record by getting known on dev@ and meetups. Show your knowledge by helping out others, help with patches that are not your own, don’t rewrite core initially, help with building plug-in points (like for shuffling implementations), help with testing across the whole configuration space, testing in unusual settings. Delegate the test at scale to those that have the huge clusters - there will always be issues revealed in their setup that do not cause any problems on a single machine or in a tiny cluster. Also make sure to download the package in the beta phase and test that it works in your problem space. Another way to get involved is by writing better documentation - either online or as a book. Share your experience.

One major challenge is work on major refactorings - there is the option to branch out, switch to a commit-than-review model but that only post-pones work to merge time. For independent works it’s even more complicated to get the changes in. Also integrating post graduate work that can be very valuable isn’t particularly simple - especially if there already is a lack in helping hand s- who’s going to spend the work to mentor those students.

Some ideas to improve the situation would b the help with spreading the knowledge - in local university partnerships, google hangouts, local dev workshops, using git and gerrit for better distributed development and merging.

Apache Con, Hadoop , ,