Notes on storage options - FOSDEM 05

2013-02-17 20:43

Second day at FOSDEM for me started with the MySQL dev room. One thing that made me smile was in the MySQL new features talk: The speaker announced support for “NoSQL interfaces” to MySQL. That is kind of fun in two dimensions: A) What he really means is support for the memcached interface. Given the vast number of different interfaces to databases today, announcing anything as “supports NoSQL interfaces” sounds kind of silly. B) Given the fact that many databases refrain from supporting SQL not because they think their interface is inferior to SQL but because they sacrifice SQL compliance for better performance, Hadoop integration, scaling properties or others this seems really kind of turning the world upside-down.

As for new features – the new MySQL release improved the query optimiser, subquery support. When it comes to replication there were improvements along the lines of performance (multi threaded slaves etc.), data integrity (replication check sums being computed, propagated and checked), agility (support for time delayed replication), failover and recovery.

There were improvements along the lines of performance schemata, security, workbench features. The goal is to be the go-to-database for small and growing businesses on the web.

After that I joined the systemd in Debian talk. Looking forward to systemd support in my next Debian version.

HBase optimisation notes

Lars George's talk on HBase performance notes was pretty much packed – like any other of the NoSQL (and really also the community/marketing and legal dev room) talks.

Lars started by explaining that by default HBase is configured to reserve 40% of the JVM heap for in memory stores to speed up reading, 20% for the blockcache used for writing and leaves the rest as breath area.

On read HBase will first locate the correct region server and route the request accordingly – this information is cached on the client side for faster access. Prefetching on boot-up is possible to save a few milliseconds on first requests. In order to touch as little files as possible when fetching bloomfilters and time ranges are used. In addition the block cache is queried to avoid going to disk entirely. A hint: Leave as much space as possible for the OS file cache for faster access. When monitoring reads make sure to check the metrics exported by HBase e.g. by tracking them over time in Ganglia.

The cluster size will determine your write performance: HBase files are so-called log structured merge trees. Writes are first stored in memory and in the so-called Write-Ahead-Log (WAL, stored and as a result replicated on HDFS). This information is flushed to disk periodically either when there are too many log files around or the system gets under memory pressure. WAL without pending edits are being discarded.

HBase files are written in an append-only fashion. Regular compactions make sure that deleted records are being deleted.

In general the WAL file size is configured to be 64 to 128 MB. In addition only 32 log files are permitted before a flush is forced. This can be too small a file size or number of log files in periods of high write request numbers and is detrimental in particular as writes sync across all stores, so large cells in one family will cause a lot of writes.

Bypassing the WAL is possible though not recommended as it is the only source for durability there is. It may make sense on derived columns that can easily be re-created in a co-processor on crash.

Too small WAL sizes can lead to compaction storms happening on your cluster: Many small files than have to be merged sequentially into one large file. Keep in mind that flushes happen across column families even if just one family triggers.

Some handy numbers to have when computing write performance of your cluster and sizing HBase configuration for your use case: HDFS has an expected 35 to 50 MB/s throughput. Given different cell size this is how that number translates to HBase write performance:

Cell sizeOPS
10kBwith 800 less than expected as this HBase is not optimised for these sizes
1kB6000, see above

As a general rule of thumb: Have your memstore be driven by size number of regions and flush size. Have the number of allowed WAL logs before flush be driven by fill and flush rates.. The capacity of your cluster is driven by the JVM heap, region count and size, key distribution (check the talks on HBase schema design). There might be ways to get rid of the Java heap restriction through off-heap memory, however that is not yet implemented.

Keep enough and large enough WAL logs, do not oversubscribe the memstore space, keep the flush size in the right boundaries, check WAL usage on your cluster. Use Ganglia for cluster monitoring. Enable compression, tweak the compaction algorithm to peg background I/O, keep uneven families in separate tables, watch the metrics for blockcache and memstore.

Apache Con – last day

2010-11-27 23:23
Day three of Apache Con started with interesting talks on Tomcat 7, including an introduction to the new features of that release. Those include better memory leak prevention and detection capabilities – the implementation of these capabilities have lead to the discovery of various leaks that appear under more or less weird circumstances in famous open source libraries and the JVM itself. But also better management and reporting facilities are part of the new release.

As I started the third day over at the Tomcat track, unfortunately I missed the Tika and Nutch presentations by Chris Mattman – so happy, that at least the slides were published online: $LINK. The development of nutch was especially interesting for me as that was the first Apache project I got involved with back in 2004. Nutch started out as a project with the goal of providing an open source alternative internet-scale search engine. Based on Lucene as a indexer kernel, it also providing crawling, content extraction and link analysis.

Focussed on building an internet scale search engine the need for a distributed processing environment quickly became apparent. Initial implementations of a nutch distributed file system and a map reduce engine lead to the creation of the Apache Hadoop project.

In recent years it was comparably quiet around nutch. Besides Hadoop also content extraction was factored out of the project into Apache Tika. At the moment development is getting more momentum again. Future developments are supposed to be focussed on building an efficient crawling engine. As storage backend the project wants to leverage Apache HBase, for content extraction Tika is to be used, as indexing backend Solr.

I loved the presentation by Geoffrey Young on how they used Solr to replace their old MySQL search based system for better performance and more features at Ticketmaster. Indexing documents representing music CDs presents some special challenges when it comes to domain modeling: There are bands with names like “!!!”. In addition users are very likely to misspell certain artists names. In contrast to large search providers like Google these businesses usually have neither human resources nor enough log data to provide comparable coverage e.g. when implementing spell-checking. A very promising and agile approach taking instead was to parse log files for most common failing queries and from that learn more about features needed by users: There were many queries including geo information coming from users looking for an event at one specific location. As a result geo information was added to the index leading to happier users.