Notes on storage options - FOSDEM 05

2013-02-17 20:43
On MySQL

Second day at FOSDEM for me started with the MySQL dev room. One thing that made me smile was in the MySQL new features talk: The speaker announced support for “NoSQL interfaces” to MySQL. That is kind of fun in two dimensions: A) What he really means is support for the memcached interface. Given the vast number of different interfaces to databases today, announcing anything as “supports NoSQL interfaces” sounds kind of silly. B) Given the fact that many databases refrain from supporting SQL not because they think their interface is inferior to SQL but because they sacrifice SQL compliance for better performance, Hadoop integration, scaling properties or others this seems really kind of turning the world upside-down.

As for new features – the new MySQL release improved the query optimiser, subquery support. When it comes to replication there were improvements along the lines of performance (multi threaded slaves etc.), data integrity (replication check sums being computed, propagated and checked), agility (support for time delayed replication), failover and recovery.

There were improvements along the lines of performance schemata, security, workbench features. The goal is to be the go-to-database for small and growing businesses on the web.

After that I joined the systemd in Debian talk. Looking forward to systemd support in my next Debian version.

HBase optimisation notes

Lars George's talk on HBase performance notes was pretty much packed – like any other of the NoSQL (and really also the community/marketing and legal dev room) talks.



Lars started by explaining that by default HBase is configured to reserve 40% of the JVM heap for in memory stores to speed up reading, 20% for the blockcache used for writing and leaves the rest as breath area.

On read HBase will first locate the correct region server and route the request accordingly – this information is cached on the client side for faster access. Prefetching on boot-up is possible to save a few milliseconds on first requests. In order to touch as little files as possible when fetching bloomfilters and time ranges are used. In addition the block cache is queried to avoid going to disk entirely. A hint: Leave as much space as possible for the OS file cache for faster access. When monitoring reads make sure to check the metrics exported by HBase e.g. by tracking them over time in Ganglia.

The cluster size will determine your write performance: HBase files are so-called log structured merge trees. Writes are first stored in memory and in the so-called Write-Ahead-Log (WAL, stored and as a result replicated on HDFS). This information is flushed to disk periodically either when there are too many log files around or the system gets under memory pressure. WAL without pending edits are being discarded.

HBase files are written in an append-only fashion. Regular compactions make sure that deleted records are being deleted.

In general the WAL file size is configured to be 64 to 128 MB. In addition only 32 log files are permitted before a flush is forced. This can be too small a file size or number of log files in periods of high write request numbers and is detrimental in particular as writes sync across all stores, so large cells in one family will cause a lot of writes.

Bypassing the WAL is possible though not recommended as it is the only source for durability there is. It may make sense on derived columns that can easily be re-created in a co-processor on crash.

Too small WAL sizes can lead to compaction storms happening on your cluster: Many small files than have to be merged sequentially into one large file. Keep in mind that flushes happen across column families even if just one family triggers.

Some handy numbers to have when computing write performance of your cluster and sizing HBase configuration for your use case: HDFS has an expected 35 to 50 MB/s throughput. Given different cell size this is how that number translates to HBase write performance:






Cell size OPS
0.5MB 70-100
100kB 250-500
10kB with 800 less than expected as this HBase is not optimised for these sizes
1kB 6000, see above


As a general rule of thumb: Have your memstore be driven by size number of regions and flush size. Have the number of allowed WAL logs before flush be driven by fill and flush rates.. The capacity of your cluster is driven by the JVM heap, region count and size, key distribution (check the talks on HBase schema design). There might be ways to get rid of the Java heap restriction through off-heap memory, however that is not yet implemented.

Keep enough and large enough WAL logs, do not oversubscribe the memstore space, keep the flush size in the right boundaries, check WAL usage on your cluster. Use Ganglia for cluster monitoring. Enable compression, tweak the compaction algorithm to peg background I/O, keep uneven families in separate tables, watch the metrics for blockcache and memstore.

ApacheConEU - part 05

2012-11-14 20:47
The afternoon featured several talks on HBase - both it's implementation as well as schema optimisation. One major issue in schema design in the choice of key. Simplest recommendation is to make sure that keys are designed such that on reading data load will be evenly distributed accross all nodes to prevent region-server hot-spotting. General advise here are hashing or reversing urls.

When it comes to running your own HBase cluster make sure you know what is going on in the cluster at any point in time:

  • Hbase comes with tools for checking and fixing tables,
  • tools for inspecting hfiles that HBase stores data in,
  • commands for inspecting the binary write ahead log,
  • web interfaces for master and region servers,
  • offline meta data repair tooling.


When it comes to system monitoring make sure to track cluster behaviour over time e.g. by using Ganglia or OpenTSDB and configure your alerts accordingly.

One tip for high traffic sites - it might make sense to disable automatic splitting to avoid splits during peaks and rather postpone them to low traffic times. One rather new project presented to monitor region sizes was Hannibal.

At the end of his talk the speaker went into some more detail on problems encountered when rolling out HBase and lessons learnt:

  • the topic itself was new so both engineering and ops were learning.
  • at scale nothing that was tested on small scale works as advertised.
  • hardware issues will occur, tuning the configuration to your workload is crucial.
  • they used early versions - inevitably leading to stability issues.
  • it's crucial to know that something bad is building up before all systems start catching fire - monitoring and alerting the right thing is important. With Hadoop there are multiple levels to monitor: the hardware, os, jvm, Hadoop itself, HBase on top. It's important to correlate the metrics.
  • key- and schema design are key.
  • monitoring and knowledgable operations are important.
  • there are no emergency actions - in case of an emergency it just will take time to recover: Even if there is a backup, even just transferring the data back can take hours and days.
  • HBase (and Hadoop) is DevOps technology.
  • there is a huge learning curve to get to a state that makes using these systems easy.



In his talk on HBase schema design Lars George started out with an introduction to the HBase architecture. On the logical level it's best to think of HBase as a large distributed hash table - all data except for names are stored as byte arrays (with helpers to transform that data back into well known data types provided). The tables themselves are stored in a sparse format which means that null values essentially come for free.

On a more technical level the project uses zookeeper for master election, split tracking and state tracking. The architecture itself is based on log structured merge trees. All data initially ends up in the write ahead log - with data always being appended to the end this log can be written and ready very efficiently without any seek penalty. The data is inserted at the respected region server in memory (mem store, size of 64 MB typically) and synched to disk in regular intervals. In HBase all files are immutable - modifications are done only by writing new data and merging it in later. Deletes also happen by marking data as deleted instead of really deleting it. On a minor compaction the recently few files are being merged. On a major compaction all files are merged and deletes are being handled. Handling your own major compaction is possible as well.

In terms of performance lookup by key is the best you can do. If you do lookup by value this will result in a full-table scan. There is an option to give HBase a hint as to where to find the key when it is updated only infrequently - there is an option to provide a timestamp of roughly where to look for it. Also there are options to use Bloomfilters for better read performance. Another option is to move more data into the row key itself if that is the data you will be searching for very often. Make sure to de-normalize your data as HBase does not do any sophisticated joins, there will be duplication in your data as all should be pre-joined to get better performance. Have intelligent keys that make match your read/write patterns. Also make sure to keep your keys reasonably short as they are being repeated for each entry - so moving the whole data into the key isn't going to get you anything.

Speaking of read write patterns - as a general rule of thumb: to optimise for better write performance tune the memstore size. For better read performance tune the block cache size. One final hint: Anything below 8 servers really is just a test setup as it doesn't get you any real advantages.

Cloudera in Berlin

2011-11-14 20:24
Cloudera is hosting another round of trainings in Berlin in November this year. In addition to the trainings on Apache Hadoop this time around there will also be trainings on Apache HBase.

Register online via:



FOSDEM - HBase at Facebook Messaging

2011-02-17 20:17

Nicolas Spiegelberg gave an awesome introduction not only to the architecture that powers Facebook messaging but also to the design decisions behind their use of Apache HBase as a storage backend. Disclaimer: HBase is being used for message storage, for attachements with Haystack a different backend is used.

The reasons to go for HBase include its strong consistency model, support for auto failover, load balancing of shards, support for compression, atomic read-modify-write support and the inherent Map/Reduce support.

When going from MySQL to HBase some technological problems had to be solved: Coming from MySQL basically all data was normalised - so in an ideal world, migration would have involved one large join to port all the data over to HBase. As this is not feasable in a production environment instead what was done was to load all data into an intermediary HBase table, join the data via Map/Reduce and import all into the target HBase instance. The whole setup was run in a dark launch - being fed with parallel life traffic for performance optimisation and measurement.

The goald was zero data loss in HBase - which meant using the Apache Hadoop append branch of HDFS. The re-designed the HBase master in the process to avoid having a single point of failure, backup masters are handled by zookeeper. Lots of bug fixes went back from Facebooks engineers to the HBase code base. In addition for stability reason rolling restarts were added for upgrades, performance improvements, consistency checks.

The Apache HBase community received lots of love from Facebook for their willingness to work together with the Facebook team on better stability and performance. Work on improvements was shared between teams in an amazing open and inclusive model to development.


One additional hint: FOSDEM videos of all talks including this one have been put online in the meantime.

CFP - Berlin Buzzwords 2011 - search, score, scale

2011-01-26 08:00
This is to announce the Berlin Buzzwords 2011. The second edition of the successful conference on scalable and open search, data processing and data storage in Germany,
taking place in Berlin.




Call for Presentations Berlin Buzzwords

http://berlinbuzzwords.de

Berlin Buzzwords 2011 - Search, Store, Scale

6/7 June 2011




The event will comprise presentations on scalable data processing. We invite you to submit talks on the topics:




  • IR / Search - Lucene, Solr, katta or comparable solutions
  • NoSQL - like CouchDB, MongoDB, Jackrabbit, HBase and others
  • Hadoop - Hadoop itself, MapReduce, Cascading or Pig and relatives




Closely related topics not explicitly listed above are welcome. We are looking for presentations on the implementation of the systems themselves, real world applications and case studies.



Important Dates (all dates in GMT +2)

  • Submission deadline: March 1st 2011, 23:59 MEZ
  • Notification of accepted speakers: March 22th, 2011, MEZ.
  • Publication of final schedule: April 5th, 2011.
  • Conference: June 6/7. 2011




High quality, technical submissions are called for, ranging from principles to practice. We are looking for real world use cases, background on the architecture of specific projects and a deep dive into architectures built on top of e.g. Hadoop clusters.



Proposals should be submitted at http://berlinbuzzwords.de/content/cfp-0 no later than March 1st, 2011. Acceptance notifications will be sent out soon after the submission deadline. Please include your name, bio and email, the title of the talk, a brief abstract in English language. Please indicate whether you want to give a lightning (10min), short (20min) or long (40min) presentation and indicate the level of experience with the topic your audience should have (e.g. whether your talk will be suitable for newbies or is targeted for experienced users.) If you'd like to pitch your brand new product in your talk, please let us know as well - there will be extra space for presenting new ideas, awesome products and great new projects.



The presentation format is short. We will be enforcing the schedule rigorously.



If you are interested in sponsoring the event (e.g. we would be happy to provide videos after the event, free drinks for attendees as well as an after-show party), please contact us.



Follow @berlinbuzzwords on Twitter for updates. News on the conference will be published on our website at http://berlinbuzzwords.de.



Program Chairs: Isabel Drost, Jan Lehnardt, and Simon Willnauer.



Schedule and further updates on the event will be published on http://berlinbuzzwords.de Please re-distribute this CfP to people who might be interested.



Contact us at:



newthinking communications GmbH
Schönhauser Allee 6/7
10119 Berlin, Germany
Julia Gemählich
Isabel Drost
+49(0)30-9210 596

Devoxx – Day 2 HBase

2010-12-09 21:25
Devoxx featured several interesting case studies of how HBase and Hadoop can be used to scale data analysis back ends as well as data serving front ends.

Twitter



Dmitry Ryaboy from Twitter explained how to scale high load and large data systems using Cassandra. Looking at the sheer amount of tweets generated each day it becomes obvious that with a system like MySQL alone this site cannot be run.

Twitter has released several of their internal tools under a free software license for others to re-use – some of them being rather straight forward, others more involved. At Twitter each Tweet is annotated by a user_id, a time stamp (ok if skewed by a few minutes) as well as a unique tweet_id. In order to come up with a solution for generating the latter one they built a library called snowflake. Though rather simple algorithm even works in a cross data-centre set-up: The first bits are composed of the current time stamp, the following bits encode the data-centre, after that there is room for a counter. The tweet_ids are globally ordered by time and distinct across data-centres without the need for global synchronisation.

With gizzard Twitter released a rather general sharding implementation that is used internally to run distributed versions of Lucene, MySQL as well as Redis (to be introduced for caching tweet timelines due to its explicit support for lists as data structures for values that are not available in memcached).

FlockDB for large scale social graph storage and analysis. Rainbird for time series analysis, though with OpenTSDB there is something comparable available for HBase. Haplocheirus for message vector caching (currently based on memcached, soon to be migrated to Redis for its richer data structures). The queries available through the front-end are rather limited thus making it easy to provide pre-computed, optimised version in the back-end. As with the caching problem a tradeoff between hit rate on the pool of pre-computed items vs. storage cost can be made based on the observed query distribution.

In the back-end of Twitter various statistical and data mining analysis are run on top of Hadoop HBase To compute potentially interesting followers for users, to extract potentially interesting products etc.
The final take-home message here: Go from requirements to final solution. In the space of storage systems there is not such thing as a silver bullet. Instead you have to carefully evaluate features and properties of each solutions as your data and load increase.

Facebook



When implementing Facebook Messaging (a new feature that was announced this week) Facebook decided to go for HBase instead of Cassandra. The requirements of the feature included massive scale, long-tail write access to the database (which more or less ruled out MySQL and comparable solutions) and a need for strict ordering of messages (which ruled out any eventually consistent system. The decision was made to use HBase.

A team of 15 developers (including operations and frontend) was working on the system for one year before it was finally released. The feature supports for integration of facebook messaging, IM, SMS and mail into one single system making it possible to group all messages by conversation no matter which device was used to send the message originally. That way each user's inbox turns into a social inbox.

Adobe



Cosmin Lehene presented four use cases of Hadoop at Adobe. The first one dealt with creating and evaluating profiles of the Adobe Media Player. Users would be associated with a vector giving more information on what types of genre the meda they consumed belonged to. These vectors would then be used to generate recommendations for additional content to view in order to increase consumption rate. Adobe built a clustering system that would interface Mahout's canopy- and k-means implementations with their HBase backend for user grouping. Thanks Cosmin for including that information in your presentation!

A second use case focussed on finding out more on the usage of flash on the internet. Using Google to search for flash content was no good as only the first 2000 results could be viewed thus resulting in a highly skewed sample. Instead they used a mixture of nutch and HBase for storage to retrieve the content. Analysis was done with respect to various features of flash movies, such as frame rates. The analysis revealed a large gap between the perceived typical usage and the actual usage of flash on the internet.

The third use case involves analysis of images and usage patterns on the Photoshop-in-a-browser edition of Photoshop.com. The forth use case dealt with scaling the infrastructure that powers businesscatalyst – a turn-key online business platform solution including analysis, campaigning and more. When purchased by Adobe the system was very successful business-wise. However the infrastructure was by no means able to put up with the load it had to accommodate. Changing to a back-end based on HBase led to better performance, faster report generation.

Devoxx – Day two – Hadoop and HBase

2010-12-08 21:24
In his session on the current state of Hadoop Tom went into a little more detail not only on the features released in the latest release or on the roadmap for upcoming releases (including Kerberos based security, append support, warm standby namenode and others).
He also gave a very interesting view on the current Hadoop ecosystem. More and more projects are currently being created that either extend Hadoop or are built on top of Hadoop. Several of these are being run as projects at the Apache Software Foundation, however some are available outside of Apache only. Using graphviz he created a graph of projects depending on or extending Hadoop and from that provided a rough classification of these projects.

As to be expected HDFS and Map/Reduce are part of the very basis of this ecosystem. Right next to them sits zookeeper, a distributed coordination and looking service.

Storage systems extending the capabilities of HDFS include HBase that adds random read/write as well as realtime access to the otherwise batch-oriented distributed file-system. With PIG and Hive and Cascading three projects are making it easier to formulate complex queries for Hadoop. Among the three, PIG is mainly focussed on expressing data filtering and processing, with SQL support being added over time as well. Hive came from the need for SQL formulation on Hadoop clusters. Cascading goes a slightly different way, providing a Java API for easier query formulation. The new kid on the block sort of is Plume, a project initiated by Ted Dunning that has the goal of coming up with a Map/Reduce abstraction layer inspired by Google's Flume Java publication.

There are several projects for data import into HDFS. Sqoop can be used for interfacing with RDMBS. Chukwa and Flume deals with feeding log data into the filesystem. For general co-ordination and workflow orchestration there is the release of Oozie, originally developed at Yahoo! as well as support for workflow definition in Cascading.

When storing data in Hadoop it is a common requirement to find a compact, structured representation of the data to store. Though human readable, xml files are not very compact. However when using any binary format, schema evolution commonly is a problem: Adding, renaming or deleting fields in most cases causes the need to upgrade all code interacting with the data as well as re-formatting already stored data. With Thrift, Avro and Protocol Buffers there are three options available for storing data in a compact, structured binary format. All three projects come with support for schema evolution by providing users no only to deal with missing data but also by providing a means to map old to new fields and vice versa.

Devoxx – University – Cassandra, HBase

2010-12-06 21:20
During the morning session FIXME Ellison gave an introduction to the distributed NoSQL database Cassandra. Being generally based on the Dynamo paper from Amazon the key-value store distributes key/value pairs according to a consistent hashing schema. Nodes can be added dynamically making the system well suited for elastic scaling. In contrast to Dynamo, Cassandra can be tuned for the required consistency level. The system is tuned for storing moderately sized key/value pairs. Large blobs of data should not be stored into it. A recent addition to Cassandra has been the integration with Hadoop's Map/Reduce implementation for data processing. In addition Cassandra comes with a Thrift interface as well as higher level interfaces such as Hector.

In the afternoon Michael Stack and Jonathan Grey gave an overview of HBase covering basic installation, going into more detail concerning the APIs. Most interesting to me was the HBase architecture and optimisation details. The systems is inspired by Google's BigTable paper. It uses Apache HDFS as storage back-end inheriting the failure resilience of HDFS. The system uses Zookeeper for co-ordination and meta-data storage. However fear not, Zookeeper comes packaged with HBase, so in case you have not yet setup your own Zookeeper installation, HBase will do that for you on installation.

HBase is split into a master server for holding meta-data and region servers for storing the actual data. When storing data HBase optimises storage for data locality. This is done in two ways: On write the first copy usually goes to the local disk of the client, so even when storage exceeds the size of one block and remote copies get replicated to differing nodes, at least the first copy gets stored on one machine only. During a compaction phase that is scheduled regularly (usually every 24h, jitter time can be added to lower the load on the cluster) data is re-shuffled for optimal layout. When running Map/Reduce jobs against HBase this data locality can be easily exploited. HBase comes with its own input and output formats for Hadoop jobs.

HBase comes not only with Map/Reduce integration, it also publishes a Thrift interface, a REST interface and can be queried from an HBase shell.

Devoxx University – Productive programmer, HBase

2010-12-04 21:17
The first day at Devoxx featured several tutorials – most interesting to me was the pragramatic programmer. The speaker also is the author of the equally named book at O'Reilly. The book was the result of the observation that developers today are more and more IDE bound, no longer able to use the command line effectively. The result are developers that are unnecessarily slow when creating software. The goal was to bring usage patterns of productive software development to juniors how grew up in a GUI only environment. However half-way through the book, it became apparent that a book on command line wizardry only is barely interesting at all. So the focus was shifted and now includes more general productivity patterns.
The goal was to accelerate development – mostly by avoiding time consuming usage patterns (minimise mouse usage) and automation of repetitive tasks (computers are good at doing dull, repetitive tasks – that's what they are made for.
Second goal was increasing focus. Two main ingredients to that are switching off anything that disturbs the development flow: No more pop-ups, not more mail notifications, no more flashing side windows. If you have ever had the effect of thinking “So late already?” when your colleagues were going out for lunch – then you know what is meant by being in the flow. It takes up to 20min to get into this mode – but just the fraction of a second to be thrown out. With developers being significantly more productive in this state it makes sense to reduce the risk of being thrown out.
Third goal was about canonicality, fourth one on automation.
During the morning I hopped on and off the Hadoop talk as well – the tutorial was great to get into the system, Tom White went into detail also explaining several of the most common advanced patterns. Of course not that much new stuff if you sort-of know the system already :)

Apache Con – Hadoop, HBase, Httpd

2010-11-24 23:19
The first Apache Con day featured several presentations on NoSQL databases (track sponsored by Day software), a Hadoop track as well as presentations on Httpd and an Open source business track.

Since its inception Hadoop always was intended to be run in trusted environments firewalled from hostile users or even attackers. As such it never really supported any security features. This is about the change with the new Hadoop release including better Kerberos based security.

When creating files in Hadoop a long awaited feature was append support. Basically up to now writing to Hadoop was a one-of job: Open a file, write your data, close it and be done. Re-opening and appending data was not possible. This situation is especially bad for HBase as its design relies on being able to append data to an existing file. There have been efforts for adding append support to HDFS earlier as well as an integration of such patches by third party vendors. However only with a current Hadoop version Append is officially supported by HDFS.

A very interesting use case-wise of the Hadoop stack was presented by $name-here from $name. They are using a Hadoop cluster to provide a repository of code released under free software licenses. The business case is to enable corporations to check their source code against existing code and spot license infringements. This does not only include linking to free software under incompatible licenses but also developers copying pieces of free code, e.g. copying entire classes or functions into internal projects that originally were available only under copyleft licenses.

The speaker went into some detail explaining the initial problems they had run into: Obviously it's no good idea to mix and match Hadoop and HBase versions freely. Instead it is best practice to use only versions claimed to be compatible by the developers. Another common mistake is to leave parameters of both projects at their defaults. The default parameters are supposed to be fool-proof. However they are optimised to work well for Hadoop newbies who want to try out the system on a single node cluster and in a distributed setting obviously need more attention. Other anti-patterns include storing only tiny little files in the cluster thus quickly running out of memory on the namenode (that stores all file information including block mappings in main memory for faster access).

In the NoSQL track Jonathan Grey from Facebook gave a very interesting overview on the current state of HBase. Turns out that Facebook would announce only a few days after that their internal use of HBase for the newly launched feature of Facebook messaging.

HBase has adopted a release cycle including development/ production releases to get their systems into interested users' hands more quickly: Users willing to try out new experimental features can use the development releases of HBase. Those who are not should go for the stable releases.

After focussing on improving performance in the past months the project is currently focussing on stability: Data loss is to be avoided by all means. Zookeeper is to be integrated more tightly for storing configuration information thus enabling live reconfiguration (at least to some extend). In addition also HBase is targeting to integrate stored procedures like behaviour: As explained in Googles Percolator paper $LINK batch oriented processing get's you only so far. If data that gets added constantly it makes sense to give up on some of the throughput batch-based systems provide and instead optimise for shorter processing cycles by implementing event triggered processing.

On recommendation of one of neofonie's sys-ops I visited some of the httpd talks: First Rich Bowen gave an overview of unusual tasks one can solve with httpd. The list included things like automatically re-writing http response content to match your application. There is even a spell checker for request URLs: Given marketing has given your flyer to the press with a typo in the url, chances are that the spellchecking module can fix these automatically for each request: Common mistakes covered are switched letters, numbers replaced by letters etc. The performance cost has to be paid only in case no hit could be found – so instead of returning a 404 right away the server first tries to find the document by taking into account common mis-spellings.