Devoxx – University – Cassandra, HBase #

During the morning session FIXME Ellison gave an introduction to the distributed NoSQL database Cassandra. Being generally based on the Dynamo paper from Amazon the key-value store distributes key/value pairs according to a consistent hashing schema. Nodes can be added dynamically making the system well suited for elastic scaling. In contrast to Dynamo, Cassandra can be tuned for the required consistency level. The system is tuned for storing moderately sized key/value pairs. Large blobs of data should not be stored into it. A recent addition to Cassandra has been the integration with Hadoop’s Map/Reduce implementation for data processing. In addition Cassandra comes with a Thrift interface as well as higher level interfaces such as Hector.

In the afternoon Michael Stack and Jonathan Grey gave an overview of HBase covering basic installation, going into more detail concerning the APIs. Most interesting to me was the HBase architecture and optimisation details. The systems is inspired by Google’s BigTable paper. It uses Apache HDFS as storage back-end inheriting the failure resilience of HDFS. The system uses Zookeeper for co-ordination and meta-data storage. However fear not, Zookeeper comes packaged with HBase, so in case you have not yet setup your own Zookeeper installation, HBase will do that for you on installation.

HBase is split into a master server for holding meta-data and region servers for storing the actual data. When storing data HBase optimises storage for data locality. This is done in two ways: On write the first copy usually goes to the local disk of the client, so even when storage exceeds the size of one block and remote copies get replicated to differing nodes, at least the first copy gets stored on one machine only. During a compaction phase that is scheduled regularly (usually every 24h, jitter time can be added to lower the load on the cluster) data is re-shuffled for optimal layout. When running Map/Reduce jobs against HBase this data locality can be easily exploited. HBase comes with its own input and output formats for Hadoop jobs.

HBase comes not only with Map/Reduce integration, it also publishes a Thrift interface, a REST interface and can be queried from an HBase shell.