Wonder if you should switch from your RDBMS to Apache Hadoop: Don't!

2013-08-26 17:10
Last weekend I spend a lot of fun time at FrOSCon* in Sankt Augustin - always great to catch up with friends in the open source space. As always there were quite a few talks on NoSQL, Hadoop, but also really solid advise on tuning your system for stuff like MySQL (including a side note on PostgreSQL and Oracle) from Kristian Köhntopp. When following some of the discussions in the audience before and after the talk I could not help but shake my head on some of the advise given about HDFS and friends.

This is to give a really short rule of thumb on what project to use for which occasion. Maybe it helps clear some false assumptions. Note: All of the below are most likely gross oversimplifications. Don't use it as hard and fast advise but as a first step towards finding more information with your preferred search engine.

Use Case 1 - relational data Technology
I have usual relational data Use a relational database - think MySQL and friends
I have relational data but my database doesn't perform. Tune your system, go to step 1.
I have relational data but way more reads than one machine can accomodate. Have master-slave replication turned on, configure enough slaves to accommodate your traffic.
I have relational data, but way too much data for a single machine. Start sharding your database.
I have a lot of reads (or writes) and too much data for a single machine. If the sharding+replication pain gets unbearable but you still need strong consistency guarantees start playing with HBase. You might loose the option of SQL but win being able to scale beyond traditional solutions. Hint: Unless your online product is hugely successful switching to HBase usually means you've missed some tuning option.
Use Case 2 - Crawling Technology
I want to store and process a crawl of the internet. Store it as flat files, if you like encode metadata together with the data in protocol buffers, thrift or Avro.
My internet crawl no longer fits on a single disk. Put multiple disks in your machine, RAID them if you like.
Processing my crawl takes too long. Optimise your pipeline. Make sure you utilise all processors in your machine.
Processing the crawl still takes too long. If your data doesn't fit on a single machine, takes way too long to process but there is no bigger machine that you can reasonably pay for you are probably willing to take some pain. Get yourself more than one machine, hook them together, install Hadoop and use either plain map reduce , Pig, Hive or Cascading to process the data. Distribution-wise Apache, Cloudera, MapR, Hortonworks are all good choices.
Use Case 3 - BI Technology
I have structured data and want my business analysts to find great new insights. Use a data warehouse your analysts are most familiar with.
I want to draw conclusions from one year worth of traffic on a busy web site (hint: resulting log files no longer fit on the hard disk of my biggest machine). Stream your logs into HDFS. From here it depends: If it's your developers that want to get their hands dirty, Cascading and depending packages might be a decent idea. There's plenty of UDFs in Pig that will help you as well. If the work is to be done by data analysts that only speak SQL use Hive.
I want to correlate user transactions with social media activity around my brand. See above.

A really short three bullet point summary

  • Use HBase to scale the backend your end users interact with. If you want to trade strong consistency for being able to span multiple datacenters on multiple continents take a look at Cassandra.
  • Use plain HDFS with Hive/Pig/Cascading for batch analysis. This could be business intelligence queries against user transactions, log file analysis for statistics, data extraction steps for internet crawls, social media data or other sensor data.
  • Use Drill or Impala for low latency business intelligence queries.

Good advise from ApacheConEU 2008/9

Back at one of the first ApacheCons I ever attended there was an Apache Hadoop BoF. One of the attendees asked for good reasons to switch from his current working infrastructure to Hadoop. In my opinion the advise he got from Christophe Bisciglia is still valid today. Paraphrased version:
For as long as you wonder why you should be switching to Hadoop, don't.

A parting note: I've left CouchDB, MongoDB, JackRabbit and friends out of the equation. The reason for this is my own lack of first-hand experience with those projects. Maybe someone else can add to the list here.

* A note to the organisers: Thilo and myself married last year in September. So when seeing the term "Fromm" in a speaker's surname doesn't automatically mean that the speaker hotel room should be booked on the name "Thilo Fromm" - the speaker on your list could as well be called "Isabel Drost-Fromm". It was hilarious to have the speaker reimbursement package signed by my husband though this year around I was the one giving a talk at your conference ;)

JAX: Hadoop overview by Bernd Fondermann

2013-05-18 20:29

After breakfast was over the first day started with a talk by Bernd on the
Hadoop ecosystem. He did a good job selecting the most important and
interesting projects related to storing data in HDFS and processing it with Map
Reduce. After the usual "what is Hadoop", "what does the general architecture
look like", "what will change with YARN" Bernd gave a nice overview of which
publications each of the relevant projects rely on:

  • HDFS is mainly based on the paper on GFS.

  • Map Reduce comes with it's own publication.

  • The big table paper mainly inspired Cassandra (to some extend), HBase,
    Accumulo and Hypertable.

  • Protocol Buffers inspired Avro and Thrift, and is available as free
    software itself.

  • Dremel (the storage side of things) inspired Parquet.

  • The query language side of Dremel inspired Drill and Impala.

  • Power Drill might inspire Drill.

  • Pregel (a graph database) inspired Giraph.

  • Percolator provided some inspiration to HBase.

  • Dynamo by Amazon kicked of Cassandra and others.

  • Chubby inspired Zookeeper, both are based on Paxos.

  • On top of Map Reduce today there are tons of higher level languages,
    starting with Sawzall inside of Google, continuing with Pig and Hive at Apache
    we are now left with added languages like Cascading, Cascalog, Scalding and
    many more.

  • There are many other interesting publications (Megastore, Spanner, F1 to
    name just a few) for which there is no free implementation yet. In addition
    with Storm, Hana and Haystack there are implementations lacking canonical

After this really broad clarification of names and terms used, Bernd went into
some more detail on how Zookeeper is being used for defining the namenode in
Hadoop 2, how high availablility and federation works for namenodes. In
addition he gave a clear explanation of how block reports work on cluster
bootup. The remainder of the talk was reserved for giving an intro to HBase,
Giraph and Drill.

Hadoop Summit Amsterdam

2013-05-16 20:27

About a month ago I attended the first European Hadoop Summit, organised by
Hortonworks in Amsterdam. The two day conference brought together both vendors
and users of Apache Hadoop for talks, exhibition and after conference beer

Russel Jurney kindly asked me to chair the Hadoop applied track during
Apache Con EU. As a result I had a good excuse to attend the event. Overall
there were at least three times as many submissions than could reasonably be
accepted. Accordingly accepting proposals was pretty hard.

Though some of the Apache community aspect was missing at Hadoop summit it was
interesting nevertheless to see who is active in this space both as users as
well as vendors.

If you check out the talks on Youtube make sure to not miss the two sessions by
Ted Dunning as well as the talk on handling logging data by Twitter.

ApacheConNA: Hadoop metrics

2013-05-14 20:25

Have you ever measured the general behaviour of your Hadoop jobs? Have you
sized your cluster accordingly? Do you know whether your work load really is IO
bound or CPU bound? Legend has it noone expecpt Allen Wittenauer over at
Linked.In, formerly Y! ever did this analysis for his clusters.

Steve Watt gave a pitch for actually going out into your datacenter measuring
what is going on there and adjusting the deployment accordingly: In small
clusters it may make sense to rely on raided disks instead of additional
storage nodes to guarantee ``replication levels''. When going out to vendors to
buy hardware don't rely on paper calculations only: Standard servers in Hadoop
clusters are 1 or 2u. This is quite unlike beefy boxes being sold otherwise.

Figure out what reference architecture is being used by partners, run your
standard workloads, adjust the configuration. If you want to run the 10TB
Terrasort to benchmark your hardware and system configuration. Make sure to
capture data during all your runs - have Ganglia or SAR, watch out for
intersting behaviour in io rates, cpu utilisation, network traffic. The goal is
to get the cpu busy, not wait for network or disk.

After the instrumentation and trial run look for over- and underprovisionings,
adjust, leather, rinse, repeat.

Also make sure to talk to the datacenter people: There are floor space, power
and cooling constraints to keep in mind. Don't let the whole datacenter go down
because your cpu intensive job is drawing more power than the DC was designed
for. Ther are also power constraints per floor tile due to cooling issues -
those should dictate the design.

Take a close look at the disks you deploy: SATA vs. SAS can make a 40%
performance difference at a 20% cost difference. Also the number of cores per
machines dictates the number of disks to spread the likelyhood of random read
access. As a rule of thumb - in a 2U machine today there should be at least
twelve large form factor disks.

When it comes to controllers he goal should be to get a dedicated lane to disc,
safe one controller if price is an issue. Trade off compute power against power

Designing your network keep in mind that one switch going down means that one
rack will be gone. This may be a non-issue in a Y! size cluster, in your
smaller scale world it might be worth the money investing in a second switch
though: Having 20 nodes go black isn't a lot of fun if you cannot farm out the
work and re-replication to other nodes and racks. Also make sure to have enough
ports in rack switches for the machines you are planning to provision.

Avoid playing the ops whake-a-mole game by having one large cluster in the
organisation than many different ones where possible. Multi-tenancy in Hadoop is
still pre-mature though.

If you want to play with future deployments - watch out for HP currently
packing 270 servers where today are just two via system on a chip designs.

Apache Hadoop Get Together Berlin

2013-05-08 23:41
This evening I joined the group over at Immobilienscout 24 for today's Hadoop Get Together. David Obermann had invited Dr. Falk-Florian Henrich from CeleraOne to talk about their real-time analytics on live data streams.

Their system is being used by the New York Times Springer's Die Welt for traffic analysis. The goal is to identify recurring users that might be willing to pay for the content they want to read. The trade-off here is to keep readers interested long enough to make them pay in the end, instead of scaring them away with a restrictive pay wall which would immediately lead to way less ad revenues.

Currently CeleraOne's system is based on a combination of MongoDB for persistent storage, ZeroMQ for communicating with the revenue engine and http/json for connecting to the controlling web frontend. The live traffic analysis is all done in RAM, while long term storage ends up in MongoDB.

The second speaker was Michael Hausenblas from MapR. He spends most of his time contributing to Apache Drill - an open source implementation of Google's Dremel.

Being an Apache project Drill is developed in an open, meritocratic way - contributors come from various different backgrounds. Currently Drill is in its early stages of development: They have a logical plan, a reference interpreter, a basic SQL parser. There is a demo application. As data backends they support HBase.

For most of the implementation they are trying to re-use existing libraries, e.g. for the columnar storage Drill is looking into either using Twitter's Parquet or Hive ORC file format.

In terms of contributing to the project: There is no need to be a rockstar programmer to make valuable contributions to Apache Drill: Use cases, documentation, test data are all valuable and appreciated by the project.

For more information check out the slide deck (this is an older version - this nights edition most likely soon to be published):

If you missed today's event make sure to get enlisted in the Hadoop Get Together Xing Group so next time you get a notification.

One thing to note though: When registering for the event - please make sure to free your ticket if you cannot make it. I had a few requests from people who would have loved to attend today who didn't get a ticket but would most likely have fit into the room.

Notes on storage options - FOSDEM 05

2013-02-17 20:43

Second day at FOSDEM for me started with the MySQL dev room. One thing that made me smile was in the MySQL new features talk: The speaker announced support for “NoSQL interfaces” to MySQL. That is kind of fun in two dimensions: A) What he really means is support for the memcached interface. Given the vast number of different interfaces to databases today, announcing anything as “supports NoSQL interfaces” sounds kind of silly. B) Given the fact that many databases refrain from supporting SQL not because they think their interface is inferior to SQL but because they sacrifice SQL compliance for better performance, Hadoop integration, scaling properties or others this seems really kind of turning the world upside-down.

As for new features – the new MySQL release improved the query optimiser, subquery support. When it comes to replication there were improvements along the lines of performance (multi threaded slaves etc.), data integrity (replication check sums being computed, propagated and checked), agility (support for time delayed replication), failover and recovery.

There were improvements along the lines of performance schemata, security, workbench features. The goal is to be the go-to-database for small and growing businesses on the web.

After that I joined the systemd in Debian talk. Looking forward to systemd support in my next Debian version.

HBase optimisation notes

Lars George's talk on HBase performance notes was pretty much packed – like any other of the NoSQL (and really also the community/marketing and legal dev room) talks.

Lars started by explaining that by default HBase is configured to reserve 40% of the JVM heap for in memory stores to speed up reading, 20% for the blockcache used for writing and leaves the rest as breath area.

On read HBase will first locate the correct region server and route the request accordingly – this information is cached on the client side for faster access. Prefetching on boot-up is possible to save a few milliseconds on first requests. In order to touch as little files as possible when fetching bloomfilters and time ranges are used. In addition the block cache is queried to avoid going to disk entirely. A hint: Leave as much space as possible for the OS file cache for faster access. When monitoring reads make sure to check the metrics exported by HBase e.g. by tracking them over time in Ganglia.

The cluster size will determine your write performance: HBase files are so-called log structured merge trees. Writes are first stored in memory and in the so-called Write-Ahead-Log (WAL, stored and as a result replicated on HDFS). This information is flushed to disk periodically either when there are too many log files around or the system gets under memory pressure. WAL without pending edits are being discarded.

HBase files are written in an append-only fashion. Regular compactions make sure that deleted records are being deleted.

In general the WAL file size is configured to be 64 to 128 MB. In addition only 32 log files are permitted before a flush is forced. This can be too small a file size or number of log files in periods of high write request numbers and is detrimental in particular as writes sync across all stores, so large cells in one family will cause a lot of writes.

Bypassing the WAL is possible though not recommended as it is the only source for durability there is. It may make sense on derived columns that can easily be re-created in a co-processor on crash.

Too small WAL sizes can lead to compaction storms happening on your cluster: Many small files than have to be merged sequentially into one large file. Keep in mind that flushes happen across column families even if just one family triggers.

Some handy numbers to have when computing write performance of your cluster and sizing HBase configuration for your use case: HDFS has an expected 35 to 50 MB/s throughput. Given different cell size this is how that number translates to HBase write performance:

Cell size OPS
0.5MB 70-100
100kB 250-500
10kB with 800 less than expected as this HBase is not optimised for these sizes
1kB 6000, see above

As a general rule of thumb: Have your memstore be driven by size number of regions and flush size. Have the number of allowed WAL logs before flush be driven by fill and flush rates.. The capacity of your cluster is driven by the JVM heap, region count and size, key distribution (check the talks on HBase schema design). There might be ways to get rid of the Java heap restriction through off-heap memory, however that is not yet implemented.

Keep enough and large enough WAL logs, do not oversubscribe the memstore space, keep the flush size in the right boundaries, check WAL usage on your cluster. Use Ganglia for cluster monitoring. Enable compression, tweak the compaction algorithm to peg background I/O, keep uneven families in separate tables, watch the metrics for blockcache and memstore.

Linux vs. Hadoop - some inspiration?

2013-01-16 20:22
This (even for my blog’s standards) long-ish blog post was inspired by a talk given late last year at Apache Con EU as well as from discussions around what constitutes “Apache Hadoop compatibility” and how to make extending Hadoop easier. The post is based on conversations with at least one guy close to the Linux kernel community and another developer working on Hadoop. Both were extremly helpful in answering my questions and sanity checking the post below. After all I’m neither an expert on Linux kernel development and design, nor am I an expert on the detailed design and implementation of features coming up in the next few Hadoop releases. Thanks for your input.

Posting this here as I thought the result of my trials to understand the exact design commonalities and differences better might be interesting for others as well. Disclaimer: This is by no means an attempt to influence current development, it just summarizes some recent thoughts and analysis. As a result I’m happy about comments pointing out additions or corrections - preferably as trackback or maybe on Google Plus as I had to turn of comments on this very blog for spamming reasons.

In his slides on “Insides Hadoop dev” during Apache Con EU:

Steve Loughran included a comparison that popped up rather often already in recent past but still made me think:

“Apache Hadoop is an OS for the datacenter”

It does make a very good point, even though being slightly misleading in my opinion:

  • There are lots of applications that need to run in a datacenter that do not imply having to use Hadoop at all - think mobile application backends, content management systems of publishers, encyclopedia hosting. Growing you may still run into the need for central log processing, scheduling and storing data.
  • Even if your application benefits from a Hadoop cluster you will need a zoo of other projects not necessarily related to the project to successfully run your cluster - think configuration management, monitoring, alerting. Actually many of these topics are on the radar of Hadoop developers - with an intend to avoid the NIH principle and rather integrate better with existing proven standard tools.

However if you do want to do large scale data analysis on largely unstructured data today you will most likely end up using Apache Hadoop.

When talking about operating systems in the free software world inevitably the topic will drift towards the Linux kernel. Being one the most successful free software projects out there from time to time it’s interesting and valuable to look at its history and present in terms of development process, architecture, stakeholders in the development cycle and the way conflicting interests are being dealt with.

Although interesting in many dimensions this blog post focuses just on two related aspects:

  • How to balance innovation for stability in critical parts of the system.
  • How to deal with modularity and API stability from an architectural point of view taking project-external (read: non-mainline) module contributions into account.

The post is not going to deal with just “everything map/reduce” but focus solely on software written specifically to work with Apache Hadoop. In particular Map/Reduce layers plugged on top of existing distributed file systems that ignore data locality guarantees as well as layers on top of existing relational database management systems that ignore easy distribution and fail over are intentionally being ignored.

Balancing innovation with stability

One pain point mentioned during Steve’s talk was the perceived need for a very stable and reliable HDFS that prevents changes and improvements from making it into Hadoop. The rational is very simple: Many customers have entrusted lots (as in not easy to re-create in any reasonable time frame) of critical (as in the service offered degrades substantially when no longer based on that data) data to Hadoop. Even when in a backup Hadoop going down for a file system failure would still be catastrophic as it would take ages to get all systems back to a working state - time that means loosing lots of customer interaction with the service provided.

When glancing over to Linux-land (or Windows, or MacOS really) the situation isn’t much different: Though both backup and recovery are much cheaper there, having to restore a user’s hard-disk just due to some weird programming mistake still is not acceptable. Where does innovation happen there? Well, if you want durability and stability all you do is to use one of the well proven file system implementations - everyone knows names like ext2, xfs and friends. A simple “man mount” will reveal many more. If on the contrary you need some more cutting edge features or want to implement a whole new idea of how a file system should work, you are free to implement your own module or contribute to those marked as EXPERIMENTAL.

If Hadoop really is the OS of the datacenter than maybe it’s time to think about ways that enable users to swap in their prefered file system implementation, maybe it’s time for developers to focus implementation of new features that could break existing deployed systems to separate modules. Maybe it’s time to announce an end-of-support-date for older implementations (unless there are users that not only need support but are willing to put time and implementation effort into maintaining these old versions that is.)

Dealing with modularity and API stability

With the vision of being able to completely replace whole sub-systems comes the question of how to guarantee some sort of interoperability. The market for Hadoop and surrounding projects is already split, it’s hard to grasp for outsiders and newcomers which components work with wich version of Hadoop. Is there a better way to do things?

Looking at the Linux kernel I see some parallels here: There’s components built on top of kernel system calls (tools like ls, mkdir etc. all rely on a fixed set of system calls being available). On the other hand there’s a wide variety of vendors offering kernel drivers for their hardware. Those come in three versions:

  • Some are distributed as part of the mainline kernel (e.g. those for Intel graphics cards).
  • Some are distributed separately but including all source code (e.g. ….)
  • Some are distributed as binary blog with some generic GPLed glue logic (e.g. those provided by NVIDIA for their graphics cards).

Essentially there are two kinds of programming interfaces: ABIs (Application Binary Interfaces) that are being developed against from user space applications like “ls” and friends. APIs (Application Programming Interfaces) that are being developed against by kernel modules like the one by NVIDIA.

Coming back to Hadoop I see some parallelism here: There are ABIs that are being used by user space applications like “hadoop fs -ls” or your average map/reduce application. There are also some sort of APIs that strictly only allow for communication between HDFS, Map/Reduce and applications on top.

The Java ecosystem has a history of having APIs defined and standardised through the JCP and implemented by multiple vendors afterwards. With Apache projects people coming from a plain Java world often wonder why there is no standard that defines the APIs of valuable projects like Lucene or even Hadoop. Even log4j, commons logging and build tooling follow the “defacto standardisation” approach where development defines the API as opposed to a standardisation committee.

Going one step back the natrual question to ask is why there is demand for standardisation. What are the benefits of having APIs standardised? Going through a lengthy standardisation process obviously can’t be the benefit.

Advantages that come to my mind:

  • When having multiple vendors involved that do not want to or cannot communicate otherwise a standardisation committee can provide a neutral ground for communication in particular for the engineers involved.
  • For users there is some higher level document they can refer to in order to compare solutions and see how painful it might be to migrate.

Having been to a DIN/ISO SQL meetup lately there’s also a few pitfalls that I can think of:

  • You really have to make sure that your standard isn’t going to be polluted with things that never get implemented just because someone thought a particular feature could be interesting.
  • Standardisation usually takes a long time (read: mutliple years) until something valuable that than can be adopted and implemented in the industry is created.

More concerns include but are not limited to the problem of testing the standard - when putting the standard into main focus instead of the implementation there is a risk of including features in the standard that are hard or even impossible to implement. There is the risk of running into competing organisations gaming the system, making deals with each other - all leading to compromises that are everything but technologically sensible. There clearly is a barrier to entry when standardisation happens in a professional standards body. (On a related note: At least the German group working on the DIN/ISO standard defining the standard query language in particular in big data environments. Let me know if you would like to get involved.)

Concerning the first advantage (having some neutral ground for vendors to meet): Looking at your average standardisation effort those committees may be neutral ground. However communication isn’t necessarily available to the public for whatever reasons. Compared to the situation little over a decade ago there’s also one major shift in how development is done on successful projects: Software is no longer developed in-house only. Many successful components that enable productivity are developed in the open in a collaborative way that is open to any participant. Httpd, Linux, PHP, Lucene, Hadoop, Perl, Python, Django, Debian and others are all developed by teams spanning continents, cultures and most importantly corporations. Those projects provide a neutral ground for developers to meet and discuss their idea of what an implementation should look like.

Pondering a bit more on where successful projects I know of came from reveals something particularly interesting: ODF first was implemented as part of Open Office and then turned into a standardised format. XMPP was first implemented and than turned into an IETF standardised protocol. Lucene never went for any storage format or even search API standardisation but defined very rigid backwards compatibility guidelines that users learnt to trust. Linux itself never went for ABI standardisation - instead they opted for very strict ABI backwards compat guidelines that developers of user space tools could rely on.

Looking at the Linux kernel in particular the rule is that user facing ABIs are supposed to be backwards compatible: You will always be able to run yesterday’s ls against a newer kernel. One advantage for me as a user is that this way I can easily upgrade the kernel in my system without having to worry about any of the installed user space software.

The picture looks rather different with Linux’ APIs: Those are intentionally not considered holy and subject to change if need be. As a result vendors providing proprietary kernel driver like NVIDIA have the burden of providing updated versions in case they want to support more than one kernel version.

I could imaging a world similar to that for Hadoop: A world in which clients run older versions of Hadoop but are still able to talk to their upgraded clusters. A world in which older MapReduce programs still run when deployed on newer clusters. The only people who would need to worry about API upgrades would be those providing plugins to Hadoop itself or replace components of the system. According to Steve this is what YARN promises: Turn MR into user layer code, have the lower level resource manager for requesting machines near the data.

ApacheConEU - part 05

2012-11-14 20:47
The afternoon featured several talks on HBase - both it's implementation as well as schema optimisation. One major issue in schema design in the choice of key. Simplest recommendation is to make sure that keys are designed such that on reading data load will be evenly distributed accross all nodes to prevent region-server hot-spotting. General advise here are hashing or reversing urls.

When it comes to running your own HBase cluster make sure you know what is going on in the cluster at any point in time:

  • Hbase comes with tools for checking and fixing tables,
  • tools for inspecting hfiles that HBase stores data in,
  • commands for inspecting the binary write ahead log,
  • web interfaces for master and region servers,
  • offline meta data repair tooling.

When it comes to system monitoring make sure to track cluster behaviour over time e.g. by using Ganglia or OpenTSDB and configure your alerts accordingly.

One tip for high traffic sites - it might make sense to disable automatic splitting to avoid splits during peaks and rather postpone them to low traffic times. One rather new project presented to monitor region sizes was Hannibal.

At the end of his talk the speaker went into some more detail on problems encountered when rolling out HBase and lessons learnt:

  • the topic itself was new so both engineering and ops were learning.
  • at scale nothing that was tested on small scale works as advertised.
  • hardware issues will occur, tuning the configuration to your workload is crucial.
  • they used early versions - inevitably leading to stability issues.
  • it's crucial to know that something bad is building up before all systems start catching fire - monitoring and alerting the right thing is important. With Hadoop there are multiple levels to monitor: the hardware, os, jvm, Hadoop itself, HBase on top. It's important to correlate the metrics.
  • key- and schema design are key.
  • monitoring and knowledgable operations are important.
  • there are no emergency actions - in case of an emergency it just will take time to recover: Even if there is a backup, even just transferring the data back can take hours and days.
  • HBase (and Hadoop) is DevOps technology.
  • there is a huge learning curve to get to a state that makes using these systems easy.

In his talk on HBase schema design Lars George started out with an introduction to the HBase architecture. On the logical level it's best to think of HBase as a large distributed hash table - all data except for names are stored as byte arrays (with helpers to transform that data back into well known data types provided). The tables themselves are stored in a sparse format which means that null values essentially come for free.

On a more technical level the project uses zookeeper for master election, split tracking and state tracking. The architecture itself is based on log structured merge trees. All data initially ends up in the write ahead log - with data always being appended to the end this log can be written and ready very efficiently without any seek penalty. The data is inserted at the respected region server in memory (mem store, size of 64 MB typically) and synched to disk in regular intervals. In HBase all files are immutable - modifications are done only by writing new data and merging it in later. Deletes also happen by marking data as deleted instead of really deleting it. On a minor compaction the recently few files are being merged. On a major compaction all files are merged and deletes are being handled. Handling your own major compaction is possible as well.

In terms of performance lookup by key is the best you can do. If you do lookup by value this will result in a full-table scan. There is an option to give HBase a hint as to where to find the key when it is updated only infrequently - there is an option to provide a timestamp of roughly where to look for it. Also there are options to use Bloomfilters for better read performance. Another option is to move more data into the row key itself if that is the data you will be searching for very often. Make sure to de-normalize your data as HBase does not do any sophisticated joins, there will be duplication in your data as all should be pre-joined to get better performance. Have intelligent keys that make match your read/write patterns. Also make sure to keep your keys reasonably short as they are being repeated for each entry - so moving the whole data into the key isn't going to get you anything.

Speaking of read write patterns - as a general rule of thumb: to optimise for better write performance tune the memstore size. For better read performance tune the block cache size. One final hint: Anything below 8 servers really is just a test setup as it doesn't get you any real advantages.

ApacheConEU - part 04

2012-11-13 20:46
The second talk I went to was the one on the dev@hadoop.a.o insights given by Steve Loughran. According to Steve Hadoop has turned into what he calls an operating system for the data center - similar to Linux in that it's development is not driven by a vendor but by its users: Even though Hortenworks, Cloudera and MapR each have full time people working on Hadoop (and related projects), this work usually is driven by customer requirements which ultimately means that someone is running a Hadoop cluster that he has trouble with and wants to have fixed. In that sense the community at large has benefitted a lot from the financial crisis that Y! has slipped into: Most of the Hadoop knowledge that is now spread across companies like Linked.In, Facebook and others comes from engineers leaving Y! and joining one of those companies. With that also the development cycle of Hadoop has changed: While it was mostly driven by Y! schedule in the beginning - crunching out new releases nearly on a monthly basis with a dip around Christmas time it's got more irregular later - to pick up a more regular schedule just recently.

Image taken a few talks earlier in the same session.

In terms of version numbers: 1.x is to be considered stable, production ready with fixes and low risk patches applied only. For 2.x the picture is a bit different - currently that is in alpha stage, features and fixes go there first, new code is developed there first. However Hadoop is not just http://hadoop.apache.org - the ecosystem is much larger including projects like Mahout, Hama, HBase and Accumulo built on top of Hadoop and Hive, Pig, Sqoop and Flume for integraion and analysis, as well as oozie for coordination an zookeeper for distributed configuration. There's even more in incubation (or just recently graduated): Kafka for logging, whirr for cloud deployment, giraph for graph processing, ambari for cluster management, s4 for distributed stream processing, hcatalog and templeton for schema management, chuckwa for loggin purposes. All of the latter ones love helping hands. If you want to help out and want to "play Apache" those are the place to go to.

With such a large ecosystem one of the major pain points when rolling out your own Hadoop cluster is integrating all components you need: All projects release on separate release schedules, usually documenting against which version of Hadoop they were built. However finding a working combination is not always trivial. The first place to go to that springs to mind are the standard Linux distributions - however for their release and support cycles (e.g. Debian's guarantees for stable releases) the pace inside of Hadoop and the speed with which old versions were declared "no longer supported" still is too fast. So what alternatives are there? You can go with Cloudera who ship Apache Hadoop extended with their patches and additional proprietary software. You can opt for Hortenworks that ships the ASF Hadoop only. Or you can opt for Apache itself and either tame the zoo yourself or rely on BigTop that aims to integrate the latest versions.

Even though there are people working fulltime on the project there's still way more work to do than hands to help out. Some issues do fall through the cracks, in particular if you are the only person affected by the issue. Ultimately this may mean that if the bug only affects you you will have to be the one to fix the issue. Before going out and hacking the source itself it may make sense to go out and search for the stack trace you are looking at, the error message in front of you - look in JIRA and on the mailing list archives to see if there's someone else who has solved the problem already - and in an ideal case even provided a patch that just didn't make it into the distribution so far.

Also contributing to Hadoop is a great way to get to work on CS hard problems: There's distributed computing involved (clearly), there's consensus implementations like Paxos, there's work to optimise Hadoop for specific CPU architectures, there's scheduling and data placement problems to solve, machine learning and graph algorithm implementations and more. If you are into these topics - come and join, those are highly needed skills. People on the list tend to be friendly - they are "just" overloaded.

Of course there are barriers to entry - just like the Linux kernel Hadoop has become business critical for many of its users. Steve's recommendation for dealing with this circumstance was to not compete with existing work that is being done - instead either concentrate on areas that aren't covered yet or even better yet collaborate with others. A single lonely developer just cannot reasonably compete.

In terms of commit process Hadoop is running a review-than-commit protocol. Also it makes things seemingly more secure it also means that the development process is a lot slower, that things get neglected, that there is a lot of frustration also on the committers' side when having to update a patch oftentimes. In addition tests running for a long time doesn't make contributing substancially easier. With lots of valuable and critical data being stored in HDFS makeing radical changes there also isn't something that happens easily. The best way to really get started is to get trusted and have a track record: The maintanance cost for abandoned patches that are large, hard to grasp and no longer supported by the original author is just to high.

Get a track record by getting known on dev@ and meetups. Show your knowledge by helping out others, help with patches that are not your own, don't rewrite core initially, help with building plug-in points (like for shuffling implementations), help with testing across the whole configuration space, testing in unusual settings. Delegate the test at scale to those that have the huge clusters - there will always be issues revealed in their setup that do not cause any problems on a single machine or in a tiny cluster. Also make sure to download the package in the beta phase and test that it works in your problem space. Another way to get involved is by writing better documentation - either online or as a book. Share your experience.

One major challenge is work on major refactorings - there is the option to branch out, switch to a commit-than-review model but that only post-pones work to merge time. For independent works it's even more complicated to get the changes in. Also integrating post graduate work that can be very valuable isn't particularly simple - especially if there already is a lack in helping hand s- who's going to spend the work to mentor those students.

Some ideas to improve the situation would b the help with spreading the knowledge - in local university partnerships, google hangouts, local dev workshops, using git and gerrit for better distributed development and merging.

Video: Stefan Hübner on Cascalog

2012-08-28 20:49