After breakfast was over the first day started with a talk by Bernd on the
Hadoop ecosystem. He did a good job selecting the most important and
interesting projects related to storing data in HDFS and processing it with Map
Reduce. After the usual "what is Hadoop", "what does the general architecture
look like", "what will change with YARN" Bernd gave a nice overview of which
publications each of the relevant projects rely on:
- HDFS is mainly based on the paper on GFS.
- Map Reduce comes with it's own publication.
- The big table paper mainly inspired Cassandra (to some extend), HBase,
Accumulo and Hypertable.
- Protocol Buffers inspired Avro and Thrift, and is available as free
- Dremel (the storage side of things) inspired Parquet.
- The query language side of Dremel inspired Drill and Impala.
- Power Drill might inspire Drill.
- Pregel (a graph database) inspired Giraph.
- Percolator provided some inspiration to HBase.
- Dynamo by Amazon kicked of Cassandra and others.
- Chubby inspired Zookeeper, both are based on Paxos.
- On top of Map Reduce today there are tons of higher level languages,
starting with Sawzall inside of Google, continuing with Pig and Hive at Apache
we are now left with added languages like Cascading, Cascalog, Scalding and
- There are many other interesting publications (Megastore, Spanner, F1 to
name just a few) for which there is no free implementation yet. In addition
with Storm, Hana and Haystack there are implementations lacking canonical
After this really broad clarification of names and terms used, Bernd went into
some more detail on how Zookeeper is being used for defining the namenode in
Hadoop 2, how high availablility and federation works for namenodes. In
addition he gave a clear explanation of how block reports work on cluster
bootup. The remainder of the talk was reserved for giving an intro to HBase,
Giraph and Drill.