JAX: Hadoop overview by Bernd Fondermann #
After breakfast was over the first day started with a talk by Bernd on the
Hadoop ecosystem. He did a good
job selecting the most important and
interesting projects related to storing data in HDFS and processing it with
Map
Reduce. After the usual "what is Hadoop", "what does the general architecture
look like", "what will change
with YARN" Bernd gave a nice overview of which
publications each of the relevant projects rely
on:
- HDFS is mainly based on the paper on GFS.
- Map Reduce comes with it's
own publication.
- The big table paper mainly inspired Cassandra (to some extend), HBase,
Accumulo and Hypertable. - Protocol Buffers inspired Avro and Thrift, and is available as free
software itself. - Dremel (the storage side of things) inspired Parquet.
- The query language side
of Dremel inspired Drill and Impala.
- Power Drill might inspire Drill.
- Pregel (a graph
database) inspired Giraph.
- Percolator provided some inspiration to HBase.
- Dynamo by
Amazon kicked of Cassandra and others.
- Chubby inspired Zookeeper, both are based on
Paxos.
- On top of Map Reduce today there are tons of higher level languages,
starting with Sawzall inside of Google, continuing with Pig and Hive at Apache
we are now left with added languages like Cascading, Cascalog, Scalding and
many more. - There are many other interesting publications (Megastore, Spanner,
F1 to
name just a few) for which there is no free implementation yet. In addition
with Storm, Hana and Haystack there are implementations lacking canonical
publications.
After this really broad
clarification of names and terms used, Bernd went into
some more detail on how Zookeeper is being used for defining
the namenode in
Hadoop 2, how high availablility and federation works for namenodes. In
addition he gave a clear
explanation of how block reports work on cluster
bootup. The remainder of the talk was reserved for giving an intro
to HBase,
Giraph and Drill.