JAX: Hadoop overview by Bernd Fondermann

2013-05-18 20:29


After breakfast was over the first day started with a talk by Bernd on the
Hadoop ecosystem. He did a good job selecting the most important and
interesting projects related to storing data in HDFS and processing it with Map
Reduce. After the usual "what is Hadoop", "what does the general architecture
look like", "what will change with YARN" Bernd gave a nice overview of which
publications each of the relevant projects rely on:




  • HDFS is mainly based on the paper on GFS.

  • Map Reduce comes with it's own publication.

  • The big table paper mainly inspired Cassandra (to some extend), HBase,
    Accumulo and Hypertable.

  • Protocol Buffers inspired Avro and Thrift, and is available as free
    software itself.

  • Dremel (the storage side of things) inspired Parquet.

  • The query language side of Dremel inspired Drill and Impala.

  • Power Drill might inspire Drill.

  • Pregel (a graph database) inspired Giraph.

  • Percolator provided some inspiration to HBase.

  • Dynamo by Amazon kicked of Cassandra and others.

  • Chubby inspired Zookeeper, both are based on Paxos.

  • On top of Map Reduce today there are tons of higher level languages,
    starting with Sawzall inside of Google, continuing with Pig and Hive at Apache
    we are now left with added languages like Cascading, Cascalog, Scalding and
    many more.

  • There are many other interesting publications (Megastore, Spanner, F1 to
    name just a few) for which there is no free implementation yet. In addition
    with Storm, Hana and Haystack there are implementations lacking canonical
    publications.



After this really broad clarification of names and terms used, Bernd went into
some more detail on how Zookeeper is being used for defining the namenode in
Hadoop 2, how high availablility and federation works for namenodes. In
addition he gave a clear explanation of how block reports work on cluster
bootup. The remainder of the talk was reserved for giving an intro to HBase,
Giraph and Drill.

BigDataCon

2013-05-17 20:29


Together with Uwe Schindler I had published a series of articles on Apache
Lucene at Software and Support Media's Java Mag several years ago. Earlier this
year S&S kindly invited my to their BigDataCon - co-located with JAX to give a
talk of my choosing that at least touches upon Lucene.


Thinking back and forth about what topic to cover what came to my mind was to
give a talk on how easy it is to do text classification with Mahout when
relying on Apache Lucene for text analysis, tokenisation and token filtering.
All classes essentially are in place to integrate Lucene Analyzers with Mahout
vector generation - needed e.g. as a pre-processing step for classification or
text clustering.


Feel free to check out some of my sandbox code over at <a
href=``http://github.org/MaineC/sofia''>github</a>.


After attending the conference I can only recommend everyone interested in Java
programming and able to understand German to buy a ticket for the conference.
It's really well executed, great selection of talks (though the sponsored
keynotes usually aren't particularly interesting), tasty meals, interesting
people to chat with.