Video: Stefan Hübner on Cascalog

2012-08-28 20:49

Apache Hadoop Get Together Berlin - August 2012

2012-08-15 23:30
Despite beautiful summer weather roughly 50 people gathered at ImmobilienScout24 for the August 2012 edition of the Apache Hadoop Get Together (Thanks again for hosting the event and sponsoring drinks and pizza to ImmoScout as well as to David Obermann for organising the meetup.



Today there were three talks: In the first presentation Dragan Milosevic (also known from his talk at the Hadoop GetTogether and his presentation at Berlin Buzzwords) provided more insight as to how Zanox is managing their internal RPC protocols in particular when it comes to versioning and upgrading protocol versions. Though in principle very simple to do this sort of problem still is very common when starting to roll out distributed systems and scaling them over time. The concepts he described were not unlike what is available today in projects like Avro, Thrift or protocol buffers. However by the time they needed versioning support for their client server applications neither of these projects was a really good fit. This also highlights one important constraint: With communication being a very central component in distributed systems, changing libraries after an implementation went to production can be too painful to be followed through.

In the second presentation Stefanie Huber, Manuel Messner and Stephan Friese showed how Gameduell is using Hadoop to provide better data analytics for marketing, BI, developers, product managers et.al. Founded in 2003 they have a accumulated quite a bit of data consisting of micro transactions (related to payment operations), user activities, gaming results that need to be used for balancing games. Their team turned a hairy, complex system into a pretty clean, Hadoop based solution: By now all actions end up in a Hadoop cluster (with an option to subscribe to a feed for realtime events). Typically from there people would start analysis jobs either in plain map reduce or in pig and export the data to external databases for further analysis by BI people who preferred Hive as a query language as it is much closer to SQL than any of the alternatives. As of late they introduced HCatalog to support providing a common view on data for all three analysis options - in addition to allowing for a more abstract view of the data available that does not require knowing the exact filesystem structure to access the data.

After a short break in the last talk of the evening Stefan Hübner introduced Cascalog to the otherwise pretty Java-savvy crowd. Being based on Cascading Cascalog provides for a concise way of formulating queries to a Hadoop cluster (compared to plain map reduce). Also when contrasted with Pig or Hive what stands out is the option to easily and seemlessly integrate additional functions (both map- and reduce-side) into Cascalog scripts without switching languages or abstractions. Note: When testing Cascalog scripts, one project to look at is Midje.

Overall a really interesting evening with lots of new input, interesting discussions and new input. Always amazing to see what other big data applications people in Berlin are developing. It's awesome to see so many development teams adopt seemingly new technologies (some even still in the Apache Incubator) for production systems. Looking forward to the next edition - as well as to the slides and videos of today's edition.

Note to self: Clojure with Vim and Maven

2012-07-17 20:07
Steps to get a somewhat working Clojure environment with vim:



Note: There is more convenient tooling for emacs (see also getting started with clojure and emacs) - its just that my fingers are more used to interacting with vim...

2nd note: This post is not an introduction or walk through on how to get Clojure setup in vim - it's not even particularly complete. This is intentional - if you want to start tinkering with Clojure: Use Emacs! This is just my way to re-discover through Google what I did the other day but forgot in the mean time.

Clojure Berlin - March 2012

2012-03-07 22:37
In today's Clojure meetup Stefan Hübner gave an introduction to Cascalog - a Clojure library based on Cascading for large scale data processing on Apache Hadoop without hassle.

After a brief overview of what he is using the tool for to do log processing at his day job for http://maps.nokia.com Stefan went into some more detail on why he chose Cascalog over other project that provide abstraction layers on top of Hadoop's plain map/reduce library: Both Pig and Hive provide easy to learn SQL-like languages to quickly write analysis jobs. The major disadvantage however comes when in need for domain specific operators - in particular when these turn out to be needed just once: Developers end up switching back and forth between e.g. Pig Latin and Java code to accomplish their analysis need. These kinds of one-off analysis tasks are exactly where Cascalog shines: No need to leave the Clojure context, just program your map/reduce jobs on a very high level (Cascalog itself is quite similar to datalog in syntax which makes it easy to read and simple to forget about all the nitty-gritty details of writing map/reduce jobs).

Writing a join to compute persons' age and gender from a trivial data model is as simple as typing:


;; Persons' age and gender
(? [?person ?age ?gender]
(age ?person ?age)
(gender ?person ?gender)


Multiple sorts of input generators are implemented already: Reading text files, using files in HDFS as input are both common use cases. Of course it is possible to provide your own implementation for that as well to integrate any type of data input in addition to what is available already.

In my view Cascalog combines the speed of development that was brought by Pig and Hive with the flexibility of being able to seemlessly switch to a powerful programming language for anything custom. If you yourself have been using or even contributing to either Cascalog or Cascading: I'd love to see your submission to Berlin Buzzwords - remember, the submission deadline is this week on Sunday *MEZ*.