Clojure Berlin - March 2012

Clojure Berlin - March 2012 #

In today’s Clojure meetup Stefan Hübner gave an introduction to Cascalog - a Clojure library based on Cascading for large scale data processing on Apache Hadoop without hassle.

After a brief overview of what he is using the tool for to do log processing at his day job for http://maps.nokia.com Stefan went into some more detail on why he chose Cascalog over other project that provide abstraction layers on top of Hadoop’s plain map/reduce library: Both Pig and Hive provide easy to learn SQL-like languages to quickly write analysis jobs. The major disadvantage however comes when in need for domain specific operators - in particular when these turn out to be needed just once: Developers end up switching back and forth between e.g. Pig Latin and Java code to accomplish their analysis need. These kinds of one-off analysis tasks are exactly where Cascalog shines: No need to leave the Clojure context, just program your map/reduce jobs on a very high level (Cascalog itself is quite similar to datalog in syntax which makes it easy to read and simple to forget about all the nitty-gritty details of writing map/reduce jobs).

Writing a join to compute persons’ age and gender from a trivial data model is as simple as typing:


;; Persons’ age and gender
(?<- (stdout)
[?person ?age ?gender]
(age ?person ?age)
(gender ?person ?gender)


Multiple sorts of input generators are implemented already: Reading text files, using files in HDFS as input are both common use cases. Of course it is possible to provide your own implementation for that as well to integrate any type of data input in addition to what is available already.

In my view Cascalog combines the speed of development that was brought by Pig and Hive with the flexibility of being able to seemlessly switch to a powerful programming language for anything custom. If you yourself have been using or even contributing to either Cascalog or Cascading: I’d love to see your submission to Berlin Buzzwords - remember, the submission deadline is this week on Sunday MEZ.