Video: Stefan Hübner on Cascalog

2012-08-28 20:49

Apache Hadoop Get Together Berlin - August 2012

2012-08-15 23:30
Despite beautiful summer weather roughly 50 people gathered at ImmobilienScout24 for the August 2012 edition of the Apache Hadoop Get Together (Thanks again for hosting the event and sponsoring drinks and pizza to ImmoScout as well as to David Obermann for organising the meetup.

Today there were three talks: In the first presentation Dragan Milosevic (also known from his talk at the Hadoop GetTogether and his presentation at Berlin Buzzwords) provided more insight as to how Zanox is managing their internal RPC protocols in particular when it comes to versioning and upgrading protocol versions. Though in principle very simple to do this sort of problem still is very common when starting to roll out distributed systems and scaling them over time. The concepts he described were not unlike what is available today in projects like Avro, Thrift or protocol buffers. However by the time they needed versioning support for their client server applications neither of these projects was a really good fit. This also highlights one important constraint: With communication being a very central component in distributed systems, changing libraries after an implementation went to production can be too painful to be followed through.

In the second presentation Stefanie Huber, Manuel Messner and Stephan Friese showed how Gameduell is using Hadoop to provide better data analytics for marketing, BI, developers, product managers Founded in 2003 they have a accumulated quite a bit of data consisting of micro transactions (related to payment operations), user activities, gaming results that need to be used for balancing games. Their team turned a hairy, complex system into a pretty clean, Hadoop based solution: By now all actions end up in a Hadoop cluster (with an option to subscribe to a feed for realtime events). Typically from there people would start analysis jobs either in plain map reduce or in pig and export the data to external databases for further analysis by BI people who preferred Hive as a query language as it is much closer to SQL than any of the alternatives. As of late they introduced HCatalog to support providing a common view on data for all three analysis options - in addition to allowing for a more abstract view of the data available that does not require knowing the exact filesystem structure to access the data.

After a short break in the last talk of the evening Stefan Hübner introduced Cascalog to the otherwise pretty Java-savvy crowd. Being based on Cascading Cascalog provides for a concise way of formulating queries to a Hadoop cluster (compared to plain map reduce). Also when contrasted with Pig or Hive what stands out is the option to easily and seemlessly integrate additional functions (both map- and reduce-side) into Cascalog scripts without switching languages or abstractions. Note: When testing Cascalog scripts, one project to look at is Midje.

Overall a really interesting evening with lots of new input, interesting discussions and new input. Always amazing to see what other big data applications people in Berlin are developing. It's awesome to see so many development teams adopt seemingly new technologies (some even still in the Apache Incubator) for production systems. Looking forward to the next edition - as well as to the slides and videos of today's edition.

Apache Hadoop Get Together Berlin

2012-07-23 20:41
As seen on Xing - the next Apache Hadoop Get Together is planned to take place in August:

When: 15. August, 18 p.m.

Where: Immobilien Scout GmbH, Andreasstr. 10, 10243 Berlin

As always there will be slots of 30min each for talks on your Hadoop topic. After each talk there will be time for discussion.

It is important to indicate attendance. Only registered visitors will be permitted to attend.

Register here:

Talks scheduled thus far:

Dragan Milosevic

Robust Communication Mechanisms in zanox Reporting Systems

It happened an annoying number of times that we wanted to improve only one particular component in our distributed reporting system, but often had to update almost everything due to the RPC version-mismatch, which occurred in a communication between the updated component and the rest of our system. To mitigate this problem and to significantly simplify the integration of new components, we extended the used RPC protocol to perform a version handshake before the actual communication starts. This RPC extension is accompanied with serialisation/deserialization methods, which are downward compatible due to being able to successfully deserialise any
serialised older version of exchanged objects. Putting together these extensions makes it possible for us to successfully operate multiple versions of frontend and backend components, and to have the power to autonomously decide what and when should be updated/improved in our distributed reporting system.

Two other talks are planned and I will provide you with further information soon.

A big Thank You goes to Immobilien Scout GmbH for providing the venue at no cost for our event and for sponsoring the videotaping of the presentations.

Looking forward to seeing you in Berlin,


Berlin Hadoop Get Together (April 2012)- videos are up

2012-04-23 14:22

Apache Hadoop Get Together Berlin December 2011

2011-12-08 01:50
First of all a huge Thank You to David Obermann for organising today's Apache Hadoop Get Together Berlin: After a successful Berlin Buzzwords and a rather long pause following that finally a Christmas meetup took place today at Smarthouse, kindly sponsored by Axel Springer and organised by David Obermann from idealo. About 40 guests from Neofonie, Nokia, Amen, StudiVZ, Gameduell, TU Berlin, nurago, Soundcloud, and many others made it to the event.

In the first presentation Douwe Osinga from triposo went into some details on what Triposo is all about, how development for it differs in terms of scope and focus at larger corporations and what patterns they use for getting the data crawled, cleaned and served to users.

The goal of Triposo is to be able to build travel guides in a fully automated way. In contrast to simply creating a catalog of places to go to the goal is to have an application that is competitive to Lonely Planet books: Have tours, detailed background information, recommend places to visit based on wheather and seasonal signals, allow users to create travel books.

Joining Triposo from Google, Douwe gave a rather interesting perspective on what makes a startup interesting for innovative ideas. There are four interesting aspects of application development that according to his talk matter for Google projects: First is embracing failure. Not only can single hard disks fail, but servers might be switched off automatically for maintenance, even entire datacenters going offline must not affect your application. Second is a strong focus on speed: Developers working with dynamic languages like Python that allow for rapid prototyping at the expense of slower runtime are generally frowned upon. Third building block is the focus on search that is ingrained in every piece of architecture and thinking. Fourth and last is a strong mentality to build your own which may lead to great software but leaves software developers in an isolated island of proprietary software that can limit but at least shapes your way of thinking.

He gave Youtube as an example: Though built on top of MySQL, implemented in Python and certainly not failure proof in every aspect they succeeded by concentrating on users' needs, time to market and iteratively improving their software with a frequent (as in one week) develop-release-deploy cycle. When entering new markets and providing innovative applications it often is crucial to be able to move quickly at the expense of speed and stability. It certainly is important to consider different architectures and chose the one that is appropriate to solve the problem at hand. Same reasoning applies for Apache Hadoop as well: Do not try to solve problems with it that it is not made to solve. Instead first think what is the right tool for your job.

Triposo itself is built on top of 12 data sources. Most are freely available, integrated to build a usable and valuable travel guide application for iOS and Android. The features available in Triposo can be phrased in terms of a search and information retrieval problem setting and as such lend itself well for integrated sources. With offers from Amazon, Google itself, Dropbox and the like it has become easy to deploy applications in an elastic way and scale with your user base and demand for more country coverage. For them it proved advantages to go for an implementation based on dynamic languages for pure development speed.

When it comes to QA they take a semi-manual approach: There are scripts checking recall (Brandenburger Tor must be found for the Berlin guide) as well as precision (there must be only one Brandenburger Tor in Berlin). Those rules need to be manually tuned.

When integrating different sources you quickly run into a duplicate discovery problem. Their approach is pretty pragmatic: Merge anything that you are confident enough to say it is a duplicate. Kill everything that is likely a duplicate but you are not confident enough to merge. The general guideline is to rather miss a place than have it twice.

For the wikipedia source so far they are only parsing the English version. There are plans to also support other languages - in particular for parsing to increase data quality as e.g. for some places geo coordinates may be available in the German article but not in the English one.

Though not going into too many technical details the talk gave some nice insights as to the strengths and weaknesses of different company sizes and mindsets when it comes to innovation as well as stabilization. Certainly a startup to watch, glad to hear that though incorporated in the US most developers actually live in Berlin now.

The second talk was given by Max Jakob from Neofonie GmbH (working on EU funded research project Dicode) gave an overview of their pipeline for named entity extraction and disambiguation based on a language model extracted from the raw German wikipedia dump. They used Pig to scale a pipeline down from about a week to 1.5 hours with not much development overhead: Quite some logic could be re-used from the open source project pignlproc initiated by Olivier Grisel. This project already features a Wikipedia loader, a UDF for extracting information from Wikipedia documents and additional scripts for training and building corpora.

Based on that they defined the ML probability of a surface form being a named entity. The script itself is not very magical: The whole process can be expressed as a few steps involving grouping and couting tuples. The effect in terms of runtime vs. development time however is impressive.

Checkout their DICODE github project for further details on the code itself.

After the meetup about 20 attendees followed David to a bar nearby. It is always great to get a chance to talk to the speakers, exchange experiences with others and learn more on what people are actually working on with Hadoop after the event.

Slides of all talks are going to be posted soon, videos go online as soon as they are post processed so stay tuned for further information.

Looking forward to seeing you again for the next meetup. If you could not make it this time, there is a very easy way to not have that happen again next time: First speaker to submit a talk proposal to David sets the date and time of the meetup (taking into account any constraints with venue and video taping of course).

Apache Hadoop Get Together - March 2010 - Update

2010-02-11 14:25
Due to conflicts in the schedule of newthinking store, we had to change the time of the Get Together slightly. We will start one hour earlier than announced.

When: March 10th, 4p.m.
Where: newthinking store, Tucholskystr. 48, Berlin Mitte

Looking forward to seeing you there.