Apache Hadoop Get Together - February 2012

2012-02-23 00:14
Today the first Hadoop Get Together Berlin 2012 took place - David got the event hosted by and at Axel Springer who kindly also paid for the (soon to be published) videos. Thanks also to the unbelievable Machine company for the tasty buffet after the meetup. Another thanks to Open Source Press for donating three of their Hadoop books.



Today's selection was quite diverse: The event started with a presentation by Markus Andrezak who gave an overview of Kanban and how it helped him change the development workflow over at eBay/mobile. Being well suited for environments that require flexibility Kanban is well suited to decrease risk associated with any single release by bringing the number of features released down to an absolute minimum. At Mobile his team got release cycles down to once a day. More than ten times a day however aren't unheard of as well. The general goal for him was to reduce the risk associated with releases by reducing the number of features released per release, reducing the number of moving parts in one release and as a result reducing the number of potential sources for problems: If anything goes wrong, rolling back is no issue - nor is narrowing down on the potential sources of bugs in the changed software that were introduced in that particular release.

This development and output focused part of the process is complemented by an input focused Kanban cycle for product design: Products are going from idea to vision to a more detailed backlog to development and finally live the same as issues in development itself move from Todo to in progress, under review and done.

With both cycles the main goal is to keep the number of items in progress as low as possible. This will result in more focus for each developer and greatly reduce overhead: Don't do more than one or two things at a time. Only catch: Most companies are focused on keeping development busy at all times - their goal is to reach 100% utilization. This however is in no way correlated to actual efficiency: By having 100% utilization there is not way you can deal with problems along the way, there is no buffer. Instead the idea should be to concentrate on a constant flow of released and live features instead.



Now what is the link of all that to Hadoop? (Hint: No, this is no pun on the project's slow release cycle.) The process of Kanban allows for frequent releases, it allows for frequent feedback. This enables a model of development that starts out from a model of your business case (no matter how coarse that may be), start building some code, measure your performance with that code based on actual usage data and adjust the model accordingly. Kanban lets you iterate very quickly on that loop getting you ahead of competitors eventually. In terms of technology one strong tool in their toolbox to really do data analytics on their incoming data is to use Hadoop and scale up analysing business data.

In the second talk Martin Scholl started out by drawing a comparison from music vs. printed music sheets to the actual performance of musicians in a concert: The former represents static, factual data. The latter represents a process that may be recorded, but may not by copied itself as it lives by the interactions with the audience. The same holds true for social networks: Their current state and the way you look at them is deeply influenced by your way of interacting with the system in realtime.

So in addition to data storage solutions for static data, he argues, we also need a way to process streaming data in an efficient and fault tolerant way. The system he uses for that purpose is Storm that was open-sourced by Twitter late last year. Built on top of zeroMQ it allows for flexible and fault tolerant messaging. Example applications mentioned are event analysis (filtering, aggregation, counting, monitoring), parallel distributed rpc based on message passing.

Two concrete examples include setting up a live A/B testing environment that is dynamically reconfigurable based on it's input as well as event handling in a social network environment where interactions might trigger messages being sent by mail and instant message but also trigger updates in a recommendation model.

In the last talk Fabian Hüske from TU Berlin introduced Stratosphere - an EU founded research project that is working on an extended computational model on top of HDFS that provides more flexibility and better performance. Being developed before the rise of Apache Hadoop YARN unfortunately essentially what they did was to re-implement the whole map/reduce computational layer and put their system into that. Would be interesting to see how a port to YARN performs and what sort of advantages it gives in production.

Looking forward to seeing you all in June for Berlin Buzzwords - make sure to submit your presentation soon, call for presentations won't be extended this year.

February 2012 Apache Hadoop Get Together Berlin

2012-01-31 20:34
The upcoming Apache Hadoop Get-Together is scheduled for 22. February, 6 p.m. - taking place at Axel Springer, Axel-Springer-Str. 65, 10888 Berlin. Thanks to Springer for sponsoring the location!

Note: It is important to indicate attendance. Due to security restrictions at the venue only registered visitors will be permitted. Get your ticket here: https://www.xing.com/events/hadoop-22-02-859807

Talks scheduled thus far:

Markus Andrezak : "Queue Management in Product Development with Kanban - enabling flow and fast feedback along the value chain" - It's a truism today that fast feedback from your market is a key advantage. This talk is about how you can deliver smallest product increments or MVPs (minimal viable products) quickly to your market to get fastest possible feedback on cause and effect of your product changes. To achieve that, it helps to provide a continuous deployment infrastructure as well as all you need for A/B testing and other feedback instruments. To make the most of these achievements, Kanban helps to limit work in progress, thus manage queues and speed up lead times (time from order to delivery or concept to cash). This helps us speed through the OODA Loop, i.e. Eric Ries' (The Lean Startup) Model -> Build -> Code -> Measure -> Data -> Validate -> Model. The more we can go through the loop, the more we have a chance to fine tune and validate our model of the business and finally make the right decisions.

Markus is one of Germany’s leading Kanban practitioners - writing and presenting talks about it in numerous publications and conferences. He will provide a brief view into how he is achieving fast feedback in diverse contexts.
Currently he is Head of mobile commerce at mobile.de.

Martin Scholl : "On Firehoses and Storms: Event Thinking, Event Processing" - The SQL doctrine is still in full effect and still fundamentally affects the way software is designed, the state it is stored in as well as the system architecture. With the NoSQL movement people have started to realize that the manner in which data is stored affects the full stack -- and that reduction of impedance mismatch is a good thing(TM). "Thinking in events" follows this tradition of questioning what is state-of-the-art. Modeling a system not in mutable entities (as with data stores) but as a stream of immutable events that incrementally modify state, yields results that will exceed your expectations. This talk will be about event thinking, event software modeling and how Twitter's Storm can help you process events at large.

Martin Scholl is interested in data management systems. He is also a Founder of infinipool GmbH.


Fabian Hüske : "Large-Scale Data Analysis Beyond Map/Reduce" - Stratosphere is a joint project by TU Berlin, HU Berlin, and HPI Potsdam and researches "Information Management on the Cloud". In the course of the project, a massively parallel data processing system is built. The current version of the system consists of the parallel PACT programming model, a database inspired optimizer, and the parallel dataflow processing engine, Nephele. Stratosphere has been released as open source. This talk will focus on the PACT programming model, which is a generalization of Map/Reduce, and show how PACT eases the specification of complex data analysis tasks. At the end of the talk, an overview of Stratosphere's upcoming release will be given.

Fabian has been a research associate at the Database Systems and Information Management (DIMA) group at the Technische Universität Berlin since June 2008. He is working in the Stratosphere research project, focusing on parallel programming models, parallel data processing, and query optimization. Fabian started his studies at the University of Cooperative Education, Stuttgart, in cooperation with IBM Germany in 2003. During that course, he visited the IBM Almaden Research Center in San Jose, USA, twice and finished in 2006. Fabian undertook his studies at Universität Ulm and earned a master's degree in 2008. His research interests include distributed information management, query processing, and query optimization.


A big Thank You goes to Axel Springer for providing the venue at no cost for our event and for paying for videos to be taped of the presentations. A huge thanks also to David Obermann for organising the event.

Looking forward to seeing you in Berlin.