Video: "Accessing Hadoop data with HCatalog and PostgreSQL"

2012-08-20 20:53

Apache Hadoop Get Together Berlin - August 2012

2012-08-15 23:30
Despite beautiful summer weather roughly 50 people gathered at ImmobilienScout24 for the August 2012 edition of the Apache Hadoop Get Together (Thanks again for hosting the event and sponsoring drinks and pizza to ImmoScout as well as to David Obermann for organising the meetup.



Today there were three talks: In the first presentation Dragan Milosevic (also known from his talk at the Hadoop GetTogether and his presentation at Berlin Buzzwords) provided more insight as to how Zanox is managing their internal RPC protocols in particular when it comes to versioning and upgrading protocol versions. Though in principle very simple to do this sort of problem still is very common when starting to roll out distributed systems and scaling them over time. The concepts he described were not unlike what is available today in projects like Avro, Thrift or protocol buffers. However by the time they needed versioning support for their client server applications neither of these projects was a really good fit. This also highlights one important constraint: With communication being a very central component in distributed systems, changing libraries after an implementation went to production can be too painful to be followed through.

In the second presentation Stefanie Huber, Manuel Messner and Stephan Friese showed how Gameduell is using Hadoop to provide better data analytics for marketing, BI, developers, product managers et.al. Founded in 2003 they have a accumulated quite a bit of data consisting of micro transactions (related to payment operations), user activities, gaming results that need to be used for balancing games. Their team turned a hairy, complex system into a pretty clean, Hadoop based solution: By now all actions end up in a Hadoop cluster (with an option to subscribe to a feed for realtime events). Typically from there people would start analysis jobs either in plain map reduce or in pig and export the data to external databases for further analysis by BI people who preferred Hive as a query language as it is much closer to SQL than any of the alternatives. As of late they introduced HCatalog to support providing a common view on data for all three analysis options - in addition to allowing for a more abstract view of the data available that does not require knowing the exact filesystem structure to access the data.

After a short break in the last talk of the evening Stefan Hübner introduced Cascalog to the otherwise pretty Java-savvy crowd. Being based on Cascading Cascalog provides for a concise way of formulating queries to a Hadoop cluster (compared to plain map reduce). Also when contrasted with Pig or Hive what stands out is the option to easily and seemlessly integrate additional functions (both map- and reduce-side) into Cascalog scripts without switching languages or abstractions. Note: When testing Cascalog scripts, one project to look at is Midje.

Overall a really interesting evening with lots of new input, interesting discussions and new input. Always amazing to see what other big data applications people in Berlin are developing. It's awesome to see so many development teams adopt seemingly new technologies (some even still in the Apache Incubator) for production systems. Looking forward to the next edition - as well as to the slides and videos of today's edition.

Apache Con returns to Europe

2012-08-01 20:41
In November Apache Con will come back to Europe. The event will take place in Sinsheim inviting foundation members, project committers, contributors and users to meet, discuss and have fun during the one week event.



Several meetups will be held the weekend before the main conference kicks off, watch out for announcements on your favourite project mailing list.

ApacheCon is still open for submissions until August 3rd - head over to the Call for submissions for more information. The conference is split into several tracks that are being handled individually: Apache Daily - Tools frameworks and components used on a daily basis, Apache Java Enterprise projects, Big Data, Camel in Action, Cloud, Linked Data, Lucene, Modular Java Applications, NoSQL Database, OFBiz (The Apache Enterprise Automation project), Open Office and finally Web Infrastructure (covering HTTPD, TomCat and Traffic Server, the heart of many Internet projects).

Make sure to mark the date in your calendar to meet with the people behind the ASF projects, learn more on how the foundation works and what makes Apache projects so particular compared to others. Join us for a week of fun and dense talks on all things Apache.


The Apache Feather logo is a trademark of The Apache Software Foundation.

Apache Hadoop Get Together Berlin

2012-07-23 20:41
As seen on Xing - the next Apache Hadoop Get Together is planned to take place in August:

When: 15. August, 18 p.m.

Where: Immobilien Scout GmbH, Andreasstr. 10, 10243 Berlin


As always there will be slots of 30min each for talks on your Hadoop topic. After each talk there will be time for discussion.

It is important to indicate attendance. Only registered visitors will be permitted to attend.

Register here: https://www.xing.com/events/hadoop-get-together-1114707


Talks scheduled thus far:

Speaker:
Dragan Milosevic

Session:
Robust Communication Mechanisms in zanox Reporting Systems

It happened an annoying number of times that we wanted to improve only one particular component in our distributed reporting system, but often had to update almost everything due to the RPC version-mismatch, which occurred in a communication between the updated component and the rest of our system. To mitigate this problem and to significantly simplify the integration of new components, we extended the used RPC protocol to perform a version handshake before the actual communication starts. This RPC extension is accompanied with serialisation/deserialization methods, which are downward compatible due to being able to successfully deserialise any
serialised older version of exchanged objects. Putting together these extensions makes it possible for us to successfully operate multiple versions of frontend and backend components, and to have the power to autonomously decide what and when should be updated/improved in our distributed reporting system.


Two other talks are planned and I will provide you with further information soon.

A big Thank You goes to Immobilien Scout GmbH for providing the venue at no cost for our event and for sponsoring the videotaping of the presentations.

Looking forward to seeing you in Berlin,

David

Need your input: Failing big data projects - experiences from the wild

2012-07-18 20:11
A few weeks ago my talk on "How to fail your big data project quick and rapidly" was accepted at O'Reily Strata conference in London. The basic intention of this talk is to share some anti-patterns, embarrassing failure modes and "please don't do this at home" kind of advice with those entering the buzzwordy space of big data.

Inspired by Thomas Sundberg's presentation on "failing software projects the talk will be split in five chapters and highlight the top two failure-factors for each.

I only have so much knowledge of what can go wrong when dealing with big data. In addition no one likes talking about what did not work in their environment. So I'd like to invite you to share your war stories in a public etherpad - either anonymously or including your name so I can give credit. Some ideas are already sketched up - feel free to extend, adjust, re-rank or change.

Looking forward to your stories.

Berlin Buzzwords Schedule online - book your ticket now

2012-04-30 10:29
As of beginning of last week the Berlin Buzzwords schedule is online. The Program Committee has
completed reviewing all submissions and set up the schedule containing a great lineup of speakers for this years Berlin Buzzwords program. Among the speakers we have Leslie Hawthorn (Red Hat), Alex Lloyd (Google), Michael Busch (Twitter) as well as Nicolas Spiegelberg (Facebook). Checkout our program in the online schedule.

Berlin Buzzwords standard conference tickets are still available. Note that we also offer a special rate for groups of 5 and more attendees with a 15% discount off the standard ticket price. Make sure to book your ticket now: Ticket prizes will rise by another 100 Euros for last minute purchases in three weeks!

“Berlin Buzzwords is by far one of the best conferences around if you care about search, distributed systems, and NoSQL...” says Shay Banon, founder of ElasticSearch.

Berlin Buzzwords will take place June 4th and 5th 2012 at Urania Berlin. The 3rd edition of the conference for developers and users of open source projects, again focuses on everything related to scalable search, data-analysis in the cloud and NoSQL-databases. We are bringing together developers, scientists, and analysts working on innovative technologies for storing, analysing and searching today's massive amounts of digital data.

Berlin Buzzwords is organised by newthinking communications GmbH in collaboration with Isabel Drost (Member of the Apache Software Foundation, PMC member Apache community development and co-founder of Apache Mahout), Jan Lehnardt (PMC member Apache CouchDB) and Simon Willnauer (Member of the Apache Software Foundation, PMC member Apache Lucene).

More information including speaker interviews, ticket sales, press information as well as "meet me at bbuzz" buttons are available on the official Berlin Buzzwords website.

Looking forward to meeting you in June.


PS: Did I mention that Berlin is all beautiful in Summer?

Berlin Hadoop Get Together (April 2012)- videos are up

2012-04-23 14:22

Clojure Berlin - March 2012

2012-03-07 22:37
In today's Clojure meetup Stefan Hübner gave an introduction to Cascalog - a Clojure library based on Cascading for large scale data processing on Apache Hadoop without hassle.

After a brief overview of what he is using the tool for to do log processing at his day job for http://maps.nokia.com Stefan went into some more detail on why he chose Cascalog over other project that provide abstraction layers on top of Hadoop's plain map/reduce library: Both Pig and Hive provide easy to learn SQL-like languages to quickly write analysis jobs. The major disadvantage however comes when in need for domain specific operators - in particular when these turn out to be needed just once: Developers end up switching back and forth between e.g. Pig Latin and Java code to accomplish their analysis need. These kinds of one-off analysis tasks are exactly where Cascalog shines: No need to leave the Clojure context, just program your map/reduce jobs on a very high level (Cascalog itself is quite similar to datalog in syntax which makes it easy to read and simple to forget about all the nitty-gritty details of writing map/reduce jobs).

Writing a join to compute persons' age and gender from a trivial data model is as simple as typing:


;; Persons' age and gender
(? [?person ?age ?gender]
(age ?person ?age)
(gender ?person ?gender)


Multiple sorts of input generators are implemented already: Reading text files, using files in HDFS as input are both common use cases. Of course it is possible to provide your own implementation for that as well to integrate any type of data input in addition to what is available already.

In my view Cascalog combines the speed of development that was brought by Pig and Hive with the flexibility of being able to seemlessly switch to a powerful programming language for anything custom. If you yourself have been using or even contributing to either Cascalog or Cascading: I'd love to see your submission to Berlin Buzzwords - remember, the submission deadline is this week on Sunday *MEZ*.

Berlin Hadoop Get Together - videos are up

2012-03-02 20:08

Apache Hadoop Get Together - February 2012

2012-02-23 00:14
Today the first Hadoop Get Together Berlin 2012 took place - David got the event hosted by and at Axel Springer who kindly also paid for the (soon to be published) videos. Thanks also to the unbelievable Machine company for the tasty buffet after the meetup. Another thanks to Open Source Press for donating three of their Hadoop books.



Today's selection was quite diverse: The event started with a presentation by Markus Andrezak who gave an overview of Kanban and how it helped him change the development workflow over at eBay/mobile. Being well suited for environments that require flexibility Kanban is well suited to decrease risk associated with any single release by bringing the number of features released down to an absolute minimum. At Mobile his team got release cycles down to once a day. More than ten times a day however aren't unheard of as well. The general goal for him was to reduce the risk associated with releases by reducing the number of features released per release, reducing the number of moving parts in one release and as a result reducing the number of potential sources for problems: If anything goes wrong, rolling back is no issue - nor is narrowing down on the potential sources of bugs in the changed software that were introduced in that particular release.

This development and output focused part of the process is complemented by an input focused Kanban cycle for product design: Products are going from idea to vision to a more detailed backlog to development and finally live the same as issues in development itself move from Todo to in progress, under review and done.

With both cycles the main goal is to keep the number of items in progress as low as possible. This will result in more focus for each developer and greatly reduce overhead: Don't do more than one or two things at a time. Only catch: Most companies are focused on keeping development busy at all times - their goal is to reach 100% utilization. This however is in no way correlated to actual efficiency: By having 100% utilization there is not way you can deal with problems along the way, there is no buffer. Instead the idea should be to concentrate on a constant flow of released and live features instead.



Now what is the link of all that to Hadoop? (Hint: No, this is no pun on the project's slow release cycle.) The process of Kanban allows for frequent releases, it allows for frequent feedback. This enables a model of development that starts out from a model of your business case (no matter how coarse that may be), start building some code, measure your performance with that code based on actual usage data and adjust the model accordingly. Kanban lets you iterate very quickly on that loop getting you ahead of competitors eventually. In terms of technology one strong tool in their toolbox to really do data analytics on their incoming data is to use Hadoop and scale up analysing business data.

In the second talk Martin Scholl started out by drawing a comparison from music vs. printed music sheets to the actual performance of musicians in a concert: The former represents static, factual data. The latter represents a process that may be recorded, but may not by copied itself as it lives by the interactions with the audience. The same holds true for social networks: Their current state and the way you look at them is deeply influenced by your way of interacting with the system in realtime.

So in addition to data storage solutions for static data, he argues, we also need a way to process streaming data in an efficient and fault tolerant way. The system he uses for that purpose is Storm that was open-sourced by Twitter late last year. Built on top of zeroMQ it allows for flexible and fault tolerant messaging. Example applications mentioned are event analysis (filtering, aggregation, counting, monitoring), parallel distributed rpc based on message passing.

Two concrete examples include setting up a live A/B testing environment that is dynamically reconfigurable based on it's input as well as event handling in a social network environment where interactions might trigger messages being sent by mail and instant message but also trigger updates in a recommendation model.

In the last talk Fabian Hüske from TU Berlin introduced Stratosphere - an EU founded research project that is working on an extended computational model on top of HDFS that provides more flexibility and better performance. Being developed before the rise of Apache Hadoop YARN unfortunately essentially what they did was to re-implement the whole map/reduce computational layer and put their system into that. Would be interesting to see how a port to YARN performs and what sort of advantages it gives in production.

Looking forward to seeing you all in June for Berlin Buzzwords - make sure to submit your presentation soon, call for presentations won't be extended this year.