Apache Hadoop Get Together Berlin

2013-05-08 23:41
This evening I joined the group over at Immobilienscout 24 for today's Hadoop Get Together. David Obermann had invited Dr. Falk-Florian Henrich from CeleraOne to talk about their real-time analytics on live data streams.

Their system is being used by the New York Times Springer's Die Welt for traffic analysis. The goal is to identify recurring users that might be willing to pay for the content they want to read. The trade-off here is to keep readers interested long enough to make them pay in the end, instead of scaring them away with a restrictive pay wall which would immediately lead to way less ad revenues.

Currently CeleraOne's system is based on a combination of MongoDB for persistent storage, ZeroMQ for communicating with the revenue engine and http/json for connecting to the controlling web frontend. The live traffic analysis is all done in RAM, while long term storage ends up in MongoDB.

The second speaker was Michael Hausenblas from MapR. He spends most of his time contributing to Apache Drill - an open source implementation of Google's Dremel.

Being an Apache project Drill is developed in an open, meritocratic way - contributors come from various different backgrounds. Currently Drill is in its early stages of development: They have a logical plan, a reference interpreter, a basic SQL parser. There is a demo application. As data backends they support HBase.

For most of the implementation they are trying to re-use existing libraries, e.g. for the columnar storage Drill is looking into either using Twitter's Parquet or Hive ORC file format.

In terms of contributing to the project: There is no need to be a rockstar programmer to make valuable contributions to Apache Drill: Use cases, documentation, test data are all valuable and appreciated by the project.

For more information check out the slide deck (this is an older version - this nights edition most likely soon to be published):

If you missed today's event make sure to get enlisted in the Hadoop Get Together Xing Group so next time you get a notification.

One thing to note though: When registering for the event - please make sure to free your ticket if you cannot make it. I had a few requests from people who would have loved to attend today who didn't get a ticket but would most likely have fit into the room.

Apache Mahout 0.6 released

2012-02-08 21:33
As of Monday, February 6th a new Apache Mahout version was released. The new package features

Lots of performance improvments:

  • A new LDA implementation using Collapsed Variational Bayes 0th Derivative Approximation - try that out if you have been bothered by the way less than optimal performance of the old version.
  • Improved Decision Tree performance and added support for regression problems
  • Reduced runtime of dot product between vectors - many algorithms in Mahout rely on that, so these performance improvements will affect anyone using them.
  • Reduced runtime of LanczosSolver tests - make modifications to Mahout more easily and have faster development cycles by faster testing.
  • Increased efficiency of parallel ALS matrix factorization
  • Performance improvements in RowSimilarityJob, TransposeJob - helpful for anyone trying to find similar items or running the Hadoop based recommender

New features:

  • K-Trusses, Top-Down and Bottom-Up clustering, Random Walk with Restarts implementation
  • SSVD enhancements

Better integration:

  • Added MongoDB and Cassandra DataModel support
  • Added numerous clustering display examples

Many bug fixes, refactorings, and other small improvements. More information is available in the Release Notes.

Overall great improvements towards better performance, better stability and integration. However there are still quite some outstanding issues and issues in need for review. Come join the project, help us improve existing patches, improve performance and in particular integration and streamlining of how to use the different parts of the project.

Devoxx University – MongoDB, Mahout

2010-12-05 21:19
The second tutorial was given by Roger Bodamer on MongoDB. It concentrates on being horizontally scalable by avoiding joins and complex, multi document transactions. It supports a new data model that allows for flexible, changeable "schemas".

The exact data layout is determined by the types of operations you expect for your application, by the access patterns (reading vs. writing data; types of updates and types of queries). Also don't forget about indexing tables by columns to speed up frequently run queries.

Scaling MongoDB is supported by replication in a master/slave setup quite as any traditional system. In a replica set of n nodes, any of these can be elected as the primary (taking writes). If that one goes down, new master election happens. For durability all writes are required to go to at least a majority of all nodes, if that does not happen, now guarantee is given as to the availability of the update in case of primary failure. Write sharding comes with MongoDB as well.
Java support for Mongo is pretty standard - Raw Mongo driver comes in a Map<... ...> flavour. Morphia supports Pojo mapping, annotations etc. for MongoDB Java integration, Code generators for various other JVM languages are available as well.

See also: http://blog.wordnik.com/12-months-with-mongodb

My talk was scheduled for 30min in the afternoon. I went into some detail on what is necessary to build a news clustering system with Mahout and finished the presentation by a short overview of the other use cases that could be solved with the various algorithms. In the audience, nearly all had heard about Hadoop before – most likely in the introductory session that same morning. Same for Lucene. Solr was known to about half of the attendees. Mahout to just a few. Knowing that only very few attendees had any Machine Learning background I tried to provide a very high level overview of what can be done with the library, not going into too much mathematical details. There were quite a few interested questions after the presentation – both online and offline, including requests for examples on how to integrate the software with Solr. In addition connectors for instance to HBase as a data-source were interesting to people. Show-casing integration of Mahout, possibly even providing not only Java- but also REST interfaces might be one route to easier integration and faster adoption of Mahout.