Mahout at Berlin ignite

2010-03-01 22:24
This evening the first Berlin ignite event took place in the "Festsaal" in Berlin X-Berg. Organiser of the event was Matt Biddulph from Nokia Gate 5. We had eleven fantastic talks (ok, to be more precise: At least ten fantastic ones, my own can only be judged by the audience ;) ).

Topics included things you can learn when starting to collect data, themes from (agile) project management, RepRap machines (see also the Rep Rap FOSDEM 2010 talk), bots and robots. The talks finished with a presentation of a Part time scientist's vision of getting to the moon - an article on the project is available on heise newsticker.

The room was filled with more then 120 people resulting in a location packed with interested attendees. It was great seeing the talks on such diverse topics. Hope to have more events of this format here in Berlin. Thanks go to Matt, all speakers and everyone involved in generally making the event a big success.

For those who didn't make it to the event, slides and audio should go online soon. At least the slides on Mahout are available online.

Open Community Camp 2010

2010-02-12 13:07
The following information just reached my via Marten Vijn. Thought it might be interesting to you:

I am pleased to announce OpenCommunityCamp 2010.

The camp is from 10th to 18th July, in Oegstgeest, the Netherlands.

The website[1] is refreshed and the first speakers are booked.
It is time to register[2] if you plan be there (please do this

Currently we need to find more people to attend, self-organizing groups
for the day program and interesting speakers for the evening program.

I look forward to hear your ideas and plan if you come. If you have
any questions don't hesitate to mail me.

kind regards,
Marten Vijn


On Thursday: Open Hadoop User Group Munich

2009-12-16 06:06
If one evening of Apache Hadoop is not enough for you: The next Christmas Meetup in Germany takes place one day later in Munich.

  • When: Thursday December 17, 2009 at 5:30pm open end
  • Where: eCircle AG, Nymphenburger Straße 86, 80636 München ("Bruckmann" Building, "U1 Mailinger Str", map in German and look for the signs)

Talks scheduled by Bob and Lars:

Bob Schulze from eCircle will be giving the first presentation on how eCircle is planning to use the Hadoop stack.

Dave Butlerdi will be giving an overview of his usage of Hadoop.

Lars George will give a state of affairs of the HBase project. What is it, what does it do and how he is using it (since early 2008).

There is a quick connect via train from Berlin to Munich. So if you are attending the Berlin Get Together, it is very easy to travel south to Munich one day later and visit the Munich event as well.

NoSQL Berlin Meetup

2009-10-23 13:24
Yesterday evening the NoSQL Berlin Meetup took place in newthinking store, Berlin Mitte. We had planned for some 50 to 70 people. It quickly became clear that the room would be full - at startup I counted about 80 guests interested in NoSQL topics both locally from Berlin but also traveling here from New York.

Some pictures are available on flickr - thanks to @langalex for sending the url to me:

The meetup started with an introduction to basic principles on consistancy and agreement protocols that are the basis of many scalable storage solutions, including Scalaris. Monika Moser explained, why one can have only two of the three goals of consistency, availability and partition tolerance. After that she gave an introduction to Paxos - a scalable, partition tolerant agreement protocol.

In the second talk, Mathias Meyer introduced Redis - a wicket fast key value store that supports strings, lists and sets as values. It is implemented in C, comes with a persistence mechanism. Only problem: All the data stored in Redis needs to fit in memory for this store to work.

After a short break Jan Lehnardt gave an overview of building P2P applications with CouchDB. He showed how CouchDB can be scaled to large deployments with modules that build distribution and sharding on top of CouchDB. But CouchDB can also be scaled down to run on mobile devices. As synchronization is so simple with that DB it is a perfect fit for Ubuntu One - the initiative of Canonical that brings a personal cloud to everyone for sharing and distributing your data.

Martin Scholl gave an overview of Riak - a highly distributed key-value store with support for map-reduce style queries, sharding of data and a rest-Interface.

The last session included a talk by Mathias Stearn on MongoDB - a key-value store that does not come with json formatted documents but uses bson for document encoding. This makes it easy to support for compact and fast object (de-)serialization.

The final talk was given by Prof. Stefan Edlich on object oriented databases.

After the event, speakers and attendees switched over to Cafe Aufsturz for some drinks, beer and food - and of course for further discussions.

Big thanks goes to the sponsors (Versant, Peritor (drinks at newthinking), StudiVZ (videos), Sociomantic (drinks at Aufsturz), Soundcloud (food at Aufsturz). Another big thanks to Jan Lehnardt and Thomas Nicolai for helping me set up this event.

Looking forward to seeing you guys either in Oakland this November or probably next year at the next NoSQL conference in Berlin.

Getting Hadoop trunk up and running from source

2009-10-04 20:18
Having told Thilo about the possibility to write Hadoop jobs in Python with Dumbo, we spent some time getting Dumbo 0.21 up and running over the past weekend. The first option the wiki proposes is to take a pre-0.21 release and patch that to work with the current Dumbo release. The second option described takes the not-yet-released version of Hadoop that can be used w/o any patches.

We decided to follow the latter suggestion. After the latest split of the project, we downloaded common, hdfs and mapreduce. Building each project was easy - assuming that ant, Sun JDK 6 (for Hadoop), Forrest (for the documentation pages) and Sun JDK 5 (for forrest) is installed.

Deviating from the documentation, the distributed filesystem as well as map reduce are now started from separate scripts ( instead of These scripts are located in the common project. In addition the variables HADOOP_HDFS_HOME and HADOOP_MAPRED_HOME must be set to point to respective projects for cluster setup to work. Other than that the setup currently is identical to the previous version.

Dev House Berlin 2.0

2009-10-04 20:04
This weekend DevHouseBerlin took place in the Box119, kindly organized by Jan Lehnardt, sponsored by Upstream and StudiVZ. There were about 30 people gathered in Friedrichshain, hacking and discussing various projects: Mostly Python/ Django, Ruby/ Rails and Erlang people.

The first day was reserved for hacking and exchanging ideas. Late afternoon attendees put together a list of talks that were than rated, ranked with the top three chosen for presentation on Sunday. The list included topics on CouchDB, RestMS, Hadoop, Concurrency in Erlang, P2P CouchDB and many more. The first three topics were chosen by the participants for presentation.

During the time at DevHouse I finally got a list of topics and papers up at Mahout TU project - now only the exact credit system for the Mahout course at TU is missing. I got some time to work on Mahout improvements and documentation. Unfortunately I was too tired today to complete the code review for MAHOUT-157 - promise to do that early next week.

Spending one weekend with equal-minded people, being able to pair with someone else in case of more complex problems made the weekend a great time for me. Planning to be there again next year. Thanks to the sponsors and organisers for making this happen.

AWS User Group Berlin

2009-09-29 16:06
On Monday the first AWS user group took place in newthinking store, Berlin. The event featured talks by Martin Buhr from Amazon as well as presentations of AWS users like Dawanda, Peritor and Sound Cloud.

Unfortunately the most interesting question concerning Elastic Map Reduce was left unanswered by Martin: Does using EMR facilitate exploiting data locality/ rack locality optimizations that are possible in Hadoop? The question on whether Amazon is using the AWS APIs internally as well was answered positively, though of course they did not publish all of their systems infractructure.

Next meeting was scheduled to take place in two months time. Thanks to Peritor for organizing the meetup.

First NoSQL Meetup in Germany

2009-09-09 18:58
On October 22nd 2009 the first NoSQL Meetup Germany is going to take place in newthinking store/ Berlin:

Please submit your presentation proposals until September 22nd, accepted speakers will be notified soon after.

If you would like to sponsor the event, feel free to contact us: We would be very happy to provide videos after the event and free drinks for everyone during the event.

Hope to see you soon in Berlin.

Apache Con Europe 2009 - part 1

2009-03-29 18:41
The past week members, committers and users of Apache software projects gathered in Amsterdam for another Apache Con EU - and to celebrate the 10th birthday of the ASF. One week dedicated to the development and use of Free Software and the Apache Way.

Monday was BarCamp day for me, the first BarCamp I ever attended. Unfortunately not all participants proposed talks. So some of the atmosphere of an unconference was missing. The first talk by Danese Cooper was on "HowTo: Amsterdam Coffee Shops". She explained the ins and outs of going to coffee shops in Amsterdam, gave both legal and practical advise. There was a presentation of the Open Street Map project, several Apache projects. One talk discussed transfering the ideas of Free Software to other parts of life. Ross Gardler started a discussion on how to advocate contributions to Free Software projects in science and education.

Tuesday for me meant having some time for Mahout during the Hackathon. Specifically I looked into enhancing matrices with meta information. In the evening there were quite a few interesting talks at the Lucene Meetup: Jukka gave an overview of Tika, Grant introduced Solr. After Grant's talk some of the participants shared numbers on their Solr installations (number of documents per index, query volumn, machine setup). To me it was extremely interesting to gain some insight into what people actually accomplish with Solr. The final talk was on Apache Droids, a still incubating crawling framework.

The Wednesday tracks were a little unfair: The Hadoop track (videos available online for a small fee) was right in parallel to the Lucene track. The day started with a very interesting keynote by Raghu from Yahoo! on their storage system PNUTS. He went into quite some technical detail. Obviously there is interest in publishing the underlying code under an open source license.

After the Mahout introduction by Grant Ingersoll I changed room to the Hadoop track. Arun Murthy shared his experience on tuning and debugging Hadoop applications. After lunch Olga Natkovich gave an introduction to Pig - a higher language on top of Hadoop that allows for specifications of filter operations, joins and basic control flow of map reduce jobs in just a few lines of Pig Latin code. Tom White gave an overview of what it means to run Hadoop on the EC2 cloud. He compared several options for storing the data to process. Today it is very likely that there will soon be quite a few more providers of cloud services in addition to Amazon.

Allen Wittenauer gave an overview of Hadoop from the operations point of view. Steve Lougran finally covered the topic of running Hadoop on dynamically allocated servers.

The day finished with a pretty interesting BOF on Hadoop. There still are people that do not clearly see the differences of Hadoop based systems to database backed applications. Best way to find out whether the model fits: Set up a trial cluster and do experiment yourself. Noone can tell which solution is best for you except for yourself (and maybe Cloudera setting up the cluster for you :) ).

After that the Mahout/UIMA BOF was scheduled - there were quite a few interesting discussions on what UIMA can be used for and how it integrates with Mahout. One major take home message: We need more examples integrating both. We developers do see the clear connections. But users often do not realize that many Apache projects should be used together to get the biggest value out.

Cloud Camp Berlin

2009-03-23 13:29
Title: Cloud Camp Berlin
Link out: Click here
Date: 2009-04-30