Large Scalability - Papers and implementations

2009-06-23 12:08
In recent years the Googles and Amazons on this world have released papers on how to scale computing and processing to terrabytes of data. These publications have led to the implementation of various open source projects that benefit from that knowledge. However mapping the various open source projects to the original papers and assigning tasks that these projects solve is not always easy.

With no guarantee of completeness this lists provides a short mapping from open source project to publication.

There are further overviews available online as well as a set of slides from the NOSQL debrief.















Map Reduce Hadoop Core Map Reduce Distributed programming on rails, 5 Hadoop questions, 10 Map Reduce Tips
GFS HDFS (Hadoop File System) Distributed file system for unstructured data
Bigtable HBase, Hypertable Distributed storage for structured data, When to use HBase.
Chubby Zookeeper Distributed lock- and naming service
Sawzall PIG, Cascading, JAQL, Hive Higher level langage for writing map reduce jobs
Protocol Buffers Protocol Buffers, Thrift, Avro, more traditional: Hessian, Java serialization Data serialization, early benchmarks
Some NoSQL storage solutions CouchDB, MongoDB CouchDB: document database
Dynamo Dynomite, Voldemort, Cassandra Distributed key-value stores
Index Lucene Search index
Index distribution katta, Solr, nutch Distributed Lucene indexes
Crawling nutch, Heritrix, droids, Grub, Aperture Crawling linked pages



Open Street Map @ FSFE meetup

2009-06-21 20:59
At the last meeting of the local FSFE group here in Berlin Sabine Stengel from cartogis gave a presentation on Open Street Map. But instead of focussing on the technical side she described the legal issues and showed the broad variety of commercial projects that are possible with this type of mapping information.

It was interesting to learn of how detailed and high quality the information provided by volunteers really is. I think it will be interesting to see, how the project keeps traction after "everything is mapped" - how it remains interesting to stay involved, to keep the information up to date over a longer period of time.

Open Source Development is good for you

2009-05-21 09:08
GSoC (Google summer of code) - one of the open source programs of Google - has started again in 2009. Students come to work for open source projects during the summer and on success are paid by Google a fair amount of money.

This program is an ideal oportunity for students to get into open source projects: You get a mentor, you have pre-defined task to work on with a goal you set yourself. And in the end there is money.

At the beginning of GSoC student ranking Ted Dunning posted a very interesting mail on his view on why students should participate in open source development:

  • It is a perfect chance to work together with senior developers that are passionate about what they do.
  • Usually universities teach the theoretical side of life, which is good. But if working in industry later, students need experience with current development best practices and tools. They need to be aware of test driven development, they need to know how to use source control systems, continuous integration tools, build management frameworks, bug tracking tools. Open source projects usually are a great place to try out these technologies and learn how to best apply them.
  • Working on open source students need to coordinate with their peers. They need to learn that development is not only about coding, but about communication as well.
  • Last but not least this is a chance to chose yourself what you are working on and achieve so much more than when starting yet another brand new single developer project.


In the end all this adds up to learning and practicing the skills needed to successfully work on software development projects with more than just a few developers.

Tomcat Tuesday talk

2009-05-21 09:07
Since several months at neofonie we have a talk given by external or internal developers on various subjects each Tuesday. Usually these presentations are a nice way to get an overview of new emerging technologies, to get an overview of current conference topics or to gain insight into interesting internal projects.

This week we had Apache Tomcat Committer and PMC Peter Rossbach here at neofonie to talk about the Tomcat architecture and Tomcat clustering solutions. He gave two pretty in-depth presentations on the Tomcat internals, Tomcat optimization and extension points.

Some points that were especially interesting to me: The project started out in the late nineties, initiated by a bunch of developers who just wanted to see what it takes to write a web application container and that fullfills the spec. The goal basically was a reference implementation. Soon enough however users defined the resulting code as production ready and used it.

There are a few caveats from this history that are still visible. One is the lack of tests in the codebase. Sure, each release is tested agains the Sun TCK - but these tests cannot be opened to the general public. So if you as a developer make extensions or modifications to the code base there is no easy way of knowing whether you broke something or not.

For me as a developer it was interesting to see really how complex it quickly gets to cluster tomcat deployments and make them failure resistant. Some tools mentioned that help automatic with easier deployment are Puppet and FAI. One issue however that is still on the developer's agenda is Tomcat monitoring.

To summarize: The conference room was packed with developers expecting two very interesting talks. Thanks to Peter Rossbach for coming to neofonie and explaining more on the internals of the Tomcat software, the project and the community behind.

Back from Zürich

2009-05-05 16:58
I spend the last five days in Zurich. I wanted to visit the city again - and still owed one of my friends there a visit. I am really happy the weather was quite nice over the weekend. That way I could spend quite some time in town (got another one of those puzzles) and go for a hike on the Ütli mountain: I took the steep way up that had quite a lot of stairs. Interestingly though, despite being quite tired when I finally arrived on top, my legs did not have sore muscles the next day. Seems going to work and back again by bike does indeed help a bit, even if we have no hills in Berlin.

Yesterday I was allowed to present the Apache project Mahout in a Google tech talk. Usually I am talking to people well familiar with the various Apache projects. Giving my talk I asked people who was familiar with Lucene, with Hadoop. To me it was pretty unusual that very few engineers were aware of these. It almost seemed like it is unusual to have a look at what is going outside the company? Or was it just the selection of people that were interested in my talk?

I tried to cover most of the basics, put Mahout into the context of the Lucene umbrella project. I tried to show some of the applications that can be built with Mahout and detailed some of the things that are on our agenda.

Some of the questions I received were on the scalability of Hadoop, on the general distribution of people being paid to work on Free Software projects vs. those working on them in their freetime. Another question was whether the project is targeted to text only applications (which of course it is not, as feature extraction so far has been left to the user). Last but not least the relation to UIMA was brought up by a former IBM-UIMA engineer.

To summarize: For me it was a pretty interesting experience to give this tech talk. I hope it did help me to do away with some of my "Apache bias". It is always valuable to look into what is going outside one's community.

Announcing Apache Mahout 0.1

2009-04-08 15:11
This morning I received Grant's release mail of Apache Mahout. I am really happy that after little more than one year we now have our first release out there to test and scrutinate by anyone interested in the project. Thanks to all the committers who have helped make this possible. A special thanks to Grant Ingersoll for putting so much time into getting many release issues out of the way as well as to those who reviewed the release candidates and all the major and minor problems.

For those who are not familiar with Mahout: The goal of the project is to build a suite of machine learning libraries under the Apache license. The main focus is on:

  • Building a viable community that develops new features, helps users with software problems and is interested in the data mining problems Mahout users.
  • Developing stable, well documented, scalable software that solves your problems.


The current release includes several algorithms for clustering (k-Means, Canopy, fuzzy k-Means, Dirichlet based), for classification (Naive Bayes and Complementary Naive Bayes). There is some integration with the Watchmaker evolutionary programming framework. The Taste Collaborative Filtering framework moved to Mahout as well. Taste has been around for a while and is much more mature than the rest of the code.

With this being a 0.1 release we are looking for early adopters that are willing to work with cutting edge software and gain benefits from working closely together with the community. We are seeking feedback on use cases as well as performance numbers. If you are using Mahout in your projects or plan to use it or even only evaluate it - we are happy about hearing back from you on our mailing lists. Tell us what you like, what works well, but do not forget to tell us what you would like to improve. Contributions and Patches as always are very welcome.

For more information see the project homepage, especially the wiki and the Lucene weblog by Grant Ingersoll.

Apache Con Europe 2009 - part 2

2009-03-29 19:42
Thursday morning started with an interesting talk on open source collaboration tools and how they can help resolving some collaboration overhead on commercial software projects. Four goals can be reached with the help of the right tools: Sharing the project vision, tracking the current status of the project, finding places to help the project and documenting the project history as well as the reasons for decisions along the way. The exact tool used is irrelevant as long as it helps to solve the four tasks above.

The second talk was on cloud architectures by Steve Loughran. He explained what reasons there are to go into the cloud, what a typical cloud architecture looks like. Steve described Amazon's offer, mentioned other cloud service providers and highlighted some options for a private cloud. However his main interest is in building a standardised cloud stack. Currently choosing one of the cloud provides means vendor lock-in: Your application uses a special API, your data are stored on special servers. There are quite a few tools necessary for building a cloud stack available at Apache (Hadoop, HBase, CouchDB, Pig, Zookeeper...). The question that remains is how to integrate the various pieces and extend where necessary to arrive at a solution that can compete with AppEngine or Azure?

After lunch I went to the Solr case study by JTeam. Basically one great commercial for Solr. They even brought the happy customer to Apache Con to talk about the experience of Solr from his point of view. Great work, really!

The Lightning Talk session ended the day - with a "Happy birthday to you" from the community.

After having spent the last 4 days from 8a.m. to 12p.m. at Apache Con I really did need some rest on Thursday and went to bed pretty early: At 11p.m. ...

Cloud Camp Berlin

2009-03-23 13:29
Title: Cloud Camp Berlin
Link out: Click here
Date: 2009-04-30

FSFE booth at the Chemnitzer Linux Tage

2009-03-23 08:30
This year for the 11th time the "Linux Tage" were organized at the university of Chemnitz. Each year in March this means two days devoted to the topic of open and free software. It means an event that is very well organized by a pretty professional team of volunteers.

For the third time the FSFE had its booth at the event - this time run by Rainer Kersten, Uwe Zemisch and me. Recurring questions at the booth were

  • "What the hack is FSFE and in which ways do you actually support free software?"
  • "I am already fellow, you keep telling me there are these great fellowship meetups. Do you know whether there is one near my town? How do these events start? How are they organized?"


It was interesting to see that FSFE is one of the few organizations that try to fill the gap between those writing open source software and those actually making decisions that are relevant to the developers but know nothing of writing software whatsoever.

Besides running the booth there was some time left for a few talks. I decided to go to the OpenMP talk. The idea is to develop a highlevel API for marking code passages for parallel execution. It is not designed for parallel programming on clusters but on multi core machines. Somewhat related to the Java concurrency package but far more high level.

The second talk I went to was on personal data protection laws in Germany. One funny piece of information: Even the ministry of justice was sued sucessfully for storing to much information on the visitors of its webpage.

Last talk I went to was on Google Android. To me it looks like a nice mix of completly open source (like Open Moko) and completely closed source. If you need a phone you can use for making phone calls but still want to play with it and be root on the phone (399$ for the dev phone, sim unlocked, only available for registered developers, registration is 25,-$), Android G1 propably is the way to go. The phone is highly integrated with Google applications. The assumption when building it seems to have been, that people are online all the time with that phone.

For coding: Only a Java API is available, no C or C++. SDK is available for Lin/Mac/Win. The emulator does work, only thing it does not reflect is the real speed of the device itself. Each app gets its own VM, Dalvik supports process memory sharing that makes that less expensive. In case of memory shortage apps are killed in order of user impact (empty/precreated, background, service (mp3 player), visible apps, foreground apps). Idea is to kill those apps that are least visible to the user. The programmer needs to take care that apps constantly store state so restarting them gets them up in the same state they were in when killed.

More information online: http://www.htc.com; adoid.git.kernel.org;

All in all: Really nice weekend. Looking forwared to return next year.

FSFE Meetup Berlin

2009-03-07 23:07
Title: FSFE Meetup Berlin
Location: newthinking store
Start Time: 19:00
Date: 2009-03-12


The FSFE Berlin group is going to meet next Thursday. Topics to discuss at the next Get Together are


  • Document Freedom Day activities.
  • This year is the year of elections in Germany. How do we get the topic of Free Software into the discussion?


Feel free to visit the wiki, drop by and take part in the discussions. After the meetup we will go to some restaurant close by for food and drinks.