Apache Mahout @ Devoxx Tools in Action Track

2010-11-01 09:32
This year's Devoxx will feature several presentations coming from the Apache Hadoop ecosystem including Tom White on the basics of Hadoop: HDFS, MapReduce, Hive and Pig as well as Michael Stack on HBase.



In addition there will be a brief Tools in Action presentation on Monday evening featuring Apache Mahout.

Please let me know if you are going to Devoxx - would be great to meet some more Apache people there, maybe have dinner at one of the conference days.

Apache Hadoop in Debian Squeeze

2010-07-17 12:04
After using Mandrake for quite a while (still blaming my boyfriend Thilo for infecting not only my computer but also myself first with that system, then with the more general idea of Free Software - but that's another story.) after finishing my master's thesis I started using GNU Debian Linux (back then in the version code-named Woody). Since I always had a GNU Debian on my private box as my main operating system - even installed it on my MacBook following the steps in the Debian Wiki.

As I am also an Apache Mahout committer, closely related to the Apache Hadoop project, I always found it kind of sad that there were no Hadoop packages in the official Debian repositories. I tried multiple times to find some time to get into Debian packaging myself, I learned what "debian/rules" is all about and discovered some of the intricacies of packaging Java based software. However I have to admit that I never was able to find enough time to really finish that task.

A few weeks before this year's FOSDEM I learned on the Apache Hadoop as well as on the Debian Java lists that a guy called Thomas Koch was working on solving bug 535861 - ITP to package Hadoop. We met at FOSDEM where I tried to raise some attention in the audience for Thomas' plans (back then he was in need for help with a few last missing pieces). In addition I invited him for Berlin Buzzwords to get in touch with other Hadoop developers and users for further input.

I am really happy that by now Hadoop has made it into the official Debian package repositories - as soon as Debian Squeezeapt-get install [Hadoop component you need]: Debian package search.

Squeeze

If you want to speed up the process of Squeeze being released as stable version: Help fixing the remaining bugs in that distribution. There are various Debian Bug Squashing Parties being organised around the world. Next one in Berlin is on next Monday, the one for Munich is running this weekend. Just got the information that Fefe posted in his blog a link to the Mozilla bug bounty:

The packages are based on the upstream Apache Hadoop distribution, being comparably new they are intended for development machines at the moment. If you are using Debian and want to work with Hadoop - this is a great opportunity to help making the packages more stable by simply using them and reporting your experiences back to the Debian community.

In addition Debian now also provides packages for Zookeeper as well as HBase - though the HBase version is not yet production ready as the HDFS-append patch is still missing.

To follow the general state and progress of these packages feel free to follow the packages pages for Hadoop, HBase, Zookeeper respectively.

Thomas currently plans to work more closely with upstream e.g. to tidy up the chaos in the start-up scripts and other minor glitches. So watch out for further improvements.

In addition I just saw another interesting ITP in the Debian bugtracker: Wishlist: katta. I am sure there are quite a few others as well.

Berlin Buzzwords - Early bird registration

2010-04-10 15:02
I would like to invite everyone interested in data storage, analysis and search to join us for two days on June 7/8th in Berlin for Berlin Buzzwords - an in-depth, technical, developer-focused conference located in the heart of Europe. Presentations will range from beginner friendly introductions on the hot data analysis topics up to in-depth technical presentations of scalable architectures.

Our intention is to bring together users and developers of data storage, analysis and search projects. Meet members of the development team working on projects you use. Get in touch with other developers you may know only from mailing list discussions. Exchange ideas with those using your software and get their feedback while having a drink in one of Berlin's many bars.

Early bird registration has been extended until April 17th - so don't wait too long.

If you would like to submit a talk yourself: Conference submission is open for little more than one week. More details are available online in the call for presentations:

Looking forward to meeting you in the beautiful, vibrant city of Berlin this summer for a conference packed with high profile speakers, awesome talks and lots of interesting discussions.

Bob Schulze on Tips and patterns with HBase

2010-03-24 03:41
At the last Hadoop Get Together in Berlin Bob Schulze from eCircle in Munich gave a presentation on “Tips and patterns with HBase”. The talk has been video recorded. The result is now available online:

HBase Bob Schulze from Isabel Drost on Vimeo.



Feel free to share and distribute the video. Thanks to Bob for an awesome talk on eCircle’s usage of HBase - and on providing some background information on how HBase was applied to solve your problems.

Another thanks to Nokia for sponsoring the video taping - and to newthinking for providing the location for free.

Looking forward to Berlin Buzzwords in June. Early registration is open already. Several great talk proposals have been submitted already. If you are a Hadoop Get Together visitor (or even speaker) and would like to have a community ticket, please contact me.

Seminar on scaling learning at DIMA TU Berlin

2010-03-17 21:10
Last Thursday the seminar on scaling learning problems took place at DIMA at TU Berlin. We had five students give talks.

The talks started with an introduction to map reduce. Oleg Mayevskiy first explained the basic concept, than gave an overview of the parallelization architecture and finally showed how jobs can be formulated as map reduce jobs.

His paper as well as his slides are available online.

Second was Daniel Georg - he was working on the rather broad topic of NoSQL databases. Being too fuzzy to be covered in one 20min talk, Daniel focussed on distributed solutions - namely Bigtable/HBase and Yahoo! PNUTS.

Daniel's paper as well as the slides are available online as well.

Third was Dirk Dieter Flamming on duplicate detection. He concentrated on algorithms for near duplicate detection needed when building information retrieval systems that work with real world documents: The web is full of copies, mirrors, near duplicates and documents made of partial copies. The important task is to identify near duplicates to not only reduce the data store but to potentially be able to track original authorship over time.

Again, paper and slides are available online.

After a short break, Qiuyan Xu presented ways to learn ranking functions from explicit as well as implicit user feedback. Any interaction with search engines provides valuable feedback about the quality of the current ranking function. Watching users - and learning from their clicks - can help to improve future ranking functions.

A very detailedpaper as well as slides are available for download.

Last talk was be Robert Kubiak on topic detection and tracking. The talk presented methods for identifying and tracking upcoming topics e.g. in news streams or blog postings. Given the amount of new information published digitally each day, these systems can help following interesting news topics or by sending notifications on new, upcoming topics.

Paper and slides are available online.

If you are a student in Berlin interested in scalable machine learning: The next course IMPRO2 has been setup. As last year the goal is to not only improve your skills in writing code but also to interact with the community and if appropriate to contribute back the work created during the course.

Slides are available

2010-03-11 00:49
Slides for the last Hadoop Get Together are available online:



Videos will follow as soon as the are ready. Watch this space for further updates.

Apache Hadoop Get Together March 2010

2010-03-11 00:40
Today (or more correctly, yesterday) the March 2010 Hadoop Get Together took place in newthinking store. I arrived rather early to have some time to do some planning for Berlin Buzzwords - got there nearly one hour before the meetup. However it did not take very long until first guests came to the store. So I quickly got my introductory slides in place - Martin from newthinking already had the room setup, camera in place and audio working.

When starting the meetup the room was already packed with some 60 people - we ended up having over 70 people interested in the mix of talks on Hadoop, HBase and Spatial search with Lucene and Solr. Doing the regular "Who are you"-round, we learned that there were people from nurago, Xing, StudiVZ, *lots and lots* of people from Nokia, Zanox, eCircle, nugg.ad and many others.

The meetup was kindly supported by newthinking store (venue for free) and Nokia (sponsored the videos). Steffen Bickel took his chance during the introduction to give a brief overview of Nokia and - guess - explain, that Nokia is a great place to work and yeah - they are hiring!

The first talk was given by Bob Schulze who joined the meetup coming from eCircle in Munich. Given his previous experience with scaling their infrastructure from a regular database/ datawarehouse setup he explained how HBase helped when processing really large amounts of data. Being an e-mail marketing provider, eCircle does have quite a bit of data to process. And yes, eCircle is hiring.

Second talk was by Dragan Milosevic from Zanox on scaling product search and reporting with Hadoop. Just as eCircle, Zanox came from a regular RDMS setup that became too expensive and too complex too scale before switching over to a Hadoop/Lucene stack. He used his chance to make the Lucene developers aware of the fact that there are users who would were actually using Lucene's compression features. Zanox, as well, is looking for people to hire.

Last talk was by Chris Male from JTeam in Amsterdam on the developments in Lucene and Solr to support for spatial search. There are various development routes being followed: Cartesian tiers as well as numeric range searches. He also explained that most of the features are still under heavy development. He finished his talk with a demo on what can be done with spatial search in Lucene/ Solr. You already guessed so, JTeam is hiring as well ;)

After the talks we went to Cafe Aufsturz for beers, drinks and some food. People enjoyed talking to each other exchanging experiences. A Lucene focussed table quickly formed - main topics: Spatial search, Lucene/Solr merge threads, heavy committing, Mike McCandless (is this guy real or just an alter-ego of the Lucene community?).

At some time around 11p.m. the core of the guests (well - the Lucene part of the meetup, that is Simon, Uwe and the guys from JTeam) moved over to a bar close by next to cinema central for some more beer and drinks. At about 1a.m. it finally was time to head home.

I'd like to say thanks: First of all to the speakers. Without you the meetup would not be possible. Second to newthinking and Nokia for their support. And of course to all attendees for having grown the meetup to its current size.

I had a really nice evening with people from the Hadoop, HBase and Lucene community. Special thanks to you guys from JTeam for traveling 6h to Berlin just for a "little", though no longer that tiny, Hadoop meetup. Promise stands, to visit one of your next Lucene meetups in Amsterdam and present Mahout there - however I need some help finding affordable accomodation ;)

Hope to see you all in June at Berlin Buzzwords.