Archive

Archive for the ‘Hadoop’ Category

Apache Hadoop in Debian Squeeze

July 17th, 2010

After using Mandrake for quite a while (still blaming my boyfriend Thilo for infecting not only my computer but also myself first with that system, then with the more general idea of Free Software - but that’s another story.) after finishing my master’s thesis I started using GNU Debian Linux (back then in the version code-named Woody). Since I always had a GNU Debian on my private box as my main operating system - even installed it on my MacBook following the steps in the Debian Wiki.

As I am also an Apache Mahout committer, closely related to the Apache Hadoop project, I always found it kind of sad that there were no Hadoop packages in the official Debian repositories. I tried multiple times to find some time to get into Debian packaging myself, I learned what “debian/rules” is all about and discovered some of the intricacies of packaging Java based software. However I have to admit that I never was able to find enough time to really finish that task.

A few weeks before this year’s FOSDEM I learned on the Apache Hadoop as well as on the Debian Java lists that a guy called Thomas Koch was working on solving bug 535861 - ITP to package Hadoop. We met at FOSDEM where I tried to raise some attention in the audience for Thomas’ plans (back then he was in need for help with a few last missing pieces). In addition I invited him for Berlin Buzzwords to get in touch with other Hadoop developers and users for further input.

I am really happy that by now Hadoop has made it into the official Debian package repositories - as soon as Debian Squeezeapt-get install [Hadoop component you need]: Debian package search.

Squeeze

If you want to speed up the process of Squeeze being released as stable version: Help fixing the remaining bugs in that distribution. There are various Debian Bug Squashing Parties being organised around the world. Next one in Berlin is on next Monday, the one for Munich is running this weekend. Just got the information that Fefe posted in his blog a link to the Mozilla bug bounty:

The packages are based on the upstream Apache Hadoop distribution, being comparably new they are intended for development machines at the moment. If you are using Debian and want to work with Hadoop - this is a great opportunity to help making the packages more stable by simply using them and reporting your experiences back to the Debian community.

In addition Debian now also provides packages for Zookeeper as well as HBase - though the HBase version is not yet production ready as the HDFS-append patch is still missing.

To follow the general state and progress of these packages feel free to follow the packages pages for Hadoop, HBase, Zookeeper respectively.

Thomas currently plans to work more closely with upstream e.g. to tidy up the chaos in the start-up scripts and other minor glitches. So watch out for further improvements.

In addition I just saw another interesting ITP in the Debian bugtracker: Wishlist: katta. I am sure there are quite a few others as well.

Hadoop , , ,

Scaling user groups

May 26th, 2010

A few hours ago, Jan Lehnardt posted a link on How to organise a nerd conference - joking that this is how we planned Berlin Buzzwords. Well, it is not exactly that easy - however the comic actually is not so far from the truth either:

About two years ago, after having started Apache Mahout together with Grant Ingersoll, Karl Wettin and others, several Apache Hadoop user groups, meetups and get togethers started to pop up all around the world. The one closest to me was the Hadoop user group UK. Back in 2008 I was pretty envious to all these user groups - being so distributed, there was no way I could ever attend all of them, though talks were certainly interesting. So the naive thought of a back then naive free software developer was: Let’s have that in Berlin. To have initial talks I called Stefan Groschupf. His answer was very positive: Oh yeah, let’s do this. I am in Germany for another two weeks, so it should be at about that timeframe. We agreed that if no-one showed up, we could still have some pizza together and share insights from our projects.

For the venue I knew from regular meetups of the Free Software Foundation Europe - read FSF*E* - that newthinking store was available for free for meetups for devs of free software. On I went, calling Martin from the store, booked the room. After that some mails went to the usual suspects, mailing lists and such. At the first meetup two years ago, more than 15 attendees - with two more people who had prepared slides. Pizzas obviously had to wait a little.

If you are wondering what that looked like back then - Thanks to Martin for taking the image back then and putting it online.



We (as in all attendees) decided to repeat the exercise three months later*, talks for the next time were proposed during that first session. Noone objected to having it in Berlin again - everyone knew this was the only way to avoid having to do the organization next time.

The meetup grew steadily in size, talks started being proposed three to six months in advance. I ended up creating not only a mailing list for the meetup but also a blog so I could publish news on Jan’s CouchDB talk and Lars George’s HBase talk back then. We got video sponsoring from Cloudera (Thanks Christophe), StudiVZ (Thanks Nils), and Nokia (Thanks Matt). Late last year I did the first European NoSQL meetup together with Jan Lehnardt - 80 attendees, lots of potential for more, the newthinking store obviously a bit too small for that :)

If you are wondering what NoSQL and Hadoop meetups looked like last time:


During that meetup the idea was born for a larger NoSQL conference in Berlin in 2010. First ideas were tossed around together with Jan and Simon Willnauer during Apache Con US in Oakland. The topic Hadoop got added there. In January 2010 finally Lucene was added to the mix. We contacted newthinking for support - got a very warm welcome.

Now - two years after the first Apache Hadoop Get Together Berlin we are proud to host Berlin Buzzwords - focussed on NoSQL, Apache Hadoop and search as in Apache Lucene.The conference is co-organised by newthinking communications, Simon Willnauer, Jan Lehnardt and myself. A big thanks to neofonie for supporting me by making it possible that I could do most of the organisation during my regular working hours.

The speaker lineup looks fantastic. Registration is going very well - exceeding expectations (did I mention that registration is still open, group and student tickets still available?).

I am really looking forward to an amazing conference on 7th and 8th of June. We will have a NoSQL barcamp in newthinking store Sunday evening before the conference. Keynote speaker packages have been sent out and were well received. Hotel rooms for speakers are booked. We are about to pull together the last loose ends in the coming days. Happy to have so many guys (and a few girls) interested in scalability topics here in town at the beginning of June. Looking forward to seeing you in Berlin.

* The second meetup turned out to be the first and so far only one that took place w/o the organiser - I broke my leg on my way to newthinking by getting hit by a BMW X5… *sigh* Note for other meetup organizers: Always have a backup moderator - in may case that was my neofonie manager Holger Düwiger who happened to attend that meetup for the first time back then.

Apache Hadoop Get Together Berlin, Berlin Buzzwords, Hadoop, Lucene, Mahout , , , , ,

Building a Hadoop Job Jar with Maven

March 11th, 2010

Put here as a reminder, so I do not forget about it. There is a really nice tutorial online on Building Hadoop Job with Maven.

Hadoop , ,

Alan Atlas at Scrumtisch Berlin

February 17th, 2010

At the last Berlin Scrumtisch, @AlanAtlas gave a presentation on how he introduced Scrum at Amazon (starting as early as back in 2004). Introducing Scrum at Amazon by that time seemed natural due to a few factors:

  • Amazon was and is always very customer centric. The original methodology of working backwards in time - that is starting with the press release, from there writing the FAQ and manual and finally get to the code - really made people concentrate on the product.
  • Teams were sized according to the two-pizza rule: Each team is supposed to be no larger than can reasonably be fed for lunch with two pizzas. Turns out that is about five to ten people, given regular american pizzas.
  • Management didn’t really care exactly *how* the job of software development was done. They only cared *that* it was done. This proved as both - an advantage as it gave quite a bit of freedom - and a disadvantage - as indifference leads to impediments in the process that cannot be easily resolved.

The goal of Alan’s team was to build an infinitely scalable, zero downtime storage system that had a web service based interface - it was the S3 team. Given the task at hand, it was clear that the task wasn’t going to be anything even close to trivial. Alan’s idea: Try Scrum on that.

He himself came to the idea after attending an Agile conference and hearing about Scrum for the first time there. Alan requested a Scrum training from his manager, who approved it, provided it took place in Seattle - close to Amazon HQ that is. Turns out, the Scrum trainer available in Seattle happened to be Ken Schwaber.

The idea of Scrum basically was spread by word of mouth propaganda: In the hallways, in the smokers lounge, close to the coffee machine. People started adopting it - with some teams doing better, some worse. Allan himself got two days a quarter to give people an introduction to this funny new methodology called Scrum.

18 months later, S3 was shipped - as a side effect, the number of subscribers to the internal scrum mailing list had increased from three to 100, there were 150 CSM graduates, still three teams using XM, reputation of scrum was mostly positive.

In January 2007 others started giving their own introductions to Scrum. Developers generally liked it, there were significant failure cases, lots of resistance and misunderstanding mostly on the management side who sometimes confused it with lean production.

In June 2008 Allen became a fulltime internal Scrum trainer. First thing organised was a Scrum gathering - in an open spaces like setup people were invited to discuss their issues and questions on Scrum.

In January 2009, about 50% Scrum usage was reached, their were 600 subscribers on the Scrum mailing list, 450 CSM graduates, reputation of Scrum was as good (or bad) as months before. However it became more and more clear that further restructuring would only be possible against a lot of resistance.

There were a few lessons learned from introducing Scrum in other companies as well:

  • You cannot introduce Scrum if there is no notion of comparably stable teams (as in for at least six months, not five projects at a time per developer).
  • You need permission and ownership of committments to really get going.
  • You need to know at least the basics of Scrum.

There are a few factors that make introduction easier: Community matters. People need to be able to talk to each other, exchange experiences and learn from one another. Coaching matters. More support, immediate success in lots of cases is the result of getting a coach on the team. Credibility for management increases, Scrum implementation is easier understood that way, skepticism and resistance reduced. Management matters. Middle management must be part of the process. They will mess up scrum if not educated correctly. Going bottom-up works up to a point - but to go the whole way, you do need management.

For Amazon, that is the point where implementation got stuck for Allan: Management was neither interested nor really involved in Scrum. However there are some impediments that cannot be fixed w/o. Still Scrum is working to some extend - teams are still trying to get better and improve the process. So it is not really “Scrum, but”. Even going only part of the way helped a lot already.

As for the questions: There were a few unusual ones - e.g. on what to do with a team that is itself skeptical about Scrum introduction - especially if that was done by higher management. There are a few ways to remedy that: Point out success to the negative team member. Make that member a Scrum master to let him go through the process and really understand what is going on. And above all make the reasons for going that way clear and transparent. Including promises and wishes from that change.

Another question dealt with how to convince management of Scrum. Clearly better performing teams with happy developers (”you cannot buy developers with money only”) are some valid reasons.

Yet another question dealt with convincing customers. Allan’s way is to sneak Scrum into the process: Speak the customers language. If he does not like spring planning, call it a project status meeting.

Asked for cases of developers feeling like they work in a hamster wheel when doing Scrum, the general consensus was that it needs to be made sure that people do not only get more responsibilities but get the benefits from Scrum as well. Otherwise Scrum with all its numbers and measurements can be perfectly abused and turned into an amazing micro management tool.

Asked for other bay area companies using Scrum - yes, apple does so, Microsoft at least tries to. Google certainly uses the ideas, without calling it Scrum. Facebook seems not to use it. Xing (not bay area) uses Scrum. There seem to be about 50% of all software development companies worldwide who are using Scrum.

After the talk there was time to gather and have some pizza and pasta, time for drinks and discussions. I really appreciated the comments and ideas exchanged after the talk. Hope to see you all next time around.

Hadoop, Scrum , ,

FOSDEM - video recordings online

February 14th, 2010

As published in the FOSDEM blog the video recordings are available online - at least for the main track and the lightning talks. Happy video watching!

Events, Freetime, Hadoop , , ,

Berlin Buzzwords - June 2010

February 11th, 2010

As announced at FOSDEM: Early June (currently scheduled for 7th/8th) a conference on the topics scalable search, storage and processing will take place in Kalkscheune/Berlin. The conference is co-organised by newthinking store, Jan Lehnardt, Simon Willnauer, Thilo Fromm, and Isabel Drost.

The focus will be on NoSQL databases like CouchDB, Jackrabbit, MongoDB, HBase. Search tracks will cover topics like Lucene, Solr, katta and others. Data munging tracks will focus mainly on Hadoop, MapReduce in general and distributed systems.

More information including the call for presentations will be made available online next week on a separate webpage. Early registration starts in March. Watch this blog for more information or follow @hadoopberlin.

Berlin Buzzwords, Events, Hadoop , , , ,

Hadoop trainings in Europe

February 2nd, 2010

Recently I received this mail from Christophe Bisciglia on Cloudera Hadoop trainings. Thought it might be interesting to the Hadoop Berlin community:

Hadoop Fans,

Over the next year, you’ll see new options for Hadoop training and
certification from Cloudera. One of the first things you’ll see will
be live sessions outside the US, tentatively planned for the April /
May time frame.

We’ve seen strong interest in Hadoop on all of our international
trips, so we’d like to ask for community input as we decide exactly
which cities to visit next. For cities we come to, we’ll offer our 3
day developer training + certification, and with sufficient interest,
we may also include a 1 day training + certification program for
system administrators.

If you are interested in attending one or both of these sessions,
please fill out a brief survey (link below). If you’re using Hadoop at
work, and it’s time to train more of your team, you can let us know
how large of a group you have. Survey responses aren’t a commitment to
attend, but we may reach out to respondents before we schedule a
session to get a better understanding of actual attendance.

You can fill out survey here: http://www.surveymonkey.com/s/MKGZHG9

If you have any trouble with the survey, or are interested in a
private training session, please don’t hesitate to reach out directly.

Cheers,
Christophe

Apache Hadoop Get Together Berlin, Hadoop ,

Shopping at Ikea

February 1st, 2010

Some weeks ago, Thilo had a tiny little gadget not to be missed in an average geek’s appartment: A server - admittedly a little old and a bit slow, but still usable for playing around. He installed Ubuntu server on it. At the evening we got it configured to run Hadoop. Little later we found out that some friends of us probably, maybe have some usable hardware left as well - we’ll see on Monday.

However having a server on your dinner table is not really practical: There’s always some danger of spilling tea over it… However last week, one of my colleagues posted a link to the Lack Rack wiki page in the eth-0 Wiki on one of our mailing lists.

So yesterday was one of the (very rare) days, when I got Thilo to join me on a trip to Ikea. The result can be seen in the images above. Looks like elephants invaded our living room ;)

Hacking, Hadoop , ,

Hadoop at Heise c’t

January 31st, 2010

<surreptitious_advertising>
Interesting for those readers speaking German: Heise published an introductory article on Hadoop in its latest issue. Have fun reading.
<surreptitious_advertising/>

Thanks to Simon for proof-reading and providing valuable input. Thanks to Thilo Fromm for the hadoop graphics (unfortunately none of them got published in its original form), the catchy title, proof-reading the text over and over again and for keeping me sane during several past and coming months.

If you want to know more on Apache Hadoop, come watch my FOSDEM Hadoop talk next weekend. If you want to join discussions on Apache Hadoop and Lucene, stay tuned for a conference in Berlin on these topics.

Apache Con, Apache Hadoop Get Together Berlin, Events, Hadoop , , ,

With a little help from my friends

December 31st, 2009

The end of the year 2009 is quickly approaching. To me it feels a little like it ran away far too quickly. So instead of taking part in the annual review of past events, I would like to use it as an opportunity to say thank you: The past twelve months were a lot of fun with lots of interesting, nice people from all over the world. I got the chance to meet quite a bit of the Mahout community, I got lots and lots of new developers from all over Germany - or more precisely the EU - to attend the Apache Hadoop Get Together in Berlin. The interest in Mahout has grown tremendously over the past year.

All of this would not have been possible without the help of many people: First of all I’d like to thank Thilo Fromm - for making me happy whenever I was disappointed, for solacing me when I when I was sad, for patiently listening to me nervously whining before each and every talk, for kindly reviewing my slides and last but not least for helping me fix some of the problems that bugged me. Oh - and, thanks for helping me fix the issue in the zookeeper c-client within minutes that puzzled me for days.

Another big Thanks goes to family, first and foremost my mum, who kindly took care of organizing quite a bit of my paperwork and kept me on schedule with so many “unimportant” tasks like getting an appointment with some hospital to finally get the screws taken out of my knee ;)

A special thanks goes to the growing Mahout community as well as to the Lucene people - you know, who you are - keep up the great work: You rock!

Furthermore there are students at TU Berlin who have shown that with Mahout it is “dead-simple” to write an application that, given a stream of documents, groups them by topic and makes the result searchable in Solr. Thanks to you for solving the minor and major problems, for communicating with the community, for transparently communicating problems. Looking forward to continue working together with you next year.

Finally a big thank you to all of the speakers, sponsors and attendees of the Apache Hadopp Get Together, the NoSQL conference and the Apache Dinner Berlin - without you these events would never have been possible. Looking forward to seeing you again in January/ March 2010!

I hope I didn’t forget too many people - just in case: I am pretty grateful for all the input, help and feedback I got this year.

PS: Another thanks to the spaceboyz visiting Berlin for 26C3 for helping Thilo tidy up our apartment after Congress was over this year ;)

Hadoop, Lucene, Mahout , , , ,