Hadoop at Heise c't

2010-01-31 13:37
Interesting for those readers speaking German: Heise published an introductory article on Hadoop in its latest issue. Have fun reading.

Thanks to Simon for proof-reading and providing valuable input. Thanks to Thilo Fromm for the hadoop graphics (unfortunately none of them got published in its original form), the catchy title, proof-reading the text over and over again and for keeping me sane during several past and coming months.

If you want to know more on Apache Hadoop, come watch my FOSDEM Hadoop talk next weekend. If you want to join discussions on Apache Hadoop and Lucene, stay tuned for a conference in Berlin on these topics.

ApacheCon Oakland Roundup

2009-11-19 20:15
Two weeks ago ApacheCon US 2009 ended in Oakland California. Shane published a set of links to articles that contain information on what happened at Apache Con. Some of them are officially published by the Apache PRC project, others are write-ups of individuals on which talks they attended and which topics they considered particularly interesting.

Apache Con US Wrap Up

2009-11-16 22:10
some weeks ago I attended ApacheConUS09 in Oakland/ California. In the mean time, videos of one of the sessions have been published online:

You can find a wrap up of the most prominent topics at the conference at heise (unfortunately Germany-only).

By far the largest topics at the conference:
  • Lucene - there was a meetup with over 100 attendees as well as two main tracks with Lucene focussed talks. New features of Lucene 2.9.* were in the center of interest: The new range search capabilities, segment search that improves caching, a new token stream api that makes annotating terms more flexible as well as a lot of performance improvements. Shortly after the conference, Lucene 2.9.1 as well as Solr 1.4 was released so end-users switching to the new version now benefit from better performance and several new features.
  • Hadoop - large scale data processing currently is one of the biggest topics. Be it logfile analysis, business intelligence or ad-hoc analysis of user data. Hadoop was covered by a user meetup as well as one track on the first conference day. The track started with an introduction by Owen O'Malley and Doug Cutting. It continued with talks on HBase, Hive, Pig and other projects from the Hadoop ecosystem.

But also projects like Apache Tomcat and Apache HTTPD were well covered within one to two sessions each.

Currently a hot topic within the foundation is the challenge of bringing the community together face-to-face. Apache projects have become so numerous that covering them all within 3+2 days of conference and trainings seems no longer feasable. One way to mitigate these problems might be to motivate people to do more local meetups potentially supported by ConCom as has already happened in the Lucene- and Hadoop-communities. A related topic is the task of community building and community growth within the ASF. Google Summer of Code has been a great way to integrate new people. However the model does not scale that well for the foundation. With ComDev a new project was founded with the goal to work on community development issues, talking to research, getting students into open source early on. The project is largely supported by Ross Gardler, who already has experience with teaching and promoting open source and free software in the research context being part of the open source watch project in the UK.

Apache Con US 09 brought together a large community of Apache software developers and users from all over the world who gathered in California, not only for the talks but also for face-to-face communication, coding together and exchanging ideas.

Update: Slides of my Mahout talk are now online.

Lucene Meetup Oakland

2009-11-04 06:05
Though pretty late in the evening the room is packed with some 100 people. Most of them solr or pure lucene java users. There are quite a few Lucene committers at the meetup from all over the world. Several even have heard about Mahout - some even used it :)

Some introductiory questions to index sizes and query volumn: 1 Mio documents seem pretty standard for Lucene deployments - several people run 10 Mio neither. Some people even use indexes with up to billions of documents in Lucene - but at low query volumn. Usually people run projects with about 10 queries per second, but up to 500.

Eric's presentation gives a nice introduction to what is going on with Lucene/Solr in terms of user interfaces. He starts with an overview of the problems that libraries face when building search engines - especially the facetting side of life. Especially interesting seem Solaritas - a velocity response writer that makes it easy to render search responses not in xml but in simple templated output. He of course also included an overview of the features of LucidFind, the Lucid hosted search engine for all Lucene projects and sub-projects. Take Home message: The interface is the application, as are the urls. Facets are not just for lists.

Second talk is given by Uwe giving an overview of the implementation of numeric searches and range queries and numeric range filters in Lucene.

Third presenter is Stefan on katta - a project on top of Lucene that adds index splits, load balancing, index replication, failover, distributed TFIDF. The mission of katta is to build a distributed Lucene for large indexes under high query load. The project heavily relies on zookeeper for coordination. It uses Hadoop IPC for search communication.

Lighting talks include talks by

  • Zoie: A realtime search extension for Lucene, developed inside of LinkedIn and now open sourced at google code.
  • Jukka proposed a new project: A Lucene-based content mangement system.
  • Next presenter highlighted the problem of document-to-document search. The problem here is that queries are not just one or two terms but more like 40 terms.
  • Next talk shared some statistics: more than 2s at average leads to 40% abandonance rate for sites. The presenter is very interested in the Lucene Ajax project. Before using solr the speaker set up projects with solutions like Endeca or Mercato. Solr to him is an alternative that supports facetting.
  • Andzrej gives an overview of index pruning in 5min - giving details on which approaches are currently being discussed in research as well as in the Lucene jira for index pruning.
  • Next talk was on Lucy - a lucene port to C.
  • Last talk gave an overview of the findings on analysing the Lucene community.
  • One other lightning talk by a guy using and deploying Typo3 pages. Typo3 does come with an integrated search engine. The presenter's group built an extension to Typo3 that integrates the CMS with Solr search.
  • The final last talk is done by Grant Ingersoll on Mahout. Thanks for that!

Big Thanks to Lucid for sponsoring the meetup.

Hadoop Get Together Berlin @ Apache Con US Barcamp

2009-11-03 21:05
This is my first real day at ApacheCon US 2009. I arrived yesterday afternoon, was kept awake by three Lucene committers until midnight: "Otherwise you will have a very bad jetlag"... Admittedly it did work out: I slept like a baby until about 08:00a.m. the next morning and am not that tired today.

Today Hackthon, Trainings and barcamp Apache happen in parallel. Ross Gardler tricked me into doing a presentation on my experiences on doing local user meetups. I put the slides online.

The general consent was, that it is actually not that hard to do such a meetup - at least if you are have someone locally to help organizing or do it in a town you know very well. There are ways to get support from the ASF for doing such meetups - people help you get speakers, talk to potential sponsors or find a location. In my experience if doing the event in your hometown, finding a location is not that hard: Either you are lucky having someone like newthinking store around. Or you can contact you local university or even your employer to find some conference room that you can use for free.

Getting the first two to three meetups up and running - especially finding speakers - is hard. However you should be able to benefit from being part of an Apache project already and probably know your community and know who would be willing to speak at one of those meetups. Once the meetup is well established, it should be possible to find sponsors to pay for video taping, free beer and pizza.

Keep in mind that having a fixed schedule ready in advance helps to attract people - it's always good to know why one should travel to the meetup by train or plane. Don't forget to plan for time for socializing after the event - having some beer and maybe food together makes it easy for people to connect after the meetup.

Apache Con US - Program up

2009-10-29 07:35
The final program is available for download over at http://us.apachecon.com. The schedule is packed with interesting talks on Hadoop, Lucene, Tomcat, httpd, web services, osgi. For those less tech-savvy there is a business track explaining how to best use open source software in an entreprise environment. There is also a community track explaining what makes open source projects successful.

Looking forward to seeing you in Oakland.

Apache Con drawing closer

2009-09-04 06:47
By November I will be traveling to Oakland - for me it is the first Apache Con US ever. And the first Apache Con I will be giving a talk in one of the main tracks:

I will be presenting Apache Mahout, give an overview of the project, of our current status and explain which problems can be solved with the current implementation. The talks will conclude with an outlook to upcoming tasks and features our users can expect in the near future.

There is great news already: First commercial users like Mippin are explaining their experiences with Mahout.

Currently, I am looking forward to meeting several Mahout (and Lucene, Hadoop, Solr, ...) committers there. I met some at Apache Con EU already, but it's always nice to talk to people in person who before one only knew from mailing lists. Of course I am also looking forward to having time to review and write code. Hope to see you there.

Update: Flights booked.

Apache Con Europe 2009 - part 3

2009-03-29 19:56
Friday was the last conference day. I enjoyed the Apache pioneers panel with a brief history of the Apache Software Foundation as well as lots of stories on how people first got in contact with the ASF.

After lunch I went to the testing and cloud session. I enjoyed the talk on continuum and its uses by Wendy Smoak. She gave a basic overview of why one would want a CI system and provided a brief introduction to continuum. After that Carlos Sanchez showed how to use the cloud to automate interface tests with Selenium: The basic idea is to automatically (initiated through maven) start up AMIs on EC2, each configured with another operating system and run Selenium tests against the application under development in these. Really nice system for running automated interface tests.

The final session for me was the talk by Chris Anderson and Jan Lehnardt on CouchDB deployments.

The day ended with the Closing Event and Raffle. Big Thank You to Ross Gardler for including the Berlin Apache Hadoop Get Together in newthinking store in the announcements! Will sent the CfP to concom really soon, as promised. Finally I won one package of caffeinated sweets at the Raffle - does that mean less sleep for me in the coming weeks?

Now I am finally back home and had some time to do a quick writeup. If you are interested in the complete notes, go to http://fotos.isabel-drost.de (default login is published in the login window). Looking forward to the Hadoop User Group UK on 14th of April. If you have not signed up yet - do so now: http://huguk.org

Apache Con Europe 2009 - part 2

2009-03-29 19:42
Thursday morning started with an interesting talk on open source collaboration tools and how they can help resolving some collaboration overhead on commercial software projects. Four goals can be reached with the help of the right tools: Sharing the project vision, tracking the current status of the project, finding places to help the project and documenting the project history as well as the reasons for decisions along the way. The exact tool used is irrelevant as long as it helps to solve the four tasks above.

The second talk was on cloud architectures by Steve Loughran. He explained what reasons there are to go into the cloud, what a typical cloud architecture looks like. Steve described Amazon's offer, mentioned other cloud service providers and highlighted some options for a private cloud. However his main interest is in building a standardised cloud stack. Currently choosing one of the cloud provides means vendor lock-in: Your application uses a special API, your data are stored on special servers. There are quite a few tools necessary for building a cloud stack available at Apache (Hadoop, HBase, CouchDB, Pig, Zookeeper...). The question that remains is how to integrate the various pieces and extend where necessary to arrive at a solution that can compete with AppEngine or Azure?

After lunch I went to the Solr case study by JTeam. Basically one great commercial for Solr. They even brought the happy customer to Apache Con to talk about the experience of Solr from his point of view. Great work, really!

The Lightning Talk session ended the day - with a "Happy birthday to you" from the community.

After having spent the last 4 days from 8a.m. to 12p.m. at Apache Con I really did need some rest on Thursday and went to bed pretty early: At 11p.m. ...

Apache Con Europe 2009 - part 1

2009-03-29 18:41
The past week members, committers and users of Apache software projects gathered in Amsterdam for another Apache Con EU - and to celebrate the 10th birthday of the ASF. One week dedicated to the development and use of Free Software and the Apache Way.

Monday was BarCamp day for me, the first BarCamp I ever attended. Unfortunately not all participants proposed talks. So some of the atmosphere of an unconference was missing. The first talk by Danese Cooper was on "HowTo: Amsterdam Coffee Shops". She explained the ins and outs of going to coffee shops in Amsterdam, gave both legal and practical advise. There was a presentation of the Open Street Map project, several Apache projects. One talk discussed transfering the ideas of Free Software to other parts of life. Ross Gardler started a discussion on how to advocate contributions to Free Software projects in science and education.

Tuesday for me meant having some time for Mahout during the Hackathon. Specifically I looked into enhancing matrices with meta information. In the evening there were quite a few interesting talks at the Lucene Meetup: Jukka gave an overview of Tika, Grant introduced Solr. After Grant's talk some of the participants shared numbers on their Solr installations (number of documents per index, query volumn, machine setup). To me it was extremely interesting to gain some insight into what people actually accomplish with Solr. The final talk was on Apache Droids, a still incubating crawling framework.

The Wednesday tracks were a little unfair: The Hadoop track (videos available online for a small fee) was right in parallel to the Lucene track. The day started with a very interesting keynote by Raghu from Yahoo! on their storage system PNUTS. He went into quite some technical detail. Obviously there is interest in publishing the underlying code under an open source license.

After the Mahout introduction by Grant Ingersoll I changed room to the Hadoop track. Arun Murthy shared his experience on tuning and debugging Hadoop applications. After lunch Olga Natkovich gave an introduction to Pig - a higher language on top of Hadoop that allows for specifications of filter operations, joins and basic control flow of map reduce jobs in just a few lines of Pig Latin code. Tom White gave an overview of what it means to run Hadoop on the EC2 cloud. He compared several options for storing the data to process. Today it is very likely that there will soon be quite a few more providers of cloud services in addition to Amazon.

Allen Wittenauer gave an overview of Hadoop from the operations point of view. Steve Lougran finally covered the topic of running Hadoop on dynamically allocated servers.

The day finished with a pretty interesting BOF on Hadoop. There still are people that do not clearly see the differences of Hadoop based systems to database backed applications. Best way to find out whether the model fits: Set up a trial cluster and do experiment yourself. Noone can tell which solution is best for you except for yourself (and maybe Cloudera setting up the cluster for you :) ).

After that the Mahout/UIMA BOF was scheduled - there were quite a few interesting discussions on what UIMA can be used for and how it integrates with Mahout. One major take home message: We need more examples integrating both. We developers do see the clear connections. But users often do not realize that many Apache projects should be used together to get the biggest value out.