Archive

Archive for April, 2009

Scrum Table with Thoralf Klatt

April 29th, 2009 at 9:19am

On Wednesday, the 22nd of April, about 20 people interested in Scrum gathered in the DiVino in Friedrichshain/Berlin. The event was split in two parts: In the first half we gathered topics participants were interested in, put priorities next the them and discussed the most highly ranked topic: “Scrum in large teams, splitting large tasks across teams.”

The basic take home messages of the discussion:

  • One way to cleanly split a task across teams is to first do a design sprint together, fix the API and then split up. Problem with that: Integration and validation of what you do theoretically up front.
  • Another way is to continously integrate all parts, that way you get direct feedback. Might be impractical without a sort of fixed API though.
  • Do keep in mind that increasing the team exponentially increases management overhead.
  • Do track the progress and performance with well known values (delivered value per sprint, velocity, define KPIs etc.)

The second part of the meetup was covered by the talf of Thoralf from Nokia Siemens networks on how they do scrum across countries and continents. Main interessting points for me:

  • Face to face communication is necessary - good video equipment can help with that.

  • Integrating ready made products into new solutions create new challenges to solve.
  • Transparency and communication with developers become a challenge.

More information on the event can be found on the blog of the round table.

General, Scrum ,

Feedback from the Hadoop User Group UK

April 29th, 2009 at 8:54am

A few weeks after the Hadoop User Group UK is over, there are quite a few postings on the event online. I will try to keep this page updated if there are any further reviews. The one I found so far:

http://huguk.org/2009/04/huguk-2-wrap-up.html - the wrap-up of the event itself.

http://blog.oskarsson.nu/2009_04_01_archive.html - a short summary by the organiser - Thanks again for a great event.

http://www.cloudera.com/blog/2009/04/21/hadoop-uk-user-group-meeting/ - a short summary on the Cloudera blog.

http://people.kmi.open.ac.uk/adam/?p=26 - a quick overview with a Mahout focus by Adam Rae.

General, Hadoop User Group UK, Software Foundation , , ,

June 2009 Apache Hadoop Get Together @ Berlin

April 23rd, 2009 at 7:30pm

Title: Apache Hadoop Get Together @ Berlin
Location: newthinking store Berlin Mitte
Link out: Click here
Description: I just announced the fifth Apache Hadoop Get Together in Berlin at the newthinking store. Torsten Curdt offered to give a talk on data serialization with Thrift and Protocol Buffers.

If you have a topic you would like to talk about: Feel free to just bring your slides - there will be a beamer and lots of people interested in scalable information retrieval.
Start Time: 17:00
Date: 2009-06-25

General, Get Together, Software Foundation , ,

Mahout on EC2

April 21st, 2009 at 9:00pm

Amazon released Elastic Map Reduce only a few weeks ago. EMR is based on a hosted Hadoop environment and offers machines to run map reduce jobs against data in S3 on demand.

Last week Stephen Green has spent quite some effort to get Mahout running on EMR. Thanks to his work Mahout is running on EMR since last Thursday night. Read the weblog of Tim Bass for further information.

Software Foundation , ,

Hadoop User Group UK

April 21st, 2009 at 8:34pm

On Tuesday the 14th the second Hadoop User Group UK took place in London. This time venue and pizza was sponsored by Sun. The room quickly filled http://www.thecepblog.com/2009/04/19/kmeans-clustering-now-running-on-elastic-mapreduce/with approximately 70 people.

Tom opened the session with a talk on 10 practical tips on how to get the most benefit from Apache Hadoop. The first question users should ask themselves is which type of programming language they want to use. There is a choice between structured data processing languages (PIG or Hive), dynamic languages (Streaming or Dumbo), or using Java which is closest to the system.

Tom’s second hint dealt with the size of files to process with Hadoop: Both - too large unsplittable and too small ones are bad for performance. In the first case, the workload cannot be easily distributed across the nodes in the latter case each unit of work is to small to account for startup and coordination overhead. There are ways to remedy these problems with sequence files and map files though. Another performance optimization would be to chain individual jobs - PIG and Hive do a pretty decent job in automatically generating such jobs. ChainMapper and ChainReducer can help with creating chained jobs.

Another important task when implementing map reduce jobs is to tell Hadoop the progress of your job. For once, this is important for long running jobs in order for them to remain alive and not be killed by the framework due to timeouts. Second, it is convenient for the user as he can view the progress in the web UI of Hadoop.

Usual suspects for tuning a job: Number of mappers and reducers, usage of combiners, compression customised data serialisation, shuffling tweaks. Of course there is always the option to let someone else do the tuning: Cloudera does provide support as well as pre-built packages init scripts and the like ;)

In the second talk I did a brief Mahout intro. It was surprising to me that half of the attendees already employed machine learning algorithm implementations in their daily work. Judging from the discussion after the talk and from questions I received after it the interest in the project seems pretty high. The slide I liked the most: The announcement of our first 0.1 release. Thanks to all Mahout committers and contributors who made this possible.

After the coffee break Craig gave an introduction to Terrier an extensible information retrieval plattform developed at the university of Glasgow. He mentioned a few other open IR platforms namely Tuple Flow, Zettair, Lemur/Indri, Xapian, as well as of course nutch/Solr/Lucene.

What does Terrier have to do with the HugUK? Well index creation in Terrier is now based on an implementation that makes use of Hadoop for parallelization. Craig did some very interesting analysis on scalability of the solution: The team was able to achieve scaling near linear in the number of nodes added (at least as long as more than reducer is used ;) ).

After the pizza Paolo described his experiences implementing the vanilla pagerank computation with Hadoop. One of his test datasets was the Citeseer citation graph. Interestingly enough: Some of the nodes in this graph have self references (maybe due to extraction problems), duplicate citations, and the data comes in an invalid xml format.

The last talk was on HBase by Michael Stack. I am really happy I attended HugUK as I missed that talk in Amsterdam at the ApacheCon. First Michael gave an overview of which features of a typical RDBMS are not supported by HBase: Relations, joins, and of course JDBC being among the limitations. On the pro site HBase offers a multiple node solutions that has scale out and replication built in.

HBase can be used as source as well as as sink for map reduce jobs and thus integrates nicely with the Apache Hadoop stack. The framework provides a simple shell for administrative tasks (surgery on sick clusters forced flushes non sql get scan and put methods). In addition the master comes with a UI to monitor the cluster state.

Your typical DBA work though differs with HBase: Data locality and physical layout do matter and can be configured. Michaels recommendation was to start out testing with the XL instance on EC2 and decrease instances if you find out that it is too large.

The talk finished with an outlook of the features in the upcoming release the issues on the todo list and an overview of companies already using HBase.

After talks were finished quite a few attendees went over to a pub close by: Drinking beer, discussing new directions and sharing war stories.

I would to thank Johan Oskarsson for organising the event. And a special thanks to Tom for letting me use his Laptop for the Apache Mahout presentation: the hard disk of mine broke exactly one day before.

Last but not least thank you to Sylvio and Susi for letting me stay at their place - and thanks to Helene for crying only during daytime when I was out anyway ;)

Hope to see at least some of the attendees again at the next Hadoop Meetup in Berlin. Looking forward to the next Hadoop User Group UK.

General, Hadoop User Group UK, Software Foundation , , ,

Announcing Apache Mahout 0.1

April 8th, 2009 at 3:11pm

This morning I received Grant’s release mail of Apache Mahout. I am really happy that after little more than one year we now have our first release out there to test and scrutinate by anyone interested in the project. Thanks to all the committers who have helped make this possible. A special thanks to Grant Ingersoll for putting so much time into getting many release issues out of the way as well as to those who reviewed the release candidates and all the major and minor problems.

For those who are not familiar with Mahout: The goal of the project is to build a suite of machine learning libraries under the Apache license. The main focus is on:

  • Building a viable community that develops new features, helps users with software problems and is interested in the data mining problems Mahout users.
  • Developing stable, well documented, scalable software that solves your problems.

The current release includes several algorithms for clustering (k-Means, Canopy, fuzzy k-Means, Dirichlet based), for classification (Naive Bayes and Complementary Naive Bayes). There is some integration with the Watchmaker evolutionary programming framework. The Taste Collaborative Filtering framework moved to Mahout as well. Taste has been around for a while and is much more mature than the rest of the code.

With this being a 0.1 release we are looking for early adopters that are willing to work with cutting edge software and gain benefits from working closely together with the community. We are seeking feedback on use cases as well as performance numbers. If you are using Mahout in your projects or plan to use it or even only evaluate it - we are happy about hearing back from you on our mailing lists. Tell us what you like, what works well, but do not forget to tell us what you would like to improve. Contributions and Patches as always are very welcome.

For more information see the project homepage, especially the wiki and the Lucene weblog by Grant Ingersoll.

Free Software, Software Foundation ,

GSoC: Student applications.

April 5th, 2009 at 9:57am

Title: GSoC: Student applications closed
Link out: Click here
Description: After this date no more student applications are accepted. Internal ranking at Apache starts 7 days earlier. The ranking process finishes at 16th of April.
Date: 2009-04-03

General, Software Foundation , ,

Students begin coding for their GSoC projects

April 5th, 2009 at 9:54am

Title: Students begin coding for their GSoC projects
Link out: Click here
Description: GSoC time.
Date: 2009-05-26

General, Software Foundation , ,