December 8th, 2010 at 7:48am
A few hours ago the Mahout Meetup at MapR Technologies in San Jose/CA ended. Two photos taken at the event leaked - happy to be able to publish them here.
More information on the discussions and more technical details to follow. Stay tuned.
Mahout
Bay Area, Mahout, MapR
December 5th, 2010 at 9:19pm
The second tutorial was given by Roger Bodamer on MongoDB. It concentrates on being horizontally scalable by avoiding joins and complex, multi document transactions. It supports a new data model that allows for flexible, changeable “schemas”.
The exact data layout is determined by the types of operations you expect for your application, by the access patterns (reading vs. writing data; types of updates and types of queries). Also don’t forget about indexing tables by columns to speed up frequently run queries.
Scaling MongoDB is supported by replication in a master/slave setup quite as any traditional system. In a replica set of n nodes, any of these can be elected as the primary (taking writes). If that one goes down, new master election happens. For durability all writes are required to go to at least a majority of all nodes, if that does not happen, now guarantee is given as to the availability of the update in case of primary failure. Write sharding comes with MongoDB as well.
Java support for Mongo is pretty standard - Raw Mongo driver comes in a Map<..., ... > flavour. Morphia supports Pojo mapping, annotations etc. for MongoDB Java integration, Code generators for various other JVM languages are available as well.
See also: http://blog.wordnik.com/12-months-with-mongodb
My talk was scheduled for 30min in the afternoon. I went into some detail on what is necessary to build a news clustering system with Mahout and finished the presentation by a short overview of the other use cases that could be solved with the various algorithms. In the audience, nearly all had heard about Hadoop before – most likely in the introductory session that same morning. Same for Lucene. Solr was known to about half of the attendees. Mahout to just a few. Knowing that only very few attendees had any Machine Learning background I tried to provide a very high level overview of what can be done with the library, not going into too much mathematical details. There were quite a few interested questions after the presentation – both online and offline, including requests for examples on how to integrate the software with Solr. In addition connectors for instance to HBase as a data-source were interesting to people. Show-casing integration of Mahout, possibly even providing not only Java- but also REST interfaces might be one route to easier integration and faster adoption of Mahout.
Mahout
Mahout, mongodb, NOSQL
December 3rd, 2010 at 9:16pm
With 3000 attendees Devoxx is the largest Java Community conference world-wide. Each year in autumn it takes place in Antwerp/ Belgium, in recent years in the Metropolis cinema. The conference tickets were sold out long before doors were opened this year.
The focus of the presentations are mainly on enterprise Java featuring talks by famous Joshua Bloch, Mark Reihnhold and others on new features of the upcoming JDK release as well as intricacies of the Java programming language itself.
This year for the first time the scope was extended to include one whole track on NoSQL databases. The track was organised by Steven Noels. It featured fantastic presentations on HBase use cases, easily accessible introductions to the concepts and usage of Hadoop.
To me it was interesting to observe which talks people would go to. In contrast to many other conferences here the NoSQL/ cloud-computing presentations were less visited than I’d have expected. One reason might be the fact that especially on conference day two they had to compete with popular topics such as the Java puzzlers, Live Java posse and others. However when talking to other attendees their seemed to be a clear gap between the two communities caused probably by a mixture of
- there being very different problems to be solved in the enterprise world vs. the free software, requirements and scalability driven NoSQL community. Although even comparably small companies (compared to the Googles and Yahoo!s of this world) in Germany are already facing scaling issues, these problems are not yet that pervasive in the Java community as a whole. To me this was rather important to learn, as coming from a Machine learning background, now working for a search provider and being involved with Mahout, Lucene and Hadoop scalability and a growth in data has always been one of the major drivers for any projects I have been working on so far.
- Even when faced with growing amounts of data in the regular enterprise world developers seem to be faced with the problem of not being able to freely select the technologies to be used for implementing a project. In contrast to startups and lean software teams there still seem to be quite a few teams that are not only given what to implement but also how to implement the software unnecessarily restricting the tools to use to solve a given problem.
One final factor that drives developers adopting NoSQL and cloud computing technologies is the observation for the need to optimise the system as a whole – to think outside the box of fixed APIs and module development units. To that end the DevOps movement was especially interesting to me as only by getting the knowledge largely hidden in operations teams into development and mixing that with the skill of software developers can lead to truly elastic and adaptable systems.
General, Mahout, Software Foundation
antwerp, Devoxx, Java, Mahout, NOSQL
December 1st, 2010 at 9:09pm
After Apache Con I spent a few days in Lisbon for Codebits. The conference is not developers-only. It is more of a mixture of hacking event, conference, exhibition. Though the location was not optimal for giving presentations (large exhibition hall with now a rather noisy presentation area) the whole event brought quite an interesting mixture of people together in one place in the capital of Portugal.

I had been to Portugal earlier this year, however that was just for recreating and vacation. So this time around I was quite happy to get the chance of seeing some part of the local culture that otherwise I would probably never have gotten access to. Having some loose ties to the Berlin hackers community, to the free software people in Europe but also to pragmatic open source developers what was most astonishing to me was to see the comparably huge amount of systems running Microsoft Windows used by codebits attendees. Talking a bit with locals it seemed like using free software for development is not all that unusual in Portugal, however people tend to wait for problems getting fixed instead of getting involved and actively contributing back.
Mahout
codebits, Lisbon, Mahout, teddy
November 30th, 2010 at 9:07pm
Flying to Atlanta I finally had a few hours of time to finalize the review of the Mahout in Action MEAP edition. The book is intended for potential users of the Apache Mahout, a project focussing on implementing scalable algorithms for machine learning.
Describing machine learning algorithms and their application to practioners is a non-trivial task: Usually there is more than one algorithm available for seemingly identically problem settings. In addition each algorithm usually comes with multiple parameters for fine-tuning its behaviour to the problem setting at hand.
Sean Owen does an awesome job explaining the basic concepts behind building recommender systems in that book. In a very intuitive way he highlights the properties of each algorithm and its options. Based on one example setting taken from a real world problem (parents buying music Cds for their children based on more or less background information) he highlights the properties of each available recommender algorithm.
The second section of the book highlights available implementations for clustering documents, that is grouping documents by similarity – a problem that is very common when it comes to grouping texts into topics and detecting upcoming new topics in a stream of publications. Robin Anil and Ted Dunning make it very easy to understand what clustering is all about, explain how to use, configure and use the current implementations in Mahout in various practical settings.
The book looks very promising. It is well suited for engineers looking for an explanation of how to successfully use Mahout to solve real world problems. In contrast to existing publications it makes it easy to grasp the basic concepts event without wading through complicated computations. The book is specially targeted to Mahout users. However it does give important background information on the algorithms available that is needed to decide on exactly which implementation and which configuration to use. Looking forward to the last section on classification algorithms.
Mahout
Book Review, Mahout, Manning
November 3rd, 2010 at 3:21pm
On last Sunday the Apache Mahout project published the 0.4 release. Nearly every piece of the code has been refactored and improved since the last 0.3 release. The release was timed to happen exactly before Apache Con NA in Atlanta. As such it was published on October 31st - the Halloween release, sort-of.
Especially mentionable are the following improvements:
- Model refactoring and CLI changes to improve integration and consistency
- Map/Reduce job to compute the pairwise similarities of the rows of a matrix using a customizable similarity measure
- Map/Reduce job to compute the item-item-similarities for item-based collaborative filtering
- More support for distributed operations on very large matrices
- Easier access to Mahout operations via the command line
- New vector encoding framework for high speed vectorization without a pre-built dictionary
- Additional elements of supervised model evaluation framework
- Promoted several pieces of old Colt framework to tested status (QR decomposition, in particular)
- Can now save random forests and use it to classify new data
New features and algorithms include:
- New ClusterEvaluator and CDbwClusterEvaluator offer new ways to evaluate clustering effectiveness
- New Spectral Clustering and MinHash Clustering (still experimental)
- New VectorModelClassifier allows any set of clusters to be used for classification
- RecommenderJob has been evolved to a fully distributed item-based recommender
- Distributed Lanczos SVD implementation
- New HMM based sequence classification from GSoC (currently as sequential version only and still experimental)
- Sequential logistic regression training framework
- New SGD classifier
- Experimental new type of NB classifier, and feature reduction options for existing one
There were many, many more small fixes, improvements, refactorings and cleanup. Go check out the new release, give the new features a try and report back to us on the user mailing list.
Mahout
Apache Mahout, release
November 1st, 2010 at 9:32am
This year’s Devoxx will feature several presentations coming from the Apache Hadoop ecosystem including Tom White on the basics of Hadoop: HDFS, MapReduce, Hive and Pig as well as Michael Stack on HBase.

In addition there will be a brief Tools in Action presentation on Monday evening featuring Apache Mahout.
Please let me know if you are going to Devoxx - would be great to meet some more Apache people there, maybe have dinner at one of the conference days.
General, Mahout, Software Foundation
Apache Dinner, Devoxx, Hadoop, hbase, Mahout, Software Foundation
October 25th, 2010 at 6:51pm
This evening the first Machine Learning Gossip meeting is scheduled to take place at 9p.m. at Victoriabar: Professionals working in research advancing machine learning algorithms and industry projects putting machine learning algorithms to practical use meet for some drinks, food and hopefully lots of interesting discussions.
If successful the meeting is supposed to take place on a regular schedule. Ask Michael Brückner for the date and location of the next meetup.
General, Mahout, Science
Berlin, Machine Learning, Mahout
October 21st, 2010 at 1:55pm
A few weeks ago we had the autumn edition of the Apache Hadoop Get Together in newthinking store in Berlin. I am glad to announce the first video online:
Mahout Sebastian Schelter from Isabel Drost on Vimeo.
Thanks to JTeam for sponsoring video taping, thanks to newthinking for providing the location and thanks to Martin Schmidt from newthinking for producing the video.
Stay tuned for the second video to be published next week.
Get Together, Mahout
Get Together, Hadoop, Mahout
June 20th, 2010 at 12:55pm
In Summer last year I was invited to give a presentation on Apache Mahout at TU Berlin. After the talk was over some of the research group members asked me to design and give a course on scalable machine learning with open source software during the winter semester.
The project attracted four to five students - not very many - but then again it is a course people can take voluntarily. During the first semester participants were asked to integrate Mahout to build a system that crawls web pages, assigns them to clusters and makes the content searchable with Lucene. The intention was to get students to publish any patches they have to make at Mahout. In addition the code behind the system was supposed to be published after the project was over.
This setup turned out to be sub-optimal: The participants never grew confident enough to publish not only their ideas and design on the mailinglist but also send in the access data to the SCM system that hosted the project source code.
Some similar setup was run at HPI Potsdam by Christoph Böhm: He let students implement various information retrieval and machine learning algorithms on top of Apache Hadoop. After the course was over he tried to motivate students to publish their code at Apache Mahout. So far I have seen no submissions.
Being aware of these problems next time I setup the course for the summer semester at TU I chose a slightly different model: Having only four students who do not have enough free cycles to work on the project full time I set the goal to implement an HMM - including tests, example and documentation. Being roughly aligned with GSoC I asked students to publish their timeline in JIRA. As soon as coding started I urged them to publish even incremental progress and ask the community for feedback.
Now we do have an open JIRA issue with a patch attached to it. People also got some code review feedback already. Having Berlin Buzzwords in town while the course was still running I used my chance to get students in touch with other Mahout developers. Looks like at least one of them is planning to stay with the project for a little longer. For me it would be a great success if at least one student could be turned into a longer term contributor to the project.
So far it looks like applying the general principle of releasing code early and often helps people do integrate into some project. My own lesson learned from those experiences however is to urge students early on to get in touch and release their code: It was not particularly easy to get them to send e-mails to public mailing lists. However if they had done this just once, feedback usually was very positive - and surprised by how friendly and helpful in the free software community generally are.
Mahout
Free Software, Mahout, Teaching, TU Berlin, University