Archive

Archive for the ‘Apache’ Category

Apache Mahout Podcast

December 13th, 2010 at 9:21pm

During Apache Con ATL Michael Coté interviewed Grant Ingersoll on Apache Mahout. The interview is available online as podcast. The interview covers the goals and current use cases of the project, goes into some detail on the reasons for initially starting it. If you are wondering what Mahout is all about, what you can do with it and which direction development is heading, the interview is a great option to find out more.

Apache Con, Mahout, Software Foundation , , ,

Apache Lunch Devoxx

December 11th, 2010 at 9:30pm

On Twitter I suggested to host an Apache dinner during Devoxx. Matthias Wesendorf of Apache MyFaces was so kind to take up the discussion carrying it over to the Apache community mailing-list. It quickly turned out that there was quite some interest with several members and committers attending Devoxx. We scheduled the meetup for Friday after the conference during lunch time.
I pinged a few Apache related people I knew would attend the conference (being a speaker and a committer at some Apache project almost certainly resulted in getting a ping). Steven Noels kindly made a reservation at a restaurant close by and announced time and geo coordinates on party.apache.org. Although several speakers had left already that very same morning, we turned out to be eleven people – including Stephen Coleburn, Mathias Wessendorf, Steven Noels, Martijn Dashorst of the Apache Wicked project. Was great meeting all of you – and being able to put some faces to names :)

General, Software Foundation , , ,

Devoxx – Day 2 HBase

December 9th, 2010 at 9:25pm

Devoxx featured several interesting case studies of how HBase and Hadoop can be used to scale data analysis back ends as well as data serving front ends.

Twitter

Dmitry Ryaboy from Twitter explained how to scale high load and large data systems using Cassandra. Looking at the sheer amount of tweets generated each day it becomes obvious that with a system like MySQL alone this site cannot be run.

Twitter has released several of their internal tools under a free software license for others to re-use – some of them being rather straight forward, others more involved. At Twitter each Tweet is annotated by a user_id, a time stamp (ok if skewed by a few minutes) as well as a unique tweet_id. In order to come up with a solution for generating the latter one they built a library called snowflake. Though rather simple algorithm even works in a cross data-centre set-up: The first bits are composed of the current time stamp, the following bits encode the data-centre, after that there is room for a counter. The tweet_ids are globally ordered by time and distinct across data-centres without the need for global synchronisation.

With gizzard Twitter released a rather general sharding implementation that is used internally to run distributed versions of Lucene, MySQL as well as Redis (to be introduced for caching tweet timelines due to its explicit support for lists as data structures for values that are not available in memcached).

FlockDB for large scale social graph storage and analysis. Rainbird for time series analysis, though with OpenTSDB there is something comparable available for HBase. Haplocheirus for message vector caching (currently based on memcached, soon to be migrated to Redis for its richer data structures). The queries available through the front-end are rather limited thus making it easy to provide pre-computed, optimised version in the back-end. As with the caching problem a tradeoff between hit rate on the pool of pre-computed items vs. storage cost can be made based on the observed query distribution.

In the back-end of Twitter various statistical and data mining analysis are run on top of Hadoop HBase To compute potentially interesting followers for users, to extract potentially interesting products etc.
The final take-home message here: Go from requirements to final solution. In the space of storage systems there is not such thing as a silver bullet. Instead you have to carefully evaluate features and properties of each solutions as your data and load increase.

Facebook

When implementing Facebook Messaging (a new feature that was announced this week) Facebook decided to go for HBase instead of Cassandra. The requirements of the feature included massive scale, long-tail write access to the database (which more or less ruled out MySQL and comparable solutions) and a need for strict ordering of messages (which ruled out any eventually consistent system. The decision was made to use HBase.

A team of 15 developers (including operations and frontend) was working on the system for one year before it was finally released. The feature supports for integration of facebook messaging, IM, SMS and mail into one single system making it possible to group all messages by conversation no matter which device was used to send the message originally. That way each user’s inbox turns into a social inbox.

Adobe

Cosmin Lehene presented four use cases of Hadoop at Adobe. The first one dealt with creating and evaluating profiles of the Adobe Media Player. Users would be associated with a vector giving more information on what types of genre the meda they consumed belonged to. These vectors would then be used to generate recommendations for additional content to view in order to increase consumption rate. Adobe built a clustering system that would interface Mahout’s canopy- and k-means implementations with their HBase backend for user grouping. Thanks Cosmin for including that information in your presentation!

A second use case focussed on finding out more on the usage of flash on the internet. Using Google to search for flash content was no good as only the first 2000 results could be viewed thus resulting in a highly skewed sample. Instead they used a mixture of nutch and HBase for storage to retrieve the content. Analysis was done with respect to various features of flash movies, such as frame rates. The analysis revealed a large gap between the perceived typical usage and the actual usage of flash on the internet.

The third use case involves analysis of images and usage patterns on the Photoshop-in-a-browser edition of Photoshop.com. The forth use case dealt with scaling the infrastructure that powers businesscatalyst – a turn-key online business platform solution including analysis, campaigning and more. When purchased by Adobe the system was very successful business-wise. However the infrastructure was by no means able to put up with the load it had to accommodate. Changing to a back-end based on HBase led to better performance, faster report generation.

General, Hacking, Mahout , , , , , ,

Apache Mahout Meetup in San Jose

December 8th, 2010 at 7:48am

A few hours ago the Mahout Meetup at MapR Technologies in San Jose/CA ended. Two photos taken at the event leaked - happy to be able to publish them here.

More information on the discussions and more technical details to follow. Stay tuned.

Mahout , ,

Devoxx University – MongoDB, Mahout

December 5th, 2010 at 9:19pm

The second tutorial was given by Roger Bodamer on MongoDB. It concentrates on being horizontally scalable by avoiding joins and complex, multi document transactions. It supports a new data model that allows for flexible, changeable “schemas”.

The exact data layout is determined by the types of operations you expect for your application, by the access patterns (reading vs. writing data; types of updates and types of queries). Also don’t forget about indexing tables by columns to speed up frequently run queries.

Scaling MongoDB is supported by replication in a master/slave setup quite as any traditional system. In a replica set of n nodes, any of these can be elected as the primary (taking writes). If that one goes down, new master election happens. For durability all writes are required to go to at least a majority of all nodes, if that does not happen, now guarantee is given as to the availability of the update in case of primary failure. Write sharding comes with MongoDB as well.
Java support for Mongo is pretty standard - Raw Mongo driver comes in a Map<..., ... > flavour. Morphia supports Pojo mapping, annotations etc. for MongoDB Java integration, Code generators for various other JVM languages are available as well.

See also: http://blog.wordnik.com/12-months-with-mongodb

My talk was scheduled for 30min in the afternoon. I went into some detail on what is necessary to build a news clustering system with Mahout and finished the presentation by a short overview of the other use cases that could be solved with the various algorithms. In the audience, nearly all had heard about Hadoop before – most likely in the introductory session that same morning. Same for Lucene. Solr was known to about half of the attendees. Mahout to just a few. Knowing that only very few attendees had any Machine Learning background I tried to provide a very high level overview of what can be done with the library, not going into too much mathematical details. There were quite a few interested questions after the presentation – both online and offline, including requests for examples on how to integrate the software with Solr. In addition connectors for instance to HBase as a data-source were interesting to people. Show-casing integration of Mahout, possibly even providing not only Java- but also REST interfaces might be one route to easier integration and faster adoption of Mahout.

Mahout , ,

Devoxx Antwerp

December 3rd, 2010 at 9:16pm

With 3000 attendees Devoxx is the largest Java Community conference world-wide. Each year in autumn it takes place in Antwerp/ Belgium, in recent years in the Metropolis cinema. The conference tickets were sold out long before doors were opened this year.
The focus of the presentations are mainly on enterprise Java featuring talks by famous Joshua Bloch, Mark Reihnhold and others on new features of the upcoming JDK release as well as intricacies of the Java programming language itself.
This year for the first time the scope was extended to include one whole track on NoSQL databases. The track was organised by Steven Noels. It featured fantastic presentations on HBase use cases, easily accessible introductions to the concepts and usage of Hadoop.
To me it was interesting to observe which talks people would go to. In contrast to many other conferences here the NoSQL/ cloud-computing presentations were less visited than I’d have expected. One reason might be the fact that especially on conference day two they had to compete with popular topics such as the Java puzzlers, Live Java posse and others. However when talking to other attendees their seemed to be a clear gap between the two communities caused probably by a mixture of

  • there being very different problems to be solved in the enterprise world vs. the free software, requirements and scalability driven NoSQL community. Although even comparably small companies (compared to the Googles and Yahoo!s of this world) in Germany are already facing scaling issues, these problems are not yet that pervasive in the Java community as a whole. To me this was rather important to learn, as coming from a Machine learning background, now working for a search provider and being involved with Mahout, Lucene and Hadoop scalability and a growth in data has always been one of the major drivers for any projects I have been working on so far.
  • Even when faced with growing amounts of data in the regular enterprise world developers seem to be faced with the problem of not being able to freely select the technologies to be used for implementing a project. In contrast to startups and lean software teams there still seem to be quite a few teams that are not only given what to implement but also how to implement the software unnecessarily restricting the tools to use to solve a given problem.

One final factor that drives developers adopting NoSQL and cloud computing technologies is the observation for the need to optimise the system as a whole – to think outside the box of fixed APIs and module development units. To that end the DevOps movement was especially interesting to me as only by getting the knowledge largely hidden in operations teams into development and mixing that with the skill of software developers can lead to truly elastic and adaptable systems.

General, Mahout, Software Foundation , , , ,

Teddy in Lisbon

December 1st, 2010 at 9:09pm

After Apache Con I spent a few days in Lisbon for Codebits. The conference is not developers-only. It is more of a mixture of hacking event, conference, exhibition. Though the location was not optimal for giving presentations (large exhibition hall with now a rather noisy presentation area) the whole event brought quite an interesting mixture of people together in one place in the capital of Portugal.

I had been to Portugal earlier this year, however that was just for recreating and vacation. So this time around I was quite happy to get the chance of seeing some part of the local culture that otherwise I would probably never have gotten access to. Having some loose ties to the Berlin hackers community, to the free software people in Europe but also to pragmatic open source developers what was most astonishing to me was to see the comparably huge amount of systems running Microsoft Windows used by codebits attendees. Talking a bit with locals it seemed like using free software for development is not all that unusual in Portugal, however people tend to wait for problems getting fixed instead of getting involved and actively contributing back.

Mahout , , ,

Mahout in Action

November 30th, 2010 at 9:07pm

Flying to Atlanta I finally had a few hours of time to finalize the review of the Mahout in Action MEAP edition. The book is intended for potential users of the Apache Mahout, a project focussing on implementing scalable algorithms for machine learning.
Describing machine learning algorithms and their application to practioners is a non-trivial task: Usually there is more than one algorithm available for seemingly identically problem settings. In addition each algorithm usually comes with multiple parameters for fine-tuning its behaviour to the problem setting at hand.
Sean Owen does an awesome job explaining the basic concepts behind building recommender systems in that book. In a very intuitive way he highlights the properties of each algorithm and its options. Based on one example setting taken from a real world problem (parents buying music Cds for their children based on more or less background information) he highlights the properties of each available recommender algorithm.
The second section of the book highlights available implementations for clustering documents, that is grouping documents by similarity – a problem that is very common when it comes to grouping texts into topics and detecting upcoming new topics in a stream of publications. Robin Anil and Ted Dunning make it very easy to understand what clustering is all about, explain how to use, configure and use the current implementations in Mahout in various practical settings.
The book looks very promising. It is well suited for engineers looking for an explanation of how to successfully use Mahout to solve real world problems. In contrast to existing publications it makes it easy to grasp the basic concepts event without wading through complicated computations. The book is specially targeted to Mahout users. However it does give important background information on the algorithms available that is needed to decide on exactly which implementation and which configuration to use. Looking forward to the last section on classification algorithms.

Mahout , ,

Travelling

November 22nd, 2010 at 11:17pm

Currently on my way back from a series of conferences in the past three weeks in the IC from Schiphol. After three weeks of conferences, lots of new input and lots of interesting projects I learned about it is finally time to head back and put the stuff I have learned to good use.


View Travelling in a larger map

As seems normal with open source conferences I got far more input on interesting projects than I can expect to ever get applied in on a daily basis. Still it is always inspiring to meet with other developers in the same field – or even quite different fields and learn more on what projects they are working on, how they solve various problems.

A huge Thank You goes to the DICODE EU research project for sponsoring the Apache Con and Devoxx trips, another Thanks to Sapo.pt for inviting me to Lisbon and covering travel expenses. A special thank you to the assistant at neofonie who made travel arrangements for Atlanta and Antwerp: It all worked without problems even up to me having a power outlet in the train that is finally taking me back.

Apache Con, General, Software Foundation , , , ,

Apache Mahout 0.4 release

November 3rd, 2010 at 3:21pm

On last Sunday the Apache Mahout project published the 0.4 release. Nearly every piece of the code has been refactored and improved since the last 0.3 release. The release was timed to happen exactly before Apache Con NA in Atlanta. As such it was published on October 31st - the Halloween release, sort-of.

Especially mentionable are the following improvements:

  • Model refactoring and CLI changes to improve integration and consistency
  • Map/Reduce job to compute the pairwise similarities of the rows of a matrix using a customizable similarity measure
  • Map/Reduce job to compute the item-item-similarities for item-based collaborative filtering
  • More support for distributed operations on very large matrices
  • Easier access to Mahout operations via the command line
  • New vector encoding framework for high speed vectorization without a pre-built dictionary
  • Additional elements of supervised model evaluation framework
  • Promoted several pieces of old Colt framework to tested status (QR decomposition, in particular)
  • Can now save random forests and use it to classify new data

New features and algorithms include:

  • New ClusterEvaluator and CDbwClusterEvaluator offer new ways to evaluate clustering effectiveness
  • New Spectral Clustering and MinHash Clustering (still experimental)
  • New VectorModelClassifier allows any set of clusters to be used for classification
  • RecommenderJob has been evolved to a fully distributed item-based recommender
  • Distributed Lanczos SVD implementation
  • New HMM based sequence classification from GSoC (currently as sequential version only and still experimental)
  • Sequential logistic regression training framework
  • New SGD classifier
  • Experimental new type of NB classifier, and feature reduction options for existing one

There were many, many more small fixes, improvements, refactorings and cleanup. Go check out the new release, give the new features a try and report back to us on the user mailing list.

Mahout ,