Archive

Archive for the ‘Mahout’ Category

Apache Mahout 0.6 released

February 8th, 2012 at 9:33pm

As of Monday, February 6th a new Apache Mahout version was released. The new package features

Lots of performance improvments:

  • A new LDA implementation using Collapsed Variational Bayes 0th Derivative Approximation - try that out if you have been bothered by the way less than optimal performance of the old version.
  • Improved Decision Tree performance and added support for regression problems
  • Reduced runtime of dot product between vectors - many algorithms in Mahout rely on that, so these performance improvements will affect anyone using them.
  • Reduced runtime of LanczosSolver tests - make modifications to Mahout more easily and have faster development cycles by faster testing.
  • Increased efficiency of parallel ALS matrix factorization
  • Performance improvements in RowSimilarityJob, TransposeJob - helpful for anyone trying to find similar items or running the Hadoop based recommender

New features:

  • K-Trusses, Top-Down and Bottom-Up clustering, Random Walk with Restarts implementation
  • SSVD enhancements

Better integration:

  • Added MongoDB and Cassandra DataModel support
  • Added numerous clustering display examples

Many bug fixes, refactorings, and other small improvements. More information is available in the Release Notes.

Overall great improvements towards better performance, better stability and integration. However there are still quite some outstanding issues and issues in need for review. Come join the project, help us improve existing patches, improve performance and in particular integration and streamlining of how to use the different parts of the project.

Mahout , , ,

Learning Machine Learning with Apache Mahout

December 13th, 2011 at 10:20pm

Once in a while I get questions like Where to start learning more on machine learning. Other than the official sources I think there is quite good coverage also in the Mahout community: Since it was founded several presentations have been given that give an overview of Apache Mahout, introduce special features or even go into more details on particular implementations. Below is an attempt to create a collection of talks given so far without any claim to contain links to all videos or lectures. Feel free to add your favourite in the comments section. In addition I linked to some online courses with further material to get you started.

When looking for books of course check out Mahout in Action. Also Taming Text and the data mining book that comes with weka are good starting points for practitioners.

Introductory, overview videos

Technical details

Further course material

Mahout, Science , ,

See you in Vancouver at Apache Con NA 2011

October 24th, 2011 at 1:49pm

Mid November Apache hosts its famous yearly conference - this time in Vancouver/Canada. They kindly accepted my presentations on Apache Mahout for intelligent data analysis (mostly focused on introducing the project to new comers and showing what happened within the project in the past year - if you have any wish concerning topics you would like to see covered in particular, please let me know) as well as a more committer focused one on Talking people into creating patches (with the goal of highlighting some of the issues new-comers to free software projects that want to contribute run into and initiating a discussion on what helps to convince them to keep up the momentum and over come and obstacles).

Looking forward to seeing you in Vancouver for Apache Con NA.

Apache Con, Free Software, Mahout ,

GoTo Con

October 10th, 2011 at 8:49pm

Location: Amsterdam
Link out: Click here
Start Date: 2011-10-12
End Date: 2011-10-14

This week late Tuesday night I am going to leave for GoTo con in Amsterdam. Train tickets are already booked - this is going to be my first trip with City Night line, will see how great they are.

GoTo Amsterdam features a special Apache track as well as several talk on scaling up, searching, but also includes stuff in general architectural decisions. If you have not registered yet - use dros200 as promotion code to get a discount on the registration prize.

Looking forward to seeing you in Amsterdam later this week.

General, Mahout , ,

Apache Mahout Hackathon Berlin

March 21st, 2011 at 9:39pm

Last year Sebastian Schelter from Berlin was added to the list of committers for Apache Mahout. With two committers in town the idea was born to meet some day, work on Mahout. So why not just announce that meeting publicly and invite others who might be interested in learning more about the framework? I got in touch with c-base - a hacker space in Berlin well suited to host a Hackathon - and quickly got their ok for the event.

As a result the first Apache Mahout Hackathon took place at c-base in Berlin last weekend. We had about eight attendees - arriving at varying times: I guess 11a.m. simply is way too early to get up for your average software developer on a Saturday. I got a few people surprised by the venue - especially those who were attending a Hackathon for the very first time and had expected c-base to be some IT company ;)

We started the day with a brief collection of ideas that everyone wanted to work on: Some needed help to use Mahout - topics included:

  • How to use Apache Mahout collaborative filtering with complex models.
  • How to use Apache Mahout via a web application?
  • How to use classification (mostly focussed on using Naive Bayes from within web applications).
  • Is HBase a solution for scalable graph mining algorithms?
  • Is there a frequent itemset algorithm that respects temporal changes in patterns?

Those more into Mahout development proposed a slightly different set of topics:

  • PLSI and Map/Reduce?
  • Build customisable sampling strategies for distributed recommendations.
  • Come up with a more Java API friendly configuration scheme for Mahout clusterings.
  • Complete the distributed SVD recommender.

Quickly teams of two to three (and more) people formed. First several user side questions could be addressed by mixing more experienced Mahout developers with newbie users. Apart from Mahout specifics also more basic questions of getting involved even by simply contributing to the online documentation, answering questions on the mailing lists or just providing structured access to existing material that users generally have trouble finding.

Another topic that is being overlooked all too when asking users to contribute to the project is the process of creating, submitting, applying and reviewing patches itself: Being deeply involved with free software projects dealing with patches, integration of issue tracker and svn with the project mailing lists all seems very obvious. However even this seemingly basic setup sometimes looks confusing and complex to regular users - that is very common but not limited to people who are just starting to work as software developers.

Thanks to Thilo Fromm for taking the group picture.

In the evening people finally started hacking more sophisticated tasks - working on the first project patches. On Sunday only the really hard core developers remained - leading to a rather focussed work on Mahout improvements which in the end led to first patches sent in from the Mahout Hackathon.

Hacking, Mahout , , ,

Apache Mahout Meetup Amsterdam

February 19th, 2011 at 8:18pm

Last week I was honoured to be invited as one of the two speakers on Apache Mahout at the Mahout meetup in Amsterdam at JTeams offices. After free beer, cola and pizza Frank Scholten gave an overview of Mahout’s clustering capabilities. After a brief introduction to Mahout itself he went into a little more detail on how clustering works in general. After that with a selection of Seinfeld scripts he used a fun data set to guide the audience through the process of choosing the right data preparation steps, coming up with good training parameters and finally evaluating clustering quality.

After that I gave a brief introduction to classification with Mahout - going into a little more detail when it comes to data preparation and quality evaluation. The audience seemed most interested in learning more on how data preparation works - after all that step cannot really be covered by Mahout itself (though we do have some support) but instead needs a lot of domain knowledge from the user side.

Judging from the brief round of self introductions the meetup was well visited by an intesting mixture of people coming from JTeam, Hippo, the dutch police working on data analytics, developers working at RIPE and many more.

If you are interested in more data analysis, search and data storage - do not miss registration for Berlin Buzzwords on June 6/7th 2011.

General, Mahout , ,

O’Reilly Strata Conference

January 22nd, 2011 at 4:34am

Title: O’Reilly Strata Conference
Location: Santa Clara
Link out: Click here
Description: Early next February O’Reilly is planning to put on a very interesting conference on the topic of data analysis and the business of generating value from raw digital data.


Strata 2011

I’m really glad to have received the acceptance notification for my presentation and travel sponsorship from the DICODE project. So see you in Santa Clara.
Start Date: 2011-02-01
End Date: 2011-02-03

If you are still unsure whether you should attend or not: Strata kindly handed out discount codes to speakers to share with their followers and readers. It saves you 25% of the registration cost - just use str11fsd during registration.

General, Mahout , , ,

Apache Mahout Hackathon Berlin

December 14th, 2010 at 8:50pm

Early next year - on February 19th/20th to be more precise - the first Apache Mahout Hackathon is scheduled to take place at c-base. The Hackathon will take one weekend. There will be plenty of time to hack on your favourite Mahout issue, to get in touch with two of the Mahout committers and get your machine learning project off the ground.

Please contact isabel@apache.org if you are planning to attend this event or register with the xing event so we can plan for enough space for everyone. If you have not registered for the event there is now guarantee you will be admitted.

If you’d like to support the event: We are still looking for sponsors for drinks and pizza.

Mahout , , , ,

Apache Mahout Podcast

December 13th, 2010 at 9:21pm

During Apache Con ATL Michael Coté interviewed Grant Ingersoll on Apache Mahout. The interview is available online as podcast. The interview covers the goals and current use cases of the project, goes into some detail on the reasons for initially starting it. If you are wondering what Mahout is all about, what you can do with it and which direction development is heading, the interview is a great option to find out more.

Apache Con, Mahout, Software Foundation , , ,

Devoxx – Day 2 HBase

December 9th, 2010 at 9:25pm

Devoxx featured several interesting case studies of how HBase and Hadoop can be used to scale data analysis back ends as well as data serving front ends.

Twitter

Dmitry Ryaboy from Twitter explained how to scale high load and large data systems using Cassandra. Looking at the sheer amount of tweets generated each day it becomes obvious that with a system like MySQL alone this site cannot be run.

Twitter has released several of their internal tools under a free software license for others to re-use – some of them being rather straight forward, others more involved. At Twitter each Tweet is annotated by a user_id, a time stamp (ok if skewed by a few minutes) as well as a unique tweet_id. In order to come up with a solution for generating the latter one they built a library called snowflake. Though rather simple algorithm even works in a cross data-centre set-up: The first bits are composed of the current time stamp, the following bits encode the data-centre, after that there is room for a counter. The tweet_ids are globally ordered by time and distinct across data-centres without the need for global synchronisation.

With gizzard Twitter released a rather general sharding implementation that is used internally to run distributed versions of Lucene, MySQL as well as Redis (to be introduced for caching tweet timelines due to its explicit support for lists as data structures for values that are not available in memcached).

FlockDB for large scale social graph storage and analysis. Rainbird for time series analysis, though with OpenTSDB there is something comparable available for HBase. Haplocheirus for message vector caching (currently based on memcached, soon to be migrated to Redis for its richer data structures). The queries available through the front-end are rather limited thus making it easy to provide pre-computed, optimised version in the back-end. As with the caching problem a tradeoff between hit rate on the pool of pre-computed items vs. storage cost can be made based on the observed query distribution.

In the back-end of Twitter various statistical and data mining analysis are run on top of Hadoop HBase To compute potentially interesting followers for users, to extract potentially interesting products etc.
The final take-home message here: Go from requirements to final solution. In the space of storage systems there is not such thing as a silver bullet. Instead you have to carefully evaluate features and properties of each solutions as your data and load increase.

Facebook

When implementing Facebook Messaging (a new feature that was announced this week) Facebook decided to go for HBase instead of Cassandra. The requirements of the feature included massive scale, long-tail write access to the database (which more or less ruled out MySQL and comparable solutions) and a need for strict ordering of messages (which ruled out any eventually consistent system. The decision was made to use HBase.

A team of 15 developers (including operations and frontend) was working on the system for one year before it was finally released. The feature supports for integration of facebook messaging, IM, SMS and mail into one single system making it possible to group all messages by conversation no matter which device was used to send the message originally. That way each user’s inbox turns into a social inbox.

Adobe

Cosmin Lehene presented four use cases of Hadoop at Adobe. The first one dealt with creating and evaluating profiles of the Adobe Media Player. Users would be associated with a vector giving more information on what types of genre the meda they consumed belonged to. These vectors would then be used to generate recommendations for additional content to view in order to increase consumption rate. Adobe built a clustering system that would interface Mahout’s canopy- and k-means implementations with their HBase backend for user grouping. Thanks Cosmin for including that information in your presentation!

A second use case focussed on finding out more on the usage of flash on the internet. Using Google to search for flash content was no good as only the first 2000 results could be viewed thus resulting in a highly skewed sample. Instead they used a mixture of nutch and HBase for storage to retrieve the content. Analysis was done with respect to various features of flash movies, such as frame rates. The analysis revealed a large gap between the perceived typical usage and the actual usage of flash on the internet.

The third use case involves analysis of images and usage patterns on the Photoshop-in-a-browser edition of Photoshop.com. The forth use case dealt with scaling the infrastructure that powers businesscatalyst – a turn-key online business platform solution including analysis, campaigning and more. When purchased by Adobe the system was very successful business-wise. However the infrastructure was by no means able to put up with the load it had to accommodate. Changing to a back-end based on HBase led to better performance, faster report generation.

General, Hacking, Mahout , , , , , ,