Archive

Posts Tagged ‘Mahout’

Devoxx University – MongoDB, Mahout

December 5th, 2010 at 9:19pm

The second tutorial was given by Roger Bodamer on MongoDB. It concentrates on being horizontally scalable by avoiding joins and complex, multi document transactions. It supports a new data model that allows for flexible, changeable “schemas”.

The exact data layout is determined by the types of operations you expect for your application, by the access patterns (reading vs. writing data; types of updates and types of queries). Also don’t forget about indexing tables by columns to speed up frequently run queries.

Scaling MongoDB is supported by replication in a master/slave setup quite as any traditional system. In a replica set of n nodes, any of these can be elected as the primary (taking writes). If that one goes down, new master election happens. For durability all writes are required to go to at least a majority of all nodes, if that does not happen, now guarantee is given as to the availability of the update in case of primary failure. Write sharding comes with MongoDB as well.
Java support for Mongo is pretty standard - Raw Mongo driver comes in a Map<..., ... > flavour. Morphia supports Pojo mapping, annotations etc. for MongoDB Java integration, Code generators for various other JVM languages are available as well.

See also: http://blog.wordnik.com/12-months-with-mongodb

My talk was scheduled for 30min in the afternoon. I went into some detail on what is necessary to build a news clustering system with Mahout and finished the presentation by a short overview of the other use cases that could be solved with the various algorithms. In the audience, nearly all had heard about Hadoop before – most likely in the introductory session that same morning. Same for Lucene. Solr was known to about half of the attendees. Mahout to just a few. Knowing that only very few attendees had any Machine Learning background I tried to provide a very high level overview of what can be done with the library, not going into too much mathematical details. There were quite a few interested questions after the presentation – both online and offline, including requests for examples on how to integrate the software with Solr. In addition connectors for instance to HBase as a data-source were interesting to people. Show-casing integration of Mahout, possibly even providing not only Java- but also REST interfaces might be one route to easier integration and faster adoption of Mahout.

Mahout , ,

Devoxx Antwerp

December 3rd, 2010 at 9:16pm

With 3000 attendees Devoxx is the largest Java Community conference world-wide. Each year in autumn it takes place in Antwerp/ Belgium, in recent years in the Metropolis cinema. The conference tickets were sold out long before doors were opened this year.
The focus of the presentations are mainly on enterprise Java featuring talks by famous Joshua Bloch, Mark Reihnhold and others on new features of the upcoming JDK release as well as intricacies of the Java programming language itself.
This year for the first time the scope was extended to include one whole track on NoSQL databases. The track was organised by Steven Noels. It featured fantastic presentations on HBase use cases, easily accessible introductions to the concepts and usage of Hadoop.
To me it was interesting to observe which talks people would go to. In contrast to many other conferences here the NoSQL/ cloud-computing presentations were less visited than I’d have expected. One reason might be the fact that especially on conference day two they had to compete with popular topics such as the Java puzzlers, Live Java posse and others. However when talking to other attendees their seemed to be a clear gap between the two communities caused probably by a mixture of

  • there being very different problems to be solved in the enterprise world vs. the free software, requirements and scalability driven NoSQL community. Although even comparably small companies (compared to the Googles and Yahoo!s of this world) in Germany are already facing scaling issues, these problems are not yet that pervasive in the Java community as a whole. To me this was rather important to learn, as coming from a Machine learning background, now working for a search provider and being involved with Mahout, Lucene and Hadoop scalability and a growth in data has always been one of the major drivers for any projects I have been working on so far.
  • Even when faced with growing amounts of data in the regular enterprise world developers seem to be faced with the problem of not being able to freely select the technologies to be used for implementing a project. In contrast to startups and lean software teams there still seem to be quite a few teams that are not only given what to implement but also how to implement the software unnecessarily restricting the tools to use to solve a given problem.

One final factor that drives developers adopting NoSQL and cloud computing technologies is the observation for the need to optimise the system as a whole – to think outside the box of fixed APIs and module development units. To that end the DevOps movement was especially interesting to me as only by getting the knowledge largely hidden in operations teams into development and mixing that with the skill of software developers can lead to truly elastic and adaptable systems.

General, Mahout, Software Foundation , , , ,

Teddy in Lisbon

December 1st, 2010 at 9:09pm

After Apache Con I spent a few days in Lisbon for Codebits. The conference is not developers-only. It is more of a mixture of hacking event, conference, exhibition. Though the location was not optimal for giving presentations (large exhibition hall with now a rather noisy presentation area) the whole event brought quite an interesting mixture of people together in one place in the capital of Portugal.

I had been to Portugal earlier this year, however that was just for recreating and vacation. So this time around I was quite happy to get the chance of seeing some part of the local culture that otherwise I would probably never have gotten access to. Having some loose ties to the Berlin hackers community, to the free software people in Europe but also to pragmatic open source developers what was most astonishing to me was to see the comparably huge amount of systems running Microsoft Windows used by codebits attendees. Talking a bit with locals it seemed like using free software for development is not all that unusual in Portugal, however people tend to wait for problems getting fixed instead of getting involved and actively contributing back.

Mahout , , ,

Mahout in Action

November 30th, 2010 at 9:07pm

Flying to Atlanta I finally had a few hours of time to finalize the review of the Mahout in Action MEAP edition. The book is intended for potential users of the Apache Mahout, a project focussing on implementing scalable algorithms for machine learning.
Describing machine learning algorithms and their application to practioners is a non-trivial task: Usually there is more than one algorithm available for seemingly identically problem settings. In addition each algorithm usually comes with multiple parameters for fine-tuning its behaviour to the problem setting at hand.
Sean Owen does an awesome job explaining the basic concepts behind building recommender systems in that book. In a very intuitive way he highlights the properties of each algorithm and its options. Based on one example setting taken from a real world problem (parents buying music Cds for their children based on more or less background information) he highlights the properties of each available recommender algorithm.
The second section of the book highlights available implementations for clustering documents, that is grouping documents by similarity – a problem that is very common when it comes to grouping texts into topics and detecting upcoming new topics in a stream of publications. Robin Anil and Ted Dunning make it very easy to understand what clustering is all about, explain how to use, configure and use the current implementations in Mahout in various practical settings.
The book looks very promising. It is well suited for engineers looking for an explanation of how to successfully use Mahout to solve real world problems. In contrast to existing publications it makes it easy to grasp the basic concepts event without wading through complicated computations. The book is specially targeted to Mahout users. However it does give important background information on the algorithms available that is needed to decide on exactly which implementation and which configuration to use. Looking forward to the last section on classification algorithms.

Mahout , ,

Apache Con – Mahout, commons and Lucene

November 26th, 2010 at 11:21pm

The second day the track interesting to me provided an overview of some of the Apache commons projects. So seemingly small in scope and light-weight in implementation and dependencies these projects provide vital features not yet well supported by the Sun JVM. There is a commons math implementation featuring a fair amount of algebraic, numeric and trigonometric functions (among others), the commons exec framework for executing processes externally to the JVM w/o running into the danger of creating dead-locks or wasting resources.

After that the Mahout and Lucene presentations were up. Grant gave a great overview of various use-cases of machine learning in the wild, rightly claiming that anyone using the internet today makes use of some machine learning powered application each day – be it e-mail spam filtering, the Gmail priority inbox, recommendaed articles on news sites, recommended items to buy at shopping sites or targeted advertisements shown when browsing. The talk was concluded by a more detailed presentation of how to successfully combine the features of Mahout and Lucene/Solr to built next generation web services that integrate user feedback into their user experience.

Apache Con , ,

Apache Mahout @ Devoxx Tools in Action Track

November 1st, 2010 at 9:32am

This year’s Devoxx will feature several presentations coming from the Apache Hadoop ecosystem including Tom White on the basics of Hadoop: HDFS, MapReduce, Hive and Pig as well as Michael Stack on HBase.

In addition there will be a brief Tools in Action presentation on Monday evening featuring Apache Mahout.

Please let me know if you are going to Devoxx - would be great to meet some more Apache people there, maybe have dinner at one of the conference days.

General, Mahout, Software Foundation , , , , ,

Apache Mahout @ Lisbon Codebits

October 31st, 2010 at 9:36am

Second week of November I’ll spend a few days in Lisbon - never would have thought that I’d return so quickly when I visited this beautiful city this summer during vacation. I’ll be there for Codebits - thanks to Sapo for inviting me to be there.

Back in summer I learned only after I returned to Germany that there was someone form Portugal seeking to meet with other Apache people exactly when I was down there. I contacted the guy proposing to do an Apache Dinner to see how many other committers and friends could be reached. In addition Filipe asked me whether I could imagine flying down to Sapo to give a talk on Mahout as devs there would be interested in it. Well, I told him that if I got travel support, I’d be happy to be there. This 10min chat quickly turned into an invitation to a great conference in Lisbon. Looking forward to meet you there. (And looking forward to weather that compared to Germany is way warmer and more sunny right now. :) )

General, Software Foundation , , , ,

CfP: Data Analysis Dev Room at Fosdem 2011

October 27th, 2010 at 6:56am

Call for Presentations: Data Analysis Dev Room, FOSDEM
http://fosdem.org
5 February 2011
1pm to 7pm
Brussels, Belgium

This is to announce the Data Analysis DevRoom co-located with FOSDEM. The first Meetup on analysing and learning from data, taking place in Brussels, Belgium.

Important Dates (all dates in GMT +2):

  • Submission deadline: 2010-12-17
  • Notification of accepted speakers: 2010-12-20
  • Publication of final schedule: 2011-01-10
  • Meetup: 2011-02-05

Data analysis is an increasingly popular topic in the hacker community. This trend is illustrated by declarations such as:

“I keep saying the sexy job in the next ten years will be statisticians. People think I’m joking, but who would’ve guessed that computer engineers would’ve been the sexy job of the 1990s? The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill in the next decades, not only at the professional level but even at the educational level for elementary school kids, for high school kids, for college kids. Because now we really do have essentially free and ubiquitous data. So the complimentary scarce factor is the ability to understand that data and extract value from it.”

– Hal Varian, Google’s chief economist

Topics
The event will comprise presentations on scalable data processing. We invite you to submit talks on the topics:

  • Information retrieval / Search
  • Large Scale data processing
  • Machine Learning
  • Text Mining
  • Computer vision
  • Linked Open Data
  • Sample list of related open source / data projects (not exhaustive) :
  • http://lucene.apache.org
  • http://hadoop.apache.org (including MapReduce, Pig, Hive, …)
  • http://www.r-project.org/
  • http://scipy.org
  • http://mahout.apache.org
  • http://opennlp.sourceforge.net
  • http://nltk.org
  • http://opencv.willowgarage.com
  • http://mloss.org & http://mldata.org
  • http://dbpedia.org & http://freebase.com

Closely related topics not explicitly listed above are welcome.

High quality, technical submissions are called for, ranging from principles to practice.

We are looking for presentations on the implementation of the systems themselves, real world applications and case studies.

Submissions should be based on free software solutions.

Submission
Proposals should be submitted at fosdem.datadevroom@gmail.com no later than 2010-12-17. Acceptance notifications will be sent out on 2010-12-20.

Please include your name, bio and email, the title of the talk, a brief abstract in English language. Please indicate the level of experience with the topic your audience should have (e.g. whether your talk will be suitable for newbies or is targeted for experienced users.)

The presentation format is short: 30 minutes including questions. We will be enforcing the schedule rigorously.

Sponsoring
If you are interested in sponsoring the event (e.g. we would be happy to provide videos after the event, free drinks for attendees as well as an after-show party), please contact us. Note: “DataDevRoom sponsors” will not be endorsed as “FOSDEM sponsors” and hence not listed in the sponsors section on the fosdem.org website.

Announcements
Follow @DataDevRoom on twitter for updates. News on the conference will be published on our website at http://fosdem.org.

Program Chairs:

  • Olivier Grisel - @ogrisel
  • Isabel Drost - @MaineC
  • Nicolas Maillot - @nmaillot

Please re-distribute this CFP to people who might be interested.

General , , , , ,

Video: Max Heimel on sequence tagging w/ Apache Mahout

October 26th, 2010 at 7:58pm

Some time ago Max Heimel from TU Berlin gave presentation of the new HMM support in the Mahout 0.4 release at the Apache Hadoop Get Together in Berlin:

Mahout Max Heimel from Isabel Drost on Vimeo.

Thanks to JTeam for sponsoring video taping, thanks to newthinking for providing the location and thanks to Martin Schmidt from newthinking for producing the video.

Get Together , , , ,

Machine Learning Gossip Meeting Berlin

October 25th, 2010 at 6:51pm

This evening the first Machine Learning Gossip meeting is scheduled to take place at 9p.m. at Victoriabar: Professionals working in research advancing machine learning algorithms and industry projects putting machine learning algorithms to practical use meet for some drinks, food and hopefully lots of interesting discussions.

If successful the meeting is supposed to take place on a regular schedule. Ask Michael Brückner for the date and location of the next meetup.

General, Mahout, Science , ,