Apache Mahout 0.4 release #
On last Sunday the Apache Mahout project published the 0.4 release. Nearly
every piece of the code has been refactored and improved since the last 0.3 release. The release was timed to happen
exactly before Apache Con NA in Atlanta. As such it was published on October 31st - the Halloween release,
sort-of.
Especially mentionable are the following improvements:
- Model refactoring and CLI changes
to improve integration and consistency
- Map/Reduce job to compute the pairwise similarities of the rows of a
matrix using a customizable similarity measure
- Map/Reduce job to compute the item-item-similarities for
item-based collaborative filtering
- More support for distributed operations on very large matrices
- Easier
access to Mahout operations via the command line
- New vector encoding framework for high speed vectorization
without a pre-built dictionary
- Additional elements of supervised model evaluation framework
- Promoted
several pieces of old Colt framework to tested status (QR decomposition, in particular)
- Can now save random
forests and use it to classify new data
New features and algorithms include:
- New
ClusterEvaluator and CDbwClusterEvaluator offer new ways to evaluate clustering effectiveness
- New Spectral
Clustering and MinHash Clustering (still experimental)
- New VectorModelClassifier allows any set of clusters to
be used for classification
- RecommenderJob has been evolved to a fully distributed item-based
recommender
- Distributed Lanczos SVD implementation
- New HMM based sequence classification from GSoC
(currently as sequential version only and still experimental)
- Sequential logistic regression training
framework
- New SGD classifier
- Experimental new type of NB classifier, and feature reduction options for
existing one
There were many, many more small fixes, improvements, refactorings and cleanup. Go check out the new release, give the new features a try and report back to us on the user mailing list.