Seminar on scaling learning at DIMA TU Berlin #
Last Thursday the seminar on scaling learning problems took place at DIMA at TU Berlin. We had five students give
talks.
The talks started with an introduction to map reduce. Oleg Mayevskiy first explained the basic concept,
than gave an overview of the parallelization architecture and finally showed how jobs can be formulated as map reduce
jobs.
His paper as well as his slides are available
online.
Second was Daniel Georg - he was working on the rather broad topic of NoSQL databases. Being too fuzzy
to be covered in one 20min talk, Daniel focussed on distributed solutions - namely Bigtable/HBase and Yahoo!
PNUTS.
Daniel’s paper as well as
the slides are available online
as well.
Third was Dirk Dieter Flamming on duplicate detection. He concentrated on algorithms for near duplicate
detection needed when building information retrieval systems that work with real world documents: The web is full of
copies, mirrors, near duplicates and documents made of partial copies. The important task is to identify near
duplicates to not only reduce the data store but to potentially be able to track original authorship over
time.
Again, paper and slides are available
online.
After a short break, Qiuyan Xu presented ways to learn ranking functions from explicit as well as
implicit user feedback. Any interaction with search engines provides valuable feedback about the quality of the current
ranking function. Watching users - and learning from their clicks - can help to improve future ranking
functions.
A very detailedpaper as well as slides are available for
download.
Last talk was be Robert Kubiak on topic detection and tracking. The talk presented methods for
identifying and tracking upcoming topics e.g. in news streams or blog postings. Given the amount of new information
published digitally each day, these systems can help following interesting news topics or by sending notifications on
new, upcoming topics.
Paper and slides are available online.
If you
are a student in Berlin interested in scalable machine learning: The next course IMPRO2 has been setup. As last year
the goal is to not only improve your skills in writing code but also to interact with the community and if appropriate
to contribute back the work created during the course.