Data Scientists - researchers' persectives

2012-08-03 20:35
"Data scientist" as a term has caught quite some attention as of late (together with all the big data, scalability and cloud hype). Instead of re-hashing arguments seen in other sources I thought it might make more sense to link to a few of the thought provoking posts I came across recently.



Machine learning problem settings

2011-08-06 07:10
Together with Sebastian Schelter I held a Nokia sponsored (Thank you!) lecture on large scale data analysis and data mining during the past few months.

After supervising a few successful university projects based on Apache Mahout the goal of this lecture was to introduce students to some of the basic concepts and problems encountered today in a world where huge datasets are generally available and are easy to process with Apache Hadoop. As such the course is targeted at an entry level audience - thorough treatment of the mathematical background of latest machine learning technology is left to the machine learning research groups in Potsdam, at TU Berlin and the neural information processing group at TU.

Slides and exercises are available online via git. Please let me know if you want to re-use them in your lecture.



The very first problem that users of machine learning algorithms usually come across is mapping their application problem to one of the various machine learning problems. In 2010 Michael Brückner gave a lecture on Intelligent Data Analysis with Matlab (slides and videos in German) including a simple taxonomy of algorithms:

According to


  • the types of input data an algorithm can handle (either independent instances, also called examples, sequences or graphs of instances)
  • the type of training data available (e.g. instances with assigned nominal target attribute, no labels at all, a partial sorting of sets of instances)
  • and the learning goal
algorithms can be nicely partitioned by the learning problem that they solve. Based on that very first step of identifying exactly what the problem setting is, deciding which algorithm to use becomes much easier. Based on that taxonomy I came up with the above graphic giving a first overview of which tasks can be solved with machine learning:

Boxes in dark blue are what in general is called supervised learning, yellow unsupervised and light blue semi supervised - based on the amount of labeled training data available. Red boxes indicate settings with the goal of knowledge discovery. Green are any ranking problems.

Machine Learning Gossip Meeting Berlin

2010-10-25 18:51
This evening the first Machine Learning Gossip meeting is scheduled to take place at 9p.m. at Victoriabar: Professionals working in research advancing machine learning algorithms and industry projects putting machine learning algorithms to practical use meet for some drinks, food and hopefully lots of interesting discussions.

If successful the meeting is supposed to take place on a regular schedule. Ask Michael Brückner for the date and location of the next meetup.

MLOSS workshop at ICML accepted

2010-02-15 19:31
The workshop on machine learning open source software has been accpted. Find further details on the workshop homepage.

Submissions are open until April 10th, Samoa time.