Machine learning problem settings

2011-08-06 07:10 More posts about Apache Mahout Science Theory Machine Learning
Together with Sebastian Schelter I held a Nokia sponsored (Thank you!) lecture on large scale data analysis and data mining during the past few months.

After supervising a few successful university projects based on Apache Mahout the goal of this lecture was to introduce students to some of the basic concepts and problems encountered today in a world where huge datasets are generally available and are easy to process with Apache Hadoop. As such the course is targeted at an entry level audience - thorough treatment of the mathematical background of latest machine learning technology is left to the machine learning research groups in Potsdam, at TU Berlin and the neural information processing group at TU.

Slides and exercises are available online via git. Please let me know if you want to re-use them in your lecture.

The very first problem that users of machine learning algorithms usually come across is mapping their application problem to one of the various machine learning problems. In 2010 Michael Brückner gave a lecture on Intelligent Data Analysis with Matlab (slides and videos in German) including a simple taxonomy of algorithms:

According to

  • the types of input data an algorithm can handle (either independent instances, also called examples, sequences or graphs of instances)
  • the type of training data available (e.g. instances with assigned nominal target attribute, no labels at all, a partial sorting of sets of instances)
  • and the learning goal
algorithms can be nicely partitioned by the learning problem that they solve. Based on that very first step of identifying exactly what the problem setting is, deciding which algorithm to use becomes much easier. Based on that taxonomy I came up with the above graphic giving a first overview of which tasks can be solved with machine learning:

Boxes in dark blue are what in general is called supervised learning, yellow unsupervised and light blue semi supervised - based on the amount of labeled training data available. Red boxes indicate settings with the goal of knowledge discovery. Green are any ranking problems.