Machine learning problem settings #
Together with Sebastian Schelter I held a Nokia sponsored (Thank you!) lecture on large
scale data analysis and data mining during the past few months.
After supervising a few successful
university projects based on Apache Mahout the goal of this lecture was to introduce students to some of the basic
concepts and problems encountered today in a world where huge datasets are generally available and are easy to process
with Apache Hadoop. As such the course is targeted at an entry level audience - thorough treatment of the mathematical
background of latest machine learning technology is left to the machine learning research groups in Potsdam, at TU Berlin and the
neural information processing group at TU.
Slides and exercises are available online via git. Please let me know if you want to re-use
them in your lecture.
The very first problem that users
of machine learning algorithms usually come across is mapping their application problem to one of the various machine
learning problems. In 2010 Michael Brückner gave a lecture on Intelligent Data Analysis with Matlab (slides and
videos in German) including a simple taxonomy of algorithms:
According to
- the types of input
data an algorithm can handle (either independent instances, also called examples, sequences or graphs of
instances)
- the type of training data available (e.g. instances with assigned nominal target attribute, no labels
at all, a partial sorting of sets of instances)
- and the learning goal
Boxes in dark blue are what in general is called supervised learning, yellow unsupervised and light blue semi supervised - based on the amount of labeled training data available. Red boxes indicate settings with the goal of knowledge discovery. Green are any ranking problems.