Thanks for all the help

2012-12-31 11:24
This year was a blast: It started with the ever great FOSDEM in Brussels (see you there in 2013?), an invitation to GeeCon in Poznan (if you ever get an invitation to speak there - do accept, the organisers do an amazing job at that event). In summer we had Berlin Buzzwords in Berlin for the third time with 700 attendees (to retain the community feel to the conference we decided to limit tickets in 2013, so make sure you get your's early). In autumn I changed my name and afterwards spent two amazing weeks in Sydney, only to attend Strata EU afterwards. Finally in December I was invited to go through the most amazing submissions for Hadoop Summit in Amsterdam 2013 (it was incredibly hard to pick and choose - thanks to Sean and Torsten for assisting me with that choice for the Hadoop Applied track.)

I think I would have gone mad if it hadn't been for all the help from friends and family: A big hug to my husband for keeping me sane when ever times got a bit rough in terms of stuff in my calendar. Thanks for all your support throughout the year. Another huge hug to my family - in particular to my mom who early 2012 volunteered to take care of most of the local organisation of our wedding (we got married close to where I grew up) and put in several surprises that she "kept mum" about up to the very last second. Also everyone who helped fill your wedding magazine with content (and train my ma in dealing with all sorts of document formats containing that content in her mail box - me personally I was forbidden to even just touch her machine during the first nine months of 2012 ;) ).

Another thanks to David Obermann for a series of interesting talks at this year's Apache Hadoop Get Together Berlin. It's amazing to see the event continue to grow even after essentially stepping down from being the main organiser.

Speaking of events: Another Thank You to Julia Gemählich, Claudia Brückner and the whole Berlin Buzzwords team. This was the first year I reduced the time I put into the event considerably - it was the first year I could attend the conference and not be all too tired to enjoy at least some of the presentations. You did a great job! Also thanks to my colleagues over at Nokia who provided me with a day a week to get things done for Buzzwords. In the same context: A very big thank you to every one who helped turn Berlin Buzzwords into a welcoming event for everyone: Thanks to all speakers, sponsors and attendees. Looking forward to seeing you again next year.

Finally a really big Thanks to all the people who helped turn our wedding day and vacation afterwards into the great time it was: Thanks to our families, our friends (including but not limited to the best photographer and friend I've met so far, those who hosted us in Sydney, and the many people who provided us with information on where to go and what to do.)

There's one thing though that bugged me by the end of this year:

So I decided that my New Year's resolution for 2013 would be to ramp up the time I spend on Apache one way or another: At least as committer for Apache Mahout, Mentor for Apache Drill and as a Member of the foundation.

Wishing all of you a Happy New Year - and looking forward to another successful Berlin Buzzwords in 2013.

RecSys Stammtisch Berlin - December 2012

2012-12-30 12:40
Earlier this month I attended the fourth Recommender Stammtisch in Berlin. The event was kindly hosted by Soundcloud - who on top of organising the speakers provided a really yummy buffet by Kochzeichen D.

With Paul Lamere the evening started with a very entertaining but also very packed talk on why music recommendation is special - or put more generally why all recommender systems are special:

  • Traditionally recommender systems found their way into the wild to drive sales. In music however the main goal is to help users discover new content.
  • Listeners are very different: Ranging from those indifferent to what is being played (imagine someone sitting in a coffee bar enjoying their espresso - it's unlikely that those would want to influence the playlist of the shop's entertainment system unless they are really annoyed with it's content). There are casual listeners who from time to time skip a piece. There are more engaged people who train their own recommender through services like Finally there are fanatics that are really into certain kinds of music. Building just one system to fit them all won't do. Also relying on just one signal won't do - instead you will have to deal with both, content signals like loudness plots as well as community signals.
  • Music applications tend to be highly interactive. So even if there is little to no reliable explicit feedback people tell you how much you like your music when skipping, turning pieces louder, interacting with the content behind the song being played.
  • In contrast to many other domains music deals with a vast item space and a huge long tail of songs that almost never get interacted with.
  • In contrast to shopping recommenders however in music making mistakes is comparably cheap: In most situations music isn't purchased on a song by song basis but based on some subscription model. That way the actual cost of playing the wrong song is low. Also songs tend to be not much longer than 5min so also users are less annoyed when confronted with a slightly wrong piece of music.
  • When implementing recommenders for a shopping site it is to be avoided to re-recommend stuff a user has purchased already. This is not the case in music recommendation. Quite the contrary: Re-recommending known music is one indicator for playlists people will like.
  • When it comes to building playlists care must be taken to organise songs in a coherent way, mixing new and familiar songs in a pleasing order - essentially the goal should be to take the listener on a journey.
  • Fast updates are crucial: Music business itself is fast paced with new releases coming out regularly and being taken up very quickly by major broadcasting stations.
  • Music is highly contextual: It pays to know if the user is in the mood for calm or for faster music..
  • There are highly passionate users that are very easy to scare away - those tend to be the loudest ones influencing your user community most.
  • Though meta data is key in music as well, never expect it to be correct. There are all sorts of weird band and song names that you never thought would be possible - an observation that also Ticketmaster made when building their ticket search engine.
  • Music is highly social and irrational - so just knowing your users friends and their tastes won't get you to being perfect.

Overall I guess the conclusion is that no matter which domain you deal with you will always need to know the exact properties of that domain to build a successful system.

In the second talk Brian McFee explained one way of modeling playlists with context. With that he concentrated on passive music discovery - that is based on one query return a list of music to listen to sequentially as opposed to active retrieval where users issue a query to search for a specific piece of music.

Historically it turned out to be difficult to come up with any playlist generator that is better than randomly selecting songs to play. His model is based on a random walk notian where the vertices are songs and edges represent learnt group similarities. Groups were represented by features like familiarity, social tags, specific audio features, metadata, release dates etc. Depending on the playlist category in most cases he was able to show that his model actually does perform better than random.

In the third talk Oscar Celma showed some techniques to also benefit from some of the more complicated signals for music recommendation. Essentially his take was that by relying on usage signals only you will be stuck with the head of the usage distribution only. What you want though is to be able to provide recommendations for the long tail as well.

Some signals he mentioned included content based features (rythm, BPM, timbre, harmony), usage signals, social signals (beware of people trying to game the system or make fun of it though) and a mix of all those. His recommendation was to put content signals at the end of the processing pipeline for re-ranking and refining playlists.

When providing recommendations it is essential to be able to answer why something was recommended. Even just in the space of novelty vs. relevancy to the user there are four possible strategies: a) recommend only old stuff that is marginally relevant to the specific user: This will end up pulling up mostly popular songs. b) recommend what is new but not relevant to the user: This will end up pulling out songs that turn your user away. c) recommend what is relevant to the user but old, this will mostly surface stuff the user knows already but is a safe bet to play. d) recommend what is both relevant and new to the user - here the real interesting work starts as this deals with recommending genuinely new songs to users.

To balance discovery with safe fallback go for skips, bans, likes and dislikes. Take into account the user context and attention.

The final point the speaker made was the need to take into account the whole picture: Your shiny new recommendation algorithm will just be a tiny piece in the puzzle. Much more work will need to go into data collection and ingestion, into API design.

The last talk finally went into some detail of the history of playlist creation - back from music creators' choices, via radio station mixes, mix tapes and finally ending up at spotify and fully automatic playlist creation.

There is a vast body of knowledge on how to create successful playlists e.g. among DJs that speak about warm-up phases, chillout times, alternating types of music in order to take the audience on a journey. Even just shuffling music the user already knows can be very powerful given the pool of songs the shuffle is based on neither too large (containing too broad types of music) nor too small (leading to frequent repetitions). According to Ben Fields the science and art of playlist generation and in particular evaluation is still pretty much in it's infancy with much to come.

Strata EU - part 4

2012-10-28 20:17
The rest of the day was mainly reserved for more technical talks: Tom Wight introducing the merits of MR2, also known as YARN. Steve Loughran gave a very insightful talk on the various failure modes of Hadoop – though the Namenode is like the most obvious single point of failure there are a few more traps waiting for those depending on their Hadoop clusters: Hadoop does just find with single harddisks failing. Failing single machines usually also does not create a huge issue. However what if the switch one of your racks is connected with fails? Suddenly not just one machine has to be re-replicated but a whole rack of machines. Even if you have enough space in your cluster left, can your network deal with the replication traffic? What if your cluster is split in half as a result? Steve gave an introduction to the various HA configurations available for Hadoop. There's one insight I really liked though: If you are looking for SPOFs in your system – just carry a pager … and wait.

In the afternoon I joined Ted Dunning's talk on fast kNN soon to be available in Mahout – the speedups gained really do look impressive – just like the fact that the algorithm is all online and single pass.

It was good to meet with so many big data people in two days – including Sean Owen who joined the Data Science Meetup in the evening.

Thanks to the O'Reilly Strata team – you really did an awesome job making Strata EU an interesting and very well organised event. If you yourself are still wondering what this big data thing is and in what respect it might be relevant to your company Strata is the place to be to find out: Though being a tad to high-level for people with a technical interest the selection of talks is really great when it comes to showing the wide impact of big data applications from IT, the medical sector right up to data journalism.

If you are interested in anything big data, in particular who to turn the technology into value make sure you check out the conferences in New York and Santa Clara. Also all keynotes of London were video taped and are available on YouTube by now.

Strata EU - part 3

2012-10-27 20:16
The first Tuesday morning keynote put the hype around big data into historical context: According to wikipedia big data apps are defined by their capability of coping with data set sizes that are larger than can be handled with commonly available machines and algorithms. Going from that definition we can look back to history and will realize that the issue of big data actually isn't that new: Even back in the 1950s people had to deal with big data problems. One example the speaker went through was a trading company that back in the old days had a very capable computer at their disposal. To ensure optimal utilisation they would rent out computing power whenever they did not need it for their own computations. One of the tasks they had to accomplish was a government contract: Freight charges on rails had been changed to be distance based. As a result the British government needed information on the pairwise distances between all trainstations in GB. The developers had to deal with the fact that they did not have enough memory to fit all computation into it – as a result they had to partition the task. Also Dijkstra's algorithm for finding shortest paths in graphs wasn't invented until 4 years later – so they had to figure something out themselves to get the job done (note: Compared to what Dijkstra published later it actually was very similar – only that they never published it). The conclusion is quite obvious: The problems we face today with Petabytes of data aren't particularly new – we are again pushing frontiers, inventing new algorithms as we go, partition our data to suit the compute power that we have.

With everyday examples and a bit of hackery the second keynote went into detail on what it means to live in a world that increasingly depends on sensors around us. The first example the speaker gave was on a hotel that featured RFID cards for room access. On the card it was noted that every entry and exit to the room is being tracked – how scary is that? In particular when taking into account how simple it is to trick the system behind into revealing some of the gathered information as shown a few slides later by the speaker. A second example he have was a leaked dataset of mobile device types, names and usernames. By looking at the statistics of that dataset (What is the distribution of device types – it was mainly iPads as opposed to iPhones or Android phones. What is the distribution of device names? - Right after manufacturer names those contained mainly male names. When correlating these with a statistic on most common baby name per year they managed to find that those were mainly in their mid thirties.) The group of people whose data had leaked used the app mainly on an iPad, was mainly male and in their thirties. With a bit more digging it was possible to deduce who exactly had leaked the data – and do that well enough for the responsible person (an American publisher) to not be able to deny that. The last example showed how to use geographical self tracking correlated with credit card transactions to identify fraudulent transactions – in some cases faster than the bank would discover them.

The last keynote provided some insight into the presentation bias prevalent in academic publishing – but in particular in medical publications: There the preference to publish positive results is particularly detrimental as it has a direct effect on patient treatment.

Strata EU - part 2

2012-10-26 20:15
The second keynote touched upon the topic of data literacy: In an age in which growing amounts of data are being generated being able to make sense of these becomes a crucial skill for citizens just like reading, writing and computing. The speaker's message was two-fold: a) People currently are not being taught how to deal with that data but are being taught that all that growing data is evil. Like an enemy hiding under their bed just waiting to jump at them. b) When it comes to getting the people around you literate the common wisdom is to simplify, simplify, simplify. However her approach is a little different: Don't simplify. Instead give people the option to learn and improve. As a trivial comparison: Just because her own little baby does not yet talk doesn't mean she shouldn't talk to it. Over time the little human will learn and adapt and have great fun communicating with others. Similarly we shouldn't over-simplify but give others a chance to learn.

The last keynote dealt gave a really nice perspective on information overload and the history of information creation. Starting back in the age of clay tablets where writing was to 90% used for accounting only – tablets being tagged for easier findability. Continuing with the invention of paper – back then still as roles as opposed to books that facilitated easy sequential reading but made random access hard. The obvious next step being books that allow for random access read. Going on to initial printing efforts in an age where books were still a scarce resource. Continuing to the age of the printing press with movable types when books became ubiquitous – introducing the need for more metadata attached to books like title pages, TOCs and indexes for better findability. As book production became simpler and cheaper people soon had to think of new ways to cope with the ever growing amount of information available to them. Compared to that the current big data revolution does not look to familiar anymore: Much like the printing press allowed for more and more books to become available , Hadoop allows for more and more data to be stored in clusters. As a result we will have to think about new ways to cope with the increasing amount of data at our disposal, time to start going beyond the mere production processes and deal with the implications for society. Each past data revolution left both – winners and loosers – mainly unintentioned by those who invented the production processes. Same will happen with today's data revolution.

After the keynotes I joined some of the nerdcore track talks on Clojure for data science and Cascalog for distributed data analysis, briefly joined the talk on data literacy for those playing with self tracking tools to finally join some friends heading out for an Apache Dinner. Always great to meet with people you know in cities abroad. Thanks to the cloud of people who facilitated the event!

O'Reilly Strata London - part 1

2012-10-25 20:13
A few weeks ago I attended O'Reilly Strata EU. As I had the honour of being on the program committee I remember how hard it was to decide on which talks to accept and which ones to decline. It's great to see that potential turned into an awesome conference on all things Big Data.

I arrived a bit late as I flew in only Monday morning. So I didn't get to see all of the keynotes and plunged right into Dyson's talk on the history of computing from Alan Turing to now including the everlasting goal of making computers more like humans, making them what is generally called intelligent.

The next keynote was co-presented by the Guardian and Google on the Guardian big data blog. Guardian is very well known for their innovative approach to journalism that more and more relies on being able to make sense of ever growing datasets – both public and not-yet-published. It was quite interesting to see them use technologies like Google Refine for cleaning up data, see them mention common tools like Google spreadsheets or Tableau for data presentation and learn more on how they enrich data by joining it with publicly available datasets.

Some thoughts on a conf taxonomy

2012-09-16 12:53
One common way for open source developers to meet face-to-face is to attend conferences relevant to their subject of interest. A common way to have one near you if there ain't none yet is to go and organise one yourself. The most obvious stuff to resolve for that task:

  • Most likely there will be some financial transactions involved - sponsors wanting to support you, attendees paying for their tickets, you paying for the venue and for food.
  • Someone will have to choose which speakers to invite.
  • How to scale if there are more speakers and attendees than you can reasonably welcome yourself.

So far I've come across a multitude of ways to deal with these two issues alone. Some encountered at events with >200 attendees are listed below. Feel free to add your context.

NameContent selectionFor profitTicketsFoodScaling model
FOSDEM/ Brusselsopen CfP, decision by organisersNope - it's hosted by a university, organised by a couple of students and an incredible multitude of volunteers.Access is completely free though attendees are being asked to support the conference with a donation.Food is on sale through the organisersIn addition to two main tracks there's a multitude of independently but affiliated and co-located so-called dev rooms that are completely community organised e.g. for Debian, Java, Embedded, KDE and others
Frosconopen CfP, decision by organisersNope - again hosted by a university, organised by a couple of students and volunteersTickets are cheap - in the 5 Euro rangeFood is on-sale at the event.There are workshops and related events that are community organised. Those are starting to get more visible in the main program as well.
Linux Tage Chemnitzopen CfP, decision by organisers + committee.Nope - hosted by TU Chemnitz with huge local support.Cheap - in the 5 Euro range.On sale at the event (soup and related stuff).Stable number of attendees so far.
Chaos Communication Congressopen CfP, decision by organisers + committeeyesfor four days slightly less than 100,- Euroon sale in the venue as well as aroundmove to different location
Chaos Campopen CfP, decision by organisers + committeeyes100 < prize < 500,- range for whole week including camping groundon sale at the locationnot needed so far
Berlin Buzzwordsopen CfP, decision by volunteersyesmore than 300,- Euros in early birdincluded in the priceaffiliated workshops
ApacheConopen CfP, decision by volunteersyesin EU >200,-, in US usually >1k$included in priceaffiliated meetups
Lucene Revolutionopen CfP, decision by organisersmore or less, mainly PR for organiser>500,-included in pricenot needed so far
GoTo Coninvitation onlyyes>500,- rangeincluded in priceturn the "one location" only conference into a series that moves across Europe with the help of some locals that are interested in having the event
Strataopen CfP, decision made by committee - final decision by organisersyesin the >500 Euro rangeincluded in pricesplit in different locations, organisers remain the same still

From the above table to me it seems that most conferences differ in whether they are fully non profit solely for the sake of education. In contrast to that there are events that are for profit (as in support the organisers financially), or some kind of self-marketing where profit is indirect in terms of more contracts signed. They also differ in whether submissions are open or invited talks only. In addition there are those that have paid talks (usually clearly marked as such) or accept talks through the submission form only. In terms of cost one model is to go extremely low-cost with no money paid for venue or food vs. those that include catering in the ticket price.

Me personally I have a strong preference to events that feature an open CfP - mainly because talks tend to be more diverse and - given a strong program committee - also of decent quality as only the best make it through. In addition the events tend to be less formal when fully community organised - over time regulars among speakers, attendees and exhibition participants tend to know each other generating a rather friendly athmosphere.

Moving to a new domain

2012-09-12 12:30
Executive summary: This is to warn those of you who are subscribed to this blog - the domain to reach this blog w/o redirects will soon change to by - you might want to adjust your rss subscription accordingly.

Longer version: This blog post is scheduled to go live some time after lunch-time on September 12th 2012. You might have heart rumors before - that date Ms. Isabel Drost and Mr. Thilo Fromm are supposed to get married.

There were times when war and conflicts between kingdoms were settled by having children of the reigns get married. Today this old tradition is being continued on a much smaller scale by having a couple get married that is comprised of one half being passionate about Linux Kernel hacking and a strong proponent of GPL/LGPL open source licensing and the other half coming from the Java world, mainly contributing to ASL projects.

As a bit of "showing of good will" both agreed to the proposal of Matthias Kirschner: Girls that are FSFE fellows really should only marry other FSFE fellows. So we got Thilo a fellowship membership setup very quickly.

PS: Now looking forward to dancing into a new part of life this evening ;)

Pps: Thanks to photomic for the DLSR fotos, and to masq for taking the above picture and mailing it to my server. Having a secure shell on your mobile phone rocks!

FrOSCon - on teaching

2012-09-09 08:17
The last talk I went to during FrOSCon was Selena's keynote on "Mistakes were made". She started by explaining how she taught computer science (or even just computer-) concepts to teachers herself - emphasizing how exhausting teaching can be, how many even trivial concepts were unknown to her students. After that Selena briefly sketched how she herself came to IT - emphasizing how providing mostly the information she needed to accomplish the current task at hand and telling how to get more information helped her make her first steps a great deal.

The main point of her talk however was to highlight some of the underlying causes for the lack of talented cs students. Some background literature is online at her piratepad on the subject.

The discussion that followed the keynote (and included contributions from two very interested, refreshingly dedicated teachers) was quite lively: People generally agreed that computer science/ computing or even just logical and statistical thinking plays a sadly minor role in current education. Students are mainly forced to memorize large amounts of facts by heart but are not taught to question their environment, discover relations or rate sources of information. The obvious question that seemed to follows was that on what to remove from the curriculum when introducing computing as a subject. My personal take on that is that maybe there is no need for removing anything - instead changing the way concepts are taught might already go a long way: Put arts, maths, natural sciences and music into context, have kids evaluate statistics and rate them not only in maths but also in e.g. biology by letting them examine some common statistical fallacies in the subject area.

Another problem stated was the common lack of technical understanding, the common lack of time for preparation and the common lack of understanding for the concept of open source or creative commons content. Taken together this makes sharing teaching material and improving it together with others incredibly hard.

Selena's call to action was for geeks to get involved and educate the people near and dear to them instead of giving up. On thing to add to that: Most German universities have some sort of visitors' days to prospective students - some even have collaborations with schools to do projects together with younger ones - make sure to check out your own university - you might well find out that teaching is not only exhausting but also particularly rewarding especially when teaching students that really want to know and participate in your project just because they want to.

If you know any teachers who are open to the idea of having externals take over some their lessons or at least provide input get them connected with your peers that are interested in educating others. Also keep in mind that most open source projects, hacker spaces and related organisations in Germany are so-called "gemeinnütziger e.V." - a status that in many cases was achieved by declaring the advancement of education as at least one of their goals.

FrOSCon - understanding Linux with strace

2012-09-06 20:29
Being a Java child I had only dealt with strace once before: Trying to figure out whether any part of the Mahout tests happens to use /dev/random for initialisation in a late night debugging session with my favourite embedded Linux developer. Strace itself is a great tool to actually see what your program is doing in terms of system calls, giving you the option to follow on a very detailed level what is going on.

In his talk Harald König gave a very easy to follow over view on how to understand Linux with strace. Starting with the basic use cases (trace file access, program calls, replay data, analyse time stamps, do some statistics) the quickly moved on to showing some more advanced tricks you can do with the little tool: Finding sys-calls that take surprisingly long vs. times when user code is doing long-running computations. Capturing and replaying e.g. networking related calls to simulate problems that the application runs into. Figuring out bottlenecks (or just plain weird stuff) in the application by figuring out the most frequent syscall. Figuring out which configuration files an application really touches - sorting them by last modified date with a bit of shell magic might give an answer to the common question of whether the last update or the last time the user tinkered with the configuration turned his favourite editor to appear green instead of white. On the other hand it can also reveal when configurations have been deleted (in the presentation he moved away the user-emacs configuration. As a result emacs tried >30 times to find it for various configuration options during startup: Is it there? No. ... Is it now there? No. ... Maybe now? Nope. ... ;) ).

When looking at strace, you might also want to take a look at ltrace that traces library calls - the output there might be a bit more readable in that it's not just system calls but also library calls. Remember though that tracing everything can not only make your app pretty slow but also quickly generates several gigabytes of information.