First of all a huge Thank You to David Obermann for organising today's Apache Hadoop Get Together Berlin: After a successful Berlin Buzzwords and a rather long pause following that finally a Christmas meetup took place today at Smarthouse, kindly sponsored by Axel Springer and organised by David Obermann from idealo. About 40 guests from Neofonie, Nokia, Amen, StudiVZ, Gameduell, TU Berlin, nurago, Soundcloud, nugg.ad and many others made it to the event.
In the first presentation Douwe Osinga from triposo
went into some details on what Triposo is all about, how development for it differs in terms of scope and focus at larger corporations and what patterns they use for getting the data crawled, cleaned and served to users.
The goal of Triposo is to be able to build travel guides in a fully automated way. In contrast to simply creating a catalog of places to go to the goal is to have an application that is competitive to Lonely Planet books: Have tours, detailed background information, recommend places to visit based on wheather and seasonal signals, allow users to create travel books.
Joining Triposo from Google, Douwe gave a rather interesting perspective on what makes a startup interesting for innovative ideas. There are four interesting aspects of application development that according to his talk matter for Google projects: First is embracing failure. Not only can single hard disks fail, but servers might be switched off automatically for maintenance, even entire datacenters going offline must not affect your application. Second is a strong focus on speed: Developers working with dynamic languages like Python that allow for rapid prototyping at the expense of slower runtime are generally frowned upon. Third building block is the focus on search that is ingrained in every piece of architecture and thinking. Fourth and last is a strong mentality to build your own which may lead to great software but leaves software developers in an isolated island of proprietary software that can limit but at least shapes your way of thinking.
He gave Youtube as an example: Though built on top of MySQL, implemented in Python and certainly not failure proof in every aspect they succeeded by concentrating on users' needs, time to market and iteratively improving their software with a frequent (as in one week) develop-release-deploy cycle. When entering new markets and providing innovative applications it often is crucial to be able to move quickly at the expense of speed and stability. It certainly is important to consider different architectures and chose the one that is appropriate to solve the problem at hand. Same reasoning applies for Apache Hadoop as well: Do not try to solve problems with it that it is not made to solve. Instead first think what is the right tool for your job.
Triposo itself is built on top of 12 data sources. Most are freely available, integrated to build a usable and valuable travel guide application for iOS and Android. The features available in Triposo can be phrased in terms of a search and information retrieval problem setting and as such lend itself well for integrated sources. With offers from Amazon, Google itself, Dropbox and the like it has become easy to deploy applications in an elastic way and scale with your user base and demand for more country coverage. For them it proved advantages to go for an implementation based on dynamic languages for pure development speed.
When it comes to QA they take a semi-manual approach: There are scripts checking recall (Brandenburger Tor must be found for the Berlin guide) as well as precision (there must be only one Brandenburger Tor in Berlin). Those rules need to be manually tuned.
When integrating different sources you quickly run into a duplicate discovery problem. Their approach is pretty pragmatic: Merge anything that you are confident enough to say it is a duplicate. Kill everything that is likely a duplicate but you are not confident enough to merge. The general guideline is to rather miss a place than have it twice.
For the wikipedia source so far they are only parsing the English version. There are plans to also support other languages - in particular for parsing to increase data quality as e.g. for some places geo coordinates may be available in the German article but not in the English one.
Though not going into too many technical details the talk gave some nice insights as to the strengths and weaknesses of different company sizes and mindsets when it comes to innovation as well as stabilization. Certainly a startup to watch, glad to hear that though incorporated in the US most developers actually live in Berlin now.
The second talk was given by Max Jakob from Neofonie GmbH (working on EU funded research project Dicode) gave an overview of their pipeline for named entity extraction and disambiguation based on a language model extracted from the raw German wikipedia dump. They used Pig to scale a pipeline down from about a week to 1.5 hours with not much development overhead: Quite some logic could be re-used from the open source project pignlproc initiated by Olivier Grisel. This project already features a Wikipedia loader, a UDF for extracting information from Wikipedia documents and additional scripts for training and building corpora.
Based on that they defined the ML probability of a surface form being a named entity. The script itself is not very magical: The whole process can be expressed as a few steps involving grouping and couting tuples. The effect in terms of runtime vs. development time however is impressive.
Checkout their DICODE github project
for further details on the code itself.
After the meetup about 20 attendees followed David to a bar nearby. It is always great to get a chance to talk to the speakers, exchange experiences with others and learn more on what people are actually working on with Hadoop after the event.
Slides of all talks are going to be posted soon, videos go online as soon as they are post processed so stay tuned for further information.
Looking forward to seeing you again for the next meetup. If you could not make it this time, there is a very easy way to not have that happen again next time: First speaker to submit a talk proposal to David sets the date and time of the meetup (taking into account any constraints with venue and video taping of course).