Archive

Archive for December, 2011

#28c3

December 30th, 2011 at 2:07am

Restate my assumptions.

One: Mathematics is the language of nature.

Two: Everything around us can be represented and understood through numbers.

Three: If you graph the numbers of any system, patterns emerge. Therefore, there are patterns everywhere in nature.

The above is a quote from today’s “Hackers in movies” talk at 28c3 - which amongst others also showed a brief snippet of the movie Pi. For several years I stayed well away from that one famous Hackers’ conference in Berlin that takes place annually between Christmas and New Year. 23C3 was the last congress I attended until today. Though there were several fantastic talks and mean presentation quality was pretty good the standard deviation of talk quality was just too high for my taste. In addition due to limited room sizes with 4 tracks there were quite a few space issues.

In recent years much of that has changed: The maximum number of tickets is strictly enforced, there is an additional lounge area in a large tent next to the entrance, for the sake of having larger rooms the number of tracks was reduced to three. Streaming works for the most part making it possible for those who did not get one of the 3000 full conference tickets to follow the program from their preferred hacker space. In addition fem does an amazing job of recording, mastering, encoding and pushing videos online: Hacker Jeopardy - a show that wasn’t over until early Thursday morning (about 3a.m.?) - was up on Youtube at least on Thursday at 7a.m if not earlier.

Several nice geeks got me talked into joining the crowd briefly this evening for a the last three talks in “Saal 1″ depicted above: You cannot be in Berlin during 28c3 and not see the so-called “fnord Jahresrückblick” by Fefe and Frank Rieger, creators of the Alternativlos podcast.

Overall it is amazing to watch BCC being invaded by a large group of hackers. It’s fun to see quite a few of them on Alexanderplatz, watch people have fun with a TV B Gone in front of large electronics stores. It’s great to get to watch highly technical but also several political talks 4 days in a row from 11 a.m. until at least 2p.m. the following day that are being given by people who are very passionate about what they do and the projects they spend their time on.

If you are into tinkering, hacking, trying out sorting algorithms and generally having fun with technology make sure you check out the 28c3 Youtube channel. If you want to learn more on Hacker culture, mark the days between Christmas and New Year next year and attend 29c3 - don’t worry if you do not speak German - the majority of talks is in English, most of the ones that aren’t are being translated on the fly by volunteers. If you are good at translations, feel free to volunteer yourself for that task. Speaking of volunteering: My respect to all angels (helping hands), heralds (those introducing speakers), noc (network operating center), poc (phone operating center), the organisation team and anyone who helps keep make that event as enjoyable to attendees as it is.

Update: Thank you to the geeks who after staying in our apartment for #28c3 helped get it back to a clean state - actually cleaner than it was before. You rock!

General, Hacking

Learning Machine Learning with Apache Mahout

December 13th, 2011 at 10:20pm

Once in a while I get questions like Where to start learning more on machine learning. Other than the official sources I think there is quite good coverage also in the Mahout community: Since it was founded several presentations have been given that give an overview of Apache Mahout, introduce special features or even go into more details on particular implementations. Below is an attempt to create a collection of talks given so far without any claim to contain links to all videos or lectures. Feel free to add your favourite in the comments section. In addition I linked to some online courses with further material to get you started.

When looking for books of course check out Mahout in Action. Also Taming Text and the data mining book that comes with weka are good starting points for practitioners.

Introductory, overview videos

Technical details

Further course material

Mahout, Science , ,

Berlin Tech Meetups

December 9th, 2011 at 10:32pm

Berlin currently is of growing interest for software engineers, has a very active startup scene and as a result several community organised meetups. Listed below is a short, “highly objective” selection of local user groups - showing just the breadth of topics discussed.

If you want to discover new meetups: It helps attending one that is closest to your interest as usually people follow several user groups. In addition watching the scheduled event at co-working and hacker spaces like co-up Berlin, betahaus, c-base can help.

General, Relocating to Berlin ,

Video up: Douwe Osinga

December 9th, 2011 at 10:01pm

Video: Max Jacob on Pig for NLP

December 9th, 2011 at 9:26pm

Slides online

December 9th, 2011 at 6:55am

Slides of this week’s Apache Hadoop Get Together Berlin are online at:

Overall a great event, well organised - looking forward to seeing you at the next Get Together. If you want to get in touch with our participants, learn about new events or simply chat between meetups join our Apache Hadoop Get Together Linked.In Group.

Get Together , , , ,

Apache Hadoop Get Together Berlin December 2011

December 8th, 2011 at 1:50am

First of all a huge Thank You to David Obermann for organising today’s Apache Hadoop Get Together Berlin: After a successful Berlin Buzzwords and a rather long pause following that finally a Christmas meetup took place today at Smarthouse, kindly sponsored by Axel Springer and organised by David Obermann from idealo. About 40 guests from Neofonie, Nokia, Amen, StudiVZ, Gameduell, TU Berlin, nurago, Soundcloud, nugg.ad and many others made it to the event.

In the first presentation Douwe Osinga from triposo went into some details on what Triposo is all about, how development for it differs in terms of scope and focus at larger corporations and what patterns they use for getting the data crawled, cleaned and served to users.

The goal of Triposo is to be able to build travel guides in a fully automated way. In contrast to simply creating a catalog of places to go to the goal is to have an application that is competitive to Lonely Planet books: Have tours, detailed background information, recommend places to visit based on wheather and seasonal signals, allow users to create travel books.

Joining Triposo from Google, Douwe gave a rather interesting perspective on what makes a startup interesting for innovative ideas. There are four interesting aspects of application development that according to his talk matter for Google projects: First is embracing failure. Not only can single hard disks fail, but servers might be switched off automatically for maintenance, even entire datacenters going offline must not affect your application. Second is a strong focus on speed: Developers working with dynamic languages like Python that allow for rapid prototyping at the expense of slower runtime are generally frowned upon. Third building block is the focus on search that is ingrained in every piece of architecture and thinking. Fourth and last is a strong mentality to build your own which may lead to great software but leaves software developers in an isolated island of proprietary software that can limit but at least shapes your way of thinking.

He gave Youtube as an example: Though built on top of MySQL, implemented in Python and certainly not failure proof in every aspect they succeeded by concentrating on users’ needs, time to market and iteratively improving their software with a frequent (as in one week) develop-release-deploy cycle. When entering new markets and providing innovative applications it often is crucial to be able to move quickly at the expense of speed and stability. It certainly is important to consider different architectures and chose the one that is appropriate to solve the problem at hand. Same reasoning applies for Apache Hadoop as well: Do not try to solve problems with it that it is not made to solve. Instead first think what is the right tool for your job.

Triposo itself is built on top of 12 data sources. Most are freely available, integrated to build a usable and valuable travel guide application for iOS and Android. The features available in Triposo can be phrased in terms of a search and information retrieval problem setting and as such lend itself well for integrated sources. With offers from Amazon, Google itself, Dropbox and the like it has become easy to deploy applications in an elastic way and scale with your user base and demand for more country coverage. For them it proved advantages to go for an implementation based on dynamic languages for pure development speed.

When it comes to QA they take a semi-manual approach: There are scripts checking recall (Brandenburger Tor must be found for the Berlin guide) as well as precision (there must be only one Brandenburger Tor in Berlin). Those rules need to be manually tuned.

When integrating different sources you quickly run into a duplicate discovery problem. Their approach is pretty pragmatic: Merge anything that you are confident enough to say it is a duplicate. Kill everything that is likely a duplicate but you are not confident enough to merge. The general guideline is to rather miss a place than have it twice.

For the wikipedia source so far they are only parsing the English version. There are plans to also support other languages - in particular for parsing to increase data quality as e.g. for some places geo coordinates may be available in the German article but not in the English one.

Though not going into too many technical details the talk gave some nice insights as to the strengths and weaknesses of different company sizes and mindsets when it comes to innovation as well as stabilization. Certainly a startup to watch, glad to hear that though incorporated in the US most developers actually live in Berlin now.

The second talk was given by Max Jakob from Neofonie GmbH (working on EU funded research project Dicode) gave an overview of their pipeline for named entity extraction and disambiguation based on a language model extracted from the raw German wikipedia dump. They used Pig to scale a pipeline down from about a week to 1.5 hours with not much development overhead: Quite some logic could be re-used from the open source project pignlproc initiated by Olivier Grisel. This project already features a Wikipedia loader, a UDF for extracting information from Wikipedia documents and additional scripts for training and building corpora.

Based on that they defined the ML probability of a surface form being a named entity. The script itself is not very magical: The whole process can be expressed as a few steps involving grouping and couting tuples. The effect in terms of runtime vs. development time however is impressive.

Checkout their DICODE github project for further details on the code itself.

After the meetup about 20 attendees followed David to a bar nearby. It is always great to get a chance to talk to the speakers, exchange experiences with others and learn more on what people are actually working on with Hadoop after the event.

Slides of all talks are going to be posted soon, videos go online as soon as they are post processed so stay tuned for further information.

Looking forward to seeing you again for the next meetup. If you could not make it this time, there is a very easy way to not have that happen again next time: First speaker to submit a talk proposal to David sets the date and time of the meetup (taking into account any constraints with venue and video taping of course).

Get Together , ,

About

December 1st, 2011 at 10:16am

This is the blog of Isabel. I am member of the Apache Software Foundation, co-founder of Apache Mahout. In my free time I like hacking on free software projects, bringing people together, learning more on technologies related to the current big data hype, travelling to friends in interesting countries.

I am co-founder of the Berlin Apache Hadoop Get Together and the Berlin Buzzwords conference - it’s so much more fun to have all those interesting people come to Berlin instead of having to travel abroad. I’m also regular speaker at various conferences on all things free software and big data. From time to time I publish stuff not only on my personal blog but also on heise.de, at the Software and Support Verlag and various others.

Disclaimer: I am writing this blog with my “Apache hat”, sometimes also with my “FSFE hat”, on my head. The opinions expressed herein are my own personal opinions and do not represent my employer’s view in any way. Except when explicitly stated otherwise activities and events described are not related to my daytime job.

About