Archive

Archive for February, 2012

Apache Hadoop Get Together - February 2012

February 23rd, 2012 at 12:14am

Today the first Hadoop Get Together Berlin 2012 took place - David got the event hosted by and at Axel Springer who kindly also paid for the (soon to be published) videos. Thanks also to the unbelievable Machine company for the tasty buffet after the meetup. Another thanks to Open Source Press for donating three of their Hadoop books.

Today’s selection was quite diverse: The event started with a presentation by Markus Andrezak who gave an overview of Kanban and how it helped him change the development workflow over at eBay/mobile. Being well suited for environments that require flexibility Kanban is well suited to decrease risk associated with any single release by bringing the number of features released down to an absolute minimum. At Mobile his team got release cycles down to once a day. More than ten times a day however aren’t unheard of as well. The general goal for him was to reduce the risk associated with releases by reducing the number of features released per release, reducing the number of moving parts in one release and as a result reducing the number of potential sources for problems: If anything goes wrong, rolling back is no issue - nor is narrowing down on the potential sources of bugs in the changed software that were introduced in that particular release.

This development and output focused part of the process is complemented by an input focused Kanban cycle for product design: Products are going from idea to vision to a more detailed backlog to development and finally live the same as issues in development itself move from Todo to in progress, under review and done.

With both cycles the main goal is to keep the number of items in progress as low as possible. This will result in more focus for each developer and greatly reduce overhead: Don’t do more than one or two things at a time. Only catch: Most companies are focused on keeping development busy at all times - their goal is to reach 100% utilization. This however is in no way correlated to actual efficiency: By having 100% utilization there is not way you can deal with problems along the way, there is no buffer. Instead the idea should be to concentrate on a constant flow of released and live features instead.

Now what is the link of all that to Hadoop? (Hint: No, this is no pun on the project’s slow release cycle.) The process of Kanban allows for frequent releases, it allows for frequent feedback. This enables a model of development that starts out from a model of your business case (no matter how coarse that may be), start building some code, measure your performance with that code based on actual usage data and adjust the model accordingly. Kanban lets you iterate very quickly on that loop getting you ahead of competitors eventually. In terms of technology one strong tool in their toolbox to really do data analytics on their incoming data is to use Hadoop and scale up analysing business data.

In the second talk Martin Scholl started out by drawing a comparison from music vs. printed music sheets to the actual performance of musicians in a concert: The former represents static, factual data. The latter represents a process that may be recorded, but may not by copied itself as it lives by the interactions with the audience. The same holds true for social networks: Their current state and the way you look at them is deeply influenced by your way of interacting with the system in realtime.

So in addition to data storage solutions for static data, he argues, we also need a way to process streaming data in an efficient and fault tolerant way. The system he uses for that purpose is Storm that was open-sourced by Twitter late last year. Built on top of zeroMQ it allows for flexible and fault tolerant messaging. Example applications mentioned are event analysis (filtering, aggregation, counting, monitoring), parallel distributed rpc based on message passing.

Two concrete examples include setting up a live A/B testing environment that is dynamically reconfigurable based on it’s input as well as event handling in a social network environment where interactions might trigger messages being sent by mail and instant message but also trigger updates in a recommendation model.

In the last talk Fabian Hüske from TU Berlin introduced Stratosphere - an EU founded research project that is working on an extended computational model on top of HDFS that provides more flexibility and better performance. Being developed before the rise of Apache Hadoop YARN unfortunately essentially what they did was to re-implement the whole map/reduce computational layer and put their system into that. Would be interesting to see how a port to YARN performs and what sort of advantages it gives in production.

Looking forward to seeing you all in June for Berlin Buzzwords - make sure to submit your presentation soon, call for presentations won’t be extended this year.

Get Together , ,

HowTo: Meetups in Berlin

February 14th, 2012 at 8:23pm

I get that question once in a while - and need the list below myself every now and then: How to actually setup a meetup in Germany. Essentially it all boils down to three questions: Which channels to use for PR? Where to do the meeting? What other benefits to offer to attendees?

When it comes to PR there are several options:

  • Announce the meetup on relevant mailing lists
  • Use social networking sites relevant to your project - in Germany Xing works best, Twitter, Facebook, Linked.In and Google+ are other options
  • Ask anyone you know personally for help with spreading the word
  • If you have one post information on your personal blog

Where to go for the meetup:

The venue usually is the biggest question mark. After deciding on how big you’d like to shoot for initially you can start looking for a location. For your first meetup don’t rent a room - with a bit of creativity there are lots of options that are free of charge.

  • If you are a student or have active relations to any university going there usually is the cheapest and least complicated version.
  • Another option is to just book a table in a restaurant that has a reasonably large room. Simply choose your favourite one - knowing the owner helps in getting extra space.
  • Third option is to go to any co-working space that also has a meeting area. In general they are very open to hosting community events - co-up Berlin, Betahaus are just two options.
  • If you are planning a less formal event, your local hacker spaces might be an option: c-base Berlin, in Berlin e.V. are two Berlin examples. Hackers Dojo and Noisebridge are two Bay Area examples.
  • Last but not least look out for local startups that are currently hiring new people: They tend to be very open to hosting events. See Berlin Buzzwords Hackathon providers list for some examples.

What else?

  • Make sure attendees can register themselves - xing works for that, so do Google forms
  • Setup a mailing list or some other notification service to help people track future events (Google Groups works, so does a dedicated Twitter Account)
  • Provide some background online - meetup.com works but does charge a small fee. Setting up a blog on wordpress or blogger works as well, though it is not quite as interactive as the meetup.com site.
  • Get in touch with attendees and local companies - usually they are quite happy to provide some financial support to your meetup for free drinks or videos.
  • If you want videos: Recording audio is trivial, putting it online is extremely simple if you use soundcloud’s app. Recording video also is rather simple but can be time consuming. Finding sponsors to pay for them if you offer to brand the videos is reasonably simple. For the Hadoop Get Together we usually hire Martin Schmidt. Sites to put videos online: Vimeo works but has rather low upload limits, blip.tv is a bit better in this respect.
  • Sponsoring in general: Companies looking for developers related to the meetup’s technology as well as those providing consulting for that technology tend to be open to supporting local events. What works best is to contact people you already know there - they will know best who to ask internally.

One final note: Being the organiser of such a meetup puts you at the center of a local community. Over time people will start remembering your face and name. Make sure you do the same - you should at least be able to remember faces, affiliations and names of your regular attendees.

General , , ,

Happy Valentine

February 14th, 2012 at 6:24am

Free Software developers can be very critical: Every single line of code gets scrutinized, every design is reviewed by several often opinionated people. Even the way communities are supposed to work sometimes gets restricted. Sometimes a simple Thank You can make all the difference for any contributor or committer.

I love Free Software!

FSFE proposed a really nice campaign: Celebrate the “I love Free Software” - Day on February 14th. In the hope that some of the readers of this blog actively develop or contribute to free software projects - this is a thank you for you! It’s your contributions that make all the difference - be it code, documentation, help for users or code reviews.

Free Software ,

February 14th: “I love free software day”

February 13th, 2012 at 9:07pm

This year FSFE is once again running their I love free software campaign on February 14th: The goal they put up is to have more love reports, hugs and Thank You messages sent out than bug reports filed against projects.

They have put online a few ideas on what to do that day. I’d like to add one additional option: If you are using any free software and you feel the urgent need to file a bug report on that day, use the opportunity to submit a patch as well: Make sure to not only describe what is going wrong but add a patch that contains a test to show the issue and a code modification that fixes the issue, is compatible with the project’s coding guidelines, doesn’t break anything else in the project. Any other contribution (documentation, increasing test coverage, help to other users) welcome as well of course.

Software Foundation , ,

Note to self - Java heap analysis

February 9th, 2012 at 9:30pm

As I keep searching for those URLs over and over again linking them here. When running into JVM heap issues (an out of memory exception is a pretty sure sign, so can be the program getting slower and slower over time) there’s a few things you can do for analysis:

Start with telling the effected JVM process to output some statistics on heap layout as well as thread state by sending it a SIGQUIT (if you want to use the number instead - it’s 3 - avoid typing 9 instead ;) ).

More detailed insight is available via jConsole - remote setup can be a bit tricky but is well doable and worth the effort as it gives much more detail on what is running and how memory consumption really looks like.

For an detailed analysis take a heap dump with either jmap, jConsole or by starting the process with the JVM option -XX:+HeapDumpOnOutOfMemoryError. Look at it either with jhat or the IBM heap analyzer. Also netbeans offers nice support for searching for memory leaks.

On a more general note on diagnosing java stuff see Rainer Jung’s presentation on troubleshooting Java applications as well as Attila Szegedi’s presentation on JVM tuning.

Hacking, Note to Self , ,

Apache Mahout 0.6 released

February 8th, 2012 at 9:33pm

As of Monday, February 6th a new Apache Mahout version was released. The new package features

Lots of performance improvments:

  • A new LDA implementation using Collapsed Variational Bayes 0th Derivative Approximation - try that out if you have been bothered by the way less than optimal performance of the old version.
  • Improved Decision Tree performance and added support for regression problems
  • Reduced runtime of dot product between vectors - many algorithms in Mahout rely on that, so these performance improvements will affect anyone using them.
  • Reduced runtime of LanczosSolver tests - make modifications to Mahout more easily and have faster development cycles by faster testing.
  • Increased efficiency of parallel ALS matrix factorization
  • Performance improvements in RowSimilarityJob, TransposeJob - helpful for anyone trying to find similar items or running the Hadoop based recommender

New features:

  • K-Trusses, Top-Down and Bottom-Up clustering, Random Walk with Restarts implementation
  • SSVD enhancements

Better integration:

  • Added MongoDB and Cassandra DataModel support
  • Added numerous clustering display examples

Many bug fixes, refactorings, and other small improvements. More information is available in the Release Notes.

Overall great improvements towards better performance, better stability and integration. However there are still quite some outstanding issues and issues in need for review. Come join the project, help us improve existing patches, improve performance and in particular integration and streamlining of how to use the different parts of the project.

Mahout , , ,

Clojure in Berlin

February 2nd, 2012 at 12:01am

Though I had the chance to tinker with some Clojure code only briefly it’s programming model and the resulting compact programs do fascinate me. As the resulting code runs on a JVM and does integrate well with existing Java libraries migration is comparably cheap and easy.

Today I finally managed to attend the local Berlin Clojure meetup, co-organised by Stefan Hübner and Fronx. Timing couldn’t have been much better: In this evenings event Philip Potter from Thoughtworks introduced Overtone - a library for making music with Clojure.

After installing and configuring jack for sound output, supercollider, and overtone outputting your first tone is as simple as registering the overtone library and typing

(definst foo [] (saw 220))
(foo)

To stop it type (stop).

Other types of waves of course are supported as well, so is playing different waves simultaneously and modifying them at runtime. Also expressing sounds as notes (c, d, e, f, g) that may have a certain length is possible of course – which makes it so much easier to design music than having to thing in frequencies.

A sample of what can easily be done with Overtone:


Original sound way better - this sample was taken with a mobile phone, compressed, re-coded and then put online. Checkout Overtone project for the real thing - and don’t even try to listen to the sample with low-end laptop speakers ;)

Overall a well organised meetup (Thanks to Soundcloud for hosting it, to the organisers for putting it together and to the speaker for a really well done introduction to Overtone) and an interesting way to get started with Clojure with very fast (audio) feedback.

General ,