On being aggressively public

2014-02-01 19:06
If it didn't happen on the mailing list it didn't happen at all.
... with all it's implications this seems to be the hardest lesson for newcomers to the Apache way of development to learn.

In the minimal sense it means that any decision a project takes has to be taken publicly, preferably in some archived, searchable medium. In a wider sense it's usually interpreted as:

  • ... don't ask people getting started questions privately by mail, rather go to the user mailing list.
  • ... don't coordinate and take decisions in video conferences, rather take the final decision on the dev mailing list.
  • ... don't ask people privately on IM, rather go to the user mailing list.


There are even canned answers that can get sent out to those asking privately.

This concept seems to be very hard to grasp for people: As a student, developers tend to be afraid to ask "stupid questions" - it's easy to say that "there are no stupid questions, only stupid answers" - truly feeling this to be true can be hard though in particular for people who are just beginning.

However (assuming the question hasn't been answered before or the answer wasn't easy to find with a search engine of your choice) asking these questions is a valuable contribution: Chances are, there are a couple other people who would have the same question - and are too shy to ask. Chances are even bigger that those developing the project would never think of documenting the answer anywhere simply because they do not see it as a question due to their background and knowledge.

For researchers in particular on their way to getting a PhD. the problem usually is the habit of publishing facts only after they are checked, cleaned and well presentable. When it comes to developing software what this approach boils down to developing a huge amount of code hidden from others. The disadvantages are clear: You get no feedback on whether or not there would be simpler/faster approaches to solving parts of your problem. Developers of the upstream project won't take the needs for this extension into account when rewriting current modules or adding new code. As a result general advise is to share and discuss your ideas before even starting to code.

Even for seasoned software engineers this approach tends to be a problem when not used to it: In a corporate environment feedback on code often is perceived as humiliating. As much as we preach that a developer isn't their code so feedback on coding issues is not to be taken seriously it still takes a lot of getting used to when suddenly people give that feedback - and give it in the open for everyone to see.

Also often decisions are taken privately before communicating publicly in corporations. As a result having heated design discussions tends to feel like a major marketing problem. Instead what is achieved is making the decision process publicly visible to others so they can follow the arguments without repeating them over and over. It also means that others can chime in and help out if they are affected (after all the second mantra is that those who do the work are the ones who ultimately take the decisions).

Again another reason could be for people new to the project or new to some role in the project that asking trivial questions looks intimidating. Again - even if you've contributed to a project for years noone expects you to know anything. If any piece isn't documented in sufficient depth asking the question provides a track record for others to later refer to. It also means you tend to get your answer faster because there are more people who can answer your question. Oh - and if you don't get an answer chances are, everyone else is in the dark as well but was just not asking.

One final word: One fear many people seem to share when asking is to make their own knowledge gaps and flaws public - not only to peers but also to current or future superiors. There's three parts to the advise that I tend to give: Everyone once started small. If your potential future manager or team mates don't understand this, than learning and making mistakes obviously aren't welcome in the environment that you are aspiring to join - is it really worth joining such an environment? Even so you start small, if you continue to contribute on a regular basis the amount of information that you produce online will grow. Guess how much time people spend on evaluating CVs - it's very unlikely they will read and follow each discussion that you contributed to. And as a third point - do assume that by default everything you do online will be visible and searchable forever. However chances are that ten or fifteen years from now what you are writing today will no longer be accessible: Services go away, projects and their infrastructure die, discussion media change - anyone still remember using NNTP on a regular basis?

As an exercise to the reader: Before writing this post I spend a couple of hours trying to dig up the mails I had sent I believe in 2000 to the I think KURT and RTLinux discussion lists in order to prepare some presentation about the state of realtime Linux back at university. I search the web, search the archives I could find, searched my own backups from back then - but couldn't find anything. If you do find some trace of that, I owe you a beer next time we meet at FOSDEM ;)

Notes on storage options - FOSDEM 05

2013-02-17 20:43
On MySQL

Second day at FOSDEM for me started with the MySQL dev room. One thing that made me smile was in the MySQL new features talk: The speaker announced support for “NoSQL interfaces” to MySQL. That is kind of fun in two dimensions: A) What he really means is support for the memcached interface. Given the vast number of different interfaces to databases today, announcing anything as “supports NoSQL interfaces” sounds kind of silly. B) Given the fact that many databases refrain from supporting SQL not because they think their interface is inferior to SQL but because they sacrifice SQL compliance for better performance, Hadoop integration, scaling properties or others this seems really kind of turning the world upside-down.

As for new features – the new MySQL release improved the query optimiser, subquery support. When it comes to replication there were improvements along the lines of performance (multi threaded slaves etc.), data integrity (replication check sums being computed, propagated and checked), agility (support for time delayed replication), failover and recovery.

There were improvements along the lines of performance schemata, security, workbench features. The goal is to be the go-to-database for small and growing businesses on the web.

After that I joined the systemd in Debian talk. Looking forward to systemd support in my next Debian version.

HBase optimisation notes

Lars George's talk on HBase performance notes was pretty much packed – like any other of the NoSQL (and really also the community/marketing and legal dev room) talks.



Lars started by explaining that by default HBase is configured to reserve 40% of the JVM heap for in memory stores to speed up reading, 20% for the blockcache used for writing and leaves the rest as breath area.

On read HBase will first locate the correct region server and route the request accordingly – this information is cached on the client side for faster access. Prefetching on boot-up is possible to save a few milliseconds on first requests. In order to touch as little files as possible when fetching bloomfilters and time ranges are used. In addition the block cache is queried to avoid going to disk entirely. A hint: Leave as much space as possible for the OS file cache for faster access. When monitoring reads make sure to check the metrics exported by HBase e.g. by tracking them over time in Ganglia.

The cluster size will determine your write performance: HBase files are so-called log structured merge trees. Writes are first stored in memory and in the so-called Write-Ahead-Log (WAL, stored and as a result replicated on HDFS). This information is flushed to disk periodically either when there are too many log files around or the system gets under memory pressure. WAL without pending edits are being discarded.

HBase files are written in an append-only fashion. Regular compactions make sure that deleted records are being deleted.

In general the WAL file size is configured to be 64 to 128 MB. In addition only 32 log files are permitted before a flush is forced. This can be too small a file size or number of log files in periods of high write request numbers and is detrimental in particular as writes sync across all stores, so large cells in one family will cause a lot of writes.

Bypassing the WAL is possible though not recommended as it is the only source for durability there is. It may make sense on derived columns that can easily be re-created in a co-processor on crash.

Too small WAL sizes can lead to compaction storms happening on your cluster: Many small files than have to be merged sequentially into one large file. Keep in mind that flushes happen across column families even if just one family triggers.

Some handy numbers to have when computing write performance of your cluster and sizing HBase configuration for your use case: HDFS has an expected 35 to 50 MB/s throughput. Given different cell size this is how that number translates to HBase write performance:






Cell sizeOPS
0.5MB70-100
100kB250-500
10kBwith 800 less than expected as this HBase is not optimised for these sizes
1kB6000, see above


As a general rule of thumb: Have your memstore be driven by size number of regions and flush size. Have the number of allowed WAL logs before flush be driven by fill and flush rates.. The capacity of your cluster is driven by the JVM heap, region count and size, key distribution (check the talks on HBase schema design). There might be ways to get rid of the Java heap restriction through off-heap memory, however that is not yet implemented.

Keep enough and large enough WAL logs, do not oversubscribe the memstore space, keep the flush size in the right boundaries, check WAL usage on your cluster. Use Ganglia for cluster monitoring. Enable compression, tweak the compaction algorithm to peg background I/O, keep uneven families in separate tables, watch the metrics for blockcache and memstore.

Elastic Search meetup Berlin – January 2013

2013-02-01 18:34
The first meetup this year I went to started with a large bag of good news for Elastic Search users. In the offices of Sys Eleven (thanks for hosting) the meetup started at 7p.m. last Tuesday. Simon Willnauer gave an overview of what to expect of the upcoming major release of Elastic Search:

For all 0.20.x version ES features a shard allocator version that is ignorant of which index shards belong to, machine properties, usage patterns. Especially ignoring index information can be detrimental and lead to having all shards of one index on one machine in the end leading to hot spots in your cluster. Today this is solved by lots of manual intervention or even using custom shard allocator implementations.

With the new release there will be an EvenShardCountAllocator that allows for balancing shards of indexes on machines – by default it will behave like the old allocator but can be configured to take weighted factors into account. The implementation will start with basic properties like “which index does this shard belong to” but the goal is to also make variables like remaining disk space available. To avoid constant re-allocation there is a threshold on the delta that has to be passed for re-allocation to kick in.

0.21 will be released when Lucene 4.1 is integrated. That will bring new codecs, concurrent flushing (to avoid the stop-the-world flush during indexing that is used in anything below Lucene 4 – hint: Give less memory to your JVM in order to cause more frequent flushes), there will be compressed sort fields, spellchecking and suggest built into the search request (though unigram only). There will be one similarity configurable per field – that means you can switch from TF-IDF to alternative built-in scoring models or even build your own.

Speaking of rolling your own: There is a new interface for FieldData (used for faceting, scoring and sorting) to allow for specialised data structures and implementations per field. Also the default implementation will be much more memory efficient for most scenarios be using UTF-8 instead of UTF-16 characters).

As for GeoSpatial: The code came to Lucene as a code dump that the contributor wasn't willing to support or maintain. It was replaced by an implementation that wasn't that much better. However the community is about to take up the mess and turn it into something better.

After the talk the session essentially changed to an “interactive mailing list” setup where people would ask questions live and get answers both from other users as well as the developers. Some example was the question for recommendability of pyes as a library. Most people had used it, many ran into issues when trying to run an upgrade with features being taken away or behaviour being changed without much notice. There are plans to release Perl, Ruby and Python clients. However also using JRuby, Groovy, Scala or Clojure to communicate with ES works well.

On the benefit of joining the cluster for requests: That safes one hop for routing, result merging, is an option to have a master w/o data and helps with indexing as the data doesn't go through an additional node.

As for plugins the next thing needed is an upgrade and versioning schema. Concerning plugin reloading without restarting the cluster there was not much ambition to get that into the project from the ES side of things – there is just too much hazzle when it comes to loading and unloading classes with references still hanging around to make that worthwhile.

Speaking of clients: When writing your own don't rely on the binary protocol. This is a private interface that can be subject to change at any time.

When dealing with AWS: The S3 gateway is not recommended to be used as it is way too slow (and as a result very expensive). Rather backup with replicas, keep the data around for backup or use rsync. When trying to backup across regions this is nothing that ES will help you with directly – rather send your data to both sites and index locally. One recommendation that came from the audience was to not try and use EBS as the IO optimised versions are just too expensive – it's much more cost effective to rely on ephermeral storage. Another thing to checkout is the support for ES being zone aware to avoid having all shards in one availability zone. Also the node discovery timeout should be increased to at least one minute to work in AWS. When it comes to hosted solutions like heroko you usually are too limited in what you can do with these offers compared to the low maintenance overhead of running your own cluster. Oh, and don't even think about index encryption if you want to have a fast index without spending hours and hours of development time on speeding your solution up with custom codecs and the like :)

Looking forward to the Elastic Search next meetup end of February – location still to be announced. It's always interesting to see such meetup groups grow (this time from roughly 15 in November to over 30 in January).

PS: A final shout-out to Hossman - that psychological trick you played on my at your boosting and biasing talk at Apache Con EU is slightly annoying: Everytime someone mentions TF-IDF in a talk (and that isn't too unlikely in any Lucene, Solr, Elastic Search talks) I panicingly double check whether there are funny pictures on the slide shown! ;)

Thanks for all the help

2012-12-31 11:24
This year was a blast: It started with the ever great FOSDEM in Brussels (see you there in 2013?), an invitation to GeeCon in Poznan (if you ever get an invitation to speak there - do accept, the organisers do an amazing job at that event). In summer we had Berlin Buzzwords in Berlin for the third time with 700 attendees (to retain the community feel to the conference we decided to limit tickets in 2013, so make sure you get your's early). In autumn I changed my name and afterwards spent two amazing weeks in Sydney, only to attend Strata EU afterwards. Finally in December I was invited to go through the most amazing submissions for Hadoop Summit in Amsterdam 2013 (it was incredibly hard to pick and choose - thanks to Sean and Torsten for assisting me with that choice for the Hadoop Applied track.)

I think I would have gone mad if it hadn't been for all the help from friends and family: A big hug to my husband for keeping me sane when ever times got a bit rough in terms of stuff in my calendar. Thanks for all your support throughout the year. Another huge hug to my family - in particular to my mom who early 2012 volunteered to take care of most of the local organisation of our wedding (we got married close to where I grew up) and put in several surprises that she "kept mum" about up to the very last second. Also everyone who helped fill your wedding magazine with content (and train my ma in dealing with all sorts of document formats containing that content in her mail box - me personally I was forbidden to even just touch her machine during the first nine months of 2012 ;) ).

Another thanks to David Obermann for a series of interesting talks at this year's Apache Hadoop Get Together Berlin. It's amazing to see the event continue to grow even after essentially stepping down from being the main organiser.

Speaking of events: Another Thank You to Julia Gemählich, Claudia Brückner and the whole Berlin Buzzwords team. This was the first year I reduced the time I put into the event considerably - it was the first year I could attend the conference and not be all too tired to enjoy at least some of the presentations. You did a great job! Also thanks to my colleagues over at Nokia who provided me with a day a week to get things done for Buzzwords. In the same context: A very big thank you to every one who helped turn Berlin Buzzwords into a welcoming event for everyone: Thanks to all speakers, sponsors and attendees. Looking forward to seeing you again next year.

Finally a really big Thanks to all the people who helped turn our wedding day and vacation afterwards into the great time it was: Thanks to our families, our friends (including but not limited to the best photographer and friend I've met so far, those who hosted us in Sydney, and the many people who provided us with information on where to go and what to do.)

There's one thing though that bugged me by the end of this year:





So I decided that my New Year's resolution for 2013 would be to ramp up the time I spend on Apache one way or another: At least as committer for Apache Mahout, Mentor for Apache Drill and as a Member of the foundation.

Wishing all of you a Happy New Year - and looking forward to another successful Berlin Buzzwords in 2013.

GeeCon - failing software projects fast and rapidly

2012-05-23 08:04
My second day started with a talk on how to fail projects fast and rapidly. There are a few tricks to do that that relate to different aspects of your project. Lets take a look at each of them in turn.

The first measures to take to fail a project are organisational really:

  • Refer to developers as resources – that will demotivate them and express that they are replaceable instead of being valuable human beings.
  • Schedule meetings often and make everyone attend. However cancel them on short notice, do not show up yourself or come unprepared.
  • Make daily standups really long – 45min at least. Or better yet: Schedule weekly team meetings at a table, but cancel them as often as you can.
  • Always demand Minutes of Meeting after the meeting. (Hint: Yes, they are good to cover your ass, however if you have to do that, your organisation is screwed anyway.)
  • Plans are nothing, planning is everything – however planning should be done by the most experienced, estimation does not have to happen collectively (that only leads to the team feeling like they promissed something), rather have estimations be done by the most experienced manager.
  • Control all the details, assign your resources to tasks and do not let them self-organise.


When it comes to demotivating developers there are a few more things than the obvious critizing in public that will help destroy your team culture:

  • Don't invest in tooling – the biggest screen, fastest computer, most comfortable office really should be reserved for those doing the hard work, namely managers.
  • Make working off-site impossible or really hard: Avoid having laptops for people, avoid setting up workable VPN solutions, do not open any ssh ports into your organisation.
  • Demand working overtime. People will become tired, they'll sacrifice family and hobbies, guess how long they will remain happy coders.
  • Blindly deploy coding standards across the whole company and have those agreed upon in a committee. We all know how effective committee driven design (thanks to Pieter Hintjens for that term) is. Also demand 100% test coverage, forbid test driven development, forbid pair programming, demand 100% Junit coverage.
  • And of course check quality and performance as the very last thing during the development cycle. While at that avoid frequent deployments, do not let developers onto production machines – not even with read only access. Don't do small releases, let alone continuous deployment.
  • As a manager when rolling out changes: Forget about those retrospectives and incremental change. Roll out big changes at a time.
  • As a team lead accept broken builds, don't stop the line to fix a build – rather have one guy fix it while others continue to add new features.


When it comes to architecture there are a few certain ways to project death that you can follow to kill development:

  • Enforce framework usage across all projects in your company. Do the same for editors, development frameworks, databases etc. Instead of using the right tool for the job standardise the way development is done.
  • Employ a bunch of ivory tower architects that communicate with UML and Slide-ware only.
  • Remember: We are building complex systems. Complex systems need complex design. Have that design decided upon by a committee.
  • Communication should be system agnostic and standardised – why not use SOAP's xml over http?
  • Use Singletons – they'll give you tightly coupled systems with a decent amount of global state.


When it comes to development we can also make life for developers very hard:

  • Don't establish best practices and patterns – there is no need to learn from past failure.
  • We need not definition of done – everyone knows when something is done and what in particular that really means, right?
  • We need not common language – in particular not between developers and business analysts.
  • Don't use version control – or rely on Clear Case.
  • Don't do continuous integration.
  • Have no code ownership – in contrast have a separate module modified by a different developer and forbid others to contribute. That leaves us with a nice bus factor of 1.
  • Don't do pair programming to spread the knowledge. See above.
  • Don't do refactoring – rather get it right from the start.
  • Don't do non-functional requirements – something like “must cope with high load” is enough of a specification. Also put any testing at the end of the development process, do lots of manual testing (after all machines cannot judge quality as well as humans can, right?), post-pone all difficult pieces to the end, with a bit of luck they get dropped anyway. Also test evenly – there is no need to test more important or more complex pieces heavier than others.

Disclaimer for those who do not understand irony: The speaker Thomas Sundberg is very much into the agile manifesto, agile principles and xp values. The fun part of irony is that you can turn around the meaning of most of what is written above and get some good advise on not failing your projects.

GeeCon - TDD and it's influence on software design

2012-05-22 08:04
The second talk I went to on the first day was on the influence of TDD on software design. Keith Braithwaite did a really great job of first introducing the concept of cyclomatic complexity and than showing at the example of Hudson as well as many other open source Java projects that the average and mean cyclomatic complexity of all those projects actually is pretty close to one and when plotted for all methods pretty much follows a power law distribution. Comparing the properties of their specific distribution of cyclomatic complexities over projects he found out that the less steep the curve is, that is the more balance the distribution is, that is the less really complex pieces there are in the code the more likely are developers happy with the current state of the code. Not only that, also that distribution would be transformed into something more balanced after refactorings.

Now looking at a selection of open source projects he analyzed what the alpha of the distribution of cyclomatic complexity is for projects that have no tests at all, have tests and those that were developed according to TDD. Turns out that the latter ones were the ones with the most balanced alpha.

GeeCon - Randomized testing

2012-05-21 08:02
I arrived late during lunch time on Thursday for GeeCon – however just in time to listen to one of the most interesting talks when it comes to testing. Did you ever have the issue of writing code that runs well in your development environment but crashes as soon as it's rolled out at customers only to find out that their Locale setting was causing the issues? Ever had to deal with random test failure because against better advise your tests did depend on execution order that is almost guaranteed to be different on new JVM releases?

The Lucene community has encountered many similar issues. In effect they are faced with having to test a huge number of different configuration combinations in order to make sure that their software runs in all client setups. In recent months they developed an approach called randomised testing to tackle this problem: Essentially on each run “random tests” are run multiple times, each time with a slightly different configuration, input, in a different environment (e.g. Locale settings, time zones, JVMs, operating systems). Each of these configurations are pseudo random – however on test failure the framework will reveal the seed that was used to initialize that pseudo random number generator and thus allow you to reproduce the failure deterministically.

The idea itself is not new: published in a paper by Ntafos, used in fuzzers to identify security holes in applications this kind of technique is pretty well known. However applying it to write tests is a new idea used at Lucene.

The advantage is clear: With every new run of the test suite you gain confidence that your code is actually stable to any kind of user input. The downside of course is that you will discover all sorts of different issues and bugs not only in your code but also in the JVM itself. If your library is being used in all sorts of different setups fixing these issues upfront however is crucial to avoid users being surprised that it does not work well in their setup. Make sure to fix these failures quickly though – developers tend to ignore flickering tests over time. Adding randomness – and thereby essentially increasing the number of tests in your testsuite – will add the amount of effort to invest in fixing broken code.

Dawid Weiss gave a great overview of how random tests can be used to harden a code base. He introduced the testframework written at carrot search that isolated the random test features: It comes with a RandomizedRunner implementation that can be used to subsitute junit's own runner. It's capable of tracking test isolation by tracking spawned threads that might be leaking out of tests. In addition it provides utilities for instance for creating random strings, locals, numbers as well as annotations to denote how often a test should run and when it should run (always vs. nightly).

So when having tests with random input – how do you check for correctness? The most obvious thing to do is when being able to check the exact output. When testing a sorting method, not matter what the implementation and the input is – the output should always be sorted, which is easy enough to check. Also checking against simpler, but maybe in practice more expensive algorithms is an option.

A second approach is to do sanity checks: Math.abs() at least should always return positive integers. The third approach is to do no checking at all in some cases. Why would that help? You'd be surprised by how many failures and exceptions you get by actually using your API in unexpected ways or giving your program unexpected input. This kind of behaviour checking does not need any assertions.

Note: Really loved the APad/ iMiga that Dawid used to give his talk! Been such a long time since I last played with my own Amiga...

Happy Valentine

2012-02-14 06:24
Free Software developers can be very critical: Every single line of code gets scrutinized, every design is reviewed by several often opinionated people. Even the way communities are supposed to work sometimes gets restricted. Sometimes a simple Thank You can make all the difference for any contributor or committer.

I love Free Software!

FSFE proposed a really nice campaign: Celebrate the "I love Free Software" - Day on February 14th. In the hope that some of the readers of this blog actively develop or contribute to free software projects - this is a thank you for you! It's your contributions that make all the difference - be it code, documentation, help for users or code reviews.

February 14th: "I love free software day"

2012-02-13 21:07
This year FSFE is once again running their I love free software campaign on February 14th: The goal they put up is to have more love reports, hugs and Thank You messages sent out than bug reports filed against projects.

They have put online a few ideas on what to do that day. I'd like to add one additional option: If you are using any free software and you feel the urgent need to file a bug report on that day, use the opportunity to submit a patch as well: Make sure to not only describe what is going wrong but add a patch that contains a test to show the issue and a code modification that fixes the issue, is compatible with the project's coding guidelines, doesn't break anything else in the project. Any other contribution (documentation, increasing test coverage, help to other users) welcome as well of course.

See you in Vancouver at Apache Con NA 2011

2011-10-24 13:49
Mid November Apache hosts its famous yearly conference - this time in Vancouver/Canada. They kindly accepted my presentations on Apache Mahout for intelligent data analysis (mostly focused on introducing the project to new comers and showing what happened within the project in the past year - if you have any wish concerning topics you would like to see covered in particular, please let me know) as well as a more committer focused one on Talking people into creating patches (with the goal of highlighting some of the issues new-comers to free software projects that want to contribute run into and initiating a discussion on what helps to convince them to keep up the momentum and over come and obstacles).

Looking forward to seeing you in Vancouver for Apache Con NA.