Note to self: Backup bottlenecks

2014-03-23 18:26
I learnt the following relations the hard way 10 years ago when trying to backup a rather tiny amount of data, went through the computation again three years ago. Still I had to re-do the computation this morning when trying to pull a final full backup from my old MacBook. Posting here for future reference: Note 1: Some numbers like 10BASE-T included only for historic reference. Note 2: Excluded the Alice DSL uplink speed - if included the left-hand chart would no longer be particularly helpful or readable...

Children tinkering

2014-01-05 02:07
Years ago I decided that in case I got the same question for at least three times I would write down the answer and put it somewhere online in a more or less public location that I can link to. The latest question I got once too often came from daddies (mostly, sorry - not even a handful of moms around me, let alone moms who are into tech) looking for ways to get there children in touch with technology.

Of course every recommendation depends heavily on the age and interest of the little one in question. However most recommendations are based on using a child's love for games - see also a story on how a father accidentally turned his daughter into a dungeons and dragons fan for a bit more background on what I mean.

There are several obvious paths, including Lego Mindstorms, the programming kits by Fischertechnik, several electronics kits you get at your favourite shop, fun stuff like Makey, makey kits that can turn a banana into a controller. Also many games come with their own level designers (think Little Big Planet, though the older children might remember that even Far Cry, Doom and friends came with level designers).

In addition by now there are quite a few courses and hacking events that kids are invited to go to - like the FrogLabs co-located with FrosCon, the Chaos macht Schule initiative by CCC, meetups like the ones hosted by Open Tech School, Jugend Hackt. In addition quite a few universities collaborate with schools to bring pupils in touch with research (and oftentimes tech) - like e.g. at HU Berlin.

In addition there are a three more less typical recommendations:

  • As a child I loved programming a turtle (well, a white dot really) to move across the screen forward or backwards, to turn east, south, west or north, to paint or to stop painting. The slightly more advanced (both in a graphical as well as in an interactive sense of the word) version of that would be to go for Squeak (all smalltalk, first heard about it about a decade ago at the Chemnitzer Linuxtage) or Scratch (a geek dad kindly showed that to me years ago).
  • When it comes to hardware hacking one recommendation I can give from personal experience is to take part in one of the soldering courses by Mitch Altman - you know, the guy who invented the "TV-B-Gone". Really simple circuits, you solder yourself (no worries, the parts are large and robust enough that breaking them is really, really, really hard). What you end up with tends to be blinking and in some cases is programmable. As an aside: Those courses really aren't only interesting for little ones - I've seen adults attend, including people who are pretty deep into Java programming and barely ever touch circuits in their daily work.
  • If you are more into board games: Years ago one of my friends invited me to a RoboRally session. Essentially every player programs their little robot to move across the board.

When it comes to books one piece I can highly recommend (didn't know something like that existed until my colleagues came up with it) would be the book "Geek Mom" - there's also an edition called "Geek Dad". Be warned though, this is not tech only.

If you know of any other events, meetups, books or games that you think should really go on that list, let me know.

Linux vs. Hadoop - some inspiration?

2013-01-16 20:22
This (even for my blog’s standards) long-ish blog post was inspired by a talk given late last year at Apache Con EU as well as from discussions around what constitutes “Apache Hadoop compatibility” and how to make extending Hadoop easier. The post is based on conversations with at least one guy close to the Linux kernel community and another developer working on Hadoop. Both were extremly helpful in answering my questions and sanity checking the post below. After all I’m neither an expert on Linux kernel development and design, nor am I an expert on the detailed design and implementation of features coming up in the next few Hadoop releases. Thanks for your input.

Posting this here as I thought the result of my trials to understand the exact design commonalities and differences better might be interesting for others as well. Disclaimer: This is by no means an attempt to influence current development, it just summarizes some recent thoughts and analysis. As a result I’m happy about comments pointing out additions or corrections - preferably as trackback or maybe on Google Plus as I had to turn of comments on this very blog for spamming reasons.

In his slides on “Insides Hadoop dev” during Apache Con EU:

Steve Loughran included a comparison that popped up rather often already in recent past but still made me think:

“Apache Hadoop is an OS for the datacenter”

It does make a very good point, even though being slightly misleading in my opinion:

  • There are lots of applications that need to run in a datacenter that do not imply having to use Hadoop at all - think mobile application backends, content management systems of publishers, encyclopedia hosting. Growing you may still run into the need for central log processing, scheduling and storing data.
  • Even if your application benefits from a Hadoop cluster you will need a zoo of other projects not necessarily related to the project to successfully run your cluster - think configuration management, monitoring, alerting. Actually many of these topics are on the radar of Hadoop developers - with an intend to avoid the NIH principle and rather integrate better with existing proven standard tools.

However if you do want to do large scale data analysis on largely unstructured data today you will most likely end up using Apache Hadoop.

When talking about operating systems in the free software world inevitably the topic will drift towards the Linux kernel. Being one the most successful free software projects out there from time to time it’s interesting and valuable to look at its history and present in terms of development process, architecture, stakeholders in the development cycle and the way conflicting interests are being dealt with.

Although interesting in many dimensions this blog post focuses just on two related aspects:

  • How to balance innovation for stability in critical parts of the system.
  • How to deal with modularity and API stability from an architectural point of view taking project-external (read: non-mainline) module contributions into account.

The post is not going to deal with just “everything map/reduce” but focus solely on software written specifically to work with Apache Hadoop. In particular Map/Reduce layers plugged on top of existing distributed file systems that ignore data locality guarantees as well as layers on top of existing relational database management systems that ignore easy distribution and fail over are intentionally being ignored.

Balancing innovation with stability

One pain point mentioned during Steve’s talk was the perceived need for a very stable and reliable HDFS that prevents changes and improvements from making it into Hadoop. The rational is very simple: Many customers have entrusted lots (as in not easy to re-create in any reasonable time frame) of critical (as in the service offered degrades substantially when no longer based on that data) data to Hadoop. Even when in a backup Hadoop going down for a file system failure would still be catastrophic as it would take ages to get all systems back to a working state - time that means loosing lots of customer interaction with the service provided.

When glancing over to Linux-land (or Windows, or MacOS really) the situation isn’t much different: Though both backup and recovery are much cheaper there, having to restore a user’s hard-disk just due to some weird programming mistake still is not acceptable. Where does innovation happen there? Well, if you want durability and stability all you do is to use one of the well proven file system implementations - everyone knows names like ext2, xfs and friends. A simple “man mount” will reveal many more. If on the contrary you need some more cutting edge features or want to implement a whole new idea of how a file system should work, you are free to implement your own module or contribute to those marked as EXPERIMENTAL.

If Hadoop really is the OS of the datacenter than maybe it’s time to think about ways that enable users to swap in their prefered file system implementation, maybe it’s time for developers to focus implementation of new features that could break existing deployed systems to separate modules. Maybe it’s time to announce an end-of-support-date for older implementations (unless there are users that not only need support but are willing to put time and implementation effort into maintaining these old versions that is.)

Dealing with modularity and API stability

With the vision of being able to completely replace whole sub-systems comes the question of how to guarantee some sort of interoperability. The market for Hadoop and surrounding projects is already split, it’s hard to grasp for outsiders and newcomers which components work with wich version of Hadoop. Is there a better way to do things?

Looking at the Linux kernel I see some parallels here: There’s components built on top of kernel system calls (tools like ls, mkdir etc. all rely on a fixed set of system calls being available). On the other hand there’s a wide variety of vendors offering kernel drivers for their hardware. Those come in three versions:

  • Some are distributed as part of the mainline kernel (e.g. those for Intel graphics cards).
  • Some are distributed separately but including all source code (e.g. ….)
  • Some are distributed as binary blog with some generic GPLed glue logic (e.g. those provided by NVIDIA for their graphics cards).

Essentially there are two kinds of programming interfaces: ABIs (Application Binary Interfaces) that are being developed against from user space applications like “ls” and friends. APIs (Application Programming Interfaces) that are being developed against by kernel modules like the one by NVIDIA.

Coming back to Hadoop I see some parallelism here: There are ABIs that are being used by user space applications like “hadoop fs -ls” or your average map/reduce application. There are also some sort of APIs that strictly only allow for communication between HDFS, Map/Reduce and applications on top.

The Java ecosystem has a history of having APIs defined and standardised through the JCP and implemented by multiple vendors afterwards. With Apache projects people coming from a plain Java world often wonder why there is no standard that defines the APIs of valuable projects like Lucene or even Hadoop. Even log4j, commons logging and build tooling follow the “defacto standardisation” approach where development defines the API as opposed to a standardisation committee.

Going one step back the natrual question to ask is why there is demand for standardisation. What are the benefits of having APIs standardised? Going through a lengthy standardisation process obviously can’t be the benefit.

Advantages that come to my mind:

  • When having multiple vendors involved that do not want to or cannot communicate otherwise a standardisation committee can provide a neutral ground for communication in particular for the engineers involved.
  • For users there is some higher level document they can refer to in order to compare solutions and see how painful it might be to migrate.

Having been to a DIN/ISO SQL meetup lately there’s also a few pitfalls that I can think of:

  • You really have to make sure that your standard isn’t going to be polluted with things that never get implemented just because someone thought a particular feature could be interesting.
  • Standardisation usually takes a long time (read: mutliple years) until something valuable that than can be adopted and implemented in the industry is created.

More concerns include but are not limited to the problem of testing the standard - when putting the standard into main focus instead of the implementation there is a risk of including features in the standard that are hard or even impossible to implement. There is the risk of running into competing organisations gaming the system, making deals with each other - all leading to compromises that are everything but technologically sensible. There clearly is a barrier to entry when standardisation happens in a professional standards body. (On a related note: At least the German group working on the DIN/ISO standard defining the standard query language in particular in big data environments. Let me know if you would like to get involved.)

Concerning the first advantage (having some neutral ground for vendors to meet): Looking at your average standardisation effort those committees may be neutral ground. However communication isn’t necessarily available to the public for whatever reasons. Compared to the situation little over a decade ago there’s also one major shift in how development is done on successful projects: Software is no longer developed in-house only. Many successful components that enable productivity are developed in the open in a collaborative way that is open to any participant. Httpd, Linux, PHP, Lucene, Hadoop, Perl, Python, Django, Debian and others are all developed by teams spanning continents, cultures and most importantly corporations. Those projects provide a neutral ground for developers to meet and discuss their idea of what an implementation should look like.

Pondering a bit more on where successful projects I know of came from reveals something particularly interesting: ODF first was implemented as part of Open Office and then turned into a standardised format. XMPP was first implemented and than turned into an IETF standardised protocol. Lucene never went for any storage format or even search API standardisation but defined very rigid backwards compatibility guidelines that users learnt to trust. Linux itself never went for ABI standardisation - instead they opted for very strict ABI backwards compat guidelines that developers of user space tools could rely on.

Looking at the Linux kernel in particular the rule is that user facing ABIs are supposed to be backwards compatible: You will always be able to run yesterday’s ls against a newer kernel. One advantage for me as a user is that this way I can easily upgrade the kernel in my system without having to worry about any of the installed user space software.

The picture looks rather different with Linux’ APIs: Those are intentionally not considered holy and subject to change if need be. As a result vendors providing proprietary kernel driver like NVIDIA have the burden of providing updated versions in case they want to support more than one kernel version.

I could imaging a world similar to that for Hadoop: A world in which clients run older versions of Hadoop but are still able to talk to their upgraded clusters. A world in which older MapReduce programs still run when deployed on newer clusters. The only people who would need to worry about API upgrades would be those providing plugins to Hadoop itself or replace components of the system. According to Steve this is what YARN promises: Turn MR into user layer code, have the lower level resource manager for requesting machines near the data.

Note to self - link to 3D maps

2012-09-24 08:39
After searching for the link the third time today - just in case I happen to be again looking for Nokia's 3d maps: is the non-plugin link that works in Firefox.

FrOSCon - Git Goodies

2012-09-05 20:34
In his talk on Git Goodies Sebastian Harl introduced not only some of the lesser known git tooling but also gave a brief introduction as to how git organises its database. Starting with an explanation of how patches essentially are treated as blobs identified by SHA1 hashes (thus avoiding duplication not only in the local database but allover the git universe), pointed to by trees that are in turn generated and extended by commits that are in turn referenced by branches (updates on new commits) and tags (don't update on new commits). With that concept in mind it suddenly becomes trivial to understand that HEAD simply is a reference to wherever you next commit is going to in your working directory. It also becomes natural to understand that HEAD pointing just to a commit-id but not to a branch is called a de-tached head.

Commits in git are tracked in three spaces: In the repository (this is where stuff goes after a commit), in the index (this is where stuff goes after an add or rm) and in the working directory. Reverting is symetric: git checkout takes stuff from the repository and puts it into the current working copy. reset --mixed/--hard only touches the index.

When starting to work more with git start reading the man and help pages. They contain lots of goodies that make daily work easier: There are options that allow for colored diffs, setting external merge tools (e.g. vimdiff), setting the push default (just current branch or all matching branches). There are options to define aliases for commands (diff here has a large variety of options that can be handy like coloring only different words instead of lines). There are options to set the git-dir (where .git lies) as well as the working directory which makes it easy to track your website in git but not have the git directory lie in your public_html folder.

There is a git archive to checkout your stuff as tar.gz. When browsing the git history tig can come in handy - it allows for browsing your repository with an ncurses interface, show logs, diffs and the tree of all commits. You can ask it to only show logs that match a certain pattern.

Make sure to also look at the documentation of ref-parse that explains how to reference commits in an even more flexible manner (e.g. master@{yesterday}). Also checkout the git reflog to take a look at the version history of your versioning. Really handy if you ever mess up your repository and need to get back to a sane state. Also a good way to recover detached commits. Take a look at git-bisect to learn more on how to binary-search for commits that broke your build. Use a fine granular way to add changes to your repository with git add -p - do not forget to take a look at git stash as well as cherry-pick.

FrOSCon 2012 - REST

2012-08-29 19:33
Together with Thilo I went to FrOSCon last weekend. Despite a few minor glitches and the "traditional" long BBQ line the conference was very well organised and again brought together a very diverse crowd of people including but not limited to Debian developers, OpenOffice people, FSFE representatives, KDE and Gnome developers, people with background in Lisp, Clojure, PHP, Java, C and HTML5.

The first talk we went to was given by JThijssen on REST in practice. After briefly introducing REST and going a bit into Myths and false believes about REST he explained how REST principles can be applied in your average software development project.

To set a common understanding of the topic he first introduced the four steps REST Maturity Model: Step zero means using plain old xml over http for rpc or SOAP. Nothing particularly fancy here - even to some extend breaking common standards related to http. Going one level up means modeling your entities as resources. Level two is as simple as using the http verbs for what they are intended - don't delete anything on the other side just by using a GET request. Level three finally means using hypermedia controls, HATEOS and providing navigational means to decide on what to do next.

Myths and legends

Rest is always http - well, it is transport agnostic. However mostly it using http for transport.

Rest equals CRUD - though not designed for that it is often used for that task in practice.

Rest scales - as a protocol yes, however of course that does not mean that the backend you are talking to does. All Rest does for you is to give you a means to horizontally scale without having to worry too much about server state.

Common mistakes

Using Http verbs - if you've ever dealt with web crawling you probably know those stories of some server's content being deleted just be crawling a public facing web site just because there was a "delete" button somewhere that would trigger a delete action through an innocent looking GET request. The lesson learnt of those: Use the verbs for what they are intended to be used. One commonly confused thing is the usage of PUT vs. POST. Common rule of thumb that also applies to the CouchDB REST API: Use PUT if you know what the resulting URL should be (e.g. when storing an entry to he database and you know the key that you want to use). Use POST if you do not care about which URL should result from the operation (e.g. if the database should automatically generate a unique key for you). Also make sure to use the error codes as intended - never return error code 2?? only to add an xml snippet to the payload that explains to the surprised user that an error occurred including an error code. If you really need an explanation of why this is considered bad practice if not plain evil, think about caching policies and related issues.

When dealing with resources a common mistake is to stuff as much information as possible into one single resource for one particular use case. This means transferring a lot of additional information that may not be needed for other use cases. A better approach could be to allow clients to request custom views and joins of the data instead of pre-generating them.

When it comes to logging in to your API - don't design around HTTP - use it. Sure you can give a session id into a cookie to the user. However than you are left with the problem of handling client state on the server - which was supposed to be stateless so clients can talk to any server. You could store the logged in information in the client cookie - signing and encrypting that might even make it slightly less weird. However the cleaner approach would be to authenticate individual requests and avoid state altogether.

When it comes to URL design keep in mind to keep them in a format that is easy to handle for caches. An easy check would be to try and bookmark the page you are looking at. Also think about ways to increase the number of cache hits if results are even slightly expensive to generate. Think about an interface to retrieve the distance from Amsterdam to Brussels. The URL could be /distance/to/from - however given no major road issues the distance from Amsterdam to Brussels should be the same as from Brussels to Amsterdam. One easy way to deal with that would be to allow for both requests but to send a redirect to the first version in case a user requests the second. The semantics would be slightly different when asking for driving directions - there the returned answers would indeed differ.

The speaker also introduced a concept for handling asynchronous updates that I found interesting: When creating a resource hand out a 202 accepted response including a queue ticket that can be used to query for progress. For as long as the ticket is not yet being actively dealt with it may even contain cancellation methods. As soon as the resource is created requesting the ticket URL will return a redirect to the newly created resource.

The gist of the talk for me was to not break the Rest constraints unless you really have to - stay realistic and pragmatic about the whole topic. After all, most likely you are not going to build the next Twitter API ;)

Spotted this morning...

2012-08-16 22:03
in front of my office:

Ever wondered how accurate navigable map data for your Garmin, your in-car navigation system (most likely), or are created? One piece of the puzzle is the car above collecting data for Navteq, a subsidary of Nokia.

On Reading Code

2012-08-02 15:14

“If you don’t have time to read, you don’t have the time or the tools to write.” –Stephen King

Quite a while ago GeeCon published the video taped talk of Kevlin Henney on "Cool Code". This keynote is great to watch for everyone who loves to read code - not the one you encounter in real world enterprise systems - but the one that truely teaches you lessons:

GeeCON 2012: Kevlin Henney - Cool Code from GeeCON Conference on Vimeo.

Need your input: Failing big data projects - experiences from the wild

2012-07-18 20:11
A few weeks ago my talk on "How to fail your big data project quick and rapidly" was accepted at O'Reily Strata conference in London. The basic intention of this talk is to share some anti-patterns, embarrassing failure modes and "please don't do this at home" kind of advice with those entering the buzzwordy space of big data.

Inspired by Thomas Sundberg's presentation on "failing software projects the talk will be split in five chapters and highlight the top two failure-factors for each.

I only have so much knowledge of what can go wrong when dealing with big data. In addition no one likes talking about what did not work in their environment. So I'd like to invite you to share your war stories in a public etherpad - either anonymously or including your name so I can give credit. Some ideas are already sketched up - feel free to extend, adjust, re-rank or change.

Looking forward to your stories.

Note to self: Clojure with Vim and Maven

2012-07-17 20:07
Steps to get a somewhat working Clojure environment with vim:

Note: There is more convenient tooling for emacs (see also getting started with clojure and emacs) - its just that my fingers are more used to interacting with vim...

2nd note: This post is not an introduction or walk through on how to get Clojure setup in vim - it's not even particularly complete. This is intentional - if you want to start tinkering with Clojure: Use Emacs! This is just my way to re-discover through Google what I did the other day but forgot in the mean time.