Note to self: Backup bottlenecks

2014-03-23 18:26
I learnt the following relations the hard way 10 years ago when trying to backup a rather tiny amount of data, went through the computation again three years ago. Still I had to re-do the computation this morning when trying to pull a final full backup from my old MacBook. Posting here for future reference: Note 1: Some numbers like 10BASE-T included only for historic reference. Note 2: Excluded the Alice DSL uplink speed - if included the left-hand chart would no longer be particularly helpful or readable...

Note to self: Basic R operations

2012-10-18 22:55
After searching for that all too often and for too long (in particular the "add a column as index" bit):

  • To read a file: d
  • Useful for getting an overview of the data:summary(d); head(d); tail(d)
  • For sorting some data frame: s
  • For adding a column to a data frame: s$idx
  • For plotting a column: ggplot(s, aes(idx, engagement)) + geom_point() +scale_x_log10()

Note to self - link to 3D maps

2012-09-24 08:39
After searching for the link the third time today - just in case I happen to be again looking for Nokia's 3d maps: is the non-plugin link that works in Firefox.

Note to self: Clojure with Vim and Maven

2012-07-17 20:07
Steps to get a somewhat working Clojure environment with vim:

Note: There is more convenient tooling for emacs (see also getting started with clojure and emacs) - its just that my fingers are more used to interacting with vim...

2nd note: This post is not an introduction or walk through on how to get Clojure setup in vim - it's not even particularly complete. This is intentional - if you want to start tinkering with Clojure: Use Emacs! This is just my way to re-discover through Google what I did the other day but forgot in the mean time.

Second steps with git

2012-04-22 20:34
Leaving this here in case I'll search for it later again - and I'm pretty sure I will.

The following is a simplification of the git workflow detailed earlier - in particular the first two steps and a little background.

Instead of starting by cloning the upstream repository on github and than going from there as follows:

#clone the github repository
git clone

#add upstream to the local clone
git remote add upstream git://

you can also take a slightly different approach and start with an empty github repository to push your changes into instead:

#clone the upstream repository
git clone git://

#add upstream your personal - still empty - repo to the local clone
git remote add personal

#push your local modifications branch mods to your personal repo
git push personal mods

That should leave you with branch mods being visible in your personal repo now.

Note to self - Java heap analysis

2012-02-09 21:30
As I keep searching for those URLs over and over again linking them here. When running into JVM heap issues (an out of memory exception is a pretty sure sign, so can be the program getting slower and slower over time) there's a few things you can do for analysis:

Start with telling the effected JVM process to output some statistics on heap layout as well as thread state by sending it a SIGQUIT (if you want to use the number instead - it's 3 - avoid typing 9 instead ;) ).

More detailed insight is available via jConsole - remote setup can be a bit tricky but is well doable and worth the effort as it gives much more detail on what is running and how memory consumption really looks like.

For an detailed analysis take a heap dump with either jmap, jConsole or by starting the process with the JVM option -XX:+HeapDumpOnOutOfMemoryError. Look at it either with jhat or the IBM heap analyzer. Also netbeans offers nice support for searching for memory leaks.

On a more general note on diagnosing java stuff see Rainer Jung's presentation on troubleshooting Java applications as well as Attila Szegedi's presentation on JVM tuning.

Note to self: svn:ignore usage

2011-02-25 20:47
Putting the information here to make retrieving it a bit easier next time.

When working with svn and some random IDE I'd really love to avoid checking in any files that are IDE specific (project configuration, classpath, etc.). The command to do that:

svn propedit svn:ignore $directory_to_edit

After issuing this command you'll be prompted to enter file patterns for files to ignore or the directory names.

More detailed information in the official documentation on svn:ignore.

First steps with git

2010-10-30 19:47
A few weeks ago I started to use git not only for tracking changes in my own private repository but also for Mahout development and for reviewing patches. My setup probably is a bit unusual, so I thought, I'd first describe that before diving deeper into the specifc steps.

Workflow to implement

With my development I wanted to follow Mahout trunk very closely, integrating and merging any changes as soon as I continue to work on the code. I wanted to be able to work with two different machines on the client side that are located at two distinct physical locations. I was fine with publishing any changes or intermediate progress online.

The tools used

I setup a clone of the official Mahout git repository on github as a place the check changes into and as a place to publish my own changes.

On each machine used, I cloned this github repository. After that I added the official Mahout git repository as upstream repository to be able to fetch and merge in any upstream changes.

Command set

After cloning the official Mahout repository into my own github account, the following set of commands was used on a single client machine to clone and setup the repository. See also the Github help on forking git repositories.

#clone the github repository
git clone

#add upstream to the local clone
git remote add upstream git://

One additional piece of configuration that helped make life easier was to setup a list of files and file patterns to be ignored by git.

Each distinct changeset (be it code review, code style changes or steps towards own changes) would then be done in their own branches locally. To share them with other developers as well as make them accessible to my second machine I would use the following commands on the machine used for initial development:

#create the branch
git branch MAHOUT-666

#publish the branch on github
git push origin MAHOUT-666

To get all changes both from my first machine and from upstream into the second machine all that was needed was:

#select correct local branch
git checkout trunk

#get and merge changes from upstream
git fetch upstream
git merge upstream/trunk

#get changes from github
git fetch origin
git merge origin/trunk

#get branch from above
git checkout -b MAHOUT-666 origin/MAHOUT-666

Of course pushing changes into an Apache repository is not possible. So I would still end up creating a patch, submit that to JIRA for review and in the end apply and commit that via svn. As soon as these changes finally made it into the official trunk all branches created earlier were rendered obsolete.

What still makes me stick with git especially for reviewing patches and working on multiple changesets is it's capability to quickly and completely locally create branches. This feature totally changed my so-far established workflow for keeping changesets separate:

With svn I would create a separate checkout of the original repository from a remote server, make my changes or even just apply a patch for review. To speed things up or be able to work offline I would keep one svn checkout clean, copy that to a different location and only there apply the patch.

In combination with using an IDE this workflow would result in me having to re-import each different checkout as a separate project. Even though both Idea and Eclipse are reasonably fast with importing and setting up projects it would still cost some time.

With git all I do is one clone. After that I can locally create branches w/o contacting the server again. I usually keep trunk clean from any local changes - patches are applied to separate branches for review. Same happens to any code modifications. That way all work can happen when disconnected from the version control server.

When combined with IntelliJ Idea fun becomes even greater: The IDE regularly scans the filesystem for updated files. So after each git checkout I'll find the IDE automatically adjust to the changed source code - that way avoiding project re-creation. Same is of course possible with Eclipse - it just involves one additional click on the Refresh button.

For me git helped speed up my work processes and supported use cases that otherwise would have involved sending patches to and fro between separate mailboxes. That way work with patches and changeset seemed way more natural and better supported by the version control system itself. In addition it of course is a great relief to be able to checkin, diff, log, checkout etc. even when disconnected from the network - which for me still is one of the biggest advantages of any distributed version control system.

Lance Norskog recently pointed out one more step that is helpful:
You didn't mention how to purge your project branch out of the github fork. From Deleting a remote branch or tag

This command is a bit arcane at first glance… git push REMOTENAME :BRANCHNAME. If you look at the advanced push syntax above it should make a bit more sense. You are literally telling git “push nothing into BRANCHNAME on REMOTENAME”. And, you also have to delete the branch locally also.

Building a Hadoop Job Jar with Maven

2010-03-11 19:16
Put here as a reminder, so I do not forget about it. There is a really nice tutorial online on Building Hadoop Job with Maven.