Archive

Archive for the ‘Hacking’ Category

Part 3: A polite way to say no - and why there are times when it doesn’t work.

September 7th, 2010

After having shared my thoughts on how to improve focus and how to track tasks eating up time this post will explain how to keep time invested at a more or less constant level. The goal of this exercise is to keep obligations at a reasonable level - be it at work or during ones spare time.

In recent time I have collected a small set of techniques to reduce what gets to my desk - I don’t claim this list to be exhaustive. However some of it did help me organise conference and still have a life besides that.

Sharing and delegating tasks

Sharing and delegating are actually two different ways of integrating other people: Sharing for me means working together on a topic. That could be as easy as setting up a CMS or it could be more involved as in publishing articles on Lucene in some magazine. The advantage is that both of you can contribute to the task, possible even learn from each other: When I was doing the article series on Lucene together with Uwe it also was a great learning experience for me to have someone take the time to explain to me - well, not only to me - what flexible indexing, local search and numeric range queries are really all about, as in technically implemented. So it was not only an enormous time-saver for me, as the alternative would have been me reading through documentation, code and mailing lists to get up to date. But it also gave me the unique opportunity to learn from the very developers of these features about how they work and how they are meant to be used.

The disadvantage of sharing is that part of the work still remains on your desk. That’s where delegation helps: Take the task, find someone who is capable and willing to solve it and give it to them. There are two problems here: First you have to trust people to actually work on the task. Second you probably cannot avoid checking back from time to time to see if there is progress, if there are any impediments etc. So it means less work than with sharing. But there is more risk in not getting your results and more work to be done for co-ordination. However it is a very powerful technique if applied correctly to scale what can be achieved: Telling people what you need help with and letting them take over some of that work does scale way better than micro-managing people or even trying to be part of every piece of a project. It means giving up some of your control, in return you can turn to other potentially more involved tasks. Note to self: Need to build up more trust in that area.

Both concepts however are not actually about saying no but about being able to say yes even if you already have just very few time left.

Prioritisation

Prioritising tasks can be done on a scale from zero to any arbitrarily large number. Obviously it helps with deciding whom to say no to: It’s going to be those projects rated very low. That is those you could easily do without That’s the simplest case as it is easiest to explain. The strategy I usually use is to be honest with people: If there are conflicting conferences, it’s easy to reject invitations. If some publication does not pay for you, it’s easiest to be open and honest with people and tell them. Usually they will understand.

A second reason for a rating of zero is that the task is one of those “Does not belong on my desk” tasks. My advice for those would be to get rid of them as quickly as possible: They draw away your energy without giving back any value. This issue plays nicely with the “patches welcome” theme from open source: People working on open source projects are most successful if they are driven by their own needs. So if you want something implemented, either implement and submit it yourself - or find someone you can pay to do so. People will not work for you. You can jump up and down, complain on the mailing lists - but if the feature you would like to see is something that no-one else in the existing community needs, it won’t get done until someone needs it.

Introduce barriers

A nice way of rejecting favours that works at least sometimes is to raise the barrier. The example here would be getting an invitation to give an introductory talk for a closed audience. So what I tried was to raise the bar by asking for funding for travel and accommodation.

Keep in mind though that there is the risk that the one inviting you actually accepts your conditions - no matter how high you think you have set them. Especially the example given above has the problem of being too low a bar in most cases. So be prepared to have to keep your promise. As a result the conditions you set really should lead to the task turning into something that is fun to do.

Cut off early

Imaging you have committed to some task. Later on you realise you won’t actually make it: You have no time, priorities have changed, the task is too involved or any other reason you could potentially imaging.

The important way to reduce the load on your desk is to communicate this issue as early as possible. It’s clear that people will be more disappointed the later they learn that something they probably depend on won’t arrive in time or will never happen: They’ll never be extremely happy, however the sooner they learn the more time they have on their part to react. And actually, most people don’t react that disappointed at all, simply because they may have counted some risk into the equation when giving you the task - which is not to say you should lower the reliability of your commitments, simply because no-one is expecting you to meet your goals anyway. However usually the amount of trouble expected is way higher than what actually happens. Second note to self: Don’t forget about this option.

Patches welcome

At least in open source: If it’s nothing that helps make your world better - there are other people out there to help out. Patches being welcome may seem obvious. However in some areas it really is not: If someone asks the project member to be present at some conference, he may himself not consider himself capable of representing the project or even just making an impact by talking to people about it. That is the point where to encourage people that any input is welcome - not only code, but also documentation, communication and marketing work.

Of course as with any Pattern there are boundaries when not to apply it or when applying it would mean too much effort or loss. If that is the case and you have committed and cannot step back, than you should think about what could be a great reward if you went through the tasks: What would it take to make you happily comply and still gain energy through what you are doing? Basically it isn’t about doing what you like but about loving what you do (L. Tolstoi).

There is also valuable advice on managing ones energy from the Apache Software Foundation that is specially targeted at new committers. If you have not done so yet take the time to read it.

Freetime, Hacking , ,

Part 2: Tracking tasks, or - Where the hack did my time go to last week?

September 3rd, 2010

After summarising some strategies for not loosing track of tasks, meetings and conferences in the last post, this one is going to focus on the retrospect on achievements. If at some point in time you have asked yourself “Where the hack did time go to?” - maybe after two busy weeks of work this article might have a few techniques for you.

Usually when that happens to me it’s either a sign that I’ve been on vacation (where that is totally fine) or that too many different, sometimes small but numerous tasks have sneaked into my schedule.

Together with Thilo I have found a few techniques helpful in dealing with these kind of problems. The goals in applying them (at least for me) have been:

  • Configure the planned work load to a manageable amount.
  • Make transparent and trackable (to oneself and others) which and how many tasks have been finished.
  • Track over time any changes in number of tasks accomplished per time slot.

After hearing about Scrum and its way of planning tasks I thought about using it not only for software development but for task planning in general. Scrum comprises some techniques that help achieving the goals described above:

  1. In Scrum, development is split into sprints: Iterations of focussed software development that are confined to a fixed length. Each sprint is filled up with tasks. The number of tasks put into one sprint is defined by the so-called velocity of the team.
  2. Tasks are ordered by priority by the product owner. Priority here is influenced by factors like risk (riskier tasks should be attacked earlier than safe ones), ROI (those tasks that promise to increase ROI most should of course be done and launched first) and a few more. After priorisation, tasks are estimated in order - that way those tasks most important to the product owner are guaranteed to have an estimated complexity defined even if there was not enough time to estimate all items.
  3. Complexity here does not mean “amount of time to implement a feature” - it’s more like how much time do we need, how much communication overhead is involved, how complex is the feature. A workable way to come up with reasonably sensible numbers is to chose one base item, assing complexity of one to it and estimate all coming items in terms of “is as complex as the base item”, “has double the complexity” - and so on - according to the fibonacci series. Fibonacci is well suited for that task as do not increase linearly - similarly humans are better at estimating small things (be it distances or complexities). As soon as items get too big, estimation also tends to be way off the real number.
  4. To come up with a reasonable estimate of what can be done in any week, I usually just look back to past weeks and use that as an estimate. That technique is close enough to the real number to be a working approach.

To track what got done during the past week, we use a whiteboard as Scrum Board. Putting tasks into the known categories of todo, checked out and done. That way when resetting the board after one week and adding tasks for the following week it is pretty obvious which actions ate up most of the time. The amount of work that goes onto the board is restricted to not be larger than what got accomplished during the past week.

So what goes onto the whiteboard? Basically anything that we cannot track as working hours: The Hadoop Get Together can be found just next to doing the laundry. Writing and sending out the long deferred e-mail is put right next to going out for dinner with potential sponsors for free software courses at university.

Now that weekly time tracking is set-up - is there a way to also come up with a nice longer term measure? Turns out, there are actually three:

First and most obviously the whiteboard itself provides an easy measure: By tracking weekly velocity and plotting that against time it is easy to identify weeks with more or less freetime. As a second source of information a quick look into ones calendar quickly shows how many meetings and conferences one attended over the course of a year. Last but not least it helps to track talks given on a separate webpage.

It helps to look back from time to time: To evaluate the benefit of the respective activities, to not loose track of the tasks accomplished, to prioritise and maybe re-order stuff on the ToDo list. Would be great if you’d share some of your techniques of tracking and tracing time and tasks - either in the comments or as a separate blog post.

Freetime, Hacking, Scrum ,

Some statistics

August 11th, 2010

Various research projects focus on learning more on how open source communities work:

  • What makes people commit themselves to such projects?

  • How much involvement from various companies is there?
  • Do people contribute during working hours or in their spare time?
  • Who are the most active contributors in terms of individuals and in terms of companies?

When asked to fill out surveys, especially in cases where that happens for the n-th time with n being larger than say 5, software developers usually are not very likely to fill out these questionairs. However knowing some of the processes of open source software development it soon becomes obvious there are way more extensive sources for information - albeit not trivial to evaluate and prone to at least some error.

Free software tends to be developed “in the open”: Project members with various backgrounds get together to collaborate on a common subject, exchanging ideas, designs and thoughts digitally. Nearly every project with more then two members at least has mailing list archives and some sort of commit log to some version control system. Usually people also have bugtrackers that one can use as a source for information.

If we take the ASF as an example, there is a nice application to create various statistics from svn logs:

The caveats of this analysis are pretty obvious: Commit times are set according to the local of the server, however that may be far off compared to the actual timezone the developer lives in. Even when knowing each developer’s timezone there is still some in-accuracy in the estimations as people might cross timezone bounderies when going off for vacation. Still the data available from that source should already provide some input as to when people are contributing, how many projects they work on, how much effort in general goes into each project etc.

Turning the analysis the other way around and looking at mailing list contributions, one might ask whether a company indeed is involved in free software development. One trivial, naive first shot could be to simply look for mailinglist postings that originate from some corporate mail address. Again the raw numbers displayed below have to be normalised. This time company size and fraction of developers vs. non-developers in a company has to be taken into consideration when comparing graphs and numbers to each other.

Yet another caveat are mailinglists that are not archived in the mail archiving service that one may have choosen as the basis for comparison. In addition people may contribute from their employer’s machines but not use the corporate mail address (me personally I am one of these outliers, using the apache.org address for anything at some ASF project).

101tec
JTeam
Tarent
Kippdata
Lucid Imagination
Day
HP
IBM
Yahoo!
Nokia
Oracle
Sun

Easily visible even from that trivial 5min analysis however is general trending of involvement in free software projects. In addition those projects are displayed prominently projects that employees are working with and contributing to most actively - it comes as no surprise that for Yahoo! that is Hadoop. In addition if graphs go back in time far enough, one might even see the timeframe of when a company changed its open source strategy (or was founded (see the graph of Lucid), or got acquired (see Sun’s graph), or acquired a company with a different stategy (see Oracle’s graph) ;) ).

Sort of general advise might be to first use the data that is already out there as a starting point - in contrast to other closed communities free software developers tend to generate a lot of it. And usually it is freely available online. However when doing so, know your data well and be cautious to draw premature conclusions: The effect you are seeing may well be caused by some external factor.

Free Software, Hacking , , , ,

Part 1: Travelling minds

August 3rd, 2010

In the last post I promised to share some more information on techniques I came across and found useful under an increasing work load. Instead of taking a close look at my professional calendar I decided to use my private one as an example - first because spare time is even more precious then working hours, simply because there is so few of it and secondly because I am free to publicly scrutinize not only the methods for keeping it in good shape but also the entries in it.

I am planning to split the article in four pieces as follows as keeping all information in one article would lead to a text longer then I could possibly expect to be read from beginning to end:

  1. Part 1: Traveling minds - how to stay focussed in an always-on community.
  2. Part 2: Tracking tasks, or: Where the hack did my time go to last week?
  3. Part 3: A polite way to say no - and why there are times when it doesn’t work.
  4. Part 4: Constant evaluation and improvement: Finding sources for feedback.
  5. Part 5: A final word on vacation.

Several years ago, I had no problem with tasks like going out reading a book for hours, working on code for hours, answering mails only from time to time, thinking about one particular problem for days. As the number of projects and tasks grew, these tasks became increasingly hard to accomplish: Writing code, my mind would wander off to the mailing list; when reviewing patches my mind would start actually thinking about that one implementation that was still lingering on my hard disk.

There are a few techniques for getting back to that state of thinking about just one thing at a time. One article I found very insightful was an essay by Paul Graham. He gave a pretty good analysis of thoughts that can bind your attention and draw them away from what should actually be the thing you are thinking about. According to his analysis a pretty reliable way to discover ideas that steal your attention is to observe what thoughts your mind wanders to when you are taking a shower (I would add cycling to work here, basically anything that lets your mind free to dream and think): If it is not in line with what you would like to think about, it might be a good time to think about the need to change.

There are a few ways to force your mind to stay “on-topic”. Some very easy ones are explained in a recent blog post on attention span (Thanks to Thilo for the link):

  • Organising your virtual desktops such that applications are sorted according to tasks (one for communication, one for coding project x, another one for working on project y) helps to switch off distraction that would otherwise hide in plain sight. Who wants to work on code if TweetDeck is blinking at you next to your editor? In contrast to the original author I would not go so far to switch off multiple monitors: Its great to have your editor, some terminals, documentation in the browser open all at the same time in one workspace. However I do try to keep everything that has do with communication separate from coding etc.
  • Train to work for longer and longer periods of time on one task and one task only: The world does not fall apart, if people have to wait for an answer to your mail for longer than 30min - at least they’ll get used to it. You do not need to take your phone to meetings: If anything is starting to melt down there will be people who know where you are and who will drag you out of the meeting room in no time. Anything else can well wait for another 60min.
  • When working with tabbed browsing: Don’t open more tabs then you can easily scan. You won’t read those interesting blog post you found four weeks ago anyway. In modern browsers it is possible to detach tabs. That way you can follow the first hint of keeping even the web pages sorted on desktops according to activity: You do not need your time tracking application next to your editor. Having only documentation and testing application open there does help.
  • Keep your environment friendly and supportive. Who has ever shared an office (or a lecture at university back when I was a student) with me knows that close to my desk the probability of finding sweets, cookies, drinks and snacks approaches one. Being hungry when trying to fix a bug does not help, believe me.

One additional trick that helps staying just focussed enough for debugging complex problems is to make use of systematic debugging by Andreas Zeller (also explained in Zen and the Art of Motorcycle Maintenance). The trick is to explicitly track you thoughts on paper: Write down your hypothesis of what causes the problem. Then identify an experiment to test the hypothesis - you should know how to use your debugger, when to use print statements, which unit tests to write and when to simply take a very close look at the code and potentially make it simpler for that. Only when your experiment confirms that you have found the cause of the problem you really have identified what you need to fix.

There are a few other techniques for getting things off of your head that are just there to distract you: If you ever have read the book “Getting things done” or seen the Inbox zero presentations you may already have an idea of what I am hinting at.

By now I have a calendar application that works like a charm: It reminds me of meetings ahead of time, it warns me in case of conflicts, it accepts notes, it has an amazing life span of one year and is always available (provided I do not forget it at home):

- got mine here ;) That’s for organising meetings, going to conferences, getting articles done in time and not forgetting about family birthdays.

For week to week planning we tend to use Scrum including a scrum board. However that is not only for planning as anyone using Scrum may have expected already.

For my inbox the rule is to filter any mailing list into its own folder. Second rule is to keep the number of messages in my inbox to something that fits into a window with less than 15 lines: Anything I need for further reference (conference instructions, contacts, addresses that did not yet go into my little blue book, phone numbers not yet stored in my mobile phone) goes into its own folder. Anything that needs a reply is not allowed to stay in the inbox for longer than half a week. For larger projects mail gets sorted into their own project folders. Anything else simply goes to an archive: There are search indexes available, even Linux supports desktop search, search is even integrated in most mail clients. Oh and did I mention that I managed to search for one specific mail for an hour just recently, though it was filed into its own perfectly logical folder - simply because I had forgotten which folder it was?

To get rid of things I have to do “some time in the near future but not now” I keep a list in my notebook - just so my mind knows the note is there for me to review and it knows I don’t forget about it. So to some extend my notebook is my personal swap space. One thing I learnt at Google was to not use loose paper for these kinds of notes - a bound book is way better in that it keeps all notes in one place. In addition you do not get into danger of throwing notes away too early or mis-place them.

The only thing missing is a real product backlog that keeps track of larger things to do and projects to accomplish - something like “I really do need to find a weekend to drive these >250km north to the eastbaltic sea (Thanks to Astro for pointing out the typo to me - hey, that means there is at least one guy who actually did read that blog post from beginning to end - wow!) and relax” :)

Apache, Free Software, Freetime, Hacking ,

Series: Getting things done

July 30th, 2010

Probably not too unusual for people working on free software mostly (though no longer exclusively) in their spare time, the number of items that appear in my private calendar have increased steadily in the past months and years:

  • Every three months I am organising the Apache Hadoop Get Together in Berlin.
  • I have been asked (and accepted the offer) to publish articles on Hadoop and Lucene in magazines.
  • There are various conferences I attend - either as speaker or simply as participant: FOSDEM, Froscon, Apache Con NA, Devoxx, Chemnitzer Linuxtag - to name just a few.
  • For Berlin Buzzwords I did get quite a bit of time for organisation, still some issues leaked over to what others would call free time.
  • I am mentoring one of Mahout’s GSoC students which is a lot of fun.
  • At least I try to spend as much time as possible on the Mahout mailing lists keeping up with what is developed and discussed there.

There are various techniques to cope with increased work load and still find enough time to relax. Some of them involve simply remembering what to do at the right time, some involve prioritization, others deal with measuring and planning what to do. In this tiny series I’ll explain the techniques I employ - or at least try to - in the hope of getting your feedback, and comments on how to improve the system. After all, the most important task is to constantly improve ones own processes.

Freetime, Hacking, Scrum , , ,

Google Summer of Code starting

March 10th, 2010

As published on the Google Open Source blog the application period for mentoring organizations for GSoC starts now. The ASF is already in the process of applying. If you are a student, looking for an interesting project to work on during the coming summer - you might consider participating in GSoC. It does give you are great opportunity to get in touch with successful free software projects, learn how to work in global teams, improve your communication skills and last but not least show and publish your fantastic coding skills.

If you want to learn more on Why you should contribute to open source, the article by Shalin Shekhar Mangar is a great summary of some of the reasons why people work on open source projects.

Apache, Hacking, Mahout

Shopping at Ikea

February 1st, 2010

Some weeks ago, Thilo had a tiny little gadget not to be missed in an average geek’s appartment: A server - admittedly a little old and a bit slow, but still usable for playing around. He installed Ubuntu server on it. At the evening we got it configured to run Hadoop. Little later we found out that some friends of us probably, maybe have some usable hardware left as well - we’ll see on Monday.

However having a server on your dinner table is not really practical: There’s always some danger of spilling tea over it… However last week, one of my colleagues posted a link to the Lack Rack wiki page in the eth-0 Wiki on one of our mailing lists.

So yesterday was one of the (very rare) days, when I got Thilo to join me on a trip to Ikea. The result can be seen in the images above. Looks like elephants invaded our living room ;)

Hacking, Hadoop , ,

The 7 deadly sins of (Java) software developers

January 23rd, 2010

On Lucid Imaginations Blog Jay Hill published a great article on The seven deadly sins of solr. Basically it is a collection of his experiences “analyzing and evaluating a great many instances of Solr implementations, running in some of the largest Fortune 500 companies”. It is a collection of common mistakes, mis-configurations and pitfalls in Solr installations in production use.

I loved the article very much. However, many of the symptoms that Jay described in his article do not apply to Solr installations only. In the following I will try to come up with a more general classification of errors that occur when your average Java developer starts using a sufficiently large framework that is supposed to make his work easier. Happy about any input on your favourite production issues.

Remark: What is printed in italic is quoted as is.

Sin number 1: Sloth - I’ll do it later

Let’s define sloth as laziness or indifference. This one bites most of us at some time or another. We just can’t resist the impulse to take a shortcut, or we simply refuse to acknowledge the amount of effort required to do a task properly. Ultimately we wind up paying the price, usually with interest.

There is even a name for it in Scrum: Technical debt. It may be ok to take a shortcut, given this is done based on an informed decision. As with regular debt, you may get a big advantage like launching way earlier than your competitor. However as with real debt, it does come at a prize.

Lack of commitment

Jay describes the problems that are especially frequent when switching search applications: Humans in general do not like giving up their habits. A nice example described in more detail in a recent Zeit article is what happens each year in December/ January when the first snow falls: It is by no means irregular or not to be expected that it starts snowing in December in Germany. However there will be lots of people who are not prepared for that. They refuse to put on winter tiers in late autumn. They use their car instead of public transport despite warnings in public press. The conclusion of the article was simple: People are simply not willing to change habits they got used to. It does take longer and is a bit less flexible to get to work by public transport instead of your own car. It does require adjusting your daily routine, optimising your processes.

Something similar happens to a developer that is “forced” to switch technology, be it the search server, the database, the build system or simply the version control system: The old ways of doing stuff simply may not work as expected. New tools might be called for. New technologies to learn. However in not so seldom cases developers just blame the new tools: “But with the old setup this would always work.”

Developing software - probably more than anything else - means constant development, constant change. Technologies shift as tasks shift, tools are improved as workflows change. Developing software means to constantly watch closely what you are doing, reflecting on what works and what doesn’t and changing things that don’t work. Accepting change, seeing it as a chance rather than an obstacle is critical.

If however change is imposed on developers though good arguments in favour of the old approach exist, it may be worth the effort to at least take the technical view into account to make an informed decision.

Not reviewing, editing, or changing the default configuration files.

I have extended this one a bit: Developers not changing default configuration files are not that uncommon. Be it the default database configuration, default logging configuration for your modules or default configuration of your servlet container. Even if you are using software pre-packed by your distribution, it is still worth the effort to review configuration files for your services and adjust them to your needs. Usually they are to be used as examples that still need tweaking and customization after roll-out.

JVM settings and GC

If you are running Java application there is no way around to adjust GC settings as well as general JVM settings to your particular use case. There are great tutorials at sun.com that explain both the settings themselves as well as several rules-of-thumb of where to start. Still nothing should stop you from measuring your particular application and its specific needs - both, before and after tuning. Along with that goes the obvious recommendation to simply “know-your-tools” - learning load testing tools shortly before launch time is certainly no good choice. Trying to find out more on Java memory analysis late in the development cycle just because you need to find that stupid memory leak like *now* is no good idea neither.

There are several nice talks as well as several tutorials available online on the topic of JVM tuning, debugging memory as well as threading issues, one of them being the talk by Rainer Jung at Frocson 2008.

Sin number 2: Greed

Running a service on insufficient hardware (be it main memory, harddisks, bandwidth, …) is not only an issue with Solr installations. There are many cases where just adding hardware may help in the short run, but is a dead-end in the long run:

  • Given a highly inefficient implementation, identifying bottlenecks, profiling, benchmarking and optimization go a long way.
  • Given an inappropriate architecture, redesign, reimplementation and maybe even switching base technologies does help.

However as Jay pointed out, running production servers with less power than your average desktop Mac has does not help neither.

Sin number 3: Pride

Engineers love to code. Sometimes to the point of wanting to create custom work that may have a solution in place already, just because: a) They believe they can do it better. b) They believe they can learn by going through the process. c) It “would be fun”. This is not meant to discourage new work to help out with an open-source project, to contribute bug fixes, or certainly to improve existing functionality. But be careful not to rush off and start coding before you know what options already exist. Measure twice, cut once.

Don’t re-invent the wheel.

As described in Jay’s post, there are developers who seem to be actively searching for reasons to re-invent the wheel. Sure, this is far easier with open source software than with commercial software. Access to code here makes the difference: Understanding, learning from, sharing and improving the software is key to free software.

However there are so many cases where improve does not mean re-implement but submitting patches, fixing bugs, adding new features to the orignal project or just refactoring the original code and ironing out some well known bumbs to make life easier for others.

Every now and then a new query abstraction language for map reduce pops up. Some of those really solve distinct problem settings that cannot (and should not) be solved within one language. Especially if a technology is young, this is pretty usual as people try out different approaches to see what works and what does not work out so well. Good and stable things come from that - in general the fittest approach survives. However, too often I have heard developers excusing their re-invention by “having had too few time to do a throughough evaluation of existing frameworks and libraries”. The irony here really is that usually, coding up your own solution does take time as well. In other cases the excuse was missing support for some of the features needed. How about adding those features, submitting them upstream and benefitting from what is already there and an active community supporting the project, testing it, applying fixes and adding further improvements?

Make use of the mailing lists and the list archives.

Communication is key to success in software development. According to Conway’s law “Organizations
which design systems are constrained to produce systems which are copies of the communication structures of these organizations.” I guess it is pretty obvious that developing software today generally means designing complex systems.

In Open source, mailing lists (and bug trackers, the code itself, release notes etc.) are all ways for communication. (See also Bertrand’s brilliant talk on open source collaboration tools for that). With in-house development there is even added benefit as face-to-face communication or at least teleconferencing is possible.

However software developers in general seem to be reluctant to ask questions, to discuss their design, their implementation and their needs for changes. It just seems simpler to work-around a situation that disturbs you instead of propagating the problem to its source - or just asking for the information you need. A nice article on a related topic was published recently it-republik.

However asking these questions, taking part in these discussions is what makes software better. It is what happens regularly within open source projects in terms of design discussions on mailing lists, discussions on focussed issues in the bug tracker as well as in terms of code review.

There are several best practices that come with Agile Development that help starting discussions on code. Pair programming is one of these. Code reviews are another example. Having more than two eye balls look at a problem usually makes the solution more robust, gives confidence in what was implemented and as a nice side effect spreads knowledge on the code avoiding a single point of failure with just one developer being familiar with a particular piece of code.

Sin number 4: Lust

Must have more!You’ll have to grant me artistic license on this one, or else we won’t be able to keep this blog G-rated. So let’s define lust as “an unnatural craving for something to the point of self-indulgence or lunacy”. OK.

Setting the JVM Heap size too high, not leaving enough RAM for the OS.

Jay describes how setting the JVM RAM allocation too high can lead to Java eating up all memory and leaving nothing for the OS. The observation does not apply to Solr deployments only. Tomcat is just yet another application where this applies as well. Especially with IO-bound applications giving too much memory to the JVM is grave as the OS does not longer have enough space for disk caches.

The general take-away probably should be to measure and tune according to the real observed behaviour of your application. A second take-home message would be to understand your system - not only the Java part of it, but the whole machine from Java, the OS down to the hardware - to tune it effectively. However that should be a well known fact anyway. For Java developers, it sometimes helps to simply talk to your operations guys to get the bigger picture.

Too much attention on the JVM and garbage collection.

There are actually two aspects here: For one, as described by Jay it should not be necessary to try every arcane JVM or GC setting unless you are a JVM expert. More precisely, simply trying various options w/o understanding, what they mean, what side-effects they have and in which situations they help obviously isn’t a very good idea.

The second aspect would be developers starting with JVM optimization only to learn later on that the real problem is within their own application. Tuning JVM parameters really should be one of the last steps in your optimization pipeline. First should be benchmarking and profiling your own code. At the same stage you should review configuration parameters of your application (size of thread pools, connection pools etc.) as well your libraries and frameworks (here come solr’s configuration files, Tomcat’s configuration, RDBMs configuration parameters, cache configurations…). Last but not least should be JVM tuning - starting with adjusting memory to a reasonable amount, setting the GC configuration that makes most sense to your application.

Sin number 5: Envy

Bah!

Wanting features that other sites have, that you really don’t need.

It should be good engineering practice to start with your business needs and distill user stories from that and identify the technology that solves your problem. Don’t go from problem to solution without first having understood your problem. Or even worse: Don’t go from solution (that is from a technology you would love to use) to searching for a problem that this solution might solve: “But there must be a RDBMS somewhere in our architecture, right?”

Wanting to have a bigger index than the other guy.

The antithesis of the “greed” issue of not allocating enough resources. “Shooting for the moon” and trying to allow for possible growth over the next 20 years. Another scenario would be to never fix your system but leave every piece open and configurable, in the end leading to a system that is harder to configure than sendmail is. Yet another scenario would be to plan for billions of users before even launching: That may make sense for a new Google gadget, however for the “new kid on the block”? Probably unlikely, unless you have really good marketing guys. Plan for what is reasonable in your project, observe real traffic and identify real bottlenecks once you see them. Usually estimations of what bottlenecks could be are just plain wrong unless you have lot’s of experience with the type of application you are building. As Jeff Dean pointed out in his WSDM 2009 keynote, the right design for X users may still be right with 10x the amount of users. But do plan a rewrite at about the time you start having 100x and more the amount of users.

Sin number 6: Gluttony

“Staying fit and trim” is usually good practice when designing and running Solr applications. A lot of these issues cross over into the “Sloth” category, and are generally cases where the extra effort to keep your configuration and data efficiently managed is not considered important.

Lack of attention to field configuration in the schema.

Storing fields that will never be retrieved. Indexing fields that will never be searched. Storing term vectors, positions and offsets when they will never be used. Unnecessary bloat. Understand your data and your users and design your schema and fields accordingly.

On a more general scale that might be wrapped into the general advise of keeping only data that is really needed: Rotate logs on a schedule fit to your business, operations needs and based on available machines. Rotate data written into your database backend: It may make sense to keep users that did not interact with your application for 10 years. If you have a large datacenter for storage that may make even more sense. However usually keeping inactive users in your records simply eats up space.

Unexamined queries that are redundant or inefficient.

Queries that catch too much information, are redundant or multiple queries that could be folded into one are not only a problem for Solr users. Anyone using data sources that are expensive to query probably knows how to optimize those queries for reduced cost.

Sin number 7: Wrath

Now! While wrath is usually considered to be synonymous with anger, let’s use an older definition here: “a vehement denial of the truth, both to others and in the form of self-denial, impatience.”

Assuming you will never need to re-index your data.

Hmm - don’t only backup. Include recovery in your plans! Admittedly with search applications, this includes keeping the original documents - it is not unusual to add more fields or to want to parse data differently from the first indexing run. Same applies if you are post-processing data that has been entered by users or spidered from the web for tasks like information extraction, classifier training etc.

Rushing to production.

Of course we all have deadlines, but you only get one chance to make a first impression. Years ago I was part of a project where we released our search application prematurely (ahead of schedule) because the business decided it was better to have something in place rather than not have a search option. We developers felt that, with another four weeks of work we could deliver a fully-ready system that would be an excellent search application. But we rushed to production with some major flaws. Customers of ours were furious when they searched for their products and couldn’t find them. We developed a bad reputation, angered some business partners, and lost money just because it was deemed necessary to have a search application up and running four weeks early.

Leaving that as is - just adding, this does not apply to search applications only ;)

So keep it simple and separate, stay smart, stay up to date, and keep your application on the straight-and-narrow (YAGNI ;) ). Seek (intelligently) and ye shall find.

Free Software, Hacking, Lucene, Scrum , , ,

Getting Hadoop trunk up and running from source

October 4th, 2009

Having told Thilo about the possibility to write Hadoop jobs in Python with Dumbo, we spent some time getting Dumbo 0.21 up and running over the past weekend. The first option the wiki proposes is to take a pre-0.21 release and patch that to work with the current Dumbo release. The second option described takes the not-yet-released version of Hadoop that can be used w/o any patches.

We decided to follow the latter suggestion. After the latest split of the project, we downloaded common, hdfs and mapreduce. Building each project was easy - assuming that ant, Sun JDK 6 (for Hadoop), Forrest (for the documentation pages) and Sun JDK 5 (for forrest) is installed.

Deviating from the documentation, the distributed filesystem as well as map reduce are now started from separate scripts (start-dfs.sh/ start-mapred.sh instead of start-all.sh). These scripts are located in the common project. In addition the variables HADOOP_HDFS_HOME and HADOOP_MAPRED_HOME must be set to point to respective projects for cluster setup to work. Other than that the setup currently is identical to the previous version.

*Camp, Hacking , , ,

Dev House Berlin 2.0

October 4th, 2009

This weekend DevHouseBerlin took place in the Box119, kindly organized by Jan Lehnardt, sponsored by Upstream and StudiVZ. There were about 30 people gathered in Friedrichshain, hacking and discussing various projects: Mostly Python/ Django, Ruby/ Rails and Erlang people.

The first day was reserved for hacking and exchanging ideas. Late afternoon attendees put together a list of talks that were than rated, ranked with the top three chosen for presentation on Sunday. The list included topics on CouchDB, RestMS, Hadoop, Concurrency in Erlang, P2P CouchDB and many more. The first three topics were chosen by the participants for presentation.

During the time at DevHouse I finally got a list of topics and papers up at Mahout TU project - now only the exact credit system for the Mahout course at TU is missing. I got some time to work on Mahout improvements and documentation. Unfortunately I was too tired today to complete the code review for MAHOUT-157 - promise to do that early next week.

Spending one weekend with equal-minded people, being able to pair with someone else in case of more complex problems made the weekend a great time for me. Planning to be there again next year. Thanks to the sponsors and organisers for making this happen.

*Camp, Hacking , ,