Archive

Archive for January, 2010

Hadoop at Heise c’t

January 31st, 2010

<surreptitious_advertising>
Interesting for those readers speaking German: Heise published an introductory article on Hadoop in its latest issue. Have fun reading.
<surreptitious_advertising/>

Thanks to Simon for proof-reading and providing valuable input. Thanks to Thilo Fromm for the hadoop graphics (unfortunately none of them got published in its original form), the catchy title, proof-reading the text over and over again and for keeping me sane during several past and coming months.

If you want to know more on Apache Hadoop, come watch my FOSDEM Hadoop talk next weekend. If you want to join discussions on Apache Hadoop and Lucene, stay tuned for a conference in Berlin on these topics.

Apache Con, Apache Hadoop Get Together Berlin, Events, Hadoop , , ,

March 2010 Apache Hadoop Get Together Berlin

January 29th, 2010

This is to announce the next Apache Hadoop Get Together that will take place in newthinking store in Berlin.

  • When: March 10th, 4p.m.
  • Where: Newthinking store Berlin

As always there will be slots of 20min each for talks on your Hadoop topic. After each talk there will be a lot time to discuss. You can order drinks directly at the bar in the newthinking store. If you like, you can order pizza. We will go to Cafe Aufsturz after the event for some beer and something to eat.


View Larger Map

Talks scheduled so far:

Chris Male (JTeam/ Amsterdam): Spatial Search with Solr

Abstract: The rise in popularity of Google Maps and mobile devices with GPS have resulted in a trend in the search field. People are no longer content with finding results that match a text query, they also want to find results which are near a location. So called spatial search differs considerably from traditional free text search in that it cannot be achieved through common search techniques such as inverted indexes. Instead, new algorithms and data structures had to be developed that achieve efficient and accurate spatial search, that also allow spatial search to have a role in the determination of a result’s relevance. This technology has primarily been found in proprietary closed source search applications, however in the last 12-18 months, considerable effort has been invested into bringing open source spatial search support to Apache Solr and Lucene. While much is still left to be done, this talk will introduce how spatial search is currently supported in Solr, what work is happening currently, and a roadmap for future developments.

Dragan Milosevic (zanox/ Berlin: Product Search and Reporting powered by Hadoop

Abstract:

To efficiently process and index 80 million products, as well as store and analyse 30 million clicks and 500 million views daily, Zanox AG is using Hadoop HDFS and Map?Reduce technologies. This talk will present product-processing and reporting frameworks running on 17 node Hadoop cluster, being able to (1) robustly store products and tracking data in distributed manner, (2) rapidly consolidate, normalise and categorise products, (3) merge and aggregate tracking data and (4) efficiently builds indexes for supporting distributed search and reporting, running in several search clusters.

Bob Schulze (eCircle/ Munich): Database and Table Design Tips with HBase

Abstract: Recurring design patterns for the BigTable/HBase storage model.

A big Thanks goes to the newthinking store for providing a room in the center of Berlin for us. Another big thanks goes to Nokia Gate 5 for sponsoring videos of the talks. Links to the videos will be posted here.

Please do indicate on the following Upcoming event if you are planning to attend to make planning (and booking tables at Aufsturz) easier. Registration through Xing is possible as well.

Looking forward to seeing you in Berlin,
Isabel

Apache Hadoop Get Together Berlin , , ,

The 7 deadly sins of (Java) software developers

January 23rd, 2010

On Lucid Imaginations Blog Jay Hill published a great article on The seven deadly sins of solr. Basically it is a collection of his experiences “analyzing and evaluating a great many instances of Solr implementations, running in some of the largest Fortune 500 companies”. It is a collection of common mistakes, mis-configurations and pitfalls in Solr installations in production use.

I loved the article very much. However, many of the symptoms that Jay described in his article do not apply to Solr installations only. In the following I will try to come up with a more general classification of errors that occur when your average Java developer starts using a sufficiently large framework that is supposed to make his work easier. Happy about any input on your favourite production issues.

Remark: What is printed in italic is quoted as is.

Sin number 1: Sloth - I’ll do it later

Let’s define sloth as laziness or indifference. This one bites most of us at some time or another. We just can’t resist the impulse to take a shortcut, or we simply refuse to acknowledge the amount of effort required to do a task properly. Ultimately we wind up paying the price, usually with interest.

There is even a name for it in Scrum: Technical debt. It may be ok to take a shortcut, given this is done based on an informed decision. As with regular debt, you may get a big advantage like launching way earlier than your competitor. However as with real debt, it does come at a prize.

Lack of commitment

Jay describes the problems that are especially frequent when switching search applications: Humans in general do not like giving up their habits. A nice example described in more detail in a recent Zeit article is what happens each year in December/ January when the first snow falls: It is by no means irregular or not to be expected that it starts snowing in December in Germany. However there will be lots of people who are not prepared for that. They refuse to put on winter tiers in late autumn. They use their car instead of public transport despite warnings in public press. The conclusion of the article was simple: People are simply not willing to change habits they got used to. It does take longer and is a bit less flexible to get to work by public transport instead of your own car. It does require adjusting your daily routine, optimising your processes.

Something similar happens to a developer that is “forced” to switch technology, be it the search server, the database, the build system or simply the version control system: The old ways of doing stuff simply may not work as expected. New tools might be called for. New technologies to learn. However in not so seldom cases developers just blame the new tools: “But with the old setup this would always work.”

Developing software - probably more than anything else - means constant development, constant change. Technologies shift as tasks shift, tools are improved as workflows change. Developing software means to constantly watch closely what you are doing, reflecting on what works and what doesn’t and changing things that don’t work. Accepting change, seeing it as a chance rather than an obstacle is critical.

If however change is imposed on developers though good arguments in favour of the old approach exist, it may be worth the effort to at least take the technical view into account to make an informed decision.

Not reviewing, editing, or changing the default configuration files.

I have extended this one a bit: Developers not changing default configuration files are not that uncommon. Be it the default database configuration, default logging configuration for your modules or default configuration of your servlet container. Even if you are using software pre-packed by your distribution, it is still worth the effort to review configuration files for your services and adjust them to your needs. Usually they are to be used as examples that still need tweaking and customization after roll-out.

JVM settings and GC

If you are running Java application there is no way around to adjust GC settings as well as general JVM settings to your particular use case. There are great tutorials at sun.com that explain both the settings themselves as well as several rules-of-thumb of where to start. Still nothing should stop you from measuring your particular application and its specific needs - both, before and after tuning. Along with that goes the obvious recommendation to simply “know-your-tools” - learning load testing tools shortly before launch time is certainly no good choice. Trying to find out more on Java memory analysis late in the development cycle just because you need to find that stupid memory leak like *now* is no good idea neither.

There are several nice talks as well as several tutorials available online on the topic of JVM tuning, debugging memory as well as threading issues, one of them being the talk by Rainer Jung at Frocson 2008.

Sin number 2: Greed

Running a service on insufficient hardware (be it main memory, harddisks, bandwidth, …) is not only an issue with Solr installations. There are many cases where just adding hardware may help in the short run, but is a dead-end in the long run:

  • Given a highly inefficient implementation, identifying bottlenecks, profiling, benchmarking and optimization go a long way.
  • Given an inappropriate architecture, redesign, reimplementation and maybe even switching base technologies does help.

However as Jay pointed out, running production servers with less power than your average desktop Mac has does not help neither.

Sin number 3: Pride

Engineers love to code. Sometimes to the point of wanting to create custom work that may have a solution in place already, just because: a) They believe they can do it better. b) They believe they can learn by going through the process. c) It “would be fun”. This is not meant to discourage new work to help out with an open-source project, to contribute bug fixes, or certainly to improve existing functionality. But be careful not to rush off and start coding before you know what options already exist. Measure twice, cut once.

Don’t re-invent the wheel.

As described in Jay’s post, there are developers who seem to be actively searching for reasons to re-invent the wheel. Sure, this is far easier with open source software than with commercial software. Access to code here makes the difference: Understanding, learning from, sharing and improving the software is key to free software.

However there are so many cases where improve does not mean re-implement but submitting patches, fixing bugs, adding new features to the orignal project or just refactoring the original code and ironing out some well known bumbs to make life easier for others.

Every now and then a new query abstraction language for map reduce pops up. Some of those really solve distinct problem settings that cannot (and should not) be solved within one language. Especially if a technology is young, this is pretty usual as people try out different approaches to see what works and what does not work out so well. Good and stable things come from that - in general the fittest approach survives. However, too often I have heard developers excusing their re-invention by “having had too few time to do a throughough evaluation of existing frameworks and libraries”. The irony here really is that usually, coding up your own solution does take time as well. In other cases the excuse was missing support for some of the features needed. How about adding those features, submitting them upstream and benefitting from what is already there and an active community supporting the project, testing it, applying fixes and adding further improvements?

Make use of the mailing lists and the list archives.

Communication is key to success in software development. According to Conway’s law “Organizations
which design systems are constrained to produce systems which are copies of the communication structures of these organizations.” I guess it is pretty obvious that developing software today generally means designing complex systems.

In Open source, mailing lists (and bug trackers, the code itself, release notes etc.) are all ways for communication. (See also Bertrand’s brilliant talk on open source collaboration tools for that). With in-house development there is even added benefit as face-to-face communication or at least teleconferencing is possible.

However software developers in general seem to be reluctant to ask questions, to discuss their design, their implementation and their needs for changes. It just seems simpler to work-around a situation that disturbs you instead of propagating the problem to its source - or just asking for the information you need. A nice article on a related topic was published recently it-republik.

However asking these questions, taking part in these discussions is what makes software better. It is what happens regularly within open source projects in terms of design discussions on mailing lists, discussions on focussed issues in the bug tracker as well as in terms of code review.

There are several best practices that come with Agile Development that help starting discussions on code. Pair programming is one of these. Code reviews are another example. Having more than two eye balls look at a problem usually makes the solution more robust, gives confidence in what was implemented and as a nice side effect spreads knowledge on the code avoiding a single point of failure with just one developer being familiar with a particular piece of code.

Sin number 4: Lust

Must have more!You’ll have to grant me artistic license on this one, or else we won’t be able to keep this blog G-rated. So let’s define lust as “an unnatural craving for something to the point of self-indulgence or lunacy”. OK.

Setting the JVM Heap size too high, not leaving enough RAM for the OS.

Jay describes how setting the JVM RAM allocation too high can lead to Java eating up all memory and leaving nothing for the OS. The observation does not apply to Solr deployments only. Tomcat is just yet another application where this applies as well. Especially with IO-bound applications giving too much memory to the JVM is grave as the OS does not longer have enough space for disk caches.

The general take-away probably should be to measure and tune according to the real observed behaviour of your application. A second take-home message would be to understand your system - not only the Java part of it, but the whole machine from Java, the OS down to the hardware - to tune it effectively. However that should be a well known fact anyway. For Java developers, it sometimes helps to simply talk to your operations guys to get the bigger picture.

Too much attention on the JVM and garbage collection.

There are actually two aspects here: For one, as described by Jay it should not be necessary to try every arcane JVM or GC setting unless you are a JVM expert. More precisely, simply trying various options w/o understanding, what they mean, what side-effects they have and in which situations they help obviously isn’t a very good idea.

The second aspect would be developers starting with JVM optimization only to learn later on that the real problem is within their own application. Tuning JVM parameters really should be one of the last steps in your optimization pipeline. First should be benchmarking and profiling your own code. At the same stage you should review configuration parameters of your application (size of thread pools, connection pools etc.) as well your libraries and frameworks (here come solr’s configuration files, Tomcat’s configuration, RDBMs configuration parameters, cache configurations…). Last but not least should be JVM tuning - starting with adjusting memory to a reasonable amount, setting the GC configuration that makes most sense to your application.

Sin number 5: Envy

Bah!

Wanting features that other sites have, that you really don’t need.

It should be good engineering practice to start with your business needs and distill user stories from that and identify the technology that solves your problem. Don’t go from problem to solution without first having understood your problem. Or even worse: Don’t go from solution (that is from a technology you would love to use) to searching for a problem that this solution might solve: “But there must be a RDBMS somewhere in our architecture, right?”

Wanting to have a bigger index than the other guy.

The antithesis of the “greed” issue of not allocating enough resources. “Shooting for the moon” and trying to allow for possible growth over the next 20 years. Another scenario would be to never fix your system but leave every piece open and configurable, in the end leading to a system that is harder to configure than sendmail is. Yet another scenario would be to plan for billions of users before even launching: That may make sense for a new Google gadget, however for the “new kid on the block”? Probably unlikely, unless you have really good marketing guys. Plan for what is reasonable in your project, observe real traffic and identify real bottlenecks once you see them. Usually estimations of what bottlenecks could be are just plain wrong unless you have lot’s of experience with the type of application you are building. As Jeff Dean pointed out in his WSDM 2009 keynote, the right design for X users may still be right with 10x the amount of users. But do plan a rewrite at about the time you start having 100x and more the amount of users.

Sin number 6: Gluttony

“Staying fit and trim” is usually good practice when designing and running Solr applications. A lot of these issues cross over into the “Sloth” category, and are generally cases where the extra effort to keep your configuration and data efficiently managed is not considered important.

Lack of attention to field configuration in the schema.

Storing fields that will never be retrieved. Indexing fields that will never be searched. Storing term vectors, positions and offsets when they will never be used. Unnecessary bloat. Understand your data and your users and design your schema and fields accordingly.

On a more general scale that might be wrapped into the general advise of keeping only data that is really needed: Rotate logs on a schedule fit to your business, operations needs and based on available machines. Rotate data written into your database backend: It may make sense to keep users that did not interact with your application for 10 years. If you have a large datacenter for storage that may make even more sense. However usually keeping inactive users in your records simply eats up space.

Unexamined queries that are redundant or inefficient.

Queries that catch too much information, are redundant or multiple queries that could be folded into one are not only a problem for Solr users. Anyone using data sources that are expensive to query probably knows how to optimize those queries for reduced cost.

Sin number 7: Wrath

Now! While wrath is usually considered to be synonymous with anger, let’s use an older definition here: “a vehement denial of the truth, both to others and in the form of self-denial, impatience.”

Assuming you will never need to re-index your data.

Hmm - don’t only backup. Include recovery in your plans! Admittedly with search applications, this includes keeping the original documents - it is not unusual to add more fields or to want to parse data differently from the first indexing run. Same applies if you are post-processing data that has been entered by users or spidered from the web for tasks like information extraction, classifier training etc.

Rushing to production.

Of course we all have deadlines, but you only get one chance to make a first impression. Years ago I was part of a project where we released our search application prematurely (ahead of schedule) because the business decided it was better to have something in place rather than not have a search option. We developers felt that, with another four weeks of work we could deliver a fully-ready system that would be an excellent search application. But we rushed to production with some major flaws. Customers of ours were furious when they searched for their products and couldn’t find them. We developed a bad reputation, angered some business partners, and lost money just because it was deemed necessary to have a search application up and running four weeks early.

Leaving that as is - just adding, this does not apply to search applications only ;)

So keep it simple and separate, stay smart, stay up to date, and keep your application on the straight-and-narrow (YAGNI ;) ). Seek (intelligently) and ye shall find.

Free Software, Hacking, Lucene, Scrum , , ,

Apache Dinner January 2010

January 18th, 2010

This evening in X-Berg several local committers met for the second “Apache Dinner” - an informal gathering of local Apache committers, friends and associates for food, beer and interesting discussions. Next one is probably to be scheduled some time in February. Feel free to send a message to Torsten Curdt to be included on the next invitation mail. Thanks for organizing a nice evening, Torsten. Hope to see even more Apache friends at the next dinner ;)

Apache, Freetime , , ,

Mahout in Action

January 11th, 2010

As noted earlier by Grant Ingersoll, the first chapters of Mahout in Action are already online at Manning:



Sean, Robin, keep up the great work! I would love to read more of the book in the near future.

Mahout ,

How much of Scrum is implemented?

January 6th, 2010

I have started using Scrum for various purposes: It has inspired the way software is developed at my current employer. I use it to organize a students’ project at university. In addition we are using it at home to get all personal tasks (preparing breakfast, doing the laundry, meeting with friends…) in line for each week.

Constantly looking for ways to evaluate, refine and improve work - I am also looking for ideas on how to evaluate which aspects of the Scrum implementation can actually be improved. One pretty common way to do this evaluation is to do the so-called “Nokia-Test”. A set of questions on the project management that gives a possibility to judge your implementation of Scrum. As an example lets just have a closer look at our “Scrum Housework” implementation.


Question 1 - Iterations

  • No iterations - 0
  • Interations > 6 weeks - 1
  • Variable length < 6 weeks - 2
  • Fixed iteration length 6 weeks - 3
  • Fixed iteration length 5 weeks - 4
  • Fixed iteration 4 weeks or less - 10

Currently we are doing one week iterations - planning ahead for longer just seems impossible, except for events like going to conferences or regular birthdays. So that would be 10 points for iterations.


Question 2 - Testing within the Sprint

  • No dedicated QA - 0
  • Unit tested - 1
  • Feature tested - 5
  • Features tested as soon as completed - 7
  • Software passes acceptance testing - 8
  • Software is deployed - 10

Hmm. Admittedly there is no real testing in place except for smoke testing for stuff like emptying the dish washer.

    Question 3 - Agile Specification

  • No requirements - 0
  • Big requirements documents - 1
  • Poor user stories - 4
  • Good requirements - 5
  • Good user stories - 7
  • Just enough, just in time specifications - 8
  • Good user stories tied to specifications as needed - 10

We do not have big documents describing how to setup the christmas tree. But at the beginning of each sprint there is a set of user stories, if needed with acceptance criteria specified. So something like “Tidy up computer desk” would be augmented by the information: “To the extend that there are no items except for the laptop on the desk afterwards and the desk was dusted”. That might probably make a 10.


Question 4 - Product Owner

  • No Product Owner - 0

  • Product Owner who doesn’t understand Scrum - 1
  • Product Owner who disrupts team - 2
  • Product Owner not involved with team - 2
  • Product owner with clear product backlog estimated by team before Sprint Planning meeting (READY) - 5
  • Product owner with release roadmap with dates based on team velocity - 8
  • Product owner who motivates team - 10

We are both familiar with Scrum. However, due to the nature of the tasks and due to lack of people in the loop we are exchanging the role of the product owner regularly. We are still missing a product backlog - currently it is loosely defined as a pile of post-it notes with estimates put beside each item that define its complexity. So I would give some 3 points on that one.


Question 5 - Product Backlog

  • No Product Backlog - 0
  • Multiple Product Backlogs - 1
  • Single Product Backlog - 3
  • Product Backlog clearly specified and prioritized by ROI before Sprint Planning (READY) - 5
  • Product Owner has release burndown with release date based on velocity - 7
  • Product Owner can measure ROI based on real revenue, cost per story point, or other metrics - 10

We only have one product backlog though it is very informal. So that would make 2 points.




Question 6 - Estimates

  • Product Backlog not estimated - 0
  • Estimates not produced by team - 1
  • Estimates not produced by planning poker - 5
  • Estimates produced by planning poker by team - 8
  • Estimate error < 10% - 10

Naturally those doing the tasks are those producing the estimates. Thanks to Agile42 in Berlin we now even have a set of planning poker cards: Yeah! That makes 8 points. Just as an example: Getting returnable bottles back to the shop makes for 8 complexity points, going to cinema are 16 just like storing the christmas decoration back in it’s boxes, preparing breakfast is just about 3 points ;)


Question 7 - Sprint Burndown Chart

  • No burndown chart - 0

  • Burndown chart not updated by team - 1
  • Burndown chart in hours/days not accounting for work in progress (partial tasks burn down) - 2
  • Burndown chart only burns down when task in done (TrackDone pattern) - 4
  • Burndown only burns down when story is done - 5
  • Add 3 points if team knows velocity
  • Add two point if Product Owner release plan based on known velocity

Our we do have whiteboard with post-it notes on them that are checked out and moved to done as soon as they are done - there is no arguing about the laundry being done before it is cleaned, dried, ironed and back to the closet. ;) So that would make for 5 points. In addition we know our velocity, which would make another 3 points:



Naturally we are pretty capable of telling what can reasonably be expected to be done within the coming sprints. That might add another 2 points for the release being based on that velocity.


Question 8 - Team Disruption

  • Manager or Project Leader disrupts team - 0
  • Product Owner disrupts team - 1
  • Managers, Project Leaders or Team leaders telling people what to do - 3
  • Have Project Leader and Scrum roles - 5
  • No one disrupting team, only Scrum roles - 10

There are events and people interrupting running sprints: Say, NoSQL meetups that are planned spontaneously or new articles that get written and printed within less than a week. But usually these events are rather seldom and are kept to a minimum due to the short sprint length. So that might make for 3 points.


Question 9 - Team

  • Tasks assigned to individuals during Sprint Planning – 0
  • Team members do not have any overlap in their area of expertise – 0
  • No emergent leadership - one or more team members designated as a directive authority -1
  • Team does not have the necessary competency - 2
  • Team commits collectively to Sprint goal and backlog - 7
  • Team members collectively fight impediments during the sprint - 9
  • Team is in hyperproductive state - 10

Currently we are in a state where we have identified and started emerging impediments - declining tasks that cannot reasonably be done within the given timeframe, getting a real product backlog up, tracking even minor tasks like writing e-mails to organize the Apache Hadoop Get Together. So that makes for 9 points.

In total that makes for 54 points (excuse for computing incorrectly: It is 23:34, I am a little tired but cannot sleep due to caffeine). How does your team score on the Nokia test?

Scrum , ,

Third “December Hadoop Get Together” video online

January 5th, 2010

In the following video taken at the last Hadoop Get Together in Berlin Jörg Möllenkamp explains why Hadoop is interesting for Sun - and why Sun Hardware might be a good fit for Hadoop applications:

Hadoop Jörg Möllenkamp from Isabel Drost on Vimeo.

In a blog post published after the event, Jörg gives more details on his idea of Parasitic Hadoop he introduced at the meetup.

Apache Hadoop Get Together Berlin , , , ,

Second December Hadoop Get Together video

January 3rd, 2010

Richard Hutton from nugg.ad explained how they scaled their ad recommendation system to an increasing number of users with the help of Hadoop. To learn more on their use case and details on which problems they solved with Hadoop, watch the video below:

Hadoop Richard Hutton from Isabel Drost on Vimeo.

Apache Hadoop Get Together Berlin , ,