One Ring to find them
One Ring to bring them all
and in the darkness bind them:
One Ring to find them
Mid November Apache hosts its famous yearly conference - this time in Vancouver/Canada. They kindly accepted my presentations on Apache Mahout for intelligent data analysis (mostly focused on introducing the project to new comers and showing what happened within the project in the past year - if you have any wish concerning topics you would like to see covered in particular, please let me know) as well as a more committer focused one on Talking people into creating patches (with the goal of highlighting some of the issues new-comers to free software projects that want to contribute run into and initiating a discussion on what helps to convince them to keep up the momentum and over come and obstacles).
Looking forward to seeing you in Vancouver for Apache Con NA.
Day two of GoTo Con Amsterdam started with a keynote by former Squeak developer Dan Ingalls. He introduced the Lively kernel - a component architecture for HTML5 that runs in any browser and allows easy composition, sharing and programming of items. Having seen Squeak years ago and being thrilled by its concepts even back then it was amazing to see what you can do with Lively kernel in a browser. If you are a designer and have some spare minutes to spend, consider taking a closer look at this project and dedicating some of your time to help them get better graphics and shapes for their system.
After the keynote I had a hard time deciding on whether to watch Ross Gardler’s introduction to the Apache way or Friso van Vollenhoven’s talk on building three Hadoop clusters in a year - too many interesting stuff in parallel that day. In the end I went for the Hadoop talk - listening to presentations on what Hadoop is actually being used for is always interesting - especially if it involves institutions like RIPE who have the data to analyze the internet downtime in Egypt.
Frise gave a great overview of Hadoop and how you can even use it for personal purposes: Apache Whirr makes it easy to use Hadoop in the cloud by enabling simple EC2 deployment, the Twitter API is a never ending source for more data to analyze (if you don’t have any yourself).
After Jim Webber’s presentation on the graph database neo4j I joined the talk on HBase use cases by Michael Stack. He introduced a set of HBase usages, problems people ran into and lessons learned. HBase in itself is built on top of HDFS - and as such inherits its advantages, strengths and some of its weaknesses.
It is great for handling 100s of GB up to PB of data in an online, random access but strong consistency kind of model. It provides a ruby based shell, comes with a java api, map reduce connectivity, pig, hive and cascading integration, provides metrics through the hadoop metrics subsystem that are exposed via JMX and through Ganglia, provides server side filters and co-processors, hadoop security, versioning, replication and more.
Stumbleupon deals with 1B stumbles a month, has 20M users (growing), users spend approx 7h a month stumbling. For them HBase is the de-factor storage engine. It’s now 2.5years in production and enabled a “throw nothing away” culture, streamlined development. Analysis is done on a separate HBase cluster from the online version. Their lessons learnt: Educate engineering on how it works, study production numbers (small changes can make a for big payoff), over provisioning makes your life easier and gets your weekends back.
… is a distributed, scalable time series database that collects, stores and serves metrics on the fly. For stumbleupon it is their ears and eyes into the system that quickly replaced the usual mix of ganglia, munin and cacti.
As announced earlier this year, Facebook’s messaging system is based on HBase. Also facebook metrics and analytics are stored in HBase. The full story is available in a SIGMOD paper by Facebook.
In short - for Facebook Messaging HBase has to deal with 500M users, millions of messages and billions of instant messages per day. Most interesting piece of the system here was their migration path that by running both systems in parallel made switching over really smooth albeit still technologically challenging.
Their lessons learnt include the need to study production and adjust accordingly, to iterate on the schema to get it right. They also made the experience that there were still some pretty gnarly bugs - however with the help of the HBase community those could be sorted out bit by bit. They also concentrated on building a system that allows for locality - inter rack communication can kill you.
They keep their version of the bing webcrawl in HBase. They have high data ingest volumns (up to multiple TB/hour) from multiple streams. Atop their application also has a wide spectrum of access patterns (from scans down to single cell access). Yahoo right now runs the single larges known HBase cluster on top of 980 2.4 GHz nodes with 16 cores and 24GB Ram each in addition to 6×2TB of disk. Their biggest table has 50B documents, most of the data is loaded in bulk though.
… uses HBase as backend for their image hosting service. In contrast to the above HBase users they don’t have a dedicated dev team but are highly motivated and skilled ops. Being cost senstive and with a little bit of bad luck with them really everything went bad that could go bad - from crashing JVMs, bad RAM crashes, bad glibc with a race condition, etc. Their lessons learnt include that it’s better to run more smaller nodes than less big nodes. In addition lots of RAM is always great to avoid swapping.
The final talk I attended that day was on tackling the folklore around high performance computing. The speakers re-visited common wisdom that is generally known in the Java Community and re-evaluated it’s applicability to recent hardware architectures. Make sure to check out their slides for details on common mis-conceptions when it comes to optimization patterns. Basic take away from this talk is to know not only your programming language and framework but also the VM you are implementing your code for, the system your application will run on and the hardware your stuff will be deployed to: Hardware vendors have gone to great length optimizing their systems, but software developers have been amazing at cancelling out those optimizations quicker then they were put in.
All in all a great conference with lots of inspiring input. Thanks to the organizers for their hard work. Looking forward to seeing some of you over in Vancouver for Apache Con NA 2011.
Last week GoTo Con took place in Amsterdam. Being a sister conference to GoTo in Aarhus the Amsterdam event focused on the broad topics of agile development, architectural challenges, backend and frontend development, platforms like the JVM and .NET. In addition the Amsterdam event featured a special Apache track tailored towards presentations focusing on the development model at Apache and the technologies developed at Apache.
So far Dart is just a tech preview - on the agenda of the development team we find items such as better support for REST arguments, enums, reflection, pattern matching, tooling for test coverage and profiling. All code is freely available, also the language specification and tutorials are open. The developers would love to get more feedback from external teams.
Twitter JVM tuning best practices
In his presentation on JVM tuning Attila Szegedi went into quite some detail on what kind of measures Twitter usually takes when it comes to optimizing code that run on the JVM and exhibits performance issues. Broadly speaking there are three dimensions along which the usual culprits for bad performance hide:
- Memory footprint of the applciation.
- Latency of requests.
- Thread coordination issues.
Memory footprint reduction
A first step always should be to verify that memory is actually responsible for the issues seen. Running the JVM with verbosegc turned on helps identify how often and how effective full GC cycles happen on the machine. Next step is to take into account the simple solution: Evaluate whether the application can simply be given more memory. If that does not help or is impossible start thinking about how to shrink memory requirements: Use caching to avoid having to load all data im memory at once, trim down the data representation used in your implementation, when looking into what to trim know exactly what amount of memory various objects need and how many of these object you actually keep in memory - this analysis should also go into detail when using code generated from frameworks like thrift.
When taking a simple view latency optimization boils down to making a tradeoff between memory usage and time. A little less naive view is to understand that actually it is a set of three goals to optimize:
Tuning an application means to take the product of the three, shift focus but keep the product stable. Optimization is assumed to increase the resulting product.
Biggest thread to latency are full gc cycles. Things to keep in mind when tuning and optimizing: Though the type of gc to run is configurable, this configuration does not apply to cleanup of eden space - Eden is always cleaned up with a stop-the-world gc. In general this is not too grave, as cleaning up objects that are no longer referenced is very cheap. However it can turn into a problem when there are too many surviving objects.
When it comes to selecting GC implementations: Optimize for throughput by delaying GC for as long as possible. This is especially handy for bulk jobs. When optimizing for responsiveness use low pause collectors - they incur a somewhat constant penalty however those avoid having single requests with extremely large response time. This is most handy for online jobs.
Other options to look into: Use adaptivesizepolicy and maxgcholdmillis to allow the jvm to size heap on its own based on your target characteristics. Use the printheapatgc option to view gc heap collection statistics - especially watch out for fromspace being less than 100%, use printtenuredistribution to keep an eye on number of ages, size distribution. In general, give an app as much memory as possible - when using concurrent mark and sweep implementation make sure to over-provision by about 25 to 30% to give the app a gc cushion for operation. If you can spare one cpu, set initiateoccupationfraction to 0 and let gc run all the time.
The last issue in general causing delays are thread coordination issues. The facilities for multi-threaded programming in Java are still pretty low level - even worse, developers generally hardly know about synchronized - not so much about the atomic data types that are available - let alone other features of the concurrent package.
Make sure you check out the speaker’s slides they certainly contain valuable information for developers that want to scale their Java applications.
Another talk that was pretty interesting to me was the introduction of Akka - a project I had only heard about before but did not have any deep technical background knowledge on. The goals when building it were fault tolerance, scalability and concurrency. Basically an easy way to scale up and out. Built in Scala, Akka also comes with Java bindings.
Akka is built around the actor model for easier distribution. Actors are isolated, communicate only via messages and have no shared memory - making it easy to run them in a distributed way without having to worry about synchronization. Distribution across machines is currently based on protocol buffers and NIO. However the resulting network topology is still hard wired during development time.
The goal of new Akka developments is to make roll-out dynamic and adaptive. For that they came up with a zookeeper based virtual address resolution, configurable load balancing strategies and the option for reconfiguration during runtime.
The first day was filled with lots of technical talks - so several remained more on the overview/introductory level - which is a good thing to learn about new technologies. In addition there were a few presentations on new features of upcoming and past releases for instance for Java 7 and Sprint 3.1 - it’s always nice to learn about the rational behind changes and improvements.
As for the agile talks - most of them propagated pretty innovative ideas that need a lot of courage to put into practice. However in several cases I could not help but get the feeling that either the processes presented were very specific to the environment they were established in and would not survive sudden stress - be it decline in revenue or team issues. In addition quite a few ideas that were introduced as novelties were already inherent in existing processes: Trust and natural communication really is the goal when establishing things like Scrum. In the end, the meetings are just the tool to get there. Clarity wrt to vision and even business value is core to prioritizing work to be done. Understanding and finding suitable metrics to measure and monitor business value of a product should be at the heart of any development project.
Overall the first day brought together a good crowd of talented people exchanging interesting ideas, news on current projects and technical details of battle-field-stories. Being still rather small, the Amsterdam edition of GoTo con certainly made it easy to get in touch with speakers as well as other attendees over a cup of coffee and discuss the presented issues. Huge thanks to the organizers for putting together an interesting schedule, booking a really tasty meal and having a friendly answer to any question from confused attendees.
Finally getting to spend some days of vacation in the bay area soon - so I thought I might try to leverage the lazy web on recommendations on what not to miss while there. So far on our list:
- Rent a canoo in Sausalito
- Take a tour to Alcatras
- Get to drive down along the coast towards Half Moon Bay
- Spend one or two days strolling through the city to take photos
- Do one of these touristic bike trips across the Golden Gate bridge
- Spend a day or two to get hiking in one of the red wood tree forests
- Got a recommendation to drive up to Lake Tahoe for two days
- Get to see Napa valley
- Spend a day on whale watching
So that already makes up for ten days - however if you spot anything important and beautiful missing, please let me know. If you have any recommendation on where to spend Halloween - please also get in touch. If you are up to meeting for a cup of coffee or dinner and talk about Mahout, HBase, Hadoop and friends - we certainly can fit that into the schedule somewhere.
It was several years ago that a frequent traveller told me about it being more comfortable and time saving to travel mid-sized distances (e.g. from Berlin to Amsterdam) by train - night train that is - instead of flying. Back then without decent knowledge of which combination of booking early and discount card make prizes of trips by train somewhat comparable to those offered by airlines that wasn’t really an option for me.
It took me years and a conference in Vienna to re-visit to his proposal: Trying to find ways to reduce the amount of times I fly I looked for alternatives. Going by car clearly is not an option as it’s more time consuming and on such long distances also more stressful. Going by regular train seemed like a waste of time as well. So I re-checked the offers for night trains. The idea of going to bed in Berlin and waking up at my destination just seemed too good.
Currently sitting in a CNL to Amsterdam I have to admit that for now this is the most relaxing way to travel I’ve found so far: Get on board, sleep heavenly (though not as comfortable as in your preferred hotel, beds are still ok), get your breakfast brought to bed in the morning. Mix in meeting friendly people (so far maybe I was just lucky) that are either discovering Europe as backpackers (met one from Australia and one from Canada yesterday) or happen to be taking that train as part of their weekly commute.
I think at least for most European cities I’m converted now
Link out: Click here
Start Date: 2011-10-12
End Date: 2011-10-14
This week late Tuesday night I am going to leave for GoTo con in Amsterdam. Train tickets are already booked - this is going to be my first trip with City Night line, will see how great they are.
GoTo Amsterdam features a special Apache track as well as several talk on scaling up, searching, but also includes stuff in general architectural decisions. If you have not registered yet - use dros200 as promotion code to get a discount on the registration prize.
Looking forward to seeing you in Amsterdam later this week.