ApacheConNA: Misc

2013-05-15 20:26

In his talk on Spdy Mathew Steele explained how he implemented the spdy protocol
as an Apache httpd module - working around most of the safety measures and
design decisions in the current httpd version. Essentially to get httpd to
support the protocol all you need now is mod_spdy plus a modified version of

The keynote on the last day was given by the Puppet founder. Some interesting
points to take away from that:

  • Though hard in the beginning (and half way through, and after years) it
    is important to learn giving up control: It usually is much more productive and
    leads to better results to encourage people to do something than to be
    restrictive about it. A single developer only has so much bandwidth - by
    farming tasks out to others - and giving them full control - you substantially
    increase your throughput without having to put in more energy.

  • Be transparent - it's ok to have commercial goals with your project. Just
    make sure that the community knows about it and is not surprised to learn about

  • Be nice - not many succeed at this, not many are truely able to ignore
    religion (vi vs. emacs). This also means to be welcoming to newbies, to hustle
    at conferences, to engage the community as opposed to announcing changes.

Overall good advise for those starting or working on an OSS project and seeking
to increase visibility and reach.

If you want to learn more on what other talks were given at ApacheCon NA or want to follow up in more detail on some of the talks described here check out the slides archive online.

ApacheConNA: Hadoop metrics

2013-05-14 20:25

Have you ever measured the general behaviour of your Hadoop jobs? Have you
sized your cluster accordingly? Do you know whether your work load really is IO
bound or CPU bound? Legend has it noone expecpt Allen Wittenauer over at
Linked.In, formerly Y! ever did this analysis for his clusters.

Steve Watt gave a pitch for actually going out into your datacenter measuring
what is going on there and adjusting the deployment accordingly: In small
clusters it may make sense to rely on raided disks instead of additional
storage nodes to guarantee ``replication levels''. When going out to vendors to
buy hardware don't rely on paper calculations only: Standard servers in Hadoop
clusters are 1 or 2u. This is quite unlike beefy boxes being sold otherwise.

Figure out what reference architecture is being used by partners, run your
standard workloads, adjust the configuration. If you want to run the 10TB
Terrasort to benchmark your hardware and system configuration. Make sure to
capture data during all your runs - have Ganglia or SAR, watch out for
intersting behaviour in io rates, cpu utilisation, network traffic. The goal is
to get the cpu busy, not wait for network or disk.

After the instrumentation and trial run look for over- and underprovisionings,
adjust, leather, rinse, repeat.

Also make sure to talk to the datacenter people: There are floor space, power
and cooling constraints to keep in mind. Don't let the whole datacenter go down
because your cpu intensive job is drawing more power than the DC was designed
for. Ther are also power constraints per floor tile due to cooling issues -
those should dictate the design.

Take a close look at the disks you deploy: SATA vs. SAS can make a 40%
performance difference at a 20% cost difference. Also the number of cores per
machines dictates the number of disks to spread the likelyhood of random read
access. As a rule of thumb - in a 2U machine today there should be at least
twelve large form factor disks.

When it comes to controllers he goal should be to get a dedicated lane to disc,
safe one controller if price is an issue. Trade off compute power against power

Designing your network keep in mind that one switch going down means that one
rack will be gone. This may be a non-issue in a Y! size cluster, in your
smaller scale world it might be worth the money investing in a second switch
though: Having 20 nodes go black isn't a lot of fun if you cannot farm out the
work and re-replication to other nodes and racks. Also make sure to have enough
ports in rack switches for the machines you are planning to provision.

Avoid playing the ops whake-a-mole game by having one large cluster in the
organisation than many different ones where possible. Multi-tenancy in Hadoop is
still pre-mature though.

If you want to play with future deployments - watch out for HP currently
packing 270 servers where today are just two via system on a chip designs.

ApacheConNA: Monitoring httpd and Tomcat

2013-05-13 20:23

Monitoring - a task generally neglected - or over done - during development.
But still vital enough to wake up people from well earned sleep at night when
done wrong. Rainer Jung provided some valuable insights on how to monitor Apache httpd and Tomcat.

Of course failure detection, alarms and notifications are all part of good
monitoring. However so is avoidance of false positives and metric collection,
visualisation, and collection in advance to help with capacity planning and
uncover irregular behaviour.

In general the standard pieces being monitored are load, cache utilisation,
memory, garbage collection and response times. What we do not see from all that
are times spent waiting for the backend, looping in code, blocked threads.

When it comes to monitoring Java - JMX is pretty much the standard choice. Data
is grouped in management beans (MBeans). Each Java process has default beans,
on top there are beans provided by Tomcat, on top there may be application
specific ones.

For remote access, there are Java clients that know the protocol - the server
must be configured though to accept their connection. Keep in mind to open the
firewall in between as well if there is any. Well known clients include
JVisualVM (nice for interactive inspection), jmxterm as a command line client.

The only issue: Most MBeans encode source code structure, where what you really
need is change rates. In general those are easy to infer though.

On the server side for Tomcat there is the JMXProxy in Tomcat manager that
exposes MBeans. In addition there is Jolohia (including JSon serialisation) or
the option to roll your own.

So what kind of information is in MBeans:

  • OS - load, process cpu time, physical memory, global OS level
    stats. As an example: Here deviding cpu time by time geves you the average cpu

  • Runtime MBean gives uptime.

  • Threading MBean gives information on count, max available threads etc

  • Class Loading MBean should get stable unless you are using dynamic
    languaes or have enabled class unloading for jsps in Tomcat.

  • Compliation contains HotSpot compiler information.

  • Memory contains information on all regions thrown in one pot. If you need
    more fine grained information look out for the Memory Pool and GC MBeans.

As for Tomcat specific things:

  • Threadpool (for each connector) has information on size, number of busy

  • GlobalRequestProc has request counts, processing times, max time bytes
    received/sent, error count (those that Tomcat notices that is).

  • RequestProcessor exists once per thread, it shows if a request is
    currently running and for how long. Nice to see if there are long running

  • DataSource provides information on Tomcat provided database connections.

Per Webapp there are a couple of more MBeans:

  • ManagerMBean has information on session management - e.g. session
    counter since start, login rate, active sessions, expired sessions, max active
    sinse restart sessions (here a restart is possible), number of rejected
    sessions, average alive time, processing time it took to clean up sessions,
    create and required rate for last 100 sessions

  • ServletMBean contains request count, accumulated processing time.

  • JspMBean (together with activated loading/unloading policy) has
    information on unload and reload stats and provides the max number of loaded

For httpd the goals with monitoring are pretty similar. The only difference is
the protocol used - in this case provided by the status module. As an
alternative use the scoreboard connections.

You will find information on

  • restart time, uptime

  • serverload

  • total number of accesses and traffic

  • idle workers and number of requests currently processed

  • cpu usage - though that is only accurate when all children are stopped
    which in production isn't particularly likely.

Lines that indicate what threads do contain waitinng, request read, send reply
- more information is documented online.

When monitoring make sure to monitor not only production but also your stress
tests to make meaningful comparisons.

ApacheConNA: On Security

2013-05-12 20:22

During the security talk at Apache Con a topic commonly glossed over by
developers was covered in quite some detail: With software being developed that
is being deployed rather widely online (over 50% of all websites are powered
by the Apache webserver) natually security issues are of large concern.

Currently there are eight trustworthy people on the foundation-wide security
response team, subscribed to security@apache.org. The team was started by
William A. Rowe when he found a volnarability in httpd. The general work mode -
as opposed to the otherwise ``all things open'' way of doing things at Apache -
is to keep the issues found private until fixed and publicise widely

So when running Apache software on your servers - how do you learn about
security issues? There is no such thing as a priority list for specific
vendors. The only way to get an inside scoop is to join the respective
project's PMC list - that is to get active yourself.

So what is being found? 90% of all security issues are found be security
researches. The remaining 10% are usually published accidentially - e.g. by
users submitting the issue through the regular public bug tracker of the
respective project.

In Tomcat currently no issues was disclosed w/o letting the project know. httpd
still is the prime target - even of security researchers who are in favour of
a full disclosure policy - the PMC cannot do a lot here other than fix issues
quickly (usually within 24 hours).

As a general rule of thumb: Keep your average release cycle time in mind - how
long will it take to get fixes into people's hands? Communicate transparently
which version will get security fixes - and which won't.

As for static analysis tools - many of those are written for web apps and as
such not very helpful for a container. What is highly dangerous in a web app
may just be the thing the container has to do to provide features to web apps.
As for Tomcat, they have made good experiences with Findbugs - most others have
too many false positives.

When dealing with a volnarability yourself, try to get guidance from the
security team on what is actually a security volnarability - though the final
decision is with the project.

Dealing with the tradeoff of working in private vs. exposing users affected by
the volnarability to attacks is up to the PMC. Some work in public but call the
actual fix a refactoring or cleanup. Given enough coding skills on the attacker
side this of course will not help too much as sort of reverse engineering what
is being fixed by the patches is still possible. On the other hand doing
everything in private on a separate branch isn't public development anymore.

After this general introduction Mark gave a good overview of the good, the bad
and the ugly way of handling security issues in Tomcat. For his slides
(including an anecdote of what according to the timing and topic looks like it
was highly related to the 2011 Java Hash Collision talk at Chaos Communication

ApacheConNA: On documentation

2013-05-11 20:20

In her talk on documentation on OSS Noirin gave a great wrap up of the topic of
what documentation to create for a project and how to go about that task.

One way to think about documentation is to keep in mind that it fulfills
different tasks: There is conceptual, procedural and task-reference
documentation. When starting to analyse your docs you may first want to debug
the way it fails to help its users: ``I can't read my mail'' really could mean
``My computer is under water''.

A good way to find awesome documentation can be to check out Stackoverflow
questions on your project, blog posts and training articles. Users today really
are searching instead of browsing docs. So where to find documentation actually
is starting to matter less. What does matter though is that those pages with
relevant information are written in a way that makes it easy to find them
through search engines: Provide a decent title, stable URLs, reasonable tags
and descriptions. By the way, both infra and docs people are happy to talk to
*good* SEO guys.

In terms of where to keep documentation:

For conceptual docs that need regular review it's probably best to keep them in
version control. For task documentation steps should be easy to upgrade once
they fail for users. Make sure to accept bug reports in any form - be it on
Facebook, Twitter or in your issue tracker.

When writing good documentation always keep your audience in mind: If you don't
have a specific one, pitch one. Don't try to cater for everyone - if your docs
are too simplistic or too complex for others, link out to further material.
Understand their level of understanding. Understand what they will do after
reading the docs.

On a general level always include an about section, a system overview, a
description of when to read the doc, how to achieve the goal, provide
examples, provide a trouble shooting section and provide further information
links. Write breadth first - details are hard to fill in without a bigger
picture. Complete the overview section last. Call out context and
pre-requesites explicitly, don't make your audience do more than they really
need to do. Reserve the details for a later document.

In general the most important and most general stuff as well as the basics
should come first. Mention the number of steps to be taken early. When it comes
to providing details: The more you provide, the more important the reader will
deem that part.

ApacheConNA: On delegation

2013-05-10 20:19

In her talk on delegation Deb Nicholson touched upon a really important topic in
OSS: Your project may live longer than you are willing to support it yourself.

The first important point about delegation is to delegate - and to not wait
until you have to do it. Soon you will realise that mentoring and delegation
actually is a way to multiply your resources.

In order to delegate people to delegate to are needed. To find those it can be
helpful to understand what motivates people to work in general as well as on
open source in particular: Sure, fixing a given problem and working on great
software projects may be part of it. As important though is recognition
individually and in groups of people.

Keeping that in mind, ``Thanking'' is actually a license to print free money in
the open source world. Do it in a verbose manner to be believable, do it in
public and in a way that makes your contributors feel a little bit of glory.

Another way to lead people in is to help out socially: Facilitate connections,
suggest connections, introduce people. Based on the diversity of the project
you are working on you may be in a way larger network and have access to much
more corporations and communities than any peer who is not active. Use that

Also when leading OSS projects keep in eye on people being rude: Your project
should be accessible to facilitate participation.

In case of questions treat them as a welcome opportunity to pull a new
community member in: Answer quickly, answer on your list, delegate to middle
seniors to pull them in. Have training missions for people who want to get
started and don't know your tooling yet. Have prepared documents to provide
links to in case questions occur.

In Apache we tend to argue people should not fall victim of volunteeritis.
Another way to put that is to make sure to avoid the licked cookie syndrom:
When people volunteer to do a task and never re-appear that task is tainted
until explicitly marked as ``not taken'' later on. One way to automate that is
to have a fixed deadline after which tasks are automatically marked as free to
take and tackle by anyone.

When it comes to the question of When to write documentation: There really is
no point in time that should stop you from contributing docs - all the way from
just above getting started level (writing the getting started docs for those
following you) up to the ``I'm an awesome super-hacker'' mode for those trying
to hack on similar areas.

Especially when delegating to newbies make sure to set the right expectations:
How long is it going to take to fix an issue, what is the task complexity, tell
them who is going to be involved, who is there to help out in case of road

In general make sure to be a role model for the behaviour you want in your
project: Ask questions yourself, step back when your have taken on too much,
appreciate people stepping back.

Understand the motivation of your new comers - try to talk to them one on one
to understand their motivation and help to align work on the project with their
life goal. When starting to delegate, start with tasks that seem to small to
delegate at all to get new people familiar with the process - and to get
yourself familiar with the feeling of giving up control. Usually you will need
to pull tasks apart that before were done by one person. Don't look for a
person replacement - instead look for separate tasks and how people can best
perform these.

Make visible and clear what you need: Is it code or reviews? Documentation or
translations, UX helpers? Incentivise what you really need - have code sprints,
gamify the process of creating better docs, put the logo creation under a

All of this is great if you have only people who all contribute in a very
positive way. What if there is someone who's contributions are actually
detrimental to the project? How to deal with bad people? They may not even do
so intentionally... One option is to find a task that better suits their
skills. Another might be to find another project for them that better fits
their way of communicating. Talk to the person in question, address head on
what is going on. Talking around or avoiding that conversation usually only
delays and enlarges your problem. One simple but effective strategy can be to
tell people what you would like them to do in order to help them find out that
this is not what they want to do - that they are not the right people for you
and should find a better place.

More on this can be found in material like ``How assholes are killing your
project'' as well as the ``Poisonous people talk'' and the book ``Producing
open source software''.

On the how of dealing with bad people make sure to criticise privately first,
chack in a backchannel of other committers for their opinion - otherwise you
might be lonely very quickly. Keep to criticising the bahaviour instead of the
person itself. Most people really do not want to be a jerk.

ApacheConNA: First keynote

2013-05-09 20:13

All three ApacheCon keynotes were focussed around the general theme of open
source communities. The first on given by Theo had very good advise to the
engineer not only striving to work on open source software but become an
excellent software developer:

  • Be loyal to the problem instead of to the code: You shouldn't be
    addicted to any particular programming language or framework and refuse to work
    and get familiar with others. Avoid high specialisation and seek cross
    fertilisation. Instead of addiction to your tooling you should seek to
    diversify your toolset to use the best for your current problem.

  • Work towards becoming a generalist: Understand your stack top to bottom -
    starting with your code, potentially passing the VM it runs in up down to the
    hardware layer. Do the same to requirements you are exposed to: Being 1s old
    may be just good enough to be ``always current'' when thinking of a news
    serving web site. Try to understand the real problem that underpins a certain
    technical requirement that is being brought up to you. This deep understanding
    of how your system works can make the difference in fixing a production issue
    in three days instead of twelve weeks.

The last point is particularly interesting for those aiming to write scalable
code: Software and frameworks today are intended to make development easier -
with high probability they will break when running at the edge.

What is unique about the ASF is the great opportunity to meet people with
experience in many different technologies. In addition there is an unparalleled
level of trust in a community as diverse as the ASF. One open question that
remains is how to leverage this potential successfully within the foundation.

ApacheConNA: Meet the indian tribe

2013-05-08 20:10
ApacheCon is the ``User Conference of the Apache Software Foundation''. What
should that mean? If you are going to Apache Con you have the chance of meeting
committers of your favourite projects as well as members of the foundation
itself. Though there are a lot of talks that are interesting from a technical
point of view the goal really is to turn you into an active member of the
foundation yourself. This is true for the North American version even more than
for the European edition.

Though why should you as a general user of Apache software be interested in
attending then? Pieter Hintjens put it quite nicely in an interview on his
latest ZeroMQ book with O'Reilly:

If you are using free software in particular in commercial setups you really do
want to know how the project is governed and what it takes to get active and
involved yourself. What would it take to move the project into a direction that
fits your business needs? How do you make sure features you need are actually
being added to the project instead of useless stuff?

ApacheCon is the conference to find out how Apache projects work internally,
the place to be to meet active people in person and put faces to names. Lots of
community building events focus on getting newbies in touch with long term

ApacheConEU - part 11 (last part)

2012-11-20 20:35
One of the last sessions covered logging frameworks for Java. Christian Grobmeier started by detailing the common requirements for all logging frameworks:

  • Speed - developers do not want to pay a disproportional penalty for using a logging framework.
  • Fail-safety and reliability - under no circumstances should your logging framework kill your application. In addition it would be most annoying to find that one log message that would help you de-cypher the problem your application ran into missing. As obvious as those requirements sound - there are counter examples to both: There is a memory leak in log4j1, when reconfiguring logback on the fly it may well lose messages.
  • Log frameworks should be backwards compatible: Both, changing the API as well as incompatible configuration file formats aren’t particularly great when wanting to upgrade the logging framework version you use.
  • On top of all the way you do logging ultimately is a matter of taste - by now there are several implementations even just for Java online catering needs ranging from pure simplicity to huge flexibility: log4j, logback, java.util.logging, AVSL, tinyLog. On top of that there are aggregators like SLF4j and commons logging - even though the speaker himself does contribute to the latter framework, at the current time he still recommends using SLF4J as it is actively being developed, more modern and better supported.

Biggest news shared in the talk was the release of log4j2 which comes with commons-logging and slf4j integration. This version finally gets rid of the if (debug.enabled()) { log.debug(...)} idiom by introducing place holders and variable length argument lists essentially enabling format strings for logging. Markers can be added to the logging code to allow for later filtering. Writing plugins has been made a whole lot simpler. There is support for an easier to read xml configuration format as well as json configurations. In additions configurations can be set to be reloaded on change in a pre-defined interval. It is slightly slower than logback and log4j on average, though we are still talking about a few milliseconds for a large amount of log messages. Unfortunately those averages did not come with error bars which would have made interpreting them in comparison a bit easier.

However log4j is not just about logging. It does have sub projects for viewing logs of httpd, log4j and others (called chainsaw), logging for php and .net.

Log4j has a rather complex history: As soon as the leader of the project left, activity died away. By now activity has taken up again quite a bit with 4 contributors. They created 6 releases in the last year alone, on top of 600 mailing list messages and a huge amount of commits. Still also log4j is hiring - if you want to work for free on a fun project that affects nearly every Java developer world wide, work together with awesome coders this project is seeking new contributors.

If you consider logging boring just be reminded that logs are a valuable source of user activity leading to features like being able to recommend new products to customers, localise content offerings or even just adjusting the default settings of your web page to increase click through. Related to log4j there’s Apache Flume dealing with distributed logging, there’s challenges in the cloud and mobile space, Apache Mayhem for Logger ingestions.

The last technical session dealt with Apache Buildr - a build system for Java, though not only Java, written in Ruby. The advantage being that it delivers the artifact resolution and download from maven archives through ivy plugins, provides greater flexibility through ruby integration and can fall back to ant tasks if needed.

The final session was the closing plenary given by Nick Burch. Most noteably he invited attendees for ApacheCon2013 Europe. Looking forward to meeting all of you there. The community edition of ApacheCon was an awesome setup for people to meet and not only pitch their projects but to also provide deep technical detail and show off more of what the Apache community is all about. Looking forward to the audio recordings as well as to the videos taken during the conference. CU all next year!

ApacheConEU - part 10

2012-11-19 20:34
In the next session Jukka introduced Tika - a toolkit for parsing content from files including a heuristics based component for guessing the file type: Based on file extension, magic and certain patterns in the file the file type can be guessed rather reliably. Some anecdotes:

  • not all mime types are registered with IANA, there are of course conflicting file extensions,
  • Microsoft Word not only localises their interface but also the magic in the file,
  • html detection is particularly hard as there is quite some overlap with other file formats (e.g. there are such things as html mails...)
  • xhtml parsing is quite reliable by using an actual xml parser for the first few bytes to identify the namespace of the document
  • identifying odf documents is easy - though zipped the magic is preserved uncompressed at a pre-defined location in the file
  • for ooxml the file has to be unpacked to identify it
  • plain text content is hardest - there are a few heuristics based on UTF BOMs, ASCII stats, line ending stats, byte histograms, still it's not fool proof.

In addition Tika can extract metadata: For text that can be as easy as encoding, length, content type and comments. For images that is extended by image size and potentially EXIF data. For pdf data it gets even more comprehensive including information on the file creator, save date and more (same applies for MS office documents). Most document metadata is based on the doublin core standard, for images there’s EXIF and IPCT - soon there’ll also be xmb related data that can be discovered.