JAX: Java performance myths

2013-05-22 20:37
This talk was one of the famous talks on Java performance myths by Arno Haase.
His main point - supported with dozens of illustrative examples was for
software developers to stop trusting in word of mouth, cargo cult like myths
that are abundant among engineers. Again the goal should be to write readable
code above all - for one the Java compiler and JIT are great at optimising. In
addition many of the myths being spread in the Java community that are claimed
to lead to better performance are simply not true.

It was interesting to learn how many different aspects of both software and
hardware contribute to code performance. Micro benchmarks are considered
dangerous for a reason - creating a well controlled environment that matches
what the code will encounter in production is influenced by things like just in
time compilation, cpu throttling, etc.

Some myths that Arno proved wrong include final making code faster (in case of
method parameters it doesn't make a difference up to bytecode being identical
with and without), inheritance being always expensive (even with an abstract
class between the interface and the implementation Java 6 and 7 can still
inline the method in question). Another one was on often wrongly scoped Java
vs. C comparisons. One myth resolved around the creation of temporary objects -
since Java 6 and 7 in simple cases even these can be optimised away.

When it comes to (un-)boxing and reflection there is a performance penalty. For
the latter mostly for method lookup, not so much for calling the method. What we
are talking about however are penalties in the range of about 1000 compute
cycles. Compared to doing any remote calls this is still dwarfed. Reflection on
fields is even cheaper.

One of the more wide spread myths resolved around string concatenation being
expensive - doing a ``A'' + ``B'' in code will be turned into ``AB'' in
bytecode. Even doing the same with a variable will be turned into the use of
StringBuilder ever since -XX:OptimizeStringConcat was turned on by default.

The main message here is to stop trusting your intuition when reasoning about a
system's performance and performance bottlenecks. Instead the goal should be to
go and measure what is really going on. Those are simple examples where your
average Java intuition goes wrong. Make sure to stay on top with what the JVM
turns your code into and how that is than executed on the hardware you have
rolled out if you really want to get the last bit of speed out of your

JAX: Does parallel equal performant?

2013-05-21 20:34
In general there is a tendency to set parallel implementations to being equal
to performant implementations. Except in the really naive case there is always
going to be some overhead due to scheduling work, managing memory sharing and
network communication overhead. Essentially that knowledge is reflected in
Amdahl's law (the amount of serial work limits the benefit from running parts
of your implementation in parallel, http://en.wikipedia.org/wiki/Amdahl's_law),
and Little's law (http://en.wikipedia.org/wiki/Little's_law) in case of queuing

When looking at current Java optimisations there is quite a bit going on to
support better parallelisation: Work is being done to provide for improving
lock contention situations, the GC adaptive sizing policy has been improved to
a usable state, there is added support for parallel arrays and lampbda's
splitable interface.

When it comes to better locking optimisations what is most notable is work
towards coarsening locks at compile and JIT time (essentially moving locks from
the inside of a loop to the outside); eliminating locks if objects are being
used in a local, non-threaded context anyway; and support for biased locking
(that is forcing locks only when a second thread is trying to access an
object). All three taken together can lead to performance improvements that
will almost render StringBuffer and StringBuilder to exhibit equal performance
in a single threaded context.

For pieces of code that suffer from false sharing (two variables used in
separate threads independently that end up in the same CPU cacheline and as a
result are both flushed on update) there is a new annotation: Adding the
"@contended" annotation can help the compiler for which pieces of code to add
cacheline padding (or re-arrange entirely) to avoid that false sharing from
happening. One other way to avoid false sharing seems to be to look for class
cohesion - coherent classes where methods and variables are closely related
tend to suffer less from false sharing. If you would like to view the resulting
layout use the "-XX:PrintFieldLayout" option.

Java 8 will bring a few more notable improvements including changes to the
adaptive sizing GC policy, the introduction of parallel arrays that allow for
parallel execution of predicates on array entries, changes to the concurrency
libraries, internalised iterators.

JAX: Pigs, snakes and deaths by 1k cuts

2013-05-20 20:32
In his talk on performance problems Rainer Schuppe gave a great introduction to
which kinds of performance problems can be observed in production and how to
best root-cause them.

Simply put performance issues usually arise due to a difference in either data
volumn, concurrency levels or resource usage between the dev, qa and production
environments. The tooling to uncover and explain them is pretty well known:
Staring with looking at logfiles, ARM tools, using aspects, bytecode
instrumentalisation, sampling, watching JMX statistics, and PMI tools.

All of theses tools have their own unique advantages and disadvantages. With
logs you get the most freedom, however you have to know what to log at
development time. In addition logging is i/o heavy, so doing too much can slow
the application down itself. In a common distributed system logs need to be
aggregated somehow. As a simple example of what can go wrong are cascading
exceptions spilled to disk that cause machines to run out of disk space one
after the other. When relying on logging make sure to keep transaction
contexts, in particular transaction ids across machines and services to
correlate outages. In terms of tool support, look at scribe, splunk and flume.

A tool often used for tracking down performance issues in development is the
well known profiler. Usually it creates lots of very detailed data. However it
is most valuable in development - in production profiling a complete server
stack produces way too much load and data to be feasable. In addition there's
usually no transaction context available for correlation again.

A third way of watching applications do their work is to watch via JMX. This
capability is built in for any Java application, in particular for servlet
containers. Again there is not transaction context. Unless you take care of it
there won't be any historic data.

When it comes to diagnosing problems, you are essentially left with fixing
either the "it does not work" case or the "it is slow case".

For the "it is slow case" there are a few incarnations:

  • It was always slow, we got used to it.

  • It gets slow over time.

  • It gets slower exponentially.

  • It suddenly gets slow.

  • There is a spontanous crash.

In the case of "it does not work" you are left with the following observations:

  • Sudden outages.

  • Always flaky.

  • Sporadic error messages.

  • Silent death.

  • Increasing error rates.

  • Misleading error messages.

In the end you will always be spinning in a Look at symptoms, Elimnate
non-causes, Identifiy suspects, Confirm and Eliminate comparing to normal. If
not done with that, leather, rinse, repeat. When it comes to causes for errors
and slowness you will usually will run into one of the following causes: In
many cases bad coding practices are a problem, too much load, missing backends,
resource conflicts, memory and resource leakage as well as hardware/networking
issues are causes.

Some symptoms you may observe include foreseeable lock ups (it's always slow
after four hours, so we just reboot automatically before that), consistent
slowness, sporadic errors (it always happens after a certain request came in),
getting slow and slower (most likely leaking resources), sudden chaos (e.g.
someone pulling the plug or someone removing a hard disk), and high utilisation
of resources.

Linear memory leak

In case of a linear memory leak, the application usually runs into an OOM
eventually, getting ever slower before that due to GC pressure. Reasons could
be linear structures being filled but never emptied. What you observe are
growing heap utilisation and growing GC times. In order to find such leakage
make sure to turn on verbose GC logging, do heapdumps to find leaks. One
challenge though: It may be hard to find the leakage if the problem is not one
large object, but many, many small ones that lead to a death by 1000 cuts
bleeding the application to death.

In development and testing you will do heap comparisons. Keep in mind that
taking a heap dump causes the JVM to stop. You can use common profilers to look
at the heap dump. There are variants that help with automatic leak detection.

A variant is the pig in a python issue where sudden unusually large objects
cause the application to be overloaded.

Resource leaks and conflicts

Another common problem is leaking resources other than memory - not closing
file handles can be one incarnation. Those problems cause a slowness over time,
they may lead to having the heap grow over time - usually that is not the most
visible problem though. If instance tracking does not help here, your last
resort should be doing code audits.

In case of conflicting resource usage you usually face code that was developed
with overly cautious locking and data integrity constraints. The way to go are
threaddumps to uncover threads in block and wait states.

Bad coding practices

When it comes to bad coding practices what is usually seen is code in endless
loops (easy to see in thread dumps), cpu bound computations where no result
caching is done. Also layeritis with too much (de-)serialisation can be a
problem. In addition there is a general "the ORM will save us all" problem that
may lead to massive SQL statements, or to using the wrong data fetch strategy.
When it comes to caching - if caches are too large, access times of course grow
as well. There could be never ending retry loops, ever blocking networking
calls. Also people tend to catch exceptions but not do anything about them
other than adding a little #fixme annotation to the code.

When it comes to locking you might run into dead-/live-lock problems. There
could be chokepoints (resources that all threads need for each processing
chain). In a thread dump you will typically see lots of wait instead of block

In addition there could be internal and external bottlenecks. In particular
keep those in mind when dealing with databases.

The goal should be to find an optimum for your application between too many too
small requests that waste resources getting dispatched, and one huge request
that everyone else is waiting for.

JAX: Java HPC by Norman Maurer

2013-05-19 20:31
For slides see also: Speakerdeck: High performance networking on the JVM

Norman started his talk clarifying what he means by high scale: Anything above
1000 concurrent connections in his talk are considered high scale, anything
below 100 concurrent connections is fine to be handled with threads and blocking
IO. Before tuning anything, make sure to measure if you have any problem at
all: Readability should always go before optimisation.

He gave a few pointers as to where to look for optimisations: Get started by
studying the socket options - TCP-NO-DELAY as well as the send and receive
buffer sizes are most interesting. When under GC pressure (check the GC locks
to figure out if you are) make sure to minimise allocation and deallocation of
objects. In order to do that consider making objects static and final where
possible. Make sure to use CMS or G1 for garbage collection in order to
maximise throughput. Size areas in the JVM heap according to your access
patterns. The goal should always be to minimise the chance of running into a
stop the world garbage collection.

When it comes to using buffers you have the choice of using direct or heap
buffers. While the former are expensive to create, the latter come with the
cost of being zero'ed out. Often people start buffer pooling, potentially
initialising the pool in a lazy manner. In order to avoid memory fragmentation
in the Java heap, it can be a good idea to create the buffer at startup time
and re-use it later on.

In particular when parsing structured messages like they are common in
protocols it usually makes sense to use gathering writes and scattering reads
to minimise the number of system calls for reading and writing. Also try to
buffer more if you want to minimise system calls. Use slice and duplicate to
create views on your buffers to avoid mem copies. Use a file channel when
copying files without modifications.

Make sure you do not block - think of DNS servers being unavailable or slow as
an example.

As a parting note, make sure to define and document your threading model. It
may ease development to know that some objects will always only be used in a
single threaded context. It usually helps to reduce context switches as well as
may ease development to know that some objects will always only be used in a
single threaded context. It usually helps to reduce context switches as well as
keeping data in the same thread to avoid having to use synchronisation and the
use of volatile.

Also make a conscious decision about which protocol you would like to use for
transport - in addition to tcp there's also udp, udt, sctp. Use pipelining in
order to parallelise.

Devoxx – Day one – Java, Performance and Devops

2010-12-15 21:22
In his keynote Mark Reinhold provided some information on the very interesting features to be included in the Java 7 release. Generics will be easier to declare with the diamond operator. Nested try-finally constructs that are nowadays needed to safely close resources will no longer be necessary – their will be the option of implementing a Closeable interface supporting a method close() that get's called whenever objects of that class's type go out of scope. That way resources can be freed automatically. Though different in concept, it still reminds me a lot of the functionality typically provided by destructors in C++.

The support for lambda operators and direct method references that will greately help reducing clutter due to nested inner classes has been postponed for later Java releases. Though it took 4 years to come up with the Java 7 release new features are pretty much limited. However the current roadmap looks pretty much release date driven. The intention seems to be to get developers focussed on a limited set of reachable features to finally get the release out into the hands of users.

The speaker claimed Oracle to remain committed to Java development – first and foremost because of being a heavy Java user themselves. However also in order to generate revenue indirectly (through selling support and consulting for Java related products), directly (through Java support) and reducing internal development cost and Java friction.

Though Oracle had a JVM implementation of its own (jRocket) development of HotSpot will be continued – mostly due to a larger number developers being familiar with HotSpot. However monitoring and diagnosis tooling that was superior at jRocket is supposed to be ported to HotSpot.

In the core Java session I also went to the talk on Java performance analysis by Joshua Bloch. He a good job bringing the topic of performance analysis on complex systems to software developers. In ancient times it was quite easy to estimate a piece of code's static performance by static code analysis. Looking at the expression if (condition && secondCondition) it is still commonly considered to be faster to use “&&” over “&”. However looking at current CPU architectures that make heavy use of instruction pipelines it heavily depends on their branch prediction heuristics whether this statement is still true. Dirtying the pipeline by using && may well be more expensive than doing the extra evaluation. General message: The performance of your code in a real world system depends on the hardware it runs on, the operating system as well as the exact VM version used. Estimating performance based on static analysis only is no longer possible.

However even when doing benchmarks one might well reach false conclusions. It is common knowledge that running a benchmark on a VM is required to be run multiple times – VM warmup phases are well known to developers, so the common performance pattern for on specific function usually looks like that:

However even when repeating the test on the same machine multiple times, the values seen after warm-up may be skewed substantially. The only remedy to reaching false conclusions is to do several VM runs, average of the runs (and provide median etc. that are less susceptible to outliers) and provide error bars for each averaged run. When comparing two different implementations the only way to reliably tell which one is better than the other is to do statistical significance tests. Consider the diagram below. When leaving error bars out, the left implementation seems clearly better than the right. However when taking into account how widely skewed the performance numbers are and adding error bars to the entries, this is no longer the case: Both runs are no longer statistically significantly different.