JAX: Pigs, snakes and deaths by 1k cuts

JAX: Pigs, snakes and deaths by 1k cuts #

In his talk on performance problems Rainer Schuppe gave a great introduction to
which kinds of performance problems can be observed in production and how to
best root-cause them.


Simply put performance issues usually arise due to a difference in either data
volumn, concurrency levels or resource usage between the dev, qa and production
environments. The tooling to uncover and explain them is pretty well known:
Staring with looking at logfiles, ARM tools, using aspects, bytecode
instrumentalisation, sampling, watching JMX statistics, and PMI tools.


All of theses tools have their own unique advantages and disadvantages. With
logs you get the most freedom, however you have to know what to log at
development time. In addition logging is i/o heavy, so doing too much can slow
the application down itself. In a common distributed system logs need to be
aggregated somehow. As a simple example of what can go wrong are cascading
exceptions spilled to disk that cause machines to run out of disk space one
after the other. When relying on logging make sure to keep transaction
contexts, in particular transaction ids across machines and services to
correlate outages. In terms of tool support, look at scribe, splunk and flume.


A tool often used for tracking down performance issues in development is the
well known profiler. Usually it creates lots of very detailed data. However it
is most valuable in development - in production profiling a complete server
stack produces way too much load and data to be feasable. In addition there’s
usually no transaction context available for correlation again.


A third way of watching applications do their work is to watch via JMX. This
capability is built in for any Java application, in particular for servlet
containers. Again there is not transaction context. Unless you take care of it
there won’t be any historic data.


When it comes to diagnosing problems, you are essentially left with fixing
either the “it does not work” case or the “it is slow case”.


For the “it is slow case” there are a few incarnations:


  • It was always slow, we got used to it.

  • It gets slow over time.

  • It gets slower exponentially.

  • It suddenly gets slow.

  • There is a spontanous crash.




In the case of “it does not work” you are left with the following observations:


  • Sudden outages.

  • Always flaky.

  • Sporadic error messages.

  • Silent death.

  • Increasing error rates.

  • Misleading error messages.




In the end you will always be spinning in a Look at symptoms, Elimnate
non-causes, Identifiy suspects, Confirm and Eliminate comparing to normal. If
not done with that, leather, rinse, repeat. When it comes to causes for errors
and slowness you will usually will run into one of the following causes: In
many cases bad coding practices are a problem, too much load, missing backends,
resource conflicts, memory and resource leakage as well as hardware/networking
issues are causes.


Some symptoms you may observe include foreseeable lock ups (it’s always slow
after four hours, so we just reboot automatically before that), consistent
slowness, sporadic errors (it always happens after a certain request came in),
getting slow and slower (most likely leaking resources), sudden chaos (e.g.
someone pulling the plug or someone removing a hard disk), and high utilisation
of resources.

Linear memory leak



In case of a linear memory leak, the application usually runs into an OOM
eventually, getting ever slower before that due to GC pressure. Reasons could
be linear structures being filled but never emptied. What you observe are
growing heap utilisation and growing GC times. In order to find such leakage
make sure to turn on verbose GC logging, do heapdumps to find leaks. One
challenge though: It may be hard to find the leakage if the problem is not one
large object, but many, many small ones that lead to a death by 1000 cuts
bleeding the application to death.


In development and testing you will do heap comparisons. Keep in mind that
taking a heap dump causes the JVM to stop. You can use common profilers to look
at the heap dump. There are variants that help with automatic leak detection.


A variant is the pig in a python issue where sudden unusually large objects
cause the application to be overloaded.


Resource leaks and conflicts


Another common problem is leaking resources other than memory - not closing
file handles can be one incarnation. Those problems cause a slowness over time,
they may lead to having the heap grow over time - usually that is not the most
visible problem though. If instance tracking does not help here, your last
resort should be doing code audits.


In case of conflicting resource usage you usually face code that was developed
with overly cautious locking and data integrity constraints. The way to go are
threaddumps to uncover threads in block and wait states.


Bad coding practices


When it comes to bad coding practices what is usually seen is code in endless
loops (easy to see in thread dumps), cpu bound computations where no result
caching is done. Also layeritis with too much (de-)serialisation can be a
problem. In addition there is a general “the ORM will save us all” problem that
may lead to massive SQL statements, or to using the wrong data fetch strategy.
When it comes to caching - if caches are too large, access times of course grow
as well. There could be never ending retry loops, ever blocking networking
calls. Also people tend to catch exceptions but not do anything about them
other than adding a little #fixme annotation to the code.


When it comes to locking you might run into dead-/live-lock problems. There
could be chokepoints (resources that all threads need for each processing
chain). In a thread dump you will typically see lots of wait instead of block
time.


In addition there could be internal and external bottlenecks. In particular
keep those in mind when dealing with databases.


The goal should be to find an optimum for your application between too many too
small requests that waste resources getting dispatched, and one huge request
that everyone else is waiting for.