JAX: Tales from production

2013-05-23 20:38
In a second presentation Peter Roßbach together with Andreas Schmidt provided
some more detail on what the topic logging entails in real world projects.
Development messages turn into valuable information needed to uncover issues
and downtime of systems, capacity planning, measuring the effect of software
changes, analysing resource usage under real world usage. In addition to these
technical use cases there is a need to provide business metrics.


When dealing with multiple systems you deal with correlating values across
machines and systems, providing meaningful visualisations to draw the correct
decisions.


When thinking of your log architecture you might want to consider storing not
only log messages. In addition facts like release numbers should be tracked
somewhere - ready to join in when needed to correlate behaviour with release
version. To do that also track events like rolling out a release to production.
Launching in a new market, switching traffic to a new system could be other
events. Introduce not only pure log messages but also provide aggregated
metrics and counters. All of these pieces should be stored and tracked
automatically to free operations for more important work.


Have you ever thought about documenting not only your software, it's interfaces
and input/output format? What about documenting the logged information as well?
What about the fields contained in each log message? Are they documented or do
people have to infer their meaning from the content? What about valid ranges
for values - are they noted down somewhere? Did you store whether a specific
field can only contain integers or whether some day it also could contain
letters? What about the number format - is it decimal, hexadecimal?


For a nice architecture documentation of the BBC checkout

Winning the metrics battle by the BBC dev blog.


There's an abundance of tools out there to help you with all sorts of logging
related topics:




  • For visualisation and transport: Datadog, kibana, logstash, statsd,
    graphite, syslog-ng

  • For providing the values: JMX, metrics, Jolokia

  • For collection: collecd, statsd, graphite, newrelic, datadog

  • For storage: typical RRD tools including RRD4j, MongoDB, OpenTSDB based
    on HBase, Hadoop

  • For charting: Munin, Cacti, Nagios, Graphit, Ganglia, New Relic, Datadog

  • For Profiling: Dynatrace, New Relic, Boundary

  • For events: Zabbix, Icinga, OMD, OpenNMS, HypericHQ, Nagios,JbossRHQ

  • For logging: splunk, Graylog2, Kibana, logstash




Make sure to provide metrics consistently and be able to add them with minimal
effort. Self adaption and automation are useful for this. Make sure developers,
operations and product owners are able to use the same system so there is no
information gap on either side. Your logging pipeline should be tailored to
provide easy and fast feedback on the implementation and features of the
product.


To reach a decent level of automation a set of tools is needed for:


  • Configuration management (where to store passwords, urls or ips, log
    levels etc.). Typical names here include Zookeeper,but also CFEngine, Puppet
    and Chef.

  • Deployment management. Typical names here are UC4, udeploy, glu, etsy
    deployment.

  • Server orchestration (e.g. what is started when during boot). Typical
    names include UC4, Nolio, Marionette Collective, rundeck.

  • Automated provisioning (think ``how long does it take from server failure
    to bringing that service back up online?''). Typical names include kickstart,
    vagrant, or typical cloud environments.

  • Test driven/ behaviour driven environments (think about adjusting not
    only your application but also firewall configurations). Typical tools that
    come to mind here include Server spec, rspec, cucumber, c-puppet, chef.

  • When it comes to defining the points of communication for the whole
    pipeline there is no tool you can use that is better than traditional pen and
    paper, socially getting both development and operations into one room.




The tooling to support this process goes from simple self-written bash scripts
in the startup model to frameworks that support the flow partially, up to
process based suites that help you. No matter which path you choose the goal
should always be to end up with a well documented, reproducable step into
production. When introducing such systems problems in your organisation may
become apparent. Sometimes it helps to just create facts: It's easier to ask for
forgiveness than permission.

JAX: Logging best practices

2013-05-22 20:37
The ideal outcome of Peter Roßbach's talk on logging best practices was to have
attendees leave the room thinking ``we know all this already and are applying
it successfully'' - most likely though the majority left thinking about how to
implement even the most basic advise discussed.


From his consultancy and fire fighter background he has a good overview of what
logging in the average corporate environment looks like: No logging plan, no
rules, dozens of logging frameworks in active use, output in many different
languages, no structured log events but a myriad of different quoting,
formatting and bracketing standards instead.


So what should the ideal log line contain? First of all it should really be a
log line instead of a multi line something that cannot be reconstructed when
interleaved with other messages. The line should not only contain the class
name that logged the information (actually that is the least important piece of
information), it should contain the thread id, server name, a (standardised and
always consistently formatted) timestamp in a decent resolution (hint: one new
timestamp per second is not helpful when facing several hundred requests per
second). Make sure to have timing aligned across machines if timestamps are
needed for correlating logs. Ideally there should be context in the form of
request id, flow id, session id.


When thinking about logs, do not think too much about human readability - think
more in terms of machine readability and parsability. Treat your logging system
as the db in your data center that has to deal with most traffic. It is what
holds user interactions and system metrics that can be used as business
metrics, for debugging performance problems, for digging up functional issues.
Most likely you will want to turn free text that provides lots of flexibility
for screwing up into a more structured format like json, or even some binary
format that is storage efficient (think protocol buffers, thrift, avro).


In terms of log levels, make sure to log development traces on trace, provide
detailed problem analysis stuff on debug, put normal behaviour onto info. In
case of degraded functionality, log to warn. In case of things you cannot
easily recovered from put them on error. When it comes to logging hierarchies -
do not only think in class hierarchies but also in terms of use cases: Just
because your http connector is used in two modules doesn't mean that there
should be no way to turn logging on just for one of the modules alone.


When designing your logging make sure to talk to all stakeholders to get clear
requirements. Make sure you can find out how the system is being used in the
wild, be able to quantify the number of exceptions; max, min and average
duration of a request and similar metrics.


Tools you could look at for help include but are not limited to splunk, jmx,
jconsole, syslog, logstash, statd, redis for log collection and queuing.


As a parting exercise: Look at all of your own logfiles and count the different
formats used for storing time.