Data serialization

2009-06-26 08:39
XML, JSON and others are currently standard data exchange formats. Being human-readable but still structured enough to be easily parsable by programs is their main benefit. Problems are overhead in size and parsing time. In addition at least xml is not really as human-readable as it could be.

An alternative are binary formats. Yet those often are not platform independent (either C++ or Java or Python bindings) or are not upgradable (what if your boss comes along and wants you to add yet another field? Do you need to process all your data again?).

There are a few libraries that promise to solve at least some of these problems. Usually you specify your data format with an IDL, generate (Byte-)code from it and use mechanisms provided by the libraries to upgrade your format.

Yesterday at the Berlin Apache Hadoop Get Together Torsten Curdt gave a short introduction to two of these solutions: Thrift and Protocol buffers. He explained why Joost decided to use one of those libraries and highlighted why they went with Thrift instead of Protocol Buffers.

This morning I have gathered a list of data exchange libs that are currently available:

  • Thrift ... developed at Facebook, now in the Apache incubator, active community, Bindings for C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, Smalltalk, and OCaml.
  • ProtoBuf ... developed at Google, mainly one developer only, bindings for C++, Java und Python.
  • Avro ... started by Doug Cutting, skips code generation.
  • ETCH ... developed at Cisco, now in the Apache Incubator, Bindings for Java, C#, JavaScript.

There are some performance benchmarks online. Another recent, extensive comparison of serialization performance of various frameworks.