Data serialization #
XML, JSON and others are currently standard data exchange formats. Being human-readable but still structured enough to
be easily parsable by programs is their main benefit. Problems are overhead in size and parsing time. In addition at
least xml is not really as human-readable as it could be.
An alternative are binary formats. Yet those often are
not platform independent (either C++ or Java or Python bindings) or are not upgradable (what if your boss comes along
and wants you to add yet another field? Do you need to process all your data again?).
There are a few libraries
that promise to solve at least some of these problems. Usually you specify your data format with an IDL, generate
(Byte-)code from it and use mechanisms provided by the libraries to upgrade your format.
Yesterday at the Berlin
Apache Hadoop Get Together Torsten Curdt gave a short introduction to two of these solutions: Thrift and Protocol
buffers. He explained why Joost decided to use one of those libraries and highlighted why they went with Thrift instead
of Protocol Buffers.
This morning I have gathered a list of data exchange libs that are currently
available:
- Thrift … developed at Facebook, now in
the Apache incubator, active community, Bindings for C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa,
Smalltalk, and OCaml.
- ProtoBuf … developed at Google, mainly
one developer only, bindings for C++, Java und Python.
- Avro …
started by Doug Cutting, skips code generation.
- ETCH …
developed at Cisco, now in the Apache Incubator, Bindings for Java, C#, JavaScript.
There are some performance benchmarks online. Another recent, extensive comparison of serialization performance of various frameworks.