Or how to use protocol buffers efficiently in Python.

This post is inspired by the many posts raving about how json is so much faster or better or something than a binary format. Of course json or text can’t be faster - binary by design is very efficient, parsing is supposed to be straightforward, no inefficient text mangling to deal with - hence faster overall. But, the implementation can suffer as it does for Python Protocol Buffers in the default installation using pure Python. But by switching to the CPython version you gain a significant speedup as you would expect. And you don’t have to change any code it’s just installation and setup.

We use Protocol Buffers at Ráiteas quite happily, they provide a language independent and efficient way of transferring information between programs. We make use of Python, Java and C++ and by far the exchange of information has been the least troublesome part of the code thanks to Protocol Buffers.

Getting Python Up To Speed

At the time of this post I am using Python 2.7.5 and [Protocol buffers3] 2.5.0 and my development platform is OS X 10.9 using Clang. I hit a couple of issues, when making the switch to the C++ implementation on OSX and have compiled a few notes, to ease the pain going forward.

Benchmark introduction

I have based my testing on the micro-benchmark found on github here in folder protobuftest. My fork with changes are here. Note: It is only the protobuftest sub-project I was interested in.

It is a micro-benchmark and as such doesn't implement a particularly complicated message, and the only processing going on is serialisation and reading back from disk. I decided to try a few more compression formats and include lz4 and gzip for comparisson (though zlib as a default would have been a faster compromise) for the Protocol Buffer message. There is some overhead but lz4 is really fast, not as compressed as gzip or zlib which is very close to gzip in this example. The compression you gain versus the raw PB is not great, but if bandwidth is at a premium you are probably better off with say zlib. Of course of the time to compress or CPU usage is not important you can use bzip2 or even xz (http://en.wikipedia.org/wiki/Xz).

The starting point

Start by cloning the repository and checking out the tag bench-1.

git clone https://github.com/hnrkptrsn/benchmarks.git
cd benchmarks/protobuftest
git checkout tags/bench-1
python protobuftest.py

You should get some output similar to this: (will vary according to your particular system)

Read is read in relation to json with >1.0 being worse. Write is writing messages in relation to json with >1.0 being worse. Size is size of messages on disk to json with >1.0 being more bytes on disk. (lower is better)

Format Read Write Size
json 1.00 1.00 1.00
proto 15.23 6.26 0.57
json.gz 3.15 0.96 0.37
csv.gz 4.25 0.32 0.34

This does show a rather poor protocol buffer performance, as mentioned in the original benchmark readme.

I added lz4 and gzip to the protocol buffer tests to check out the bench-2 tag and have a look:

git checkout tags/bench-2
python protobuftest.py
Format Read Write Size
json 1.00 1.00 1.00
proto 16.20 6.13 0.57
pb.lz4 10.97 5.69 0.52
pb.gz 12.56 6.12 0.35
json.gz 3.34 0.94 0.37
csv.gz 4.39 0.31 0.34

Size on disk changes slightly for the protocol buffer implementation.

Using the C++ PB implementation.

Now install the Protocol Buffers binary version and run the test program again. You may want a look here. Run the test again while asking for the cpp version.


Now Protocol Buffers are starting to look interesting.

Format Read Write Size
json 1.00 1.00 1.00
proto 1.07 0.21 0.57
pb.lz4 1.21 0.22 0.52
pb.gz 3.39 0.26 0.35
json.gz 3.86 1.09 0.37
csv.gz 4.40 0.24 0.34

Improve performance by compiling further

This page is a good read and inspiration for the final test run.

Get the bench-3 tag and compile and install the python extension.

git checkout tags/bench-3

protoc --cpp_out=. addressbook.proto

sudo \
ARCHFLAGS=-Wno-error=unused-command-line-argument-hard-error-in-future \
python setup.py build

sudo \
ARCHFLAGS=-Wno-error=unused-command-line-argument-hard-error-in-future \
python setup.py install

Format Read Write Size
json 1.00 1.00 1.00
proto 0.19 0.06 0.57
pb.lz4 0.23 0.06 0.52
pb.gz 2.47 0.10 0.35
json.gz 3.84 1.10 0.37
csv.gz 4.51 0.25 0.34

Who says Protocol Buffers were slow?

Have a look at @yaaang for a better description of the use of the compiled versions of PBs. And have a look at @sanand0 for the original benchmark if you want a go yourself.

Message final size on disk

Probably only compress Protocol Buffers if you by profiling can determine that bandwidth is the limiting factor otherwise the CPU impact is of course quite significant. There is a trade off where the bandwidth cost overshadows the CPU cost depending on actual use.

Size on disk Format
5383631 output.csvz
15988791 output.json
5855888 output.jsz
9183462 output.pb
5667722 output.pbgz
8308787 output.pbz

Other message formats

At Ráiteas we are looking at a few newer developments in terms of message objects. Namely capnproto, from the author of Protocol Buffers, and FlatBuffers from Google. Both looks very interesting and there are of course more message libraries out there.