Or how to use protocol buffers efficiently in Python.
This post is inspired by the many posts raving about how json is so much faster or better or something than a binary format. Of course json or text can’t be faster - binary by design is very efficient, parsing is supposed to be straightforward, no inefficient text mangling to deal with - hence faster overall. But, the implementation can suffer as it does for Python Protocol Buffers in the default installation using pure Python. But by switching to the CPython version you gain a significant speedup as you would expect. And you don’t have to change any code it’s just installation and setup.
We use Protocol Buffers at Ráiteas quite happily, they provide a language independent and efficient way of transferring information between programs. We make use of Python, Java and C++ and by far the exchange of information has been the least troublesome part of the code thanks to Protocol Buffers.
Getting Python Up To Speed
At the time of this post I am using Python 2.7.5 and [Protocol buffers3] 2.5.0 and my development platform is OS X 10.9 using Clang. I hit a couple of issues, when making the switch to the C++ implementation on OSX and have compiled a few notes, to ease the pain going forward.
It is a micro-benchmark and as such doesn't implement a particularly complicated message, and the only processing going on is serialisation and reading back from disk. I decided to try a few more compression formats and include lz4 and gzip for comparisson (though zlib as a default would have been a faster compromise) for the Protocol Buffer message. There is some overhead but lz4 is really fast, not as compressed as gzip or zlib which is very close to gzip in this example. The compression you gain versus the raw PB is not great, but if bandwidth is at a premium you are probably better off with say zlib. Of course of the time to compress or CPU usage is not important you can use bzip2 or even xz (http://en.wikipedia.org/wiki/Xz).
The starting point
Start by cloning the repository and checking out the tag bench-1.
git clone https://github.com/hnrkptrsn/benchmarks.git cd benchmarks/protobuftest git checkout tags/bench-1 python protobuftest.py
You should get some output similar to this: (will vary according to your particular system)
Read is read in relation to json with >1.0 being worse. Write is writing messages in relation to json with >1.0 being worse. Size is size of messages on disk to json with >1.0 being more bytes on disk. (lower is better)
This does show a rather poor protocol buffer performance, as mentioned in the original benchmark readme.
I added lz4 and gzip to the protocol buffer tests to check out the bench-2 tag and have a look:
git checkout tags/bench-2 python protobuftest.py
Size on disk changes slightly for the protocol buffer implementation.
Using the C++ PB implementation.
Now install the Protocol Buffers binary version and run the test program again. You may want a look here. Run the test again while asking for the cpp version.
PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=cpp python protobuftest.py
Now Protocol Buffers are starting to look interesting.
Improve performance by compiling further
This page is a good read and inspiration for the final test run.
Get the bench-3 tag and compile and install the python extension.
git checkout tags/bench-3 protoc --cpp_out=. addressbook.proto sudo \ PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=cpp \ ARCHFLAGS=-Wno-error=unused-command-line-argument-hard-error-in-future \ python setup.py build sudo \ PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=cpp \ ARCHFLAGS=-Wno-error=unused-command-line-argument-hard-error-in-future \ python setup.py install PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=cpp python protobuftest.py
Who says Protocol Buffers were slow?
Message final size on disk
Probably only compress Protocol Buffers if you by profiling can determine that bandwidth is the limiting factor otherwise the CPU impact is of course quite significant. There is a trade off where the bandwidth cost overshadows the CPU cost depending on actual use.
|Size on disk||Format|
Other message formats
At Ráiteas we are looking at a few newer developments in terms of message objects. Namely capnproto, from the author of Protocol Buffers, and FlatBuffers from Google. Both looks very interesting and there are of course more message libraries out there.