A Binary Choice

As a general rule, don't invent your own file format until you have to, and even then, probably don't. But sometimes, you have to.

Tim C's company was building a format they called "generic raw format". It was solving a hard problem: they were collecting messages from a variety of organizations, in a mix of binary and plaintext, and dumping them into a flat file. Each file might contain many messages, and they needed to be able to split those messages and timestamp them correctly.

This meant that the file format needed a header at the top. It would contain information about byte order, version number, have space for arbitrary keys, and report the header length, all as text, represented as key/value pairs. Then they realized that the some of the clients and vendors supplying this data may want to include some binary data in the header, so it would also need a binary section.

All of this created some technical issues. The key one was that the header length, stored as text, could change the length of the header. This wasn't itself a deal-breaker, but other little flags created problems. If they represented byte-order as BIGENDIAN=Y, would that create confusion for their users? Would users make mistakes about what architecture they were on, or expect to use LITTLEENDIAN=Y instead?

In the end, it just made more sense to make all of the important fields binary fields. The header could still have a text section, which could contain arbitrary key/value pairs. For things like endianness, there were much simpler ways to solve the problem, like reserving 32-bits and having clients store a 1 in it. The parser could then detect whether that read as 0x00000001 or 0x10000000 and react accordingly. Having the header length be an integer and not text also meant that recording the length wouldn't impact the length.

These were all pretty reasonable things to do in a header format, and good compromises for usability and their business needs. So of course, Blaise, the CTO, objected to these changes.

"I thought we'd agreed to text!" Blaise said, when reviewing the plan for the header format.

"Well, we did," Tim explained. "But as I said, for technical reasons, it makes much more sense."

"Right, but if you do that, we can't use cat or head to review the contents of the file header."

Tim blinked. "The header has a section for binary data anyway. No one should be using cat or head to look at it."

"How else would they look at it?"

"Part of this project is to release a low-level dump tool, so they can interact with the data that way. You shouldn't just cat binary files to your terminal, weird stuff can happen."

Blaise was not convinced. "The operations people might not have the tool installed! I use cat for reading files, our file should be catable."

"But, again," Tim said, trying to be patient. "The header contains a reserved section for binary data anyway, the file content itself may be binary data, the entire idea behind what we're doing here doesn't work with, and was never meant to work with, cat."

Blaise pulled up a terminal, grabbed a sample file, and cated it. "There," he said, triumphantly, pointing at the header section where he could see key/value pairs in a sea of binary nonsense. "I can still see the header parameters. I want the file to be like that."

At this point, Tim was out of things to say. He and his team revised the spec into a much less easy to use, a much more confusing, and a much more annoying header format. The CTO got what the CTO wanted.

Surprisingly, they ended up having a hard time getting their partners to adopt the new format though…

[Advertisement] Utilize BuildMaster to release your software with confidence, at the pace your business demands. Download today!

This post originally appeared on The Daily WTF.

Leave a Reply

Your email address will not be published. Required fields are marked *