1/10/2013

Your giant proprietary (or at least silo) codebase is a huge liability

There has been a lot of news this week about vulnerabilities in very low-level platform code being used in production by many many people. First there was a ruby exploit, and now today I see that there is a new java zero day.

The truth is, these kinds of exploits are absolutely everywhere. When off-the-shelf libraries are assembled together to make a whole that is greater than the sum of the parts, strange interactions are possible that the original integrators never conceived of.

In the case of the ruby exploit, from what I read it seems something like: Part of the web decoding machinery that could decode URL encoded parameters was extended to be able to decode XML. The XML decoding machinery was then extended to be able to decode YAML.

YAML has a syntax for serializing arbitrary Ruby objects, and when that YAML file is deserialized a new instance of that object is created. With careful crafting of the input file, a large variety of arbitrary code execution is possible.

This is also the reason it is not a good idea to use pickle as a network serialization format in Python. You might think, "oh, I'll use marshal. Marshal doesn't support arbitrary class serialization." But take a look at the list of object types marshal does support:

None, integers, long integers, floating point numbers, strings, Unicode objects, tuples, lists, sets, dictionaries, and code objects

Code objects. I rest my case. Of course, you would have to be calling the return results from the marshal module in order for a code object constructed by an attacker to run on your server, but some hacker somewhere is probably going to figure out some crazy way.

Which brings me to my main point: I've observed over the years that for some reason business type people and even some programmers seem to think that a large proprietary codebase that nobody else is allowed to look at is an asset. It's not; it's a liability!

You don't understand what's in your code. You don't understand what's in the code of the large number of libraries that you use every day. Codebases are written over weeks, months, years, by different people, in different frames of mind.

There are solutions to this code complexity problem. We can break large complex code bases into small parts that are very explicit and careful about validating their input. We can completely isolate these parts from each other so that they can't accidentally (or maliciously) break something.

Libraries could strive for simplicity and explicitness rather than kitchen-sink-itis. If a surgeon wants to do surgery, they are going to choose a light, sharp, well-balanced scalpel, not an old Swiss Army knife.

Code that only a few people have to look at doesn't have to be clear. Only those few people have to bear the mental burden of holding that nasty code in their head. Code that a lot of people need to look at has a higher probability of being clear. This is one advantage of open source; obviously, it's not enough.

My suggestion for reducing the complexity in interactions like these is to create simpler, more well-defined libraries and isolate these libraries from each other in different processes.

Processes evolved in the 70s to isolate users from each other but now it is 2013 and we could start isolating more and more libraries from each other. For languages that don't use reference counting, fork with copy on write may be good enough to allow us to actually use many many UNIX processes for a single application without consuming too many resources.