Tuesday, March 22, 2011

Distributed systems and dynamic typing

There's a very good blog post by Robert Harper, a CMU CS professor, called "Dynamic languages are static languages." It's very enlightening and I strongly encourage you to read it. If I understand it correctly, which I freely admit I may not, the general idea is that dynamic languages can be thought of as static languages that have a single all-encompassing type. In that regard, dynamic languages are a proper subset of static languages. If you think I misinterpreted his post and I'm confused, please flame me in the comments.

In a previous post, called "Parallelism is not concurrency", he opines on a pet peeve of mine, namely how the terms parallelism and concurrency are nonchalantly and incorrectly interchanged. Parallelism applies to deterministic operations that operate on similar data in similar time. Some examples of parallel operations include rendering of 3D scenes on GPUs, or encoding/decoding blocks of compressed video. Concurrency, on the other hand, refers to the nondeterministic manner in which distributed systems operate, particularly ones where CPUs are separated over an unreliable network which can lose connectivity at any time.

In a third blog post, "Teaching FP to freshmen", Robert says he'll be teaching Standard ML to freshmen, and further argues he won't be teaching Object Oriented Programming because it's "both anti-modular and anti-parallel by its very nature, and hence unsuitable for a modern CS curriculum."

Three very interesting blog posts by the same person, clearly a well-educated computer science professor who knows more than I do. I mean, I like programming languages and I'm working on my own dynamically typed functional programming language, but Robert Harper wrote a book, one I certainly couldn't write. So who am I to judge?

Parallelism and Concurrency are Both Important

What I see missing from Robert Harper's writing is any attention paid to concurrency. He pays intense attention to parallelism, recognizes parallelism as important for the future, and strongly advocates functional languages as a great way to address the problems of parallelism. I have absolutely no bone to pick with him here and wish him great luck with his courses at CMU which address this problem domain.

However, my interests primarily lie in the realm of concurrent computing, and particularly in the area of distributed computing. In the area of distributed computing I find dynamic languages particularly important and think Robert Harper's article on static vs. dynamic languages omits some of the advantages of dynamic languages which make them work better in distributed systems.

This is a weighty subject, and one in which I don't think my own ideas alone make a particularly cogent argument. For some opening remarks, I will defer to Joe Armstrong, creator of the Erlang programming language, and his opinion from the book "Coders at Work" by Peter Seibel. As a bit of context to this quote, Joe is discussing the potential advantages that a static type system might confer upon Erlang. But then he gets to the disadvantages...
...the static type people say, "Well, we really rather like the benefits of dynamic types when we're marshaling data structures." We can't send an arbitrary program down a wire and reconstruct it at the other end because we need to know the type. And we have—Cardelli called it a system that's permanently inconsistent. We have systems that are growing and changing all the time, where the parts may be temporarily inconsistent. And as I change the code in a system, it's not atomic. Some of the nodes change, others don't. They talk to each other—at certain times they're consistent. At other times—when we go over a communication boundary—do we trust that the boundary is correct?
Type Systems and the CAP Theorem

There are two particularly sticky problems when it comes to the use of type in distributed systems. The first is the issue of serialization, or marshaling, of particular states. One way or another this is a solvable problem, both for statically typed and dynamically typed languages. I really don't want to delve too deep into this issue as it distracts from the larger point I'm trying to make, but in general, I think this is an easier problem to solve in dynamic type systems. I would also like to note that serialization formats which vomit their types all over the protocol and the codebase are outgrowths of static type systems. I'm looking at you, CORBA, SOAP, Protocol Buffers, and Thrift. On the flip side, systems which choose a minimal, semi-universal type system, such as JSON, BSON, BERT, and Msgpack, are all outgrowths of dynamic type systems. If I have a point to make here, I think it's that these systems are outgrowths of two very different ways of thinking about the same problem.

Marshaling is still a very important topic in distributed systems. Erlang largely abstracts this problem away from the programmer, allowing distributed nodes to exchange Erlang "terms" between processes on distributed nodes in the exact same way one would exchange messages between two processes located on the same host. The process of serializing that data, transmitting it over the network, receiving it via TCP, decoding it, and sending it to the appropriate Erlang process, is completely transparent to the end user. This is an extraordinarily powerful abstraction.

While statically typed languages can attempt to marshal data in a minimalistic JSON-like type system, this typically isn't the case. Instead, statically typed languages generally seem to prefer to vomit their types all over the protocol. The boilerplate code needed to marshal/unmarshal particular types can be generated automatically by a declaration of the types and methods of a particular protocol, such as the WSDL files used by SOAP. Again, users of statically typed languages could reduce the state of particular entities in their system to one which could fit into a minimalistic type system, but for static languages this is still a manual process, or one which requires manual code generation. In a language like Erlang which is built from the ground up to be distributed, dynamic, and operate around a minimalistic type system, serialization and deserialization can happen completely automatically.

Why is this important in a distributed system? Because, to paraphrase Joe Armstrong, distributed systems are messy. Imagine an Erlang-like distributed system that's statically typed. In order for such a system to work effectively, all nodes in the system must have the exact same code loaded and therefore have a universal consensus on what the types in the system are. This has huge implications on the theoretical properties on such a system. In order for a distributed system to agree on the nature of all types, it must be consistent.

However, if you're familiar with the CAP theorem, you may recognize the inherent problem this may cause. The CAP theorem gives you three options: a consistent fault-tolerant system, a highly available fault-tolerant system, or a consistent highly available system which breaks at the first sign of a fault. Only two of these options provide the consistency needed to ensure universal agreement on the types in the system such that automatic marshaling/unmarshaling of static types is even possible. In a distributed system, you either must give up universal agreement on the types, or sacrifice availability.

To quote Joe again, distributed systems are composed of parts which are "growing and changing all the time" with "parts may be temporarily inconsistent." While there aren't any guarantees that distributed systems built around dynamic type systems will work, inconsistent statically typed systems with disagreements about types are guaranteed not to work. Dynamic systems not only provide the possibility that your system may continue to function with different versions of the code loaded at the same time, but the ability for the programmer to plan for this contingency and offer ways to mitigate it. It's possible this may result in errors, but it may work, whereas incompatible type definitions are universally guaranteed to create errors. In a distributed environment, dynamic type systems provide extensibility, whereas static type systems actively seek to preclude it.

Something I've seen time and time again in systems like SOAP and even Thrift and Protocol buffers is attempts by programmers to escape the constraints of the type system, which almost universally fall into proprietary ways to store key/value pairs. One SOAP API I'm working with now provides "Maps" with "Entry" tags that have a key attribute and an associated value. Another implementation provides an array of "NameValuePair" objects. These solutions seem ugly, but in my opinion, their intentions are not. These are people seeking to extend running systems without the need to completely redefine the protocol. That's very much a practical concern.

Distributed Applications Must Be Flexible

In order for distributed programming to work effectively, nodes need to be able to call functions on each other without the need for programmers to write custom marshaling/demarshaling code for each type. The marshaled data needs to work extensibly, so that nodes running different versions of the code can still talk to each other in a forwards and backwards compatible manner.

Protocols will change over time. Older versions of the code need to work with a newer protocol, and vice versa, older versions of the protocol need to work with newer code. Nodes should be upgraded as practicality dictates. Perhaps your system administrator begins an upgrade and you lose access to a datacenter, because a janitor at your upstream ISP spilled a bucket of mopwater all over their datacenter's raised floor and caused a huge electrical disaster. Now your distributed application is running two different versions of the code, and it's suffered a network partition. Does this mean it should just break when the network partition is fixed?

Erlang has shown us it's possible to recover from these kinds of conditions. Even when we can't change code atomically across the entirety of a distributed application, it should still continue to operate without breaking. Distributed systems should be able to grow and change all the time without rigid boundaries.