Tuesday, April 17, 2012

Introducing DCell: actor-based distributed objects for Ruby

DCell, which is short for Distributed Celluloid (and pronounced like the batteries you used to jam into boom boxes and RC cars) is an actor-based distributed object oriented programming framework for Ruby. Celluloid is an actor library I wrote for Ruby which exposes concurrent Ruby objects that "quack" just like any other Ruby object. DCell takes the asynchronous messaging protocol from Celluloid and exposes it to distributed networks of Ruby interpreters by marshaling Celluloid's messages as strings and sending them to other nodes over 0MQ.

Before I talk about DCell I'd like to talk a little bit about the history behind distributed objects in general and the ideas that DCell draws upon.

A Brief History of Distributed Objects

Nowadays you don't hear people talking much about distributed objects. However, once upon a time, distributed objects were big business. They used to be one of Steve Jobs' passions during his time at NeXT. In the mid-90's, Steve Jobs phased out NeXT's hardware division and began repositioning the company as, among other things, the "largest object supplier in the world." NeXT turned its focus exclusively to its software, namely the WebObjects framework for building dynamic web sites. It would also ship the Enterprise Objects Framework, which Steve described as allowing you to "make NeXTSTEP objects, and with no programming, have persistent and coherent storage with SQL databases" (sound a little bit like ActiveRecord a decade before ActiveRecord, anyone?)

WebObjects was built on a technology called Portable Distributed Objects, which allowed Objective C applications developed on any platform to seamlessly interoperate over computer networks and across platforms. If you haven't seen it, watch Steve Jobs 1995 presentation The Future of Objects (although you may want to skip directly to where Steve begins talking about distributed objects). Even now, some 16 years later, these demos still seem futuristic. Steve loved the simplicity of distributed objects: "Objects can message objects transparently that live on other machines over the network, and you don't have to worry about the networking gunk, and you don't have to worry about finding them, and you don't have to worry about anything. It's just as if you messaged an object that's right next door."

So why is it some of Steve's demos seem futuristic and slightly beyond our grasp even a decade and a half later, and building multi-tier client/server (i.e. web) applications is a challenge typically involving building lots of client and server wrapper code instead of transparently invoking distributed objects? Unfortunately, Steve's rosy future of distributed objects never really came to pass. NeXT's technology was commercial, only available in Objective C (at a time when C++ reigned supreme and no one used Objective C), and couldn't beat the open web and the standards that would emerge for developing web APIs. Meanwhile, the industry standards for distributed objects would be mired in debacles of their own.

In the before time, in the long long ago, in the days before JSON, XML, and REST, when Tim Berners-Lee had just started hosting the world's first web site on his personal NeXTstation, C++ was the lingua franca of the day and people were just getting started making C++ programs communicate over networks. The Object Management Group (with the now-unfortunate acronym OMG) hashed out the necessary wire protocols, object representations, interface definitions, and a profusion of other standards needed to allow C++ programs to invoke remote objects over TCP/IP networks. The result was CORBA.

CORBA is a technology that drowned in its own complexity, however in the early '90s it was the Enterprise: Serious Business way to allow applications to communicate over networks. Once you waded through the myriad complexity of the various CORBA technologies, you were left with objects you could invoke over the network in a manner somewhat reminiscent of the way you would invoke an object locally.

However, soon the web would take off and HTTP would soon become the de facto protocol programs used to communicate. Unfortunately, HTTP is a terrible protocol for implementing distributed objects. Where HTTP exposes a predefined set of verbs you can use to manipulate resources which have a number of possible representations, objects don't have representations: they are just objects, and you talk to them with messages. This didn't stop people from trying to build a distributed object protocol on top of HTTP though. The result was SOAP, a protocol which abandoned CORBA's "orbs" for web servers, its Interface Definition Language for WSDL, its Common Data Representation for XML, and its Inter-Orb Object Protocol for HTTP.

This was something of a step in the right direction: SOAP actually was a "Simple" Object Access Protocol when compared to CORBA, and SOAP would soon vanquish CORBA for Enterprise: Serious Business applications. While SOAP would gain a few fans, particularly in the Java and .NET communities, who saw the value of being able to expose objects to the network with a few point-and-clicks which generated gobs of XML, SOAP would soon join CORBA's ranks as a reviled technology.

SOAP's complexity comes first and foremost being a committee standard that would succumb to "too many cooks" syndrome in the same way CORBA did. It also suffered from trying to be a cross-language protocol that needed to deal with static type systems, requiring tools which could read WSDL definitions and spit out volumes of generated boilerplate code for interacting with remote services. Beyond that, SOAP suffered from an impedance mismatch with HTTP by largely ignoring the features HTTP provides, using it as little more than a transport wrapper for shoving blobs of XML across the network. The XML contained the actual messages to be sent to remote objects or the responses coming from a remote method invocation, while anything done at the HTTP level itself was just boilerplate.

REST to the rescue?

For all its complexity, it was easy to lose sight of what SOAP was actually trying to do. Rather than painlessly interacting with remote objects, SOAP left us wondering why things were so slow, staring at WSDL errors wondering what went wrong, and picking through gobs and gobs of XML trying to debug problems. REST, which eschewed distributed objects and favored the paradigms of HTTP, would be SOAP's coup de grĂ¢ce. SOAP is now relegated to a handfull of legacy enterprise applications whereas the rest of the open web has almost universally embraced REST.

So REST triumphed and web APIs reign supreme. Distributed objects are little more than a footnote in history. And we're still left wondering why putting together the sorts of demos Steve Jobs was showing with Portable Distributed Objects in 1995 is so hard.

Using REST makes sense when exposing services for third parties to use. However, if you control both the client and server, and have Ruby frontends talking to Ruby services, REST begins to look like a bunch of gunk that's getting in the way:

Implementing domain objects, REST services, and REST clients becomes work duplicated in 3 places across systems using this sort of architecture. Wouldn't it be a lot simpler if the frontend web application could simply consume remote services as if they were just objects?

This sort of simplicity is the goal of distributed objects.

Distributed Objects in Ruby

Ruby has its own foray into the world of distributed objects: DRb, or Distributed Ruby. DRb exposes Ruby objects to the network, each uniquely identified by a URI. Clients can ask for a DRb object by URI and get a DRbObject back. These DRbObjects act as proxies, intercepting calls and sending them over the network, serializing them with Ruby's Marshal, where they're handled by a remote DRbServer. DRbServer uses a thread-per-connection model, allowing it to concurrently process several requests. When a DRb connection handler receives a request, it looks up the requested object, invokes a method on it, then serializes the response and sends it back to the caller.

For the most part, DRb has lingered in obscurity, save until recently with the publication of the dRuby book which has stirred up a modicum of interest. Where Steve Jobs thought PDO was "by far the easiest way to build multi-tier client/server applications because of [its] completely transparent distributed object model," Rubyists don't turn to DRb to build multi-tier applications, but instead typically rely on REST, building out APIs for what is, in the end, distributed communication between Ruby objects.

Why is it then that DRb isn't the go-to tool people use for building multi-tiered web applications in Ruby? It's easy to say that DRb failed because people are used to thinking in terms of HTTP and can better understand the semantics of the system when using HTTP, especially when it comes to areas like load balancing and caching. Separating services with HTTP also opens up the door to reimplementing those services in a different language in the future. However, even with future-proofing for a rewrite in another language out of the picture, I think most Rubyists would still choose to use REST APIs instead of DRb, and I think that's a defensible position.

While DRb does a great job of trying to make access to remote objects as transparent as possible, it has a number of flaws. DRb is inherently multithreaded but doesn't give the user any sort of tools or framework to manage concurrent access to objects. This means building DRb applications immediately exposes you to all the complexities of multithreaded programming whether you're aware of it or not, and Rubyists seem generally uncomfortable with building thread-safe programs. While DRb allows you to talk to in-process objects the same way you'd talk to out-of-process objects, but it doesn't make it natural to build a program that way.

Beyond Objects: The Power of Distributed Erlang

While CORBA and SOAP are reviled for their complexity, there's another distributed system which is beloved for its high level of abstraction: Distributed Erlang. Erlang is, if anything, a highly opinionated language whose central design goal is to build robust self-healing systems you never need to shut down. When it comes to distribution, Erlang's goal is to make it as transparent as possible. Erlang is a dynamic language which insists you express everything within a scant number of core types. This makes serializing state in Erlang so you can ship it across the wire extremely simple and fast.

However, the real strength of Erlang is the Actor Model, which can be more or less summarized as follows:
  1. Actors communicate by sending messages
  2. Every actor has a mailbox with a unique address. If you have an actor's address, you can send it messages
  3. Actors can create other actors, and when they do they know their address so they can send the newly-created actors messages
  4. Actors can pass addresses (or handles) to themselves or other actors in messages
Erlang uses this method within individual VMs as the basis of its concurrency model. Erlang actors (a.k.a. processes) all run concurrently and communicate with messages. However, Erlang also supports distribution using the exact same primitives it uses for concurrency. It doesn't matter which type of actor you're talking to in Erlang, they "quack" the same, and thus Erlang has you model your problem in a way that provides both concurrency and distribution using the same abstraction.

Distributed Erlang offers several features aimed at building robust distributed systems. The underlying messaging protocol is asynchronous, allowing many more messaging patterns than traditional RPC systems (e.g. HTTP) which use a request/response pattern that keeps the client and server in lockstep. Some examples of these patterns are round robin (distributing messages across N actors), scatter/gather (distributing computation across N actors and gathering the results), and publish/subscribe (allowing actors to register interest in a topic, then informing them of events related to that topic).

In addition, Erlang processes can link to each other and receive events whenever a remote actor exits (i.e. if it crashes). This allows you to build robust systems that can detect errors and take action accordingly. Erlang emphasizes a "fail early" philosophy where actors are encouraged not to try to handle errors but instead crash and restart in a clean state. Linking allows groups of interdependent actors to be taken down en masse, with all of them restarting in a clean state afterward. Exit events can also be handled, which is useful in distributed system for things like leader election.

DCell provides all of these features. When you create an actor with Celluloid, a proxy object to the actor is returned. This proxy lets you use the method protocol to communicate with an actor using messages. DCell implements special marshalling behavior for these proxy objects, allowing you to pass them around between nodes in a DCell system and invoke methods on remote actors in the exact same way you would with local actors.

Unlike DRb, DCell also exposes asynchronous behaviors, such as executing method calls in the background, and also using futures to schedule method invocation in advance then waiting for the result later. DCell also lets distributed actors to link to each other and be informed when a remote actor terminates.

I'm certainly not the first to imitate Erlang's approach to distribution. It's been seen in many other (distributed) actor frameworks, including the Akka framework in Scala and the Jobim framework in Clojure.

Bringing Erlang's ideas over to Ruby

I have a long history of projects that try to cross-pollenate Ruby and Erlang. My first attempt was Revactor, my previous attempt at an actor library which provided a very raw and low-level API which is almost identical to the Rubinius Actor API. Revactor modeled each actor as a Fiber and thus provided no true concurrency. Another of my projects, Reia, tried to bring a more friendly syntax and OO semantics to Erlang.

With Celluloid I've come full circle, trying to implement Erlang's ideas on Ruby again. Only this time, Celluloid makes working with actors easy and intuitive by embracing the uniform access principle and allowing you to build concurrent systems that you can talk to just like any other Ruby object. Celluloid also provides asynchronous calls (what Erlang would call a "cast") where a method is invoked on the receiver but the caller doesn't wait for a response. In addition to that Celluloid provides futures, which allow you to kick off a method on a remote actor and obtain the value returned from the call at some point in the future.

In addition Celluloid embraces many of Erlang's ideas about fault tolerance, including a "crash early" philosophy. Celluloid lets you link groups of interdependent actors together so if any one fails you can crash an entire group. Supervisors and supervision trees automatically restart actors in a clean state whenever they crash.

Celluloid does all of this using an asynchronous message protocol. Actors communicate with other actors by sending asynchronous messages. A message might say an actor has crashed, or another actor is requesting a method should be invoked, or that a method invocation is complete and the response is a given value. All of the heavy lifting for building robust, fault-tolerant systems is baked into Celluloid.

When programs are factored this way, adding distribution is easy. DCell takes the existing primitives Celluloid has built up for building concurrent programs and exposes them onto the network. DCell itself acts as little more than a message router, and the majority of the work in adding fault tolerance is still handled by Celluloid.

Getting Started with DCell

To understand how DCell works we need to look at how a cluster is organized. This is an example of a cluster with 5 nodes:



In this picture the green nodes represent individual Ruby VMs. The links between the nodes are shown in black or gray to illustrate actively connected or potentially connected nodes. DCell makes connections between nodes lazily as actors request them. This means DCell clusters are potentially fully connected networks where each of the nodes is (or can be) directly connected to every other node in the cluster. DCell doesn't implement any sort of routing system or overlay network, and instead depends on all nodes being directly accessible to each other over TCP.

DCell uses 0MQ to manage transporting data over the network. 0MQ supports several different messaging patterns, and in the future DCell may use more of them, but for the time being DCell uses PUSH/PULL sockets exclusively. This works well because Celluloid's messaging system is asynchronous by design: each node has a PULL socket that represents the node's mailbox, and the other nodes on the network have a PUSH socket to send that node messages.

To configure an individual DCell node, we need to give it a node ID and a 0MQ address to bind to. Node IDs look like domain names, and 0MQ addresses look like URLs that start with tcp:


To create a cluster, we need to start another Ruby VM and connect it to the first VM. Once you have a cluster of multiple nodes, you can bootstrap additional nodes into the cluster by pointing them at any node, and all nodes will gossip about the newly added node:

Once you are connected to another node, you can browse the available nodes using DCell::Node.all:


To invoke a node on a particular service, obtain a handle to its node object, then look up an individual actor by the name it's registered under. By default, all nodes run a basic information service which you can use to experiment with DCell:

To implement your own DCell service, all you have to do is create a Celluloid actor and register it on your node. See the info service source code for an example.

DCell Explorer

DCell also includes a simple web UI for visualizing the state of a particular DCell cluster. To launch the web UI, run:


Then go to http://localhost:8000/ (provided you used the same host/port). You should see the following:


This is the basic dashboard that DCell's web UI provides. You can see connected nodes, their connection state, and if they're available browse the info service and see various information about them.

We're just getting started...

DCell is only about half a year old, and still relatively immature, but already it's seen a great number of contributors and a lot of attention. Some of the most exciting developments are still on the horizon, including paxos-powered multicall support which will let you call a quorum of nodes, along with generalized group membership support with leadership election, and generalized pub/sub.

All that said, I'd like to think that DCell is the most powerful distributed systems framework available for Ruby today, and I would love to see the remaining bugs ironed out and missing features added.

That can only happen if people are using DCell, finding bugs, and reporting missing features. This may sound a little bit scary, but if you're considering building a nontrivial distributed system in Ruby, DCell is a great place to start.