Tuesday, July 26, 2011

The Trouble with Erlang (or Erlang is a ghetto)


This is a blog post I have been meaning to write for quite some time. I lament doing so because I've made a considerable time investment into the Erlang infrastructure and really love some of its ideas. Erlang has done a great and still unique job of synthesizing a number of concepts in a very interesting way. But after using the platform in various capacities for some 4 years now, there are some glaring issues I think need to be called out.

Records suck and there's no struct/map data structure

Erlang has a feature called "records" which uses the preprocessor to give you something akin to a struct or map, i.e. a way to access named fields of a particular object/term within the system. As far as I can tell, there's pretty much universal agreement within the community that this is a huge limitation, and several proposals have been made to remedy the problem. The requested feature has typically been referred to as a "frame", and several proposals for implementing frames have been floating around for several years. Yet no action has been taken on the problem.

So why doesn't Erlang have frames? While Erlang is an open source project, its implementation and release cycle are managed by Ericsson, the company that created it, and Ericsson just doesn't seem to care. I'm not sure what Ericsson's priorities are when it comes to adding features to Erlang, but in my opinion they're doing a worse job of engaging the community than Oracle has been doing with Java. I hate Oracle as a company, but so far it feels like they've actually done a fairly good job managing Java development and moving Java forward. I can't say that at all with Ericsson, and frames are the quintessential example of this.

Erlang sucks at managing memory

Once upon a time I looked upon BEAM's design as the future pattern all virtual machines would follow. I strongly encourage you to read that post before taking issue with anything I have to say in regard to this matter. I have completely reversed my opinion since the time I write that post.

The other night I tweeted "If you're looking for a language that gets multicore concurrency right, look at how Azul implemented Java on their Vega architecture" and I definitely stand by that. Azul is a company that got a lot of smart hardware and software people together and had them work on designing a custom system which would scale to hundreds of CPU cores (up to 768 of them), heaps that topped 500 GB (up to 768GB), and had the GC pause only 10-20ms at a time. The realtime performance characteristics Azul managed to eek out of their system lead them to often describe their GC as "pauseless".

Where Azul was scaling up to 768 CPUs in 2007, Erlang was crapping out around 15 CPUs in 2009. For everything Erlang had to say about the importance of immutability and messaging in concurrent systems, and despite Joe Armstrong's promise that "your Erlang program should just run N times faster on an N core processor," it turns out that on the Erlang VM the N core processor promise had an upper bound of around 15.

Why is this? Erlang implements its own memory allocator and can't take advantage of libraries like tcmalloc to provide better multithreaded heap management. I can't fault a language VM like BEAM for doing this save for the fact that what Erlang provides is relatively crappy.

Erlang has done a fairly decent job given the constraints it was working within. Erlang wanted to provide a soft realtime system, and managed to create one that works on commodity architectures, unlike the Azul Vega appliances which require custom hardware. However, Azul has managed to port their version of the JVM to x86 hardware with their Zing Architecture, which wraps the JVM in a separate runtime container which uses software transactional memory to replace the hardware transactional memory found on the Vega appliances. It's higher overhead but provides similar guarantees. Java also provides the RTSJ specification for building realtime systems in Java.

Both Zing and RTSJ demonstrate that Erlang's approach to building a realtime garbage collected system, using separate heaps per process, isn't necessary to still provide realtime characteristics. Erlang's approach of using separate heaps is nonstandard and comparatively hard to optimize because most other systems are using a shared heap model. Azul's Vega architecture shows that shared heaps can scale up to hundreds of CPU cores and hundreds of gigabytes of heap while still providing realtime characteristics. Even more exciting is that AMD's Fusion architecture, which they're implementing in conjunction with ARM, provides read and write barriers at the hardware level necessary to provide a system like Azul using commodity hardware.

However, I think everything I just said is moot for the majority of applications. People building messaging systems want the best performance possible but don't typically have software realtime constraints. The Erlang VM's approach to soft realtime made a design decision which hampers its messaging speed, namely the use of separate heaps, which requires messages be copied from one heap to another. This means the Erlang VM does not provide zero-copy messaging. Every time you send a message from one Erlang process to another, some amount of data must be copied.

Erlang has partly mitigated this problem by providing a separate shared heap for binaries, which are the Erlang type for arbitrary blobs of binary data. This means if you ensure the majority of data you move around doesn't contain anything of significant size except binaries, perhaps this won't be a problem. However, if you're moving large collections of numbers around (Erlang's strings-as-lists-of-integers come to mind), messaging will be comparatively slow compared to a zero copy system.

Various solutions to this have been proposed for BEAM, such as switching from a shared-nothing heap to a shared heap or a hybrid heap (where message-passed objects are copied once), however the Erlang garbage collector is not suitable for managing shared/hybrid heaps and would need to be rewritten for the task, and nobody has managed to get the shared/hybrid heaps working with Erlang's SMP scheduler, or rewritten the garbage collector to be more suitable to the task of managing a shared/hybrid heap.

A potential solution to this? Erjang, an implementation of Erlang on the JVM, provides zero copy messaging using the Kilim library for lightweight threads.

JIT? What JIT?

Erlang has a "JIT" compiler called HiPE, which is mostly hype. I put JIT in quotes because HiPE is mostly an Erlang-to-native-code compiler with a limited set of backends which does a pretty bad job of optimizing and can't use runtime profiling information to improve the quality of the native code it generates in the way JIT compilers like HotSpot are able to. Calling HiPE a just-in-time compiler is a stretch as it is for most part an ahead-of-time native compiler for Erlang. The quality of native code produced by HiPE can be so poor that it's often outperformed by the userland bytecode interpreter implemented in BEAM.

HiPE can perform a very limited set of optimizations. In particular, Erlang code is factored into modules, and HiPE's inliner is unable to inline natie code across modules. This is due to HiPE's lack of a deoptimizer (a.k.a. deopt), or a way to translate JITed code back into bytecode, which is necessary in general but particularly necessary in Erlang for cases like hot code swapping. Deopt support is a feature of many JIT compilers in languages more popular than Erlang, most notably the HotSpot compiler on the JVM. Google's V8 virtual machine for JavaScript added deoptimization support as part of their "Crankshaft" compilation infrastructure.

Erlang isn't general purpose

Erlang hates state. It especially hates shared state. The only facility provided by the language for dealing with shared state in Erlang is called "Erlang Term Storage" and provides a Judy array that several Erlang processes can talk to. The semantics of ETS are fairly awkward and using it directly is difficult. Erlang has a baked-in database called Mnesia which is built on ETS. Mnesia's performance characteristics aren't great but it provides a friendlier face for ETS. These are the only solutions to shared state baked into the language.

What should you do if you want to deal with a shared-state concurrency program in Erlang? The general advice is: don't. Erlang isn't designed for solving shared-state concurrency problems. If you encounter a shared state concurrency problem while developing your Erlang program, sorry, you picked the wrong language. Perhaps you should move along... and Clojure offers you some great ways to tackle shared state concurrency problems.

The syntax is atrocious

I think this one goes without saying. That said...

Let me come at this from a different angle than you're probably expecting: I've recently started working with Clojure, and I have to say, I really think Erlang would've been a lot better off with a Lisp-like syntax than a Prolog-inspired syntax. To-date Erlang is the only popular language with a Prolog inspired syntax and all of the awkward tokens and gramatical constructions make me wish it just had a simple Lispy syntax. This has been implemented in Robert Virding's Lisp Flavoured Erlang, which is very cool and worth checking out.

That opinion might come as a surprise, because the main project I was developing in Erlang was Reia, a Ruby-like syntax and runtime for Erlang. I've discontinued this project, for many reasons, one of which is because it's been surpassed in features and documentation by a similar project, José Valim's Elixir. After years of working on Reia, I've really grown to believe I'd rather spend my time working on a language which incorporates Erlang's ideas, but on the JVM with mutable state.

The Erlang cargo cult would love to hang me out to dry for even saying that... so let me address it right now.

Immutable state sucks and isn't necessary for Erlang-Style Concurrency

Immutable state languages force object creation whenever anything changes. This can be partially mitigated by persistent data structures, which are able to share bits and pieces of each other because they're immutable. This works, for example, when attempting to create a sublist that consists of the last N elements of a list. But what if you want the first N elements? You have to make a new list. What if you want elements M..N? You have to make a new list.

In mutable state languages, performance problems can often be mitigated by mutating local (i.e. non-shared) state instead of creating new objects. To give an example from the Ruby language, combining two strings with the + operator, which creates a new string from two old ones, is significantly slower than combining two strings with the concatenating >> operator, which modifies the original string. Mutating state rather than creating new objects means there's fewer objects for the garbage collector to clean up and helps keep your program in-cache on inner loops. If you've seen Cliff Click's crash course on modern hardware, you're probably familiar with the idea that latency from cache misses is quickly becoming the dominating factor in today's software performance. Too much object creation blows the cache.

Cliff Click also covered Actors, the underpinning of Erlang's concurrency model, in his Concurrency Revolution from a Hardware Perspective talk at JavaOne. One takeaway from this is that actors should provide a safe system for mutable state, because all mutable state is confined to actors which only communicate using messages. Actors should facilitate a shared-nothing system where concurrent state mutations are impossible because no two actors share state and rely on messages for all synchronization and state exchange.

The Kilim library for Java provides a fast zero-copy messaging system for Java which still enables mutable state. In Kilim, when one actor sends a message, it loses visibility of the object it sends, and it becomes the responsibility of the recipient. If both actors need a copy of the message, the sender can make a copy of an object before it's sent to the recipient. Again, Erlang doesn't provide zero-copy (except for binaries) so Kilim's worst case is actually Erlang's best case.

The limitations of concurrent objects in Reia were solved using mutable state in my Celluloid concurrent object library for Ruby, but that deserves a blog post in and of itself.

Single assignment is just as problematic as destructive assignment

Erlang doesn't allow destructive assignments of variables, instead variables can only be assigned once. Single assignment is often trotted out as a panacea for the woes of mistakenly rebinding a variable then using it later expecting you had the original value. However, let me show you a real-world case that has happened to me on several occasions which wouldn't be an error in a language with destructive assignment and pattern matching (e.g. Reia).

There exists a complimentary case of mistaken variable usage to the afforementioned problem with destructive assignment. In single-assignment programs, it involves mistakenly using the same variable name twice excepting the variable to be unbound the second time:

The first pattern matching expression binds the Foo variable to something. In the second case, we've mistakenly forgot Foo was already bound. What's the result?

exception error: no match of right hand side... 

We get no compiler warning in this case. This is the type of error you only encounter at runtime. It can lay undetected in your codebase, unless you're writing tests. Know what other problem writing tests solves? Mistaken destructive assignments.

Single assignment is often trotted out by the Erlang cargo cult as having something to do with Erlang's concurrency model. This couldn't be more mistaken. Reia compiled destructive assignments into Static Single Assignment (SSA) form. This form provides versioned variables in the same manner as most Erlang programmers end up doing manually. Furthermore, SSA is functional programming. While it may not jive with the general idealism of functional programming, the two forms (SSA and continuation passing style) have been formally proven identical.

The standard library is inconsistent, ugly, and riddled with legacy

Should module names in the standard library be plural, like "lists"? Or should they be singular, like "string"? Should we count from 1, as in most of the functions found in things like the lists module, or should we count from 0 like the functions found in the array module? How do I get the length of a list? Is it lists:length/1? No, it's erlang:length/1. How do I get the Nth element of the tuple? Should I look in the tuple module? Wait, there is no tuple module! Instead it's erlang:element/2. How about the length of a tuple? It's erlang:tuple_size/1. Why is the length of a list just "length" whereas the length of a tuple is "tuple_size"? Wouldn't "list_length" be more consistent, as it calls out it works on lists?

When we call erlang:now() to get the current time, it returns {1311,657039,366306}.  What the hell does that mean? It's a tuple with three elements. How could time possible need three elements? A quick look at the documentation reveals that this tuple takes the form {Megaseconds, Seconds, Microseconds}. Separating out Microseconds makes sense... Erlang has no native decimal type so using a float would lose precision. But why split apart Megaseconds and Seconds?

Once upon a time Erlang didn't support integers large enough to store the combination of Megaseconds and Seconds, so they were split apart. The result is a meaningless jumble of three numbers, which you have to run through the confusingly named calendar:now_to_local_time/1 function to get a human meaningful result, which doesn't tell you what time it is now, but instead takes the tuple that erlang:now/0 returns as an argument and will spit back meaningful {Year, Month, Day} and {Hour, Minute, Second} tuples.

Legacy in the grammar

Try to use "query" as an atom in Erlang, e.g. {query, "SELECT * FROM foobar"}. What happens?

syntax error before: ','

This is because 'query' is a reserved word which was reserved for Mnemosyne queries. Never heard of Mnemosyne? That's because it's an archaic way of querying Erlang's built-in database, Mnesia, and has been replaced with Query List Comprehensions (QLC). However, it remains around for backwards compatibility.

You can't use "query" as a function name. You can't tag a tuple with "query". You can't do anything with "query" except invoke a deprecated legacy API which no one uses anymore.

Strings-as-lists suck

Erlang provides two ways of representing strings. One is as lists of integers, which is the traditional way that most of the library functions support. Another is binaries. Erlang has no way of differentiating lists of integers that represent strings from lists of integers that are actually lists of integers. If you send a list of integers in a message to another process, the entire list of integers is copied every time. On 64-bit platforms, every integer takes up 64-bits.

The obvious solution here is to use binaries instead of lists of integers. Binaries are more compact and exist in a separate heap so they aren't copied each time they're sent in a message. The Erlang ecosystem seems to be gradually transitioning towards using binaries rather than strings. However, much of the tooling and string functions are designed to work with list-based strings. To leverage these functions, you have to convert a binary to a list before working with it. This just feels like unnecessary pain.

The abstract concept of lists as strings isn't inherently flawed. In many ways it does make sense to think of strings as lists of characters. Lists as strings would probably make a lot more sense if Erlang had a native character type distinct from integers which was more compact and could avoid being copied each time a string is sent in a message like a binary. Perhaps in such a system it'd be possible to avoid transcoding strings read off the wire or completely transforming them to a different representation, which is costly, inefficient, and often times unnecessary (yes, this is a problem with Java too).

There's no "let"

Want a local binding in Erlang? Perhaps you've used let for this in a Lisp. What happens when you try to do this in Erlang? Even attempting to use "let" in Erlang just yields: syntax error before: 'let'

Once upon a time Erlang was supposed to get let bindings, and the "let" keyword was set aside for this purpose. But much like frames, it never happened. Instead, let is now an unimplemented reserved word which just breaks your programs.

There's no "nil"

In Clojure, I can write the following: (if false :youll-never-know).  This implicitly returns "nil" because the condition was false. What's the equivalent Erlang?

Erlang forces you to specify a clause that always matches regardless of whether you care about the result or not. If no clause matches, you get the amazingly fun "badmatch" exception. In cases where you don't care about the result, you're still forced to add a nonsense clause which returns a void value just to prevent the runtime from raising an exception.

Where do I go from here?

Again, I want to emphasize that I have a great deal of respect for Erlang conceptually. But at this point I'd like to take what I've learned and go elsewhere with it. One direction I've gone is the Celluloid concurrent object library for Ruby. You can read more about it in the original blog post I wrote about Celluloid, which is a bit out-of-date at this point. I have a forthcoming blog post which should dive a bit deeper into Celluloid's guts and how it can do things which aren't possible in Erlang.

As you've probably guess from the references sprinkled throughout this post, I'm learning Clojure. I'm a fan of the JVM and Clojure provides a great functional language for leveraging the JVM's features. I think the sort of things that I'd be writing in Erlang I'll try writing in Clojure instead. Clojure has elegant Lisp syntax. Clojure has maps. Clojure has powerful facilities for dealing with concurrent shared state problems. Clojure has great semantics for safely managing mutable state in a concurrent environment. Clojure has real strings. Clojure has let. Clojure has nil. Clojure runs on the JVM and can leverage the considerable facilities of the HotSpot JIT and JVM garbage collectors.

I'd also like to try my hand at creating a JVM language, especially with the impeding release of Java 7 this Thursday. Java 7 brings with it InvokeDynamic, a fast way to dispatch methods in dynamic languages, and considerably eases the difficulty of implementing dynamic languages on the JVM. Stay tuned for more details on this.

Wednesday, June 29, 2011

Why I'm stopping work on Reia

Some of you may have seen Elixir, a Ruby-inspired language for the Erlang VM, created by
José Valim. Whenever I see posts on HN or proggit about Elixir, I check out the comments, and almost every time someone asks about Reia, specifically how the two languages compare.

Reia and Elixir are extremely similar. If you check out their respective codebases, there's an almost bizarro world similarity between them. The project layout is almost identical, as is the build process and associated scripts to drive various parts of the language like the compiler and REPL. My impression of the code after looking over it was we were using largely the same mechanisms, and where they differed, José's solutions were something I had been thinking of changing, or something entirely new I hadn't thought of which seemed pretty cool.

At this point, in all respects Elixir is farther along than Reia and a better implementation. I think it's pretty bad to have two competing niche languages with nearly identical features.

So I'll be discontinuing work on Reia. The codebase and the web site will remain up on Github for however long Github wants to host them. Celluloid is Reia's spiritual successor. The main feature I always wanted out of Reia was a concurrent object system, and Celluloid implements that on top of Ruby.

Wednesday, May 11, 2011

Introducing Celluloid: a concurrent object framework for Ruby

I've spend a lot of time recently working on a new concurrency library for Ruby called Celluloid. In short, its goal is to make working with threads as easy as working with Ruby objects in most cases, while still remaining loaded with all sorts of power user features for the edge cases. It's heavily inspired by Erlang and the actor model (hence "celluloid") and represents my best effort to port as many of these concepts over while making them more Ruby-like. This is the first in what I hope will be a series of blog posts describing various concurrency problems you might encounter in Ruby and how Celluloid can help solve them.

If you're already sold on using threads in Ruby, feel free to skip the next section of this post. However, as threads have remained a perpetual pariah in Ruby as a language, I feel some explanation is in order as to why you might actually consider using them.

Ruby and Threads

Rubyists generally don't like threads. There's plenty of good reasons to dislike threads: they're error prone for end users and the original implementation of threads in the Matz Ruby Interpreter was pretty crappy and broken in multiple ways. Even with the latest YARV interpreter found in Ruby 1.9, a global lock prevents multiple threads from running concurrently.

On the flip side, if you need multicore concurrency Ruby processes are cheap and there are some pretty good libraries like DRb for allowing Ruby VMs to work together. But even then most people are using Ruby to write stateless webapps that store all state in a database, so you can just run multiple Ruby VMs which all have the same application loaded to leverage multiple CPUs in a machine.

I used to be in the thread-hater camp, having cut my teeth on multithreaded C programs which left a bitter taste in my mouth, but recently I've changed my tune. This is mainly due to the great work of the JRuby and Rubinius teams to add true multicore concurrency to their Ruby implementations. JRuby has supported multicore concurrency via threads for awhile, and Rubinius is adding it in their hydra branch. With these Ruby interpreters, you can run one virtual machine per host and the threads you create will be automatically load balanced among all available CPU cores.

This has immediate benefits for things like Rails applications which enable thread safe mode. Rails will automatically create a new thread per request, allowing one VM to service multiple requests simultaneously. On interpreters like JRuby and Rubinius Hydra, this means you can run just a single VM per host and your application will utilize all available CPU cores. All the memory overhead of loading multiple copies of your application is mitigated, and as an added benefit you can take advantage of the better garbage collection these VMs (particularly the JVM) offer.

There is a catch: libraries can't share state across threads without using some form of thread synchronization. This is often trotted out as a persistent worry of those who prefer to run their Rails apps in the standard single threaded mode. Those gem authors, who knows what they're doing? Maybe they're using global variables!  People don't think about this sort of stuff in Ruby, so shouldn't we just assume that 100% of Ruby libraries aren't thread safe per default?

The truth, at least for things like Rails apps, is that the general way they operate typically eschews thread safety issues. Ruby as a language favors object creation over mutating existing objects, and webapps generally create a new set of objects per request and don't provide mechanisms for sharing state between connections due to their stateless nature. In general, webapps are stateless and don't do things which will share state between threads.

If you do intend to go thread safe on your Rails app, you should certainly do your due diligence for auditing the libraries you use for unsafe usage of global and class variables, but in general I think the worries about running Rails apps in multithreaded mode are overblown. Ruby has much better semantics for promoting thread safety than other languages that have made the leap from single threaded to multithreaded (e.g. C/C++), and those languages have managed to make the transition with countless applications running in a multithreaded mode.

In the two years I've been deploying thread safe Rails applications, I've encountered exactly one thread safety bug, and that was in a library that originally claimed to have a specific thread safe mode but removed it from later releases and I unfortunately didn't catch that they had done so. The fix was simple: just create a thread-specific instance of an object I needed rather than sharing one across all threads. I won't say finding the bug was easy peasy, but all in all I don't think one bug was a bad price to pay for all the benefits of moving to a multithreaded deployment.

Concurrent Objects: How do they work?

Celluloid's concurrent objects work like a combination of normal Ruby objects and threads. You can call methods on them just like normal Ruby objects. To create a concurrent object, declare a class that includes the Celluloid::Actor module:

Then call the spawn method on the class:

This creates a new concurrent Charlie Sheen object. Calling the current_status method on it returns the normal value we'd expect from a method call. If an exception is raised, it will likewise be raised in the scope of the caller. But behind the scenes, all these things are happening in a separate thread.

Let's say things aren't going so well for Charlie. Instead of winning, Charlie is fired:

How can we help Charlie win again?

Calling Sheen#win! here does something special: it executes the method asynchronously. Adding a ! to the end of any method name sends a message to a concurrent object to execute a method, but doesn't wait for a response and thus will always return nil. You can think of this like signaling an object to do something in the background, and perhaps you'll check on the result later using normal synchronous method calls.

Using a ! to call a method asynchronously follows the Ruby convention of predicate methods with a bang on the end being "dangerous." While there are certain dangers of asynchronous methods (namely in how errors are handled), providing thread safe access to instance variables is not one of them. Charlie is running in his own thread, but there's no need to synchronize access to his private variables. This is where Celluloid's secret sauce comes in.

Charlie maintains a queue of pending method calls and executes them one at a time in sequence. Celluloid uses and asynchronous messaging layer that you can communicate with using normal Ruby method call syntax. However, when you call a method on a concurrent object in Celluloid, the "message" you send is quite literal and takes the form of a request object which waits for a response object (instances of Celluloid::Call and Celluloid::Response respectively).

This approach is largely inspired by the gen_server abstraction within the Erlang/OTP framework. For you Erlang nerds who might be worried Celluloid tries to jam everything into gen_server-shaped boxes, let me say right away that isn't the case, but you will have to wait for a further installment of my blog to find out why.

Celluloid by Example: Parallel Map

Let's start with a simple, practical, real-world example. If you're interested in digging deeper into the guts of Celluloid before starting this, I'd suggest you check out the README. That said, let's start with a simple problem: how can we implement a parallel map? That is to say, how can we reimplement Enumerable#map such that all of the map operations are performed in parallel rather than sequentially?

As this is a contrived and relatively simple problem, I'll go ahead and share with you how you might do it using Ruby threads as opposed to using Celluloid:

This version alone winds up being all you need to accomplish simple parallel map operations in Ruby. Here are some examples of using it from irb:

This pmap implementation behaves just like we'd expect map to. It returns the value of the block for each element if everything succeeds correctly, and raises an exception if anything goes wrong along the way.

Now I'd like to show you how to refactor this code to fit into the concurrent object pattern Celluloid uses. Let's start by trying to represent this same code using an object to perform the computation:

To turn this into a concurrent object, we first need to include Celluloid::Actor. To achieve concurrency, we need to make the method that performs the computation callable asynchronously. The initialize method is called synchronously by spawn (in case something goes wrong during initialization), so we'll need to create a separate method that actually calls the given block:

After that we can rewrite Enumerable#pmap using this class:

This creates a new Mapper actor for each element and calls Mapper#run asynchronously on each of them. After every one of them is executing they're iterated again, this time checking the return value. Since actors can only process one method call at a time, the call to Mapper#value will block until Mapper#run has completed, even though Mapper#run was called asynchronously.

This approach of allowing a value to be computed in the background and then only blocking when the value is requested is called a future. You've now seen how to implement a future, but it's also baked directly into Celluloid itself. Here's how to implement Enumerable#pmap using Celluloid::Futures:

Like Mapper, Celluloid::Future takes arguments, passes them to a block, then runs that block in the background asynchronously. Only when the value is requested does it block the current thread.

Now that we have a pmap function, what can we do with it? How about we compare the time it takes to do a bit of Google Image Search screen scraping for different color images for a particular search term using regular map vs. pmap?

The performance metrics vary across Ruby implementations, but in general, the parallel version goes approximately 3.5x faster, and the Celluloid version is 5-10ms slower than the version written using raw Ruby threads.

While this example is fairly trivial, in the next installment of this blog I'd like to demonstrate how to write a Sinatra-based chat server similar to node_chat.

What does this mean for Revactor?

In 2008 I wrote another actor library called Revactor, based on Fibers and evented I/O. While those ideas have grown increasingly popular, Revactor never did. I attribute this largely to Revactor's API, which was a fairly literal translation of Erlang's APIs into Ruby form with too little focus on putting an easy and accessible face on things. If you saw Mike Perham's recent article on actors in Rubinius (Revactor used a mostly identical API, as did MenTaLguY's Omnibus library), the code can be a little daunting, to the point you might need to learn a little Erlang just to figure out how to use it.

Celluloid is the official successor to Revactor. While Celluloid is primarily based on thread-based actors, it's been designed from the ground up with the idea of eventually incorporating event-based actors as well which can interoperate with an event library like EventMachine or cool.io. I know I originally dissed Scala for having both thread-based and event-based actors, but short of an Erlang-like process abstraction, it's not a bad compromise.

What about Reia?

One of the projects I've been working on the longest is Reia, a Ruby-like programming language for the Erlang VM. Today happens to be Reia's third birthday, and I do have to say after three years it's not where I thought it would be. It's generally usable but still missing some big features I feel are needed to make it a "complete" language. The main thing I feel is missing from Reia is a concurrent object model, and you can think of Celluloid as being an in-Ruby prototype of how it would work. I started to work on this in the legacy branch of Reia, but felt like it was too big of a project to tackle until I had fleshed out some of the other fundamentals of the language.

After I feel comfortable with how Celluloid is working I would like to try reimplementing in Reia. After that I think Reia will evolve into a truly useful language which bridges the gap between object oriented languages and Erlang-style concurrency.

I think Celluloid has the potential to be a truly useful library in Ruby on its own, however. It provides most of the underpinnings needed for more Erlang-like concurrent applications without having to switch to a different language.

Tuesday, April 5, 2011

Reia: Now with a PEG parser

I haven't blogged about Reia in so long. Mea culpa! It's about this time that people start to think "I haven't heard about Reia in awhile. It must be dead." No, Reia is very much alive my friends, and making some pretty interesting progress... I hope to be able to announce some important new features around Reia's 3rd birthday. Stay tuned for that, but first a preview of things to come...


An experimental branch of Reia is now available which uses Neotoma to generate its parser thanks to some pretty impressive contributions by Graeme Defty. Neotoma is a Parsing Expression Grammar (PEG) based parser generator for Erlang, inspired by Packrat. Neotoma's author Sean Cribbs has also released Neotoma 1.5 with contributions by Graeme Defty which uses Erlang binaries internally and is substantially faster than the previous version.

Why does a PEG matter? PEGs make short work of problems that are extremely complex to handle when you attempt to subdivide the problem of parsing into separate scanning and parsing steps, such as how a traditional lex/yacc (or in Reia's case, leex/yecc) parser would operate. If you've ever attempted to look at the source code of the Ruby parser, you'll find some strange and confusing feedback between the lexer and the parser.

What's this for? Well the foremost reason is: interpolated strings. Ruby allows you to embed arbitrary Ruby code inside of any string using the #{...} syntax. This is a really great feature and one I managed to half-assed implement in Reia because I strongly believe in its awesomeness. However, Reia's implementation is a bit brittle and has a lot of implementations. Just recently I switched to the latest greatest version of leex which ships with the Erlang runtime and had to make some rather arcane changes to some code that uses leex internals, just to continue to support a partially functioning string interpolation mechanism.

Now, toss in some more fun complexities: the awesome /.../ regex literal syntax. We all love it, but why doesn't every language have it? What you may not realize is that this syntax is ambiguous in certain cases with the / and /= operators like it is in other languages like JavaScript. Now, for added fun, toss in the fact that regular expressions can interpolate just like strings, and you're beginning to see the makings of a gramatical nightmare.

All of these things are easy with a PEG. PEGs blend tokenization with parsing and allow comprehension of a much wider range of languages than is possible without pulling your hair out using lex/yacc-like tools, and much more than that, it's a very natural process with PEGs. PEGs are also by nature right-recursive, something that works quite well in immutable state languages like Erlang that have to build their syntax trees from right-to-left due to the nature of immutable singly linked lists. These lists only let you append elements on the left.

This sounds well and good, but unfortunately I have some bad news. Even after a few rounds of performance tuning, using an experimental branch of Neotoma that uses binaries internally instead of lists, the PEG parser is still about an order of magnitude slower than the leex/yecc version. Talking to Neotoma's author Sean Cribbs, it sounds like Neotoma might be doing an excessive amount of unnecessary copying internally. If PEGs pique your interest and you know a bit of Erlang you might take a peek at Neotoma and see if you can find some potential performance optimizations.

I would still like to merge the peg branch of Reia, and the PEG grammar fixes a lot of quirks in the present yecc grammar, but I'm still holding out until the performance is improved to closer to the leex/yecc speeds.

Tuesday, March 22, 2011

Distributed systems and dynamic typing

There's a very good blog post by Robert Harper, a CMU CS professor, called "Dynamic languages are static languages." It's very enlightening and I strongly encourage you to read it. If I understand it correctly, which I freely admit I may not, the general idea is that dynamic languages can be thought of as static languages that have a single all-encompassing type. In that regard, dynamic languages are a proper subset of static languages. If you think I misinterpreted his post and I'm confused, please flame me in the comments.

In a previous post, called "Parallelism is not concurrency", he opines on a pet peeve of mine, namely how the terms parallelism and concurrency are nonchalantly and incorrectly interchanged. Parallelism applies to deterministic operations that operate on similar data in similar time. Some examples of parallel operations include rendering of 3D scenes on GPUs, or encoding/decoding blocks of compressed video. Concurrency, on the other hand, refers to the nondeterministic manner in which distributed systems operate, particularly ones where CPUs are separated over an unreliable network which can lose connectivity at any time.

In a third blog post, "Teaching FP to freshmen", Robert says he'll be teaching Standard ML to freshmen, and further argues he won't be teaching Object Oriented Programming because it's "both anti-modular and anti-parallel by its very nature, and hence unsuitable for a modern CS curriculum."

Three very interesting blog posts by the same person, clearly a well-educated computer science professor who knows more than I do. I mean, I like programming languages and I'm working on my own dynamically typed functional programming language, but Robert Harper wrote a book, one I certainly couldn't write. So who am I to judge?

Parallelism and Concurrency are Both Important

What I see missing from Robert Harper's writing is any attention paid to concurrency. He pays intense attention to parallelism, recognizes parallelism as important for the future, and strongly advocates functional languages as a great way to address the problems of parallelism. I have absolutely no bone to pick with him here and wish him great luck with his courses at CMU which address this problem domain.

However, my interests primarily lie in the realm of concurrent computing, and particularly in the area of distributed computing. In the area of distributed computing I find dynamic languages particularly important and think Robert Harper's article on static vs. dynamic languages omits some of the advantages of dynamic languages which make them work better in distributed systems.

This is a weighty subject, and one in which I don't think my own ideas alone make a particularly cogent argument. For some opening remarks, I will defer to Joe Armstrong, creator of the Erlang programming language, and his opinion from the book "Coders at Work" by Peter Seibel. As a bit of context to this quote, Joe is discussing the potential advantages that a static type system might confer upon Erlang. But then he gets to the disadvantages...
...the static type people say, "Well, we really rather like the benefits of dynamic types when we're marshaling data structures." We can't send an arbitrary program down a wire and reconstruct it at the other end because we need to know the type. And we have—Cardelli called it a system that's permanently inconsistent. We have systems that are growing and changing all the time, where the parts may be temporarily inconsistent. And as I change the code in a system, it's not atomic. Some of the nodes change, others don't. They talk to each other—at certain times they're consistent. At other times—when we go over a communication boundary—do we trust that the boundary is correct?
Type Systems and the CAP Theorem

There are two particularly sticky problems when it comes to the use of type in distributed systems. The first is the issue of serialization, or marshaling, of particular states. One way or another this is a solvable problem, both for statically typed and dynamically typed languages. I really don't want to delve too deep into this issue as it distracts from the larger point I'm trying to make, but in general, I think this is an easier problem to solve in dynamic type systems. I would also like to note that serialization formats which vomit their types all over the protocol and the codebase are outgrowths of static type systems. I'm looking at you, CORBA, SOAP, Protocol Buffers, and Thrift. On the flip side, systems which choose a minimal, semi-universal type system, such as JSON, BSON, BERT, and Msgpack, are all outgrowths of dynamic type systems. If I have a point to make here, I think it's that these systems are outgrowths of two very different ways of thinking about the same problem.

Marshaling is still a very important topic in distributed systems. Erlang largely abstracts this problem away from the programmer, allowing distributed nodes to exchange Erlang "terms" between processes on distributed nodes in the exact same way one would exchange messages between two processes located on the same host. The process of serializing that data, transmitting it over the network, receiving it via TCP, decoding it, and sending it to the appropriate Erlang process, is completely transparent to the end user. This is an extraordinarily powerful abstraction.

While statically typed languages can attempt to marshal data in a minimalistic JSON-like type system, this typically isn't the case. Instead, statically typed languages generally seem to prefer to vomit their types all over the protocol. The boilerplate code needed to marshal/unmarshal particular types can be generated automatically by a declaration of the types and methods of a particular protocol, such as the WSDL files used by SOAP. Again, users of statically typed languages could reduce the state of particular entities in their system to one which could fit into a minimalistic type system, but for static languages this is still a manual process, or one which requires manual code generation. In a language like Erlang which is built from the ground up to be distributed, dynamic, and operate around a minimalistic type system, serialization and deserialization can happen completely automatically.

Why is this important in a distributed system? Because, to paraphrase Joe Armstrong, distributed systems are messy. Imagine an Erlang-like distributed system that's statically typed. In order for such a system to work effectively, all nodes in the system must have the exact same code loaded and therefore have a universal consensus on what the types in the system are. This has huge implications on the theoretical properties on such a system. In order for a distributed system to agree on the nature of all types, it must be consistent.

However, if you're familiar with the CAP theorem, you may recognize the inherent problem this may cause. The CAP theorem gives you three options: a consistent fault-tolerant system, a highly available fault-tolerant system, or a consistent highly available system which breaks at the first sign of a fault. Only two of these options provide the consistency needed to ensure universal agreement on the types in the system such that automatic marshaling/unmarshaling of static types is even possible. In a distributed system, you either must give up universal agreement on the types, or sacrifice availability.

To quote Joe again, distributed systems are composed of parts which are "growing and changing all the time" with "parts may be temporarily inconsistent." While there aren't any guarantees that distributed systems built around dynamic type systems will work, inconsistent statically typed systems with disagreements about types are guaranteed not to work. Dynamic systems not only provide the possibility that your system may continue to function with different versions of the code loaded at the same time, but the ability for the programmer to plan for this contingency and offer ways to mitigate it. It's possible this may result in errors, but it may work, whereas incompatible type definitions are universally guaranteed to create errors. In a distributed environment, dynamic type systems provide extensibility, whereas static type systems actively seek to preclude it.

Something I've seen time and time again in systems like SOAP and even Thrift and Protocol buffers is attempts by programmers to escape the constraints of the type system, which almost universally fall into proprietary ways to store key/value pairs. One SOAP API I'm working with now provides "Maps" with "Entry" tags that have a key attribute and an associated value. Another implementation provides an array of "NameValuePair" objects. These solutions seem ugly, but in my opinion, their intentions are not. These are people seeking to extend running systems without the need to completely redefine the protocol. That's very much a practical concern.

Distributed Applications Must Be Flexible

In order for distributed programming to work effectively, nodes need to be able to call functions on each other without the need for programmers to write custom marshaling/demarshaling code for each type. The marshaled data needs to work extensibly, so that nodes running different versions of the code can still talk to each other in a forwards and backwards compatible manner.

Protocols will change over time. Older versions of the code need to work with a newer protocol, and vice versa, older versions of the protocol need to work with newer code. Nodes should be upgraded as practicality dictates. Perhaps your system administrator begins an upgrade and you lose access to a datacenter, because a janitor at your upstream ISP spilled a bucket of mopwater all over their datacenter's raised floor and caused a huge electrical disaster. Now your distributed application is running two different versions of the code, and it's suffered a network partition. Does this mean it should just break when the network partition is fixed?

Erlang has shown us it's possible to recover from these kinds of conditions. Even when we can't change code atomically across the entirety of a distributed application, it should still continue to operate without breaking. Distributed systems should be able to grow and change all the time without rigid boundaries.

Tuesday, March 15, 2011

Relativity for Programmers (A.K.A. Oh CAP!!!)

When you first start to learn physics you typically learn Newtonian Mechanics. Objects have inertia. Force is mass times acceleration. For every action there is an equal and opposite reaction. These all seem straightforward.

Albert Einstein turned the world on its head. Everything is relative. Time is relative. Simultaneity is relative. There is no universal clock.

The same thing happened in the computer science world, not with a bang or the attention that was given relativistic physics, but with a small crowd paying attention. CAP Theorem has the same implications for computer scientists as relativity had for physicists, but no one is really paying attention.

Computer scientists may dream of a fully consistent database distributed across a network, capable of two-phase commit and typical consistency guarantees. Not many think about the implications of such a database.

Imagine you had your database split across two datacenters. Let's call them Foo and Bar. While at one point in time, Foo and Bar could talk, now they can't. Perhaps a clumsy janitor spilled the bucket for his mop, and it destroyed all of your routing hardware between your datacenters. Some customers are accessing Foo and some are accessing Bar. If they want to perform a read, they'll get data that's possibly stale. If they want to perform a write, what happens?

CAP Theorem

CAP Theorem is possibly the most important property of distributed systems that you can possibly understand. It's been motherfucking proven by computer scientists, bitches! So what does it state? The core idea is that consistency which is both instantaneous and global is impossible, except on a single node. As soon as you have more than node storing data for you, you either have to sacrifice availability or consistency. Much like how you can only infer the state of the universe from old light beams hitting your light cone in relativistic physics, distributed systems can only infer the state of other nodes via the protocols they communicate with. These protocols can fall into one of three different forms per the CAP theorem:
  • CP: we have masters you have to listen to, bitch slaves. You can try to change the cluster state but all changes have to go through the masters, and if your master is split, you can't modify any state you bitch submissive node
  • AP: sure, we'll accept reads and writes. We'll try to eventually resolve any conflicts in the event of a netsplit. If we're cool we'll let you specify your own conflict resolution mechanism in the event of conflicting writes, otherwise we'll just do last writer wins.
  • CA: the network broke? Whoa. I was totally not expecting that. I was just happily doing bongrips until you barged in. I'm so high I don't even know what's going on.
If you're writing a datastore you consider to be "CA", you're doing it wrong. Fortunately, most datastores that claim to have "CA" properties are actually CP. That is to say, in the event of a network partition, they sacrifice availability. Reads to the cluster hopefully work, but writes shouldn't. Writes should error or time out for a consistent system. Any system that functions this way is in fact "CP". That is to say: it provides consistency, but in the event of a netsplit, you lose availability, at least for writes. Ideally these systems still provide stale reads.

Calling your system CA is an admission of the fact that if a partition separation occurs, and due to writes each partition has a separate view of the world, when the netsplit resolves your partitions have no mechanism for resolving conflicting states between partitions and thus your datastore is just fucked. Short of a system administrator picking which view of the world they consider to be the best and blowing away the state on the other nodes, the system state is going to be inconsistent and irresolvable.

So what are the alternatives which are partition tolerant and don't break in the event that the janitor spills mopwater on the router that routes packets between your datacenters?

Option #1: sacrifice consistency for availability

This option lets clients read and write from any partition. When connectivity between partitions is restored, there's a plan for resolving all of the writes into a consistent world state. The simplest plan is last writer wins, which is used by Cassandra. Whoever was the last one to alter the cluster state provided the definitive cluster state, simply by virtue of writing last.

In an eventual consistency model, clients may have inconsistent views of the world. This is unfortunately a fact of life. State may unexpectedly change by a particular client's view, because it's being modified on a different partition and those partitions managed to sync up. The road ahead of you may be a beach, but after syncing up with another part of the world, it may turn into a cactus. Such is life.

Option #2: sacrifice availability for consistency

Say you have a master/slave data store, and someone writes to a slave. Wat do? Slave can't talk to master. What's the most obvious option? Error.

This is a "CP" system. Slaves can serve stale state, but if you want to change the state, you have to talk to the master for that state. State can be partitioned/shareded/durr between multiple masters, but if the master for the particular state a slave wants to modify isn't available, sorry, that's a paddlin', or rather an error.

Option #3: highly available, and consistent at the partition level, netsplit fixed wat???

The only other option, the "CA" option, is that each partition pretends that writes to that partition are consistent, but when the netsplits are resolved, there's no consistent way to resolve that state. That is to say, if netsplits occur, each partition is highly available and allows writes, but as soon as the netsplit resolves, the cluster state is broke and irreconcilable.

In a quick survey of NoSQL datastores, Membase, MongoDB, and the hypothetical Redis Cluster all claim to be "CA". Are they really? Or do they just sacrifice availability in the event of a netsplit? If that's true, they're actually "CP", not "CA".

You really don't want to sacrifice partition tolerance. In fact, some might go as far as to say you CAN'T sacrifice partition tolerance. If you think your system is "CA" you probably understand the the CAP theorem and its implications.

CAP is the law, and AP or CP are the only reasonable options 

If you're running an app on a distributed network, things are going to go wrong. You might idealistically claim that you can provide both consistency and availability, sacrificing partition tolerance. You can't.


Can you pick "CA"? Sure, if you're content with losing all your data if the network had any errors. That's probably not what you mean. If you're building software for a fault-tolerant world, it's probably either going to be:
  • Consistent but unavailable during netsplits (CP)
  • "Eventually" consistent but available during netsplits (AP)
The only other option is "broken after netsplits". The best option is probably AP: this is the approach was pioneered by the Amazon Dynamo system which powered Amazon's shopping cart, and by copycat open source systems like Riak and Cassandra, not that Riak and Cassandra aren't awesome too.

Thursday, October 14, 2010

How to make a cheap CI traffic light

We use continuous integration (CI) testing where I work, using Hudson as our CI server. I saw this recent article on the social impacts of making the CI testing status more visible and how that encourages people to fix things that are broken. One way to improve the visibility beyond just having a monitor displaying the CI server status page is to add a large traffic light displaying the status of your CI server. This is an idea I first heard of being used at Digg and have later seen at companies like Github. Below is the Hudson status setup we have at my workplace, showing both the traffic light and the monitor displaying the build status page:


Green means everything built successfully. Yellow means a build is in progress. Red means something is broken. It's a very effective way to let everyone in the office know the status of your builds and encourage people to fix them!

Believe it or not, adding a traffic light to your CI setup is simple, relatively easy, and cheap! Also, if you're like me occasionally breaking into electronics and soldering things is fun! Here's what you'll need to build your own CI traffic light:
  • Lava Lite Traffic Light from Amazon ($19.99)
  • A computer running Linux with a parallel port
  • 4-conductor wire (one for each lamp + ground)
  • A DB-25 connector
  • Soldering supplies (soldering iron, solder, desolder braid, etc)
I bought the Lava Lite Traffic Light based on some reviews from people who used it as a Toastmasters timing light (to let people know when their time speaking is up):


By default it blinks in what the box claims is a "random" pattern but is actually a fixed pattern of red, yellow, green, red, yellow, green. But after opening up the traffic light itself, which only requires a few screws, I found out it was extremely easy to modify.

By default it's controlled by a single IC which is mounted on its own daughterboard, soldered to a small control board which contains the requisite transistors and resistors for powering the lamps and switching them on and off. The first thing you'll need to do is unscrew the control board and desolder the daughterboard from it. Desolder braid is your friend here.

The next step will be following the traces from the lamps back to where the daughterboard is mounted. This is fairly straightforward as the components involved in switching the lamps on and off are in threes. Looking at the control board you should very quickly be able to identify these components. There are three larger blue resistors which limit the current going to the lamps. Don't touch those! There are three transistors which are used to turn the lamps on and off. And finally there are three smaller brown resistors... follow the traces from these resistors to where they used to attach to the daughterboard. These are at TTL voltage, which is good, because so is a parallel port. What we'll be doing is soldering wires onto these three connectors and wiring them directly into a parallel port:


I'm afraid I don't have any fancy schematics for how exactly to do this, but it shouldn't be too hard! Here's a photo of where I soldered onto the control board. You can see the slot in the lower right corner where the daughterboard for the control IC used to be. I used a green wire for the green light, red for red, white for yellow, and black for ground. The circuits that go to ground are large and well labeled, so it should be fairly straightforward to find the points to solder onto:


If you've soldered correctly, you should be able to short any of the lamp wires (or a combination thereof) with the ground wire and they should turn on.

On the other side of your wire, you'll be wanting to solder on a DB25 connector. Wire the three wires which control the lamps to data lines 0, 1, and 2, and the ground to any of the many ground pins available on a parallel port. Here's a picture of mine:


And here's a handy dandy parallel port pinout diagram:


You'll be wanting to solder onto D0, D2, D2, and probably pin 18 (ground). I put green on D0, yellow on D1, and red on D2.

Next, you'll need some software to control the traffic light over the parallel port. To control our traffic light, I wrote a simple UDP server in C that receives 1-byte packets and writes them out to the parallel port. I originally wrote the server in Ruby but garbage collection was hanging up some fun light shows I was trying to put on, so I rewrote it in C for moar realtime. Here is the C source code to the traffic light server:


This uses the /dev/port character device on Linux, so you'll either need to run this server as root or a user specifically configured to have permissions to /dev/port. One thing I soon discovered was that the parallel port is stateful and will remain in the last state you set it to. This means you don't constantly have to write to the parallel port to keep a lamp on. Instead you can just set a state and the parallel port will remain in that state. Another thing I discovered that I still can't explain is that for some reason I needed to write the same byte twice to get the light to show the appropriate state. I don't know why, but it works!

Next, you'll need a client for your traffic light server. Here's a simple one I wrote in Ruby:


Just wait until this script gets passed around the office. Believe me from experience, everyone will go nuts for the first few days playing with the traffic light. Setting the light to an individual color is all, but for a true acid test here's a script I wrote which calls the above one and will make your traffic light rave balls!


But we didn't build this traffic light just to make it rave balls, did we? No, back to business, we need to get it to show the status of the Hudson build server. Here are some examples you can go off of which we're using now. It's a hacked together collection of scripts, to be sure, but it gets the job done:


Just set your Hudson server address (and username and password, if you protect it with HTTP basic auth like we do) and you're ready to go.

With all those scripts combined and a quick cron job to periodically poll Hudson's status, the traffic light will automatically display your build status and let everyone in your office know how diligent you've been about keeping your tests passing.

Now, get to work and build your own traffic light!