Wednesday, January 28, 2009

The cutting edge of VM design

The Java Virtual Machine generally stands out as the state-of-the-art, loaded with all sorts of crazy optimization tricks it would make your head spin. For this reason many languages (including brand new languages) are targeting it as their runtime, as opposed to writing their own virtual machine which I guess some people still like to do. However, I think the route the JVM has taken is one which will gradually wane in popularity as programmers begin to face a future with dozens if not hundreds of CPU cores at their disposal.

Designers of virtual machines will begin to undergo a realization which is already upon the designers of computer processors: focus less on doing one thing quickly and focus more on being good at doing many things at once. This is why I do not believe the state of the art in virtual machine lies in things like the JVM. Rather, I see the Erlang virtual machine as being the state of the art. Java simply was not designed for this future:

Image stolen from Ulf Wiger who took it from Joe Armstrong who borrowed it from Erik Hagersten

At first glance it may be difficult to appreciate what makes the Erlang VM interesting. Soft realtime garbage collection is nice but the JVM has hard realtime garbage collection. Erlang's JIT is comparatively slow at things like numerical computing and can't inline across modules the way HotSpot can inline methods across classes. It's a notoriously difficult beast for inexperienced system administrators who may see it start gobbling up CPU and RAM for no apparent reason, and the only way to probe what's doing is to get onto its shell and enter a bunch of commands in an esoteric functional language, horrors!

However, these problems are relatively minor when you look at what's on the roadmap for the Erlang VM. In the beginning of his book Programming Erlang, the language's creator, Joe Armstrong, lays out the dream of what his concurrency model hopes to achieve: your program runs N times faster on N CPUs. Erlang has more or less achieved this through the use of distribution. However, distributed computing is hard: suddenly you must consider the cost of sending messages across the network, dealing with latency, network outages, crashing servers, etc. While Erlang/OTP provides a great framework for distributed computing, many of the concerns of distributed computing become irrelevant when you're dealing with a single machine with vast numbers of CPU cores.

The Erlang virtual machine now supports an SMP scheduler which runs a hardware thread for each CPU core. This lets the VM distribute processes, its concurrency primitive, across multiple CPUs at once with none of the impacts distributed computing has to the programmer. It's a comparatively simple affair: make as many processes as you want and the scheduler will load balance them accordingly.

However, while this all seems nice and rosy the SMP scheduler is full of bottlenecks. It works quite well for two or four CPU cores, but as you add more and more you see diminishing returns. The same goes for the JVM, but while they're stuck with their fundamental design constraints, the Erlang concurrency model allows its VM designers considerably more freedom for optimization.

A recent presentation by super famous Erlang guy Ulf Wiger lays out the future of multicore programming with Erlang. The presentation goes through the present scalability weaknesses of the Erlang VM and how they plan on addressing them. You can take Ulf's word for it, or if you're lazy you can just keep reading and I'll try to break it down for you.

So, here's what it is: (sorry for ripping these images out of your presentation, Ulf! They rule!)

This image had a big Ericsson logo on it. Did I do bad by ripping it off? Oh well, if someone complains I'll make my own awesome version.

This is how the Erlang SMP scheduler works today. There are lots of schedulers, each running in a native thread, which all pull from a common run queue. Can you see the bottleneck? Look at all those arrows pointing to the same place! So what's the solution? Divide and conquer:

By giving each scheduler thread its own run queue, the only centralized ickyness becomes the logic needed to load balance processes across various schedulers. This whole thing is beginning to look a lot like how an operating system's kernel works, isn't it? Well great, so how did that all work out in terms of SMP scalability?

Uhoh, the diagonal line I was expecting appears to be a bit flaccid

The red line shows the present SMP scheduler with a single run queue, and the blue line shows the next generation SMP scheduler which will be released in the next version of Erlang. As you can see, adding multiple run queues improved the scheduler's performance, but clearly there is still a bottleneck as you add CPU cores. The graphs show the program actually slowing down with increasing number of cores. So what's the problem?

Aack, a bunch of arrows pointing at the same place again!

Erlang's memory allocator is presently a locking contention point and is limiting the SMP scheduler's scalability across increasing numbers of CPU cores. I can only assume from Ulf throwing this into his presentation that this is the next bottleneck the Erlang/OTP team at Ericsson intends to address after they've released the new SMP scheduler.

The moderate gains of the new SMP scheduler may not seem like something to be excited about, especially since the improvements in the benchmark weren't all that spectacular. So why am I excited? Because one by one the developers of the Erlang virtual machine are removing scalability bottlenecks and increasing the VM's performance. Due to Erlang's underlying concurrency model, the optimization potentials remain huge. And as the number of CPU cores available on a single chip continues to increase (this benchmark was run on a 64-core CPU) these sorts of optimizations will become increasingly necessary to leverage a CPU's full capabilities.

Architectures like the JVM aren't completely doomed in the multicore future. Approaches like software transactional memory can be used on the JVM (particularly with a language like Clojure) in ways that are better suited to certain types of concurrency problems than Erlang's shared-nothing process approach.

But overall, I find Erlang's approach to concurrency as one which makes things easier on the programmer while foisting more of the underlying complexity onto the VM itself. I think Erlang's approach to concurrency will generally make more sense to programmers than STM, especially if it has the right kind of face on it, which is what I hope to achieve with my language Reia, which targets the Erlang VM. And most of all, I see great potential for VM optimizations to improve the scalability on multicore CPUs, and hope one day the SMP scheduler is able to achieve Joe Armstrong's dream of your program running N times faster on N CPUs. Whatever approach the Erlang VM eventually settles upon will be studied for years to come.

Friday, January 9, 2009

Reia and object-aware concurrency

Concurrency is perhaps the foremost issue today's programmers must learn to cope with. CPU designers have exhausted the level of optimization they can provide to programmers working with a single thread of execution, and more modern designs take transistors which would've ordinarily gone to improve sequential performance and have instead thrown it at more cores. Intel demoed a CPU with 80 cores, yet one third of the transistors seen in a Core 2 Duo. This is the future of processors: more cores that do less. CPU designers are going to devote fewer transistors to each individual core while trying to pack more and more cores onto a single chip. Some crazy people are predicting that the number of CPU cores we see on a single chip is going to grow exponentially. I don't know about that, but it's a reasonable enough guess as any at this point, and it seems like it's where the trends are going.

This is posing a big problem for programmers. Now that all of these cores are available, how do you write programs that can make use of them all? The typical way most programmers are used to dealing with this problem is with threads. Threads let you spread your program's execution across multiple CPU cores, which is great! But threads are error-prone. The synchronization is as hard to get right as manual memory management. Pay attention and use good strategies and it's manageable. Put it in the hands of mediocre programmers and things can quickly fall apart. And once they do it can be incredibly hard to debug.

But it gets worse. Let's assume you've managed to use threads correctly in your program. Great! How well can it scale across an increasing number of CPU cores? The synchronization model employed by most programs uses a locking mechanism to synchronize state. To get a threaded program to scale well, locks must be scrutinized, benchmarked, and made increasingly granular in order for threaded programs to scale well. While this is all well and doable, granular locking is complicated, and the reality is most programs trying to pull it off aren't going to do it right, and for that reason many programmers may avoid it entirely.

So how well will programmers be able to leverage these new multicore CPUs? That question remains up in the air. Threaded approaches are difficult at best, and often hard to scale across an increasing number of CPU cores, especially when complex data structures are involved. For that reason some perceive that programmers are in a sort of crisis in regard to how they will leverage an increasing number of CPU cores.

Some solutions to this problem have emerged. One that gets a lot of attention is software transactional memory. This is a model which looks at memory in a manner similar to a database, using transactions to control concurrent access to shared state. Unfortunately this model has failed to achieve widespread popularity. Theories abound as to what exactly the problem is, generally surrounding its difficulty to apply to real-world problems. So far, this approach to concurrency remains in the collective unconscious as a theoretical one which is rarely applied.

Another approach which has received a considerable amount of hype in certain communities is "Erlang style-concurrency." Erlang is a programming language originally developed for telephony applications which uses lots of small processes that each do one thing and do it well, and talk to each other over message queues. Some liken this to the Unix philosophy: "Write programs that do one thing and do it well. Write programs to work together." However the actual approach of using processes which communicate with messages is known as the Actor model which grew out of Smalltalk.

Erlang processes have lots of nice properties which make concurrency easy. Like OS processes they run simultaneously and the Erlang runtime can distribute them across all available CPU cores with no additional work on the programmer's part. They are pre-emptable: if a process is doing something computationally intensive then after a certain number of "reductions" (e.g. function calls) the Erlang scheduler will pre-empt it and let another process to run. Erlang processes don't share state and can be garbage collected independent of one another. This means there's no "stop the world" condition with the garbage collector. But also, by not sharing state synchronization gets much easier. Now you don't have to synchronize access to shared state, instead you need to synchronize the way your program behaves, which is a far easier problem.

Last year I began working on an Actor model implementation for Ruby called Revactor. It had many of the features seen in Erlang: concurrent "processes" with message boxes and the ability to selectively receive messages from them. However, two things deferred me from continuing to work on Revactor: first, it was slow. Its messaging speed between two processes was nearly two orders of magnitude slower than Erlang, as was the process creation speed. Second, and probably the larger issue, was how much overlap I felt existed between Actors and objects. Actors did many of the same things objects did, but actors and objects did not play well together.

After seeing how much slower Actors in Ruby worked compared to Erlang, I decided to try a different approach entirely. If Actors couldn't come to Ruby, perhaps I could try bringing Ruby to Erlang. In early 2008 I started tinkering with a Ruby-like language on top of Erlang called Reia. Reia brings with it an object model which works on top of the Erlang process and messaging model. Like other object models objects in Reia communicate with messages, but rather than being some metaphorical construct, Reia objects quite literally communicate with messages.

This approach marks a rapid departure from most other object oriented languages. Most object oriented languages have a concurrency primitive like threads which are completely divorced from objects. Threads give you no guarantees about how state is managed or synchronized, and threads have no knowledge of method invocation. In Reia, all objects are concurrent and synchronize around the method dispatch protocol. You can think of each object as being its own thread, except each object's state is independent of all of the others, so there's no need for semaphores or mutexes or critical sections. All synchronization is handled through message dispatch itself, and thanks to the shared-nothing process architecture you only need to worry about synchronizing behavior, not state.

Furthermore, Reia gives you deep hooks into the method dispatch process, allowing objects to asynchronously respond to method calls, to invoke methods asynchronously, or to asynchronously get the response of a method call. This small grab bag of asynchronous tricks allows you to quickly and easily begin adding concurrency into your programs while still preserving the traditional method dispatch approach seen in object oriented languages.

Will this object-aware approach to concurrency and synchronization actually gain traction with programmers? It's hard to say, however it's clear that the other approaches aren't doing so well either. With the number of cores available to programmers skyrocketing, I think it's clear solutions are needed. Only time will tell which ones are actually successful.