Wednesday, January 28, 2009

The cutting edge of VM design

The Java Virtual Machine generally stands out as the state-of-the-art, loaded with all sorts of crazy optimization tricks it would make your head spin. For this reason many languages (including brand new languages) are targeting it as their runtime, as opposed to writing their own virtual machine which I guess some people still like to do. However, I think the route the JVM has taken is one which will gradually wane in popularity as programmers begin to face a future with dozens if not hundreds of CPU cores at their disposal.

Designers of virtual machines will begin to undergo a realization which is already upon the designers of computer processors: focus less on doing one thing quickly and focus more on being good at doing many things at once. This is why I do not believe the state of the art in virtual machine lies in things like the JVM. Rather, I see the Erlang virtual machine as being the state of the art. Java simply was not designed for this future:

Image stolen from Ulf Wiger who took it from Joe Armstrong who borrowed it from Erik Hagersten

At first glance it may be difficult to appreciate what makes the Erlang VM interesting. Soft realtime garbage collection is nice but the JVM has hard realtime garbage collection. Erlang's JIT is comparatively slow at things like numerical computing and can't inline across modules the way HotSpot can inline methods across classes. It's a notoriously difficult beast for inexperienced system administrators who may see it start gobbling up CPU and RAM for no apparent reason, and the only way to probe what's doing is to get onto its shell and enter a bunch of commands in an esoteric functional language, horrors!

However, these problems are relatively minor when you look at what's on the roadmap for the Erlang VM. In the beginning of his book Programming Erlang, the language's creator, Joe Armstrong, lays out the dream of what his concurrency model hopes to achieve: your program runs N times faster on N CPUs. Erlang has more or less achieved this through the use of distribution. However, distributed computing is hard: suddenly you must consider the cost of sending messages across the network, dealing with latency, network outages, crashing servers, etc. While Erlang/OTP provides a great framework for distributed computing, many of the concerns of distributed computing become irrelevant when you're dealing with a single machine with vast numbers of CPU cores.

The Erlang virtual machine now supports an SMP scheduler which runs a hardware thread for each CPU core. This lets the VM distribute processes, its concurrency primitive, across multiple CPUs at once with none of the impacts distributed computing has to the programmer. It's a comparatively simple affair: make as many processes as you want and the scheduler will load balance them accordingly.

However, while this all seems nice and rosy the SMP scheduler is full of bottlenecks. It works quite well for two or four CPU cores, but as you add more and more you see diminishing returns. The same goes for the JVM, but while they're stuck with their fundamental design constraints, the Erlang concurrency model allows its VM designers considerably more freedom for optimization.

A recent presentation by super famous Erlang guy Ulf Wiger lays out the future of multicore programming with Erlang. The presentation goes through the present scalability weaknesses of the Erlang VM and how they plan on addressing them. You can take Ulf's word for it, or if you're lazy you can just keep reading and I'll try to break it down for you.

So, here's what it is: (sorry for ripping these images out of your presentation, Ulf! They rule!)

This image had a big Ericsson logo on it. Did I do bad by ripping it off? Oh well, if someone complains I'll make my own awesome version.

This is how the Erlang SMP scheduler works today. There are lots of schedulers, each running in a native thread, which all pull from a common run queue. Can you see the bottleneck? Look at all those arrows pointing to the same place! So what's the solution? Divide and conquer:

By giving each scheduler thread its own run queue, the only centralized ickyness becomes the logic needed to load balance processes across various schedulers. This whole thing is beginning to look a lot like how an operating system's kernel works, isn't it? Well great, so how did that all work out in terms of SMP scalability?

Uhoh, the diagonal line I was expecting appears to be a bit flaccid

The red line shows the present SMP scheduler with a single run queue, and the blue line shows the next generation SMP scheduler which will be released in the next version of Erlang. As you can see, adding multiple run queues improved the scheduler's performance, but clearly there is still a bottleneck as you add CPU cores. The graphs show the program actually slowing down with increasing number of cores. So what's the problem?

Aack, a bunch of arrows pointing at the same place again!

Erlang's memory allocator is presently a locking contention point and is limiting the SMP scheduler's scalability across increasing numbers of CPU cores. I can only assume from Ulf throwing this into his presentation that this is the next bottleneck the Erlang/OTP team at Ericsson intends to address after they've released the new SMP scheduler.

The moderate gains of the new SMP scheduler may not seem like something to be excited about, especially since the improvements in the benchmark weren't all that spectacular. So why am I excited? Because one by one the developers of the Erlang virtual machine are removing scalability bottlenecks and increasing the VM's performance. Due to Erlang's underlying concurrency model, the optimization potentials remain huge. And as the number of CPU cores available on a single chip continues to increase (this benchmark was run on a 64-core CPU) these sorts of optimizations will become increasingly necessary to leverage a CPU's full capabilities.

Architectures like the JVM aren't completely doomed in the multicore future. Approaches like software transactional memory can be used on the JVM (particularly with a language like Clojure) in ways that are better suited to certain types of concurrency problems than Erlang's shared-nothing process approach.

But overall, I find Erlang's approach to concurrency as one which makes things easier on the programmer while foisting more of the underlying complexity onto the VM itself. I think Erlang's approach to concurrency will generally make more sense to programmers than STM, especially if it has the right kind of face on it, which is what I hope to achieve with my language Reia, which targets the Erlang VM. And most of all, I see great potential for VM optimizations to improve the scalability on multicore CPUs, and hope one day the SMP scheduler is able to achieve Joe Armstrong's dream of your program running N times faster on N CPUs. Whatever approach the Erlang VM eventually settles upon will be studied for years to come.

258 comments:

«Oldest   ‹Older   401 – 258 of 258
«Oldest ‹Older   401 – 258 of 258   Newer› Newest»