Wednesday, January 28, 2009

The cutting edge of VM design

The Java Virtual Machine generally stands out as the state-of-the-art, loaded with all sorts of crazy optimization tricks it would make your head spin. For this reason many languages (including brand new languages) are targeting it as their runtime, as opposed to writing their own virtual machine which I guess some people still like to do. However, I think the route the JVM has taken is one which will gradually wane in popularity as programmers begin to face a future with dozens if not hundreds of CPU cores at their disposal.

Designers of virtual machines will begin to undergo a realization which is already upon the designers of computer processors: focus less on doing one thing quickly and focus more on being good at doing many things at once. This is why I do not believe the state of the art in virtual machine lies in things like the JVM. Rather, I see the Erlang virtual machine as being the state of the art. Java simply was not designed for this future:

Image stolen from Ulf Wiger who took it from Joe Armstrong who borrowed it from Erik Hagersten

At first glance it may be difficult to appreciate what makes the Erlang VM interesting. Soft realtime garbage collection is nice but the JVM has hard realtime garbage collection. Erlang's JIT is comparatively slow at things like numerical computing and can't inline across modules the way HotSpot can inline methods across classes. It's a notoriously difficult beast for inexperienced system administrators who may see it start gobbling up CPU and RAM for no apparent reason, and the only way to probe what's doing is to get onto its shell and enter a bunch of commands in an esoteric functional language, horrors!

However, these problems are relatively minor when you look at what's on the roadmap for the Erlang VM. In the beginning of his book Programming Erlang, the language's creator, Joe Armstrong, lays out the dream of what his concurrency model hopes to achieve: your program runs N times faster on N CPUs. Erlang has more or less achieved this through the use of distribution. However, distributed computing is hard: suddenly you must consider the cost of sending messages across the network, dealing with latency, network outages, crashing servers, etc. While Erlang/OTP provides a great framework for distributed computing, many of the concerns of distributed computing become irrelevant when you're dealing with a single machine with vast numbers of CPU cores.

The Erlang virtual machine now supports an SMP scheduler which runs a hardware thread for each CPU core. This lets the VM distribute processes, its concurrency primitive, across multiple CPUs at once with none of the impacts distributed computing has to the programmer. It's a comparatively simple affair: make as many processes as you want and the scheduler will load balance them accordingly.

However, while this all seems nice and rosy the SMP scheduler is full of bottlenecks. It works quite well for two or four CPU cores, but as you add more and more you see diminishing returns. The same goes for the JVM, but while they're stuck with their fundamental design constraints, the Erlang concurrency model allows its VM designers considerably more freedom for optimization.

A recent presentation by super famous Erlang guy Ulf Wiger lays out the future of multicore programming with Erlang. The presentation goes through the present scalability weaknesses of the Erlang VM and how they plan on addressing them. You can take Ulf's word for it, or if you're lazy you can just keep reading and I'll try to break it down for you.

So, here's what it is: (sorry for ripping these images out of your presentation, Ulf! They rule!)

This image had a big Ericsson logo on it. Did I do bad by ripping it off? Oh well, if someone complains I'll make my own awesome version.

This is how the Erlang SMP scheduler works today. There are lots of schedulers, each running in a native thread, which all pull from a common run queue. Can you see the bottleneck? Look at all those arrows pointing to the same place! So what's the solution? Divide and conquer:

By giving each scheduler thread its own run queue, the only centralized ickyness becomes the logic needed to load balance processes across various schedulers. This whole thing is beginning to look a lot like how an operating system's kernel works, isn't it? Well great, so how did that all work out in terms of SMP scalability?

Uhoh, the diagonal line I was expecting appears to be a bit flaccid

The red line shows the present SMP scheduler with a single run queue, and the blue line shows the next generation SMP scheduler which will be released in the next version of Erlang. As you can see, adding multiple run queues improved the scheduler's performance, but clearly there is still a bottleneck as you add CPU cores. The graphs show the program actually slowing down with increasing number of cores. So what's the problem?

Aack, a bunch of arrows pointing at the same place again!

Erlang's memory allocator is presently a locking contention point and is limiting the SMP scheduler's scalability across increasing numbers of CPU cores. I can only assume from Ulf throwing this into his presentation that this is the next bottleneck the Erlang/OTP team at Ericsson intends to address after they've released the new SMP scheduler.

The moderate gains of the new SMP scheduler may not seem like something to be excited about, especially since the improvements in the benchmark weren't all that spectacular. So why am I excited? Because one by one the developers of the Erlang virtual machine are removing scalability bottlenecks and increasing the VM's performance. Due to Erlang's underlying concurrency model, the optimization potentials remain huge. And as the number of CPU cores available on a single chip continues to increase (this benchmark was run on a 64-core CPU) these sorts of optimizations will become increasingly necessary to leverage a CPU's full capabilities.

Architectures like the JVM aren't completely doomed in the multicore future. Approaches like software transactional memory can be used on the JVM (particularly with a language like Clojure) in ways that are better suited to certain types of concurrency problems than Erlang's shared-nothing process approach.

But overall, I find Erlang's approach to concurrency as one which makes things easier on the programmer while foisting more of the underlying complexity onto the VM itself. I think Erlang's approach to concurrency will generally make more sense to programmers than STM, especially if it has the right kind of face on it, which is what I hope to achieve with my language Reia, which targets the Erlang VM. And most of all, I see great potential for VM optimizations to improve the scalability on multicore CPUs, and hope one day the SMP scheduler is able to achieve Joe Armstrong's dream of your program running N times faster on N CPUs. Whatever approach the Erlang VM eventually settles upon will be studied for years to come.

10 comments:

Keymone said...

great post! thank you!

Wes Felter said...

Doesn't Azul run HotSpot on a 864-core system?

johnbender said...

switched blogs I see. I'm waiting till you're done with REIA to continue effing about with the erlang VM.

I want ruby on the erlang VM and you're as close as I'm getting.

Rick Minerich said...

I never realized Erlang had it's own VM tuned for parallelism. I'm going to have to spend some time reading about Erlang for sure.

Thanks for the post!

David Anderson said...

Just a note: if the bottleneck in the new Erlang VM is the memory allocator being slammed by many OS threads, the Erlang folks should consider using tcmalloc.

Tcmalloc the memory allocator used by Google in programs with many dozens (even hundreds) of OS threads competing for memory allocator time. It has a separate allocation pool for each thread that greatly reduces the need for hitting on the contented main heap allocator, and has a ton of smart logic to recycle memory in such a way as to remain thread-local as much as possible.

It probably won't make that floppy diagonal stand completely straight, but tcmalloc is known to keep chugging along at O(n)-ish long after glibc malloc has given up and died. And the allocator, along with cpu and heap profilers, are all open source too :-).

Stephan.Schmidt said...

If the Erlang VM is optimized for parallelism, why is the actor implementation only 1.5 to 2 times faster than fresh Java actor implementations?

(Or sometimes slower, http://www.codemonkeyism.com/archives/2008/06/21/interesting-picture-benchmarking-erlang-versus-java-concurrency/ )

Otherwise good article,

Stephan

--
Programming is hard - http://blog.codemonkeyism.com
http://twitter.com/codemonkeyism

sysprv said...

Take a look at the Dis virtual machine in Inferno.

Tim said...

I wonder why so many people are currently so excited about green threads. Are there any hard-numbers that show their benefits?

Not too many years ago, there have been a number of green thread implementations. For example, the first Java VMs had them. I also remember the first Linux thread libraries being user-space implementations. But eventually it was demonstrated that a good kernel-based thread implementation was not significantly slower than a user-space implementation, and common sense was that they were a bad idea that just adds complexity.

Has this changed on modern multi-core CPUs, and if yes, why?

rgh said...

Tim,

Erlang uses green threads in conjunction with systems threads so if you have 1000 processes (as they are called in erlang) on a machine with four cores they will be spread across the four cores.

An erlang process is a very lightweight thing; you can create thousands of them in under second or so and they consume very little memory (200 bytes of the top of my head). This coupled with the actor model creates a *very* powerful way of writing systems that will work across multiple core/cpus/machines.

Arkadin Agathos said...

Love your blog, hate the design. Light lettering on very dark background. Very difficult to read. Dark lettering on light background: much easier to read. 'Nuf said.