Monday, April 6, 2009

Why I don't like Scala

Scala is a hybrid functional/imperative object/actor language for the Java Virtual Machine which has recently gained notoriety by Twitter selecting it as the basis of their future development. My language Reia is also a hybrid functional/imperative object/actor language. You may think given these similarities I would like Scala... but I don't.

I originally tried Scala back in 2007, shortly after I started becoming proficient in Erlang. I was extremely interested in the actor model at the time, and Scala provides an actor model implementation. Aside from Erlang, it was one of the only languages besides Scheme and Io I had discovered which had attempted an actor model implementation, and Scala's actor support seemed heavily influenced by Erlang (even at the syntactic level). I was initially excited about Scala but slowly grew discontented with it.

At first I enjoyed being able to use objects within the sequential portions of my code and actors for the concurrent parts. This is an idea I would carry over into Ruby with Revactor library, an actor model implementation I created for Ruby 1.9. Revactor let you write sequential code using normal Ruby objects, while using actors for concurrency. Like Scala, Revactor was heavily influenced by Erlang, to the point that I had created an API almost virtually identical translation of Erlang's actor API to Ruby as MenTaLguY had in his actor library called Omnibus. MenTaLguY is perhaps the most concurrency-aware Ruby developer I have ever met, so I felt I may be on to something.

However, rather quickly I discovered something about using actors and objects in the same program: there was considerable overlap in what actors and objects do. More and more I found myself trying to reinvent objects with actors. I also began to realize that Scala was running into this problem as well. There were some particularly egregious cases. What follows is the most insidious one.

The WTF Operator

One of the most common patterns in actor-based programs is the Remote Procedure Call or RPC. As the actor protocol is asynchronous, RPCs provide synchronous calls in the form of two asynchronous messages, a request and a response.

RPCs are extremely prevalent in actor-based programming, to the point that Joe Armstrong, creator of Erlang, says:
95% of the time standard synchronous RPCs will work - but not all the
time, that's why it's nice to be able to open up things and muck around at the
message passing level.
Seeing RPCs as exceedingly common, the creators of Scala created an operator for it: "!?"

WTF?! While it's easy to poke fun at an operator that resembles an interrobang, the duplicated semantics of this operator are what I dislike. To illustrate the point, let me show you some Scala code:

response = receiver !? request

and the equivalent code in Reia:

response = receiver.request()

Reia can use the standard method invocation syntax because in Reia, all objects are actors. Scala takes an "everything is an object" approach, with actors being an additional entity which duplicates some, but not all, of the functions of objects. In Scala, actors are objects, whereas in Reia objects are actors.

Scala's object model borrows heavily from Java, which is in turn largely inspired by C++. In this model, objects are effectively just states, and method calls (a.k.a. "sending a message to an object") are little more than function calls which act upon and mutate those states.

Scala also implements the actor model, which is in turn inspired by Smalltalk and its messaging-based approach to object orientation. The result is a language which straddles two worlds: objects acted upon by function calls, and actors which are acted upon by messages.

Furthermore, Scala's actors fall prey Clojure creator Rich Hickey's concerns about actor-based languages:
It reduces your flexibility in modeling - this is a world in which everyone sits in a windowless room and communicates only by mail. Programs are decomposed as piles of blocking switch statements. You can only handle messages you anticipated receiving. Coordinating activities involving multiple actors is very difficult. You can't observe anything without its cooperation/coordination - making ad-hoc reporting or analysis impossible, instead forcing every actor to participate in each protocol.
Reia offers a solution to this problem with its objects-as-actors approach: all actor-objects speak a common protocol, the "Object" protocol, and above that, they speak whatever methods belong to their class. Objects implicitly participate in the same actor protocol, because they all inherit the same behavior from their common ancestor.

Scala's actors... well... if you !? them a message they aren't explicitly hardcoded to understand (and yes nitpickers, common behaviors can be abstracted into functions) they will ?! at your message and ignore it.

Two kinds of actors?

One of the biggest strengths of the Erlang VM is its approach to lightweight concurrency. The Erlang VM was designed from the ground up so you don't have to be afraid of creating a large number of Erlang processes (i.e. actors). Unlike the JVM, the Erlang VM is stackless and therefore much better at lightweight concurrency. Erlang's VM also has advanced mechanisms for load balancing its lightweight processes across CPU cores. The result is a system which lets you create a large number of actors, relying on the Erlang virtual machine to load balance them across all the available CPU cores for you.

Because the JVM isn't stackless and uses native threads as its concurrency mechanism, Scala couldn't implement Erlang-style lightweight concurrency, and instead compromises by implementing two types of actors. One type of actor is based on native threads. However, native threads are substantially heavier than a lightweight Erlang process, limiting the number of thread-based actors you can create in Scala as compared to Erlang. To address this limitation, Scala implements its own form of lightweight concurrency in the form of event-based actors. Event-based actors do not have all the functionality of thread-based actors but do not incur the penalty of needing to be backed by a native thread.

Should you use a thread-based actor or an event-based actor in your Scala program? This is a case of implementation details (namely the JVM's lack of lightweight concurrency) creeping out into the language design. Projects like Kilim are trying to address lightweight concurrency on the JVM, and hopefully Scala will be able to leverage such a project in the future as the basis of its actor model and get rid of the threaded/evented gap, but for now Scala makes you choose.

Scala leaves you with three similar, overlapping constructs to choose from when modeling state, identity, and concurrency in programs: objects, event-based actors, and thread-based actors. Which should you choose?

Reia provides both objects and actors, but actors are there for edge cases and intended to be used almost exclusively by advanced programmers. Reia introduces a number of asynchronous concepts into its object model, and for that reason objects alone should suffice for most programmers, even when writing concurrent programs.

Advantages of Scala's approach

Reia's approach comes with a number of disadvantages, despite the theoretical benefits I've outlined above. For starters, Scala is a language built on the JVM, which is arguably the best language virtual machine available. Scala's sequential performance tops Erlang even if its performance in concurrent benchmarks typically lags behind.

Reia's main disadvantage is that its object model does not work like any other language in existence, unless you consider Erlang's approach an "object model". Objects, being a shared-nothing, individually garbage collected Erlang process, are much heavier (approximately 300 machine words at minimum) than objects in your typical object oriented language (where I hear some runtimes offer zero overhead objects, or something). Your typical "throw objects at the problem" programmer is going to build a system, and the more objects that are involved the more error prone it's going to become. Reia is a language which asks you to sit back for a second and ponder what can be modeled as possibly nested structures of lists, tuples, and maps instead of objects.

Reia does not allow cyclical call graphs, meaning that an object receiving a call cannot call another object earlier in the call graph. Instead, objects deeper in the call graph must interact with any previously called objects asynchronously. If your head is spinning now I don't blame you, and if you do understand what I'm talking about I cannot offer any solutions. Reia's call graphs must be acyclic, and I have no suggestions to potential Reia developers as to how to avoid this problem, besides being aware of the call flow and ensuring that all "back-calls" are done asynchronously. Cyclic call graphs result in a deadlock, one which can presently only be detected through timeouts, and remain a terrible pathological case. I really wish I could offer a better solution and I am eager if anyone can help me find a solution. This is far and away the biggest problem I have ever been faced with in Reia's object model and I am sad to say I do not have a good solution.

All that said, Scala's solution is so beautifully familiar! It works with the traditional OOP semantics of C++ which were carried over into Java, and this is what most OOP programmers are familiar with. I sometimes worry that the approach to OOP I am advocating in Reia will be rejected by developers who are familiar with the C++-inspired model, because Reia's approach is more complex and more confusing.

Furthermore, Scala's object model is not only familiar, it's extremely well-studied and well-optimized. The JVM provides immense capability to inline method calls, which means calls which span multiple objects can be condensed down to a single function call. This is because the Smalltalk-inspired illusion that these objects are receiving and sending messages is completely suspended, and objects are treated in C++-style as mere chunks of state, thus an inlined method call can act on many of them at once as if they were simple chunks of state. In Reia, all objects are concurrent, share no state, and can only communicate with messages. Inlining calls across objects is thoroughly impossible since sending messages in Reia is not some theoretical construct, it's what really happens and cannot simply be abstracted away into a function call which mutates the state of multiple objects. Each object is its own world and synchronizes with the outside by talking by sending other objects messages and waiting for their responses (or perhaps just sending messages to other objects then forgetting about it and moving on).

So why even bother?

Why even bother pursuing Reia's approach then, if it's more complex and slow? I think its theoretical purity offers many advantages. Synchronizing concurrency through the object model itself abstracts away a lot of complexity. Traditional object usage patterns in Reia (aside from cyclical call graphs) have traditional object behavior, but when necessary, objects can easily be made concurrent by using them asynchronously. Because of this, the programmer isn't burdened with deciding what parts of the system need to be sequential and what parts concurrent ahead of time. They don't need to rip out their obj.method() calls and replace them with WTFs!? when they need some part of the system they didn't initially anticipate to be concurrent. Programmers shouldn't even need to use actors directly unless they're implementing certain actor-specific behaviors, like FSMs (the 5% case Joe Armstrong was talking about).

Why build objects on actors?

Objects aren't something I really have a concrete, logical defense of, as opposed to a functional approach. To each their own is all I can say. Object oriented programming is something of a cargo cult... its enthusiasts give defenses which often apply equally to functional programming, especially functional programming with the actor model as in Erlang.

My belief is that if concurrent programming can be mostly abstracted to the level of objects themselves, a lot more people will be able to understand it. People understand objects and the separation of concerns which is supposedly implicit in their identities, but present thread-based approaches to concurrency just toss that out the window. The same problem is present in Scala: layering the actor model on top of an object system means you end up with a confusing upper layer which works kind of like objects, but not quite, and runs concurrently, while objects don't. When you want to change part of the system from an object into an actor, you have to rewrite it into actor syntax and change all your lovely dots into !?s?!

Building the object system on top of the actor model itself means objects become the concurrency primitive. There is no divorce between objects and actors. There is no divorce between objects, thread-based actors, and event-based actors as in Scala. In Reia objects are actors, speak a common protocol, and can handle the most common use cases.

When objects are actors, I think everything becomes a lot simpler.

Sunday, April 5, 2009

Twitter: a followup

My last post about Twitter unsurprisingly drew the responses of the super duper microblogging interconnected developers at Twitter.

I'm still confused as to why Starling ever came about in the first place, but their reasoning for the move to Scala is a lot clearer now. They want a statically typed language to manage a larger codebase, and a faster language so they can grow to handle more load. It would seem their entire system is both stateful and high throughput, for which they need a high performance disk logged message queue.

However, for the one stateless part of their system, the webapp, I guess they're going to stick with Ruby, and the good old Matz Ruby Interpreter. JRuby won't run their app for some reason.

That's at least what I digested from the Twitter employees. Thanks for your replies.

Friday, April 3, 2009

Twitter: blaming Ruby for their mistakes?

"It's a poor workman who blames his tools..."
With the Internet abuzz about Google being in talks with Twitter, it seems that Ruby has become the proverbial whipping boy for Twitter's scaling problems. Twitter developer Alex Payne is now preaching Scala is the new Ruby, and panning Ruby for its technical foibles:
One of the things that I’ve found throughout my career is the need to have long-lived processes. And Ruby, like many scripting languages, has trouble being an environment for long lived processes. But the JVM is very good at that, because it’s been optimized for that over the last ten years. So Scala provides a basis for writing long-lived servers, and that’s primarily what we use it for at Twitter right now.
I've certainly been bitten by Ruby's poor garbage collection. The de facto Ruby interpreter uses a primitive mark-sweep garbage collection algorithm which slowly "leaks" memory over time as the heap fragments. It sure would be nice if Ruby had all sorts of exotic pluggable GC options the way the JVM does. Twitter developer Robey Pointer opines: "With Scala we could still write this really high level code, but be on the JVM." Yes, it sure would be nice if you could use a high level language like Ruby on the JVM.

Except Ruby does run on the JVM with JRuby, and this is something mosts Rubyists I know are aware of. I've been following JRuby pretty closely for the past several months because I am particularly interested in using Ruby for "always on" devices, so I need the compacting garbage collection the JVM provides. JRuby doesn't perform as well as Scala, but it is a fully implemented, viable, and in most cases better performing implementation than the de facto interpreter.

I can only assume someone at Twitter knows about and has used JRuby. Why this doesn't enter into their technology selection process is beyond me. Thanks to JRuby, "the interpreter sucks!" is no longer a valid complaint against Ruby, but that doesn't seem to prevent Twitter, one of the foremost Ruby-using companies in the world, from trashing it. This is ironic considering complaints that "Ruby doesn't scale!" are almost intractably linked to Twitter's scaling problems, while other companies have managed huge Rails deployments without scaling problems. I do not envy the task Twitter has before them and at my job I certainly don't deal with the sheer volumes of data they do (although I do still deal with asynchronous processing of a lot of data, using Ruby), but it's my belief that Twitter's scaling problems have much more to do with the culture at Twitter than they do with Ruby as a language.

At the heart of this Ruby vs. Scala debacle at Twitter is their message queue. Rather than chosing one of the hundreds of message queues that are already available (including ones written in Ruby), Twitter seemed to succumb to NIH and wrote their own. The result was Starling, a message queue which talks the memcache protocol (never mind there's already a message queue that does that too).

Starling is quite possibly one of the slowest and most poorly designed message queues in existence. I work for a company which, among other things, does a lot of message queue-driven asynchronous background processing of data using Ruby processes. When we selected a message queue, we surveyed at least a dozen of them, one of which was Starling. We did some basic simulated load testing, seeing how the queue performed for increasing numbers of readers/writers versus increasing message volumes. Starling's performance was utterly abysmal. As we increased the number of readers/writers to Starling its performance started nearing rock bottom quickly.

As I perhaps somewhat self-aggrandizingly consider myself one of the most knowledgable people regarding I/O in the Ruby world, I decided to peek around the Starling source and see what I discovered. What I found was a half-assed and pathetically underperforming reinvention of EventMachine, an event-based networking framework for Ruby which is the Ruby answer to the Twisted framework from Python. EventMachine is built on an underlying C++ implementation, and while the API it exposes is rather ugly, it's quite fast. This gross oversight was not present in the other message queue available for Ruby, which benchmarked substantially faster than Starling. Eventually Starling would be forked as "Evented Starling" and this gross oversight would be corrected.

As someone who has contributed to the EventMachine project and written my own high performance Ruby event framework, this is my reaction to the design of Starling:


It's not as if it's particularly hard to write a message queue. For shits and grins I wrote my own in Erlang just to compare it to Starling. The result was more or less as full featured as Starling, but performed a few orders of magnitude better, and was 1/10th the size (150 lines of code as opposed to 1500). My queue doesn't perform nearly as well as mature, open source alternatives, but it was a fun exercise to gauge just how badly the Twitter people failed.

Starling was clearly the first attempt of its authors to write a high performance network server in Ruby, and they miserably failed. I've never seen an explanation from Twitter as to why they felt existing message queues were inadequate. However, it became painfully clear that Starling was woefully inadequate:
By mid-2008, one of these Ruby message queues completely crashed and developers needed two and a half hours to shove the dropped Tweets back through the system. When your game is micro-blogging, that's a lifetime.
Yes, with absolutely zero experience in writing high performance network servers in Ruby, Twitter's NIH led them to homebrew their own message queue. And surprise surprise, it failed miserably! What was Twitter's reaction? Did they start looking for a better, open source message queue system written by people who are actually competent to develop message queues? No, of course not, more NIH to the rescue:
Then, in his spare time, one developer ported the code to Scala. According to Payne, the Scala queue could process the same message backlog in 20 seconds.
Yes, clearly Ruby is the problem, and more NIH is the solution. The result was Kestrel, a new message queue written in Scala which nobody but Twitter uses. It performs a lot better than Starling, though! Just not as well as RabbitMQ, a queue so fast certain crazy people I know are streaming video through it in realtime.

I've never seen Twitter's rationale for writing their own message queue in the first place. Reading the list of requirements given in the Kestrel description, I'm completely confused as to why MemcacheQ does not meet their needs. If you're willing to drop the "use the memcache protocol" requirement there are dozens of queues which would seem to fit their needs, with better performance than Kestrel.

I'm uncertain as to what else Twitter is using Scala for besides its message queue. Given all their myopic and seemingly JRuby-unaware harping on using Ruby for background jobs:
"And that wall was not so much it code but in limitations in Ruby virtual machine. There's a lot of things that Ruby is great at, but long running processes? Particularly memory intensive ones? Not so much."
...I'm guessing they are in the process of ripping out all the background jobs and rewriting them in Scala. But the message queue remains the center point of their argument. The failures of the message queue are Twitter's own, not Ruby's.

I would be curious to hear what arguments, if any, Twitter had against JRuby, or other message queues. The public arguments I've seen, and the decision making process I'm inferring from it, seem like an incredibly inept one. This belies something I've heard about those who deal with the Twitter people: their scaling problems come not so much from Ruby but from bad design decisions.

Overall, I think they're missing the point of Ruby. Ruby shines as an automation and "glue code" language, providing you with a swiss army knife that lets you easily integrate many software components writen in whatever language you want into a single, cohesive application. Message queues are commoditized in the software world, and clients exist for virtually all of them in the Ruby world. AMQP, XMPP, Stomp, you name it and chances are Ruby speaks it. The EngineYard folks are doing wonderful things with Ruby and XMPP/AMQP in their Vertebra and Nanite projects. The Twitter folks are... off reinventing message queues in Ruby, then blaming Ruby when their implementation turns out to be slow.

In conclusion... is Ruby a bad language for writing message queues in? Yes, there are much better choices. Message queues are a particularly performance critical piece of software, which requires your language has excellent I/O capabilities and a decent strategy for concurrency. Ruby has neither of these, so it's no wonder Starling fails miserably. But message queues aren't something you should be writing yourself. This speaks much more to Twitter's culture of NIH than it does to Ruby as a language.

Is Ruby a bad language for writing long-running processes? Absolutely not. JRuby provides state-of-the-art garbage collection algorithms available in the JVM to the Ruby world. These are the exact same technologies that are available in Scala. JRuby addresses all of their concerns for long-running processes, but they don't bother to mention it and instead just point out the problems of the de facto Ruby interpreter.

I expect this debate is raging inside Twitter and we're only seeing the surface of it. My apologies to Twitter if there is actually a well thought out rationale for what they're doing, but if so, the public message (and software) you're giving to the world is entirely unconvincing.

Update: If you check the comments you'll see the Twitter folks have clarified their position, and I've created a new post in response. I think their arguments for Scala and Kestrel are certainly reasonable, and their position makes much more sense when it's based on Scala's strengths, not Ruby's weaknesses. Twitter is a great service that I use every day (I mean, I have a Twitter sidebar on my blog and all), despite its occasional stability problems. I wish them luck on their new Scala-based backend and hope they can get these stability problems licked.