"It's a poor workman who blames his tools..."With the Internet abuzz about Google being in talks with Twitter, it seems that Ruby has become the proverbial whipping boy for Twitter's scaling problems. Twitter developer Alex Payne is now preaching Scala is the new Ruby, and panning Ruby for its technical foibles:
One of the things that I’ve found throughout my career is the need to have long-lived processes. And Ruby, like many scripting languages, has trouble being an environment for long lived processes. But the JVM is very good at that, because it’s been optimized for that over the last ten years. So Scala provides a basis for writing long-lived servers, and that’s primarily what we use it for at Twitter right now.I've certainly been bitten by Ruby's poor garbage collection. The de facto Ruby interpreter uses a primitive mark-sweep garbage collection algorithm which slowly "leaks" memory over time as the heap fragments. It sure would be nice if Ruby had all sorts of exotic pluggable GC options the way the JVM does. Twitter developer Robey Pointer opines: "With Scala we could still write this really high level code, but be on the JVM." Yes, it sure would be nice if you could use a high level language like Ruby on the JVM.
Except Ruby does run on the JVM with JRuby, and this is something mosts Rubyists I know are aware of. I've been following JRuby pretty closely for the past several months because I am particularly interested in using Ruby for "always on" devices, so I need the compacting garbage collection the JVM provides. JRuby doesn't perform as well as Scala, but it is a fully implemented, viable, and in most cases better performing implementation than the de facto interpreter.
I can only assume someone at Twitter knows about and has used JRuby. Why this doesn't enter into their technology selection process is beyond me. Thanks to JRuby, "the interpreter sucks!" is no longer a valid complaint against Ruby, but that doesn't seem to prevent Twitter, one of the foremost Ruby-using companies in the world, from trashing it. This is ironic considering complaints that "Ruby doesn't scale!" are almost intractably linked to Twitter's scaling problems, while other companies have managed huge Rails deployments without scaling problems. I do not envy the task Twitter has before them and at my job I certainly don't deal with the sheer volumes of data they do (although I do still deal with asynchronous processing of a lot of data, using Ruby), but it's my belief that Twitter's scaling problems have much more to do with the culture at Twitter than they do with Ruby as a language.
At the heart of this Ruby vs. Scala debacle at Twitter is their message queue. Rather than chosing one of the hundreds of message queues that are already available (including ones written in Ruby), Twitter seemed to succumb to NIH and wrote their own. The result was Starling, a message queue which talks the memcache protocol (never mind there's already a message queue that does that too).
Starling is quite possibly one of the slowest and most poorly designed message queues in existence. I work for a company which, among other things, does a lot of message queue-driven asynchronous background processing of data using Ruby processes. When we selected a message queue, we surveyed at least a dozen of them, one of which was Starling. We did some basic simulated load testing, seeing how the queue performed for increasing numbers of readers/writers versus increasing message volumes. Starling's performance was utterly abysmal. As we increased the number of readers/writers to Starling its performance started nearing rock bottom quickly.
As I perhaps somewhat self-aggrandizingly consider myself one of the most knowledgable people regarding I/O in the Ruby world, I decided to peek around the Starling source and see what I discovered. What I found was a half-assed and pathetically underperforming reinvention of EventMachine, an event-based networking framework for Ruby which is the Ruby answer to the Twisted framework from Python. EventMachine is built on an underlying C++ implementation, and while the API it exposes is rather ugly, it's quite fast. This gross oversight was not present in the other message queue available for Ruby, which benchmarked substantially faster than Starling. Eventually Starling would be forked as "Evented Starling" and this gross oversight would be corrected.
As someone who has contributed to the EventMachine project and written my own high performance Ruby event framework, this is my reaction to the design of Starling:
It's not as if it's particularly hard to write a message queue. For shits and grins I wrote my own in Erlang just to compare it to Starling. The result was more or less as full featured as Starling, but performed a few orders of magnitude better, and was 1/10th the size (150 lines of code as opposed to 1500). My queue doesn't perform nearly as well as mature, open source alternatives, but it was a fun exercise to gauge just how badly the Twitter people failed.
Starling was clearly the first attempt of its authors to write a high performance network server in Ruby, and they miserably failed. I've never seen an explanation from Twitter as to why they felt existing message queues were inadequate. However, it became painfully clear that Starling was woefully inadequate:
By mid-2008, one of these Ruby message queues completely crashed and developers needed two and a half hours to shove the dropped Tweets back through the system. When your game is micro-blogging, that's a lifetime.Yes, with absolutely zero experience in writing high performance network servers in Ruby, Twitter's NIH led them to homebrew their own message queue. And surprise surprise, it failed miserably! What was Twitter's reaction? Did they start looking for a better, open source message queue system written by people who are actually competent to develop message queues? No, of course not, more NIH to the rescue:
Then, in his spare time, one developer ported the code to Scala. According to Payne, the Scala queue could process the same message backlog in 20 seconds.Yes, clearly Ruby is the problem, and more NIH is the solution. The result was Kestrel, a new message queue written in Scala which nobody but Twitter uses. It performs a lot better than Starling, though! Just not as well as RabbitMQ, a queue so fast certain crazy people I know are streaming video through it in realtime.
I've never seen Twitter's rationale for writing their own message queue in the first place. Reading the list of requirements given in the Kestrel description, I'm completely confused as to why MemcacheQ does not meet their needs. If you're willing to drop the "use the memcache protocol" requirement there are dozens of queues which would seem to fit their needs, with better performance than Kestrel.
I'm uncertain as to what else Twitter is using Scala for besides its message queue. Given all their myopic and seemingly JRuby-unaware harping on using Ruby for background jobs:
"And that wall was not so much it code but in limitations in Ruby virtual machine. There's a lot of things that Ruby is great at, but long running processes? Particularly memory intensive ones? Not so much."...I'm guessing they are in the process of ripping out all the background jobs and rewriting them in Scala. But the message queue remains the center point of their argument. The failures of the message queue are Twitter's own, not Ruby's.
I would be curious to hear what arguments, if any, Twitter had against JRuby, or other message queues. The public arguments I've seen, and the decision making process I'm inferring from it, seem like an incredibly inept one. This belies something I've heard about those who deal with the Twitter people: their scaling problems come not so much from Ruby but from bad design decisions.
Overall, I think they're missing the point of Ruby. Ruby shines as an automation and "glue code" language, providing you with a swiss army knife that lets you easily integrate many software components writen in whatever language you want into a single, cohesive application. Message queues are commoditized in the software world, and clients exist for virtually all of them in the Ruby world. AMQP, XMPP, Stomp, you name it and chances are Ruby speaks it. The EngineYard folks are doing wonderful things with Ruby and XMPP/AMQP in their Vertebra and Nanite projects. The Twitter folks are... off reinventing message queues in Ruby, then blaming Ruby when their implementation turns out to be slow.
In conclusion... is Ruby a bad language for writing message queues in? Yes, there are much better choices. Message queues are a particularly performance critical piece of software, which requires your language has excellent I/O capabilities and a decent strategy for concurrency. Ruby has neither of these, so it's no wonder Starling fails miserably. But message queues aren't something you should be writing yourself. This speaks much more to Twitter's culture of NIH than it does to Ruby as a language.
Is Ruby a bad language for writing long-running processes? Absolutely not. JRuby provides state-of-the-art garbage collection algorithms available in the JVM to the Ruby world. These are the exact same technologies that are available in Scala. JRuby addresses all of their concerns for long-running processes, but they don't bother to mention it and instead just point out the problems of the de facto Ruby interpreter.
I expect this debate is raging inside Twitter and we're only seeing the surface of it. My apologies to Twitter if there is actually a well thought out rationale for what they're doing, but if so, the public message (and software) you're giving to the world is entirely unconvincing.
Update: If you check the comments you'll see the Twitter folks have clarified their position, and I've created a new post in response. I think their arguments for Scala and Kestrel are certainly reasonable, and their position makes much more sense when it's based on Scala's strengths, not Ruby's weaknesses. Twitter is a great service that I use every day (I mean, I have a Twitter sidebar on my blog and all), despite its occasional stability problems. I wish them luck on their new Scala-based backend and hope they can get these stability problems licked.