Friday, April 3, 2009

Twitter: blaming Ruby for their mistakes?

"It's a poor workman who blames his tools..."
With the Internet abuzz about Google being in talks with Twitter, it seems that Ruby has become the proverbial whipping boy for Twitter's scaling problems. Twitter developer Alex Payne is now preaching Scala is the new Ruby, and panning Ruby for its technical foibles:
One of the things that I’ve found throughout my career is the need to have long-lived processes. And Ruby, like many scripting languages, has trouble being an environment for long lived processes. But the JVM is very good at that, because it’s been optimized for that over the last ten years. So Scala provides a basis for writing long-lived servers, and that’s primarily what we use it for at Twitter right now.
I've certainly been bitten by Ruby's poor garbage collection. The de facto Ruby interpreter uses a primitive mark-sweep garbage collection algorithm which slowly "leaks" memory over time as the heap fragments. It sure would be nice if Ruby had all sorts of exotic pluggable GC options the way the JVM does. Twitter developer Robey Pointer opines: "With Scala we could still write this really high level code, but be on the JVM." Yes, it sure would be nice if you could use a high level language like Ruby on the JVM.

Except Ruby does run on the JVM with JRuby, and this is something mosts Rubyists I know are aware of. I've been following JRuby pretty closely for the past several months because I am particularly interested in using Ruby for "always on" devices, so I need the compacting garbage collection the JVM provides. JRuby doesn't perform as well as Scala, but it is a fully implemented, viable, and in most cases better performing implementation than the de facto interpreter.

I can only assume someone at Twitter knows about and has used JRuby. Why this doesn't enter into their technology selection process is beyond me. Thanks to JRuby, "the interpreter sucks!" is no longer a valid complaint against Ruby, but that doesn't seem to prevent Twitter, one of the foremost Ruby-using companies in the world, from trashing it. This is ironic considering complaints that "Ruby doesn't scale!" are almost intractably linked to Twitter's scaling problems, while other companies have managed huge Rails deployments without scaling problems. I do not envy the task Twitter has before them and at my job I certainly don't deal with the sheer volumes of data they do (although I do still deal with asynchronous processing of a lot of data, using Ruby), but it's my belief that Twitter's scaling problems have much more to do with the culture at Twitter than they do with Ruby as a language.

At the heart of this Ruby vs. Scala debacle at Twitter is their message queue. Rather than chosing one of the hundreds of message queues that are already available (including ones written in Ruby), Twitter seemed to succumb to NIH and wrote their own. The result was Starling, a message queue which talks the memcache protocol (never mind there's already a message queue that does that too).

Starling is quite possibly one of the slowest and most poorly designed message queues in existence. I work for a company which, among other things, does a lot of message queue-driven asynchronous background processing of data using Ruby processes. When we selected a message queue, we surveyed at least a dozen of them, one of which was Starling. We did some basic simulated load testing, seeing how the queue performed for increasing numbers of readers/writers versus increasing message volumes. Starling's performance was utterly abysmal. As we increased the number of readers/writers to Starling its performance started nearing rock bottom quickly.

As I perhaps somewhat self-aggrandizingly consider myself one of the most knowledgable people regarding I/O in the Ruby world, I decided to peek around the Starling source and see what I discovered. What I found was a half-assed and pathetically underperforming reinvention of EventMachine, an event-based networking framework for Ruby which is the Ruby answer to the Twisted framework from Python. EventMachine is built on an underlying C++ implementation, and while the API it exposes is rather ugly, it's quite fast. This gross oversight was not present in the other message queue available for Ruby, which benchmarked substantially faster than Starling. Eventually Starling would be forked as "Evented Starling" and this gross oversight would be corrected.

As someone who has contributed to the EventMachine project and written my own high performance Ruby event framework, this is my reaction to the design of Starling:


It's not as if it's particularly hard to write a message queue. For shits and grins I wrote my own in Erlang just to compare it to Starling. The result was more or less as full featured as Starling, but performed a few orders of magnitude better, and was 1/10th the size (150 lines of code as opposed to 1500). My queue doesn't perform nearly as well as mature, open source alternatives, but it was a fun exercise to gauge just how badly the Twitter people failed.

Starling was clearly the first attempt of its authors to write a high performance network server in Ruby, and they miserably failed. I've never seen an explanation from Twitter as to why they felt existing message queues were inadequate. However, it became painfully clear that Starling was woefully inadequate:
By mid-2008, one of these Ruby message queues completely crashed and developers needed two and a half hours to shove the dropped Tweets back through the system. When your game is micro-blogging, that's a lifetime.
Yes, with absolutely zero experience in writing high performance network servers in Ruby, Twitter's NIH led them to homebrew their own message queue. And surprise surprise, it failed miserably! What was Twitter's reaction? Did they start looking for a better, open source message queue system written by people who are actually competent to develop message queues? No, of course not, more NIH to the rescue:
Then, in his spare time, one developer ported the code to Scala. According to Payne, the Scala queue could process the same message backlog in 20 seconds.
Yes, clearly Ruby is the problem, and more NIH is the solution. The result was Kestrel, a new message queue written in Scala which nobody but Twitter uses. It performs a lot better than Starling, though! Just not as well as RabbitMQ, a queue so fast certain crazy people I know are streaming video through it in realtime.

I've never seen Twitter's rationale for writing their own message queue in the first place. Reading the list of requirements given in the Kestrel description, I'm completely confused as to why MemcacheQ does not meet their needs. If you're willing to drop the "use the memcache protocol" requirement there are dozens of queues which would seem to fit their needs, with better performance than Kestrel.

I'm uncertain as to what else Twitter is using Scala for besides its message queue. Given all their myopic and seemingly JRuby-unaware harping on using Ruby for background jobs:
"And that wall was not so much it code but in limitations in Ruby virtual machine. There's a lot of things that Ruby is great at, but long running processes? Particularly memory intensive ones? Not so much."
...I'm guessing they are in the process of ripping out all the background jobs and rewriting them in Scala. But the message queue remains the center point of their argument. The failures of the message queue are Twitter's own, not Ruby's.

I would be curious to hear what arguments, if any, Twitter had against JRuby, or other message queues. The public arguments I've seen, and the decision making process I'm inferring from it, seem like an incredibly inept one. This belies something I've heard about those who deal with the Twitter people: their scaling problems come not so much from Ruby but from bad design decisions.

Overall, I think they're missing the point of Ruby. Ruby shines as an automation and "glue code" language, providing you with a swiss army knife that lets you easily integrate many software components writen in whatever language you want into a single, cohesive application. Message queues are commoditized in the software world, and clients exist for virtually all of them in the Ruby world. AMQP, XMPP, Stomp, you name it and chances are Ruby speaks it. The EngineYard folks are doing wonderful things with Ruby and XMPP/AMQP in their Vertebra and Nanite projects. The Twitter folks are... off reinventing message queues in Ruby, then blaming Ruby when their implementation turns out to be slow.

In conclusion... is Ruby a bad language for writing message queues in? Yes, there are much better choices. Message queues are a particularly performance critical piece of software, which requires your language has excellent I/O capabilities and a decent strategy for concurrency. Ruby has neither of these, so it's no wonder Starling fails miserably. But message queues aren't something you should be writing yourself. This speaks much more to Twitter's culture of NIH than it does to Ruby as a language.

Is Ruby a bad language for writing long-running processes? Absolutely not. JRuby provides state-of-the-art garbage collection algorithms available in the JVM to the Ruby world. These are the exact same technologies that are available in Scala. JRuby addresses all of their concerns for long-running processes, but they don't bother to mention it and instead just point out the problems of the de facto Ruby interpreter.

I expect this debate is raging inside Twitter and we're only seeing the surface of it. My apologies to Twitter if there is actually a well thought out rationale for what they're doing, but if so, the public message (and software) you're giving to the world is entirely unconvincing.

Update: If you check the comments you'll see the Twitter folks have clarified their position, and I've created a new post in response. I think their arguments for Scala and Kestrel are certainly reasonable, and their position makes much more sense when it's based on Scala's strengths, not Ruby's weaknesses. Twitter is a great service that I use every day (I mean, I have a Twitter sidebar on my blog and all), despite its occasional stability problems. I wish them luck on their new Scala-based backend and hope they can get these stability problems licked.

48 comments:

johnbender said...

I'm glad you mentioned Erlang as messaging plays directly to its strengths (even without mentioning the ease with which it can scale out)

Simon said...

Of the message queues you mention, only ActiveMQ was mature enough to be a reasonable option back in early 2008 when they released Starling. See my more detailed comment on hacker news: http://news.ycombinator.com/item?id=546275

johnbender said...

@simon

Yes, but why rewrite it in Scala now when they have more options. It still doesn't make any sense.

Steve Jenson said...

Thanks for the blog post.

We can't use JRuby without rewriting all of our gems that rely on native extensions.

We did try other message queues and found them to either not work well at our volume or to lack features (like paging to disk) that we needed.

Daniel said...

> We can't use JRuby without
> rewriting all of our gems
> that rely on native extensions.

Don't quite understand this. Isn't the message server pretty stand-alone? Does it need to load lots of gems? Also by rewriting in Scala you automatically lose all your gems....

Ikai Lan said...

Tony,

I really enjoyed reading this article. Keep it up!

Ikai
http://www.twitter.com/ikai

Steve Jenson said...

Daniel, the port to Scala for starling was purely for performance. Starling didn't rely heavily on third-party libraries. Twitter.com proper does.

Charles Oliver Nutter said...

Steve Jenson: I don't want to hassle you too much, but your statement is really absurd. "We can't use JRuby without rewriting all of our gems that rely on native extensions." So you wrote the backend in Scala? None of the furor raised over the Twitter move has anything to do with the frontend, which is still running Ruby. Keep that part running on CRuby if it suits you, I really don't care. But the Artima interview listed many points in favor of Scala and against Ruby that could have been remedied by JRuby. You must understand we work very hard to help people understand that many problems with CRuby are solved very neatly by JRuby. If Scala was truly the best option to solve Twitter's problems, so be it...I recognize that Scala is a great tool. But continuing to use CRuby's issues as justification for leaving Ruby while ignoring JRuby really stinks.

Steve Jenson said...

Charles, we respect the hard work that you guys have done with JRuby but for the systems that we wanted to build, we felt that we'd get more from a system that compiled to JVM bytecode and could perform on par with Java. We were hoping to also see benefits from Scala's advanced type system. So far, we haven't been disappointed. That doesn't mean there's no place for JRuby at Twitter, we just haven't found it yet. We're open to it.

Ed Borasky said...

1. JRuby can compile to JVM bytecode, right?

2. There's no way to know how Twitter as a whole will perform on JRuby without investing significant manpower and performance engineering. The same goes for a port to Scala, Erlang, or C. There ain't no free lunches or silver bullets in performance engineering.

3. It's easy to sit on the sidelines and criticize Twitter for scalability issues. But no matter what kind of architecture or platform you have, when you build a public-facing application with the kind of unregulated load growth that Twitter is supporting, there are going to be performance engineering challenges. eBay's had them, Google's had them, Amazon's had them, Netflix has had them. I'm sure the major online banking servers have too, although you won't hear about them publicly. :)

Eventually you have to take it back to the business and make the hard choices. Sure, there's some interesting computer science / application architecture here, and lots of room for improvement in the underlying Ruby core interpreters. I've profiled them -- I've seen what they do down to the line of code. So has Charlie, at least for JRuby.

But it's a business decision as much as it is a technical decision. Do you regulate load growth? If so, how? Do you throw hardware at it? What happens if the growth stops, as all exponential growth usually does? Who are your customers? What are their needs? What are the priorities?

Yehuda Katz said...

@steve Your very last comment (you wanted Java speeds and typing) was a valid argument, and the argument should be left at that.

The arguments related to long-running processes, threading, and garbage collection (and now C extensions) are muddling the topic so it cannot be properly argued in good faith.

Tony said...

Steve,

In what ways did ActiveMQ not meet your needs which Starling did?

ActiveMQ is much faster than Starling and supports a disk log. It also speaks the Stomp protocol which has been well-supported in Ruby for quite awhile.

Nick said...

Steve did extensive load and stress testing of ActiveMQ, RabbitMQ, etc. ActiveMQ is actually quite slow (much slower than Kestrel), RabbitMQ consistently crashes with too many producers and too few consumers.

The whole "NIH" argument is an unfair one for services at Twitter's scale. Long ago Twitter reached the point where we run daily into bugs and performance limitations in supposedly enterprise-class open source software. We crash mysql, we crash memcached, etc. all the time.

Scala is a very nice language. It's very fast, faster than JRuby, and it's type system provides some advantages over a dynamic language for certain kinds of applications. Some engineers here prefer it to Ruby, and so weighing the pros and cons, most new network services at Twitter are being written in Scala.

Steve Jenson said...

Tony,

We didn't compare ActiveMQ to Starling, we compared it to Kestrel. Robey had already written Kestrel in his spare time and we were going to use it as a more scalable drop-in replacement while we evaluated other queues.

ActiveMQ, in persistent mode, was very slow. An order of magnitude slower than Kestrel. RabbitMQ was speedy but if you put in more messages than you had memory, it would run out of memory and crash.

We also evaluated several commercial message queues but found them lacking in Ruby support.

MemcacheQ wasn't around when we did this evaluation but it's inappropriate due to it's use of Oracle BerkeleyDB which we don't want to pay for just to get functionality we already have.

The commenter 'Nick' works at Twitter and can attest that we spent several weeks going over our options, running extensive load tests, and presented our findings to the team at each stage. We did our due diligence.

sethladd said...

It sounds like Twitter did a lot of research and could very well have completely sound reasons for their choices. I think it's time Twitter put together a white paper or presentation on queue services. This would silence a lot of the critics and benefit the community as the research is opened up. I suspect that, if the research is sound and ActiveMQ truly is slow (at twitter scale), then the ActiveMQ community would get it together and fix the issue. This benefits everyone. Twitter becomes a thought leader in large scale message processing and ActiveMQ gets a whole lot better.

Remember when people found the Linux kernel networking stack to be very slow compared to Windows? The Linux people got it together and refactored it to be zero copy and ultimately very fast.

Benchmarks and solid research help everyone. Twitter: please share this. The scientific method has been proven, let's act like good engineers and measure and prove and share.

Ehsanul said...

Totally agree with sethladd. Releasing information about their internal evaluations will both help Twitter's image, and help the community.

cremes said...

The last poster is right; post your findings.

Also, if neither ActiveMQ or RabbitMQ were perfect fits, why not take advantage of their open source nature and fix them? It seems odd to create a greenfield message queueing project when so many open alternatives exist.

Okay, so they weren't perfect in your testing. But weren't they closer than what you had (nothing)?

Alternately, I know the guys behind RabbitMQ do consulting work for money (LShift, I think). You could have farmed the work out and not distracted your internal folks with designing, engineering, building, testing and maintaining another home-grown component.

Release the details. We'll all be better off for it.

Nick said...
This comment has been removed by the author.
Nick said...

I suspect we can release our numbers, hopefully we will do so. But--and I realize I'm asking for faith on your guys' part--give us the benefit of the doubt, set aside your NIH phobia. We had a RabbitMQ guy IN THE OFFICE to discuss our needs.

Existing open source projects aren't sacred. It's OK for there to be yet another one, especially if it has concrete improvements. Contributing back to an open-source project, when it's not in your language of choice/expertise (e.g., Java, Erlang) is not a great option. In our case, Kestrel makes a certain set of trade offs with respect to durability that *MQ doesn't. In Kestrel, the write-ahead log is in memory and is only periodically flushed to disk. In *MQ the log is always synced to disk. Finally, the use of the memcached protocol w/ client-side round-robin is a fundamentally different way of providing scaling and availability than the *MQs, and prohibits hard ordering.

Finally, when you work on a large code-base at scale like Twitter, the integration costs of e.g., moving to Stomp or AMQP can easily outweigh the costs of writing and operationalizing a fully encapsulated service that preserves an existing API (Starling/Memcached). With kestrel, we could simply replace 1 of our N starling servers and see how it behaves. No client code needed to change.

Now imagine a complete port to stomp. Suppose it turns out during our initial deploy that there are major defects somewhere (in the message queue, in the new producer/consumer code, whatever). How do we roll back? We have data in two message queues, producers and consumers at completely different versions, etc.

These issues are complex and reflect interacting with a legacy codebase and weighing engineering trade-offs that you aren't normally concerned with until you're Twitter. So please, give us the benefit of the doubt.

Tony said...

The one thing that still doesn't make sense to me is if your requirements are as high as you claim (which I'm sure they are) how did you ever get by on Starling?

Steve Jenson said...

Tony,

We had a much smaller site then and ran a lot more Starlings.

Alex Payne said...

Tony,

Hoo boy. First of all, I hope you've had a chance to read my general reply to the articles about my Web 2.0 Expo talk [1] and this response to a vocal member of the Ruby community [2]. I sound like a pretty unreasonable guy filtered through the tech press and Reddit comments, but I hope less so in my own words.

Secondly, the quote at the top of your post is from my coworker, Steve Jenson, who's been participating in the discussion on this post.

On JRuby: as Steve said, we can't actually boot our main Rails app on JRuby. That's a blocker. Incidentally, if you know of anyone who has a large JRuby deployment, we'd be interested in that first-hand experience. If you don't, it might be a little early to say it would solve all our problems.

It's also incorrect to say that the way JRuby and Scala make use of the JVM is exactly the same. Much like our other decisions haven't been arbitrary, our decision to use Scala over other JVM-hosted languages was based on investigation.

On our culture: if you'd like to know about how we write code, or how our code has evolved over time, just ask us. We're all on Twitter, of course, but most of the engineers also have blogs and publish their email addresses. There's no need to speculate. Just ask. There's not a "raging debate" internally because we make our engineering decisions like engineers: we experiment, and base our decisions on the results of those experiments.

It's definitely true that Starling and Evented Starling are relatively immature queuing systems. I was eager to get them out of our stack. So, as Steve said, we put all the MQ's you think we'd try through their paces not too long ago, and we knocked one after another over in straightforward benchmarks. Some, like RabbitMQ, just up and died. Others chugged on, but slowly. Where we ran into issues, we contacted experts and applied best practices, but in the end, we found that Kestrel fit our particular use cases better and more reliably. This was not the hypothesis we had going into those benchmarks, but it's what the data bore out.

We get a lot of speculation to the tune of "why haven't those idiots tried x, it's so obvious!" Generally, we have tried x, as well as y and z. Funnily enough, I was actually pushing to get us on RabbitMQ, but our benchmarks showed that it just wouldn't work for us, which is a shame, because it advertises some sexy features.

Personally, I'm extremely NIH-averse; I research open source and commercial solutions before cutting a new path. In the case of our MQ, one of our engineers actually wrote Kestrel in his free time, so it was bit more like we adopted an existing open source project than rolled our own. Pretty much the last thing we want to be doing is focusing on problems outside our domain. As it so happens, though, moving messages around quickly is our business. I don't think it's crazy-go-nuts that we've spent some time on an MQ.

I hope my colleagues and I have been able to answer some of your questions. As I said, in the future, please consider emailing us so we can share our experience. Then, we can have a public discussion about facts, not speculation. Perhaps, as commenter sethladd suggested, the onus is on us to produce a whitepaper or presentation about our findings so as to stave off such speculation. Time constraints are the main reason why we haven't done so.

[1] http://al3x.net/2009/04/04/reasoned-technical-discussion.html
[2] http://blog.obiefernandez.com/content/2009/04/my-reasoned-response-about-scala-at-twitter.html#IDComment18212539

Tony said...

Wow, thanks. That was quite a satisfactory response. I'll post a followup blog here in a bit.

cremes said...

RE: forking or improving an existing open source project and integrating it with your systems...

Asked and answered. Thanks for your transparency.

Paul said...

So what I get from these comments from the Twitter guys is that Starling sucked and was subsequently replaced by a new Scala messaging system.

In other words Ruby wasn't the problem, it was Starling (Which as the Twitter guys said was a good messaging system when written but just couldn't handle the increased load in the long run).

alexis said...

Hello, Alexis here from the RabbitMQ team. I am the guy who visited the Twitter offices, that al3x mentioned in his comment above.

First off I would like to second the statement made by some of the Twitter guys above that they do not have an 'NIH' mindset. While some individuals in their team like to code new stuff, which can become a form of NIH, it is true that *as a team* they approach problems systematically.

Secondly, I know something about what the Twitter team do with messaging, and what they don't use messaging for. And I have just read a large number of blog posts and comments about Scala, messaging, Twitter (and Alex's book!). People are making massive assumptions about how Twitter do and don't use messaging. Most of these assumptions are completely wrong. Give these guys a break. They are trying to make improvements in a running system which is scrutinised every minute of every day, for a single sign of failure. This would drive anyone to drink, madness, or worse yet, functional languages ;-)

The choice of language is secondary to the design of any messaging system. RabbitMQ is written in erlang/OTP which like Scala can use a share-nothing model. But this is no guarantee that your messaging system will work well in every possible scenario that any customer could ever use. This is why writing good messaging systems is hard, and in the case of RabbitMQ it leads to our being careful to add major features slowly, because we don't want the product to be compromised by bad designs that are hard to remove later. We have to do things this way because we have a lot of (mostly happy) users.

Now, quite a few of those users have asked us to add a feature called 'page to disk'. We are adding this feature now. 'Page to disk' means that when messages are persistent, they do *not* get held in memory at the same time. Note that in the current version of RabbitMQ, if a message is persisted, then a copy is held in memory as well. One of the RabbitMQ team recently blogged about this here: http://www.lshift.net/blog/2009/04/02/cranial-surgery-giving-rabbit-more-memory (please note that the messages are few but large in order to test the overflow properties of the system)

The scenario in which page-to-disk is needed is as follows:

1. You have relatively slow consumers with durable subscriptions. E.g. they disappear for days at a time.

2. You have to keep all messages that they have not seen yet and cannot flush them on a timeout basis.

3. You have enough producers and data that this fills up the memory of your broker.

4. You cannot, or don't want to, run the broker on multiple machines, e.g. using RabbitMQ clusters.

Since it is not my place to speak about what Twitter actually does under the hood, I shall leave it to readers to figure out when, if at all, these criteria apply to the several ways that Twitter might or might not use messaging.

It's quite easy to write a messaging system that manages balanced transient flows where ingress and egress are similar. Writing messaging systems that work under any combination of flows, on any number of machines, and in multiple different reliability scenarios ... is a more interesting problem. Page-to-disk is a way to make RabbitMQ better and address more scenarios.

If you are reading this and have other ways to improve the broker, please send us information via the mailing list or privately to info at rabbitmq.com, and please be as detailed and concrete as possible.

If you are in SF and want to know more about how you can build a twitter type system using RabbitMQ, we shall be talking about it, a little bit, this Wednesday evening: http://www.bayfp.org/blog/2009/03/25/next-meeting-rabbitmq-wednesday-april-8th-730pm/ (with beers afterwards). We'll just be talking about *messaging* in various scenarios.

I'd like to finish by asking everyone to check out Harper Reed's most excellent new project: http://www.awesomeupdater.com/

Cheers,

alexis

PS: to one person who shall remain nameless, thanks for a completely fatuous tweet. I'll buy you a beer next week if you come to the talk on Wednesday :-)

cremes said...

Looks like the folks over at SecondLife did an evaluation of several message queues and posted their results. Check it out here:

http://tinyurl.com/c8x6z5

bob pasker said...

@alex> I research open source and commercial solutions before cutting a new path.

I recommended 1.5 million message/second Tervela box through numerous channels, and I don't have any evidence that it was ever considered.

http://www.tervela.com/tmx

Glenn Rempe said...

Maybe I am cynical, but In the interest of transparency please also note that Alex may also be pimping his own book about Scala which *may* in part help us understand why he is being more vocal in re-opening the debate on 'Ruby v. Scala'.

I am not in any way saying his arguments and experiences are not well thought out (any programmer worth his salt knows there is no silver bullet language), and I won't take a position on Ruby vs. Scala since I have only used one of them.

He does though in fact appear to have a vested interest in this language based religious war (above and beyond his Twitter affiliation) which has proven over several years to provide excellent link-bait...

http://oreilly.com/catalog/9780596157746/

I'm just saying...

johny boyd said...

"Glenn" has to muddy the waters with his statement about the twitter folks not being all that honest since one of them is writing a scala book.

Give me a break dude, and go off with your tabloid comments elsewhere. Just when folks start to have a reasonable discussion ...

-jpb

Alex Payne said...

Bob: part of research, for us, was considering cost and integration time. See Nick's comments above for why Kestrel ended up being much easier to test than other complete-rewrite solutions. We did appreciate your suggestion.

Glenn: if you search around for what people make on technical books, you'll see that I don't have much to gain from book sales. I'm putting a lot more time into my half of the book than I'll ever make back. But it's well worth it for what I'm learning in the process.

Blaine said...

Tony: as the guy who wrote Starling, I'll be nice and simply say that you haven't done your research. Your Picard there is hitting his head over the stupidity of the internal combustion engine.

Starling was written over two years ago (it took a long time to release). It was written in about a day and a half. At the time, no memcache-based queue servers existed, RabbitMQ required knowledge of erlang, and rev and EventMachine didn't exist or were so immature that they were essentially useless. ActiveMQ didn't have Ruby client libraries, and Spread, well. Anyhow.

If you do the math, for the number of queue operations Twitter was doing *two years ago*, Starling is plenty fast enough, because each queue action took several orders of magnitude longer than the queuing / de-queueing itself.

Moreover, you might try turning off the fsync() that is present in the default Starling distribution, which makes it easily four times slower than normal operation. And then you might compare apples to apples, and ensure that your queues are doing the same thing.

And to top it all off, in the README for Starling I wrote:

"Starling is "slow" as far as messaging systems are concerned. In practice, it's fast enough."

It was a case of worse-is-better, and frankly, its existence has been an important part of discussions around message queue design from a usability standpoint. Prior to Starling, queuing was HARD for small organizations, and the agenda was set by JMS and "the enterprise" - oh, there's Picard again.

I'm not going to defend Starling as a great example of high performance and resilient message queue or networking software design, and I tend to recommend RabbitMQ or Kestrel to people looking for a simple queue.

In the future, as Alex suggests, please feel free to email developers of software before making many unsubstantiated and incorrect claims. My email address is and has always been in the Starling source code.

Tony said...

"EventMachine didn't exist or were so immature that they were essentially useless."

That is most certainly not the case. I've been using EventMachine since 2006, and while I am not a fan of the API (hence writing Rev), it was most certainly stable and mature back then.

The EventMachine-based Ruby StompServer was released in late 2006 and outperforms (non-Evented) Starling.

nothingmuch said...

I don't want to pay for Oracle BerkeleyDB either, that's why I use it without giving them money. It is opensource, you know =P

Ryan said...

nothingmuch said...
"I don't want to pay for Oracle BerkeleyDB either, that's why I use it without giving them money. It is opensource, you know =P"

It's under a copyleft license which prevents its redistribution along with your software unless you distributed your source or pay them a (very large last time I heard) fee.

Tony said...

One thing I really feel I should say:

For the purposes of my job, we do not need a stateful message queue. We run an automation system which keeps all state in a database. Most of our system is effectively stateless, including our message queues. If our message queues crash, our system recovers from the state in the database.

Twitter, who is running a messaging system, apparently needs to persist state across their entire system. I don't know if the system could be architected in a more stateless manner, but I'll give them the benefit of the doubt and assume they have tried to make their system less stateful. This means all their queues need to be disk logged. The performance tests we performed which I anecdotally cited did not ever use a disk logged message queue intentionally, but in the case of Starling we were as this appears to be the only option Starling provides.

I'm afraid this is an apples-to-oranges comparison and I may be unduly sullying Starling.

That said we were testing a persistent configuration of Ruby Stompserver, which despite being written in Ruby was able to outperform Starling (at the level of ~100 readers/writers)

In Twitter's case it sounds as if they were running a rather large "flock of Starlings" to handle the load. This works, but in our system we run one queue per server.

ted said...

Twitter is a successful company, therefore, whatever it does, must be 'correct'. If your company is successful then whatever it does is 'correct' even if it is completely different from Twitter.
No tongue in cheek intended. Going toward Scala must be a good thing, otherwise they wouldn't do it.
Results are everything.

nothingmuch said...
This comment has been removed by the author.
nothingmuch said...

why would twitter want to redistribute Berkeley DB itself?

It's perfectly legal to distribute source code that links against it, and even that doesn't seem to be a real concern for an internal product.

The claim that BDB costs money does not really apply in this situation unless they want to sell a product that is prebuilt with it, the way I understand the licensing issues anwyay.

kookster said...

Very interesting conversation. I work on ActiveMessaging, and have used most of the brokers mentioned above.

I'll throw out one more - there is the reliable messaging gem that is lovely for development - persists to disk, very lightweight, and trivial to install and use. I don't think I would begin to recommend for high demand production, as I haven't tested it for such, but for smaller projects and development, I am a big fan.

I know from several other large applications that ActiveMQ is troubled at load, and am not surprised it fell over.

Stomp is not a perfect protocol, such as lacking a good way to rollback message receipt, but there are more than a few impls out there in python, ruby, and java via rabbit and activemq. I always found it disappointing that twitter didn't use the protocol if not the impls out there, as it seems a good fit (short text messages are stomp's bread and butter).

I do understand that twitter is a messaging app at core, so it makes sense to invest in building something, I only wish that they had adopted a messaging spec/protocol such as amqp or stomp for their impl.

Personally, I have high hopes for RabbitMQ/amqp, and that is the next broker/protocol I'll be integrating with ActiveMessaging.

-Andrew Kuklewicz

Vidar said...

Any messaging server where IO doesn't dominate to the point where language choice is irrelevant is badly written.

Writing one in Ruby is fairly simple. I wrote a Ruby Stomp server for inhouse use years ago, and it was trivial to optimize it to the point where the time spent in my Ruby code was less than 10% of the total time spent - the rest of the time was spent in the kernel handling IO syscalls.

Ed Borasky said...

"Any messaging server where IO doesn't dominate to the point where language choice is irrelevant is badly written." -- Vidar

"For every complex problem there is an answer that is clear, simple, and wrong." -- H. L. Mencken

:)

But seriously, building a scalable architecture for something as complex as a messaging server involves more than just "throwing cores at it until it's I/O bound". :) The optimum case, which you will never achieve except "on the average", is for the processor and I/O utilizations to be approximately equal and for the number of "users" to be at a point known in queuing theory as "N*"

This is called "asymptotic bounds analysis" (ABA) and can be found in lots of places, but a good start is

http://www.cs.washington.edu/homes/lazowska/qsp/Images/Chap_05.pdf

HTH :)

Twitter said...

So Twitter is awesome for Taking traffic to your website . It is very
simple to setup and it's a fun positive way to keep in contact with
people. To get more followers on twitter check out this amazing
tool, Twitter Traffic Machine.

Rob Davies said...

Apache ActiveMQ is highly configurable - you can always make it scale or perform better - depending on your use case: see Scaling ActiveMQ

Bruce Snyder said...

FWIW, I'd like to know how ActiveMQ did not meet the needs of Twitter so that we can use that information to improve ActiveMQ. There has been no discussion of configurations or topologies or use cases so I'm not even sure where to begin.

FYI, ActiveMQ was born out of the Apache Geronimo project but actually began life at the Codehaus. The first releases of ActiveMQ began appearing back around 2003 or 2004. It was only later in 2005 that we moved it to the Apache Software Foundation.

There are some comments about RabbitMQ from Alexis regarding the page-to-disk feature so I can address that. ActiveMQ already has this feature so that messages (and even references) are not held in memory (see message cursors here: http://bit.ly/10u5WJ).

The additional comments from Alexis are exactly correct. Building a messaging system that will suit any use case, running on any number of systems, using any topology with all the necessary features is a difficult task and certainly takes time to perfect.

There was also a comment from Alex Payne that experts were contacted. Was anyone from the ActiveMQ community contacted?

Since the research has already been done, please let us know why ActiveMQ did not work for Twitter so that we can improve it.

asanjuan said...

It's a fact that RoR/Ruby is slower than Scala, why kill the messenger for the message? It's pretty humorous how all the RoR hypesters are now jumping off a sinking ship.

Michael Wilson said...

I'm a little late to the party, but this is a wonderful article; thanks! Your words played a part in my selection of RabbitMQ in a web-app I'm working on. I just published a post on my blog making reference to this great article: SeatSync Engineering: Choosing Platforms.

yanli said...

Now Akka 2.0 is released to scale Scala into the next leap!