Tuesday, June 29, 2010

Reia: Pluggable Parsers

One stand-out quality of the Ruby community is a fascination with obtaining and manipulating Ruby parse trees.  Such a fascination exists in many languages, but it's particularly weird in Ruby because until Ruby 1.9 there was no first-class way to obtain a Ruby parse tree.  People went spelunking with C code into Ruby's internals, ripping the parse tree right out and exposing it back to the Ruby environment.  Eventually Ruby parsers were implemented in Ruby itself in various projects.  Yet it remains that while Ruby as a language seems to attract parse tree tinkerers, the language itself does not provide first-class ways to satisfy their needs.

I firmly believe that being able to obtain a parse tree for the programming language you're using is important and should be a first-class language feature.  To that end, Reia supports a String#parse method:

>> "2+2".parse()
=> [(:binary_op,1,:'+',(:integer,1,2),(:integer,1,2))]

This parses the "2+2" string as Reia source code.  The result might remind you a little bit of Lisp: it's a Reia parse tree.  Right now there aren't immediate uses for Reia parse trees, but I'd soon like to add an interface for compiling/executing them.  Erlang supports a feature called "parse transforms" which allow on-the-fly transformations of Erlang syntax.  I'd also like to add such a feature to Reia.

If String#parse were just used to parse Reia source code it'd be a bit of a waste.  However, it can be used for more than just that.  For example, parsing JSON (as of tonight):

>> '{"foo": [1,2,3], "bar": [4,5,6]}'.parse(:json)    
=> {"foo"=>[1,2,3],"bar"=>[4,5,6]}

After some recent problems dealing with JSON libraries in Ruby, I really felt JSON parsing should be part of the standard library.  With this syntax, it almost feels like JSON parsing is part of the core language.  Rubyists generally implement that sort of thing by monkeypatching the core types.  Reia lets anyone define their own String#parse method by defining special module names, with no modifications to the core types required (which Reia doesn't let you do anyway).

To better understand how this works, let's take a look at how Reia implements String#parse:

def parse(format)
  "#{format.to_s().capitalize()}Parser".to_module().parse(self)
end

Given a format of :foobar, String#parse will capitalize the argument into "Foobar", then look for a "FoobarParser" module to parse itself with.  This means anyone can add a parser to the language just by defining a module name that ends with "Parser" and has a parse method which accepts a string as an argument.

In short, anyone can add a parser to the language which can be called with a simple, elegant syntax.  No monkeypatching required.

Monday, June 28, 2010

How to Properly Utilize Modern Computers


Holy abstract concept, Batman! Computers are complicated things, especially modern networked ones filled with multiple CPU cores, and anyone professing to know a singular way to utilize them is truly a madman, or a genius, or a little bit of both... but before we can talk about modern computers we must first talk about computers as they used to be.

While this may look prehistoric, it actually happened at Burning Man

Long ago, programming languages were crap and programming was hard. Ken Thompson and Dennis Richie reinvented the way we think about computers by designing not only a new operating system but a new programming language to write that operating system in. It was an ambitious effort that has forever shaped modern computing. Some people don't appreciate it and wax philosophical about hypothetically superior solutions. Those people are retards. Unix rules. Get over it.

No, this isn't quite as good as having sex


Unix had a brilliant underlying philosophy: do one thing and do it well; use multiple processes to solve problems; use text streams as your interface.  The simplicity of the Unix model had a beautiful elegance to it and made it very easy to leverage host resources in a scalable manner.  Instead of writing big clunky monolithic applications, write several small programs that use text streams to talk to each other.  Then if your host just happens to have multiple processors, the kernel can handle the task of farming out multiple jobs to multiple CPUs.

Shells and scripting languages were created to provide the interface and glue to the underlying system utilities. Users could easily queue up a series of tools to analyze and digest text streams as they saw fit.  The interesting thing about this approach is that often times users of these sorts of utilities were performing pure functional programming.  Each utility acts as a function which accepts its input over a pipe and produces output which it sends over a pipe.
A pearl, not to be confused with Perl.  Perl is not a gem.

Perl fit into this ecosystem beautifully.  Perl was focused at making short scripts which work on text streams, while providing easy conversion back and forth between text streams and numbers, since often times the text stream processing you want to do in Unix involves some kind of math.  Perl is an extremely expressive language which allowed people to write far more powerful scripts than anything that had been seen in previous Unix scripting languages.  It was powerful, expressive, and whimsical.  Unfortunately, its whimsy would also be its demise.

Perl's approach didn't scale well to large applications.  The level of abstraction it provided was targeted at writing short scripts within the multiprocess Unix environment.  However, the tide was turning against the entire Unix philosophy.  Monolithic applications and application environments were soon to become the norm.
OHAI!!!


Java tried to abstract away the underlying operating system.  It was not easy to write Java programs that fit into the traditional Unix philosophy.  Java strongly prefers you talk directly to things in Java Land, and because of that they reinvented standard Unix tools like cron as Quartz. Rather that using the traditional Unix shared-nothing process model to leverage multiple CPUs, Java wants you to use threads.  If you write your entire application this way, you can deploy an application by running a single instance of the Java Virtual Machine and giving it all your system memory.  With a single instance of the JVM you can theoretically utilize all of your available system resources.

Java still got a lot of things wrong.  Threads are one problem (I'll get into that later).  Another is handling application upgrades.  Some environments tried to support hot code swapping, but this usually ended up leaking memory.  In general, the recommended approach for upgrading a Java application is going to be starting and stopping the JVM.  If you happen to be running a network server, such as, say, a web server, this means you have to wait for all clients to disconnect, or you have to shut down without completing their requests.  Depending on the nature of your network protocol, clients may continue to remain connected indefinitely, so upgrades for those types of services typically means mandatory outages.

EWWWWW!!!!

Unfortunately, both the Unix model and the multithreaded model have warts.

Unix doesn't exactly provide the greatest set of tools for managing multiple processes.  The interprocess signaling model used to manage processes left an awful lot to be desired.  The pipe mechanism used for interprocess communication is rather primitive.  Requiring everything be serialized to text streams incurs a lot of overhead, especially when you write several programs in the same language and can use more efficient data structures than text to communicate data.

In that regard, there are a lot of incentives towards moving to something like Java for concurrent programming.  However, threads have warts too.

The semantics are just plain confusing and the possibility error is huge.  There are a set of best practices which mostly come down to the overriding concern: don't share state between threads.  As long as you never share state between threads there is never any concern over data corruption in concurrent programs.  However, many multithreaded programs share state all over the place, using a collection of highly error-prone synchronization mechanisms to try to keep everything kosher.  However, if you happen to forget to synchronize access to any given piece of shared state, you're screwed, you've just encountered a threading bug.  Sharing state between threads requires extreme vigilance on the part of the programmer, and also intimate knowledge about how threads work and their possible caveats.

Beyond all this, threads are managed by the kernel, and talking to the kernel has high overhead.  A truly amazing feat would be to soup up the Unix model and build your system using lots of shared-nothing processes that communicate using messages and mailboxes rather than primitive text streams.  This is exactly the approach that was taken by Erlang.

I AM ERLANG!!!

Erlang took the whole Unix philosophy to the next level.  Erlang process work like Unix processes, except they use mailboxes and messages instead of pipes.  Unlike threads, Erlang processes run in userspace, which makes them relatively fast.  You can create new Erlang processes a lot faster than you can create threads.  The Erlang VM can run one kernel thread per CPU on your system and load balance processes.  Code can be hot-swapped at runtime in a well-defined manner with extremely consistent semantics.  The entire language philosophy emphasizes the creation of distributed, fault-tolerant, self-healing programs which are able to not only leverage an entire computer, but leverage an entire network of connected computers, using a philosophy which is similar to but an improvement on the Unix approach.

In Erlang, all state is immutable.  This completely eliminates the problems of sharing state between threads.  Due to the way the language is designed it is simply not possible.  This opens up possibilities for Erlang language implementers to safely share state across threads, since the data can't be mutated.  Unfortunately attempts at using this approach in the present Erlang virtual machine have not yet lead to significant performance benefits.

Erlang has its own warts.  For everything it gets right semantically, it is still an aesthetically ugly language.  Very few would describe Erlang code as beautiful.  Despite claims that the semantics, and not the syntax are the barrier to learning Erlang, the main excuse I've heard from people who have avoided Erlang is that they don't like the syntax.
Clojure's logo is so awesome!

Clojure offers a different approach to leveraging modern multiprocessor computers.  It provides shared state that threads can work on transactionally, an approach called Software Transactional Memory (STM), which works kind of like a database.  When you aren't inside a transaction, all state is immutable, which means all state within the language is inherently "thread safe".

Because it's built for the JVM, Clojure is able to take advantage of all the previous effort put into an efficient native threads implementation for the Java programming language.  While this is great for utilizing multicore systems, it's still centered around the notion of shared state.  Distributing your program to multiple computers requires a conceptually different approach than you would ordinarily use to distribute a problem to multiple CPUs.

Beyond that, Clojure uses Lisp syntax.  While some people enjoy writing raw syntax trees because its "homoiconic" nature (not to be confused with Madonna or house music) means they can work all sorts of wonderful wizardry with macros, history has shown that in general most people are not really big fans of that sort of syntax.  Lisp has a lot of parens for a reason: because in most other languages those parens are implicit.


So what's the answer?  How do we "properly" utilize modern computers?  I don't have the answer, only an opinion.

The Unix model was great.  It just lacked a few features to really carry it over to distributed systems.  That said, I really like the idea of running a single VM like the JVM per host, and letting it consume all available system resources running a single application.

Erlang lets you do this, except it provides a Unix-like process model with many of the warts excised.  Erlang has excellent process management, and lets you interact with processes on remote nodes the same way you'd interact with the local system.  Erlang replaced the lousy pipe-based model of interprocess communication with messages, mailboxes, and even filters that allow you to selectively receive from your mailbox.

Erlang provides lightweight, shared-nothing userspace processes and a great way for them to communicate, as well as a scheduler that can dynamically load balance them between native threads and thus host CPUs.  Among many programming experts I've talked to there's a general consensus that having some sort of userspace concurrency context, be it a coroutine or a userspace thread, is a very handy construct to have.  Erlang, perhaps more than any other language out there, has wrapped up userspace concurrency contexts into a very neat little package.

I still feel Erlang's main drawback is its syntax, and I have a few ideas about that.  I think my language Reia brings with it the expressivity of a scripting language like Perl or Ruby combined with the awesome semantics of Erlang which allow it to easily utilize networks of multicore computers.  Reia can support the monolithic one-process-per-application approach so associated with Java while allowing developers to write multiprocess applications internally.  Reia is scripting evolved.

Sunday, June 20, 2010

Dear Twitter: fix your fucking shit, seriously

UPDATE (3/2012): Hi there. This post still seems to get a lot of traffic, but I'd like for you to know I've changed my opinion. At the time I wrote this, it was immediately after my first RailsConf where I was depending on using Twitter in order to be able to get in contact with people, and at the time, their service was somewhat lousy.

Since then, Twitter has done an amazing job of shoring up their infrastructure and making it robust. That said, this post no longer reflects my opinion of Twitter. I continue to use Twitter every day and it's still my personal favorite social network. Please take the post below with a grain of salt and recognize that it's an artifact of its time. I'm leaving the original text for your consideration, but please recognize the context.

I use Twitter every day. Every single fucking day. So when Twitter goes down, it affects me. And lately, Twitter has been down every single fucking day.  It's not like they're unaware of it.  Twitter Unavailable.  High Error Rate on Twitter.com.  Temporarily Missing Tweets.  High Error Rate on Twitter.com.  Site Availability Issues Due to Failed Enhancementof Our Timeline Cache.  Working on Incorrect Tweet Counts.  Bursts of Elevated Errors.  Bursts of Errors.  Site-Wide Availability Issues.  High Error Rate on Twitter.com.  Site Availability Issues.  More Site Availability Issues. And Even More Site Availability Issues!  And all of those within the past two weeks.

Twitter, it isn't hard to conclude your site is fucking broken.

I use Twitter because of the community of people. From a technology perspective, Twitter is markedly inferior to Facebook and Google Buzz, which not only manage to stay up a lot more than Twitter, but also support basic features like threaded conversations.  I use your site because of the community, and exclusively because of the community.  I know the community of Ruby programmers likes Twitter, and I'm not going to get them to move.  So I'm stuck with Twitter.

From a technological perspective, Twitter is lagging lagging behind... way behind.  Facebook has uptime, an order of magnitude more traffic, and threaded conversations!  Google Buzz has uptime, and threaded conversations too!  Twitter does not have threaded conversations, and is broken all the time.  I understand Twitter hired the Twitoaster guy to add threaded conversations.  Before you add that, can you please make sure your site isn't broken all the time?

Seriously, I want to like Twitter.  I use Twitter all the time.  I am a fucking Twitter whore.  But seriously Twitter, you are the only site whose 503 Service Temporarily Unavailable page is known by name.  Stephen Colbert is even namedropping it.  While I'm a systems architect, I don't want to give you architecture advice.  You're a high traffic site and I can't intimately know your pain points like you do.  You know your pain points.  So fucking fix them.  Facebook works consistently with an order of magnitude more traffic.  Google Buzz works consistently.  So why the fucking fuck doesn't Twitter work consistently?

Twitter, you fucking fail.. Fix your fucking shit. Seriously.

Saturday, June 19, 2010

Reia: Everything Is An Object



I recently added support for immutable objects to Reia.  Immutable objects work in a similar manner to objects in languages like Ruby, except once created they cannot be changed.  You can set instance variables inside of the "initialize" method (the constructor), but once they've been set, they cannot be altered.  If you want to make any changes, you'll have to create a new object.

Now I've gone one step further: all of Reia's core types are now immutable objects.  This means they are defined in Reia using the same syntax as user-defined immutable objects.  And since Reia looks a lot like Ruby, that means their implementation should be easy to understand for anyone who is familiar with Ruby.  Neat, huh?

When I originally started working on Reia, I drank Erlang-creator Joe Armstrong's kool-aid about object oriented programming.  I wanted to map OOP directly on to the Erlang asynchronous messaging model, and proceeded along that path.  When you sent a message to an object, I wanted that message to be literal, not some hand-wavey concept which was implemented as little more than function calls.

However, this meant concurrency came into play whenever you wanted to encapsulate some particular piece of state into an object.  And if the state of that object never changed, not only was this needlessly complex, it was a total waste!  Furthermore, the core types behaved as if they were "objects" when really they weren't... they pretended to work in a similar manner, but they were special blessed immutable objects.  People asked me if they could implement their own immutable objects, and sadly my answer was no.

Encapsulating immutable states has always been a pain point for Erlang.  The canonical approach, Erlang records, are a goofy and oft reviled preprocessor construct with an unwieldy syntax.  Later Erlang added an "unsupported" feature called paramaterized modules, which feel like half-assed immutable objects.  There are very few fans of either of these features.

The typical OOP thinking is that objects provide a great tool for encapsulating state.  So why do Erlang programmers have to use things like records or parameterized modules instead of objects?  Let's look at Joe Armstrong's reasoning:
Consider "time". In an OO language "time" has to be an object. But in a non OO language a "time" is a instance of a data type. For example, in Erlang there are lots of different varieties of time, these can be clearly and unambiguously specified using type declarations
Okay, great!  So if we try to get the current time in Erlang, what does it give us?

Eshell V5.7.3  (abort with ^G)
1> erlang:now().
{1276,989504,651041}

Aiee!  What the crap is that?  In order to even begin to figure it out, we have to consult the "type declaration":

now() -> {MegaSecs, Secs, MicroSecs}

Okay, this is beginning to make more sense, after consulting the documentation.  What we've received is a tuple, which splits the current time up into megaseconds, seconds, and microseconds since January 1st, 1970.  Microseconds are split off so we don't lose precision by trying to store the value as a float, which makes sense.  However, megaseconds and seconds were split up because at the time the erlang:now()function was written, Erlang did not have support for bignums.  In other words, the type declaration is tainted with legacy.

So what if we have erlang:now() output, how do we, say, convert that into a human meaningful representation of the time, instead of the number of seconds since January 1st, 1970?  Can you guess?  Probably not...

1> calendar:now_to_local_time(erlang:now()).
{{2010,6,19},{17,34,40}}

Of course, the calendar module!  I'm sure that's exactly where you were thinking of looking, right?  No, not really.  The response is decently comprehensible, if you know what time it is.  However, this doesn't seem like a particularly good solution to me.  I guess Joe Armstrong likes his functions and type declarations.  I don't.  So how does Reia do it?

>> Time()
=> #<Time 2010/6/19 17:36:36>

Look at that.  Time is an object!  An immutable object in this case.  Thanks to the fact that there are functions bound to the time object's identity, it automatically knows how to display itself in a human-meaningful manner.  Because the identity of this particular piece of state is coupled to functions which automatically know how to act on it, we don't have to do a lot of digging to figure out how to make it human meaningful.  It just happens automatically.

In the end, I'm a fan of objects and Joe Armstrong is a fan of traditional functional programming principles.  They're both solutions to the same problem, but my argument is Erlang doesn't have a good solution to the problem of user-defined types and coupling the identity of states to corresponding functions designed to act on them.  In the case of the latter, Joe Armstrong thinks it's a bad thing whereas I consider it a basic, essential feature of a programming language.  As for the former, Erlang has given us records and parameterized modules, neither of which are a good solution.

I recently learned (courtesy the JRuby developers) that when Matz created Ruby, he took the core functions of Perl and mapped them all onto objects.  Every single core function in Perl has a corresponding Ruby method which belongs to a particular class.  I am now doing the same thing with Erlang, mapping all of the core functions Erlang provides onto Reia objects.

If Ruby is object-oriented Perl, then Reia is object-oriented Erlang.

Saturday, June 5, 2010

Reia: Immutable objects at last!


When I started creating Reia, I was originally skeptical of the "everything is an object" concept seen in languages like Smalltalk and Ruby.  This was partially inspired by Erlang creator Joe Armstrong's hatred of object oriented programming.  However, the more I used Erlang and increasingly saw Erlang's solutions for state encapsulation like records and parameterized modules as increasingly inadequate, the more I looked to objects to provide a solution.

Moments ago I committed the remaining code needed to support instance variables within Reia's immutable objects.  You can view an example of Reia's immutable objects in action.  If you're familiar with Ruby, you'll hopefully have little trouble reading this code:

>> class Foo; def initialize(value); @value = value; end; def value; @value; end; end
=> Foo
>> obj = Foo(42)
=> #<Foo @value=42>
>> obj.value()
=> 42

One caveat: Reia uses Python-style class instantiation syntax, i.e. Foo(42).  Rubyists should read this as Foo.new(42).

So what are immutable objects, exactly?  They work much like the traditional objects you're used to using in Ruby.  Howevever, unlike Ruby, Reia is an immutable state language.  This means once you create an object, you cannot modify it.  The constructor method, which borrows the name "initialize" from Ruby, has the special ability to bind instance variables within a particular object, however once that initialize method completes and the object is created, no changes can be made.  If you want to modify the instance variables, you'll have to create a new object.

Reia will eventually support objects whose state is allowed to change.  These will take the form of concurrent objects, which is the original design goal I had in mind with Reia.  Mutable objects take the form of Erlang processes, and more specifically Erlang/OTP gen_servers, which do not share any state with other concurrent objects and communicate only with Erlang messages.  Going forward, my goal is to make all of the Reia built-in types into immutable objects, allowing user-defined immutable objects, and also allowing concurrent objects whose state can change (albeit in a purely functional manner).

If you've been following me so far, I hope you can sense how concurrency affects Reia's state management model:
  • Sequential objects have immutable instance variables
  • Concurrent objects will have mutable instance variables
This is similar to the state management compromise Rich Hickey chose in the Clojure language.  In Clojure, by default all state is immutable.  However, Clojure employs Software Transactional Memory for concurrency, and inside of Clojure's STM transactions (i.e. where concurrency enters the picture) state becomes mutable.

There's still a lot to be implemented in Reia's object model.  I intend to support polymorphism through simple class-based inheritance, and the code needed to support that is partially in place.  I'd like to support Ruby-style mix-ins.  Once these features are all in place, I intend to completely rewrite the core types, reimplementing them all as immutable objects.

All that said, if you're interested in Reia and would like to start hacking on it, Reia could really use a standard library of objects, and the requisite code is now in place to facilitate that.  I would encourage anyone with an interest in Reia to clone it on github and start implementing the standard library features you will like.  The standard library needs all sorts of things, particularly wrappers for things like files, sockets, ETS tables, and other things which are already provided by the Erlang standard library.

Don't worry too much about making mistakes.  Just send me a pull request and I'll incorporate your code, review it, and make changes where I see issues.  I'd like to prevent the standard library from snowballing into the monster the Ruby standard library is presently, so if you have a feature you'd like to see incorporated, ping me on it (through github is fine) and I'll let you know if I think it should be incorporated.

I'm actively trying to recruit the open source community to build Reia's standard library, so if you're interested, start hacking!