Tuesday, April 5, 2011

Reia: Now with a PEG parser

I haven't blogged about Reia in so long. Mea culpa! It's about this time that people start to think "I haven't heard about Reia in awhile. It must be dead." No, Reia is very much alive my friends, and making some pretty interesting progress... I hope to be able to announce some important new features around Reia's 3rd birthday. Stay tuned for that, but first a preview of things to come...


An experimental branch of Reia is now available which uses Neotoma to generate its parser thanks to some pretty impressive contributions by Graeme Defty. Neotoma is a Parsing Expression Grammar (PEG) based parser generator for Erlang, inspired by Packrat. Neotoma's author Sean Cribbs has also released Neotoma 1.5 with contributions by Graeme Defty which uses Erlang binaries internally and is substantially faster than the previous version.

Why does a PEG matter? PEGs make short work of problems that are extremely complex to handle when you attempt to subdivide the problem of parsing into separate scanning and parsing steps, such as how a traditional lex/yacc (or in Reia's case, leex/yecc) parser would operate. If you've ever attempted to look at the source code of the Ruby parser, you'll find some strange and confusing feedback between the lexer and the parser.

What's this for? Well the foremost reason is: interpolated strings. Ruby allows you to embed arbitrary Ruby code inside of any string using the #{...} syntax. This is a really great feature and one I managed to half-assed implement in Reia because I strongly believe in its awesomeness. However, Reia's implementation is a bit brittle and has a lot of implementations. Just recently I switched to the latest greatest version of leex which ships with the Erlang runtime and had to make some rather arcane changes to some code that uses leex internals, just to continue to support a partially functioning string interpolation mechanism.

Now, toss in some more fun complexities: the awesome /.../ regex literal syntax. We all love it, but why doesn't every language have it? What you may not realize is that this syntax is ambiguous in certain cases with the / and /= operators like it is in other languages like JavaScript. Now, for added fun, toss in the fact that regular expressions can interpolate just like strings, and you're beginning to see the makings of a gramatical nightmare.

All of these things are easy with a PEG. PEGs blend tokenization with parsing and allow comprehension of a much wider range of languages than is possible without pulling your hair out using lex/yacc-like tools, and much more than that, it's a very natural process with PEGs. PEGs are also by nature right-recursive, something that works quite well in immutable state languages like Erlang that have to build their syntax trees from right-to-left due to the nature of immutable singly linked lists. These lists only let you append elements on the left.

This sounds well and good, but unfortunately I have some bad news. Even after a few rounds of performance tuning, using an experimental branch of Neotoma that uses binaries internally instead of lists, the PEG parser is still about an order of magnitude slower than the leex/yecc version. Talking to Neotoma's author Sean Cribbs, it sounds like Neotoma might be doing an excessive amount of unnecessary copying internally. If PEGs pique your interest and you know a bit of Erlang you might take a peek at Neotoma and see if you can find some potential performance optimizations.

I would still like to merge the peg branch of Reia, and the PEG grammar fixes a lot of quirks in the present yecc grammar, but I'm still holding out until the performance is improved to closer to the leex/yecc speeds.