Tuesday, March 3, 2009

Indentation sensitivity: a post-mortem?

I've been designing my language Reia with an indentation-sensitive syntax like Python... until now.

Things went fairly smoothly at first. I read a short description of how Python's lexer works and in no time banged out an equivalent one with leex. Inspired by this, I proceeded constructing my grammar, largely defaulting to Ruby or Erlang in cases I was uncertain of and hardly ever looking at the Python grammar. I wanted Reia to have a purely expression-based syntax where everything returns a value, as opposed to a Pythonic grammar with both expressions and statements which return no value. This would prove to be my undoing.

At the time I didn't understand the beautiful set of compromises Guido van Rossum managed to make when designing the Python grammar. I became curious as to why my language had multi-line lambdas and Python did not. The answer to that question would lay in a problem with my grammar I would soon discover and could not come up with a good solution to.

I first discovered the problem with Reia's multi-line blocks. I had used them in the past in small 1-2 liner functions without issue, and all had seemed well. Then some time later I tried to add more to a function after a multi-line block. I ran into an unexpected syntax error... I couldn't put any additional expressions after a multi-line block.

Blocks are more or less lambdas, so it would seem as if I had hit the infamous multi-line lambda problem. I compared the output of my lexer to Python's and found the two to be identical. The answer would lie in the grammar. Looking at the tokens there was no statement separator between multi-line statements in Python.

From Python's grammar.txt:

statement ::= stmt_list NEWLINE | compound_stmt

What does this line mean? All ordinary statements (i.e. stmt_lists), such as expressions, and anything that fits on a single line must be terminated by a NEWLINE. The "compound statements", ones which span multiple lines, were self-contained. There was no statement separator between them and the next statement. In the Zen of Python they just are, with no explicitly delineated boundaries.

This underlies the magic of Python's "indent consistently and your code will be valid" syntax. I could not duplicate this using a Python-style lexer. The best I could come up with was modifying the lexer to accept a few different cases of empty lines as being syntactically significant, which meant indentation sensitivity would be more obtrusive.

Reading Guido van Rossum's rationale against multi-line lambdas in Python, Language Design Is Not Just Solving Puzzles, he lays out his case: any solution that embeds an indentation-based block in the middle of an expression is unacceptable, and an alternate way of grouping syntax besides indentation is unacceptable. I certainly agree with the latter... it's silly to have some expression/statement groupings indentation based and some grouped by enclosing tokens. The former was a bit more subtle... I didn't realize what separating out statements from expressions afforded in the grammar (specifically in regard to the self-contained "compound statements").

I started noticing other mismatches between indentation sensitivity and an expression-based grammar which aren't present in Python. Implicit line joning in Python removes tokens which are useless in the given context (which much be inside an expression) since expressions can't contain statements. In Reia's previous form implicit line joining would rule out a whole host of otherwise gramatically valid constructions embedding expressions with indent blocks inside other expressions. But without implicit line joining, the programmer is very limited in terms of splitting up function arguments, lists, dicts, and tuples across multiple lines.

I also started to notice all sorts of strange syntactic oddities which emerge in a grammar which embeds indentation blocks in statements. This is exactly the kind of ugliness I believe Guido sought to avoid by disallowing indent blocks within expressions.

Unhappy with that direction, I looked to alternatives which preserve indentation sensitivity. The Haskell approach to indentation seemed interesting, and I found a expression-based Python-like language with Haskell-like indentation: Logix. It turned out to be by Tom Locke, who would go on to author Hobo, the web application builder for Rails. (Edit: Tom Locke mailed me back saying he doesn't like indentation-based syntaxes any more and now prefers Ruby)

This seemed like a potential direction to go in, but at this point, I was sick of indentation sensitivity. My conclusion is it works very well in Python, and if you're content to separate your grammar into statements and expressions and only have indent blocks in statements, it's a great way to go with a language.

I simply wasn't content to abandon an expression-based grammar. I needed multi-line expressions, specifically for Ruby-style blocks, if nothing else. In a totally unscientific poll of Rubyists I put together, the #1 feature with a 75% approval rating was blocks. I couldn't have multi-line expressions in an indentation sensitive language where I desperately deisred them for blocks. So something had to give...

Last night I pushed a new "indentless" branch for Reia. Without indentation sensitivity the Python influence is starting to wane and it's beginning to look a lot like Ruby. I spent tonight cleaning it up and making sure it can build itself from scratch, as well as converting all the code examples and remaining parts of the standard library.

This is quite likely the direction Reia is going to go. So far the response to abandoning indentation sensitivity has been overwhelmingly positive. It will certainly be a boon for people making template languages as now it will be quite easy to put together something like ERb or a template language which embeds Reia syntax with little worry.

Short of someone making an impassioned plea for preserving indentation sensitivity, it will soon be gone from Reia.

182 comments: