Tuesday, March 3, 2009

Indentation sensitivity: a post-mortem?

I've been designing my language Reia with an indentation-sensitive syntax like Python... until now.

Things went fairly smoothly at first. I read a short description of how Python's lexer works and in no time banged out an equivalent one with leex. Inspired by this, I proceeded constructing my grammar, largely defaulting to Ruby or Erlang in cases I was uncertain of and hardly ever looking at the Python grammar. I wanted Reia to have a purely expression-based syntax where everything returns a value, as opposed to a Pythonic grammar with both expressions and statements which return no value. This would prove to be my undoing.

At the time I didn't understand the beautiful set of compromises Guido van Rossum managed to make when designing the Python grammar. I became curious as to why my language had multi-line lambdas and Python did not. The answer to that question would lay in a problem with my grammar I would soon discover and could not come up with a good solution to.

I first discovered the problem with Reia's multi-line blocks. I had used them in the past in small 1-2 liner functions without issue, and all had seemed well. Then some time later I tried to add more to a function after a multi-line block. I ran into an unexpected syntax error... I couldn't put any additional expressions after a multi-line block.

Blocks are more or less lambdas, so it would seem as if I had hit the infamous multi-line lambda problem. I compared the output of my lexer to Python's and found the two to be identical. The answer would lie in the grammar. Looking at the tokens there was no statement separator between multi-line statements in Python.

From Python's grammar.txt:

statement ::= stmt_list NEWLINE | compound_stmt

What does this line mean? All ordinary statements (i.e. stmt_lists), such as expressions, and anything that fits on a single line must be terminated by a NEWLINE. The "compound statements", ones which span multiple lines, were self-contained. There was no statement separator between them and the next statement. In the Zen of Python they just are, with no explicitly delineated boundaries.

This underlies the magic of Python's "indent consistently and your code will be valid" syntax. I could not duplicate this using a Python-style lexer. The best I could come up with was modifying the lexer to accept a few different cases of empty lines as being syntactically significant, which meant indentation sensitivity would be more obtrusive.

Reading Guido van Rossum's rationale against multi-line lambdas in Python, Language Design Is Not Just Solving Puzzles, he lays out his case: any solution that embeds an indentation-based block in the middle of an expression is unacceptable, and an alternate way of grouping syntax besides indentation is unacceptable. I certainly agree with the latter... it's silly to have some expression/statement groupings indentation based and some grouped by enclosing tokens. The former was a bit more subtle... I didn't realize what separating out statements from expressions afforded in the grammar (specifically in regard to the self-contained "compound statements").

I started noticing other mismatches between indentation sensitivity and an expression-based grammar which aren't present in Python. Implicit line joning in Python removes tokens which are useless in the given context (which much be inside an expression) since expressions can't contain statements. In Reia's previous form implicit line joining would rule out a whole host of otherwise gramatically valid constructions embedding expressions with indent blocks inside other expressions. But without implicit line joining, the programmer is very limited in terms of splitting up function arguments, lists, dicts, and tuples across multiple lines.

I also started to notice all sorts of strange syntactic oddities which emerge in a grammar which embeds indentation blocks in statements. This is exactly the kind of ugliness I believe Guido sought to avoid by disallowing indent blocks within expressions.

Unhappy with that direction, I looked to alternatives which preserve indentation sensitivity. The Haskell approach to indentation seemed interesting, and I found a expression-based Python-like language with Haskell-like indentation: Logix. It turned out to be by Tom Locke, who would go on to author Hobo, the web application builder for Rails. (Edit: Tom Locke mailed me back saying he doesn't like indentation-based syntaxes any more and now prefers Ruby)

This seemed like a potential direction to go in, but at this point, I was sick of indentation sensitivity. My conclusion is it works very well in Python, and if you're content to separate your grammar into statements and expressions and only have indent blocks in statements, it's a great way to go with a language.

I simply wasn't content to abandon an expression-based grammar. I needed multi-line expressions, specifically for Ruby-style blocks, if nothing else. In a totally unscientific poll of Rubyists I put together, the #1 feature with a 75% approval rating was blocks. I couldn't have multi-line expressions in an indentation sensitive language where I desperately deisred them for blocks. So something had to give...

Last night I pushed a new "indentless" branch for Reia. Without indentation sensitivity the Python influence is starting to wane and it's beginning to look a lot like Ruby. I spent tonight cleaning it up and making sure it can build itself from scratch, as well as converting all the code examples and remaining parts of the standard library.

This is quite likely the direction Reia is going to go. So far the response to abandoning indentation sensitivity has been overwhelmingly positive. It will certainly be a boon for people making template languages as now it will be quite easy to put together something like ERb or a template language which embeds Reia syntax with little worry.

Short of someone making an impassioned plea for preserving indentation sensitivity, it will soon be gone from Reia.

5 comments:

Brian Ford said...

The force is strong in this one; he will join the dark side. :)

Thanks for the explanation, Tony. I never bothered to understand it at this level.

Reid Kleckner said...

So I was considering a similar problem, since I'm also trying to design a programming language as a small side project. It's much less serious, and more of a playground for ideas.

For our language, we went the statements/expressions route, but I think you can hack blocks back into a statement grammar. We came up with a syntax something like this:

my_dict.items().each(&) func(k, v):
...
...

The idea is similar to the Ruby innovation of putting the block code outside of the parenthesis as one would have in Lisp. Here, you go further, and just use s simple syntactic placeholder (I picked & since we weren't using it for bitwise and), and just put it at the end of the line. You could nest blocks, but you could not have more than one per statement. It seems like you can't do that in Ruby either, and I think having multiple blocks stuffed into the same statement would be super confusing.

Of course, in your case you'd have to abandon things like:

x = if b
'b'
else
'c'

But we didn't want that syntax anyway.

Matthew said...

You could extend into a second dimension. By that I mean that a line of code is a horizontal sequence of terms, but some of those terms could extend vertically. You would need an editor which would let you insert a multi-line section of text into a line. A bit tricky in normal editors, though...

tav said...

Hey Tony,

While it doesn't suit the expressions-only approach you desire, I came up with a Pythonic way to do Ruby blocks:

* http://tav.espians.com/ruby-style-blocks-in-python.html

Hope it's useful.

Let me know what you think and keep up the great work with Reia!

-- Thanks, tav

Ian Bicking said...

If you want to support templating languages, then you really should support some kind of AST that template languages can compile down into. Generating source code from a template is half-assed at best.