A while back I wrote a parser combinator library in the Elixir language called Ergo. I did it partly as a language learning exercise and partly because I had some things needing parsing that related to AgendaScope (which is also being written in Elixir).
Barring some performance issues, Ergo has actually turned into quite a nice tool for parsing and it’s getting a workout in two projects: one related to AgendaScope and one a fun side-project. But there is a problem I’ve hit that I haven’t quite thought my way around yet. The side project is easier to talk about so let’s use a problem example from there:
@game begin title: "The Maltese Owl" author: "Matt Mower" @place bobs_office begin description: """A dimly lit, smoky, office with a battered leather top desk and swivel chair.""" w: #outer_office s: #private_room end @actor bob_heart begin name: "Bob" role: #detective end … end
A parser combinator combines simple parsers together to create more complex parsers. A simple parser might be
character that parsers things like ‘a’, ‘b’, ‘5’, ‘#’ and so on. Then there are parsers (called combinator parsers) that take other parsers and use a rule to apply them. It’s a bit like higher-order functions.
sequence parser applies a series of parsers, in turn, and returns
:ok if they all match, otherwise
choice takes a list of parsers and returns
:ok when one of them matches, otherwise
many parser takes a single parser and applies it repeatedly. When a parser returns
:error the input is rewound so that the next parser gets a chance to match on the input. So in the
choice each parser gets an opportunity to parse the same input.
Let’s look at an example to parse the above syntax (ignoring whitespace & other issues to keep things clear). It might look something like:
def game() do sequence([ literal("@game"), literal("begin"), many( choice([ place(), actor() ]), literal("end") ]) end def place() do sequence([ literal("@place"), id(), literal("begin"), attributes(), literal("end") ]) end # actor() will look very similar to place() def attributes() do many(attribute()) end def attribute() do sequence([ name(), literal(":"), value() ]) )
choice parser attempts to parse a
place and if that fails, rewinds the input, and attempts to match
actor instead. The
many parser would then repeat this choice over and over so that you could parse an infinite stream of places and actors, in any order.
many parser can succeed zero or more times. It’s like the
* operator in a regular expression. When its child parser returns
:error (in our example, when the
choice cannot match either a
actor) it stops. But what does it return?
It turns out that
:ok and not
:error. That’s because
many is greedy and will keep trying to match forever so that not matching at some point is not only expected but required!
If we were parsing:
@game begin @place … end @actor … end @finally … end end
many to terminate because
@final is not part of the sequence of places & actors. However, if we look at the example at the beginning, we have two attributes defined on
It turns out that an attribute name must be at least 3 characters long so these are invalid and
attributes is consequently returning an
:error and triggering the end of the
many. But this is a different end than meeting
In this case the input is rewound and the next parser —
literal("end") — gets applied. But it’s getting applied to input that should have been consumed by the
So the problem, as clearly as I can state it, is how to distinguish between situations in which many receiving an
:error from its child parser means “end of the sequence” and when it means “I should have parsed this, but couldn’t because of a user error.”
I’m not sure what an elegant solution is to this problem. My first instinct is to add a “parser” called something like
commit that specifies that an error after this point is a user-error.
So, for example:
def place() do sequence([ literal("@place"), commit(), literal("begin"), … literal("end") ]) end
What this means is that if we have correctly parsed the input “@place” we know we are definitely in something that should match. If we can’t get
place to match from here then the user has specified an incorrect input and rather than
:error it should return something like
:fatal as an indicator to future parsers that the input cannot be correctly parsed, rather than attempting to continue parsing on on the wrong inputs.
Is the problem clear? Would this work? Is there something I’ve missed? Perhaps a more elegant solution to this problem?