A while back I wrote a parser combinator library in the Elixir language called Ergo. I did it partly as a language learning exercise and partly because I had some things needing parsing that related to AgendaScope (which is also being written in Elixir).
Barring some performance issues, Ergo has actually turned into quite a nice tool for parsing and it’s getting a workout in two projects: one related to AgendaScope and one a fun side-project. But there is a problem I’ve hit that I haven’t quite thought my way around yet. The side project is easier to talk about so let’s use a problem example from there:
@game begin
title: "The Maltese Owl"
author: "Matt Mower"
@place bobs_office begin
description: """A dimly lit, smoky, office with a battered leather top desk and swivel chair."""
w: #outer_office
s: #private_room
end
@actor bob_heart begin
name: "Bob"
role: #detective
end
…
end
A parser combinator combines simple parsers together to create more complex parsers. A simple parser might be character
that parsers things like ‘a’, ‘b’, ‘5’, ‘#’ and so on. Then there are parsers (called combinator parsers) that take other parsers and use a rule to apply them. It’s a bit like higher-order functions.
The sequence
parser applies a series of parsers, in turn, and returns :ok
if they all match, otherwise :error
. The choice
takes a list of parsers and returns :ok
when one of them matches, otherwise :error
. The many
parser takes a single parser and applies it repeatedly. When a parser returns :error
the input is rewound so that the next parser gets a chance to match on the input. So in the choice
each parser gets an opportunity to parse the same input.
Let’s look at an example to parse the above syntax (ignoring whitespace & other issues to keep things clear). It might look something like:
def game() do
sequence([
literal("@game"),
literal("begin"),
many(
choice([
place(),
actor()
]),
literal("end")
])
end
def place() do
sequence([
literal("@place"),
id(),
literal("begin"),
attributes(),
literal("end")
])
end
# actor() will look very similar to place()
def attributes() do
many(attribute())
end
def attribute() do
sequence([
name(),
literal(":"),
value()
])
)
Within game
the choice
parser attempts to parse a place
and if that fails, rewinds the input, and attempts to match actor
instead. The many
parser would then repeat this choice over and over so that you could parse an infinite stream of places and actors, in any order.
The many
parser can succeed zero or more times. It’s like the *
operator in a regular expression. When its child parser returns :error
(in our example, when the choice
cannot match either a place
or actor
) it stops. But what does it return?
It turns out that many
returns :ok
and not :error
. That’s because many
is greedy and will keep trying to match forever so that not matching at some point is not only expected but required!
If we were parsing:
@game begin
@place …
end
@actor …
end
@finally …
end
end
We expect many
to terminate because @final
is not part of the sequence of places & actors. However, if we look at the example at the beginning, we have two attributes defined on bobs_office
called n
and w
.
It turns out that an attribute name must be at least 3 characters long so these are invalid and attributes
is consequently returning an :error
and triggering the end of the many
. But this is a different end than meeting @finally
.
In this case the input is rewound and the next parser — literal("end")
— gets applied. But it’s getting applied to input that should have been consumed by the place
parser!
So the problem, as clearly as I can state it, is how to distinguish between situations in which many receiving an :error
from its child parser means “end of the sequence” and when it means “I should have parsed this, but couldn’t because of a user error.”
I’m not sure what an elegant solution is to this problem. My first instinct is to add a “parser” called something like commit
that specifies that an error after this point is a user-error.
So, for example:
def place() do
sequence([
literal("@place"),
commit(),
literal("begin"),
…
literal("end")
])
end
What this means is that if we have correctly parsed the input “@place” we know we are definitely in something that should match. If we can’t get place
to match from here then the user has specified an incorrect input and rather than place
returning :error
it should return something like :fatal
as an indicator to future parsers that the input cannot be correctly parsed, rather than attempting to continue parsing on on the wrong inputs.
Is the problem clear? Would this work? Is there something I’ve missed? Perhaps a more elegant solution to this problem?