Many’s the difficult parser

A while back I wrote a parser combinator library in the Elixir language called Ergo. I did it partly as a language learning exercise and partly because I had some things needing parsing that related to AgendaScope (which is also being written in Elixir).

Barring some performance issues, Ergo has actually turned into quite a nice tool for parsing and it’s getting a workout in two projects: one related to AgendaScope and one a fun side-project. But there is a problem I’ve hit that I haven’t quite thought my way around yet. The side project is easier to talk about so let’s use a problem example from there:

@game begin
  title: "The Maltese Owl"
  author: "Matt Mower"

  @place bobs_office begin
    description: """A dimly lit, smoky, office with a battered leather top desk and swivel chair."""
    w: #outer_office
    s: #private_room
  end

  @actor bob_heart begin
    name: "Bob"
    role: #detective
  end

  …
end

A parser combinator combines simple parsers together to create more complex parsers. A simple parser might be character that parsers things like ‘a’, ‘b’, ‘5’, ‘#’ and so on. Then there are parsers (called combinator parsers) that take other parsers and use a rule to apply them. It’s a bit like higher-order functions.

The sequence parser applies a series of parsers, in turn, and returns :ok if they all match, otherwise :error. The choice takes a list of parsers and returns :ok when one of them matches, otherwise :error. The many parser takes a single parser and applies it repeatedly. When a parser returns :error the input is rewound so that the next parser gets a chance to match on the input. So in the choice each parser gets an opportunity to parse the same input.

Let’s look at an example to parse the above syntax (ignoring whitespace & other issues to keep things clear). It might look something like:

def game() do
  sequence([
    literal("@game"),
    literal("begin"),
    many(
      choice([
        place(),
        actor()
    ]),
    literal("end")
  ])
end

def place() do
  sequence([
    literal("@place"),
    id(),
    literal("begin"),
    attributes(),
    literal("end")
  ])
end

# actor() will look very similar to place()

def attributes() do
  many(attribute())
end

def attribute() do
  sequence([
    name(),
    literal(":"),
    value()
  ])
)

Within game the choice parser attempts to parse a place and if that fails, rewinds the input, and attempts to match actor instead. The many parser would then repeat this choice over and over so that you could parse an infinite stream of places and actors, in any order.

The many parser can succeed zero or more times. It’s like the * operator in a regular expression. When its child parser returns :error (in our example, when the choice cannot match either a place or actor) it stops. But what does it return?

It turns out that many returns :ok and not :error. That’s because many is greedy and will keep trying to match forever so that not matching at some point is not only expected but required!

If we were parsing:

@game begin
  @place …
  end

  @actor …
  end

  @finally …
  end
end

We expect many to terminate because @final is not part of the sequence of places & actors. However, if we look at the example at the beginning, we have two attributes defined on bobs_office called n and w.

It turns out that an attribute name must be at least 3 characters long so these are invalid and attributes is consequently returning an :error and triggering the end of the many. But this is a different end than meeting @finally.

In this case the input is rewound and the next parser — literal("end") — gets applied. But it’s getting applied to input that should have been consumed by the place parser!

So the problem, as clearly as I can state it, is how to distinguish between situations in which many receiving an :error from its child parser means “end of the sequence” and when it means “I should have parsed this, but couldn’t because of a user error.”

I’m not sure what an elegant solution is to this problem. My first instinct is to add a “parser” called something like commit that specifies that an error after this point is a user-error.

So, for example:

def place() do
  sequence([
    literal("@place"),
    commit(),
    literal("begin"),
    …
    literal("end")
  ])
end

What this means is that if we have correctly parsed the input “@place” we know we are definitely in something that should match. If we can’t get place to match from here then the user has specified an incorrect input and rather than place returning :error it should return something like :fatal as an indicator to future parsers that the input cannot be correctly parsed, rather than attempting to continue parsing on on the wrong inputs.

Is the problem clear? Would this work? Is there something I’ve missed? Perhaps a more elegant solution to this problem?

Leave a Reply

Your email address will not be published.