Discussion of some design details

I want to describe some thoughts about parser construction, design decisions and error messaging, mostly for reference, but also to understand how these parsers work.

Lazy evaluation by %or%, and constructing parsers

The parser p1 %or% p2 performs lazy evaluation, meaning that if p1 succeeds then p2 is never tested. This is in contrast to the parser that Hutton (1992) describes1. The choice for lazy evaluation is motivated by the realization (laziness?) that a parser that evaluates all possible alternatives may quickly become extremely complex: it splits into two alternative paths with every %or%. I wondered whether this decision would cause practical problems, because, in principle if both p1 and p2 would be successful the result of this lazy evaluating parser will depend on the order in which the two alternatives are put. However, I do not believe that this should lead to problems in practical applications, for two reasons:

  1. a document format that is interpretable in multiple ways is not well-designed: it is fundamentally undecidable which of the alternative interpretations should be followed.
  2. a parser that can produce multiple interpretations of a well-designed, unambiguous document format should be re-designed.

There is one situation in which you could get into problems, namely when the differentiation between two document formats depends on their full consumption by a parser. For example:

Format-1

XXXX
YYYY
ZZZZ

and Format-2

XXXX
YYYY
QQQQ

can only be distinguished when the last line is read. If two alternative parsers do not read beyond the YYYY, then both will be successful, and we have an ambiguous interpretation. This is why we urge you to put an eof() parser at the end of any parser that you construct. It is also the reason why a warning is issued when a parser does not completely consume the input2.

Tracking where a parser error occurs

When you just get a fail [] as the output when a parser fails on a long input then it can become challenging to find out what went wrong. It would help a lot if you would know the index number (“line” number) of the input on which the parser fails. This means that the parser has to track where it is and where it fails. Since the only function that shifts to a next index is the %then% combinator, and since any other function that shifts to the right, like the repeater functions (zero_or_more(), etc.) is based on this function, we just have to count the number of times the %then% combinator was called up to the failure to know where the parser failed. A complication is that we may be testing alternative parsers implemented by the %or% combinator, and the parser may fail on different lines in the alternative “arms”. In that case we will say that the parser fails in the element with the highest index when both arms of an %or% combinator fail. The package returns a marker object (printed as []) with that element index number as an attribute. The principle is illustrated in the figure below.

<i>**Illustration of marker positioning when a parser fails on an input vector of length 6**. Each filled dot represents a successful parse. A `%then%` combinator shifts to the next element whereas an `%or%` combinator applies alternative parsers to the input. The parser fails at the red crosses and the red square. A marker (red square) will be put at the parser that fails at the input element with the highest index, which is element #5 here.</i>

Illustration of marker positioning when a parser fails on an input vector of length 6. Each filled dot represents a successful parse. A %then% combinator shifts to the next element whereas an %or% combinator applies alternative parsers to the input. The parser fails at the red crosses and the red square. A marker (red square) will be put at the parser that fails at the input element with the highest index, which is element #5 here.

In the current implementation we chose to track the current element index number with a state variable. The principle of using state variables is described in section 7.4 of R Packages (2e).

Note that proper tracking and error messaging only takes place when the parser is wrapped in the reporter() function. Below I illustrate the example above by parsing the first 6 letters of the alphabet by a parser that fails on the indicated positions.

p <- function() {
  literal("A") %then%
    literal("B") %then%
    (arm1() %or% arm2())
}

arm1 <- function() {
  literal("D") %then%
    literal("E") %then%
    literal("F") %then%
    literal("G")
}

arm2 <- function() {
  literal("C") %then% 
    (arm21() %or% arm22())
}

arm21 <- function() {
  literal("D") %then%
    literal("F") %then%
    literal("G")
}

arm22 <- function() {
  literal("E") %then%
    literal("F") %then%
    literal("G")
}

try(reporter((p() %then% eof()))(LETTERS[1:6]))
#> Error : Parser failed on line 5 of input.
#> Line content: "E"

An now successfully parsing on arm22

try(reporter((p() %then% eof()))(c("A","B","C","E","F","G")))
#> [[1]]
#> [1] "A"
#> 
#> [[2]]
#> [1] "B"
#> 
#> [[3]]
#> [1] "C"
#> 
#> [[4]]
#> [1] "E"
#> 
#> [[5]]
#> [1] "F"
#> 
#> [[6]]
#> [1] "G"

The package rex

After making the first public version of this package I discovered that there exists a package called rex (“Friendly regular expressions”) that uses a very similar approach and in a number of cases even the same commands as this package. The examples in the package show that it is used to construct parsers for strings, i.e. single-element character vectors, not multi-element character vectors. I haven’t tested it, but if it is fast enough I can imagine that it can be used in parcr’s function match_s() to create parsers for strings that have a complex structure.

Literature

Hutton, G. (1992) Higher-order functions for parsing. J Funct Programming 2: 323–343.

  1. In Hutton (1992) the %or% operator is called $alt.↩︎

  2. This only happens if the parser has been converted to an error-messaging parser by wrapping it in the reporter() function.↩︎