Practical general top-down parsers

Ali Afroozeh and Anastasia Izmaylova

Promotors: prof.dr. P. Klint (UvA and CWI) and prof.dr. J. Vinju (CWI and TU/e)
Universiteit van Amsterdam
Date: 11 June, 2019
Thesis: PDF

Summary

This thesis presents the design and implementation of practical general top-down parsers. Top-down parsing, usually in the form of recursive-descent parsing, is efficient, easy to understand and is widely used to manually write parsers for real programming languages. Our work in this thesis has been inspired by the Generalized LL (GLL) parsing algorithm and Johnson’s CPS recognizers. These two work generalize recursive-descent parsing to support all context-free grammars, including left-recursive ones. In fact, part of our work can be seen as merging and unifying GLL and Johnson’s CPS recognizers. Working with Johnson’s recognizers led us to propose a more efficient GSS for GLL, which resembles traditional function memoization, and also shaped our work on data-dependent grammars. Our experience with GLL gave us insight into how to make Johnson’s CPS recognizers cubic and generate SPPF from them.
We propose data-dependent grammars as an abstract intermediate language for implementing various disambiguation constructs without knowledge of a specific parsing technology. As data-dependent grammars are rather low-level, we also provide high-level disambiguation constructs, e.g., for operator precedence and indentation-sensitivity, that desugar to data-dependent grammars. A significant part of this thesis is dedicated to the semantics and implementation of disambiguation rules for operator precedence. We introduce a derivation-based semantics for operator precedence disambiguation that is independent of the underlying parsing technique, and is safe, i.e., does not remove sentences from the language when there is no ambiguity. We propose an implementation of our operator precedence semantics using data-dependent grammars. In the course of this thesis, we have developed Iguana, our GLL-based data-dependent parsing framework. The results of our performance evaluation show that Iguana is practical for parsing real programming languages, and in terms of performance, is comparable to a mature parsing tool such as ANTLR.
Finally, based on Johnson’s CPS recognizers, we present general parser combinators that can deal with all context-free grammars and produce an SPPF in cubic time and space. Our general parser combinators have the flexibility and expressiveness of traditional parser combinators, and the performance guarantee of general parsing algorithms. We used our general parser combinators as the basis for Meerkat, a general parser combinator library in Scala.