Parsing with Regular Expressions & Extensions to Kleene Algebra

Niels Bjørn Bugge Grathwohl
2015
PhD thesis

Abstract

In the first part of this thesis, we investigate methods for regular expression parsing. The work is divided into three parts.

The first part is a two-pass algorithm for greedy regular expression parsing in a semi-streaming fashion. It executes in time O(mn) for expressions of size m and input of size n. The first pass outputs a compact log of only k bits per input symbol, where k is the number of alternatives and Kleene stars in the expression. The second pass processes this log to produce a compact bit-coded parse tree, not requiring the input symbols at all. This is an improvement upon previous parsing algorithms that either do not scale linearly in the input or require more working memory for the log and use three passes instead of two.

The second part is an optimally streaming algorithm: bits of the bit-coded parse tree are output as soon as it is semantically possible to guarantee what the prefix of the final parse tree is. This can be done in time O(2m log m+mn) for expressions of size m and input of size n. To be optimal, a PSPACE-complete preprocessing step is required: For a fixed regular expression, the algorithm’s run-time scales linearly with the input length. For inherently non-streamable expressions like a*x + a*y the algorithm degrades gracefully to a two-pass algorithm.

The third part is a determinization procedure for the optimally streaming algorithm where the preprocessing step has been omitted, along with a surface language, Kleenex, for expressing general non-deterministic finite transducers. We show how to compile Kleenex programs into deterministic streaming string transducers (SSTs), and demonstrate the performance of a Kleenex compiler that translates the SSTs to efficient C code. The resulting programs are essentially optimally streaming, run in worst-case linear time in the input size, and show consistent high performance in the 1 Gbps range on various use cases.

In the second part of this thesis, we present two extensions to Kleene algebra.

Chomsky algebra is an algebra with a structure similar to Kleene algebra, but with a generalized mu-operator for recursion instead of the Kleene star. We show that the axioms of idempotent semirings along with continuity of the mu-operator completely axiomatizes the equational theory of the context-free languages.

KAT+B! is an extension to Kleene algebra with tests (KAT) with mutable state. We describe a test algebra B! for mutable tests and give a commutative coproduct between KATs. The axioms of B! together with KAT and a set of commutativity conditions are shown to completely axiomatize the equational theory of an arbitrary KAT enriched with mutable state. The theory is EXPSPACE-complete. We give some examples of its use in showing the Böhm–Jacopini theorem and the folk theorem that all WHILE programs can be put into a normal form with only one WHILE loop.