Kleenex: Compiling Nondeterministic Transducers to Deterministic Streaming Transducers

Grathwohl, Bjørn Bugge and Henglein, Fritz and Rasmussen, Ulrik Terp and Søholm, Kristoffer Aalund and Tørholm, Sebastian Paaske
2016
Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages

Abstract

We present and illustrate Kleenex, a language for expressing general nondeterministic finite transducers, and its novel compilation to streaming string transducers with worst-case linear-time performance and sustained high throughput. Its underlying theory is based on transducer decomposition into oracle and action machines: the oracle machine performs streaming greedy disambiguation of the input; the action machine performs the output actions. In use cases Kleenex achieves consistently high throughput rates around the 1 Gbps range on stock hardware. It performs well, especially in complex use cases, in comparison to both specialized and related tools such as GNU AWK, GNU sed, RE2, Ragel and regular-expression libraries.