PhD defence by Heather Christine Lent

Portrait of Heather Christine Lent on a blue background

Title

NLP Across the Resource Landscape: Development in Creole NLP & Evaluation in Semantic Parsing

Abstract

Data availability is the crux on which much research in NLP depends. When a language is highly-resourced, it is possible to train large-scale models that demonstrate impressive performance on a wide array of tasks. Unfortunately, a majority of the world's languages are still lower-resourced, and lack sufficient data (if any data) that such models require for training. As a result, NLP research can look very different in lower- versus higher-resourced settings, and both scenarios come with their own particular challenges. For lower-resourced languages, different methods must be developed for overcoming the constraints of limited data, as well as for leveraging resources from other languages, unless significantly more effort is spent on resource creation. Meanwhile for higher-resourced languages, even when performance metrics are competitive, it can be difficult to evaluate what biases a model has learned from large datasets, and to determine a model's concrete limitations. Thus, in this work, we present a collection of studies from these two different settings of data availability: lower-resourced NLP for Creole languages and higher-resourced semantic parsing evaluation.

The set of studies on NLP for Creoles each expand upon the topic from a different angle. First, we explore a linguistically motivated approach for language modeling of Creoles, utilizing a Distributionally Robust Objective, but ultimately find that this method does not outperform standard Empirical Risk Minimization (Chapter 2). Next, we investigate transfer learning for Creoles, as this is a common approach for leveraging information from high-resourced languages for lower-resourced ones. We find that the typical scenario, whereby cross-lingual learning is achieved by training on a set of languages closely related to the target language, cannot be trivially applied to Creoles (Chapter 3). The final two works within the scope of Creole NLP explore the needs for language technology within Creole-speaking communities, as it is critical that researchers do not assume these needs on behalf of a community (Chapter 4), and discuss approaches, progress, and considerations for creating a multitask benchmark dataset for Creoles, which will allow NLP researchers the opportunity to include these languages within their work, and to further develop Creole NLP (Chapter 5).

For the remainder of this work, we present two studies on approaches for evaluating semantic parsers trained for English. In the first, we introduce a test suite for unit testing of text-to-SQL semantic parsers, in order to identify the true strengths and weaknesses of a model, beyond opaque accuracy metrics; we find that even state-of-the-art models still struggle with simple SQL operations like selecting columns (Chapter 6). In the second, we test three semantic role labeling parsers for their susceptibility to bias against figurative, non-literal utterances, as such language is common in everyday communication. We find that the parser utilizing a large-scale pre-trained language model was more biased against figurative language, than the models using other word representation approaches (Chapter 7).

Supervisors

Principal Supervisor: Professor Anders Østerskov Søgaard

Assessment Committee

Tenure-Track Assistant Professor Desmond Elliott, Dpt. of Computer Science

Tenure-Track Assistant Professor Katharina Kann, UC Boulder

Leader of defense: Tenure-Track Assistant Professor Daniel Hershcovich, Dpt. of Computer Science

For an electronic copy of the thesis, please visit the PhD Programme page

This defence will take place physically, but you can follow online on Zoom