Fine-grained Named-Entity Recognition for the East-India Company
domain

Arnoult, Sophie; Nijman, Brecht; Wissen, Leon van

doi:10.63744/DRbhWNTzqNzR

Abstract

The Digital Humanities can nowadays benefit from easily accessible tools and pretrained models. Questions remain about the adequation between the data used to train these models and the task data. For a task like Named-Entity Recognition, domain specificity expresses itself not only in the linguistic domain but also in the entities of interest. While fine-grained entity tagsets are valuable, they are harder to annotate, leading to smaller, less representative training data, and may also be less interoperable with other NER label sets. In this work, we introduce a new fine-grained NER dataset for early modern Dutch texts related to the Dutch East India Company, covering 15 NER tags and 8000 mentions. We show that training a language model on the task data improves NER performance compared to off-the-shelf multilingual pretrained models. We further introduce a new method, class-agnostic co-training, to augment training data with existing NER datasets from the same domain, but with more restricted tagsets. We demonstrate that this method improves performance for augmented tags while increasing overall precision. Our annotations and code are publicly available.

Fine-grained Named-Entity Recognition for the East-India Company domain