Anthology of Computers and the Humanities · Volume 3

Fine-grained Named-Entity Recognition for the East-India Company domain

Sophie Arnoult1 ORCID , Brecht Nijman2 ORCID and Leon van Wissen3 ORCID

  • 1 VU University, Amsterdam, The Netherlands
  • 2 Huygens Institute, Amsterdam, The Netherlands
  • 3 University of Amsterdam, Amsterdam, The Netherlands

Permanent Link: https://doi.org/10.63744/DRbhWNTzqNzR

Published: 21 November 2025

Keywords: Dutch historical domain, named-entity recognition, pretrained language models, data augmentation

Abstract

The Digital Humanities can nowadays benefit from easily accessible tools and pretrained models. Questions remain about the adequation between the data used to train these models and the task data. For a task like Named-Entity Recognition, domain specificity expresses itself not only in the linguistic domain but also in the entities of interest. While fine-grained entity tagsets are valuable, they are harder to annotate, leading to smaller, less representative training data, and may also be less interoperable with other NER label sets. In this work, we introduce a new fine-grained NER dataset for early modern Dutch texts related to the Dutch East India Company, covering 15 NER tags and 8000 mentions. We show that training a language model on the task data improves NER performance compared to off-the-shelf multilingual pretrained models. We further introduce a new method, class-agnostic co-training, to augment training data with existing NER datasets from the same domain, but with more restricted tagsets. We demonstrate that this method improves performance for augmented tags while increasing overall precision. Our annotations and code are publicly available.