Anthology of Computers and the Humanities · Volume 3

Zero-shot Methods for Historical Text Restoration

Kiara M. H. Liu1 , Kiara M. H. Liu1 ORCID , Martin Mueller2 , Matthew Wilkens1 and Matthew Wilkens1 ORCID

  • 1 Department of Information Science, Cornell University, Ithaca, U.S.A.
  • 2 Department of English, Northwestern University, Cook County, U.S.A.

Permanent Link: https://doi.org/10.63744/gz3Wm6kr19yr

Published: 21 November 2025

Keywords: historical text, zero-shot learning, language models, computational humanities

Abstract

The EarlyPrint corpus is a uniquely high-value resource, comprising over 60,000 digitized Early Modern English texts published between 1473 and the early 1700s. Despite having been created by hand-keying from scans of original documents, transcription defects remain a problem due to the limitations of early scanning technologies. Specifically, unrecognizable letters are denoted by the “blackdot” character (“•”). Previous methods, including both human review and an LSTM-based approach, had moderate success in correcting these transcription errors. This paper expands on previous work by exploring zero-shot techniques using historically adapted large language models. We identify two groups of blackdot words – we use lexical matching combined with zero-shot evaluation for the less challenging instances, and direct zero-shot prediction for the more complex cases. We achieve 95% accuracy on valid instances in the first group and 78.6% accuracy across the majority of blackdot words in the second. In total, we recommend 2.8 million missing-letter predictions and implement over 700,000 high-confidence corrections within the corpus, substantially improving data quality for scholarly use.