Zero-shot Methods for Historical Text Restoration

Liu\orcidlink{0009-0004-2578-7986}, Kiara M. H.; Liu, Kiara M. H.; Mueller, Martin; Wilkens\orcidlink{0000-0001-6749-9318}, Matthew; Wilkens, Matthew

doi:10.63744/gz3Wm6kr19yr

Abstract

The EarlyPrint corpus is a uniquely high-value resource, comprising over 60,000 digitized Early Modern English texts published between 1473 and the early 1700s. Despite having been created by hand-keying from scans of original documents, transcription defects remain a problem due to the limitations of early scanning technologies. Specifically, unrecognizable letters are denoted by the “blackdot” character (“•”). Previous methods, including both human review and an LSTM-based approach, had moderate success in correcting these transcription errors. This paper expands on previous work by exploring zero-shot techniques using historically adapted large language models. We identify two groups of blackdot words – we use lexical matching combined with zero-shot evaluation for the less challenging instances, and direct zero-shot prediction for the more complex cases. We achieve 95% accuracy on valid instances in the first group and 78.6% accuracy across the majority of blackdot words in the second. In total, we recommend 2.8 million missing-letter predictions and implement over 700,000 high-confidence corrections within the corpus, substantially improving data quality for scholarly use.