Anthology of Computers and the Humanities · Volume 3

Digging Through Garbage: Detection of ‘Garbage’ Words in Digitized Historical Documents

Mirjam Cuper1 ORCID and Ethan den Boer2 ORCID

  • 1 Research Department, KB, The National Library of The Netherlands, The Hague, The Netherlands
  • 2 Independent researcher, Rotterdam, The Netherlands

Permanent Link: https://doi.org/10.63744/wD9bYr0WxUTa

Published: 21 November 2025

Keywords: Digital Heritage, OCR Quality, Garbage Detection

Abstract

Digitized historical heritage is widely available, and the need for high-quality material is increasing. However, there are still some unsolved Optical Character Recognition (OCR) problems. Approaches have arisen to correct such problems after digitization, with the use of automatic post-processing. But sometimes the OCR output is of such low quality that these post-processing methods are not an option. To detect this ’garbage’ output, various ‘garbage detection’ methods have been developed, based on the English and German language. We expand upon these methods by developing a garbage detection method tailored to 17th century Dutch. Using a dataset of 6,245 17th century Dutch Newspapers for which we have both the original OCR output and the manually corrected transcriptions available, we compare various rule-based and machine learning methods. We developed a semi-automated method to create a labeled dataset for training, and we created a rule-based method tailored to 17th century Dutch. Our results show that various machine learning models outperform rule-based methods. We made our models publicly available for use by both researchers and heritage institutions.