Digging Through Garbage: Detection of ‘Garbage’ Words in Digitized
Historical Documents

Cuper, Mirjam; Boer, Ethan den

doi:10.63744/wD9bYr0WxUTa

Abstract

Digitized historical heritage is widely available, and the need for high-quality material is increasing. However, there are still some unsolved Optical Character Recognition (OCR) problems. Approaches have arisen to correct such problems after digitization, with the use of automatic post-processing. But sometimes the OCR output is of such low quality that these post-processing methods are not an option. To detect this ’garbage’ output, various ‘garbage detection’ methods have been developed, based on the English and German language. We expand upon these methods by developing a garbage detection method tailored to 17th century Dutch. Using a dataset of 6,245 17th century Dutch Newspapers for which we have both the original OCR output and the manually corrected transcriptions available, we compare various rule-based and machine learning methods. We developed a semi-automated method to create a labeled dataset for training, and we created a rule-based method tailored to 17th century Dutch. Our results show that various machine learning models outperform rule-based methods. We made our models publicly available for use by both researchers and heritage institutions.

Digging Through Garbage: Detection of ‘Garbage’ Words in Digitized Historical Documents