Anthology of Computers and the Humanities · Volume 3

How Scalable is Quality Assessment of Text Recognition? A Combination of Ground Truth and Confidence Scores

Michał Bubula1 ORCID , Konstantin Baierer1 ORCID , Jörg Lehmann1 ORCID , Clemens Neudecker1 ORCID , Vahid Rezanezhad1 ORCID and Doris Škarić1 ORCID

  • 1 Department for Information and Data Management, Berlin State Library, Berlin, Germany

Permanent Link: https://doi.org/10.63744/GR59c1iXu6Wj

Published: 21 November 2025

Keywords: optical character recognition, evaluation, confidence scores, word error rates, ground truth, historical document analysis

Abstract

Vast amounts of historical documents are being digitized with subsequent optical character recognition (OCR), but the quality assessment of the results is challenging for larger quantities. Ground Truth-based evaluation requires sufficient and representative data that are expensive to create. Following recent work, we investigate whether confidence scores automatically provided by text recognition systems can serve as a proxy. Based on an analysis of the relationship between word error rates and word confidence scores for several OCR engines, we find that the latter can serve as a useful indicator of OCR quality. In a second step, we explore the scalability and reliability of combining Ground Truth and confidence scores for quality assessment of text recognition in several experiments on a heterogeneous dataset comprising almost 5 million pages of historical documents from 1456–2000. The deeper analysis of the evaluation results provides insights into typical issues for OCR of historical documents, suggesting potential directions for future work.