Vast amounts of historical documents are being digitized with subsequent optical character recognition (OCR), but the quality assessment of the results is challenging for larger quantities. Ground Truth-based evaluation requires sufficient and representative data that are expensive to create. Following recent work, we investigate whether confidence scores automatically provided by text recognition systems can serve as a proxy. Based on an analysis of the relationship between word error rates and word confidence scores for several OCR engines, we find that the latter can serve as a useful indicator of OCR quality. In a second step, we explore the scalability and reliability of combining Ground Truth and confidence scores for quality assessment of text recognition in several experiments on a heterogeneous dataset comprising almost 5 million pages of historical documents from 1456–2000. The deeper analysis of the evaluation results provides insights into typical issues for OCR of historical documents, suggesting potential directions for future work.
