Building Historical Corpora with Multimodal LLMs: Epistemic Gaps and
Misreadings in 18th-Century Russian Books

Levchenko, Maria

doi:10.63744/SKoZVUHQbtE7

Abstract

Large language models offer transformative potential for digitizing historical texts, but their application to humanities research raises critical questions about temporal bias and historical representation. We present the first systematic evaluation of multimodal LLMs for historical optical character recognition (OCR), testing 11 leading models on 1,030 pages of 18th-century Russian texts printed in Civil font. Using a contamination-free dataset from the National Library of Russia, we demonstrate that while LLMs substantially outperform traditional OCR systems (achieving 3.36% vs. 21.55–45.96% character error rates), they exhibit systematic temporal biases that fundamentally compromise historical authenticity. Our analysis reveals two distinct forms of distortion: a “modernization trap" where models automatically “correct" historical orthography to contemporary standards, and paradoxical “over-historicization" where models insert anachronistic medieval Slavonic characters into 18th-century texts. These errors reflect what we term the absence of “historical linguistic competence" — models treat historical language not as a continuum of specific periods but as an undifferentiated space labeled “old". Different model families exhibit distinct error signatures, exposing how architectural choices and training data composition shape temporal bias. These findings reveal that “epistemic anachronism" in AI systems goes beyond inherited editorial biases. While training data explains modernization, the concurrent archaization demonstrates a fundamental architectural limitation: without temporal metadata as a training signal, models cannot develop “historical linguistic competence" even when explicitly provided with dates. Our work shows how these systems create temporal chimeras that appear historical while actively corrupting the historical record.

Building Historical Corpora with Multimodal LLMs: Epistemic Gaps and Misreadings in 18th-Century Russian Books