Anthology of Computers and the Humanities · Volume 3

From Raw Text to Meaningful Information: Named Entity Recognition, Disambiguation, and Semantic Enrichment of a Large Corpus of Historical Police Records (Antwerp, 1876–1945)

Lith Lefranc1 ORCID

  • 1 Center for Urban History, University of Antwerp, Belgium

Permanent Link: https://doi.org/10.63744/Aym1S8P80hvy

Published: 21 November 2025

Keywords: Named Entity Recognition, Historical Entity Disambiguation, Geocoding, Name-to-Gender Inference, Toponym Resolution, Night Studies, Urban History

Abstract

This paper presents a pipeline for transforming noisy, machine-readable historical records into structured, meaningful information. Unlike prior methods that often rely on contemporary natural language processing tools, this framework adapts language models and ontologies to the historical and regional specificity of the data. Using 271 incident books from Antwerp’s local police (1876-1945), we developed an integrated approach combining historical named entity recognition, disambiguation, and further semantic enrichment. We trained domain-adapted transformer models on manually annotated data to extract dates, times, locations, and demographic information about individuals involved in the incidents (names, birthplaces, birthyears, and occupations). Post-processing methods address historical spelling variations, handwritten text recognition errors, and inconsistent administrative practices through normalization and disambiguation. The pipeline enriches extracted data through geocoding of historical street names using custom gazetteers, automated name-to-gender inference, and systematic conversion of occupational descriptions to social class categories via HISCO/HISCLASS standards. The system achieves F1-scores ranging from 0.82 to 0.99 across entity types, demonstrating how computational methods can unlock noisy historical records for data-driven urban history research.