Anthology of Computers and the Humanities · Volume 3

Producing Structured Data from Historical Sources:
a Preliminary Application to French Senate Tables

Joël Féral1 ORCID , Joseph Chazalon2 ORCID and Marie Puren2 ORCID

  • 1 École Nationale des Chartes, Paris, France
  • 2 LRE, EPITA, Le Kremlin-Bicêtre, France

Permanent Link: https://doi.org/10.63744/oXn2aMxza3iJ

Published: 21 November 2025

Keywords: Parliamentary Archives, Structured Data, Generative Models, Evaluation

Abstract

We address the extraction of structured data from noisy historical documents (namely, the 1931 Tables nominatives of the French Senate) using a LLM guided by lightly constrained generation rather than strict post-hoc validation. Our contribution is threefold: (1) a minimal, application-driven target schema (speaker name + list of page references) expressed so it can be injected into the prompt to steer generation; (2) a hybrid pipeline that decouples OCR from schema-oriented generation, leveraging the LLM’s tolerance to OCR noise while limiting hallucinations via an expected JSON format; (3) an evaluation protocol for structured outputs using optimal record matching and a continuous Integrated Matching Quality metric that overcomes precision/recall brittleness. Code and data are publicly available at https://github.com/EPITAResearchLab/feral.25.chr.