Producing Structured Data from Historical Sources: A Preliminary
Application to French Senate <em>Tables</em>

F{\'e}ral, Jo{\"e}l; Chazalon, Joseph; Puren, Marie

doi:10.63744/oXn2aMxza3iJ

Abstract

We address the extraction of structured data from noisy historical documents (namely, the 1931 Tables nominatives of the French Senate) using a LLM guided by lightly constrained generation rather than strict post-hoc validation. Our contribution is threefold: (1) a minimal, application-driven target schema (speaker name + list of page references) expressed so it can be injected into the prompt to steer generation; (2) a hybrid pipeline that decouples OCR from schema-oriented generation, leveraging the LLM’s tolerance to OCR noise while limiting hallucinations via an expected JSON format; (3) an evaluation protocol for structured outputs using optimal record matching and a continuous Integrated Matching Quality metric that overcomes precision/recall brittleness. Code and data are publicly available at https://github.com/EPITAResearchLab/feral.25.chr.

Producing Structured Data from Historical Sources: A Preliminary Application to French Senate Tables