We address the extraction of structured data from noisy historical documents (namely, the 1931 Tables nominatives of the French Senate) using a LLM guided by lightly constrained generation rather than strict post-hoc validation. Our contribution is threefold: (1) a minimal, application-driven target schema (speaker name + list of page references) expressed so it can be injected into the prompt to steer generation; (2) a hybrid pipeline that decouples OCR from schema-oriented generation, leveraging the LLM’s tolerance to OCR noise while limiting hallucinations via an expected JSON format; (3) an evaluation protocol for structured outputs using optimal record matching and a continuous Integrated Matching Quality metric that overcomes precision/recall brittleness. Code and data are publicly available at https://github.com/EPITAResearchLab/feral.25.chr.
