Anthology of Computers and the Humanities · Volume 4

Faire du neuf avec du balisé : Quand une édition TEI devient la mémoire d’un RAG

Clément Castellon2 , Floriane Chiffoleau1,2 and Alina Miasnikova1,2

  • 1 Observatoire des Textes, des Idées et des Corpus (ObTIC), Paris, France
  • 2 Sorbonne Université, Paris, France

Permanent Link: https://doi.org/10.63744/UR8DCmRglj12

Published: 21 May 2025

Keywords: digital edition, retrieval-augmented generation (RAG), TEI XML encoding, heritage collection

Mots clés : édition numérique, génération à enrichissement contextuel (RAG), encodage XML-TEI, collection patrimoniale

Abstract

This study investigates the integration of XML-TEI encoded digital editions of heritage collections into a Retrieval-Augmented Generation (RAG) system. Using the collections of the Valentin Haüy Association, we demonstrate that leveraging hierarchical divisions, titles, and paragraphs improves chunking and information retrieval. Results show that the RAG delivers precise, coherent answers faithfully reflecting the source documents while enabling traceability through TEI metadata. This approach highlights the potential of structured digital editions as a dynamic memory for RAG systems, offering enhanced exploration and accessibility of complex archival corpora within the digital humanities.