Anthology of Computers and the Humanities · Volume 3

How ‘Pagan’ Is My Text? Information Extraction from Untranscribed Data

Rachael M. Griffiths1,2 ORCID and Marieke Meelen1 ORCID

  • 1 DTAL, University of Cambridge, United Kingdom
  • 2 EPHE-PSL, Paris, France

Permanent Link: https://doi.org/10.63744/aYiz0uLyIS4f

Published: 21 November 2025

Keywords: HTR, Information Extraction, Text Classification, Tibetan, Religious Studies

Abstract

In this paper, we present our work-in-progress on Information Extraction and Text Classification from large manuscript collections that have not yet been transcribed. We propose a three-stage pipeline starting with digitisation using a collaborative Handwritten Text Recognition (HTR) workflow, followed by Normalisation and Segmentation of the texts to create searchable collections, and, finally, we discuss how Text Classification and Information Extraction can help us identify the texts with Tibetan ‘Pagan’ religious features that are hidden among texts that belong to the Buddhist and Bön religious traditions.