How ‘Pagan’ Is My Text? Information Extraction from Untranscribed
Data

Griffiths, Rachael M.; Meelen, Marieke

doi:10.63744/aYiz0uLyIS4f

Abstract

In this paper, we present our work-in-progress on Information Extraction and Text Classification from large manuscript collections that have not yet been transcribed. We propose a three-stage pipeline starting with digitisation using a collaborative Handwritten Text Recognition (HTR) workflow, followed by Normalisation and Segmentation of the texts to create searchable collections, and, finally, we discuss how Text Classification and Information Extraction can help us identify the texts with Tibetan ‘Pagan’ religious features that are hidden among texts that belong to the Buddhist and Bön religious traditions.

How ‘Pagan’ Is My Text? Information Extraction from Untranscribed Data