Benchmarking Methods for Digitizing Print Bibliographies

Rodrigues, Elizabeth; Khalid, Muhammad; Nandi, Shayak; Vrieze, Amelia; Yu, Tianyang

doi:10.63744/SisvHqHBH67Z

Abstract

This short paper reports on work in progress to develop a pipeline for the digitization of a two-column format print bibliography as structured data for query. Print bibliographies constitute a major potential source of data for computational literary studies, but only if their digitization in a useful format can be made technically and economically feasible for interested researchers. To this end, the project seeks to benchmark traditional OCR and multimodal LLM text transcription of idiosyncratically formatted text and its conversion to JSON format in the resource-constrained and pedagogically motivated context of a small liberal arts college.