Anthology of Computers and the Humanities · Volume 3

Benchmarking Methods for Digitizing Print Bibliographies

Elizabeth Rodrigues1 ORCID , Muhammad Khalid2 ORCID , Shayak Nandi2 , Amelia Vrieze2 ORCID and Tianyang Yu2 ORCID

  • 1 Libraries, Grinnell College, Grinnell, IA, USA
  • 2 Computer Science, Grinnell College, Grinnell, IA, USA

Permanent Link: https://doi.org/10.63744/SisvHqHBH67Z

Published: 21 November 2025

Keywords: multimodal large language models, structured data, dataset creation

Abstract

This short paper reports on work in progress to develop a pipeline for the digitization of a two-column format print bibliography as structured data for query. Print bibliographies constitute a major potential source of data for computational literary studies, but only if their digitization in a useful format can be made technically and economically feasible for interested researchers. To this end, the project seeks to benchmark traditional OCR and multimodal LLM text transcription of idiosyncratically formatted text and its conversion to JSON format in the resource-constrained and pedagogically motivated context of a small liberal arts college.