This data paper details a work-in-progress to construct a large-scale, open book catalogue to address the critical need to understand the “shadow libraries” increasingly used to train large language models. We describe the aggregation and planned unification of tens of millions of bibliographic records from Library Genesis, Z-Library, OpenLibrary, and Goodreads. The paper outlines our methodology for tackling poor-quality metadata through a unified “work/edition/author” data model, cross-source validation, and heuristic-based enrichment. The resulting metadata-only catalogue will provide a novel resource for computational humanities research and enable a critical audit of opaque AI training corpora. We conclude with a legal analysis justifying our approach under EU and US law.
