Anthology of Computers and the Humanities · Volume 3

Quantifying Archival Silences: Phylogenetic Diversity Analysis of Controlled Vocabulary Utilization

Melvin Wevers1 ORCID , Thomas Smits1 ORCID and Folgert Karsdorp2 ORCID

  • 1 Department of History, University of Amsterdam, Amsterdam, the Netherlands
  • 2 Meertens Institute, Amsterdam, the Netherlands

Permanent Link: https://doi.org/10.63744/H9LnUCR9Mxmx

Published: 21 November 2025

Keywords: phylogenetic diversity, controlled vocabularies, cultural heritage, audiovisual archives

Abstract

This study adapts Faith’s Phylogenetic Diversity metric from ecology to measure controlled vocabulary utilization in archival collections, addressing limitations of traditional diversity measures that ignore hierarchical term relationships. We introduce three diagnostic ratios—Coverage, Completeness, and Cataloging Intensity—and apply them to 878,046 photographs across 16 Dutch National Archives collections cataloged with the hierarchical GTAA vocabulary. The framework provides quantitative tools for assessing how vocabulary utilization patterns influence cultural heritage accessibility while highlighting the tension between cataloging intensity and comprehensive research utility. The findings suggest that the interaction between collection content characteristics and institutional cataloging practices creates different pathways for cultural heritage discovery, revealing substantial variation in both the scope of conceptual domains (coverage ratio) addressed and the thoroughness (completeness ratio) of description within those domains. This framework provides empirical benchmarks for evidence-based collection assessment and metadata evaluation.