Anthology of Computers and the Humanities · Volume 3

Benchmarking Multimodal Large Language Models in Zero-shot and Few-shot Scenarios: Preliminary Results on Studying Christian Iconography

Gianmarco Spinaci1,2 ORCID , Lukas Klic2 ORCID and Giovanni Colavizza1,3 ORCID

  • 1 Department of Classical Philology and Italian Studies, University of Bologna, Bologna, Italy
  • 2 Villa i Tatti, The Harvard University Center for Italian Renaissance Studies, Firenze, Italy
  • 3 Department of Communication, University of Copenhagen, Copenhagen, Denmark

Permanent Link: https://doi.org/10.63744/oxWtm5MhhwBH

Published: 21 November 2025

Keywords: Multimodal Models, Large Language Models, Image Classification, Iconography

Abstract

This study evaluates the capabilities of multimodal large language models (LLMs) in the task of single-label classification of Christian iconography, focusing on their performance in zero-shot and few-shot settings across curated datasets. The goal was to assess whether general-purpose models, such as GPT-4o and Gemini 2.5, can interpret the Iconography, typically addressed by supervised classifiers, and evaluate their performance. Two research questions guided the analysis: (RQ1) How do multimodal LLMs perform on image classification of Christian saints? And (RQ2), how does performance vary when enriching input with contextual information or few-shot exemplars? We conducted a benchmarking study using three datasets supporting Iconclass natively: ArtDL, ICONCLASS, and Wikidata, filtered to include the top 10 most frequent classes. Models were tested under three conditions: (1) classification using class labels, (2) classification with Iconclass descriptions, and (3) few-shot learning with five exemplars. Results were compared against ResNet50 baselines fine-tuned on the same datasets. The findings show that Gemini-2.5 Pro and GPT-4o outperformed the ResNet50 baselines across the three configurations reaching peaks of 90.45% and 88.20% in ArtDL, respectively. Accuracy dropped significantly on the Wikidata dataset, suggesting model sensitivity to image size and metadata alignment. Enriching prompts with class descriptions generally improved zero-shot performance, while few-shot learning produced lower results, with only occasional and minimal increments in accuracy. We conclude that general-purpose multimodal LLMs are capable of classification in visually complex cultural heritage domains, even without specific fine-tuning. However, their performance is influenced by dataset consistency and the design of the prompts. These results support the application of LLMs as metadata curation tools in digital humanities workflows, suggesting future research on prompt optimization and the expansion of the study to other classification strategies and models.