Traditional methodologies for evaluating visual artistic output in art therapy remain rare and time-intensive, creating barriers to systematic assessment of therapeutic progress. This study presents the first application of multimodal dense embeddings for longitudinal evaluation of art therapy outcomes in individuals with schizophrenia. We analyzed 168 art therapy images produced by 14 participants with schizophrenia using CLIP (Contrastive Language-Image Pretraining) embeddings. CLIP embeddings successfully captured meaningful semantic patterns, with real images showing significantly greater semantic dispersion than spatially randomized controls. Longitudinal analysis revealed progressive semantic diversification over time, with significant increases in semantic distance between consecutive images (β = 0.284, p = 0.001) and cumulative semantic drift from first images (β = 0.336, p < 0.001). Individual differences analysis showed high variability in volume metrics spanning several orders of magnitude (M = 1.13 × 10¹¹, SD = 2.05 × 10¹¹), indicating highly individual semantic exploration patterns. Vision language models provide a novel and objective methodology for evaluating the progression of art therapy that reveals systematic patterns of semantic evolution during treatment. The progressive semantic diversification observed suggests that art therapy facilitates expanding creative expression and psychological exploration over time. The substantial individual differences in semantic exploration patterns indicate potential for personalized treatment approaches based on creative trajectory analysis. This methodology offers promising applications for systematic art therapy assessment, treatment monitoring, and personalized intervention strategies in clinical practice.
