Gaze-referential inference—the ability to infer what others are looking at—is a critical component of a theory of mind that underpins natural human-AI interaction. In a controlled study, we evaluated this skill across 111 Vision Language Models (VLMs) using photos taken with manipulated difficulty and variability, comparing performance with that of human participants (N = 65), and analyzed behaviors using mixed-effects models.
We found that 94 of the 111 VLMs failed to do better than random guessing, while humans achieved near-ceiling accuracy. VLMs even respond with each choice almost equally frequently. Are they randomly guessing? Although most VLMs struggle, when we zoom in on five of the top-tier VLMs with above-chance performance, we find that their performance declined with increasing task difficulty but varied only slightly across different prompts and scene objects. These behavioral patterns cannot be explained by considering them as random guessers. Instead, they likely use a combination of heuristics and guessing such that their performance is subject to the task difficulty but robust to perceptual variations. This suggests that VLMs, lacking gaze inference capability, have yet to become technologies that can naturally interact with humans, but the potential remains.
94 of 111 VLMs performed no better than random guessing (~42%), while humans achieved ~91% accuracy. VLMs responded with every possible option almost equally often. Even top-tier VLMs like GPT-4o only reached ~50% accuracy. Are they simply guessing randomly?
VLM size and release date show minimal correlation with performance (both R² < 0.03), suggesting fundamental architectural limitations rather than scale issues. The pattern of errors they made is also quite different from that of humans and does not become more similar as VLMs get bigger (R² < 0.1).
The performance of the 5 top-tier VLMs (baseline-corrected) degrades significantly with increasing proximity between objects and number of objects, but surprisingly shows no view angle effects (while humans do, p<0.001). In general, this means their performance becomes closer to random guessing as difficulty increases.
Task Difficulty Effects suggest meaningful computation rather than random responses. Top-tier VLMs likely employ heuristics (or approximations) that work to some extent for easier cases but break down under challenging conditions, an indication of partial understanding of gaze direction and spatial relationships.
Humans show strong diagonal patterns indicating consistent correct responses, while VLMs produce nearly uniform distributions across all choices—a pattern consistent with a random-guessing account that led to our initial speculation—are they guessing?
VLM size (and release date) show minimal correlation with accuracy (both R² < 0.03), suggesting core architectural limitations.
We conducted a pre-registered study focusing on five top-performing VLMs (GPT-4o, Gemini 1.5 Pro, InternLM-XComposer2-vl-7b, Qwen2.5-VL-72B-Instruct, and GLM-4V-9B) and humans to understand their behavioral patterns using mixed-effects logistic regression models. To correct for baseline due to the varying number of choices, we examine not the accuracy but the ratio between the odds of accuracy and the chance baseline. We found that performance moves closer to baseline as difficulty increases. The hypothesis that heuristics used in easier cases break down for harder cases stems from this degradation pattern.
Significant for 4/5 VLMs and humans: Baseline-correct performance degrades as objects get closer together (p < 0.05 for Gemini, GPT-4o, GLM-4V, InternLM; p = 0.15 for Qwen).
Significant for all 5 VLMs and humans: Baseline-correct performance drops dramatically (p < 0.001) as the number of candidate objects increases from 2 to 3 and 4. Effect sizes ranging from -0.165 to -0.285.
Human baseline-correct performance degrades when viewing the gazer's side profile (p < 0.001), whereas VLMs show no significant effect.
While the number of levels for gazers (2) is too small to conclude sensitivity, there is minimal variance across prompts (small) and object combinations (almost zero).
This work uses gaze as a proxy to demonstrate how controlled studies can help us investigate behavioral patterns behind benchmark scores (that would have marked them as random guessers) and reductionism (“approximate retrievers”), constrain hypotheses (they are unlikely to be just guessing), and provide new hypothesis (that they might primarily use head directions but not eye directions) that is subject to further mechanistic investigations.
@article{vlmGaze2025,
title={Can Vision Language Models Infer Human Gaze Direction? A Controlled Study},
author={Zhang, Zory and Feng, Pinyuan and Wang, Bingyang and Zhao, Tianwei and Yu, Suyang and Gao, Qingying and Deng, Hokin and Ma, Ziqiao and Li, Yijiang and Luo, Dezhi},
year={2025},
eprint={2506.05412},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2506.05412},
}