👁️ Can Vision Language Models Infer Human Gaze Direction? A Controlled Study

Preprint under review

1Brown University, 2Columbia University, 3Emory University, 4Johns Hopkins University,
5University of Washington, 6Carnegie Mellon University, 7University of Michigan, 8UC San Diego

Co-leading, *Co-advising

GrowAI Team Logo
GrowAI Team | growing-ai-like-a-child.github.io

TL;DR: In a task involving inference of human gaze direction, our controlled study reveals a substantial performance gap between top-tier Vision-Language Models (VLMs) and humans, as well as behavioral patterns in VLMs' responses that suggest they are not simply guessing.

Abstract

Gaze-referential inference—the ability to infer what others are looking at—is a critical component of a theory of mind that underpins natural human-AI interaction. In a controlled study, we evaluated this skill across 111 Vision Language Models (VLMs) using photos taken with manipulated difficulty and variability, comparing performance with that of human participants (N = 65), and analyzed behaviors using mixed-effects models.

We found that 94 of the 111 VLMs failed to do better than random guessing, while humans achieved near-ceiling accuracy. VLMs even respond with each choice almost equally frequently. Are they randomly guessing? Although most VLMs struggle, when we zoom in on five of the top-tier VLMs with above-chance performance, we find that their performance declined with increasing task difficulty but varied only slightly across different prompts and scene objects. These behavioral patterns cannot be explained by considering them as random guessers. Instead, they likely use a combination of heuristics and guessing such that their performance is subject to the task difficulty but robust to perceptual variations. This suggests that VLMs, lacking gaze inference capability, have yet to become technologies that can naturally interact with humans, but the potential remains.

🔍 Key Findings

Massive Performance Gap

94 of 111 VLMs performed no better than random guessing (~42%), while humans achieved ~91% accuracy. VLMs responded with every possible option almost equally often. Even top-tier VLMs like GPT-4o only reached ~50% accuracy. Are they simply guessing randomly?

Scaling Limitations

VLM size and release date show minimal correlation with performance (both R² < 0.03), suggesting fundamental architectural limitations rather than scale issues. The pattern of errors they made is also quite different from that of humans and does not become more similar as VLMs get bigger (R² < 0.1).

Task Difficulty Effects

The performance of the 5 top-tier VLMs (baseline-corrected) degrades significantly with increasing proximity between objects and number of objects, but surprisingly shows no view angle effects (while humans do, p<0.001). In general, this means their performance becomes closer to random guessing as difficulty increases.

Not Random Guessing

Task Difficulty Effects suggest meaningful computation rather than random responses. Top-tier VLMs likely employ heuristics (or approximations) that work to some extent for easier cases but break down under challenging conditions, an indication of partial understanding of gaze direction and spatial relationships.

Controlled Study & Stimuli Examples

Human Survey Interfaces (via JsPsych + Prolific)

📊 Analyses

Analysis A: Overall Performance Comparison

~42%
Random-Guessing Baseline
(Expected chance performance)
94/111
VLMs at Chance Level
(Failed statistical significance test)
~91%
Human Performance
(Near-ceiling accuracy)

🔍 Confusion Matrix Analysis

Confusion matrices comparison

Humans show strong diagonal patterns indicating consistent correct responses, while VLMs produce nearly uniform distributions across all choices—a pattern consistent with a random-guessing account that led to our initial speculation—are they guessing?

📈 Scaling Analysis

VLM size (and release date) show minimal correlation with accuracy (both R² < 0.03), suggesting core architectural limitations.

Analysis B: Behavioral Patterns in Top-Tier VLMs and Humans

We conducted a pre-registered study focusing on five top-performing VLMs (GPT-4o, Gemini 1.5 Pro, InternLM-XComposer2-vl-7b, Qwen2.5-VL-72B-Instruct, and GLM-4V-9B) and humans to understand their behavioral patterns using mixed-effects logistic regression models. To correct for baseline due to the varying number of choices, we examine not the accuracy but the ratio between the odds of accuracy and the chance baseline. We found that performance moves closer to baseline as difficulty increases. The hypothesis that heuristics used in easier cases break down for harder cases stems from this degradation pattern.

The estimated marginal means. The random-guessing baseline is indicated by dashed lines.

📐 Proximity Effect

Significant for 4/5 VLMs and humans: Baseline-correct performance degrades as objects get closer together (p < 0.05 for Gemini, GPT-4o, GLM-4V, InternLM; p = 0.15 for Qwen).

🎯 Choice Effect

Significant for all 5 VLMs and humans: Baseline-correct performance drops dramatically (p < 0.001) as the number of candidate objects increases from 2 to 3 and 4. Effect sizes ranging from -0.165 to -0.285.

👁️ View Effect

Human baseline-correct performance degrades when viewing the gazer's side profile (p < 0.001), whereas VLMs show no significant effect.

Key Insight: VLM may rely on head orientation rather than eye gaze direction, making them less sensitive to side-views that increase eye direction geometric ambiguity and hinder counterfactual reasoning in human observers (who show View effect).

🔄 Sensitivity Analysis

While the number of levels for gazers (2) is too small to conclude sensitivity, there is minimal variance across prompts (small) and object combinations (almost zero).

Implication: The inference VLMs have not only depends on the task difficulty but is also robust to surface-level perceptual variations, like object visual features and background distractors, suggesting a meaningful computation rather than random guessing.

💡 Implications & Future Work

Implications for AI Development

  • 1. Gaze Inference Deficits: The problem might lies at not picking up eye gaze for inferring looking directions.
  • 2. Theory of Mind and Human-AI Interaction: VLMs lack fundamental gaze inference abilities that bootstrap human social cognition, and they are not yet ready for natural collaboration requiring social understanding.
  • 3. Beyond Scaling: Current paradigms may be insufficient—more parameters won't solve this gap.

Open Questions

  • 1. What is the mechanism of approximation in top-tier VLMs, and how does it emerge from the data?
  • 2. What roles do VLMs attach to gaze (e.g., the referential nature of gaze, using gaze to resolve conversational referent ambiguity)?
  • 3. To have AI that learns like humans and acquires Theory of Mind from naturalistic data, will a curriculum that encourages early development of gaze-referential understanding be helpful?

Methodological Contribution

This work uses gaze as a proxy to demonstrate how controlled studies can help us investigate behavioral patterns behind benchmark scores (that would have marked them as random guessers) and reductionism (“approximate retrievers”), constrain hypotheses (they are unlikely to be just guessing), and provide new hypothesis (that they might primarily use head directions but not eye directions) that is subject to further mechanistic investigations.

BibTeX

@article{vlmGaze2025,
  title={Can Vision Language Models Infer Human Gaze Direction? A Controlled Study},
  author={Zhang, Zory and Feng, Pinyuan and Wang, Bingyang and Zhao, Tianwei and Yu, Suyang and Gao, Qingying and Deng, Hokin and Ma, Ziqiao and Li, Yijiang and Luo, Dezhi},
  year={2025},
  eprint={2506.05412},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2506.05412},
}