Exploring and Improving Multimodal Interactive Intelligence

Frontier models such as Gemini-3-Pro and GPT-5 achieve or exceed human performance on elite competitive benchmarks in mathematics, programming, and scientific reasoning. Yet the same models fail more than 90% of the time at solving a simple 3×3 visual jigsaw puzzle through interaction. This gap exposes a fundamental weakness in visual interaction and exploration, a capability essential for autonomous agents and robotics.
VisGym is a suite of 17 diverse, customizable, and scalable interactive environments for evaluating and training visual interaction and exploration. Our results show that competitive-level reasoning alone is insufficient for robust visual agents, and that progress in multimodal intelligence requires rethinking how models explore, act, and learn from visual feedback.
Contributors
Zirui (Colin) Wang*, Junyi Zhang*, Jiaxin Ge*, Long Lian, Letian Fu, Lisa Dunlap, Ken Goldberg, Xudong Wang, Ion Stoica, David M. Chan, Sewon Min, Joseph E. Gonzalez
Publications
CoRR – VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents