VisGym – UC Berkeley Sky Computing Lab

Exploring and Improving Multimodal Interactive Intelligence

Frontier models such as Gemini-3-Pro and GPT-5 achieve or exceed human performance on elite competitive benchmarks in mathematics, programming, and scientific reasoning. Yet the same models fail more than 90% of the time at solving a simple 3×3 visual jigsaw puzzle through interaction. This gap exposes a fundamental weakness in visual interaction and exploration, a capability essential for autonomous agents and robotics.

VisGym is a suite of 17 diverse, customizable, and scalable interactive environments for evaluating and training visual interaction and exploration. Our results show that competitive-level reasoning alone is insufficient for robust visual agents, and that progress in multimodal intelligence requires rethinking how models explore, act, and learn from visual feedback.

Blog Post

GitHub

Website

Contributors

Zirui (Colin) Wang*, Junyi Zhang*, Jiaxin Ge*, Long Lian, Letian Fu, Lisa Dunlap, Ken Goldberg, Xudong Wang, Ion Stoica, David M. Chan, Sewon Min, Joseph E. Gonzalez

Publications

CoRR – VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents