SVE – UC Berkeley Sky Computing Lab

Stateful Visual Encoders for Vision Language Models

Humans compare images by looking back and forth. Many open-weight VLMs encode each image independently, and defer comparison to the LM. We introduce SVE: Stateful Visual Encoders for Vision-Language Models, where the visual encoder itself becomes change-aware.

GitHub

Website

Contributors

Zirui Wang, Junwei Yu, Adam Yala, David M. Chan, Joseph E. Gonzalez, Trevor Darrell

Publications

CoRR – Stateful Visual Encoders for Vision-Language Models