SVE

Stateful Visual Encoders for Vision Language Models

Humans compare images by looking back and forth. Many open-weight VLMs encode each image independently, and defer comparison to the LM. We introduce SVE: Stateful Visual Encoders for Vision-Language Models, where the visual encoder itself becomes change-aware.


Contributors

Zirui Wang, Junwei Yu, Adam Yala, David M. Chan, Joseph E. Gonzalez, Trevor Darrell