Retrospective Verification and Self-Correction

Vision-Language Models (VLMs) have made huge strides on tasks like image captioning or visual question answering. However, they still suffer from hallucinations, where they generate descriptions on nonexistent objects or concepts.
Previous approaches generally fall into two paradigms:
- Generation Adjustment: This method aims to improve the alignment of textual outputs with visual inputs by modifying the VLM’s generation process. This can be done either in a training-free manner (adjusting logits at decoding time) or through a training-based approach (introducing additional supervision signals or custom objective functions).
- Post-hoc Verification: This method introduces large external models (e.g.,GPT-4) to evaluate and verify outputs after generation.
However, generation adjustment methods struggle to correct erroneous tokens once generated and do not leverage retrospective reasoning to assess output quality. On the other hand, post-hoc verification is computationally expensive, and often results in generic refusals rather than targeted improvements.
REVERSE: REtrospective VERification and SElf-correction
To mitigate this, we introduce REVERSE (REtrospective VERification and SElf-correction), the first framework to integrate generation adjustment with online post-hoc verification within a single VLM architecture. REVERSE detects, backtracks, and corrects hallucinations during the decoding process.
Contributors
Tsung-Han Wu, Heekyung Lee, Jiaxin Ge, Joseph E. Gonzalez, Trevor Darrell, David M. Chan