VaLR: Vision-aligned Latent Reasoning for Multi-modal Large Language Model

Abstract

Despite recent advancements in Multi-modal Large Language Models (MLLMs) on diverse understanding tasks, these models struggle to solve problems which require extensive multi-step reasoning. This is primarily due to the progressive dilution of visual information during long-context generation, which hinders their ability to fully exploit test-time scaling.

To address this issue, we introduce Vision-aligned Latent Reasoning (VaLR), a simple, yet effective reasoning framework that dynamically generates vision-aligned latent tokens before each Chain of Thought reasoning step, guiding the model to reason based on perceptual cues in the latent space. Specifically, VaLR is trained to preserve visual knowledge during reasoning by aligning intermediate embeddings of MLLM with those from vision encoders.

Empirical results demonstrate that VaLR consistently outperforms existing approaches across a wide range of benchmarks requiring long-context understanding or precise visual perception, while exhibiting test-time scaling behavior not observed in prior MLLMs. In particular, VaLR improves the performance significantly from 33.0% to 52.9% on VSI-Bench, achieving a 19.9%p gain over Qwen2.5-VL. Code is available at project page.

VaLR Benefits from Longer Generations unlike Baselines

VaLR exhibits consistent performance improvements as reasoning length increases.

Data Scalability

VaLR shows scalability along with data size. Compared to vanilla-SFT, VaLR achieves >20x faster training to reach comparable performance on V^* benchmark.

Long-context Evaluation

We evaluate VaLR on VSI-Bench, a 3D VQA benchmark with multi-view images. VaLR-S and VaLR-M denotes single encoder (DINOv3)-aligned model and multiple encoder (DINOv3, SigLIPv2, $\pi^3$)-aligned model, respectively. VaLR outperforms all baselines by a large margin, showing robust reasoning capabilities in long-context scenarios.

Perception Evaluation

VaLR outperforms all baselines on perception benchmarks including BLINK, MMVP, MMStar, V^*, and CVBench. VaLR preserves visual knowledge on moderate-context reasoning.

BibTeX

@article{jeon2026vision,
  title={Vision-aligned Latent Reasoning for Multi-modal Large Language Model},
  author={Jeon, Byungwoo and Jeong, Yoonwoo and Lee, Hyunseok and Cho, Minsu and Shin, Jinwoo},
  journal={arXiv preprint arXiv:2602.04476},
  year={2026}
}