Monocular Depth Estimation
SpatialBoost enhances the visual representation on explicit spatial task. The performance of DINOv2 with SpatialBoost (0.25) on NYUd matches with DINOv3 (0.25).
Despite the remarkable success of large-scale pre-trained image representation models (i.e., vision encoders) across various vision tasks, they are predominantly trained on 2D image data and therefore often fail to capture 3D spatial relationships between objects and backgrounds in the real world, constraining their effectiveness in many downstream applications.
To address this, we propose SpatialBoost, a scalable framework that enhances the spatial awareness of existing pre-trained vision encoders by injecting 3D spatial knowledge expressed in linguistic descriptions. The core idea involves converting dense 3D spatial information from 2D images into linguistic expressions, which is then used to inject such spatial knowledge into vision encoders through a Large Language Model (LLM). To this end, we adopt a multi-turn Chain-of-Thought (CoT) reasoning process that progressively incorporates dense spatial knowledge and builds hierarchical spatial understanding.
To validate effectiveness, we adapt SpatialBoost to state-of-the-art vision encoders such as DINOv3, and evaluate its performance gains on a wide range of benchmarks requiring both 3D perception and general vision abilities. For instance, SpatialBoost improves DINOv3 performance from 55.9 to 59.7 mIoU on ADE20K, achieving state-of-the-art performance with 3.8% gain over the pre-trained DINOv3.
Construct hierarchical reasoning data with spatial knowledge from pixel-level to scene-level.
SpatialBoost enhances the visual representation on explicit spatial task. The performance of DINOv2 with SpatialBoost (0.25) on NYUd matches with DINOv3 (0.25).
SpatialBoost enhances the visual representation on spatially-related tasks. The performance of DINOv3 rises from 55.9 to 59.7 mIoU, achieving state-of-the-art performance.
We evaluate SpatialBoost on 3D benchmark, Lexicon3D. The benchmark consists of diverse 3D-centric tasks such as Vision-Language Reasoning (VLR), Visual Grounding (VG), Geometry Understanding (GU), and 3D Semantic Understanding (3D SU). SpatialBoost outperforms all 3D-centric tasks in Lexicon3D.
SpatialBoost enhances the visual representation on robot learning that requires spatial knowledge. The performance of OpenCLIP (70.5) and DINOv2 (75.8) with SpatialBoost on par with the state-of-the-art vision encoder, SigLIPv2 (69.7) and DINOv3 (72.8), respecitvely.
SpatialBoost enhances general knowledge of the visual representation. We evaluate SpatialBoost on image classification and retrieval tasks. Consistent performance gain on these tasks indicates the SpatialBoost is not overfitted to specific 3D knowledge.
SpatialBoost shows scalability along with data size. The performance on monocular depth estimation and semantic segmentation enhances as the size of the data grow up.
@article{jeon2026spatialboost,
title={SpatialBoost: Enhancing Visual Representation through Language-Guided Reasoning},
author={Jeon, Byungwoo and Kim, Dongyoung and Jang, Huiwon and Kim, Insoo and Shin, Jinwoo},
journal={arXiv preprint arXiv:2603.22057},
year={2026}
}