SpatialBoost: Enhancing Visual Representation through Language-Guided Reasoning

Abstract

Despite the remarkable success of large-scale pre-trained image representation models (i.e., vision encoders) across various vision tasks, they often fail to learn 3D spatial relationships between objects and backgrounds in the real world, constraining their effectiveness in various downstream applications. We attribute this to the limited availability of large-scale 3D training data, which makes it difficult for current image representation learning approaches to learn spatial relationships. This motivates the need for learning paradigms that rely on strong supervision while requiring less data.

To address this, we propose a novel learning framework, SpatialBoost, that enhances the spatial awareness of existing pre-trained vision encoders by injecting dense 3D spatial knowledge expressed in linguistic forms. To be specific, the core idea involves converting dense 3D spatial information from 2D images into linguistic expressions, which is then used to inject such spatial knowledge into vision encoders through a Large Language Model (LLM). To this end, we adopt a multi-turn Chain-of-Thought (CoT) reasoning process that progressively incorporates dense spatial knowledge and builds hierarchical spatial understanding.

To validate effectiveness, we adapt SpatialBoost to state-of-the-art vision encoders such as DINOv3, and evaluate its performance gains on a wide range of benchmarks requiring both 3D perception and general vision abilities.

Dataset Curation

Construct hierarchical reasoning data with spatial knowledge from pixel-level to scene-level.

Monocular Depth Estimation

SpatialBoost enhances the visual representation on explicit spatial task. The performance of DINOv2 with SpatialBoost (0.25) on NYUd matches with DINOv3 (0.25).

Semantic Segmentation

SpatialBoost enhances the visual representation on spatially-related tasks. The performance of DINOv3 rises from 55.9 to 59.7 mIoU, achieving state-of-the-art performance.

3D-centric Tasks

We evaluate SpatialBoost on 3D benchmark, Lexicon3D. The benchmark consists of diverse 3D-centric tasks such as Vision-Language Reasoning (VLR), Visual Grounding (VG), Geometry Understanding (GU), and 3D Semantic Understanding (3D SU). SpatialBoost outperforms all 3D-centric tasks in Lexicon3D.

Vision-based Robot Learning

SpatialBoost enhances the visual representation on robot learning that requires spatial knowledge. The performance of OpenCLIP (70.5) and DINOv2 (75.8) with SpatialBoost on par with the state-of-the-art vision encoder, SigLIPv2 (69.7) and DINOv3 (72.8), respecitvely.

General Vision Knowledge

SpatialBoost enhances general knowledge of the visual representation. We evaluate SpatialBoost on image classification and retrieval tasks. Consistent performance gain on these tasks indicates the SpatialBoost is not overfitted to specific 3D knowledge.

Data Scalability

SpatialBoost shows scalability along with data size. The performance on monocular depth estimation and semantic segmentation enhances as the size of the data grow up.

BibTeX

@misc{jeon2025spatialboost,
  author    = {Jeon, Byungwoo and Kim, Dongyoung and Jang, Huiwon and Kim, Insoo and Shin, Jinwoo},
  title     = {SpatialBoost: Enhancing Visual Representation through Language-Guided Reasoning},
  journal   = {arXiv},
  year      = {2025},
}