Programs
- M. Tech. in Automotive Engineering -Postgraduate
- Fellowship in Diabetic Foot Surgery 1 Year -Fellowship
Publication Type : Journal Article
Publisher : IEEE
Source : 2025 IEEE International Conference on Computer Vision and Machine Intelligence (CVMI)
Url : https://doi.org/10.1109/cvmi66673.2025.11337662
Campus : Coimbatore
School : School of Physical Sciences
Department : Mathematics
Year : 2025
Abstract : Vision-Language Model (VLM) spatial relationship understanding is an asset of VLMs when used in real-world tasks, e.g., robotic grasping and self-driving navigation. Existing VLMs trained only on RGB images are marred by a lack of spatial relationship reasoning due to the lack of depth perception. In this paper, we overcome this limitation by incorporating Monocular Depth Estimation (MDE) in fine-tuning VLMs. We employ three state-of-the-art MDE models-ZoeDepth, Depth Anything V2, and DepthPro-to generate depth maps of a large variety of images from spatially from SpatialQA. The depth-enhanced images are utilized to fine-tune the Mini-InternVL-l.5 model, a lite VLM with 2 billion parameters. The spatial reasoning abilities of the base and fine-tuned models are compared in terms of the SpatialBench benchmark, by varying the depth estimation models to study which yields better spatial reasoning. We see that fine-tuning using depth information significantly enhances spatial awareness, particularly in counting, object existence, and reachability tasks. Of the three MDE models, ZoeDepth consistently yields the best performance gains. These findings highlight the importance of incorporating depth cues in the training pipelines of VLMs to unlock their full potential in spatial relationship tasks.
Cite this Research Publication : Karthik Prasad G, Murali Krishna Panthangi, Enhancing Spatial Reasoning in Vision-Language Models via Monocular Depth Estimation: A Comparative Study on SpatialBench, 2025 IEEE International Conference on Computer Vision and Machine Intelligence (CVMI), IEEE, 2025, https://doi.org/10.1109/cvmi66673.2025.11337662