A Vision-Language-Action Model Approach for Geospatially Guided Autonomous Navigation

Authors

  • Deepthi Kumar Department of Geography and Geoinformation Science, George Mason University, Fairfax, VA
  • Tyler Treat Department of Geography and Geoinformation Science, George Mason University, Fairfax, VA
  • James Gallagher Department of Geography and Geoinformation Science, George Mason University, Fairfax, VA
  • Edward Oughton Department of Geography and Geoinformation Science, George Mason University, Fairfax, VA

DOI:

https://doi.org/10.13021/jssr2025.5279

Abstract

Autonomous navigation in dynamic environments requires robust computer vision to ensure obstacle avoidance along precomputed paths. In geospatial-based robotics, understanding visual input is essential to making route decisions, avoiding collisions, and reaching desired waypoints. While single-model approaches like YOLOv5 offer object detection, they often struggle with ambiguous or low-confidence visual inputs, especially in cluttered spaces. This project builds upon Google PaLM-SayCan and the vision-language action model paradigm, applying their core idea of grounding language models in decision-making to improve perception and precision using compact multimodal models. Furthermore, existing vision systems often lack contextual understanding and adaptive decision-making for human -centered environments. Instead of relying on large-scale LLMs, we explore the use of lightweight, robust models - most notably SmolVLM2, fine-tuned on COCO (Common Objects in Context) images for better contextual reasoning in visual scenes. Our multi-model perception pipeline integrates YOLOv8 for initial object detection, while SmolVLM2 acts as a fallback validator when detection confidence is low. We implement a fusion model of classic and generative techniques, creating a decision module that selects among 4 possible actions: proceed, detour, stop and query. Route planning is based on Dijkstra-generated waypoints, with the fused vision system influencing live adjustments. Overall, this approach achieved 88% detection precision and 91% obstacle avoidance success. The method is able to improve obstacle avoidance by 27%, and achieve significantly better throughput compared to YOLO-only inference. This work demonstrates that compact VLAMs can significantly improve the perception layer in autonomous navigation, especially in human-centric geospatial environments.

Published

2025-09-25

Issue

Section

College of Science: Department of Geography and Geoinformation Science