A Vision-Language-Action Model Approach for Geospatially Guided Autonomous Navigation

Deepthi Kumar; Tyler Treat; James Gallagher; Edward Oughton

doi:10.13021/jssr2025.5279

Authors

Deepthi Kumar Department of Geography and Geoinformation Science, George Mason University, Fairfax, VA
Tyler Treat Department of Geography and Geoinformation Science, George Mason University, Fairfax, VA
James Gallagher Department of Geography and Geoinformation Science, George Mason University, Fairfax, VA
Edward Oughton Department of Geography and Geoinformation Science, George Mason University, Fairfax, VA

DOI:

https://doi.org/10.13021/jssr2025.5279

Abstract

Autonomous navigation in dynamic environments requires robust computer vision to ensure obstacle avoidance along precomputed paths. In geospatial-based robotics, understanding visual input is essential to making route decisions, avoiding collisions, and reaching desired waypoints. While single-model approaches like YOLOv5 offer object detection, they often struggle with ambiguous or low-confidence visual inputs, especially in cluttered spaces. This project builds upon Google PaLM-SayCan and the vision-language action model paradigm, applying their core idea of grounding language models in decision-making to improve perception and precision using compact multimodal models. Furthermore, existing vision systems often lack contextual understanding and adaptive decision-making for human -centered environments. Instead of relying on large-scale LLMs, we explore the use of lightweight, robust models - most notably SmolVLM2, fine-tuned on COCO (Common Objects in Context) images for better contextual reasoning in visual scenes. Our multi-model perception pipeline integrates YOLOv8 for initial object detection, while SmolVLM2 acts as a fallback validator when detection confidence is low. We implement a fusion model of classic and generative techniques, creating a decision module that selects among 4 possible actions: proceed, detour, stop and query. Route planning is based on Dijkstra-generated waypoints, with the fused vision system influencing live adjustments. Overall, this approach achieved 88% detection precision and 91% obstacle avoidance success. The method is able to improve obstacle avoidance by 27%, and achieve significantly better throughput compared to YOLO-only inference. This work demonstrates that compact VLAMs can significantly improve the perception layer in autonomous navigation, especially in human-centric geospatial environments.

A Vision-Language-Action Model Approach for Geospatially Guided Autonomous Navigation

Authors

DOI:

Abstract

Published

Issue

Section

License

assip