Adaptive Multimodal RGB–LWIR Sensor Fusion and YOLO Backbone Architectures for Real-Time Object Tracking in Autonomous Turret Systems Across Variable Illumination Conditions
DOI:
https://doi.org/10.13021/jssr2025.5277Abstract
Autonomous robotic platforms are playing a growing role across the emergency services sector, supporting missions such as search and rescue operations in disaster zones and reconnaissance. However, traditional red-green-blue (RGB) detection pipelines struggle in low-light environments, and thermal-based systems fail to capture object characteristics such as color and texture. These complementary limitations suggest that combining thermal and visible imaging may yield more reliable performance across diverse conditions. To address these challenges, this study introduces a unified, adaptive framework that fuses long-wave infrared (LWIR) and RGB video streams at multiple fusion ratios, and dynamically selects the optimal detection model based on the illumination conditions. Using a library of 33 custom-trained you only look once (YOLO) models, we identified the top-performing architectures for the three illumination conditions: no-light (<10 lux), dim light (10–1000 lux), and daylight (>1000 lux). Fusion was performed by blending pixel intensities from aligned LWIR and RGB frames at eleven predefined ratios, from full RGB (100/0) to full LWIR (0/100) in 10% steps. We created a dataset of over 22,000 annotated images across varied illumination using a 75/25 split for model training and validation. Preliminary findings suggest that adaptive RGB–LWIR fusion produced noticeably higher average confidence and firing success rates compared to our baseline models (YOLOv5 and YOLOv11) operating on single modalities. This work lays the foundation for adaptive multimodal perception, improving the reliability of autonomous robotic vision in diverse environments.
Published
Issue
Section
License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.