Digital Twin Framework for Real-Time Computing Infrastructure Monitoring
Abstract
Spatiotemporal studies demand significant computing power and infrastructure. The recent rise of artificial intelligence
has further amplified these computational requirements and introduced new cybersecurity risks. Existing data center
management tools diagnose system errors slowly and lack predictive capabilities. To address these challenges, we
developed a Computing Infrastructure Digital Twin (DT) of a 600 -machine data center (DC) to enable real-time
monitoring, autonomous detection of system issues, and efficient resource management. Metrics from the physical DC
are collected using Prometheus, enabling real-time insights and an alert system based on pre-defined rules. Additionally,
further information such as system log files are retrieved and stored in a PostgreSQL database for downstream tasks via
large language models (LLMs) such as summarization, information extraction, and anomaly detection. The 3D virtual
replica, modeled in Autodesk Fusion and Onshape and visualized through NVIDIA Omniverse, reflects the real-time status
of the infrastructure, allowing users to detect and explore system errors through an interactive interface. Initial results
suggest that this DT has implications for developing more efficient and secure data center management systems. Future
research will incorporate the use of artificial intelligence and machine learning (AI/ML) to predict potential anomalies,
system errors, and security threats.
Published
Issue
Section
License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.