A Comparative Study of Decision Tree and Random Forest ML Algorithms for Estimating PM2.5 Levels from AOD

Authors

  • Kapilan Karunakaran NSF Spatiotemporal Innovation Center, Department of Geography and Geoinformation Science, George Mason University, Fairfax, VA
  • Seren Smith NSF Spatiotemporal Innovation Center, Department of Geography and Geoinformation Science, George Mason University, Fairfax, VA
  • Zifu Wang NSF Spatiotemporal Innovation Center, Department of Geography and Geoinformation Science, George Mason University, Fairfax, VA

Abstract

Ground monitoring systems to measure PM2.5 levels are expensive to deploy in remote regions. A lack of air quality data for such regions impacted by frequent forest fires leads to excessive exposure to smoke, causing increased respiratory diseases in these regions. With a proper machine learning model, the AQI can be estimated from readily available AOD data for such regions. In this paper we compare two machine learning algorithms: Decision Tree and Random Forest (RF), accounting for various locations and conditions. This comparative study aims to select the optimal model based on factors such as temperature, humidity, season, and geography. PurpleAir sensor data is used for training and preprocessing, classification of indoor and outdoor sensor data, averaging of the data where applicable, and grouping based on season. Based on the preliminary analysis, the RF model is up to 50% more accurate when both indoor and outdoor sensor data were used and when the region is impacted by a forest fire. Both algorithms have similar estimates when trained with a homogenous dataset excluding extreme weather events like fire or storm. There is a significant variation in the accuracy due to weather events like rain due to increased humidity. However, the use of a singular dataset to train the model limits its accuracy in predicting various types of wildfires. This paper concludes that while it is possible to estimate PM2.5 close to ground monitoring systems, the training data and model needs to consider all influencing factors to improve accuracy.

Published

2024-10-13

Issue

Section

College of Science: Department of Geography and Geoinformation Science