Imputing Missing Chlorophyll-a Data in the Chesapeake Bay Region Using Random Forest and KNN Machine Learning Models

Authors

  • Elena Zhang NSF Spatiotemporal Innovation Center, Department of Geography and Geoinformation Science, George Mason University, Fairfax, VA
  • Rakshita Chidananda NSF Spatiotemporal Innovation Center, Department of Geography and Geoinformation Science, George Mason University, Fairfax, VA
  • Chaowei Yang NSF Spatiotemporal Innovation Center, Department of Geography and Geoinformation Science, George Mason University, Fairfax, VA

DOI:

https://doi.org/10.13021/jssr2025.5351

Abstract

Harmful algal blooms (HAB) have become an increasingly pressing environmental issue in the Chesapeake Bay Region, posing a threat to surrounding ecosystems and human health through releasing toxins or sheer biomass. As the nation’s largest estuary, Chesapeake Bay is not only a center of agricultural activity such as shellfish farming but also supports a wide range of diverse habitats. Consequently, the prediction of algal blooms is essential to protecting water quality. Satellite data collection of chlorophyll-a (Chl-a) values has allowed for accessible and consistent collection of data to train prediction models; however, data is often obscured by cloud cover and other environmental factors. A lack of complete and continuous data has compromised the accuracy of predictions and thus it is essential to use machine learning for the imputation of data gaps. We used a full-coverage dataset from the year 2025 to compare the performance of Random Forest (RF) and K-Nearest Neighbors (KNN) models for imputing Chl-a values in the Chesapeake Bay Region. The best model was found to be RF with a coefficient of determination (R2) of 0.916, a notable improvement over KNN which yielded an R2 value of 0.588 when 30.46% of data was missing. The accuracy of the RF models was enhanced through iterative imputation to improve the accuracy of RF from 0.85 initially, to 0.916. Not only was the RF model more accurate, but it also exhibited a significantly lower runtime that proved essential when processing large amounts of satellite data. This study provides valuable insight into methods of adjusting RF models to better improve the quality of Chl-a data to advise water quality management.

Published

2025-09-25

Issue

Section

College of Science: Department of Geography and Geoinformation Science