Evaluation of Synthetic Population Data Created Using Generative Adversarial Networks

Authors

  • SRIHAN KOTNANA Aspiring Scientists' Summer Internship Program Intern
  • Taylor Anderson Aspiring Scientists' Summer Internship Program Mentor
  • Andreas Züfle Aspiring Scientists' Summer Internship Program Mentor
  • Hamdi Kavak Aspiring Scientists' Summer Internship Program Mentor

DOI:

https://doi.org/10.13021/jssr2021.3201

Abstract

The generation of realistic synthetic populations is an important function for many agent-based models to provide accurate predictions. The problem with synthetic population data lies within the high dimensional data and irregular distributions. However, deep generative models have been proposed to tackle this issue because of their ability to model arbitrary distributions with greater flexibility. This study presents a comparison and evaluation of synthetically generated populations with different generative adversarial network (GAN) models. We use the public use microdata sample (PUMS) of the population from Fairfax County, Virginia to evaluate the performance of a tabular GAN, conditional tabular GAN (CTGAN), and CopulaGAN, a variant of the CTGAN. Metrics from the TableEvaluator and SDV python libraries are used to measure correlations and probabilistic distributions of population attributes. We found that the CTGAN and the CopulaGAN both outperformed the tabular GAN, while the CTGAN narrowly outperformed the CopulaGAN's average similarity score by 2%. To compare models, we used various F1-scores including logistic regression, random forest classifiers, decision trees, and a multi-layer perceptron; then, we averaged the Jaccard similarity, a metric we used to compute the closeness between the real and synthetic F1 scores for each category. Our research can be applied to other regions in the United States and can be used to accurately model populations when only a small sample of the population is available.

Published

2022-12-13

Issue

Section

College of Science: Department of Computational and Data Sciences

Categories