Open Access iconOpen Access

ARTICLE

Generating Synthetic Data for Machine Learning Models from the Pediatric Heart Network Fontan I Dataset

Vatche Bahudian, John Valdovinos*

Department of Electrical and Computer Engineering, California State University Northridge, Northridge, CA 91330, USA

* Corresponding Author: John Valdovinos. Email: email

(This article belongs to the Special Issue: Artificial Intelligence in Congenital Heart Disease)

Congenital Heart Disease 2025, 20(1), 115-127. https://doi.org/10.32604/chd.2025.063991

Abstract

Background: The population of Fontan patients, patients born with a single functioning ventricle, is growing. There is a growing need to develop algorithms for this population that can predict health outcomes. Artificial intelligence models predicting short-term and long-term health outcomes for patients with the Fontan circulation are needed. Generative adversarial networks (GANs) provide a solution for generating realistic and useful synthetic data that can be used to train such models. Methods: Despite their promise, GANs have not been widely adopted in the congenital heart disease research community due, in some part, to a lack of knowledge on how to employ them. In this research study, a GAN was used to generate synthetic data from the Pediatric Heart Network Fontan I dataset. A subset of data consisting of the echocardiographic and BNP measures collected from Fontan patients was used to train the GAN. Two sets of synthetic data were created to understand the effect of data missingness on synthetic data generation. Synthetic data was created from real data in which the missing values were imputed using Multiple Imputation by Chained Equations (MICE) (referred to as synthetic from imputed real samples). In addition, synthetic data was created from real data in which the missing values were dropped (referred to as synthetic from dropped real samples). Both synthetic datasets were evaluated for fidelity by using visual methods which involved comparing histograms and principal component analysis (PCA) plots. Fidelity was measured quantitatively by (1) comparing synthetic and real data using the Kolmogorov-Smirnov test to evaluate the similarity between two distributions and (2) training a neural network to distinguish between real and synthetic samples. Both synthetic datasets were evaluated for utility by training a neural network with synthetic data and testing the neural network on its ability to classify patients that have ventricular dysfunction using echocardiograph measures and serological measures. Results: Using histograms, associated probability density functions, and (PCA), both synthetic datasets showed visual resemblance in distribution and variance to real Fontan data. Quantitatively, synthetic data from dropped real samples had higher similarity scores, as demonstrated by the Kolmogorov–Smirnov statistic, for all but one feature (age at Fontan) compared to synthetic data from imputed real samples, which demonstrated dissimilar scores for three features (Echo SV, Echo tda, and BNP). In addition, synthetic data from dropped real samples resembled real data to a larger extent (49.3% classification error) than synthetic data from imputed real samples (65.28% classification error). Classification errors approximating 50% represent datasets that are indistinguishable. In terms of utility, synthetic data created from real data in which the missing values were imputed classified ventricular dysfunction in real data with a classification error of 10.99%. Similarly, utility of the generated synthetic data by showing that a neural network trained on synthetic data derived from real data in which the missing values were dropped could classify ventricular dysfunction in real data with a classification error of 9.44%. Conclusions: Although representing a limited subset of the vast data available on the Pediatric Heart Network, generative adversarial networks can create synthetic data that mimics the probability distribution of real Fontan echocardiographic measures. Clinicians can use these synthetic data to create models that predict health outcomes for Fontan patients.

Keywords

Synthetic data; congenital heart disease; Fontan circulation

Cite This Article

APA Style
Bahudian, V., Valdovinos, J. (2025). Generating synthetic data for machine learning models from the pediatric heart network fontan I dataset. Congenital Heart Disease, 20(1), 115–127. https://doi.org/10.32604/chd.2025.063991
Vancouver Style
Bahudian V, Valdovinos J. Generating synthetic data for machine learning models from the pediatric heart network fontan I dataset. Congeni Heart Dis. 2025;20(1):115–127. https://doi.org/10.32604/chd.2025.063991
IEEE Style
V. Bahudian and J. Valdovinos, “Generating Synthetic Data for Machine Learning Models from the Pediatric Heart Network Fontan I Dataset,” Congeni. Heart Dis., vol. 20, no. 1, pp. 115–127, 2025. https://doi.org/10.32604/chd.2025.063991



cc Copyright © 2025 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • 138

    View

  • 94

    Download

  • 0

    Like

Share Link