Publications

A Reinforcement Learning Approach to Synthetic Data Generation
Preprint
Synthetic data generation (SDG) is a promising approach for enabling data sharing in biomedical studies while preserving patient privacy. Yet, state-of-the-art generative models often require large datasets and complex training procedures, limiting their applicability in small-sample settings common in biomedical research. This study aims to develop a more principled and efficient approach to SDG and evaluate its efficacy for biomedical applications. In this work, we reframe SDG as a reinforcement learning (RL) problem and introduce RLSyn, a novel framework that models the data generator as a stochastic policy over patient records and optimizes it using Proximal Policy Optimization with discriminator-derived rewards. We evaluate RLSyn on two biomedical datasets--AI-READI and MIMIC-IV--and benchmark it against state-of-the-art generative adversarial networks (GANs) and diffusion-based methods across extensive privacy, utility, and fidelity evaluations. On MIMIC-IV, RLSyn achieves predictive utility comparable to diffusion models (S2R AUC 0.902 vs 0.906 respectively) while slightly outperforming them in fidelity (NMI 0.001 vs. 0.003; DWD 2.073 vs. 2.797) and achieving comparable, low privacy risk (~0.50 membership inference risk AUC). On the smaller AI-READI dataset, RLSyn again matches diffusion-based utility (S2R AUC 0.873 vs. 0.871), while achieving higher fidelity (NMI 0.001 vs. 0.002; DWD 13.352 vs. 16.441) and significantly lower vulnerability to membership inference attacks (AUC 0.544 vs. 0.601). Both RLSyn and diffusion-based models substantially outperform GANs across utility and fidelity on both datasets. Our results suggest that reinforcement learning provides a principled and effective approach for synthetic biomedical data generation, particularly in data-scarce regimes.