Publications

Sample-Efficient Synthetic Data Generation via Reinforcement Learning
Preprint coming soon!
Synthetic health data offers a powerful tool for enabling machine learning research while protecting patient privacy and addressing data access barriers. However, existing generative models often struggle in small-sample settings, limiting their utility in domains where data is scarce. This project introduces a reinforcement learning–based framework for synthetic electronic health record (EHR) generation. Using the NIH Bridge2AI AI-READI diabetes dataset as a testbed, we benchmarked our RL model against state-of-the-art EHR generative approaches, including GAN-based and diffusion-based methods. On the AI-READI dataset, the RL model achieved higher downstream utility, closer statistical fidelity and lower privacy risks than competing models. When scaled to the larger MIMIC-IV EHR dataset, the RL framework matched the performance of existing state-of-the-art models, demonstrating both robustness and scalability.