Mr. David Seong
MD Student, expected graduation Spring 2026
Ph.D. Student in Immunology, admitted Autumn 2021
MSTP Student
All Publications
-
Benchmarking cell type and gene set annotation by large language models with AnnDictionary.
Nature communications
2025; 16 (1): 9511
Abstract
We develop an open-source package called AnnDictionary to facilitate the parallel, independent analysis of multiple anndata. AnnDictionary is built on top of LangChain and AnnData and supports all common large language model (LLM) providers. AnnDictionary only requires 1 line of code to configure or switch the LLM backend and it contains numerous multithreading optimizations to support the analysis of many anndata and large anndata. We use AnnDictionary to perform the first benchmarking study of all major LLMs at de novo cell-type annotation. LLMs vary greatly in absolute agreement with manual annotation based on model size. Inter-LLM agreement also varies with model size. We find that LLM annotation of most major cell types to be more than 80-90% accurate, and will maintain a leaderboard of LLM cell type annotation. Furthermore, we benchmark these LLMs at functional annotation of gene sets, and find that Claude 3.5 Sonnet recovers close matches of functional gene set annotations in over 80% of test sets.
View details for DOI 10.1038/s41467-025-64511-x
View details for PubMedID 41152246
View details for PubMedCentralID 8080633
-
Epigenomic profile of GBA1 in Parkinson's disease.
Parkinsonism & related disorders
2025; 140: 108066
Abstract
While genome-wide association studies have identified GBA1 as a key gene contributing to disease severity and cognitive decline in PD, its molecular effects remain poorly understood.We used integrative bulk ATAC-seq across six brain regions from autopsied individuals with PD and varying genetic risk to characterize region- and cell type-specific molecular differences. Using Cellformer, an AI-based bulk ATAC-seq-deconvolution tool, we determined cell type-specific effects of GBA1 on PD disease progression and then validated our findings using whole transcriptome data from blood samples.Epigenomic differences between PD with ("GBA+"; n = 15) and without ("GBA-", n = 15) GBA1 variants were localized in substantia nigra. Nineteen chromatin-accessible regions strictly separated GBA+ from GBA-, including the promoter sites of key genes such as CACNA1C, EHMT1, and SLC25A48. The effect in GBA + spanned the main cell types in brain, and chromatin differences between GBA- and GBA + increased with neuropathologic progression of disease. Significant differences in the epigenomic profile in GBA+ were observed in neuronal cells (AUROC = 0.8, AUPRC = 0.8, P-value<0.0001). Validation in blood samples distinguished between GBA+ and GBA-subtypes, achieving AUROC values of 0.99. Over 5000 transcripts in blood cells distinguished GBA+ from GBA-, validating key genes and pathways from our epigenomic analysis of brain regions.Our study provides novel insights into the cell type-specific epigenomic and transcriptomic landscape of GBA+ and its molecular divergence from other PD subtypes, and highlights potential therapeutic targets for this genetically defined subset of PD.
View details for DOI 10.1016/j.parkreldis.2025.108066
View details for PubMedID 41033114
-
Deep learning-based cell type profiles reveal signatures of Alzheimer's disease resilience and resistance.
Brain : a journal of neurology
2025
Abstract
Neurological disorders result from the complex and poorly understood contributions of many cell types, essential for uncovering mechanisms behind these disorders and identifying specific therapeutic targets. Single-nucleus technologies have advanced brain disease research, but remain limited by their low nuclear transcriptional coverage, high cost, and technical complexity. To address this, we applied a transformer-based deep learning model that restores cell type-specific investigation transcriptional programs from bulk RNA-seq, significantly outperforming previous methods. This enables large-scale and cost-effective investigation of cell type-specific transcriptomes in complex and heterogeneous phenotypes such as cognitive resilience or brain resistance to Alzheimer's disease. Our analysis identified astrocytes as the major cell mediator of Alzheimer's disease resilience across cerebral cortex regions, while excitatory neurons and oligodendrocyte progenitor cells emerged as the major cell mediators of resistance, maintaining synaptic function and preserving neuron health. Finally, we show that our approach could restore the whole tissue transcriptome, offering an unbiased framework for exploring cell-specific functions beyond single nucleus data.
View details for DOI 10.1093/brain/awaf285
View details for PubMedID 40794555
-
Benchmarking of pre-training strategies for electronic health record foundation models.
JAMIA open
2025; 8 (4): ooaf090
Abstract
Objective: Our objective is to compare different pre-training strategies for electronic health record (EHR) foundation models.Materials and Methods: We evaluated three approaches using a transformer-based architecture: baseline (no pre-training), self-supervised pre-training with masked language modeling, and supervised pre-training. The models were assessed on their ability to predict both major adverse cardiac events and mortality occurring within 12 months. The pre-training cohort was 405679 patients prescribed antihypertensives and the fine tuning cohort was 5525 patients who received doxorubicin.Results: Task-specific supervised pre-training achieved superior performance (AUROC 0.70, AUPRC 0.23), outperforming both self-supervised pre-training and the baseline. However, when the model was evaluated on the task of 12-month mortality prediction, the self-supervised model performed best.Discussion: While supervised pre-training excels when aligned with downstream tasks, self-supervised approaches offer more generalized utility.Conclusion: Pre-training strategy selection should consider intended applications, data availability, and transferability requirements.
View details for DOI 10.1093/jamiaopen/ooaf090
View details for PubMedID 40809468
-
CD301b+ monocyte-derived dendritic cells mediate resistance to radiotherapy.
The Journal of experimental medicine
2025; 222 (6)
Abstract
Monocytes infiltrating tumors acquire various states that distinctly impact cancer treatment. Here, we show that resistance of tumors to radiotherapy (RT) is controlled by the accumulation of monocyte-derived dendritic cells (moDCs). These moDCs are characterized by the expression of CD301b and have a superior capacity to generate regulatory T cells (Tregs). Accordingly, moDC depletion limits Treg generation and improves the therapeutic outcome of RT. Mechanistically, we demonstrate that granulocyte-macrophage colony-stimulating factor (GM-CSF) derived from radioresistant tumor cells following RT is necessary for the accumulation of moDCs. Our results unravel the immunosuppressive function of moDCs and identify GM-CSF as an immunotherapeutic target during RT.
View details for DOI 10.1084/jem.20231717
View details for PubMedID 40146036
View details for PubMedCentralID PMC11949126
-
Author Correction: AI-guided precision parenteral nutrition for neonatal intensive care units.
Nature medicine
2025
View details for DOI 10.1038/s41591-025-03691-x
View details for PubMedID 40205201
-
AI-guided precision parenteral nutrition for neonatal intensive care units.
Nature medicine
2025
Abstract
One in ten neonates are admitted to neonatal intensive care units, highlighting the need for precise interventions. However, the application of artificial intelligence (AI) in guiding neonatal care remains underexplored. Total parenteral nutrition (TPN) is a life-saving treatment for preterm neonates; however, implementation of the therapy in its current form is subjective, error-prone and resource-consuming. Here, we developed TPN2.0-a data-driven approach that optimizes and standardizes TPN using information collected routinely in electronic health records. We assembled a decade of TPN compositions (79,790 orders; 5,913 patients) at Stanford to train TPN2.0. In addition to internal validation, we also validated our model in an external cohort (63,273 orders; 3,417 patients) from a second hospital. Our algorithm identified 15 TPN formulas that can enable a precision-medicine approach (Pearson's R = 0.94 compared to experts), increasing safety and potentially reducing cost. A blinded study (n = 192) revealed that physicians rated TPN2.0 higher than current best practice. In patients with high disagreement between the actual prescriptions and TPN2.0, standard prescriptions were associated with increased morbidities (for example, odds ratio = 3.33; P value = 0.0007 for necrotizing enterocolitis), while TPN2.0 recommendations were linked to reduced risk. Finally, we demonstrated that TPN2.0 employing a transformer architecture enabled guideline-adhering, physician-in-the-loop recommendations that allow collaboration between the care team and AI.
View details for DOI 10.1038/s41591-025-03601-1
View details for PubMedID 40133525
View details for PubMedCentralID 10593864
-
PregMedNet: Multifaceted Maternal Medication Impacts on Neonatal Complications.
medRxiv : the preprint server for health sciences
2025
Abstract
While medication intake is common among pregnant women, medication safety remains underexplored, leading to unclear guidance for patients and healthcare professionals. PregMedNet addresses this gap by providing a multifaceted maternal medication safety framework based on systematic analysis of 1.19 million mother-baby dyads from U.S. claims databases. A novel confounding adjustment pipeline was applied to systematically control confounders for multiple medication-disease pairs, robustly identifying both known and novel maternal medication effects. Notably, one of the newly discovered associations was experimentally validated, demonstrating the reliability of claims data and machine learning for perinatal medication safety studies. Additionally, potential biological mechanisms of newly identified associations were generated using a graph learning method. These findings highlight PregMedNet's value in promoting safer medication use during pregnancy and maternal-neonatal outcomes.
View details for DOI 10.1101/2025.02.13.25322242
View details for PubMedID 39990567
View details for PubMedCentralID PMC11844599
-
A machine learning approach to leveraging electronic health records for enhanced omics analysis.
Nature machine intelligence
2025; 7 (2): 293-306
Abstract
Omics studies produce a large number of measurements, enabling the development, validation and interpretation of systems-level biological models. Large cohorts are required to power these complex models; yet, the cohort size remains limited due to clinical and budgetary constraints. We introduce clinical and omics multimodal analysis enhanced with transfer learning (COMET), a machine learning framework that incorporates large, observational electronic health record databases and transfer learning to improve the analysis of small datasets from omics studies. By pretraining on electronic health record data and adaptively blending both early and late fusion strategies, COMET overcomes the limitations of existing multimodal machine learning methods. Using two independent datasets, we showed that COMET improved the predictive modelling performance and biological discovery compared with the analysis of omics data with traditional methods. By incorporating electronic health record data into omics analyses, COMET enables more precise patient classifications, beyond the simplistic binary reduction to cases and controls. This framework can be broadly applied to the analysis of multimodal omics studies and reveals more powerful biological insights from limited cohort sizes.
View details for DOI 10.1038/s42256-024-00974-9
View details for PubMedID 40008295
View details for PubMedCentralID PMC11847705
-
Generating pregnant patient biological profiles by deconvoluting clinical records with electronic health record foundation models.
Briefings in bioinformatics
2024; 25 (6)
Abstract
Translational biology posits a strong bi-directional link between clinical phenotypes and a patient's biological profile. By leveraging this bi-directional link, we can efficiently deconvolute pre-existing clinical information into biological profiles. However, traditional computational tools are limited in their ability to resolve this link because of the relatively small sizes of paired clinical-biological datasets for training and the high dimensionality/sparsity of tabular clinical data. Here, we use state-of-the-art foundation models (FMs) for electronic health record (EHR) data to generate proteomics profiles of pregnant patients, thereby deconvoluting pre-existing clinical information into biological profiles without the cost and effort of running large-scale traditional omics studies. We show that FM-derived representations of a patient's EHR data coupled with a fully connected neural network prediction head can generate 206 blood protein expression levels. Interestingly, these proteins were enriched for developmental pathways, while proteins not able to be generated from EHR data were enriched for metabolic pathways. Finally, we show a proteomic signature of gestational diabetes that includes proteins with established and novel links to gestational diabetes. These results showcase the power of FM-derived EHR representations in efficiently generating biological states of pregnant patients. This capability can revolutionize disease understanding and therapeutic development, offering a cost-effective, time-efficient, and less invasive alternative to traditional methods of generating proteomics.
View details for DOI 10.1093/bib/bbae574
View details for PubMedID 39545787
View details for PubMedCentralID PMC11565587
-
Computational Approaches for Predicting Preterm Birth and Newborn Outcomes.
Clinics in perinatology
2024; 51 (2): 461-473
Abstract
Preterm birth (PTB) and its associated morbidities are a leading cause of infant mortality and morbidity. Accurate predictive models and a better biological understanding of PTB-associated morbidities are critical in reducing their adverse effects. Increasing availability of multimodal high-dimensional data sets with concurrent advances in artificial intelligence (AI) have created a rich opportunity to gain novel insights into PTB, a clinically complex and multifactorial disease. Here, the authors review the use of AI to analyze 3 modes of data: electronic health records, biological omics, and social determinants of health metrics.
View details for DOI 10.1016/j.clp.2024.02.005
View details for PubMedID 38705652
View details for PubMedCentralID PMC11070639
-
Comprehensive overview of the anesthesiology research landscape: A machine Learning Analysis of 737 NIH-funded anesthesiology primary Investigator's publication trends.
Heliyon
2024; 10 (7): e29050
Abstract
Anesthesiology plays a crucial role in perioperative care, critical care, and pain management, impacting patient experiences and clinical outcomes. However, our understanding of the anesthesiology research landscape is limited. Accordingly, we initiated a data-driven analysis through topic modeling to uncover research trends, enabling informed decision-making and fostering progress within the field.The easyPubMed R package was used to collect 32,300 PubMed abstracts spanning from 2000 to 2022. These abstracts were authored by 737 Anesthesiology Principal Investigators (PIs) who were recipients of National Institute of Health (NIH) funding from 2010 to 2022. Abstracts were preprocessed, vectorized, and analyzed with the state-of-the-art BERTopic algorithm to identify pillar topics and trending subtopics within anesthesiology research. Temporal trends were assessed using the Mann-Kendall test.The publishing journals with most abstracts in this dataset were Anesthesia & Analgesia 1133, Anesthesiology 992, and Pain 671. Eight pillar topics were identified and categorized as basic or clinical sciences based on a hierarchical clustering analysis. Amongst the pillar topics, "Cells & Proteomics" had both the highest annual and total number of abstracts. Interestingly, there was an overall upward trend for all topics spanning the years 2000-2022. However, when focusing on the period from 2015 to 2022, topics "Cells & Proteomics" and "Pulmonology" exhibit a downward trajectory. Additionally, various subtopics were identified, with notable increasing trends in "Aneurysms", "Covid 19 Pandemic", and "Artificial intelligence & Machine Learning".Our work offers a comprehensive analysis of the anesthesiology research landscape by providing insights into pillar topics, and trending subtopics. These findings contribute to a better understanding of anesthesiology research and can guide future directions.
View details for DOI 10.1016/j.heliyon.2024.e29050
View details for PubMedID 38623206
View details for PubMedCentralID PMC11016610
-
Transitional dendritic cells are distinct from conventional DC2 precursors and mediate proinflammatory antiviral responses.
Nature immunology
2023
Abstract
High-dimensional approaches have revealed heterogeneity amongst dendritic cells (DCs), including a population of transitional DCs (tDCs) in mice and humans. However, the origin and relationship of tDCs to other DC subsets has been unclear. Here we show that tDCs are distinct from other well-characterized DCs and conventional DC precursors (pre-cDCs). We demonstrate that tDCs originate from bone marrow progenitors shared with plasmacytoid DCs (pDCs). In the periphery, tDCs contribute to the pool of ESAM+ type 2 DCs (DC2s), and these DC2s have pDC-related developmental features. Different from pre-cDCs, tDCs have less turnover, capture antigen, respond to stimuli and activate antigen-specific naive T cells, all characteristics of differentiated DCs. Different from pDCs, viral sensing by tDCs results in IL-1beta secretion and fatal immune pathology in a murine coronavirus model. Our findings suggest that tDCs are a distinct pDC-related subset with a DC2 differentiation potential and unique proinflammatory function during viral infections.
View details for DOI 10.1038/s41590-023-01545-7
View details for PubMedID 37414907
-
Rapid recruitment and IFN-I-mediated activation of monocytes dictate focal radiotherapy efficacy.
Science immunology
2023; 8 (84): eadd7446
Abstract
The recruitment of monocytes and their differentiation into immunosuppressive cells is associated with the low efficacy of preclinical nonconformal radiotherapy (RT) for tumors. However, nonconformal RT (non-CRT) does not mimic clinical practice, and little is known about the role of monocytes after RT modes used in patients, such as conformal RT (CRT). Here, we investigated the acute immune response induced by after CRT. Contrary to non-CRT approaches, we found that CRT induces a rapid and robust recruitment of monocytes to the tumor that minimally differentiate into tumor-associated macrophages or dendritic cells but instead up-regulate major histocompatibility complex II and costimulatory molecules. We found that these large numbers of infiltrating monocytes are responsible for activating effector polyfunctional CD8+ tumor-infiltrating lymphocytes that reduce tumor burden. Mechanistically, we show that monocyte-derived type I interferon is pivotal in promoting monocyte accumulation and immunostimulatory function in a positive feedback loop. We also demonstrate that monocyte accumulation in the tumor microenvironment is hindered when RT inadvertently affects healthy tissues, as occurs in non-CRT. Our results unravel the immunostimulatory function of monocytes during clinically relevant modes of RT and demonstrate that limiting the exposure of healthy tissues to radiation has a positive therapeutic effect on the overall antitumor immune response.
View details for DOI 10.1126/sciimmunol.add7446
View details for PubMedID 37294749