Yuanning Zheng
Postdoctoral Scholar, Biomedical Informatics
Honors & Awards
-
NIH K99/R00 Pathway to Independence Award, NIH/NCI (2025-2030)
All Publications
-
Single-cell multimodal analysis reveals tumor microenvironment predictive of treatment response in non-small cell lung cancer.
Science advances
2025; 11 (21): eadu2151
Abstract
Non-small cell lung cancer (NSCLC) constitutes over 80% of lung cancer cases and remains a leading cause of cancer-related mortality worldwide. Despite the advent of immune checkpoint inhibitors, their efficacy is limited to 27 to 45% of patients. Identifying likely treatment responders is essential for optimizing healthcare and improving quality of life. We generated multiplex immunofluorescence (mIF) images, histopathology, and RNA sequencing data from human NSCLC tissues. Through the analysis of mIF images, we characterized the spatial organization of 1.5 million cells based on the expression levels for 33 biomarkers. To enable large-scale characterization of tumor microenvironments, we developed NucSegAI, a deep learning model for automated nuclear segmentation and cellular classification in histology images. With this model, we analyzed the morphological, textural, and topological phenotypes of 45.6 million cells across 119 whole-slide images. Through unsupervised phenotype discovery, we identified specific lymphocyte phenotypes predictive of immunotherapy response. Our findings can improve patient stratification and guide selection of effective therapeutic regimens.
View details for DOI 10.1126/sciadv.adu2151
View details for PubMedID 40408481
-
Generation of synthetic whole-slide image tiles of tumours from RNA-sequencing data via cascaded diffusion models.
Nature biomedical engineering
2024
Abstract
Training machine-learning models with synthetically generated data can alleviate the problem of data scarcity when acquiring diverse and sufficiently large datasets is costly and challenging. Here we show that cascaded diffusion models can be used to synthesize realistic whole-slide image tiles from latent representations of RNA-sequencing data from human tumours. Alterations in gene expression affected the composition of cell types in the generated synthetic image tiles, which accurately preserved the distribution of cell types and maintained the cell fraction observed in bulk RNA-sequencing data, as we show for lung adenocarcinoma, kidney renal papillary cell carcinoma, cervical squamous cell carcinoma, colon adenocarcinoma and glioblastoma. Machine-learning models pretrained with the generated synthetic data performed better than models trained from scratch. Synthetic data may accelerate the development of machine-learning models in scarce-data settings and allow for the imputation of missing data modalities.
View details for DOI 10.1038/s41551-024-01193-8
View details for PubMedID 38514775
-
Digital profiling of gene expression from histology images with linearized attention
Nature Communications
2024; 15 (1): (9886)
View details for DOI 10.1038/s41467-024-54182-5
-
EpiMix is an integrative tool for epigenomic subtyping using DNA methylation.
Cell reports methods
2023; 3 (7): 100515
Abstract
DNA methylation (DNAme) is a major epigenetic factor influencing gene expression with alterations leading to cancer and immunological and cardiovascular diseases. Recent technological advances have enabled genome-wide profiling of DNAme in large human cohorts. There is a need for analytical methods that can more sensitively detect differential methylation profiles present in subsets of individuals from these heterogeneous, population-level datasets. We developed an end-to-end analytical framework named "EpiMix" for population-level analysis of DNAme and gene expression. Compared with existing methods, EpiMix showed higher sensitivity in detecting abnormal DNAme that was present in only small patient subsets. We extended the model-based analyses of EpiMix to cis-regulatory elements within protein-coding genes, distal enhancers, and genes encoding microRNAs and long non-coding RNAs (lncRNAs). Using cell-type-specific data from two separate studies, we discover epigenetic mechanisms underlying childhood food allergy and survival-associated, methylation-driven ncRNAs in non-small cell lung cancer.
View details for DOI 10.1016/j.crmeth.2023.100515
View details for PubMedID 37533639
View details for PubMedCentralID PMC10391348
-
Spatial cellular architecture predicts prognosis in glioblastoma.
Nature communications
2023; 14 (1): 4122
Abstract
Intra-tumoral heterogeneity and cell-state plasticity are key drivers for the therapeutic resistance of glioblastoma. Here, we investigate the association between spatial cellular organization and glioblastoma prognosis. Leveraging single-cell RNA-seq and spatial transcriptomics data, we develop a deep learning model to predict transcriptional subtypes of glioblastoma cells from histology images. Employing this model, we phenotypically analyze 40 million tissue spots from 410 patients and identify consistent associations between tumor architecture and prognosis across two independent cohorts. Patients with poor prognosis exhibit higher proportions of tumor cells expressing a hypoxia-induced transcriptional program. Furthermore, a clustering pattern of astrocyte-like tumor cells is associated with worse prognosis, while dispersion and connection of the astrocytes with other transcriptional subtypes correlate with decreased risk. To validate these results, we develop a separate deep learning model that utilizes histology images to predict prognosis. Applying this model to spatial transcriptomics data reveal survival-associated regional gene expression programs. Overall, our study presents a scalable approach to unravel the transcriptional heterogeneity of glioblastoma and establishes a critical connection between spatial cellular architecture and clinical outcomes.
View details for DOI 10.1038/s41467-023-39933-0
View details for PubMedID 37433817
View details for PubMedCentralID PMC10336135
-
A deep-learning algorithm to classify skin lesions from mpox virus infection.
Nature medicine
2023
Abstract
Undetected infection and delayed isolation of infected individuals are key factors driving the monkeypox virus (now termed mpox virus or MPXV) outbreak. To enable earlier detection of MPXV infection, we developed an image-based deep convolutional neural network (named MPXV-CNN) for the identification of the characteristic skin lesions caused by MPXV. We assembled a dataset of 139,198 skin lesion images, split into training/validation and testing cohorts, comprising non-MPXV images (n=138,522) from eight dermatological repositories and MPXV images (n=676) from the scientific literature, news articles, social media and a prospective cohort of the Stanford University Medical Center (n=63 images from 12 patients, all male). In the validation and testing cohorts, the sensitivity of the MPXV-CNN was 0.83 and 0.91, the specificity was 0.965 and 0.898 and the area under the curve was 0.967 and 0.966, respectively. In the prospective cohort, the sensitivity was 0.89. The classification performance of the MPXV-CNN was robust across various skin tones and body regions. To facilitate the usage of the algorithm, we developed a web-based app by which the MPXV-CNN can be accessed for patient guidance. The capability of the MPXV-CNN for identifying MPXV lesions has the potential to aid in MPXV outbreak mitigation.
View details for DOI 10.1038/s41591-023-02225-7
View details for PubMedID 36864252
-
Response to anti-angiogenic therapy is associated with AIMP protein family expression in glioblastoma and lower-grade gliomas.
Cancer research communications
2025
Abstract
Glioblastoma (GBM) is a highly vascularized, heterogeneous tumor, yet anti-angiogenic therapies have yielded limited survival benefits. The lack of validated predictive biomarkers for treatment response stratification remains a major challenge. Aminoacyl tRNA synthetase complex-interacting multicomplex proteins (AIMPs) 1/2/3 have been implicated in CNS diseases, but their roles in gliomas remain unexplored. We investigated their association with angiogenesis and their significance as predictive biomarkers for anti-angiogenic treatment response. In this multi-cohort retrospective study we analyzed glioma samples from TCGA, CGGA, Rembrandt, Gravendeel, BELOB and REGOMA trials, and four single-cell transcriptomic datasets. Multi-omic analyses incorporated transcriptomic, epigenetic, and proteomic data. Kaplan-Meier and Cox proportional hazards models were used to assess the potential prognostic value of AIMPs in heterogeneous and homogeneous treatment-groups. Using single-cell transcriptomics, we explored spatial and cell-type-specific AIMP2 expression in GBM. AIMP1/2/3 expressions correlated significantly with angiogenesis across TCGA cancers. In gliomas, AIMPs were upregulated in tumor vs. normal tissues, higher- vs. lower-grade gliomas, and recurrent vs. primary tumors (p<0.05). Upon retrospective analysis of two clinical trials assessing different anti-angiogenic drugs, we found that high-AIMP2 subgroups had improved response to therapies in GBM (REGOMA: HR 4.75 [1.96-11.5], p<0.001; BELOB: HR 2.3 [1.17-4.49], p=0.015). AIMP2-cg04317940 methylation emerged as a clinically applicable stratification marker. Single-cell analysis revealed homogeneous AIMP2 expression in tumor tissues, particularly in AC-like cells, suggesting a mechanistic link to tumor angiogenesis. These findings provide novel insights into the role of AIMPs in angiogenesis, offering improved patient stratification and therapeutic outcomes in recurrent GBM.
View details for DOI 10.1158/2767-9764.CRC-25-0170
View details for PubMedID 40874786
-
A 20-feature radiomic signature of triple-negative breast cancer identifies patients at high risk of death.
NPJ breast cancer
2025; 11 (1): 79
Abstract
A substantial proportion of patients with non-metastatic triple-negative breast cancer (TNBC) experience disease progression and death despite treatment. However, no tool currently exists to discriminate those at higher risk of death. To identify high-risk TNBC, we conducted a retrospective analysis of 749 patients from two independent cohorts. We built a prediction model that leverages breast magnetic resonance imaging (MRI) features to predict risk groups based on a 50-gene Transcriptomics Signature (TS). The TS distinguished patients with high-risk for death in multivariate survival analysis (Transcriptomic cohort: [HR] = 13.6, 95% confidence interval [CI] = 1.56-1, p = 0.02; SCAN-B cohort: HR = 1.45, CI 1.04-2.03, p = 0.02). The model identified a 20-feature radiomic signature derived from breast MRI that predicted the TS-based risk groups. This imaging-based classifier was applied to a validation cohort (log rank p = 0.013, accuracy 0.72, AUC 0.71, F1 0.74, precision 0.67, and recall 0.82), detecting a 25% absolute survival difference between high- and low-risk groups after 5 years.
View details for DOI 10.1038/s41523-025-00790-3
View details for PubMedID 40715116
View details for PubMedCentralID 8824427
-
Evaluating Vision and Pathology Foundation Models for Computational Pathology: A Comprehensive Benchmark Study.
Research square
2025
Abstract
To advance precision medicine in pathology, robust AI-driven foundation models are increasingly needed to uncover complex patterns in large-scale pathology datasets, enabling more accurate disease detection, classification, and prognostic insights. However, despite substantial progress in deep learning and computer vision, the comparative performance and generalizability of these pathology foundation models across diverse histopathological datasets and tasks remain largely unexamined. In this study, we conduct a comprehensive benchmarking of 31 AI foundation models for computational pathology, including general vision models (VM), general vision-language models (VLM), pathology-specific vision models (Path-VM), and pathology-specific vision-language models (Path-VLM), evaluated over 41 tasks sourced from TCGA, CPTAC, external benchmarking datasets, and out-of-domain datasets. Our study demonstrates that Virchow2, a pathology foundation model, delivered the highest performance across TCGA, CPTAC, and external tasks, highlighting its effectiveness in diverse histopathological evaluations. We also show that Path-VM outperformed both Path-VLM and VM, securing top rankings across tasks despite lacking a statistically significant edge over vision models. Our findings reveal that model size and data size did not consistently correlate with improved performance in pathology foundation models, challenging assumptions about scaling in histopathological applications. Lastly, our study demonstrates that a fusion model, integrating top-performing foundation models, achieved superior generalization across external tasks and diverse tissues in histopathological analysis. These findings emphasize the need for further research to understand the underlying factors influencing model performance and to develop strategies that enhance the generalizability and robustness of pathology-specific vision foundation models across different tissue types and datasets. PathBench : https://pathbench.stanford.edu/.
View details for DOI 10.21203/rs.3.rs-6823810/v1
View details for PubMedID 40630532
View details for PubMedCentralID PMC12236927
-
Revealing cancer driver genes through integrative transcriptomic and epigenomic analyses with Moonlight.
PLoS computational biology
2025; 21 (4): e1012999
Abstract
Cancer involves dynamic changes caused by (epi)genetic alterations such as mutations or abnormal DNA methylation patterns which occur in cancer driver genes. These driver genes are divided into oncogenes and tumor suppressors depending on their function and mechanism of action. Discovering driver genes in different cancer (sub)types is important not only for increasing current understanding of carcinogenesis but also from prognostic and therapeutic perspectives. We have previously developed a framework called Moonlight which uses a systems biology multi-omics approach for prediction of driver genes. Here, we present an important development in Moonlight2 by incorporating a DNA methylation layer which provides epigenetic evidence for deregulated expression profiles of driver genes. To this end, we present a novel functionality called Gene Methylation Analysis (GMA) which investigates abnormal DNA methylation patterns to predict driver genes. This is achieved by integrating the tool EpiMix which is designed to detect such aberrant DNA methylation patterns in a cohort of patients and further couples these patterns with gene expression changes. To showcase GMA, we applied it to three cancer (sub)types (basal-like breast cancer, lung adenocarcinoma, and thyroid carcinoma) where we discovered 33, 190, and 263 epigenetically driven genes, respectively. A subset of these driver genes had prognostic effects with expression levels significantly affecting survival of the patients. Moreover, a subset of the driver genes demonstrated therapeutic potential as drug targets. This study provides a framework for exploring the driving forces behind cancer and provides novel insights into the landscape of three cancer sub(types) by integrating gene expression and methylation data.
View details for DOI 10.1371/journal.pcbi.1012999
View details for PubMedID 40258059
-
Reliability-enhanced data cleaning in biomedical machine learning using inductive conformal prediction.
PLoS computational biology
2025; 21 (2): e1012803
Abstract
Accurately labeling large datasets is important for biomedical machine learning yet challenging while modern data augmentation methods may generate noise in the training data, which may deteriorate machine learning model performance. Existing approaches addressing noisy training data typically rely on strict modeling assumptions, classification models and well-curated dataset. To address these, we propose a novel reliability-based training-data-cleaning method employing inductive conformal prediction (ICP). This method uses a small set of well-curated training data and leverages ICP-calculated reliability metrics to selectively correct mislabeled data and outliers within vast quantities of noisy training data. The efficacy is validated across three classification tasks with distinct modalities: filtering drug-induced-liver-injury (DILI) literature with free-text title and abstract, predicting ICU admission of COVID-19 patients through CT radiomics and electronic health records, and subtyping breast cancer using RNA-sequencing data. Varying levels of noise to the training labels were introduced via label permutation. Our training-data-cleaning method significantly enhanced the downstream classification performance (paired t-tests, p ≤ 0 . 05 among 30 random train/test partitions): significant accuracy enhancement in 86 out of 96 DILI experiments (up to 11.4% increase from 0.812 to 0.905), significant AUROC and AUPRC enhancements in all 48 COVID-19 experiments (up to 23.8% increase from 0.597 to 0.739 for AUROC, and 69.8% increase from 0.183 to 0.311 for AUPRC), and significant accuracy and macro-average F1-score improvements in 47 out of 48 RNA-sequencing experiments (up to 74.6% increase from 0.351 to 0.613 for accuracy, and 89.0% increase from 0.267 to 0.505 for F1-score). The improvement can be both statistically and clinically significant for information retrieval, disease diagnosis and prognosis. The method offers the potential to substantially boost classification performance in biomedical machine learning tasks without necessitating an excessive volume of well-curated training data or strong data distribution and modeling assumptions in existing semi-supervised learning methods.
View details for DOI 10.1371/journal.pcbi.1012803
View details for PubMedID 39946419
-
Deep learning uncovers histological patterns of YAP1/TEAD activity related to disease aggressiveness in cancer patients.
iScience
2025; 28 (1): 111638
Abstract
Over the last decade, Hippo signaling has emerged as a major tumor-suppressing pathway. Its dysregulation is associated with abnormal expression of YAP1 and TEAD-family genes. Recent works have highlighted the role of YAP1/TEAD activity in several cancers and its potential therapeutic implications. Therefore, identifying patients with a dysregulated Hippo pathway is key to enhancing treatment impact. Although recent studies have derived RNA-seq-based signatures, there remains a need for a reproducible and cost-effective method to measure the pathway activation. In recent years, deep learning applied to histology slides have emerged as an effective way to predict molecular information from a data modality available in clinical routine. Here, we trained models to predict YAP1/TEAD activity from H&E-stained histology slides in multiple cancers. The robustness of our approach was assessed in seven independent validation cohorts. Finally, we showed that histological markers of disease aggressiveness were associated with dysfunctional Hippo signaling.
View details for DOI 10.1016/j.isci.2024.111638
View details for PubMedID 39868035
View details for PubMedCentralID PMC11758823
-
Digital Spatial Profiling identifies distinct patterns of immuno-oncology-related gene expression within oropharyngeal tumours in relation to HPV and p16 status
Frontiers in Oncology
2024; 14 (1428741)
View details for DOI 10.3389/fonc.2024.1428741
-
Multimodal deep learning to predict prognosis in adult and pediatric brain tumors.
Communications medicine
2023; 3 (1): 44
Abstract
The introduction of deep learning in both imaging and genomics has significantly advanced the analysis of biomedical data. For complex diseases such as cancer, different data modalities may reveal different disease characteristics, and the integration of imaging with genomic data has the potential to unravel additional information than when using these data sources in isolation. Here, we propose a DL framework that combines these two modalities with the aim to predict brain tumor prognosis.Using two separate glioma cohorts of 783 adults and 305 pediatric patients we developed a DL framework that can fuse histopathology images with gene expression profiles. Three strategies for data fusion were implemented and compared: early, late, and joint fusion. Additional validation of the adult glioma models was done on an independent cohort of 97 adult patients.Here we show that the developed multimodal data models achieve better prediction results compared to the single data models, but also lead to the identification of more relevant biological pathways. When testing our adult models on a third brain tumor dataset, we show our multimodal framework is able to generalize and performs better on new data from different cohorts. Leveraging the concept of transfer learning, we demonstrate how our pediatric multimodal models can be used to predict prognosis for two more rare (less available samples) pediatric brain tumors.Our study illustrates that a multimodal data fusion approach can be successfully implemented and customized to model clinical outcome of adult and pediatric brain tumors.
View details for DOI 10.1038/s43856-023-00276-y
View details for PubMedID 36991216
View details for PubMedCentralID 5563115
-
Early Dietary Exposures Epigenetically Program Mammary Cancer Susceptibility through Igf1-Mediated Expansion of the Mammary Stem Cell Compartment.
Cells
2022; 11 (16)
Abstract
Diet is a critical environmental factor affecting breast cancer risk, and recent evidence shows that dietary exposures during early development can affect lifetime mammary cancer susceptibility. To elucidate the underlying mechanisms, we used our established crossover feeding mouse model, where exposure to a high-fat and high-sugar (HFHS) diet during defined developmental windows determines mammary tumor incidence and latency in carcinogen-treated mice. Mammary tumor incidence is significantly increased in mice receiving a HFHS post-weaning diet (high-tumor mice, HT) compared to those receiving a HFHS diet during gestation (low-tumor mice, LT). The current study revealed that the mammary stem cell (MaSC) population was significantly increased in mammary glands from HT compared to LT mice. Igf1 expression was increased in mammary stromal cells from HT mice, where it promoted MaSC self-renewal. The increased Igf1 expression was induced by DNA hypomethylation of the Igf1 Pr1 promoter, mediated by a decrease in Dnmt3b levels. Mammary tissues from HT mice also had reduced levels of Igfbp5, leading to increased bioavailability of tissue Igf1. This study provides novel insights into how early dietary exposures program mammary cancer risk, demonstrating that effective dietary intervention can reduce mammary cancer incidence.
View details for DOI 10.3390/cells11162558
View details for PubMedID 36010633
View details for PubMedCentralID PMC9406400
-
Overexpression of IGF-1 During Early Development Expands the Number of Mammary Stem Cells and Primes them for Transformation.
Stem cells (Dayton, Ohio)
2022; 40 (3): 273-289
Abstract
Insulin-like growth factor I (IGF-1) has been implicated in breast cancer due to its mitogenic and anti-apoptotic effects. Despite substantial research on the role of IGF-1 in tumor progression, the relationship of IGF-1 to tissue stem cells, particularly in mammary tissue, and the resulting tumor susceptibility has not been elucidated. Previous studies with the BK5.IGF-1 transgenic (Tg) mouse model reveals that IGF-1 does not act as a classical, post-carcinogen tumor promoter in the mammary gland. Pre-pubertal Tg mammary glands display increased numbers and enlarged sizes of terminal end buds, a niche for mammary stem cells (MaSCs). Here we show that MaSCs from both wild-type (WT) and Tg mice expressed IGF-1R and that overexpression of Tg IGF-1 increased numbers of MaSCs by undergoing symmetric division, resulting in an expansion of the MaSC and luminal progenitor (LP) compartments in pre-pubertal female mice. This expansion was maintained post-pubertally and validated by mammosphere assays in vitro and transplantation assays in vivo. The addition of recombinant IGF-1 promoted, and IGF-1R downstream inhibitors decreased mammosphere formation. Single-cell transcriptomic profiles generated from 2 related platforms reveal that IGF-1 stimulated quiescent MaSCs to enter the cell cycle and increased their expression of genes involved in proliferation, plasticity, tumorigenesis, invasion, and metastasis. This study identifies a novel, pro-tumorigenic mechanism, where IGF-1 increases the number of transformation-susceptible carcinogen targets during the early stages of mammary tissue development, and "primes" their gene expression profiles for transformation.
View details for DOI 10.1093/stmcls/sxab018
View details for PubMedID 35356986