Siyu He
Postdoctoral Scholar, Biomedical Data Sciences
Bio
I am a postdoctoral fellow in the Department of Biomedical Data Science at Stanford University, where I am advised by Dr. James Zou and Dr. Stephen Quake.
My research interests lie at the intersection of statistical machine learning, computational biology, stem cell engineering, and disease modeling. My mission is to leverage AI methodologies in biomedicine to accelerate our understanding of diseases. I earned my PhD in Biomedical Engineering from Columbia University, where I am co-advised by Dr. Kam Leong and Dr. Elham Azizi. I hold a Bachelor's degree in Physics from Xi'an Jiaotong University.
Stanford Advisors
-
Stephen Quake, Postdoctoral Research Mentor
-
James Zou, Postdoctoral Faculty Sponsor
All Publications
-
Benchmarking cell type and gene set annotation by large language models with AnnDictionary.
Nature communications
2025; 16 (1): 9511
Abstract
We develop an open-source package called AnnDictionary to facilitate the parallel, independent analysis of multiple anndata. AnnDictionary is built on top of LangChain and AnnData and supports all common large language model (LLM) providers. AnnDictionary only requires 1 line of code to configure or switch the LLM backend and it contains numerous multithreading optimizations to support the analysis of many anndata and large anndata. We use AnnDictionary to perform the first benchmarking study of all major LLMs at de novo cell-type annotation. LLMs vary greatly in absolute agreement with manual annotation based on model size. Inter-LLM agreement also varies with model size. We find that LLM annotation of most major cell types to be more than 80-90% accurate, and will maintain a leaderboard of LLM cell type annotation. Furthermore, we benchmark these LLMs at functional annotation of gene sets, and find that Claude 3.5 Sonnet recovers close matches of functional gene set annotations in over 80% of test sets.
View details for DOI 10.1038/s41467-025-64511-x
View details for PubMedID 41152246
View details for PubMedCentralID 8080633
-
Squidiff: Predicting cellular development and responses to perturbations using a diffusion model.
bioRxiv : the preprint server for biology
2025
Abstract
Single-cell sequencing has revolutionized our understanding of cellular heterogeneity and responses to environmental stimuli. However, mapping transcriptomic changes across diverse cell types in response to various stimuli and elucidating underlying disease mechanisms remains challenging. Studies involving physical stimuli, such as radiotherapy, or chemical stimuli, like drug testing, demand labor-intensive experimentation, hindering mechanistic insight and drug discovery. Here we present Squidiff, a diffusion model-based generative framework that predicts transcriptomic changes across diverse cell types in response to environmental changes. We demonstrate Squidiff's robustness across cell differentiation, gene perturbation, and drug response prediction. Through continuous denoising and semantic feature integration, Squidiff learns transient cell states and predicts high-resolution transcriptomic landscapes over time and conditions. Furthermore, we applied Squidiff to model blood vessel organoid development and cellular responses to neutron irradiation and growth factors. Our results demonstrate that Squidiff enables in silico screening of molecular landscapes, facilitating rapid hypothesis generation and providing valuable insights for precision medicine.
View details for DOI 10.1101/2024.11.16.623974
View details for PubMedID 40909548
View details for PubMedCentralID PMC12407682
-
Quantifying large language model usage in scientific papers.
Nature human behaviour
2025
Abstract
Scientific publishing is the primary means of disseminating research findings. There has been speculation about how extensively large language models (LLMs) are being used in academic writing. Here we conduct a systematic analysis across 1,121,912 preprints and published papers from January 2020 to September 2024 on arXiv, bioRxiv and Nature portfolio journals, using a population-level framework based on word frequency shifts to estimate the prevalence of LLM-modified content over time. Our findings suggest a steady increase in LLM usage, with the largest and fastest growth estimated for computer science papers (up to 22%). By comparison, mathematics papers and the Nature portfolio showed lower evidence of LLM modification (up to 9%). LLM modification estimates were higher among papers from first authors who post preprints more frequently, papers in more crowded research areas and papers of shorter lengths. Our findings suggest that LLMs are being broadly used in scientific writing.
View details for DOI 10.1038/s41562-025-02273-8
View details for PubMedID 40760036
View details for PubMedCentralID 5199034
-
Spatial multi-omics and deep learning reveal fingerprints of immunotherapy response and resistance in hepatocellular carcinoma.
bioRxiv : the preprint server for biology
2025
Abstract
Despite advances in immunotherapy treatment, nonresponse rates remain high, and mechanisms of resistance to checkpoint inhibition remain unclear. To address this gap, we performed spatial transcriptomic and proteomic profiling on human hepatocellular carcinoma tissues collected before and after immunotherapy. We developed an interpretable, multimodal deep learning framework to extract key cellular and molecular signatures from these data. Our graph neural network approach based on spatial proteomic inputs achieved outstanding performance (ROC-AUC > 0.9) in predicting patient treatment response. Key predictive features and associated spatial transcriptomic profiles revealed the multi-omic landscape of immunotherapy response and resistance. One such feature was an interface niche expressing restrictive extracellular matrix factors that physically separates tumor tissue and lymphoid aggregates in nonresponders. We integrate this and other spatially-resolved signatures into SPARC, a multi-omic "fingerprint" comprising scores for immunotherapy response and resistance mechanisms. This study lays groundwork for future patient stratification and treatment strategies in cancer immunotherapy.
View details for DOI 10.1101/2025.06.11.656869
View details for PubMedID 40661489
View details for PubMedCentralID PMC12259099
-
Encoding spatial tumour dynamics with Starfysh
NATURE REVIEWS CANCER
2024
View details for DOI 10.1038/s41568-024-00764-w
View details for Web of Science ID 001330068300001
View details for PubMedID 39394485
-
Starfysh integrates spatial transcriptomic and histologic data to reveal heterogeneous tumor-immune hubs
NATURE BIOTECHNOLOGY
2024
Abstract
Spatially resolved gene expression profiling provides insight into tissue organization and cell-cell crosstalk; however, sequencing-based spatial transcriptomics (ST) lacks single-cell resolution. Current ST analysis methods require single-cell RNA sequencing data as a reference for rigorous interpretation of cell states, mostly do not use associated histology images and are not capable of inferring shared neighborhoods across multiple tissues. Here we present Starfysh, a computational toolbox using a deep generative model that incorporates archetypal analysis and any known cell type markers to characterize known or new tissue-specific cell states without a single-cell reference. Starfysh improves the characterization of spatial dynamics in complex tissues using histology images and enables the comparison of niches as spatial hubs across tissues. Integrative analysis of primary estrogen receptor (ER)-positive breast cancer, triple-negative breast cancer (TNBC) and metaplastic breast cancer (MBC) tissues led to the identification of spatial hubs with patient- and disease-specific cell type compositions and revealed metabolic reprogramming shaping immunosuppressive hubs in aggressive MBC.
View details for DOI 10.1038/s41587-024-02173-8
View details for Web of Science ID 001190085400001
View details for PubMedID 38514799
View details for PubMedCentralID 9118175
-
Human vascular organoids with a mosaicAKT1mutation recapitulate Proteus syndrome.
bioRxiv : the preprint server for biology
2024
Abstract
Vascular malformation, a key clinical phenotype of Proteus syndrome, lacks effective models for pathophysiological study and drug development due to limited patient sample access. To bridge this gap, we built a human vascular organoid model replicating Proteus syndrome's vasculature. Using CRISPR/Cas9 genome editing and gene overexpression, we created induced pluripotent stem cells (iPSCs) embodying the Proteus syndrome-specific AKTE17K point mutation for organoid generation. Our findings revealed that AKT overactivation in these organoids resulted in smaller sizes yet increased vascular connectivity, although with less stable connections. This could be due to the significant vasculogenesis induced by AKT overactivation. This phenomenon likely stems from boosted vasculogenesis triggered by AKT overactivation, leading to increased vascular sprouting. Additionally, a notable increase in dysfunctional PDGFRbeta+ mural cells, impaired in matrix secretion, was observed in these AKT-overactivated organoids. The application of AKT inhibitors (ARQ092, AZD5363, or GDC0068) reversed the vascular malformations; the inhibitors' effectiveness was directly linked to reduced connectivity in the organoids. In summary, our study introduces an innovative in vitro model combining organoid technology and gene editing to explore vascular pathophysiology in Proteus syndrome. This model not only simulates Proteus syndrome vasculature but also holds potential for mimicking vasculatures of other genetically driven diseases. It represents an advance in drug development for rare diseases, historically plagued by slow progress.
View details for DOI 10.1101/2024.01.26.577324
View details for PubMedID 38328122
https://orcid.org/0000-0001-7187-3034