I’m currently a Stanford Science Fellow in the Stanford University School of Medicine where I develop algorithms and machine learning methods with a focus on biological application.
I did my Ph.D. at the Computer Science and Artificial Intelligence Lab at MIT and was an undergraduate at Stanford University. I’ve also worked on machine learning for early-pipeline moonshots at Google X and for health-related applications at Illumina.
Honors & Awards
Stanford Science Fellow, Stanford University (2021)
National Defense Science and Engineering Graduate Fellowship Program, US Department of Defense (2019)
Bachelor of Science, Stanford University, ENGL-MIN (2016)
Bachelor of Science, Stanford University, CS-BSH (2016)
Ph.D., Massachusetts Institute of Technology, Electrical Engineering and Computer Science (2021)
Peter Kim, Postdoctoral Faculty Sponsor
Brian Hie, Bryan Bryson, Bonnie Berger. "United States Patent 11,011,253 Escape profiling for therapeutic and vaccine development"
Evolutionary-scale prediction of atomic-level protein structure with a language model.
Science (New York, N.Y.)
2023; 379 (6637): 1123-1130
Recent advances in machine learning have leveraged evolutionary information in multiple sequence alignments to predict protein structure. We demonstrate direct inference of full atomic-level protein structure from primary sequence using a large language model. As language models of protein sequences are scaled up to 15 billion parameters, an atomic-resolution picture of protein structure emerges in the learned representations. This results in an order-of-magnitude acceleration of high-resolution structure prediction, which enables large-scale structural characterization of metagenomic proteins. We apply this capability to construct the ESM Metagenomic Atlas by predicting structures for >617 million metagenomic protein sequences, including >225 million that are predicted with high confidence, which gives a view into the vast breadth and diversity of natural proteins.
View details for DOI 10.1126/science.ade2574
View details for PubMedID 36927031
Evolutionary velocity with protein language models predicts evolutionary dynamics of diverse proteins.
The degree to which evolution is predictable is a fundamental question in biology. Previous attempts to predict the evolution of protein sequences have been limited to specific proteins and to small changes, such as single-residue mutations. Here, we demonstrate that by using a protein language model to predict the local evolution within protein families, we recover a dynamic "vector field" of protein evolution that we call evolutionary velocity (evo-velocity). Evo-velocity generalizes to evolution over vastly different timescales, from viral proteins evolving over years to eukaryotic proteins evolving over geologic eons, and can predict the evolutionary dynamics of proteins that were not used to develop the original model. Evo-velocity also yields new evolutionary insights by predicting strategies of viral-host immune escape, resolving conflicting theories on the evolution of serpins, and revealing a key role of horizontal gene transfer in the evolution of eukaryotic glycolysis.
View details for DOI 10.1016/j.cels.2022.01.003
View details for PubMedID 35120643
- Predicting the mutational drivers of future SARS-CoV-2 variants of concern. Science translational medicine 1800: eabk3445
Adaptive machine learning for protein engineering.
Current opinion in structural biology
1800; 72: 145-152
Machine-learning models that learn from data to predict how protein sequence encodes function are emerging as a useful protein engineering tool. However, when using these models to suggest new protein designs, one must deal with the vast combinatorial complexity of protein sequences. Here, we review how to use a sequence-to-function machine-learning surrogate model to select sequences for experimental measurement. First, we discuss how to select sequences through a single round of machine-learning optimization. Then, we discuss sequential optimization, where the goal is to discover optimized sequences and improve the model across multiple rounds of training, optimization, and experimental measurement.
View details for DOI 10.1016/j.sbi.2021.11.002
View details for PubMedID 34896756
Schema: metric learning enables interpretable synthesis of heterogeneous single-cell modalities
2021; 22 (1): 131
A complete understanding of biological processes requires synthesizing information across heterogeneous modalities, such as age, disease status, or gene expression. Technological advances in single-cell profiling have enabled researchers to assay multiple modalities simultaneously. We present Schema, which uses a principled metric learning strategy that identifies informative features in a modality to synthesize disparate modalities into a single coherent interpretation. We use Schema to infer cell types by integrating gene expression and chromatin accessibility data; demonstrate informative data visualizations that synthesize multiple modalities; perform differential gene expression analysis in the context of spatial variability; and estimate evolutionary pressure on peptide sequences.
View details for DOI 10.1186/s13059-021-02313-2
View details for Web of Science ID 000656147300001
View details for PubMedID 33941239
View details for PubMedCentralID PMC8091541
Learning the language of viral evolution and escape
2021; 371 (6526): 284-+
The ability for viruses to mutate and evade the human immune system and cause infection, called viral escape, remains an obstacle to antiviral and vaccine development. Understanding the complex rules that govern escape could inform therapeutic design. We modeled viral escape with machine learning algorithms originally developed for human natural language. We identified escape mutations as those that preserve viral infectivity but cause a virus to look different to the immune system, akin to word changes that preserve a sentence's grammaticality but change its meaning. With this approach, language models of influenza hemagglutinin, HIV-1 envelope glycoprotein (HIV Env), and severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) Spike viral proteins can accurately predict structural escape patterns using sequence data alone. Our study represents a promising conceptual bridge between natural language and viral evolution.
View details for DOI 10.1126/science.abd7331
View details for Web of Science ID 000607782500053
View details for PubMedID 33446556
Leveraging Uncertainty in Machine Learning Accelerates Biological Discovery and Design
2020; 11 (5): 461-+
Machine learning that generates biological hypotheses has transformative potential, but most learning algorithms are susceptible to pathological failure when exploring regimes beyond the training data distribution. A solution to address this issue is to quantify prediction uncertainty so that algorithms can gracefully handle novel phenomena that confound standard methods. Here, we demonstrate the broad utility of robust uncertainty prediction in biological discovery. By leveraging Gaussian process-based uncertainty prediction on modern pre-trained features, we train a model on just 72 compounds to make predictions over a 10,833-compound library, identifying and experimentally validating compounds with nanomolar affinity for diverse kinases and whole-cell growth inhibition of Mycobacterium tuberculosis. Uncertainty facilitates a tight iterative loop between computation and experimentation and generalizes across biological domains as diverse as protein engineering and single-cell transcriptomics. More broadly, our work demonstrates that uncertainty should play a key role in the increasing adoption of machine learning algorithms into the experimental lifecycle.
View details for DOI 10.1016/j.cels.2020.09.007
View details for Web of Science ID 000592218000004
View details for PubMedID 33065027
- Computational Methods for Single-Cell RNA Sequencing ANNUAL REVIEW OF BIOMEDICAL DATA SCIENCE, VOL 3, 2020 2020; 3: 339-364
Geometric Sketching Compactly Summarizes the Single-Cell Transcriptomic Landscape
2019; 8 (6): 483-+
Large-scale single-cell RNA sequencing (scRNA-seq) studies that profile hundreds of thousands of cells are becoming increasingly common, overwhelming existing analysis pipelines. Here, we describe how to enhance and accelerate single-cell data analysis by summarizing the transcriptomic heterogeneity within a dataset using a small subset of cells, which we refer to as a geometric sketch. Our sketches provide more comprehensive visualization of transcriptional diversity, capture rare cell types with high sensitivity, and reveal biological cell types via clustering. Our sketch of umbilical cord blood cells uncovers a rare subpopulation of inflammatory macrophages, which we experimentally validated. The construction of our sketches is extremely fast, which enabled us to accelerate other crucial resource-intensive tasks, such as scRNA-seq data integration, while maintaining accuracy. We anticipate our algorithm will become an increasingly essential step when sharing and analyzing the rapidly growing volume of scRNA-seq data and help enable the democratization of single-cell omics.
View details for DOI 10.1016/j.cels.2019.05.003
View details for Web of Science ID 000472959800004
View details for PubMedID 31176620
View details for PubMedCentralID PMC6597305
Efficient integration of heterogeneous single-cell transcriptomes using Scanorama
2019; 37 (6): 685-+
Integration of single-cell RNA sequencing (scRNA-seq) data from multiple experiments, laboratories and technologies can uncover biological insights, but current methods for scRNA-seq data integration are limited by a requirement for datasets to derive from functionally similar cells. We present Scanorama, an algorithm that identifies and merges the shared cell types among all pairs of datasets and accurately integrates heterogeneous collections of scRNA-seq data. We applied Scanorama to integrate and remove batch effects across 105,476 cells from 26 diverse scRNA-seq experiments representing 9 different technologies. Scanorama is sensitive to subtle temporal changes within the same cell lineage, successfully integrating functionally similar cells across time series data of CD14+ monocytes at different stages of differentiation into macrophages. Finally, we show that Scanorama is orders of magnitude faster than existing techniques and can integrate a collection of 1,095,538 cells in just ~9 h.
View details for DOI 10.1038/s41587-019-0113-3
View details for Web of Science ID 000470108400020
View details for PubMedID 31061482
View details for PubMedCentralID PMC6551256
- Fine-mapping cis-regulatory variants in diverse human populations ELIFE 2019; 8
Realizing private and practical pharmacological collaboration
2018; 362 (6412): 347-350
Although combining data from multiple entities could power life-saving breakthroughs, open sharing of pharmacological data is generally not viable because of data privacy and intellectual property concerns. To this end, we leverage modern cryptographic tools to introduce a computational protocol for securely training a predictive model of drug-target interactions (DTIs) on a pooled dataset that overcomes barriers to data sharing by provably ensuring the confidentiality of all underlying drugs, targets, and observed interactions. Our protocol runs within days on a real dataset of more than 1 million interactions and is more accurate than state-of-the-art DTI prediction methods. Using our protocol, we discover previously unidentified DTIs that we experimentally validated via targeted assays. Our work lays a foundation for more effective and cooperative biomedical research.
View details for DOI 10.1126/science.aat4807
View details for Web of Science ID 000447680100050
View details for PubMedID 30337410
View details for PubMedCentralID PMC6519716
Pooled ChIP-Seq Links Variation in Transcription Factor Binding to Complex Disease Risk
2016; 165 (3): 730-741
Cis-regulatory elements such as transcription factor (TF) binding sites can be identified genome-wide, but it remains far more challenging to pinpoint genetic variants affecting TF binding. Here, we introduce a pooling-based approach to mapping quantitative trait loci (QTLs) for molecular-level traits. Applying this to five TFs and a histone modification, we mapped thousands of cis-acting QTLs, with over 25-fold lower cost compared to standard QTL mapping. We found that single genetic variants frequently affect binding of multiple TFs, and CTCF can recruit all five TFs to its binding sites. These QTLs often affect local chromatin and transcription but can also influence long-range chromosomal contacts, demonstrating a role for natural genetic variation in chromosomal architecture. Thousands of these QTLs have been implicated in genome-wide association studies, providing candidate molecular mechanisms for many disease risk loci and suggesting that TF binding variation may underlie a large fraction of human phenotypic variation.
View details for DOI 10.1016/j.cell.2016.03.041
View details for Web of Science ID 000374636800029
View details for PubMedID 27087447
View details for PubMedCentralID PMC4842172