Brian Hie's Profile | Stanford Profiles

Bio

I am an Assistant Professor of Chemical Engineering at Stanford University, the Dieter Schwarz Foundation Stanford Data Science Faculty Fellow, and an Innovation Investigator at Arc Institute. I supervise the Laboratory of Evolutionary Design, where we conduct research at the intersection of biology and machine learning.

I was previously a Stanford Science Fellow in the Stanford University School of Medicine and a Visiting Researcher at Meta AI. I completed my Ph.D. at MIT CSAIL and was an undergraduate at Stanford University.

Academic Appointments

Assistant Professor, Chemical Engineering
Member, Bio-X
Affiliate, Institute for Human-Centered Artificial Intelligence (HAI)
Faculty Fellow, Sarafan ChEM-H
The Dieter Schwarz Foundation SDS Faculty Fellow, Stanford Data Science

Honors & Awards

AI2050 Early Career Fellow, Schmidt Sciences (2025)
Innovation Investigator, Arc Institute (2024)
Stanford Science Fellow, Stanford University (2021)
National Defense Science and Engineering Graduate Fellowship Program, US Department of Defense (2019)

Boards, Advisory Committees, Professional Organizations

Board of Reviewing Editors, Science Magazine (2025 - Present)

Professional Education

Ph.D., Massachusetts Institute of Technology, Electrical Engineering and Computer Science (2021)
M.S., Massachusetts Institute of Technology, Electrical Engineering and Computer Science (2019)
B.S.H., Stanford University, Computer Science (2016)

Patents

Anupama Thubagere Jagadeesh, Brian Hie. "United States Patent 11,532,400 Hyperspectral scanning to determine skin health", Dec 20, 2022
Brian Hie, Bonnie Berger, Hyunghoon Cho. "United States Patent 11,450,439 Realizing private and practical pharmacological collaboration using a neural network architecture configured for reduced computation overhead", Sep 20, 2022
Brian Hie, Bryan Bryson, Bonnie Berger. "United States Patent 11,011,253 Escape profiling for therapeutic and vaccine development", May 18, 2021

2025-26 Courses

Colloquium
CHEMENG 699 (Aut, Spr)
Data Science and Machine Learning Approaches in Chemical and Materials Engineering
CHEMENG 177, CHEMENG 277, MATSCI 166, MATSCI 176 (Win)
Data Science for Computational Molecular Biology
DATASCI 194B, DATASCI 294B (Spr)
Independent Studies (8)
- Directed Investigation
  BIOE 392 (Aut, Win, Spr, Sum)
- Directed Reading
  BMDS 299 (Spr)
- Directed Study
  BIOE 391 (Aut, Win, Spr, Sum)
- Graduate Research
  BIOPHYS 300 (Aut, Win, Spr, Sum)
- Graduate Research in Chemical Engineering
  CHEMENG 600 (Aut, Win, Spr, Sum)
- Ph.D. Research Rotation
  CME 391 (Spr)
- Undergraduate Honors Research in Chemical Engineering
  CHEMENG 190H (Aut, Win, Spr, Sum)
- Undergraduate Research in Chemical Engineering
  CHEMENG 190 (Aut, Win, Spr, Sum)
Prior Year Courses
2024-25 Courses
- Colloquium
  CHEMENG 699 (Aut, Win, Spr)
- Data Science and Machine Learning Approaches in Chemical and Materials Engineering
  CHEMENG 177, CHEMENG 277, MATSCI 166, MATSCI 176 (Win)
- Data Science for Computational Molecular Biology
  DATASCI 194B, DATASCI 294B (Spr)

Stanford Advisees

Doctoral Dissertation Reader (AC)
Tianyu Lu, Alp Tartici
Orals Chair
Adonis Rubio, Izumi de los Rios Kobara
Postdoctoral Faculty Sponsor
Alex Hao, Lily Taylor
Doctoral Dissertation Advisor (AC)
Brandon Ameglio, Garyk Brixi, Daniel Chang, Brian Kang, Samuel King, Aditi Merchant, Talal Widatalla
Doctoral Dissertation Co-Advisor (AC)
Mia Grahn, Ivan Specht, Chloe Wen, Chang M. Yun
Doctoral (Program)
Matthew Liu, Divya Nori

All Publications

Efficient generation of epitope-targeted antibodies with Germinal. Nature biotechnology Mille-Fragoso, L. S., Driscoll, C. L., Wang, J. N., Dai, H., Widatalla, T., Zhang, J. L., Zhang, X., Rao, B., Feng, L., Hie, B. L., Gao, X. J. 2026

Abstract

Obtaining antibodies to specific protein targets is a widely important yet experimentally laborious process. Meanwhile, computational methods for antibody design have been limited by low success rates that require resource-intensive screening. Here we introduce Germinal, a broadly enabling generative pipeline that designs antibodies against specific epitopes with nanomolar binding affinities while requiring only low-n experimental testing. Our method co-optimizes antibody structure and sequence by integrating a structure predictor with an antibody-specific protein language model to perform de novo design of functional complementarity-determining regions onto a user-specified structural framework. When tested against four diverse protein targets, Germinal designed functional antibodies across all targets and binder formats, testing only 43-101 designs for each antigen. Validated designs also exhibited robust expression in mammalian cells and high sequence and structural novelty. We provide open-source code and full computational and experimental protocols to facilitate wide adoption.

View details for DOI 10.1038/s41587-026-03187-0

View details for PubMedID 42337361

View details for PubMedCentralID 431171
Genome modelling and design across all domains of life with Evo 2. Nature Brixi, G., Durrant, M. G., Ku, J., Naghipourfar, M., Poli, M., Sun, G., Brockman, G., Chang, D., Fanton, A., Gonzalez, G. A., King, S. H., Li, D. B., Merchant, A. T., Nguyen, E., Ricci-Tam, C., Romero, D. W., Schmok, J. C., Taghibakhshi, A., Vorontsov, A., Yang, B., Deng, M., Gorton, L., Nguyen, N., Wang, N. K., Pearce, M. T., Simon, E., Adams, E., Amador, Z. J., Ashley, E. A., Baccus, S. A., Dai, H., Dillmann, S., Ermon, S., Guo, D., Herschl, M. H., Ilango, R., Janik, K., Lu, A. X., Mehta, R., Mofrad, M. R., Ng, M. Y., Pannu, J., Ré, C., St John, J., Sullivan, J., Tey, J., Viggiano, B., Zhu, K., Zynda, G., Balsam, D., Collison, P., Costa, A. B., Hernandez-Boussard, T., Ho, E., Liu, M. Y., McGrath, T., Powell, K., Pinglay, S., Burke, D. P., Goodarzi, H., Hsu, P. D., Hie, B. L. 2026

Abstract

All of life encodes information with DNA. Although tools for genome sequencing, synthesis and editing have transformed biological research, we still lack sufficient understanding of the immense complexity encoded by genomes to predict the effects of many classes of genomic changes or to intelligently compose new biological systems. Artificial intelligence models that learn information from genomic sequences across diverse organisms have increasingly advanced prediction and design capabilities1,2. Here we introduce Evo 2, a biological foundation model trained on 9 trillion DNA base pairs from a highly curated genomic atlas spanning all domains of life to have a 1 million token context window with single-nucleotide resolution. Evo 2 learns to accurately predict the functional impacts of genetic variation-from noncoding pathogenic mutations to clinically significant BRCA1 variants-without task-specific fine-tuning. Mechanistic interpretability analyses reveal that Evo 2 learns representations associated with biological features, including exon-intron boundaries, transcription factor binding sites, protein structural elements and prophage genomic regions. The generative abilities of Evo 2 produce mitochondrial, prokaryotic and eukaryotic sequences at genome scale with greater naturalness and coherence than previous methods. Evo 2 also generates experimentally validated chromatin accessibility patterns when guided by predictive models3,4 and inference-time search. We have made Evo 2 fully open, including model parameters, training code5, inference code and the OpenGenome2 dataset, to accelerate the exploration and design of biological complexity.

View details for DOI 10.1038/s41586-026-10176-5

View details for PubMedID 41781614

View details for PubMedCentralID 12057570
Rapid directed evolution guided by protein language models and epistatic interactions. Science (New York, N.Y.) Tran, V. Q., Nemeth, M., Bartie, L. J., Chandrasekaran, S. S., Fanton, A., Moon, H. C., Hie, B. L., Konermann, S., Hsu, P. D. 2026: eaea1820

Abstract

Protein engineering is limited by the inefficient search through a high-dimensional sequence space to find combinations of synergistic mutations. Traditional approaches use stepwise mutation stacking, whereas machine learning methods require extensive datasets or multiple experimental rounds and are bottlenecked by costly, length-limited gene synthesis. We present MULTI-evolve, a rapid evolution framework that systematically engineers multimutants. Our approach combines protein language models or existing functional data with epistatic modelling to predict synergistic combinations. Proposed multimutants are built through MULTI-assembly, a mutagenesis method enabling high-efficiency assembly across multikilobase sequences. Applying MULTI-evolve to three proteins achieved up to 10-fold improvements with a single round of machine learning-guided directed evolution. MULTI-evolve provides a streamlined approach for end-to-end, multimutant engineering for a broad range of protein types and functions.

View details for DOI 10.1126/science.aea1820

View details for PubMedID 41712694
Semantic design of functional de novo genes from a genomic language model. Nature Merchant, A. T., King, S. H., Nguyen, E., Hie, B. L. 2025

Abstract

Generative genomic models can design increasingly complex biological systems1. However, controlling these models to generate novel sequences with desired functions remains challenging. Here, we show that Evo, a genomic language model, can leverage genomic context to perform function-guided design that accesses novel regions of sequence space. By learning semantic relationships across prokaryotic genes2, Evo enables a genomic 'autocomplete' in which a DNA prompt encoding genomic context for a function of interest guides the generation of novel sequences enriched for related functions, which we refer to as 'semantic design'. We validate this approach by experimentally testing the activity of generated anti-CRISPR proteins and type II and III toxin-antitoxin systems, including de novo genes with no significant sequence similarity to natural proteins. In-context design of proteins and non-coding RNAs with Evo achieves robust activity and high experimental success rates even in the absence of structural priors, known evolutionary conservation or task-specific fine-tuning. We then use Evo to complete millions of prompts to produce SynGenome, a database containing over 120 billion base pairs of artificial intelligence-generated genomic sequences that enables semantic design across many functions. More broadly, these results demonstrate that generative genomics with biological language models can extend beyond natural sequences.

View details for DOI 10.1038/s41586-025-09749-7

View details for PubMedID 41261132

View details for PubMedCentralID 12057570
Efficient generation of epitope-targetedde novoantibodies with Germinal. bioRxiv : the preprint server for biology Mille-Fragoso, L. S., Wang, J. N., Driscoll, C. L., Dai, H., Widatalla, T., Zhang, X., Hie, B. L., Gao, X. J. 2025

Abstract

Obtaining novel antibodies against specific protein targets is a widely important yet experimentally laborious process. Meanwhile, computational methods for antibody design have been limited by low success rates that currently require resource-intensive screening. Here, we introduce Germinal, a broadly enabling generative framework that designs antibodies against specific epitopes with nanomolar binding affinities while requiring only low-n experimental testing. Our method co-optimizes antibody structure and sequence by integrating a structure predictor with an antibody-specific protein language model to perform de novo design of functional complementarity-determining regions (CDRs) onto a user-specified structural framework. When tested against four diverse protein targets, Germinal achieved an experimental success rate of 4-22% across all targets, testing only 43-101 designs for each antigen. Validated nanobodies also exhibited robust expression in mammalian cells and nanomolar binding affinities. We provide open-source code and full computational and experimental protocols to facilitate wide adoption. Germinal represents a milestone in efficient, epitope-targeted de novo antibody design, with notable implications for the development of molecular tools and therapeutics.

View details for DOI 10.1101/2025.09.19.677421

View details for PubMedID 41040335
Sidechain conditioning and modeling for full-atom protein sequence design with FAMPNN. Proceedings of machine learning research Widatalla, T., Shuai, R. W., Hie, B. L., Huang, P. S. 2025; 267: 66746-66771

Abstract

Leading deep learning-based methods for fixed-backbone protein sequence design do not model protein sidechain conformation during sequence generation despite the large role the three-dimensional arrangement of sidechain atoms play in protein conformation, stability, and overall protein function. Instead, these models implicitly reason about crucial sidechain interactions based on backbone geometry and known amino acid sequence labels. To address this, we present FAMPNN (Full-Atom MPNN), a sequence design method that explicitly models both sequence identity and sidechain conformation for each residue, where the per-token distribution of a residue's discrete amino acid identity and its continuous sidechain conformation are learned with a combined categorical cross-entropy and diffusion loss objective. We demonstrate that learning these distributions jointly is a highly synergistic task that both improves sequence recovery while achieving state-of-the-art sidechain packing. Furthermore, benefits from full-atom modeling generalize from sequence recovery to practical protein design applications, such as zero-shot prediction of experimental binding and stability measurements.

View details for DOI 10.1101/2024.09.25.614868

View details for PubMedID 41307002

View details for PubMedCentralID PMC12646570
Sidechain conditioning and modeling for full-atom protein sequence design with FAMPNN. Proceedings of machine learning research Widatalla, T., Shuai, R. W., Hie, B. L., Huang, P. 2025; 267: 66746-66771

Abstract

Leading deep learning-based methods for fixed-backbone protein sequence design do not model protein sidechain conformation during sequence generation despite the large role the three-dimensional arrangement of sidechain atoms play in protein conformation, stability, and overall protein function. Instead, these models implicitly reason about crucial sidechain interactions based on backbone geometry and known amino acid sequence labels. To address this, we present FAMPNN (Full-Atom MPNN), a sequence design method that explicitly models both sequence identity and sidechain conformation for each residue, where the per-token distribution of a residue's discrete amino acid identity and its continuous sidechain conformation are learned with a combined categorical cross-entropy and diffusion loss objective. We demonstrate that learning these distributions jointly is a highly synergistic task that both improves sequence recovery while achieving state-of-the-art sidechain packing. Furthermore, benefits from full-atom modeling generalize from sequence recovery to practical protein design applications, such as zero-shot prediction of experimental binding and stability measurements.

View details for PubMedID 41307002
Utilizing Machine Learning to Improve Neutralization Potency of an HIV-1 Antibody Targeting the gp41 N-Heptad Repeat. ACS chemical biology Filsinger Interrante, M. V., Tang, S., Kim, S., Shanker, V. R., Hie, B. L., Bruun, T. U., Wu, W., Pak, J. E., Fernandez, D., Kim, P. S. 2025

Abstract

The N-heptad repeat (NHR) of the HIV-1 gp41 prehairpin intermediate (PHI) is an attractive potential vaccine target with high sequence conservation across diverse strains. However, despite the potency of NHR-targeting peptides and clinical efficacy of the NHR-targeting entry inhibitor enfuvirtide, no potently neutralizing NHR-directed monoclonal antibodies (mAbs) nor antisera have been identified or elicited to date. The lack of potent NHR-binding mAbs both dampens enthusiasm for vaccine development efforts at this target and presents a barrier to performing passive immunization experiments with NHR-targeting antibodies. To address this challenge, we previously developed an improved variant of the NHR-directed mAb D5, called D5_AR, which is capable of neutralizing diverse tier-2 viruses. Building on that work, here we present the 2.7Å-crystal structure of D5_AR bound to NHR mimetic peptide IQN17. We then utilize protein language models and supervised machine learning to generate small (n < 100) libraries of D5_AR variants that are subsequently screened for improved neutralization potency. We identify a variant with 5-fold improved neutralization potency, D5_FI, which is the most potent NHR-directed monoclonal antibody characterized to date and exhibits broad neutralization of tier-2 and -3 pseudoviruses as well as replicating R5 and X4 challenge strains. Additionally, our work highlights the ability of protein language models to efficiently identify improved mAb variants from relatively small libraries.

View details for DOI 10.1021/acschembio.5c00035

View details for PubMedID 40540236
Sidechain conditioning and modeling for full-atom protein sequence design with FAMPNN Widatalla, T., Shuai, R. W., Hie, B. L., Huang, P. edited by Singh, A., Fazel, M., Hsu, D., Lacoste-Julien, S., Berkenkamp, F., Maharaj, T., Wagstaff, K., Zhu, J. JMLR-JOURNAL MACHINE LEARNING RESEARCH. 2025: 66746-66771

View details for Web of Science ID 001693167600260
Sequence modeling and design from molecular to genome scale with Evo. Science (New York, N.Y.) Nguyen, E., Poli, M., Durrant, M. G., Kang, B., Katrekar, D., Li, D. B., Bartie, L. J., Thomas, A. W., King, S. H., Brixi, G., Sullivan, J., Ng, M. Y., Lewis, A., Lou, A., Ermon, S., Baccus, S. A., Hernandez-Boussard, T., Re, C., Hsu, P. D., Hie, B. L. 2024; 386 (6723): eado9336

Abstract

The genome is a sequence that encodes the DNA, RNA, and proteins that orchestrate an organism's function. We present Evo, a long-context genomic foundation model with a frontier architecture trained on millions of prokaryotic and phage genomes, and report scaling laws on DNA to complement observations in language and vision. Evo generalizes across DNA, RNA, and proteins, enabling zero-shot function prediction competitive with domain-specific language models and the generation of functional CRISPR-Cas and transposon systems, representing the first examples of protein-RNA and protein-DNA codesign with a language model. Evo also learns how small mutations affect whole-organism fitness and generates megabase-scale sequences with plausible genomic architecture. These prediction and generation capabilities span molecular to genomic scales of complexity, advancing our understanding and control of biology.

View details for DOI 10.1126/science.ado9336

View details for PubMedID 39541441
Unsupervised evolution of protein and antibody complexes with a structure-informed language model. Science (New York, N.Y.) Shanker, V. R., Bruun, T. U., Hie, B. L., Kim, P. S. 2024; 385 (6704): 46-53

Abstract

Large language models trained on sequence information alone can learn high-level principles of protein design. However, beyond sequence, the three-dimensional structures of proteins determine their specific function, activity, and evolvability. Here, we show that a general protein language model augmented with protein structure backbone coordinates can guide evolution for diverse proteins without the need to model individual functional tasks. We also demonstrate that ESM-IF1, which was only trained on single-chain structures, can be extended to engineer protein complexes. Using this approach, we screened about 30 variants of two therapeutic clinical antibodies used to treat severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection. We achieved up to 25-fold improvement in neutralization and 37-fold improvement in affinity against antibody-escaped viral variants of concern BQ.1.1 and XBB.1.5, respectively. These findings highlight the advantage of integrating structural information to identify efficient protein evolution trajectories without requiring any task-specific training data.

View details for DOI 10.1126/science.adk8946

View details for PubMedID 38963838
Scanorama: integrating large and diverse single-cell transcriptomic datasets. Nature protocols Hie, B. L., Kim, S., Rando, T. A., Bryson, B., Berger, B. 2024

Abstract

Merging diverse single-cell RNA sequencing (scRNA-seq) data from numerous experiments, laboratories and technologies can uncover important biological insights. Nonetheless, integrating scRNA-seq data encounters special challenges when the datasets are composed of diverse cell type compositions. Scanorama offers a robust solution for improving the quality and interpretation of heterogeneous scRNA-seq data by effectively merging information from diverse sources. Scanorama is designed to address the technical variation introduced by differences in sample preparation, sequencing depth and experimental batches that can confound the analysis of multiple scRNA-seq datasets. Here we provide a detailed protocol for using Scanorama within a Scanpy-based single-cell analysis workflow coupled with Google Colaboratory, a cloud-based free Jupyter notebook environment service. The protocol involves Scanorama integration, a process that typically spans 0.5-3 h. Scanorama integration requires a basic understanding of cellular biology, transcriptomic technologies and bioinformatics. Our protocol and new Scanorama-Colaboratory resource should make scRNA-seq integration more widely accessible to researchers.

View details for DOI 10.1038/s41596-024-00991-3

View details for PubMedID 38844552
Generative artificial intelligence for de novo protein design. Current opinion in structural biology Winnifrith, A., Outeiral, C., Hie, B. L. 2024; 86: 102794

Abstract

Engineering new molecules with desirable functions and properties has the potential to extend our ability to engineer proteins beyond what nature has so far evolved. Advances in the so-called 'de novo' design problem have recently been brought forward by developments in artificial intelligence. Generative architectures, such as language models and diffusion processes, seem adept at generating novel, yet realistic proteins that display desirable properties and perform specified functions. State-of-the-art design protocols now achieve experimental success rates nearing 20%, thus widening the access to de novo designed proteins. Despite extensive progress, there are clear field-wide challenges, for example, in determining the best in silico metrics to prioritise designs for experimental testing, and in designing proteins that can undergo large conformational changes or be regulated by post-translational modifications. With an increase in the number of models being developed, this review provides a framework to understand how these tools fit into the overall process of de novo protein design. Throughout, we highlight the power of incorporating biochemical knowledge to improve performance and interpretability.

View details for DOI 10.1016/j.sbi.2024.102794

View details for PubMedID 38663170
Inverse folding of protein complexes with a structure-informed language model enables unsupervised antibody evolution. bioRxiv : the preprint server for biology Shanker, V. R., Bruun, T. U., Hie, B. L., Kim, P. S. 2023

Abstract

Large language models trained on sequence information alone are capable of learning high level principles of protein design. However, beyond sequence, the three-dimensional structures of proteins determine their specific function, activity, and evolvability. Here we show that a general protein language model augmented with protein structure backbone coordinates and trained on the inverse folding problem can guide evolution for diverse proteins without needing to explicitly model individual functional tasks. We demonstrate inverse folding to be an effective unsupervised, structure-based sequence optimization strategy that also generalizes to multimeric complexes by implicitly learning features of binding and amino acid epistasis. Using this approach, we screened ~30 variants of two therapeutic clinical antibodies used to treat SARS-CoV-2 infection and achieved up to 26-fold improvement in neutralization and 37-fold improvement in affinity against antibody-escaped viral variants-of-concern BQ.1.1 and XBB.1.5, respectively. In addition to substantial overall improvements in protein function, we find inverse folding performs with leading experimental success rates among other reported machine learning-guided directed evolution methods, without requiring any task-specific training data.

View details for DOI 10.1101/2023.12.19.572475

View details for PubMedID 38187780

View details for PubMedCentralID PMC10769282
Machine Learning for Protein Engineering. ArXiv Johnston, K. E., Fannjiang, C., Wittmann, B. J., Hie, B. L., Yang, K. K., Wu, Z. 2023

Abstract

Directed evolution of proteins has been the most effective method for protein engineering. However, a new paradigm is emerging, fusing the library generation and screening approaches of traditional directed evolution with computation through the training of machine learning models on protein sequence fitness data. This chapter highlights successful applications of machine learning to protein engineering and directed evolution, organized by the improvements that have been made with respect to each step of the directed evolution cycle. Additionally, we provide an outlook for the future based on the current direction of the field, namely in the development of calibrated models and in incorporating other modalities, such as protein structure.

View details for DOI 10.1038/nrm2805

View details for PubMedID 37292483

View details for PubMedCentralID PMC10246115
Efficient evolution of human antibodies from general protein language models. Nature biotechnology Hie, B. L., Shanker, V. R., Xu, D., Bruun, T. U., Weidenbacher, P. A., Tang, S., Wu, W., Pak, J. E., Kim, P. S. 2023

Abstract

Natural evolution must explore a vast landscape of possible sequences for desirable yet rare mutations, suggesting that learning from natural evolutionary strategies could guide artificial evolution. Here we report that general protein language models can efficiently evolve human antibodies by suggesting mutations that are evolutionarily plausible, despite providing the model with no information about the target antigen, binding specificity or protein structure. We performed language-model-guided affinity maturation of seven antibodies, screening 20 or fewer variants of each antibody across only two rounds of laboratory evolution, and improved the binding affinities of four clinically relevant, highly mature antibodies up to sevenfold and three unmatured antibodies up to 160-fold, with many designs also demonstrating favorable thermostability and viral neutralization activity against Ebola and severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pseudoviruses. The same models that improve antibody binding also guide efficient evolution across diverse protein families and selection pressures, including antibiotic resistance and enzyme activity, suggesting that these results generalize to many settings.

View details for DOI 10.1038/s41587-023-01763-2

View details for PubMedID 37095349

View details for PubMedCentralID 4410700
Evolutionary-scale prediction of atomic-level protein structure with a language model. Science (New York, N.Y.) Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W., Smetanin, N., Verkuil, R., Kabeli, O., Shmueli, Y., Dos Santos Costa, A., Fazel-Zarandi, M., Sercu, T., Candido, S., Rives, A. 2023; 379 (6637): 1123-1130

Abstract

Recent advances in machine learning have leveraged evolutionary information in multiple sequence alignments to predict protein structure. We demonstrate direct inference of full atomic-level protein structure from primary sequence using a large language model. As language models of protein sequences are scaled up to 15 billion parameters, an atomic-resolution picture of protein structure emerges in the learned representations. This results in an order-of-magnitude acceleration of high-resolution structure prediction, which enables large-scale structural characterization of metagenomic proteins. We apply this capability to construct the ESM Metagenomic Atlas by predicting structures for >617 million metagenomic protein sequences, including >225 million that are predicted with high confidence, which gives a view into the vast breadth and diversity of natural proteins.

View details for DOI 10.1126/science.ade2574

View details for PubMedID 36927031
Evolutionary velocity with protein language models predicts evolutionary dynamics of diverse proteins. Cell systems Hie, B. L., Yang, K. K., Kim, P. S. 2022

Abstract

The degree to which evolution is predictable is a fundamental question in biology. Previous attempts to predict the evolution of protein sequences have been limited to specific proteins and to small changes, such as single-residue mutations. Here, we demonstrate that by using a protein language model to predict the local evolution within protein families, we recover a dynamic "vector field" of protein evolution that we call evolutionary velocity (evo-velocity). Evo-velocity generalizes to evolution over vastly different timescales, from viral proteins evolving over years to eukaryotic proteins evolving over geologic eons, and can predict the evolutionary dynamics of proteins that were not used to develop the original model. Evo-velocity also yields new evolutionary insights by predicting strategies of viral-host immune escape, resolving conflicting theories on the evolution of serpins, and revealing a key role of horizontal gene transfer in the evolution of eukaryotic glycolysis.

View details for DOI 10.1016/j.cels.2022.01.003

View details for PubMedID 35120643
Predicting the mutational drivers of future SARS-CoV-2 variants of concern. Science translational medicine Maher, M. C., Bartha, I., Weaver, S., di Iulio, J., Ferri, E., Soriaga, L., Lempp, F. A., Hie, B. L., Bryson, B., Berger, B., Robertson, D. L., Snell, G., Corti, D., Virgin, H. W., Kosakovsky Pond, S. L., Telenti, A. 1800: eabk3445

Abstract

[Figure: see text].

View details for DOI 10.1126/scitranslmed.abk3445

View details for PubMedID 35014856
Adaptive machine learning for protein engineering. Current opinion in structural biology Hie, B. L., Yang, K. K. 1800; 72: 145-152

Abstract

Machine-learning models that learn from data to predict how protein sequence encodes function are emerging as a useful protein engineering tool. However, when using these models to suggest new protein designs, one must deal with the vast combinatorial complexity of protein sequences. Here, we review how to use a sequence-to-function machine-learning surrogate model to select sequences for experimental measurement. First, we discuss how to select sequences through a single round of machine-learning optimization. Then, we discuss sequential optimization, where the goal is to discover optimized sequences and improve the model across multiple rounds of training, optimization, and experimental measurement.

View details for DOI 10.1016/j.sbi.2021.11.002

View details for PubMedID 34896756
Schema: metric learning enables interpretable synthesis of heterogeneous single-cell modalities GENOME BIOLOGY Singh, R., Hie, B. L., Narayan, A., Berger, B. 2021; 22 (1): 131

Abstract

A complete understanding of biological processes requires synthesizing information across heterogeneous modalities, such as age, disease status, or gene expression. Technological advances in single-cell profiling have enabled researchers to assay multiple modalities simultaneously. We present Schema, which uses a principled metric learning strategy that identifies informative features in a modality to synthesize disparate modalities into a single coherent interpretation. We use Schema to infer cell types by integrating gene expression and chromatin accessibility data; demonstrate informative data visualizations that synthesize multiple modalities; perform differential gene expression analysis in the context of spatial variability; and estimate evolutionary pressure on peptide sequences.

View details for DOI 10.1186/s13059-021-02313-2

View details for Web of Science ID 000656147300001

View details for PubMedID 33941239

View details for PubMedCentralID PMC8091541
Learning the language of viral evolution and escape SCIENCE Hie, B., Zhong, E. D., Berger, B., Bryson, B. 2021; 371 (6526): 284-+

Abstract

The ability for viruses to mutate and evade the human immune system and cause infection, called viral escape, remains an obstacle to antiviral and vaccine development. Understanding the complex rules that govern escape could inform therapeutic design. We modeled viral escape with machine learning algorithms originally developed for human natural language. We identified escape mutations as those that preserve viral infectivity but cause a virus to look different to the immune system, akin to word changes that preserve a sentence's grammaticality but change its meaning. With this approach, language models of influenza hemagglutinin, HIV-1 envelope glycoprotein (HIV Env), and severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) Spike viral proteins can accurately predict structural escape patterns using sequence data alone. Our study represents a promising conceptual bridge between natural language and viral evolution.

View details for DOI 10.1126/science.abd7331

View details for Web of Science ID 000607782500053

View details for PubMedID 33446556
Leveraging Uncertainty in Machine Learning Accelerates Biological Discovery and Design CELL SYSTEMS Hie, B., Bryson, B. D., Berger, B. 2020; 11 (5): 461-+

Abstract

Machine learning that generates biological hypotheses has transformative potential, but most learning algorithms are susceptible to pathological failure when exploring regimes beyond the training data distribution. A solution to address this issue is to quantify prediction uncertainty so that algorithms can gracefully handle novel phenomena that confound standard methods. Here, we demonstrate the broad utility of robust uncertainty prediction in biological discovery. By leveraging Gaussian process-based uncertainty prediction on modern pre-trained features, we train a model on just 72 compounds to make predictions over a 10,833-compound library, identifying and experimentally validating compounds with nanomolar affinity for diverse kinases and whole-cell growth inhibition of Mycobacterium tuberculosis. Uncertainty facilitates a tight iterative loop between computation and experimentation and generalizes across biological domains as diverse as protein engineering and single-cell transcriptomics. More broadly, our work demonstrates that uncertainty should play a key role in the increasing adoption of machine learning algorithms into the experimental lifecycle.

View details for DOI 10.1016/j.cels.2020.09.007

View details for Web of Science ID 000592218000004

View details for PubMedID 33065027
Computational Methods for Single-Cell RNA Sequencing ANNUAL REVIEW OF BIOMEDICAL DATA SCIENCE, VOL 3, 2020 Hie, B., Peters, J., Nyquist, S. K., Shalek, A. K., Berger, B., Bryson, B. D. edited by Altman, R. B. 2020; 3: 339-364

View details for DOI 10.1146/annurev-biodatasci-012220-100601

View details for Web of Science ID 000613910200014
Geometric Sketching Compactly Summarizes the Single-Cell Transcriptomic Landscape CELL SYSTEMS Hie, B., Cho, H., DeMeo, B., Bryson, B., Berger, B. 2019; 8 (6): 483-+

Abstract

Large-scale single-cell RNA sequencing (scRNA-seq) studies that profile hundreds of thousands of cells are becoming increasingly common, overwhelming existing analysis pipelines. Here, we describe how to enhance and accelerate single-cell data analysis by summarizing the transcriptomic heterogeneity within a dataset using a small subset of cells, which we refer to as a geometric sketch. Our sketches provide more comprehensive visualization of transcriptional diversity, capture rare cell types with high sensitivity, and reveal biological cell types via clustering. Our sketch of umbilical cord blood cells uncovers a rare subpopulation of inflammatory macrophages, which we experimentally validated. The construction of our sketches is extremely fast, which enabled us to accelerate other crucial resource-intensive tasks, such as scRNA-seq data integration, while maintaining accuracy. We anticipate our algorithm will become an increasingly essential step when sharing and analyzing the rapidly growing volume of scRNA-seq data and help enable the democratization of single-cell omics.

View details for DOI 10.1016/j.cels.2019.05.003

View details for Web of Science ID 000472959800004

View details for PubMedID 31176620

View details for PubMedCentralID PMC6597305
Efficient integration of heterogeneous single-cell transcriptomes using Scanorama NATURE BIOTECHNOLOGY Hie, B., Bryson, B., Berger, B. 2019; 37 (6): 685-+

Abstract

Integration of single-cell RNA sequencing (scRNA-seq) data from multiple experiments, laboratories and technologies can uncover biological insights, but current methods for scRNA-seq data integration are limited by a requirement for datasets to derive from functionally similar cells. We present Scanorama, an algorithm that identifies and merges the shared cell types among all pairs of datasets and accurately integrates heterogeneous collections of scRNA-seq data. We applied Scanorama to integrate and remove batch effects across 105,476 cells from 26 diverse scRNA-seq experiments representing 9 different technologies. Scanorama is sensitive to subtle temporal changes within the same cell lineage, successfully integrating functionally similar cells across time series data of CD14+ monocytes at different stages of differentiation into macrophages. Finally, we show that Scanorama is orders of magnitude faster than existing techniques and can integrate a collection of 1,095,538 cells in just ~9 h.

View details for DOI 10.1038/s41587-019-0113-3

View details for Web of Science ID 000470108400020

View details for PubMedID 31061482

View details for PubMedCentralID PMC6551256
Fine-mapping cis-regulatory variants in diverse human populations ELIFE Tehranchi, A., Hie, B., Dacre, M., Kaplow, I., Pettie, K., Combs, P., Fraser, H. B. 2019; 8

View details for DOI 10.7554/eLife.39595

View details for Web of Science ID 000455701800001
Realizing private and practical pharmacological collaboration SCIENCE Hie, B., Cho, H., Berger, B. 2018; 362 (6412): 347-350

Abstract

Although combining data from multiple entities could power life-saving breakthroughs, open sharing of pharmacological data is generally not viable because of data privacy and intellectual property concerns. To this end, we leverage modern cryptographic tools to introduce a computational protocol for securely training a predictive model of drug-target interactions (DTIs) on a pooled dataset that overcomes barriers to data sharing by provably ensuring the confidentiality of all underlying drugs, targets, and observed interactions. Our protocol runs within days on a real dataset of more than 1 million interactions and is more accurate than state-of-the-art DTI prediction methods. Using our protocol, we discover previously unidentified DTIs that we experimentally validated via targeted assays. Our work lays a foundation for more effective and cooperative biomedical research.

View details for DOI 10.1126/science.aat4807

View details for Web of Science ID 000447680100050

View details for PubMedID 30337410

View details for PubMedCentralID PMC6519716
Pooled ChIP-Seq Links Variation in Transcription Factor Binding to Complex Disease Risk CELL Tehranchi, A. K., Myrthil, M., Martin, T., Hie, B. L., Golan, D., Fraser, H. B. 2016; 165 (3): 730-741

Abstract

Cis-regulatory elements such as transcription factor (TF) binding sites can be identified genome-wide, but it remains far more challenging to pinpoint genetic variants affecting TF binding. Here, we introduce a pooling-based approach to mapping quantitative trait loci (QTLs) for molecular-level traits. Applying this to five TFs and a histone modification, we mapped thousands of cis-acting QTLs, with over 25-fold lower cost compared to standard QTL mapping. We found that single genetic variants frequently affect binding of multiple TFs, and CTCF can recruit all five TFs to its binding sites. These QTLs often affect local chromatin and transcription but can also influence long-range chromosomal contacts, demonstrating a role for natural genetic variation in chromosomal architecture. Thousands of these QTLs have been implicated in genome-wide association studies, providing candidate molecular mechanisms for many disease risk loci and suggesting that TF binding variation may underlie a large fraction of human phenotypic variation.

View details for DOI 10.1016/j.cell.2016.03.041

View details for Web of Science ID 000374636800029

View details for PubMedID 27087447

View details for PubMedCentralID PMC4842172

Brian Hie

Assistant Professor of Chemical Engineering

Bio

Academic Appointments

Honors & Awards

Boards, Advisory Committees, Professional Organizations

Professional Education

Patents

Additional Info

Links

2025-26 Courses

2024-25 Courses

Stanford Advisees

All Publications

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract