Work Experience


  • Junior Machine Learning Scientist, ProteinQure (May 3, 2021 - August 31, 2022)

    Location

    Toronto, Ontario, Canada

All Publications


  • Ensemble-conditioned protein sequence design with Caliby. bioRxiv : the preprint server for biology Shuai, R. W., Lu, T., Bhatti, S., Kouba, P., Huang, P. S. 2025

    Abstract

    Structure-conditioned sequence design models aim to design a protein sequence that will fold into a given target structure. Deep-learning-based approaches for sequence design have proven highly successful for various protein design applications, but many non-idealized backbones still remain out of reach for current models under typical in silico success criteria. We hypothesize that training objectives prioritizing native sequence recovery unintentionally push models to reproduce non-structural signals (e.g. phylogenetic relatedness, neutral drift, or dataset sampling biases), rather than a broadly generalizable structure-sequence mapping. Inspired by recent work bridging sequence likelihood and fitness prediction in protein language models, we introduce Caliby, a Potts model-based sequence design method capable of conditioning on an ensemble of structures. Conditioning on a synthetic ensemble generated from an input backbone allows sampling of sequences consistent with the structural constraints of the ensemble while averaging out undesired biases towards the native sequence. Ensemble-conditioned sequence design with Caliby reduces native sequence recovery while substantially improving AlphaFold2 self-consistency, outperforming state-of-the-art models ProteinMPNN and ChromaDesign on both native and de novo backbones. Finally, we train a variant of Caliby on only soluble proteins and demonstrate in silico that Protpardelle-1c binder designs that were previously deemed undesignable by SolubleMPNN are actually designable under SolubleCaliby, highlighting limitations of existing filtering pipelines. These results suggest that Caliby can expand the de novo design space beyond highly idealized backbones.

    View details for DOI 10.1101/2025.09.30.679633

    View details for PubMedID 41256639

    View details for PubMedCentralID PMC12621727

  • SLAE: Strictly Local All-atom Environment for Protein Representation. bioRxiv : the preprint server for biology Chen, Y., Zhao, C., Huang, P. S., Lu, T., Wayment-Steele, H. K. 2025

    Abstract

    Building physically grounded protein representations is central to computational biology, yet most existing approaches rely on sequence-pretrained language models or backbone-only graphs that overlook side-chain geometry and chemical detail. We present SLAE, a unified all-atom framework for learning protein representations from each residue's local atomic neighborhood using only atom types and interatomic geometries. To encourage expressive feature extraction, we introduce a novel multi-task autoencoder objective that combines coordinate reconstruction, sequence recovery, and energy regression. SLAE reconstructs all-atom structures with high fidelity from latent residue environments and achieves state-of-the-art performance across diverse downstream tasks via transfer learning. SLAE's latent space is chemically informative and environmentally sensitive, enabling quantitative assessment of structural qualities and smooth interpolation between conformations at all-atom resolution.

    View details for DOI 10.1101/2025.10.03.680398

    View details for PubMedID 41278779

    View details for PubMedCentralID PMC12632552

  • Conditional Protein Structure Generation with Protpardelle-1c. bioRxiv : the preprint server for biology Lu, T., Shuai, R., Kouba, P., Li, Z., Chen, Y., Shirali, A., Kim, J., Huang, P. S. 2025

    Abstract

    We present Protpardelle-1c, a collection of protein structure generative models with robust motif scaffolding and support for multi-chain complex generation under hotspot-conditioning. Enabling sidechain-conditioning to a backbone-only model increased Protpardelle-1c's MotifBench score from 4.97 to 28.16, outperforming RFdiffusion's 21.27. The crop-conditional all-atom model achieved 208 unique solutions on the La-Proteina all-atom motif scaffolding benchmark, on par with La-Proteina while having ~10 times fewer parameters. At 22M parameters, Protpardelle-1c enables rapid sampling, taking 40 minutes to sample all 3000 MotifBench backbones on an NVIDIA A100-80GB, compared to 31 hours for RFdiffusion.

    View details for DOI 10.1101/2025.08.18.670959

    View details for PubMedID 40894579

    View details for PubMedCentralID PMC12393353

  • Assessing generative model coverage of protein structures with SHAPES. Cell systems Lu, T., Liu, M., Chen, Y., Kim, J., Huang, P. S. 2025: 101347

    Abstract

    Recent advances in generative modeling enable efficient sampling of protein structures, but their tendency to optimize for designability imposes a bias toward idealized structures at the expense of loops and other complex structural motifs that are critical for function. We introduce SHAPES (structural and hierarchical assessment of proteins with embedding similarity) to evaluate five state-of-the-art generative models of protein structures. Using structural embeddings across multiple structural hierarchies, ranging from local geometries to global protein architectures, we reveal substantial undersampling of the observed protein structure space by these models. We use Fréchet protein distance (FPD) to quantify distributional coverage. Different models are distinct in their coverage behavior across different sampling noise scales and temperatures. The frequency of tertiary motifs (TERMs) further supports the observations. More robust sequence design and structure prediction methods are likely crucial in guiding the development of models with improved coverage of the designable protein space. A record of this paper's transparent peer review process is included in the supplemental information.

    View details for DOI 10.1016/j.cels.2025.101347

    View details for PubMedID 40738113

  • Synthetic biology education and pedagogy: a review of evolving practices in a growing discipline FRONTIERS IN EDUCATION Menard, J., Diep, P., Sheikh, F., Escobar, A., Dykstra, C. B., Sajtovich, V. A., Ahmadi, A., Bodyreva, E., Boucinha, A., Chandrasekharan, S., Duan, J., Emond, C., Lu, T., McLean, I., Morse, L., Serra, D., Stancescu, A., Suresh, S., Ingalls, B. P. 2024; 9
  • Sparks of function by de novo protein design. Nature biotechnology Chu, A. E., Lu, T., Huang, P. S. 2024; 42 (2): 203-215

    Abstract

    Information in proteins flows from sequence to structure to function, with each step causally driven by the preceding one. Protein design is founded on inverting this process: specify a desired function, design a structure executing this function, and find a sequence that folds into this structure. This 'central dogma' underlies nearly all de novo protein-design efforts. Our ability to accomplish these tasks depends on our understanding of protein folding and function and our ability to capture this understanding in computational methods. In recent years, deep learning-derived approaches for efficient and accurate structure modeling and enrichment of successful designs have enabled progression beyond the design of protein structures and towards the design of functional proteins. We examine these advances in the broader context of classical de novo protein design and consider implications for future challenges to come, including fundamental capabilities such as sequence and structure co-design and conformational control considering flexibility, and functional objectives such as antibody and enzyme design.

    View details for DOI 10.1038/s41587-024-02133-2

    View details for PubMedID 38361073

    View details for PubMedCentralID 6423711

  • Geometric Deep Learning for Structure-Based Ligand Design. ACS central science Powers, A. S., Yu, H. H., Suriana, P., Koodli, R. V., Lu, T., Paggi, J. M., Dror, R. O. 2023; 9 (12): 2257-2267

    Abstract

    A pervasive challenge in drug design is determining how to expand a ligand-a small molecule that binds to a target biomolecule-in order to improve various properties of the ligand. Adding single chemical groups, known as fragments, is important for lead optimization tasks, and adding multiple fragments is critical for fragment-based drug design. We have developed a comprehensive framework that uses machine learning and three-dimensional protein-ligand structures to address this challenge. Our method, FRAME, iteratively determines where on a ligand to add fragments, selects fragments to add, and predicts the geometry of the added fragments. On a comprehensive benchmark, FRAME consistently improves predicted affinity and selectivity relative to the initial ligand, while generating molecules with more drug-like chemical properties than docking-based methods currently in widespread use. FRAME learns to accurately describe molecular interactions despite being given no prior information on such interactions. The resulting framework for quality molecular hypothesis generation can be easily incorporated into the workflows of medicinal chemists for diverse tasks, including lead optimization, fragment-based drug discovery, and de novo drug design.

    View details for DOI 10.1021/acscentsci.3c00572

    View details for PubMedID 38161364

  • ELASPIC2 (EL2): Combining Contextualized Language Models and Graph Neural Networks to Predict Effects of Mutations JOURNAL OF MOLECULAR BIOLOGY Strokach, A., Lu, T., Kim, P. M. 2021; 433 (11): 166810

    Abstract

    The ELASPIC web server allows users to evaluate the effect of mutations on protein folding and protein-protein interaction on a proteome-wide scale. It uses homology models of proteins and protein-protein interactions, which have been precalculated for several proteomes, and machine learning models, which integrate structural information with sequence conservation scores, in order to make its predictions. Since the original publication of the ELASPIC web server, several advances have motivated a revisiting of the problem of mutation effect prediction. First, progress in neural network architectures and self-supervised pre-trained has resulted in models which provide more informative embeddings of protein sequence and structure than those used by the original version of ELASPIC. Second, the amount of training data has increased several-fold, largely driven by advances in deep mutation scanning and other multiplexed assays of variant effect. Here, we describe two machine learning models which leverage the recent advances in order to achieve superior accuracy in predicting the effect of mutation on protein folding and protein-protein interaction. The models incorporate features generated using pre-trained transformer- and graph convolution-based neural networks, and are trained to optimize a ranking objective function, which permits the use of heterogeneous training data. The outputs from the new models have been incorporated into the ELASPIC web server, available at http://elaspic.kimlab.org.

    View details for DOI 10.1016/j.jmb.2021.166810

    View details for Web of Science ID 000648520800013

    View details for PubMedID 33450251