All Publications


  • DeepLoc 2.0: multi-label subcellular localization prediction using protein language models. Nucleic acids research Thumuluri, V., Almagro Armenteros, J. J., Johansen, A. R., Nielsen, H., Winther, O. 2022

    Abstract

    The prediction of protein subcellular localization is of great relevance for proteomics research. Here, we propose an update to the popular tool DeepLoc with multi-localization prediction and improvements in both performance and interpretability. For training and validation, we curate eukaryotic and human multi-location protein datasets with stringent homology partitioning and enriched with sorting signal information compiled from the literature. We achieve state-of-the-art performance in DeepLoc 2.0 by using a pre-trained protein language model. It has the further advantage that it uses sequence input rather than relying on slower protein profiles. We provide two means of better interpretability: an attention output along the sequence and highly accurate prediction of nine different types of protein sorting signals. We find that the attention output correlates well with the position of sorting signals. The webserver is available at services.healthtech.dtu.dk/service.php?DeepLoc-2.0.

    View details for DOI 10.1093/nar/gkac278

    View details for PubMedID 35489069

  • SignalP 6.0 predicts all five types of signal peptides using protein language models. Nature biotechnology Teufel, F., Almagro Armenteros, J. J., Johansen, A. R., Gislason, M. H., Pihl, S. I., Tsirigos, K. D., Winther, O., Brunak, S., von Heijne, G., Nielsen, H. 1800

    Abstract

    Signal peptides (SPs) are short amino acid sequences that control protein secretion and translocation in all living organisms. SPs can be predicted from sequence data, but existing algorithms are unable to detect all known types of SPs. We introduce SignalP 6.0, a machine learning model that detects all five SP types and is applicable to metagenomic data.

    View details for DOI 10.1038/s41587-021-01156-3

    View details for PubMedID 34980915

  • NetSolP: predicting protein solubility in Escherichia coli using language models. Bioinformatics (Oxford, England) Thumuluri, V., Martiny, H. M., Almagro Armenteros, J. J., Salomon, J., Nielsen, H., Johansen, A. R. 2022; 38 (4): 941-946

    Abstract

    Solubility and expression levels of proteins can be a limiting factor for large-scale studies and industrial production. By determining the solubility and expression directly from the protein sequence, the success rate of wet-lab experiments can be increased.In this study, we focus on predicting the solubility and usability for purification of proteins expressed in Escherichia coli directly from the sequence. Our model NetSolP is based on deep learning protein language models called transformers and we show that it achieves state-of-the-art performance and improves extrapolation across datasets. As we find current methods are built on biased datasets, we curate existing datasets by using strict sequence-identity partitioning and ensure that there is minimal bias in the sequences.The predictor and data are available at https://services.healthtech.dtu.dk/service.php?NetSolP and the open-sourced code is available at https://github.com/tvinet/NetSolP-1.0.Supplementary data are available at Bioinformatics online.

    View details for DOI 10.1093/bioinformatics/btab801

    View details for PubMedID 35088833

  • NetSolP: predicting protein solubility in E. coli using language models. Bioinformatics (Oxford, England) Thumuluri, V., Martiny, H., Armenteros, J. J., Salomon, J., Nielsen, H., Johansen, A. R. 2021

    Abstract

    MOTIVATION: Solubility and expression levels of proteins can be a limiting factor for large-scale studies and industrial production. By determining the solubility and expression directly from the protein sequence, the success rate of wet-lab experiments can be increased.RESULTS: In this study, we focus on predicting the solubility and usability for purification of proteins expressed in Escherichia coli directly from the sequence. Our model NetSolP is based on deep learning protein language models called transformers and we show that it achieves state-of-the-art performance and improves extrapolation across datasets. As we find current methods are built on biased datasets, we curate existing datasets by using strict sequence-identity partitioning and ensure that there is minimal bias in the sequences.AVAILABILITY: The predictor and data are available at https://services.healthtech.dtu.dk/service.php?NetSolP and the open-sourced code is available at https://github.com/tvinet/NetSolP-1.0.SUPPLEMENTARY INFORMATION: Supplementary data is attached in submission.

    View details for DOI 10.1093/bioinformatics/btab801

    View details for PubMedID 34849581

  • Deep protein representations enable recombinant protein expression prediction. Computational biology and chemistry Martiny, H. M., Armenteros, J. J., Johansen, A. R., Salomon, J., Nielsen, H. 2021; 95: 107596

    Abstract

    A crucial process in the production of industrial enzymes is recombinant gene expression, which aims to induce enzyme overexpression of the genes in a host microbe. Current approaches for securing overexpression rely on molecular tools such as adjusting the recombinant expression vector, adjusting cultivation conditions, or performing codon optimizations. However, such strategies are time-consuming, and an alternative strategy would be to select genes for better compatibility with the recombinant host. Several methods for predicting soluble expression are available; however, they are all optimized for the expression host Escherichia coli and do not consider the possibility of an expressed protein not being soluble. We show that these tools are not suited for predicting expression potential in the industrially important host Bacillus subtilis. Instead, we build a B. subtilis-specific machine learning model for expressibility prediction. Given millions of unlabelled proteins and a small labeled dataset, we can successfully train such a predictive model. The unlabeled proteins provide a performance boost relative to using amino acid frequencies of the labeled proteins as input. On average, we obtain a modest performance of 0.64 area-under-the-curve (AUC) and 0.2 Matthews correlation coefficient (MCC). However, we find that this is sufficient for the prioritization of expression candidates for high-throughput studies. Moreover, the predicted class probabilities are correlated with expression levels. A number of features related to protein expression, including base frequencies and solubility, are captured by the model.

    View details for DOI 10.1016/j.compbiolchem.2021.107596

    View details for PubMedID 34775287