Institute Affiliations


All Publications


  • DeepLoc 2.1: multi-label membrane protein type prediction using protein language models. Nucleic acids research Ødum, M. T., Teufel, F., Thumuluri, V., Almagro Armenteros, J. J., Johansen, A. R., Winther, O., Nielsen, H. 2024

    Abstract

    DeepLoc 2.0 is a popular web server for the prediction of protein subcellular localization and sorting signals. Here, we introduce DeepLoc 2.1, which additionally classifies the input proteins into the membrane protein types Transmembrane, Peripheral, Lipid-anchored and Soluble. Leveraging pre-trained transformer-based protein language models, the server utilizes a three-stage architecture for sequence-based, multi-label predictions. Comparative evaluations with other established tools on a test set of 4933 eukaryotic protein sequences, constructed following stringent homology partitioning, demonstrate state-of-the-art performance. Notably, DeepLoc 2.1 outperforms existing models, with the larger ProtT5 model exhibiting a marginal advantage over the ESM-1B model. The web server is available at https://services.healthtech.dtu.dk/services/DeepLoc-2.1.

    View details for DOI 10.1093/nar/gkae237

    View details for PubMedID 38587188

  • GraphPart: homology partitioning for biological sequence analysis. NAR genomics and bioinformatics Teufel, F., Gíslason, M. H., Almagro Armenteros, J. J., Johansen, A. R., Winther, O., Nielsen, H. 2023; 5 (4): lqad088

    Abstract

    When splitting biological sequence data for the development and testing of predictive models, it is necessary to avoid too-closely related pairs of sequences ending up in different partitions. If this is ignored, performance of prediction methods will tend to be overestimated. Several algorithms have been proposed for homology reduction, where sequences are removed until no too-closely related pairs remain. We present GraphPart, an algorithm for homology partitioning that divides the data such that closely related sequences always end up in the same partition, while keeping as many sequences as possible in the dataset. Evaluation of GraphPart on Protein, DNA and RNA datasets shows that it is capable of retaining a larger number of sequences per dataset, while providing homology separation on a par with reduction approaches.

    View details for DOI 10.1093/nargab/lqad088

    View details for PubMedID 37850036

    View details for PubMedCentralID PMC10578201

  • DeepLoc 2.0: multi-label subcellular localization prediction using protein language models. Nucleic acids research Thumuluri, V., Almagro Armenteros, J. J., Johansen, A. R., Nielsen, H., Winther, O. 2022

    Abstract

    The prediction of protein subcellular localization is of great relevance for proteomics research. Here, we propose an update to the popular tool DeepLoc with multi-localization prediction and improvements in both performance and interpretability. For training and validation, we curate eukaryotic and human multi-location protein datasets with stringent homology partitioning and enriched with sorting signal information compiled from the literature. We achieve state-of-the-art performance in DeepLoc 2.0 by using a pre-trained protein language model. It has the further advantage that it uses sequence input rather than relying on slower protein profiles. We provide two means of better interpretability: an attention output along the sequence and highly accurate prediction of nine different types of protein sorting signals. We find that the attention output correlates well with the position of sorting signals. The webserver is available at services.healthtech.dtu.dk/service.php?DeepLoc-2.0.

    View details for DOI 10.1093/nar/gkac278

    View details for PubMedID 35489069

  • SignalP 6.0 predicts all five types of signal peptides using protein language models. Nature biotechnology Teufel, F., Almagro Armenteros, J. J., Johansen, A. R., Gislason, M. H., Pihl, S. I., Tsirigos, K. D., Winther, O., Brunak, S., von Heijne, G., Nielsen, H. 1800

    Abstract

    Signal peptides (SPs) are short amino acid sequences that control protein secretion and translocation in all living organisms. SPs can be predicted from sequence data, but existing algorithms are unable to detect all known types of SPs. We introduce SignalP 6.0, a machine learning model that detects all five SP types and is applicable to metagenomic data.

    View details for DOI 10.1038/s41587-021-01156-3

    View details for PubMedID 34980915

  • NetSolP: predicting protein solubility in Escherichia coli using language models. Bioinformatics (Oxford, England) Thumuluri, V., Martiny, H. M., Almagro Armenteros, J. J., Salomon, J., Nielsen, H., Johansen, A. R. 2022; 38 (4): 941-946

    Abstract

    Solubility and expression levels of proteins can be a limiting factor for large-scale studies and industrial production. By determining the solubility and expression directly from the protein sequence, the success rate of wet-lab experiments can be increased.In this study, we focus on predicting the solubility and usability for purification of proteins expressed in Escherichia coli directly from the sequence. Our model NetSolP is based on deep learning protein language models called transformers and we show that it achieves state-of-the-art performance and improves extrapolation across datasets. As we find current methods are built on biased datasets, we curate existing datasets by using strict sequence-identity partitioning and ensure that there is minimal bias in the sequences.The predictor and data are available at https://services.healthtech.dtu.dk/service.php?NetSolP and the open-sourced code is available at https://github.com/tvinet/NetSolP-1.0.Supplementary data are available at Bioinformatics online.

    View details for DOI 10.1093/bioinformatics/btab801

    View details for PubMedID 35088833

  • NetSolP: predicting protein solubility in E. coli using language models. Bioinformatics (Oxford, England) Thumuluri, V., Martiny, H., Armenteros, J. J., Salomon, J., Nielsen, H., Johansen, A. R. 2021

    Abstract

    MOTIVATION: Solubility and expression levels of proteins can be a limiting factor for large-scale studies and industrial production. By determining the solubility and expression directly from the protein sequence, the success rate of wet-lab experiments can be increased.RESULTS: In this study, we focus on predicting the solubility and usability for purification of proteins expressed in Escherichia coli directly from the sequence. Our model NetSolP is based on deep learning protein language models called transformers and we show that it achieves state-of-the-art performance and improves extrapolation across datasets. As we find current methods are built on biased datasets, we curate existing datasets by using strict sequence-identity partitioning and ensure that there is minimal bias in the sequences.AVAILABILITY: The predictor and data are available at https://services.healthtech.dtu.dk/service.php?NetSolP and the open-sourced code is available at https://github.com/tvinet/NetSolP-1.0.SUPPLEMENTARY INFORMATION: Supplementary data is attached in submission.

    View details for DOI 10.1093/bioinformatics/btab801

    View details for PubMedID 34849581

  • Deep protein representations enable recombinant protein expression prediction. Computational biology and chemistry Martiny, H. M., Armenteros, J. J., Johansen, A. R., Salomon, J., Nielsen, H. 2021; 95: 107596

    Abstract

    A crucial process in the production of industrial enzymes is recombinant gene expression, which aims to induce enzyme overexpression of the genes in a host microbe. Current approaches for securing overexpression rely on molecular tools such as adjusting the recombinant expression vector, adjusting cultivation conditions, or performing codon optimizations. However, such strategies are time-consuming, and an alternative strategy would be to select genes for better compatibility with the recombinant host. Several methods for predicting soluble expression are available; however, they are all optimized for the expression host Escherichia coli and do not consider the possibility of an expressed protein not being soluble. We show that these tools are not suited for predicting expression potential in the industrially important host Bacillus subtilis. Instead, we build a B. subtilis-specific machine learning model for expressibility prediction. Given millions of unlabelled proteins and a small labeled dataset, we can successfully train such a predictive model. The unlabeled proteins provide a performance boost relative to using amino acid frequencies of the labeled proteins as input. On average, we obtain a modest performance of 0.64 area-under-the-curve (AUC) and 0.2 Matthews correlation coefficient (MCC). However, we find that this is sufficient for the prioritization of expression candidates for high-throughput studies. Moreover, the predicted class probabilities are correlated with expression levels. A number of features related to protein expression, including base frequencies and solubility, are captured by the model.

    View details for DOI 10.1016/j.compbiolchem.2021.107596

    View details for PubMedID 34775287

  • Prediction of GPI-anchored proteins with pointer neural networks CURRENT RESEARCH IN BIOTECHNOLOGY Gislason, M., Nielsen, H., Armenteros, J., Johansen, A. 2021; 3: 6-13