Alexander Johansen
Ph.D. Student in Computer Science, admitted Autumn 2020
All Publications
-
DeepLoc 2.1: multi-label membrane protein type prediction using protein language models.
Nucleic acids research
2024
Abstract
DeepLoc 2.0 is a popular web server for the prediction of protein subcellular localization and sorting signals. Here, we introduce DeepLoc 2.1, which additionally classifies the input proteins into the membrane protein types Transmembrane, Peripheral, Lipid-anchored and Soluble. Leveraging pre-trained transformer-based protein language models, the server utilizes a three-stage architecture for sequence-based, multi-label predictions. Comparative evaluations with other established tools on a test set of 4933 eukaryotic protein sequences, constructed following stringent homology partitioning, demonstrate state-of-the-art performance. Notably, DeepLoc 2.1 outperforms existing models, with the larger ProtT5 model exhibiting a marginal advantage over the ESM-1B model. The web server is available at https://services.healthtech.dtu.dk/services/DeepLoc-2.1.
View details for DOI 10.1093/nar/gkae237
View details for PubMedID 38587188
-
GraphPart: homology partitioning for biological sequence analysis.
NAR genomics and bioinformatics
2023; 5 (4): lqad088
Abstract
When splitting biological sequence data for the development and testing of predictive models, it is necessary to avoid too-closely related pairs of sequences ending up in different partitions. If this is ignored, performance of prediction methods will tend to be overestimated. Several algorithms have been proposed for homology reduction, where sequences are removed until no too-closely related pairs remain. We present GraphPart, an algorithm for homology partitioning that divides the data such that closely related sequences always end up in the same partition, while keeping as many sequences as possible in the dataset. Evaluation of GraphPart on Protein, DNA and RNA datasets shows that it is capable of retaining a larger number of sequences per dataset, while providing homology separation on a par with reduction approaches.
View details for DOI 10.1093/nargab/lqad088
View details for PubMedID 37850036
View details for PubMedCentralID PMC10578201
-
DeepLoc 2.0: multi-label subcellular localization prediction using protein language models.
Nucleic acids research
2022
Abstract
The prediction of protein subcellular localization is of great relevance for proteomics research. Here, we propose an update to the popular tool DeepLoc with multi-localization prediction and improvements in both performance and interpretability. For training and validation, we curate eukaryotic and human multi-location protein datasets with stringent homology partitioning and enriched with sorting signal information compiled from the literature. We achieve state-of-the-art performance in DeepLoc 2.0 by using a pre-trained protein language model. It has the further advantage that it uses sequence input rather than relying on slower protein profiles. We provide two means of better interpretability: an attention output along the sequence and highly accurate prediction of nine different types of protein sorting signals. We find that the attention output correlates well with the position of sorting signals. The webserver is available at services.healthtech.dtu.dk/service.php?DeepLoc-2.0.
View details for DOI 10.1093/nar/gkac278
View details for PubMedID 35489069
-
SignalP 6.0 predicts all five types of signal peptides using protein language models.
Nature biotechnology
1800
Abstract
Signal peptides (SPs) are short amino acid sequences that control protein secretion and translocation in all living organisms. SPs can be predicted from sequence data, but existing algorithms are unable to detect all known types of SPs. We introduce SignalP 6.0, a machine learning model that detects all five SP types and is applicable to metagenomic data.
View details for DOI 10.1038/s41587-021-01156-3
View details for PubMedID 34980915
-
NetSolP: predicting protein solubility in Escherichia coli using language models.
Bioinformatics (Oxford, England)
2022; 38 (4): 941-946
Abstract
Solubility and expression levels of proteins can be a limiting factor for large-scale studies and industrial production. By determining the solubility and expression directly from the protein sequence, the success rate of wet-lab experiments can be increased.In this study, we focus on predicting the solubility and usability for purification of proteins expressed in Escherichia coli directly from the sequence. Our model NetSolP is based on deep learning protein language models called transformers and we show that it achieves state-of-the-art performance and improves extrapolation across datasets. As we find current methods are built on biased datasets, we curate existing datasets by using strict sequence-identity partitioning and ensure that there is minimal bias in the sequences.The predictor and data are available at https://services.healthtech.dtu.dk/service.php?NetSolP and the open-sourced code is available at https://github.com/tvinet/NetSolP-1.0.Supplementary data are available at Bioinformatics online.
View details for DOI 10.1093/bioinformatics/btab801
View details for PubMedID 35088833
-
NetSolP: predicting protein solubility in E. coli using language models.
Bioinformatics (Oxford, England)
2021
Abstract
MOTIVATION: Solubility and expression levels of proteins can be a limiting factor for large-scale studies and industrial production. By determining the solubility and expression directly from the protein sequence, the success rate of wet-lab experiments can be increased.RESULTS: In this study, we focus on predicting the solubility and usability for purification of proteins expressed in Escherichia coli directly from the sequence. Our model NetSolP is based on deep learning protein language models called transformers and we show that it achieves state-of-the-art performance and improves extrapolation across datasets. As we find current methods are built on biased datasets, we curate existing datasets by using strict sequence-identity partitioning and ensure that there is minimal bias in the sequences.AVAILABILITY: The predictor and data are available at https://services.healthtech.dtu.dk/service.php?NetSolP and the open-sourced code is available at https://github.com/tvinet/NetSolP-1.0.SUPPLEMENTARY INFORMATION: Supplementary data is attached in submission.
View details for DOI 10.1093/bioinformatics/btab801
View details for PubMedID 34849581
-
Prediction of GPI-anchored proteins with pointer neural networks
CURRENT RESEARCH IN BIOTECHNOLOGY
2021; 3: 6-13
View details for DOI 10.1016/j.crbiot.2021.01.001
View details for Web of Science ID 000739728600002
-
Deep protein representations enable recombinant protein expression prediction.
Computational biology and chemistry
2021; 95: 107596
Abstract
A crucial process in the production of industrial enzymes is recombinant gene expression, which aims to induce enzyme overexpression of the genes in a host microbe. Current approaches for securing overexpression rely on molecular tools such as adjusting the recombinant expression vector, adjusting cultivation conditions, or performing codon optimizations. However, such strategies are time-consuming, and an alternative strategy would be to select genes for better compatibility with the recombinant host. Several methods for predicting soluble expression are available; however, they are all optimized for the expression host Escherichia coli and do not consider the possibility of an expressed protein not being soluble. We show that these tools are not suited for predicting expression potential in the industrially important host Bacillus subtilis. Instead, we build a B. subtilis-specific machine learning model for expressibility prediction. Given millions of unlabelled proteins and a small labeled dataset, we can successfully train such a predictive model. The unlabeled proteins provide a performance boost relative to using amino acid frequencies of the labeled proteins as input. On average, we obtain a modest performance of 0.64 area-under-the-curve (AUC) and 0.2 Matthews correlation coefficient (MCC). However, we find that this is sufficient for the prioritization of expression candidates for high-throughput studies. Moreover, the predicted class probabilities are correlated with expression levels. A number of features related to protein expression, including base frequencies and solubility, are captured by the model.
View details for DOI 10.1016/j.compbiolchem.2021.107596
View details for PubMedID 34775287