Alexander Johansen's Profile | Stanford Profiles

Institute Affiliations

Member (Student), Cardiovascular Institute

Contact

Academic
arjo@stanford.edu

University - Student Department: Computer Science Position: Graduate

Additional Info

Mail Code: 9025
ORCID:
https://orcid.org/0000-0002-4993-7916

All Publications

DeepLoc 2.1: multi-label membrane protein type prediction using protein language models. Nucleic acids research Ødum, M. T., Teufel, F., Thumuluri, V., Almagro Armenteros, J. J., Johansen, A. R., Winther, O., Nielsen, H. 2024

Abstract

DeepLoc 2.0 is a popular web server for the prediction of protein subcellular localization and sorting signals. Here, we introduce DeepLoc 2.1, which additionally classifies the input proteins into the membrane protein types Transmembrane, Peripheral, Lipid-anchored and Soluble. Leveraging pre-trained transformer-based protein language models, the server utilizes a three-stage architecture for sequence-based, multi-label predictions. Comparative evaluations with other established tools on a test set of 4933 eukaryotic protein sequences, constructed following stringent homology partitioning, demonstrate state-of-the-art performance. Notably, DeepLoc 2.1 outperforms existing models, with the larger ProtT5 model exhibiting a marginal advantage over the ESM-1B model. The web server is available at https://services.healthtech.dtu.dk/services/DeepLoc-2.1.

View details for DOI 10.1093/nar/gkae237

View details for PubMedID 38587188
GraphPart: homology partitioning for biological sequence analysis. NAR genomics and bioinformatics Teufel, F., Gíslason, M. H., Almagro Armenteros, J. J., Johansen, A. R., Winther, O., Nielsen, H. 2023; 5 (4): lqad088

Abstract

When splitting biological sequence data for the development and testing of predictive models, it is necessary to avoid too-closely related pairs of sequences ending up in different partitions. If this is ignored, performance of prediction methods will tend to be overestimated. Several algorithms have been proposed for homology reduction, where sequences are removed until no too-closely related pairs remain. We present GraphPart, an algorithm for homology partitioning that divides the data such that closely related sequences always end up in the same partition, while keeping as many sequences as possible in the dataset. Evaluation of GraphPart on Protein, DNA and RNA datasets shows that it is capable of retaining a larger number of sequences per dataset, while providing homology separation on a par with reduction approaches.

View details for DOI 10.1093/nargab/lqad088

View details for PubMedID 37850036

View details for PubMedCentralID PMC10578201
DeepLoc 2.0: multi-label subcellular localization prediction using protein language models. Nucleic acids research Thumuluri, V., Almagro Armenteros, J. J., Johansen, A. R., Nielsen, H., Winther, O. 2022

Abstract

The prediction of protein subcellular localization is of great relevance for proteomics research. Here, we propose an update to the popular tool DeepLoc with multi-localization prediction and improvements in both performance and interpretability. For training and validation, we curate eukaryotic and human multi-location protein datasets with stringent homology partitioning and enriched with sorting signal information compiled from the literature. We achieve state-of-the-art performance in DeepLoc 2.0 by using a pre-trained protein language model. It has the further advantage that it uses sequence input rather than relying on slower protein profiles. We provide two means of better interpretability: an attention output along the sequence and highly accurate prediction of nine different types of protein sorting signals. We find that the attention output correlates well with the position of sorting signals. The webserver is available at services.healthtech.dtu.dk/service.php?DeepLoc-2.0.

View details for DOI 10.1093/nar/gkac278

View details for PubMedID 35489069
SignalP 6.0 predicts all five types of signal peptides using protein language models. Nature biotechnology Teufel, F., Almagro Armenteros, J. J., Johansen, A. R., Gislason, M. H., Pihl, S. I., Tsirigos, K. D., Winther, O., Brunak, S., von Heijne, G., Nielsen, H. 1800

Abstract

Signal peptides (SPs) are short amino acid sequences that control protein secretion and translocation in all living organisms. SPs can be predicted from sequence data, but existing algorithms are unable to detect all known types of SPs. We introduce SignalP 6.0, a machine learning model that detects all five SP types and is applicable to metagenomic data.

View details for DOI 10.1038/s41587-021-01156-3

View details for PubMedID 34980915
NetSolP: predicting protein solubility in Escherichia coli using language models. Bioinformatics (Oxford, England) Thumuluri, V., Martiny, H. M., Almagro Armenteros, J. J., Salomon, J., Nielsen, H., Johansen, A. R. 2022; 38 (4): 941-946

Abstract

Solubility and expression levels of proteins can be a limiting factor for large-scale studies and industrial production. By determining the solubility and expression directly from the protein sequence, the success rate of wet-lab experiments can be increased.In this study, we focus on predicting the solubility and usability for purification of proteins expressed in Escherichia coli directly from the sequence. Our model NetSolP is based on deep learning protein language models called transformers and we show that it achieves state-of-the-art performance and improves extrapolation across datasets. As we find current methods are built on biased datasets, we curate existing datasets by using strict sequence-identity partitioning and ensure that there is minimal bias in the sequences.The predictor and data are available at https://services.healthtech.dtu.dk/service.php?NetSolP and the open-sourced code is available at https://github.com/tvinet/NetSolP-1.0.Supplementary data are available at Bioinformatics online.

View details for DOI 10.1093/bioinformatics/btab801

View details for PubMedID 35088833
NetSolP: predicting protein solubility in E. coli using language models. Bioinformatics (Oxford, England) Thumuluri, V., Martiny, H., Armenteros, J. J., Salomon, J., Nielsen, H., Johansen, A. R. 2021

Abstract

MOTIVATION: Solubility and expression levels of proteins can be a limiting factor for large-scale studies and industrial production. By determining the solubility and expression directly from the protein sequence, the success rate of wet-lab experiments can be increased.RESULTS: In this study, we focus on predicting the solubility and usability for purification of proteins expressed in Escherichia coli directly from the sequence. Our model NetSolP is based on deep learning protein language models called transformers and we show that it achieves state-of-the-art performance and improves extrapolation across datasets. As we find current methods are built on biased datasets, we curate existing datasets by using strict sequence-identity partitioning and ensure that there is minimal bias in the sequences.AVAILABILITY: The predictor and data are available at https://services.healthtech.dtu.dk/service.php?NetSolP and the open-sourced code is available at https://github.com/tvinet/NetSolP-1.0.SUPPLEMENTARY INFORMATION: Supplementary data is attached in submission.

View details for DOI 10.1093/bioinformatics/btab801

View details for PubMedID 34849581
Prediction of GPI-anchored proteins with pointer neural networks CURRENT RESEARCH IN BIOTECHNOLOGY Gislason, M., Nielsen, H., Armenteros, J., Johansen, A. 2021; 3: 6-13

View details for DOI 10.1016/j.crbiot.2021.01.001

View details for Web of Science ID 000739728600002
Deep protein representations enable recombinant protein expression prediction. Computational biology and chemistry Martiny, H. M., Armenteros, J. J., Johansen, A. R., Salomon, J., Nielsen, H. 2021; 95: 107596

Abstract

A crucial process in the production of industrial enzymes is recombinant gene expression, which aims to induce enzyme overexpression of the genes in a host microbe. Current approaches for securing overexpression rely on molecular tools such as adjusting the recombinant expression vector, adjusting cultivation conditions, or performing codon optimizations. However, such strategies are time-consuming, and an alternative strategy would be to select genes for better compatibility with the recombinant host. Several methods for predicting soluble expression are available; however, they are all optimized for the expression host Escherichia coli and do not consider the possibility of an expressed protein not being soluble. We show that these tools are not suited for predicting expression potential in the industrially important host Bacillus subtilis. Instead, we build a B. subtilis-specific machine learning model for expressibility prediction. Given millions of unlabelled proteins and a small labeled dataset, we can successfully train such a predictive model. The unlabeled proteins provide a performance boost relative to using amino acid frequencies of the labeled proteins as input. On average, we obtain a modest performance of 0.64 area-under-the-curve (AUC) and 0.2 Matthews correlation coefficient (MCC). However, we find that this is sufficient for the prioritization of expression candidates for high-throughput studies. Moreover, the predicted class probabilities are correlated with expression levels. A number of features related to protein expression, including base frequencies and solubility, are captured by the model.

View details for DOI 10.1016/j.compbiolchem.2021.107596

View details for PubMedID 34775287

Alexander Johansen

Ph.D. Student in Computer Science, admitted Autumn 2020

Institute Affiliations

Contact

Additional Info

All Publications

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract