PostDoc, Institute for Systems Biology, Proteomics & Systems Biology Mentor: Ruedi Aebersold
Ph.D., University of California, Los Angeles, Chemistry & Biochemistry Mentor: David Eisenberg
B.S., Washington University in St. Louis, Computer Science & Biochemistry
Current Research and Scholarly Interests
The Mallick lab focuses on translating multi-omic discovery into precision diagnostics. In particular we use tightly integrated computational and experimental, multi-omic approaches to discover the processes underlying how cells behave (or misbehave) and accordingly how cancers develop and grow. We hope that by exploring these processes, and by formalizing our knowledge in predictive mathematical models that we will be able to better identify biomarkers that can be used to detect cancers earlier and describe how they are likely to behave (e.g. aggressive vs indolent, drug sensitive vs responsive).
More specifically, we are working in three focus areas: Cancer Systems Biology, Multi-scale Biomarker Biology and Technology Development. Notably, many of the studies in our group are investigating fundamental physiological processes and thus are generally applicable to a range of cell-types and diseases.
Our group has also been leading the development of ProteoWizard, an open source set of libraries and tools to simplify the process of developing proteomics tools. They read and write the HUPO-PSI mzML standard and have been incorporated into the ISB's transproteomicpipeline!
For more information see http://mallicklab.stanford.edu
Independent Studies (11)
- Biomedical Informatics Teaching Methods
BIOMEDIN 290 (Win)
- Directed Reading and Research
BIOMEDIN 299 (Win)
- Directed Reading in Radiology
RAD 299 (Aut, Win, Spr, Sum)
- Early Clinical Experience in Radiology
RAD 280 (Aut, Win, Spr, Sum)
- Graduate Research
IMMUNOL 399 (Spr, Sum)
- Graduate Research
RAD 399 (Aut, Win, Spr, Sum)
- Medical Scholars Research
BIOMEDIN 370 (Win)
- Medical Scholars Research
RAD 370 (Aut, Win, Spr, Sum)
- Readings in Radiology Research
RAD 101 (Aut, Win, Spr, Sum)
- Teaching in Immunology
IMMUNOL 290 (Spr, Sum)
- Undergraduate Research
RAD 199 (Aut, Win, Spr, Sum)
- Biomedical Informatics Teaching Methods
Prior Year Courses
- Mass Spectrometry and Proteomics: Opening the Black Box
BIOS 227 (Win)
- Mass Spectrometry and Proteomics: Opening the Black Box
Building high-quality assay libraries for targeted analysis of SWATH MS data.
2015; 10 (3): 426-441
Targeted proteomics by selected/multiple reaction monitoring (S/MRM) or, on a larger scale, by SWATH (sequential window acquisition of all theoretical spectra) MS (mass spectrometry) typically relies on spectral reference libraries for peptide identification. Quality and coverage of these libraries are therefore of crucial importance for the performance of the methods. Here we present a detailed protocol that has been successfully used to build high-quality, extensive reference libraries supporting targeted proteomics by SWATH MS. We describe each step of the process, including data acquisition by discovery proteomics, assertion of peptide-spectrum matches (PSMs), generation of consensus spectra and compilation of MS coordinates that uniquely define each targeted peptide. Crucial steps such as false discovery rate (FDR) control, retention time normalization and handling of post-translationally modified peptides are detailed. Finally, we show how to use the library to extract SWATH data with the open-source software Skyline. The protocol takes 2-3 d to complete, depending on the extent of the library and the computational resources available.
View details for DOI 10.1038/nprot.2015.015
View details for PubMedID 25675208
- A cross-platform toolkit for mass spectrometry and proteomics NATURE BIOTECHNOLOGY 2012; 30 (10): 918-920
Quantitative Proteomic Profiling Identifies Protein Correlates to EGFR Kinase Inhibition
MOLECULAR CANCER THERAPEUTICS
2012; 11 (5): 1071-1081
Clinical oncology is hampered by lack of tools to accurately assess a patient's response to pathway-targeted therapies. Serum and tumor cell surface proteins whose abundance, or change in abundance in response to therapy, differentiates patients responding to a therapy from patients not responding to a therapy could be usefully incorporated into tools for monitoring response. Here, we posit and then verify that proteomic discovery in in vitro tissue culture models can identify proteins with concordant in vivo behavior and further, can be a valuable approach for identifying tumor-derived serum proteins. In this study, we use stable isotope labeling of amino acids in culture (SILAC) with proteomic technologies to quantitatively analyze the gefitinib-related protein changes in a model system for sensitivity to EGF receptor (EGFR)-targeted tyrosine kinase inhibitors. We identified 3,707 intracellular proteins, 1,276 cell surface proteins, and 879 shed proteins. More than 75% of the proteins identified had quantitative information, and a subset consisting of 400 proteins showed a statistically significant change in abundance following gefitinib treatment. We validated the change in expression profile in vitro and screened our panel of response markers in an in vivo isogenic resistant model and showed that these were markers of gefitinib response and not simply markers of phospho-EGFR downregulation. In doing so, we also were able to identify which proteins might be useful as markers for monitoring response and which proteins might be useful as markers for a priori prediction of response.
View details for DOI 10.1158/1535-7163.MCT-11-0852
View details for Web of Science ID 000307984800003
View details for PubMedID 22411897
Evolutionary Modeling of Combination Treatment Strategies To Overcome Resistance to Tyrosine Kinase Inhibitors in Non-Small Cell Lung Cancer
2011; 8 (6): 2069-2079
Many initially successful anticancer therapies lose effectiveness over time, and eventually, cancer cells acquire resistance to the therapy. Acquired resistance remains a major obstacle to improving remission rates and achieving prolonged disease-free survival. Consequently, novel approaches to overcome or prevent resistance are of significant clinical importance. There has been considerable interest in treating non-small cell lung cancer (NSCLC) with combinations of EGFR-targeted therapeutics (e.g., erlotinib) and cytotoxic therapeutics (e.g., paclitaxel); however, acquired resistance to erlotinib, driven by a variety of mechanisms, remains an obstacle to treatment success. In about 50% of cases, resistance is due to a T790M point mutation in EGFR, and T790M-containing cells ultimately dominate the tumor composition and lead to tumor regrowth. We employed a combined experimental and mathematical modeling-based approach to identify treatment strategies that impede the outgrowth of primary T790M-mediated resistance in NSCLC populations. Our mathematical model predicts the population dynamics of mixtures of sensitive and resistant cells, thereby describing how the tumor composition, initial fraction of resistant cells, and degree of selective pressure influence the time until progression of disease. Model development relied upon quantitative experimental measurements of cell proliferation and death using a novel microscopy approach. Using this approach, we systematically explored the space of combination treatment strategies and demonstrated that optimally timed sequential strategies yielded large improvements in survival outcome relative to monotherapies at the same concentrations. Our investigations revealed regions of the treatment space in which low-dose sequential combination strategies, after preclinical validation, may lead to a tumor reduction and improved survival outcome for patients with T790M-mediated resistance.
View details for DOI 10.1021/mp200270v
View details for Web of Science ID 000297537300011
View details for PubMedID 21995722
Impact of Protein Stability, Cellular Localization, and Abundance on Proteomic Detection of Tumor-Derived Proteins in Plasma
2011; 6 (7)
Tumor-derived, circulating proteins are potentially useful as biomarkers for detection of cancer, for monitoring of disease progression, regression and recurrence, and for assessment of therapeutic response. Here we interrogated how a protein's stability, cellular localization, and abundance affect its observability in blood by mass-spectrometry-based proteomics techniques. We performed proteomic profiling on tumors and plasma from two different xenograft mouse models. A statistical analysis of this data revealed protein properties indicative of the detection level in plasma. Though 20% of the proteins identified in plasma were tumor-derived, only 5% of the proteins observed in the tumor tissue were found in plasma. Both intracellular and extracellular tumor proteins were observed in plasma; however, after normalizing for tumor abundance, extracellular proteins were seven times more likely to be detected. Although proteins that were more abundant in the tumor were also more likely to be observed in plasma, the relationship was nonlinear: Doubling the spectral count increased detection rate by only 50%. Many secreted proteins, even those with relatively low spectral count, were observed in plasma, but few low abundance intracellular proteins were observed. Proteins predicted to be stable by dipeptide composition were significantly more likely to be identified in plasma than less stable proteins. The number of tryptic peptides in a protein was not significantly related to the chance of a protein being observed in plasma. Quantitative comparison of large versus small tumors revealed that the abundance of proteins in plasma as measured by spectral count was associated with the tumor size, but the relationship was not one-to-one; a 3-fold decrease in tumor size resulted in a 16-fold decrease in protein abundance in plasma. This study provides quantitative support for a tumor-derived marker prioritization strategy that favors secreted and stable proteins over all but the most abundant intracellular proteins.
View details for DOI 10.1371/journal.pone.0023090
View details for Web of Science ID 000293286500074
View details for PubMedID 21829587
Proteomics: a pragmatic perspective
2010; 28 (7): 695-709
The evolution of mass spectrometry-based proteomic technologies has advanced our understanding of the complex and dynamic nature of proteomes while concurrently revealing that no 'one-size-fits-all' proteomic strategy can be used to address all biological questions. Whereas some techniques, such as those for analyzing protein complexes, have matured and are broadly applied with great success, others, such as global quantitative protein expression profiling for biomarker discovery, are still confined to a few expert laboratories. In this Perspective, we attempt to distill the wide array of conceivable proteomic approaches into a compact canon of techniques suited to asking and answering specific types of biological questions. By discussing the relationship between the complexity of a biological sample and the difficulty of implementing the appropriate analysis approach, we contrast areas of proteomics broadly usable today with those that require significant technical and conceptual development. We hope to provide nonexperts with a guide for calibrating expectations of what can realistically be learned from a proteomics experiment and for gauging the planning and execution effort. We further provide a detailed supplement explaining the most common techniques in proteomics.
View details for DOI 10.1038/nbt.1658
View details for Web of Science ID 000279723900027
View details for PubMedID 20622844
Computational prediction of proteotypic peptides for quantitative proteomics.
2007; 25 (1): 125-131
Mass spectrometry-based quantitative proteomics has become an important component of biological and clinical research. Although such analyses typically assume that a protein's peptide fragments are observed with equal likelihood, only a few so-called 'proteotypic' peptides are repeatedly and consistently identified for any given protein present in a mixture. Using >600,000 peptide identifications generated by four proteomic platforms, we empirically identified >16,000 proteotypic peptides for 4,030 distinct yeast proteins. Characteristic physicochemical properties of these peptides were used to develop a computational tool that can predict proteotypic peptides for any protein from any organism, for a given platform, with >85% cumulative accuracy. Possible applications of proteotypic peptides include validation of protein identifications, absolute quantification of proteins, annotation of coding sequences in genomes, and characterization of the physical principles governing key elements of mass spectrometric workflows (e.g., digestion, chromatography, ionization and fragmentation).
View details for PubMedID 17195840
- Anti-MET ImmunoPET for Non-Small Cell Lung Cancer Using Novel Fully Human Antibody Fragments MOLECULAR CANCER THERAPEUTICS 2014; 13 (11): 2607-2617
Characterizing deformability and surface friction of cancer cells.
Proceedings of the National Academy of Sciences of the United States of America
2013; 110 (19): 7580-7585
Metastasis requires the penetration of cancer cells through tight spaces, which is mediated by the physical properties of the cells as well as their interactions with the confined environment. Various microfluidic approaches have been devised to mimic traversal in vitro by measuring the time required for cells to pass through a constriction. Although a cell's passage time is expected to depend on its deformability, measurements from existing approaches are confounded by a cell's size and its frictional properties with the channel wall. Here, we introduce a device that enables the precise measurement of (i) the size of a single cell, given by its buoyant mass, (ii) the velocity of the cell entering a constricted microchannel (entry velocity), and (iii) the velocity of the cell as it transits through the constriction (transit velocity). Changing the deformability of the cell by perturbing its cytoskeleton primarily alters the entry velocity, whereas changing the surface friction by immobilizing positive charges on the constriction's walls primarily alters the transit velocity, indicating that these parameters can give insight into the factors affecting the passage of each cell. When accounting for cell buoyant mass, we find that cells possessing higher metastatic potential exhibit faster entry velocities than cells with lower metastatic potential. We additionally find that some cell types with higher metastatic potential exhibit greater than expected changes in transit velocities, suggesting that not only the increased deformability but reduced friction may be a factor in enabling invasive cancer cells to efficiently squeeze through tight spaces.
View details for DOI 10.1073/pnas.1218806110
View details for PubMedID 23610435
A physical sciences network characterization of non-tumorigenic and metastatic cells
To investigate the transition from non-cancerous to metastatic from a physical sciences perspective, the Physical Sciences-Oncology Centers (PS-OC) Network performed molecular and biophysical comparative studies of the non-tumorigenic MCF-10A and metastatic MDA-MB-231 breast epithelial cell lines, commonly used as models of cancer metastasis. Experiments were performed in 20 laboratories from 12 PS-OCs. Each laboratory was supplied with identical aliquots and common reagents and culture protocols. Analyses of these measurements revealed dramatic differences in their mechanics, migration, adhesion, oxygen response, and proteomic profiles. Model-based multi-omics approaches identified key differences between these cells' regulatory networks involved in morphology and survival. These results provide a multifaceted description of cellular parameters of two widely used cell lines and demonstrate the value of the PS-OC Network approach for integration of diverse experimental observations to elucidate the phenotypes associated with cancer metastasis.
View details for DOI 10.1038/srep01449
View details for Web of Science ID 000318061300001
View details for PubMedID 23618955
Anterior gradient 2 (AGR2): Blood-based biomarker elevated in metastatic prostate cancer associated with the neuroendocrine phenotype
2013; 73 (3): 306-315
Anterior gradient 2 (AGR2) is associated with metastatic progression in prostate cancer cells as well as other normal and malignant tissues. We investigated AGR2 expression in patients with metastatic prostate cancer.Blood was collected from 44 patients with metastatic prostate cancer separated as: castration sensitive prostate cancer (CSPC, n?=?5); castration resistant prostate cancer (CRPC, n?=?36); and neuroendocrine-predominate CRPC defined by PSA???1?ng/ml in the presence of wide-spread metastatic disease (NE-CRPC, n?=?3). AGR2 mRNA levels were measured with RT-PCR in circulating tumor cell (CTC)-enriched peripheral blood. Plasma AGR2 levels were determined via ELISA assay. AGR2 expression was modulated in prostate cancer cell lines using plasmid and viral vectors.AGR2 mRNA levels are elevated in CTCs and strongly correlated with CTC enumeration. Plasma AGR2 levels are elevated in all sub-groups. AGR2 levels vary independently to PSA and change in some patients in response to androgen-directed and other therapies. Plasma AGR2 levels are highest in the NE-CRPC sub-group. A correlation between AGR2, chromagranin A (CGA), and neuron-specific enolase (NSE) expression is demonstrated in prostate cancer cell lines.We conclude that AGR2 expression is elevated at the mRNA and protein level in patients with metastatic prostate cancer. In particular, we find that AGR2 expression is associated features consistent with neuroendocrine, or anaplastic, prostate cancer, exemplified by an aggressive clinical phenotype without elevation in circulating PSA levels. Further studies are warranted to explore the mechanistic and prognostic implications of AGR2 expression in this patient population.
View details for DOI 10.1002/pros.22569
View details for Web of Science ID 000313895900010
View details for PubMedID 22911164
Unexpected Dissemination Patterns in Lymphoma Progression Revealed by Serial Imaging within a Murine Lymph Node
2012; 72 (23): 6111-6118
Non-Hodgkin lymphoma (NHL) is a heterogeneous and highly disseminated disease, but the mechanisms of its growth and dissemination are not well understood. Using a mouse model of this disease, we used multimodal imaging, including intravital microscopy (IVM) combined with bioluminescence, as a powerful tool to better elucidate NHL progression. We injected enhanced green fluorescent protein and luciferase-expressing E?-Myc/Arf(-/-) (Cdkn2a(-/-)) mouse lymphoma cells (EL-Arf(-/-)) into C57BL/6NCrl mice intravenously. Long-term observation inside a peripheral lymph node was enabled by a novel lymph node internal window chamber technique that allows chronic, sequential lymph node imaging under in vivo physiologic conditions. Interestingly, during early stages of tumor progression we found that few if any lymphoma cells homed initially to the inguinal lymph node (ILN), despite clear evidence of lymphoma cells in the bone marrow and spleen. Unexpectedly, we detected a reproducible efflux of lymphoma cells from spleen and bone marrow, concomitant with a massive and synchronous influx of lymphoma cells into the ILN, several days after injection. We confirmed a coordinated efflux/influx of tumor cells by injecting EL-Arf(-/-) lymphoma cells directly into the spleen and observing a burst of lymphoma cells, validating that the burst originated in organs remote from the lymph nodes. Our findings argue that in NHL an efflux of tumor cells from one disease site to another, distant site in which they become established occurs in discrete bursts.
View details for DOI 10.1158/0008-5472.CAN-12-2579
View details for Web of Science ID 000311893100005
View details for PubMedID 23033441
Investigation of acquired resistance to EGFR-targeted therapies in lung cancer using cDNA microarrays.
Methods in molecular biology (Clifton, N.J.)
2012; 795: 233-253
Clinical tools to accurately describe, evaluate, and predict an individual's response to cancer therapy are a field-wide priority; in many advanced cancers, only 10-20% of individuals will have a clinical benefit from therapy, yet we treat the entire population. Furthermore, many therapies are initially effective, but lose effectiveness over time. Here we describe methods to derive in vitro models of resistance to EGFR tyrosine kinase inhibitors. We additionally describe approaches to characterize possible mechanisms of resistance by genomic and transcriptomic approaches.
View details for DOI 10.1007/978-1-61779-337-0_16
View details for PubMedID 21960227
- Cancer as a Multi-scale Complex Adaptive System Assessment Of Physical Sciences And Engineering Advances In Life Sciences And Oncology (Aphelion) In Europe 2012: 4-21
Installation and use of LabKey Server for proteomics.
Current protocols in bioinformatics / editoral board, Andreas D. Baxevanis ... [et al.]
2011; Chapter 13: Unit 13 5-?
LabKey Server (formerly CPAS, the Computational Proteomics Analysis System) provides a Web-based platform for mining data from liquid chromatography-tandem mass spectrometry (LC-MS/MS) proteomic experiments. This open source platform supports systematic proteomic analyses and secure data management, integration, and sharing. LabKey Server incorporates several tools currently used in proteomic analysis, including the X! Tandem search engine, the ProteoWizard toolkit, and the PeptideProphet and ProteinProphet data mining tools. These tools and others are integrated into LabKey Server, which provides an extensible architecture for developing high-throughput biological applications. The LabKey Server analysis pipeline acts on data in standardized file formats, so that researchers may use LabKey Server with other search engines, including Mascot or SEQUEST, that follow a standardized format for reporting search engine results. Supported builds of LabKey Server are freely available at http://www.labkey.com/. Documentation and source code are available under the Apache License 2.0 at http://www.labkey.org.
View details for DOI 10.1002/0471250953.bi1305s36
View details for PubMedID 22161569
A High-Confidence Human Plasma Proteome Reference Set with Estimated Concentrations in PeptideAtlas
MOLECULAR & CELLULAR PROTEOMICS
2011; 10 (9)
Human blood plasma can be obtained relatively noninvasively and contains proteins from most, if not all, tissues of the body. Therefore, an extensive, quantitative catalog of plasma proteins is an important starting point for the discovery of disease biomarkers. In 2005, we showed that different proteomics measurements using different sample preparation and analysis techniques identify significantly different sets of proteins, and that a comprehensive plasma proteome can be compiled only by combining data from many different experiments. Applying advanced computational methods developed for the analysis and integration of very large and diverse data sets generated by tandem MS measurements of tryptic peptides, we have now compiled a high-confidence human plasma proteome reference set with well over twice the identified proteins of previous high-confidence sets. It includes a hierarchy of protein identifications at different levels of redundancy following a clearly defined scheme, which we propose as a standard that can be applied to any proteomics data set to facilitate cross-proteome analyses. Further, to aid in development of blood-based diagnostics using techniques such as selected reaction monitoring, we provide a rough estimate of protein concentrations using spectral counting. We identified 20,433 distinct peptides, from which we inferred a highly nonredundant set of 1929 protein sequences at a false discovery rate of 1%. We have made this resource available via PeptideAtlas, a large, multiorganism, publicly accessible compendium of peptides identified in tandem MS experiments conducted by laboratories around the world.
View details for DOI 10.1074/mcp.M110.006353
View details for Web of Science ID 000294729200003
View details for PubMedID 21632744
- Applying Multi-Agent Techniques to Cancer Modeling Proceedings of the Sixth Workshop on Multiagent Sequential Decision Making in Uncertain Domains 2011
- Interactively Mapping Data Sources into the Semantic Web Proceedings of The First International Symposium on Linked Science 2011; 783
Model-based discovery of circulating biomarkers.
Methods in molecular biology (Clifton, N.J.)
2011; 728: 87-107
Proteomic-based biomarker discovery approaches broadly attempt to identify proteins whose basal abundance, or change in abundance in response to a perturbation (e.g., a therapeutic intervention) is able to discriminate between populations of patients. Up until recently, the majority of approaches for discovering circulating biomarkers have focused on directly profiling serum or plasma to identify such proteins. However, the complexity and dynamic range of protein abundance in serum and plasma create a significant challenge for proteomics methods. To overcome these barriers, diverse approaches to simplify or to fractionate serum and plasma have been developed. For some diseases, such as those related to specific organs, there may be useful marker proteins that originate in the organ. Here, we describe an approach for marker discovery that focuses on the profiling of either primary tissue or cell culture models thereof.
View details for DOI 10.1007/978-1-61779-068-3_5
View details for PubMedID 21468942
Peptide Identification from Mixture Tandem Mass Spectra
MOLECULAR & CELLULAR PROTEOMICS
2010; 9 (7): 1476-1485
The success of high-throughput proteomics hinges on the ability of computational methods to identify peptides from tandem mass spectra (MS/MS). However, a common limitation of most peptide identification approaches is the nearly ubiquitous assumption that each MS/MS spectrum is generated from a single peptide. We propose a new computational approach for the identification of mixture spectra generated from more than one peptide. Capitalizing on the growing availability of large libraries of single-peptide spectra (spectral libraries), our quantitative approach is able to identify up to 98% of all mixture spectra from equally abundant peptides and automatically adjust to varying abundance ratios of up to 10:1. Furthermore, we show how theoretical bounds on spectral similarity avoid the need to compare each experimental spectrum against all possible combinations of candidate peptides (achieving speedups of over five orders of magnitude) and demonstrate that mixture-spectra can be identified in a matter of seconds against proteome-scale spectral libraries. Although our approach was developed for and is demonstrated on peptide spectra, we argue that the generality of the methods allows for their direct application to other types of spectral libraries and mixture spectra.
View details for DOI 10.1074/mcp.M000136-MCP201
View details for Web of Science ID 000279397200009
View details for PubMedID 20348588
- Mass spectrometry based proteomics in cancer research Modern Molecular Biology: Approaches for Unbiased Discovery in Cancer Research 2010: 117-156
Recommendations from the 2008 International Summit on Proteomics Data Release and Sharing Policy: The Amsterdam Principles
JOURNAL OF PROTEOME RESEARCH
2009; 8 (7): 3689-3692
Policies supporting the rapid and open sharing of genomic data have directly fueled the accelerated pace of discovery in large-scale genomics research. The proteomics community is starting to implement analogous policies and infrastructure for making large-scale proteomics data widely available on a precompetitive basis. On August 14, 2008, the National Cancer Institute (NCI) convened the "International Summit on Proteomics Data Release and Sharing Policy" in Amsterdam, The Netherlands, to identify and address potential roadblocks to rapid and open access to data. The six principles agreed upon by key stakeholders at the summit addressed issues surrounding (1) timing, (2) comprehensiveness, (3) format, (4) deposition to repositories, (5) quality metrics, and (6) responsibility for proteomics data release. This summit report explores various approaches to develop a framework of data release and sharing principles that will most effectively fulfill the needs of the funding agencies and the research community.
View details for DOI 10.1021/pr900023z
View details for Web of Science ID 000267694600043
View details for PubMedID 19344107
ProteoWizard: open source software for rapid proteomics tools development
2008; 24 (21): 2534-2536
The ProteoWizard software project provides a modular and extensible set of open-source, cross-platform tools and libraries. The tools perform proteomics data analyses; the libraries enable rapid tool creation by providing a robust, pluggable development framework that simplifies and unifies data file access, and performs standard proteomics and LCMS dataset computations. The library contains readers and writers of the mzML data format, which has been written using modern C++ techniques and design principles and supports a variety of platforms with native compilers. The software has been specifically released under the Apache v2 license to ensure it can be used in both academic and commercial projects. In addition to the library, we also introduce a rapidly growing set of companion tools whose implementation helps to illustrate the simplicity of developing applications on top of the ProteoWizard library.Cross-platform software that compiles using native compilers (i.e. GCC on Linux, MSVC on Windows and XCode on OSX) is available for download free of charge, at http://proteowizard.sourceforge.net. This website also provides code examples, and documentation. It is our hope the ProteoWizard project will become a standard platform for proteomics development; consequently, code use, contribution and further development are strongly encouraged.
View details for DOI 10.1093/bioinformatics/btn323
View details for Web of Science ID 000260381200017
View details for PubMedID 18606607
Halobacterium salinarum NRC-1 PeptideAtlas: Toward strategies for targeted proteomics and improved proteome coverage
JOURNAL OF PROTEOME RESEARCH
2008; 7 (9): 3755-3764
The relatively small numbers of proteins and fewer possible post-translational modifications in microbes provide a unique opportunity to comprehensively characterize their dynamic proteomes. We have constructed a PeptideAtlas (PA) covering 62.7% of the predicted proteome of the extremely halophilic archaeon Halobacterium salinarum NRC-1 by compiling approximately 636 000 tandem mass spectra from 497 mass spectrometry runs in 88 experiments. Analysis of the PA with respect to biophysical properties of constituent peptides, functional properties of parent proteins of detected peptides, and performance of different mass spectrometry approaches has highlighted plausible strategies for improving proteome coverage and selecting signature peptides for targeted proteomics. Notably, discovery of a significant correlation between absolute abundances of mRNAs and proteins has helped identify low abundance of proteins as the major limitation in peptide detection. Furthermore, we have discovered that iTRAQ labeling for quantitative proteomic analysis introduces a significant bias in peptide detection by mass spectrometry. Therefore, despite identifying at least one proteotypic peptide for almost all proteins in the PA, a context-dependent selection of proteotypic peptides appears to be the most effective approach for targeted proteomics.
View details for DOI 10.1021/pr800031f
View details for Web of Science ID 000259015500014
View details for PubMedID 18652504
Precursor-ion mass re-estimation improves peptide identification on hybrid instruments
JOURNAL OF PROTEOME RESEARCH
2008; 7 (9): 4031-4039
Mass spectrometry-based proteomics experiments have become an important tool for studying biological systems. Identifying the proteins in complex mixtures by assigning peptide fragmentation spectra to peptide sequences is an important step in the proteomics process. The 1-2 ppm mass-accuracy of hybrid instruments, like the LTQ-FT, has been cited as a key factor in their ability to identify a larger number of peptides with greater confidence than competing instruments. However, in replicate experiments of an 18-protein mixture, we note parent masses deviate 171 ppm, on average, for ion-trap data directed identifications and 8 ppm, on average, for preview Fourier transform (FT) data directed identifications. These deviations are neither caused by poor calibration nor by excessive ion-loading and are most likely due to errors in parent mass estimation. To improve these deviations, we introduce msPrefix, a program to re-estimate a peptide's parent mass from an associated high-accuracy full-scan survey spectrum. In 18-protein mixture experiments, msPrefix parent mass estimates deviate only 1 ppm, on average, from the identified peptides. In a cell lysate experiment searched with a tolerance of 50 ppm, 2295 peptides were confidently identified using native data and 4560 using msPrefixed data. Likewise, in a plasma experiment searched with a tolerance of 50 ppm, 326 peptides were identified using native data and 1216 using msPrefixed data. msPrefix is also able to determine which MS/MS spectra were possibly derived from multiple precursor ions. In complex mixture experiments, we demonstrate that more than 50% of triggered MS/MS may have had multiple precursor ions and note that spectra with multiple candidate ions are less likely to result in an identification using TANDEM. These results demonstrate integration of msPrefix into traditional shotgun proteomics workflows significantly improves identification results.
View details for DOI 10.1021/pr800307m
View details for Web of Science ID 000259015500038
View details for PubMedID 18707148
The standard protein mix database: A diverse data set to assist in the production of improved peptide and protein identification software tools
JOURNAL OF PROTEOME RESEARCH
2008; 7 (1): 96-103
Tandem mass spectrometry (MS/MS) is frequently used in the identification of peptides and proteins. Typical proteomic experiments rely on algorithms such as SEQUEST and MASCOT to compare thousands of tandem mass spectra against the theoretical fragment ion spectra of peptides in a database. The probabilities that these spectrum-to-sequence assignments are correct can be determined by statistical software such as PeptideProphet or through estimations based on reverse or decoy databases. However, many of the software applications that assign probabilities for MS/MS spectra to sequence matches were developed using training data sets from 3D ion-trap mass spectrometers. Given the variety of types of mass spectrometers that have become commercially available over the last 5 years, we sought to generate a data set of reference data covering multiple instrumentation platforms to facilitate both the refinement of existing computational approaches and the development of novel software tools. We analyzed the proteolytic peptides in a mixture of tryptic digests of 18 proteins, named the "ISB standard protein mix", using 8 different mass spectrometers. These include linear and 3D ion traps, two quadrupole time-of-flight platforms (qq-TOF), and two MALDI-TOF-TOF platforms. The resulting data set, which has been named the Standard Protein Mix Database, consists of over 1.1 million spectra in 150+ replicate runs on the mass spectrometers. The data were inspected for quality of separation and searched using SEQUEST. All data, including the native raw instrument and mzXML formats and the PeptideProphet validated peptide assignments, are available at http://regis-web.systemsbiology.net/PublicDatasets/.
View details for DOI 10.1021/pr070244j
View details for Web of Science ID 000252154200012
View details for PubMedID 17711323
Quantitative proteomic analysis of the budding yeast cell cycle using acid-cleavable isotope-coded affinity tag reagents
2006; 6 (23): 6146-6157
Quantitative profiling of proteins, the direct effectors of nearly all biological functions, will undoubtedly complement technologies for the measurement of mRNA. Systematic proteomic measurement of the cell cycle is now possible by using stable isotopic labeling with isotope-coded affinity tag reagents and software tools for high-throughput analysis of LC-MS/MS data. We provide here the first such study achieving quantitative, global proteomic measurement of a time-course gene expression experiment in a model eukaryote, the budding yeast Saccharomyces cerevisiae, during the cell cycle. We sampled 48% of all predicted ORFs, and provide the data, including identifications, quantitations, and statistical measures of certainty, to the community in a sortable matrix. We do not detect significant concordance in the dynamics of the system over the time-course tested between our proteomic measurements and microarray measures collected from similarly treated yeast cultures. Our proteomic dataset therefore provides a necessary and complementary measure of eukaryotic gene expression, establishes a rich database for the functional analysis of S. cerevisiae proteins, and will enable further development of technologies for global proteomic analysis of higher eukaryotes.
View details for DOI 10.1002/pmic.200600159
View details for Web of Science ID 000242879000004
View details for PubMedID 17133367
Protein cross-linking analysis using mass spectrometry, isotope-coded cross-linkers, and integrated computational data processing
JOURNAL OF PROTEOME RESEARCH
2006; 5 (9): 2270-2282
Distance constraints in proteins and protein complexes provide invaluable information for calculation of 3D structures, identification of protein binding partners and localization of protein-protein contact sites. We have developed an integrative approach to identify and characterize such sites through the analysis of proteolytic products derived from proteins chemically cross-linked by isotopically coded cross-linkers using LC-MALDI tandem mass spectrometry and computer software. This method is specifically tailored toward the rapid analysis of low microgram amounts of proteins or multimeric protein complexes cross-linked with nonlabeled and deuterium-labeled bis-NHS ester cross-linking reagents (both commercially available and readily synthesized). Through labeling with [18O]water solvent and LC-MALDI analysis, the method further allows the possible distinction between Type 0 and Type 1 or Type 2 modified peptides (monolinks and looplinks or cross-links), although such a distinction is more readily made from analysis of tandem mass spectrometry data. When applied to the bacterial Colicin E7 DNAse/Im7 heterodimeric protein complex, 23 cross-links were identified including six intersubunit cross-links, all between residues that are close in space when examined in the context of the X-ray structure of the heterodimer. In addition, cross-links were successfully identified in five single subunit proteins, beta-lactoglobulin, cytochrome c, lysozyme, myoglobin, and ribonuclease A, establishing the generality of the approach.
View details for DOI 10.1021/pr060154z
View details for Web of Science ID 000240200700024
View details for PubMedID 16944939
Mutagenesis of putative serine-threonine phosphorylation sites proximal to Arg255 of human cytochrome P450c17 does not selectively promote its 17,20-lyase activity
FERTILITY AND STERILITY
2006; 85: 1290-1299
To investigate the role of serine-threonine phosphorylation on the activity of human P450c17.In vitro study.Academic basic research laboratory.None.P450c17 expression constructs with a FLAG-tag on either the C-terminus or N-terminus of the protein were generated. Human C-terminal FLAG-tagged P450c17 chromosomal DNA was subjected to site-directed mutagenesis. Serine 258 and threonine 260 each were mutated to alanine and aspartic acid. The mutant P450c17s were expressed in COS-7 cells, and the enzymatic activities were measured.17alpha-Hydroxylase and C(17-20) lyase activities of human P450c17.C-terminal FLAG-tagged P450c17 functioned indistinguishably from the wild-type P450c17. Mutants S258A, S258D, and T260D had significantly less 17alpha-hydroxylase and C(17-20) lyase activities than the wild type.Adding an epitope tag to the C-terminus of the P450c17 protein does not interfere with its activities and will be a useful tool to isolate human P450c17 protein from cultured cells. Phosphorylation of serine 258 but not threonine 260 may act as a physiologic regulator of both enzymatic activities through interaction with obligatory redox partners.
View details for DOI 10.1016/j.fertnstert.2005.12.011
View details for Web of Science ID 000236902300028
View details for PubMedID 16616104
Signal maps for mass spectrometry-based comparative proteomics
MOLECULAR & CELLULAR PROTEOMICS
2006; 5 (3): 423-432
Mass spectrometry-based proteomic experiments, in combination with liquid chromatography-based separation, can be used to compare complex biological samples across multiple conditions. These comparisons are usually performed on the level of protein lists generated from individual experiments. Unfortunately given the current technologies, these lists typically cover only a small fraction of the total protein content, making global comparisons extremely limited. Recently approaches have been suggested that are built on the comparison of computationally built feature lists instead of protein identifications. Although these approaches promise to capture a bigger spectrum of the proteins present in a complex mixture, their success is strongly dependent on the correctness of the identified features and the aligned retention times of these features across multiple experiments. In this experimental-computational study, we went one step further and performed the comparisons directly on the signal level. First signal maps were constructed that associate the experimental signals across multiple experiments. Then a feature detection algorithm used this integrated information to identify those features that are discriminating or common across multiple experiments. At the core of our approach is a score function that faithfully recognizes mass spectra from similar peptide mixtures and an algorithm that produces an optimal alignment (time warping) of the liquid chromatography experiments on the basis of raw MS signal, making minimal assumptions on the underlying data. We provide experimental evidence that suggests uniqueness and correctness of the resulting signal maps even on low accuracy mass spectrometers. These maps can be used for a variety of proteomic analyses. Here we illustrate the use of signal maps for the discovery of diagnostic biomarkers. An imple-mentation of our algorithm is available on our Web server.
View details for DOI 10.1074/mcp.M500133-MCP200
View details for Web of Science ID 000236142800001
View details for PubMedID 16269421
Analysis of the Saccharomyces cerevisiae proteome with PeptideAtlas
2006; 7 (11)
We present the Saccharomyces cerevisiae PeptideAtlas composed from 47 diverse experiments and 4.9 million tandem mass spectra. The observed peptides align to 61% of Saccharomyces Genome Database (SGD) open reading frames (ORFs), 49% of the uncharacterized SGD ORFs, 54% of S. cerevisiae ORFs with a Gene Ontology annotation of 'molecular function unknown', and 76% of ORFs with Gene names. We highlight the use of this resource for data mining, construction of high quality lists for targeted proteomics, validation of proteins, and software development.
View details for DOI 10.1186/gb-2006-7-11-r106
View details for Web of Science ID 000243967000010
View details for PubMedID 17101051
The PeptideAtlas project
NUCLEIC ACIDS RESEARCH
2006; 34: D655-D658
The completion of the sequencing of the human genome and the concurrent, rapid development of high-throughput proteomic methods have resulted in an increasing need for automated approaches to archive proteomic data in a repository that enables the exchange of data among researchers and also accurate integration with genomic data. PeptideAtlas (http://www.peptideatlas.org/) addresses these needs by identifying peptides by tandem mass spectrometry (MS/MS), statistically validating those identifications and then mapping identified sequences to the genomes of eukaryotic organisms. A meaningful comparison of data across different experiments generated by different groups using different types of instruments is enabled by the implementation of a uniform analytic process. This uniform statistical validation ensures a consistent and high-quality set of peptide and protein identifications. The raw data from many diverse proteomic experiments are made available in the associated PeptideAtlas repository in several formats. Here we present a summary of our process and details about the Human, Drosophila and Yeast PeptideAtlas builds.
View details for DOI 10.1093/nar/gkj040
View details for Web of Science ID 000239307700138
View details for PubMedID 16381952
- A perspective on protein profiling of blood BJU INTERNATIONAL 2005; 96 (4): 477-482
Scoring proteomes with proteotypic peptide probes
NATURE REVIEWS MOLECULAR CELL BIOLOGY
2005; 6 (7): 577-583
Technologies for genome-wide analyses typically undergo a transition from a discovery phase to a scoring phase. In the discovery phase, the genomic universe is explored and all pertinent data are noted. In the scoring phase, relevant entities are screened to reveal groups of genes that are associated with specific biological processes or conditions. In this article, we propose that the transition from a discovery to a scoring phase is also essential, feasible and imminent for proteomics.
View details for DOI 10.1038/nrm1683
View details for Web of Science ID 000230245700014
View details for PubMedID 15957003
High throughput quantitative analysis of serum proteins using glycopeptide capture and liquid chromatography mass spectrometry
MOLECULAR & CELLULAR PROTEOMICS
2005; 4 (2): 144-155
It is expected that the composition of the serum proteome can provide valuable information about the state of the human body in health and disease and that this information can be extracted via quantitative proteomic measurements. Suitable proteomic techniques need to be sensitive, reproducible, and robust to detect potential biomarkers below the level of highly expressed proteins, generate data sets that are comparable between experiments and laboratories, and have high throughput to support statistical studies. Here we report a method for high throughput quantitative analysis of serum proteins. It consists of the selective isolation of peptides that are N-linked glycosylated in the intact protein, the analysis of these now deglycosylated peptides by liquid chromatography electrospray ionization mass spectrometry, and the comparative analysis of the resulting patterns. By focusing selectively on a few formerly N-linked glycopeptides per serum protein, the complexity of the analyte sample is significantly reduced and the sensitivity and throughput of serum proteome analysis are increased compared with the analysis of total tryptic peptides from unfractionated samples. We provide data that document the performance of the method and show that sera from untreated normal mice and genetically identical mice with carcinogen-induced skin cancer can be unambiguously discriminated using unsupervised clustering of the resulting peptide patterns. We further identify, by tandem mass spectrometry, some of the peptides that were consistently elevated in cancer mice compared with their control littermates.
View details for DOI 10.1074/mcp.M400090-MCP200
View details for Web of Science ID 000227381300004
View details for PubMedID 15608340
- Finding protein domain boundaries: an automated, non-homology-based method IEEE Intelligent Systems 2005; Nov-Dec (6): 26-33
Integration with the human genome of peptide sequences obtained by high-throughput mass spectrometry
2005; 6 (1)
A crucial aim upon the completion of the human genome is the verification and functional annotation of all predicted genes and their protein products. Here we describe the mapping of peptides derived from accurate interpretations of protein tandem mass spectrometry (MS) data to eukaryotic genomes and the generation of an expandable resource for integration of data from many diverse proteomics experiments. Furthermore, we demonstrate that peptide identifications obtained from high-throughput proteomics can be integrated on a large scale with the human genome. This resource could serve as an expandable repository for MS-derived proteome information.
View details for Web of Science ID 000226337200015
View details for PubMedID 15642101
PFIT and PFRIT: Bioinformatic algorithms for detecting glycosidase function from structure and sequence
2004; 13 (1): 221-229
The identification of the enzymes involved in the metabolism of simple and complex carbohydrates presents one bioinformatic challenge in the post-genomic era. Here, we present the PFIT and PFRIT algorithms for identifying those proteins adopting the alpha/beta barrel fold that function as glycosidases. These algorithms are based on the observation that proteins adopting the alpha/beta barrel fold share positions in their tertiary structures having equivalent sets of atomic interactions. These are conserved tertiary interaction positions, which have been implicated in both structure and function. Glycosidases adopting the alpha/beta barrel fold share more conserved tertiary interactions than alpha/beta barrel proteins having other functions. The enrichment pattern of conserved tertiary interactions in the glycosidases is the information that PFIT and PFRIT use to predict whether any given alpha/beta barrel will function as a glycosidase or not. Using as a test set a database of 19 glycosidase and 45 nonglycosidase alpha/beta barrel proteins with low sequence similarity, PFIT and PFRIT can correctly predict glycosidase function for 84% of the proteins known to function as glycosidases. PFIT and PFRIT incorrectly predict glycosidase function for 25% of the nonglycosidases. The program PSI-BLAST can also correctly identify 84% of the 19 glycosidases, however, it incorrectly predicts glycosidase function for 50% of the nonglycosidases (twofold greater than PFIT and PFRIT). Overall, we demonstrate that the structure-based PFIT and PFRIT algorithms are both more selective and sensitive for predicting glycosidase function than the sequence-based PSI-BLAST algorithm.
View details for DOI 10.1110/ps.03274104
View details for Web of Science ID 000187587700022
View details for PubMedID 14691237
Inference of protein function and protein linkages in Mycobacterium tuberculosis based on prokaryotic genome organization: a combined computational approach
2003; 4 (9)
The genome of Mycobacterium tuberculosis was analyzed using recently developed computational approaches to infer protein function and protein linkages. We evaluated and employed a method to infer genes likely to belong to the same operon, as judged by the nucleotide distance between genes in the same genomic orientation, and combined this method with those of the Rosetta Stone, Phylogenetic Profile and conserved Gene Neighbor computational methods for the inference of protein function.
View details for Web of Science ID 000185048100012
View details for PubMedID 12952538
The directional atomic solvation energy: An atom-based potential for the assignment of protein sequences to known folds
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2002; 99 (25): 16041-16046
The Directional Atomic Solvation EnergY (DASEY) is an atom-based description of the environment of an amino acid position within a known 3D protein structure. The DASEY has been developed to align and score a probe amino acid sequence to a library of template protein structures for fold assignment. DASEY is computed by summing the atomic solvation parameters of atoms falling within a tetrahedral sector, or petal, extending 16 A along each of the four bond axes of each alpha-carbon atom of the protein. The DASEY discriminates between pairs of structurally equivalent positions and random pairs in protein structures sharing a fold but belonging to different superfamilies, unlike some previous descriptors of protein environments, such as buried area. Furthermore, the DASEY values have characteristic patterns of residue replacement, an essential feature of a successful fold assignment method. Benchmarking fold assignment with DASEY achieves coverage of 56% of sequences with 90% accuracy when probe sequences are matched to protein structural templates belonging to the same fold but to a different superfamily, an improvement of greater than 200% over a previous method.
View details for DOI 10.1073/pnas.252626399
View details for Web of Science ID 000179783400041
View details for PubMedID 12461172
Genomic evidence that the intracellular proteins of archaeal microbes contain disulfide bonds
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2002; 99 (15): 9679-9684
Disulfide bonds have only rarely been found in intracellular proteins. That pattern is consistent with the chemically reducing environment inside the cells of well-studied organisms. However, recent experiments and new calculations based on genomic data of archaea provide striking contradictions to this pattern. Our results indicate that the intracellular proteins of certain hyperthermophilic archaea, especially the crenarchaea Pyrobaculum aerophilum and Aeropyrum pernix, are rich in disulfide bonds. This finding implicates disulfide bonding in stabilizing many thermostable proteins and points to novel chemical environments inside these microbes. These unexpected results illustrate the wealth of biochemical insights available from the growing reservoir of genomic data.
View details for DOI 10.1073/pnas.142310499
View details for Web of Science ID 000177042400017
View details for PubMedID 12107280
A modeled hydrophobic domain on the TCL1 oncoprotein mediates association with AKT at the cytoplasmic membrane
2002; 41 (20): 6376-6382
AKT has a critical role in relaying cell survival and proliferation signals initiated by ligand binding to surface receptors in mammalian cells. Induction of AKT serine/threonine kinase activity is augmented by the T-cell leukemia-1 (TCL1) oncoprotein through a physical association requiring the AKT pleckstrin homology domain. Here, we used molecular modeling and identified an exposed hydrophobic patch composed of two discontinuous amino acid stretches near one end of the TCL1 beta-barrel that was required for a TCL1-AKT association. Site-directed mutations of this region did not affect TCL1 secondary structure, yet they disrupted interactions with AKT. This region was found in other members of the TCL1 oncoprotein family, such as TCL1b and MTCP1, and suggested a conserved, novel AKT binding domain. Interestingly, TCL1 and AKT co-localize in multiple cell compartments, but only extracts from the plasma membrane stimulate optimal complex formation in vitro. Identification of an AKT binding domain on TCL1 is an important step in deciphering the complex interactions that regulate AKT kinase activity in lymphocyte development and neoplasia within the immune system.
View details for DOI 10.1021/bi016068o
View details for Web of Science ID 000175651400019
View details for PubMedID 12009899
GXXXG and AXXXA: Common alpha-helical interaction motifs in proteins, particularly in extremophiles
2002; 41 (19): 5990-5997
The GXXXG motif is a frequently occurring sequence of residues that is known to favor helix-helix interactions in membrane proteins. Here we show that the GXXXG motif is also prevalent in soluble proteins whose structures have been determined. Some 152 proteins from a non-redundant PDB set contain at least one alpha-helix with the GXXXG motif, 41 +/- 9% more than expected if glycine residues were uniformly distributed in those alpha-helices. More than 50% of the GXXXG-containing alpha-helices participate in helix-helix interactions. In fact, 26 of those helix-helix interactions are structurally similar to the helix-helix interaction of the glycophorin A dimer, where two transmembrane helices associate to form a dimer stabilized by the GXXXG motif. As for the glycophorin A structure, we find backbone-to-backbone atomic contacts of the C alpha-H...O type in each of these 26 helix-helix interactions that display the stereochemical hallmarks of hydrogen bond formation. These glycophorin A-like helix-helix interactions are enriched in the general set of helix-helix interactions containing the GXXXG motif, suggesting that the inferred C alpha-H...O hydrogen bonds stabilize the helix-helix interactions. In addition to the GXXXG motif, some 808 proteins from the non-redundant PDB set contain at least one alpha-helix with the AXXXA motif (30 +/- 3% greater than expected). Both the GXXXG and AXXXA motifs occur frequently in predicted alpha-helices from 24 fully sequenced genomes. Occurrence of the AXXXA motif is enhanced to a greater extent in thermophiles than in mesophiles, suggesting that helical interaction based on the AXXXA motif may be a common mechanism of thermostability in protein structures. We conclude that the GXXXG sequence motif stabilizes helix-helix interactions in proteins, and that the AXXXA sequence motif also stabilizes the folded state of proteins.
View details for DOI 10.1021/bi0200763
View details for Web of Science ID 000175547000007
View details for PubMedID 11993993
- Making sense of proteomics: Using bioinformatics to discover a protein's structure, functions and interactions. Proteins and Proteomics: A Laboratory Manual. Cold Spring Harbor Laboratory Press: 2002: Chapter 11
The 1.7 angstrom crystal structure of BPI: A study of how two dissimilar amino acid sequences can adopt the same fold
JOURNAL OF MOLECULAR BIOLOGY
2000; 299 (4): 1019-1034
We have extended the resolution of the crystal structure of human bactericidal/permeability-increasing protein (BPI) to 1.7 A. BPI has two domains with the same fold, but with little sequence similarity. To understand the similarity in structure of the two domains, we compare the corresponding residue positions in the two domains by the method of 3D-1D profiles. A 3D-1D profile is a string formed by assigning each position in the 3D structure to one of 18 environment classes. The environment classes are defined by the local secondary structure, the area of the residue which is buried from solvent, and the fraction of the area buried by polar atoms. A structural alignment between the two BPI domains was used to compare the 3D-1D environments of structurally equivalent positions. Greater than 31% of the aligned positions have conserved 3D-1D environments, but only 13% have conserved residue identities. Analysis of the 3D-1D environmentally conserved positions helps to identify pairs of residues likely to be important in conserving the fold, regardless of the residue similarity. We find examples of 3D-1D environmentally conserved positions with dissimilar residues which nevertheless play similar structural roles. To generalize our findings, we analyzed four other proteins with similar structures yet dissimilar sequences. Together, these examples show that aligned pairs of dissimilar residues often share similar structural roles, stabilizing dissimilar sequences in the same fold.
View details for Web of Science ID 000087680400016
View details for PubMedID 10843855
Selecting protein targets for structural genomics of Pyrobaculum aerophilum: Validating automated fold assignment methods by using binary hypothesis testing
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2000; 97 (6): 2450-2455
Three-dimensional protein folds were assigned to all ORFs of the recently sequenced genome of the hyperthermophilic archaeon Pyrobaculum aerophilum. Binary hypothesis testing was used to estimate a confidence level for each assignment. A separate test was conducted to assign a probability for whether each sequence has a novel fold-i.e., one that is not yet represented in the experimental database of known structures. Of the 2,130 predicted nontransmembrane proteins in this organism, 916 matched a fold at a cumulative 90% confidence level, and 245 could be assigned at a 99% confidence level. Likewise, 286 proteins were predicted to have a previously unobserved fold with a 90% confidence level, and 14 at a 99% confidence level. These statistically based tools are combined with homology searches against the Online Mendelian Inheritance in Man (OMIM) human genetics database and other protein databases for the selection of attractive targets for crystallographic or NMR structure determination. Results of these studies have been collated and placed at http://www.doe-mbi.ucla.edu/people/parag/P A_HOME/, the University of California, Los Angeles-Department of Energy Pyrobaculum aerophilum web site.
View details for Web of Science ID 000085941400011
View details for PubMedID 10706641
- The accidental bioinformaticist JOURNAL OF CELLULAR BIOCHEMISTRY 2000; 80 (2): 208-209