Doctor of Philosophy, Indiana University (2011)
Bachelor of Science, Peking University (2003)
Ronald Davis, Postdoctoral Faculty Sponsor
Comprehensive curation and analysis of fungal biosynthetic gene clusters of published natural products.
Fungal genetics and biology : FG & B
Microorganisms produce a wide range of natural products (NPs) with clinically and agriculturally relevant biological activities. In bacteria and fungi, genes encoding successive steps in a biosynthetic pathway tend to be clustered on the chromosome as biosynthetic gene clusters (BGCs). Historically, "activity-guided" approaches to NP discovery have focused on bioactivity screening of NPs produced by culturable microbes. In contrast, recent "genome mining" approaches first identify candidate BGCs, express these biosynthetic genes using synthetic biology methods, and finally test for the production of NPs. Fungal genome mining efforts and the exploration of novel sequence and NP space are limited, however, by the lack of a comprehensive catalog of BGCs encoding experimentally-validated products. In this study, we generated a comprehensive reference set of fungal NPs whose biosynthetic gene clusters are described in the published literature. To generate this dataset, we first identified NCBI records that included both a peer-reviewed article and an associated nucleotide record. We filtered these records by text and homology criteria to identify putative NP-related articles and BGCs. Next, we manually curated the resulting articles, chemical structures, and protein sequences. The resulting catalog contains 197 unique NP compounds covering several major classes of fungal NPs, including polyketides, non-ribosomal peptides, terpenoids, and alkaloids. The distribution of articles published per compound shows a bias towards the study of certain popular compounds, such as the aflatoxins. Phylogenetic analysis of biosynthetic genes suggests that much chemical and enzymatic diversity remains to be discovered in fungi. Our catalog was incorporated into the recently launched Minimum Information about Biosynthetic Gene cluster (MIBiG) repository to create the largest known set of fungal BGCs and associated NPs, a resource that we anticipate will guide future genome mining and synthetic biology efforts toward discovering novel fungal enzymes and metabolites.
View details for DOI 10.1016/j.fgb.2016.01.012
View details for PubMedID 26808821
SEPARATING THE CAUSES AND CONSEQUENCES IN DISEASE TRANSCRIPTOME.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
2016; 21: 381-92
The causes of complex diseases are multifactorial and the phenotypes of complex diseases are typically heterogeneous, posting significant challenges for both the experiment design and statistical inference in the study of such diseases. Transcriptome profiling can potentially provide key insights on the pathogenesis of diseases, but the signals from the disease causes and consequences are intertwined, leaving it to speculations what are likely causal. Genome-wide association study on the other hand provides direct evidences on the potential genetic causes of diseases, but it does not provide a comprehensive view of disease pathogenesis, and it has difficulties in detecting the weak signals from individual genes. Here we propose an approach diseaseExPatho that combines transcriptome data, regulome knowledge, and GWAS results if available, for separating the causes and consequences in the disease transcriptome. DiseaseExPatho computationally deconvolutes the expression data into gene expression modules, hierarchically ranks the modules based on regulome using a novel algorithm, and given GWAS data, it directly labels the potential causal gene modules based on their correlations with genome-wide gene-disease associations. Strikingly, we observed that the putative causal modules are not necessarily differentially expressed in disease, while the other modules can show strong differential expression without enrichment of top GWAS variations. On the other hand, we showed that the regulatory network based module ranking prioritized the putative causal modules consistently in 6 diseases. We suggest that the approach is applicable to other common and rare complex diseases to prioritize causal pathways with or without genome-wide association studies.
View details for PubMedID 26776202
- Assessment of the Radiation Effects of Cardiac CT Angiography Using Protein and Genetic Biomarkers JACC-CARDIOVASCULAR IMAGING 2015; 8 (8): 873-884
A maximum-likelihood approach to absolute protein quantification in mass spectrometry
View details for DOI 10.1145/2808719.2808750
Computational approaches to protein inference in shotgun proteomics
Shotgun proteomics has recently emerged as a powerful approach to characterizing proteomes in biological samples. Its overall objective is to identify the form and quantity of each protein in a high-throughput manner by coupling liquid chromatography with tandem mass spectrometry. As a consequence of its high throughput nature, shotgun proteomics faces challenges with respect to the analysis and interpretation of experimental data. Among such challenges, the identification of proteins present in a sample has been recognized as an important computational task. This task generally consists of (1) assigning experimental tandem mass spectra to peptides derived from a protein database, and (2) mapping assigned peptides to proteins and quantifying the confidence of identified proteins. Protein identification is fundamentally a statistical inference problem with a number of methods proposed to address its challenges. In this review we categorize current approaches into rule-based, combinatorial optimization and probabilistic inference techniques, and present them using integer programming and Bayesian inference frameworks. We also discuss the main challenges of protein identification and propose potential solutions with the goal of spurring innovative research in this area.
View details for DOI 10.1186/1471-2105-13-S16-S4
View details for Web of Science ID 000312714500004
View details for PubMedID 23176300
Protein identification problem from a Bayesian point of view.
Statistics and its interface
2012; 5 (1): 21-37
We present a generic Bayesian framework for the peptide and protein identification in proteomics, and provide a unified interpretation for the database searching and the de novo peptide sequencing approaches that are used in peptide identification. We describe several probabilistic graphical models and a variety of prior distributions that can be incorporated into the Bayesian framework to model different types of prior information, such as the known protein sequences, the known protein abundances, the peptide precursor masses, the estimated peptide retention time and the peptide detectabilities. Various applications of the Bayesian framework are discussed theoretically, including its application to the identification of peptides containing mutations and post-translational modifications.
View details for DOI 10.4310/SII.2012.v5.n1.a3
View details for PubMedID 24761189
- Investigation of VUV photodissociation propensities using peptide libraries INTERNATIONAL JOURNAL OF MASS SPECTROMETRY 2011; 308 (2-3): 142-154
To Release or Not to Release: Evaluating Information Leaks in Aggregate Human-Genome Data
COMPUTER SECURITY - ESORICS 2011
2011; 6879: 607-627
View details for Web of Science ID 000307366400033
The Importance of Peptide Detectability for Protein Identification, Quantification, and Experiment Design in MS/MS Proteomics
JOURNAL OF PROTEOME RESEARCH
2010; 9 (12): 6288-6297
Peptide detectability is defined as the probability that a peptide is identified in an LC-MS/MS experiment and has been useful in providing solutions to protein inference and label-free quantification. Previously, predictors for peptide detectability trained on standard or complex samples were proposed. Although the models trained on complex samples may benefit from the large training data sets, it is unclear to what extent they are affected by the unequal abundances of identified proteins. To address this challenge and improve detectability prediction, we present a new algorithm for the iterative learning of peptide detectability from complex mixtures. We provide evidence that the new method approximates detectability with useful accuracy and, based on its design, can be used to interpret the outcome of other learning strategies. We studied the properties of peptides from the bacterium Deinococcus radiodurans and found that at standard quantities, its tryptic peptides can be roughly classified as either detectable or undetectable, with a relatively small fraction having medium detectability. We extend the concept of detectability from peptides to proteins and apply the model to predict the behavior of a replicate LC-MS/MS experiment from a single analysis. Finally, our study summarizes a theoretical framework for peptide/protein identification and label-free quantification.
View details for DOI 10.1021/pr1005586
View details for Web of Science ID 000284856200018
View details for PubMedID 21067214
Structure-based kernels for the prediction of catalytic residues and their involvement in human inherited disease
2010; 26 (16): 1975-1982
Enzyme catalysis is involved in numerous biological processes and the disruption of enzymatic activity has been implicated in human disease. Despite this, various aspects of catalytic reactions are not completely understood, such as the mechanics of reaction chemistry and the geometry of catalytic residues within active sites. As a result, the computational prediction of catalytic residues has the potential to identify novel catalytic pockets, aid in the design of more efficient enzymes and also predict the molecular basis of disease.We propose a new kernel-based algorithm for the prediction of catalytic residues based on protein sequence, structure and evolutionary information. The method relies upon explicit modeling of similarity between residue-centered neighborhoods in protein structures. We present evidence that this algorithm evaluates favorably against established approaches, and also provides insights into the relative importance of the geometry, physicochemical properties and evolutionary conservation of catalytic residue activity. The new algorithm was used to identify known mutations associated with inherited disease whose molecular mechanism might be predicted to operate specifically though the loss or gain of catalytic residues. It should, therefore, provide a viable approach to identifying the molecular basis of disease in which the loss or gain of function is not caused solely by the disruption of protein stability. Our analysis suggests that both mechanisms are actively involved in human inherited disease.Source code for the structural kernel is available at www.informatics.indiana.edu/predrag/.
View details for DOI 10.1093/bioinformatics/btq319
View details for Web of Science ID 000280703500008
View details for PubMedID 20551136
Combinatorial Libraries of Synthetic Peptides as a Model for Shotgun Proteomics
2010; 82 (15): 6559-6568
A synthetic approach to model the analytical complexity of biological proteolytic digests has been developed. Combinatorial peptide libraries ranging in length between 9 and 12 amino acids that represent typical tryptic digests were designed, synthesized, and analyzed. Individual libraries and mixtures thereof were studied by replicate liquid chromatography-ion trap mass spectrometry and compared to a tryptic digest of Deinococcus radiodurans. Similar to complex proteome analysis, replicate study of individual libraries identified additional unique peptides. Fewer novel sequences were revealed with each additional analysis in a manner similar to that observed for biological data. Our results demonstrate a bimodal distribution of peptides sorting to either very low or very high levels of detection. Upon mixing of libraries at equal abundance, a length-dependent bias in favor of longer sequence identification was observed. Peptide identification as a function of site-specific amino acid content was characterized with certain amino acids proving to be of considerable importance. This report demonstrates that peptide libraries of defined character can serve as a reference for instrument characterization. Furthermore, they are uniquely suited to delineate the physical properties that influence identification of peptides, which provides a foundation for optimizing the study of samples with less defined heterogeneity.
View details for DOI 10.1021/ac100910a
View details for Web of Science ID 000280401400036
View details for PubMedID 20669997
A Bayesian Approach to Protein Inference Problem in Shotgun Proteomics
JOURNAL OF COMPUTATIONAL BIOLOGY
2009; 16 (8): 1183-1193
The protein inference problem represents a major challenge in shotgun proteomics. In this article, we describe a novel Bayesian approach to address this challenge by incorporating the predicted peptide detectabilities as the prior probabilities of peptide identification. We propose a rigorious probabilistic model for protein inference and provide practical algoritmic solutions to this problem. We used a complex synthetic protein mixture to test our method and obtained promising results.
View details for DOI 10.1089/cmb.2009.0018
View details for Web of Science ID 000269639100015
View details for PubMedID 19645593
Learning Your Identity and Disease from Research Papers: Information Leaks in Genome Wide Association Study
CCS'09: PROCEEDINGS OF THE 16TH ACM CONFERENCE ON COMPUTER AND COMMUNICATIONS SECURITY
View details for Web of Science ID 000281662800049
"REVERSE ECOLOGY" AND THE POWER OF POPULATION GENOMICS
2008; 62 (12): 2984-2994
Rapid and inexpensive sequencing technologies are making it possible to collect whole genome sequence data on multiple individuals from a population. This type of data can be used to quickly identify genes that control important ecological and evolutionary phenotypes by finding the targets of adaptive natural selection, and we therefore refer to such approaches as "reverse ecology." To quantify the power gained in detecting positive selection using population genomic data, we compare three statistical methods for identifying targets of selection: the McDonald-Kreitman test, the mkprf method, and a likelihood implementation for detecting d(N)/d(S) > 1. Because the first two methods use polymorphism data we expect them to have more power to detect selection. However, when applied to population genomic datasets from human, fly, and yeast, the tests using polymorphism data were actually weaker in two of the three datasets. We explore reasons why the simpler comparative method has identified more genes under selection, and suggest that the different methods may really be detecting different signals from the same sequence data. Finally, we find several statistical anomalies associated with the mkprf method, including an almost linear dependence between the number of positively selected genes identified and the prior distributions used. We conclude that interpreting the results produced by this method should be done with some caution.
View details for DOI 10.1111/j.1558-5646.2008.00486.x
View details for Web of Science ID 000261442900004
View details for PubMedID 18752601
A Bayesian approach to protein inference problem in shotgun proteomics
RESEARCH IN COMPUTATIONAL MOLECULAR BIOLOGY, PROCEEDINGS
2008; 4955: 167-180
View details for Web of Science ID 000254391500012