Professional Education

  • Doctor of Philosophy, Indiana University (2011)
  • Bachelor of Science, Peking University (2003)

Stanford Advisors

All Publications

  • Assessment of the Radiation Effects of Cardiac CT Angiography Using Protein and Genetic Biomarkers JACC-CARDIOVASCULAR IMAGING Nguyen, P. K., Lee, W. H., Li, Y. F., Hong, W. X., Hu, S., Chan, C., Liang, G., Nguyen, I., Ong, S., Churko, J., Wang, J., Altman, R. B., Fleischmann, D., Wu, J. C. 2015; 8 (8): 873-884
  • Computational approaches to protein inference in shotgun proteomics BMC BIOINFORMATICS Li, Y. F., Radivojac, P. 2012; 13


    Shotgun proteomics has recently emerged as a powerful approach to characterizing proteomes in biological samples. Its overall objective is to identify the form and quantity of each protein in a high-throughput manner by coupling liquid chromatography with tandem mass spectrometry. As a consequence of its high throughput nature, shotgun proteomics faces challenges with respect to the analysis and interpretation of experimental data. Among such challenges, the identification of proteins present in a sample has been recognized as an important computational task. This task generally consists of (1) assigning experimental tandem mass spectra to peptides derived from a protein database, and (2) mapping assigned peptides to proteins and quantifying the confidence of identified proteins. Protein identification is fundamentally a statistical inference problem with a number of methods proposed to address its challenges. In this review we categorize current approaches into rule-based, combinatorial optimization and probabilistic inference techniques, and present them using integer programming and Bayesian inference frameworks. We also discuss the main challenges of protein identification and propose potential solutions with the goal of spurring innovative research in this area.

    View details for DOI 10.1186/1471-2105-13-S16-S4

    View details for Web of Science ID 000312714500004

    View details for PubMedID 23176300

  • Protein identification problem from a Bayesian point of view. Statistics and its interface Li, Y. F., Arnold, R. J., Radivojac, P., Tang, H. 2012; 5 (1): 21-37


    We present a generic Bayesian framework for the peptide and protein identification in proteomics, and provide a unified interpretation for the database searching and the de novo peptide sequencing approaches that are used in peptide identification. We describe several probabilistic graphical models and a variety of prior distributions that can be incorporated into the Bayesian framework to model different types of prior information, such as the known protein sequences, the known protein abundances, the peptide precursor masses, the estimated peptide retention time and the peptide detectabilities. Various applications of the Bayesian framework are discussed theoretically, including its application to the identification of peptides containing mutations and post-translational modifications.

    View details for DOI 10.4310/SII.2012.v5.n1.a3

    View details for PubMedID 24761189

  • The Importance of Peptide Detectability for Protein Identification, Quantification, and Experiment Design in MS/MS Proteomics JOURNAL OF PROTEOME RESEARCH Li, Y. F., Arnold, R. J., Tang, H., Radivojac, P. 2010; 9 (12): 6288-6297


    Peptide detectability is defined as the probability that a peptide is identified in an LC-MS/MS experiment and has been useful in providing solutions to protein inference and label-free quantification. Previously, predictors for peptide detectability trained on standard or complex samples were proposed. Although the models trained on complex samples may benefit from the large training data sets, it is unclear to what extent they are affected by the unequal abundances of identified proteins. To address this challenge and improve detectability prediction, we present a new algorithm for the iterative learning of peptide detectability from complex mixtures. We provide evidence that the new method approximates detectability with useful accuracy and, based on its design, can be used to interpret the outcome of other learning strategies. We studied the properties of peptides from the bacterium Deinococcus radiodurans and found that at standard quantities, its tryptic peptides can be roughly classified as either detectable or undetectable, with a relatively small fraction having medium detectability. We extend the concept of detectability from peptides to proteins and apply the model to predict the behavior of a replicate LC-MS/MS experiment from a single analysis. Finally, our study summarizes a theoretical framework for peptide/protein identification and label-free quantification.

    View details for DOI 10.1021/pr1005586

    View details for Web of Science ID 000284856200018

    View details for PubMedID 21067214

  • Structure-based kernels for the prediction of catalytic residues and their involvement in human inherited disease BIOINFORMATICS Xin, F., Myers, S., Li, Y. F., Cooper, D. N., Mooney, S. D., Radivojac, P. 2010; 26 (16): 1975-1982


    Enzyme catalysis is involved in numerous biological processes and the disruption of enzymatic activity has been implicated in human disease. Despite this, various aspects of catalytic reactions are not completely understood, such as the mechanics of reaction chemistry and the geometry of catalytic residues within active sites. As a result, the computational prediction of catalytic residues has the potential to identify novel catalytic pockets, aid in the design of more efficient enzymes and also predict the molecular basis of disease.We propose a new kernel-based algorithm for the prediction of catalytic residues based on protein sequence, structure and evolutionary information. The method relies upon explicit modeling of similarity between residue-centered neighborhoods in protein structures. We present evidence that this algorithm evaluates favorably against established approaches, and also provides insights into the relative importance of the geometry, physicochemical properties and evolutionary conservation of catalytic residue activity. The new algorithm was used to identify known mutations associated with inherited disease whose molecular mechanism might be predicted to operate specifically though the loss or gain of catalytic residues. It should, therefore, provide a viable approach to identifying the molecular basis of disease in which the loss or gain of function is not caused solely by the disruption of protein stability. Our analysis suggests that both mechanisms are actively involved in human inherited disease.Source code for the structural kernel is available at

    View details for DOI 10.1093/bioinformatics/btq319

    View details for Web of Science ID 000280703500008

    View details for PubMedID 20551136

  • Combinatorial Libraries of Synthetic Peptides as a Model for Shotgun Proteomics ANALYTICAL CHEMISTRY Bohrer, B. C., Li, Y. F., Reilly, J. P., Clemmer, D. E., DiMarchi, R. D., Radivojac, P., Tang, H., Arnold, R. J. 2010; 82 (15): 6559-6568


    A synthetic approach to model the analytical complexity of biological proteolytic digests has been developed. Combinatorial peptide libraries ranging in length between 9 and 12 amino acids that represent typical tryptic digests were designed, synthesized, and analyzed. Individual libraries and mixtures thereof were studied by replicate liquid chromatography-ion trap mass spectrometry and compared to a tryptic digest of Deinococcus radiodurans. Similar to complex proteome analysis, replicate study of individual libraries identified additional unique peptides. Fewer novel sequences were revealed with each additional analysis in a manner similar to that observed for biological data. Our results demonstrate a bimodal distribution of peptides sorting to either very low or very high levels of detection. Upon mixing of libraries at equal abundance, a length-dependent bias in favor of longer sequence identification was observed. Peptide identification as a function of site-specific amino acid content was characterized with certain amino acids proving to be of considerable importance. This report demonstrates that peptide libraries of defined character can serve as a reference for instrument characterization. Furthermore, they are uniquely suited to delineate the physical properties that influence identification of peptides, which provides a foundation for optimizing the study of samples with less defined heterogeneity.

    View details for DOI 10.1021/ac100910a

    View details for Web of Science ID 000280401400036

    View details for PubMedID 20669997

  • A Bayesian Approach to Protein Inference Problem in Shotgun Proteomics JOURNAL OF COMPUTATIONAL BIOLOGY Li, Y. F., Arnold, R. J., Li, Y., Radivojac, P., Sheng, Q., Tang, H. 2009; 16 (8): 1183-1193


    The protein inference problem represents a major challenge in shotgun proteomics. In this article, we describe a novel Bayesian approach to address this challenge by incorporating the predicted peptide detectabilities as the prior probabilities of peptide identification. We propose a rigorious probabilistic model for protein inference and provide practical algoritmic solutions to this problem. We used a complex synthetic protein mixture to test our method and obtained promising results.

    View details for DOI 10.1089/cmb.2009.0018

    View details for Web of Science ID 000269639100015

    View details for PubMedID 19645593

  • "REVERSE ECOLOGY" AND THE POWER OF POPULATION GENOMICS EVOLUTION Li, Y. F., Costello, J. C., Holloway, A. K., Hahn, M. W. 2008; 62 (12): 2984-2994


    Rapid and inexpensive sequencing technologies are making it possible to collect whole genome sequence data on multiple individuals from a population. This type of data can be used to quickly identify genes that control important ecological and evolutionary phenotypes by finding the targets of adaptive natural selection, and we therefore refer to such approaches as "reverse ecology." To quantify the power gained in detecting positive selection using population genomic data, we compare three statistical methods for identifying targets of selection: the McDonald-Kreitman test, the mkprf method, and a likelihood implementation for detecting d(N)/d(S) > 1. Because the first two methods use polymorphism data we expect them to have more power to detect selection. However, when applied to population genomic datasets from human, fly, and yeast, the tests using polymorphism data were actually weaker in two of the three datasets. We explore reasons why the simpler comparative method has identified more genes under selection, and suggest that the different methods may really be detecting different signals from the same sequence data. Finally, we find several statistical anomalies associated with the mkprf method, including an almost linear dependence between the number of positively selected genes identified and the prior distributions used. We conclude that interpreting the results produced by this method should be done with some caution.

    View details for DOI 10.1111/j.1558-5646.2008.00486.x

    View details for Web of Science ID 000261442900004

    View details for PubMedID 18752601