I am a hybrid computer scientist, statistician and bioinformatitian generally interested in genome sciences and medicine. I have extensively worked on metagenomics, cancer genomics and structural variants. My goal is to push forward the field of genomic sciences, in particular personalized genomics and medicine, by integrating data, technology, computation and statistical modeling. My publications have addressed methodology developments in microbial community analysis and recently in structural variant and cancer genome analysis.
Honors & Awards
Travel Fellowship, Alzheimer’s Association International Conference (2016)
Reviewer's Choice Best Abstract, The American Society of Human Genetics Annual Meeting (2015)
Travel Fellowship, Bayer International Computational Biology Workshop (2014)
Dissertation Year Fellowship, University of Southern California (2012)
Merit Fellowship, University of Southern California (2006-2007)
Boards, Advisory Committees, Professional Organizations
Program Committee Co-chair, COMMAND workshop of the IEEE Bioinformatics and Biomedicine Conference 2015 (2015 - 2015)
Doctor of Philosophy, University of Southern California (2013)
Master of Science, University of Southern California, Los Angeles, US, Statistics (2012)
Master of Science, University of Southern California, Los Angeles, US, Computer Science (2008)
Master of Science, Fudan University, Shanghai, China, Physics (Theoretical Physics) (2006)
Bachelor of Science, Fudan University, Shanghai, China, Electronics Engineering (2003)
Integrated metagenomic data analysis demonstrates that a loss of diversity in oral microbiota is associated with periodontitis.
2017; 18: 1041-?
Periodontitis is an inflammatory disease affecting the tissues supporting teeth (periodontium). Integrative analysis of metagenomic samples from multiple periodontitis studies is a powerful way to examine microbiota diversity and interactions within host oral cavity.A total of 43 subjects were recruited to participate in two previous studies profiling the microbial community of human subgingival plaque samples using shotgun metagenomic sequencing. We integrated metagenomic sequence data from those two studies, including six healthy controls, 14 sites representative of stable periodontitis, 16 sites representative of progressing periodontitis, and seven periodontal sites of unknown status. We applied phylogenetic diversity, differential abundance, and network analyses, as well as clustering, to the integrated dataset to compare microbiological community profiles among the different disease states.We found alpha-diversity, i.e., mean species diversity in sites or habitats at a local scale, to be the single strongest predictor of subjects' periodontitis status (P < 0.011). More specifically, healthy subjects had the highest alpha-diversity, while subjects with stable sites had the lowest alpha-diversity. From these results, we developed an alpha-diversity logistic model-based naive classifier able to perfectly predict the disease status of the seven subjects with unknown periodontal status (not used in training). Phylogenetic profiling resulted in the discovery of nine marker microbes, and these species are able to differentiate between stable and progressing periodontitis, achieving an accuracy of 94.4%. Finally, we found that the reduction of negatively correlated species is a notable signature of disease progression.Our results consistently show a strong association between the loss of oral microbiota diversity and the progression of periodontitis, suggesting that metagenomics sequencing and phylogenetic profiling are predictive of early periodontitis, leading to potential therapeutic intervention. Our results also support a keystone pathogen-mediated polymicrobial synergy and dysbiosis (PSD) model to explain the etiology of periodontitis. Apart from P. gingivalis, we identified three additional keystone species potentially mediating the progression of periodontitis progression based on pathogenic characteristics similar to those of known keystone pathogens.
View details for DOI 10.1186/s12864-016-3254-5
View details for PubMedID 28198672
View details for PubMedCentralID PMC5310281
A genome-wide approach for detecting novel insertion-deletion variants of mid-range size.
Nucleic acids research
2016; 44 (15)
We present SWAN, a statistical framework for robust detection of genomic structural variants in next-generation sequencing data and an analysis of mid-range size insertion and deletions (<10 Kb) for whole genome analysis and DNA mixtures. To identify these mid-range size events, SWAN collectively uses information from read-pair, read-depth and one end mapped reads through statistical likelihoods based on Poisson field models. SWAN also uses soft-clip/split read remapping to supplement the likelihood analysis and determine variant boundaries. The accuracy of SWAN is demonstrated by in silico spike-ins and by identification of known variants in the NA12878 genome. We used SWAN to identify a series of novel set of mid-range insertion/deletion detection that were confirmed by targeted deep re-sequencing. An R package implementation of SWAN is open source and freely available.
View details for DOI 10.1093/nar/gkw481
View details for PubMedID 27325742
View details for PubMedCentralID PMC5009736
- Statistical significance approximation in local trend analysis of high-throughput time-series data using the theory of Markov chains BMC BIOINFORMATICS 2015; 16
Efficient statistical significance approximation for local similarity analysis of high-throughput time series data
2013; 29 (2): 230-237
Local similarity analysis of biological time series data helps elucidate the varying dynamics of biological systems. However, its applications to large scale high-throughput data are limited by slow permutation procedures for statistical significance evaluation.We developed a theoretical approach to approximate the statistical significance of local similarity analysis based on the approximate tail distribution of the maximum partial sum of independent identically distributed (i.i.d.) random variables. Simulations show that the derived formula approximates the tail distribution reasonably well (starting at time points > 10 with no delay and > 20 with delay) and provides P-values comparable with those from permutations. The new approach enables efficient calculation of statistical significance for pairwise local similarity analysis, making possible all-to-all local association studies otherwise prohibitive. As a demonstration, local similarity analysis of human microbiome time series shows that core operational taxonomic units (OTUs) are highly synergetic and some of the associations are body-site specific across samples.The new approach is implemented in our eLSA package, which now provides pipelines for faster local similarity analysis of time series data. The tool is freely available from eLSA's website: http://meta.usc.edu/softs/lsa.Supplementary data are available at Bioinformatics firstname.lastname@example.org.
View details for DOI 10.1093/bioinformatics/bts668
View details for Web of Science ID 000313722800011
View details for PubMedID 23178636
Extended local similarity analysis (eLSA) of microbial community and other time series data with replicates
BMC SYSTEMS BIOLOGY
The increasing availability of time series microbial community data from metagenomics and other molecular biological studies has enabled the analysis of large-scale microbial co-occurrence and association networks. Among the many analytical techniques available, the Local Similarity Analysis (LSA) method is unique in that it captures local and potentially time-delayed co-occurrence and association patterns in time series data that cannot otherwise be identified by ordinary correlation analysis. However LSA, as originally developed, does not consider time series data with replicates, which hinders the full exploitation of available information. With replicates, it is possible to understand the variability of local similarity (LS) score and to obtain its confidence interval.We extended our LSA technique to time series data with replicates and termed it extended LSA, or eLSA. Simulations showed the capability of eLSA to capture subinterval and time-delayed associations. We implemented the eLSA technique into an easy-to-use analytic software package. The software pipeline integrates data normalization, statistical correlation calculation, statistical significance evaluation, and association network construction steps. We applied the eLSA technique to microbial community and gene expression datasets, where unique time-dependent associations were identified.The extended LSA analysis technique was demonstrated to reveal statistically significant local and potentially time-delayed association patterns in replicated time series data beyond that of ordinary correlation analysis. These statistically significant associations can provide insights to the real dynamics of biological systems. The newly designed eLSA software efficiently streamlines the analysis and is freely available from the eLSA homepage, which can be accessed at http://meta.usc.edu/softs/lsa.
View details for DOI 10.1186/1752-0509-5-S2-S15
View details for Web of Science ID 000301987000015
View details for PubMedID 22784572
Accurate Genome Relative Abundance Estimation Based on Shotgun Metagenomic Reads
2011; 6 (12)
Accurate estimation of microbial community composition based on metagenomic sequencing data is fundamental for subsequent metagenomics analysis. Prevalent estimation methods are mainly based on directly summarizing alignment results or its variants; often result in biased and/or unstable estimates. We have developed a unified probabilistic framework (named GRAMMy) by explicitly modeling read assignment ambiguities, genome size biases and read distributions along the genomes. Maximum likelihood method is employed to compute Genome Relative Abundance of microbial communities using the Mixture Model theory (GRAMMy). GRAMMy has been demonstrated to give estimates that are accurate and robust across both simulated and real read benchmark datasets. We applied GRAMMy to a collection of 34 metagenomic read sets from four metagenomics projects and identified 99 frequent species (minimally 0.5% abundant in at least 50% of the data-sets) in the human gut samples. Our results show substantial improvements over previous studies, such as adjusting the over-estimated abundance for Bacteroides species for human gut samples, by providing a new reference-based strategy for metagenomic sample comparisons. GRAMMy can be used flexibly with many read assignment tools (mapping, alignment or composition-based) even with low-sensitivity mapping results from huge short-read datasets. It will be increasingly useful as an accurate and robust tool for abundance estimation with the growing size of read sets and the expanding database of reference genomes.
View details for DOI 10.1371/journal.pone.0027992
View details for Web of Science ID 000298173500008
View details for PubMedID 22162995
Phase transition in sequence unique reconstruction
JOURNAL OF SYSTEMS SCIENCE & COMPLEXITY
2007; 20 (1): 18-29
View details for Web of Science ID 000254940400002
CRISPR-Cas9-targeted fragmentation and selective sequencing enable massively parallel microsatellite analysis
Microsatellites are multi-allelic and composed of short tandem repeats (STRs) with individual motifs composed of mononucleotides, dinucleotides or higher including hexamers. Next-generation sequencing approaches and other STR assays rely on a limited number of PCR amplicons, typically in the tens. Here, we demonstrate STR-Seq, a next-generation sequencing technology that analyses over 2,000 STRs in parallel, and provides the accurate genotyping of microsatellites. STR-Seq employs in vitro CRISPR-Cas9-targeted fragmentation to produce specific DNA molecules covering the complete microsatellite sequence. Amplification-free library preparation provides single molecule sequences without unique molecular barcodes. STR-selective primers enable massively parallel, targeted sequencing of large STR sets. Overall, STR-Seq has higher throughput, improved accuracy and provides a greater number of informative haplotypes compared with other microsatellite analysis approaches. With these new features, STR-Seq can identify a 0.1% minor genome fraction in a DNA mixture composed of different, unrelated samples.
View details for DOI 10.1038/ncomms14291
View details for Web of Science ID 000393379700001
View details for PubMedID 28169275
View details for PubMedCentralID PMC5309709
Correlation detection strategies in microbial data sets vary widely in sensitivity and precision
2016; 10 (7): 1669-1681
Disruption of healthy microbial communities has been linked to numerous diseases, yet microbial interactions are little understood. This is due in part to the large number of bacteria, and the much larger number of interactions (easily in the millions), making experimental investigation very difficult at best and necessitating the nascent field of computational exploration through microbial correlation networks. We benchmark the performance of eight correlation techniques on simulated and real data in response to challenges specific to microbiome studies: fractional sampling of ribosomal RNA sequences, uneven sampling depths, rare microbes and a high proportion of zero counts. Also tested is the ability to distinguish signals from noise, and detect a range of ecological and time-series relationships. Finally, we provide specific recommendations for correlation technique usage. Although some methods perform better than others, there is still considerable need for improvement in current techniques.
View details for DOI 10.1038/ismej.2015.235
View details for Web of Science ID 000378292100011
View details for PubMedID 26905627
View details for PubMedCentralID PMC4918442
- SCAN STATISTICS ON POISSON RANDOM FIELDS WITH APPLICATIONS IN GENOMICS ANNALS OF APPLIED STATISTICS 2016; 10 (2): 726-755
- Pan-cancer analysis of the extent and consequences of intratumor heterogeneity NATURE MEDICINE 2016; 22 (1): 105-?
Cross-depth analysis of marine bacterial networks suggests downward propagation of temporal changes
2015; 9 (12): 2573-2586
Interactions among microbes and stratification across depths are both believed to be important drivers of microbial communities, though little is known about how microbial associations differ between and across depths. We have monitored the free-living microbial community at the San Pedro Ocean Time-series station, monthly, for a decade, at five different depths: 5 m, the deep chlorophyll maximum layer, 150 m, 500 m and 890 m (just above the sea floor). Here, we introduce microbial association networks that combine data from multiple ocean depths to investigate both within- and between-depth relationships, sometimes time-lagged, among microbes and environmental parameters. The euphotic zone, deep chlorophyll maximum and 890 m depth each contain two negatively correlated 'modules' (groups of many inter-correlated bacteria and environmental conditions) suggesting regular transitions between two contrasting environmental states. Two-thirds of pairwise correlations of bacterial taxa between depths lagged such that changes in the abundance of deeper organisms followed changes in shallower organisms. Taken in conjunction with previous observations of seasonality at 890 m, these trends suggest that planktonic microbial communities throughout the water column are linked to environmental conditions and/or microbial communities in overlying waters. Poorly understood groups including Marine Group A, Nitrospina and AEGEAN-169 clades contained taxa that showed diverse association patterns, suggesting these groups contain multiple ecological species, each shaped by different factors, which we have started to delineate. These observations build upon previous work at this location, lending further credence to the hypothesis that sinking particles and vertically migrating animals transport materials that significantly shape the time-varying patterns of microbial community composition.
View details for DOI 10.1038/ismej.2015.76
View details for Web of Science ID 000365094400004
View details for PubMedID 25989373
Emergence of Hemagglutinin Mutations During the Course of Influenza Infection.
2015; 5: 16178-?
Influenza remains a significant cause of disease mortality. The ongoing threat of influenza infection is partly attributable to the emergence of new mutations in the influenza genome. Among the influenza viral gene products, the hemagglutinin (HA) glycoprotein plays a critical role in influenza pathogenesis, is the target for vaccines and accumulates new mutations that may alter the efficacy of immunization. To study the emergence of HA mutations during the course of infection, we employed a deep-targeted sequencing method. We used samples from 17 patients with active H1N1 or H3N2 influenza infections. These patients were not treated with antivirals. In addition, we had samples from five patients who were analyzed longitudinally. Thus, we determined the quantitative changes in the fractional representation of HA mutations during the course of infection. Across individuals in the study, a series of novel HA mutations directly altered the HA coding sequence were identified. Serial viral sampling revealed HA mutations that either were stable, expanded or were reduced in representation during the course of the infection. Overall, we demonstrated the emergence of unique mutations specific to an infected individual and temporal genetic variation during infection.
View details for DOI 10.1038/srep16178
View details for PubMedID 26538451
View details for PubMedCentralID PMC4633648
Accurate Genome Relative Abundance Estimation Based on Shotgun Metagenomic Reads
Encyclopedia of Metagenomics: Genes, Genomes and Metagenomes. Basics, Methods, Databases and Tools
edited by Nelson, K.
View details for DOI 10.1007/978-1-4614-6418-1_723-4
Extended Local Similarity Analysis (eLSA) of Biological Data
Encyclopedia of Metagenomics: Genes, Genomes and Metagenomes. Basics, Methods, Databases and Tools
edited by Nelson, K.
View details for DOI 10.1007/978-1-4614-6418-1_722-5
- A Quantitative Evaluation of Health Care System in US, China, and Sweden Health Med 2013; 7 (4)
Genetic analysis of differentiation of T-helper lymphocytes
GENETICS AND MOLECULAR RESEARCH
2013; 12 (2): 972-987
In the human immune system, T-helper cells are able to differentiate into two lymphocyte subsets: Th1 and Th2. The intracellular signaling pathways of differentiation form a dynamic regulation network by secreting distinctive types of cytokines, while differentiation is regulated by two major gene loci: T-bet and GATA-3. We developed a system dynamics model to simulate the differentiation and re-differentiation process of T-helper cells, based on gene expression levels of T-bet and GATA-3 during differentiation of these cells. We arrived at three ultimate states of the model and came to the conclusion that cell differentiation potential exists as long as the system dynamics is at an unstable equilibrium point; the T-helper cells will no longer have the potential of differentiation when the model reaches a stable equilibrium point. In addition, the time lag caused by expression of transcription factors can lead to oscillations in the secretion of cytokines during differentiation.
View details for DOI 10.4238/2013.April.2.13
View details for Web of Science ID 000320030100011
View details for PubMedID 23613243
Marine bacterial, archaeal and protistan association networks reveal ecological linkages
2011; 5 (9): 1414-1425
Microbes have central roles in ocean food webs and global biogeochemical processes, yet specific ecological relationships among these taxa are largely unknown. This is in part due to the dilute, microscopic nature of the planktonic microbial community, which prevents direct observation of their interactions. Here, we use a holistic (that is, microbial system-wide) approach to investigate time-dependent variations among taxa from all three domains of life in a marine microbial community. We investigated the community composition of bacteria, archaea and protists through cultivation-independent methods, along with total bacterial and viral abundance, and physico-chemical observations. Samples and observations were collected monthly over 3 years at a well-described ocean time-series site of southern California. To find associations among these organisms, we calculated time-dependent rank correlations (that is, local similarity correlations) among relative abundances of bacteria, archaea, protists, total abundance of bacteria and viruses and physico-chemical parameters. We used a network generated from these statistical correlations to visualize and identify time-dependent associations among ecologically important taxa, for example, the SAR11 cluster, stramenopiles, alveolates, cyanobacteria and ammonia-oxidizing archaea. Negative correlations, perhaps suggesting competition or predation, were also common. The analysis revealed a progression of microbial communities through time, and also a group of unknown eukaryotes that were highly correlated with dinoflagellates, indicating possible symbioses or parasitism. Possible 'keystone' species were evident. The network has statistical features similar to previously described ecological networks, and in network parlance has non-random, small world properties (that is, highly interconnected nodes). This approach provides new insights into the natural history of microbes.
View details for DOI 10.1038/ismej.2011.24
View details for Web of Science ID 000295782900003
View details for PubMedID 21430787
PPLook: an automated data mining tool for protein-protein interaction
Extracting and visualizing of protein-protein interaction (PPI) from text literatures are a meaningful topic in protein science. It assists the identification of interactions among proteins. There is a lack of tools to extract PPI, visualize and classify the results.We developed a PPI search system, termed PPLook, which automatically extracts and visualizes protein-protein interaction (PPI) from text. Given a query protein name, PPLook can search a dataset for other proteins interacting with it by using a keywords dictionary pattern-matching algorithm, and display the topological parameters, such as the number of nodes, edges, and connected components. The visualization component of PPLook enables us to view the interaction relationship among the proteins in a three-dimensional space based on the OpenGL graphics interface technology. PPLook can also provide the functions of selecting protein semantic class, counting the number of semantic class proteins which interact with query protein, counting the literature number of articles appearing the interaction relationship about the query protein. Moreover, PPLook provides heterogeneous search and a user-friendly graphical interface.PPLook is an effective tool for biologists and biosystem developers who need to access PPI information from the literature. PPLook is freely available for non-commercial users at http://meta.usc.edu/softs/PPLook.
View details for DOI 10.1186/1471-2105-11-326
View details for Web of Science ID 000280331700002
View details for PubMedID 20550717
Oligonucleotide profiling for discriminating bacteria in bacterial communities
COMBINATORIAL CHEMISTRY & HIGH THROUGHPUT SCREENING
2007; 10 (4): 247-255
Based on the relative ratios of di- and tri-nucleotides in the DNA sequences, the profiles of 164 genome sequences from 152 representative microbial organisms were computed. By comparing the profiles of the genomes and their substrings with length 500 bps, the fluctuations of the relative abundances of di- and tri-nucleotides of these genomic sequences were analyzed. A new method to discriminate the origins of orphan DNA sequences was proposed, and the origins of 17 uncultured bacterium sequences from a bacterial community in the human gut were postulated and discussed.
View details for Web of Science ID 000247022400002
View details for PubMedID 17506707