I am a hybrid computer scientist, statistician and bioinformatitian generally interested in genome sciences and medicine. I have extensively worked on metagenomics, cancer genomics and structural variants. My goal is to push forward the field of genomic sciences, in particular personalized genomics and medicine, by integrating data, technology, computation and statistical modeling. My publications have addressed methodology developments in microbial community analysis and recently in structural variant and cancer genome analysis.
Instructor, Medicine - Oncology
Honors & Awards
Postdoc Fellowship, American Cancer Society (2019)
Scholar-In-Training Award, American Association for Cancer Research (2018)
Travel Fellowship, Alzheimer’s Association International Conference (2016)
Reviewer's Choice Best Abstract, The American Society of Human Genetics Annual Meeting (2015)
Travel Fellowship, Bayer International Computational Biology Workshop (2014)
Dissertation Year Fellowship, University of Southern California (2012)
Merit Fellowship, University of Southern California (2006-2007)
Boards, Advisory Committees, Professional Organizations
Program Committee Co-chair, COMMAND workshop of the IEEE Bioinformatics and Biomedicine Conference 2015 (2015 - 2015)
Doctor of Philosophy, University of Southern California, Los Angeles, US, Bioinformatics and Computational Biology (2013)
Master of Science, University of Southern California, Los Angeles, US, Statistics (2012)
Master of Science, University of Southern California, Los Angeles, US, Computer Science (2008)
Master of Science, Fudan University, Shanghai, China, Physics (Theoretical Physics) (2006)
Bachelor of Science, Fudan University, Shanghai, China, Electronics Engineering (2003)
SVEngine: an efficient and versatile simulator of genome structural variations with features of cancer clonal evolution.
Background: Simulating genome sequence data with variant features facilitates the development and benchmarking of structural variant analysis programs. However, there are only a few data simulators that provide structural variants in silico and even fewer that provide variants with different allelic fraction and haplotypes.Findings: We developed SVEngine, an open source tool to address this need. SVEngine simulates next generation sequencing data with embedded structural variations. As input, SVEngine takes template haploid sequences (FASTA) and an external variant file, a variant distribution file and/or a clonal phylogeny tree file (NEWICK) as input. Subsequently, it simulates and outputs sequence contigs (FASTAs), sequence reads (FASTQs) and/or post-alignment files (BAMs). All of the files contain the desired variants, along with BED files containing the ground truth. SVEngine's flexible design process enables one to specify size, position, and allelic fraction for deletions, insertions, duplications, inversions and translocations. Finally, SVEngine simulates sequence data that replicates the characteristics of a sequencing library with mixed sizes of DNA insert molecules. To improve the compute speed, SVEngine is highly parallelized to reduce the simulation time.Conclusions: We demonstrated the versatile features of SVEngine and its improved runtime comparisons with other available simulators. SVEngine's features include the simulation of locus-specific variant frequency designed to mimic the phylogeny of cancer clonal evolution. We validated SVEngine's accuracy by simulating genome-wide structural variants of NA12878 and a heterogenous cancer genome. Our evaluation included checking various sequencing mapping features such as coverage change, read clipping, insert size shift and neighbouring hanging read pairs for representative variant types. Structural variant callers Lumpy and Manta and tumor heterogeneity estimator THetA2 were able to perform realistically on the simulated data. SVEngine is implemented as a standard Python package and is freely available for academic use at: https://bitbucket.org/charade/svengine.
View details for DOI 10.1093/gigascience/giy081
View details for PubMedID 29982625
Identification of large rearrangements in cancer genomes with barcode linked reads.
Nucleic acids research
2018; 46 (4): e19
Large genomic rearrangements involve inversions, deletions and other structural changes that span Megabase segments of the human genome. This category of genetic aberration is the cause of many hereditary genetic disorders and contributes to pathogenesis of diseases like cancer. We developed a new algorithm called ZoomX for analysing barcode-linked sequence reads-these sequences can be traced to individual high molecular weight DNA molecules (>50 kb). To generate barcode linked sequence reads, we employ a library preparation technology (10X Genomics) that uses droplets to partition and barcode DNA molecules. Using linked read data from whole genome sequencing, we identify large genomic rearrangements, typically greater than 200kb, even when they are only present in low allelic fractions. Our algorithm uses a Poisson scan statistic to identify genomic rearrangement junctions, determine counts of junction-spanning molecules and calculate a Fisher's exact test for determining statistical significance for somatic aberrations. Utilizing a well-characterized human genome, we benchmarked this approach to accurately identify large rearrangement. Subsequently, we demonstrated that our algorithm identifies somatic rearrangements when present in lower allelic fractions as occurs in tumors. We characterized a set of complex cancer rearrangements with multiple classes of structural aberrations and with possible roles in oncogenesis.
View details for DOI 10.1093/nar/gkx1193
View details for PubMedID 29186506
View details for PubMedCentralID PMC5829571
CoreProbe: A Novel Algorithm for Estimating Relative Abundance Based on Metagenomic Reads.
2018; 9 (6)
With the rapid development of high-throughput sequencing technology, the analysis of metagenomic sequencing data and the accurate and efficient estimation of relative microbial abundance have become important ways to explore the microbial composition and function of microbes. In addition, the accuracy and efficiency of the relative microbial abundance estimation are closely related to the algorithm and the selection of the reference sequence for sequence alignment. We introduced the microbial core genome as the reference sequence for potential microbes in a metagenomic sample, and we constructed a finite mixture and latent Dirichlet models and used the Gibbs sampling algorithm to estimate the relative abundance of microorganisms. The simulation results showed that our approach can improve the efficiency while maintaining high accuracy and is more suitable for high-throughput metagenomic data. The new approach was implemented in our CoreProbe package which provides a pipeline for an accurate and efficient estimation of the relative abundance of microbes in a community. This tool is available free of charge from the CoreProbe's website: Access the Docker image with the following instruction: sudo docker pull panhongfei/coreprobe:1.0.
View details for DOI 10.3390/genes9060313
View details for PubMedID 29925824
View details for PubMedCentralID PMC6027520
Integrated metagenomic data analysis demonstrates that a loss of diversity in oral microbiota is associated with periodontitis.
2017; 18: 1041-?
Periodontitis is an inflammatory disease affecting the tissues supporting teeth (periodontium). Integrative analysis of metagenomic samples from multiple periodontitis studies is a powerful way to examine microbiota diversity and interactions within host oral cavity.A total of 43 subjects were recruited to participate in two previous studies profiling the microbial community of human subgingival plaque samples using shotgun metagenomic sequencing. We integrated metagenomic sequence data from those two studies, including six healthy controls, 14 sites representative of stable periodontitis, 16 sites representative of progressing periodontitis, and seven periodontal sites of unknown status. We applied phylogenetic diversity, differential abundance, and network analyses, as well as clustering, to the integrated dataset to compare microbiological community profiles among the different disease states.We found alpha-diversity, i.e., mean species diversity in sites or habitats at a local scale, to be the single strongest predictor of subjects' periodontitis status (P < 0.011). More specifically, healthy subjects had the highest alpha-diversity, while subjects with stable sites had the lowest alpha-diversity. From these results, we developed an alpha-diversity logistic model-based naive classifier able to perfectly predict the disease status of the seven subjects with unknown periodontal status (not used in training). Phylogenetic profiling resulted in the discovery of nine marker microbes, and these species are able to differentiate between stable and progressing periodontitis, achieving an accuracy of 94.4%. Finally, we found that the reduction of negatively correlated species is a notable signature of disease progression.Our results consistently show a strong association between the loss of oral microbiota diversity and the progression of periodontitis, suggesting that metagenomics sequencing and phylogenetic profiling are predictive of early periodontitis, leading to potential therapeutic intervention. Our results also support a keystone pathogen-mediated polymicrobial synergy and dysbiosis (PSD) model to explain the etiology of periodontitis. Apart from P. gingivalis, we identified three additional keystone species potentially mediating the progression of periodontitis progression based on pathogenic characteristics similar to those of known keystone pathogens.
View details for DOI 10.1186/s12864-016-3254-5
View details for PubMedID 28198672
View details for PubMedCentralID PMC5310281
A genome-wide approach for detecting novel insertion-deletion variants of mid-range size.
Nucleic acids research
2016; 44 (15)
We present SWAN, a statistical framework for robust detection of genomic structural variants in next-generation sequencing data and an analysis of mid-range size insertion and deletions (<10 Kb) for whole genome analysis and DNA mixtures. To identify these mid-range size events, SWAN collectively uses information from read-pair, read-depth and one end mapped reads through statistical likelihoods based on Poisson field models. SWAN also uses soft-clip/split read remapping to supplement the likelihood analysis and determine variant boundaries. The accuracy of SWAN is demonstrated by in silico spike-ins and by identification of known variants in the NA12878 genome. We used SWAN to identify a series of novel set of mid-range insertion/deletion detection that were confirmed by targeted deep re-sequencing. An R package implementation of SWAN is open source and freely available.
View details for DOI 10.1093/nar/gkw481
View details for PubMedID 27325742
View details for PubMedCentralID PMC5009736
Statistical significance approximation in local trend analysis of high-throughput time-series data using the theory of Markov chains
Local trend (i.e. shape) analysis of time series data reveals co-changing patterns in dynamics of biological systems. However, slow permutation procedures to evaluate the statistical significance of local trend scores have limited its applications to high-throughput time series data analysis, e.g., data from the next generation sequencing technology based studies.By extending the theories for the tail probability of the range of sum of Markovian random variables, we propose formulae for approximating the statistical significance of local trend scores. Using simulations and real data, we show that the approximate p-value is close to that obtained using a large number of permutations (starting at time points >20 with no delay and >30 with delay of at most three time steps) in that the non-zero decimals of the p-values obtained by the approximation and the permutations are mostly the same when the approximate p-value is less than 0.05. In addition, the approximate p-value is slightly larger than that based on permutations making hypothesis testing based on the approximate p-value conservative. The approximation enables efficient calculation of p-values for pairwise local trend analysis, making large scale all-versus-all comparisons possible. We also propose a hybrid approach by integrating the approximation and permutations to obtain accurate p-values for significantly associated pairs. We further demonstrate its use with the analysis of the Polymouth Marine Laboratory (PML) microbial community time series from high-throughput sequencing data and found interesting organism co-occurrence dynamic patterns.The software tool is integrated into the eLSA software package that now provides accelerated local trend and similarity analysis pipelines for time series data. The package is freely available from the eLSA website: http://bitbucket.org/charade/elsa.
View details for DOI 10.1186/s12859-015-0732-8
View details for Web of Science ID 000361431300001
View details for PubMedID 26390921
View details for PubMedCentralID PMC4578688
Efficient statistical significance approximation for local similarity analysis of high-throughput time series data
2013; 29 (2): 230-237
Local similarity analysis of biological time series data helps elucidate the varying dynamics of biological systems. However, its applications to large scale high-throughput data are limited by slow permutation procedures for statistical significance evaluation.We developed a theoretical approach to approximate the statistical significance of local similarity analysis based on the approximate tail distribution of the maximum partial sum of independent identically distributed (i.i.d.) random variables. Simulations show that the derived formula approximates the tail distribution reasonably well (starting at time points > 10 with no delay and > 20 with delay) and provides P-values comparable with those from permutations. The new approach enables efficient calculation of statistical significance for pairwise local similarity analysis, making possible all-to-all local association studies otherwise prohibitive. As a demonstration, local similarity analysis of human microbiome time series shows that core operational taxonomic units (OTUs) are highly synergetic and some of the associations are body-site specific across samples.The new approach is implemented in our eLSA package, which now provides pipelines for faster local similarity analysis of time series data. The tool is freely available from eLSA's website: http://meta.usc.edu/softs/lsa.Supplementary data are available at Bioinformatics email@example.com.
View details for DOI 10.1093/bioinformatics/bts668
View details for Web of Science ID 000313722800011
View details for PubMedID 23178636
Extended local similarity analysis (eLSA) of microbial community and other time series data with replicates
BMC SYSTEMS BIOLOGY
The increasing availability of time series microbial community data from metagenomics and other molecular biological studies has enabled the analysis of large-scale microbial co-occurrence and association networks. Among the many analytical techniques available, the Local Similarity Analysis (LSA) method is unique in that it captures local and potentially time-delayed co-occurrence and association patterns in time series data that cannot otherwise be identified by ordinary correlation analysis. However LSA, as originally developed, does not consider time series data with replicates, which hinders the full exploitation of available information. With replicates, it is possible to understand the variability of local similarity (LS) score and to obtain its confidence interval.We extended our LSA technique to time series data with replicates and termed it extended LSA, or eLSA. Simulations showed the capability of eLSA to capture subinterval and time-delayed associations. We implemented the eLSA technique into an easy-to-use analytic software package. The software pipeline integrates data normalization, statistical correlation calculation, statistical significance evaluation, and association network construction steps. We applied the eLSA technique to microbial community and gene expression datasets, where unique time-dependent associations were identified.The extended LSA analysis technique was demonstrated to reveal statistically significant local and potentially time-delayed association patterns in replicated time series data beyond that of ordinary correlation analysis. These statistically significant associations can provide insights to the real dynamics of biological systems. The newly designed eLSA software efficiently streamlines the analysis and is freely available from the eLSA homepage, which can be accessed at http://meta.usc.edu/softs/lsa.
View details for DOI 10.1186/1752-0509-5-S2-S15
View details for Web of Science ID 000301987000015
View details for PubMedID 22784572
Accurate Genome Relative Abundance Estimation Based on Shotgun Metagenomic Reads
2011; 6 (12)
Accurate estimation of microbial community composition based on metagenomic sequencing data is fundamental for subsequent metagenomics analysis. Prevalent estimation methods are mainly based on directly summarizing alignment results or its variants; often result in biased and/or unstable estimates. We have developed a unified probabilistic framework (named GRAMMy) by explicitly modeling read assignment ambiguities, genome size biases and read distributions along the genomes. Maximum likelihood method is employed to compute Genome Relative Abundance of microbial communities using the Mixture Model theory (GRAMMy). GRAMMy has been demonstrated to give estimates that are accurate and robust across both simulated and real read benchmark datasets. We applied GRAMMy to a collection of 34 metagenomic read sets from four metagenomics projects and identified 99 frequent species (minimally 0.5% abundant in at least 50% of the data-sets) in the human gut samples. Our results show substantial improvements over previous studies, such as adjusting the over-estimated abundance for Bacteroides species for human gut samples, by providing a new reference-based strategy for metagenomic sample comparisons. GRAMMy can be used flexibly with many read assignment tools (mapping, alignment or composition-based) even with low-sensitivity mapping results from huge short-read datasets. It will be increasingly useful as an accurate and robust tool for abundance estimation with the growing size of read sets and the expanding database of reference genomes.
View details for DOI 10.1371/journal.pone.0027992
View details for Web of Science ID 000298173500008
View details for PubMedID 22162995
Phase transition in sequence unique reconstruction
JOURNAL OF SYSTEMS SCIENCE & COMPLEXITY
2007; 20 (1): 18-29
View details for Web of Science ID 000254940400002
Chromosome-scale mega-haplotypes enable digital karyotyping of cancer aneuploidy
NUCLEIC ACIDS RESEARCH
2017; 45 (19): e162
Genomic instability is a frequently occurring feature of cancer that involves large-scale structural alterations. These somatic changes in chromosome structure include duplication of entire chromosome arms and aneuploidy where chromosomes are duplicated beyond normal diploid content. However, the accurate determination of aneuploidy events in cancer genomes is a challenge. Recent advances in sequencing technology allow the characterization of haplotypes that extend megabases along the human genome using high molecular weight (HMW) DNA. For this study, we employed a library preparation method in which sequence reads have barcodes linked to single HMW DNA molecules. Barcode-linked reads are used to generate extended haplotypes on the order of megabases. We developed a method that leverages haplotypes to identify chromosomal segmental alterations in cancer and uses this information to join haplotypes together, thus extending the range of phased variants. With this approach, we identified mega-haplotypes that encompass entire chromosome arms. We characterized the chromosomal arm changes and aneuploidy events in a manner that offers similar information as a traditional karyotype but with the benefit of DNA sequence resolution. We applied this approach to characterize aneuploidy and chromosomal alterations from a series of primary colorectal cancers.
View details for DOI 10.1093/nar/gkx712
View details for Web of Science ID 000414552300001
View details for PubMedID 28977555
View details for PubMedCentralID PMC5737808
CRISPR-Cas9-targeted fragmentation and selective sequencing enable massively parallel microsatellite analysis
Microsatellites are multi-allelic and composed of short tandem repeats (STRs) with individual motifs composed of mononucleotides, dinucleotides or higher including hexamers. Next-generation sequencing approaches and other STR assays rely on a limited number of PCR amplicons, typically in the tens. Here, we demonstrate STR-Seq, a next-generation sequencing technology that analyses over 2,000 STRs in parallel, and provides the accurate genotyping of microsatellites. STR-Seq employs in vitro CRISPR-Cas9-targeted fragmentation to produce specific DNA molecules covering the complete microsatellite sequence. Amplification-free library preparation provides single molecule sequences without unique molecular barcodes. STR-selective primers enable massively parallel, targeted sequencing of large STR sets. Overall, STR-Seq has higher throughput, improved accuracy and provides a greater number of informative haplotypes compared with other microsatellite analysis approaches. With these new features, STR-Seq can identify a 0.1% minor genome fraction in a DNA mixture composed of different, unrelated samples.
View details for DOI 10.1038/ncomms14291
View details for Web of Science ID 000393379700001
View details for PubMedID 28169275
View details for PubMedCentralID PMC5309709
Correlation detection strategies in microbial data sets vary widely in sensitivity and precision
2016; 10 (7): 1669-1681
Disruption of healthy microbial communities has been linked to numerous diseases, yet microbial interactions are little understood. This is due in part to the large number of bacteria, and the much larger number of interactions (easily in the millions), making experimental investigation very difficult at best and necessitating the nascent field of computational exploration through microbial correlation networks. We benchmark the performance of eight correlation techniques on simulated and real data in response to challenges specific to microbiome studies: fractional sampling of ribosomal RNA sequences, uneven sampling depths, rare microbes and a high proportion of zero counts. Also tested is the ability to distinguish signals from noise, and detect a range of ecological and time-series relationships. Finally, we provide specific recommendations for correlation technique usage. Although some methods perform better than others, there is still considerable need for improvement in current techniques.
View details for DOI 10.1038/ismej.2015.235
View details for Web of Science ID 000378292100011
View details for PubMedID 26905627
View details for PubMedCentralID PMC4918442
- SCAN STATISTICS ON POISSON RANDOM FIELDS WITH APPLICATIONS IN GENOMICS ANNALS OF APPLIED STATISTICS 2016; 10 (2): 726-755
Pan-cancer analysis of the extent and consequences of intratumor heterogeneity
2016; 22 (1): 105-?
Intratumor heterogeneity (ITH) drives neoplastic progression and therapeutic resistance. We used the bioinformatics tools 'expanding ploidy and allele frequency on nested subpopulations' (EXPANDS) and PyClone to detect clones that are present at a ≥10% frequency in 1,165 exome sequences from tumors in The Cancer Genome Atlas. 86% of tumors across 12 cancer types had at least two clones. ITH in the morphology of nuclei was associated with genetic ITH (Spearman's correlation coefficient, ρ = 0.24-0.41; P < 0.001). Mutation of a driver gene that typically appears in smaller clones was a survival risk factor (hazard ratio (HR) = 2.15, 95% confidence interval (CI): 1.71-2.69). The risk of mortality also increased when >2 clones coexisted in the same tumor sample (HR = 1.49, 95% CI: 1.20-1.87). In two independent data sets, copy-number alterations affecting either <25% or >75% of a tumor's genome predicted reduced risk (HR = 0.15, 95% CI: 0.08-0.29). Mortality risk also declined when >4 clones coexisted in the sample, suggesting a trade-off between the costs and benefits of genomic instability. ITH and genomic instability thus have the potential to be useful measures that can universally be applied to all cancers.
View details for DOI 10.1038/nm.3984
View details for Web of Science ID 000367590700022
Cross-depth analysis of marine bacterial networks suggests downward propagation of temporal changes
2015; 9 (12): 2573-2586
Interactions among microbes and stratification across depths are both believed to be important drivers of microbial communities, though little is known about how microbial associations differ between and across depths. We have monitored the free-living microbial community at the San Pedro Ocean Time-series station, monthly, for a decade, at five different depths: 5 m, the deep chlorophyll maximum layer, 150 m, 500 m and 890 m (just above the sea floor). Here, we introduce microbial association networks that combine data from multiple ocean depths to investigate both within- and between-depth relationships, sometimes time-lagged, among microbes and environmental parameters. The euphotic zone, deep chlorophyll maximum and 890 m depth each contain two negatively correlated 'modules' (groups of many inter-correlated bacteria and environmental conditions) suggesting regular transitions between two contrasting environmental states. Two-thirds of pairwise correlations of bacterial taxa between depths lagged such that changes in the abundance of deeper organisms followed changes in shallower organisms. Taken in conjunction with previous observations of seasonality at 890 m, these trends suggest that planktonic microbial communities throughout the water column are linked to environmental conditions and/or microbial communities in overlying waters. Poorly understood groups including Marine Group A, Nitrospina and AEGEAN-169 clades contained taxa that showed diverse association patterns, suggesting these groups contain multiple ecological species, each shaped by different factors, which we have started to delineate. These observations build upon previous work at this location, lending further credence to the hypothesis that sinking particles and vertically migrating animals transport materials that significantly shape the time-varying patterns of microbial community composition.
View details for DOI 10.1038/ismej.2015.76
View details for Web of Science ID 000365094400004
View details for PubMedID 25989373
- A new multiple feature approach for rapid and highly accurate somatic structural variation discovery from whole cancer genome sequencing AMER ASSOC CANCER RESEARCH. 2015
Emergence of Hemagglutinin Mutations During the Course of Influenza Infection.
2015; 5: 16178-?
Influenza remains a significant cause of disease mortality. The ongoing threat of influenza infection is partly attributable to the emergence of new mutations in the influenza genome. Among the influenza viral gene products, the hemagglutinin (HA) glycoprotein plays a critical role in influenza pathogenesis, is the target for vaccines and accumulates new mutations that may alter the efficacy of immunization. To study the emergence of HA mutations during the course of infection, we employed a deep-targeted sequencing method. We used samples from 17 patients with active H1N1 or H3N2 influenza infections. These patients were not treated with antivirals. In addition, we had samples from five patients who were analyzed longitudinally. Thus, we determined the quantitative changes in the fractional representation of HA mutations during the course of infection. Across individuals in the study, a series of novel HA mutations directly altered the HA coding sequence were identified. Serial viral sampling revealed HA mutations that either were stable, expanded or were reduced in representation during the course of the infection. Overall, we demonstrated the emergence of unique mutations specific to an infected individual and temporal genetic variation during infection.
View details for DOI 10.1038/srep16178
View details for PubMedID 26538451
View details for PubMedCentralID PMC4633648
Extended Local Similarity Analysis (eLSA) of Biological Data
Encyclopedia of Metagenomics: Genes, Genomes and Metagenomes. Basics, Methods, Databases and Tools
edited by Nelson, K.
View details for DOI 10.1007/978-1-4614-6418-1_722-5
Accurate Genome Relative Abundance Estimation Based on Shotgun Metagenomic Reads
Encyclopedia of Metagenomics: Genes, Genomes and Metagenomes. Basics, Methods, Databases and Tools
edited by Nelson, K.
View details for DOI 10.1007/978-1-4614-6418-1_723-4
- A Quantitative Evaluation of Health Care System in US, China, and Sweden Health Med 2013; 7 (4)
Genetic analysis of differentiation of T-helper lymphocytes
GENETICS AND MOLECULAR RESEARCH
2013; 12 (2): 972-987
In the human immune system, T-helper cells are able to differentiate into two lymphocyte subsets: Th1 and Th2. The intracellular signaling pathways of differentiation form a dynamic regulation network by secreting distinctive types of cytokines, while differentiation is regulated by two major gene loci: T-bet and GATA-3. We developed a system dynamics model to simulate the differentiation and re-differentiation process of T-helper cells, based on gene expression levels of T-bet and GATA-3 during differentiation of these cells. We arrived at three ultimate states of the model and came to the conclusion that cell differentiation potential exists as long as the system dynamics is at an unstable equilibrium point; the T-helper cells will no longer have the potential of differentiation when the model reaches a stable equilibrium point. In addition, the time lag caused by expression of transcription factors can lead to oscillations in the secretion of cytokines during differentiation.
View details for DOI 10.4238/2013.April.2.13
View details for Web of Science ID 000320030100011
View details for PubMedID 23613243
Marine bacterial, archaeal and protistan association networks reveal ecological linkages
2011; 5 (9): 1414-1425
Microbes have central roles in ocean food webs and global biogeochemical processes, yet specific ecological relationships among these taxa are largely unknown. This is in part due to the dilute, microscopic nature of the planktonic microbial community, which prevents direct observation of their interactions. Here, we use a holistic (that is, microbial system-wide) approach to investigate time-dependent variations among taxa from all three domains of life in a marine microbial community. We investigated the community composition of bacteria, archaea and protists through cultivation-independent methods, along with total bacterial and viral abundance, and physico-chemical observations. Samples and observations were collected monthly over 3 years at a well-described ocean time-series site of southern California. To find associations among these organisms, we calculated time-dependent rank correlations (that is, local similarity correlations) among relative abundances of bacteria, archaea, protists, total abundance of bacteria and viruses and physico-chemical parameters. We used a network generated from these statistical correlations to visualize and identify time-dependent associations among ecologically important taxa, for example, the SAR11 cluster, stramenopiles, alveolates, cyanobacteria and ammonia-oxidizing archaea. Negative correlations, perhaps suggesting competition or predation, were also common. The analysis revealed a progression of microbial communities through time, and also a group of unknown eukaryotes that were highly correlated with dinoflagellates, indicating possible symbioses or parasitism. Possible 'keystone' species were evident. The network has statistical features similar to previously described ecological networks, and in network parlance has non-random, small world properties (that is, highly interconnected nodes). This approach provides new insights into the natural history of microbes.
View details for DOI 10.1038/ismej.2011.24
View details for Web of Science ID 000295782900003
View details for PubMedID 21430787
PPLook: an automated data mining tool for protein-protein interaction
Extracting and visualizing of protein-protein interaction (PPI) from text literatures are a meaningful topic in protein science. It assists the identification of interactions among proteins. There is a lack of tools to extract PPI, visualize and classify the results.We developed a PPI search system, termed PPLook, which automatically extracts and visualizes protein-protein interaction (PPI) from text. Given a query protein name, PPLook can search a dataset for other proteins interacting with it by using a keywords dictionary pattern-matching algorithm, and display the topological parameters, such as the number of nodes, edges, and connected components. The visualization component of PPLook enables us to view the interaction relationship among the proteins in a three-dimensional space based on the OpenGL graphics interface technology. PPLook can also provide the functions of selecting protein semantic class, counting the number of semantic class proteins which interact with query protein, counting the literature number of articles appearing the interaction relationship about the query protein. Moreover, PPLook provides heterogeneous search and a user-friendly graphical interface.PPLook is an effective tool for biologists and biosystem developers who need to access PPI information from the literature. PPLook is freely available for non-commercial users at http://meta.usc.edu/softs/PPLook.
View details for DOI 10.1186/1471-2105-11-326
View details for Web of Science ID 000280331700002
View details for PubMedID 20550717
Oligonucleotide profiling for discriminating bacteria in bacterial communities
COMBINATORIAL CHEMISTRY & HIGH THROUGHPUT SCREENING
2007; 10 (4): 247-255
Based on the relative ratios of di- and tri-nucleotides in the DNA sequences, the profiles of 164 genome sequences from 152 representative microbial organisms were computed. By comparing the profiles of the genomes and their substrings with length 500 bps, the fluctuations of the relative abundances of di- and tri-nucleotides of these genomic sequences were analyzed. A new method to discriminate the origins of orphan DNA sequences was proposed, and the origins of 17 uncultured bacterium sequences from a bacterial community in the human gut were postulated and discussed.
View details for Web of Science ID 000247022400002
View details for PubMedID 17506707