Wing Hung Wong
Stephen R. Pierce Family Goldman Sachs Professor of Science and Human Health and Professor of Biomedical Data Science
Statistics
Web page: https://web.stanford.edu/group/wonglab/
Bio
I am a professor at Stanford University with joint appointments in the Department of Statistics and the Department of Biomedical Data Science. My current research interests are Bayesian Statistics, Computational Biology and Precision Medicine.
Academic Appointments
-
Professor, Statistics
-
Professor, Department of Biomedical Data Science
-
Member, Bio-X
-
Member, Cardiovascular Institute
-
Member, Stanford Cancer Institute
-
Member, Wu Tsai Neurosciences Institute
Honors & Awards
-
Grace Wahba Award and Lecture, Institute of Mathematic Statistics (2023)
-
COPSS Distinguished Achievement Award and Lectureship, Committee of Presidents of Statistical Societies (2021)
-
Founding Member, The Academy of Sciences of Hong Kong (2015)
-
Academician, Academia Sinica (2010)
-
Member, National Academy of Sciences (2009)
-
Bahadur Lecturer, The University of Chicago (2006)
-
Fellow, American Association for the Advancement of Science (2002)
-
Neyman Lecturer, Institute of Mathematical Statistics (2002)
-
Fellow, American Statistical Association (1998)
-
COPSS Award, Committee of Presidents of Statistical Societies (1993)
-
Fellow, Institute of Mathematical Statistics (1991)
-
Fellow, Guggenheim Foundation (1986)
Current Research and Scholarly Interests
My lab is interested in statistics and genomics. Past contributions include the use of Monte Carlo algorithms in Bayesian computation, asymptotic inference in high or infinite dimensional problems, and bioinformatics tools for the analysis microarray data and sequencing data. Current interests include i) gene regulatory analysis based on integrative modeling of data from diverse cell types and from single cells, and ii) use of semiconductor technology to enable novel biological experiments.
2024-25 Courses
- Biomedical Data Science Student Seminar
BIOMEDIN 201 (Win) - Consulting Workshop
STATS 390 (Win) - Workshop in Biostatistics
BIODS 260B, STATS 260B (Win) -
Independent Studies (8)
- Biomedical Informatics Teaching Methods
BIOMEDIN 290 (Aut, Win, Spr, Sum) - Directed Reading and Research
BIOMEDIN 299 (Aut, Win, Spr, Sum) - Independent Study
STATS 299 (Aut, Win, Spr, Sum) - Industrial Research for Statisticians
STATS 298 (Aut, Win, Spr, Sum) - Industrial Research for Statisticians
STATS 398 (Aut, Win, Spr, Sum) - Medical Scholars Research
BIOMEDIN 370 (Aut, Win, Spr, Sum) - Ph.D. Research
CME 400 (Aut, Win, Spr, Sum) - Research
STATS 399 (Aut, Win, Spr, Sum)
- Biomedical Informatics Teaching Methods
-
Prior Year Courses
2023-24 Courses
- Bayesian Statistics
STATS 270, STATS 370 (Spr) - Biomedical Data Science Student Seminar
BIODS 201, BIOMEDIN 201 (Win) - Literature of Statistics
STATS 319 (Spr) - Workshop in Biostatistics
BIODS 260B, STATS 260B (Win)
2022-23 Courses
- Bayesian Statistics
STATS 270, STATS 370 (Aut) - Biomedical Informatics Student Seminar
BIODS 201, BIOMEDIN 201 (Win)
2021-22 Courses
- Bayesian Statistics
STATS 270, STATS 370 (Spr)
- Bayesian Statistics
Stanford Advisees
-
Doctoral Dissertation Reader (AC)
Ian Christopher Tanoh -
Postdoctoral Faculty Sponsor
Zhanying Feng, Qiao Liu, Wanwen Zeng -
Doctoral Dissertation Advisor (AC)
Sophia Lu -
Master's Program Advisor
Claudio Aguilar Alvarez, Xinyi Ai, Moritz Bolling, Jinny Chung, Andy Dai, Yiran Fan, Daniel Frees, Arnav Gangal, Salil Goyal, Martin Pollack -
Postdoctoral Research Mentor
Hanmin Guo
Graduate and Fellowship Programs
-
Biology (School of Humanities and Sciences) (Phd Program)
All Publications
-
Author Correction: Cost-effective methylome sequencing of cell-free DNA for accurately detecting and locating cancer.
Nature communications
2024; 15 (1): 3693
View details for DOI 10.1038/s41467-024-48018-5
View details for PubMedID 38693151
-
Prioritizing disease-related rare variants by integrating gene expression data.
bioRxiv : the preprint server for biology
2024
Abstract
Rare variants, comprising a vast majority of human genetic variations, are likely to have more deleterious impact on human diseases compared to common variants. Here we present carrier statistic, a statistical framework to prioritize disease-related rare variants by integrating gene expression data. By quantifying the impact of rare variants on gene expression, carrier statistic can prioritize those rare variants that have large functional consequence in the diseased patients. Through simulation studies and analyzing real multi-omics dataset, we demonstrated that carrier statistic is applicable in studies with limited sample size (a few hundreds) and achieves substantially higher sensitivity than existing rare variants association methods. Application to Alzheimer's disease reveals 16 rare variants within 15 genes with extreme carrier statistics. The carrier statistic method can be applied to various rare variant types and is adaptable to other omics data modalities, offering a powerful tool for investigating the molecular mechanisms underlying complex diseases.
View details for DOI 10.1101/2024.03.19.585836
View details for PubMedID 38562756
-
Genetic effects of sequence-conserved enhancer-like elements on human complex traits.
Genome biology
2024; 25 (1): 1
Abstract
The vast majority of findings from human genome-wide association studies (GWAS) map to non-coding sequences, complicating their mechanistic interpretations and clinical translations. Non-coding sequences that are evolutionarily conserved and biochemically active could offer clues to the mechanisms underpinning GWAS discoveries. However, genetic effects of such sequences have not been systematically examined across a wide range of human tissues and traits, hampering progress to fully understand regulatory causes of human complex traits.Here we develop a simple yet effective strategy to identify functional elements exhibiting high levels of human-mouse sequence conservation and enhancer-like biochemical activity, which scales well to 313 epigenomic datasets across 106 human tissues and cell types. Combined with 468 GWAS of European (EUR) and East Asian (EAS) ancestries, these elements show tissue-specific enrichments of heritability and causal variants for many traits, which are significantly stronger than enrichments based on enhancers without sequence conservation. These elements also help prioritize candidate genes that are functionally relevant to body mass index (BMI) and schizophrenia but were not reported in previous GWAS with large sample sizes.Our findings provide a comprehensive assessment of how sequence-conserved enhancer-like elements affect complex traits in diverse tissues and demonstrate a generalizable strategy of integrating evolutionary and biochemical data to elucidate human disease genetics.
View details for DOI 10.1186/s13059-023-03142-1
View details for PubMedID 38167462
View details for PubMedCentralID PMC10759394
-
Detection and analysis of complex structural variation in human genomes across populations and in brains of donors with psychiatric disorders
Cell
2024; Published online September 30, 2024
View details for DOI 10.1016/j.cell.2024.09.014
-
Data integration and inference of gene regulation using single-cell temporal multimodal data with scTIE.
Genome research
2023
Abstract
Single-cell technologies offer unprecedented opportunities to dissect gene regulatory mechanisms in context-specific ways. Although there are computational methods for extracting gene regulatory relationships from scRNA-seq and scATAC-seq data, the data integration problem, essential for accurate cell type identification, has been mostly treated as a standalone challenge. Here we present scTIE, a unified method that integrates temporal multimodal data and infers regulatory relationships predictive of cellular state changes. scTIE uses an autoencoder to embed cells from all time points into a common space using iterative optimal transport, followed by extracting interpretable information to predict cell trajectories. Using a variety of synthetic and real temporal multimodal datasets, we demonstrate scTIE achieves effective data integration while preserving more biological signals than existing methods, particularly in the presence of batch effects and noise. Furthermore, on the exemplar multiome dataset we generated from differentiating mouse embryonic stem cells over time, we demonstrate scTIE captures regulatory elements highly predictive of cell transition probabilities, providing new avenues to understand the regulatory landscape driving developmental processes.
View details for DOI 10.1101/gr.277960.123
View details for PubMedID 38190633
-
Mutations in human DNA methyltransferase DNMT1 induce specific genome-wide epigenomic and transcriptomic changes in neurodevelopment.
Human molecular genetics
2023
Abstract
DNA methyltransferase type 1 (DNMT1) is a major enzyme involved in maintaining the methylation pattern after DNA replication. Mutations in DNMT1 have been associated with autosomal dominant cerebellar ataxia, deafness, and narcolepsy (ADCA-DN). We used fibroblasts, induced pluripotent stem cells (iPSCs) and induced neurons (iNs) generated from patients with ADCA-DN and controls, to explore the epigenomic and transcriptomic effects of mutations in DNMT1. We show cell-type specific changes in gene expression and DNA methylation patterns. DNA methylation and gene expression changes were negatively correlated in iPSCs and iNs. In addition, we identified a group of genes associated with clinical phenotypes of ADCA-DN, including PDGFB and PRDM8 for cerebellar ataxia, psychosis and dementia, and NR2F1 for deafness and optic atrophy. Furthermore, ZFP57, which is required to maintain gene imprinting through DNA methylation during early development, was hypomethylated in promoters and exhibited upregulated expression in patients with ADCA-DN in both iPSC and iNs. Our results provide insight into the functions of DNMT1 and the molecular changes associated with ADCA-DN, with potential implications for genes associated with related phenotypes.
View details for DOI 10.1093/hmg/ddad123
View details for PubMedID 37584462
-
EpiGePT: a Pretrained Transformer model for epigenomics.
bioRxiv : the preprint server for biology
2023
Abstract
The transformer-based models, such as GPT-31 and DALL-E2, have achieved unprecedented breakthroughs in the field of natural language processing and computer vision. The inherent similarities between natural language and biological sequences have prompted a new wave of inferring the grammatical rules underneath the biological sequences. In genomic study, it is worth noting that DNA sequences alone cannot explain all the gene activities due to epigenetic mechanism. To investigate this problem, we propose EpiGePT, a new transformer-based language pretrained model in epigenomics, for predicting genome-wide epigenomic signals by considering the mechanistic modeling of transcriptional regulation. Specifically, EpiGePT takes the context-specific activities of transcription factors (TFs) into consideration, which could offer deeper biological insights comparing to models trained on DNA sequence only. In a series of experiments, EpiGePT demonstrates state-of-the-art performance in a diverse epigenomic signals prediction tasks as well as new prediction tasks by fine-tuning. Furthermore, EpiGePT is capable of learning the cell-type-specific long-range interactions through the self-attention mechanism and interpreting the genetic variants that associated with human diseases. We expect that the advances of EpiGePT can shed light on understanding the complex regulatory mechanisms in gene regulation. We provide free online prediction service of EpiGePT through https://health.tsinghua.edu.cn/epigept/.
View details for DOI 10.1101/2023.07.15.549134
View details for PubMedID 37502861
View details for PubMedCentralID PMC10370089
-
Comprehensive tissue deconvolution of cell-free DNA by deep learning for disease diagnosis and monitoring.
Proceedings of the National Academy of Sciences of the United States of America
2023; 120 (28): e2305236120
Abstract
Plasma cell-free DNA (cfDNA) is a noninvasive biomarker for cell death of all organs. Deciphering the tissue origin of cfDNA can reveal abnormal cell death because of diseases, which has great clinical potential in disease detection and monitoring. Despite the great promise, the sensitive and accurate quantification of tissue-derived cfDNA remains challenging to existing methods due to the limited characterization of tissue methylation and the reliance on unsupervised methods. To fully exploit the clinical potential of tissue-derived cfDNA, here we present one of the largest comprehensive and high-resolution methylation atlas based on 521 noncancer tissue samples spanning 29 major types of human tissues. We systematically identified fragment-level tissue-specific methylation patterns and extensively validated them in orthogonal datasets. Based on the rich tissue methylation atlas, we develop the first supervised tissue deconvolution approach, a deep-learning-powered model, cfSort, for sensitive and accurate tissue deconvolution in cfDNA. On the benchmarking data, cfSort showed superior sensitivity and accuracy compared to the existing methods. We further demonstrated the clinical utilities of cfSort with two potential applications: aiding disease diagnosis and monitoring treatment side effects. The tissue-derived cfDNA fraction estimated from cfSort reflected the clinical outcomes of the patients. In summary, the tissue methylation atlas and cfSort enhanced the performance of tissue deconvolution in cfDNA, thus facilitating cfDNA-based disease detection and longitudinal treatment monitoring.
View details for DOI 10.1073/pnas.2305236120
View details for PubMedID 37399400
-
scTIE: data integration and inference of gene regulation using single-cell temporal multimodal data.
bioRxiv : the preprint server for biology
2023
Abstract
Single-cell technologies offer unprecedented opportunities to dissect gene regulatory mechanisms in context-specific ways. Although there are computational methods for extracting gene regulatory relationships from scRNA-seq and scATAC-seq data, the data integration problem, essential for accurate cell type identification, has been mostly treated as a standalone challenge. Here we present scTIE, a unified method that integrates temporal multimodal data and infers regulatory relationships predictive of cellular state changes. scTIE uses an autoencoder to embed cells from all time points into a common space using iterative optimal transport, followed by extracting interpretable information to predict cell trajectories. Using a variety of synthetic and real temporal multimodal datasets, we demonstrate scTIE achieves effective data integration while preserving more biological signals than existing methods, particularly in the presence of batch effects and noise. Furthermore, on the exemplar multiome dataset we generated from differentiating mouse embryonic stem cells over time, we demonstrate scTIE captures regulatory elements highly predictive of cell transition probabilities, providing new potentials to understand the regulatory landscape driving developmental processes.
View details for DOI 10.1101/2023.05.18.541381
View details for PubMedID 37292801
View details for PubMedCentralID PMC10245711
-
NeuronMotif: Deciphering cis-regulatory codes by layer-wise demixing of deep neural networks.
Proceedings of the National Academy of Sciences of the United States of America
2023; 120 (15): e2216698120
Abstract
Discovering DNA regulatory sequence motifs and their relative positions is vital to understanding the mechanisms of gene expression regulation. Although deep convolutional neural networks (CNNs) have achieved great success in predicting cis-regulatory elements, the discovery of motifs and their combinatorial patterns from these CNN models has remained difficult. We show that the main difficulty is due to the problem of multifaceted neurons which respond to multiple types of sequence patterns. Since existing interpretation methods were mainly designed to visualize the class of sequences that can activate the neuron, the resulting visualization will correspond to a mixture of patterns. Such a mixture is usually difficult to interpret without resolving the mixed patterns. We propose the NeuronMotif algorithm to interpret such neurons. Given any convolutional neuron (CN) in the network, NeuronMotif first generates a large sample of sequences capable of activating the CN, which typically consists of a mixture of patterns. Then, the sequences are "demixed" in a layer-wise manner by backward clustering of the feature maps of the involved convolutional layers. NeuronMotif can output the sequence motifs, and the syntax rules governing their combinations are depicted by position weight matrices organized in tree structures. Compared to existing methods, the motifs found by NeuronMotif have more matches to known motifs in the JASPAR database. The higher-order patterns uncovered for deep CNs are supported by the literature and ATAC-seq footprinting. Overall, NeuronMotif enables the deciphering of cis-regulatory codes from deep CNs and enhances the utility of CNN in genome interpretation.
View details for DOI 10.1073/pnas.2216698120
View details for PubMedID 37023129
-
Revealing Free Energy Landscape From MD Data via Conditional Angle Partition Tree.
IEEE/ACM transactions on computational biology and bioinformatics
2023; 20 (2): 1384-1394
Abstract
Deciphering the free energy landscape of biomolecular structure space is crucial for understanding many complex molecular processes, such as protein-protein interaction, RNA folding, and protein folding. A major source of current dynamic structure data is Molecular Dynamics (MD) simulations. Several methods have been proposed to investigate the free energy landscape from MD data, but all of them rely on the assumption that kinetic similarity is associated with global geometric similarity, which may lead to unsatisfactory results. In this paper, we proposed a new method called Conditional Angle Partition Tree to reveal the hierarchical free energy landscape by correlating local geometric similarity with kinetic similarity. Its application on the benchmark alanine dipeptide MD data showed a much better performance than existing methods in exploring and understanding the free energy landscape. We also applied it to the MD data of Villin HP35. Our results are more reasonable on various aspects than those from other methods and very informative on the hierarchical structure of its energy landscape.
View details for DOI 10.1109/TCBB.2022.3172352
View details for PubMedID 35503836
-
Statins improve endothelial function via suppression of epigenetic-driven EndMT
Nature Cardiovascular Research
2023
View details for DOI 10.1038/s44161-022-00205-7
-
Convergence Rates of a Class of Multivariate Density Estimation Methods Based on Adaptive Partitioning
JOURNAL OF MACHINE LEARNING RESEARCH
2023; 24
View details for Web of Science ID 001125491200001
-
Heritability enrichment in context-specific regulatory networks improves phenotype-relevant tissue identification.
eLife
2022; 11
Abstract
Systems genetics holds the promise to decipher complex traits by interpreting their associated SNPs through gene regulatory networks derived from comprehensive multi-omics data of cell types, tissues, and organs. Here, we propose SpecVar to integrate paired chromatin accessibility and gene expression data into context-specific regulatory network atlas and regulatory categories, conduct heritability enrichment analysis with GWAS summary statistics, identify relevant tissues, and depict common genetic factors acting in the shared regulatory networks between traits by relevance correlation. Our method improves power upon existing approaches by associating SNPs with context-specific regulatory elements to assess heritability enrichments and by explicitly prioritizing gene regulations underlying relevant tissues. Ablation studies, independent data validation, and comparison experiments with existing methods on GWAS of six phenotypes show that SpecVar can improve heritability enrichment, accurately detect relevant tissues, and reveal causal regulations. Furthermore, SpecVar correlates the relevance patterns for pairs of phenotypes and better reveals shared SNP associated regulations of phenotypes than existing methods. Studying GWAS of 206 phenotypes in UK-Biobank demonstrates that SpecVar leverages the context-specific regulatory network atlas to prioritize phenotypes' relevant tissues and shared heritability for biological and therapeutic insights. SpecVar provides a powerful way to interpret SNPs via context-specific regulatory networks and is available at https://github.com/AMSSwanglab/SpecVar.
View details for DOI 10.7554/eLife.82535
View details for PubMedID 36525361
-
Author Correction: Regulatory analysis of single cell multiome gene expression and chromatin accessibility data with scREG.
Genome biology
2022; 23 (1): 213
View details for DOI 10.1186/s13059-022-02786-9
View details for PubMedID 36229829
-
HiChIPdb: a comprehensive database of HiChIP regulatory interactions.
Nucleic acids research
2022
Abstract
Elucidating the role of 3D architecture of DNA in gene regulation is crucial for understanding cell differentiation, tissue homeostasis and disease development. Among various chromatin conformation capture methods, HiChIP has received increasing attention for its significant improvement over other methods in profiling of regulatory (e.g. H3K27ac) and structural (e.g. cohesin) interactions. To facilitate the studies of 3D regulatory interactions, we developed a HiChIP interactions database, HiChIPdb (http://health.tsinghua.edu.cn/hichipdb/). The current version of HiChIPdb contains 262M annotated HiChIP interactions from 200 high-throughput HiChIP samples across 108 cell types. The functionalities of HiChIPdb include: (i) standardized categorization of HiChIP interactions in a hierarchical structure based on organ, tissue and cell line and (ii) comprehensive annotations of HiChIP interactions with regulatory genes and GWAS Catalog SNPs. To the best of our knowledge, HiChIPdb is the first comprehensive database that utilizes a unified pipeline to map the functional interactions across diverse cell types and tissues in different resolutions. We believe this database has the potential to advance cutting-edge research in regulatory mechanisms in development and disease by removing the barrier in data aggregation, preprocessing, and analysis.
View details for DOI 10.1093/nar/gkac859
View details for PubMedID 36215037
-
DETECTION OF COMPLEX STRUCTURAL GENOME VARIANTS USING ARC-SV AND THEIR ENRICHMENT INSIDE GENES OF NEURODEVELOPMENTAL PATHWAYS
ELSEVIER. 2022: E177-E178
View details for DOI 10.1016/j.euroneuro.2022.07.322
View details for Web of Science ID 000898874200094
-
Cost-effective methylome sequencing of cell-free DNA for accurately detecting and locating cancer.
Nature communications
2022; 13 (1): 5566
Abstract
Early cancer detection by cell-free DNA faces multiple challenges: low fraction of tumor cell-free DNA, molecular heterogeneity of cancer, and sample sizes that are not sufficient to reflect diverse patient populations. Here, we develop a cancer detection approach to address these challenges. It consists of an assay, cfMethyl-Seq, for cost-effective sequencing of the cell-free DNA methylome (with > 12-fold enrichment over whole genome bisulfite sequencing in CpG islands), and a computational method to extract methylation information and diagnose patients. Applying our approach to 408 colon, liver, lung, and stomach cancer patients and controls, at 97.9% specificity we achieve 80.7% and 74.5% sensitivity in detecting all-stage and early-stage cancer, and 89.1% and 85.0% accuracy for locating tissue-of-origin of all-stage and early-stage cancer, respectively. Our approach cost-effectively retains methylome profiles of cancer abnormalities, allowing us to learn new features and expand to other cancer types as training cohorts grow.
View details for DOI 10.1038/s41467-022-32995-6
View details for PubMedID 36175411
-
Unfolding the genotype-to-phenotype black box of cardiovascular diseases through cross-scale modeling.
iScience
2022; 25 (8): 104790
Abstract
Complex traits such as cardiovascular diseases (CVD) are the results of complicated processes jointly affected by genetic and environmental factors. Genome-wide association studies (GWAS) identified genetic variants associated with diseases but usually did not reveal the underlying mechanisms. There could be many intermediate steps at epigenetic, transcriptomic, and cellular scales inside the black box of genotype-phenotype associations. In this article, we present a machine-learning-based cross-scale framework GRPath to decipher putative causal paths (pcPaths) from genetic variants to disease phenotypes by integrating multiple omics data. Applying GRPath on CVD, we identified 646 and 549 pcPaths linking putative causal regions, variants, and gene expressions in specific cell types for two types of heart failure, respectively. The findings suggest new understandings of coronary heart disease. Our work promoted the modeling of tissue- and cell type-specific cross-scale regulation to uncover mechanisms behind disease-associated variants, and provided new findings on the molecular mechanisms of CVD.
View details for DOI 10.1016/j.isci.2022.104790
View details for PubMedID 35992073
-
Nested epistasis enhancer networks for robust genome regulation.
Science (New York, N.Y.)
2022: eabk3512
Abstract
Mammalian genomes possess multiple enhancers spanning an ultralong distance (>megabases) to modulate important genes, yet it is unclear how these enhancers coordinate to achieve this task. Here, we combine multiplexed CRISPRi screening with machine learning to define quantitative enhancer-enhancer interactions. We find that the ultralong distance enhancer network possesses a nested multi-layer architecture that confers functional robustness of gene expression. Experimental characterization reveals that enhancer epistasis is maintained by three-dimensional chromosomal interactions and BRD4 condensation. Machine learning prediction of synergistic enhancers provides an effective strategy to identify non-coding variant pairs associated with pathogenic genes in diseases beyond Genome-Wide Association Studies (GWAS) analysis. Our work unveils nested epistasis enhancer networks, which can better explain enhancer functions within cells and in diseases.
View details for DOI 10.1126/science.abk3512
View details for PubMedID 35951677
-
Regulatory analysis of single cell multiome gene expression and chromatin accessibility data with scREG.
Genome biology
2022; 23 (1): 114
Abstract
Technological development has enabled the profiling of gene expression and chromatin accessibility from the same cell. We develop scREG, a dimension reduction methodology, based on the concept of cis-regulatory potential, for single cell multiome data. This concept is further used for the construction of subpopulation-specific cis-regulatory networks. The capability of inferring useful regulatory network is demonstrated by the two-fold increment on network inference accuracy compared to the Pearson correlation-based method and the 27-fold enrichment of GWAS variants for inflammatory bowel disease in the cis-regulatory elements. The R package scREG provides comprehensive functions for single cell multiome data analysis.
View details for DOI 10.1186/s13059-022-02682-2
View details for PubMedID 35578363
-
scJoint integrates atlas-scale single-cell RNA-seq and ATAC-seq data with transfer learning.
Nature biotechnology
1800
Abstract
Single-cell multiomics data continues to grow at an unprecedented pace. Although several methods have demonstrated promising results in integrating several data modalities from the same tissue, the complexity and scale of data compositions present in cell atlases still pose a challenge. Here, we present scJoint, a transfer learning method to integrate atlas-scale, heterogeneous collections of scRNA-seq and scATAC-seq data. scJoint leverages information from annotated scRNA-seq data in a semisupervised framework and uses a neural network to simultaneously train labeled and unlabeled data, allowing label transfer and joint visualization in an integrative framework. Using atlas data as well as multimodal datasets generated with ASAP-seq and CITE-seq, we demonstrate that scJoint is computationally efficient and consistently achieves substantially higher cell-type label accuracy than existing methods while providing meaningful joint visualizations. Thus, scJoint overcomes the heterogeneity of different data modalities to enable a more comprehensive understanding of cellular phenotypes.
View details for DOI 10.1038/s41587-021-01161-6
View details for PubMedID 35058621
-
Leveraging cell-type-specific regulatory networks to interpret genetic variants in abdominal aortic aneurysm.
Proceedings of the National Academy of Sciences of the United States of America
1800; 119 (1)
Abstract
Abdominal aortic aneurysm (AAA) is a common degenerative cardiovascular disease whose pathobiology is not clearly understood. The cellular heterogeneity and cell-type-specific gene regulation of vascular cells in human AAA have not been well-characterized. Here, we performed analysis of whole-genome sequencing data in AAA patients versus controls with the aim of detecting disease-associated variants that may affect gene regulation in human aortic smooth muscle cells (AoSMC) and human aortic endothelial cells (HAEC), two cell types of high relevance to AAA disease. To support this analysis, we generated H3K27ac HiChIP data for these cell types and inferred cell-type-specific gene regulatory networks. We observed that AAA-associated variants were most enriched in regulatory regions in AoSMC, compared with HAEC and CD4+ cells. The cell-type-specific regulation defined by this HiChIP data supported the importance of ERG and the KLF family of transcription factors in AAA disease. The analysis of regulatory elements that contain noncoding variants and also are differentially open between AAA patients and controls revealed the significance of the interleukin-6-mediated signaling pathway. This finding was further validated by including information from the deleteriousness effect of nonsynonymous single-nucleotide variants in AAA patients and additional control data from the Medical Genome Reference Bank dataset. These results shed important insights into AAA pathogenesis and provide a model for cell-type-specific analysis of disease-associated variants.
View details for DOI 10.1073/pnas.2115601119
View details for PubMedID 34930827
-
Coupled generation.
Journal of the American Statistical Association
2022; 117 (539): 1243-1253
Abstract
Instance generation creates representative examples to interpret a learning model, as in regression and classification. For example, representative sentences of a topic of interest describe the topic specifically for sentence categorization. In such a situation, a large number of unlabeled observations may be available in addition to labeled data, for example, many unclassified text corpora (unlabeled instances) are available with only a few classified sentences (labeled instances). In this article, we introduce a novel generative method, called a coupled generator, producing instances given a specific learning outcome, based on indirect and direct generators. The indirect generator uses the inverse principle to yield the corresponding inverse probability, enabling to generate instances by leveraging an unlabeled data. The direct generator learns the distribution of an instance given its learning outcome. Then, the coupled generator seeks the best one from the indirect and direct generators, which is designed to enjoy the benefits of both and deliver higher generation accuracy. For sentence generation given a topic, we develop an embedding-based regression/classification in conjuncture with an unconditional recurrent neural network for the indirect generator, whereas a conditional recurrent neural network is natural for the corresponding direct generator. Moreover, we derive finite-sample generation error bounds for the indirect and direct generators to reveal the generative aspects of both methods thus explaining the benefits of the coupled generator. Finally, we apply the proposed methods to a real benchmark of abstract classification and demonstrate that the coupled generator composes reasonably good sentences from a dictionary to describe a specific topic of interest.
View details for DOI 10.1080/01621459.2020.1844719
View details for PubMedID 36465716
View details for PubMedCentralID PMC9718422
-
On the identifiability of the isoform deconvolution problem: application to select the proper fragment length in an RNAseq library.
Bioinformatics (Oxford, England)
1800
Abstract
MOTIVATION: Isoform deconvolution is an NP-hard problem. The accuracy of the proposed solutions are far from perfect. At present, it is not known if gene structure and isoform concentration can be uniquely inferred given paired-end reads, and there is no objective method to select the fragment length to improve the number of identifiable genes. Different pieces of evidence suggest that the optimal fragment length is gene-dependent, stressing the need for a method that selects the fragment length according to a reasonable trade-off across all the genes in the whole genome.RESULTS: A gene is considered to be identifiable if it is possible to get both the structure and concentration of its transcripts univocally. Here, we present a method to state the identifiability of this deconvolution problem. Assuming a given transcriptome and that the coverage is sufficient to interrogate all junction reads of the transcripts, this method states whether or not a gene is identifiable given the read length and fragment length distribution.Applying this method using different read and fragment length combinations, the optimal average fragment length for the human transcriptome is around 400-600nt for coding genes and 150-200nt for long non-coding RNAs. The optimal read length is the largest one that fits in the fragment length. It is also discussed the potential profit of combining several libraries to reconstruct the transcriptome. Combining two libraries of very different fragment lengths results in a significant improvement in gene identifiability.AVAILABILITY: Code is available in GitHub (https://github.com/JFerrer-B/transcriptome-identifiability).SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
View details for DOI 10.1093/bioinformatics/btab873
View details for PubMedID 34978563
-
AN EQUATION FOR THE IDENTIFICATION OF AVERAGE CAUSAL EFFECT IN NONLINEAR MODELS
STATISTICA SINICA
2022; 32: 539-545
View details for DOI 10.5705/ss.202021.0191
View details for Web of Science ID 000748651700001
-
DeepCAGE: Incorporating transcription factors in genome-wide prediction of chromatin accessibility.
Genomics, proteomics & bioinformatics
2022
Abstract
Although computational approaches have been complementing high-throughput biological experiments for the identification of functional regions in the human genome, it remains a great challenge to systematically decipher interactions between transcription factors and regulatory elements to achieve interpretable annotations of chromatin accessibility across diverse cellular contexts. To solve this problem, we propose DeepCAGE, a deep learning framework that integrates sequence information and binding status of transcription factors, for the accurate prediction of chromatin accessible regions at a genome-wide scale in a variety of cell types. DeepCAGE takes advantage of a densely connected deep convolutional neural network architecture to automatically learn sequence signatures of known chromatin accessible regions and then incorporates such features with expression levels and binding activities of human core transcription factors to predict novel chromatin accessible regions. In a series of systematic comparisons with existing methods, DeepCAGE exhibits superior performance in not only the classification but also the regression of chromatin accessibility signals. In a detailed analysis of transcription factor activities, DeepCAGE successfully extracts novel binding motifs and measures the contribution of a transcription factor to the regulation with respect to a specific locus in a certain cell type. When applied to whole-genome sequencing data analysis, our method successfully prioritizes putative deleterious variants underlying a human complex trait and thus provides insights into the understanding of disease-associated genetic variants. DeepCAGE can be downloaded from https://github.com/kimmo1019/DeepCAGE.
View details for DOI 10.1016/j.gpb.2021.08.015
View details for PubMedID 35293310
-
Collaborative Multilabel Classification
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
2021
View details for DOI 10.1080/01621459.2021.1961783
View details for Web of Science ID 000693298100001
-
Sc-compReg enables the comparison of gene regulatory networks between conditions using single-cell data.
Nature communications
2021; 12 (1): 4763
Abstract
The comparison of gene regulatory networks between diseased versus healthy individuals or between two different treatments is an important scientific problem. Here, we propose sc-compReg as a method for the comparative analysis of gene expression regulatory networks between two conditions using single cell gene expression (scRNA-seq) and single cell chromatin accessibility data (scATAC-seq). Our software, sc-compReg, can be used as a stand-alone package that provides joint clustering and embedding of the cells from both scRNA-seq and scATAC-seq, and the construction of differential regulatory networks across two conditions. We apply the method to compare the gene regulatory networks of an individual with chronic lymphocytic leukemia (CLL) versus a healthy control. The analysis reveals a tumor-specific B cell subpopulation in the CLL patient and identifies TOX2 as a potential regulator of this subpopulation.
View details for DOI 10.1038/s41467-021-25089-2
View details for PubMedID 34362918
-
Dynamic chromatin regulatory landscape of human CAR T cell exhaustion.
Proceedings of the National Academy of Sciences of the United States of America
2021; 118 (30)
Abstract
Dysfunction in T cells limits the efficacy of cancer immunotherapy. We profiled the epigenome, transcriptome, and enhancer connectome of exhaustion-prone GD2-targeting HA-28z chimeric antigen receptor (CAR) T cells and control CD19-targeting CAR T cells, which present less exhaustion-inducing tonic signaling, at multiple points during their ex vivo expansion. We found widespread, dynamic changes in chromatin accessibility and three-dimensional (3D) chromosome conformation preceding changes in gene expression, notably at loci proximal to exhaustion-associated genes such as PDCD1, CTLA4, and HAVCR2, and increased DNA motif access for AP-1 family transcription factors, which are known to promote exhaustion. Although T cell exhaustion has been studied in detail in mice, we find that the regulatory networks of T cell exhaustion differ between species and involve distinct loci of accessible chromatin and cis-regulated target genes in human CAR T cell exhaustion. Deletion of exhaustion-specific candidate enhancers of PDCD1 suppress the expression of PD-1 in an in vitro model of T cell dysfunction and in HA-28z CAR T cells, suggesting enhancer editing as a path forward in improving cancer immunotherapy.
View details for DOI 10.1073/pnas.2104758118
View details for PubMedID 34285077
-
Sensitive detection of tumor mutations from blood and its application to immunotherapy prognosis.
Nature communications
2021; 12 (1): 4172
Abstract
Cell-free DNA (cfDNA) is attractive for many applications, including detecting cancer, identifying the tissue of origin, and monitoring. A fundamental task underlying these applications is SNV calling from cfDNA, which is hindered by the very low tumor content. Thus sensitive and accurate detection of low-frequency mutations (<5%) remains challenging for existing SNV callers. Here we present cfSNV, a method incorporating multi-layer error suppression and hierarchical mutation calling, to address this challenge. Furthermore, by leveraging cfDNA's comprehensive coverage of tumor clonal landscape, cfSNV can profile mutations in subclones. In both simulated and real patient data, cfSNV outperforms existing tools in sensitivity while maintaining high precision. cfSNV enhances the clinical utilities of cfDNA by improving mutation detection performance in medium-depth sequencing data, therefore making Whole-Exome Sequencing a viable option. As an example, we demonstrate that the tumor mutation profile from cfDNA WES data can provide an effective biomarker to predict immunotherapy outcomes.
View details for DOI 10.1038/s41467-021-24457-2
View details for PubMedID 34234141
-
MIMIC: an optimization method to identify cell type-specific marker panel for cell sorting.
Briefings in bioinformatics
2021
Abstract
Multi-omics data allow us to select a small set of informative markers for the discrimination of specific cell types and study of cellular heterogeneity. However, it is often challenging to choose an optimal marker panel from the high-dimensional molecular profiles for a large amount of cell types. Here, we propose a method called Mixed Integer programming Model to Identify Cell type-specific marker panel (MIMIC). MIMIC maintains the hierarchical topology among different cell types and simultaneously maximizes the specificity of a fixed number of selected markers. MIMIC was benchmarked on the mouse ENCODE RNA-seq dataset, with 29 diverse tissues, for 43 surface markers (SMs) and 1345 transcription factors (TFs). MIMIC could select biologically meaningful markers and is robust for different accuracy criteria. It shows advantages over the standard single gene-based approaches and widely used dimensional reduction methods, such as multidimensional scaling and t-SNE, both in accuracy and in biological interpretation. Furthermore, the combination of SMs and TFs achieves better specificity than SMs or TFs alone. Applying MIMIC to a large collection of 641 RNA-seq samples covering 231 cell types identifies a panel of TFs and SMs that reveal the modularity of cell type association networks. Finally, the scalability of MIMIC is demonstrated by selecting enhancer markers from mouse ENCODE data. MIMIC is freely available at https://github.com/MengZou1/MIMIC.
View details for DOI 10.1093/bib/bbab235
View details for PubMedID 34180954
-
Simultaneous deep generative modeling and clustering of single cell genomic data.
Nature machine intelligence
2021; 3 (6): 536-544
Abstract
Recent advances in single-cell technologies, including single-cell ATAC-seq (scATAC-seq), have enabled large-scale profiling of the chromatin accessibility landscape at the single cell level. However, the characteristics of scATAC-seq data, including high sparsity and high dimensionality, have greatly complicated the computational analysis. Here, we proposed scDEC, a computational tool for single cell ATAC-seq analysis with deep generative neural networks. scDEC is built on a pair of generative adversarial networks (GANs), and is capable of learning the latent representation and inferring the cell labels, simultaneously. In a series of experiments, scDEC demonstrates superior performance over other tools in scATAC-seq analysis across multiple datasets and experimental settings. In downstream applications, we demonstrated that the generative power of scDEC helps to infer the trajectory and intermediate state of cells during differentiation and the latent features learned by scDEC can potentially reveal both biological cell types and within-cell-type variations. We also showed that it is possible to extend scDEC for the integrative analysis of multi-modal single cell data.
View details for DOI 10.1038/s42256-021-00333-y
View details for PubMedID 34179690
View details for PubMedCentralID PMC8223760
-
Modeling regulatory network topology improves genome-wide analyses of complex human traits.
Nature communications
2021; 12 (1): 2851
Abstract
Genome-wide association studies (GWAS) have cataloged many significant associations between genetic variants and complex traits. However, most of these findings have unclear biological significance, because they often have small effects and occur in non-coding regions. Integration of GWAS with gene regulatory networks addresses both issues by aggregating weak genetic signals within regulatory programs. Here we develop a Bayesian framework that integrates GWAS summary statistics with regulatory networks to infer genetic enrichments and associations simultaneously. Our method improves upon existing approaches by explicitly modeling network topology to assess enrichments, and by automatically leveraging enrichments to identify associations. Applying this method to 18 human traits and 38 regulatory networks shows that genetic signals of complex traits are often enriched in interconnections specific to trait-relevant cell types or tissues. Prioritizing variants within enriched networks identifies known and previously undescribed trait-associated genes revealing biological and therapeutic insights.
View details for DOI 10.1038/s41467-021-22588-0
View details for PubMedID 33990562
-
Simultaneous deep generative modelling and clustering of single-cell genomic data
NATURE MACHINE INTELLIGENCE
2021
View details for DOI 10.1038/s42256-021-00333-y
View details for Web of Science ID 000649431300002
-
Density estimation using deep generative neural networks.
Proceedings of the National Academy of Sciences of the United States of America
2021; 118 (15)
Abstract
Density estimation is one of the fundamental problems in both statistics and machine learning. In this study, we propose Roundtrip, a computational framework for general-purpose density estimation based on deep generative neural networks. Roundtrip retains the generative power of deep generative models, such as generative adversarial networks (GANs) while it also provides estimates of density values, thus supporting both data generation and density estimation. Unlike previous neural density estimators that put stringent conditions on the transformation from the latent space to the data space, Roundtrip enables the use of much more general mappings where target density is modeled by learning a manifold induced from a base density (e.g., Gaussian distribution). Roundtrip provides a statistical framework for GAN models where an explicit evaluation of density values is feasible. In numerical experiments, Roundtrip exceeds state-of-the-art performance in a diverse range of density estimation tasks.
View details for DOI 10.1073/pnas.2101344118
View details for PubMedID 33833061
-
hReg-CNCC reconstructs a regulatory network in human cranial neural crest cells and annotates variants in a developmental context.
Communications biology
2021; 4 (1): 442
Abstract
Cranial Neural Crest Cells (CNCC) originate at the cephalic region from forebrain, midbrain and hindbrain, migrate into the developing craniofacial region, and subsequently differentiate into multiple cell types. The entire specification, delamination, migration, and differentiation process is highly regulated and abnormalities during this craniofacial development cause birth defects. To better understand the molecular networks underlying CNCC, we integrate paired gene expression & chromatin accessibility data and reconstruct the genome-wide human Regulatory network of CNCC (hReg-CNCC). Consensus optimization predicts high-quality regulations and reveals the architecture of upstream, core, and downstream transcription factors that are associated with functions of neural plate border, specification, and migration. hReg-CNCC allows us to annotate genetic variants of human facial GWAS and disease traits with associated cis-regulatory modules, transcription factors, and target genes. For example, we reveal the distal and combinatorial regulation of multiple SNPs to core TF ALX1 and associations to facial distances and cranial rare disease. In addition, hReg-CNCC connects the DNA sequence differences in evolution, such as ultra-conserved elements and human accelerated regions, with gene expression and phenotype. hReg-CNCC provides a valuable resource to interpret genetic variants as early as gastrulation during embryonic development. The network resources are available at https://github.com/AMSSwanglab/hReg-CNCC .
View details for DOI 10.1038/s42003-021-01970-0
View details for PubMedID 33824393
-
Coupled Generation
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
2020
View details for DOI 10.1080/01621459.2020.1844719
View details for Web of Science ID 000604694300001
-
Mini-Batch Metropolis-Hastings With Reversible SGLD Proposal
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
2020
View details for DOI 10.1080/01621459.2020.1782222
View details for Web of Science ID 000569188700001
-
Time course regulatory analysis based on paired expression and chromatin accessibility data.
Genome research
2020
Abstract
Time course experiment is a widely used design in the study of cellular processes such as differentiation or response to stimuli. In this paper, we propose TimeReg (Time Course Regulatory Analysis) as a method for the analysis of gene regulatory networks based on paired gene expression and chromatin accessibility data from the time course. TimeReg can be used to prioritize regulatory elements, to extract core regulatory modules at each time point, to identify key regulators driving changes of the cellular state, and to causally connect the modules across different time points. We applied the method to analyze paired chromatin accessibility and gene expression data from retinoic acid (RA) induced mouse embryonic stem cells (mESC) differentiation experiment. The analysis identified 57,048 novel regulatory elements, regulating cerebellar development, synapse assembly and hindbrain morphogenesis, which substantially extended our knowledge of cis-regulatory elements during the differentiation. Using single cell RNA-seq data, we showed that the core regulatory modules can reflect the properties of different subpopulations of cells. Finally, the driver regulators are shown to be important in clarifying the relations between modules across adjacent time points. As a second example, our method on Ascl1 induced direct reprogramming from fibroblast to neuron time-course data identified Id1/2 as driver regulators of early stage of reprogramming.
View details for DOI 10.1101/gr.257063.119
View details for PubMedID 32188700
-
Integrated functional genomic analyses of Klinefelter and Turner syndromes reveal global network effects of altered X chromosome dosage.
Proceedings of the National Academy of Sciences of the United States of America
2020
Abstract
In both Turner syndrome (TS) and Klinefelter syndrome (KS) copy number aberrations of the X chromosome lead to various developmental symptoms. We report a comparative analysis of TS vs. KS regarding differences at the genomic network level measured in primary samples by analyzing gene expression, DNA methylation, and chromatin conformation. X-chromosome inactivation (XCI) silences transcription from one X chromosome in female mammals, on which most genes are inactive, and some genes escape from XCI. In TS, almost all differentially expressed escape genes are down-regulated but most differentially expressed inactive genes are up-regulated. In KS, differentially expressed escape genes are up-regulated while the majority of inactive genes appear unchanged. Interestingly, 94 differentially expressed genes (DEGs) overlapped between TS and female and KS and male comparisons; and these almost uniformly display expression changes into opposite directions. DEGs on the X chromosome and the autosomes are coexpressed in both syndromes, indicating that there are molecular ripple effects of the changes in X chromosome dosage. Six potential candidate genes (RPS4X, SEPT6, NKRF, CX0rf57, NAA10, and FLNA) for KS are identified on Xq, as well as candidate central genes on Xp for TS. Only promoters of inactive genes are differentially methylated in both syndromes while escape gene promoters remain unchanged. The intrachromosomal contact map of the X chromosome in TS exhibits the structure of an active X chromosome. The discovery of shared DEGs indicates the existence of common molecular mechanisms for gene regulation in TS and KS that transmit the gene dosage changes to the transcriptome.
View details for DOI 10.1073/pnas.1910003117
View details for PubMedID 32071206
-
Chromatin accessibility landscape and regulatory network of high-altitude hypoxia adaptation.
Nature communications
2020; 11 (1): 4928
Abstract
High-altitude adaptation of Tibetans represents a remarkable case of natural selection during recent human evolution. Previous genome-wide scans found many non-coding variants under selection, suggesting a pressing need to understand the functional role of non-coding regulatory elements (REs). Here, we generate time courses of paired ATAC-seq and RNA-seq data on cultured HUVECs under hypoxic and normoxic conditions. We further develop a variant interpretation methodology (vPECA) to identify active selected REs (ASREs) and associated regulatory network. We discover three causal SNPs of EPAS1, the key adaptive gene for Tibetans. These SNPs decrease the accessibility of ASREs with weakened binding strength of relevant TFs, and cooperatively down-regulate EPAS1 expression. We further construct the downstream network of EPAS1, elucidating its roles in hypoxic response and angiogenesis. Collectively, we provide a systematic approach to interpret phenotype-associated noncoding variants in proper cell types and relevant dynamic conditions, to model their impact on gene regulation.
View details for DOI 10.1038/s41467-020-18638-8
View details for PubMedID 33004791
-
A method for scoring the cell type-specific impacts of noncoding variants in personal genomes.
Proceedings of the National Academy of Sciences of the United States of America
2020
Abstract
A person's genome typically contains millions of variants which represent the differences between this personal genome and the reference human genome. The interpretation of these variants, i.e., the assessment of their potential impact on a person's phenotype, is currently of great interest in human genetics and medicine. We have developed a prioritization tool called OpenCausal which takes as inputs 1) a personal genome and 2) a reference context-specific TF expression profile and returns a list of noncoding variants prioritized according to their impact on chromatin accessibility for any given genomic region of interest. We applied OpenCausal to 6,430 samples across 18 tissues derived from the GTEx project and found that the variants prioritized by OpenCausal are highly enriched for eQTLs and caQTLs. We further propose a strategy to integrate the predicted open scores with genome-wide association studies (GWAS) data to prioritize putative causal variants and regulatory elements for a given risk locus (i.e., fine-mapping analysis). As an initial example, we applied this method to a GWAS dataset of human height and found that the prioritized putative variants and elements are correlated with the phenotype (i.e., heights of individuals) better than others.
View details for DOI 10.1073/pnas.1922703117
View details for PubMedID 32817564
-
Xrare: a machine learning method jointly modeling phenotypes and genetic evidence for rare disease diagnosis
GENETICS IN MEDICINE
2019; 21 (9): 2126–34
View details for DOI 10.1038/s41436-019-0439-8
View details for Web of Science ID 000484400800023
-
DeepTACT: predicting 3D chromatin contacts via bootstrapping deep learning.
Nucleic acids research
2019
Abstract
Interactions between regulatory elements are of crucial importance for the understanding of transcriptional regulation and the interpretation of disease mechanisms. Hi-C technique has been developed for genome-wide detection of chromatin contacts. However, unless extremely deep sequencing is performed on a very large number of input cells, which is technically limited and expensive, current Hi-C experiments do not have high enough resolution to resolve contacts between regulatory elements. Here, we develop DeepTACT, a bootstrapping deep learning model, to integrate genome sequences and chromatin accessibility data for the prediction of chromatin contacts between regulatory elements. DeepTACT can infer not only promoter-enhancer interactions, but also promoter-promoter interactions. In tests based on promoter capture Hi-C data, DeepTACT shows better performance over existing methods. DeepTACT analysis also identifies a class of hub promoters, which are correlated with transcriptional activation across cell lines, enriched in housekeeping genes, functionally related to fundamental biological processes, and capable of reflecting cell similarity. Finally, the utility of chromatin contacts in the study of human diseases is illustrated by the association of IFNA2to coronary artery disease via an integrative analysis of GWAS data and interactions predicted by DeepTACT.
View details for PubMedID 30869141
-
DC3 is a method for deconvolution and coupled clustering from bulk and single-cell genomics data.
Nature communications
2019; 10 (1): 4613
Abstract
Characterizing and interpreting heterogeneous mixtures at the cellular level is a critical problem in genomics. Single-cell assays offer an opportunity to resolve cellular level heterogeneity, e.g., scRNA-seq enables single-cell expression profiling, and scATAC-seq identifies active regulatory elements. Furthermore, while scHi-C can measure the chromatin contacts (i.e., loops) between active regulatory elements to target genes in single cells, bulk HiChIP can measure such contacts in a higher resolution. In this work, we introduce DC3 (De-Convolution and Coupled-Clustering) as a method for the joint analysis of various bulk and single-cell data such as HiChIP, RNA-seq and ATAC-seq from the same heterogeneous cell population. DC3 can simultaneously identify distinct subpopulations, assign single cells to the subpopulations (i.e., clustering) and de-convolve the bulk data into subpopulation-specific data. The subpopulation-specific profiles of gene expression, chromatin accessibility and enhancer-promoter contact obtained by DC3 provide a comprehensive characterization of the gene regulatory system in each subpopulation.
View details for DOI 10.1038/s41467-019-12547-1
View details for PubMedID 31601804
-
Extensive and deep sequencing of the Venter/HuRef genome for developing and benchmarking genome analysis tools
SCIENTIFIC DATA
2018; 5
View details for DOI 10.1038/sdata.2018.261
View details for Web of Science ID 000453585800001
-
Extensive and deep sequencing of the Venter/HuRef genome for developing and benchmarking genome analysis tools.
Scientific data
2018; 5: 180261
Abstract
We produced an extensive collection of deep re-sequencing datasets for the Venter/HuRef genome using the Illumina massively-parallel DNA sequencing platform. The original Venter genome sequence is a very-high quality phased assembly based on Sanger sequencing. Therefore, researchers developing novel computational tools for the analysis of human genome sequence variation for the dominant Illumina sequencing technology can test and hone their algorithms by making variant calls from these Venter/HuRef datasets and then immediately confirm the detected variants in the Sanger assembly, freeing them of the need for further experimental validation. This process also applies to implementing and benchmarking existing genome analysis pipelines. We prepared and sequenced 200bp and 350bp short-insert whole-genome sequencing libraries (sequenced to 100x and 40x genomic coverages respectively) as well as 2kb, 5kb, and 12kb mate-pair libraries (49x, 122x, and 145x physical coverages respectively). Lastly, we produced a linked-read library (128x physical coverage) from which we also performed haplotype phasing.
View details for PubMedID 30561434
-
Towards high performance data analytic on heterogeneous many-core systems: A study on Bayesian Sequential Partitioning
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING
2018; 122: 36–50
View details for DOI 10.1016/j.jpdc.2018.07.011
View details for Web of Science ID 000448232400004
-
CRISPhieRmix: a hierarchical mixture model for CRISPR pooled screens.
Genome biology
2018; 19 (1): 159
Abstract
Pooled CRISPR screens allow researchers to interrogate genetic causes of complex phenotypes at the genome-wide scale and promise higher specificity and sensitivity compared to competing technologies. Unfortunately, two problems exist, particularly for CRISPRi/a screens: variability in guide efficiency and large rare off-target effects. We present a method, CRISPhieRmix, that resolves these issues by using a hierarchical mixture model with a broad-tailed null distribution. We show that CRISPhieRmix allows for more accurate and powerful inferences in large-scale pooled CRISPRi/a screens. We discuss key issues in the analysis and design of screens, particularly the number of guides needed for faithful full discovery.
View details for PubMedID 30296940
-
CRISPR Activation Screens Systematically Identify Factors that Drive Neuronal Fate and Reprogramming.
Cell stem cell
2018
Abstract
Comprehensive identification of factors that can specify neuronal fate could provide valuable insights into lineage specification and reprogramming, but systematic interrogation of transcription factors, and their interactions with each other, has proven technically challenging. We developed a CRISPR activation (CRISPRa) approach to systematically identify regulators of neuronal-fate specification. We activated expression of all endogenous transcription factors and other regulators via a pooled CRISPRa screen in embryonic stem cells, revealing genes including epigenetic regulators such as Ezh2 that can induce neuronal fate. Systematic CRISPR-based activation of factor pairs allowed us to generate a genetic interaction map for neuronal differentiation, with confirmation of top individual and combinatorial hits as bona fide inducers of neuronal fate. Several factor pairs could directly reprogram fibroblasts into neurons, which shared similar transcriptional programs with endogenous neurons. This study provides an unbiased discovery approach for systematic identification of genes that drive cell-fate acquisition.
View details for PubMedID 30318302
-
Integrative analysis of single-cell genomics data by coupled nonnegative matrix factorizations.
Proceedings of the National Academy of Sciences of the United States of America
2018
Abstract
When different types of functional genomics data are generated on single cells from different samples of cells from the same heterogeneous population, the clustering of cells in the different samples should be coupled. We formulate this "coupled clustering" problem as an optimization problem and propose the method of coupled nonnegative matrix factorizations (coupled NMF) for its solution. The method is illustrated by the integrative analysis of single-cell RNA-sequencing (RNA-seq) and single-cell ATAC-sequencing (ATAC-seq) data.
View details for PubMedID 29987051
-
Unsupervised clustering and epigenetic classification of single cells
NATURE COMMUNICATIONS
2018; 9: 2410
Abstract
Characterizing epigenetic heterogeneity at the cellular level is a critical problem in the modern genomics era. Assays such as single cell ATAC-seq (scATAC-seq) offer an opportunity to interrogate cellular level epigenetic heterogeneity through patterns of variability in open chromatin. However, these assays exhibit technical variability that complicates clear classification and cell type identification in heterogeneous populations. We present scABC, an R package for the unsupervised clustering of single-cell epigenetic data, to classify scATAC-seq data and discover regions of open chromatin specific to cell identity.
View details for PubMedID 29925875
-
A 1.86mJ/Gb/Query Bit-Plane Payload Machine Learning Processor in 90nm CMOS
IEEE. 2018
View details for Web of Science ID 000450113800042
-
DIABETIC RETINOPATHY DETECTION BASED ON DEEP CONVOLUTIONAL NEURAL NETWORKS
IEEE. 2018: 1030–34
View details for Web of Science ID 000446384601045
-
CORRELATION-BASED FACE DETECTION FOR RECOGNIZING FACES IN VIDEOS
IEEE. 2018: 3101–5
View details for Web of Science ID 000446384603054
-
CONFNET: PREDICT WITH CONFIDENCE
IEEE. 2018: 2921–25
View details for Web of Science ID 000446384603018
-
Challenges and recommendations for epigenomics in precision health
NATURE BIOTECHNOLOGY
2017; 35 (12): 1128–32
View details for PubMedID 29220033
-
Simultaneous inference of phenotype-associated genes and relevant tissues from GWAS data via Bayesian integration of multiple tissue-specific gene networks
JOURNAL OF MOLECULAR CELL BIOLOGY
2017; 9 (6): 436–52
Abstract
Although genome-wide association studies (GWAS) have successfully identified thousands of genomic loci associated with hundreds of complex traits in the past decade, the debate about such problems as missing heritability and weak interpretability has been appealing for effective computational methods to facilitate the advanced analysis of the vast volume of existing and anticipated genetic data. Towards this goal, gene-level integrative GWAS analysis with the assumption that genes associated with a phenotype tend to be enriched in biological gene sets or gene networks has recently attracted much attention, due to such advantages as straightforward interpretation, less multiple testing burdens, and robustness across studies. However, existing methods in this category usually exploit non-tissue-specific gene networks and thus lack the ability to utilize informative tissue-specific characteristics. To overcome this limitation, we proposed a Bayesian approach called SIGNET (Simultaneously Inference of GeNEs and Tissues) to integrate GWAS data and multiple tissue-specific gene networks for the simultaneous inference of phenotype-associated genes and relevant tissues. Through extensive simulation studies, we showed the effectiveness of our method in finding both associated genes and relevant tissues for a phenotype. In applications to real GWAS data of 14 complex phenotypes, we demonstrated the power of our method in both deciphering genetic basis and discovering biological insights of a phenotype. With this understanding, we expect to see SIGNET as a valuable tool for integrative GWAS analysis, thereby boosting the prevention, diagnosis, and treatment of human inherited diseases and eventually facilitating precision medicine.
View details for PubMedID 29300920
-
Gaining comprehensive biological insight into the transcriptome by performing a broad-spectrum RNA-seq analysis
NATURE COMMUNICATIONS
2017; 8: 59
Abstract
RNA-sequencing (RNA-seq) is an essential technique for transcriptome studies, hundreds of analysis tools have been developed since it was debuted. Although recent efforts have attempted to assess the latest available tools, they have not evaluated the analysis workflows comprehensively to unleash the power within RNA-seq. Here we conduct an extensive study analysing a broad spectrum of RNA-seq workflows. Surpassing the expression analysis scope, our work also includes assessment of RNA variant-calling, RNA editing and RNA fusion detection techniques. Specifically, we examine both short- and long-read RNA-seq technologies, 39 analysis tools resulting in ~120 combinations, and ~490 analyses involving 15 samples with a variety of germline, cancer and stem cell data sets. We report the performance and propose a comprehensive RNA-seq analysis protocol, named RNACocktail, along with a computational pipeline achieving high accuracy. Validation on different samples reveals that our proposed protocol could help researchers extract more biologically relevant predictions by broad analysis of the transcriptome.RNA-seq is widely used for transcriptome analysis. Here, the authors analyse a wide spectrum of RNA-seq workflows and present a comprehensive analysis protocol named RNACocktail as well as a computational pipeline leveraging the widely used tools for accurate RNA-seq analysis.
View details for PubMedID 28680106
-
COSINE: non-seeding method for mapping long noisy sequences.
Nucleic acids research
2017
Abstract
Third generation sequencing (TGS) are highly promising technologies but the long and noisy reads from TGS are difficult to align using existing algorithms. Here, we present COSINE, a conceptually new method designed specifically for aligning long reads contaminated by a high level of errors. COSINE computes the context similarity of two stretches of nucleobases given the similarity over distributions of their short k-mers (k = 3-4) along the sequences. The results on simulated and real data show that COSINE achieves high sensitivity and specificity under a wide range of read accuracies. When the error rate is high, COSINE can offer substantial advantages over existing alignment methods.
View details for DOI 10.1093/nar/gkx511
View details for PubMedID 28586438
-
Predicting transcription factor binding motifs from DNA-binding domains, chromatin accessibility and gene expression data.
Nucleic acids research
2017; 45 (10): 5666-5677
Abstract
Transcription factors (TFs) play crucial roles in regulating gene expression through interactions with specific DNA sequences. Recently, the sequence motif of almost 400 human TFs have been identified using high-throughput SELEX sequencing. However, there remain a large number of TFs (∼800) with no high-throughput-derived binding motifs. Computational methods capable of associating known motifs to such TFs will avoid tremendous experimental efforts and enable deeper understanding of transcriptional regulatory functions. We present a method to associate known motifs to TFs (MATLAB code is available in Supplementary Materials). Our method is based on a probabilistic framework that not only exploits DNA-binding domains and specificities, but also integrates open chromatin, gene expression and genomic data to accurately infer monomeric and homodimeric binding motifs. Our analysis resulted in the assignment of motifs to 200 TFs with no SELEX-derived motifs, roughly a 50% increase compared to the existing coverage.
View details for DOI 10.1093/nar/gkx358
View details for PubMedID 28472398
-
Modeling gene regulation from paired expression and chromatin accessibility data.
Proceedings of the National Academy of Sciences of the United States of America
2017
Abstract
The rapid increase of genome-wide datasets on gene expression, chromatin states, and transcription factor (TF) binding locations offers an exciting opportunity to interpret the information encoded in genomes and epigenomes. This task can be challenging as it requires joint modeling of context-specific activation of cis-regulatory elements (REs) and the effects on transcription of associated regulatory factors. To meet this challenge, we propose a statistical approach based on paired expression and chromatin accessibility (PECA) data across diverse cellular contexts. In our approach, we model (i) the localization to REs of chromatin regulators (CRs) based on their interaction with sequence-specific TFs, (ii) the activation of REs due to CRs that are localized to them, and (iii) the effect of TFs bound to activated REs on the transcription of target genes (TGs). The transcriptional regulatory network inferred by PECA provides a detailed view of how trans- and cis-regulatory elements work together to affect gene expression in a context-specific manner. We illustrate the feasibility of this approach by analyzing paired expression and accessibility data from the mouse Encyclopedia of DNA Elements (ENCODE) and explore various applications of the resulting model.
View details for DOI 10.1073/pnas.1704553114
View details for PubMedID 28576882
-
Scalable multi-sample single-cell data analysis by Partition-Assisted Clustering and Multiple Alignments of Networks.
PLoS computational biology
2017; 13 (12): e1005875
Abstract
Mass cytometry (CyTOF) has greatly expanded the capability of cytometry. It is now easy to generate multiple CyTOF samples in a single study, with each sample containing single-cell measurement on 50 markers for more than hundreds of thousands of cells. Current methods do not adequately address the issues concerning combining multiple samples for subpopulation discovery, and these issues can be quickly and dramatically amplified with increasing number of samples. To overcome this limitation, we developed Partition-Assisted Clustering and Multiple Alignments of Networks (PAC-MAN) for the fast automatic identification of cell populations in CyTOF data closely matching that of expert manual-discovery, and for alignments between subpopulations across samples to define dataset-level cellular states. PAC-MAN is computationally efficient, allowing the management of very large CyTOF datasets, which are increasingly common in clinical studies and cancer studies that monitor various tissue samples for each subject.
View details for PubMedID 29281633
-
Convergence rates of a partition based Bayesian multivariate density estimation method
NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2017
View details for Web of Science ID 000452649404078
-
Phased Genome Sequencing Through Chromosome Sorting.
Methods in molecular biology (Clifton, N.J.)
2017; 1551: 171-188
Abstract
Phase information of an individual genome provides fundamentally useful genetic information for the understanding of genome function, phenotype, and disease. With the development of new sequencing technology, much interest has been focused on the challenges in obtaining long-range phase information. Here, we present the detailed protocol for a method capable of generating genomic sequences completely phased across the entire chromosome through FACS-mediated chromosome sorting and next generation sequencing, known as Phase-seq.
View details for DOI 10.1007/978-1-4939-6750-6_10
View details for PubMedID 28138847
-
Simultaneous dimension reduction and adjustment for confounding variation
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2016; 113 (51): 14662-14667
Abstract
Dimension reduction methods are commonly applied to high-throughput biological datasets. However, the results can be hindered by confounding factors, either biological or technical in origin. In this study, we extend principal component analysis (PCA) to propose AC-PCA for simultaneous dimension reduction and adjustment for confounding (AC) variation. We show that AC-PCA can adjust for (i) variations across individual donors present in a human brain exon array dataset and (ii) variations of different species in a model organism ENCODE RNA sequencing dataset. Our approach is able to recover the anatomical structure of neocortical regions and to capture the shared variation among species during embryonic development. For gene selection purposes, we extend AC-PCA with sparsity constraints and propose and implement an efficient algorithm. The methods developed in this paper can also be applied to more general settings. The R package and MATLAB source code are available at https://github.com/linzx06/AC-PCA.
View details for DOI 10.1073/pnas.1617317113
View details for PubMedID 27930330
-
Modeling the causal regulatory network by integrating chromatin accessibility and transcriptome data
NATIONAL SCIENCE REVIEW
2016; 3 (2): 240-251
View details for DOI 10.1093/nsr/nww025
View details for Web of Science ID 000379759700025
-
Stable 5-Hydroxymethylcytosine (5hmC) Acquisition Marks Gene Activation During Chondrogenic Differentiation
JOURNAL OF BONE AND MINERAL RESEARCH
2016; 31 (3): 524-534
Abstract
Regulation of gene expression changes during chondrogenic differentiation by DNA methylation and demethylation is little understood. Methylated cytosines (5mC) are oxidized by the ten-eleven-translocation (TET) proteins to 5-hydroxymethylcytosines (5hmC), 5-formylcytosines (5fC) and 5-carboxylcytosines (5caC) eventually leading to a replacement by unmethylated cytosines (C) i.e. DNA demethylation. Additionally, 5hmC is stable and acts as an epigenetic mark by itself. Here, we report that global changes in 5hmC mark chondrogenic differentiation in vivo and in vitro. Tibia anlagen and growth plate analyses during limb development at mouse embryonic days E 11.5, 13.5 and 17.5 showed dynamic changes in 5hmC levels in the differentiating chondrocytes. A similar increase in 5hmC levels was observed in the ATDC5 chondroprogenitor cell line accompanied by increased expression of the TET proteins during in vitro differentiation. Loss of TET1 in ATDC5 decreased 5hmC levels and impaired differentiation, demonstrating a functional role for TET1-mediated 5hmC dynamics in chondrogenic differentiation. Global analyses of the 5hmC-enriched sequences during early and late chondrogenic differentiation identified 5hmC distribution to be enriched in the regulatory regions of genes preceding the transcription start site (TSS) as well as in the gene bodies. Stable gains in 5hmC were observed in specific subsets of genes including genes associated with cartilage development and in chondrogenic lineage-specific genes. 5hmC gains in regulatory promoter and enhancer regions as well as in gene bodies were strongly associated with activated but not repressed genes, indicating a potential regulatory role for DNA hydroxymethylation in chondrogenic gene expression. This article is protected by copyright. All rights reserved.
View details for DOI 10.1002/jbmr.2711
View details for Web of Science ID 000373596800006
View details for PubMedID 26363184
-
Computational Aspects of Optional Polya Tree
JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS
2016; 25 (1): 301-320
View details for DOI 10.1080/10618600.2014.1002927
View details for Web of Science ID 000372129900016
-
The primate-specific noncoding RNA HPAT5 regulates pluripotency during human preimplantation development and nuclear reprogramming
NATURE GENETICS
2016; 48 (1): 44-?
View details for DOI 10.1038/ng.3449
View details for Web of Science ID 000367255300012
-
A Hardware-Efficient Sigmoid Function With Adjustable Precision for a Neural Network System
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II-EXPRESS BRIEFS
2015; 62 (11): 1073-1077
View details for DOI 10.1109/TCSII.2015.2456531
View details for Web of Science ID 000365988500013
-
Characterization of fusion genes and the significantly expressed fusion isoforms in breast cancer by hybrid sequencing
NUCLEIC ACIDS RESEARCH
2015; 43 (18)
Abstract
We developed an innovative hybrid sequencing approach, IDP-fusion, to detect fusion genes, determine fusion sites and identify and quantify fusion isoforms. IDP-fusion is the first method to study gene fusion events by integrating Third Generation Sequencing long reads and Second Generation Sequencing short reads. We applied IDP-fusion to PacBio data and Illumina data from the MCF-7 breast cancer cells. Compared with the existing tools, IDP-fusion detects fusion genes at higher precision and a very low false positive rate. The results show that IDP-fusion will be useful for unraveling the complexity of multiple fusion splices and fusion isoforms within tumorigenesis-relevant fusion genes.
View details for DOI 10.1093/nar/gkv562
View details for Web of Science ID 000366406500002
View details for PubMedID 26040699
View details for PubMedCentralID PMC4605286
-
An ensemble approach to accurately detect somatic mutations using SomaticSeq
GENOME BIOLOGY
2015; 16
Abstract
SomaticSeq is an accurate somatic mutation detection pipeline implementing a stochastic boosting algorithm to produce highly accurate somatic mutation calls for both single nucleotide variants and small insertions and deletions. The workflow currently incorporates five state-of-the-art somatic mutation callers, and extracts over 70 individual genomic and sequencing features for each candidate site. A training set is provided to an adaptively boosted decision tree learner to create a classifier for predicting mutation statuses. We validate our results with both synthetic and real data. We report that SomaticSeq is able to achieve better overall accuracy than any individual tool incorporated.
View details for DOI 10.1186/s13059-015-0758-2
View details for Web of Science ID 000361452100004
View details for PubMedID 26381235
View details for PubMedCentralID PMC4574535
-
MetaSV: an accurate and integrative structural-variant caller for next generation sequencing
BIOINFORMATICS
2015; 31 (16): 2741-2744
Abstract
Structural variations (SVs) are large genomic rearrangements that vary significantly in size, making them challenging to detect with the relatively short reads from next-generation sequencing (NGS). Different SV detection methods have been developed; however, each is limited to specific kinds of SVs with varying accuracy and resolution. Previous works have attempted to combine different methods, but they still suffer from poor accuracy particularly for insertions. We propose MetaSV, an integrated SV caller which leverages multiple orthogonal SV signals for high accuracy and resolution. MetaSV proceeds by merging SVs from multiple tools for all types of SVs. It also analyzes soft-clipped reads from alignment to detect insertions accurately since existing tools underestimate insertion SVs. Local assembly in combination with dynamic programming is used to improve breakpoint resolution. Paired-end and coverage information is used to predict SV genotypes. Using simulation and experimental data, we demonstrate the effectiveness of MetaSV across various SV types and sizes.Code in Python is at http://bioinform.github.io/metasv/.rd@bina.comSupplementary data are available at Bioinformatics online.
View details for DOI 10.1093/bioinformatics/btv204
View details for Web of Science ID 000359666600020
View details for PubMedID 25861968
View details for PubMedCentralID PMC4528635
-
Genome-Wide Mapping of DNA Hydroxymethylation in Osteoarthritic Chondrocytes
ARTHRITIS & RHEUMATOLOGY
2015; 67 (8): 2129-2140
Abstract
To examine the genome-wide distribution of hydroxymethylated cytosine (5hmC) in osteoarthritic (OA) and normal chondrocytes in order to investigate the effect on OA-specific gene expression.Cartilage was obtained from OA patients undergoing total knee arthroplasty or from control patients undergoing anterior cruciate ligament reconstruction. Genome-wide sequencing of 5hmC-enriched DNA was performed in a small cohort of normal and OA chondrocytes to identify differentially hydroxymethylated regions (DhMRs) in OA chondrocytes. Data from the genome-wide sequencing of 5hmC-enriched DNA were intersected with global OA gene expression data to define subsets of genes and pathways potentially affected by increased 5hmC levels in OA chondrocytes.A total of 70,591 DhMRs were identified in OA chondrocytes as compared to normal chondrocytes, 44,288 (63%) of which were increased in OA chondrocytes. The majority of DhMRs (66%) were gained in gene bodies. Increased DhMRs were observed in ∼50% of genes previously implicated in OA pathology including MMP3, LRP5, GDF5, and COL11A1. Furthermore, analyses of gene expression data revealed gene body gain of 5hmC appears to be preferentially associated with activated, but not repressed, genes in OA chondrocytes.This study provides the first genome-wide profiling of 5hmC distribution in OA chondrocytes. We had previously reported a global increase in 5hmC levels in OA chondrocytes. Gain of 5hmC in the gene body is found to be characteristic of activated genes in OA chondrocytes, highlighting the influence of 5hmC as an epigenetic mark in OA. In addition, this study identifies multiple OA-associated genes that are potentially regulated either singularly by gain of DNA hydroxymethylation or in combination with loss of DNA methylation.
View details for DOI 10.1002/art.39179
View details for Web of Science ID 000358609300018
View details for PubMedCentralID PMC4519426
-
VarSim: a high-fidelity simulation and validation framework for high-throughput genome sequencing with cancer applications
BIOINFORMATICS
2015; 31 (9): 1469-1471
Abstract
VarSim is a framework for assessing alignment and variant calling accuracy in high-throughput genome sequencing through simulation or real data. In contrast to simulating a random mutation spectrum, it synthesizes diploid genomes with germline and somatic mutations based on a realistic model. This model leverages information such as previously reported mutations to make the synthetic genomes biologically relevant. VarSim simulates and validates a wide range of variants, including single nucleotide variants, small indels and large structural variants. It is an automated, comprehensive compute framework supporting parallel computation and multiple read simulators. Furthermore, we developed a novel map data structure to validate read alignments, a strategy to compare variants binned in size ranges and a lightweight, interactive, graphical report to visualize validation results with detailed statistics. Thus far, it is the most comprehensive validation tool for secondary analysis in next generation sequencing.Code in Java and Python along with instructions to download the reads and variants is at http://bioinform.github.io/varsim.rd@bina.comSupplementary data are available at Bioinformatics online.
View details for DOI 10.1093/bioinformatics/btu828
View details for Web of Science ID 000355665800019
View details for PubMedID 25524895
View details for PubMedCentralID PMC4410653
-
Leveraging long read sequencing from a single individual to provide a comprehensive resource for benchmarking variant calling methods.
Scientific reports
2015; 5: 14493-?
Abstract
A high-confidence, comprehensive human variant set is critical in assessing accuracy of sequencing algorithms, which are crucial in precision medicine based on high-throughput sequencing. Although recent works have attempted to provide such a resource, they still do not encompass all major types of variants including structural variants (SVs). Thus, we leveraged the massive high-quality Sanger sequences from the HuRef genome to construct by far the most comprehensive gold set of a single individual, which was cross validated with deep Illumina sequencing, population datasets, and well-established algorithms. It was a necessary effort to completely reanalyze the HuRef genome as its previously published variants were mostly reported five years ago, suffering from compatibility, organization, and accuracy issues that prevent their direct use in benchmarking. Our extensive analysis and validation resulted in a gold set with high specificity and sensitivity. In contrast to the current gold sets of the NA12878 or HS1011 genomes, our gold set is the first that includes small variants, deletion SVs and insertion SVs up to a hundred thousand base-pairs. We demonstrate the utility of our HuRef gold set to benchmark several published SV detection tools.
View details for DOI 10.1038/srep14493
View details for PubMedID 26412485
View details for PubMedCentralID PMC4585973
-
Learning regulatory programs by threshold SVD regression
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2014; 111 (44): 15675-15680
Abstract
We formulate a statistical model for the regulation of global gene expression by multiple regulatory programs and propose a thresholding singular value decomposition (T-SVD) regression method for learning such a model from data. Extensive simulations demonstrate that this method offers improved computational speed and higher sensitivity and specificity over competing approaches. The method is used to analyze microRNA (miRNA) and long noncoding RNA (lncRNA) data from The Cancer Genome Atlas (TCGA) consortium. The analysis yields previously unidentified insights into the combinatorial regulation of gene expression by noncoding RNAs, as well as findings that are supported by evidence from the literature.
View details for DOI 10.1073/pnas.1417808111
View details for Web of Science ID 000344088100029
View details for PubMedID 25331876
View details for PubMedCentralID PMC4226119
-
Human tRNA synthetase catalytic nulls with diverse functions.
Science
2014; 345 (6194): 328-332
Abstract
Genetic efficiency in higher organisms depends on mechanisms to create multiple functions from single genes. To investigate this question for an enzyme family, we chose aminoacyl tRNA synthetases (AARSs). They are exceptional in their progressive and accretive proliferation of noncatalytic domains as the Tree of Life is ascended. Here we report discovery of a large number of natural catalytic nulls (CNs) for each human AARS. Splicing events retain noncatalytic domains while ablating the catalytic domain to create CNs with diverse functions. Each synthetase is converted into several new signaling proteins with biological activities "orthogonal" to that of the catalytic parent. We suggest that splice variants with nonenzymatic functions may be more general, as evidenced by recent findings of other catalytically inactive splice-variant enzymes.
View details for DOI 10.1126/science.1252943
View details for PubMedID 25035493
-
Modeling stochastic noise in gene regulatory systems.
Quantitative biology (Beijing, China)
2014; 2 (1): 1-29
Abstract
The Master equation is considered the gold standard for modeling the stochastic mechanisms of gene regulation in molecular detail, but it is too complex to solve exactly in most cases, so approximation and simulation methods are essential. However, there is still a lack of consensus about the best way to carry these out. To help clarify the situation, we review Master equation models of gene regulation, theoretical approximations based on an expansion method due to N.G. van Kampen and R. Kubo, and simulation algorithms due to D.T. Gillespie and P. Langevin. Expansion of the Master equation shows that for systems with a single stable steady-state, the stochastic model reduces to a deterministic model in a first-order approximation. Additional theory, also due to van Kampen, describes the asymptotic behavior of multistable systems. To support and illustrate the theory and provide further insight into the complex behavior of multistable systems, we perform a detailed simulation study comparing the various approximation and simulation methods applied to synthetic gene regulatory systems with various qualitative characteristics. The simulation studies show that for large stochastic systems with a single steady-state, deterministic models are quite accurate, since the probability distribution of the solution has a single peak tracking the deterministic trajectory whose variance is inversely proportional to the system size. In multistable stochastic systems, large fluctuations can cause individual trajectories to escape from the domain of attraction of one steady-state and be attracted to another, so the system eventually reaches a multimodal probability distribution in which all stable steady-states are represented proportional to their relative stability. However, since the escape time scales exponentially with system size, this process can take a very long time in large systems.
View details for PubMedID 25632368
-
Density estimation on multivariate censored data with optional Polya tree
BIOSTATISTICS
2014; 15 (1): 182-195
Abstract
Analyzing the failure times of multiple events is of interest in many fields. Estimating the joint distribution of the failure times in a non-parametric way is not straightforward because some failure times are often right-censored and only known to be greater than observed follow-up times. Although it has been studied, there is no universally optimal solution for this problem. It is still challenging and important to provide alternatives that may be more suitable than existing ones in specific settings. Related problems of the existing methods are not only limited to infeasible computations, but also include the lack of optimality and possible non-monotonicity of the estimated survival function. In this paper, we proposed a non-parametric Bayesian approach for directly estimating the density function of multivariate survival times, where the prior is constructed based on the optional Pólya tree. We investigated several theoretical aspects of the procedure and derived an efficient iterative algorithm for implementing the Bayesian procedure. The empirical performance of the method was examined via extensive simulation studies. Finally, we presented a detailed analysis using the proposed method on the relationship among organ recovery times in severely injured patients. From the analysis, we suggested interesting medical information that can be further pursued in clinics.
View details for DOI 10.1093/biostatistics/kxt025
View details for Web of Science ID 000328286700019
View details for PubMedID 23902636
View details for PubMedCentralID PMC3862208
-
Characterization of the human ESC transcriptome by hybrid sequencing
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2013; 110 (50): E4821-E4830
Abstract
Although transcriptional and posttranscriptional events are detected in RNA-Seq data from second-generation sequencing, full-length mRNA isoforms are not captured. On the other hand, third-generation sequencing, which yields much longer reads, has current limitations of lower raw accuracy and throughput. Here, we combine second-generation sequencing and third-generation sequencing with a custom-designed method for isoform identification and quantification to generate a high-confidence isoform dataset for human embryonic stem cells (hESCs). We report 8,084 RefSeq-annotated isoforms detected as full-length and an additional 5,459 isoforms predicted through statistical inference. Over one-third of these are novel isoforms, including 273 RNAs from gene loci that have not previously been identified. Further characterization of the novel loci indicates that a subset is expressed in pluripotent cells but not in diverse fetal and adult tissues; moreover, their reduced expression perturbs the network of pluripotency-associated genes. Results suggest that gene identification, even in well-characterized human cell lines and tissues, is likely far from complete.
View details for DOI 10.1073/pnas.1320101110
View details for Web of Science ID 000328061700004
View details for PubMedID 24282307
View details for PubMedCentralID PMC3864310
-
Multivariate Density Estimation by Bayesian Sequential Partitioning
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
2013; 108 (504): 1402-1410
View details for DOI 10.1080/01621459.2013.813389
View details for Web of Science ID 000328908700025
-
Early role for IL-6 signalling during generation of induced pluripotent stem cells revealed by heterokaryon RNA-Seq.
Nature cell biology
2013; 15 (10): 1244-1252
Abstract
Molecular insights into somatic cell reprogramming to induced pluripotent stem cells (iPS) would aid regenerative medicine, but are difficult to elucidate in iPS because of their heterogeneity, as relatively few cells undergo reprogramming (0.1-1%; refs , ). To identify early acting regulators, we capitalized on non-dividing heterokaryons (mouse embryonic stem cells fused to human fibroblasts), in which reprogramming towards pluripotency is efficient and rapid, enabling the identification of transient regulators required at the onset. We used bi-species transcriptome-wide RNA-seq to quantify transcriptional changes in the human somatic nucleus during reprogramming towards pluripotency in heterokaryons. During heterokaryon reprogramming, the cytokine interleukin 6 (IL6), which is not detectable at significant levels in embryonic stem cells, was induced 50-fold. A 4-day culture with IL6 at the onset of iPS reprogramming replaced stably transduced oncogenic c-Myc such that transduction of only Oct4, Klf4 and Sox2 was required. IL6 also activated another Jak/Stat target, the serine/threonine kinase gene Pim1, which accounted for the IL6-mediated twofold increase in iPS frequency. In contrast, LIF, another induced GP130 ligand, failed to increase iPS frequency or activate c-Myc or Pim1, thereby revealing a differential role for the two Jak/Stat inducers in iPS generation. These findings demonstrate the power of heterokaryon bi-species global RNA-seq to identify early acting regulators of reprogramming, for example, extrinsic replacements for stably transduced transcription factors such as the potent oncogene c-Myc.
View details for DOI 10.1038/ncb2835
View details for PubMedID 23995732
-
Early role for IL-6 signalling during generation of induced pluripotent stem cells revealed by heterokaryon RNA-Seq.
Nature cell biology
2013; 15 (10): 1244-1252
View details for DOI 10.1038/ncb2835
View details for PubMedID 23995732
-
LEARNING A NONLINEAR DYNAMICAL SYSTEM MODEL OF GENE REGULATION: A PERTURBED STEADY-STATE APPROACH
ANNALS OF APPLIED STATISTICS
2013; 7 (3): 1311-1333
View details for DOI 10.1214/13-AOAS645
View details for Web of Science ID 000328198700003
-
Personalized prediction of first-cycle in vitro fertilization success
FERTILITY AND STERILITY
2013; 99 (7): 1905-1911
Abstract
To test whether the probability of having a live birth (LB) with the first IVF cycle (C1) can be predicted and personalized for patients in diverse environments.Retrospective validation of multicenter prediction model.Three university-affiliated outpatient IVF clinics located in different countries.Using primary models aggregated from >13,000 C1s, we applied the boosted tree method to train a preIVF-diversity model (PreIVF-D) with 1,061 C1s from 2008 to 2009, and validated predicted LB probabilities with an independent dataset comprising 1,058 C1s from 2008 to 2009.None.Predictive power, reclassification, receiver operator characteristic analysis, calibration, dynamic range.Overall, with PreIVF-D, 86% of cases had significantly different LB probabilities compared with age control, and more than one-half had higher LB probabilities. Specifically, 42% of patients could have been identified by PreIVF-D to have a personalized predicted success rate >45%, whereas an age-control model could not differentiate them from others. Furthermore, PreIVF-D showed improved predictive power, with 36% improved log-likelihood (or 9.0-fold by log-scale; >1,000-fold linear scale), and prediction errors for subgroups ranged from 0.9% to 3.7%.Validated prediction of personalized LB probabilities from diverse multiple sources identify excellent prognoses in more than one-half of patients.
View details for DOI 10.1016/j.fertnstert.2013.02.016
View details for Web of Science ID 000320505900028
View details for PubMedID 23522806
-
Detecting DNA modifications from SMRT sequencing data by modeling sequence context dependence of polymerase kinetic.
PLoS computational biology
2013; 9 (3)
Abstract
DNA modifications such as methylation and DNA damage can play critical regulatory roles in biological systems. Single molecule, real time (SMRT) sequencing technology generates DNA sequences as well as DNA polymerase kinetic information that can be used for the direct detection of DNA modifications. We demonstrate that local sequence context has a strong impact on DNA polymerase kinetics in the neighborhood of the incorporation site during the DNA synthesis reaction, allowing for the possibility of estimating the expected kinetic rate of the enzyme at the incorporation site using kinetic rate information collected from existing SMRT sequencing data (historical data) covering the same local sequence contexts of interest. We develop an Empirical Bayesian hierarchical model for incorporating historical data. Our results show that the model could greatly increase DNA modification detection accuracy, and reduce requirement of control data coverage. For some DNA modifications that have a strong signal, a control sample is not even needed by using historical data as alternative to control. Thus, sequencing costs can be greatly reduced by using the model. We implemented the model in a R package named seqPatch, which is available at https://github.com/zhixingfeng/seqPatch.
View details for DOI 10.1371/journal.pcbi.1002935
View details for PubMedID 23516341
-
RNA sequencing reveals a diverse and dynamic repertoire of the Xenopus tropicalis transcriptome over development
GENOME RESEARCH
2013; 23 (1): 201-216
Abstract
The Xenopus embryo has provided key insights into fate specification, the cell cycle, and other fundamental developmental and cellular processes, yet a comprehensive understanding of its transcriptome is lacking. Here, we used paired end RNA sequencing (RNA-seq) to explore the transcriptome of Xenopus tropicalis in 23 distinct developmental stages. We determined expression levels of all genes annotated in RefSeq and Ensembl and showed for the first time on a genome-wide scale that, despite a general state of transcriptional silence in the earliest stages of development, approximately 150 genes are transcribed prior to the midblastula transition. In addition, our splicing analysis uncovered more than 10,000 novel splice junctions at each stage and revealed that many known genes have additional unannotated isoforms. Furthermore, we used Cufflinks to reconstruct transcripts from our RNA-seq data and found that ∼13.5% of the final contigs are derived from novel transcribed regions, both within introns and in intergenic regions. We then developed a filtering pipeline to separate protein-coding transcripts from noncoding RNAs and identified a confident set of 6686 noncoding transcripts in 3859 genomic loci. Since the current reference genome, XenTro3, consists of hundreds of scaffolds instead of full chromosomes, we also performed de novo reconstruction of the transcriptome using Trinity and uncovered hundreds of transcripts that are missing from the genome. Collectively, our data will not only aid in completing the assembly of the Xenopus tropicalis genome but will also serve as a valuable resource for gene discovery and for unraveling the fundamental mechanisms of vertebrate embryogenesis.
View details for DOI 10.1101/gr.141424.112
View details for Web of Science ID 000312963400019
View details for PubMedID 22960373
View details for PubMedCentralID PMC3530680
-
Modeling kinetic rate variation in third generation DNA sequencing data to detect putative modifications to DNA bases
GENOME RESEARCH
2013; 23 (1): 129-141
Abstract
Current generation DNA sequencing instruments are moving closer to seamlessly sequencing genomes of entire populations as a routine part of scientific investigation. However, while significant inroads have been made identifying small nucleotide variation and structural variations in DNA that impact phenotypes of interest, progress has not been as dramatic regarding epigenetic changes and base-level damage to DNA, largely due to technological limitations in assaying all known and unknown types of modifications at genome scale. Recently, single-molecule real time (SMRT) sequencing has been reported to identify kinetic variation (KV) events that have been demonstrated to reflect epigenetic changes of every known type, providing a path forward for detecting base modifications as a routine part of sequencing. However, to date no statistical framework has been proposed to enhance the power to detect these events while also controlling for false-positive events. By modeling enzyme kinetics in the neighborhood of an arbitrary location in a genomic region of interest as a conditional random field, we provide a statistical framework for incorporating kinetic information at a test position of interest as well as at neighboring sites that help enhance the power to detect KV events. The performance of this and related models is explored, with the best-performing model applied to plasmid DNA isolated from Escherichia coli and mitochondrial DNA isolated from human brain tissue. We highlight widespread kinetic variation events, some of which strongly associate with known modification events, while others represent putative chemically modified sites of unknown types.
View details for DOI 10.1101/gr.136739.111
View details for Web of Science ID 000312963400012
View details for PubMedID 23093720
View details for PubMedCentralID PMC3530673
-
An Oct4-Sall4-Nanog network controls developmental progression in the pre-implantation mouse embryo
MOLECULAR SYSTEMS BIOLOGY
2013; 9
Abstract
Landmark events occur in a coordinated manner during pre-implantation development of the mammalian embryo, yet the regulatory network that orchestrates these events remains largely unknown. Here, we present the first systematic investigation of the network in pre-implantation mouse embryos using morpholino-mediated gene knockdowns of key embryonic stem cell (ESC) factors followed by detailed transcriptome analysis of pooled embryos, single embryos, and individual blastomeres. We delineated the regulons of Oct4, Sall4, and Nanog and identified a set of metabolism- and transport-related genes that were controlled by these transcription factors in embryos but not in ESCs. Strikingly, the knockdown embryos arrested at a range of developmental stages. We provided evidence that the DNA methyltransferase Dnmt3b has a role in determining the extent to which a knockdown embryo can develop. We further showed that the feed-forward loop comprising Dnmt3b, the pluripotency factors, and the miR-290-295 cluster exemplifies a network motif that buffers embryos against gene expression noise. Our findings indicate that Oct4, Sall4, and Nanog form a robust and integrated network to govern mammalian pre-implantation development.
View details for DOI 10.1038/msb.2012.65
View details for Web of Science ID 000314415800002
View details for PubMedID 23295861
View details for PubMedCentralID PMC3564263
-
Neural-specific Sox2 input and differential Gli-binding affinity provide context and positional information in Shh-directed neural patterning
GENES & DEVELOPMENT
2012; 26 (24): 2802-2816
Abstract
In the vertebrate neural tube, regional Sonic hedgehog (Shh) signaling invokes a time- and concentration-dependent induction of six different cell populations mediated through Gli transcriptional regulators. Elsewhere in the embryo, Shh/Gli responses invoke different tissue-appropriate regulatory programs. A genome-scale analysis of DNA binding by Gli1 and Sox2, a pan-neural determinant, identified a set of shared regulatory regions associated with key factors central to cell fate determination and neural tube patterning. Functional analysis in transgenic mice validates core enhancers for each of these factors and demonstrates the dual requirement for Gli1 and Sox2 inputs for neural enhancer activity. Furthermore, through an unbiased determination of Gli-binding site preferences and analysis of binding site variants in the developing mammalian CNS, we demonstrate that differential Gli-binding affinity underlies threshold-level activator responses to Shh input. In summary, our results highlight Sox2 input as a context-specific determinant of the neural-specific Shh response and differential Gli-binding site affinity as an important cis-regulatory property critical for interpreting Shh morphogen action in the mammalian neural tube.
View details for DOI 10.1101/gad.207142.112
View details for Web of Science ID 000312775700012
View details for PubMedID 23249739
View details for PubMedCentralID PMC3533082
-
Activation of Innate Immunity Is Required for Efficient Nuclear Reprogramming
CELL
2012; 151 (3): 547-558
Abstract
Retroviral overexpression of reprogramming factors (Oct4, Sox2, Klf4, c-Myc) generates induced pluripotent stem cells (iPSCs). However, the integration of foreign DNA could induce genomic dysregulation. Cell-permeant proteins (CPPs) could overcome this limitation. To date, this approach has proved exceedingly inefficient. We discovered a striking difference in the pattern of gene expression induced by viral versus CPP-based delivery of the reprogramming factors, suggesting that a signaling pathway required for efficient nuclear reprogramming was activated by the retroviral, but not CPP approach. In gain- and loss-of-function studies, we find that the toll-like receptor 3 (TLR3) pathway enables efficient induction of pluripotency by viral or mmRNA approaches. Stimulation of TLR3 causes rapid and global changes in the expression of epigenetic modifiers to enhance chromatin remodeling and nuclear reprogramming. Activation of inflammatory pathways are required for efficient nuclear reprogramming in the induction of pluripotency.
View details for DOI 10.1016/j.cell.2012.09.034
View details for PubMedID 23101625
-
Improving PacBio Long Read Accuracy by Short Read Alignment
PLOS ONE
2012; 7 (10)
Abstract
The recent development of third generation sequencing (TGS) generates much longer reads than second generation sequencing (SGS) and thus provides a chance to solve problems that are difficult to study through SGS alone. However, higher raw read error rates are an intrinsic drawback in most TGS technologies. Here we present a computational method, LSC, to perform error correction of TGS long reads (LR) by SGS short reads (SR). Aiming to reduce the error rate in homopolymer runs in the main TGS platform, the PacBio® RS, LSC applies a homopolymer compression (HC) transformation strategy to increase the sensitivity of SR-LR alignment without scarifying alignment accuracy. We applied LSC to 100,000 PacBio long reads from human brain cerebellum RNA-seq data and 64 million single-end 75 bp reads from human brain RNA-seq data. The results show LSC can correct PacBio long reads to reduce the error rate by more than 3 folds. The improved accuracy greatly benefits many downstream analyses, such as directional gene isoform detection in RNA-seq study. Compared with another hybrid correction tool, LSC can achieve over double the sensitivity and similar specificity.
View details for DOI 10.1371/journal.pone.0046679
View details for Web of Science ID 000309580800039
View details for PubMedID 23056399
View details for PubMedCentralID PMC3464235
-
Fast and accurate read alignment for resequencing
BIOINFORMATICS
2012; 28 (18): 2366-2373
Abstract
Next-generation sequence analysis has become an important task both in laboratory and clinical settings. A key stage in the majority sequence analysis workflows, such as resequencing, is the alignment of genomic reads to a reference genome. The accurate alignment of reads with large indels is a computationally challenging task for researchers.We introduce SeqAlto as a new algorithm for read alignment. For reads longer than or equal to 100 bp, SeqAlto is up to 10 × faster than existing algorithms, while retaining high accuracy and the ability to align reads with large (up to 50 bp) indels. This improvement in efficiency is particularly important in the analysis of future sequencing data where the number of reads approaches many billions. Furthermore, SeqAlto uses less than 8 GB of memory to align against the human genome. SeqAlto is benchmarked against several existing tools with both real and simulated data.Linux and Mac OS X binaries free for academic use are available at http://www.stanford.edu/group/wonglab/seqaltowhwong@stanford.edu.
View details for DOI 10.1093/bioinformatics/bts450
View details for Web of Science ID 000308532300059
View details for PubMedID 22811546
View details for PubMedCentralID PMC3436849
-
Six2 and Wnt Regulate Self-Renewal and Commitment of Nephron Progenitors through Shared Gene Regulatory Networks
DEVELOPMENTAL CELL
2012; 23 (3): 637-651
Abstract
A balance between Six2-dependent self-renewal and canonical Wnt signaling-directed commitment regulates mammalian nephrogenesis. Intersectional studies using chromatin immunoprecipitation and transcriptional profiling identified direct target genes shared by each pathway within nephron progenitors. Wnt4 and Fgf8 are essential for progenitor commitment; cis-regulatory modules flanking each gene are cobound by Six2 and β-catenin and are dependent on conserved Lef/Tcf binding sites for activity. In vitro and in vivo analyses suggest that Six2 and Lef/Tcf factors form a regulatory complex that promotes progenitor maintenance while entry of β-catenin into this complex promotes nephrogenesis. Alternative transcriptional responses associated with Six2 and β-catenin cobinding events occur through non-Lef/Tcf DNA binding mechanisms, highlighting the regulatory complexity downstream of Wnt signaling in the developing mammalian kidney.
View details for DOI 10.1016/j.devcel.2012.07.008
View details for Web of Science ID 000308776400019
View details for PubMedID 22902740
-
Predicting personalized multiple birth risks after in vitro fertilization-double embryo transfer
FERTILITY AND STERILITY
2012; 98 (1)
Abstract
To report and evaluate the performance and utility of an approach to predicting IVF-double embryo transfer (DET) multiple birth risks that is evidence-based, clinic-specific, and considers each patient's clinical profile.Retrospective prediction modeling.An outpatient university-affiliated IVF clinic.We used boosted tree methods to analyze 2,413 independent IVF-DET treatment cycles that resulted in live births. The IVF cycles were retrieved from a database that comprised more than 33,000 IVF cycles.None.The performance of this prediction model, MBP-BIVF, was validated by an independent data set, to evaluate predictive power, discrimination, dynamic range, and reclassification.Multiple birth probabilities ranging from 11.8% to 54.8% were predicted by the model and were significantly different from control predictions in more than half of the patients. The prediction model showed an improvement of 146% in predictive power and 16.0% in discrimination over control. The population standard error was 1.8%.We showed that IVF patients have inherently different risks of multiple birth, even when DET is specified, and this risk can be predicted before ET. The use of clinic-specific prediction models provides an evidence-based and personalized method to counsel patients.
View details for DOI 10.1016/j.fertnstert.2012.04.011
View details for Web of Science ID 000305950200020
View details for PubMedID 22673597
-
A Sparse Transmission Disequilibrium Test for Haplotypes Based on Bradley-Terry Graphs
HUMAN HEREDITY
2012; 73 (1): 52-61
Abstract
Linkage and association analysis based on haplotype transmission disequilibrium can be more informative than single marker analysis. Several works have been proposed in recent years to extend the transmission disequilibrium test (TDT) to haplotypes. Among them, a powerful approach called the evolutionary tree TDT (ET-TDT) incorporates information about the evolutionary relationship among haplotypes using the cladogram of the locus.In this work we extend this approach by taking into consideration the sparsity of causal mutations in the evolutionary history. We first introduce the notion of a Bradley-Terry (BT) graph representation of a haplotype locus. The most important property of the BT graph is that sparsity of the edge set of the graph corresponds to small number of causal mutations in the evolution of the haplotypes. We then propose a method to test the null hypothesis of no linkage and association against sparse alternatives under which a small number of edges on the BT graph have non-nil effects.We compare the performance of our approach to that of the ET-TDT through a power study, and show that incorporating sparsity of causal mutations can significantly improve the power of a haplotype-based TDT.
View details for DOI 10.1159/000335937
View details for Web of Science ID 000302111100008
View details for PubMedID 22398955
View details for PubMedCentralID PMC3357149
-
Coupling Optional Polya Trees and the Two Sample Problem
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
2011; 106 (496): 1553-1565
View details for DOI 10.1198/jasa.2011.tm10003
View details for Web of Science ID 000299662900026
-
A BOOTSTRAP-BASED NON-PARAMETRIC ANOVA METHOD WITH APPLICATIONS TO FACTORIAL MICROARRAY DATA
STATISTICA SINICA
2011; 21 (2): 495-514
View details for Web of Science ID 000290459900002
-
A New FACS Approach Isolates hESC Derived Endoderm Using Transcription Factors
PLOS ONE
2011; 6 (3)
Abstract
We show that high quality microarray gene expression profiles can be obtained following FACS sorting of cells using combinations of transcription factors. We use this transcription factor FACS (tfFACS) methodology to perform a genomic analysis of hESC-derived endodermal lineages marked by combinations of SOX17, GATA4, and CXCR4, and find that triple positive cells have a much stronger definitive endoderm signature than other combinations of these markers. Additionally, SOX17(+) GATA4(+) cells can be obtained at a much earlier stage of differentiation, prior to expression of CXCR4(+) cells, providing an important new tool to isolate this earlier definitive endoderm subtype. Overall, tfFACS represents an advancement in FACS technology which broadly crosses multiple disciplines, most notably in regenerative medicine to redefine cellular populations.
View details for DOI 10.1371/journal.pone.0017536
View details for Web of Science ID 000288170900026
View details for PubMedID 21408072
View details for PubMedCentralID PMC3052315
-
Human transcriptome array for high-throughput clinical studies
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2011; 108 (9): 3707-3712
Abstract
A 6.9 million-feature oligonucleotide array of the human transcriptome [Glue Grant human transcriptome (GG-H array)] has been developed for high-throughput and cost-effective analyses in clinical studies. This array allows comprehensive examination of gene expression and genome-wide identification of alternative splicing as well as detection of coding SNPs and noncoding transcripts. The performance of the array was examined and compared with mRNA sequencing (RNA-Seq) results over multiple independent replicates of liver and muscle samples. Compared with RNA-Seq of 46 million uniquely mappable reads per replicate, the GG-H array is highly reproducible in estimating gene and exon abundance. Although both platforms detect similar expression changes at the gene level, the GG-H array is more sensitive at the exon level. Deeper sequencing is required to adequately cover low-abundance transcripts. The array has been implemented in a multicenter clinical program and has generated high-quality, reproducible data. Considering the clinical trial requirements of cost, sample availability, and throughput, the GG-H array has a wide range of applications. An emerging approach for large-scale clinical genomic studies is to first use RNA-Seq to the sufficient depth for the discovery of transcriptome elements relevant to the disease process followed by high-throughput and reliable screening of these elements on thousands of patient samples using custom-designed arrays.
View details for DOI 10.1073/pnas.1019753108
View details for Web of Science ID 000287844400051
View details for PubMedID 21317363
View details for PubMedCentralID PMC3048146
-
Statistical Modeling of RNA-Seq Data
STATISTICAL SCIENCE
2011; 26 (1): 62-83
View details for DOI 10.1214/10-STS343
View details for Web of Science ID 000292424900013
-
Completely phased genome sequencing through chromosome sorting
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2011; 108 (1): 12-17
Abstract
The two haploid genome sequences that a person inherits from the two parents represent the most fundamentally useful type of genetic information for the study of heritable diseases and the development of personalized medicine. Because of the difficulty in obtaining long-range phase information, current sequencing methods are unable to provide this information. Here, we introduce and show feasibility of a scalable approach capable of generating genomic sequences completely phased across the entire chromosome.
View details for DOI 10.1073/pnas.1016725108
View details for PubMedID 21169219
-
THE ANALYSIS OF CHIP-SEQ DATA
METHODS IN ENZYMOLOGY, VOL 497: SYNTHETIC BIOLOGY, METHODS FOR PART/DEVICE CHARACTERIZATION AND CHASSIS ENGINEERING, PT A
2011; 497: 51-73
Abstract
Chromatin immunoprecipitation coupled with ultra-high-throug put parallel DNA sequencing (ChIP-seq) is an effective technology for the investigation of genome-wide protein-DNA interactions. Examples of applications include the studies of RNA polymerases transcription, transcriptional regulation, and histone modifications. The technology provides accurate and high-resolution mapping of the protein-DNA binding loci that are important in the understanding of many processes in development and diseases. Since the introduction of ChIP-seq experiments in 2007, many statistical and computational methods have been developed to support the analysis of the massive datasets from these experiments. However, because of the complex, multistaged analysis workflow, it is still difficult for an experimental investigator to conduct the analysis of his or her own ChIP-seq data. In this chapter, we review the basic design of ChIP-seq experiments and provide an in-depth tutorial on how to prepare, to preprocess, and to analyze ChIP-seq datasets. The tutorial is based on a revised version of our software package CisGenome, which was designed to encompass most standard tasks in ChIP-seq data analysis. Relevant statistical and computational issues will be highlighted, discussed, and illustrated by means of real data examples.
View details for DOI 10.1016/B978-0-12-385075-1.00003-2
View details for Web of Science ID 000291321200003
View details for PubMedID 21601082
-
Integration of Brassinosteroid Signal Transduction with the Transcription Network for Plant Growth Regulation in Arabidopsis
DEVELOPMENTAL CELL
2010; 19 (5): 765-777
Abstract
Brassinosteroids (BRs) regulate a wide range of developmental and physiological processes in plants through a receptor-kinase signaling pathway that controls the BZR transcription factors. Here, we use transcript profiling and chromatin-immunoprecipitation microarray (ChIP-chip) experiments to identify 953 BR-regulated BZR1 target (BRBT) genes. Functional studies of selected BRBTs further demonstrate roles in BR promotion of cell elongation. The BRBT genes reveal numerous molecular links between the BR-signaling pathway and downstream components involved in developmental and physiological processes. Furthermore, the results reveal extensive crosstalk between BR and other hormonal and light-signaling pathways at multiple levels. For example, BZR1 not only controls the expression of many signaling components of other hormonal and light pathways but also coregulates common target genes with light-signaling transcription factors. Our results provide a genomic map of steroid hormone actions in plants that reveals a regulatory network that integrates hormonal and light-signaling pathways for plant growth regulation.
View details for DOI 10.1016/j.devcel.2010.10.010
View details for Web of Science ID 000284516300016
View details for PubMedID 21074725
View details for PubMedCentralID PMC3018842
-
From EM to Data Augmentation: The Emergence of MCMC Bayesian Computation in the 1980s
STATISTICAL SCIENCE
2010; 25 (4): 506-516
View details for DOI 10.1214/10-STS341
View details for Web of Science ID 000288497200006
-
Deep phenotyping to predict live birth outcomes in in vitro fertilization
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2010; 107 (31): 13570-13575
Abstract
Nearly 75% of in vitro fertilization (IVF) treatments do not result in live births and patients are largely guided by a generalized age-based prognostic stratification. We sought to provide personalized and validated prognosis by using available clinical and embryo data from prior, failed treatments to predict live birth probabilities in the subsequent treatment. We generated a boosted tree model, IVFBT, by training it with IVF outcomes data from 1,676 first cycles (C1s) from 2003-2006, followed by external validation with 634 cycles from 2007-2008, respectively. We tested whether this model could predict the probability of having a live birth in the subsequent treatment (C2). By using nondeterministic methods to identify prognostic factors and their relative nonredundant contribution, we generated a prediction model, IVF(BT), that was superior to the age-based control by providing over 1,000-fold improvement to fit new data (p<0.05), and increased discrimination by receiver-operative characteristic analysis (area-under-the-curve, 0.80 vs. 0.68 for C1, 0.68 vs. 0.58 for C2). IVFBT provided predictions that were more accurate for approximately 83% of C1 and approximately 60% of C2 cycles that were out of the range predicted by age. Over half of those patients were reclassified to have higher live birth probabilities. We showed that data from a prior cycle could be used effectively to provide personalized and validated live birth probabilities in a subsequent cycle. Our approach may be replicated and further validated in other IVF clinics.
View details for DOI 10.1073/pnas.1002296107
View details for PubMedID 20643955
-
Detection of splice junctions from paired-end RNA-seq data by SpliceMap
NUCLEIC ACIDS RESEARCH
2010; 38 (14): 4570-4578
Abstract
Alternative splicing is a prevalent post-transcriptional process, which is not only important to normal cellular function but is also involved in human diseases. The newly developed second generation sequencing technique provides high-throughput data (RNA-seq data) to study alternative splicing events in different types of cells. Here, we present a computational method, SpliceMap, to detect splice junctions from RNA-seq data. This method does not depend on any existing annotation of gene structures and is capable of finding novel splice junctions with high sensitivity and specificity. It can handle long reads (50-100 nt) and can exploit paired-read information to improve mapping accuracy. Several parameters are included in the output to indicate the reliability of the predicted junction and help filter out false predictions. We applied SpliceMap to analyze 23 million paired 50-nt reads from human brain tissue. The results show at this depth of sequencing, RNA-seq can support reliable detection of splice junctions except for those that are present at very low level. Compared to current methods, SpliceMap can achieve 12% higher sensitivity without sacrificing specificity.
View details for DOI 10.1093/nar/gkq211
View details for Web of Science ID 000280922400010
View details for PubMedID 20371516
View details for PubMedCentralID PMC2919714
-
CisGenome Browser: a flexible tool for genomic data visualization
BIOINFORMATICS
2010; 26 (14): 1781-1782
Abstract
We present an open source, platform independent tool, called CisGenome Browser, which can work together with any other data analysis program to serve as a flexible component for genomic data visualization. It can also work by itself as a standalone genome browser. By working as a light-weight web server, CisGenome Browser is a convenient tool for data sharing between labs. It has features that are specifically designed for ultra high-throughput sequencing data visualization.http://biogibbs.stanford.edu/ approximately jiangh/browser/
View details for DOI 10.1093/bioinformatics/btq286
View details for Web of Science ID 000279474400017
View details for PubMedID 20513664
View details for PubMedCentralID PMC2894522
-
An "Almost Exhaustive" Search-Based Sequential Permutation Method for Detecting Epistasis in Disease Association Studies
GENETIC EPIDEMIOLOGY
2010; 34 (5): 434-443
Abstract
Due to the complex nature of common diseases, their etiology is likely to involve "uncommon but strong" (UBS) interactive effects--i.e. allelic combinations that are each present in only a small fraction of the patients but associated with high disease risk. However, the identification of such effects using standard methods for testing association can be difficult. In this work, we introduce a method for testing interactions that is particularly powerful in detecting UBS effects. The method consists of two modules--one is a pattern counting algorithm designed for efficiently evaluating the risk significance of each marker combination, and the other is a sequential permutation scheme for multiple testing correction. We demonstrate the work of our method using a candidate gene data set for cardiovascular and coronary diseases with an injected UBS three-locus interaction. In addition, we investigate the power and false rejection properties of our method using data sets simulated from a joint dominance three-locus model that gives rise to UBS interactive effects. The results show that our method can be much more powerful than standard approaches such as trend test and multifactor dimensionality reduction for detecting UBS interactions.
View details for DOI 10.1002/gepi.20496
View details for Web of Science ID 000280349600007
View details for PubMedID 20583286
-
Analysis of factorial time-course microarrays with application to a clinical study of burn injury.
Proceedings of the National Academy of Sciences of the United States of America
2010; 107 (22): 9923-9928
Abstract
Time-course microarray experiments are capable of capturing dynamic gene expression profiles. It is important to study how these dynamic profiles depend on the multiple factors that characterize the experimental condition under which the time course is observed. Analytic methods are needed to simultaneously handle the time course and factorial structure in the data. We developed a method to evaluate factor effects by pooling information across the time course while accounting for multiple testing and nonnormality of the microarray data. The method effectively extracts gene-specific response features and models their dependency on the experimental factors. Both longitudinal and cross-sectional time-course data can be handled by our approach. The method was used to analyze the impact of age on the temporal gene response to burn injury in a large-scale clinical study. Our analysis reveals that 21% of the genes responsive to burn are age-specific, among which expressions of mitochondria and immunoglobulin genes are differentially perturbed in pediatric and adult patients by burn injury. These new findings in the body's response to burn injury between children and adults support further investigations of therapeutic options targeting specific age groups. The methodology proposed here has been implemented in R package "TANOVA" and submitted to the Comprehensive R Archive Network at http://www.r-project.org/. It is also available for download at http://gluegrant1.stanford.edu/TANOVA/.
View details for DOI 10.1073/pnas.1002757107
View details for PubMedID 20479259
View details for PubMedCentralID PMC2890487
-
OPTIONAL POLYA TREE AND BAYESIAN INFERENCE
ANNALS OF STATISTICS
2010; 38 (3): 1433-1459
View details for DOI 10.1214/09-AOS755
View details for Web of Science ID 000277471000006
-
Analysis of factorial time-course microarrays with application to a clinical study of burn injury
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2010; 107 (22): 9923-9928
Abstract
Time-course microarray experiments are capable of capturing dynamic gene expression profiles. It is important to study how these dynamic profiles depend on the multiple factors that characterize the experimental condition under which the time course is observed. Analytic methods are needed to simultaneously handle the time course and factorial structure in the data. We developed a method to evaluate factor effects by pooling information across the time course while accounting for multiple testing and nonnormality of the microarray data. The method effectively extracts gene-specific response features and models their dependency on the experimental factors. Both longitudinal and cross-sectional time-course data can be handled by our approach. The method was used to analyze the impact of age on the temporal gene response to burn injury in a large-scale clinical study. Our analysis reveals that 21% of the genes responsive to burn are age-specific, among which expressions of mitochondria and immunoglobulin genes are differentially perturbed in pediatric and adult patients by burn injury. These new findings in the body's response to burn injury between children and adults support further investigations of therapeutic options targeting specific age groups. The methodology proposed here has been implemented in R package "TANOVA" and submitted to the Comprehensive R Archive Network at http://www.r-project.org/. It is also available for download at http://gluegrant1.stanford.edu/TANOVA/.
View details for DOI 10.1073/pnas.1002757107
View details for Web of Science ID 000278246000005
View details for PubMedCentralID PMC2890487
-
Hedgehog pathway-regulated gene networks in cerebellum development and tumorigenesis
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2010; 107 (21): 9736-9741
Abstract
Many genes initially identified for their roles in cell fate determination or signaling during development can have a significant impact on tumorigenesis. In the developing cerebellum, Sonic hedgehog (Shh) stimulates the proliferation of granule neuron precursor cells (GNPs) by activating the Gli transcription factors. Inappropriate activation of Shh target genes results in unrestrained cell division and eventually medulloblastoma, the most common pediatric brain malignancy. We find dramatic differences in the gene networks that are directly driven by the Gli1 transcription factor in GNPs and medulloblastoma. Gli1 binding location analysis revealed hundreds of genomic loci bound by Gli1 in normal and cancer cells. Only one third of the genes bound by Gli1 in GNPs were also bound in tumor cells. Correlation with gene expression levels indicated that 116 genes were preferentially transcribed in tumors, whereas 132 genes were target genes in both GNPs and medulloblastoma. Quantitative PCR and in situ hybridization for some putative target genes support their direct regulation by Gli. The results indicate that transformation of normal GNPs into deadly tumor cells is accompanied by a distinct set of Gli-regulated genes and may provide candidates for targeted therapies.
View details for DOI 10.1073/pnas.1004602107
View details for Web of Science ID 000278054700048
View details for PubMedID 20460306
View details for PubMedCentralID PMC2906878
-
Modeling Co-Expression across Species for Complex Traits: Insights to the Difference of Human and Mouse Embryonic Stem Cells
PLOS COMPUTATIONAL BIOLOGY
2010; 6 (3)
Abstract
Complex interactions between genes or proteins contribute substantially to phenotypic evolution. We present a probabilistic model and a maximum likelihood approach for cross-species clustering analysis and for identification of conserved as well as species-specific co-expression modules. This model enables a "soft" cross-species clustering (SCSC) approach by encouraging but not enforcing orthologous genes to be grouped into the same cluster. SCSC is therefore robust to obscure orthologous relationships and can reflect different functional roles of orthologous genes in different species. We generated a time-course gene expression dataset for differentiating mouse embryonic stem (ES) cells, and compiled a dataset of published gene expression data on differentiating human ES cells. Applying SCSC to analyze these datasets, we identified conserved and species-specific gene regulatory modules. Together with protein-DNA binding data, an SCSC cluster specifically induced in murine ES cells indicated that the KLF2/4/5 transcription factors, although critical to maintaining the pluripotent phenotype in mouse ES cells, were decoupled from the OCT4/SOX2/NANOG regulatory module in human ES cells. Two of the target genes of murine KLF2/4/5, LIN28 and NODAL, were rewired to be targets of OCT4/SOX2/NANOG in human ES cells. Moreover, by mapping SCSC clusters onto KEGG signaling pathways, we identified the signal transduction components that were induced in pluripotent ES cells in either a conserved or a species-specific manner. These results suggest that the pluripotent cell identity can be established and maintained through more than one gene regulatory network.
View details for DOI 10.1371/journal.pcbi.1000707
View details for Web of Science ID 000278125200015
View details for PubMedID 20300647
View details for PubMedCentralID PMC2837392
-
Modeling non-uniformity in short-read rates in RNA-Seq data
GENOME BIOLOGY
2010; 11 (5)
Abstract
After mapping, RNA-Seq data can be summarized by a sequence of read counts commonly modeled as Poisson variables with constant rates along each transcript, which actually fit data poorly. We suggest using variable rates for different positions, and propose two models to predict these rates based on local sequences. These models explain more than 50% of the variations and can lead to improved estimates of gene and isoform expressions for both Illumina and Applied Biosystems data.
View details for DOI 10.1186/gb-2010-11-5-r50
View details for Web of Science ID 000279631000015
View details for PubMedID 20459815
View details for PubMedCentralID PMC2898062
-
ChIP-Seq of transcription factors predicts absolute and differential gene expression in embryonic stem cells
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2009; 106 (51): 21521-21526
Abstract
Next-generation sequencing has greatly increased the scope and the resolution of transcriptional regulation study. RNA sequencing (RNA-Seq) and ChIP-Seq experiments are now generating comprehensive data on transcript abundance and on regulator-DNA interactions. We propose an approach for an integrated analysis of these data based on feature extraction of ChIP-Seq signals, principal component analysis, and regression-based component selection. Compared with traditional methods, our approach not only offers higher power in predicting gene expression from ChIP-Seq data but also provides a way to capture cooperation among regulators. In mouse embryonic stem cells (ESCs), we find that a remarkably high proportion of variation in gene expression (65%) can be explained by the binding signals of 12 transcription factors (TFs). Two groups of TFs are identified. Whereas the first group (E2f1, Myc, Mycn, and Zfx) act as activators in general, the second group (Oct4, Nanog, Sox2, Smad1, Stat3, Tcfcp2l1, and Esrrb) may serve as either activator or repressor depending on the target. The two groups of TFs cooperate tightly to activate genes that are differentially up-regulated in ESCs. In the absence of binding by the first group, the binding of the second group is associated with genes that are repressed in ESCs and derepressed upon early differentiation.
View details for DOI 10.1073/pnas.0904863106
View details for Web of Science ID 000272994200013
View details for PubMedID 19995984
View details for PubMedCentralID PMC2789751
-
Identifiability of isoform deconvolution from junction arrays and RNA-Seq
BIOINFORMATICS
2009; 25 (23): 3056-3059
Abstract
Splice junction microarrays and RNA-seq are two popular ways of quantifying splice variants within a cell. Unfortunately, isoform expressions cannot always be determined from the expressions of individual exons and splice junctions. While this issue has been noted before, the extent of the problem on various platforms has not yet been explored, nor have potential remedies been presented.We propose criteria that will guarantee identifiability of an isoform deconvolution model on exon and splice junction arrays and in RNA-Seq. We show that up to 97% of 2256 alternatively spliced human genes selected from the RefSeq database lead to identifiable gene models in RNA-seq, with similar results in mouse. However, in the Human Exon array only 26% of these genes lead to identifiable models, and even in the most comprehensive splice junction array only 69% lead to identifiable models.Supplementary data are available at Bioinformatics online.
View details for DOI 10.1093/bioinformatics/btp544
View details for Web of Science ID 000272080800002
View details for PubMedID 19762346
View details for PubMedCentralID PMC3167695
-
Dissecting Early Differentially Expressed Genes in a Mixture of Differentiating Embryonic Stem Cells
PLOS COMPUTATIONAL BIOLOGY
2009; 5 (12)
Abstract
The differentiation of embryonic stem cells is initiated by a gradual loss of pluripotency-associated transcripts and induction of differentiation genes. Accordingly, the detection of differentially expressed genes at the early stages of differentiation could assist the identification of the causal genes that either promote or inhibit differentiation. The previous methods of identifying differentially expressed genes by comparing different cell types would inevitably include a large portion of genes that respond to, rather than regulate, the differentiation process. We demonstrate through the use of biological replicates and a novel statistical approach that the gene expression data obtained without prior separation of cell types are informative for detecting differentially expressed genes at the early stages of differentiation. Applying the proposed method to analyze the differentiation of murine embryonic stem cells, we identified and then experimentally verified Smarcad1 as a novel regulator of pluripotency and self-renewal. We formalized this statistical approach as a statistical test that is generally applicable to analyze other differentiation processes.
View details for DOI 10.1371/journal.pcbi.1000607
View details for Web of Science ID 000274229000025
View details for PubMedID 20019792
View details for PubMedCentralID PMC2784941
-
FoxOs Cooperatively Regulate Diverse Pathways Governing Neural Stem Cell Homeostasis
CELL STEM CELL
2009; 5 (5): 540-553
Abstract
The PI3K-AKT-FoxO pathway is integral to lifespan regulation in lower organisms and essential for the stability of long-lived cells in mammals. Here, we report the impact of combined FoxO1, 3, and 4 deficiencies on mammalian brain physiology with a particular emphasis on the study of the neural stem/progenitor cell (NSC) pool. We show that the FoxO family plays a prominent role in NSC proliferation and renewal. FoxO-deficient mice show initial increased brain size and proliferation of neural progenitor cells during early postnatal life, followed by precocious significant decline in the NSC pool and accompanying neurogenesis in adult brains. Mechanistically, integrated transcriptomic, promoter, and functional analyses of FoxO-deficient NSC cultures identified direct gene targets with known links to the regulation of human brain size and the control of cellular proliferation, differentiation, and oxidative defense. Thus, the FoxO family coordinately regulates diverse genes and pathways to govern key aspects of NSC homeostasis in the mammalian brain.
View details for DOI 10.1016/j.stem.2009.09.013
View details for Web of Science ID 000272019500015
View details for PubMedID 19896444
View details for PubMedCentralID PMC3285492
-
Energy landscape of a spin-glass model: Exploration and characterization
PHYSICAL REVIEW E
2009; 79 (5)
Abstract
The disconnectivity graph (DG) is widely used to represent energy landscapes. Although powerful numerical methods have been developed to construct DGs for continuous potential-energy surfaces, they have difficulties in applications to discrete Hamiltonians as the case of spin-glass models. When the configuration space is large, brute force enumeration of all configurations to build a DG is not practical. We propose an alternative approach to construct DGs based on recursive partition of Monte Carlo samples from microcanonical ensembles. To characterize energy landscapes, we define the local density of states (LDOS) on a DG, with which one can compute many thermodynamic properties over local energy basins for any temperature. Estimation of LDOS is developed with DG construction. We further propose the concepts of tree entropy and local escape probability, both of which are functions of local density of states, to capture the symmetry and the roughness of a Boltzmann distribution, respectively. Our approach is applied to a study of the Sherrington-Kirkpatrick spin-glass model with N varying between 20 and 100 spins. We observe that the energy landscape is extremely asymmetric and there exists a sharp increase in local escape probability preceding the transition from spin glass to paramagnetic phase.
View details for DOI 10.1103/PhysRevE.79.051117
View details for Web of Science ID 000266500700031
View details for PubMedID 19518426
-
Modeling the spatio-temporal network that drives patterning in the vertebrate central nervous system
BIOCHIMICA ET BIOPHYSICA ACTA-GENE REGULATORY MECHANISMS
2009; 1789 (4): 299-305
Abstract
In this review, we discuss the gene regulatory network underlying the patterning of the ventral neural tube during vertebrate embryogenesis. The neural tube is partitioned into domains of distinct cell fates by inductive signals along both anterior-posterior and dorsal-ventral axes. A defining feature of the dorsal-ventral patterning is the graded distribution of Sonic hedgehog (Shh), which acts as a morphogen to specify several classes of ventral neurons in a concentration-dependent fashion. These inductive signals translate into patterned expressions of transcription factors that define different neural progenitor subtypes. Progenitor boundaries are sharpened by repressive interactions between these transcription factors. The progenitor-expressed transcription factors induce another set of transcription factors that are thought to contribute to neural identities in post-mitotic neural precursors. Thus, the gene regulatory network of the ventral neural tube patterning is characterized by hierarchical expression [inductive signal-->progenitor specifying factors (mitotic)--> precursor specifying factors (post mitotic)--> differentiated neural markers] and cross-repression between progenitor-expressed regulatory factors. Although a number of transcriptional regulators have been identified at each hierarchical level, their precise regulatory relationships are not clear. Here we discuss approaches aimed at clarifying and extending our understanding of the formation and propagation of this network.
View details for DOI 10.1016/j.bbagrm.2009.01.002
View details for Web of Science ID 000265729800008
View details for PubMedID 19445894
-
Cross-hybridization modeling on Affymetrix exon arrays
BIOINFORMATICS
2008; 24 (24): 2887-2893
Abstract
Microarray designs have become increasingly probe-rich, enabling targeting of specific features, such as individual exons or single nucleotide polymorphisms. These arrays have the potential to achieve quantitative high-throughput estimates of transcript abundances, but currently these estimates are affected by biases due to cross-hybridization, in which probes hybridize to off-target transcripts.To study cross-hybridization, we map Affymetrix exon array probes to a set of annotated mRNA transcripts, allowing a small number of mismatches or insertion/deletions between the two sequences. Based on a systematic study of the degree to which probes with a given match type to a transcript are affected by cross-hybridization, we developed a strategy to correct for cross-hybridization biases of gene-level expression estimates. Comparison with Solexa ultra high-throughput sequencing data demonstrates that correction for cross-hybridization leads to a significant improvement of gene expression estimates.We provide mappings between human and mouse exon array probes and off-target transcripts and provide software extending the GeneBASE program for generating gene-level expression estimates including the cross-hybridization correction http://biogibbs.stanford.edu/~kkapur/GeneBase/.
View details for DOI 10.1093/bioinformatics/btn571
View details for Web of Science ID 000261456700012
View details for PubMedID 18984598
View details for PubMedCentralID PMC2639301
-
RECONSTRUCTING THE ENERGY LANDSCAPE OF A DISTRIBUTION FROM MONTE CARLO SAMPLES
ANNALS OF APPLIED STATISTICS
2008; 2 (4): 1307-1331
View details for DOI 10.1214/08-AOAS196
View details for Web of Science ID 000262731100010
-
An integrated software system for analyzing ChIP-chip and ChIP-seq data
NATURE BIOTECHNOLOGY
2008; 26 (11): 1293-1300
Abstract
We present CisGenome, a software system for analyzing genome-wide chromatin immunoprecipitation (ChIP) data. CisGenome is designed to meet all basic needs of ChIP data analyses, including visualization, data normalization, peak detection, false discovery rate computation, gene-peak association, and sequence and motif analysis. In addition to implementing previously published ChIP-microarray (ChIP-chip) analysis methods, the software contains statistical methods designed specifically for ChlP sequencing (ChIP-seq) data obtained by coupling ChIP with massively parallel sequencing. The modular design of CisGenome enables it to support interactive analyses through a graphic user interface as well as customized batch-mode computation for advanced data mining. A built-in browser allows visualization of array images, signals, gene structure, conservation, and DNA sequence and motif information. We demonstrate the use of these tools by a comparative analysis of ChIP-chip and ChIP-seq data for the transcription factor NRSF/REST, a study of ChIP-seq analysis with or without a negative control sample, and an analysis of a new motif in Nanog- and Sox2-binding regions.
View details for DOI 10.1038/nbt.1505
View details for Web of Science ID 000260832200025
View details for PubMedID 18978777
View details for PubMedCentralID PMC2596672
-
SeqMap: mapping massive amount of oligonucleotides to the genome
BIOINFORMATICS
2008; 24 (20): 2395-2396
Abstract
SeqMap is a tool for mapping large amount of short sequences to the genome. It is designed for finding all the places in a reference genome where each sequence may come from. This task is essential to the analysis of data from ultra high-throughput sequencing machines. With a carefully designed index-filtering algorithm and an efficient implementation, SeqMap can map tens of millions of short sequences to a genome of several billions of nucleotides. Multiple substitutions and insertions/deletions of the nucleotide bases in the sequences can be tolerated and therefore detected. SeqMap supports FASTA input format and various output formats, and provides command line options for tuning almost every aspect of the mapping process. A typical mapping can be done in a few hours on a desktop PC. Parallel use of SeqMap on a cluster is also very straightforward.
View details for DOI 10.1093/bioinformatics/btn429
View details for Web of Science ID 000259973500020
View details for PubMedID 18697769
View details for PubMedCentralID PMC2562015
-
A genome-scale analysis of the cis-regulatory circuitry underlying sonic hedgehog-mediated patterning of the mammalian limb
GENES & DEVELOPMENT
2008; 22 (19): 2651-2663
Abstract
Sonic hedgehog (Shh) signals via Gli transcription factors to direct digit number and identity in the vertebrate limb. We characterized the Gli-dependent cis-regulatory network through a combination of whole-genome chromatin immunoprecipitation (ChIP)-on-chip and transcriptional profiling of the developing mouse limb. These analyses identified approximately 5000 high-quality Gli3-binding sites, including all known Gli-dependent enhancers. Discrete binding regions exhibit a higher-order clustering, highlighting the complexity of cis-regulatory interactions. Further, Gli3 binds inertly to previously identified neural-specific Gli enhancers, demonstrating the accessibility of their cis-regulatory elements. Intersection of DNA binding data with gene expression profiles predicted 205 putative limb target genes. A subset of putative cis-regulatory regions were analyzed in transgenic embryos, establishing Blimp1 as a direct Gli target and identifying Gli activator signaling in a direct, long-range regulation of the BMP antagonist Gremlin. In contrast, a long-range silencer cassette downstream from Hand2 likely mediates Gli3 repression in the anterior limb. These studies provide the first comprehensive characterization of the transcriptional output of a Shh-patterning process in the mammalian embryo and a framework for elaborating regulatory networks in the developing limb.
View details for DOI 10.1101/gad.1693008
View details for Web of Science ID 000259700900010
View details for PubMedID 18832070
-
Isolation and transcriptional profiling of purified hepatic cells derived from human embryonic stem cells
STEM CELLS
2008; 26 (8): 2032-2041
Abstract
The differentiation of human embryonic stem cells (hESCs) into functional hepatocytes provides a powerful in vitro model system for studying the molecular mechanisms governing liver development. Furthermore, a well-characterized renewable supply of hepatocytes differentiated from hESCs could be used for in vitro assays of drug metabolism and toxicology, screening of potential antiviral agents, and cell-based therapies to treat liver disease. In this study, we describe a protocol for the differentiation of hESCs toward hepatic cells with complex cellular morphologies. Putative hepatic cells were identified and isolated using a lentiviral vector, containing the alpha-fetoprotein promoter driving enhanced green fluorescent protein expression (AFP:eGFP). Whole-genome transcriptional profiling was performed on triplicate samples of AFP:eGFP+ and AFP:eGFP- cell populations using the recently released Affymetrix Exon Array ST 1.0 (Santa Clara, CA, http://www.affymetrix.com). Statistical analysis of the transcriptional profiles demonstrated that the AFP:eGFP+ population is highly enriched for genes characteristic of hepatic cells. These data provide a unique insight into the complex process of hepatocyte differentiation, point to signaling pathways that may be manipulated to more efficiently direct the differentiation of hESCs toward mature hepatocytes, and identify molecular markers that may be used for further dissection of hepatic cell differentiation from hESCs. Disclosure of potential conflicts of interest is found at the end of this article.
View details for DOI 10.1634/stemcells.2007-0964
View details for PubMedID 18535157
-
Defining Human Embryo Phenotypes by Cohort-Specific Prognostic Factors
PLOS ONE
2008; 3 (7)
Abstract
Hundreds of thousands of human embryos are cultured yearly at in vitro fertilization (IVF) centers worldwide, yet the vast majority fail to develop in culture or following transfer to the uterus. However, human embryo phenotypes have not been formally defined, and current criteria for embryo transfer largely focus on characteristics of individual embryos. We hypothesized that embryo cohort-specific variables describing sibling embryos as a group may predict developmental competence as measured by IVF cycle outcomes and serve to define human embryo phenotypes.We retrieved data for all 1117 IVF cycles performed in 2005 at Stanford University Medical Center, and further analyzed clinical data from the 665 fresh IVF, non-donor cycles and their associated 4144 embryos. Thirty variables representing patient characteristics, clinical diagnoses, treatment protocol, and embryo parameters were analyzed in an unbiased manner by regression tree models, based on dichotomous pregnancy outcomes defined by positive serum beta-human chorionic gonadotropin (beta-hCG). IVF cycle outcomes were most accurately predicted at approximately 70% by four non-redundant, embryo cohort-specific variables that, remarkably, were more informative than any measures of individual, transferred embryos: Total number of embryos, number of 8-cell embryos, rate (percentage) of cleavage arrest in the cohort and day 3 follicle stimulating hormone (FSH) level. While three of these variables captured the effects of other significant variables, only the rate of cleavage arrest was independent of any known variables.Our findings support defining human embryo phenotypes by non-redundant, prognostic variables that are specific to sibling embryos in a cohort.
View details for DOI 10.1371/journal.pone.0002562
View details for PubMedID 18596962
-
Learning causal Bayesian network structures from experimental data
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
2008; 103 (482): 778-789
View details for DOI 10.1198/016214508000000193
View details for Web of Science ID 000257897500035
-
Reconfigurable Computing for Learning Bayesian Networks
16th ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
ASSOC COMPUTING MACHINERY. 2008: 203–211
View details for Web of Science ID 000267587700020
-
Optimal discovery of a stochastic genetic network
American Control Conference 2008
IEEE. 2008: 2773–2779
View details for Web of Science ID 000259261502023
-
Evolutionary Monte Carlo methods for clustering
JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS
2007; 16 (4): 855-876
View details for DOI 10.1198/106186007X255072
View details for Web of Science ID 000252010500006
-
A gene regulatory network in mouse embryonic stem cells
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2007; 104 (42): 16438-16443
Abstract
We analyze new and existing expression and transcription factor-binding data to characterize gene regulatory relations in mouse ES cells (ESC). In addition to confirming the key roles of Oct4, Sox2, and Nanog, our analysis identifies several genes, such as Esrrb, Stat3, Tcf7, Sall4, and LRH-1, as statistically significant coregulators. The regulatory interactions among 15 core regulators are used to construct a gene regulatory network in ESC. The network encapsulates extensive cross-regulations among the core regulators, highlights how they may control epigenetic processes, and reveals the surprising roles of nuclear receptors. Our analysis also provides information on the regulation of a large number of putative target genes of the network.
View details for DOI 10.1073/pnas.0701014104
View details for Web of Science ID 000250373400012
View details for PubMedID 17940043
-
Assessing the conservation of mammalian gene expression using high-density exon arrays
MOLECULAR BIOLOGY AND EVOLUTION
2007; 24 (6): 1283-1285
Abstract
Microarray data from multiple species have been used to study evolutionary constraints on gene expression. Expression measurements from conventional microarray platforms such as the 3' expression arrays are strongly affected by platform-dependent probe effects that may introduce apparent but misleading discrepancies between species. In this manuscript, we assess the conservation of mammalian gene expression in adult tissues using data from a high-density exon array platform. The exon arrays have more than 6 million probes on a single array targeting all exons in a genome. We find that, unlike 3' array data, gene expression measurements from exon arrays reveal patterns of gene expression that are highly conserved between humans and mice in multiple tissues. Our analysis provides strong evidence for widespread stabilizing selection pressure on transcript abundance during mammalian evolution.
View details for DOI 10.1093/molbev/msm061
View details for Web of Science ID 000247207700001
View details for PubMedID 17387099
-
COUPLING HIDDEN MARKOV MODELS FOR THE DISCOVERY OF Cis-REGULATORY MODULES IN MULTIPLE SPECIES
ANNALS OF APPLIED STATISTICS
2007; 1 (1): 36-65
View details for DOI 10.1214/07-AOAS103
View details for Web of Science ID 000261050400003
-
Genomic characterization of Gli-activator targets in sonic hedgehog-mediated neural patterning
DEVELOPMENT
2007; 134 (10): 1977-1989
Abstract
Sonic hedgehog (Shh) acts as a morphogen to mediate the specification of distinct cell identities in the ventral neural tube through a Gli-mediated (Gli1-3) transcriptional network. Identifying Gli targets in a systematic fashion is central to the understanding of the action of Shh. We examined this issue in differentiating neural progenitors in mouse. An epitope-tagged Gli-activator protein was used to directly isolate cis-regulatory sequences by chromatin immunoprecipitation (ChIP). ChIP products were then used to screen custom genomic tiling arrays of putative Hedgehog (Hh) targets predicted from transcriptional profiling studies, surveying 50-150 kb of non-transcribed sequence for each candidate. In addition to identifying expected Gli-target sites, the data predicted a number of unreported direct targets of Shh action. Transgenic analysis of binding regions in Nkx2.2, Nkx2.1 (Titf1) and Rab34 established these as direct Hh targets. These data also facilitated the generation of an algorithm that improved in silico predictions of Hh target genes. Together, these approaches provide significant new insights into both tissue-specific and general transcriptional targets in a crucial Shh-mediated patterning process.
View details for DOI 10.1242/dev.001966
View details for Web of Science ID 000246138700016
View details for PubMedID 17442700
-
FoxOs are lineage-restricted redundant tumor suppressors and regulate endothelial cell homeostasis
CELL
2007; 128 (2): 309-323
Abstract
Activated phosphoinositide 3-kinase (PI3K)-AKT signaling appears to be an obligate event in the development of cancer. The highly related members of the mammalian FoxO transcription factor family, FoxO1, FoxO3, and FoxO4, represent one of several effector arms of PI3K-AKT signaling, prompting genetic analysis of the role of FoxOs in the neoplastic phenotypes linked to PI3K-AKT activation. While germline or somatic deletion of up to five FoxO alleles produced remarkably modest neoplastic phenotypes, broad somatic deletion of all FoxOs engendered a progressive cancer-prone condition characterized by thymic lymphomas and hemangiomas, demonstrating that the mammalian FoxOs are indeed bona fide tumor suppressors. Transcriptome and promoter analyses of differentially affected endothelium identified direct FoxO targets and revealed that FoxO regulation of these targets in vivo is highly context-specific, even in the same cell type. Functional studies validated Sprouty2 and PBX1, among others, as FoxO-regulated mediators of endothelial cell morphogenesis and vascular homeostasis.
View details for DOI 10.1016/j.cell.2006.13.029
View details for Web of Science ID 000244420500016
View details for PubMedID 17254969
View details for PubMedCentralID PMC1855089
-
Exon arrays provide accurate assessments of gene expression
GENOME BIOLOGY
2007; 8 (5)
Abstract
We have developed a strategy for estimating gene expression on Affymetrix Exon arrays. The method includes a probe-specific background correction and a probe selection strategy in which a subset of probes with highly correlated intensities across multiple samples are chosen to summarize gene expression. Our results demonstrate that the proposed background model offers improvements over the default Affymetrix background correction and that Exon arrays may provide more accurate measurements of gene expression than traditional 3' arrays.
View details for DOI 10.1186/gb-2007-8-5-r82
View details for Web of Science ID 000246983100029
View details for PubMedID 17504534
View details for PubMedCentralID PMC1929160
-
Probe Selection and Expression Index Computation of Affymetrix Exon Arrays
PLOS ONE
2006; 1 (1)
Abstract
There is great current interest in developing microarray platforms for measuring mRNA abundance at both gene level and exon level. The Affymetrix Exon Array is a new high-density gene expression microarray platform, with over six million probes targeting all annotated and predicted exons in a genome. An important question for the analysis of exon array data is how to compute overall gene expression indexes. Because of the complexity of the design of exon array probes, this problem is different in nature from summarizing gene-level expression from traditional 3' expression arrays.In this manuscript, we use exon array data from 11 human tissues to study methods for computing gene-level expression. We showed that for most genes there is a subset of exon array probes having highly correlated intensities across multiple samples. We suggest that these probes could be used as reliable indicators of overall gene expression levels. We developed a probe selection algorithm to select such a subset of highly correlated probes for each gene, and computed gene expression indexes using the selected probes.Our results demonstrate that probe selection improves gene expression estimates from exon arrays. The selected probes can be used in future analyses of other exon array datasets to compute gene expression indexes.
View details for DOI 10.1371/journal.pone.0000088
View details for Web of Science ID 000207443600087
View details for PubMedID 17183719
View details for PubMedCentralID PMC1762343
-
A comparative analysis of genome-wide chromatin immunoprecipitation data for mammalian transcription factors
NUCLEIC ACIDS RESEARCH
2006; 34 (21)
Abstract
Genome-wide location analysis (ChIP-chip, ChIP-PET) is a powerful technique to study mammalian transcriptional regulation. In order to obtain a basic understanding of the location data generated for mammalian transcription factors and potential issues in their analysis, we conducted a comparative study of eight independent ChIP experiments involving six different transcription factors in human and mouse. Our cross-study comparisons, to the best of our knowledge the first to analyze multiple datasets, revealed the importance of carefully chosen genomic controls in the de novo identification of key transcription factor binding motifs, raised issues about the interpretation of ubiquitously occurring sequence motifs, and demonstrated the clustering tendency of protein-binding regions for certain transcription factors.
View details for DOI 10.1093/nar/gkl803
View details for Web of Science ID 000242716800004
View details for PubMedID 17090591
View details for PubMedCentralID PMC1669715
-
Computational biology: Toward deciphering gene regulatory information in mammalian genomes
BIOMETRICS
2006; 62 (3): 645-663
Abstract
Computational biology is a rapidly evolving area where methodologies from computer science, mathematics, and statistics are applied to address fundamental problems in biology. The study of gene regulatory information is a central problem in current computational biology. This article reviews recent development of statistical methods related to this field. Starting from microarray gene selection, we examine methods for finding transcription factor binding motifs and cis-regulatory modules in coregulated genes, and methods for utilizing information from cross-species comparisons and ChIP-chip experiments. The ultimate understanding of cis-regulatory logic in mammalian genomes may require the integration of information collected from all these steps.
View details for DOI 10.1111/j.1541-0420.2006.00625.x
View details for Web of Science ID 000240708300001
View details for PubMedID 16984301
-
Is the future biology Shakespearean or Newtonian?
MOLECULAR BIOSYSTEMS
2006; 2 (9): 411-416
Abstract
"Cells do not care about mathematics" thus concluded a biologist friend after a discussion on the future of biology. And indeed, why should they care? But if we exchange the word "cell" with "rock", "Moon" or "electrons", do we have to change the sentence also? Starting from this line of thought, we review some recent developments in understanding the stochastic behavior of biological systems. We emphasize the importance of a molecular Signal Generator in the study of genetic networks.
View details for DOI 10.1039/b607243g
View details for Web of Science ID 000240284300007
View details for PubMedID 17153137
-
A tale of two morphogen gradients: Identifying Gli targets of Hedgehog Signaling
65th Annual Meeting of the Society-for-Developmental-Biology
ACADEMIC PRESS INC ELSEVIER SCIENCE. 2006: 423–23
View details for DOI 10.1016/j.ydbio.2006.04.301
View details for Web of Science ID 000238996200296
-
A study of density of states and ground states in hydrophobic-hydrophilic protein folding models by equi-energy sampling
JOURNAL OF CHEMICAL PHYSICS
2006; 124 (24)
Abstract
We propose an equi-energy (EE) sampling approach to study protein folding in the two-dimensional hydrophobic-hydrophilic (HP) lattice model. This approach enables efficient exploration of the global energy landscape and provides accurate estimates of the density of states, which then allows us to conduct a detailed study of the thermodynamics of HP protein folding, in particular, on the temperature dependence of the transition from folding to unfolding and on how sequence composition affects this phenomenon. With no extra cost, this approach also provides estimates on global energy minima and ground states. Without using any prior structural information of the protein the EE sampler is able to find the ground states that match the best known results in most benchmark cases. The numerical results demonstrate it as a powerful method to study lattice protein folding models.
View details for DOI 10.1063/1.2208607
View details for Web of Science ID 000238730600039
View details for PubMedID 16821999
-
Inferring loss-of-heterozygosity from unpaired tumors using high-density oligonucleotide SNP arrays
PLOS COMPUTATIONAL BIOLOGY
2006; 2 (5): 323-332
Abstract
Loss of heterozygosity (LOH) of chromosomal regions bearing tumor suppressors is a key event in the evolution of epithelial and mesenchymal tumors. Identification of these regions usually relies on genotyping tumor and counterpart normal DNA and noting regions where heterozygous alleles in the normal DNA become homozygous in the tumor. However, paired normal samples for tumors and cell lines are often not available. With the advent of oligonucleotide arrays that simultaneously assay thousands of single-nucleotide polymorphism (SNP) markers, genotyping can now be done at high enough resolution to allow identification of LOH events by the absence of heterozygous loci, without comparison to normal controls. Here we describe a hidden Markov model-based method to identify LOH from unpaired tumor samples, taking into account SNP intermarker distances, SNP-specific heterozygosity rates, and the haplotype structure of the human genome. When we applied the method to data genotyped on 100 K arrays, we correctly identified 99% of SNP markers as either retention or loss. We also correctly identified 81% of the regions of LOH, including 98% of regions greater than 3 megabases. By integrating copy number analysis into the method, we were able to distinguish LOH from allelic imbalance. Application of this method to data from a set of prostate samples without paired normals identified known regions of prevalent LOH. We have developed a method for analyzing high-density oligonucleotide SNP array data to accurately identify of regions of LOH and retention in tumors without the need for paired normal samples.
View details for DOI 10.1371/journal.pcbi.0020041
View details for Web of Science ID 000239493900001
View details for PubMedID 16699594
View details for PubMedCentralID PMC1458964
-
Recursive SVM feature selection and sample classification for mass-spectrometry and microarray data
BMC BIOINFORMATICS
2006; 7
Abstract
Like microarray-based investigations, high-throughput proteomics techniques require machine learning algorithms to identify biomarkers that are informative for biological classification problems. Feature selection and classification algorithms need to be robust to noise and outliers in the data.We developed a recursive support vector machine (R-SVM) algorithm to select important genes/biomarkers for the classification of noisy data. We compared its performance to a similar, state-of-the-art method (SVM recursive feature elimination or SVM-RFE), paying special attention to the ability of recovering the true informative genes/biomarkers and the robustness to outliers in the data. Simulation experiments show that a 5%- approximately 20% improvement over SVM-RFE can be achieved regard to these properties. The SVM-based methods are also compared with a conventional univariate method and their respective strengths and weaknesses are discussed. R-SVM was applied to two sets of SELDI-TOF-MS proteomics data, one from a human breast cancer study and the other from a study on rat liver cirrhosis. Important biomarkers found by the algorithm were validated by follow-up biological experiments.The proposed R-SVM method is suitable for analyzing noisy high-throughput proteomics and microarray data and it outperforms SVM-RFE in the robustness to noise and in the ability to recover informative features. The multivariate SVM-based method outperforms the univariate method in the classification performance, but univariate methods can reveal more of the differentially expressed features especially when there are correlations between the features.
View details for DOI 10.1186/1471-2105-7-1-197
View details for Web of Science ID 000237263600001
View details for PubMedID 16606446
-
An improved distance measure between the expression profiles linking co-expression and co-regulation in mouse
BMC BIOINFORMATICS
2006; 7
Abstract
Many statistical algorithms combine microarray expression data and genome sequence data to identify transcription factor binding motifs in the low eukaryotic genomes. Finding cis-regulatory elements in higher eukaryote genomes, however, remains a challenge, as searching in the promoter regions of genes with similar expression patterns often fails. The difficulty is partially attributable to the poor performance of the similarity measures for comparing expression profiles. The widely accepted measures are inadequate for distinguishing genes transcribed from distinct regulatory mechanisms in the complicated genomes of higher eukaryotes.By defining the regulatory similarity between a gene pair as the number of common known transcription factor binding motifs in the promoter regions, we compared the performance of several expression distance measures on seven mouse expression data sets. We propose a new distance measure that accounts for both the linear trends and fold-changes of expression across the samples.The study reveals that the proposed distance measure for comparing expression profiles enables us to identify genes with large number of common regulatory elements because it reflects the inherent regulatory information better than widely accepted distance measures such as the Pearson's correlation or cosine correlation with or without log transformation.
View details for DOI 10.1186/1471-2105-7-44
View details for Web of Science ID 000236062200001
View details for PubMedID 16438730
-
Reliable prediction of transcription factor binding sites by phylogenetic verification
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2005; 102 (47): 16945-16950
Abstract
We present a statistical methodology that largely improves the accuracy in computational predictions of transcription factor (TF) binding sites in eukaryote genomes. This method models the cross-species conservation of binding sites without relying on accurate sequence alignment. It can be coupled with any motif-finding algorithm that searches for overrepresented sequence motifs in individual species and can increase the accuracy of the coupled motif-finding algorithm. Because this method is capable of accurately detecting TF binding sites, it also enhances our ability to predict the cis-regulatory modules. We applied this method on the published chromatin immunoprecipitation (ChIP)-chip data in Saccharomyces cerevisiae and found that its sensitivity and specificity are 9% and 14% higher than those of two recent methods. We also recovered almost all of the previously verified TF binding sites and made predictions on the cis-regulatory elements that govern the tight regulation of ribosomal protein genes in 13 eukaryote species (2 plants, 4 yeasts, 2 worms, 2 insects, and 3 mammals). These results give insights to the transcriptional regulation in eukaryotic organisms.
View details for DOI 10.1073/pnas.0504201102
View details for Web of Science ID 000233463200009
View details for PubMedID 16286651
View details for PubMedCentralID PMC1283155
-
De novo discovery of a tissue-specific gene regulatory module in a chordate
GENOME RESEARCH
2005; 15 (10): 1315-1324
Abstract
We engage the experimental and computational challenges of de novo regulatory module discovery in a complex and largely unstudied metazoan genome. Our analysis is based on the comprehensive characterization of regulatory elements of 20 muscle genes in the chordate, Ciona savignyi. Three independent types of data we generate contribute to the characterization of a muscle-specific regulatory module: (1) Positive elements (PEs), short sequences sufficient for strong muscle expression that are identified in a high-resolution in vivo analysis; (2) CisModules (CMs), candidate regulatory modules defined by clusters of overrepresented motifs predicted de novo; and (3) Conserved elements (CEs), short noncoding sequences of strong conservation between C. savignyi and C. intestinalis. We estimate the accuracy of the computational predictions by an analysis of the intersection of these data. As final biological validation of the discovered muscle regulatory module, we implement a novel algorithm to search the genome for instances of the module and identify seven novel enhancers.
View details for DOI 10.1101/gr.4062605
View details for PubMedID 16169925
-
HumanUpstream and MouseUpstream: Databases of promoter sequences in the human and mouse genomes
OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY
2005; 9 (3): 220-224
Abstract
Large-scale genome annotations, based largely on gene prediction programs, may be inaccurate in their predictions of transcription start sites, so that the identification of promoter regions remains unreliable. Here we focus on the identification of reliable gene promoter regions, critical to the understanding of transcriptional regulation. We report the construction of databases of upstream sequences Human Upstream and Mouse Upstream based on information from both the human and mouse genomes and the database of expressed sequence tags (dbEST). Using the ENSEMBL generic genome annotation system, our approach allows more reliable identification of transcript start sites, and therefore extraction of more reliable promoters regions. The Human Upstream and Human Upstream databases are available free of charge.
View details for Web of Science ID 000232649500002
View details for PubMedID 16209636
-
TileMap: create chromosomal map of tiling array hybridizations
BIOINFORMATICS
2005; 21 (18): 3629-3636
Abstract
Tiling array is a new type of microarray that can be used to survey genomic transcriptional activities and transcription factor binding sites at high resolution. The goal of this paper is to develop effective statistical tools to identify genomic loci that show transcriptional or protein binding patterns of interest.A two-step approach is proposed and is implemented in TileMap. In the first step, a test-statistic is computed for each probe based on a hierarchical empirical Bayes model. In the second step, the test-statistics of probes within a genomic region are used to infer whether the region is of interest or not. Hierarchical empirical Bayes model shrinks variance estimates and increases sensitivity of the analysis. It allows complex multiple sample comparisons that are essential for the study of temporal and spatial patterns of hybridization across different experimental conditions. Neighboring probes are combined through a moving average method (MA) or a hidden Markov model (HMM). Unbalanced mixture subtraction is proposed to provide approximate estimates of false discovery rate for MA and model parameters for HMM.TileMap is freely available at http://biogibbs.stanford.edu/~jihk/TileMap/index.htm.http://biogibbs.stanford.edu/~jihk/TileMap/index.htm (includes coloured versions of all figures).
View details for DOI 10.1093/bioinformatics/bti593
View details for Web of Science ID 000231694600007
View details for PubMedID 16046496
-
Identification of Gli target genes using chromatin immuno-precipitation with a genetically inducible system on genomic arrays.
64th Annual Meeting of the Society-for-Development-Biology
ACADEMIC PRESS INC ELSEVIER SCIENCE. 2005: 666–66
View details for Web of Science ID 000230683800463
-
Sampling motifs on phylogenetic trees
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2005; 102 (27): 9481-9486
Abstract
We present a method to find motifs by simultaneously using the overrepresentation property and the evolutionary conservation property of motifs. This method is applicable to divergent species where alignment is unreliable, which overcomes a major limitation of the current methods. The method has been applied to search regulatory motifs in four yeast species based on ChIP-chip data in Saccharomyces cerevisiae and obtained 20% higher accuracy than the best current methods. We also discovered cis-regulatory elements that govern the tight regulation of ribosomal protein genes in two distantly related insects by using this method. These results demonstrate that our method will be useful for the extraction of regulatory signals in multiple genomes.
View details for DOI 10.1073/pnas.0501620102
View details for Web of Science ID 000230406000010
View details for PubMedID 15983378
View details for PubMedCentralID PMC1160516
-
mSin3A corepressor regulates diverse transcriptional networks governing normal and neoplastic growth and survival
GENES & DEVELOPMENT
2005; 19 (13): 1581-1595
Abstract
mSin3A is a core component of a large multiprotein corepressor complex with associated histone deacetylase (HDAC) enzymatic activity. Physical interactions of mSin3A with many sequence-specific transcription factors has linked the mSin3A corepressor complex to the regulation of diverse signaling pathways and associated biological processes. To dissect the complex nature of mSin3A's actions, we monitored the impact of conditional mSin3A deletion on the developmental, cell biological, and transcriptional levels. mSin3A was shown to play an essential role in early embryonic development and in the proliferation and survival of primary, immortalized, and transformed cells. Genetic and biochemical analyses established a role for mSin3A/HDAC in p53 deacetylation and activation, although genetic deletion of p53 was not sufficient to attenuate the mSin3A null cell lethal phenotype. Consistent with mSin3A's broad biological activities beyond regulation of the p53 pathway, time-course gene expression profiling following mSin3A deletion revealed deregulation of genes involved in cell cycle regulation, DNA replication, DNA repair, apoptosis, chromatin modifications, and mitochondrial metabolism. Computational analysis of the mSin3A transcriptome using a knowledge-based database revealed several nodal points through which mSin3A influences gene expression, including the Myc-Mad, E2F, and p53 transcriptional networks. Further validation of these nodes derived from in silico promoter analysis showing enrichment for Myc-Mad, E2F, and p53 cis-regulatory elements in regulatory regions of up-regulated genes following mSin3A depletion. Significantly, in silico promoter analyses also revealed specific cis-regulatory elements binding the transcriptional activator Stat and the ISWI ATP-dependent nucleosome remodeling factor Falz, thereby expanding further the mSin3A network of regulatory factors. Together, these integrated genetic, biochemical, and computational studies demonstrate the involvement of mSin3A in the regulation of diverse pathways governing many aspects of normal and neoplastic growth and survival and provide an experimental framework for the analysis of essential genes with diverse biological functions.
View details for DOI 10.1101/gad.1286905
View details for Web of Science ID 000230334600008
View details for PubMedID 15998811
View details for PubMedCentralID PMC1172064
-
A small-molecule inhibitor of mpsl blocks the spindle-checkpoint response to a lack of tension on mitotic chromosomes
CURRENT BIOLOGY
2005; 15 (11): 1070-1076
Abstract
The spindle checkpoint prevents chromosome loss by preventing chromosome segregation in cells with improperly attached chromosomes [1, 2 and 3]. The checkpoint senses defects in the attachment of chromosomes to the mitotic spindle [4] and the tension exerted on chromosomes by spindle forces in mitosis [5, 6 and 7]. Because many cancers have defects in chromosome segregation, this checkpoint may be required for survival of tumor cells and may be a target for chemotherapy. We performed a phenotype-based chemical-genetic screen in budding yeast and identified an inhibitor of the spindle checkpoint, called cincreasin. We used a genome-wide collection of yeast gene-deletion strains and traditional genetic and biochemical analysis to show that the target of cincreasin is Mps1, a protein kinase required for checkpoint function [8]. Despite the requirement for Mps1 for sensing both the lack of microtubule attachment and tension at kinetochores, we find concentrations of cincreasin that selectively inhibit the tension-sensitive branch of the spindle checkpoint. At these concentrations, cincreasin causes lethal chromosome missegregation in mutants that display chromosomal instability. Our results demonstrate that Mps1 can be exploited as a target and that inhibiting the tension-sensitive branch of the spindle checkpoint may be a way of selectively killing cancer cells that display chromosomal instability.
View details for DOI 10.1016/j.cub.2005.05.020
View details for Web of Science ID 000229984100031
View details for PubMedID 15936280
-
UbIC(2) - Towards ubiquitous bio-information computing: Data protocols, middleware, and web services for heterogeneous biological information integration and retrieval
INTERNATIONAL JOURNAL OF SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING
2005; 15 (3): 475-485
View details for Web of Science ID 000230937200002
-
A boosting approach for motif modeling using ChIP-chip data
BIOINFORMATICS
2005; 21 (11): 2636-2643
Abstract
Building an accurate binding model for a transcription factor (TF) is essential to differentiate its true binding targets from those spurious ones. This is an important step toward understanding gene regulation.This paper describes a boosting approach to modeling TF-DNA binding. Different from the widely used weight matrix model, which predicts TF-DNA binding based on a linear combination of position-specific contributions, our approach builds a TF binding classifier by combining a set of weight matrix based classifiers, thus yielding a non-linear binding decision rule. The proposed approach was applied to the ChIP-chip data of Saccharomyces cerevisiae. When compared with the weight matrix method, our new approach showed significant improvements on the specificity in a majority of cases.
View details for DOI 10.1093/bioinformatics/bti402
View details for Web of Science ID 000229441500010
View details for PubMedID 15817698
-
Tight clustering: A resampling-based approach for identifying stable and tight patterns in data
BIOMETRICS
2005; 61 (1): 10-16
Abstract
In this article, we propose a method for clustering that produces tight and stable clusters without forcing all points into clusters. The methodology is general but was initially motivated from cluster analysis of microarray experiments. Most current algorithms aim to assign all genes into clusters. For many biological studies, however, we are mainly interested in identifying the most informative, tight, and stable clusters of sizes, say, 20-60 genes for further investigation. We want to avoid the contamination of tightly regulated expression patterns of biologically relevant genes due to other genes whose expressions are only loosely compatible with these patterns. "Tight clustering" has been developed specifically to address this problem. It applies K-means clustering as an intermediate clustering engine. Early truncation of a hierarchical clustering tree is used to overcome the local minimum problem in K-means clustering. The tightest and most stable clusters are identified in a sequential manner through an analysis of the tendency of genes to be grouped together under repeated resampling. We validated this method in a simulated example and applied it to analyze a set of expression profiles in the study of embryonic stem cells.
View details for Web of Science ID 000227576600002
View details for PubMedID 15737073
-
Comparative linkage analysis and visualization of high-density oligonucleotide SNP array data
BMC GENETICS
2005; 6
Abstract
The identification of disease-associated genes using single nucleotide polymorphisms (SNPs) has been increasingly reported. In particular, the Affymetrix Mapping 10 K SNP microarray platform uses one PCR primer to amplify the DNA samples and determine the genotype of more than 10,000 SNPs in the human genome. This provides the opportunity for large scale, rapid and cost-effective genotyping assays for linkage analysis. However, the analysis of such datasets is nontrivial because of the large number of markers, and visualizing the linkage scores in the context of genome maps remains less automated using the current linkage analysis software packages. For example, the haplotyping results are commonly represented in the text format.Here we report the development of a novel software tool called CompareLinkage for automated formatting of the Affymetrix Mapping 10 K genotype data into the "Linkage" format and the subsequent analysis with multi-point linkage software programs such as Merlin and Allegro. The new software has the ability to visualize the results for all these programs in dChip in the context of genome annotations and cytoband information. In addition we implemented a variant of the Lander-Green algorithm in the dChipLinkage module of dChip software (V1.3) to perform parametric linkage analysis and haplotyping of SNP array data. These functions are integrated with the existing modules of dChip to visualize SNP genotype data together with LOD score curves. We have analyzed three families with recessive and dominant diseases using the new software programs and the comparison results are presented and discussed.The CompareLinkage and dChipLinkage software packages are freely available. They provide the visualization tools for high-density oligonucleotide SNP array data, as well as the automated functions for formatting SNP array data for the linkage analysis programs Merlin and Allegro and calling these programs for linkage analysis. The results can be visualized in dChip in the context of genes and cytobands. In addition, a variant of the Lander-Green algorithm is provided that allows parametric linkage analysis and haplotyping.
View details for DOI 10.1186/1471-2156-6-7
View details for Web of Science ID 000227316700001
View details for PubMedID 15713228
View details for PubMedCentralID PMC551603
-
GeneNotes - A novel information management software for biologists
BMC BIOINFORMATICS
2005; 6
Abstract
Collecting and managing information is a challenging task in a genome-wide profiling research project. Most databases and online computational tools require a direct human involvement. Information and computational results are presented in various multimedia formats (e.g., text, image, PDF, word files, etc.), many of which cannot be automatically processed by computers in biologically meaningful ways. In addition, the quality of computational results is far from perfect and requires nontrivial manual examination. The timely selection, integration and interpretation of heterogeneous biological information still heavily rely on the sensibility of biologists. Biologists often feel overwhelmed by the huge amount of and the great diversity of distributed heterogeneous biological information.We developed an information management application called GeneNotes. GeneNotes is the first application that allows users to collect and manage multimedia biological information about genes/ESTs. GeneNotes provides an integrated environment for users to surf the Internet, collect notes for genes/ESTs, and retrieve notes. GeneNotes is supported by a server that integrates gene annotations from many major databases (e.g., HGNC, MGI, etc.). GeneNotes uses the integrated gene annotations to (a) identify genes given various types of gene IDs (e.g., RefSeq ID, GenBank ID, etc.), and (b) provide quick views of genes. GeneNotes is free for academic usage. The program and the tutorials are available at: http://bayes.fas.harvard.edu/genenotes/.GeneNotes provides a novel human-computer interface to assist researchers to collect and manage biological information. It also provides a platform for studying how users behave when they manipulate biological information. The results of such study can lead to innovation of more intelligent human-computer interfaces that greatly shorten the cycle of biology research.
View details for DOI 10.1186/1471-2105-6-20
View details for Web of Science ID 000227451700001
View details for PubMedID 15686593
View details for PubMedCentralID PMC549201
-
Functional annotation and network reconstruction through cross-platform integration of microarray data
NATURE BIOTECHNOLOGY
2005; 23 (2): 238-243
Abstract
The rapid accumulation of microarray data translates into a need for methods to effectively integrate data generated with different platforms. Here we introduce an approach, 2(nd)-order expression analysis, that addresses this challenge by first extracting expression patterns as meta-information from each data set (1(st)-order expression analysis) and then analyzing them across multiple data sets. Using yeast as a model system, we demonstrate two distinct advantages of our approach: we can identify genes of the same function yet without coexpression patterns and we can elucidate the cooperativities between transcription factors for regulatory network reconstruction by overcoming a key obstacle, namely the quantification of activities of transcription factors. Experiments reported in the literature and performed in our lab support a significant number of our predictions.
View details for DOI 10.1038/nbt1058
View details for Web of Science ID 000226797600032
View details for PubMedID 15654329
-
Detect and adjust for population stratification in population-based association study using genomic control markers: an application of Affymetrix Genechip (R) Human Mapping 10K array
EUROPEAN JOURNAL OF HUMAN GENETICS
2004; 12 (12): 1001-1006
Abstract
Population-based association design is often compromised by false or nonreplicable findings, partially due to population stratification. Genomic control (GC) approaches were proposed to detect and adjust for this confounder. To date, the performance of this strategy has not been extensively evaluated on real data. More than 10 000 single-nucleotide polymorphisms (SNPs) were genotyped on subjects from four populations (including an Asian, an African-American and two Caucasian populations) using GeneChip Mapping 10 K array. On these data, we tested the performance of two GC approaches in different scenarios including various numbers of GC markers and different degrees of population stratification. In the scenario of substantial population stratification, both GC approaches are sensitive using only 20-50 random SNPs, and the mixed subjects can be separated into homogeneous subgroups. In the scenario of moderate stratification, both GC approaches have poor sensitivities. However, the bias in association test can still be corrected even when no statistical significant population stratification is detected. We conducted extensive benchmark analyses on GC approaches using SNPs over the whole human genome. We found GC method can cluster subjects to homogeneous subgroups if there is a substantial difference in genetic background. The inflation factor, estimated by GC markers, can effectively adjust for the confounding effect of population stratification regardless of its extent. We also suggest that as low as 50 random SNPs with heterozygosity >40% should be sufficient as genomic controls.
View details for DOI 10.1038/sj.ejhg.5201273
View details for Web of Science ID 000225165200004
View details for PubMedID 15367915
-
Estimation of genotype error rate using samples with pedigree information - an application on the GeneChip Mapping 10K array
GENOMICS
2004; 84 (4): 623-630
Abstract
Currently, most analytical methods assume all observed genotypes are correct; however, it is clear that errors may reduce statistical power or bias inference in genetic studies. We propose procedures for estimating error rate in genetic analysis and apply them to study the GeneChip Mapping 10K array, which is a technology that has recently become available and allows researchers to survey over 10,000 SNPs in a single assay. We employed a strategy to estimate the genotype error rate in pedigree data. First, the "dose-response" reference curve between error rate and the observable error number were derived by simulation, conditional on given pedigree structures and genotypes. Second, the error rate was estimated by calibrating the number of observed errors in real data to the reference curve. We evaluated the performance of this method by simulation study and applied it to a data set of 30 pedigrees genotyped using the GeneChip Mapping 10K array. This method performed favorably in all scenarios we surveyed. The dose-response reference curve was monotone and almost linear with a large slope. The method was able to estimate accurately the error rate under various pedigree structures and error models and under heterogeneous error rates. Using this method, we found that the average genotyping error rate of the GeneChip Mapping 10K array was about 0.1%. Our method provides a quick and unbiased solution to address the genotype error rate in pedigree data. It behaves well in a wide range of settings and can be easily applied in other genetic projects. The robust estimation of genotyping error rate allows us to estimate power and sample size and conduct unbiased genetic tests. The GeneChip Mapping 10K array has a low overall error rate, which is consistent with the results obtained from alternative genotyping assays.
View details for DOI 10.1016/j.ygeno.2004.05.003
View details for Web of Science ID 000224091200001
View details for PubMedID 15475239
-
Genomic analysis of mouse retinal development
PLOS BIOLOGY
2004; 2 (9): 1411-1431
View details for DOI 10.1371/journal.pbio.0020247
View details for Web of Science ID 000224108100020
-
CisModule: De novo discovery of' cis-regulatory modules by hierarchical mixture modeling
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2004; 101 (33): 12114-12119
Abstract
The regulatory information for a eukaryotic gene is encoded in cis-regulatory modules. The binding sites for a set of interacting transcription factors have the tendency to colocalize to the same modules. Current de novo motif discovery methods do not take advantage of this knowledge. We propose a hierarchical mixture approach to model the cis-regulatory module structure. Based on the model, a new de novo motif-module discovery algorithm, CisModule, is developed for the Bayesian inference of module locations and within-module motif sites. Dynamic programming-like recursions are developed to reduce the computational complexity from exponential to linear in sequence length. By using both simulated and real data sets, we demonstrate that CisModule is not only accurate in predicting modules but also more sensitive in detecting motif patterns and binding sites than standard motif discovery methods are.
View details for DOI 10.1073/pnas.0402858101
View details for Web of Science ID 000223410100038
View details for PubMedID 15297614
-
Integrated analysis of microarray data and gene function information
OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY
2004; 8 (2): 106-117
Abstract
Microarray data should be interpreted in the context of existing biological knowledge. Here we present integrated analysis of microarray data and gene function classification data using homogeneity analysis. Homogeneity analysis is a graphical multivariate statistical method for analyzing categorical data. It converts categorical data into graphical display. By simultaneously quantifying the microarray-derived gene groups and gene function categories, it captures the complex relations between biological information derived from microarray data and the existing knowledge about the gene function. Thus, homogeneity analysis provides a mathematical framework for integrating the analysis of microarray data and the existing biological knowledge.
View details for Web of Science ID 000223063500003
View details for PubMedID 15268770
-
Molecular diversity of astrocytes with implications for neurological disorders
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2004; 101 (22): 8384-8389
Abstract
The astrocyte represents the most abundant yet least understood cell type of the CNS. Here, we use a stringent experimental strategy to molecularly define the astrocyte lineage by integrating microarray datasets across several in vitro model systems of astrocyte differentiation, primary astrocyte cultures, and various astrocyterich CNS structures. The intersection of astrocyte data sets, coupled with the application of nonastrocytic exclusion filters, yielded many astrocyte-specific genes possessing strikingly varied patterns of regional CNS expression. Annotation of these astrocyte-specific genes provides direct molecular documentation of the diverse physiological roles of the astrocyte lineage. This global perspective in the normal brain also provides a framework for how astrocytes may participate in the pathogenesis of common neurological disorders like Alzheimer's disease, Parkinson's disease, stroke, epilepsy, and primary brain tumors.
View details for DOI 10.1073/pnas.0402140101
View details for Web of Science ID 000221831800025
View details for PubMedID 15155908
-
dChipSNP: significance curve and clustering of SNP-array-based loss-of-heterozygosity data
BIOINFORMATICS
2004; 20 (8): 1233-1240
Abstract
Oligonucleotide microarrays allow genotyping of thousands of single-nucleotide polymorphisms (SNPs) in parallel. Recently, this technology has been applied to loss-of-heterozygosity (LOH) analysis of paired normal and tumor samples. However, methods and software for analyzing such data are not fully developed.Here, we report automated methods for pooling SNP array replicates to make LOH calls, visualizing SNP and LOH data along chromosomes in the context of genes and cytobands, making statistical inference to identify shared LOH regions, clustering samples based on LOH profiles and correlating the clustering results to clinical variables. Application of these methods to prostate and breast cancer datasets generates biologically important results.The software module dChipSNP implementing these methods is available at http://biosun1.harvard.edu/complab/dchip/snp/The breast cancer data are provided by Andrea L. Richardson, Zhigang C. Wang and James D. Iglehart.
View details for DOI 10.1093/bioinformatics/bth069
View details for Web of Science ID 000221556100004
View details for PubMedID 14871870
-
GoSurfer: a graphical interactive tool for comparative analysis of large gene sets in Gene Ontology space.
Applied bioinformatics
2004; 3 (4): 261-264
Abstract
The analysis of complex patterns of gene regulation is central to understanding the biology of cells, tissues and organisms. Patterns of gene regulation pertaining to specific biological processes can be revealed by a variety of experimental strategies, particularly microarrays and other highly parallel methods, which generate large datasets linking many genes. Although methods for detecting gene expression have improved substantially in recent years, understanding the physiological implications of complex patterns in gene expression data is a major challenge. This article presents GoSurfer, an easy-to-use graphical exploration tool with built-in statistical features that allow a rapid assessment of the biological functions represented in large gene sets. GoSurfer takes one or two list(s) of gene identifiers (Affymetrix probe set ID) as input and retrieves all the Gene Ontology (GO) terms associated with the input genes. GoSurfer visualises these GO terms in a hierarchical tree format. With GoSurfer, users can perform statistical tests to search for the GO terms that are enriched in the annotations of the input genes. These GO terms can be highlighted on the GO tree. Users can manipulate the GO tree in various ways and interactively query the genes associated with any GO term. The user-generated graphics can be saved as graphics files, and all the GO information related to the input genes can be exported as text files.GoSurfer is a Windows-based program freely available for noncommercial use and can be downloaded at http://www.gosurfer.org. Datasets used to construct the trees shown in the figures in this article are available at http://www.gosurfer.org/download/GoSurfer.zip.
View details for PubMedID 15702958
-
Comparative analysis of gene sets in the gene ontology space under the multiple hypothesis testing framework
IEEE Computational Systems Bioinformatics Conference (CSB 2004)
IEEE COMPUTER SOC. 2004: 425–435
Abstract
The Gene Ontology (GO) resource can be used as a powerful tool to uncover the properties shared among, and specific to, a list of genes produced by high-throughput functional genomics studies, such as microarray studies. In the comparative analysis of several gene lists, researchers maybe interested in knowing which GO terms are enriched in one list of genes but relatively depleted in another. Statistical tests such as Fisher's exact test or Chi-square test can be performed to search for such GO terms. However, because multiple GO terms are tested simultaneously, individual p-values from individual tests do not serve as good indicators for picking GO terms. Furthermore, these multiple tests are highly correlated, usual multiple testing procedures that work under an independence assumption are not applicable. In this paper we introduce a procedure, based on False Discovery Rate (FDR), to treat this correlated multiple testing problem. This procedure calculates a moderately conserved estimator of q-value for every GO term. We identify the GO terms with q-values that satisfy a desired level as the significant GO terms. This procedure has been implemented into the GoSurfer software. GoSurfer is a windows based graphical data mining tool. It is freely available at http://www.gosurfer.org.
View details for Web of Science ID 000224127800042
View details for PubMedID 16448035
-
Clustering analysis of SAGE data using a Poisson approach
GENOME BIOLOGY
2004; 5 (7)
Abstract
Serial analysis of gene expression (SAGE) data have been poorly exploited by clustering analysis owing to the lack of appropriate statistical methods that consider their specific properties. We modeled SAGE data by Poisson statistics and developed two Poisson-based distances. Their application to simulated and experimental mouse retina data show that the Poisson-based distances are more appropriate and reliable for analyzing SAGE data compared to other commonly used distances or similarity measures such as Pearson correlation or Euclidean distance.
View details for Web of Science ID 000222429500016
View details for PubMedID 15239836
View details for PubMedCentralID PMC463327
-
In silico prediction of transcription factors that interact with the E2F family of transcription factors
8th International Conference on Control, Automation, Robotics and Vision (ICARCV 2004)
IEEE. 2004: 1325–1330
View details for Web of Science ID 000230484501099
-
Determination of local statistical significance of patterns in Markov sequences with application to promoter element identification
JOURNAL OF COMPUTATIONAL BIOLOGY
2004; 11 (1): 1-14
Abstract
High-level eukaryotic genomes present a particular challenge to the computational identification of transcription factor binding sites (TFBSs) because of their long noncoding regions and large numbers of repeat elements. This is evidenced by the noisy results generated by most current methods. In this paper, we present a p-value-based scoring scheme using probability generating functions to evaluate the statistical significance of potential TFBSs. Furthermore, we introduce the local genomic context into the model so that candidate sites are evaluated based both on their similarities to known binding sites and on their contrasts against their respective local genomic contexts. We demonstrate that our approach is advantageous in the prediction of myogenin and MEF2 binding sites in the human genome. We also apply LMM to large-scale human binding site sequences in situ and found that, compared to current popular methods, LMM analysis can reduce false positive errors by more than 50% without compromising sensitivity. This improvement will be of importance to any subsequent algorithm that aims to detect regulatory modules based on known PSSMs.
View details for Web of Science ID 000220234300001
View details for PubMedID 15072685
-
Towards Ubiquitous Bio-Information Computing: Data protocols, middleware, and Web services for heterogeneous biological information integration and retrieval
4TH IEEE Symposium on Bioinformatics and Bioengineering (BIBE 2004)
IEEE COMPUTER SOC. 2004: 57–64
View details for Web of Science ID 000222238200009
-
ChipInfo: software for extracting gene annotation and gene ontology information for microarray analysis
NUCLEIC ACIDS RESEARCH
2003; 31 (13): 3483-3486
Abstract
To date, assembling comprehensive annotation information for all probe sets of any Affymetrix microarrays remains a time-consuming, error-prone and challenging task. ChipInfo is designed for retrieving annotation information from online databases such as NetAffx and Gene Ontology and organizing such information into easily interpretable tabular format outputs. As companion software to dChip and GoSurfer, ChipInfo enables users to independently update the information resource files of these software packages. It also has functions for computing related summary statistics of probe sets and Gene Ontology terms. ChipInfo is available at http://biosun1.harvard.edu/complab/chipinfo/.
View details for DOI 10.1093/nar/gkg598
View details for Web of Science ID 000183832900041
View details for PubMedID 12824349
View details for PubMedCentralID PMC169004
-
A method for tight clustering: with application to microarray
2nd International Computational Systems Bioinformatics Conference
IEEE COMPUTER SOC. 2003: 396–397
View details for Web of Science ID 000188997700057
-
Transitive functional annotation by shortest-path analysis of gene expression data
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2002; 99 (20): 12783-12788
Abstract
Current methods for the functional analysis of microarray gene expression data make the implicit assumption that genes with similar expression profiles have similar functions in cells. However, among genes involved in the same biological pathway, not all gene pairs show high expression similarity. Here, we propose that transitive expression similarity among genes can be used as an important attribute to link genes of the same biological pathway. Based on large-scale yeast microarray expression data, we use the shortest-path analysis to identify transitive genes between two given genes from the same biological process. We find that not only functionally related genes with correlated expression profiles are identified but also those without. In the latter case, we compare our method to hierarchical clustering, and show that our method can reveal functional relationships among genes in a more precise manner. Finally, we show that our method can be used to reliably predict the function of unknown genes from known genes lying on the same shortest path. We assigned functions for 146 yeast genes that are considered as unknown by the Saccharomyces Genome Database and by the Yeast Proteome Database. These genes constitute around 5% of the unknown yeast ORFome.
View details for DOI 10.1073/pnas.192159399
View details for Web of Science ID 000178391700053
View details for PubMedID 12196633
View details for PubMedCentralID PMC130537
-
Extensive and divergent circadian gene expression in liver and heart
NATURE
2002; 417 (6884): 78-83
Abstract
Many mammalian peripheral tissues have circadian clocks; endogenous oscillators that generate transcriptional rhythms thought to be important for the daily timing of physiological processes. The extent of circadian gene regulation in peripheral tissues is unclear, and to what degree circadian regulation in different tissues involves common or specialized pathways is unknown. Here we report a comparative analysis of circadian gene expression in vivo in mouse liver and heart using oligonucleotide arrays representing 12,488 genes. We find that peripheral circadian gene regulation is extensive (> or = 8-10% of the genes expressed in each tissue), that the distributions of circadian phases in the two tissues are markedly different, and that very few genes show circadian regulation in both tissues. This specificity of circadian regulation cannot be accounted for by tissue-specific gene expression. Despite this divergence, the clock-regulated genes in liver and heart participate in overlapping, extremely diverse processes. A core set of 37 genes with similar circadian regulation in both tissues includes candidates for new clock genes and output genes, and it contains genes responsive to circulating factors with circadian or diurnal rhythms.
View details for Web of Science ID 000175307200041
View details for PubMedID 11967526
-
Recombinatoric exploration of novel folded structures: A heteropolymer-based model of protein evolutionary landscapes
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2002; 99 (2): 809-814
Abstract
The role of recombination in evolution is compared with that of point mutations (substitutions) in the context of a simple, polymer physics-based model mapping between sequence (genotype) and conformational (phenotype) spaces. Crossovers and point mutations of lattice chains with a hydrophobic polar code are investigated. Sequences encoding for a single ground-state conformation are considered viable and used as model proteins. Point mutations lead to diffusive walks on the evolutionary landscape, whereas crossovers can "tunnel" through barriers of diminished fitness. The degree to which crossovers allow for more efficient sequence and structural exploration depends on the relative rates of point mutations versus that of crossovers and the dispersion in fitness that characterizes the ruggedness of the evolutionary landscape. The probability that a crossover between a pair of viable sequences results in viable sequences is an order of magnitude higher than random, implying that a sequence's overall propensity to encode uniquely is embodied partially in local signals. Consistent with this observation, certain hydrophobicity patterns are significantly more favored than others among fragments (i.e., subsequences) of sequences that encode uniquely, and examples reminiscent of autonomous folding units in real proteins are found. The number of structures explored by both crossovers and point mutations is always substantially larger than that via point mutations alone, but the corresponding numbers of sequences explored can be comparable when the evolutionary landscape is rugged. Efficient structural exploration requires intermediate nonextreme ratios between point-mutation and crossover rates.
View details for Web of Science ID 000173450100050
View details for PubMedID 11805332
View details for PubMedCentralID PMC117387
-
Issues in cDNA microarray analysis: quality filtering, channel normalization, models of variations and assessment of gene effects
NUCLEIC ACIDS RESEARCH
2001; 29 (12): 2549-2557
Abstract
We consider the problem of comparing the gene expression levels of cells grown under two different conditions using cDNA microarray data. We use a quality index, computed from duplicate spots on the same slide, to filter out outlying spots, poor quality genes and problematical slides. We also perform calibration experiments to show that normalization between fluorescent labels is needed and that the normalization is slide dependent and non-linear. A rank invariant method is suggested to select non-differentially expressed genes and to construct normalization curves in comparative experiments. After normalization the residuals from the calibration data are used to provide prior information on variance components in the analysis of comparative experiments. Based on a hierarchical model that incorporates several levels of variations, a method for assessing the significance of gene effects in comparative experiments is presented. The analysis is demonstrated via two groups of experiments with 125 and 4129 genes, respectively, in Escherichia coli grown in glucose and acetate.
View details for Web of Science ID 000169616100015
View details for PubMedID 11410663
View details for PubMedCentralID PMC55725
-
Real-parameter evolutionary Monte Carlo with applications to Bayesian mixture models
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
2001; 96 (454): 653-666
View details for Web of Science ID 000168986400030
-
Feature extraction and normalization algorithms for high-density oligonucleotide gene expression array data
5th Annual Lake Tahoe Symposium
WILEY-LISS. 2001: 120–125
Abstract
Algorithms for performing feature extraction and normalization on high-density oligonucleotide gene expression arrays, have not been fully explored, and the impact these algorithms have on the downstream analysis is not well understood. Advances in such low-level analysis methods are essential to increase the sensitivity and specificity of detecting whether genes are present and/or differentially expressed. We have developed and implemented a number of algorithms for the analysis of expression array data in a software application, the DNA-Chip Analyzer (dChip). In this report, we describe the algorithms for feature extraction and normalization, and present validation data and comparison results with some of the algorithms currently in use.
View details for Web of Science ID 000173803200018
View details for PubMedID 11842437
-
Evolutionary Monte Carlo: Applications to C-p model sampling and change point problem
STATISTICA SINICA
2000; 10 (2): 317-342
View details for Web of Science ID 000086955700001
-
Relaxed simulated tempering for VLSI floorplan designs
4th Asia and South Pacific Design Automation Conference (ASP-DAC 99)
IEEE. 1999: 13–16
View details for Web of Science ID 000079494700004
-
Torsional relaxation for biopolymers
JOURNAL OF COMPUTATIONAL BIOLOGY
1998; 5 (4): 655-665
Abstract
We describe a method for making natural, physical movements in a chained polymer by sequentially adjusting a few neighboring torsion angles in the polymer backbone. In addition to being very fast and easy to implement, the method is also very general. It applies equally well to proteins and nucleic acids. This method is then used to design a local refinement procedure. We test the refinement procedure on the minimization of a simple energy function for proteins. The energy function has a simplified potential for hydrophobic interaction, a hydrogen-bond term, and a term for van der Waals interaction. There is considerable current interest in such simple energy functions for protein folding. When applied to refine structures found by a global search method, the refinement is able to produce large reduction in the hydrogen-bond term and the van der Waal term of the energy. We conclude that the method is particularly effective in finding good "packing" of residues in an initially compact conformation.
View details for Web of Science ID 000078698900003
View details for PubMedID 10072082