Director of Genome Informatics, Department of Pathology (2011 - Present)
B.A.Sc., University of British Columbia, Engineering Physics (2002)
Ph.D., University of British Columbia, Genetics (2006)
Current Research and Scholarly Interests
We focus on understanding the effects of genome variation on cellular phenotypes and cellular modeling of disease through genomic approaches such as next generation RNA sequencing in combination with developing and utilizing state-of-the-art bioinformatics and statistical genetics approaches. See our website at http://montgomerylab.stanford.edu/
- Next Generation Sequencing and Applications
BIOS 201 (Win)
Independent Studies (15)
- Biomedical Informatics Teaching Methods
BIOMEDIN 290 (Aut, Win, Spr)
- Directed Investigation
BIOE 392 (Aut, Win, Spr, Sum)
- Directed Reading and Research
BIOMEDIN 299 (Aut, Win, Spr, Sum)
- Directed Reading in Genetics
GENE 299 (Win, Spr)
- Directed Reading in Pathology
PATH 299 (Aut, Win, Spr, Sum)
- Early Clinical Experience in Pathology
PATH 280 (Aut, Win, Spr, Sum)
- Graduate Research
GENE 399 (Aut, Win, Spr)
- Graduate Research
PATH 399 (Aut, Win, Spr, Sum)
- Medical Scholars Research
BIOMEDIN 370 (Aut, Win, Spr)
- Medical Scholars Research
GENE 370 (Win, Spr)
- Medical Scholars Research
PATH 370 (Aut, Win, Spr, Sum)
- Out-of-Department Graduate Research
BIO 300X (Aut, Win, Spr, Sum)
- Supervised Study
GENE 260 (Aut, Win, Spr)
- Undergraduate Research
GENE 199 (Win, Spr)
- Undergraduate Research
PATH 199 (Aut, Win, Spr, Sum)
- Biomedical Informatics Teaching Methods
Prior Year Courses
- Genetics and Developmental Biology Training Camp
DBIO 200, GENE 200 (Aut)
- Next Generation Sequencing and Applications
BIOS 201 (Win)
- Genetics and Developmental Biology Training Camp
DBIO 200, GENE 200 (Aut)
- Next Generation Sequencing and Applications
BIOS 201 (Win)
- Genetics and Developmental Biology Training Camp
Graduate and Fellowship Programs
Biomedical Informatics (Phd Program)
- Small RNA Sequencing in Cells and Exosomes Identifies eQTLs and 14q32 as a Region of Active Export G3-GENES GENOMES GENETICS 2017; 7 (1): 31-39
Directed evolution using dCas9-targeted somatic hypermutation in mammalian cells.
Engineering and study of protein function by directed evolution has been limited by the technical requirement to use global mutagenesis or introduce DNA libraries. Here, we develop CRISPR-X, a strategy to repurpose the somatic hypermutation machinery for protein engineering in situ. Using catalytically inactive dCas9 to recruit variants of cytidine deaminase (AID) with MS2-modified sgRNAs, we can specifically mutagenize endogenous targets with limited off-target damage. This generates diverse libraries of localized point mutations and can target multiple genomic locations simultaneously. We mutagenize GFP and select for spectrum-shifted variants, including EGFP. Additionally, we mutate the target of the cancer therapeutic bortezomib, PSMB5, and identify known and novel mutations that confer bortezomib resistance. Finally, using a hyperactive AID variant, we mutagenize loci both upstream and downstream of transcriptional start sites. These experiments illustrate a powerful approach to create complex libraries of genetic variants in native context, which is broadly applicable to investigate and improve protein function.
View details for DOI 10.1038/nmeth.4038
View details for PubMedID 27798611
Small RNA Sequencing in Cells and Exosomes Identifies eQTLs and 14q32 as a Region of Active Export.
G3 (Bethesda, Md.)
Exosomes are small extracellular vesicles that carry heterogeneous cargo, including RNA, between cells. Increasing evidence suggests that exosomes are important mediators of intercellular communication and biomarkers of disease. Despite this, the variability of exosomal RNA between individuals has not been well quantified. To assess this variability, we sequenced the small RNA of cells and exosomes from a 17-member family. Across individuals, we show that selective export of miRNAs occurs not only at the level of specific transcripts, but that a cluster of 74 mature miRNAs on chromosome 14q32 is massively exported in exosomes while mostly absent from cells. We also observe more interindividual variability between exosomal samples than between cellular ones and identify four miRNA expression quantitative trait loci shared between cells and exosomes. Our findings indicate that genomically colocated miRNAs can be exported together and highlight the variability in exosomal miRNA levels between individuals as relevant for exosome use as diagnostics.
View details for DOI 10.1534/g3.116.036137
View details for PubMedID 27799337
View details for PubMedCentralID PMC5217120
DNA Methylation Profiling of Uniparental Disomy Subjects Provides a Map of Parental Epigenetic Bias in the Human Genome.
American journal of human genetics
2016; 99 (3): 555-566
Genomic imprinting is a mechanism in which gene expression varies depending on parental origin. Imprinting occurs through differential epigenetic marks on the two parental alleles, with most imprinted loci marked by the presence of differentially methylated regions (DMRs). To identify sites of parental epigenetic bias, here we have profiled DNA methylation patterns in a cohort of 57 individuals with uniparental disomy (UPD) for 19 different chromosomes, defining imprinted DMRs as sites where the maternal and paternal methylation levels diverge significantly from the biparental mean. Using this approach we identified 77 DMRs, including nearly all those described in previous studies, in addition to 34 DMRs not previously reported. These include a DMR at TUBGCP5 within the recurrent 15q11.2 microdeletion region, suggesting potential parent-of-origin effects associated with this genomic disorder. We also observed a modest parental bias in DNA methylation levels at every CpG analyzed across ∼1.9 Mb of the 15q11-q13 Prader-Willi/Angelman syndrome region, demonstrating that the influence of imprinting is not limited to individual regulatory elements such as CpG islands, but can extend across entire chromosomal domains. Using RNA-seq data, we detected signatures consistent with imprinted expression associated with nine novel DMRs. Finally, using a population sample of 4,004 blood methylomes, we define patterns of epigenetic variation at DMRs, identifying rare individuals with global gain or loss of methylation across multiple imprinted loci. Our data provide a detailed map of parental epigenetic bias in the human genome, providing insights into potential parent-of-origin effects.
View details for DOI 10.1016/j.ajhg.2016.06.032
View details for PubMedID 27569549
Impact of the X Chromosome and sex on regulatory variation
2016; 26 (6): 768-777
The X Chromosome, with its unique mode of inheritance, contributes to differences between the sexes at a molecular level, including sex-specific gene expression and sex-specific impact of genetic variation. Improving our understanding of these differences offers to elucidate the molecular mechanisms underlying sex-specific traits and diseases. However, to date, most studies have either ignored the X Chromosome or had insufficient power to test for the sex-specific impact of genetic variation. By analyzing whole blood transcriptomes of 922 individuals, we have conducted the first large-scale, genome-wide analysis of the impact of both sex and genetic variation on patterns of gene expression, including comparison between the X Chromosome and autosomes. We identified a depletion of expression quantitative trait loci (eQTL) on the X Chromosome, especially among genes under high selective constraint. In contrast, we discovered an enrichment of sex-specific regulatory variants on the X Chromosome. To resolve the molecular mechanisms underlying such effects, we generated chromatin accessibility data through ATAC-sequencing to connect sex-specific chromatin accessibility to sex-specific patterns of expression and regulatory variation. As sex-specific regulatory variants discovered in our study can inform sex differences in heritable disease prevalence, we integrated our data with genome-wide association study data for multiple immune traits identifying several traits with significant sex biases in genetic susceptibilities. Together, our study provides genome-wide insight into how genetic variation, the X Chromosome, and sex shape human gene regulation and disease.
View details for DOI 10.1101/gr.197897.115
View details for Web of Science ID 000377090400005
View details for PubMedID 27197214
View details for PubMedCentralID PMC4889977
- An Efficient Multiple-Testing Adjustment for eQTL Studies that Accounts for Linkage Disequilibrium between Variants AMERICAN JOURNAL OF HUMAN GENETICS 2016; 98 (1): 216-224
ORegAnno 3.0: a community-driven resource for curated regulatory annotation.
Nucleic acids research
2016; 44 (D1): D126-32
The Open Regulatory Annotation database (ORegAnno) is a resource for curated regulatory annotation. It contains information about regulatory regions, transcription factor binding sites, RNA binding sites, regulatory variants, haplotypes, and other regulatory elements. ORegAnno differentiates itself from other regulatory resources by facilitating crowd-sourced interpretation and annotation of regulatory observations from the literature and highly curated resources. It contains a comprehensive annotation scheme that aims to describe both the elements and outcomes of regulatory events. Moreover, ORegAnno assembles these disparate data sources and annotations into a single, high quality catalogue of curated regulatory information. The current release is an update of the database previously featured in the NAR Database Issue, and now contains 1 948 307 records, across 18 species, with a combined coverage of 334 215 080 bp. Complete records, annotation, and other associated data are available for browsing and download at http://www.oreganno.org/.
View details for DOI 10.1093/nar/gkv1203
View details for PubMedID 26578589
Integrative functional genomics identifies regulatory mechanisms at coronary artery disease loci.
2016; 7: 12092-?
Coronary artery disease (CAD) is the leading cause of mortality and morbidity, driven by both genetic and environmental risk factors. Meta-analyses of genome-wide association studies have identified >150 loci associated with CAD and myocardial infarction susceptibility in humans. A majority of these variants reside in non-coding regions and are co-inherited with hundreds of candidate regulatory variants, presenting a challenge to elucidate their functions. Herein, we use integrative genomic, epigenomic and transcriptomic profiling of perturbed human coronary artery smooth muscle cells and tissues to begin to identify causal regulatory variation and mechanisms responsible for CAD associations. Using these genome-wide maps, we prioritize 64 candidate variants and perform allele-specific binding and expression analyses at seven top candidate loci: 9p21.3, SMAD3, PDGFD, IL6R, BMP1, CCDC97/TGFB1 and LMOD1. We validate our findings in expression quantitative trait loci cohorts, which together reveal new links between CAD associations and regulatory function in the appropriate disease context.
View details for DOI 10.1038/ncomms12092
View details for PubMedID 27386823
- A global reference for human genetic variation NATURE 2015; 526 (7571): 68-?
The landscape of genomic imprinting across diverse adult human tissues
2015; 25 (7): 927-936
Genomic imprinting is an important regulatory mechanism that silences one of the parental copies of a gene. To systematically characterize this phenomenon, we analyze tissue specificity of imprinting from allelic expression data in 1582 primary tissue samples from 178 individuals from the Genotype-Tissue Expression (GTEx) project. We characterize imprinting in 42 genes, including both novel and previously identified genes. Tissue specificity of imprinting is widespread, and gender-specific effects are revealed in a small number of genes in muscle with stronger imprinting in males. IGF2 shows maternal expression in the brain instead of the canonical paternal expression elsewhere. Imprinting appears to have only a subtle impact on tissue-specific expression levels, with genes lacking a systematic expression difference between tissues with imprinted and biallelic expression. In summary, our systematic characterization of imprinting in adult tissues highlights variation in imprinting between genes, individuals, and tissues.
View details for DOI 10.1101/gr.192278.115
View details for Web of Science ID 000357356900001
View details for PubMedID 25953952
View details for PubMedCentralID PMC4484390
- Effect of predicted protein-truncating genetic variants on the human transcriptome SCIENCE 2015; 348 (6235): 666-669
Human genomics. Effect of predicted protein-truncating genetic variants on the human transcriptome.
2015; 348 (6235): 666-669
Accurate prediction of the functional effect of genetic variation is critical for clinical genome interpretation. We systematically characterized the transcriptome effects of protein-truncating variants, a class of variants expected to have profound effects on gene function, using data from the Genotype-Tissue Expression (GTEx) and Geuvadis projects. We quantitated tissue-specific and positional effects on nonsense-mediated transcript decay and present an improved predictive model for this decay. We directly measured the effect of variants both proximal and distal to splice junctions. Furthermore, we found that robustness to heterozygous gene inactivation is not due to dosage compensation. Our results illustrate the value of transcriptome data in the functional interpretation of genetic variants.
View details for DOI 10.1126/science.1261877
View details for PubMedID 25954003
View details for PubMedCentralID PMC4537935
Genetic conflict reflected in tissue-specific maps of genomic imprinting in human and mouse.
2015; 47 (5): 544-549
Genomic imprinting is an epigenetic process that restricts gene expression to either the maternally or paternally inherited allele. Many theories have been proposed to explain its evolutionary origin, but understanding has been limited by a paucity of data mapping the breadth and dynamics of imprinting within any organism. We generated an atlas of imprinting spanning 33 mouse and 45 human developmental stages and tissues. Nearly all imprinted genes were imprinted in early development and either retained their parent-of-origin expression in adults or lost it completely. Consistent with an evolutionary signature of parental conflict, imprinted genes were enriched for coexpressed pairs of maternally and paternally expressed genes, showed accelerated expression divergence between human and mouse, and were more highly expressed than their non-imprinted orthologs in other species. Our approach demonstrates a general framework for the discovery of imprinting in any species and sheds light on the causes and consequences of genomic imprinting in mammals.
View details for DOI 10.1038/ng.3274
View details for PubMedID 25848752
View details for PubMedCentralID PMC4414907
- Genetic conflict reflected in tissue-specific maps of genomic imprinting in human and mouse. Nature genetics 2015; 47 (5): 544-549
Tissue-specific effects of genetic and epigenetic variation on gene regulation and splicing.
2015; 11 (1)
Understanding how genetic variation affects distinct cellular phenotypes, such as gene expression levels, alternative splicing and DNA methylation levels, is essential for better understanding of complex diseases and traits. Furthermore, how inter-individual variation of DNA methylation is associated to gene expression is just starting to be studied. In this study, we use the GenCord cohort of 204 newborn Europeans' lymphoblastoid cell lines, T-cells and fibroblasts derived from umbilical cords. The samples were previously genotyped for 2.5 million SNPs, mRNA-sequenced, and assayed for methylation levels in 482,421 CpG sites. We observe that methylation sites associated to expression levels are enriched in enhancers, gene bodies and CpG island shores. We show that while the correlation between DNA methylation and gene expression can be positive or negative, it is very consistent across cell-types. However, this epigenetic association to gene expression appears more tissue-specific than the genetic effects on gene expression or DNA methylation (observed in both sharing estimations based on P-values and effect size correlations between cell-types). This predominance of genetic effects can also be reflected by the observation that allele specific expression differences between individuals dominate over tissue-specific effects. Additionally, we discover genetic effects on alternative splicing and interestingly, a large amount of DNA methylation correlating to alternative splicing, both in a tissue-specific manner. The locations of the SNPs and methylation sites involved in these associations highlight the participation of promoter proximal and distant regulatory regions on alternative splicing. Overall, our results provide high-resolution analyses showing how genome sequence variation has a broad effect on cellular phenotypes across cell-types, whereas epigenetic factors provide a secondary layer of variation that is more tissue-specific. Furthermore, the details of how this tissue-specificity may vary across inter-relations of molecular traits, and where these are occurring, can yield further insights into gene regulation and cellular biology as a whole.
View details for DOI 10.1371/journal.pgen.1004958
View details for PubMedID 25634236
RNA Sequencing and Analysis.
Cold Spring Harbor protocols
2015; 2015 (11): pdb top084970-?
RNA sequencing (RNA-Seq) uses the capabilities of high-throughput sequencing methods to provide insight into the transcriptome of a cell. Compared to previous Sanger sequencing- and microarray-based methods, RNA-Seq provides far higher coverage and greater resolution of the dynamic nature of the transcriptome. Beyond quantifying gene expression, the data generated by RNA-Seq facilitate the discovery of novel transcripts, identification of alternatively spliced genes, and detection of allele-specific expression. Recent advances in the RNA-Seq workflow, from sample preparation to library construction to data analysis, have enabled researchers to further elucidate the functional complexity of the transcription. In addition to polyadenylated messenger RNA (mRNA) transcripts, RNA-Seq can be applied to investigate different populations of RNA, including total RNA, pre-mRNA, and noncoding RNA, such as microRNA and long ncRNA. This article provides an introduction to RNA-Seq methods, including applications, experimental design, and technical challenges.
View details for DOI 10.1101/pdb.top084970
View details for PubMedID 25870306
Type I interferon signaling genes in recurrent major depression: increased expression detected by whole-blood RNA sequencing.
2014; 19 (12): 1267-1274
A study of genome-wide gene expression in major depressive disorder (MDD) was undertaken in a large population-based sample to determine whether altered expression levels of genes and pathways could provide insights into biological mechanisms that are relevant to this disorder. Gene expression studies have the potential to detect changes that may be because of differences in common or rare genomic sequence variation, environmental factors or their interaction. We recruited a European ancestry sample of 463 individuals with recurrent MDD and 459 controls, obtained self-report and semi-structured interview data about psychiatric and medical history and other environmental variables, sequenced RNA from whole blood and genotyped a genome-wide panel of common single-nucleotide polymorphisms. We used analytical methods to identify MDD-related genes and pathways using all of these sources of information. In analyses of association between MDD and expression levels of 13 857 single autosomal genes, accounting for multiple technical, physiological and environmental covariates, a significant excess of low P-values was observed, but there was no significant single-gene association after genome-wide correction. Pathway-based analyses of expression data detected significant association of MDD with increased expression of genes in the interferon α/β signaling pathway. This finding could not be explained by potentially confounding diseases and medications (including antidepressants) or by computationally estimated proportions of white blood cell types. Although cause-effect relationships cannot be determined from these data, the results support the hypothesis that altered immune signaling has a role in the pathogenesis, manifestation, and/or the persistence and progression of MDD.Molecular Psychiatry advance online publication, 3 December 2013; doi:10.1038/mp.2013.161.
View details for DOI 10.1038/mp.2013.161
View details for PubMedID 24296977
- Type I interferon signaling genes in recurrent major depression: increased expression detected by whole-blood RNA sequencing MOLECULAR PSYCHIATRY 2014; 19 (12): 1267-1274
- High-Resolution Transcriptome Analysis with Long-Read RNA Sequencing PLOS ONE 2014; 9 (9)
Transcriptome sequencing of a large human family identifies the impact of rare noncoding variants.
American journal of human genetics
2014; 95 (3): 245-256
Recent and rapid human population growth has led to an excess of rare genetic variants that are expected to contribute to an individual's genetic burden of disease risk. To date, much of the focus has been on rare protein-coding variants, for which potential impact can be estimated from the genetic code, but determining the impact of rare noncoding variants has been more challenging. To improve our understanding of such variants, we combined high-quality genome sequencing and RNA sequencing data from a 17-individual, three-generation family to contrast expression quantitative trait loci (eQTLs) and splicing quantitative trait loci (sQTLs) within this family to eQTLs and sQTLs within a population sample. Using this design, we found that eQTLs and sQTLs with large effects in the family were enriched with rare regulatory and splicing variants (minor allele frequency < 0.01). They were also more likely to influence essential genes and genes involved in complex disease. In addition, we tested the capacity of diverse noncoding annotation to predict the impact of rare noncoding variants. We found that distance to the transcription start site, evolutionary constraint, and epigenetic annotation were considerably more informative for predicting the impact of rare variants than for predicting the impact of common variants. These results highlight that rare noncoding variants are important contributors to individual gene-expression profiles and further demonstrate a significant capability for genomic annotation to predict the impact of rare noncoding variants.
View details for DOI 10.1016/j.ajhg.2014.08.004
View details for PubMedID 25192044
View details for PubMedCentralID PMC4157143
Transcriptome sequencing from diverse human populations reveals differentiated regulatory architecture.
2014; 10 (8)
Large-scale sequencing efforts have documented extensive genetic variation within the human genome. However, our understanding of the origins, global distribution, and functional consequences of this variation is far from complete. While regulatory variation influencing gene expression has been studied within a handful of populations, the breadth of transcriptome differences across diverse human populations has not been systematically analyzed. To better understand the spectrum of gene expression variation, alternative splicing, and the population genetics of regulatory variation in humans, we have sequenced the genomes, exomes, and transcriptomes of EBV transformed lymphoblastoid cell lines derived from 45 individuals in the Human Genome Diversity Panel (HGDP). The populations sampled span the geographic breadth of human migration history and include Namibian San, Mbuti Pygmies of the Democratic Republic of Congo, Algerian Mozabites, Pathan of Pakistan, Cambodians of East Asia, Yakut of Siberia, and Mayans of Mexico. We discover that approximately 25.0% of the variation in gene expression found amongst individuals can be attributed to population differences. However, we find few genes that are systematically differentially expressed among populations. Of this population-specific variation, 75.5% is due to expression rather than splicing variability, and we find few genes with strong evidence for differential splicing across populations. Allelic expression analyses indicate that previously mapped common regulatory variants identified in eight populations from the International Haplotype Map Phase 3 project have similar effects in our seven sampled HGDP populations, suggesting that the cellular effects of common variants are shared across diverse populations. Together, these results provide a resource for studies analyzing functional differences across populations by estimating the degree of shared gene expression, alternative splicing, and regulatory genetics across populations from the broadest points of human migration history yet sampled.
View details for DOI 10.1371/journal.pgen.1004549
View details for PubMedID 25121757
- Transcriptome sequencing from diverse human populations reveals differentiated regulatory architecture. PLoS genetics 2014; 10 (8)
Cis and trans effects of human genomic variants on gene expression.
2014; 10 (7)
Gene expression is a heritable cellular phenotype that defines the function of a cell and can lead to diseases in case of misregulation. In order to detect genetic variations affecting gene expression, we performed association analysis of single nucleotide polymorphisms (SNPs) and copy number variants (CNVs) with gene expression measured in 869 lymphoblastoid cell lines of the Avon Longitudinal Study of Parents and Children (ALSPAC) cohort in cis and in trans. We discovered that 3,534 genes (false discovery rate (FDR) = 5%) are affected by an expression quantitative trait locus (eQTL) in cis and 48 genes are affected in trans. We observed that CNVs are more likely to be eQTLs than SNPs. In addition, we found that variants associated to complex traits and diseases are enriched for trans-eQTLs and that trans-eQTLs are enriched for cis-eQTLs. As a variant affecting both a gene in cis and in trans suggests that the cis gene is functionally linked to the trans gene expression, we looked specifically for trans effects of cis-eQTLs. We discovered that 26 cis-eQTLs are associated to 92 genes in trans with the cis-eQTLs of the transcriptions factors BATF3 and HMX2 affecting the most genes. We then explored if the variation of the level of expression of the cis genes were causally affecting the level of expression of the trans genes and discovered several causal relationships between variation in the level of expression of the cis gene and variation of the level of expression of the trans gene. This analysis shows that a large sample size allows the discovery of secondary effects of human variations on gene expression that can be used to construct short directed gene regulatory networks.
View details for DOI 10.1371/journal.pgen.1004461
View details for PubMedID 25010687
- Cis and trans effects of human genomic variants on gene expression. PLoS genetics 2014; 10 (7)
Determining causality and consequence of expression quantitative trait loci
2014; 133 (6): 727-735
Expression quantitative trait loci (eQTLs) are currently the most abundant and systematically-surveyed class of functional consequence for genetic variation. Recent genetic studies of gene expression have identified thousands of eQTLs in diverse tissue types for the majority of human genes. Application of this large eQTL catalog provides an important resource for understanding the molecular basis of common genetic diseases. However, only now has both the availability of individuals with full genomes and corresponding advances in functional genomics provided the opportunity to dissect eQTLs to identify causal regulatory variants. Resolving the properties of such causal regulatory variants is improving understanding of the molecular mechanisms that influence traits and guiding the development of new genome-scale approaches to variant interpretation. In this review, we provide an overview of current computational and experimental methods for identifying causal regulatory variants and predicting their phenotypic consequences.
View details for DOI 10.1007/s00439-014-1446-0
View details for Web of Science ID 000336317000005
View details for PubMedID 24770875
Allelic Expression of Deleterious Protein-Coding Variants across Human Tissues.
2014; 10 (5)
Personal exome and genome sequencing provides access to loss-of-function and rare deleterious alleles whose interpretation is expected to provide insight into individual disease burden. However, for each allele, accurate interpretation of its effect will depend on both its penetrance and the trait's expressivity. In this regard, an important factor that can modify the effect of a pathogenic coding allele is its level of expression; a factor which itself characteristically changes across tissues. To better inform the degree to which pathogenic alleles can be modified by expression level across multiple tissues, we have conducted exome, RNA and deep, targeted allele-specific expression (ASE) sequencing in ten tissues obtained from a single individual. By combining such data, we report the impact of rare and common loss-of-function variants on allelic expression exposing stronger allelic bias for rare stop-gain variants and informing the extent to which rare deleterious coding alleles are consistently expressed across tissues. This study demonstrates the potential importance of transcriptome data to the interpretation of pathogenic protein-coding variants.
View details for DOI 10.1371/journal.pgen.1004304
View details for PubMedID 24786518
Dissecting the causal genetic mechanisms of coronary heart disease.
Current atherosclerosis reports
2014; 16 (5): 406-?
Large-scale genome-wide association studies (GWAS) have identified 46 loci that are associated with coronary heart disease (CHD). Additionally, 104 independent candidate variants (false discovery rate of 5 %) have been identified (Schunkert H, Konig IR, Kathiresan S, Reilly MP, Assimes TL, Holm H et al. Nat Genet 43:333-8, 2011; Deloukas P, Kanoni S, Willenborg C, Farrall M, Assimes TL, Thompson JR et al. Nat Genet 45:25-33, 2012; C4D Genetics Consortium. Nat Genet 43:339-44, 2011). The majority of the causal genes in these loci function independently of conventional risk factors. It is postulated that a number of the CHD-associated genes regulate basic processes in the vascular cells involved in atherosclerosis, and that study of the signaling pathways that are modulated in this cell type by causal regulatory variation will provide critical new insights for targeting the initiation and progression of disease. In this review, we will discuss the types of experimental approaches and data that are critical to understanding the molecular processes that underlie the disease risk at 9p21.3, TCF21, SORT1, and other CHD-associated loci.
View details for DOI 10.1007/s11883-014-0406-4
View details for PubMedID 24623178
SplicePlot: a utility for visualizing splicing quantitative trait loci.
2014; 30 (7): 1025-1026
RNA-Sequencing has provided unprecedented resolution of alternative splicing and splicing-quantitative trait loci (sQTL). However, there are few tools available for visualizing the genotype-dependent effects of splicing at a population level. SplicePlot is a simple command line utility that produces intuitive visualization of sQTLs and their effects. SplicePlot takes mapped RNA-seq reads in BAM format and genotype data in VCF format as input and outputs publication quality sashimi plots, hive plots, and structure plots enabling better investigation and understanding of the role of genetics on alternative splicing and transcript structure.Availability and Implementation: Source code and detailed documentation are available at http://montgomerylab.stanford.edu/spliceplot/index.html under Resources and at Github. SplicePlot is implemented in Python and is supported on Linux and Mac OS. A VirtualBox virtual machine running Ubuntu with SplicePlot already installed is also email@example.com or firstname.lastname@example.org.
View details for DOI 10.1093/bioinformatics/btt733
View details for PubMedID 24363378
Path-scan: a reporting tool for identifying clinically actionable variants.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
2014; 19: 229-240
The American College of Medical Genetics and Genomics (ACMG) recently released guidelines regarding the reporting of incidental findings in sequencing data. Given the availability of Direct to Consumer (DTC) genetic testing and the falling cost of whole exome and genome sequencing, individuals will increasingly have the opportunity to analyze their own genomic data. We have developed a web-based tool, PATH-SCAN, which annotates individual genomes and exomes for ClinVar designated pathogenic variants found within the genes from the ACMG guidelines. Because mutations in these genes predispose individuals to conditions with actionable outcomes, our tool will allow individuals or researchers to identify potential risk variants in order to consult physicians or genetic counselors for further evaluation. Moreover, our tool allows individuals to anonymously submit their pathogenic burden, so that we can crowd source the collection of quantitative information regarding the frequency of these variants. We tested our tool on 1092 publicly available genomes from the 1000 Genomes project, 163 genomes from the Personal Genome Project, and 15 genomes from a clinical genome sequencing research project. Excluding the most commonly seen variant in 1000 Genomes, about 20% of all genomes analyzed had a ClinVar designated pathogenic variant that required further evaluation.
View details for PubMedID 24297550
View details for PubMedCentralID PMC4008882
Transcriptome analysis reveals differential splicing events in IPF lung tissue.
2014; 9 (5)
Idiopathic pulmonary fibrosis (IPF) is a complex disease in which a multitude of proteins and networks are disrupted. Interrogation of the transcriptome through RNA sequencing (RNA-Seq) enables the determination of genes whose differential expression is most significant in IPF, as well as the detection of alternative splicing events which are not easily observed with traditional microarray experiments. We sequenced messenger RNA from 8 IPF lung samples and 7 healthy controls on an Illumina HiSeq 2000, and found evidence for substantial differential gene expression and differential splicing. 873 genes were differentially expressed in IPF (FDR<5%), and 440 unique genes had significant differential splicing events in at least one exonic region (FDR<5%). We used qPCR to validate the differential exon usage in the second and third most significant exonic regions, in the genes COL6A3 (RNA-Seq adjusted pval = 7.18e-10) and POSTN (RNA-Seq adjusted pval = 2.06e-09), which encode the extracellular matrix proteins collagen alpha-3(VI) and periostin. The increased gene-level expression of periostin has been associated with IPF and its clinical progression, but its differential splicing has not been studied in the context of this disease. Our results suggest that alternative splicing of these and other genes may be involved in the pathogenesis of IPF. We have developed an interactive web application which allows users to explore the results of our RNA-Seq experiment, as well as those of two previously published microarray experiments, and we hope that this will serve as a resource for future investigations of gene regulation in IPF.
View details for DOI 10.1371/journal.pone.0097550
View details for PubMedID 24805851
High-resolution transcriptome analysis with long-read RNA sequencing.
2014; 9 (9)
RNA sequencing (RNA-seq) enables characterization and quantification of individual transcriptomes as well as detection of patterns of allelic expression and alternative splicing. Current RNA-seq protocols depend on high-throughput short-read sequencing of cDNA. However, as ongoing advances are rapidly yielding increasing read lengths, a technical hurdle remains in identifying the degree to which differences in read length influence various transcriptome analyses. In this study, we generated two paired-end RNA-seq datasets of differing read lengths (2×75 bp and 2×262 bp) for lymphoblastoid cell line GM12878 and compared the effect of read length on transcriptome analyses, including read-mapping performance, gene and transcript quantification, and detection of allele-specific expression (ASE) and allele-specific alternative splicing (ASAS) patterns. Our results indicate that, while the current long-read protocol is considerably more expensive than short-read sequencing, there are important benefits that can only be achieved with longer read length, including lower mapping bias and reduced ambiguity in assigning reads to genomic elements, such as mRNA transcript. We show that these benefits ultimately lead to improved detection of cis-acting regulatory and splicing variation effects within individuals.
View details for DOI 10.1371/journal.pone.0108095
View details for PubMedID 25251678
- Transcriptome Analysis Reveals Differential Splicing Events in IPF Lung Tissue. PloS one 2014; 9 (3)
Characterizing the genetic basis of transcriptome diversity through RNA-sequencing of 922 individuals
2014; 24 (1): 14-24
Understanding the consequences of regulatory variation in the human genome remains a major challenge, with important implications for understanding gene regulation and interpreting the many disease-risk variants that fall outside of protein-coding regions. Here, we provide a direct window into the regulatory consequences of genetic variation by sequencing RNA from 922 genotyped individuals. We present a comprehensive description of the distribution of regulatory variation-by the specific expression phenotypes altered, the properties of affected genes, and the genomic characteristics of regulatory variants. We detect variants influencing expression of over ten thousand genes, and through the enhanced resolution offered by RNA-sequencing, for the first time we identify thousands of variants associated with specific phenotypes including splicing and allelic expression. Evaluating the effects of both long-range intra-chromosomal and trans (cross-chromosomal) regulation, we observe modularity in the regulatory network, with three-dimensional chromosomal configuration playing a particular role in regulatory modules within each chromosome. We also observe a significant depletion of regulatory variants affecting central and critical genes, along with a trend of reduced effect sizes as variant frequency increases, providing evidence that purifying selection and buffering have limited the deleterious impact of regulatory variation on the cell. Further, generalizing beyond observed variants, we have analyzed the genomic properties of variants associated with expression and splicing and developed a Bayesian model to predict regulatory consequences of genetic variants, applicable to the interpretation of individual genomes and disease studies. Together, these results represent a critical step toward characterizing the complete landscape of human regulatory variation.
View details for DOI 10.1101/gr.155192.113
View details for Web of Science ID 000329163500002
View details for PubMedID 24092820
Quantifying RNA allelic ratios by microfluidic multiplex PCR and sequencing.
2014; 11 (1): 51-54
We developed a targeted RNA sequencing method that couples microfluidics-based multiplex PCR and deep sequencing (mmPCR-seq) to uniformly and simultaneously amplify up to 960 loci in 48 samples independently of their gene expression levels and to accurately and cost-effectively measure allelic ratios even for low-quantity or low-quality RNA samples. We applied mmPCR-seq to RNA editing and allele-specific expression studies. mmPCR-seq complements RNA-seq for studying allelic variations in the transcriptome.
View details for DOI 10.1038/nmeth.2736
View details for PubMedID 24270603
View details for PubMedCentralID PMC3877737
Transcriptome and genome sequencing uncovers functional variation in humans.
2013; 501 (7468): 506-511
Genome sequencing projects are discovering millions of genetic variants in humans, and interpretation of their functional effects is essential for understanding the genetic basis of variation in human traits. Here we report sequencing and deep analysis of messenger RNA and microRNA from lymphoblastoid cell lines of 462 individuals from the 1000 Genomes Project--the first uniformly processed high-throughput RNA-sequencing data from multiple human populations with high-quality genome sequences. We discover extremely widespread genetic variation affecting the regulation of most genes, with transcript structure and expression level variation being equally common but genetically largely independent. Our characterization of causal regulatory variation sheds light on the cellular mechanisms of regulatory and loss-of-function variation, and allows us to infer putative causal variants for dozens of disease-associated loci. Altogether, this study provides a deep understanding of the cellular mechanisms of transcriptome variation and of the landscape of functional variants in the human genome.
View details for DOI 10.1038/nature12531
View details for PubMedID 24037378
- Transcriptome and genome sequencing uncovers functional variation in humans NATURE 2013; 501 (7468): 506-511
Systematic functional regulatory assessment of disease-associated variants.
Proceedings of the National Academy of Sciences of the United States of America
2013; 110 (23): 9607-9612
Genome-wide association studies have discovered many genetic loci associated with disease traits, but the functional molecular basis of these associations is often unresolved. Genome-wide regulatory and gene expression profiles measured across individuals and diseases reflect downstream effects of genetic variation and may allow for functional assessment of disease-associated loci. Here, we present a unique approach for systematic integration of genetic disease associations, transcription factor binding among individuals, and gene expression data to assess the functional consequences of variants associated with hundreds of human diseases. In an analysis of genome-wide binding profiles of NFκB, we find that disease-associated SNPs are enriched in NFκB binding regions overall, and specifically for inflammatory-mediated diseases, such as asthma, rheumatoid arthritis, and coronary artery disease. Using genome-wide variation in transcription factor-binding data, we find that NFκB binding is often correlated with disease-associated variants in a genotype-specific and allele-specific manner. Furthermore, we show that this binding variation is often related to expression of nearby genes, which are also found to have altered expression in independent profiling of the variant-associated disease condition. Thus, using this integrative approach, we provide a unique means to assign putative function to many disease-associated SNPs.
View details for DOI 10.1073/pnas.1219099110
View details for PubMedID 23690573
Desktop transcriptome sequencing from archival tissue to identify clinically relevant translocations.
American journal of surgical pathology
2013; 37 (6): 796-803
Somatic mutations, often translocations or single nucleotide variations, are pathognomonic for certain types of cancers and are increasingly of clinical importance for diagnosis and prediction of response to therapy. Conventional clinical assays only evaluate 1 mutation at a time, and targeted tests are often constrained to identify only the most common mutations. Genome-wide or transcriptome-wide high-throughput sequencing (HTS) of clinical samples offers an opportunity to evaluate for all clinically significant mutations with a single test. Recently a "desktop version" of HTS has become available, but most of the experience to date is based on data obtained from high-quality DNA from frozen specimens. In this study, we demonstrate, as a proof of principle, that translocations in sarcomas can be diagnosed from formalin-fixed paraffin-embedded (FFPE) tissue with desktop HTS. Using the first generation MiSeq platform, full transcriptome sequencing was performed on FFPE material from archival blocks of 3 synovial sarcomas, 3 myxoid liposarcomas, 2 Ewing sarcomas, and 1 clear cell sarcoma. Mapping the reads to the "sarcomatome" (all known 83 genes involved in translocations and mutations in sarcoma) and using a novel algorithm for ranking fusion candidates, the pathognomonic fusions and the exact breakpoints were identified in all cases of synovial sarcoma, myxoid liposarcoma, and clear cell sarcoma. The Ewing sarcoma fusion gene was detectable in FFPE material only with a sequencing platform that generates greater sequencing depth. The results show that a single transcriptome HTS assay, from FFPE, has the potential to replace conventional molecular diagnostic techniques for the evaluation of clinically relevant mutations in cancer.
View details for DOI 10.1097/PAS.0b013e31827ad9b2
View details for PubMedID 23598961
The origin, evolution, and functional impact of short insertion-deletion variants identified in 179 human genomes.
2013; 23 (5): 749-761
Short insertions and deletions (indels) are the second most abundant form of human genetic variation, but our understanding of their origins and functional effects lags behind that of other types of variants. Using population-scale sequencing, we have identified a high-quality set of 1.6 million indels from 179 individuals representing three diverse human populations. We show that rates of indel mutagenesis are highly heterogeneous, with 43%-48% of indels occurring in 4.03% of the genome, whereas in the remaining 96% their prevalence is 16 times lower than SNPs. Polymerase slippage can explain upwards of three-fourths of all indels, with the remainder being mostly simple deletions in complex sequence. However, insertions do occur and are significantly associated with pseudo-palindromic sequence features compatible with the fork stalling and template switching (FoSTeS) mechanism more commonly associated with large structural variations. We introduce a quantitative model of polymerase slippage, which enables us to identify indel-hypermutagenic protein-coding genes, some of which are associated with recurrent mutations leading to disease. Accounting for mutational rate heterogeneity due to sequence context, we find that indels across functional sequence are generally subject to stronger purifying selection than SNPs. We find that indel length modulates selection strength, and that indels affecting multiple functionally constrained nucleotides undergo stronger purifying selection. We further find that indels are enriched in associations with gene expression and find evidence for a contribution of nonsense-mediated decay. Finally, we show that indels can be integrated in existing genome-wide association studies (GWAS); although we do not find direct evidence that potentially causal protein-coding indels are enriched with associations to known disease-associated SNPs, our findings suggest that the causal variant underlying some of these associations may be indels.
View details for DOI 10.1101/gr.148718.112
View details for PubMedID 23478400
View details for PubMedCentralID PMC3638132
Examination of the relationship between variation at 17q21 and childhood wheeze phenotypes
JOURNAL OF ALLERGY AND CLINICAL IMMUNOLOGY
2013; 131 (3): 685-694
Genome-wide association studies have identified associations of genetic variants at 17q21 near ORMDL3 with childhood asthma.We sought to determine whether associations in this region are specific to particular asthma phenotypes and specific to ORMDL3.We examined associations between 244 independent single nucleotide polymorphisms (SNPs) plus 13 previously identified asthma-related SNPs in the region between 34 and 36 Mb on chromosome 17 and early wheezing phenotypes, doctor-diagnosed asthma and atopy at 7½ years, and bronchial hyperresponsiveness and lung function at 8½ years in 7045 children from the Avon Longitudinal Study of Parents and Children birth cohort study. With this, cis expression quantitative trait loci signals for the same SNPs were assessed in 875 samples across genes in the same region.The strongest evidence for phenotypic association was seen for persistent wheezing (rs8076131 near ORMDL3: relative risk ratio [RRR], 1.60 [95% CI, 1.40-1.84], P = 1.4 × 10(-11); rs2305480 near GSDML: RRR, 1.60 [95% CI, 1.39-1.83], P = 1.5 × 10(-11); and rs9303277 near IKZF3: RRR, 1.57 [95% CI, 1.37-1.79], P = 4.4 × 10(-11)). Similar but less precisely estimated effects were seen for intermediate-onset wheeze, but there was little evidence of associations with other wheezing phenotypes. There was some evidence of associations with bronchial hyperresponsiveness. SNPs across the whole region show strong evidence of association with differential levels of expression at GSDML, IKZF3, and MED24, as well as ORMDL3.Associations of SNPs in the 17q21 locus are specific to asthma and specific wheezing phenotypes and are not explained by associations with intermediate phenotypes, such as atopy or lung function.
View details for DOI 10.1016/j.jaci.2012.09.021
View details for Web of Science ID 000315587800008
View details for PubMedID 23154084
Integrating GWAS and Expression Data for Functional Characterization of Disease-Associated SNPs: An Application to Follicular Lymphoma
AMERICAN JOURNAL OF HUMAN GENETICS
2013; 92 (1): 126-130
Development of post-GWAS (genome-wide association study) methods are greatly needed for characterizing the function of trait-associated SNPs. Strategies integrating various biological data sets with GWAS results will provide insights into the mechanistic role of associated SNPs. Here, we present a method that integrates RNA sequencing (RNA-seq) and allele-specific expression data with GWAS data to further characterize SNPs associated with follicular lymphoma (FL). We investigated the influence on gene expression of three established FL-associated loci-rs10484561, rs2647012, and rs6457327-by measuring their correlation with human-leukocyte-antigen (HLA) expression levels obtained from publicly available RNA-seq expression data sets from lymphoblastoid cell lines. Our results suggest that SNPs linked to the protective variant rs2647012 exert their effect by a cis-regulatory mechanism involving modulation of HLA-DQB1 expression. In contrast, no effect on HLA expression was observed for the colocalized risk variant rs10484561. The application of integrative methods, such as those presented here, to other post-GWAS investigations will help identify causal disease variants and enhance our understanding of biological disease mechanisms.
View details for DOI 10.1016/j.ajhg.2012.11.009
View details for Web of Science ID 000313759000013
View details for PubMedID 23246294
View details for PubMedCentralID PMC3542469
Passive and active DNA methylation and the interplay with genetic variation in gene regulation.
DNA methylation is an essential epigenetic mark whose role in gene regulation and its dependency on genomic sequence and environment are not fully understood. In this study we provide novel insights into the mechanistic relationships between genetic variation, DNA methylation and transcriptome sequencing data in three different cell-types of the GenCord human population cohort. We find that the association between DNA methylation and gene expression variation among individuals are likely due to different mechanisms from those establishing methylation-expression patterns during differentiation. Furthermore, cell-type differential DNA methylation may delineate a platform in which local inter-individual changes may respond to or act in gene regulation. We show that unlike genetic regulatory variation, DNA methylation alone does not significantly drive allele specific expression. Finally, inferred mechanistic relationships using genetic variation as well as correlations with TF abundance reveal both a passive and active role of DNA methylation to regulatory interactions influencing gene expression. DOI:http://dx.doi.org/10.7554/eLife.00523.001.
View details for DOI 10.7554/eLife.00523
View details for PubMedID 23755361
- Normalizing RNA-Sequencing Data by Modeling Hidden Covariates with Prior Knowledge. PloS one 2013; 8 (7)
- Performance of genomic medicine. Genome biology 2013; 14 (12): 316
Cancer Transcriptome Sequencing and Analysis
Cancer Genomics: From Bench to Personalized Medicine
Elsevier. 2013; 1: 31–49
View details for DOI http://dx.doi.org/10.1016/B978-0-12-396967-5.00003-7
Normalizing RNA-sequencing data by modeling hidden covariates with prior knowledge.
2013; 8 (7)
Transcriptomic assays that measure expression levels are widely used to study the manifestation of environmental or genetic variations in cellular processes. RNA-sequencing in particular has the potential to considerably improve such understanding because of its capacity to assay the entire transcriptome, including novel transcriptional events. However, as with earlier expression assays, analysis of RNA-sequencing data requires carefully accounting for factors that may introduce systematic, confounding variability in the expression measurements, resulting in spurious correlations. Here, we consider the problem of modeling and removing the effects of known and hidden confounding factors from RNA-sequencing data. We describe a unified residual framework that encapsulates existing approaches, and using this framework, present a novel method, HCP (Hidden Covariates with Prior). HCP uses a more informed assumption about the confounding factors, and performs as well or better than existing approaches while having a much lower computational cost. Our experiments demonstrate that accounting for known and hidden factors with appropriate models improves the quality of RNA-sequencing data in two very different tasks: detecting genetic variations that are associated with nearby expression variations (cis-eQTLs), and constructing accurate co-expression networks.
View details for DOI 10.1371/journal.pone.0068141
View details for PubMedID 23874524
Detection and impact of rare regulatory variants in human disease.
Frontiers in genetics
2013; 4: 67-?
Advances in genome sequencing are providing unprecedented resolution of rare and private variants. However, methods which assess the effect of these variants have relied predominantly on information within coding sequences. Assessing their impact in non-coding sequences remains a significant contemporary challenge. In this review, we highlight the role of regulatory variation as causative agents and modifiers of monogenic disorders. We further discuss how advances in functional genomics are now providing new opportunity to assess the impact of rare non-coding variants and their role in disease.
View details for DOI 10.3389/fgene.2013.00067
View details for PubMedID 23755067
Sex-biased genetic effects on gene regulation in humans
2012; 22 (12): 2368-2375
Human regulatory variation, reported as expression quantitative trait loci (eQTLs), contributes to differences between populations and tissues. The contribution of eQTLs to differences between sexes, however, has not been investigated to date. Here we explore regulatory variation in females and males and demonstrate that 12%-15% of autosomal eQTLs function in a sex-biased manner. We show that genes possessing sex-biased eQTLs are expressed at similar levels across the sexes and highlight cases of genes controlling sexually dimorphic and shared traits that are under the control of distinct regulatory elements in females and males. This study illustrates that sex provides important context that can modify the effects of functional genetic variants.
View details for DOI 10.1101/gr.134981.111
View details for Web of Science ID 000311895500005
View details for PubMedID 22960374
Mapping cis- and trans-regulatory effects across multiple tissues in twins
2012; 44 (10): 1084-?
Sequence-based variation in gene expression is a key driver of disease risk. Common variants regulating expression in cis have been mapped in many expression quantitative trait locus (eQTL) studies, typically in single tissues from unrelated individuals. Here, we present a comprehensive analysis of gene expression across multiple tissues conducted in a large set of mono- and dizygotic twins that allows systematic dissection of genetic (cis and trans) and non-genetic effects on gene expression. Using identity-by-descent estimates, we show that at least 40% of the total heritable cis effect on expression cannot be accounted for by common cis variants, a finding that reveals the contribution of low-frequency and rare regulatory variants with respect to both transcriptional regulation and complex trait susceptibility. We show that a substantial proportion of gene expression heritability is trans to the structural gene, and we identify several replicating trans variants that act predominantly in a tissue-restricted manner and may regulate the transcription of many genes.
View details for DOI 10.1038/ng.2394
View details for Web of Science ID 000309550200006
View details for PubMedID 22941192
Genotype-Based Test in Mapping Cis-Regulatory Variants from Allele-Specific Expression Data
2012; 7 (6)
Identifying and understanding the impact of gene regulatory variation is of considerable importance in evolutionary and medical genetics; such variants are thought to be responsible for human-specific adaptation and to have an important role in genetic disease. Regulatory variation in cis is readily detected in individuals showing uneven expression of a transcript from its two allelic copies, an observation referred to as allelic imbalance (AI). Identifying individuals exhibiting AI allows mapping of regulatory DNA regions and the potential to identify the underlying causal genetic variant(s). However, existing mapping methods require knowledge of the haplotypes, which make them sensitive to phasing errors. In this study, we introduce a genotype-based mapping test that does not require haplotype-phase inference to locate regulatory regions. The test relies on partitioning genotypes of individuals exhibiting AI and those not expressing AI in a 2×3 contingency table. The performance of this test to detect linkage disequilibrium (LD) between a potential regulatory site and a SNP located in this region was examined by analyzing the simulated and the empirical AI datasets. In simulation experiments, the genotype-based test outperforms the haplotype-based tests with the increasing distance separating the regulatory region from its regulated transcript. The genotype-based test performed equally well with the experimental AI datasets, either from genome-wide cDNA hybridization arrays or from RNA sequencing. By avoiding the need of haplotype inference, the genotype-based test will suit AI analyses in population samples of unknown haplotype structure and will additionally facilitate the identification of cis-regulatory variants that are located far away from the regulated transcript.
View details for DOI 10.1371/journal.pone.0038667
View details for Web of Science ID 000305351700058
View details for PubMedID 22685595
Patterns of Cis Regulatory Variation in Diverse Human Populations
2012; 8 (4): 272-284
The genetic basis of gene expression variation has long been studied with the aim to understand the landscape of regulatory variants, but also more recently to assist in the interpretation and elucidation of disease signals. To date, many studies have looked in specific tissues and population-based samples, but there has been limited assessment of the degree of inter-population variability in regulatory variation. We analyzed genome-wide gene expression in lymphoblastoid cell lines from a total of 726 individuals from 8 global populations from the HapMap3 project and correlated gene expression levels with HapMap3 SNPs located in cis to the genes. We describe the influence of ancestry on gene expression levels within and between these diverse human populations and uncover a non-negligible impact on global patterns of gene expression. We further dissect the specific functional pathways differentiated between populations. We also identify 5,691 expression quantitative trait loci (eQTLs) after controlling for both non-genetic factors and population admixture and observe that half of the cis-eQTLs are replicated in one or more of the populations. We highlight patterns of eQTL-sharing between populations, which are partially determined by population genetic relatedness, and discover significant sharing of eQTL effects between Asians, European-admixed, and African subpopulations. Specifically, we observe that both the effect size and the direction of effect for eQTLs are highly conserved across populations. We observe an increasing proximity of eQTLs toward the transcription start site as sharing of eQTLs among populations increases, highlighting that variants close to TSS have stronger effects and therefore are more likely to be detected across a wider panel of populations. Together these results offer a unique picture and resource of the degree of differentiation among human populations in functional regulatory variation and provide an estimate for the transferability of complex trait variants across populations.
View details for DOI 10.1371/journal.pgen.1002639
View details for Web of Science ID 000303441800020
View details for PubMedID 22532805
A Systematic Survey of Loss-of-Function Variants in Human Protein-Coding Genes
2012; 335 (6070): 823-828
Genome-sequencing studies indicate that all humans carry many genetic variants predicted to cause loss of function (LoF) of protein-coding genes, suggesting unexpected redundancy in the human genome. Here we apply stringent filters to 2951 putative LoF variants obtained from 185 human genomes to determine their true prevalence and properties. We estimate that human genomes typically contain ~100 genuine LoF variants with ~20 genes completely inactivated. We identify rare and likely deleterious LoF alleles, including 26 known and 21 predicted severe disease-causing variants, as well as common LoF variants in nonessential genes. We describe functional and evolutionary differences between LoF-tolerant and recessive disease genes and a method for using these differences to prioritize candidate genes found in clinical sequencing studies.
View details for DOI 10.1126/science.1215040
View details for Web of Science ID 000300356400036
View details for PubMedID 22344438
Meta-analysis of genome-wide association studies identifies three new risk loci for atopic dermatitis
2012; 44 (2): 187-192
Atopic dermatitis (AD) is a commonly occurring chronic skin disease with high heritability. Apart from filaggrin (FLG), the genes influencing atopic dermatitis are largely unknown. We conducted a genome-wide association meta-analysis of 5,606 affected individuals and 20,565 controls from 16 population-based cohorts and then examined the ten most strongly associated new susceptibility loci in an additional 5,419 affected individuals and 19,833 controls from 14 studies. Three SNPs reached genome-wide significance in the discovery and replication cohorts combined, including rs479844 upstream of OVOL1 (odds ratio (OR) = 0.88, P = 1.1 × 10(-13)) and rs2164983 near ACTL9 (OR = 1.16, P = 7.1 × 10(-9)), both of which are near genes that have been implicated in epidermal proliferation and differentiation, as well as rs2897442 in KIF3A within the cytokine cluster at 5q31.1 (OR = 1.11, P = 3.8 × 10(-8)). We also replicated association with the FLG locus and with two recently identified association signals at 11q13.5 (rs7927894; P = 0.008) and 20q13.33 (rs6010620; P = 0.002). Our results underline the importance of both epidermal barrier function and immune dysregulation in atopic dermatitis pathogenesis.
View details for DOI 10.1038/ng.1017
View details for Web of Science ID 000299664400018
View details for PubMedID 22197932
DNA methylation profiles of human active and inactive X chromosomes
2011; 21 (10): 1592-1600
X-chromosome inactivation (XCI) is a dosage compensation mechanism that silences the majority of genes on one X chromosome in each female cell. To characterize epigenetic changes that accompany this process, we measured DNA methylation levels in 45,X patients carrying a single active X chromosome (X(a)), and in normal females, who carry one X(a) and one inactive X (X(i)). Methylated DNA was immunoprecipitated and hybridized to high-density oligonucleotide arrays covering the X chromosome, generating epigenetic profiles of active and inactive X chromosomes. We observed that XCI is accompanied by changes in DNA methylation specifically at CpG islands (CGIs). While the majority of CGIs show increased methylation levels on the X(i), XCI actually results in significant reductions in methylation at 7% of CGIs. Both intra- and inter-genic CGIs undergo epigenetic modification, with the biggest increase in methylation occurring at the promoters of genes silenced by XCI. In contrast, genes escaping XCI generally have low levels of promoter methylation, while genes that show inter-individual variation in silencing show intermediate increases in methylation. Thus, promoter methylation and susceptibility to XCI are correlated. We also observed a global correlation between CGI methylation and the evolutionary age of X-chromosome strata, and that genes escaping XCI show increased methylation within gene bodies. We used our epigenetic map to predict 26 novel genes escaping XCI, and searched for parent-of-origin-specific methylation differences, but found no evidence to support imprinting on the human X chromosome. Our study provides a detailed analysis of the epigenetic profile of active and inactive X chromosomes.
View details for DOI 10.1101/gr.112680.110
View details for Web of Science ID 000295407800004
View details for PubMedID 21862626
Epistatic Selection between Coding and Regulatory Variation in Human Evolution and Disease
AMERICAN JOURNAL OF HUMAN GENETICS
2011; 89 (3): 459-463
Interaction (nonadditive effects) between genetic variants has been highlighted as an important mechanism underlying phenotypic variation, but the discovery of genetic interactions in humans has proved difficult. In this study, we show that the spectrum of variation in the human genome has been shaped by modifier effects of cis-regulatory variation on the functional impact of putatively deleterious protein-coding variants. We analyzed 1000 Genomes population-scale resequencing data from Europe (CEU [Utah residents with Northern and Western European ancestry from the CEPH collection]) and Africa (YRI [Yoruba in Ibadan, Nigeria]) together with gene expression data from arrays and RNA sequencing for the same samples. We observed an underrepresentation of derived putatively functional coding variation on the more highly expressed regulatory haplotype, which suggests stronger purifying selection against deleterious coding variants that have increased penetrance because of their regulatory background. Furthermore, the frequency spectrum and impact size distribution of common regulatory polymorphisms (eQTLs) appear to be shaped in order to minimize the selective disadvantage of having deleterious coding mutations on the more highly expressed haplotype. Interestingly, eQTLs explaining common disease GWAS signals showed an enrichment of putative epistatic effects, suggesting that some disease associations might arise from interactions increasing the penetrance of rare coding variants. In conclusion, our results indicate that regulatory and coding variants often modify the functional impact of each other. This specific type of genetic interaction is detectable from sequencing data in a genome-wide manner, and characterizing these joint effects might help us understand functional mechanisms behind genetic associations to human phenotypes-including both Mendelian and common disease.
View details for DOI 10.1016/j.ajhg.2011.08.004
View details for Web of Science ID 000294939800012
View details for PubMedID 21907014
Rare and Common Regulatory Variation in Population-Scale Sequenced Human Genomes
2011; 7 (7)
Population-scale genome sequencing allows the characterization of functional effects of a broad spectrum of genetic variants underlying human phenotypic variation. Here, we investigate the influence of rare and common genetic variants on gene expression patterns, using variants identified from sequencing data from the 1000 genomes project in an African and European population sample and gene expression data from lymphoblastoid cell lines. We detect comparable numbers of expression quantitative trait loci (eQTLs) when compared to genotypes obtained from HapMap 3, but as many as 80% of the top expression quantitative trait variants (eQTVs) discovered from 1000 genomes data are novel. The properties of the newly discovered variants suggest that mapping common causal regulatory variants is challenging even with full resequencing data; however, we observe significant enrichment of regulatory effects in splice-site and nonsense variants. Using RNA sequencing data, we show that 46.2% of nonsynonymous variants are differentially expressed in at least one individual in our sample, creating widespread potential for interactions between functional protein-coding and regulatory variants. We also use allele-specific expression to identify putative rare causal regulatory variants. Furthermore, we demonstrate that outlier expression values can be due to rare variant effects, and we approximate the number of such effects harboured in an individual by effect size. Our results demonstrate that integration of genomic and RNA sequencing analyses allows for the joint assessment of genome sequence and genome function.
View details for DOI 10.1371/journal.pgen.1002144
View details for Web of Science ID 000293338600007
View details for PubMedID 21811411
Genome-wide association study identifies a common variant associated with risk of endometrial cancer
2011; 43 (5): 451-?
Endometrial cancer is the most common malignancy of the female genital tract in developed countries. To identify genetic variants associated with endometrial cancer risk, we performed a genome-wide association study involving 1,265 individuals with endometrial cancer (cases) from Australia and the UK and 5,190 controls from the Wellcome Trust Case Control Consortium. We compared genotype frequencies in cases and controls for 519,655 SNPs. Forty seven SNPs that showed evidence of association with endometrial cancer in stage 1 were genotyped in 3,957 additional cases and 6,886 controls. We identified an endometrial cancer susceptibility locus close to HNF1B at 17q12 (rs4430796, P = 7.1 × 10(-10)) that is also associated with risk of prostate cancer and is inversely associated with risk of type 2 diabetes.
View details for DOI 10.1038/ng.812
View details for Web of Science ID 000289972600015
View details for PubMedID 21499250
From expression QTLs to personalized transcriptomics
NATURE REVIEWS GENETICS
2011; 12 (4): 277-282
Approaches that combine expression quantitative trait loci (eQTLs) and genome-wide association (GWA) studies are offering new functional information about the aetiology of complex human traits and diseases. Improved study designs--which take into account technological advances in resolving the transcriptome, cell history and state, population of origin and diverse endophenotypes--are providing insights into the architecture of disease and the landscape of gene regulation in humans. Furthermore, these advances are helping to establish links between cellular effects and organismal traits.
View details for DOI 10.1038/nrg2969
View details for Web of Science ID 000288531700011
View details for PubMedID 21386863
The Architecture of Gene Regulatory Variation across Multiple Human Tissues: The MuTHER Study
2011; 7 (2)
While there have been studies exploring regulatory variation in one or more tissues, the complexity of tissue-specificity in multiple primary tissues is not yet well understood. We explore in depth the role of cis-regulatory variation in three human tissues: lymphoblastoid cell lines (LCL), skin, and fat. The samples (156 LCL, 160 skin, 166 fat) were derived simultaneously from a subset of well-phenotyped healthy female twins of the MuTHER resource. We discover an abundance of cis-eQTLs in each tissue similar to previous estimates (858 or 4.7% of genes). In addition, we apply factor analysis (FA) to remove effects of latent variables, thus more than doubling the number of our discoveries (1,822 eQTL genes). The unique study design (Matched Co-Twin Analysis--MCTA) permits immediate replication of eQTLs using co-twins (93%-98%) and validation of the considerable gain in eQTL discovery after FA correction. We highlight the challenges of comparing eQTLs between tissues. After verifying previous significance threshold-based estimates of tissue-specificity, we show their limitations given their dependency on statistical power. We propose that continuous estimates of the proportion of tissue-shared signals and direct comparison of the magnitude of effect on the fold change in expression are essential properties that jointly provide a biologically realistic view of tissue-specificity. Under this framework we demonstrate that 30% of eQTLs are shared among the three tissues studied, while another 29% appear exclusively tissue-specific. However, even among the shared eQTLs, a substantial proportion (10%-20%) have significant differences in the magnitude of fold change between genotypic classes across tissues. Our results underline the need to account for the complexity of eQTL tissue-specificity in an effort to assess consequences of such variants for complex traits.
View details for DOI 10.1371/journal.pgen.1002003
View details for Web of Science ID 000287697300035
View details for PubMedID 21304890
Identification of cis- and trans- regulatory variation modulating microRNA expression levels in human fibroblasts
2011; 21 (1): 68-73
MicroRNAs (miRNAs) are regulatory noncoding RNAs that affect the production of a significant fraction of human mRNAs via post-transcriptional regulation. Interindividual variation of the miRNA expression levels is likely to influence the expression of miRNA target genes and may therefore contribute to phenotypic differences in humans, including susceptibility to common disorders. The extent to which miRNA levels are genetically controlled is largely unknown. In this report, we assayed the expression levels of miRNAs in primary fibroblasts from 180 European newborns of the GenCord project and performed association analysis to identify eQTLs (expression quantitative traits loci). We detected robust expression for 121 miRNAs out of 365 interrogated. We have identified significant cis- (10%) and trans- (11%) eQTLs. Furthermore, we detected one genomic locus (rs1522653) that influences the expression levels of five miRNAs, thus unraveling a novel mechanism for coregulation of miRNA expression.
View details for DOI 10.1101/gr.109371.110
View details for Web of Science ID 000285868300007
View details for PubMedID 21147911
A map of human genome variation from population-scale sequencing
2010; 467 (7319): 1061-1073
The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation as a foundation for investigating the relationship between genotype and phenotype. Here we present results of the pilot phase of the project, designed to develop and compare different strategies for genome-wide sequencing with high-throughput platforms. We undertook three projects: low-coverage whole-genome sequencing of 179 individuals from four populations; high-coverage sequencing of two mother-father-child trios; and exon-targeted sequencing of 697 individuals from seven populations. We describe the location, allele frequency and local haplotype structure of approximately 15 million single nucleotide polymorphisms, 1 million short insertions and deletions, and 20,000 structural variants, most of which were previously undescribed. We show that, because we have catalogued the vast majority of common variation, over 95% of the currently accessible variants found in any individual are present in this data set. On average, each person is found to carry approximately 250 to 300 loss-of-function variants in annotated genes and 50 to 100 variants previously implicated in inherited disorders. We demonstrate how these results can be used to inform association and functional studies. From the two trios, we directly estimate the rate of de novo germline base substitution mutations to be approximately 10(-8) per base pair per generation. We explore the data with regard to signatures of natural selection, and identify a marked reduction of genetic variation in the neighbourhood of genes, due to selection at linked sites. These methods and public data will support the next phase of human genetic research.
View details for DOI 10.1038/nature09534
View details for Web of Science ID 000283548600039
View details for PubMedCentralID PMC3042601
Genevar: a database and Java application for the analysis and visualization of SNP-gene associations in eQTL studies
2010; 26 (19): 2474-2476
Genevar (GENe Expression VARiation) is a database and Java tool designed to integrate multiple datasets, and provides analysis and visualization of associations between sequence variation and gene expression. Genevar allows researchers to investigate expression quantitative trait loci (eQTL) associations within a gene locus of interest in real time. The database and application can be installed on a standard computer in database mode and, in addition, on a server to share discoveries among affiliations or the broader community over the Internet via web services protocols.http://www.sanger.ac.uk/resources/software/genevar.
View details for DOI 10.1093/bioinformatics/btq452
View details for Web of Science ID 000282170000023
View details for PubMedID 20702402
Integrating common and rare genetic variation in diverse human populations
2010; 467 (7311): 52-58
Despite great progress in identifying genetic variants that influence human disease, most inherited risk remains unexplained. A more complete understanding requires genome-wide studies that fully examine less common alleles in populations with a wide range of ancestry. To inform the design and interpretation of such studies, we genotyped 1.6 million common single nucleotide polymorphisms (SNPs) in 1,184 reference individuals from 11 global populations, and sequenced ten 100-kilobase regions in 692 of these individuals. This integrated data set of common and rare alleles, called 'HapMap 3', includes both SNPs and copy number polymorphisms (CNPs). We characterized population-specific differences among low-frequency variants, measured the improvement in imputation accuracy afforded by the larger reference panel, especially in imputing SNPs with a minor allele frequency of
View details for DOI 10.1038/nature09298
View details for Web of Science ID 000281461200033
View details for PubMedID 20811451
Transcriptome genetics using second generation sequencing in a Caucasian population
2010; 464 (7289): 773-U151
Gene expression is an important phenotype that informs about genetic and environmental effects on cellular state. Many studies have previously identified genetic variants for gene expression phenotypes using custom and commercially available microarrays. Second generation sequencing technologies are now providing unprecedented access to the fine structure of the transcriptome. We have sequenced the mRNA fraction of the transcriptome in 60 extended HapMap individuals of European descent and have combined these data with genetic variants from the HapMap3 project. We have quantified exon abundance based on read depth and have also developed methods to quantify whole transcript abundance. We have found that approximately 10 million reads of sequencing can provide access to the same dynamic range as arrays with better quantification of alternative and highly abundant transcripts. Correlation with SNPs (small nucleotide polymorphisms) leads to a larger discovery of eQTLs (expression quantitative trait loci) than with arrays. We also detect a substantial number of variants that influence the structure of mature transcripts indicating variants responsible for alternative splicing. Finally, measures of allele-specific expression allowed the identification of rare eQTLs and allelic differences in transcript structure. This analysis shows that high throughput sequencing technologies reveal new properties of genetic effects on the transcriptome and allow the exploration of genetic effects in cellular processes.
View details for DOI 10.1038/nature08903
View details for Web of Science ID 000276205000048
View details for PubMedID 20220756
Candidate Causal Regulatory Effects by Integration of Expression QTLs with Complex Trait Genetic Associations
2010; 6 (4)
The recent success of genome-wide association studies (GWAS) is now followed by the challenge to determine how the reported susceptibility variants mediate complex traits and diseases. Expression quantitative trait loci (eQTLs) have been implicated in disease associations through overlaps between eQTLs and GWAS signals. However, the abundance of eQTLs and the strong correlation structure (LD) in the genome make it likely that some of these overlaps are coincidental and not driven by the same functional variants. In the present study, we propose an empirical methodology, which we call Regulatory Trait Concordance (RTC) that accounts for local LD structure and integrates eQTLs and GWAS results in order to reveal the subset of association signals that are due to cis eQTLs. We simulate genomic regions of various LD patterns with both a single or two causal variants and show that our score outperforms SNP correlation metrics, be they statistical (r(2)) or historical (D'). Following the observation of a significant abundance of regulatory signals among currently published GWAS loci, we apply our method with the goal to prioritize relevant genes for each of the respective complex traits. We detect several potential disease-causing regulatory effects, with a strong enrichment for immunity-related conditions, consistent with the nature of the cell line tested (LCLs). Furthermore, we present an extension of the method in trans, where interrogating the whole genome for downstream effects of the disease variant can be informative regarding its unknown primary biological effect. We conclude that integrating cellular phenotype associations with organismal complex traits will facilitate the biological interpretation of the genetic effects on these traits.
View details for DOI 10.1371/journal.pgen.1000895
View details for Web of Science ID 000277354200012
View details for PubMedID 20369022
- Out of the sequencer and into the wiki as we face new challenges in genome informatics. Genome biology 2010; 11 (10): 308-?
Annotating the regulatory genome.
Methods in molecular biology (Clifton, N.J.)
2010; 674: 313-349
Determining the timing and molecular repertoire responsible for gene expression is fundamental to understanding a gene's function. Heritable differences in this character are increasingly regarded as explanatory for complex and common traits. For many known trait-predisposing genes, studies have sought to elucidate the associated logic behind gene regulation. However, there exist many challenges in deciphering these mechanisms. Among them, it is recognized that we have limited understanding of regulatory complexity, the current models of gene regulation have low specificity and any gene's regulatory logic is dependent on biological context. Addressing these limitations and defining the regulatory genome is an ongoing challenge for molecular biology. We discuss current efforts to define and annotate the regulatory genome by focusing on curation and text-mining activities. We further highlight the type of information and curation process for describing regulatory elements within the ORegAnno database ( www.oreganno.org ) and how the general standards for such information are changing.
View details for DOI 10.1007/978-1-60761-854-6_20
View details for PubMedID 20827601
The resolution of the genetics of gene expression
HUMAN MOLECULAR GENETICS
2009; 18: R211-R215
Understanding the influence of genetics on the molecular mechanisms underpinning human phenotypic diversity is fundamental to being able to predict health outcomes and treat disease. To interrogate the role of genetics on cellular state and function, gene expression has been extensively used. Past and present studies have highlighted important patterns of heritability, population differentiation and tissue-specificity in gene expression. Current and future studies are taking advantage of systems biology-based approaches and advances in sequencing technology: new methodology aims to translate regulatory networks to enrich pathways responsible for disease etiology and 2nd generation sequencing now offers single-molecular resolution of the transcriptome providing unprecedented information on the structural and genetic characteristics of gene expression. Such advances are leading to a future where rich cellular phenotypes will facilitate understanding of the transmission of genetic effect from the gene to organism.
View details for DOI 10.1093/hmg/ddp400
View details for Web of Science ID 000271265600012
View details for PubMedID 19808798
Common Regulatory Variation Impacts Gene Expression in a Cell Type-Dependent Manner
2009; 325 (5945): 1246-1250
Studies correlating genetic variation to gene expression facilitate the interpretation of common human phenotypes and disease. As functional variants may be operating in a tissue-dependent manner, we performed gene expression profiling and association with genetic variants (single-nucleotide polymorphisms) on three cell types of 75 individuals. We detected cell type-specific genetic effects, with 69 to 80% of regulatory variants operating in a cell type-specific manner, and identified multiple expressive quantitative trait loci (eQTLs) per gene, unique or shared among cell types and positively correlated with the number of transcripts per gene. Cell type-specific eQTLs were found at larger distances from genes and at lower effect size, similar to known enhancers. These data suggest that the complete regulatory variant repertoire can only be uncovered in the context of cell-type specificity.
View details for DOI 10.1126/science.1174148
View details for Web of Science ID 000269523200038
View details for PubMedID 19644074
Is the thrifty genotype hypothesis supported by evidence based on confirmed type 2 diabetes- and obesity-susceptibility variants?
2009; 52 (9): 1846-1851
According to the thrifty genotype hypothesis, the high prevalence of type 2 diabetes and obesity is a consequence of genetic variants that have undergone positive selection during historical periods of erratic food supply. The recent expansion in the number of validated type 2 diabetes- and obesity-susceptibility loci, coupled with access to empirical data, enables us to look for evidence in support (or otherwise) of the thrifty genotype hypothesis using proven loci.We employed a range of tests to obtain complementary views of the evidence for selection: we determined whether the risk allele at associated 'index' single-nucleotide polymorphisms is derived or ancestral, calculated the integrated haplotype score (iHS) and assessed the population differentiation statistic fixation index (F (ST)) for 17 type 2 diabetes and 13 obesity loci.We found no evidence for significant differences for the derived/ancestral allele test. None of the studied loci showed strong evidence for selection based on the iHS score. We find a high F (ST) for rs7901695 at TCF7L2, the largest type 2 diabetes effect size found to date.Our results provide some evidence for selection at specific loci, but there are no consistent patterns of selection that provide conclusive confirmation of the thrifty genotype hypothesis. Discovery of more signals and more causal variants for type 2 diabetes and obesity is likely to allow more detailed examination of these issues.
View details for DOI 10.1007/s00125-009-1419-3
View details for Web of Science ID 000268776100018
View details for PubMedID 19526209
Current computational methods for prioritizing candidate regulatory polymorphisms.
Methods in molecular biology (Clifton, N.J.)
2009; 569: 89-114
Discovery of DNA sequence variants responsible for human phenotypic variation is key to advances in molecular diagnostics and medicines. Historically, variants that alter the protein-coding sequence of genes have been targeted when attempting to identify a trait's etiology; this is done because the rules governing these regions are generally well-understood and candidate variants can be easily selected. However, the effects of variants on gene regulation are increasingly regarded as being as important as protein-coding variation in uncovering the nature of phenotypic variation. I discuss resources and methodology that have recently been developed to computationally prioritize variants that may alter gene expression.
View details for DOI 10.1007/978-1-59745-524-4_5
View details for PubMedID 19623487
ORegAnno: an open-access community-driven resource for regulatory annotation
NUCLEIC ACIDS RESEARCH
2008; 36: D107-D113
ORegAnno is an open-source, open-access database and literature curation system for community-based annotation of experimentally identified DNA regulatory regions, transcription factor binding sites and regulatory variants. The current release comprises 30 145 records curated from 922 publications and describing regulatory sequences for over 3853 genes and 465 transcription factors from 19 species. A new feature called the 'publication queue' allows users to input relevant papers from scientific literature as targets for annotation. The queue contains 4438 gene regulation papers entered by experts and another 54 351 identified by text-mining methods. Users can enter or 'check out' papers from the queue for manual curation using a series of user-friendly annotation pages. A typical record entry consists of species, sequence type, sequence, target gene, binding factor, experimental outcome and one or more lines of experimental evidence. An evidence ontology was developed to describe and categorize these experiments. Records are cross-referenced to Ensembl or Entrez gene identifiers, PubMed and dbSNP and can be visualized in the Ensembl or UCSC genome browsers. All data are freely available through search pages, XML data dumps or web services at: http://www.oreganno.org.
View details for DOI 10.1093/nar/gkm967
View details for Web of Science ID 000252545400020
View details for PubMedID 18006570
Text-mining assisted regulatory annotation
2008; 9 (2)
Decoding transcriptional regulatory networks and the genomic cis-regulatory logic implemented in their control nodes is a fundamental challenge in genome biology. High-throughput computational and experimental analyses of regulatory networks and sequences rely heavily on positive control data from prior small-scale experiments, but the vast majority of previously discovered regulatory data remains locked in the biomedical literature.We develop text-mining strategies to identify relevant publications and extract sequence information to assist the regulatory annotation process. Using a vector space model to identify Medline abstracts from papers likely to have high cis-regulatory content, we demonstrate that document relevance ranking can assist the curation of transcriptional regulatory networks and estimate that, minimally, 30,000 papers harbor unannotated cis-regulatory data. In addition, we show that DNA sequences can be extracted from primary text with high cis-regulatory content and mapped to genome sequences as a means of identifying the location, organism and target gene information that is critical to the cis-regulatory annotation process.Our results demonstrate that text-mining technologies can be successfully integrated with genome annotation systems, thereby increasing the availability of annotated cis-regulatory data needed to catalyze advances in the field of gene regulation.
View details for DOI 10.1186/gb-2008-9-2-r31
View details for Web of Science ID 000254659300013
View details for PubMedID 18271954
Population genomics of human gene expression
2007; 39 (10): 1217-1224
Genetic variation influences gene expression, and this variation in gene expression can be efficiently mapped to specific genomic regions and variants. Here we have used gene expression profiling of Epstein-Barr virus-transformed lymphoblastoid cell lines of all 270 individuals genotyped in the HapMap Consortium to elucidate the detailed features of genetic variation underlying gene expression variation. We find that gene expression is heritable and that differentiation between populations is in agreement with earlier small-scale studies. A detailed association analysis of over 2.2 million common SNPs per population (5% frequency in HapMap) with gene expression identified at least 1,348 genes with association signals in cis and at least 180 in trans. Replication in at least one independent population was achieved for 37% of cis signals and 15% of trans signals, respectively. Our results strongly support an abundance of cis-regulatory variation in the human genome. Detection of trans effects is limited but suggests that regulatory variation may be the key primary effect contributing to phenotypic variation in humans. We also explore several methodologies that improve the current state of analysis of gene expression variation.
View details for DOI 10.1038/ng2142
View details for Web of Science ID 000249737400017
View details for PubMedID 17873874
A survey of genomic properties for the detection of regulatory polymorphisms
PLOS COMPUTATIONAL BIOLOGY
2007; 3 (6): 1000-1010
Advances in the computational identification of functional noncoding polymorphisms will aid in cataloging novel determinants of health and identifying genetic variants that explain human evolution. To date, however, the development and evaluation of such techniques has been limited by the availability of known regulatory polymorphisms. We have attempted to address this by assembling, from the literature, a computationally tractable set of regulatory polymorphisms within the ORegAnno database (http://www.oreganno.org). We have further used 104 regulatory single-nucleotide polymorphisms from this set and 951 polymorphisms of unknown function, from 2-kb and 152-bp noncoding upstream regions of genes, to investigate the discriminatory potential of 23 properties related to gene regulation and population genetics. Among the most important properties detected in this region are distance to transcription start site, local repetitive content, sequence conservation, minor and derived allele frequencies, and presence of a CpG island. We further used the entire set of properties to evaluate their collective performance in detecting regulatory polymorphisms. Using a 10-fold cross-validation approach, we were able to achieve a sensitivity and specificity of 0.82 and 0.71, respectively, and we show that this performance is strongly influenced by the distance to the transcription start site.
View details for DOI 10.1371/journal.pcbi.0030106
View details for Web of Science ID 000249105500010
View details for PubMedID 17559298
ORegAnno: an open access database and curation system for literature-derived promoters, transcription factor binding sites and regulatory variation
2006; 22 (5): 637-640
Our understanding of gene regulation is currently limited by our ability to collectively synthesize and catalogue transcriptional regulatory elements stored in scientific literature. Over the past decade, this task has become increasingly challenging as the accrual of biologically validated regulatory sequences has accelerated. To meet this challenge, novel community-based approaches to regulatory element annotation are required.Here, we present the Open Regulatory Annotation (ORegAnno) database as a dynamic collection of literature-curated regulatory regions, transcription factor binding sites and regulatory mutations (polymorphisms and haplotypes). ORegAnno has been designed to manage the submission, indexing and validation of new annotations from users worldwide. Submissions to ORegAnno are immediately cross-referenced to EnsEMBL, dbSNP, Entrez Gene, the NCBI Taxonomy database and PubMed, where appropriate.ORegAnno is available directly through MySQL, Web services, and online at http://www.oreganno.org. All software is licensed under the Lesser GNU Public License (LGPL).
View details for DOI 10.1093/bioinformatics/btk027
View details for Web of Science ID 000235604400024
View details for PubMedID 16397004
cisRED: a database system for genome-scale computational discovery of regulatory elements
NUCLEIC ACIDS RESEARCH
2006; 34: D68-D73
We describe cisRED, a database for conserved regulatory elements that are identified and ranked by a genome-scale computational system (www.cisred.org). The database and high-throughput predictive pipeline are designed to address diverse target genomes in the context of rapidly evolving data resources and tools. Motifs are predicted in promoter regions using multiple discovery methods applied to sequence sets that include corresponding sequence regions from vertebrates. We estimate motif significance by applying discovery and post-processing methods to randomized sequence sets that are adaptively derived from target sequence sets, retain motifs with p-values below a threshold and identify groups of similar motifs and co-occurring motif patterns. The database offers information on atomic motifs, motif groups and patterns. It is web-accessible, and can be queried directly, downloaded or installed locally.
View details for DOI 10.1093/nar/gkj075
View details for Web of Science ID 000239307700015
View details for PubMedID 16381958
- An application of peer-to-peer technology to the discovery, use and assessment of bioinformatics programs NATURE METHODS 2005; 2 (8): 563-563
Sockeye: A 3D environment for comparative genomics
2004; 14 (5): 956-962
Comparative genomics techniques are used in bioinformatics analyses to identify the structural and functional properties of DNA sequences. As the amount of available sequence data steadily increases, the ability to perform large-scale comparative analyses has become increasingly relevant. In addition, the growing complexity of genomic feature annotation means that new approaches to genomic visualization need to be explored. We have developed a Java-based application called Sockeye that uses three-dimensional (3D) graphics technology to facilitate the visualization of annotation and conservation across multiple sequences. This software uses the Ensembl database project to import sequence and annotation information from several eukaryotic species. A user can additionally import their own custom sequence and annotation data. Individual annotation objects are displayed in Sockeye by using custom 3D models. Ensembl-derived and imported sequences can be analyzed by using a suite of multiple and pair-wise alignment algorithms. The results of these comparative analyses are also displayed in the 3D environment of Sockeye. By using the Java3D API to visualize genomic data in a 3D environment, we are able to compactly display cross-sequence comparisons. This provides the user with a novel platform for visualizing and comparing genomic feature organization.
View details for DOI 10.1101/gr.1890304
View details for Web of Science ID 000221171700022
View details for PubMedID 15123592
The genome sequence of the SARS-associated coronavirus
2003; 300 (5624): 1399-1404
We sequenced the 29,751-base genome of the severe acute respiratory syndrome (SARS)-associated coronavirus known as the Tor2 isolate. The genome sequence reveals that this coronavirus is only moderately related to other known coronaviruses, including two human coronaviruses, HCoV-OC43 and HCoV-229E. Phylogenetic analysis of the predicted viral proteins indicates that the virus does not closely resemble any of the three previously known groups of coronaviruses. The genome sequence will aid in the diagnosis of SARS virus infection in humans and potential animal hosts (using polymerase chain reaction and immunological tests), in the development of antivirals (including neutralizing antibodies), and in the identification of putative epitopes for vaccine development.
View details for DOI 10.1126/science.1085953
View details for Web of Science ID 000183181800036
View details for PubMedID 12730501