Honors & Awards
Clarendon Scholar, University of Oxford (2010-2015)
Osler Award, University of Oxford (2010-2015)
Gates Millenium Scholar, Bill & Melinda Gates Foundation (2004-2008)
DPhil, University of Oxford, Clinical Medicine (2015)
B.S., Massachusetts Institute of Technology, Mathematics (2008)
- Topics in Biomedical Data Science: Large-scale inference
BIODS 215 (Win)
- Workshop in Biostatistics
BIODS 260C, STATS 260C (Spr)
- Independent Studies (4)
Prior Year Courses
- Topics in Biomedical Data Science: Large-scale inference
BIODS 215 (Spr)
- Workshop in Biostatistics
BIODS 260C, STATS 260C (Spr)
- Topics in Biomedical Data Science: Large-scale inference
The genetic architecture of type 2 diabetes
2016; 536 (7614): 41-?
The genetic architecture of common traits, including the number, frequency, and effect sizes of inherited variants that contribute to individual risk, has been long debated. Genome-wide association studies have identified scores of common variants associated with type 2 diabetes, but in aggregate, these explain only a fraction of the heritability of this disease. Here, to test the hypothesis that lower-frequency variants explain much of the remainder, the GoT2D and T2D-GENES consortia performed whole-genome sequencing in 2,657 European individuals with and without diabetes, and exome sequencing in 12,940 individuals from five ancestry groups. To increase statistical power, we expanded the sample size via genotyping and imputation in a further 111,548 subjects. Variants associated with type 2 diabetes after sequencing were overwhelmingly common and most fell within regions previously identified by genome-wide association studies. Comprehensive enumeration of sequence variation is necessary to identify functional alleles that provide important clues to disease pathophysiology, but large-scale sequencing does not support the idea that lower-frequency variants have a major role in predisposition to type 2 diabetes.
View details for DOI 10.1038/nature18642
View details for Web of Science ID 000380999200026
View details for PubMedID 27398621
A protein-truncating R179X variant in RNF186 confers protection against ulcerative colitis
Protein-truncating variants protective against human disease provide in vivo validation of therapeutic targets. Here we used targeted sequencing to conduct a search for protein-truncating variants conferring protection against inflammatory bowel disease exploiting knowledge of common variants associated with the same disease. Through replication genotyping and imputation we found that a predicted protein-truncating variant (rs36095412, p.R179X, genotyped in 11,148 ulcerative colitis patients and 295,446 controls, MAF=up to 0.78%) in RNF186, a single-exon ring finger E3 ligase with strong colonic expression, protects against ulcerative colitis (overall P=6.89 × 10(-7), odds ratio=0.30). We further demonstrate that the truncated protein exhibits reduced expression and altered subcellular localization, suggesting the protective mechanism may reside in the loss of an interaction or function via mislocalization and/or loss of an essential transmembrane domain.
View details for DOI 10.1038/ncomms12342
View details for Web of Science ID 000380952600001
View details for PubMedID 27503255
Discovery of rare variants for complex phenotypes
2016; 135 (6): 625-634
With the rise of sequencing technologies, it is now feasible to assess the role rare variants play in the genetic contribution to complex trait variation. While some of the earlier targeted sequencing studies successfully identified rare variants of large effect, unbiased gene discovery using exome sequencing has experienced limited success for complex traits. Nevertheless, rare variant association studies have demonstrated that rare variants do contribute to phenotypic variability, but sample sizes will likely have to be even larger than those of common variant association studies to be powered for the detection of genes and loci. Large-scale sequencing efforts of tens of thousands of individuals, such as the UK10K Project and aggregation efforts such as the Exome Aggregation Consortium, have made great strides in advancing our knowledge of the landscape of rare variation, but there remain many considerations when studying rare variation in the context of complex traits. We discuss these considerations in this review, presenting a broad range of topics at a high level as an introduction to rare variant analysis in complex traits including the issues of power, study design, sample ascertainment, de novo variation, and statistical testing approaches. Ultimately, as sequencing costs continue to decline, larger sequencing studies will yield clearer insights into the biological consequence of rare mutations and may reveal which genes play a role in the etiology of complex traits.
View details for DOI 10.1007/s00439-016-1679-1
View details for Web of Science ID 000377017000005
View details for PubMedID 27221085
Assessing allele-specific expression across multiple tissues from RNA-seq read data
2015; 31 (15): 2497-2504
RNA sequencing enables allele-specific expression (ASE) studies that complement standard genotype expression studies for common variants and, importantly, also allow measuring the regulatory impact of rare variants. The Genotype-Tissue Expression (GTEx) project is collecting RNA-seq data on multiple tissues of a same set of individuals and novel methods are required for the analysis of these data.We present a statistical method to compare different patterns of ASE across tissues and to classify genetic variants according to their impact on the tissue-wide expression profile. We focus on strong ASE effects that we are expecting to see for protein-truncating variants, but our method can also be adjusted for other types of ASE effects. We illustrate the method with a real data example on a tissue-wide expression profile of a variant causal for lipoid proteinosis, and with a simulation study to assess our method more generally.
View details for DOI 10.1093/bioinformatics/btv074
View details for Web of Science ID 000359312400011
View details for PubMedID 25819081
- Effect of predicted protein-truncating genetic variants on the human transcriptome SCIENCE 2015; 348 (6235): 666-669
The Power of Gene-Based Rare Variant Methods to Detect Disease-Associated Variation and Test Hypotheses About Complex Disease
2015; 11 (4)
Genome and exome sequencing in large cohorts enables characterization of the role of rare variation in complex diseases. Success in this endeavor, however, requires investigators to test a diverse array of genetic hypotheses which differ in the number, frequency and effect sizes of underlying causal variants. In this study, we evaluated the power of gene-based association methods to interrogate such hypotheses, and examined the implications for study design. We developed a flexible simulation approach, using 1000 Genomes data, to (a) generate sequence variation at human genes in up to 10K case-control samples, and (b) quantify the statistical power of a panel of widely used gene-based association tests under a variety of allelic architectures, locus effect sizes, and significance thresholds. For loci explaining ~1% of phenotypic variance underlying a common dichotomous trait, we find that all methods have low absolute power to achieve exome-wide significance (~5-20% power at α = 2.5 × 10(-6)) in 3K individuals; even in 10K samples, power is modest (~60%). The combined application of multiple methods increases sensitivity, but does so at the expense of a higher false positive rate. MiST, SKAT-O, and KBAC have the highest individual mean power across simulated datasets, but we observe wide architecture-dependent variability in the individual loci detected by each test, suggesting that inferences about disease architecture from analysis of sequencing studies can differ depending on which methods are used. Our results imply that tens of thousands of individuals, extensive functional annotation, or highly targeted hypothesis testing will be required to confidently detect or exclude rare variant signals at complex disease loci.
View details for DOI 10.1371/journal.pgen.1005165
View details for Web of Science ID 000354524200049
View details for PubMedID 25906071
Choice of transcripts and software has a large effect on variant annotation
Variant annotation is a crucial step in the analysis of genome sequencing data. Functional annotation results can have a strong influence on the ultimate conclusions of disease studies. Incorrect or incomplete annotations can cause researchers both to overlook potentially disease-relevant DNA variants and to dilute interesting variants in a pool of false positives. Researchers are aware of these issues in general, but the extent of the dependency of final results on the choice of transcripts and software used for annotation has not been quantified in detail.This paper quantifies the extent of differences in annotation of 80 million variants from a whole-genome sequencing study. We compare results using the RefSeq and Ensembl transcript sets as the basis for variant annotation with the software Annovar, and also compare the results from two annotation software packages, Annovar and VEP (Ensembl's Variant Effect Predictor), when using Ensembl transcripts.We found only 44% agreement in annotations for putative loss-of-function variants when using the RefSeq and Ensembl transcript sets as the basis for annotation with Annovar. The rate of matching annotations for loss-of-function and nonsynonymous variants combined was 79% and for all exonic variants it was 83%. When comparing results from Annovar and VEP using Ensembl transcripts, matching annotations were seen for only 65% of loss-of-function variants and 87% of all exonic variants, with splicing variants revealed as the category with the greatest discrepancy. Using these comparisons, we characterised the types of apparent errors made by Annovar and VEP and discuss their impact on the analysis of DNA variants in genome sequencing studies.Variant annotation is not yet a solved problem. Choice of transcript set can have a large effect on the ultimate variant annotations obtained in a whole-genome sequencing study. Choice of annotation software can also have a substantial effect. The annotation step in the analysis of a genome sequencing study must therefore be considered carefully, and a conscious choice made as to which transcript set and software are used for annotation.
View details for DOI 10.1186/gm543
View details for Web of Science ID 000339377700001
View details for PubMedID 24944579
Assessing association between protein truncating variants and quantitative traits
2013; 29 (19): 2419-2426
In sequencing studies of common diseases and quantitative traits, power to test rare and low frequency variants individually is weak. To improve power, a common approach is to combine statistical evidence from several genetic variants in a region. Major challenges are how to do the combining and which statistical framework to use. General approaches for testing association between rare variants and quantitative traits include aggregating genotypes and trait values, referred to as 'collapsing', or using a score-based variance component test. However, little attention has been paid to alternative models tailored for protein truncating variants. Recent studies have highlighted the important role that protein truncating variants, commonly referred to as 'loss of function' variants, may have on disease susceptibility and quantitative levels of biomarkers. We propose a Bayesian modelling framework for the analysis of protein truncating variants and quantitative traits.Our simulation results show that our models have an advantage over the commonly used methods. We apply our models to sequence and exome-array data and discover strong evidence of association between low plasma triglyceride levels and protein truncating variants at APOC3 (Apolipoprotein C3).Software is available from http://www.well.ox.ac.uk/~rivas/mamba
View details for DOI 10.1093/bioinformatics/btt409
View details for Web of Science ID 000324778500008
View details for PubMedID 23860716
Transcriptome and genome sequencing uncovers functional variation in humans.
2013; 501 (7468): 506-511
Genome sequencing projects are discovering millions of genetic variants in humans, and interpretation of their functional effects is essential for understanding the genetic basis of variation in human traits. Here we report sequencing and deep analysis of messenger RNA and microRNA from lymphoblastoid cell lines of 462 individuals from the 1000 Genomes Project--the first uniformly processed high-throughput RNA-sequencing data from multiple human populations with high-quality genome sequences. We discover extremely widespread genetic variation affecting the regulation of most genes, with transcript structure and expression level variation being equally common but genetically largely independent. Our characterization of causal regulatory variation sheds light on the cellular mechanisms of regulatory and loss-of-function variation, and allows us to infer putative causal variants for dozens of disease-associated loci. Altogether, this study provides a deep understanding of the cellular mechanisms of transcriptome variation and of the landscape of functional variants in the human genome.
View details for DOI 10.1038/nature12531
View details for PubMedID 24037378
Deep Resequencing of GWAS Loci Identifies Rare Variants in CARD9, IL23R and RNF186 That Are Associated with Ulcerative Colitis
2013; 9 (9)
Genome-wide association studies and follow-up meta-analyses in Crohn's disease (CD) and ulcerative colitis (UC) have recently identified 163 disease-associated loci that meet genome-wide significance for these two inflammatory bowel diseases (IBD). These discoveries have already had a tremendous impact on our understanding of the genetic architecture of these diseases and have directed functional studies that have revealed some of the biological functions that are important to IBD (e.g. autophagy). Nonetheless, these loci can only explain a small proportion of disease variance (~14% in CD and 7.5% in UC), suggesting that not only are additional loci to be found but that the known loci may contain high effect rare risk variants that have gone undetected by GWAS. To test this, we have used a targeted sequencing approach in 200 UC cases and 150 healthy controls (HC), all of French Canadian descent, to study 55 genes in regions associated with UC. We performed follow-up genotyping of 42 rare non-synonymous variants in independent case-control cohorts (totaling 14,435 UC cases and 20,204 HC). Our results confirmed significant association to rare non-synonymous coding variants in both IL23R and CARD9, previously identified from sequencing of CD loci, as well as identified a novel association in RNF186. With the exception of CARD9 (OR = 0.39), the rare non-synonymous variants identified were of moderate effect (OR = 1.49 for RNF186 and OR = 0.79 for IL23R). RNF186 encodes a protein with a RING domain having predicted E3 ubiquitin-protein ligase activity and two transmembrane domains. Importantly, the disease-coding variant is located in the ubiquitin ligase domain. Finally, our results suggest that rare variants in genes identified by genome-wide association in UC are unlikely to contribute significantly to the overall variance for the disease. Rather, these are expected to help focus functional studies of the corresponding disease loci.
View details for DOI 10.1371/journal.pgen.1003723
View details for Web of Science ID 000325076600010
View details for PubMedID 24068945
A Flexible Approach for the Analysis of Rare Variants Allowing for a Mixture of Effects on Binary or Quantitative Traits
2013; 9 (8)
Multiple rare variants either within or across genes have been hypothesised to collectively influence complex human traits. The increasing availability of high throughput sequencing technologies offers the opportunity to study the effect of rare variants on these traits. However, appropriate and computationally efficient analytical methods are required to account for collections of rare variants that display a combination of protective, deleterious and null effects on the trait. We have developed a novel method for the analysis of rare genetic variation in a gene, region or pathway that, by simply aggregating summary statistics at each variant, can: (i) test for the presence of a mixture of effects on a trait; (ii) be applied to both binary and quantitative traits in population-based and family-based data; (iii) adjust for covariates to allow for non-genetic risk factors and; (iv) incorporate imputed genetic variation. In addition, for preliminary identification of promising genes, the method can be applied to association summary statistics, available from meta-analysis of published data, for example, without the need for individual level genotype data. Through simulation, we show that our method is immune to the presence of bi-directional effects, with no apparent loss in power across a range of different mixtures, and can achieve greater power than existing approaches as long as summary statistics at each variant are robust. We apply our method to investigate association of type-1 diabetes with imputed rare variants within genes in the major histocompatibility complex using genotype data from the Wellcome Trust Case Control Consortium.
View details for DOI 10.1371/journal.pgen.1003694
View details for Web of Science ID 000323830300045
View details for PubMedID 23966874
Deep resequencing of GWAS loci identifies independent rare variants associated with inflammatory bowel disease
2011; 43 (11): 1066-U50
More than 1,000 susceptibility loci have been identified through genome-wide association studies (GWAS) of common variants; however, the specific genes and full allelic spectrum of causal variants underlying these findings have not yet been defined. Here we used pooled next-generation sequencing to study 56 genes from regions associated with Crohn's disease in 350 cases and 350 controls. Through follow-up genotyping of 70 rare and low-frequency protein-altering variants in nine independent case-control series (16,054 Crohn's disease cases, 12,153 ulcerative colitis cases and 17,575 healthy controls), we identified four additional independent risk factors in NOD2, two additional protective variants in IL23R, a highly significant association with a protective splice variant in CARD9 (P < 1 × 10(-16), odds ratio ≈ 0.29) and additional associations with coding variants in IL18RAP, CUL2, C1orf106, PTPN22 and MUC19. We extend the results of successful GWAS by identifying new, rare and probably functional variants that could aid functional experiments and predictive models.
View details for DOI 10.1038/ng.952
View details for Web of Science ID 000296584000009
View details for PubMedID 21983784
A framework for variation discovery and genotyping using next-generation DNA sequencing data
2011; 43 (5): 491-?
Recent advances in sequencing technology make it possible to comprehensively catalog genetic variation in population samples, creating a foundation for understanding human disease, ancestry and evolution. The amounts of raw data produced are prodigious, and many computational steps are required to translate this output into high-quality variant calls. We present a unified analytic framework to discover and genotype variation among multiple samples simultaneously that achieves sensitive and specific results across five sequencing technologies and three distinct, canonical experimental designs. Our process includes (i) initial read mapping; (ii) local realignment around indels; (iii) base quality score recalibration; (iv) SNP discovery and genotyping to find all potential variants; and (v) machine learning to separate true segregating variation from machine artifacts common to next-generation sequencing technologies. We here discuss the application of these tools, instantiated in the Genome Analysis Toolkit, to deep whole-genome, whole-exome capture and multi-sample low-pass (∼4×) 1000 Genomes Project datasets.
View details for DOI 10.1038/ng.806
View details for Web of Science ID 000289972600023
View details for PubMedID 21478889
Testing for an Unusual Distribution of Rare Variants
2011; 7 (3)
Technological advances make it possible to use high-throughput sequencing as a primary discovery tool of medical genetics, specifically for assaying rare variation. Still this approach faces the analytic challenge that the influence of very rare variants can only be evaluated effectively as a group. A further complication is that any given rare variant could have no effect, could increase risk, or could be protective. We propose here the C-alpha test statistic as a novel approach for testing for the presence of this mixture of effects across a set of rare variants. Unlike existing burden tests, C-alpha, by testing the variance rather than the mean, maintains consistent power when the target set contains both risk and protective variants. Through simulations and analysis of case/control data, we demonstrate good power relative to existing methods that assess the burden of rare variants in individuals.
View details for DOI 10.1371/journal.pgen.1001322
View details for Web of Science ID 000288996600004
View details for PubMedID 21408211
biMM: Efficient estimation of genetic variances and covariances for cohorts with high-dimensional phenotype measurements.
Genetic research utilizes a decomposition of trait variances and covariances into genetic and environmental parts. Our software package biMM is a computationally efficient implementation of a bivariate linear mixed model for settings where hundreds of traits have been measured on partially overlapping sets of individuals.Implementation in R freely available at www.iki.fi/mpirinen .firstname.lastname@example.org.Available at Bioinformatics online.
View details for DOI 10.1093/bioinformatics/btx166
View details for PubMedID 28369165
Variant Enriched in the Finnish Population is Associated With Fasting Insulin Levels and Type 2 Diabetes Risk.
To identify novel coding association signals and facilitate characterization of mechanisms influencing glycemic traits and type 2 diabetes risk, we analyzed 109,215 variants derived from exome array genotyping together with an additional 390,225 variants from exome sequence in up to 39,339 normoglycemic individuals from five ancestry groups. We identified a novel association between the coding variant (p.Pro50Thr) in AKT2 and fasting plasma insulin (FI), a gene in which rare fully penetrant mutations are causal for monogenic glycemic disorders. The low-frequency allele is associated with a 12% increase in FI levels. This variant is present at 1.1% frequency in Finns but virtually absent in individuals from other ancestries. Carriers of the FI-increasing allele had increased 2-h insulin values, decreased insulin sensitivity, and increased risk of type 2 diabetes (odds ratio 1.05). In cellular studies, the AKT2-Thr50 protein exhibited a partial loss of function. We extend the allelic spectrum for coding variants in AKT2 associated with disorders of glucose homeostasis and demonstrate bidirectional effects of variants within the pleckstrin homology domain of AKT2.
View details for DOI 10.2337/db16-1329
View details for PubMedID 28341696
Rare and low-frequency coding variants alter human adult height.
2017; 542 (7640): 186-190
Height is a highly heritable, classic polygenic trait with approximately 700 common associated variants identified through genome-wide association studies so far. Here, we report 83 height-associated coding variants with lower minor-allele frequencies (in the range of 0.1-4.8%) and effects of up to 2 centimetres per allele (such as those in IHH, STC2, AR and CRISPLD2), greater than ten times the average effect of common variants. In functional follow-up studies, rare height-increasing alleles of STC2 (giving an increase of 1-2 centimetres per allele) compromised proteolytic inhibition of PAPP-A and increased cleavage of IGFBP-4 in vitro, resulting in higher bioavailability of insulin-like growth factors. These 83 height-associated variants overlap genes that are mutated in monogenic growth disorders and highlight new biological candidates (such as ADAMTS3, IL11RA and NOX4) and pathways (such as proteoglycan and glycosaminoglycan synthesis) involved in growth. Our results demonstrate that sufficiently large sample sizes can uncover rare and low-frequency variants of moderate-to-large effect associated with polygenic human phenotypes, and that these variants implicate relevant genes and pathways.
View details for DOI 10.1038/nature21039
View details for PubMedID 28146470
View details for PubMedCentralID PMC5302847
Frameshift indels introduced by genome editing can lead to in-frame exon skipping.
2017; 12 (6)
The introduction of frameshift indels by genome editing has emerged as a powerful technique to study the functions of uncharacterized genes in cell lines and model organisms. Such mutations should lead to mRNA degradation owing to nonsense-mediated mRNA decay or the production of severely truncated proteins. Here, we show that frameshift indels engineered by genome editing can also lead to skipping of "multiple of three nucleotides" exons. Such splicing events result in in-frame mRNA that may encode fully or partially functional proteins. We also characterize a segregating nonsense variant (rs2273865) located in a "multiple of three nucleotides" exon of LGALS8 that increases exon skipping in human erythroblast samples. Our results highlight the potentially frequent contribution of exonic splicing regulatory elements and are important for the interpretation of negative results in genome editing experiments. Moreover, they may contribute to a better annotation of loss-of-function mutations in the human genome.
View details for DOI 10.1371/journal.pone.0178700
View details for PubMedID 28570605
Analysis of protein-coding genetic variation in 60,706 humans
2016; 536 (7616): 285-?
Large-scale reference data sets of human genetic variation are critical for the medical and functional interpretation of DNA sequence changes. Here we describe the aggregation and analysis of high-quality exome (protein-coding region) DNA sequence data for 60,706 individuals of diverse ancestries generated as part of the Exome Aggregation Consortium (ExAC). This catalogue of human genetic diversity contains an average of one variant every eight bases of the exome, and provides direct evidence for the presence of widespread mutational recurrence. We have used this catalogue to calculate objective metrics of pathogenicity for sequence variants, and to identify genes subject to strong selection against various classes of mutation; identifying 3,230 genes with near-complete depletion of predicted protein-truncating variants, with 72% of these genes having no currently established human disease phenotype. Finally, we demonstrate that these data can be used for the efficient filtering of candidate disease-causing variants, and for the discovery of human 'knockout' variants in protein-coding genes.
View details for DOI 10.1038/nature19057
View details for Web of Science ID 000381804900026
View details for PubMedID 27535533
A null mutation in ANGPTL8 does not associate with either plasma glucose or type 2 diabetes in humans
BMC ENDOCRINE DISORDERS
Experiments in mice initially suggested a role for the protein angiopoietin-like 8 (ANGPTL8) in glucose homeostasis. However, subsequent experiments in model systems have challenged this proposed role. We sought to better understand the importance of ANGPTL8 in human glucose homeostasis by examining the association of a null mutation in ANGPTL8 with fasting glucose levels and risk for type 2 diabetes.A naturally-occurring null mutation in human ANGPTL8 (rs145464906; c.361C > T; p.Q121X) is carried by ~1 in 1000 individuals of European ancestry and is associated with higher levels of plasma high-density lipoprotein cholesterol, suggesting that this mutation has functional significance. We examined the association of p.Q121X with fasting glucose levels and risk for type 2 diabetes in up to 95,558 individuals (14,824 type 2 diabetics and 80,734 controls).We found no significant association of p.Q121X with either fasting glucose or type 2 diabetes (p-value = 0.90 and 0.65, respectively). Given our sample sizes, we had >98 % power to detect at least a 0.23 mmol/L effect on plasma glucose and >95 % power to detect a 70 % increase in risk for type 2 diabetes.Disruption of ANGPTL8 function in humans does not seem to have a large effect on measures of glucose tolerance.
View details for DOI 10.1186/s12902-016-0088-8
View details for Web of Science ID 000369216000001
View details for PubMedID 26822414
A Frameshift in CSF2RB Predominant Among Ashkenazi Jews Increases Risk for Crohn's Disease and Reduces Monocyte Signaling via GM-CSF.
Crohn's disease (CD) has the highest prevalence in Ashkenazi Jewish populations. We sought to identify rare, CD-associated frameshift variants of high functional and statistical effects.We performed exome sequencing and array-based genotype analyses of 1477 Ashkenazi Jewish individuals with CD and 2614 Ashkenazi Jewish individuals without CD (controls). To validate our findings, we performed genotype analyses of an additional 1515 CD cases and 7052 controls for frameshift mutations in the colony-stimulating factor 2-receptor β common subunit gene (CSF2RB). Intestinal tissues and blood samples were collected from patients with CD; lamina propria leukocytes were isolated and expression of CSF2RB and granulocyte-macrophage colony-stimulating factor-responsive cells were defined by adenomatous polyposis coli (APC) time-of-flight mass cytometry (CyTOF analysis). Variants of CSF2RB were transfected into HEK293 cells and the expression and functions of gene products were compared.In the discovery cohort, we associated CD with a frameshift mutation in CSF2RB (P = 8.52 × 10(-4)); the finding was validated in the replication cohort (combined P = 3.42 × 10(-6)). Incubation of intestinal lamina propria leukocytes with granulocyte-macrophage colony-stimulating factor resulted in high levels of phosphorylation of signal transducer and activator of transcription (STAT5) and lesser increases in phosphorylation of extracellular signal-regulated kinase and AK straining transforming (AKT). Cells co-transfected with full-length and mutant forms of CSF2RB had reduced pSTAT5 after stimulation with granulocyte-macrophage colony-stimulating factor, compared with cells transfected with control CSF2RB, indicating a dominant-negative effect of the mutant gene. Monocytes from patients with CD who were heterozygous for the frameshift mutation (6% of CD cases analyzed) had reduced responses to granulocyte-macrophage colony-stimulating factor and markedly decreased activity of aldehyde dehydrogenase; activity of this enzyme has been associated with immune tolerance.In a genetic analysis of Ashkenazi Jewish individuals, we associated CD with a frameshift mutation in CSF2RB. Intestinal monocytes from carriers of this mutation had reduced responses to granulocyte-macrophage colony-stimulating factor, providing an additional mechanism for alterations to the innate immune response in individuals with CD.
View details for DOI 10.1053/j.gastro.2016.06.045
View details for PubMedID 27377463
A Protein Domain and Family Based Approach to Rare Variant Association Analysis.
2016; 11 (4): e0153803
It has become common practice to analyse large scale sequencing data with statistical approaches based around the aggregation of rare variants within the same gene. We applied a novel approach to rare variant analysis by collapsing variants together using protein domain and family coordinates, regarded to be a more discrete definition of a biologically functional unit.Using Pfam definitions, we collapsed rare variants (Minor Allele Frequency ≤ 1%) together in three different ways 1) variants within single genomic regions which map to individual protein domains 2) variants within two individual protein domain regions which are predicted to be responsible for a protein-protein interaction 3) all variants within combined regions from multiple genes responsible for coding the same protein domain (i.e. protein families). A conventional collapsing analysis using gene coordinates was also undertaken for comparison. We used UK10K sequence data and investigated associations between regions of variants and lipid traits using the sequence kernel association test (SKAT).We observed no strong evidence of association between regions of variants based on Pfam domain definitions and lipid traits. Quantile-Quantile plots illustrated that the overall distributions of p-values from the protein domain analyses were comparable to that of a conventional gene-based approach. Deviations from this distribution suggested that collapsing by either protein domain or gene definitions may be favourable depending on the trait analysed.We have collapsed rare variants together using protein domain and family coordinates to present an alternative approach over collapsing across conventionally used gene-based regions. Although no strong evidence of association was detected in these analyses, future studies may still find value in adopting these approaches to detect previously unidentified association signals.
View details for DOI 10.1371/journal.pone.0153803
View details for PubMedID 27128313
The landscape of genomic imprinting across diverse adult human tissues
2015; 25 (7): 927-936
Genomic imprinting is an important regulatory mechanism that silences one of the parental copies of a gene. To systematically characterize this phenomenon, we analyze tissue specificity of imprinting from allelic expression data in 1582 primary tissue samples from 178 individuals from the Genotype-Tissue Expression (GTEx) project. We characterize imprinting in 42 genes, including both novel and previously identified genes. Tissue specificity of imprinting is widespread, and gender-specific effects are revealed in a small number of genes in muscle with stronger imprinting in males. IGF2 shows maternal expression in the brain instead of the canonical paternal expression elsewhere. Imprinting appears to have only a subtle impact on tissue-specific expression levels, with genes lacking a systematic expression difference between tissues with imprinted and biallelic expression. In summary, our systematic characterization of imprinting in adult tissues highlights variation in imprinting between genes, individuals, and tissues.
View details for DOI 10.1101/gr.192278.115
View details for Web of Science ID 000357356900001
View details for PubMedID 25953952
View details for PubMedCentralID PMC4484390
- The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans SCIENCE 2015; 348 (6235): 648-660
Whole-genome sequencing to understand the genetic architecture of common gene expression and biomarker phenotypes
HUMAN MOLECULAR GENETICS
2015; 24 (5): 1504-1512
Initial results from sequencing studies suggest that there are relatively few low-frequency (<5%) variants associated with large effects on common phenotypes. We performed low-pass whole-genome sequencing in 680 individuals from the InCHIANTI study to test two primary hypotheses: (i) that sequencing would detect single low-frequency-large effect variants that explained similar amounts of phenotypic variance as single common variants, and (ii) that some common variant associations could be explained by low-frequency variants. We tested two sets of disease-related common phenotypes for which we had statistical power to detect large numbers of common variant-common phenotype associations-11 132 cis-gene expression traits in 450 individuals and 93 circulating biomarkers in all 680 individuals. From a total of 11 657 229 high-quality variants of which 6 129 221 and 5 528 008 were common and low frequency (<5%), respectively, low frequency-large effect associations comprised 7% of detectable cis-gene expression traits [89 of 1314 cis-eQTLs at P < 1 × 10(-06) (false discovery rate ∼5%)] and one of eight biomarker associations at P < 8 × 10(-10). Very few (30 of 1232; 2%) common variant associations were fully explained by low-frequency variants. Our data show that whole-genome sequencing can identify low-frequency variants undetected by genotyping based approaches when sample sizes are sufficiently large to detect substantial numbers of common variant associations, and that common variant associations are rarely explained by single low-frequency variants of large effect.
View details for DOI 10.1093/hmg/ddu560
View details for Web of Science ID 000350142800025
View details for PubMedID 25378555
- Exome sequencing identifies rare LDLR and APOA5 alleles conferring risk for myocardial infarction. Nature 2015; 518 (7537): 102-106
Identification and Functional Characterization of G6PC2 Coding Variants Influencing Glycemic Traits Define an Effector Transcript at the G6PC2-ABCB11 Locus
2015; 11 (1)
Genome wide association studies (GWAS) for fasting glucose (FG) and insulin (FI) have identified common variant signals which explain 4.8% and 1.2% of trait variance, respectively. It is hypothesized that low-frequency and rare variants could contribute substantially to unexplained genetic variance. To test this, we analyzed exome-array data from up to 33,231 non-diabetic individuals of European ancestry. We found exome-wide significant (P<5×10-7) evidence for two loci not previously highlighted by common variant GWAS: GLP1R (p.Ala316Thr, minor allele frequency (MAF)=1.5%) influencing FG levels, and URB2 (p.Glu594Val, MAF = 0.1%) influencing FI levels. Coding variant associations can highlight potential effector genes at (non-coding) GWAS signals. At the G6PC2/ABCB11 locus, we identified multiple coding variants in G6PC2 (p.Val219Leu, p.His177Tyr, and p.Tyr207Ser) influencing FG levels, conditionally independent of each other and the non-coding GWAS signal. In vitro assays demonstrate that these associated coding alleles result in reduced protein abundance via proteasomal degradation, establishing G6PC2 as an effector gene at this locus. Reconciliation of single-variant associations and functional effects was only possible when haplotype phase was considered. In contrast to earlier reports suggesting that, paradoxically, glucose-raising alleles at this locus are protective against type 2 diabetes (T2D), the p.Val219Leu G6PC2 variant displayed a modest but directionally consistent association with T2D risk. Coding variant associations for glycemic traits in GWAS signals highlight PCSK1, RREB1, and ZHX3 as likely effector transcripts. These coding variant association signals do not have a major impact on the trait variance explained, but they do provide valuable biological insights.
View details for DOI 10.1371/journal.pgen.1004876
View details for Web of Science ID 000349314600012
View details for PubMedID 25625282
Whole-Exome Sequencing Identifies Rare and Low-Frequency Coding Variants Associated with LDL Cholesterol.
American journal of human genetics
2014; 94 (2): 233-245
Elevated low-density lipoprotein cholesterol (LDL-C) is a treatable, heritable risk factor for cardiovascular disease. Genome-wide association studies (GWASs) have identified 157 variants associated with lipid levels but are not well suited to assess the impact of rare and low-frequency variants. To determine whether rare or low-frequency coding variants are associated with LDL-C, we exome sequenced 2,005 individuals, including 554 individuals selected for extreme LDL-C (>98(th) or <2(nd) percentile). Follow-up analyses included sequencing of 1,302 additional individuals and genotype-based analysis of 52,221 individuals. We observed significant evidence of association between LDL-C and the burden of rare or low-frequency variants in PNPLA5, encoding a phospholipase-domain-containing protein, and both known and previously unidentified variants in PCSK9, LDLR and APOB, three known lipid-related genes. The effect sizes for the burden of rare variants for each associated gene were substantially higher than those observed for individual SNPs identified from GWASs. We replicated the PNPLA5 signal in an independent large-scale sequencing study of 2,084 individuals. In conclusion, this large whole-exome-sequencing study for LDL-C identified a gene not known to be implicated in LDL-C and provides unique insight into the design and analysis of similar experiments.
View details for DOI 10.1016/j.ajhg.2014.01.010
View details for PubMedID 24507775
- Transcriptome and genome sequencing uncovers functional variation in humans NATURE 2013; 501 (7468): 506-511
Association Between Variants of PRDM1 and NDP52 and Crohn's Disease, Based on Exome Sequencing and Functional Studies
2013; 145 (2): 339-347
Genome-wide association studies (GWAS) have identified 140 Crohn's disease (CD) susceptibility loci. For most loci, the variants that cause disease are not known and the genes affected by these variants have not been identified. We aimed to identify variants that cause CD through detailed sequencing, genetic association, expression, and functional studies.We sequenced whole exomes of 42 unrelated subjects with CD and 5 healthy subjects (controls) and then filtered single nucleotide variants by incorporating association results from meta-analyses of CD GWAS and in silico mutation effect prediction algorithms. We then genotyped 9348 subjects with CD, 2868 subjects with ulcerative colitis, and 14,567 control subjects and associated variants analyzed in functional studies using materials from subjects and controls and in vitro model systems.We identified rare missense mutations in PR domain-containing 1 (PRDM1) and associated these with CD. These mutations increased proliferation of T cells and secretion of cytokines on activation and increased expression of the adhesion molecule L-selectin. A common CD risk allele, identified in GWAS, correlated with reduced expression of PRDM1 in ileal biopsy specimens and peripheral blood mononuclear cells (combined P = 1.6 × 10(-8)). We identified an association between CD and a common missense variant, Val248Ala, in nuclear domain 10 protein 52 (NDP52) (P = 4.83 × 10(-9)). We found that this variant impairs the regulatory functions of NDP52 to inhibit nuclear factor κB activation of genes that regulate inflammation and affect the stability of proteins in Toll-like receptor pathways.We have extended the results of GWAS and provide evidence that variants in PRDM1 and NDP52 determine susceptibility to CD. PRDM1 maps adjacent to a CD interval identified in GWAS and encodes a transcription factor expressed by T and B cells. NDP52 is an adaptor protein that functions in selective autophagy of intracellular bacteria and signaling molecules, supporting the role of autophagy in the pathogenesis of CD.
View details for DOI 10.1053/j.gastro.2013.04.040
View details for Web of Science ID 000322630600023
View details for PubMedID 23624108
Mosaic PPM1D mutations are associated with predisposition to breast and ovarian cancer.
2013; 493 (7432): 406-410
Improved sequencing technologies offer unprecedented opportunities for investigating the role of rare genetic variation in common disease. However, there are considerable challenges with respect to study design, data analysis and replication. Using pooled next-generation sequencing of 507 genes implicated in the repair of DNA in 1,150 samples, an analytical strategy focused on protein-truncating variants (PTVs) and a large-scale sequencing case-control replication experiment in 13,642 individuals, here we show that rare PTVs in the p53-inducible protein phosphatase PPM1D are associated with predisposition to breast cancer and ovarian cancer. PPM1D PTV mutations were present in 25 out of 7,781 cases versus 1 out of 5,861 controls (P = 1.12 × 10(-5)), including 18 mutations in 6,912 individuals with breast cancer (P = 2.42 × 10(-4)) and 12 mutations in 1,121 individuals with ovarian cancer (P = 3.10 × 10(-9)). Notably, all of the identified PPM1D PTVs were mosaic in lymphocyte DNA and clustered within a 370-base-pair region in the final exon of the gene, carboxy-terminal to the phosphatase catalytic domain. Functional studies demonstrate that the mutations result in enhanced suppression of p53 in response to ionizing radiation exposure, suggesting that the mutant alleles encode hyperactive PPM1D isoforms. Thus, although the mutations cause premature protein truncation, they do not result in the simple loss-of-function effect typically associated with this class of variant, but instead probably have a gain-of-function effect. Our results have implications for the detection and management of breast and ovarian cancer risk. More generally, these data provide new insights into the role of rare and of mosaic genetic variants in common conditions, and the use of sequencing in their identification.
View details for DOI 10.1038/nature11725
View details for PubMedID 23242139
- Mosaic PPM1D mutations are associated with predisposition to breast and ovarian cancer NATURE 2013; 493 (7432): 406-U152
Rare, Low-Frequency, and Common Variants in the Protein-Coding Sequence of Biological Candidate Genes from GWASs Contribute to Risk of Rheumatoid Arthritis
AMERICAN JOURNAL OF HUMAN GENETICS
2013; 92 (1): 15-27
The extent to which variants in the protein-coding sequence of genes contribute to risk of rheumatoid arthritis (RA) is unknown. In this study, we addressed this issue by deep exon sequencing and large-scale genotyping of 25 biological candidate genes located within RA risk loci discovered by genome-wide association studies (GWASs). First, we assessed the contribution of rare coding variants in the 25 genes to the risk of RA in a pooled sequencing study of 500 RA cases and 650 controls of European ancestry. We observed an accumulation of rare nonsynonymous variants exclusive to RA cases in IL2RA and IL2RB (burden test: p = 0.007 and p = 0.018, respectively). Next, we assessed the aggregate contribution of low-frequency and common coding variants to the risk of RA by dense genotyping of the 25 gene loci in 10,609 RA cases and 35,605 controls. We observed a strong enrichment of coding variants with a nominal signal of association with RA (p < 0.05) after adjusting for the best signal of association at the loci (p(enrichment) = 6.4 × 10(-4)). For one locus containing CD2, we found that a missense variant, rs699738 (c.798C>A [p.His266Gln]), and a noncoding variant, rs624988, reside on distinct haplotypes and independently contribute to the risk of RA (p = 4.6 × 10(-6)). Overall, our results indicate that variants (distributed across the allele-frequency spectrum) within the protein-coding portion of a subset of biological candidate genes identified by GWASs contribute to the risk of RA. Further, we have demonstrated that very large sample sizes will be required for comprehensively identifying the independent alleles contributing to the missing heritability of RA.
View details for DOI 10.1016/j.ajhg.2012.11.012
View details for Web of Science ID 000313759000002
View details for PubMedID 23261300
Pooled DNA Resequencing of 68 Myocardial Infarction Candidate Genes in French Canadians
2012; 5 (5): 547-554
Familial history is a strong risk factor for coronary artery disease (CAD), especially for early-onset myocardial infarction (MI). Several genes and chromosomal regions have been implicated in the genetic cause of coronary artery disease/MI, mostly through the discovery of familial mutations implicated in hyper-/hypocholesterolemia by linkage studies and single nucleotide polymorphisms by genome-wide association studies. Except for a few examples (eg, PCSK9), the role of low-frequency genetic variation (minor allele frequency [MAF]) ≈0.1%-5% on MI/coronary artery disease predisposition has not been extensively investigated.We selected 68 candidate genes and sequenced their exons (394 kb) in 500 early-onset MI cases and 500 matched controls, all of French-Canadian ancestry, using solution-based capture in pools of nonindexed DNA samples. In these regions, we identified 1852 single nucleotide variants (695 novel) and captured 85% of the variants with MAF≥1% found by the 1000 Genomes Project in Europe-ancestry individuals. Using gene-based association testing, we prioritized for follow-up 29 low-frequency variants in 8 genes and attempted to genotype them for replication in 1594 MI cases and 2988 controls from 2 French-Canadian panels. Our pilot association analysis of low-frequency variants in 68 candidate genes did not identify genes with large effect on MI risk in French Canadians.We have optimized a strategy, applicable to all complex diseases and traits, to discover efficiently and cost-effectively DNA sequence variants in large populations. Resequencing endeavors to find low-frequency variants implicated in common human diseases are likely to require very large sample size.
View details for DOI 10.1161/CIRCGENETICS.112.963165
View details for Web of Science ID 000309886500011
View details for PubMedID 22923420
Genetic Adaptation of Fatty-Acid Metabolism: A Human-Specific Haplotype Increasing the Biosynthesis of Long-Chain Omega-3 and Omega-6 Fatty Acids
AMERICAN JOURNAL OF HUMAN GENETICS
2012; 90 (5): 809-820
Omega-3 and omega-6 long-chain polyunsaturated fatty acids (LC-PUFAs) are essential for the development and function of the human brain. They can be obtained directly from food, e.g., fish, or synthesized from precursor molecules found in vegetable oils. To determine the importance of genetic variability to fatty-acid biosynthesis, we studied FADS1 and FADS2, which encode rate-limiting enzymes for fatty-acid conversion. We performed genome-wide genotyping (n = 5,652 individuals) and targeted resequencing (n = 960 individuals) of the FADS region in five European population cohorts. We also analyzed available genomic data from human populations, archaic hominins, and more distant primates. Our results show that present-day humans have two common FADS haplotypes-defined by 28 closely linked SNPs across 38.9 kb-that differ dramatically in their ability to generate LC-PUFAs. No independent effects on FADS activity were seen for rare SNPs detected by targeted resequencing. The more efficient, evolutionarily derived haplotype appeared after the lineage split leading to modern humans and Neanderthals and shows evidence of positive selection. This human-specific haplotype increases the efficiency of synthesizing essential long-chain fatty acids from precursors and thereby might have provided an advantage in environments with limited access to dietary LC-PUFAs. In the modern world, this haplotype has been associated with lifestyle-related diseases, such as coronary artery disease.
View details for DOI 10.1016/j.ajhg.2012.03.014
View details for Web of Science ID 000303907500005
View details for PubMedID 22503634
A map of human genome variation from population-scale sequencing
2010; 467 (7319): 1061-1073
The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation as a foundation for investigating the relationship between genotype and phenotype. Here we present results of the pilot phase of the project, designed to develop and compare different strategies for genome-wide sequencing with high-throughput platforms. We undertook three projects: low-coverage whole-genome sequencing of 179 individuals from four populations; high-coverage sequencing of two mother-father-child trios; and exon-targeted sequencing of 697 individuals from seven populations. We describe the location, allele frequency and local haplotype structure of approximately 15 million single nucleotide polymorphisms, 1 million short insertions and deletions, and 20,000 structural variants, most of which were previously undescribed. We show that, because we have catalogued the vast majority of common variation, over 95% of the currently accessible variants found in any individual are present in this data set. On average, each person is found to carry approximately 250 to 300 loss-of-function variants in annotated genes and 50 to 100 variants previously implicated in inherited disorders. We demonstrate how these results can be used to inform association and functional studies. From the two trios, we directly estimate the rate of de novo germline base substitution mutations to be approximately 10(-8) per base pair per generation. We explore the data with regard to signatures of natural selection, and identify a marked reduction of genetic variation in the neighbourhood of genes, due to selection at linked sites. These methods and public data will support the next phase of human genetic research.
View details for DOI 10.1038/nature09534
View details for Web of Science ID 000283548600039
View details for PubMedCentralID PMC3042601
High-throughput, pooled sequencing identifies mutations in NUBPL and FOXRED1 in human complex I deficiency
2010; 42 (10): 851-?
Discovering the molecular basis of mitochondrial respiratory chain disease is challenging given the large number of both mitochondrial and nuclear genes that are involved. We report a strategy of focused candidate gene prediction, high-throughput sequencing and experimental validation to uncover the molecular basis of mitochondrial complex I disorders. We created seven pools of DNA from a cohort of 103 cases and 42 healthy controls and then performed deep sequencing of 103 candidate genes to identify 151 rare variants that were predicted to affect protein function. We established genetic diagnoses in 13 of 60 previously unsolved cases using confirmatory experiments, including cDNA complementation to show that mutations in NUBPL and FOXRED1 can cause complex I deficiency. Our study illustrates how large-scale sequencing, coupled with functional prediction and experimental validation, can be used to identify causal mutations in individual cases.
View details for DOI 10.1038/ng.659
View details for Web of Science ID 000282276600014
View details for PubMedID 20818383
Fine Mapping in 94 Inbred Mouse Strains Using a High-Density Haplotype Resource
2010; 185 (3): 1081-1095
The genetics of phenotypic variation in inbred mice has for nearly a century provided a primary weapon in the medical research arsenal. A catalog of the genetic variation among inbred mouse strains, however, is required to enable powerful positional cloning and association techniques. A recent whole-genome resequencing study of 15 inbred mouse strains captured a significant fraction of the genetic variation among a limited number of strains, yet the common use of hundreds of inbred strains in medical research motivates the need for a high-density variation map of a larger set of strains. Here we report a dense set of genotypes from 94 inbred mouse strains containing 10.77 million genotypes over 121,433 single nucleotide polymorphisms (SNPs), dispersed at 20-kb intervals on average across the genome, with an average concordance of 99.94% with previous SNP sets. Through pairwise comparisons of the strains, we identified an average of 4.70 distinct segments over 73 classical inbred strains in each region of the genome, suggesting limited genetic diversity between the strains. Combining these data with genotypes of 7570 gap-filling SNPs, we further imputed the untyped or missing genotypes of 94 strains over 8.27 million Perlegen SNPs. The imputation accuracy among classical inbred strains is estimated at 99.7% for the genotypes imputed with high confidence. We demonstrated the utility of these data in high-resolution linkage mapping through power simulations and statistical power analysis and provide guidelines for developing such studies. We also provide a resource of in silico association mapping between the complex traits deposited in the Mouse Phenome Database with our genotypes. We expect that these resources will facilitate effective designs of both human and mouse studies for dissecting the genetic basis of complex traits.
View details for DOI 10.1534/genetics.110.115014
View details for Web of Science ID 000281906800030
View details for PubMedID 20439770
Genetic Analysis of Human Traits In Vitro: Drug Response and Gene Expression in Lymphoblastoid Cell Lines
2008; 4 (11)
Lymphoblastoid cell lines (LCLs), originally collected as renewable sources of DNA, are now being used as a model system to study genotype-phenotype relationships in human cells, including searches for QTLs influencing levels of individual mRNAs and responses to drugs and radiation. In the course of attempting to map genes for drug response using 269 LCLs from the International HapMap Project, we evaluated the extent to which biological noise and non-genetic confounders contribute to trait variability in LCLs. While drug responses could be technically well measured on a given day, we observed significant day-to-day variability and substantial correlation to non-genetic confounders, such as baseline growth rates and metabolic state in culture. After correcting for these confounders, we were unable to detect any QTLs with genome-wide significance for drug response. A much higher proportion of variance in mRNA levels may be attributed to non-genetic factors (intra-individual variance--i.e., biological noise, levels of the EBV virus used to transform the cells, ATP levels) than to detectable eQTLs. Finally, in an attempt to improve power, we focused analysis on those genes that had both detectable eQTLs and correlation to drug response; we were unable to detect evidence that eQTL SNPs are convincingly associated with drug response in the model. While LCLs are a promising model for pharmacogenetic experiments, biological noise and in vitro artifacts may reduce power and have the potential to create spurious association due to confounding.
View details for DOI 10.1371/journal.pgen.1000287
View details for Web of Science ID 000261481000040
View details for PubMedID 19043577