PhD, MIT, Harvard-MIT Division of Health Sciences and Technology (2016)
MS, Stanford University, Bioengineering (2010)
BS, Stanford University, Biomedical Computation (2010)
- Biology and Applications of CRISPR/Cas9: Genome Editing and Epigenome Modifications
BIOS 268, GENE 268 (Spr)
- Current Issues in Genetics
GENE 219 (Spr)
- Independent Studies (4)
Graduate and Fellowship Programs
Compatibility rules of human enhancer and promoter sequences.
Gene regulation in the human genome is controlled by distal enhancers that activate specific nearby promoters1. One model for this specificity is that promoters might have sequence-encoded preferences for certain enhancers, for example mediated by interacting sets of transcription factors or cofactors2. This "biochemical compatibility" model has been supported by observations at individual human promoters and by genome-wide measurements in Drosophila3-9. However, the degree to which human enhancers and promoters are intrinsically compatible has not been systematically measured, and how their activities combine to control RNA expression remains unclear. Here we designed a high-throughput reporter assay called ExP STARR-seq (enhancer x promoter self-transcribing active regulatory region sequencing) and applied it to examine the combinatorial compatibilities of 1,000 enhancer and 1,000 promoter sequences in human K562 cells. We identify simple rules for enhancer-promoter compatibility: most enhancers activated all promoters by similar amounts, and intrinsic enhancer and promoter activities combine multiplicatively to determine RNA output (R2=0.82). In addition, two classes of enhancers and promoters showed subtle preferential effects. Promoters of housekeeping genes contained built-in activating motifs for factors such as GABPA and YY1, which decreased the responsiveness of promoters to distal enhancers. Promoters of variably expressed genes lacked these motifs and showed stronger responsiveness to enhancers. Together, this systematic assessment of enhancer-promoter compatibility suggests a multiplicative model tuned by enhancer and promoter class to control gene transcription in the human genome.
View details for DOI 10.1038/s41586-022-04877-w
View details for PubMedID 35594906
Genome-wide enhancer maps link risk variants to disease genes.
Genome-wide association studies (GWAS) have identified thousands of noncoding loci that are associated with human diseases and complex traits, each of which could reveal insights into the mechanisms of disease1. Many of the underlying causal variants may affect enhancers2,3, but we lack accurate maps of enhancers and their target genes to interpret such variants. We recently developed the activity-by-contact (ABC) model to predict which enhancers regulate which genes and validated the model using CRISPR perturbations in several cell types4. Here we apply this ABC model to create enhancer-genemaps in 131 human cell types and tissues, and use these maps to interpret the functions of GWAS variants. Across 72 diseases and complex traits, ABC links 5,036 GWAS signals to 2,249 unique genes, including a class of 577 genes that appear to influence multiple phenotypes through variants in enhancers that act in different cell types. In inflammatory bowel disease (IBD), causal variants are enriched in predicted enhancers by more than 20-fold in particular cell types such as dendritic cells, and ABC achieves higher precision than other regulatory methods at connecting noncoding variants to target genes. These variant-to-function maps reveal an enhancer that contains an IBD risk variant and that regulates the expression of PPIF to alter the membrane potential of mitochondria in macrophages. Our study reveals principles of genome regulation, identifies genes that affect IBD and provides a resource and generalizable strategy to connect risk variants of common diseases to their molecular and cellular functions.
View details for DOI 10.1038/s41586-021-03446-x
View details for PubMedID 33828297
HyPR-seq: Single-cell quantification of chosen RNAs via hybridization and sequencing of DNA probes.
Proceedings of the National Academy of Sciences of the United States of America
2020; 117 (52): 33404–13
Single-cell quantification of RNAs is important for understanding cellular heterogeneity and gene regulation, yet current approaches suffer from low sensitivity for individual transcripts, limiting their utility for many applications. Here we present Hybridization of Probes to RNA for sequencing (HyPR-seq), a method to sensitively quantify the expression of hundreds of chosen genes in single cells. HyPR-seq involves hybridizing DNA probes to RNA, distributing cells into nanoliter droplets, amplifying the probes with PCR, and sequencing the amplicons to quantify the expression of chosen genes. HyPR-seq achieves high sensitivity for individual transcripts, detects nonpolyadenylated and low-abundance transcripts, and can profile more than 100,000 single cells. We demonstrate how HyPR-seq can profile the effects of CRISPR perturbations in pooled screens, detect time-resolved changes in gene expression via measurements of gene introns, and detect rare transcripts and quantify cell-type frequencies in tissue using low-abundance marker genes. By directing sequencing power to genes of interest and sensitively quantifying individual transcripts, HyPR-seq reduces costs by up to 100-fold compared to whole-transcriptome single-cell RNA-sequencing, making HyPR-seq a powerful method for targeted RNA profiling in single cells.
View details for DOI 10.1073/pnas.2010738117
View details for PubMedID 33376219
Activity-by-contact model of enhancer-promoter regulation from thousands of CRISPR perturbations
2019; 51 (12): 1664-+
Enhancer elements in the human genome control how genes are expressed in specific cell types and harbor thousands of genetic variants that influence risk for common diseases1-4. Yet, we still do not know how enhancers regulate specific genes, and we lack general rules to predict enhancer-gene connections across cell types5,6. We developed an experimental approach, CRISPRi-FlowFISH, to perturb enhancers in the genome, and we applied it to test >3,500 potential enhancer-gene connections for 30 genes. We found that a simple activity-by-contact model substantially outperformed previous methods at predicting the complex connections in our CRISPR dataset. This activity-by-contact model allows us to construct genome-wide maps of enhancer-gene connections in a given cell type, on the basis of chromatin state measurements. Together, CRISPRi-FlowFISH and the activity-by-contact model provide a systematic approach to map and predict which enhancers regulate which genes, and will help to interpret the functions of the thousands of disease risk variants in the noncoding genome.
View details for DOI 10.1038/s41588-019-0538-0
View details for Web of Science ID 000499696700003
View details for PubMedID 31784727
View details for PubMedCentralID PMC6886585
Local regulation of gene expression by lncRNA promoters, transcription and splicing
2016; 539 (7629): 452–55
Mammalian genomes are pervasively transcribed to produce thousands of long non-coding RNAs (lncRNAs). A few of these lncRNAs have been shown to recruit regulatory complexes through RNA-protein interactions to influence the expression of nearby genes, and it has been suggested that many other lncRNAs can also act as local regulators. Such local functions could explain the observation that lncRNA expression is often correlated with the expression of nearby genes. However, these correlations have been challenging to dissect and could alternatively result from processes that are not mediated by the lncRNA transcripts themselves. For example, some gene promoters have been proposed to have dual functions as enhancers, and the process of transcription itself may contribute to gene regulation by recruiting activating factors or remodelling nucleosomes. Here we use genetic manipulation in mouse cell lines to dissect 12 genomic loci that produce lncRNAs and find that 5 of these loci influence the expression of a neighbouring gene in cis. Notably, none of these effects requires the specific lncRNA transcripts themselves and instead involves general processes associated with their production, including enhancer-like activity of gene promoters, the process of transcription, and the splicing of the transcript. Furthermore, such effects are not limited to lncRNA loci: we find that four out of six protein-coding loci also influence the expression of a neighbour. These results demonstrate that cross-talk among neighbouring genes is a prevalent phenomenon that can involve multiple mechanisms and cis-regulatory signals, including a role for RNA splice sites. These mechanisms may explain the function and evolution of some genomic loci that produce lncRNAs and broadly contribute to the regulation of both coding and non-coding genes.
View details for DOI 10.1038/nature20149
View details for Web of Science ID 000388161700059
View details for PubMedID 27783602
View details for PubMedCentralID PMC6853796
Systematic mapping of functional enhancer-promoter connections with CRISPR interference
2016; 354 (6313): 769–73
Gene expression in mammals is regulated by noncoding elements that can affect physiology and disease, yet the functions and target genes of most noncoding elements remain unknown. We present a high-throughput approach that uses clustered regularly interspaced short palindromic repeats (CRISPR) interference (CRISPRi) to discover regulatory elements and identify their target genes. We assess >1 megabase of sequence in the vicinity of two essential transcription factors, MYC and GATA1, and identify nine distal enhancers that control gene expression and cellular proliferation. Quantitative features of chromatin state and chromosome conformation distinguish the seven enhancers that regulate MYC from other elements that do not, suggesting a strategy for predicting enhancer-promoter connectivity. This CRISPRi-based approach can be applied to dissect transcriptional networks and interpret the contributions of noncoding genetic variation to human disease.
View details for DOI 10.1126/science.aag2445
View details for Web of Science ID 000387326300042
View details for PubMedID 27708057
View details for PubMedCentralID PMC5438575
The Xist lncRNA Exploits Three-Dimensional Genome Architecture to Spread Across the X Chromosome
2013; 341 (6147): 767-+
Many large noncoding RNAs (lncRNAs) regulate chromatin, but the mechanisms by which they localize to genomic targets remain unexplored. We investigated the localization mechanisms of the Xist lncRNA during X-chromosome inactivation (XCI), a paradigm of lncRNA-mediated chromatin regulation. During the maintenance of XCI, Xist binds broadly across the X chromosome. During initiation of XCI, Xist initially transfers to distal regions across the X chromosome that are not defined by specific sequences. Instead, Xist identifies these regions by exploiting the three-dimensional conformation of the X chromosome. Xist requires its silencing domain to spread across actively transcribed regions and thereby access the entire chromosome. These findings suggest a model in which Xist coats the X chromosome by searching in three dimensions, modifying chromosome structure, and spreading to newly accessible locations.
View details for DOI 10.1126/science.1237973
View details for Web of Science ID 000323122200041
View details for PubMedID 23828888
View details for PubMedCentralID PMC3778663
Computational estimates of annular diameter reveal genetic determinants of mitral valve function and disease.
2022; 7 (3)
The fibrous annulus of the mitral valve plays an important role in valvular function and cardiac physiology, while normal variation in the size of cardiovascular anatomy may share a genetic link with common and rare disease. We derived automated estimates of mitral valve annular diameter in the 4-chamber view from 32,220 MRI images from the UK Biobank at ventricular systole and diastole as the basis for GWAS. Mitral annular dimensions corresponded to previously described anatomical norms, and GWAS inclusive of 4 population strata identified 10 loci, including possibly novel loci (GOSR2, ERBB4, MCTP2, MCPH1) and genes related to cardiac contractility (BAG3, TTN, RBFOX1). ATAC-Seq of primary mitral valve tissue localized multiple variants to regions of open chromatin in biologically relevant cell types and rs17608766 to an algorithmically predicted enhancer element in GOSR2. We observed strong genetic correlation with measures of contractility and mitral valve disease and clinical correlations with heart failure, cerebrovascular disease, and ventricular arrhythmias. Polygenic scoring of mitral valve annular diameter in systole was predictive of risk mitral valve prolapse across 4 cohorts. In summary, genetic and clinical studies of mitral valve annular diameter revealed genetic determinants of mitral valve biology, while highlighting clinical associations. Polygenic determinants of mitral valve annular diameter may represent an independent risk factor for mitral prolapse. Overall, computationally estimated phenotypes derived at scale from medical imaging represent an important substrate for genetic discovery and clinical risk prediction.
View details for DOI 10.1172/jci.insight.146580
View details for PubMedID 35132965
Systematic identification of genomic elements that regulate FCGR2A expression and harbor variants linked with autoimmune disease.
Human molecular genetics
BACKGROUND: FCGR2A binds antibody-antigen complexes to regulate the abundance of circulating and deposited complexes along with downstream immune and autoimmune responses. While the abundance of FCRG2A may be critical in immune-mediated diseases, little is known about whether its surface expression is regulated through cis genomic elements and non-coding variants. In the current study, we aimed to characterize the regulation of FCGR2A expression, the impact of genetic variation and its association with autoimmune disease.METHODS: We applied CRISPR-based interference and editing to scrutinize 1.7Mb of open chromatin surrounding the FCGR2A gene to identify regulatory elements. Relevant transcription factors binding to these regions were defined through public databases. Genetic variants affecting regulation were identified using luciferase reporter assays and were verified in a cohort of 1996 genotyped healthy individuals using flow cytometry.RESULTS: We identified a complex proximal region and five distal enhancers regulating FCGR2A. The proximal region split into subregions upstream and downstream of the transcription start site, was enriched in binding of inflammation-regulated transcription factors, and harbored a variant associated with FCGR2A expression in primary myeloid cells. One distal enhancer region was occupied by CCCTC-binding factor (CTCF) whose binding site was disrupted by a rare genetic variant, altering gene expression.CONCLUSIONS: The FCGR2A gene is regulated by multiple proximal and distal genomic regions, with links to autoimmune disease. These findings may open up novel therapeutic avenues where fine-tuning of FCGR2A levels may constitute a part of treatment strategies for immune-mediated diseases.
View details for DOI 10.1093/hmg/ddab372
View details for PubMedID 34970970
COVID-19 tissue atlases reveal SARS-CoV-2 pathology and cellular targets.
COVID-19, caused by SARS-CoV-2, can result in acute respiratory distress syndrome and multiple-organ failure1-4, but little is known about its pathophysiology. Here, we generated single-cell atlases of 23 lung, 16 kidney, 16 liver and 19 heart COVID-19 autopsy donor tissue samples, and spatial atlases of 14 lung donors. Integrated computational analysis uncovered substantial remodeling in the lung epithelial, immune and stromal compartments, with evidence of multiple paths of failed tissue regeneration, including defective alveolar type 2 differentiation and expansion of fibroblasts and putative TP63+ intrapulmonary basal-like progenitor cells. Viral RNAs were enriched in mononuclear phagocytic and endothelial lung cells which induced specific host programs. Spatial analysis in lung distinguished inflammatory host responses in lung regions with and without viral RNA. Analysis of the other tissue atlases showed transcriptional alterations in multiple cell types in COVID-19 donor heart tissue, and mapped cell types and genes implicated with disease severity based on COVID-19 GWAS. Our foundational dataset elucidates the biological impact of severe SARS-CoV-2 infection across the body, a key step towards new treatments.
View details for DOI 10.1038/s41586-021-03570-8
View details for PubMedID 33915569
- Inherited causes of clonal haematopoiesis in 97,691 whole genomes (vol 586 , pg 763, 2020) NATURE 2021; 591 (7851): E27
- Author Correction: Inherited causes of clonal haematopoiesis in 97,691 whole genomes. Nature 2021
Activity-dependent regulome of human GABAergic neurons reveals new patterns of gene regulation and neurological disease heritability.
Neuronal activity-dependent gene expression is essential for brain development. Although transcriptional and epigenetic effects of neuronal activity have been explored in mice, such an investigation is lacking in humans. Because alterations in GABAergic neuronal circuits are implicated in neurological disorders, we conducted a comprehensive activity-dependent transcriptional and epigenetic profiling of human induced pluripotent stem cell-derived GABAergic neurons similar to those of the early developing striatum. We identified genes whose expression is inducible after membrane depolarization, some of which have specifically evolved in primates and/or are associated with neurological diseases, including schizophrenia and autism spectrum disorder (ASD). We define the genome-wide profile of human neuronal activity-dependent enhancers, promoters and the transcription factors CREB and CRTC1. We found significant heritability enrichment for ASD in the inducible promoters. Our results suggest that sequence variation within activity-inducible promoters of developing human forebrain GABAergic neurons contributes to ASD risk.
View details for DOI 10.1038/s41593-020-00786-1
View details for PubMedID 33542524
Inherited causes of clonal haematopoiesis in 97,691 whole genomes.
Age is the dominant risk factor for most chronic human diseases, but the mechanisms through which ageing confers this risk are largely unknown1. The age-related acquisition of somatic mutations that lead to clonal expansion in regenerating haematopoietic stem cell populations has recently been associated with both haematological cancer2-4 and coronary heart disease5-this phenomenon istermed clonal haematopoiesis of indeterminate potential (CHIP)6. Simultaneous analyses of germline and somatic whole-genome sequences provide the opportunity to identify root causes of CHIP. Here we analyse high-coverage whole-genome sequences from 97,691 participants of diverse ancestries in the National Heart, Lung, and Blood Institute Trans-omics for Precision Medicine (TOPMed) programme, and identify 4,229 individuals with CHIP. We identify associations with blood cell, lipid and inflammatory traits that are specific to different CHIPdriver genes. Association of a genome-wide set of germline genetic variants enabled the identification of three genetic loci associated with CHIP status, including one locus at TET2 that was specific to individuals of African ancestry. In silico-informed in vitro evaluation of the TET2 germline locus enabled the identification of a causal variant that disrupts a TET2 distal enhancer, resulting in increased self-renewal of haematopoietic stem cells. Overall, we observe that germline genetic variation shapes haematopoietic stem cell function, leading to CHIP through mechanisms that are specific to clonal haematopoiesis as well as shared mechanisms that lead to somatic mutations across tissues.
View details for DOI 10.1038/s41586-020-2819-2
View details for PubMedID 33057201
- Publisher Correction: Deep coverage whole genome sequences and plasma lipoprotein(a) in individuals of European and African ancestries. Nature communications 2020; 11 (1): 1715
Prioritizing disease and trait causal variants at the TNFAIP3 locus using functional and genomic features
2020; 11 (1): 1237
Genome-wide association studies have associated thousands of genetic variants with complex traits and diseases, but pinpointing the causal variant(s) among those in tight linkage disequilibrium with each associated variant remains a major challenge. Here, we use seven experimental assays to characterize all common variants at the multiple disease-associated TNFAIP3 locus in five disease-relevant immune cell lines, based on a set of features related to regulatory potential. Trait/disease-associated variants are enriched among SNPs prioritized based on either: (1) residing within CRISPRi-sensitive regulatory regions, or (2) localizing in a chromatin accessible region while displaying allele-specific reporter activity. Of the 15 trait/disease-associated haplotypes at TNFAIP3, 9 have at least one variant meeting one or both of these criteria, 5 of which are further supported by genetic fine-mapping. Our work provides a comprehensive strategy to characterize genetic variation at important disease-associated loci, and aids in the effort to identify trait causal genetic variants.
View details for DOI 10.1038/s41467-020-15022-4
View details for Web of Science ID 000549162600014
View details for PubMedID 32144282
View details for PubMedCentralID PMC7060350
Functional disease architectures reveal unique biological role of transposable elements
2019; 10: 4054
Transposable elements (TE) comprise roughly half of the human genome. Though initially derided as junk DNA, they have been widely hypothesized to contribute to the evolution of gene regulation. However, the contribution of TE to the genetic architecture of diseases remains unknown. Here, we analyze data from 41 independent diseases and complex traits to draw three conclusions. First, TE are uniquely informative for disease heritability. Despite overall depletion for heritability (54% of SNPs, 39 ± 2% of heritability), TE explain substantially more heritability than expected based on their depletion for known functional annotations. This implies that TE acquire function in ways that differ from known functional annotations. Second, older TE contribute more to disease heritability, consistent with acquiring biological function. Third, Short Interspersed Nuclear Elements (SINE) are far more enriched for blood traits than for other traits. Our results can help elucidate the biological roles that TE play in the genetic architecture of diseases.
View details for DOI 10.1038/s41467-019-11957-5
View details for Web of Science ID 000484599900004
View details for PubMedID 31492842
View details for PubMedCentralID PMC6731302
CRISPR Tools for Systematic Studies of RNA Regulation
COLD SPRING HARBOR PERSPECTIVES IN BIOLOGY
2019; 11 (8)
RNA molecules perform diverse functions in mammalian cells, including transferring genetic information from DNA to protein and playing diverse regulatory roles through interactions with other cellular components. Here, we discuss how clustered regularly interspaced short palindromic repeat (CRISPR)-based technologies for directed perturbations of DNA and RNA are revealing new insights into RNA regulation. First, we review the fundamentals of CRISPR-Cas enzymes and functional genomics tools that leverage these systems. Second, we explore how these new perturbation technologies are transforming the study of regulation of and by RNA, focusing on the functions of DNA regulatory elements and long noncoding RNAs (lncRNAs). Third, we highlight an emerging class of RNA-targeting CRISPR-Cas enzymes that have the potential to catalyze studies of RNA biology by providing tools to directly perturb or measure RNA modifications and functions. Together, these tools enable systematic studies of RNA function and regulation in mammalian cells.
View details for DOI 10.1101/cshperspect.a035386
View details for Web of Science ID 000482756900008
View details for PubMedID 31371352
View details for PubMedCentralID PMC6671937
Discovering metabolic disease gene interactions by correlated effects on cellular morphology
2019; 24: 108–19
Impaired expansion of peripheral fat contributes to the pathogenesis of insulin resistance and Type 2 Diabetes (T2D). We aimed to identify novel disease-gene interactions during adipocyte differentiation.Genes in disease-associated loci for T2D, adiposity and insulin resistance were ranked according to expression in human adipocytes. The top 125 genes were ablated in human pre-adipocytes via CRISPR/CAS9 and the resulting cellular phenotypes quantified during adipocyte differentiation with high-content microscopy and automated image analysis. Morphometric measurements were extracted from all images and used to construct morphologic profiles for each gene.Over 107 morphometric measurements were obtained. Clustering of the morphologic profiles accross all genes revealed a group of 14 genes characterized by decreased lipid accumulation, and enriched for known lipodystrophy genes. For two lipodystrophy genes, BSCL2 and AGPAT2, sub-clusters with PLIN1 and CEBPA identifed by morphological similarity were validated by independent experiments as novel protein-protein and gene regulatory interactions.A morphometric approach in adipocytes can resolve multiple cellular mechanisms for metabolic disease loci; this approach enables mechanistic interrogation of the hundreds of metabolic disease loci whose function still remains unknown.
View details for DOI 10.1016/j.molmet.2019.03.001
View details for Web of Science ID 000468472300008
View details for PubMedID 30940487
View details for PubMedCentralID PMC6531784
Gene-centric functional dissection of human genetic variation uncovers regulators of hematopoiesis
Genome-wide association studies (GWAS) have identified thousands of variants associated with human diseases and traits. However, the majority of GWAS-implicated variants are in non-coding regions of the genome and require in depth follow-up to identify target genes and decipher biological mechanisms. Here, rather than focusing on causal variants, we have undertaken a pooled loss-of-function screen in primary hematopoietic cells to interrogate 389 candidate genes contained in 75 loci associated with red blood cell traits. Using this approach, we identify 77 genes at 38 GWAS loci, with most loci harboring 1-2 candidate genes. Importantly, the hit set was strongly enriched for genes validated through orthogonal genetic approaches. Genes identified by this approach are enriched in specific and relevant biological pathways, allowing regulators of human erythropoiesis and modifiers of blood diseases to be defined. More generally, this functional screen provides a paradigm for gene-centric follow up of GWAS for a variety of human diseases and traits.
View details for DOI 10.7554/eLife.44080
View details for Web of Science ID 000468967900001
View details for PubMedID 31070582
View details for PubMedCentralID PMC6534380
- CRISPR-SURF: discovering regulatory elements by deconvolution of CRISPR tiling screen data NATURE METHODS 2018; 15 (12): 992-+
- The NORAD lncRNA assembles a topoisomerase complex critical for genome stability (vol 561, pg 132, 2018) NATURE 2018; 563 (7733): E32
The NORAD lncRNA assembles a topoisomerase complex critical for genome stability
2018; 561 (7721): 132-+
The human genome contains thousands of long non-coding RNAs1, but specific biological functions and biochemical mechanisms have been discovered for only about a dozen2-7. A specific long non-coding RNA-non-coding RNA activated by DNA damage (NORAD)-has recently been shown to be required for maintaining genomic stability8, but its molecular mechanism is unknown. Here we combine RNA antisense purification and quantitative mass spectrometry to identify proteins that directly interact with NORAD in living cells. We show that NORAD interacts with proteins involved in DNA replication and repair in steady-state cells and localizes to the nucleus upon stimulation with replication stress or DNA damage. In particular, NORAD interacts with RBMX, a component of the DNA-damage response, and contains the strongest RBMX-binding site in the transcriptome. We demonstrate that NORAD controls the ability of RBMX to assemble a ribonucleoprotein complex-which we term NORAD-activated ribonucleoprotein complex 1 (NARC1)-that contains the known suppressors of genomic instability topoisomerase I (TOP1), ALYREF and the PRPF19-CDC5L complex. Cells depleted for NORAD or RBMX display an increased frequency of chromosome segregation defects, reduced replication-fork velocity and altered cell-cycle progression-which represent phenotypes that are mechanistically linked to TOP1 and PRPF19-CDC5L function. Expression of NORAD in trans can rescue defects caused by NORAD depletion, but rescue is significantly impaired when the RBMX-binding site in NORAD is deleted. Our results demonstrate that the interaction between NORAD and RBMX is important for NORAD function, and that NORAD is required for the assembly of the previously unknown topoisomerase complex NARC1, which contributes to maintaining genomic stability. In addition, we uncover a previously unknown function for long non-coding RNAs in modulating the ability of an RNA-binding protein to assemble a higher-order ribonucleoprotein complex.
View details for DOI 10.1038/s41586-018-0453-z
View details for Web of Science ID 000443755200046
View details for PubMedID 30150775
Deep coverage whole genome sequences and plasma lipoprotein(a) in individuals of European and African ancestries (vol 9, 2606, 2018)
2018; 9: 3493
The original version of this article contained an error in the name of the author Ramachandran S. Vasan, which was incorrectly given as Vasan S. Ramachandran. This has now been corrected in both the PDF and HTML versions of the article.
View details for DOI 10.1038/s41467-018-05975-y
View details for Web of Science ID 000442522400001
View details for PubMedID 30140049
View details for PubMedCentralID PMC6107495
Positional specificity of different transcription factor classes within enhancers
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2018; 115 (30): E7222–E7230
Gene expression is controlled by sequence-specific transcription factors (TFs), which bind to regulatory sequences in DNA. TF binding occurs in nucleosome-depleted regions of DNA (NDRs), which generally encompass regions with lengths similar to those protected by nucleosomes. However, less is known about where within these regions specific TFs tend to be found. Here, we characterize the positional bias of inferred binding sites for 103 TFs within ∼500,000 NDRs across 47 cell types. We find that distinct classes of TFs display different binding preferences: Some tend to have binding sites toward the edges, some toward the center, and some at other positions within the NDR. These patterns are highly consistent across cell types, suggesting that they may reflect TF-specific intrinsic structural or functional characteristics. In particular, TF classes with binding sites at NDR edges are enriched for those known to interact with histones and chromatin remodelers, whereas TFs with central enrichment interact with other TFs and cofactors such as p300. Our results suggest distinct regiospecific binding patterns and functions of TF classes within enhancers.
View details for DOI 10.1073/pnas.1804663115
View details for Web of Science ID 000439574700030
View details for PubMedID 29987030
View details for PubMedCentralID PMC6065035
Ribosome Levels Selectively Regulate Translation and Lineage Commitment in Human Hematopoiesis
2018; 173 (1): 90-+
Blood cell formation is classically thought to occur through a hierarchical differentiation process, although recent studies have shown that lineage commitment may occur earlier in hematopoietic stem and progenitor cells (HSPCs). The relevance to human blood diseases and the underlying regulation of these refined models remain poorly understood. By studying a genetic blood disorder, Diamond-Blackfan anemia (DBA), where the majority of mutations affect ribosomal proteins and the erythroid lineage is selectively perturbed, we are able to gain mechanistic insight into how lineage commitment is programmed normally and disrupted in disease. We show that in DBA, the pool of available ribosomes is limited, while ribosome composition remains constant. Surprisingly, this global reduction in ribosome levels more profoundly alters translation of a select subset of transcripts. We show how the reduced translation of select transcripts in HSPCs can impair erythroid lineage commitment, illuminating a regulatory role for ribosome levels in cellular differentiation.
View details for DOI 10.1016/j.cell.2018.02.036
View details for Web of Science ID 000428234200010
View details for PubMedID 29551269
View details for PubMedCentralID PMC5866246
Deep coverage whole genome sequences and plasma lipoprotein(a) in individuals of European and African ancestries.
2018; 9 (1): 2606
Lipoprotein(a), Lp(a), is a modified low-density lipoprotein particle that contains apolipoprotein(a), encoded by LPA, and is a highly heritable, causal risk factor for cardiovascular diseases that varies in concentrations across ancestries. Here, we use deep-coverage whole genome sequencing in 8392 individuals of European and African ancestry to discover and interpret both single-nucleotide variants and copy number (CN) variation associated with Lp(a). We observe that genetic determinants between Europeans and Africans have several unique determinants. The common variant rs12740374 associated with Lp(a) cholesterol is an eQTL for SORT1 and independent of LDL cholesterol. Observed associations of aggregates of rare non-coding variants are largely explained by LPA structural variation, namely the LPA kringle IV 2 (KIV2)-CN. Finally, we find that LPA risk genotypes confer greater relative risk for incident atherosclerotic cardiovascular diseases compared to directly measured Lp(a), and are significantly associated with measures of subclinical atherosclerosis in African Americans.
View details for DOI 10.1038/s41467-018-04668-w
View details for PubMedID 29973585
View details for PubMedCentralID PMC6031652
Deep-coverage whole genome sequences and blood lipids among 16,324 individuals.
2018; 9 (1): 3391
Large-scale deep-coverage whole-genome sequencing (WGS) is now feasible and offers potential advantages for locus discovery. We perform WGS in 16,324 participants from four ancestries at mean depth >29X and analyze genotypes with four quantitative traits-plasma total cholesterol, low-density lipoprotein cholesterol (LDL-C), high-density lipoprotein cholesterol, and triglycerides. Common variant association yields known loci except for few variants previously poorly imputed. Rare coding variant association yields known Mendelian dyslipidemia genes but rare non-coding variant association detects no signals. A high 2M-SNP LDL-C polygenic score (top 5th percentile) confers similar effect size to a monogenic mutation (~30 mg/dl higher for each); however, among those with severe hypercholesterolemia, 23% have a high polygenic score and only 2% carry a monogenic mutation. At these sample sizes and for these phenotypes, the incremental value of WGS for discovery is limited but WGS permits simultaneous assessment of monogenic and polygenic models to severe hypercholesterolemia.
View details for DOI 10.1038/s41467-018-05747-8
View details for PubMedID 30140000
Genome-scale activation screen identifies a lncRNA locus regulating a gene neighbourhood
2017; 548 (7667): 343-+
Mammalian genomes contain thousands of loci that transcribe long noncoding RNAs (lncRNAs), some of which are known to carry out critical roles in diverse cellular processes through a variety of mechanisms. Although some lncRNA loci encode RNAs that act non-locally (in trans), there is emerging evidence that many lncRNA loci act locally (in cis) to regulate the expression of nearby genes-for example, through functions of the lncRNA promoter, transcription, or transcript itself. Despite their potentially important roles, it remains challenging to identify functional lncRNA loci and distinguish among these and other mechanisms. Here, to address these challenges, we developed a genome-scale CRISPR-Cas9 activation screen that targets more than 10,000 lncRNA transcriptional start sites to identify noncoding loci that influence a phenotype of interest. We found 11 lncRNA loci that, upon recruitment of an activator, mediate resistance to BRAF inhibitors in human melanoma cells. Most candidate loci appear to regulate nearby genes. Detailed analysis of one candidate, termed EMICERI, revealed that its transcriptional activation resulted in dosage-dependent activation of four neighbouring protein-coding genes, one of which confers the resistance phenotype. Our screening and characterization approach provides a CRISPR toolkit with which to systematically discover the functions of noncoding loci and elucidate their diverse roles in gene regulation and cellular function.
View details for DOI 10.1038/nature23451
View details for Web of Science ID 000407748400035
View details for PubMedID 28792927
View details for PubMedCentralID PMC5706657
A Genetic Variant Associated with Five Vascular Diseases Is a Distal Regulator of Endothelin-1 Gene Expression
2017; 170 (3): 522-+
Genome-wide association studies (GWASs) implicate the PHACTR1 locus (6p24) in risk for five vascular diseases, including coronary artery disease, migraine headache, cervical artery dissection, fibromuscular dysplasia, and hypertension. Through genetic fine mapping, we prioritized rs9349379, a common SNP in the third intron of the PHACTR1 gene, as the putative causal variant. Epigenomic data from human tissue revealed an enhancer signature at rs9349379 exclusively in aorta, suggesting a regulatory function for this SNP in the vasculature. CRISPR-edited stem cell-derived endothelial cells demonstrate rs9349379 regulates expression of endothelin 1 (EDN1), a gene located 600 kb upstream of PHACTR1. The known physiologic effects of EDN1 on the vasculature may explain the pattern of risk for the five associated diseases. Overall, these data illustrate the integration of genetic, phenotypic, and epigenetic analysis to identify the biologic mechanism by which a common, non-coding variant can distally regulate a gene and contribute to the pathogenesis of multiple vascular diseases.
View details for DOI 10.1016/j.cell.2017.06.049
View details for Web of Science ID 000406462400011
View details for PubMedID 28753427
View details for PubMedCentralID PMC5785707
Recurrent and functional regulatory mutations in breast cancer
2017; 547 (7661): 55-+
Genomic analysis of tumours has led to the identification of hundreds of cancer genes on the basis of the presence of mutations in protein-coding regions. By contrast, much less is known about cancer-causing mutations in non-coding regions. Here we perform deep sequencing in 360 primary breast cancers and develop computational methods to identify significantly mutated promoters. Clear signals are found in the promoters of three genes. FOXA1, a known driver of hormone-receptor positive breast cancer, harbours a mutational hotspot in its promoter leading to overexpression through increased E2F binding. RMRP and NEAT1, two non-coding RNA genes, carry mutations that affect protein binding to their promoters and alter expression levels. Our study shows that promoter regions harbour recurrent mutations in cancer with functional consequences and that the mutations occur at similar frequencies as in coding regions. Power analyses indicate that more such regions remain to be discovered through deep sequencing of adequately sized cohorts of patients.
View details for DOI 10.1038/nature22992
View details for Web of Science ID 000404839900030
View details for PubMedID 28658208
View details for PubMedCentralID PMC5563978
Systematic dissection of genomic features determining transcription factor binding and enhancer function
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2017; 114 (7): E1291–E1300
Enhancers regulate gene expression through the binding of sequence-specific transcription factors (TFs) to cognate motifs. Various features influence TF binding and enhancer function-including the chromatin state of the genomic locus, the affinities of the binding site, the activity of the bound TFs, and interactions among TFs. However, the precise nature and relative contributions of these features remain unclear. Here, we used massively parallel reporter assays (MPRAs) involving 32,115 natural and synthetic enhancers, together with high-throughput in vivo binding assays, to systematically dissect the contribution of each of these features to the binding and activity of genomic regulatory elements that contain motifs for PPARγ, a TF that serves as a key regulator of adipogenesis. We show that distinct sets of features govern PPARγ binding vs. enhancer activity. PPARγ binding is largely governed by the affinity of the specific motif site and higher-order features of the larger genomic locus, such as chromatin accessibility. In contrast, the enhancer activity of PPARγ binding sites depends on varying contributions from dozens of TFs in the immediate vicinity, including interactions between combinations of these TFs. Different pairs of motifs follow different interaction rules, including subadditive, additive, and superadditive interactions among specific classes of TFs, with both spatially constrained and flexible grammars. Our results provide a paradigm for the systematic characterization of the genomic features underlying regulatory elements, applicable to the design of synthetic regulatory elements or the interpretation of human genetic variation.
View details for DOI 10.1073/pnas.1621150114
View details for Web of Science ID 000393989300030
View details for PubMedID 28137873
View details for PubMedCentralID PMC5321001
Cohesin Loss Eliminates All Loop Domains.
2017; 171 (2): 305–20.e24
The human genome folds to create thousands of intervals, called "contact domains," that exhibit enhanced contact frequency within themselves. "Loop domains" form because of tethering between two loci-almost always bound by CTCF and cohesin-lying on the same chromosome. "Compartment domains" form when genomic intervals with similar histone marks co-segregate. Here, we explore the effects of degrading cohesin. All loop domains are eliminated, but neither compartment domains nor histone marks are affected. Loss of loop domains does not lead to widespread ectopic gene activation but does affect a significant minority of active genes. In particular, cohesin loss causes superenhancers to co-localize, forming hundreds of links within and across chromosomes and affecting the regulation of nearby genes. We then restore cohesin and monitor the re-formation of each loop. Although re-formation rates vary greatly, many megabase-sized loops recovered in under an hour, consistent with a model where loop extrusion is rapid.
View details for PubMedID 28985562
View details for PubMedCentralID PMC5846482
Long non-coding RNAs: spatial amplifiers that control nuclear structure and gene expression
NATURE REVIEWS MOLECULAR CELL BIOLOGY
2016; 17 (12): 756–70
Over the past decade, it has become clear that mammalian genomes encode thousands of long non-coding RNAs (lncRNAs), many of which are now implicated in diverse biological processes. Recent work studying the molecular mechanisms of several key examples - including Xist, which orchestrates X chromosome inactivation - has provided new insights into how lncRNAs can control cellular functions by acting in the nucleus. Here we discuss emerging mechanistic insights into how lncRNAs can regulate gene expression by coordinating regulatory proteins, localizing to target loci and shaping three-dimensional (3D) nuclear organization. We explore these principles to highlight biological challenges in gene regulation, in which lncRNAs are well-suited to perform roles that cannot be carried out by DNA elements or protein regulators alone, such as acting as spatial amplifiers of regulatory signals in the nucleus.
View details for DOI 10.1038/nrm.2016.126
View details for Web of Science ID 000388967900010
View details for PubMedID 27780979
Principles of Systems Biology-No. 10
2016; 3 (4): 318–20
CRISPR analysis of gene regulatory elements, a near-complete yeast genetic interaction map, and multi-omics mass spectrometry are milestones covered in this month's Cell Systems Call (Cell Systems 1, 307).
View details for Web of Science ID 000395781400002
View details for PubMedID 27788354
Eradication of large established tumors in mice by combination immunotherapy that engages innate and adaptive immune responses.
Checkpoint blockade with antibodies specific for cytotoxic T lymphocyte-associated protein (CTLA)-4 or programmed cell death 1 (PDCD1; also known as PD-1) elicits durable tumor regression in metastatic cancer, but these dramatic responses are confined to a minority of patients. This suboptimal outcome is probably due in part to the complex network of immunosuppressive pathways present in advanced tumors, which are unlikely to be overcome by intervention at a single signaling checkpoint. Here we describe a combination immunotherapy that recruits a variety of innate and adaptive immune cells to eliminate large tumor burdens in syngeneic tumor models and a genetically engineered mouse model of melanoma; to our knowledge tumors of this size have not previously been curable by treatments relying on endogenous immunity. Maximal antitumor efficacy required four components: a tumor-antigen-targeting antibody, a recombinant interleukin-2 with an extended half-life, anti-PD-1 and a powerful T cell vaccine. Depletion experiments revealed that CD8(+) T cells, cross-presenting dendritic cells and several other innate immune cell subsets were required for tumor regression. Effective treatment induced infiltration of immune cells and production of inflammatory cytokines in the tumor, enhanced antibody-mediated tumor antigen uptake and promoted antigen spreading. These results demonstrate the capacity of an elicited endogenous immune response to destroy large, established tumors and elucidate essential characteristics of combination immunotherapies that are capable of curing a majority of tumors in experimental settings typically viewed as intractable.
View details for DOI 10.1038/nm.4200
View details for PubMedID 27775706
View details for PubMedCentralID PMC5209798
RNA Antisense Purification (RAP) for Mapping RNA Interactions with Chromatin
NUCLEAR BODIES AND NONCODING RNAS: METHODS AND PROTOCOLS
2015; 1262: 183–97
RNA-centric biochemical purification is a general approach for studying the functions and mechanisms of noncoding RNAs. Here, we describe the experimental procedures for RNA antisense purification (RAP), a method for selective purification of endogenous RNA complexes from cell extracts that enables mapping of RNA interactions with chromatin. In RAP, the user cross-links cells to fix endogenous RNA complexes and purifies these complexes through hybrid capture with biotinylated antisense oligos. DNA loci that interact with the target RNA are identified using high-throughput DNA sequencing.
View details for DOI 10.1007/978-1-4939-2253-6_11
View details for Web of Science ID 000357692500012
View details for PubMedID 25555582
RNA-RNA Interactions Enable Specific Targeting of Noncoding RNAs to Nascent Pre-mRNAs and Chromatin Sites
2014; 159 (1): 188–99
Intermolecular RNA-RNA interactions are used by many noncoding RNAs (ncRNAs) to achieve their diverse functions. To identify these contacts, we developed a method based on RNA antisense purification to systematically map RNA-RNA interactions (RAP-RNA) and applied it to investigate two ncRNAs implicated in RNA processing: U1 small nuclear RNA, a component of the spliceosome, and Malat1, a large ncRNA that localizes to nuclear speckles. U1 and Malat1 interact with nascent transcripts through distinct targeting mechanisms. Using differential crosslinking, we confirmed that U1 directly hybridizes to 5' splice sites and 5' splice site motifs throughout introns and found that Malat1 interacts with pre-mRNAs indirectly through protein intermediates. Interactions with nascent pre-mRNAs cause U1 and Malat1 to localize proximally to chromatin at active genes, demonstrating that ncRNAs can use RNA-RNA interactions to target specific pre-mRNAs and genomic sites. RAP-RNA is sensitive to lower abundance RNAs as well, making it generally applicable for investigating ncRNAs.
View details for DOI 10.1016/j.cell.2014.08.018
View details for Web of Science ID 000343095000019
View details for PubMedID 25259926
View details for PubMedCentralID PMC4177037
Transcriptome-wide Mapping Reveals Widespread Dynamic-Regulated Pseudouridylation of ncRNA and mRNA
2014; 159 (1): 148–62
Pseudouridine is the most abundant RNA modification, yet except for a few well-studied cases, little is known about the modified positions and their function(s). Here, we develop Ψ-seq for transcriptome-wide quantitative mapping of pseudouridine. We validate Ψ-seq with spike-ins and de novo identification of previously reported positions and discover hundreds of unique sites in human and yeast mRNAs and snoRNAs. Perturbing pseudouridine synthases (PUS) uncovers which pseudouridine synthase modifies each site and their target sequence features. mRNA pseudouridinylation depends on both site-specific and snoRNA-guided pseudouridine synthases. Upon heat shock in yeast, Pus7p-mediated pseudouridylation is induced at >200 sites, and PUS7 deletion decreases the levels of otherwise pseudouridylated mRNA, suggesting a role in enhancing transcript stability. rRNA pseudouridine stoichiometries are conserved but reduced in cells from dyskeratosis congenita patients, where the PUS DKC1 is mutated. Our work identifies an enhanced, transcriptome-wide scope for pseudouridine and methods to dissect its underlying mechanisms and function.
View details for DOI 10.1016/j.cell.2014.08.028
View details for Web of Science ID 000343095000016
View details for PubMedID 25219674
View details for PubMedCentralID PMC4180118
Topological organization of multichromosomal regions by the long intergenic noncoding RNA Firre
NATURE STRUCTURAL & MOLECULAR BIOLOGY
2014; 21 (2): 198-+
RNA, including long noncoding RNA (lncRNA), is known to be an abundant and important structural component of the nuclear matrix. However, the molecular identities, functional roles and localization dynamics of lncRNAs that influence nuclear architecture remain poorly understood. Here, we describe one lncRNA, Firre, that interacts with the nuclear-matrix factor hnRNPU through a 156-bp repeating sequence and localizes across an ~5-Mb domain on the X chromosome. We further observed Firre localization across five distinct trans-chromosomal loci, which reside in spatial proximity to the Firre genomic locus on the X chromosome. Both genetic deletion of the Firre locus and knockdown of hnRNPU resulted in loss of colocalization of these trans-chromosomal interacting loci. Thus, our data suggest a model in which lncRNAs such as Firre can interface with and modulate nuclear architecture across chromosomes.
View details for DOI 10.1038/nsmb.2764
View details for Web of Science ID 000331093600013
View details for PubMedID 24463464
View details for PubMedCentralID PMC3950333
Neuregulin Autocrine Signaling Promotes Self-Renewal of Breast Tumor-Initiating Cells by Triggering HER2/HER3 Activation
2014; 74 (1): 341-352
Currently, only patients with HER2-positive tumors are candidates for HER2-targeted therapies. However, recent clinical observations suggest that the survival of patients with HER2-low breast cancers, who lack HER2 amplification, may benefit from adjuvant therapy that targets HER2. In this study, we explored a mechanism through which these benefits may be obtained. Prompted by the hypothesis that HER2/HER3 signaling in breast tumor-initiating cells (TIC) promotes self-renewal and survival, we obtained evidence that neuregulin 1 (NRG1) produced by TICs promotes their proliferation and self-renewal in HER2-low tumors, including in triple-negative breast tumors. Pharmacologic inhibition of EGFR, HER2, or both receptors reduced breast TIC survival and self-renewal in vitro and in vivo and increased TIC sensitivity to ionizing radiation. Through a tissue microarray analysis, we found that NRG1 expression and associated HER2 activation occurred in a subset of HER2-low breast cancers. Our results offer an explanation for why HER2 inhibition blocks the growth of HER2-low breast tumors. Moreover, they argue that dual inhibition of EGFR and HER2 may offer a useful therapeutic strategy to target TICs in these tumors. In generating a mechanistic rationale to apply HER2-targeting therapies in patients with HER2-low tumors, this work shows why these therapies could benefit a considerably larger number of patients with breast cancer than they currently reach.
View details for DOI 10.1158/0008-5472.CAN-13-1055
View details for Web of Science ID 000329297600033
View details for PubMedID 24177178
View details for PubMedCentralID PMC3917843
Three-Dimensional Genome Architecture Influences Partner Selection for Chromosomal Translocations in Human Disease
2012; 7 (9): e44196
Chromosomal translocations are frequent features of cancer genomes that contribute to disease progression. These rearrangements result from formation and illegitimate repair of DNA double-strand breaks (DSBs), a process that requires spatial colocalization of chromosomal breakpoints. The "contact first" hypothesis suggests that translocation partners colocalize in the nuclei of normal cells, prior to rearrangement. It is unclear, however, the extent to which spatial interactions based on three-dimensional genome architecture contribute to chromosomal rearrangements in human disease. Here we intersect Hi-C maps of three-dimensional chromosome conformation with collections of 1,533 chromosomal translocations from cancer and germline genomes. We show that many translocation-prone pairs of regions genome-wide, including the cancer translocation partners BCR-ABL and MYC-IGH, display elevated Hi-C contact frequencies in normal human cells. Considering tissue specificity, we find that translocation breakpoints reported in human hematologic malignancies have higher Hi-C contact frequencies in lymphoid cells than those reported in sarcomas and epithelial tumors. However, translocations from multiple tissue types show significant correlation with Hi-C contact frequencies, suggesting that both tissue-specific and universal features of chromatin structure contribute to chromosomal alterations. Our results demonstrate that three-dimensional genome architecture shapes the landscape of rearrangements directly observed in human disease and establish Hi-C as a key method for dissecting these effects.
View details for DOI 10.1371/journal.pone.0044196
View details for Web of Science ID 000309973900005
View details for PubMedID 23028501
View details for PubMedCentralID PMC3460994
ProfileChaser: searching microarray repositories based on genome-wide patterns of differential expression
2011; 27 (23): 3317-3318
We introduce ProfileChaser, a web server that allows for querying the Gene Expression Omnibus based on genome-wide patterns of differential expression. Using a novel, content-based approach, ProfileChaser retrieves expression profiles that match the differentially regulated transcriptional programs in a user-supplied experiment. This analysis identifies statistical links to similar expression experiments from the vast array of publicly available data on diseases, drugs, phenotypes and other experimental conditions.http://firstname.lastname@example.orgSupplementary data are available at Bioinformatics online.
View details for DOI 10.1093/bioinformatics/btr548
View details for Web of Science ID 000297352100015
View details for PubMedID 21967760
View details for PubMedCentralID PMC3223361
The Lin28/let-7 Axis Regulates Glucose Metabolism
2011; 147 (1): 81–94
The let-7 tumor suppressor microRNAs are known for their regulation of oncogenes, while the RNA-binding proteins Lin28a/b promote malignancy by inhibiting let-7 biogenesis. We have uncovered unexpected roles for the Lin28/let-7 pathway in regulating metabolism. When overexpressed in mice, both Lin28a and LIN28B promote an insulin-sensitized state that resists high-fat-diet induced diabetes. Conversely, muscle-specific loss of Lin28a or overexpression of let-7 results in insulin resistance and impaired glucose tolerance. These phenomena occur, in part, through the let-7-mediated repression of multiple components of the insulin-PI3K-mTOR pathway, including IGF1R, INSR, and IRS2. In addition, the mTOR inhibitor, rapamycin, abrogates Lin28a-mediated insulin sensitivity and enhanced glucose uptake. Moreover, let-7 targets are enriched for genes containing SNPs associated with type 2 diabetes and control of fasting glucose in human genome-wide association studies. These data establish the Lin28/let-7 pathway as a central regulator of mammalian glucose metabolism.
View details for DOI 10.1016/j.cell.2011.08.033
View details for Web of Science ID 000295396700017
View details for PubMedID 21962509
View details for PubMedCentralID PMC3353524
Content-based microarray search using differential expression profiles
With the expansion of public repositories such as the Gene Expression Omnibus (GEO), we are rapidly cataloging cellular transcriptional responses to diverse experimental conditions. Methods that query these repositories based on gene expression content, rather than textual annotations, may enable more effective experiment retrieval as well as the discovery of novel associations between drugs, diseases, and other perturbations.We develop methods to retrieve gene expression experiments that differentially express the same transcriptional programs as a query experiment. Avoiding thresholds, we generate differential expression profiles that include a score for each gene measured in an experiment. We use existing and novel dimension reduction and correlation measures to rank relevant experiments in an entirely data-driven manner, allowing emergent features of the data to drive the results. A combination of matrix decomposition and p-weighted Pearson correlation proves the most suitable for comparing differential expression profiles. We apply this method to index all GEO DataSets, and demonstrate the utility of our approach by identifying pathways and conditions relevant to transcription factors Nanog and FoxO3.Content-based gene expression search generates relevant hypotheses for biological inquiry. Experiments across platforms, tissue types, and protocols inform the analysis of new datasets.
View details for DOI 10.1186/1471-2105-11-603
View details for Web of Science ID 000286192100001
View details for PubMedID 21172034
View details for PubMedCentralID PMC3022631
Independent component analysis: Mining microarray data for fundamental human gene expression modules
JOURNAL OF BIOMEDICAL INFORMATICS
2010; 43 (6): 932-944
As public microarray repositories rapidly accumulate gene expression data, these resources contain increasingly valuable information about cellular processes in human biology. This presents a unique opportunity for intelligent data mining methods to extract information about the transcriptional modules underlying these biological processes. Modeling cellular gene expression as a combination of functional modules, we use independent component analysis (ICA) to derive 423 fundamental components of human biology from a 9395-array compendium of heterogeneous expression data. Annotation using the Gene Ontology (GO) suggests that while some of these components represent known biological modules, others may describe biology not well characterized by existing manually-curated ontologies. In order to understand the biological functions represented by these modules, we investigate the mechanism of the preclinical anti-cancer drug parthenolide (PTL) by analyzing the differential expression of our fundamental components. Our method correctly identifies known pathways and predicts that N-glycan biosynthesis and T-cell receptor signaling may contribute to PTL response. The fundamental gene modules we describe have the potential to provide pathway-level insight into new gene expression datasets.
View details for DOI 10.1016/j.jbi.2010.07.001
View details for Web of Science ID 000285036700009
View details for PubMedID 20619355
View details for PubMedCentralID PMC2991480