Bio


Anshul Kundaje is Associate Professor of Genetics and Computer Science at Stanford University. His primary research area is large-scale computational regulatory genomics. The Kundaje lab specializes in developing statistical and machine learning methods for large-scale integrative analysis of heterogeneous, high-throughput functional genomic and genetic data to decipher regulatory elements and long-range regulatory interactions, learn predictive regulatory network models across individuals, cell-types and species and improve detection and interpretation of natural and disease-associated genetic variation. Previously as a postdoc at Stanford and Research Scientist at MIT, Anshul was the lead computational analyst of the ENCODE Project and the Roadmap Epigenomics Project. Anshul is also a recipient of the 2016 NIH Director's New Innovator Award and the 2014 Alfred Sloan Fellowship.

Honors & Awards


  • HUGO Chen Award of Excellence, Human Genome Organization (2019)
  • NIH Director's New Innovator Award, NIH (2016)
  • Alfred Sloan Foundation Research Fellowship, Alfred Sloan Foundation (2014-2016)

Boards, Advisory Committees, Professional Organizations


  • Advisor, National Human Genome Research Institute Genomic Data Science Working Group (2021 - Present)
  • Editorial Board, Journal of Computational Biology (2021 - Present)
  • Editorial Board, Genome Research (2020 - Present)
  • Advisor, NIH Director's Advisory Committee for Artificial Intelligence in Biomedical Research (2019 - 2021)

Current Research and Scholarly Interests


Our research focusses on development of statistical and machine learning methods for integrative analysis of diverse functional genomic and genetic data to learn models of gene regulation. We have led the analysis efforts of the Encyclopedia of DNA Elements (ENCODE) and The Roadmap Epigenomics Projects with the development of novel methods for
1. Adaptive thresholding and normalization of massive collections of functional genomic data (e.g. ChIP-seq and DNase-seq)
2. Dissecting combinatorial transcription factor co-occupancy within and across cell-types
3. Predicting cell-type specific enhancers from chromatin state profiles
4. Exploiting expression and chromatin co-dynamics with to predict enhancer-target gene links
5. Jointly modeling sequence grammars at regulatory elements and their chromatin state dynamics, expression changes of regulators and functional interaction data to learn unified multi-scale gene regulation programs
6. Elucidating the heterogeneity of chromatin architecture at regulatory elements
7. Improving the detection and interpretation of potentially causal disease-associated variants from Genome-wide association studies
More recently, we have also been developing methods to
1. Decipher the functional heterogeneity of transcription factor binding
2. Learn long-range, three-dimensional regulatory interactions
3. Infer causal regulatory mechansisms by integrating diverse functional genomic data from temporal (e.g. differentiation/reprogramming) and perturbation (e.g. drug response, knockdown, genome-editing) experiments
4. Model the complex relationships between genetic variation, regulatory chromatin variation and expression variation in healthy and diseased individuals
5. Deep learning frameworks for genomics

Projects


  • The Encyclopedia of DNA Elements (ENCODE) Project, Stanford University, MIT

    The project generates a resource of cell-type specific genome-wide regulatory maps in the human genome. We develop statistical processing methods for next-gen sequencing based functional genomic data and machine learning methods to predict regulatory events, learn combinatorial regulatory effects of transcription factors, cell-type specific regulatory networks

    Location

    Stanford, CA

    For More Information:

  • The Roadmap Epigenomics Project, MIT (February 2012 - Present)

    The project generates genome-wide epigenomic maps in 200 human cell types. We develop computational methods and analyses to infer cell-type specific regulatory elements (e.g. enhancers) and their activity states, learn cell-type specific regulatory networks and use these maps to interpret GWAS and disease studies.

    Location

    Boston, MA

2023-24 Courses


Stanford Advisees


Graduate and Fellowship Programs


All Publications


  • Transcription factor stoichiometry, motif affinity and syntax regulate single-cell chromatin dynamics during fibroblast reprogramming to pluripotency. bioRxiv : the preprint server for biology Nair, S., Ameen, M., Sundaram, L., Pampari, A., Schreiber, J., Balsubramani, A., Wang, Y. X., Burns, D., Blau, H. M., Karakikes, I., Wang, K. C., Kundaje, A. 2023

    Abstract

    Ectopic expression of OCT4, SOX2, KLF4 and MYC (OSKM) transforms differentiated cells into induced pluripotent stem cells. To refine our mechanistic understanding of reprogramming, especially during the early stages, we profiled chromatin accessibility and gene expression at single-cell resolution across a densely sampled time course of human fibroblast reprogramming. Using neural networks that map DNA sequence to ATAC-seq profiles at base-resolution, we annotated cell-state-specific predictive transcription factor (TF) motif syntax in regulatory elements, inferred affinity- and concentration-dependent dynamics of Tn5-bias corrected TF footprints, linked peaks to putative target genes, and elucidated rewiring of TF-to-gene cis-regulatory networks. Our models reveal that early in reprogramming, OSK, at supraphysiological concentrations, rapidly open transient regulatory elements by occupying non-canonical low-affinity binding sites. As OSK concentration falls, the accessibility of these transient elements decays as a function of motif affinity. We find that these OSK-dependent transient elements sequester the somatic TF AP-1. This redistribution is strongly associated with the silencing of fibroblast-specific genes within individual nuclei. Together, our integrated single-cell resource and models reveal insights into the cis-regulatory code of reprogramming at unprecedented resolution, connect TF stoichiometry and motif syntax to diversification of cell fate trajectories, and provide new perspectives on the dynamics and role of transient regulatory elements in somatic silencing.

    View details for DOI 10.1101/2023.10.04.560808

    View details for PubMedID 37873116

  • Integrative single-cell analysis of cardiogenesis identifies developmental trajectories and non-coding mutations in congenital heart disease. Cell Ameen, M., Sundaram, L., Shen, M., Banerjee, A., Kundu, S., Nair, S., Shcherbina, A., Gu, M., Wilson, K. D., Varadarajan, A., Vadgama, N., Balsubramani, A., Wu, J. C., Engreitz, J. M., Farh, K., Karakikes, I., Wang, K. C., Quertermous, T., Greenleaf, W. J., Kundaje, A. 2022; 185 (26): 4937

    Abstract

    To define the multi-cellular epigenomic and transcriptional landscape of cardiac cellular development, we generated single-cell chromatin accessibility maps of human fetal heart tissues. We identified eight major differentiation trajectories involving primary cardiac cell types, each associated with dynamic transcription factor (TF) activity signatures. We contrasted regulatory landscapes of iPSC-derived cardiac cell types and their invivo counterparts, which enabled optimization of invitro differentiation of epicardial cells. Further, we interpreted sequence based deep learning models of cell-type-resolved chromatin accessibility profiles to decipher underlying TF motif lexicons. De novo mutations predicted to affect chromatin accessibility in arterial endothelium were enriched in congenital heart disease (CHD) cases vs. controls. Invitro studies in iPSCs validated the functional impact of identified variation on the predicted developmental cell types. This work thus defines the cell-type-resolved cis-regulatory sequence determinants of heart development and identifies disruption of cell type-specific regulatory elements in CHD.

    View details for DOI 10.1016/j.cell.2022.11.028

    View details for PubMedID 36563664

  • The dynamic, combinatorial cis-regulatory lexicon of epidermal differentiation. Nature genetics Kim, D. S., Risca, V. I., Reynolds, D. L., Chappell, J., Rubin, A. J., Jung, N., Donohue, L. K., Lopez-Pajares, V., Kathiria, A., Shi, M., Zhao, Z., Deep, H., Sharmin, M., Rao, D., Lin, S., Chang, H. Y., Snyder, M. P., Greenleaf, W. J., Kundaje, A., Khavari, P. A. 2021

    Abstract

    Transcription factors bind DNA sequence motif vocabularies in cis-regulatory elements (CREs) to modulate chromatin state and gene expression during cell state transitions. A quantitative understanding of how motif lexicons influence dynamic regulatory activity has been elusive due to the combinatorial nature of the cis-regulatory code. To address this, we undertook multiomic data profiling of chromatin and expression dynamics across epidermal differentiation to identify 40,103 dynamic CREs associated with 3,609 dynamically expressed genes, then applied an interpretable deep-learning framework to model the cis-regulatory logic of chromatin accessibility. This analysis framework identified cooperative DNA sequence rules in dynamic CREs regulating synchronous gene modules with diverse roles in skin differentiation. Massively parallel reporter assay analysis validated temporal dynamics and cooperative cis-regulatory logic. Variants linked to human polygenic skin disease were enriched in these time-dependent combinatorial motif rules. This integrative approach shows the combinatorial cis-regulatory lexicon of epidermal differentiation and represents a general framework for deciphering the organizational principles of the cis-regulatory code of dynamic gene regulation.

    View details for DOI 10.1038/s41588-021-00947-3

    View details for PubMedID 34650237

  • A genome-wide atlas of co-essential modules assigns function to uncharacterized genes. Nature genetics Wainberg, M., Kamber, R. A., Balsubramani, A., Meyers, R. M., Sinnott-Armstrong, N., Hornburg, D., Jiang, L., Chan, J., Jian, R., Gu, M., Shcherbina, A., Dubreuil, M. M., Spees, K., Meuleman, W., Snyder, M. P., Bassik, M. C., Kundaje, A. 2021

    Abstract

    A central question in the post-genomic era is how genes interact to form biological pathways. Measurements of gene dependency across hundreds of cell lines have been used to cluster genes into 'co-essential' pathways, but this approach has been limited by ubiquitous false positives. In the present study, we develop a statistical method that enables robust identification of gene co-essentiality and yields a genome-wide set of functional modules. This atlas recapitulates diverse pathways and protein complexes, and predicts the functions of 108 uncharacterized genes. Validating top predictions, we show that TMEM189 encodes plasmanylethanolamine desaturase, a key enzyme for plasmalogen synthesis. We also show that C15orf57 encodes a protein that binds the AP2 complex, localizes to clathrin-coated pits and enables efficient transferrin uptake. Finally, we provide an interactive webtool for the community to explore our results, which establish co-essentiality profiling as a powerful resource for biological pathway identification and discovery of new gene functions.

    View details for DOI 10.1038/s41588-021-00840-z

    View details for PubMedID 33859415

  • Base-resolution models of transcription-factor binding reveal soft motif syntax. Nature genetics Avsec, Ž. n., Weilert, M. n., Shrikumar, A. n., Krueger, S. n., Alexandari, A. n., Dalal, K. n., Fropf, R. n., McAnany, C. n., Gagneur, J. n., Kundaje, A. n., Zeitlinger, J. n. 2021

    Abstract

    The arrangement (syntax) of transcription factor (TF) binding motifs is an important part of the cis-regulatory code, yet remains elusive. We introduce a deep learning model, BPNet, that uses DNA sequence to predict base-resolution chromatin immunoprecipitation (ChIP)-nexus binding profiles of pluripotency TFs. We develop interpretation tools to learn predictive motif representations and identify soft syntax rules for cooperative TF binding interactions. Strikingly, Nanog preferentially binds with helical periodicity, and TFs often cooperate in a directional manner, which we validate using clustered regularly interspaced short palindromic repeat (CRISPR)-induced point mutations. Our model represents a powerful general approach to uncover the motifs and syntax of cis-regulatory sequences in genomics data.

    View details for DOI 10.1038/s41588-021-00782-6

    View details for PubMedID 33603233

  • Single-cell epigenomic analyses implicate candidate causal variants at inherited risk loci for Alzheimer's and Parkinson's diseases. Nature genetics Corces, M. R., Shcherbina, A., Kundu, S., Gloudemans, M. J., Fresard, L., Granja, J. M., Louie, B. H., Eulalio, T., Shams, S., Bagdatli, S. T., Mumbach, M. R., Liu, B., Montine, K. S., Greenleaf, W. J., Kundaje, A., Montgomery, S. B., Chang, H. Y., Montine, T. J. 2020

    Abstract

    Genome-wide association studies of neurological diseases have identified thousands of variants associated with disease phenotypes. However, most of these variants do not alter coding sequences, making it difficult to assign their function. Here, we present a multi-omic epigenetic atlas of the adult human brain through profiling of single-cell chromatin accessibility landscapes and three-dimensional chromatin interactions of diverse adult brain regions across a cohort of cognitively healthy individuals. We developed a machine-learning classifier to integrate this multi-omic framework and predict dozens of functional SNPs for Alzheimer's and Parkinson's diseases, nominating target genes and cell types for previously orphaned loci from genome-wide association studies. Moreover, we dissected the complex inverted haplotype of the MAPT (encoding tau) Parkinson's disease risk locus, identifying putative ectopic regulatory interactions in neurons that may mediate this disease association. This work expands understanding of inherited variation and provides a roadmap for the epigenomic dissection of causal regulatory variation in disease.

    View details for DOI 10.1038/s41588-020-00721-x

    View details for PubMedID 33106633

  • Opportunities and challenges for transcriptome-wide association studies NATURE GENETICS Wainberg, M., Sinnott-Armstrong, N., Mancuso, N., Barbeira, A. N., Knowles, D. A., Golan, D., Ermel, R., Ruusalepp, A., Quertermous, T., Hao, K., Bjorkegren, J. M., Im, H., Pasaniuc, B., Rivas, M. A., Kundaje, A. 2019; 51 (4): 592–99
  • Discovering epistatic feature interactions from neural network models of regulatory DNA sequences. Bioinformatics (Oxford, England) Greenside, P., Shimko, T., Fordyce, P., Kundaje, A. 2018; 34 (17): i629-i637

    Abstract

    Transcription factors bind regulatory DNA sequences in a combinatorial manner to modulate gene expression. Deep neural networks (DNNs) can learn the cis-regulatory grammars encoded in regulatory DNA sequences associated with transcription factor binding and chromatin accessibility. Several feature attribution methods have been developed for estimating the predictive importance of individual features (nucleotides or motifs) in any input DNA sequence to its associated output prediction from a DNN model. However, these methods do not reveal higher-order feature interactions encoded by the models.We present a new method called Deep Feature Interaction Maps (DFIM) to efficiently estimate interactions between all pairs of features in any input DNA sequence. DFIM accurately identifies ground truth motif interactions embedded in simulated regulatory DNA sequences. DFIM identifies synergistic interactions between GATA1 and TAL1 motifs from in vivo TF binding models. DFIM reveals epistatic interactions involving nucleotides flanking the core motif of the Cbf1 TF in yeast from in vitro TF binding models. We also apply DFIM to regulatory sequence models of in vivo chromatin accessibility to reveal interactions between regulatory genetic variants and proximal motifs of target TFs as validated by TF binding quantitative trait loci. Our approach makes significant strides in improving the interpretability of deep learning models for genomics.Code is available at: https://github.com/kundajelab/dfim.Supplementary data are available at Bioinformatics online.

    View details for DOI 10.1093/bioinformatics/bty575

    View details for PubMedID 30423062

    View details for PubMedCentralID PMC6129272

  • Opportunities and obstacles for deep learning in biology and medicine JOURNAL OF THE ROYAL SOCIETY INTERFACE Ching, T., Himmelstein, D. S., Beaulieu-Jones, B. K., Kalinin, A. A., Do, B. T., Way, G. P., Ferrero, E., Agapow, P., Zietz, M., Hoffman, M. M., Xie, W., Rosen, G. L., Lengerich, B. J., Israeli, J., Lanchantin, J., Woloszynek, S., Carpenter, A. E., Shrikumar, A., Xu, J., Cofer, E. M., Lavender, C. A., Turaga, S. C., Alexandari, A. M., Lu, Z., Harris, D. J., DeCaprio, D., Qi, Y., Kundaje, A., Peng, Y., Wiley, L. K., Segler, M. S., Boca, S. M., Swamidass, S., Huang, A., Gitter, A., Greene, C. S. 2018; 15 (141)

    Abstract

    Deep learning describes a class of machine learning algorithms that are capable of combining raw inputs into layers of intermediate features. These algorithms have recently shown impressive results across a variety of domains. Biology and medicine are data-rich disciplines, but the data are complex and often ill-understood. Hence, deep learning techniques may be particularly well suited to solve problems of these fields. We examine applications of deep learning to a variety of biomedical problems-patient classification, fundamental biological processes and treatment of patients-and discuss whether deep learning will be able to transform these tasks or if the biomedical sphere poses unique challenges. Following from an extensive literature review, we find that deep learning has yet to revolutionize biomedicine or definitively resolve any of the most pressing challenges in the field, but promising advances have been made on the prior state of the art. Even though improvements over previous baselines have been modest in general, the recent progress indicates that deep learning methods will provide valuable means for speeding up or aiding human investigation. Though progress has been made linking a specific neural network's prediction to input features, understanding how users should interpret these models to make testable hypotheses about the system under study remains an open challenge. Furthermore, the limited amount of labelled data for training presents problems in some domains, as do legal and privacy constraints on work with sensitive health records. Nonetheless, we foresee deep learning enabling changes at both bench and bedside with the potential to transform several areas of biology and medicine.

    View details for PubMedID 29618526

    View details for PubMedCentralID PMC5938574

  • Denoising genome-wide histone ChIP-seq with convolutional neural networks BIOINFORMATICS Koh, P., Pierson, E., Kundaje, A. 2017; 33 (14): I225–I233

    Abstract

    Chromatin immune-precipitation sequencing (ChIP-seq) experiments are commonly used to obtain genome-wide profiles of histone modifications associated with different types of functional genomic elements. However, the quality of histone ChIP-seq data is affected by many experimental parameters such as the amount of input DNA, antibody specificity, ChIP enrichment and sequencing depth. Making accurate inferences from chromatin profiling experiments that involve diverse experimental parameters is challenging.We introduce a convolutional denoising algorithm, Coda, that uses convolutional neural networks to learn a mapping from suboptimal to high-quality histone ChIP-seq data. This overcomes various sources of noise and variability, substantially enhancing and recovering signal when applied to low-quality chromatin profiling datasets across individuals, cell types and species. Our method has the potential to improve data quality at reduced costs. More broadly, this approach-using a high-dimensional discriminative model to encode a generative noise process-is generally applicable to other biological domains where it is easy to generate noisy data but difficult to analytically characterize the noise or underlying data distribution.https://github.com/kundajelab/coda .akundaje@stanford.edu.

    View details for PubMedID 28881977

  • Genetic Control of Chromatin States in Humans Involves Local and Distal Chromosomal Interactions CELL Grubert, F., Zaugg, J. B., Kasowski, M., Ursu, O., Spacek, D. V., Martin, A. R., Greenside, P., Srivas, R., Phanstiel, D. H., Pekowska, A., Heidari, N., Euskirchen, G., Huber, W., Pritchard, J. K., Bustamante, C. D., Steinmetz, L. M., Kundaje, A., Snyder, M. 2015; 162 (5): 1051-1065

    Abstract

    Deciphering the impact of genetic variants on gene regulation is fundamental to understanding human disease. Although gene regulation often involves long-range interactions, it is unknown to what extent non-coding genetic variants influence distal molecular phenotypes. Here, we integrate chromatin profiling for three histone marks in lymphoblastoid cell lines (LCLs) from 75 sequenced individuals with LCL-specific Hi-C and ChIA-PET-based chromatin contact maps to uncover one of the largest collections of local and distal histone quantitative trait loci (hQTLs). Distal QTLs are enriched within topologically associated domains and exhibit largely concordant variation of chromatin state coordinated by proximal and distal non-coding genetic variants. Histone QTLs are enriched for common variants associated with autoimmune diseases and enable identification of putative target genes of disease-associated variants from genome-wide association studies. These analyses provide insights into how genetic variation can affect human disease phenotypes by coordinated changes in chromatin at interacting regulatory elements.

    View details for DOI 10.1016/j.cell.2015.07.048

    View details for Web of Science ID 000360589900015

    View details for PubMedCentralID PMC4556133

  • Conserved epigenomic signals in mice and humans reveal immune basis of Alzheimer's disease. Nature Gjoneska, E., Pfenning, A. R., Mathys, H., Quon, G., Kundaje, A., Tsai, L., Kellis, M. 2015; 518 (7539): 365-369

    Abstract

    Alzheimer's disease (AD) is a severe age-related neurodegenerative disorder characterized by accumulation of amyloid-β plaques and neurofibrillary tangles, synaptic and neuronal loss, and cognitive decline. Several genes have been implicated in AD, but chromatin state alterations during neurodegeneration remain uncharacterized. Here we profile transcriptional and chromatin state dynamics across early and late pathology in the hippocampus of an inducible mouse model of AD-like neurodegeneration. We find a coordinated downregulation of synaptic plasticity genes and regulatory regions, and upregulation of immune response genes and regulatory regions, which are targeted by factors that belong to the ETS family of transcriptional regulators, including PU.1. Human regions orthologous to increasing-level enhancers show immune-cell-specific enhancer signatures as well as immune cell expression quantitative trait loci, while decreasing-level enhancer orthologues show fetal-brain-specific enhancer activity. Notably, AD-associated genetic variants are specifically enriched in increasing-level enhancer orthologues, implicating immune processes in AD predisposition. Indeed, increasing enhancers overlap known AD loci lacking protein-altering variants, and implicate additional loci that do not reach genome-wide significance. Our results reveal new insights into the mechanisms of neurodegeneration and establish the mouse as a useful model for functional studies of AD regulatory regions.

    View details for DOI 10.1038/nature14252

    View details for PubMedID 25693568

  • Integrative analysis of 111 reference human epigenomes. Nature Kundaje, A., Meuleman, W., Ernst, J., Bilenky, M., Yen, A., Heravi-Moussavi, A., Kheradpour, P., Zhang, Z., Wang, J., Ziller, M. J., Amin, V., Whitaker, J. W., Schultz, M. D., Ward, L. D., Sarkar, A., Quon, G., Sandstrom, R. S., Eaton, M. L., Wu, Y., Pfenning, A. R., Wang, X., Claussnitzer, M., Liu, Y., Coarfa, C., Harris, R. A., Shoresh, N., Epstein, C. B., Gjoneska, E., Leung, D., Xie, W., Hawkins, R. D., Lister, R., Hong, C., Gascard, P., Mungall, A. J., Moore, R., Chuah, E., Tam, A., Canfield, T. K., Hansen, R. S., Kaul, R., Sabo, P. J., Bansal, M. S., Carles, A., Dixon, J. R., Farh, K., Feizi, S., Karlic, R., Kim, A., Kulkarni, A., Li, D., Lowdon, R., Elliott, G., Mercer, T. R., Neph, S. J., Onuchic, V., Polak, P., Rajagopal, N., Ray, P., Sallari, R. C., Siebenthall, K. T., Sinnott-Armstrong, N. A., Stevens, M., Thurman, R. E., Wu, J., Zhang, B., Zhou, X., Beaudet, A. E., Boyer, L. A., De Jager, P. L., Farnham, P. J., Fisher, S. J., Haussler, D., Jones, S. J., Li, W., Marra, M. A., McManus, M. T., Sunyaev, S., Thomson, J. A., Tlsty, T. D., Tsai, L., Wang, W., Waterland, R. A., Zhang, M. Q., Chadwick, L. H., Bernstein, B. E., Costello, J. F., Ecker, J. R., Hirst, M., Meissner, A., Milosavljevic, A., Ren, B., Stamatoyannopoulos, J. A., Wang, T., Kellis, M. 2015; 518 (7539): 317-330

    Abstract

    The reference human genome sequence set the stage for studies of genetic variation and its association with human disease, but epigenomic studies lack a similar reference. To address this need, the NIH Roadmap Epigenomics Consortium generated the largest collection so far of human epigenomes for primary cells and tissues. Here we describe the integrative analysis of 111 reference human epigenomes generated as part of the programme, profiled for histone modification patterns, DNA accessibility, DNA methylation and RNA expression. We establish global maps of regulatory elements, define regulatory modules of coordinated activity, and their likely activators and repressors. We show that disease- and trait-associated genetic variants are enriched in tissue-specific epigenomic marks, revealing biologically relevant cell types for diverse human traits, and providing a resource for interpreting the molecular basis of human disease. Our results demonstrate the central role of epigenomic information for understanding gene regulation, cellular differentiation and human disease.

    View details for DOI 10.1038/nature14248

    View details for PubMedID 25693563

  • An integrated encyclopedia of DNA elements in the human genome NATURE Dunham, I., Kundaje, A., Aldred, S. F., Collins, P. J., Davis, C., Doyle, F., Epstein, C. B., Frietze, S., Harrow, J., Kaul, R., Khatun, J., Lajoie, B. R., Landt, S. G., Lee, B., Pauli, F., Rosenbloom, K. R., Sabo, P., Safi, A., Sanyal, A., Shoresh, N., Simon, J. M., Song, L., Trinklein, N. D., Altshuler, R. C., Birney, E., Brown, J. B., Cheng, C., Djebali, S., Dong, X., Dunham, I., Ernst, J., Furey, T. S., Gerstein, M., Giardine, B., Greven, M., Hardison, R. C., Harris, R. S., Herrero, J., Hoffman, M. M., Iyer, S., Kellis, M., Khatun, J., Kheradpour, P., Kundaje, A., Lassmann, T., Li, Q., Lin, X., Marinov, G. K., Merkel, A., Mortazavi, A., Parker, S. C., Reddy, T. E., Rozowsky, J., Schlesinger, F., Thurman, R. E., Wang, J., Ward, L. D., Whitfield, T. W., Wilder, S. P., Wu, W., Xi, H. S., Yip, K. Y., Zhuang, J., Bernstein, B. E., Birney, E., Dunham, I., Green, E. D., Gunter, C., Snyder, M., Pazin, M. J., Lowdon, R. F., Dillon, L. A., Adams, L. B., Kelly, C. J., Zhang, J., Wexler, J. R., Green, E. D., Good, P. J., Feingold, E. A., Bernstein, B. E., Birney, E., Crawford, G. E., Dekker, J., Elnitski, L., Farnham, P. J., Gerstein, M., Giddings, M. C., Gingeras, T. R., Green, E. D., Guigo, R., Hardison, R. C., Hubbard, T. J., Kellis, M., Kent, W. J., Lieb, J. D., Margulies, E. H., Myers, R. M., Snyder, M., Stamatoyannopoulos, J. A., Tenenbaum, S. A., Weng, Z., White, K. P., Wold, B., Khatun, J., Yu, Y., Wrobel, J., Risk, B. A., Gunawardena, H. P., Kuiper, H. C., Maier, C. W., Xie, L., Chen, X., Giddings, M. C., Bernstein, B. E., Epstein, C. B., Shoresh, N., Ernst, J., Kheradpour, P., Mikkelsen, T. S., Gillespie, S., Goren, A., Ram, O., Zhang, X., Wang, L., Issner, R., Coyne, M. J., Durham, T., Ku, M., Truong, T., Ward, L. D., Altshuler, R. C., Eaton, M. L., Kellis, M., Djebali, S., Davis, C. A., Merkel, A., Dobin, A., Lassmann, T., Mortazavi, A., Tanzer, A., Lagarde, J., Lin, W., Schlesinger, F., Xue, C., Marinov, G. K., Khatun, J., Williams, B. A., Zaleski, C., Rozowsky, J., Roeder, M., Kokocinski, F., Abdelhamid, R. F., Alioto, T., Antoshechkin, I., Baer, M. T., Batut, P., Bell, I., Bell, K., Chakrabortty, S., Chen, X., Chrast, J., Curado, J., Derrien, T., Drenkow, J., Dumais, E., Dumais, J., Duttagupta, R., Fastuca, M., Fejes-Toth, K., Ferreira, P., Foissac, S., Fullwood, M. J., Gao, H., Gonzalez, D., Gordon, A., Gunawardena, H. P., Howald, C., Jha, S., Johnson, R., Kapranov, P., King, B., Kingswood, C., Li, G., Luo, O. J., Park, E., Preall, J. B., Presaud, K., Ribeca, P., Risk, B. A., Robyr, D., Ruan, X., Sammeth, M., Sandhu, K. S., Schaeffer, L., See, L., Shahab, A., Skancke, J., Suzuki, A. M., Takahashi, H., Tilgner, H., Trout, D., Walters, N., Wang, H., Wrobel, J., Yu, Y., Hayashizaki, Y., Harrow, J., Gerstein, M., Hubbard, T. J., Reymond, A., Antonarakis, S. E., Hannon, G. J., Giddings, M. C., Ruan, Y., Wold, B., Carninci, P., Guigo, R., Gingeras, T. R., Rosenbloom, K. R., Sloan, C. A., Learned, K., Malladi, V. S., Wong, M. C., Barber, G., Cline, M. S., Dreszer, T. R., Heitner, S. G., Karolchik, D., Kent, W. J., Kirkup, V. M., Meyer, L. R., Long, J. C., Maddren, M., Raney, B. J., Furey, T. S., Song, L., Grasfeder, L. L., Giresi, P. G., Lee, B., Battenhouse, A., Sheffield, N. C., Simon, J. M., Showers, K. A., Safi, A., London, D., Bhinge, A. A., Shestak, C., Schaner, M. R., Kim, S. K., Zhang, Z. Z., Mieczkowski, P. A., Mieczkowska, J. O., Liu, Z., McDaniell, R. M., Ni, Y., Rashid, N. U., Kim, M. J., Adar, S., Zhang, Z., Wang, T., Winter, D., Keefe, D., Birney, E., Iyer, V. R., Lieb, J. D., Crawford, G. E., Li, G., Sandhu, K. S., Zheng, M., Wang, P., Luo, O. J., Shahab, A., Fullwood, M. J., Ruan, X., Ruan, Y., Myers, R. M., Pauli, F., Williams, B. A., Gertz, J., Marinov, G. K., Reddy, T. E., Vielmetter, J., Partridge, E. C., Trout, D., Varley, K. E., Gasper, C., Bansal, A., Pepke, S., Jain, P., Amrhein, H., Bowling, K. M., Anaya, M., Cross, M. K., King, B., Muratet, M. A., Antoshechkin, I., Newberry, K. M., McCue, K., Nesmith, A. S., Fisher-Aylor, K. I., Pusey, B., DeSalvo, G., Parker, S. L., Balasubramanian, S., Davis, N. S., Meadows, S. K., Eggleston, T., Gunter, C., Newberry, J. S., Levy, S. E., Absher, D. M., Mortazavi, A., Wong, W. H., Wold, B., Blow, M. J., Visel, A., Pennachio, L. A., Elnitski, L., Margulies, E. H., Parker, S. C., Petrykowska, H. M., Abyzov, A., Aken, B., Barrell, D., Barson, G., Berry, A., Bignell, A., Boychenko, V., Bussotti, G., Chrast, J., Davidson, C., Derrien, T., Despacio-Reyes, G., Diekhans, M., Ezkurdia, I., Frankish, A., Gilbert, J., Gonzalez, J. M., Griffiths, E., Harte, R., Hendrix, D. A., Howald, C., Hunt, T., Jungreis, I., Kay, M., Khurana, E., Kokocinski, F., Leng, J., Lin, M. F., Loveland, J., Lu, Z., Manthravadi, D., Mariotti, M., Mudge, J., Mukherjee, G., Notredame, C., Pei, B., Rodriguez, J. M., Saunders, G., Sboner, A., Searle, S., Sisu, C., Snow, C., Steward, C., Tanzer, A., Tapanari, E., Tress, M. L., van Baren, M. J., Walters, N., Washietl, S., Wilming, L., Zadissa, A., Zhang, Z., Brent, M., Haussler, D., Kellis, M., Valencia, A., Gerstein, M., Reymond, A., Guigo, R., Harrow, J., Hubbard, T. J., Landt, S. G., Frietze, S., Abyzov, A., Addleman, N., Alexander, R. P., Auerbach, R. K., Balasubramanian, S., Bettinger, K., Bhardwaj, N., Boyle, A. P., Cao, A. R., Cayting, P., Charos, A., Cheng, Y., Cheng, C., Eastman, C., Euskirchen, G., Fleming, J. D., Grubert, F., Habegger, L., Hariharan, M., Harmanci, A., Iyengar, S., Jin, V. X., Karczewski, K. J., Kasowski, M., Lacroute, P., Lam, H., Lamarre-Vincent, N., Leng, J., Lian, J., Lindahl-Allen, M., Min, R., Miotto, B., Monahan, H., Moqtaderi, Z., Mu, X. J., O'Geen, H., Ouyang, Z., Patacsil, D., Pei, B., Raha, D., Ramirez, L., Reed, B., Rozowsky, J., Sboner, A., Shi, M., Sisu, C., Slifer, T., Witt, H., Wu, L., Xu, X., Yan, K., Yang, X., Yip, K. Y., Zhang, Z., Struhl, K., Weissman, S. M., Gerstein, M., Farnham, P. J., Snyder, M., Tenenbaum, S. A., Penalva, L. O., Doyle, F., Karmakar, S., Landt, S. G., Bhanvadia, R. R., Choudhury, A., Domanus, M., Ma, L., Moran, J., Patacsil, D., Slifer, T., Victorsen, A., Yang, X., Snyder, M., White, K. P., Auer, T., Centanin, L., Eichenlaub, M., Gruhl, F., Heermann, S., Hoeckendorf, B., Inoue, D., Kellner, T., Kirchmaier, S., Mueller, C., Reinhardt, R., Schertel, L., Schneider, S., Sinn, R., Wittbrodt, B., Wittbrodt, J., Weng, Z., Whitfield, T. W., Wang, J., Collins, P. J., Aldred, S. F., Trinklein, N. D., Partridge, E. C., Myers, R. M., Dekker, J., Jain, G., Lajoie, B. R., Sanyal, A., Balasundaram, G., Bates, D. L., Byron, R., Canfield, T. K., Diegel, M. J., Dunn, D., Ebersol, A. K., Frum, T., Garg, K., Gist, E., Hansen, R. S., Boatman, L., Haugen, E., Humbert, R., Jain, G., Johnson, A. K., Johnson, E. M., Kutyavin, T. V., Lajoie, B. R., Lee, K., Lotakis, D., Maurano, M. T., Neph, S. J., Neri, F. V., Nguyen, E. D., Qu, H., Reynolds, A. P., Roach, V., Rynes, E., Sabo, P., Sanchez, M. E., Sandstrom, R. S., Sanyal, A., Shafer, A. O., Stergachis, A. B., Thomas, S., Thurman, R. E., Vernot, B., Vierstra, J., Vong, S., Wang, H., Weaver, M. A., Yan, Y., Zhang, M., Akey, J. M., Bender, M., Dorschner, M. O., Groudine, M., MacCoss, M. J., Navas, P., Stamatoyannopoulos, G., Kaul, R., Dekker, J., Stamatoyannopoulos, J. A., Dunham, I., Beal, K., Brazma, A., Flicek, P., Herrero, J., Johnson, N., Keefe, D., Lukk, M., Luscombe, N. M., Sobral, D., Vaquerizas, J. M., Wilder, S. P., Batzoglou, S., Sidow, A., Hussami, N., Kyriazopoulou-Panagiotopoulou, S., Libbrecht, M. W., Schaub, M. A., Kundaje, A., Hardison, R. C., Miller, W., Giardine, B., Harris, R. S., Wu, W., Bickel, P. J., Banfai, B., Boley, N. P., Brown, J. B., Huang, H., Li, Q., Li, J. J., Noble, W. S., Bilmes, J. A., Buske, O. J., Hoffman, M. M., Sahu, A. D., Kharchenko, P. V., Park, P. J., Baker, D., Taylor, J., Weng, Z., Iyer, S., Dong, X., Greven, M., Lin, X., Wang, J., Xi, H. S., Zhuang, J., Gerstein, M., Alexander, R. P., Balasubramanian, S., Cheng, C., Harmanci, A., Lochovsky, L., Min, R., Mu, X. J., Rozowsky, J., Yan, K., Yip, K. Y., Birney, E. 2012; 489 (7414): 57-74

    Abstract

    The human genome encodes the blueprint of life, but the function of the vast majority of its nearly three billion bases is unknown. The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure and histone modification. These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions. Many discovered candidate regulatory elements are physically associated with one another and with expressed genes, providing new insights into the mechanisms of gene regulation. The newly identified elements also show a statistical correspondence to sequence variants linked to human disease, and can thereby guide interpretation of this variation. Overall, the project provides new insights into the organization and regulation of our genes and genome, and is an expansive resource of functional annotations for biomedical research.

    View details for DOI 10.1038/nature11247

    View details for Web of Science ID 000308347000039

    View details for PubMedID 22955616

    View details for PubMedCentralID PMC3439153

  • Architecture of the human regulatory network derived from ENCODE data NATURE Gerstein, M. B., Kundaje, A., Hariharan, M., Landt, S. G., Yan, K., Cheng, C., Mu, X. J., Khurana, E., Rozowsky, J., Alexander, R., Min, R., Alves, P., Abyzov, A., Addleman, N., Bhardwaj, N., Boyle, A. P., Cayting, P., Charos, A., Chen, D. Z., Cheng, Y., Clarke, D., Eastman, C., Euskirchen, G., Frietze, S., Fu, Y., Gertz, J., Grubert, F., Harmanci, A., Jain, P., Kasowski, M., Lacroute, P., Leng, J., Lian, J., Monahan, H., O'Geen, H., Ouyang, Z., Partridge, E. C., Patacsil, D., Pauli, F., Raha, D., Ramirez, L., Reddy, T. E., Reed, B., Shi, M., Slifer, T., Wang, J., Wu, L., Yang, X., Yip, K. Y., Zilberman-Schapira, G., Batzoglou, S., Sidow, A., Farnham, P. J., Myers, R. M., Weissman, S. M., Snyder, M. 2012; 489 (7414): 91-100

    Abstract

    Transcription factors bind in a combinatorial fashion to specify the on-and-off states of genes; the ensemble of these binding events forms a regulatory network, constituting the wiring diagram for a cell. To examine the principles of the human transcriptional regulatory network, we determined the genomic binding information of 119 transcription-related factors in over 450 distinct experiments. We found the combinatorial, co-association of transcription factors to be highly context specific: distinct combinations of factors bind at specific genomic locations. In particular, there are significant differences in the binding proximal and distal to genes. We organized all the transcription factor binding into a hierarchy and integrated it with other genomic information (for example, microRNA regulation), forming a dense meta-network. Factors at different levels have different properties; for instance, top-level transcription factors more strongly influence expression and middle-level ones co-regulate targets to mitigate information-flow bottlenecks. Moreover, these co-regulations give rise to many enriched network motifs (for example, noise-buffering feed-forward loops). Finally, more connected network components are under stronger selection and exhibit a greater degree of allele-specific activity (that is, differential binding to the two parental alleles). The regulatory information obtained in this study will be crucial for interpreting personal genome sequences and understanding basic principles of human biology and disease.

    View details for DOI 10.1038/nature11245

    View details for PubMedID 22955619

  • Ubiquitous heterogeneity and asymmetry of the chromatin environment at regulatory elements GENOME RESEARCH Kundaje, A., Kyriazopoulou-Panagiotopoulou, S., Libbrecht, M., Smith, C. L., Raha, D., Winters, E. E., Johnson, S. M., Snyder, M., Batzoglou, S., Sidow, A. 2012; 22 (9): 1735-1747

    Abstract

    Gene regulation at functional elements (e.g., enhancers, promoters, insulators) is governed by an interplay of nucleosome remodeling, histone modifications, and transcription factor binding. To enhance our understanding of gene regulation, the ENCODE Consortium has generated a wealth of ChIP-seq data on DNA-binding proteins and histone modifications. We additionally generated nucleosome positioning data on two cell lines, K562 and GM12878, by MNase digestion and high-depth sequencing. Here we relate 14 chromatin signals (12 histone marks, DNase, and nucleosome positioning) to the binding sites of 119 DNA-binding proteins across a large number of cell lines. We developed a new method for unsupervised pattern discovery, the Clustered AGgregation Tool (CAGT), which accounts for the inherent heterogeneity in signal magnitude, shape, and implicit strand orientation of chromatin marks. We applied CAGT on a total of 5084 data set pairs to obtain an exhaustive catalog of high-resolution patterns of histone modifications and nucleosome positioning signals around bound transcription factors. Our analyses reveal extensive heterogeneity in how histone modifications are deposited, and how nucleosomes are positioned around binding sites. With the exception of the CTCF/cohesin complex, asymmetry of nucleosome positioning is predominant. Asymmetry of histone modifications is also widespread, for all types of chromatin marks examined, including promoter, enhancer, elongation, and repressive marks. The fine-resolution signal shapes discovered by CAGT unveiled novel correlation patterns between chromatin marks, nucleosome positioning, and sequence content. Meta-analyses of the signal profiles revealed a common vocabulary of chromatin signals shared across multiple cell lines and binding proteins.

    View details for DOI 10.1101/gr.136366.111

    View details for PubMedID 22955985

  • ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia GENOME RESEARCH Landt, S. G., Marinov, G. K., Kundaje, A., Kheradpour, P., Pauli, F., Batzoglou, S., Bernstein, B. E., Bickel, P., Brown, J. B., Cayting, P., Chen, Y., DeSalvo, G., Epstein, C., Fisher-Aylor, K. I., Euskirchen, G., Gerstein, M., Gertz, J., Hartemink, A. J., Hoffman, M. M., Iyer, V. R., Jung, Y. L., Karmakar, S., Kellis, M., Kharchenko, P. V., Li, Q., Liu, T., Liu, X. S., Ma, L., Milosavljevic, A., Myers, R. M., Park, P. J., Pazin, M. J., Perry, M. D., Raha, D., Reddy, T. E., Rozowsky, J., Shoresh, N., Sidow, A., Slattery, M., Stamatoyannopoulos, J. A., Tolstorukov, M. Y., White, K. P., Xi, S., Farnham, P. J., Lieb, J. D., Wold, B. J., Snyder, M. 2012; 22 (9): 1813-1831

    Abstract

    Chromatin immunoprecipitation (ChIP) followed by high-throughput DNA sequencing (ChIP-seq) has become a valuable and widely used approach for mapping the genomic location of transcription-factor binding and histone modifications in living cells. Despite its widespread use, there are considerable differences in how these experiments are conducted, how the results are scored and evaluated for quality, and how the data and metadata are archived for public use. These practices affect the quality and utility of any global ChIP experiment. Through our experience in performing ChIP-seq experiments, the ENCODE and modENCODE consortia have developed a set of working standards and guidelines for ChIP experiments that are updated routinely. The current guidelines address antibody validation, experimental replication, sequencing depth, data and metadata reporting, and data quality assessment. We discuss how ChIP quality, assessed in these ways, affects different uses of ChIP-seq data. All data sets used in the analysis have been deposited for public viewing and downloading at the ENCODE (http://encodeproject.org/ENCODE/) and modENCODE (http://www.modencode.org/) portals.

    View details for DOI 10.1101/gr.136184.111

    View details for PubMedID 22955991

  • Linking disease associations with regulatory information in the human genome GENOME RESEARCH Schaub, M. A., Boyle, A. P., Kundaje, A., Batzoglou, S., Snyder, M. 2012; 22 (9): 1748-1759

    Abstract

    Genome-wide association studies have been successful in identifying single nucleotide polymorphisms (SNPs) associated with a large number of phenotypes. However, an associated SNP is likely part of a larger region of linkage disequilibrium. This makes it difficult to precisely identify the SNPs that have a biological link with the phenotype. We have systematically investigated the association of multiple types of ENCODE data with disease-associated SNPs and show that there is significant enrichment for functional SNPs among the currently identified associations. This enrichment is strongest when integrating multiple sources of functional information and when highest confidence disease-associated SNPs are used. We propose an approach that integrates multiple types of functional data generated by the ENCODE Consortium to help identify "functional SNPs" that may be associated with the disease phenotype. Our approach generates putative functional annotations for up to 80% of all previously reported associations. We show that for most associations, the functional SNP most strongly supported by experimental evidence is a SNP in linkage disequilibrium with the reported association rather than the reported SNP itself. Our results show that the experimental data sets generated by the ENCODE Consortium can be successfully used to suggest functional hypotheses for variants associated with diseases and other phenotypes.

    View details for DOI 10.1101/gr.136127.111

    View details for PubMedID 22955986

  • A Predictive Model of the Oxygen and Heme Regulatory Network in Yeast PLOS COMPUTATIONAL BIOLOGY Kundaje, A., Xin, X., Lan, C., Lianoglou, S., Zhou, M., Zhang, L., Leslie, C. 2008; 4 (11)

    Abstract

    Deciphering gene regulatory mechanisms through the analysis of high-throughput expression data is a challenging computational problem. Previous computational studies have used large expression datasets in order to resolve fine patterns of coexpression, producing clusters or modules of potentially coregulated genes. These methods typically examine promoter sequence information, such as DNA motifs or transcription factor occupancy data, in a separate step after clustering. We needed an alternative and more integrative approach to study the oxygen regulatory network in Saccharomyces cerevisiae using a small dataset of perturbation experiments. Mechanisms of oxygen sensing and regulation underlie many physiological and pathological processes, and only a handful of oxygen regulators have been identified in previous studies. We used a new machine learning algorithm called MEDUSA to uncover detailed information about the oxygen regulatory network using genome-wide expression changes in response to perturbations in the levels of oxygen, heme, Hap1, and Co2+. MEDUSA integrates mRNA expression, promoter sequence, and ChIP-chip occupancy data to learn a model that accurately predicts the differential expression of target genes in held-out data. We used a novel margin-based score to extract significant condition-specific regulators and assemble a global map of the oxygen sensing and regulatory network. This network includes both known oxygen and heme regulators, such as Hap1, Mga2, Hap4, and Upc2, as well as many new candidate regulators. MEDUSA also identified many DNA motifs that are consistent with previous experimentally identified transcription factor binding sites. Because MEDUSA's regulatory program associates regulators to target genes through their promoter sequences, we directly tested the predicted regulators for OLE1, a gene specifically induced under hypoxia, by experimental analysis of the activity of its promoter. In each case, deletion of the candidate regulator resulted in the predicted effect on promoter activity, confirming that several novel regulators identified by MEDUSA are indeed involved in oxygen regulation. MEDUSA can reveal important information from a small dataset and generate testable hypotheses for further experimental analysis. Supplemental data are included.

    View details for DOI 10.1371/journal.pcbi.1000224

    View details for Web of Science ID 000261480800016

    View details for PubMedID 19008939

    View details for PubMedCentralID PMC2573020

  • Combining sequence and time series expression data to learn transcriptional modules IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS Kundaje, A., Middendorf, M., Gao, F., Wiggins, C., Leslie, C. 2005; 2 (3): 194-202

    Abstract

    Our goal is to cluster genes into transcriptional modules--sets of genes where similarity in expression is explained by common regulatory mechanisms at the transcriptional level. We want to learn modules from both time series gene expression data and genome-wide motif data that are now readily available for organisms such as S. cereviseae as a result of prior computational studies or experimental results. We present a generative probabilistic model for combining regulatory sequence and time series expression data to cluster genes into coherent transcriptional modules. Starting with a set of motifs representing known or putative regulatory elements (transcription factor binding sites) and the counts of occurrences of these motifs in each gene's promoter region, together with a time series expression profile for each gene, the learning algorithm uses expectation maximization to learn module assignments based on both types of data. We also present a technique based on the Jensen-Shannon entropy contributions of motifs in the learned model for associating the most significant motifs to each module. Thus, the algorithm gives a global approach for associating sets of regulatory elements to "modules" of genes with similar time series expression profiles. The model for expression data exploits our prior belief of smooth dependence on time by using statistical splines and is suitable for typical time course data sets with relatively few experiments. Moreover, the model is sufficiently interpretable that we can understand how both sequence data and expression data contribute to the cluster assignments, and how to interpolate between the two data sources. We present experimental results on the yeast cell cycle to validate our method and find that our combined expression and motif clustering algorithm discovers modules with both coherent expression and similar motif patterns, including binding motifs associated to known cell cycle transcription factors.

    View details for Web of Science ID 000235704200003

    View details for PubMedID 17044183

  • Multicenter integrated analysis of noncoding CRISPRi screens. Nature methods Yao, D., Tycko, J., Oh, J. W., Bounds, L. R., Gosai, S. J., Lataniotis, L., Mackay-Smith, A., Doughty, B. R., Gabdank, I., Schmidt, H., Guerrero-Altamirano, T., Siklenka, K., Guo, K., White, A. D., Youngworth, I., Andreeva, K., Ren, X., Barrera, A., Luo, Y., Yardımcı, G. G., Tewhey, R., Kundaje, A., Greenleaf, W. J., Sabeti, P. C., Leslie, C., Pritykin, Y., Moore, J. E., Beer, M. A., Gersbach, C. A., Reddy, T. E., Shen, Y., Engreitz, J. M., Bassik, M. C., Reilly, S. K. 2024

    Abstract

    The ENCODE Consortium's efforts to annotate noncoding cis-regulatory elements (CREs) have advanced our understanding of gene regulatory landscapes. Pooled, noncoding CRISPR screens offer a systematic approach to investigate cis-regulatory mechanisms. The ENCODE4 Functional Characterization Centers conducted 108 screens in human cell lines, comprising >540,000 perturbations across 24.85 megabases of the genome. Using 332 functionally confirmed CRE-gene links in K562 cells, we established guidelines for screening endogenous noncoding elements with CRISPR interference (CRISPRi), including accurate detection of CREs that exhibit variable, often low, transcriptional effects. Benchmarking five screen analysis tools, we find that CASA produces the most conservative CRE calls and is robust to artifacts of low-specificity single guide RNAs. We uncover a subtle DNA strand bias for CRISPRi in transcribed regions with implications for screen design and analysis. Together, we provide an accessible data resource, predesigned single guide RNAs for targeting 3,275,697 ENCODE SCREEN candidate CREs with CRISPRi and screening guidelines to accelerate functional characterization of the noncoding genome.

    View details for DOI 10.1038/s41592-024-02216-7

    View details for PubMedID 38504114

    View details for PubMedCentralID 3771521

  • Protocol for mapping the three-dimensional organization of dinoflagellate genomes. STAR protocols Marinov, G. K., Kundaje, A., Greenleaf, W. J., Grossman, A. R. 2024; 5 (2): 102941

    Abstract

    Dinoflagellate genomes often are very large and difficult to assemble, which has until recently precluded their analysis with modern functional genomic tools. Here, we present a protocol for mapping three-dimensional (3D) genome organization in dinoflagellates and using it for scaffolding their genome assemblies. We describe steps for crosslinking, nuclear lysis, denaturation, restriction digest, ligation, and DNA shearing and purification. We then detail procedures sequencing library generation and computational analysis, including initial Hi-C read mapping and 3D-DNA scaffolding/assembly correction. For complete details on the use and execution of this protocol, please refer to Marinov et al.1.

    View details for DOI 10.1016/j.xpro.2024.102941

    View details for PubMedID 38483898

  • Genome-wide interaction study with smoking for colorectal cancer risk identifies novel genetic loci related to tumor suppression, inflammation and immune response Carreras-Torres, R., Kim, A. E., Lin, Y., Diez-Obrero, V., Bien, S. A., Qu, C., Wang, J., Dimou, N., Aglago, E. K., Bouras, E., Campbell, P. T., Casey, G., Chang-Claude, J., Drew, D. A., Gunter, M., Jordahl, K. M., Kawaguchi, E., Kundaje, A., Morrison, J. L., Murphy, N., Newcomb, P., Obon-Santacana, M., Papadimitriou, N., Peoples, A. R., Ruiz-Narvaez, E., Shcherbina, A., Stern, M. C., Su, Y., Tian, Y., Tsilidis, K. K., van Duijnhoven, F. B., Hsu, L., Peters, U., Moreno, V., Gauderman, W. SPRINGERNATURE. 2024: 772
  • Rewriting regulatory DNA to dissect and reprogram gene expression. bioRxiv : the preprint server for biology Martyn, G. E., Montgomery, M. T., Jones, H., Guo, K., Doughty, B. R., Linder, J., Chen, Z., Cochran, K., Lawrence, K. A., Munson, G., Pampari, A., Fulco, C. P., Kelley, D. R., Lander, E. S., Kundaje, A., Engreitz, J. M. 2023

    Abstract

    Regulatory DNA sequences within enhancers and promoters bind transcription factors to encode cell type-specific patterns of gene expression. However, the regulatory effects and programmability of such DNA sequences remain difficult to map or predict because we have lacked scalable methods to precisely edit regulatory DNA and quantify the effects in an endogenous genomic context. Here we present an approach to measure the quantitative effects of hundreds of designed DNA sequence variants on gene expression, by combining pooled CRISPR prime editing with RNA fluorescence in situ hybridization and cell sorting (Variant-FlowFISH). We apply this method to mutagenize and rewrite regulatory DNA sequences in an enhancer and the promoter of PPIF in two immune cell lines. Of 672 variant-cell type pairs, we identify 497 that affect PPIF expression. These variants appear to act through a variety of mechanisms including disruption or optimization of existing transcription factor binding sites, as well as creation of de novo sites. Disrupting a single endogenous transcription factor binding site often led to large changes in expression (up to -40% in the enhancer, and -50% in the promoter). The same variant often had different effects across cell types and states, demonstrating a highly tunable regulatory landscape. We use these data to benchmark performance of sequence-based predictive models of gene regulation, and find that certain types of variants are not accurately predicted by existing models. Finally, we computationally design 185 small sequence variants (≤10 bp) and optimize them for specific effects on expression in silico. 84% of these rationally designed edits showed the intended direction of effect, and some had dramatic effects on expression (-100% to +202%). Variant-FlowFISH thus provides a powerful tool to map the effects of variants and transcription factor binding sites on gene expression, test and improve computational models of gene regulation, and reprogram regulatory DNA.

    View details for DOI 10.1101/2023.12.20.572268

    View details for PubMedID 38187584

    View details for PubMedCentralID PMC10769263

  • Genome-wide gene-environment interaction analyses to understand the relationship between red meat and processed meat intake and colorectal cancer risk. Cancer epidemiology, biomarkers & prevention : a publication of the American Association for Cancer Research, cosponsored by the American Society of Preventive Oncology Stern, M. C., Sanchez Mendez, J., Kim, A. E., Obón-Santacana, M., Moratalla-Navarro, F., Martín, V., Moreno, V., Lin, Y., Bien, S. A., Qu, C., Su, Y. R., White, E., Harrison, T. A., Huyghe, J. R., Tangen, C. M., Newcomb, P. A., Phipps, A. I., Thomas, C. E., Kawaguchi, E. S., Lewinger, J. P., Morrison, J. L., Conti, D. V., Wang, J., Thomas, D. C., Platz, E. A., Visvanathan, K., Keku, T. O., Newton, C. C., Um, C. Y., Kundaje, A., Shcherbina, A., Murphy, N., Gunter, M. J., Dimou, N., Papadimitriou, N., Bézieau, S., van Duijnhoven, F. J., Männistö, S., Rennert, G., Wolk, A., Hoffmeister, M., Brenner, H., Chang-Claude, J., Tian, Y., Le Marchand, L., Cotterchio, M., Tsilidis, K. K., Bishop, D. T., Melaku, Y. A., Lynch, B. M., Buchanan, D. D., Ulrich, C. M., Ose, J., Peoples, A. R., Pellatt, A. J., Li, L., Devall, M. A., Campbell, P. T., Albanes, D., Weinstein, S. J., Berndt, S. I., Gruber, S. B., Ruiz-Narvaez, E., Song, M., Joshi, A. D., Drew, D. A., Petrick, J. L., Chan, A. T., Giannakis, M., Peters, U., Hsu, L., Gauderman, W. J. 2023

    Abstract

    High red meat and/or processed meat consumption are established colorectal cancer (CRC) risk factors. We conducted a genome-wide gene-environment (GxE) interaction analysis to identify genetic variants that may modify these associations.A pooled sample of 29,842 CRC cases and 39,635 controls of European ancestry from 27 studies were included. Quantiles for red meat and processed meat intake were constructed from harmonized questionnaire data. Genotyping arrays were imputed to the Haplotype Reference Consortium. Two-step EDGE and joint tests of GxE interaction were utilized in our genome-wide scan.Meta-analyses confirmed positive associations between increased consumption of red meat and processed meat with CRC risk (per quartile red meat OR = 1.30; 95%CI = 1.21-1.41; processed meat OR = 1.40; 95%CI = 1.20-1.63). Two significant genome-wide GxE interactions for red meat consumption were found. Joint GxE tests revealed the rs4871179 SNP in chromosome 8 (downstream of HAS2); greater than median of consumption ORs = 1.38 (95%CI = 1.29-1.46), 1.20 (95%CI = 1.12 -1.27), and 1.07 (95%CI = 0.95 - 1.19) for CC, CG and GG, respectively. The two-step EDGE method identified the rs35352860 SNP in chromosome 18 (SMAD7 intron); greater than median of consumption ORs = 1.18 (95%CI = 1.11-1.24), 1.35 (95%CI = 1.26-1.44), and 1.46 (95%CI = 1.26-1.69) for CC, CT, and TT, respectively.We propose two novel biomarkers that support the role of meat consumption with an increased risk of CRC.The reported GxE interactions may explain the increased risk of CRC in certain population subgroups.

    View details for DOI 10.1158/1055-9965.EPI-23-0717

    View details for PubMedID 38112776

  • Identification of constrained sequence elements across 239 primate genomes. Nature Kuderna, L. F., Ulirsch, J. C., Rashid, S., Ameen, M., Sundaram, L., Hickey, G., Cox, A. J., Gao, H., Kumar, A., Aguet, F., Christmas, M. J., Clawson, H., Haeussler, M., Janiak, M. C., Kuhlwilm, M., Orkin, J. D., Bataillon, T., Manu, S., Valenzuela, A., Bergman, J., Rouselle, M., Silva, F. E., Agueda, L., Blanc, J., Gut, M., de Vries, D., Goodhead, I., Harris, R. A., Raveendran, M., Jensen, A., Chuma, I. S., Horvath, J. E., Hvilsom, C., Juan, D., Frandsen, P., Schraiber, J. G., de Melo, F. R., Bertuol, F., Byrne, H., Sampaio, I., Farias, I., Valsecchi, J., Messias, M., da Silva, M. N., Trivedi, M., Rossi, R., Hrbek, T., Andriaholinirina, N., Rabarivola, C. J., Zaramody, A., Jolly, C. J., Phillips-Conroy, J., Wilkerson, G., Abee, C., Simmons, J. H., Fernandez-Duque, E., Kanthaswamy, S., Shiferaw, F., Wu, D., Zhou, L., Shao, Y., Zhang, G., Keyyu, J. D., Knauf, S., Le, M. D., Lizano, E., Merker, S., Navarro, A., Nadler, T., Khor, C. C., Lee, J., Tan, P., Lim, W. K., Kitchener, A. C., Zinner, D., Gut, I., Melin, A. D., Guschanski, K., Schierup, M. H., Beck, R. M., Karakikes, I., Wang, K. C., Umapathy, G., Roos, C., Boubli, J. P., Siepel, A., Kundaje, A., Paten, B., Lindblad-Toh, K., Rogers, J., Marques Bonet, T., Farh, K. K. 2023

    Abstract

    Noncoding DNA is central to our understanding of human gene regulation and complex diseases1,2, and measuring the evolutionary sequence constraint can establish the functional relevance of putative regulatory elements in the human genome3-9. Identifying the genomic elements that have become constrained specifically in primates has been hampered by the faster evolution of noncoding DNA compared to protein-coding DNA10, the relatively short timescales separating primate species11, and the previously limited availability of whole-genome sequences12. Here we construct a whole-genome alignment of 239 species, representing nearly half of all extant species in the primate order. Using this resource, we identified human regulatory elements that are under selective constraint across primates and other mammals at a 5% false discovery rate. We detected 111,318 DNase I hypersensitivity sites and 267,410 transcription factor binding sites that are constrained specifically in primates but not across other placental mammals and validate their cis-regulatory effects on gene expression. These regulatory elements are enriched for human genetic variants that affect gene expression and complex traits and diseases. Our results highlight the important role of recent evolution in regulatory sequence elements differentiating primates, including humans, from other placental mammals.

    View details for DOI 10.1038/s41586-023-06798-8

    View details for PubMedID 38030727

    View details for PubMedCentralID 1891336

  • An encyclopedia of enhancer-gene regulatory interactions in the human genome. bioRxiv : the preprint server for biology Gschwind, A. R., Mualim, K. S., Karbalayghareh, A., Sheth, M. U., Dey, K. K., Jagoda, E., Nurtdinov, R. N., Xi, W., Tan, A. S., Jones, H., Ma, X. R., Yao, D., Nasser, J., Avsec, Ž., James, B. T., Shamim, M. S., Durand, N. C., Rao, S. S., Mahajan, R., Doughty, B. R., Andreeva, K., Ulirsch, J. C., Fan, K., Perez, E. M., Nguyen, T. C., Kelley, D. R., Finucane, H. K., Moore, J. E., Weng, Z., Kellis, M., Bassik, M. C., Price, A. L., Beer, M. A., Guigó, R., Stamatoyannopoulos, J. A., Lieberman Aiden, E., Greenleaf, W. J., Leslie, C. S., Steinmetz, L. M., Kundaje, A., Engreitz, J. M. 2023

    Abstract

    Identifying transcriptional enhancers and their target genes is essential for understanding gene regulation and the impact of human genetic variation on disease1-6. Here we create and evaluate a resource of >13 million enhancer-gene regulatory interactions across 352 cell types and tissues, by integrating predictive models, measurements of chromatin state and 3D contacts, and largescale genetic perturbations generated by the ENCODE Consortium7. We first create a systematic benchmarking pipeline to compare predictive models, assembling a dataset of 10,411 elementgene pairs measured in CRISPR perturbation experiments, >30,000 fine-mapped eQTLs, and 569 fine-mapped GWAS variants linked to a likely causal gene. Using this framework, we develop a new predictive model, ENCODE-rE2G, that achieves state-of-the-art performance across multiple prediction tasks, demonstrating a strategy involving iterative perturbations and supervised machine learning to build increasingly accurate predictive models of enhancer regulation. Using the ENCODE-rE2G model, we build an encyclopedia of enhancer-gene regulatory interactions in the human genome, which reveals global properties of enhancer networks, identifies differences in the functions of genes that have more or less complex regulatory landscapes, and improves analyses to link noncoding variants to target genes and cell types for common, complex diseases. By interpreting the model, we find evidence that, beyond enhancer activity and 3D enhancer-promoter contacts, additional features guide enhancerpromoter communication including promoter class and enhancer-enhancer synergy. Altogether, these genome-wide maps of enhancer-gene regulatory interactions, benchmarking software, predictive models, and insights about enhancer function provide a valuable resource for future studies of gene regulation and human genetics.

    View details for DOI 10.1101/2023.11.09.563812

    View details for PubMedID 38014075

    View details for PubMedCentralID PMC10680627

  • Latent human herpesvirus 6 is reactivated in CAR T cells. Nature Lareau, C. A., Yin, Y., Maurer, K., Sandor, K. D., Daniel, B., Yagnik, G., Peña, J., Crawford, J. C., Spanjaart, A. M., Gutierrez, J. C., Haradhvala, N. J., Riberdy, J. M., Abay, T., Stickels, R. R., Verboon, J. M., Liu, V., Buquicchio, F. A., Wang, F., Southard, J., Song, R., Li, W., Shrestha, A., Parida, L., Getz, G., Maus, M. V., Li, S., Moore, A., Roberts, Z. J., Ludwig, L. S., Talleur, A. C., Thomas, P. G., Dehghani, H., Pertel, T., Kundaje, A., Gottschalk, S., Roth, T. L., Kersten, M. J., Wu, C. J., Majzner, R. G., Satpathy, A. T. 2023

    Abstract

    Cell therapies have yielded durable clinical benefits for patients with cancer, but the risks associated with the development of therapies from manipulated human cells are understudied. For example, we lack a comprehensive understanding of the mechanisms of toxicities observed in patients receiving T cell therapies, including recent reports of encephalitis caused by reactivation of human herpesvirus 6 (HHV-6)1. Here, through petabase-scale viral genomics mining, we examine the landscape of human latent viral reactivation and demonstrate that HHV-6B can become reactivated in cultures of human CD4+ T cells. Using single-cell sequencing, we identify a rare population of HHV-6 'super-expressors' (about 1 in 300-10,000 cells) that possess high viral transcriptional activity, among research-grade allogeneic chimeric antigen receptor (CAR) T cells. By analysing single-cell sequencing data from patients receiving cell therapy products that are approved by the US Food and Drug Administration2 or are in clinical studies3-5, we identify the presence of HHV-6-super-expressor CAR T cells in patients in vivo. Together, the findings of our study demonstrate the utility of comprehensive genomics analyses in implicating cell therapy products as a potential source contributing to the lytic HHV-6 infection that has been reported in clinical trials1,6-8 and may influence the design and production of autologous and allogeneic cell therapies.

    View details for DOI 10.1038/s41586-023-06704-2

    View details for PubMedID 37938768

    View details for PubMedCentralID 9827115

  • The chromatin landscape of the euryarchaeon Haloferax volcanii. Genome biology Marinov, G. K., Bagdatli, S. T., Wu, T., He, C., Kundaje, A., Greenleaf, W. J. 2023; 24 (1): 253

    Abstract

    BACKGROUND: Archaea, together with Bacteria, represent the two main divisions of life on Earth, with many of the defining characteristics of the more complex eukaryotes tracing their origin to evolutionary innovations first made in their archaeal ancestors. One of the most notable such features is nucleosomal chromatin, although archaeal histones and chromatin differ significantly from those of eukaryotes, not all archaea possess histones and it is not clear if histones are a main packaging component for all that do. Despite increased interest in archaeal chromatin in recent years, its properties have been little studied using genomic tools.RESULTS: Here, we adapt the ATAC-seq assay to archaea and use it to map the accessible landscape of the genome of the euryarchaeote Haloferax volcanii. We integrate the resulting datasets with genome-wide maps of active transcription and single-stranded DNA (ssDNA) and find that while H. volcanii promoters exist in a preferentially accessible state, unlike most eukaryotes, modulation of transcriptional activity is not associated with changes in promoter accessibility. Applying orthogonal single-molecule footprinting methods, we quantify the absolute levels of physical protection of H. volcanii and find that Haloferax chromatin is similarly or only slightly more accessible, in aggregate, than that of eukaryotes. We also evaluate the degree of coordination of transcription within archaeal operons and make the unexpected observation that some CRISPR arrays are associated with highly prevalent ssDNA structures.CONCLUSIONS: Our results provide the first comprehensive maps of chromatin accessibility and active transcription in Haloferax across conditions and thus a foundation for future functional studies of archaeal chromatin.

    View details for DOI 10.1186/s13059-023-03095-5

    View details for PubMedID 37932847

  • Transcriptomics and chromatin accessibility in multiple African population samples. bioRxiv : the preprint server for biology DeGorter, M. K., Goddard, P. C., Karakoc, E., Kundu, S., Yan, S. M., Nachun, D., Abell, N., Aguirre, M., Carstensen, T., Chen, Z., Durrant, M., Dwaracherla, V. R., Feng, K., Gloudemans, M. J., Hunter, N., Moorthy, M. P., Pomilla, C., Rodrigues, K. B., Smith, C. J., Smith, K. S., Ungar, R. A., Balliu, B., Fellay, J., Flicek, P., McLaren, P. J., Henn, B., McCoy, R. C., Sugden, L., Kundaje, A., Sandhu, M. S., Gurdasani, D., Montgomery, S. B. 2023

    Abstract

    Mapping the functional human genome and impact of genetic variants is often limited to European-descendent population samples. To aid in overcoming this limitation, we measured gene expression using RNA sequencing in lymphoblastoid cell lines (LCLs) from 599 individuals from six African populations to identify novel transcripts including those not represented in the hg38 reference genome. We used whole genomes from the 1000 Genomes Project and 164 Maasai individuals to identify 8,881 expression and 6,949 splicing quantitative trait loci (eQTLs/sQTLs), and 2,611 structural variants associated with gene expression (SV-eQTLs). We further profiled chromatin accessibility using ATAC-Seq in a subset of 100 representative individuals, to identity chromatin accessibility quantitative trait loci (caQTLs) and allele-specific chromatin accessibility, and provide predictions for the functional effect of 78.9 million variants on chromatin accessibility. Using this map of eQTLs and caQTLs we fine-mapped GWAS signals for a range of complex diseases. Combined, this work expands global functional genomic data to identify novel transcripts, functional elements and variants, understand population genetic history of molecular quantitative trait loci, and further resolve the genetic basis of multiple human traits and disease.

    View details for DOI 10.1101/2023.11.04.564839

    View details for PubMedID 37986808

    View details for PubMedCentralID PMC10659267

  • The landscape of the histone-organized chromatin of Bdellovibrionota bacteria. bioRxiv : the preprint server for biology Marinov, G. K., Doughty, B., Kundaje, A., Greenleaf, W. J. 2023

    Abstract

    Histone proteins have traditionally been thought to be restricted to eukaryotes and most archaea, with eukaryotic nucleosomal histones deriving from their archaeal ancestors. In contrast, bacteria lack histones as a rule. However, histone proteins have recently been identified in a few bacterial clades, most notably the phylum Bdellovibrionota, and these histones have been proposed to exhibit a range of divergent features compared to histones in archaea and eukaryotes. However, no functional genomic studies of the properties of Bdellovibrionota chromatin have been carried out. In this work, we map the landscape of chromatin accessibility, active transcription and three-dimensional genome organization in a member of Bdellovibrionota (a Bacteriovorax strain). We find that, similar to what is observed in some archaea and in eukaryotes with compact genomes such as yeast, Bacteriovorax chromatin is characterized by preferential accessibility around promoter regions. Similar to eukaryotes, chromatin accessibility in Bacteriovorax positively correlates with gene expression. Mapping active transcription through single-strand DNA (ssDNA) profiling revealed that unlike in yeast, but similar to the state of mammalian and fly promoters, Bacteriovorax promoters exhibit very strong polymerase pausing. Finally, similar to that of other bacteria without histones, the Bacteriovorax genome exists in a three-dimensional (3D) configuration organized by the parABS system along the axis defined by replication origin and termination regions. These results provide a foundation for understanding the chromatin biology of the unique Bdellovibrionota bacteria and the functional diversity in chromatin organization across the tree of life.

    View details for DOI 10.1101/2023.10.30.564843

    View details for PubMedID 37961278

    View details for PubMedCentralID PMC10634947

  • RNA polymerase II dynamics and mRNA stability feedback scale mRNA amounts with cell size. Cell Swaffer, M. P., Marinov, G. K., Zheng, H., Fuentes Valenzuela, L., Tsui, C. Y., Jones, A. W., Greenwood, J., Kundaje, A., Greenleaf, W. J., Reyes-Lamothe, R., Skotheim, J. M. 2023

    Abstract

    A fundamental feature of cellular growth is that total protein and RNA amounts increase with cell size to keep concentrations approximately constant. A key component of this is that global transcription rates increase in larger cells. Here, we identify RNA polymerase II (RNAPII) as the limiting factor scaling mRNA transcription with cell size in budding yeast, as transcription is highly sensitive to the dosage of RNAPII but not to other components of the transcriptional machinery. Our experiments support a dynamic equilibrium model where global RNAPII transcription at a given size is set by the mass action recruitment kinetics of unengaged nucleoplasmic RNAPII to the genome. However, this only drives a sub-linear increase in transcription with size, which is then partially compensated for by a decrease in mRNA decay rates as cells enlarge. Thus, limiting RNAPII and feedback on mRNA stability work in concert to scale mRNA amounts with cell size.

    View details for DOI 10.1016/j.cell.2023.10.012

    View details for PubMedID 37944513

  • Drug Discovery in Low Data Regimes: Leveraging a Computational Pipeline for the Discovery of Novel SARS-CoV-2 Nsp14-MTase Inhibitors. bioRxiv : the preprint server for biology Nigam, A., Hurley, M. F., Li, F., Konkoĭová, E., Klíma, M., Trylčová, J., Pollice, R., Çinaroǧlu, S. S., Levin-Konigsberg, R., Handjaya, J., Schapira, M., Chau, I., Perveen, S., Ng, H. L., Ümit Kaniskan, H., Han, Y., Singh, S., Gorgulla, C., Kundaje, A., Jin, J., Voelz, V. A., Weber, J., Nencka, R., Boura, E., Vedadi, M., Aspuru-Guzik, A. 2023

    Abstract

    The COVID-19 pandemic, caused by the SARS-CoV-2 virus, has led to significant global morbidity and mortality. A crucial viral protein, the non-structural protein 14 (nsp14), catalyzes the methylation of viral RNA and plays a critical role in viral genome replication and transcription. Due to the low mutation rate in the nsp region among various SARS-CoV-2 variants, nsp14 has emerged as a promising therapeutic target. However, discovering potential inhibitors remains a challenge. In this work, we introduce a computational pipeline for the rapid and efficient identification of potential nsp14 inhibitors by leveraging virtual screening and the NCI open compound collection, which contains 250,000 freely available molecules for researchers worldwide. The introduced pipeline provides a cost-effective and efficient approach for early-stage drug discovery by allowing researchers to evaluate promising molecules without incurring synthesis expenses. Our pipeline successfully identified seven promising candidates after experimentally validating only 40 compounds. Notably, we discovered NSC620333, a compound that exhibits a strong binding affinity to nsp14 with a dissociation constant of 427 ± 84 nM. In addition, we gained new insights into the structure and function of this protein through molecular dynamics simulations. We identified new conformational states of the protein and determined that residues Phe367, Tyr368, and Gln354 within the binding pocket serve as stabilizing residues for novel ligand interactions. We also found that metal coordination complexes are crucial for the overall function of the binding pocket. Lastly, we present the solved crystal structure of the nsp14-MTase complexed with SS148, a potent inhibitor of methyltransferase activity at the nanomolar level (IC50 value of 70 ± 6 nM). Our computational pipeline accurately predicted the binding pose of SS148, demonstrating its effectiveness and potential in accelerating drug discovery efforts against SARS-CoV-2 and other emerging viruses.

    View details for DOI 10.1101/2023.10.03.560722

    View details for PubMedID 37873443

    View details for PubMedCentralID PMC10592886

  • Short tandem repeats bind transcription factors to tune eukaryotic gene expression. Science (New York, N.Y.) Horton, C. A., Alexandari, A. M., Hayes, M. G., Marklund, E., Schaepe, J. M., Aditham, A. K., Shah, N., Suzuki, P. H., Shrikumar, A., Afek, A., Greenleaf, W. J., Gordân, R., Zeitlinger, J., Kundaje, A., Fordyce, P. M. 2023; 381 (6664): eadd1250

    Abstract

    Short tandem repeats (STRs) are enriched in eukaryotic cis-regulatory elements and alter gene expression, yet how they regulate transcription remains unknown. We found that STRs modulate transcription factor (TF)-DNA affinities and apparent on-rates by about 70-fold by directly binding TF DNA-binding domains, with energetic impacts exceeding many consensus motif mutations. STRs maximize the number of weakly preferred microstates near target sites, thereby increasing TF density, with impacts well predicted by statistical mechanics. Confirming that STRs also affect TF binding in cells, neural networks trained only on in vivo occupancies predicted effects identical to those observed in vitro. Approximately 90% of TFs preferentially bound STRs that need not resemble known motifs, providing a cis-regulatory mechanism to target TFs to genomic sites.

    View details for DOI 10.1126/science.add1250

    View details for PubMedID 37733848

  • Genome-wide interaction analysis of folate for colorectal cancer risk. The American journal of clinical nutrition Bouras, E., Kim, A. E., Lin, Y., Morrison, J., Du, M., Albanes, D., Barry, E. L., Baurley, J. W., Berndt, S. I., Bien, S. A., Bishop, T. D., Brenner, H., Budiarto, A., Burnett-Hartman, A., Campbell, P. T., Carreras-Torres, R., Casey, G., Cenggoro, T. W., Chan, A. T., Chang-Claude, J., Conti, D. V., Cotterchio, M., Devall, M., Diez-Obrero, V., Dimou, N., Drew, D. A., Figueiredo, J. C., Giles, G. G., Gruber, S. B., Gunter, M. J., Harrison, T. A., Hidaka, A., Hoffmeister, M., Huyghe, J. R., Joshi, A. D., Kawaguchi, E. S., Keku, T. O., Kundaje, A., Le Marchand, L., Lewinger, J. P., Li, L., Lynch, B. M., Mahesworo, B., Männistö, S., Moreno, V., Murphy, N., Newcomb, P. A., Obón-Santacana, M., Ose, J., Palmer, J. R., Papadimitriou, N., Pardamean, B., Pellatt, A. J., Peoples, A. R., Platz, E. A., Potter, J. D., Qi, L., Qu, C., Rennert, G., Ruiz-Narvaez, E., Sakoda, L. C., Schmit, S. L., Shcherbina, A., Stern, M. C., Su, Y. R., Tangen, C. M., Thomas, D. C., Tian, Y., Um, C. Y., van Duijnhoven, F. J., Van Guelpen, B., Visvanathan, K., Wang, J., White, E., Wolk, A., Woods, M. O., Ulrich, C. M., Hsu, L., Gauderman, W. J., Peters, U., Tsilidis, K. K. 2023

    Abstract

    Epidemiological and experimental evidence suggests that higher folate intake is associated with a decreased colorectal cancer (CRC) risk; however, the mechanisms underlying this relationship are not fully understood. Genetic variation that may have a direct or indirect impact on folate metabolism can provide insights into folate's role in CRC.Our aim was to perform a genome-wide interaction analysis to identify genetic variants that may modify the association of folate on CRC risk.We applied traditional case-control logistic regression, joint 3-degree of freedom (3DF), and a two-step weighted hypothesis approach to test the interactions of common variants (allele frequency >1%) across the genome and dietary folate, folic acid supplement use, and total folate in relation to risk of CRC, in 30,550 cases and 42,336 controls from 51 studies from 3 genetic consortia (CCFR, CORECT, GECCO).Inverse associations of dietary, total folate, and folic acid supplement with CRC were found [odds ratio: 0.93 (95% confidence intervals [CI]: 0.90-0.96), and 0.91 (0.89-0.94) per quartile higher intake, and 0.82 (0.78-0.88) for users vs. non-users, respectively]. Interactions (P-interaction <5×10-8) of folic acid supplement and variants in the 3p25.2 locus [in the region of Synapsin II (SYN2)/tissue inhibitor of metalloproteinase 4 (TIMP4)] were found using the traditional interaction analysis, with variant rs150924902 (located upstream to SYN2) showing the strongest interaction. In stratified analyses by rs150924902 genotypes, folate supplement was associated with decreased CRC risk among those carrying the TT genotype (OR = 0.82; 95%CI: 0.79-0.86) but increased CRC risk among those carrying the TA genotype (OR = 1.63; 95%CI: 1.29-2.05), suggesting a qualitative interaction (P-interaction = 1.4×10-8). No interactions were observed for dietary and total folate.Variation in 3p25.2 locus may modify the association of folate supplement with CRC risk. Experimental studies and studies incorporating other relevant -omics data are warranted to validate this finding.

    View details for DOI 10.1016/j.ajcnut.2023.08.010

    View details for PubMedID 37640106

  • Chromatin accessibility in the Drosophila embryo is determined by transcription factor pioneering and enhancer activation. Developmental cell Brennan, K. J., Weilert, M., Krueger, S., Pampari, A., Liu, H. Y., Yang, A. W., Morrison, J. A., Hughes, T. R., Rushlow, C. A., Kundaje, A., Zeitlinger, J. 2023

    Abstract

    Chromatin accessibility is integral to the process by which transcription factors (TFs) read out cis-regulatory DNA sequences, but it is difficult to differentiate between TFs that drive accessibility and those that do not. Deep learning models that learn complex sequence rules provide an unprecedented opportunity to dissect this problem. Using zygotic genome activation in Drosophila as a model, we analyzed high-resolution TF binding and chromatin accessibility data with interpretable deep learning and performed genetic validation experiments. We identify a hierarchical relationship between the pioneer TF Zelda and the TFs involved in axis patterning. Zelda consistently pioneers chromatin accessibility proportional to motif affinity, whereas patterning TFs augment chromatin accessibility in sequence contexts where they mediate enhancer activation. We conclude that chromatin accessibility occurs in two tiers: one through pioneering, which makes enhancers accessible but not necessarily active, and the second when the correct combination of TFs leads to enhancer activation.

    View details for DOI 10.1016/j.devcel.2023.07.007

    View details for PubMedID 37557175

  • The ENCODE Uniform Analysis Pipelines. Research square Hitz, B. C., Lee, J. W., Jolanki, O., Kagda, M. S., Graham, K., Sud, P., Gabdank, I., Strattan, J. S., Sloan, C. A., Dreszer, T., Rowe, L. D., Podduturi, N. R., Malladi, V. S., Chan, E. T., Davidson, J. M., Ho, M., Miyasato, S., Simison, M., Tanaka, F., Luo, Y., Whaling, I., Hong, E. L., Lee, B. T., Sandstrom, R., Rynes, E., Nelson, J., Nishida, A., Ingersoll, A., Buckley, M., Frerker, M., Kim, D. S., Boley, N., Trout, D., Dobin, A., Rahmanian, S., Wyman, D., Balderrama-Gutierrez, G., Reese, F., Durand, N. C., Dudchenko, O., Weisz, D., Rao, S. S., Blackburn, A., Gkountaroulis, D., Sadr, M., Olshansky, M., Eliaz, Y., Nguyen, D., Bochkov, I., Shamim, M. S., Mahajan, R., Aiden, E., Gingeras, T., Heath, S., Hirst, M., Kent, W. J., Kundaje, A., Mortazavi, A., Wold, B., Cherry, J. M. 2023

    Abstract

    The Encyclopedia of DNA elements (ENCODE) project is a collaborative effort to create a comprehensive catalog of functional elements in the human genome. The current database comprises more than 19000 functional genomics experiments across more than 1000 cell lines and tissues using a wide array of experimental techniques to study the chromatin structure, regulatory and transcriptional landscape of the Homo sapiens and Mus musculus genomes. All experimental data, metadata, and associated computational analyses created by the ENCODE consortium are submitted to the Data Coordination Center (DCC) for validation, tracking, storage, and distribution to community resources and the scientific community. The ENCODE project has engineered and distributed uniform processing pipelines in order to promote data provenance and reproducibility as well as allow interoperability between genomic resources and other consortia. All data files, reference genome versions, software versions, and parameters used by the pipelines are captured and available via the ENCODE Portal. The pipeline code, developed using Docker and Workflow Description Language (WDL; https://openwdl.org/) is publicly available in GitHub, with images available on Dockerhub (https://hub.docker.com), enabling access to a diverse range of biomedical researchers. ENCODE pipelines maintained and used by the DCC can be installed to run on personal computers, local HPC clusters, or in cloud computing environments via Cromwell. Access to the pipelines and data via the cloud allows small labs the ability to use the data or software without access to institutional compute clusters. Standardization of the computational methodologies for analysis and quality control leads to comparable results from different ENCODE collections - a prerequisite for successful integrative analyses.

    View details for DOI 10.21203/rs.3.rs-3111932/v1

    View details for PubMedID 37503119

    View details for PubMedCentralID PMC10371165

  • Advances and prospects for the Human BioMolecular Atlas Program (HuBMAP). Nature cell biology Jain, S., Pei, L., Spraggins, J. M., Angelo, M., Carson, J. P., Gehlenborg, N., Ginty, F., Gonçalves, J. P., Hagood, J. S., Hickey, J. W., Kelleher, N. L., Laurent, L. C., Lin, S., Lin, Y., Liu, H., Naba, A., Nakayasu, E. S., Qian, W. J., Radtke, A., Robson, P., Stockwell, B. R., Van de Plas, R., Vlachos, I. S., Zhou, M., Börner, K., Snyder, M. P. 2023

    Abstract

    The Human BioMolecular Atlas Program (HuBMAP) aims to create a multi-scale spatial atlas of the healthy human body at single-cell resolution by applying advanced technologies and disseminating resources to the community. As the HuBMAP moves past its first phase, creating ontologies, protocols and pipelines, this Perspective introduces the production phase: the generation of reference spatial maps of functional tissue units across many organs from diverse populations and the creation of mapping tools and infrastructure to advance biomedical research.

    View details for DOI 10.1038/s41556-023-01194-w

    View details for PubMedID 37468756

    View details for PubMedCentralID 8238499

  • Chromatin accessibility dynamics of neurogenic niche cells reveal defects in neural stem cell adhesion and migration during aging. Nature aging Yeo, R. W., Zhou, O. Y., Zhong, B. L., Sun, E. D., Navarro Negredo, P., Nair, S., Sharmin, M., Ruetz, T. J., Wilson, M., Kundaje, A., Dunn, A. R., Brunet, A. 2023

    Abstract

    The regenerative potential of brain stem cell niches deteriorates during aging. Yet the mechanisms underlying this decline are largely unknown. Here we characterize genome-wide chromatin accessibility of neurogenic niche cells in vivo during aging. Interestingly, chromatin accessibility at adhesion and migration genes decreases with age in quiescent neural stem cells (NSCs) but increases with age in activated (proliferative) NSCs. Quiescent and activated NSCs exhibit opposing adhesion behaviors during aging: quiescent NSCs become less adhesive, whereas activated NSCs become more adhesive. Old activated NSCs also show decreased migration in vitro and diminished mobilization out of the niche for neurogenesis in vivo. Using tension sensors, we find that aging increases force-producing adhesions in activated NSCs. Inhibiting the cytoskeletal-regulating kinase ROCK reduces these adhesions, restores migration in old activated NSCs in vitro, and boosts neurogenesis in vivo. These results have implications for restoring the migratory potential of NSCs and for improving neurogenesis in the aged brain.

    View details for DOI 10.1038/s43587-023-00449-3

    View details for PubMedID 37443352

    View details for PubMedCentralID 4683085

  • Single-cell multi-omics of mitochondrial DNA disorders reveals dynamics of purifying selection across human immune cells. Nature genetics Lareau, C. A., Dubois, S. M., Buquicchio, F. A., Hsieh, Y. H., Garg, K., Kautz, P., Nitsch, L., Praktiknjo, S. D., Maschmeyer, P., Verboon, J. M., Gutierrez, J. C., Yin, Y., Fiskin, E., Luo, W., Mimitou, E. P., Muus, C., Malhotra, R., Parikh, S., Fleming, M. D., Oevermann, L., Schulte, J., Eckert, C., Kundaje, A., Smibert, P., Vardhana, S. A., Satpathy, A. T., Regev, A., Sankaran, V. G., Agarwal, S., Ludwig, L. S. 2023

    Abstract

    Pathogenic mutations in mitochondrial DNA (mtDNA) compromise cellular metabolism, contributing to cellular heterogeneity and disease. Diverse mutations are associated with diverse clinical phenotypes, suggesting distinct organ- and cell-type-specific metabolic vulnerabilities. Here we establish a multi-omics approach to quantify deletions in mtDNA alongside cell state features in single cells derived from six patients across the phenotypic spectrum of single large-scale mtDNA deletions (SLSMDs). By profiling 206,663 cells, we reveal the dynamics of pathogenic mtDNA deletion heteroplasmy consistent with purifying selection and distinct metabolic vulnerabilities across T-cell states in vivo and validate these observations in vitro. By extending analyses to hematopoietic and erythroid progenitors, we reveal mtDNA dynamics and cell-type-specific gene regulatory adaptations, demonstrating the context-dependence of perturbing mitochondrial genomic integrity. Collectively, we report pathogenic mtDNA heteroplasmy dynamics of individual blood and immune cells across lineages, demonstrating the power of single-cell multi-omics for revealing fundamental properties of mitochondrial genetics.

    View details for DOI 10.1038/s41588-023-01433-8

    View details for PubMedID 37386249

    View details for PubMedCentralID 3809581

  • Probing the diabetes and colorectal cancer relationship using gene - environment interaction analyses. British journal of cancer Dimou, N., Kim, A. E., Flanagan, O., Murphy, N., Diez-Obrero, V., Shcherbina, A., Aglago, E. K., Bouras, E., Campbell, P. T., Casey, G., Gallinger, S., Gruber, S. B., Jenkins, M. A., Lin, Y., Moreno, V., Ruiz-Narvaez, E., Stern, M. C., Tian, Y., Tsilidis, K. K., Arndt, V., Barry, E. L., Baurley, J. W., Berndt, S. I., Bézieau, S., Bien, S. A., Bishop, D. T., Brenner, H., Budiarto, A., Carreras-Torres, R., Cenggoro, T. W., Chan, A. T., Chang-Claude, J., Chanock, S. J., Chen, X., Conti, D. V., Dampier, C. H., Devall, M., Drew, D. A., Figueiredo, J. C., Giles, G. G., Gsur, A., Harrison, T. A., Hidaka, A., Hoffmeister, M., Huyghe, J. R., Jordahl, K., Kawaguchi, E., Keku, T. O., Larsson, S. C., Le Marchand, L., Lewinger, J. P., Li, L., Mahesworo, B., Morrison, J., Newcomb, P. A., Newton, C. C., Obon-Santacana, M., Ose, J., Pai, R. K., Palmer, J. R., Papadimitriou, N., Pardamean, B., Peoples, A. R., Pharoah, P. D., Platz, E. A., Potter, J. D., Rennert, G., Scacheri, P. C., Schoen, R. E., Su, Y. R., Tangen, C. M., Thibodeau, S. N., Thomas, D. C., Ulrich, C. M., Um, C. Y., van Duijnhoven, F. J., Visvanathan, K., Vodicka, P., Vodickova, L., White, E., Wolk, A., Woods, M. O., Qu, C., Kundaje, A., Hsu, L., Gauderman, W. J., Gunter, M. J., Peters, U. 2023

    Abstract

    Diabetes is an established risk factor for colorectal cancer. However, the mechanisms underlying this relationship still require investigation and it is not known if the association is modified by genetic variants. To address these questions, we undertook a genome-wide gene-environment interaction analysis.We used data from 3 genetic consortia (CCFR, CORECT, GECCO; 31,318 colorectal cancer cases/41,499 controls) and undertook genome-wide gene-environment interaction analyses with colorectal cancer risk, including interaction tests of genetics(G)xdiabetes (1-degree of freedom; d.f.) and joint testing of Gxdiabetes, G-colorectal cancer association (2-d.f. joint test) and G-diabetes correlation (3-d.f. joint test).Based on the joint tests, we found that the association of diabetes with colorectal cancer risk is modified by loci on chromosomes 8q24.11 (rs3802177, SLC30A8 - ORAA: 1.62, 95% CI: 1.34-1.96; ORAG: 1.41, 95% CI: 1.30-1.54; ORGG: 1.22, 95% CI: 1.13-1.31; p-value3-d.f.: 5.46 × 10-11) and 13q14.13 (rs9526201, LRCH1 - ORGG: 2.11, 95% CI: 1.56-2.83; ORGA: 1.52, 95% CI: 1.38-1.68; ORAA: 1.13, 95% CI: 1.06-1.21; p-value2-d.f.: 7.84 × 10-09).These results suggest that variation in genes related to insulin signaling (SLC30A8) and immune function (LRCH1) may modify the association of diabetes with colorectal cancer risk and provide novel insights into the biology underlying the diabetes and colorectal cancer relationship.

    View details for DOI 10.1038/s41416-023-02312-z

    View details for PubMedID 37365285

    View details for PubMedCentralID 6767750

  • A genetic locus within the FMN1/GREM1 gene region interacts with body mass index in colorectal cancer risk. Cancer research Aglago, E. K., Kim, A. E., Lin, Y., Qu, C., Evangelou, M., Ren, Y., Morrison, J., Albanes, D., Arndt, V., Barry, E. L., Baurley, J. W., Berndt, S. I., Bien, S. A., Bishop, D. T., Bouras, E., Brenner, H., Buchanan, D. D., Budiarto, A., Carreras-Torres, R., Casey, G., Cenggoro, T. W., Chan, A. T., Chang-Claude, J., Chen, X., Conti, D. V., Devall, M., Díez-Obrero, V., Dimou, N., Drew, D., Figueiredo, J. C., Gallinger, S., Giles, G. G., Gruber, S. B., Gsur, A., Gunter, M. J., Hampel, H., Harlid, S., Hidaka, A., Harrison, T. A., Hoffmeister, M., Huyghe, J. R., Jenkins, M. A., Jordahl, K., Joshi, A. D., Kawaguchi, E. S., Keku, T. O., Kundaje, A., Larsson, S. C., Le Marchand, L., Lewinger, J. P., Li, L., Lynch, B. M., Mahesworo, B., Mandic, M., Obón-Santacana, M., Moreno, V., Murphy, N., Nan, H., Nassir, R., Newcomb, P. A., Ogino, S., Ose, J., Pai, R. K., Palmer, J. R., Papadimitriou, N., Pardamean, B., Peoples, A. R., Platz, E. A., Potter, J. D., Prentice, R. L., Rennert, G., Ruiz-Narvaez, E., Sakoda, L. C., Scacheri, P. C., Schmit, S. L., Schoen, R. E., Shcherbina, A., Slattery, M. L., Stern, M. C., Su, Y. R., Tangen, C. M., Thibodeau, S. N., Thomas, D. C., Tian, Y., Ulrich, C. M., van Duijnhoven, F. J., Van Guelpen, B., Visvanathan, K., Vodicka, P., Wang, J., White, E., Wolk, A., Woods, M. O., Wu, A. H., Zemlianskaia, N., Hsu, L., Gauderman, W. J., Peters, U., Tsilidis, K. K., Campbell, P. T. 2023

    Abstract

    Colorectal cancer (CRC) risk can be impacted by genetic, environmental, and lifestyle factors, including diet and obesity. Gene-environment (G×E) interactions can provide biological insights into the effects of obesity on CRC risk. Here, we assessed potential genome-wide G×E interactions between body mass index (BMI) and common single nucleotide polymorphisms (SNPs) for CRC risk using data from 36,415 CRC cases and 48,451 controls from three international CRC consortia (CCFR, CORECT, and GECCO). The G×E tests included the conventional logistic regression using multiplicative terms (one-degree of freedom, 1DF test), the two-step EDGE method, and the joint 3DF test, each of which is powerful for detecting G×E interactions under specific conditions. BMI was associated with higher CRC risk. The two-step approach revealed a statistically significant G×BMI interaction located within the Formin 1/Gremlin 1 (FMN1/GREM1) gene region (rs58349661). This SNP was also identified by the 3DF test, with a suggestive statistical significance in the 1DF test. Among participants with the CC genotype of rs58349661, overweight and obesity categories were associated with higher CRC risk, whereas null associations were observed across BMI categories in those with the TT genotype. Using data from three large international consortia, this study discovered a locus in the FMN1/GREM1 gene region that interacts with BMI on the association with CRC risk. Further studies should examine the potential mechanisms through which this locus modifies the etiologic link between obesity and CRC.

    View details for DOI 10.1158/0008-5472.CAN-22-3713

    View details for PubMedID 37249599

  • De novo distillation of thermodynamic affinity from deep learning regulatory sequence models of in vivo protein-DNA binding. bioRxiv : the preprint server for biology Alexandari, A. M., Horton, C. A., Shrikumar, A., Shah, N., Li, E., Weilert, M., Pufall, M. A., Zeitlinger, J., Fordyce, P. M., Kundaje, A. 2023

    Abstract

    Transcription factors (TF) are proteins that bind DNA in a sequence-specific manner to regulate gene transcription. Despite their unique intrinsic sequence preferences, in vivo genomic occupancy profiles of TFs differ across cellular contexts. Hence, deciphering the sequence determinants of TF binding, both intrinsic and context-specific, is essential to understand gene regulation and the impact of regulatory, non-coding genetic variation. Biophysical models trained on in vitro TF binding assays can estimate intrinsic affinity landscapes and predict occupancy based on TF concentration and affinity. However, these models cannot adequately explain context-specific, in vivo binding profiles. Conversely, deep learning models, trained on in vivo TF binding assays, effectively predict and explain genomic occupancy profiles as a function of complex regulatory sequence syntax, albeit without a clear biophysical interpretation. To reconcile these complementary models of in vitro and in vivo TF binding, we developed Affinity Distillation (AD), a method that extracts thermodynamic affinities de-novo from deep learning models of TF chromatin immunoprecipitation (ChIP) experiments by marginalizing away the influence of genomic sequence context. Applied to neural networks modeling diverse classes of yeast and mammalian TFs, AD predicts energetic impacts of sequence variation within and surrounding motifs on TF binding as measured by diverse in vitro assays with superior dynamic range and accuracy compared to motif-based methods. Furthermore, AD can accurately discern affinities of TF paralogs. Our results highlight thermodynamic affinity as a key determinant of in vivo binding, suggest that deep learning models of in vivo binding implicitly learn high-resolution affinity landscapes, and show that these affinities can be successfully distilled using AD. This new biophysical interpretation of deep learning models enables high-throughput in silico experiments to explore the influence of sequence context and variation on both intrinsic affinity and in vivo occupancy.

    View details for DOI 10.1101/2023.05.11.540401

    View details for PubMedID 37214836

  • CasKAS: direct profiling of genome-wide dCas9 and Cas9 specificity using ssDNA mapping. Genome biology Marinov, G. K., Kim, S. H., Bagdatli, S. T., Higashino, S. I., Trevino, A. E., Tycko, J., Wu, T., Bintu, L., Bassik, M. C., He, C., Kundaje, A., Greenleaf, W. J. 2023; 24 (1): 85

    Abstract

    Detecting and mitigating off-target activity is critical to the practical application of CRISPR-mediated genome and epigenome editing. While numerous methods have been developed to map Cas9 binding specificity genome-wide, they are generally time-consuming and/or expensive, and not applicable to catalytically dead CRISPR enzymes. We have developed CasKAS, a rapid, inexpensive, and facile assay for identifying off-target CRISPR enzyme binding and cleavage by chemically mapping the unwound single-stranded DNA structures formed upon binding of a sgRNA-loaded Cas9 protein. We demonstrate this method in both in vitro and in vivo contexts.

    View details for DOI 10.1186/s13059-023-02930-z

    View details for PubMedID 37085898

    View details for PubMedCentralID PMC10120127

  • The ENCODE Imputation Challenge: a critical assessment of methods for cross-cell type imputation of epigenomic profiles. Genome biology Schreiber, J., Boix, C., Wook Lee, J., Li, H., Guan, Y., Chang, C. C., Chang, J. C., Hawkins-Hooker, A., Schölkopf, B., Schweikert, G., Carulla, M. R., Canakoglu, A., Guzzo, F., Nanni, L., Masseroli, M., Carman, M. J., Pinoli, P., Hong, C., Yip, K. Y., Spence, J. P., Batra, S. S., Song, Y. S., Mahony, S., Zhang, Z., Tan, W., Shen, Y., Sun, Y., Shi, M., Adrian, J., Sandstrom, R., Farrell, N., Halow, J., Lee, K., Jiang, L., Yang, X., Epstein, C., Strattan, J. S., Bernstein, B., Snyder, M., Kellis, M., Stafford, W., Kundaje, A. 2023; 24 (1): 79

    Abstract

    A promising alternative to comprehensively performing genomics experiments is to, instead, perform a subset of experiments and use computational methods to impute the remainder. However, identifying the best imputation methods and what measures meaningfully evaluate performance are open questions. We address these questions by comprehensively analyzing 23 methods from the ENCODE Imputation Challenge. We find that imputation evaluations are challenging and confounded by distributional shifts from differences in data collection and processing over time, the amount of available data, and redundancy among performance measures. Our analyses suggest simple steps for overcoming these issues and promising directions for more robust research.

    View details for DOI 10.1186/s13059-023-02915-y

    View details for PubMedID 37072822

    View details for PubMedCentralID PMC10111747

  • The ENCODE Uniform Analysis Pipelines. bioRxiv : the preprint server for biology Hitz, B. C., Jin-Wook, L., Jolanki, O., Kagda, M. S., Graham, K., Sud, P., Gabdank, I., Strattan, J. S., Sloan, C. A., Dreszer, T., Rowe, L. D., Podduturi, N. R., Malladi, V. S., Chan, E. T., Davidson, J. M., Ho, M., Miyasato, S., Simison, M., Tanaka, F., Luo, Y., Whaling, I., Hong, E. L., Lee, B. T., Sandstrom, R., Rynes, E., Nelson, J., Nishida, A., Ingersoll, A., Buckley, M., Frerker, M., Kim, D. S., Boley, N., Trout, D., Dobin, A., Rahmanian, S., Wyman, D., Balderrama-Gutierrez, G., Reese, F., Durand, N. C., Dudchenko, O., Weisz, D., Rao, S. S., Blackburn, A., Gkountaroulis, D., Sadr, M., Olshansky, M., Eliaz, Y., Nguyen, D., Bochkov, I., Shamim, M. S., Mahajan, R., Aiden, E., Gingeras, T., Heath, S., Hirst, M., Kent, W. J., Kundaje, A., Mortazavi, A., Wold, B., Cherry, J. M. 2023

    Abstract

    The Encyclopedia of DNA elements (ENCODE) project is a collaborative effort to create a comprehensive catalog of functional elements in the human genome. The current database comprises more than 19000 functional genomics experiments across more than 1000 cell lines and tissues using a wide array of experimental techniques to study the chromatin structure, regulatory and transcriptional landscape of the Homo sapiens and Mus musculus genomes. All experimental data, metadata, and associated computational analyses created by the ENCODE consortium are submitted to the Data Coordination Center (DCC) for validation, tracking, storage, and distribution to community resources and the scientific community. The ENCODE project has engineered and distributed uniform processing pipelines in order to promote data provenance and reproducibility as well as allow interoperability between genomic resources and other consortia. All data files, reference genome versions, software versions, and parameters used by the pipelines are captured and available via the ENCODE Portal. The pipeline code, developed using Docker and Workflow Description Language (WDL; https://openwdl.org/) is publicly available in GitHub, with images available on Dockerhub (https://hub.docker.com), enabling access to a diverse range of biomedical researchers. ENCODE pipelines maintained and used by the DCC can be installed to run on personal computers, local HPC clusters, or in cloud computing environments via Cromwell. Access to the pipelines and data via the cloud allows small labs the ability to use the data or software without access to institutional compute clusters. Standardization of the computational methodologies for analysis and quality control leads to comparable results from different ENCODE collections - a prerequisite for successful integrative analyses.

    View details for DOI 10.1101/2023.04.04.535623

    View details for PubMedID 37066421

    View details for PubMedCentralID PMC10104020

  • The polyclonal path to malignant transformation in familial adenomatous polyposis Schenck, R. O., Khan, A., Horning, A., Esplin, E. D., Monte, E., Wu, S., Hanson, C., Bararpour, N., Neves, S., Jiang, L., Contrepois, K., Lee, H., Guha, T. K., Hu, Z., Laquindanum, R., Mills, M. A., Chaib, H., Chiu, R., Jian, R., Chan, J., Ellenberger, M., Becker, W. R., Bahmani, B., Michael, B., Shen, J., Lancaster, S., Ladabaum, U., Kundaje, A., Longacre, T. A., Greenleaf, W. J., Ford, J. M., Snyder, M. P., Curtis, C. AMER ASSOC CANCER RESEARCH. 2023
  • Genome-Wide Analyses Characterize Shared Heritability Among Cancers and Identify Novel Cancer Susceptibility Regions. Journal of the National Cancer Institute Lindström, S., Wang, L., Feng, H., Majumdar, A., Huo, S., Macdonald, J., Harrison, T., Turman, C., Chen, H., Mancuso, N., Bammler, T., Gallinger, S., Gruber, S. B., Gunter, M. J., Le Marchand, L., Moreno, V., Offit, K., de Vivo, I., O'Mara, T. A., Spurdle, A. B., Tomlinson, I., Fitzgerald, R., Gharahkhani, P., Gockel, I., Jankowski, J., Macgregor, S., Schumacher, J., Barnholtz-Sloan, J., Bondy, M. L., Houlston, R. S., Jenkins, R. B., Melin, B., Wrensch, M., Brennan, P., Christiani, D., Johansson, M., Mckay, J., Aldrich, M. C., Amos, C. I., Landi, M. T., Tardon, A., Bishop, D. T., Demenais, F., Goldstein, A. M., Iles, M. M., Kanetsky, P. A., Law, M. H., Amundadottir, L. T., Stolzenberg-Solomon, R., Wolpin, B. M., Klein, A., Petersen, G., Risch, H., Chanock, S. J., Purdue, M. P., Scelo, G., Pharoah, P., Kar, S., Hung, R. J., Pasaniuc, B., Kraft, P. 2023

    Abstract

    The shared inherited genetic contribution to risk of different cancers is not fully known. In this study, we leverage results from twelve cancer genome-wide association studies (GWAS) to quantify pair-wise genome-wide genetic correlations across cancers and identify novel cancer susceptibility loci.We collected GWAS summary statistics for twelve solid cancers based on 376,759 cancer cases and 532,864 controls of European ancestry. The included cancer types were breast, colorectal, endometrial, esophageal, glioma, head and neck, lung, melanoma, ovarian, pancreatic, prostate, and renal cancers. We conducted cross-cancer GWAS and transcriptome-wide association studies (TWAS) to discover novel cancer susceptibility loci. Finally, we assessed the extent of variant-specific pleiotropy among cancers at known and newly identified cancer susceptibility loci.We observed wide-spread but modest genome-wide genetic correlations across cancers. In cross-cancer GWAS and TWAS, we identified 15 novel cancer susceptibility loci. Additionally, we identified multiple variants at 77 distinct loci with strong evidence of being associated with at least two cancer types by testing for pleiotropy at known cancer susceptibility loci.Overall, these results suggest that some genetic risk variants are shared among cancers, though much of cancer heritability is cancer- and thus tissue-specific. The increase in statistical power associated with larger sample sizes in cross-disease analysis allows for the identification of novel susceptibility regions. Future studies incorporating data on multiple cancer types are likely to identify additional regions associated with the risk of multiple cancer types.

    View details for DOI 10.1093/jnci/djad043

    View details for PubMedID 36929942

  • Aberrant phase separation is a common killing strategy of positively charged peptides in biology and human disease. bioRxiv : the preprint server for biology Boeynaems, S., Ma, X. R., Yeong, V., Ginell, G. M., Chen, J. H., Blum, J. A., Nakayama, L., Sanyal, A., Briner, A., Haver, D. V., Pauwels, J., Ekman, A., Schmidt, H. B., Sundararajan, K., Porta, L., Lasker, K., Larabell, C., Hayashi, M. A., Kundaje, A., Impens, F., Obermeyer, A., Holehouse, A. S., Gitler, A. D. 2023

    Abstract

    Positively charged repeat peptides are emerging as key players in neurodegenerative diseases. These peptides can perturb diverse cellular pathways but a unifying framework for how such promiscuous toxicity arises has remained elusive. We used mass-spectrometry-based proteomics to define the protein targets of these neurotoxic peptides and found that they all share similar sequence features that drive their aberrant condensation with these positively charged peptides. We trained a machine learning algorithm to detect such sequence features and unexpectedly discovered that this mode of toxicity is not limited to human repeat expansion disorders but has evolved countless times across the tree of life in the form of cationic antimicrobial and venom peptides. We demonstrate that an excess in positive charge is necessary and sufficient for this killer activity, which we name 'polycation poisoning'. These findings reveal an ancient and conserved mechanism and inform ways to leverage its design rules for new generations of bioactive peptides.

    View details for DOI 10.1101/2023.03.09.531820

    View details for PubMedID 36945394

    View details for PubMedCentralID PMC10028949

  • Author Correction: Deciphering colorectal cancer genetics through multi-omic analysis of 100,204 cases and 154,587 controls of European and east Asian ancestries. Nature genetics Fernandez-Rozadilla, C., Timofeeva, M., Chen, Z., Law, P., Thomas, M., Schmit, S., Diez-Obrero, V., Hsu, L., Fernandez-Tajes, J., Palles, C., Sherwood, K., Briggs, S., Svinti, V., Donnelly, K., Farrington, S., Blackmur, J., Vaughan-Shaw, P., Shu, X., Long, J., Cai, Q., Guo, X., Lu, Y., Broderick, P., Studd, J., Huyghe, J., Harrison, T., Conti, D., Dampier, C., Devall, M., Schumacher, F., Melas, M., Rennert, G., Obon-Santacana, M., Martin-Sanchez, V., Moratalla-Navarro, F., Oh, J. H., Kim, J., Jee, S. H., Jung, K. J., Kweon, S., Shin, M., Shin, A., Ahn, Y., Kim, D., Oze, I., Wen, W., Matsuo, K., Matsuda, K., Tanikawa, C., Ren, Z., Gao, Y., Jia, W., Hopper, J., Jenkins, M., Win, A. K., Pai, R., Figueiredo, J., Haile, R., Gallinger, S., Woods, M., Newcomb, P., Duggan, D., Cheadle, J., Kaplan, R., Maughan, T., Kerr, R., Kerr, D., Kirac, I., Bohm, J., Mecklin, L., Jousilahti, P., Knekt, P., Aaltonen, L., Rissanen, H., Pukkala, E., Eriksson, J., Cajuso, T., Hanninen, U., Kondelin, J., Palin, K., Tanskanen, T., Renkonen-Sinisalo, L., Zanke, B., Mannisto, S., Albanes, D., Weinstein, S., Ruiz-Narvaez, E., Palmer, J., Buchanan, D., Platz, E., Visvanathan, K., Ulrich, C., Siegel, E., Brezina, S., Gsur, A., Campbell, P., Chang-Claude, J., Hoffmeister, M., Brenner, H., Slattery, M., Potter, J., Tsilidis, K., Schulze, M., Gunter, M., Murphy, N., Castells, A., Castellvi-Bel, S., Moreira, L., Arndt, V., Shcherbina, A., Stern, M., Pardamean, B., Bishop, T., Giles, G., Southey, M., Idos, G., McDonnell, K., Abu-Ful, Z., Greenson, J., Shulman, K., Lejbkowicz, F., Offit, K., Su, Y., Steinfelder, R., Keku, T., van Guelpen, B., Hudson, T., Hampel, H., Pearlman, R., Berndt, S., Hayes, R., Martinez, M. E., Thomas, S., Corley, D., Pharoah, P., Larsson, S., Yen, Y., Lenz, H., White, E., Li, L., Doheny, K., Pugh, E., Shelford, T., Chan, A., Cruz-Correa, M., Lindblom, A., Hunter, D., Joshi, A., Schafmayer, C., Scacheri, P., Kundaje, A., Nickerson, D., Schoen, R., Hampe, J., Stadler, Z., Vodicka, P., Vodickova, L., Vymetalkova, V., Papadopoulos, N., Edlund, C., Gauderman, W., Thomas, D., Shibata, D., Toland, A., Markowitz, S., Kim, A., Chanock, S., van Duijnhoven, F., Feskens, E., Sakoda, L., Gago-Dominguez, M., Wolk, A., Naccarati, A., Pardini, B., FitzGerald, L., Lee, S. C., Ogino, S., Bien, S., Kooperberg, C., Li, C., Lin, Y., Prentice, R., Qu, C., Bezieau, S., Tangen, C., Mardis, E., Yamaji, T., Sawada, N., Iwasaki, M., Haiman, C., Le Marchand, L., Wu, A., Qu, C., McNeil, C., Coetzee, G., Hayward, C., Deary, I., Harris, S., Theodoratou, E., Reid, S., Walker, M., Ooi, L. Y., Moreno, V., Casey, G., Gruber, S., Tomlinson, I., Zheng, W., Dunlop, M., Houlston, R., Peters, U. 2023

    View details for DOI 10.1038/s41588-023-01334-w

    View details for PubMedID 36782065

  • CPA-Perturb-seq: Multiplexed single-cell characterization of alternative polyadenylation regulators. bioRxiv : the preprint server for biology Kowalski, M. H., Wessels, H., Linder, J., Choudhary, S., Hartman, A., Hao, Y., Mascio, I., Dalgarno, C., Kundaje, A., Satija, R. 2023

    Abstract

    Most mammalian genes have multiple polyA sites, representing a substantial source of transcript diversity that is governed by the cleavage and polyadenylation (CPA) regulatory machinery. To better understand how these proteins govern polyA site choice we introduce CPA-Perturb-seq, a multiplexed perturbation screen dataset of 42 known CPA regulators with a 3' scRNA-seq readout that enables transcriptome-wide inference of polyA site usage. We develop a statistical framework to specifically identify perturbation-dependent changes in intronic and tandem polyadenylation, and discover modules of co-regulated polyA sites exhibiting distinct functional properties. By training a multi-task deep neural network (APARENT-Perturb) on our dataset, we delineate a cis -regulatory code that predicts responsiveness to perturbation and reveals interactions between distinct regulatory complexes. Finally, we leverage our framework to re-analyze published scRNA-seq datasets, identifying new regulators that affect the relative abundance of alternatively polyadenylated transcripts, and characterizing extensive cellular heterogeneity in 3' UTR length amongst antibody-producing cells. Our work highlights the potential for multiplexed single-cell perturbation screens to further our understanding of post-transcriptional regulation in vitro and in vivo .

    View details for DOI 10.1101/2023.02.09.527751

    View details for PubMedID 36798324

  • Single-Molecule Mapping of Chromatin Accessibility Using NOMe-seq/dSMF. Methods in molecular biology (Clifton, N.J.) Hinks, M., Marinov, G. K., Kundaje, A., Bintu, L., Greenleaf, W. J. 2023; 2611: 101-119

    Abstract

    The bulk of gene expression regulation in most organisms is accomplished through the action of transcription factors (TFs) on cis-regulatory elements (CREs). In eukaryotes, these CREs are generally characterized by nucleosomal depletion and thus higher physical accessibility of DNA. Many methods exploit this property to map regions of high average accessibility, and thus putative active CREs, in bulk. However, these techniques do not provide information about coordinated patterns of accessibility along the same DNA molecule, nor do they map the absolute levels of occupancy/accessibility. SMF (Single-Molecule Footprinting) fills these gaps by leveraging recombinant DNA cytosine methyltransferases (MTase) to mark accessible locations on individual DNA molecules. In this chapter, we discuss current methods and important considerations for performing SMF experiments.

    View details for DOI 10.1007/978-1-0716-2899-7_8

    View details for PubMedID 36807067

  • Simultaneous Single-Cell Profiling of the Transcriptome and Accessible Chromatin Using SHARE-seq. Methods in molecular biology (Clifton, N.J.) Kim, S. H., Marinov, G. K., Bagdatli, S. T., Higashino, S. I., Shipony, Z., Kundaje, A., Greenleaf, W. J. 2023; 2611: 187-230

    Abstract

    The ability to analyze the transcriptomic and epigenomic states of individual single cells has in recent years transformed our ability to measure and understand biological processes. Recent advancements have focused on increasing sensitivity and throughput to provide richer and deeper biological insights at the cellular level. The next frontier is the development of multiomic methods capable of analyzing multiple features from the same cell, such as the simultaneous measurement of the transcriptome and the chromatin accessibility of candidate regulatory elements. In this chapter, we discuss and describe SHARE-seq (Simultaneous high-throughput ATAC, and RNA expression with sequencing) for carrying out simultaneous chromatin accessibility and transcriptome measurements in single cells, together with the experimental and analytical considerations for achieving optimal results.

    View details for DOI 10.1007/978-1-0716-2899-7_11

    View details for PubMedID 36807070

  • Genome-Wide Mapping of Active Regulatory Elements Using ATAC-seq. Methods in molecular biology (Clifton, N.J.) Marinov, G. K., Shipony, Z., Kundaje, A., Greenleaf, W. J. 2023; 2611: 3-19

    Abstract

    Active cis-regulatory elements (cREs) in eukaryotes are characterized by nucleosomal depletion and, accordingly, higher accessibility. This property has turned out to be immensely useful for identifying cREs genome-wide and tracking their dynamics across different cellular states and is the basis of numerous methods taking advantage of the preferential enzymatic cleavage/labeling of accessible DNA. ATAC-seq (Assay for Transposase-Accessible Chromatin using sequencing) has emerged as the most versatile and widely adaptable method and has been widely adopted as the standard tool for mapping open chromatin regions. Here, we discuss the current optimal practices and important considerations for carrying out ATAC-seq experiments, primarily in the context of mammalian systems.

    View details for DOI 10.1007/978-1-0716-2899-7_1

    View details for PubMedID 36807060

  • Genome-wide interaction study with smoking for colorectal cancer risk identifies novel genetic loci related to tumor suppression, inflammation and immune response. Cancer epidemiology, biomarkers & prevention : a publication of the American Association for Cancer Research, cosponsored by the American Society of Preventive Oncology Carreras-Torres, R., Kim, A. E., Lin, Y., Díez-Obrero, V., Bien, S. A., Qu, C., Wang, J., Dimou, N., Aglago, E. K., Albanes, D., Arndt, V., Baurley, J. W., Berndt, S. I., Bézieau, S., Bishop, D. T., Bouras, E., Brenner, H., Budiarto, A., Campbell, P. T., Casey, G., Chan, A. T., Chang-Claude, J., Chen, X., Conti, D. V., Dampier, C. H., Devall, M. A., Drew, D. A., Figueiredo, J. C., Gallinger, S., Giles, G. G., Gruber, S. B., Gsur, A., Gunter, M. J., Harrison, T. A., Hidaka, A., Hoffmeister, M., Huyghe, J. R., Jenkins, M. A., Jordahl, K. M., Kawaguchi, E., Keku, T. O., Kundaje, A., Le Marchand, L., Lewinger, J. P., Li, L., Mahesworo, B., Morrison, J. L., Murphy, N., Nan, H., Nassir, R., Newcomb, P. A., Obón-Santacana, M., Ogino, S., Ose, J., Pai, R. K., Palmer, J. R., Papadimitriou, N., Pardamean, B., Peoples, A. R., Pharoah, P. D., Platz, E. A., Rennert, G., Ruiz-Narvaez, E., Sakoda, L. C., Scacheri, P. C., Schmit, S. L., Schoen, R. E., Shcherbina, A., Slattery, M. L., Stern, M. C., Su, Y. R., Tangen, C. M., Thomas, D. C., Tian, Y., Tsilidis, K. K., Ulrich, C. M., van Duijnhoven, F. J., Van Guelpen, B., Visvanathan, K., Vodicka, P., Cenggoro, T. W., Weinstein, S. J., White, E., Wolk, A., Woods, M. O., Hsu, L., Peters, U., Moreno, V., Gauderman, W. J. 2022

    Abstract

    Tobacco smoking is an established risk factor for colorectal cancer (CRC). However, genetically-defined population subgroups may have increased susceptibility to smoking-related effects on CRC.A genome-wide interaction scan was performed including 33,756 CRC cases and 44,346 controls from three genetic consortia.Evidence of an interaction was observed between smoking status (ever vs never smokers) and a locus on 3p12.1 (rs9880919, p=4.58x10-8), with higher associated risk in subjects carrying the GG genotype (OR 1.25, 95%CI 1.20-1.30) compared with the other genotypes (OR <1.17 for GA and AA). Among ever smokers, we observed interactions between smoking intensity (increase in 10 cigarettes smoked per day) and two loci on 6p21.33 (rs4151657, p=1.72x10-8) and 8q24.23 (rs7005722, p=2.88x10-8). Subjects carrying the rs4151657 TT genotype showed higher risk (OR 1.12, 95%CI 1.09-1.16) compared with the other genotypes (OR <1.06 for TC and CC). Similarly, higher risk was observed among subjects carrying the rs7005722 AA genotype (OR 1.17, 95%CI 1.07-1.28) compared with the other genotypes (OR <1.13 for AC and CC). Functional annotation revealed that SNPs in 3p12.1 and 6p21.33 loci were located in regulatory regions, and were associated with expression levels of nearby genes. Genetic models predicting gene expression revealed that smoking parameters were associated with lower CRC risk with higher expression levels of CADM2 (3p12.1) and ATF6B (6p21.33).Our study identified novel genetic loci that may modulate the risk for CRC of smoking status and intensity, linked to tumor suppression and immune response.These findings can guide potential prevention treatments.

    View details for DOI 10.1158/1055-9965.EPI-22-0763

    View details for PubMedID 36576985

  • Deciphering colorectal cancer genetics through multi-omic analysis of 100,204 cases and 154,587 controls of European and east Asian ancestries. Nature genetics Fernandez-Rozadilla, C., Timofeeva, M., Chen, Z., Law, P., Thomas, M., Schmit, S., Díez-Obrero, V., Hsu, L., Fernandez-Tajes, J., Palles, C., Sherwood, K., Briggs, S., Svinti, V., Donnelly, K., Farrington, S., Blackmur, J., Vaughan-Shaw, P., Shu, X. O., Long, J., Cai, Q., Guo, X., Lu, Y., Broderick, P., Studd, J., Huyghe, J., Harrison, T., Conti, D., Dampier, C., Devall, M., Schumacher, F., Melas, M., Rennert, G., Obón-Santacana, M., Martín-Sánchez, V., Moratalla-Navarro, F., Oh, J. H., Kim, J., Jee, S. H., Jung, K. J., Kweon, S. S., Shin, M. H., Shin, A., Ahn, Y. O., Kim, D. H., Oze, I., Wen, W., Matsuo, K., Matsuda, K., Tanikawa, C., Ren, Z., Gao, Y. T., Jia, W. H., Hopper, J., Jenkins, M., Win, A. K., Pai, R., Figueiredo, J., Haile, R., Gallinger, S., Woods, M., Newcomb, P., Duggan, D., Cheadle, J., Kaplan, R., Maughan, T., Kerr, R., Kerr, D., Kirac, I., Böhm, J., Mecklin, L. P., Jousilahti, P., Knekt, P., Aaltonen, L., Rissanen, H., Pukkala, E., Eriksson, J., Cajuso, T., Hänninen, U., Kondelin, J., Palin, K., Tanskanen, T., Renkonen-Sinisalo, L., Zanke, B., Männistö, S., Albanes, D., Weinstein, S., Ruiz-Narvaez, E., Palmer, J., Buchanan, D., Platz, E., Visvanathan, K., Ulrich, C., Siegel, E., Brezina, S., Gsur, A., Campbell, P., Chang-Claude, J., Hoffmeister, M., Brenner, H., Slattery, M., Potter, J., Tsilidis, K., Schulze, M., Gunter, M., Murphy, N., Castells, A., Castellví-Bel, S., Moreira, L., Arndt, V., Shcherbina, A., Stern, M., Pardamean, B., Bishop, T., Giles, G., Southey, M., Idos, G., McDonnell, K., Abu-Ful, Z., Greenson, J., Shulman, K., Lejbkowicz, F., Offit, K., Su, Y. R., Steinfelder, R., Keku, T., van Guelpen, B., Hudson, T., Hampel, H., Pearlman, R., Berndt, S., Hayes, R., Martinez, M. E., Thomas, S., Corley, D., Pharoah, P., Larsson, S., Yen, Y., Lenz, H. J., White, E., Li, L., Doheny, K., Pugh, E., Shelford, T., Chan, A., Cruz-Correa, M., Lindblom, A., Hunter, D., Joshi, A., Schafmayer, C., Scacheri, P., Kundaje, A., Nickerson, D., Schoen, R., Hampe, J., Stadler, Z., Vodicka, P., Vodickova, L., Vymetalkova, V., Papadopoulos, N., Edlund, C., Gauderman, W., Thomas, D., Shibata, D., Toland, A., Markowitz, S., Kim, A., Chanock, S., van Duijnhoven, F., Feskens, E., Sakoda, L., Gago-Dominguez, M., Wolk, A., Naccarati, A., Pardini, B., FitzGerald, L., Lee, S. C., Ogino, S., Bien, S., Kooperberg, C., Li, C., Lin, Y., Prentice, R., Qu, C., Bézieau, S., Tangen, C., Mardis, E., Yamaji, T., Sawada, N., Iwasaki, M., Haiman, C., Le Marchand, L., Wu, A., Qu, C., McNeil, C., Coetzee, G., Hayward, C., Deary, I., Harris, S., Theodoratou, E., Reid, S., Walker, M., Ooi, L. Y., Moreno, V., Casey, G., Gruber, S., Tomlinson, I., Zheng, W., Dunlop, M., Houlston, R., Peters, U. 2022

    Abstract

    Colorectal cancer (CRC) is a leading cause of mortality worldwide. We conducted a genome-wide association study meta-analysis of 100,204 CRC cases and 154,587 controls of European and east Asian ancestry, identifying 205 independent risk associations, of which 50 were unreported. We performed integrative genomic, transcriptomic and methylomic analyses across large bowel mucosa and other tissues. Transcriptome- and methylome-wide association studies revealed an additional 53 risk associations. We identified 155 high-confidence effector genes functionally linked to CRC risk, many of which had no previously established role in CRC. These have multiple different functions and specifically indicate that variation in normal colorectal homeostasis, proliferation, cell adhesion, migration, immunity and microbial interactions determines CRC risk. Crosstissue analyses indicated that over a third of effector genes most probably act outside the colonic mucosa. Our findings provide insights into colorectal oncogenesis and highlight potential targets across tissues for new CRC treatment and chemoprevention strategies.

    View details for DOI 10.1038/s41588-022-01222-9

    View details for PubMedID 36539618

  • GENCODE: reference annotation for the human and mouse genomes in 2023. Nucleic acids research Frankish, A., Carbonell-Sala, S., Diekhans, M., Jungreis, I., Loveland, J. E., Mudge, J. M., Sisu, C., Wright, J. C., Arnan, C., Barnes, I., Banerjee, A., Bennett, R., Berry, A., Bignell, A., Boix, C., Calvet, F., Cerdan-Velez, D., Cunningham, F., Davidson, C., Donaldson, S., Dursun, C., Fatima, R., Giorgetti, S., Giron, C. G., Gonzalez, J. M., Hardy, M., Harrison, P. W., Hourlier, T., Hollis, Z., Hunt, T., James, B., Jiang, Y., Johnson, R., Kay, M., Lagarde, J., Martin, F. J., Gomez, L. M., Nair, S., Ni, P., Pozo, F., Ramalingam, V., Ruffier, M., Schmitt, B. M., Schreiber, J. M., Steed, E., Suner, M., Sumathipala, D., Sycheva, I., Uszczynska-Ratajczak, B., Wass, E., Yang, Y. T., Yates, A., Zafrulla, Z., Choudhary, J. S., Gerstein, M., Guigo, R., Hubbard, T. J., Kellis, M., Kundaje, A., Paten, B., Tress, M. L., Flicek, P. 2022

    Abstract

    GENCODE produces high quality gene and transcript annotation for the human and mouse genomes. All GENCODE annotation is supported by experimental data and serves as a reference for genome biology and clinical genomics. The GENCODE consortium generates targeted experimental data, develops bioinformatic tools and carries out analyses that, along with externally produced data and methods, support the identification and annotation of transcript structures and the determination of their function. Here, we present an update on the annotation of human and mouse genes, including developments in the tools, data, analyses and major collaborations which underpin this progress. For example, we report the creation of a set of non-canonical ORFs identified in GENCODE transcripts, the LRGASP collaboration to assess the use of long transcriptomic data to build transcript models, the progress in collaborations with RefSeq and UniProt to increase convergence in the annotation of human and mouse protein-coding genes, the propagation of GENCODE across the human pan-genome and the development of new tools to support annotation of regulatory features by GENCODE. Our annotation is accessible via Ensembl, the UCSC Genome Browser and https://www.gencodegenes.org.

    View details for DOI 10.1093/nar/gkac1071

    View details for PubMedID 36420896

  • Deciphering the impact of genetic variation on human polyadenylation using APARENT2. Genome biology Linder, J., Koplik, S. E., Kundaje, A., Seelig, G. 2022; 23 (1): 232

    Abstract

    3'-end processing by cleavage and polyadenylation is an important and finely tuned regulatory process during mRNA maturation. Numerous genetic variants are known to cause or contribute to human disorders by disrupting the cis-regulatory code of polyadenylation signals. Yet, due to the complexity of this code, variant interpretation remains challenging.We introduce a residual neural network model, APARENT2, that can infer 3'-cleavage and polyadenylation from DNA sequence more accurately than any previous model. This model generalizes to the case of alternative polyadenylation (APA) for a variable number of polyadenylation signals. We demonstrate APARENT2's performance on several variant datasets, including functional reporter data and human 3' aQTLs from GTEx. We apply neural network interpretation methods to gain insights into disrupted or protective higher-order features of polyadenylation. We fine-tune APARENT2 on human tissue-resolved transcriptomic data to elucidate tissue-specific variant effects. By combining APARENT2 with models of mRNA stability, we extend aQTL effect size predictions to the entire 3' untranslated region. Finally, we perform in silico saturation mutagenesis of all human polyadenylation signals and compare the predicted effects of [Formula: see text] million variants against gnomAD. While loss-of-function variants were generally selected against, we also find specific clinical conditions linked to gain-of-function mutations. For example, we detect an association between gain-of-function mutations in the 3'-end and autism spectrum disorder. To experimentally validate APARENT2's predictions, we assayed clinically relevant variants in multiple cell lines, including microglia-derived cells.A sequence-to-function model based on deep residual learning enables accurate functional interpretation of genetic variants in polyadenylation signals and, when coupled with large human variation databases, elucidates the link between functional 3'-end mutations and human health.

    View details for DOI 10.1186/s13059-022-02799-4

    View details for PubMedID 36335397

  • The dynseq browser track shows context-specific features at nucleotide resolution. Nature genetics Nair, S., Barrett, A., Li, D., Raney, B. J., Lee, B. T., Kerpedjiev, P., Ramalingam, V., Pampari, A., Lekschas, F., Wang, T., Haeussler, M., Kundaje, A. 2022

    View details for DOI 10.1038/s41588-022-01194-w

    View details for PubMedID 36241719

  • Single-cell multiome of the human retina and deep learning nominate causal variants in complex eye diseases. Cell genomics Wang, S. K., Nair, S., Li, R., Kraft, K., Pampari, A., Patel, A., Kang, J. B., Luong, C., Kundaje, A., Chang, H. Y. 2022; 2 (8)

    Abstract

    Genome-wide association studies (GWASs) of eye disorders have identified hundreds of genetic variants associated with ocular disease. However, the vast majority of these variants are noncoding, making it challenging to interpret their function. Here we present a joint single-cell atlas of gene expression and chromatin accessibility of the adult human retina with more than 50,000 cells, which we used to analyze single-nucleotide polymorphisms (SNPs) implicated by GWASs of age-related macular degeneration, glaucoma, diabetic retinopathy, myopia, and type 2 macular telangiectasia. We integrate this atlas with a HiChIP enhancer connectome, expression quantitative trait loci (eQTL) data, and base-resolution deep learning models to predict noncoding SNPs with causal roles in eye disease, assess SNP impact on transcription factor binding, and define their known and novel target genes. Our efforts nominate pathogenic SNP-target gene interactions for multiple vision disorders and provide a potentially powerful resource for interpreting noncoding variation in the eye.

    View details for DOI 10.1016/j.xgen.2022.100164

    View details for PubMedID 36277849

  • Automated sequence-based annotation and interpretation of the human genome. Nature genetics Kundaje, A., Meuleman, W. 2022

    View details for DOI 10.1038/s41588-022-01123-x

    View details for PubMedID 35817978

  • Author Correction: Single-nucleus chromatin accessibility profiling highlights regulatory mechanisms of coronary artery disease risk. Nature genetics Turner, A. W., Hu, S. S., Mosquera, J. V., Ma, W. F., Hodonsky, C. J., Wong, D., Auguste, G., Song, Y., Sol-Church, K., Farber, E., Kundu, S., Kundaje, A., Lopez, N. G., Ma, L., Ghosh, S. K., Onengut-Gumuscu, S., Ashley, E. A., Quertermous, T., Finn, A. V., Leeper, N. J., Kovacic, J. C., Björkegren, J. L., Zang, C., Miller, C. L. 2022

    View details for DOI 10.1038/s41588-022-01142-8

    View details for PubMedID 35768727

  • Single-cell analyses define a continuum of cell state and composition changes in the malignant transformation of polyps to colorectal cancer. Nature genetics Becker, W. R., Nevins, S. A., Chen, D. C., Chiu, R., Horning, A. M., Guha, T. K., Laquindanum, R., Mills, M., Chaib, H., Ladabaum, U., Longacre, T., Shen, J., Esplin, E. D., Kundaje, A., Ford, J. M., Curtis, C., Snyder, M. P., Greenleaf, W. J. 2022

    Abstract

    To chart cell composition and cell state changes that occur during the transformation of healthy colon to precancerous adenomas to colorectal cancer (CRC), we generated single-cell chromatin accessibility profiles and single-cell transcriptomes from 1,000 to 10,000 cells per sample for 48 polyps, 27 normal tissues and 6 CRCs collected from patients with or without germline APC mutations. A large fraction of polyp and CRC cells exhibit a stem-like phenotype, and we define a continuum of epigenetic and transcriptional changes occurring in these stem-like cells as they progress from homeostasis to CRC. Advanced polyps contain increasing numbers of stem-like cells, regulatory T cells and a subtype of pre-cancer-associated fibroblasts. In the cancerous state, we observe T cell exhaustion, RUNX1-regulated cancer-associated fibroblasts and increasing accessibility associated with HNF4A motifs in epithelia. DNA methylation changes in sporadic CRC are strongly anti-correlated with accessibility changes along this continuum, further identifying regulatory markers for molecular staging of polyps.

    View details for DOI 10.1038/s41588-022-01088-x

    View details for PubMedID 35726067

  • Accelerating in-silico saturation mutagenesis using compressed sensing. Bioinformatics (Oxford, England) Schreiber, J., Nair, S., Balsubramani, A., Kundaje, A. 2022

    Abstract

    MOTIVATION: In-silico saturation mutagenesis (ISM) is a popular approach in computational genomics for calculating feature attributions on biological sequences that proceeds by systematically perturbing each position in a sequence and recording the difference in model output. However, this method can be slow because systematically perturbing each position requires performing a number of forward passes proportional to the length of the sequence being examined.RESULTS: In this work, we propose a modification of ISM that leverages the principles of compressed sensing to require only a constant number of forward passes, regardless of sequence length, when applied to models that contain operations with a limited receptive field, such as convolutions. Our method, named Yuzu, can reduce the time that ISM spends in convolution operations by several orders of magnitude and, consequently, Yuzu can speed up ISM on several commonly used architectures in genomics by over an order of magnitude. Notably, we found that Yuzu provides speedups that increase with the complexity of the convolution operation and the length of the sequence being analyzed, suggesting that Yuzu provides large benefits in realistic settings.AVAILABILITY: We have made this tool available at https://github.com/kundajelab/yuzu.SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

    View details for DOI 10.1093/bioinformatics/btac385

    View details for PubMedID 35678521

  • Single-nucleus chromatin accessibility profiling highlights regulatory mechanisms of coronary artery disease risk. Nature genetics Turner, A. W., Hu, S. S., Mosquera, J. V., Ma, W. F., Hodonsky, C. J., Wong, D., Auguste, G., Song, Y., Sol-Church, K., Farber, E., Kundu, S., Kundaje, A., Lopez, N. G., Ma, L., Ghosh, S. K., Onengut-Gumuscu, S., Ashley, E. A., Quertermous, T., Finn, A. V., Leeper, N. J., Kovacic, J. C., Björkgren, J. L., Zang, C., Miller, C. L. 2022

    Abstract

    Coronary artery disease (CAD) is a complex inflammatory disease involving genetic influences across cell types. Genome-wide association studies have identified over 200 loci associated with CAD, where the majority of risk variants reside in noncoding DNA sequences impacting cis-regulatory elements. Here, we applied single-nucleus assay for transposase-accessible chromatin with sequencing to profile 28,316 nuclei across coronary artery segments from 41 patients with varying stages of CAD, which revealed 14 distinct cellular clusters. We mapped ~320,000 accessible sites across all cells, identified cell-type-specific elements and transcription factors, and prioritized functional CAD risk variants. We identified elements in smooth muscle cell transition states (for example, fibromyocytes) and functional variants predicted to alter smooth muscle cell- and macrophage-specific regulation of MRAS (3q22) and LIPA (10q23), respectively. We further nominated key driver transcription factors such as PRDM16 and TBX2. Together, this single-nucleus atlas provides a critical step towards interpreting regulatory mechanisms across the continuum of CAD risk.

    View details for DOI 10.1038/s41588-022-01069-0

    View details for PubMedID 35590109

  • Genome-Wide Interaction Analysis of Genetic Variants with Menopausal Hormone Therapy for Colorectal Cancer Risk. Journal of the National Cancer Institute Tian, Y., Kim, A. E., Bien, S. A., Lin, Y., Qu, C., Harrison, T., Carreras-Torres, R., Diez-Obrero, V., Dimou, N., Drew, D. A., Hidaka, A., Huyghe, J. R., Jordahl, K. M., Morrison, J., Murphy, N., Obon-Santacana, M., Ulrich, C. M., Ose, J., Peoples, A. R., Ruiz-Narvaez, E. A., Shcherbina, A., Stern, M., Su, Y., van Duijnhoven, F. J., Arndt, V., Baurley, J., Berndt, S. I., Bishop, D. T., Brenner, H., Buchanan, D. D., Chan, A. T., Figueiredo, J. C., Gallinger, S., Gruber, S. B., Harlid, S., Hoffmeister, M., Jenkins, M. A., Joshi, A. D., Keku, T. O., Larsson, S. C., Le Marchand, L., Li, L., Giles, G. G., Milne, R. L., Nan, H., Nassir, R., Ogino, S., Budiarto, A., Platz, E. A., Potter, J. D., Prentice, R. L., Rennert, G., Sakoda, L. C., Schoen, R. E., Slattery, M. L., Thibodeau, S. N., Van Guelpen, B., Visvanathan, K., White, E., Wolk, A., Woods, M. O., Wu, A. H., Campbell, P. T., Casey, G., Conti, D. V., Gunter, M. J., Kundaje, A., Lewinger, J. P., Moreno, V., Newcomb, P. A., Pardamean, B., Thomas, D. C., Tsilidis, K. K., Peters, U., Gauderman, W. J., Hsu, L., Chang-Claude, J. 2022

    Abstract

    BACKGROUND: The use of menopausal hormone therapy (MHT) may interact with genetic variants to influence colorectal cancer (CRC) risk.METHODS: We conducted a genome-wide gene-environment interaction between single nucleotide polymorphisms and the use of any MHT, estrogen-only, and combined estrogen-progestogen therapy with CRC risk, among 28,486 postmenopausal women (11,519 cases and 16,967 controls) from 38 studies, using logistic regression, two-step method, and 2- or 3-degree-of-freedom (d.f.) joint test. A set-based score test was applied for rare genetic variants.RESULTS: The use of any MHT, estrogen-only and estrogen-progestogen were associated with a reduced CRC risk [odds ratio (OR) with 95% confidence interval (95% CI) of 0.71 (0.64-0.78), 0.65 (0.53-0.79), and 0.73 (0.59-0.90), respectively]. The two-step method identified a statistically significant interaction between a GRIN2B variant rs117868593 and MHT use, whereby MHT-associated CRC risk was significantly reduced in women with the GG genotype [0.68 (0.64-0.72)] but not within strata of GC or CC genotypes. A statistically significant interaction between a DCBLD1 intronic variant at 6q22.1 (rs10782186) and MHT use was identified by the 2-d.f. joint test. The MHT-associated CRC risk was reduced with increasing number of rs10782186-C alleles, showing ORs of 0.78 (0.70-0.87) for TT, 0.68 (0.63-0.73) for TC, and 0.66 (0.60-0.74) for CC genotypes. In addition, five genes in rare variant analysis showed suggestive interactions with MHT (two-sided P<1.2x10-4).CONCLUSION: Genetic variants that modify the association between MHT and CRC risk were identified, offering new insights into pathways of CRC carcinogenesis and potential mechanisms involved.

    View details for DOI 10.1093/jnci/djac094

    View details for PubMedID 35512400

  • Author Correction: Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature ENCODE Project Consortium, Moore, J. E., Purcaro, M. J., Pratt, H. E., Epstein, C. B., Shoresh, N., Adrian, J., Kawli, T., Davis, C. A., Dobin, A., Kaul, R., Halow, J., Van Nostrand, E. L., Freese, P., Gorkin, D. U., Shen, Y., He, Y., Mackiewicz, M., Pauli-Behn, F., Williams, B. A., Mortazavi, A., Keller, C. A., Zhang, X., Elhajjajy, S. I., Huey, J., Dickel, D. E., Snetkova, V., Wei, X., Wang, X., Rivera-Mulia, J. C., Rozowsky, J., Zhang, J., Chhetri, S. B., Zhang, J., Victorsen, A., White, K. P., Visel, A., Yeo, G. W., Burge, C. B., Lecuyer, E., Gilbert, D. M., Dekker, J., Rinn, J., Mendenhall, E. M., Ecker, J. R., Kellis, M., Klein, R. J., Noble, W. S., Kundaje, A., Guigo, R., Farnham, P. J., Cherry, J. M., Myers, R. M., Ren, B., Graveley, B. R., Gerstein, M. B., Pennacchio, L. A., Snyder, M. P., Bernstein, B. E., Wold, B., Hardison, R. C., Gingeras, T. R., Stamatoyannopoulos, J. A., Weng, Z., Abascal, F., Acosta, R., Addleman, N. J., Adrian, J., Afzal, V., Aken, B., Akiyama, J. A., Jammal, O. A., Amrhein, H., Anderson, S. M., Andrews, G. R., Antoshechkin, I., Ardlie, K. G., Armstrong, J., Astley, M., Banerjee, B., Barkal, A. A., Barnes, I. H., Barozzi, I., Barrell, D., Barson, G., Bates, D., Baymuradov, U. K., Bazile, C., Beer, M. A., Beik, S., Bender, M. A., Bennett, R., Bouvrette, L. P., Bernstein, B. E., Berry, A., Bhaskar, A., Bignell, A., Blue, S. M., Bodine, D. M., Boix, C., Boley, N., Borrman, T., Borsari, B., Boyle, A. P., Brandsmeier, L. A., Breschi, A., Bresnick, E. H., Brooks, J. A., Buckley, M., Burge, C. B., Byron, R., Cahill, E., Cai, L., Cao, L., Carty, M., Castanon, R. G., Castillo, A., Chaib, H., Chan, E. T., Chee, D. R., Chee, S., Chen, H., Chen, H., Chen, J., Chen, S., Cherry, J. M., Chhetri, S. B., Choudhary, J. S., Chrast, J., Chung, D., Clarke, D., Cody, N. A., Coppola, C. J., Coursen, J., D'Ippolito, A. M., Dalton, S., Danyko, C., Davidson, C., Davila-Velderrain, J., Davis, C. A., Dekker, J., Deran, A., DeSalvo, G., Despacio-Reyes, G., Dewey, C. N., Dickel, D. E., Diegel, M., Diekhans, M., Dileep, V., Ding, B., Djebali, S., Dobin, A., Dominguez, D., Donaldson, S., Drenkow, J., Dreszer, T. R., Drier, Y., Duff, M. O., Dunn, D., Eastman, C., Ecker, J. R., Edwards, M. D., El-Ali, N., Elhajjajy, S. I., Elkins, K., Emili, A., Epstein, C. B., Evans, R. C., Ezkurdia, I., Fan, K., Farnham, P. J., Farrell, N. P., Feingold, E. A., Ferreira, A., Fisher-Aylor, K., Fitzgerald, S., Flicek, P., Foo, C. S., Fortier, K., Frankish, A., Freese, P., Fu, S., Fu, X., Fu, Y., Fukuda-Yuzawa, Y., Fulciniti, M., Funnell, A. P., Gabdank, I., Galeev, T., Gao, M., Giron, C. G., Garvin, T. H., Gelboin-Burkhart, C. A., Georgolopoulos, G., Gerstein, M. B., Giardine, B. M., Gifford, D. K., Gilbert, D. M., Gilchrist, D. A., Gillespie, S., Gingeras, T. R., Gong, P., Gonzalez, A., Gonzalez, J. M., Good, P., Goren, A., Gorkin, D. U., Graveley, B. R., Gray, M., Greenblatt, J. F., Griffiths, E., Groudine, M. T., Grubert, F., Gu, M., Guigo, R., Guo, H., Guo, Y., Guo, Y., Gursoy, G., Gutierrez-Arcelus, M., Halow, J., Hardison, R. C., Hardy, M., Hariharan, M., Harmanci, A., Harrington, A., Harrow, J. L., Hashimoto, T. B., Hasz, R. D., Hatan, M., Haugen, E., Hayes, J. E., He, P., He, Y., Heidari, N., Hendrickson, D., Heuston, E. F., Hilton, J. A., Hitz, B. C., Hochman, A., Holgren, C., Hou, L., Hou, S., Hsiao, Y. E., Hsu, S., Huang, H., Hubbard, T. J., Huey, J., Hughes, T. R., Hunt, T., Ibarrientos, S., Issner, R., Iwata, M., Izuogu, O., Jaakkola, T., Jameel, N., Jansen, C., Jiang, L., Jiang, P., Johnson, A., Johnson, R., Jungreis, I., Kadaba, M., Kasowski, M., Kasparian, M., Kato, M., Kaul, R., Kawli, T., Kay, M., Keen, J. C., Keles, S., Keller, C. A., Kelley, D., Kellis, M., Kheradpour, P., Kim, D. S., Kirilusha, A., Klein, R. J., Knoechel, B., Kuan, S., Kulik, M. J., Kumar, S., Kundaje, A., Kutyavin, T., Lagarde, J., Lajoie, B. R., Lambert, N. J., Lazar, J., Lee, A. Y., Lee, D., Lee, E., Lee, J. W., Lee, K., Leslie, C. S., Levy, S., Li, B., Li, H., Li, N., Li, X., Li, Y. I., Li, Y., Li, Y., Li, Y., Lian, J., Libbrecht, M. W., Lin, S., Lin, Y., Liu, D., Liu, J., Liu, P., Liu, T., Liu, X. S., Liu, Y., Liu, Y., Long, M., Lou, S., Loveland, J., Lu, A., Lu, Y., Lecuyer, E., Ma, L., Mackiewicz, M., Mannion, B. J., Mannstadt, M., Manthravadi, D., Marinov, G. K., Martin, F. J., Mattei, E., McCue, K., McEown, M., McVicker, G., Meadows, S. K., Meissner, A., Mendenhall, E. M., Messer, C. L., Meuleman, W., Meyer, C., Miller, S., Milton, M. G., Mishra, T., Moore, D. E., Moore, H. M., Moore, J. E., Moore, S. H., Moran, J., Mortazavi, A., Mudge, J. M., Munshi, N., Murad, R., Myers, R. M., Nandakumar, V., Nandi, P., Narasimha, A. M., Narayanan, A. K., Naughton, H., Navarro, F. C., Navas, P., Nazarovs, J., Nelson, J., Neph, S., Neri, F. J., Nery, J. R., Nesmith, A. R., Newberry, J. S., Newberry, K. M., Ngo, V., Nguyen, R., Nguyen, T. B., Nguyen, T., Nishida, A., Noble, W. S., Novak, C. S., Novoa, E. M., Nunez, B., O'Donnell, C. W., Olson, S., Onate, K. C., Otterman, E., Ozadam, H., Pagan, M., Palden, T., Pan, X., Park, Y., Partridge, E. C., Paten, B., Pauli-Behn, F., Pazin, M. J., Pei, B., Pennacchio, L. A., Perez, A. R., Perry, E. H., Pervouchine, D. D., Phalke, N. N., Pham, Q., Phanstiel, D. H., Plajzer-Frick, I., Pratt, G. A., Pratt, H. E., Preissl, S., Pritchard, J. K., Pritykin, Y., Purcaro, M. J., Qin, Q., Quinones-Valdez, G., Rabano, I., Radovani, E., Raj, A., Rajagopal, N., Ram, O., Ramirez, L., Ramirez, R. N., Rausch, D., Raychaudhuri, S., Raymond, J., Razavi, R., Reddy, T. E., Reimonn, T. M., Ren, B., Reymond, A., Reynolds, A., Rhie, S. K., Rinn, J., Rivera, M., Rivera-Mulia, J. C., Roberts, B. S., Rodriguez, J. M., Rozowsky, J., Ryan, R., Rynes, E., Salins, D. N., Sandstrom, R., Sasaki, T., Sathe, S., Savic, D., Scavelli, A., Scheiman, J., Schlaffner, C., Schloss, J. A., Schmitges, F. W., See, L. H., Sethi, A., Setty, M., Shafer, A., Shan, S., Sharon, E., Shen, Q., Shen, Y., Sherwood, R. I., Shi, M., Shin, S., Shoresh, N., Siebenthall, K., Sisu, C., Slifer, T., Sloan, C. A., Smith, A., Snetkova, V., Snyder, M. P., Spacek, D. V., Srinivasan, S., Srivas, R., Stamatoyannopoulos, G., Stamatoyannopoulos, J. A., Stanton, R., Steffan, D., Stehling-Sun, S., Strattan, J. S., Su, A., Sundararaman, B., Suner, M., Syed, T., Szynkarek, M., Tanaka, F. Y., Tenen, D., Teng, M., Thomas, J. A., Toffey, D., Tress, M. L., Trout, D. E., Trynka, G., Tsuji, J., Upchurch, S. A., Ursu, O., Uszczynska-Ratajczak, B., Uziel, M. C., Valencia, A., Biber, B. V., van der Velde, A. G., Van Nostrand, E. L., Vaydylevich, Y., Vazquez, J., Victorsen, A., Vielmetter, J., Vierstra, J., Visel, A., Vlasova, A., Vockley, C. M., Volpi, S., Vong, S., Wang, H., Wang, M., Wang, Q., Wang, R., Wang, T., Wang, W., Wang, X., Wang, Y., Watson, N. K., Wei, X., Wei, Z., Weisser, H., Weissman, S. M., Welch, R., Welikson, R. E., Weng, Z., Westra, H., Whitaker, J. W., White, C., White, K. P., Wildberg, A., Williams, B. A., Wine, D., Witt, H. N., Wold, B., Wolf, M., Wright, J., Xiao, R., Xiao, X., Xu, J., Xu, J., Yan, K., Yan, Y., Yang, H., Yang, X., Yang, Y., Yardimci, G. G., Yee, B. A., Yeo, G. W., Young, T., Yu, T., Yue, F., Zaleski, C., Zang, C., Zeng, H., Zeng, W., Zerbino, D. R., Zhai, J., Zhan, L., Zhan, Y., Zhang, B., Zhang, J., Zhang, J., Zhang, K., Zhang, L., Zhang, P., Zhang, Q., Zhang, X., Zhang, Y., Zhang, Z., Zhao, Y., Zheng, Y., Zhong, G., Zhou, X., Zhu, Y., Zimmerman, J., Ai, R., Li, S. 2022

    View details for DOI 10.1038/s41586-021-04226-3

    View details for PubMedID 35474001

  • Author Correction: Perspectives on ENCODE. Nature ENCODE Project Consortium, Snyder, M. P., Gingeras, T. R., Moore, J. E., Weng, Z., Gerstein, M. B., Ren, B., Hardison, R. C., Stamatoyannopoulos, J. A., Graveley, B. R., Feingold, E. A., Pazin, M. J., Pagan, M., Gilchrist, D. A., Hitz, B. C., Cherry, J. M., Bernstein, B. E., Mendenhall, E. M., Zerbino, D. R., Frankish, A., Flicek, P., Myers, R. M., Abascal, F. B., Acosta, R., Addleman, N. J., Adrian, J., Afzal, V., Aken, B., Ai, R., Akiyama, J. A., Jammal, O. A., Amrhein, H., Anderson, S. M., Andrews, G. R., Antoshechkin, I., Ardlie, K. G., Armstrong, J., Astley, M., Banerjee, B., Barkal, A. A., Barnes, I. H., Barozzi, I., Barrell, D., Barson, G., Bates, D., Baymuradov, U. K., Bazile, C., Beer, M. A., Beik, S., Bender, M. A., Bennett, R., Bouvrette, L. P., Bernstein, B. E., Berry, A., Bhaskar, A., Bignell, A., Blue, S. M., Bodine, D. M., Boix, C., Boley, N., Borrman, T., Borsari, B., Boyle, A. P., Brandsmeier, L. A., Breschi, A., Bresnick, E. H., Brooks, J. A., Buckley, M., Burge, C. B., Byron, R., Cahill, E., Cai, L., Cao, L., Carty, M., Castanon, R. G., Castillo, A., Chaib, H., Chan, E. T., Chee, D. R., Chee, S., Chen, H., Chen, H., Chen, J., Chen, S., Cherry, J. M., Chhetri, S. B., Choudhary, J. S., Chrast, J., Chung, D., Clarke, D., Cody, N. A., Coppola, C. J., Coursen, J., D'Ippolito, A. M., Dalton, S., Danyko, C., Davidson, C., Davila-Velderrain, J., Davis, C. A., Dekker, J., Deran, A., DeSalvo, G., Despacio-Reyes, G., Dewey, C. N., Dickel, D. E., Diegel, M., Diekhans, M., Dileep, V., Ding, B., Djebali, S., Dobin, A., Dominguez, D., Donaldson, S., Drenkow, J., Dreszer, T. R., Drier, Y., Duff, M. O., Dunn, D., Eastman, C., Ecker, J. R., Edwards, M. D., El-Ali, N., Elhajjajy, S. I., Elkins, K., Emili, A., Epstein, C. B., Evans, R. C., Ezkurdia, I., Fan, K., Farnham, P. J., Farrell, N., Feingold, E. A., Ferreira, A., Fisher-Aylor, K., Fitzgerald, S., Flicek, P., Foo, C. S., Fortier, K., Frankish, A., Freese, P., Fu, S., Fu, X., Fu, Y., Fukuda-Yuzawa, Y., Fulciniti, M., Funnell, A. P., Gabdank, I., Galeev, T., Gao, M., Giron, C. G., Garvin, T. H., Gelboin-Burkhart, C. A., Georgolopoulos, G., Gerstein, M. B., Giardine, B. M., Gifford, D. K., Gilbert, D. M., Gilchrist, D. A., Gillespie, S., Gingeras, T. R., Gong, P., Gonzalez, A., Gonzalez, J. M., Good, P., Goren, A., Gorkin, D. U., Graveley, B. R., Gray, M., Greenblatt, J. F., Griffiths, E., Groudine, M. T., Grubert, F., Gu, M., Guigo, R., Guo, H., Guo, Y., Guo, Y., Gursoy, G., Gutierrez-Arcelus, M., Halow, J., Hardison, R. C., Hardy, M., Hariharan, M., Harmanci, A., Harrington, A., Harrow, J. L., Hashimoto, T. B., Hasz, R. D., Hatan, M., Haugen, E., Hayes, J. E., He, P., He, Y., Heidari, N., Hendrickson, D., Heuston, E. F., Hilton, J. A., Hitz, B. C., Hochman, A., Holgren, C., Hou, L., Hou, S., Hsiao, Y. E., Hsu, S., Huang, H., Hubbard, T. J., Huey, J., Hughes, T. R., Hunt, T., Ibarrientos, S., Issner, R., Iwata, M., Izuogu, O., Jaakkola, T., Jameel, N., Jansen, C., Jiang, L., Jiang, P., Johnson, A., Johnson, R., Jungreis, I., Kadaba, M., Kasowski, M., Kasparian, M., Kato, M., Kaul, R., Kawli, T., Kay, M., Keen, J. C., Keles, S., Keller, C. A., Kelley, D., Kellis, M., Kheradpour, P., Kim, D. S., Kirilusha, A., Klein, R. J., Knoechel, B., Kuan, S., Kulik, M. J., Kumar, S., Kundaje, A., Kutyavin, T., Lagarde, J., Lajoie, B. R., Lambert, N. J., Lazar, J., Lee, A. Y., Lee, D., Lee, E., Lee, J. W., Lee, K., Leslie, C. S., Levy, S., Li, B., Li, H., Li, N., Li, S., Li, X., Li, Y. I., Li, Y., Li, Y., Li, Y., Lian, J., Libbrecht, M. W., Lin, S., Lin, Y., Liu, D., Liu, J., Liu, P., Liu, T., Liu, X. S., Liu, Y., Liu, Y., Long, M., Lou, S., Loveland, J., Lu, A., Lu, Y., Lecuyer, E., Ma, L., Mackiewicz, M., Mannion, B. J., Mannstadt, M., Manthravadi, D., Marinov, G. K., Martin, F. J., Mattei, E., McCue, K., McEown, M., McVicker, G., Meadows, S. K., Meissner, A., Mendenhall, E. M., Messer, C. L., Meuleman, W., Meyer, C., Miller, S., Milton, M. G., Mishra, T., Moore, D. E., Moore, H. M., Moore, J. E., Moore, S. H., Moran, J., Mortazavi, A., Mudge, J. M., Munshi, N., Murad, R., Myers, R. M., Nandakumar, V., Nandi, P., Narasimha, A. M., Narayanan, A. K., Naughton, H., Navarro, F. C., Navas, P., Nazarovs, J., Nelson, J., Neph, S., Neri, F. J., Nery, J. R., Nesmith, A. R., Newberry, J. S., Newberry, K. M., Ngo, V., Nguyen, R., Nguyen, T. B., Nguyen, T., Nishida, A., Noble, W. S., Novak, C. S., Novoa, E. M., Nunez, B., O'Donnell, C. W., Olson, S., Onate, K. C., Otterman, E., Ozadam, H., Pagan, M., Palden, T., Pan, X., Park, Y., Partridge, E. C., Paten, B., Pauli-Behn, F., Pazin, M. J., Pei, B., Pennacchio, L. A., Perez, A. R., Perry, E. H., Pervouchine, D. D., Phalke, N. N., Pham, Q., Phanstiel, D. H., Plajzer-Frick, I., Pratt, G. A., Pratt, H. E., Preissl, S., Pritchard, J. K., Pritykin, Y., Purcaro, M. J., Qin, Q., Quinones-Valdez, G., Rabano, I., Radovani, E., Raj, A., Rajagopal, N., Ram, O., Ramirez, L., Ramirez, R. N., Rausch, D., Raychaudhuri, S., Raymond, J., Razavi, R., Reddy, T. E., Reimonn, T. M., Ren, B., Reymond, A., Reynolds, A., Rhie, S. K., Rinn, J., Rivera, M., Rivera-Mulia, J. C., Roberts, B., Rodriguez, J. M., Rozowsky, J., Ryan, R., Rynes, E., Salins, D. N., Sandstrom, R., Sasaki, T., Sathe, S., Savic, D., Scavelli, A., Scheiman, J., Schlaffner, C., Schloss, J. A., Schmitges, F. W., See, L. H., Sethi, A., Setty, M., Shafer, A., Shan, S., Sharon, E., Shen, Q., Shen, Y., Sherwood, R. I., Shi, M., Shin, S., Shoresh, N., Siebenthall, K., Sisu, C., Slifer, T., Sloan, C. A., Smith, A., Snetkova, V., Snyder, M. P., Spacek, D. V., Srinivasan, S., Srivas, R., Stamatoyannopoulos, G., Stamatoyannopoulos, J. A., Stanton, R., Steffan, D., Stehling-Sun, S., Strattan, J. S., Su, A., Sundararaman, B., Suner, M., Syed, T., Szynkarek, M., Tanaka, F. Y., Tenen, D., Teng, M., Thomas, J. A., Toffey, D., Tress, M. L., Trout, D. E., Trynka, G., Tsuji, J., Upchurch, S. A., Ursu, O., Uszczynska-Ratajczak, B., Uziel, M. C., Valencia, A., Biber, B. V., van der Velde, A. G., Van Nostrand, E. L., Vaydylevich, Y., Vazquez, J., Victorsen, A., Vielmetter, J., Vierstra, J., Visel, A., Vlasova, A., Vockley, C. M., Volpi, S., Vong, S., Wang, H., Wang, M., Wang, Q., Wang, R., Wang, T., Wang, W., Wang, X., Wang, Y., Watson, N. K., Wei, X., Wei, Z., Weisser, H., Weissman, S. M., Welch, R., Welikson, R. E., Weng, Z., Westra, H., Whitaker, J. W., White, C., White, K. P., Wildberg, A., Williams, B. A., Wine, D., Witt, H. N., Wold, B., Wolf, M., Wright, J., Xiao, R., Xiao, X., Xu, J., Xu, J., Yan, K., Yan, Y., Yang, H., Yang, X., Yang, Y., Yardimci, G. G., Yee, B. A., Yeo, G. W., Young, T., Yu, T., Yue, F., Zaleski, C., Zang, C., Zeng, H., Zeng, W., Zerbino, D. R., Zhai, J., Zhan, L., Zhan, Y., Zhang, B., Zhang, J., Zhang, J., Zhang, K., Zhang, L., Zhang, P., Zhang, Q., Zhang, X., Zhang, Y., Zhang, Z., Zhao, Y., Zheng, Y., Zhong, G., Zhou, X., Zhu, Y., Zimmerman, J. 2022

    View details for DOI 10.1038/s41586-021-04213-8

    View details for PubMedID 35474002

  • Beyond GWAS of Colorectal Cancer: Evidence of Interaction with Alcohol Consumption and Putative Causal Variant for the 10q24.2 Region. Cancer epidemiology, biomarkers & prevention : a publication of the American Association for Cancer Research, cosponsored by the American Society of Preventive Oncology Jordahl, K. M., Shcherbina, A., Kim, A. E., Su, Y. R., Lin, Y., Wang, J., Qu, C., Albanes, D., Arndt, V., Baurley, J. W., Berndt, S. I., Bien, S. A., Bishop, D. T., Bouras, E., Brenner, H., Buchanan, D. D., Budiarto, A., Campbell, P. T., Carreras-Torres, R., Casey, G., Cenggoro, T. W., Chan, A. T., Conti, D. V., Dampier, C. H., Devall, M. A., Díez-Obrero, V., Dimou, N., Drew, D. A., Figueiredo, J. C., Gallinger, S., Giles, G. G., Gruber, S. B., Gsur, A., Gunter, M. J., Hampel, H., Harlid, S., Harrison, T. A., Hidaka, A., Hoffmeister, M., Huyghe, J. R., Jenkins, M. A., Joshi, A. D., Keku, T. O., Larsson, S. C., Le Marchand, L., Lewinger, J. P., Li, L., Mahesworo, B., Moreno, V., Morrison, J. L., Murphy, N., Nan, H., Nassir, R., Newcomb, P. A., Obón-Santacana, M., Ogino, S., Ose, J., Pai, R. K., Palmer, J. R., Papadimitriou, N., Pardamean, B., Peoples, A. R., Pharoah, P. D., Platz, E. A., Potter, J. D., Prentice, R. L., Rennert, G., Ruiz-Narvaez, E., Sakoda, L. C., Scacheri, P. C., Schmit, S. L., Schoen, R. E., Slattery, M. L., Stern, M. C., Tangen, C. M., Thibodeau, S. N., Thomas, D. C., Tian, Y., Tsilidis, K. K., Ulrich, C. M., van Duijnhoven, F. J., Van Guelpen, B., Visvanathan, K., Vodicka, P., White, E., Wolk, A., Woods, M. O., Wu, A. H., Zemlianskaia, N., Chang-Claude, J., Gauderman, W. J., Hsu, L., Kundaje, A., Peters, U. 2022: OF1-OF13

    Abstract

    Currently known associations between common genetic variants and colorectal cancer explain less than half of its heritability of 25%. As alcohol consumption has a J-shape association with colorectal cancer risk, nondrinking and heavy drinking are both risk factors for colorectal cancer.Individual-level data was pooled from the Colon Cancer Family Registry, Colorectal Transdisciplinary Study, and Genetics and Epidemiology of Colorectal Cancer Consortium to compare nondrinkers (≤1 g/day) and heavy drinkers (>28 g/day) with light-to-moderate drinkers (1-28 g/day) in GxE analyses. To improve power, we implemented joint 2df and 3df tests and a novel two-step method that modifies the weighted hypothesis testing framework. We prioritized putative causal variants by predicting allelic effects using support vector machine models.For nondrinking as compared with light-to-moderate drinking, the hybrid two-step approach identified 13 significant SNPs with pairwise r2 > 0.9 in the 10q24.2/COX15 region. When stratified by alcohol intake, the A allele of lead SNP rs2300985 has a dose-response increase in risk of colorectal cancer as compared with the G allele in light-to-moderate drinkers [OR for GA genotype = 1.11; 95% confidence interval (CI), 1.06-1.17; OR for AA genotype = 1.22; 95% CI, 1.14-1.31], but not in nondrinkers or heavy drinkers. Among the correlated candidate SNPs in the 10q24.2/COX15 region, rs1318920 was predicted to disrupt an HNF4 transcription factor binding motif.Our study suggests that the association with colorectal cancer in 10q24.2/COX15 observed in genome-wide association study is strongest in nondrinkers. We also identified rs1318920 as the putative causal regulatory variant for the region.The study identifies multifaceted evidence of a possible functional effect for rs1318920.

    View details for DOI 10.1158/1055-9965.EPI-21-1003

    View details for PubMedID 35438744

  • fastISM: Performant in-silico saturation mutagenesis for convolutional neural networks. Bioinformatics (Oxford, England) Nair, S., Shrikumar, A., Schreiber, J., Kundaje, A. 2022

    Abstract

    MOTIVATION: Deep learning models such as convolutional neural networks are able to accurately map biological sequences to associated functional readouts and properties by learning predictive de novo representations. In-silico saturation mutagenesis (ISM) is a popular feature attribution technique for inferring contributions of all characters in an input sequence to the model's predicted output. The main drawback of ISM is its runtime, as it involves multiple forward propagations of all possible mutations of each character in the input sequence through the trained model to predict the effects on the output.RESULTS: We present fastISM, an algorithm that speeds up ISM by a factor of over 10x for commonly used convolutional neural network architectures. fastISM is based on the observations that the majority of computation in ISM is spent in convolutional layers, and a single mutation only disrupts a limited region of intermediate layers, rendering most computation redundant. fastISM reduces the gap between backpropagation-based feature attribution methods and ISM. It far surpasses the runtime of backpropagation-based methods on multi-output architectures, making it feasible to run ISM on a large number of sequences.AVAILABILITY: An easy-to-use Keras/TensorFlow 2 implementation of fastISM is available at https://github.com/kundajelab/fastISM. fastISM can be installed using pip install fastism. A hands-on tutorial can be found at https://colab.research.google.com/github/kundajelab/fastISM/blob/master/notebooks/colab/DeepSEA.ipynb.SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

    View details for DOI 10.1093/bioinformatics/btac135

    View details for PubMedID 35238376

  • MITI minimum information guidelines for highly multiplexed tissue images. Nature methods Schapiro, D., Yapp, C., Sokolov, A., Reynolds, S. M., Chen, Y., Sudar, D., Xie, Y., Muhlich, J., Arias-Camison, R., Arena, S., Taylor, A. J., Nikolov, M., Tyler, M., Lin, J., Burlingame, E. A., Human Tumor Atlas Network, Chang, Y. H., Farhi, S. L., Thorsson, V., Venkatamohan, N., Drewes, J. L., Pe'er, D., Gutman, D. A., Herrmann, M. D., Gehlenborg, N., Bankhead, P., Roland, J. T., Herndon, J. M., Snyder, M. P., Angelo, M., Nolan, G., Swedlow, J. R., Schultz, N., Merrick, D. T., Mazzili, S. A., Cerami, E., Rodig, S. J., Santagata, S., Sorger, P. K., Abravanel, D. L., Achilefu, S., Ademuyiwa, F. O., Adey, A. C., Aft, R., Ahn, K. J., Alikarami, F., Alon, S., Ashenberg, O., Baker, E., Baker, G. J., Bandyopadhyay, S., Bayguinov, P., Beane, J., Becker, W., Bernt, K., Betts, C. B., Bletz, J., Blosser, T., Boire, A., Boland, G. M., Boyden, E. S., Bucher, E., Bueno, R., Cai, Q., Cambuli, F., Campbell, J., Cao, S., Caravan, W., Chaligne, R., Chan, J. M., Chasnoff, S., Chatterjee, D., Chen, A. A., Chen, C., Chen, C., Chen, B., Chen, F., Chen, S., Chheda, M. G., Chin, K., Cho, H., Chun, J., Cisneros, L., Coffey, R. J., Cohen, O., Colditz, G. A., Cole, K. A., Collins, N., Cotter, D., Coussens, L. M., Coy, S., Creason, A. L., Cui, Y., Zhou, D. C., Curtis, C., Davies, S. R., Bruijn, I., Delorey, T. M., Demir, E., Denardo, D., Diep, D., Ding, L., DiPersio, J., Dubinett, S. M., Eberlein, T. J., Eddy, J. A., Esplin, E. D., Factor, R. E., Fatahalian, K., Feiler, H. S., Fernandez, J., Fields, A., Fields, R. C., Fitzpatrick, J. A., Ford, J. M., Franklin, J., Fulton, B., Gaglia, G., Galdieri, L., Ganesh, K., Gao, J., Gaudio, B. L., Getz, G., Gibbs, D. L., Gillanders, W. E., Goecks, J., Goodwin, D., Gray, J. W., Greenleaf, W., Grimm, L. J., Gu, Q., Guerriero, J. L., Guha, T., Guimaraes, A. R., Gutierrez, B., Hacohen, N., Hanson, C. R., Harris, C. R., Hawkins, W. G., Heiser, C. N., Hoffer, J., Hollmann, T. J., Hsieh, J. J., Huang, J., Hunger, S. P., Hwang, E., Iacobuzio-Donahue, C., Iglesia, M. D., Islam, M., Izar, B., Jacobson, C. A., Janes, S., Jayasinghe, R. G., Jeudi, T., Johnson, B. E., Johnson, B. E., Ju, T., Kadara, H., Karnoub, E., Karpova, A., Khan, A., Kibbe, W., Kim, A. H., King, L. M., Kozlowski, E., Krishnamoorthy, P., Krueger, R., Kundaje, A., Ladabaum, U., Laquindanum, R., Lau, C., Lau, K. S., LeBoeuf, N. R., Lee, H., Lenburg, M., Leshchiner, I., Levy, R., Li, Y., Lian, C. G., Liang, W., Lim, K., Lin, Y., Liu, D., Liu, Q., Liu, R., Lo, J., Lo, P., Longabaugh, W. J., Longacre, T., Luckett, K., Ma, C., Maher, C., Maier, A., Makowski, D., Maley, C., Maliga, Z., Manoj, P., Maris, J. M., Markham, N., Marks, J. R., Martinez, D., Mashl, J., Masilionis, I., Massague, J., Mazurowski, M. A., McKinley, E. T., McMichael, J., Meyerson, M., Mills, G. B., Mitri, Z. I., Moorman, A., Mudd, J., Murphy, G. F., Deen, N. N., Navin, N. E., Nawy, T., Ness, R. M., Nevins, S., Nirmal, A. J., Novikov, E., Oh, S. T., Oldridge, D. A., Owzar, K., Pant, S. M., Park, W., Patti, G. J., Paul, K., Pelletier, R., Persson, D., Petty, C., Pfister, H., Polyak, K., Puram, S. V., Qiu, Q., Villalonga, A. Q., Ramirez, M. A., Rashid, R., Reeb, A. N., Reid, M. E., Remsik, J., Riesterer, J. L., Risom, T., Ritch, C. C., Rolong, A., Rudin, C. M., Ryser, M. D., Sato, K., Sears, C. L., Semenov, Y. R., Shen, J., Shoghi, K. I., Shrubsole, M. J., Shyr, Y., Sibley, A. B., Simmons, A. J., Sinha, A., Sivagnanam, S., Song, S., Southar-Smith, A., Spira, A. E., Cyr, J. S., Stefankiewicz, S., Storrs, E. P., Stover, E. H., Strand, S. H., Straub, C., Street, C., Su, T., Surrey, L. F., Suver, C., Tan, K., Terekhanova, N. V., Ternes, L., Thadi, A., Thomas, G., Tibshirani, R., Umeda, S., Uzun, Y., Vallius, T., Van Allen, E. R., Vandekar, S., Vega, P. N., Veis, D. J., Vennam, S., Verma, A., Vigneau, S., Wagle, N., Wahl, R., Walle, T., Wang, L., Warchol, S., Washington, M. K., Watson, C., Weimer, A. K., Wendl, M. C., West, R. B., White, S., Windon, A. L., Wu, H., Wu, C., Wu, Y., Wyczalkowski, M. A., Xu, J., Yao, L., Yu, W., Zhang, K., Zhu, X. 2022; 19 (3): 262-267

    View details for DOI 10.1038/s41592-022-01415-4

    View details for PubMedID 35277708

  • The chromatin organization of a chlorarachniophyte nucleomorph genome. Genome biology Marinov, G. K., Chen, X., Wu, T., He, C., Grossman, A. R., Kundaje, A., Greenleaf, W. J. 2022; 23 (1): 65

    Abstract

    BACKGROUND: Nucleomorphs are remnants of secondary endosymbiotic events between two eukaryote cells wherein the endosymbiont has retained its eukaryotic nucleus. Nucleomorphs have evolved at least twice independently, in chlorarachniophytes and cryptophytes, yet they have converged on a remarkably similar genomic architecture, characterized by the most extreme compression and miniaturization among all known eukaryotic genomes. Previous computational studies have suggested that nucleomorph chromatin likely exhibits a number of divergent features.RESULTS: In this work, we provide the first maps of open chromatin, active transcription, and three-dimensional organization for the nucleomorph genome of the chlorarachniophyte Bigelowiella natans. We find that the B. natans nucleomorph genome exists in a highly accessible state, akin to that of ribosomal DNA in some other eukaryotes, and that it is highly transcribed over its entire length, with few signs of polymerase pausing at transcription start sites (TSSs). At the same time, most nucleomorph TSSs show very strong nucleosome positioning. Chromosome conformation (Hi-C) maps reveal that nucleomorph chromosomes interact with one other at their telomeric regions and show the relative contact frequencies between the multiple genomic compartments of distinct origin that B. natans cells contain.CONCLUSIONS: We provide the first study of a nucleomorph genome using modern functional genomic tools, and derive numerous novel insights into the physical and functional organization of these unique genomes.

    View details for DOI 10.1186/s13059-022-02639-5

    View details for PubMedID 35232465

  • Discovery of widespread transcription initiation at microsatellites predictable by sequence-based deep neural network (vol 12, 3279, 2021) NATURE COMMUNICATIONS Grapotte, M., Saraswat, M., Bessiere, C., Menichelli, C., Ramilowski, J. A., Severin, J., Hayashizaki, Y., Itoh, M., Tagami, M., Murata, M., Kojima-Ishiyama, M., Noma, S., Noguchi, S., Kasukawa, T., Hasegawa, A., Suzuki, H., Nishiyori-Sueki, H., Frith, M. C., Chatelain, C., Carninci, P., de Hoon, M. L., Wasserman, W. W., Brehelin, L., Lecellier, C., FANTOM consortium 2022; 13 (1): 1200

    View details for DOI 10.1038/s41467-022-28758-y

    View details for Web of Science ID 000771136200018

    View details for PubMedID 35232988

    View details for PubMedCentralID PMC8888638

  • Short tandem repeats recruit transcription factors to tune eukaryotic gene expression Horton, C. A., Alexandari, A. M., Hayes, M. G., Schaepe, J. M., Marklund, E., Shah, N., Aditham, A. K., Shrikumar, A., Afek, A., Greenleaf, W. J., Gordan, R., Zeitlinger, J., Kundaje, A., Fordyce, P. M. CELL PRESS. 2022: 287A-288A
  • Domain adaptive neural networks improve cross-species prediction of transcription factor binding. Genome research Cochran, K., Srivastava, D., Shrikumar, A., Balsubramani, A., Hardison, R. C., Kundaje, A., Mahony, S. 1800

    Abstract

    The intrinsic DNA sequence preferences and cell-type specific cooperative partners of transcription factors (TFs) are typically highly conserved. Hence, despite the rapid evolutionary turnover of individual TF binding sites, predictive sequence models of cell-type specific genomic occupancy of a TF in one species should generalize to closely matched cell types in a related species. To assess the viability of cross-species TF binding prediction, we train neural networks to discriminate ChIP-seq peak locations from genomic background and evaluate their performance within and across species. Cross-species predictive performance is consistently worse than within-species performance, which we show is caused in part by species-specific repeats. To account for this domain shift, we use an augmented network architecture to automatically discourage learning of training species-specific sequence features. This domain adaptation approach corrects for prediction errors on species-specific repeats and improves overall cross-species model performance. Our results demonstrate that cross-species TF binding prediction is feasible when models account for domain shifts driven by species-specific repeats.

    View details for DOI 10.1101/gr.275394.121

    View details for PubMedID 35042722

  • ZEB2 Shapes the Epigenetic Landscape of Atherosclerosis Circulation Cheng, P., Wirka, R. C., Clarke, L., Zhao, Q., Kundu, R., Nguyen, T., Nair, S., Sharma, D., Kim, H., Shi, H., Assimes, T., Kim, J., Kundaje, A., Quertermous, T. 2022; 145 (6): 469–485

    Abstract

    Background: Smooth muscle cells (SMC) transition into a number of different phenotypes during atherosclerosis, including those that resemble fibroblasts and chondrocytes, and make up the majority of cells in the atherosclerotic plaque. To better understand the epigenetic and transcriptional mechanisms that mediate these cell state changes, and how they relate to risk for coronary artery disease (CAD), we have investigated the causality and function of transcription factors (TFs) at genome wide associated loci. Methods: We employed CRISPR-Cas 9 genome and epigenome editing to identify the causal gene and cell(s) for a complex CAD GWAS signal at 2q22.3. Subsequently, single-cell epigenetic and transcriptomic profiling in murine models and human coronary artery smooth muscle cells were employed to understand the cellular and molecular mechanism by which this CAD risk gene exerts its function. Results: CRISPR-Cas 9 genome and epigenome editing showed that the complex CAD genetic signals within a genomic region at 2q22.3 lie within smooth muscle long-distance enhancers for ZEB2, a TF extensively studied in the context of epithelial mesenchymal transition (EMT) in development and cancer. ZEB2 regulates SMC phenotypic transition through chromatin remodeling that obviates accessibility and disrupts both Notch and TGFβ signaling, thus altering the epigenetic trajectory of SMC transitions. SMC specific loss of ZEB2 resulted in an inability of transitioning SMCs to turn off contractile programing and take on a fibroblast-like phenotype, but accelerated the formation of chondromyocytes, mirroring features of high-risk atherosclerotic plaques in human coronary arteries. Conclusions: These studies identify ZEB2 as a new CAD GWAS gene that affects features of plaque vulnerability through direct effects on the epigenome, providing a new thereapeutic approach to target vascular disease.

    View details for DOI 10.1161/CIRCULATIONAHA.121.057789

  • A Congenital Anemia Reveals Distinct Targeting Mechanisms for Master Transcription Factor GATA1. Blood Ludwig, L., Lareau, C. A., Bao, E. L., Liu, N., Utsugisawa, T., Tseng, A. M., Myers, S. A., Verboon, J. M., Ulirsch, J. C., Luo, W., Muus, C., Fiorini, C., Olive, M. E., Vockley, C. M., Munschauer, M., Hunter, A., Ogura, H., Yamamoto, T., Inada, H., Nakagawa, S., Ozono, S., Subramanian, V., Chiarle, R., Glader, B., Carr, S. A., Aryee, M. J., Kundaje, A., Orkin, S., Regev, A., McCavit, T., Kanno, H., Sankaran, V. G. 2022

    Abstract

    Master regulators, such as the hematopoietic transcription factor (TF) GATA1, play an essential role in orchestrating lineage commitment and differentiation. However, the precise mechanisms by which such TFs regulate transcription through interactions with specific cis-regulatory elements remain incompletely understood. Here, we describe a form of congenital hemolytic anemia caused by missense mutations in an intrinsically disordered region of GATA1, with a poorly understood role in transcriptional regulation. Through integrative functional approaches, we demonstrate that these mutations perturb GATA1 transcriptional activity by partially impairing nuclear localization and selectively altering precise chromatin occupancy by GATA1. These alterations in chromatin occupancy and concordant chromatin accessibility changes alter faithful gene expression, with failure to both effectively silence and activate select genes necessary for effective terminal red cell production. We demonstrate how disease-causing mutations can reveal regulatory mechanisms that enable the faithful genomic targeting of master TFs during cellular differentiation.

    View details for DOI 10.1182/blood.2021013753

    View details for PubMedID 35030251

  • Single-Molecule Multikilobase-Scale Profiling of Chromatin Accessibility Using m6A-SMAC-Seq and m6A-CpG-GpC-SMAC-Seq. Methods in molecular biology (Clifton, N.J.) Marinov, G. K., Shipony, Z., Kundaje, A., Greenleaf, W. J. 2022; 2458: 269-298

    Abstract

    A hallmark feature of active cis-regulatory elements (CREs) in eukaryotes is their nucleosomal depletion and, accordingly, higher accessibility to enzymatic treatment. This property has been the basis of a number of sequencing-based assays for genome-wide identification and tracking the activity of CREs across different biological conditions, such as DNAse-seq, ATAC-seq , NOMeseq, and others. However, the fragmentation of DNA inherent to many of these assays and the limited read length of short-read sequencing platforms have so far not allowed the simultaneous measurement of the chromatin accessibility state of CREs located distally from each other. The combination of labeling accessible DNA with DNA modifications and nanopore sequencing has made it possible to develop such assays. Here, we provide a detailed protocol for carrying out the SMAC-seq assay (Single-Molecule long-read Accessible Chromatin mapping sequencing), in its m6A-SMAC-seq and m6A-CpG-GpC-SMAC-seq variants, together with methods for data processing and analysis, and discuss key experimental and analytical considerations for working with SMAC-seq datasets.

    View details for DOI 10.1007/978-1-0716-2140-0_15

    View details for PubMedID 35103973

  • ZEB2 Shapes the Epigenetic Landscape of Atherosclerosis. Circulation Cheng, P., Wirka, R. C., Clarke, L. S., Zhao, Q., Kundu, R., Nguyen, T., Nair, S., Sharma, D., Kim, H. J., Shi, H., Assimes, T., Kim, J. B., Kundaje, A., Quertermous, T. 2022

    Abstract

    Background: Smooth muscle cells (SMC) transition into a number of different phenotypes during atherosclerosis, including those that resemble fibroblasts and chondrocytes, and make up the majority of cells in the atherosclerotic plaque. To better understand the epigenetic and transcriptional mechanisms that mediate these cell state changes, and how they relate to risk for coronary artery disease (CAD), we have investigated the causality and function of transcription factors (TFs) at genome wide associated loci. Methods: We employed CRISPR-Cas 9 genome and epigenome editing to identify the causal gene and cell(s) for a complex CAD GWAS signal at 2q22.3. Subsequently, single-cell epigenetic and transcriptomic profiling in murine models and human coronary artery smooth muscle cells were employed to understand the cellular and molecular mechanism by which this CAD risk gene exerts its function. Results: CRISPR-Cas 9 genome and epigenome editing showed that the complex CAD genetic signals within a genomic region at 2q22.3 lie within smooth muscle long-distance enhancers for ZEB2, a TF extensively studied in the context of epithelial mesenchymal transition (EMT) in development and cancer. ZEB2 regulates SMC phenotypic transition through chromatin remodeling that obviates accessibility and disrupts both Notch and TGFβ signaling, thus altering the epigenetic trajectory of SMC transitions. SMC specific loss of ZEB2 resulted in an inability of transitioning SMCs to turn off contractile programing and take on a fibroblast-like phenotype, but accelerated the formation of chondromyocytes, mirroring features of high-risk atherosclerotic plaques in human coronary arteries. Conclusions: These studies identify ZEB2 as a new CAD GWAS gene that affects features of plaque vulnerability through direct effects on the epigenome, providing a new thereapeutic approach to target vascular disease.

    View details for DOI 10.1161/CIRCULATIONAHA.121.057789

    View details for PubMedID 34990206

  • Transcriptional and chromatin-based partitioning mechanisms uncouple protein scaling from cell size. Molecular cell Swaffer, M. P., Kim, J., Chandler-Brown, D., Langhinrichs, M., Marinov, G. K., Greenleaf, W. J., Kundaje, A., Schmoller, K. M., Skotheim, J. M. 2021

    Abstract

    Biosynthesis scales with cell size such that protein concentrations generally remain constant as cells grow. As an exception, synthesis of the cell-cycle inhibitor Whi5 "sub-scales" with cell size so that its concentration is lower in larger cells to promote cell-cycle entry. Here, we find that transcriptional control uncouples Whi5 synthesis from cell size, and we identify histones as the major class of sub-scaling transcripts besides WHI5 by screening for similar genes. Histone synthesis is thereby matched to genome content rather than cell size. Such sub-scaling proteins are challenged by asymmetric cell division because proteins are typically partitioned in proportion to newborn cell volume. To avoid this fate, Whi5 uses chromatin-binding to partition similar protein amounts to each newborn cell regardless of cell size. Disrupting both Whi5 synthesis and chromatin-based partitioning weakens G1 size control. Thus, specific transcriptional and partitioning mechanisms determine protein sub-scaling to control cell size.

    View details for DOI 10.1016/j.molcel.2021.10.007

    View details for PubMedID 34731644

  • Cell-specific Chromatin Landscape Of Human Coronary Artery Resolves Mechanisms Of Disease Risk Turner, A. W., Hu, S., Mosquera, J., Ma, W., Hodonsky, C. J., Wong, D., Auguste, G. E., Sol-Church, K., Farber, E., Kundu, S., Kundaje, A. B., Lopez, N. G., Ma, L., Ghosh, S., Onengut-Gumuscu, S., Ashley, E. A., Quertermous, T., Finn, A., Leeper, N. J., Kovacic, J. C., Bjorkegren, J. L., Zang, C., Miller, C. L. LIPPINCOTT WILLIAMS & WILKINS. 2021
  • Cell-free DNA fragments inform epigenomic mechanisms for early detection of breast cancer. Gafni, E., Harvey, A., Jaroszewicz, A., Solari, O., Landolin, J., Barbirou, M., Miller, A., Tonellato, P. J., Kundaje, A., Jeffrey, S. S., Curtis, C., Sledge, G. W., Giresi, P., Boley, N. AMER ASSOC CANCER RESEARCH. 2021
  • AP-1 is a temporally regulated dual gatekeeper of reprogramming to pluripotency. Proceedings of the National Academy of Sciences of the United States of America Markov, G. J., Mai, T., Nair, S., Shcherbina, A., Wang, Y. X., Burns, D. M., Kundaje, A., Blau, H. M. 2021; 118 (23)

    Abstract

    Somatic cell transcription factors are critical to maintaining cellular identity and constitute a barrier to human somatic cell reprogramming; yet a comprehensive understanding of the mechanism of action is lacking. To gain insight, we examined epigenome remodeling at the onset of human nuclear reprogramming by profiling human fibroblasts after fusion with murine embryonic stem cells (ESCs). By assay for transposase-accessible chromatin with high-throughput sequencing (ATAC-seq) and chromatin immunoprecipitation sequencing we identified enrichment for the activator protein 1 (AP-1) transcription factor c-Jun at regions of early transient accessibility at fibroblast-specific enhancers. Expression of a dominant negative AP-1 mutant (dnAP-1) reduced accessibility and expression of fibroblast genes, overcoming the barrier to reprogramming. Remarkably, efficient reprogramming of human fibroblasts to induced pluripotent stem cells was achieved by transduction with vectors expressing SOX2, KLF4, and inducible dnAP-1, demonstrating that dnAP-1 can substitute for exogenous human OCT4. Mechanistically, we show that the AP-1 component c-Jun has two unexpected temporally distinct functions in human reprogramming: 1) to potentiate fibroblast enhancer accessibility and fibroblast-specific gene expression, and 2) to bind to and repress OCT4 as a complex with MBD3. Our findings highlight AP-1 as a previously unrecognized potent dual gatekeeper of the somatic cell state.

    View details for DOI 10.1073/pnas.2104841118

    View details for PubMedID 34088849

  • Discovery of widespread transcription initiation at microsatellites predictable by sequence-based deep neural network NATURE COMMUNICATIONS Grapotte, M., Saraswat, M., Bessiere, C., Menichelli, C., Ramilowski, J. A., Severin, J., Hayashizaki, Y., Itoh, M., Tagami, M., Murata, M., Kojima-Ishiyamas, M., Noma, S., Noguchi, S., Kasukawa, T., Hasegawa, A., Suzuki, H., Nishiyori-Sueki, H., Frith, M. C., Chatelain, C., Carninci, P., de Hoom, M. L., Wasserman, W. W., Brehelin, L., Lecellieree, C., FANTOM Consortium 2021; 12 (1): 3297

    Abstract

    Using the Cap Analysis of Gene Expression (CAGE) technology, the FANTOM5 consortium provided one of the most comprehensive maps of transcription start sites (TSSs) in several species. Strikingly, ~72% of them could not be assigned to a specific gene and initiate at unconventional regions, outside promoters or enhancers. Here, we probe these unassigned TSSs and show that, in all species studied, a significant fraction of CAGE peaks initiate at microsatellites, also called short tandem repeats (STRs). To confirm this transcription, we develop Cap Trap RNA-seq, a technology which combines cap trapping and long read MinION sequencing. We train sequence-based deep learning models able to predict CAGE signal at STRs with high accuracy. These models unveil the importance of STR surrounding sequences not only to distinguish STR classes, but also to predict the level of transcription initiation. Importantly, genetic variants linked to human diseases are preferentially found at STRs with high transcription initiation level, supporting the biological and clinical relevance of transcription initiation at STRs. Together, our results extend the repertoire of non-coding transcription associated with DNA tandem repeats and complexify STR polymorphism.

    View details for DOI 10.1038/s41467-021-23143-7

    View details for Web of Science ID 000660869500001

    View details for PubMedID 34078885

    View details for PubMedCentralID PMC8172540

  • Transcription-dependent domain-scale three-dimensional genome organization in the dinoflagellate Breviolum minutum. Nature genetics Marinov, G. K., Trevino, A. E., Xiang, T., Kundaje, A., Grossman, A. R., Greenleaf, W. J. 2021

    Abstract

    Dinoflagellate chromosomes represent a unique evolutionary experiment, as they exist in a permanently condensed, liquid crystalline state; are not packaged by histones; and contain genes organized into tandem gene arrays, with minimal transcriptional regulation. We analyze the three-dimensional genome of Breviolum minutum, and find large topological domains (dinoflagellate topologically associating domains, which we term 'dinoTADs') without chromatin loops, which are demarcated by convergent gene array boundaries. Transcriptional inhibition disrupts dinoTADs, implicating transcription-induced supercoiling as the primary topological force in dinoflagellates.

    View details for DOI 10.1038/s41588-021-00848-5

    View details for PubMedID 33927397

  • Publisher Correction: MTSplice predicts effects of genetic variants on tissue-specific splicing. Genome biology Cheng, J., Celik, M. H., Kundaje, A., Gagneur, J. 2021; 22 (1): 107

    View details for DOI 10.1186/s13059-021-02338-7

    View details for PubMedID 33858505

  • Genome-wide enhancer maps link risk variants to disease genes. Nature Nasser, J., Bergman, D. T., Fulco, C. P., Guckelberger, P., Doughty, B. R., Patwardhan, T. A., Jones, T. R., Nguyen, T. H., Ulirsch, J. C., Lekschas, F., Mualim, K., Natri, H. M., Weeks, E. M., Munson, G., Kane, M., Kang, H. Y., Cui, A., Ray, J. P., Eisenhaure, T. M., Collins, R. L., Dey, K., Pfister, H., Price, A. L., Epstein, C. B., Kundaje, A., Xavier, R. J., Daly, M. J., Huang, H., Finucane, H. K., Hacohen, N., Lander, E. S., Engreitz, J. M. 2021

    Abstract

    Genome-wide association studies (GWAS) have identified thousands of noncoding loci that are associated with human diseases and complex traits, each of which could reveal insights into the mechanisms of disease1. Many of the underlying causal variants may affect enhancers2,3, but we lack accurate maps of enhancers and their target genes to interpret such variants. We recently developed the activity-by-contact (ABC) model to predict which enhancers regulate which genes and validated the model using CRISPR perturbations in several cell types4. Here we apply this ABC model to create enhancer-genemaps in 131 human cell types and tissues, and use these maps to interpret the functions of GWAS variants. Across 72 diseases and complex traits, ABC links 5,036 GWAS signals to 2,249 unique genes, including a class of 577 genes that appear to influence multiple phenotypes through variants in enhancers that act in different cell types. In inflammatory bowel disease (IBD), causal variants are enriched in predicted enhancers by more than 20-fold in particular cell types such as dendritic cells, and ABC achieves higher precision than other regulatory methods at connecting noncoding variants to target genes. These variant-to-function maps reveal an enhancer that contains an IBD risk variant and that regulates the expression of PPIF to alter the membrane potential of mitochondria in macrophages. Our study reveals principles of genome regulation, identifies genes that affect IBD and provides a resource and generalizable strategy to connect risk variants of common diseases to their molecular and cellular functions.

    View details for DOI 10.1038/s41586-021-03446-x

    View details for PubMedID 33828297

  • WILDS: A Benchmark of in-the-Wild Distribution Shifts Koh, P., Sagawa, S., Marklund, H., Xie, S., Zhang, M., Balsubramani, A., Hu, W., Yasunaga, M., Phillips, R., Gao, I., Lee, T., David, E., Stavness, I., Guo, W., Earnshaw, B. A., Haque, I. S., Beery, S., Leskovec, J., Kundaje, A., Pierson, E., Levine, S., Finn, C., Liang, P., Meila, M., Zhang, T. JMLR-JOURNAL MACHINE LEARNING RESEARCH. 2021
  • Chromatin and gene-regulatory dynamics of the developing human cerebral cortex at single-cell resolution. Cell Trevino, A. E., Müller, F., Andersen, J., Sundaram, L., Kathiria, A., Shcherbina, A., Farh, K., Chang, H. Y., Pașca, A. M., Kundaje, A., Pașca, S. P., Greenleaf, W. J. 2021

    Abstract

    Genetic perturbations of cortical development can lead to neurodevelopmental disease, including autism spectrum disorder (ASD). To identify genomic regions crucial to corticogenesis, we mapped the activity of gene-regulatory elements generating a single-cell atlas of gene expression and chromatin accessibility both independently and jointly. This revealed waves of gene regulation by key transcription factors (TFs) across a nearly continuous differentiation trajectory, distinguished the expression programs of glial lineages, and identified lineage-determining TFs that exhibited strong correlation between linked gene-regulatory elements and expression levels. These highly connected genes adopted an active chromatin state in early differentiating cells, consistent with lineage commitment. Base-pair-resolution neural network models identified strong cell-type-specific enrichment of noncoding mutations predicted to be disruptive in a cohort of ASD individuals and identified frequently disrupted TF binding sites. This approach illustrates how cell-type-specific mapping can provide insights into the programs governing human development and disease.

    View details for DOI 10.1016/j.cell.2021.07.039

    View details for PubMedID 34390642

  • Learning cis-regulatory principles of ADAR-based RNA editing from CRISPR-mediated mutagenesis. Nature communications Liu, X., Sun, T., Shcherbina, A., Li, Q., Jarmoskaite, I., Kappel, K., Ramaswami, G., Das, R., Kundaje, A., Li, J. B. 2021; 12 (1): 2165

    Abstract

    Adenosine-to-inosine (A-to-I) RNA editing catalyzed by ADAR enzymes occurs in double-stranded RNAs. Despite a compelling need towards predictive understanding of natural and engineered editing events, how the RNA sequence and structure determine the editing efficiency and specificity (i.e., cis-regulation) is poorly understood. We apply a CRISPR/Cas9-mediated saturation mutagenesis approach to generate libraries of mutations near three natural editing substrates at their endogenous genomic loci. We use machine learning to integrate diverse RNA sequence and structure features to model editing levels measured by deep sequencing. We confirm known features and identify new features important for RNA editing. Training and testing XGBoost algorithm within the same substrate yield models that explain 68 to 86 percent of substrate-specific variation in editing levels. However, the models do not generalize across substrates, suggesting complex and context-dependent regulation patterns. Our integrative approach can be applied to larger scale experiments towards deciphering the RNA editing code.

    View details for DOI 10.1038/s41467-021-22489-2

    View details for PubMedID 33846332

  • MTSplice predicts effects of genetic variants on tissue-specific splicing. Genome biology Cheng, J. n., Çelik, M. H., Kundaje, A. n., Gagneur, J. n. 2021; 22 (1): 94

    Abstract

    We develop the free and open-source model Multi-tissue Splicing (MTSplice) to predict the effects of genetic variants on splicing of cassette exons in 56 human tissues. MTSplice combines MMSplice, which models constitutive regulatory sequences, with a new neural network that models tissue-specific regulatory sequences. MTSplice outperforms MMSplice on predicting tissue-specific variations associated with genetic variants in most tissues of the GTEx dataset, with largest improvements on brain tissues. Furthermore, MTSplice predicts that autism-associated de novo mutations are enriched for variants affecting splicing specifically in the brain. We foresee that MTSplice will aid interpreting variants associated with tissue-specific disorders.

    View details for DOI 10.1186/s13059-021-02273-7

    View details for PubMedID 33789710

  • Genetic effects on transcriptome profiles in colon epithelium provide functional insights for genetic risk loci. Cellular and molecular gastroenterology and hepatology Díez-Obrero, V. n., Dampier, C. H., Moratalla-Navarro, F. n., Devall, M. n., Plummer, S. J., Díez-Villanueva, A. n., Peters, U. n., Bien, S. n., Huyghe, J. R., Kundaje, A. n., Ibáñez-Sanz, G. n., Guinó, E. n., Obón-Santacana, M. n., Carreras-Torres, R. n., Casey, G. n., Moreno, V. n. 2021

    Abstract

    The association of genetic variation with tissue-specific gene expression and alternative splicing guides functional characterization of complex trait associated loci and may suggest novel genes implicated in disease. Here, we aimed to 1) generate reference profiles of colon mucosa gene expression and alternative splicing and compare them across colon subsites (ascending, transverse and descending), 2) identify expression and splicing quantitative trait loci (QTLs), 3) find traits for which identified QTLs contribute to single nucleotide polymorphism (SNP)-based heritability, 4) propose candidate effector genes, and 5) provide a web-based visualization resource.We collected colonic mucosal biopsies from 485 healthy adults and performed bulk RNA sequencing (RNA-Seq). We performed genome-wide SNP genotyping from blood leukocytes. Statistical approaches and bioinformatics software were used for QTL identification and downstream analyses.We provided a complete quantification of gene expression and alternative splicing across colon subsites and described their differences. We identified thousands of expression and splicing QTLs and defined their enrichment at genome-wide regulatory regions. We found that part of the SNP-based heritability of diseases affecting colon tissue, such as colorectal cancer and inflammatory bowel disease, but also of diseases affecting other tissues, such as psychiatric conditions, can be explained by the identified QTLs. We provided candidate effector genes for multiple phenotypes. Finally, we provided the Colon Transcriptome Explorer (CoTrEx).We provided the largest characterization to date of gene expression and splicing across colon subsites. Our findings provide greater etiological insight into complex traits and diseases influenced by transcriptomic changes in colon tissue.

    View details for DOI 10.1016/j.jcmgh.2021.02.003

    View details for PubMedID 33601062

  • Genetic architectures of proximal and distal colorectal cancer are partly distinct. Gut Huyghe, J. R., Harrison, T. A., Bien, S. A., Hampel, H. n., Figueiredo, J. C., Schmit, S. L., Conti, D. V., Chen, S. n., Qu, C. n., Lin, Y. n., Barfield, R. n., Baron, J. A., Cross, A. J., Diergaarde, B. n., Duggan, D. n., Harlid, S. n., Imaz, L. n., Kang, H. M., Levine, D. M., Perduca, V. n., Perez-Cornago, A. n., Sakoda, L. C., Schumacher, F. R., Slattery, M. L., Toland, A. E., van Duijnhoven, F. J., Van Guelpen, B. n., Agudo, A. n., Albanes, D. n., Alonso, M. H., Anderson, K. n., Arnau-Collell, C. n., Arndt, V. n., Banbury, B. L., Bassik, M. C., Berndt, S. I., Bézieau, S. n., Bishop, D. T., Boehm, J. n., Boeing, H. n., Boutron-Ruault, M. C., Brenner, H. n., Brezina, S. n., Buch, S. n., Buchanan, D. D., Burnett-Hartman, A. n., Caan, B. J., Campbell, P. T., Carr, P. R., Castells, A. n., Castellví-Bel, S. n., Chan, A. T., Chang-Claude, J. n., Chanock, S. J., Curtis, K. R., de la Chapelle, A. n., Easton, D. F., English, D. R., Feskens, E. J., Gala, M. n., Gallinger, S. J., Gauderman, W. J., Giles, G. G., Goodman, P. J., Grady, W. M., Grove, J. S., Gsur, A. n., Gunter, M. J., Haile, R. W., Hampe, J. n., Hoffmeister, M. n., Hopper, J. L., Hsu, W. L., Huang, W. Y., Hudson, T. J., Jenab, M. n., Jenkins, M. A., Joshi, A. D., Keku, T. O., Kooperberg, C. n., Kühn, T. n., Küry, S. n., Le Marchand, L. n., Lejbkowicz, F. n., Li, C. I., Li, L. n., Lieb, W. n., Lindblom, A. n., Lindor, N. M., Männistö, S. n., Markowitz, S. D., Milne, R. L., Moreno, L. n., Murphy, N. n., Nassir, R. n., Offit, K. n., Ogino, S. n., Panico, S. n., Parfrey, P. S., Pearlman, R. n., Pharoah, P. D., Phipps, A. I., Platz, E. A., Potter, J. D., Prentice, R. L., Qi, L. n., Raskin, L. n., Rennert, G. n., Rennert, H. S., Riboli, E. n., Schafmayer, C. n., Schoen, R. E., Seminara, D. n., Song, M. n., Su, Y. R., Tangen, C. M., Thibodeau, S. N., Thomas, D. C., Trichopoulou, A. n., Ulrich, C. M., Visvanathan, K. n., Vodicka, P. n., Vodickova, L. n., Vymetalkova, V. n., Weigl, K. n., Weinstein, S. J., White, E. n., Wolk, A. n., Woods, M. O., Wu, A. H., Abecasis, G. R., Nickerson, D. A., Scacheri, P. C., Kundaje, A. n., Casey, G. n., Gruber, S. B., Hsu, L. n., Moreno, V. n., Hayes, R. B., Newcomb, P. A., Peters, U. n. 2021

    Abstract

    An understanding of the etiologic heterogeneity of colorectal cancer (CRC) is critical for improving precision prevention, including individualized screening recommendations and the discovery of novel drug targets and repurposable drug candidates for chemoprevention. Known differences in molecular characteristics and environmental risk factors among tumors arising in different locations of the colorectum suggest partly distinct mechanisms of carcinogenesis. The extent to which the contribution of inherited genetic risk factors for CRC differs by anatomical subsite of the primary tumor has not been examined.To identify new anatomical subsite-specific risk loci, we performed genome-wide association study (GWAS) meta-analyses including data of 48 214 CRC cases and 64 159 controls of European ancestry. We characterised effect heterogeneity at CRC risk loci using multinomial modelling.We identified 13 loci that reached genome-wide significance (p<5×10-8) and that were not reported by previous GWASs for overall CRC risk. Multiple lines of evidence support candidate genes at several of these loci. We detected substantial heterogeneity between anatomical subsites. Just over half (61) of 109 known and new risk variants showed no evidence for heterogeneity. In contrast, 22 variants showed association with distal CRC (including rectal cancer), but no evidence for association or an attenuated association with proximal CRC. For two loci, there was strong evidence for effects confined to proximal colon cancer.Genetic architectures of proximal and distal CRC are partly distinct. Studies of risk factors and mechanisms of carcinogenesis, and precision prevention strategies should take into consideration the anatomical subsite of the tumour.

    View details for DOI 10.1136/gutjnl-2020-321534

    View details for PubMedID 33632709

  • Landscape of cohesin-mediated chromatin loops in the human genome Grubert, F., Srivas, R., Spacek, D., Kasowski, M., Ruiz-Velasco, M., Sinnott-Armstrong, N., Greenside, P., Narasimha, A., Liu, Q., Geller, B., Sanghi, A., Kulik, M., Sa, S., Rabinovitch, M., Kundaje, A., Dalton, S., Zaugg, J., Snyder, M. SPRINGERNATURE. 2020: 72
  • Transparency and reproducibility in artificial intelligence. Nature Haibe-Kains, B., Adam, G. A., Hosny, A., Khodakarami, F., Massive Analysis Quality Control (MAQC) Society Board of Directors, Waldron, L., Wang, B., McIntosh, C., Goldenberg, A., Kundaje, A., Greene, C. S., Broderick, T., Hoffman, M. M., Leek, J. T., Korthauer, K., Huber, W., Brazma, A., Pineau, J., Tibshirani, R., Hastie, T., Ioannidis, J. P., Quackenbush, J., Aerts, H. J., Shraddha, T., Kusko, R., Sansone, S., Tong, W., Wolfinger, R. D., Mason, C. E., Jones, W., Dopazo, J., Furlanello, C. 2020; 586 (7829): E14–E16

    View details for DOI 10.1038/s41586-020-2766-y

    View details for PubMedID 33057217

  • Perspectives on ENCODE. Nature ENCODE Project Consortium, Snyder, M. P., Gingeras, T. R., Moore, J. E., Weng, Z., Gerstein, M. B., Ren, B., Hardison, R. C., Stamatoyannopoulos, J. A., Graveley, B. R., Feingold, E. A., Pazin, M. J., Pagan, M., Gilchrist, D. A., Hitz, B. C., Cherry, J. M., Bernstein, B. E., Mendenhall, E. M., Zerbino, D. R., Frankish, A., Flicek, P., Myers, R. M., Abascal, F., Acosta, R., Addleman, N. J., Adrian, J., Afzal, V., Aken, B., Akiyama, J. A., Jammal, O. A., Amrhein, H., Anderson, S. M., Andrews, G. R., Antoshechkin, I., Ardlie, K. G., Armstrong, J., Astley, M., Banerjee, B., Barkal, A. A., Barnes, I. H., Barozzi, I., Barrell, D., Barson, G., Bates, D., Baymuradov, U. K., Bazile, C., Beer, M. A., Beik, S., Bender, M. A., Bennett, R., Bouvrette, L. P., Bernstein, B. E., Berry, A., Bhaskar, A., Bignell, A., Blue, S. M., Bodine, D. M., Boix, C., Boley, N., Borrman, T., Borsari, B., Boyle, A. P., Brandsmeier, L. A., Breschi, A., Bresnick, E. H., Brooks, J. A., Buckley, M., Burge, C. B., Byron, R., Cahill, E., Cai, L., Cao, L., Carty, M., Castanon, R. G., Castillo, A., Chaib, H., Chan, E. T., Chee, D. R., Chee, S., Chen, H., Chen, H., Chen, J., Chen, S., Cherry, J. M., Chhetri, S. B., Choudhary, J. S., Chrast, J., Chung, D., Clarke, D., Cody, N. A., Coppola, C. J., Coursen, J., D'Ippolito, A. M., Dalton, S., Danyko, C., Davidson, C., Davila-Velderrain, J., Davis, C. A., Dekker, J., Deran, A., DeSalvo, G., Despacio-Reyes, G., Dewey, C. N., Dickel, D. E., Diegel, M., Diekhans, M., Dileep, V., Ding, B., Djebali, S., Dobin, A., Dominguez, D., Donaldson, S., Drenkow, J., Dreszer, T. R., Drier, Y., Duff, M. O., Dunn, D., Eastman, C., Ecker, J. R., Edwards, M. D., El-Ali, N., Elhajjajy, S. I., Elkins, K., Emili, A., Epstein, C. B., Evans, R. C., Ezkurdia, I., Fan, K., Farnham, P. J., Farrell, N., Feingold, E. A., Ferreira, A., Fisher-Aylor, K., Fitzgerald, S., Flicek, P., Foo, C. S., Fortier, K., Frankish, A., Freese, P., Fu, S., Fu, X., Fu, Y., Fukuda-Yuzawa, Y., Fulciniti, M., Funnell, A. P., Gabdank, I., Galeev, T., Gao, M., Giron, C. G., Garvin, T. H., Gelboin-Burkhart, C. A., Georgolopoulos, G., Gerstein, M. B., Giardine, B. M., Gifford, D. K., Gilbert, D. M., Gilchrist, D. A., Gillespie, S., Gingeras, T. R., Gong, P., Gonzalez, A., Gonzalez, J. M., Good, P., Goren, A., Gorkin, D. U., Graveley, B. R., Gray, M., Greenblatt, J. F., Griffiths, E., Groudine, M. T., Grubert, F., Gu, M., Guigo, R., Guo, H., Guo, Y., Guo, Y., Gursoy, G., Gutierrez-Arcelus, M., Halow, J., Hardison, R. C., Hardy, M., Hariharan, M., Harmanci, A., Harrington, A., Harrow, J. L., Hashimoto, T. B., Hasz, R. D., Hatan, M., Haugen, E., Hayes, J. E., He, P., He, Y., Heidari, N., Hendrickson, D., Heuston, E. F., Hilton, J. A., Hitz, B. C., Hochman, A., Holgren, C., Hou, L., Hou, S., Hsiao, Y. E., Hsu, S., Huang, H., Hubbard, T. J., Huey, J., Hughes, T. R., Hunt, T., Ibarrientos, S., Issner, R., Iwata, M., Izuogu, O., Jaakkola, T., Jameel, N., Jansen, C., Jiang, L., Jiang, P., Johnson, A., Johnson, R., Jungreis, I., Kadaba, M., Kasowski, M., Kasparian, M., Kato, M., Kaul, R., Kawli, T., Kay, M., Keen, J. C., Keles, S., Keller, C. A., Kelley, D., Kellis, M., Kheradpour, P., Kim, D. S., Kirilusha, A., Klein, R. J., Knoechel, B., Kuan, S., Kulik, M. J., Kumar, S., Kundaje, A., Kutyavin, T., Lagarde, J., Lajoie, B. R., Lambert, N. J., Lazar, J., Lee, A. Y., Lee, D., Lee, E., Lee, J. W., Lee, K., Leslie, C. S., Levy, S., Li, B., Li, H., Li, N., Li, X., Li, Y. I., Li, Y., Li, Y., Li, Y., Lian, J., Libbrecht, M. W., Lin, S., Lin, Y., Liu, D., Liu, J., Liu, P., Liu, T., Liu, X. S., Liu, Y., Liu, Y., Long, M., Lou, S., Loveland, J., Lu, A., Lu, Y., Lecuyer, E., Ma, L., Mackiewicz, M., Mannion, B. J., Mannstadt, M., Manthravadi, D., Marinov, G. K., Martin, F. J., Mattei, E., McCue, K., McEown, M., McVicker, G., Meadows, S. K., Meissner, A., Mendenhall, E. M., Messer, C. L., Meuleman, W., Meyer, C., Miller, S., Milton, M. G., Mishra, T., Moore, D. E., Moore, H. M., Moore, J. E., Moore, S. H., Moran, J., Mortazavi, A., Mudge, J. M., Munshi, N., Murad, R., Myers, R. M., Nandakumar, V., Nandi, P., Narasimha, A. M., Narayanan, A. K., Naughton, H., Navarro, F. C., Navas, P., Nazarovs, J., Nelson, J., Neph, S., Neri, F. J., Nery, J. R., Nesmith, A. R., Newberry, J. S., Newberry, K. M., Ngo, V., Nguyen, R., Nguyen, T. B., Nguyen, T., Nishida, A., Noble, W. S., Novak, C. S., Novoa, E. M., Nunez, B., O'Donnell, C. W., Olson, S., Onate, K. C., Otterman, E., Ozadam, H., Pagan, M., Palden, T., Pan, X., Park, Y., Partridge, E. C., Paten, B., Pauli-Behn, F., Pazin, M. J., Pei, B., Pennacchio, L. A., Perez, A. R., Perry, E. H., Pervouchine, D. D., Phalke, N. N., Pham, Q., Phanstiel, D. H., Plajzer-Frick, I., Pratt, G. A., Pratt, H. E., Preissl, S., Pritchard, J. K., Pritykin, Y., Purcaro, M. J., Qin, Q., Quinones-Valdez, G., Rabano, I., Radovani, E., Raj, A., Rajagopal, N., Ram, O., Ramirez, L., Ramirez, R. N., Rausch, D., Raychaudhuri, S., Raymond, J., Razavi, R., Reddy, T. E., Reimonn, T. M., Ren, B., Reymond, A., Reynolds, A., Rhie, S. K., Rinn, J., Rivera, M., Rivera-Mulia, J. C., Roberts, B., Rodriguez, J. M., Rozowsky, J., Ryan, R., Rynes, E., Salins, D. N., Sandstrom, R., Sasaki, T., Sathe, S., Savic, D., Scavelli, A., Scheiman, J., Schlaffner, C., Schloss, J. A., Schmitges, F. W., See, L. H., Sethi, A., Setty, M., Shafer, A., Shan, S., Sharon, E., Shen, Q., Shen, Y., Sherwood, R. I., Shi, M., Shin, S., Shoresh, N., Siebenthall, K., Sisu, C., Slifer, T., Sloan, C. A., Smith, A., Snetkova, V., Snyder, M. P., Spacek, D. V., Srinivasan, S., Srivas, R., Stamatoyannopoulos, G., Stamatoyannopoulos, J. A., Stanton, R., Steffan, D., Stehling-Sun, S., Strattan, J. S., Su, A., Sundararaman, B., Suner, M., Syed, T., Szynkarek, M., Tanaka, F. Y., Tenen, D., Teng, M., Thomas, J. A., Toffey, D., Tress, M. L., Trout, D. E., Trynka, G., Tsuji, J., Upchurch, S. A., Ursu, O., Uszczynska-Ratajczak, B., Uziel, M. C., Valencia, A., Biber, B. V., van der Velde, A. G., Van Nostrand, E. L., Vaydylevich, Y., Vazquez, J., Victorsen, A., Vielmetter, J., Vierstra, J., Visel, A., Vlasova, A., Vockley, C. M., Volpi, S., Vong, S., Wang, H., Wang, M., Wang, Q., Wang, R., Wang, T., Wang, W., Wang, X., Wang, Y., Watson, N. K., Wei, X., Wei, Z., Weisser, H., Weissman, S. M., Welch, R., Welikson, R. E., Weng, Z., Westra, H., Whitaker, J. W., White, C., White, K. P., Wildberg, A., Williams, B. A., Wine, D., Witt, H. N., Wold, B., Wolf, M., Wright, J., Xiao, R., Xiao, X., Xu, J., Xu, J., Yan, K., Yan, Y., Yang, H., Yang, X., Yang, Y., Yardimci, G. G., Yee, B. A., Yeo, G. W., Young, T., Yu, T., Yue, F., Zaleski, C., Zang, C., Zeng, H., Zeng, W., Zerbino, D. R., Zhai, J., Zhan, L., Zhan, Y., Zhang, B., Zhang, J., Zhang, J., Zhang, K., Zhang, L., Zhang, P., Zhang, Q., Zhang, X., Zhang, Y., Zhang, Z., Zhao, Y., Zheng, Y., Zhong, G., Zhou, X., Zhu, Y., Zimmerman, J. 2020; 583 (7818): 693–98

    Abstract

    The Encylopedia of DNA Elements (ENCODE) Project launched in 2003 with the long-term goal of developing a comprehensive map of functional elements in the human genome. These included genes, biochemical regions associated with gene regulation (for example, transcription factor binding sites, open chromatin, and histone marks) and transcript isoforms. The marks serve as sites for candidate cis-regulatory elements (cCREs) that may serve functional roles in regulating gene expression1. The project has been extended to model organisms, particularly the mouse. In the third phase of ENCODE, nearly a million and more than 300,000 cCRE annotations have been generated for human and mouse, respectively, and these have provided a valuable resource for the scientific community.

    View details for DOI 10.1038/s41586-020-2449-8

    View details for PubMedID 32728248

  • The Human Tumor Atlas Network: Charting Tumor Transitions across Space and Time at Single-Cell Resolution. Cell Rozenblatt-Rosen, O., Regev, A., Oberdoerffer, P., Nawy, T., Hupalowska, A., Rood, J. E., Ashenberg, O., Cerami, E., Coffey, R. J., Demir, E., Ding, L., Esplin, E. D., Ford, J. M., Goecks, J., Ghosh, S., Gray, J. W., Guinney, J., Hanlon, S. E., Hughes, S. K., Hwang, E. S., Iacobuzio-Donahue, C. A., Jane-Valbuena, J., Johnson, B. E., Lau, K. S., Lively, T., Mazzilli, S. A., Pe'er, D., Santagata, S., Shalek, A. K., Schapiro, D., Snyder, M. P., Sorger, P. K., Spira, A. E., Srivastava, S., Tan, K., West, R. B., Williams, E. H., Human Tumor Atlas Network, Aberle, D., Achilefu, S. I., Ademuyiwa, F. O., Adey, A. C., Aft, R. L., Agarwal, R., Aguilar, R. A., Alikarami, F., Allaj, V., Amos, C., Anders, R. A., Angelo, M. R., Anton, K., Ashenberg, O., Aster, J. C., Babur, O., Bahmani, A., Balsubramani, A., Barrett, D., Beane, J., Bender, D. E., Bernt, K., Berry, L., Betts, C. B., Bletz, J., Blise, K., Boire, A., Boland, G., Borowsky, A., Bosse, K., Bott, M., Boyden, E., Brooks, J., Bueno, R., Burlingame, E. A., Cai, Q., Campbell, J., Caravan, W., Cerami, E., Chaib, H., Chan, J. M., Chang, Y. H., Chatterjee, D., Chaudhary, O., Chen, A. A., Chen, B., Chen, C., Chen, C., Chen, F., Chen, Y., Chheda, M. G., Chin, K., Chiu, R., Chu, S., Chuaqui, R., Chun, J., Cisneros, L., Coffey, R. J., Colditz, G. A., Cole, K., Collins, N., Contrepois, K., Coussens, L. M., Creason, A. L., Crichton, D., Curtis, C., Davidsen, T., Davies, S. R., de Bruijn, I., Dellostritto, L., De Marzo, A., Demir, E., DeNardo, D. G., Diep, D., Ding, L., Diskin, S., Doan, X., Drewes, J., Dubinett, S., Dyer, M., Egger, J., Eng, J., Engelhardt, B., Erwin, G., Esplin, E. D., Esserman, L., Felmeister, A., Feiler, H. S., Fields, R. C., Fisher, S., Flaherty, K., Flournoy, J., Ford, J. M., Fortunato, A., Frangieh, A., Frye, J. L., Fulton, R. S., Galipeau, D., Gan, S., Gao, J., Gao, L., Gao, P., Gao, V. R., Geiger, T., George, A., Getz, G., Ghosh, S., Giannakis, M., Gibbs, D. L., Gillanders, W. E., Goecks, J., Goedegebuure, S. P., Gould, A., Gowers, K., Gray, J. W., Greenleaf, W., Gresham, J., Guerriero, J. L., Guha, T. K., Guimaraes, A. R., Guinney, J., Gutman, D., Hacohen, N., Hanlon, S., Hansen, C. R., Harismendy, O., Harris, K. A., Hata, A., Hayashi, A., Heiser, C., Helvie, K., Herndon, J. M., Hirst, G., Hodi, F., Hollmann, T., Horning, A., Hsieh, J. J., Hughes, S., Huh, W. J., Hunger, S., Hwang, S. E., Iacobuzio-Donahue, C. A., Ijaz, H., Izar, B., Jacobson, C. A., Janes, S., Jane-Valbuena, J., Jayasinghe, R. G., Jiang, L., Johnson, B. E., Johnson, B., Ju, T., Kadara, H., Kaestner, K., Kagan, J., Kalinke, L., Keith, R., Khan, A., Kibbe, W., Kim, A. H., Kim, E., Kim, J., Kolodzie, A., Kopytra, M., Kotler, E., Krueger, R., Krysan, K., Kundaje, A., Ladabaum, U., Lake, B. B., Lam, H., Laquindanum, R., Lau, K. S., Laughney, A. M., Lee, H., Lenburg, M., Leonard, C., Leshchiner, I., Levy, R., Li, J., Lian, C. G., Lim, K., Lin, J., Lin, Y., Liu, Q., Liu, R., Lively, T., Longabaugh, W. J., Longacre, T., Ma, C. X., Macedonia, M. C., Madison, T., Maher, C. A., Maitra, A., Makinen, N., Makowski, D., Maley, C., Maliga, Z., Mallo, D., Maris, J., Markham, N., Marks, J., Martinez, D., Mashl, R. J., Masilionais, I., Mason, J., Massague, J., Massion, P., Mattar, M., Mazurchuk, R., Mazutis, L., Mazzilli, S. A., McKinley, E. T., McMichael, J. F., Merrick, D., Meyerson, M., Miessner, J. R., Mills, G. B., Mills, M., Mondal, S. B., Mori, M., Mori, Y., Moses, E., Mosse, Y., Muhlich, J. L., Murphy, G. F., Navin, N. E., Nawy, T., Nederlof, M., Ness, R., Nevins, S., Nikolov, M., Nirmal, A. J., Nolan, G., Novikov, E., Oberdoerffer, P., O'Connell, B., Offin, M., Oh, S. T., Olson, A., Ooms, A., Ossandon, M., Owzar, K., Parmar, S., Patel, T., Patti, G. J., Pe'er, D., Pe'er, I., Peng, T., Persson, D., Petty, M., Pfister, H., Polyak, K., Pourfarhangi, K., Puram, S. V., Qiu, Q., Quintanal-Villalonga, A., Raj, A., Ramirez-Solano, M., Rashid, R., Reeb, A. N., Regev, A., Reid, M., Resnick, A., Reynolds, S. M., Riesterer, J. L., Rodig, S., Roland, J. T., Rosenfield, S., Rotem, A., Roy, S., Rozenblatt-Rosen, O., Rudin, C. M., Ryser, M. D., Santagata, S., Santi-Vicini, M., Sato, K., Schapiro, D., Schrag, D., Schultz, N., Sears, C. L., Sears, R. C., Sen, S., Sen, T., Shalek, A., Sheng, J., Sheng, Q., Shoghi, K. I., Shrubsole, M. J., Shyr, Y., Sibley, A. B., Siex, K., Simmons, A. J., Singer, D. S., Sivagnanam, S., Slyper, M., Snyder, M. P., Sokolov, A., Song, S., Sorger, P. K., Southard-Smith, A., Spira, A., Srivastava, S., Stein, J., Storm, P., Stover, E., Strand, S. H., Su, T., Sudar, D., Sullivan, R., Surrey, L., Suva, M., Tan, K., Terekhanova, N. V., Ternes, L., Thammavong, L., Thibault, G., Thomas, G. V., Thorsson, V., Todres, E., Tran, L., Tyler, M., Uzun, Y., Vachani, A., Van Allen, E., Vandekar, S., Veis, D. J., Vigneau, S., Vossough, A., Waanders, A., Wagle, N., Wang, L., Wendl, M. C., West, R., Williams, E. H., Wu, C., Wu, H., Wu, H., Wyczalkowski, M. A., Xie, Y., Yang, X., Yapp, C., Yu, W., Yuan, Y., Zhang, D., Zhang, K., Zhang, M., Zhang, N., Zhang, Y., Zhao, Y., Zhou, D. C., Zhou, Z., Zhu, H., Zhu, Q., Zhu, X., Zhu, Y., Zhuang, X. 2020; 181 (2): 236–49

    Abstract

    Crucial transitions in cancer-including tumor initiation, local expansion, metastasis, and therapeutic resistance-involve complex interactions between cells within the dynamic tumor ecosystem. Transformative single-cell genomics technologies and spatial multiplex in situ methods now provide an opportunity to interrogate this complexity at unprecedented resolution. The Human Tumor Atlas Network (HTAN), part of the National Cancer Institute (NCI) Cancer Moonshot Initiative, will establish a clinical, experimental, computational, and organizational framework to generate informative and accessible three-dimensional atlases of cancer transitions for a diverse set of tumor types. This effort complements both ongoing efforts to map healthy organs and previous large-scale cancer genomics approaches focused on bulk sequencing at a single point in time. Generating single-cell, multiparametric, longitudinal atlases and integrating them with clinical outcomes should help identify novel predictive biomarkers and features as well as therapeutically relevant cell types, cell states, and cellular interactions across transitions. The resulting tumor atlases should have a profound impact on our understanding of cancer biology and have the potential to improve cancer detection, prevention, and therapeutic discovery for better precision-medicine treatments of cancer patients and those at risk for cancer.

    View details for DOI 10.1016/j.cell.2020.03.053

    View details for PubMedID 32302568

  • CRISPR screens in cancer spheroids identify 3D growth-specific vulnerabilities. Nature Han, K., Pierce, S. E., Li, A., Spees, K., Anderson, G. R., Seoane, J. A., Lo, Y. H., Dubreuil, M., Olivas, M., Kamber, R. A., Wainberg, M., Kostyrko, K., Kelly, M. R., Yousefi, M., Simpkins, S. W., Yao, D., Lee, K., Kuo, C. J., Jackson, P. K., Sweet-Cordero, A., Kundaje, A., Gentles, A. J., Curtis, C., Winslow, M. M., Bassik, M. C. 2020; 580 (7801): 136-141

    Abstract

    Cancer genomics studies have identified thousands of putative cancer driver genes1. Development of high-throughput and accurate models to define the functions of these genes is a major challenge. Here we devised a scalable cancer-spheroid model and performed genome-wide CRISPR screens in 2D monolayers and 3D lung-cancer spheroids. CRISPR phenotypes in 3D more accurately recapitulated those of in vivo tumours, and genes with differential sensitivities between 2D and 3D conditions were highly enriched for genes that are mutated in lung cancers. These analyses also revealed drivers that are essential for cancer growth in 3D and in vivo, but not in 2D. Notably, we found that carboxypeptidase D is responsible for removal of a C-terminal RKRR motif2 from the α-chain of the insulin-like growth factor 1 receptor that is critical for receptor activity. Carboxypeptidase D expression correlates with patient outcomes in patients with lung cancer, and loss of carboxypeptidase D reduced tumour growth. Our results reveal key differences between 2D and 3D cancer models, and establish a generalizable strategy for performing CRISPR screens in spheroids to reveal cancer vulnerabilities.

    View details for DOI 10.1038/s41586-020-2099-x

    View details for PubMedID 32238925

  • CRISPR screens in cancer spheroids identify 3D growth-specific vulnerabilities NATURE Han, K., Pierce, S. E., Li, A., Spees, K., Anderson, G. R., Seoane, J. A., Lo, Y., Dubreuil, M., Olivas, M., Kamber, R. A., Wainberg, M., Kostyrko, K., Kelly, M. R., Yousefi, M., Simpkins, S. W., Yao, D., Lee, K., Kuo, C. J., Jackson, P. K., Sweet-Cordero, A., Kundaje, A., Gentles, A. J., Curtis, C., Winslow, M. M., Bassik, M. C. 2020
  • Long-range single-molecule mapping of chromatin accessibility in eukaryotes. Nature methods Shipony, Z., Marinov, G. K., Swaffer, M. P., Sinnott-Armstrong, N. A., Skotheim, J. M., Kundaje, A., Greenleaf, W. J. 2020

    Abstract

    Mapping open chromatin regions has emerged as a widely used tool for identifying active regulatory elements in eukaryotes. However, existing approaches, limited by reliance on DNA fragmentation and short-read sequencing, cannot provide information about large-scale chromatin states or reveal coordination between the states of distal regulatory elements. We have developed a method for profiling the accessibility of individual chromatin fibers, a single-molecule long-read accessible chromatin mapping sequencing assay (SMAC-seq), enabling the simultaneous, high-resolution, single-molecule assessment of chromatin states at multikilobase length scales. Our strategy is based on combining the preferential methylation of open chromatin regions by DNA methyltransferases with low sequence specificity, in this case EcoGII, an N6-methyladenosine (m6A) methyltransferase, and the ability of nanopore sequencing to directly read DNA modifications. We demonstrate that aggregate SMAC-seq signals match bulk-level accessibility measurements, observe single-molecule nucleosome and transcription factor protection footprints, and quantify the correlation between chromatin states of distal genomic elements.

    View details for DOI 10.1038/s41592-019-0730-2

    View details for PubMedID 32042188

  • High-Throughput Discovery and Characterization of Human Transcriptional Effectors. Cell Tycko, J. n., DelRosso, N. n., Hess, G. T., Aradhana, n. n., Banerjee, A. n., Mukund, A. n., Van, M. V., Ego, B. K., Yao, D. n., Spees, K. n., Suzuki, P. n., Marinov, G. K., Kundaje, A. n., Bassik, M. C., Bintu, L. n. 2020

    Abstract

    Thousands of proteins localize to the nucleus; however, it remains unclear which contain transcriptional effectors. Here, we develop HT-recruit, a pooled assay where protein libraries are recruited to a reporter, and their transcriptional effects are measured by sequencing. Using this approach, we measure gene silencing and activation for thousands of domains. We find a relationship between repressor function and evolutionary age for the KRAB domains, discover that Homeodomain repressor strength is collinear with Hox genetic organization, and identify activities for several domains of unknown function. Deep mutational scanning of the CRISPRi KRAB maps the co-repressor binding surface and identifies substitutions that improve stability/silencing. By tiling 238 proteins, we find repressors as short as ten amino acids. Finally, we report new activator domains, including a divergent KRAB. These results provide a resource of 600 human proteins containing effectors and demonstrate a scalable strategy for assigning functions to protein domains.

    View details for DOI 10.1016/j.cell.2020.11.024

    View details for PubMedID 33326746

  • Maximum Likelihood with Bias-Corrected Calibration is Hard-To-Beat at Label Shift Adaptation Alexandari, A. M., Kundaje, A., Shrikumar, A., Daume, H., Singh, A. JMLR-JOURNAL MACHINE LEARNING RESEARCH. 2020
  • Landscape of cohesin-mediated chromatin loops in the human genome. Nature Grubert, F. n., Srivas, R. n., Spacek, D. V., Kasowski, M. n., Ruiz-Velasco, M. n., Sinnott-Armstrong, N. n., Greenside, P. n., Narasimha, A. n., Liu, Q. n., Geller, B. n., Sanghi, A. n., Kulik, M. n., Sa, S. n., Rabinovitch, M. n., Kundaje, A. n., Dalton, S. n., Zaugg, J. B., Snyder, M. n. 2020; 583 (7818): 737–43

    Abstract

    Physical interactions between distal regulatory elements have a key role in regulating gene expression, but the extent to which these interactions vary between cell types and contribute to cell-type-specific gene expression remains unclear. Here, to address these questions as part of phase III of the Encyclopedia of DNA Elements (ENCODE), we mapped cohesin-mediated chromatin loops, using chromatin interaction analysis by paired-end tag sequencing (ChIA-PET), and analysed gene expression in 24 diverse human cell types, including core ENCODE cell lines. Twenty-eight per cent of all chromatin loops vary across cell types; these variations modestly correlate with changes in gene expression and are effective at grouping cell types according to their tissue of origin. The connectivity of genes corresponds to different functional classes, with housekeeping genes having few contacts, and dosage-sensitive genes being more connected to enhancer elements. This atlas of chromatin loops complements the diverse maps of regulatory architecture that comprise the ENCODE Encyclopedia, and will help to support emerging analyses of genome structure and function.

    View details for DOI 10.1038/s41586-020-2151-x

    View details for PubMedID 32728247

  • Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature Moore, J. E., Purcaro, M. J., Pratt, H. E., Epstein, C. B., Shoresh, N. n., Adrian, J. n., Kawli, T. n., Davis, C. A., Dobin, A. n., Kaul, R. n., Halow, J. n., Van Nostrand, E. L., Freese, P. n., Gorkin, D. U., Shen, Y. n., He, Y. n., Mackiewicz, M. n., Pauli-Behn, F. n., Williams, B. A., Mortazavi, A. n., Keller, C. A., Zhang, X. O., Elhajjajy, S. I., Huey, J. n., Dickel, D. E., Snetkova, V. n., Wei, X. n., Wang, X. n., Rivera-Mulia, J. C., Rozowsky, J. n., Zhang, J. n., Chhetri, S. B., Zhang, J. n., Victorsen, A. n., White, K. P., Visel, A. n., Yeo, G. W., Burge, C. B., Lécuyer, E. n., Gilbert, D. M., Dekker, J. n., Rinn, J. n., Mendenhall, E. M., Ecker, J. R., Kellis, M. n., Klein, R. J., Noble, W. S., Kundaje, A. n., Guigó, R. n., Farnham, P. J., Cherry, J. M., Myers, R. M., Ren, B. n., Graveley, B. R., Gerstein, M. B., Pennacchio, L. A., Snyder, M. P., Bernstein, B. E., Wold, B. n., Hardison, R. C., Gingeras, T. R., Stamatoyannopoulos, J. A., Weng, Z. n. 2020; 583 (7818): 699–710

    Abstract

    The human and mouse genomes contain instructions that specify RNAs and proteins and govern the timing, magnitude, and cellular context of their production. To better delineate these elements, phase III of the Encyclopedia of DNA Elements (ENCODE) Project has expanded analysis of the cell and tissue repertoires of RNA transcription, chromatin structure and modification, DNA methylation, chromatin looping, and occupancy by transcription factors and RNA-binding proteins. Here we summarize these efforts, which have produced 5,992 new experimental datasets, including systematic determinations across mouse fetal development. All data are available through the ENCODE data portal (https://www.encodeproject.org), including phase II ENCODE1 and Roadmap Epigenomics2 data. We have developed a registry of 926,535 human and 339,815 mouse candidate cis-regulatory elements, covering 7.9 and 3.4% of their respective genomes, by integrating selected datatypes associated with gene regulation, and constructed a web-based server (SCREEN; http://screen.encodeproject.org) to provide flexible, user-defined access to this resource. Collectively, the ENCODE data and registry provide an expansive resource for the scientific community to build a better understanding of the organization and function of the human and mouse genomes.

    View details for DOI 10.1038/s41586-020-2493-4

    View details for PubMedID 32728249

  • Homogeneity in the association of body mass index with type 2 diabetes across the UK Biobank: A Mendelian randomization study. PLoS medicine Wainberg, M., Mahajan, A., Kundaje, A., McCarthy, M. I., Ingelsson, E., Sinnott-Armstrong, N., Rivas, M. A. 2019; 16 (12): e1002982

    Abstract

    BACKGROUND: Lifestyle interventions to reduce body mass index (BMI) are critical public health strategies for type 2 diabetes prevention. While weight loss interventions have shown demonstrable benefit for high-risk and prediabetic individuals, we aimed to determine whether the same benefits apply to those at lower risk.METHODS AND FINDINGS: We performed a multi-stratum Mendelian randomization study of the effect size of BMI on diabetes odds in 287,394 unrelated individuals of self-reported white British ancestry in the UK Biobank, who were recruited from across the United Kingdom from 2006 to 2010 when they were between the ages of 40 and 69 years. Individuals were stratified on the following diabetes risk factors: BMI, diabetes family history, and genome-wide diabetes polygenic risk score. The main outcome measure was the odds ratio of diabetes per 1-kg/m2 BMI reduction, in the full cohort and in each stratum. Diabetes prevalence increased sharply with BMI, family history of diabetes, and genetic risk. Conversely, predicted risk reduction from weight loss was strikingly similar across BMI and genetic risk categories. Weight loss was predicted to substantially reduce diabetes odds even among lower-risk individuals: for instance, a 1-kg/m2 BMI reduction was associated with a 1.37-fold reduction (95% CI 1.12-1.68) in diabetes odds among non-overweight individuals (BMI < 25 kg/m2) without a family history of diabetes, similar to that in obese individuals (BMI ≥ 30 kg/m2) with a family history (1.21-fold reduction, 95% CI 1.13-1.29). A key limitation of this analysis is that the BMI-altering DNA sequence polymorphisms it studies represent cumulative predisposition over an individual's entire lifetime, and may consequently incorrectly estimate the risk modification potential of weight loss interventions later in life.CONCLUSIONS: In a population-scale cohort, lower BMI was consistently associated with reduced diabetes risk across BMI, family history, and genetic risk categories, suggesting all individuals can substantially reduce their diabetes risk through weight loss. Our results support the broad deployment of weight loss interventions to individuals at all levels of diabetes risk.

    View details for DOI 10.1371/journal.pmed.1002982

    View details for PubMedID 31821322

  • NETWORK MODELLING OF TOPOLOGICAL DOMAINS USING HI-C DATA. The annals of applied statistics Wang, Y. X., Sarkar, P., Ursu, O., Kundaje, A., Bickel, P. J. 2019; 13 (3): 1511-1536

    Abstract

    Chromosome conformation capture experiments such as Hi-C are used to map the three-dimensional spatial organization of genomes. One specific feature of the 3D organization is known as topologically associating domains (TADs), which are densely interacting, contiguous chromatin regions playing important roles in regulating gene expression. A few algorithms have been proposed to detect TADs. In particular, the structure of Hi-C data naturally inspires application of community detection methods. However, one of the drawbacks of community detection is that most methods take exchangeability of the nodes in the network for granted; whereas the nodes in this case, that is, the positions on the chromosomes, are not exchangeable. We propose a network model for detecting TADs using Hi-C data that takes into account this nonexchangeability. in addition, our model explicitly makes use of cell-type specific CTCF binding sites as biological covariates and can be used to identify conserved TADs across multiple cell types. The model leads to a likelihood objective that can be efficiently optimized via relaxation. We also prove that when suitably initialized, this model finds the underlying TAD structure with high probability. using simulated data, we show the advantages of our method and the caveats of popular community detection methods, such as spectral clustering, in this application. Applying our method to real Hi-C data, we demonstrate the domains identified have desirable epigenetic features and compare them across different cell types.

    View details for DOI 10.1214/19-aoas1244

    View details for PubMedID 32968472

    View details for PubMedCentralID PMC7508461

  • NETWORK MODELLING OF TOPOLOGICAL DOMAINS USING HI-C DATA ANNALS OF APPLIED STATISTICS Wang, Y., Sarkar, P., Ursu, O., Kundaje, A., Bickel, P. J. 2019; 13 (3): 1511–36
  • Integrating regulatory DNA sequence and gene expression to predict genome-wide chromatin accessibility across cellular contexts. Bioinformatics (Oxford, England) Nair, S., Kim, D. S., Perricone, J., Kundaje, A. 2019; 35 (14): i108-i116

    Abstract

    Genome-wide profiles of chromatin accessibility and gene expression in diverse cellular contexts are critical to decipher the dynamics of transcriptional regulation. Recently, convolutional neural networks have been used to learn predictive cis-regulatory DNA sequence models of context-specific chromatin accessibility landscapes. However, these context-specific regulatory sequence models cannot generalize predictions across cell types.We introduce multi-modal, residual neural network architectures that integrate cis-regulatory sequence and context-specific expression of trans-regulators to predict genome-wide chromatin accessibility profiles across cellular contexts. We show that the average accessibility of a genomic region across training contexts can be a surprisingly powerful predictor. We leverage this feature and employ novel strategies for training models to enhance genome-wide prediction of shared and context-specific chromatin accessible sites across cell types. We interpret the models to reveal insights into cis- and trans-regulation of chromatin dynamics across 123 diverse cellular contexts.The code is available at https://github.com/kundajelab/ChromDragoNN.Supplementary data are available at Bioinformatics online.

    View details for DOI 10.1093/bioinformatics/btz352

    View details for PubMedID 31510655

  • GkmExplain: fast and accurate interpretation of nonlinear gapped k-mer SVMs. Bioinformatics (Oxford, England) Shrikumar, A., Prakash, E., Kundaje, A. 2019; 35 (14): i173-i182

    Abstract

    Support Vector Machines with gapped k-mer kernels (gkm-SVMs) have been used to learn predictive models of regulatory DNA sequence. However, interpreting predictive sequence patterns learned by gkm-SVMs can be challenging. Existing interpretation methods such as deltaSVM, in-silico mutagenesis (ISM) or SHAP either do not scale well or make limiting assumptions about the model that can produce misleading results when the gkm kernel is combined with nonlinear kernels. Here, we propose GkmExplain: a computationally efficient feature attribution method for interpreting predictive sequence patterns from gkm-SVM models that has theoretical connections to the method of Integrated Gradients. Using simulated regulatory DNA sequences, we show that GkmExplain identifies predictive patterns with high accuracy while avoiding pitfalls of deltaSVM and ISM and being orders of magnitude more computationally efficient than SHAP. By applying GkmExplain and a recently developed motif discovery method called TF-MoDISco to gkm-SVM models trained on in vivo transcription factor (TF) binding data, we recover consolidated, non-redundant TF motifs. Mutation impact scores derived using GkmExplain consistently outperform deltaSVM and ISM at identifying regulatory genetic variants from gkm-SVM models of chromatin accessibility in lymphoblastoid cell-lines.Code and example notebooks to reproduce results are at https://github.com/kundajelab/gkmexplain.Supplementary data are available at Bioinformatics online.

    View details for DOI 10.1093/bioinformatics/btz322

    View details for PubMedID 31510661

  • Matrix stiffness induces a tumorigenic phenotype in mammary epithelium through changes in chromatin accessibility. Nature biomedical engineering Stowers, R. S., Shcherbina, A., Israeli, J., Gruber, J. J., Chang, J., Nam, S., Rabiee, A., Teruel, M. N., Snyder, M. P., Kundaje, A., Chaudhuri, O. 2019

    Abstract

    In breast cancer, the increased stiffness of the extracellular matrix is a key driver of malignancy. Yet little is known about the epigenomic changes that underlie the tumorigenic impact of extracellular matrix mechanics. Here, we show in a three-dimensional culture model of breast cancer that stiff extracellular matrix induces a tumorigenic phenotype through changes in chromatin state. We found that increased stiffness yielded cells with more wrinkled nuclei and with increased lamina-associated chromatin, that cells cultured in stiff matrices displayed more accessible chromatin sites, which exhibited footprints of Sp1 binding, and that this transcription factor acts along with the histone deacetylases 3 and 8 to regulate the induction of stiffness-mediated tumorigenicity. Just as cell culture on soft environments or in them rather than on tissue-culture plastic better recapitulates the acinar morphology observed in mammary epithelium in vivo, mammary epithelial cells cultured on soft microenvironments or in them also more closely replicate the in vivo chromatin state. Our results emphasize the importance of culture conditions for epigenomic studies, and reveal that chromatin state is a critical mediator of mechanotransduction.

    View details for DOI 10.1038/s41551-019-0420-5

    View details for PubMedID 31285581

  • Predicting gene expression from plasma cell-free DNA using both the fragment length and fragment position St John, J. A., Gafni, E., White, B., Kannan, A., Hansen, L., Jaroszewicz, A., Kundaje, A., Boley, N. AMER ASSOC CANCER RESEARCH. 2019
  • The ENCODE Blacklist: Identification of Problematic Regions of the Genome. Scientific reports Amemiya, H. M., Kundaje, A., Boyle, A. P. 2019; 9 (1): 9354

    Abstract

    Functional genomics assays based on high-throughput sequencing greatly expand our ability to understand the genome. Here, we define the ENCODE blacklist- a comprehensive set of regions in the human, mouse, worm, and fly genomes that have anomalous, unstructured, or high signal in next-generation sequencing experiments independent of cell line or experiment. The removal of the ENCODE blacklist is an essential quality measure when analyzing functional genomics data.

    View details for DOI 10.1038/s41598-019-45839-z

    View details for PubMedID 31249361

  • The Kipoi repository accelerates community exchange and reuse of predictive models for genomics. Nature biotechnology Avsec, Z., Kreuzhuber, R., Israeli, J., Xu, N., Cheng, J., Shrikumar, A., Banerjee, A., Kim, D. S., Beier, T., Urban, L., Kundaje, A., Stegle, O., Gagneur, J. 2019

    View details for DOI 10.1038/s41587-019-0140-0

    View details for PubMedID 31138913

  • Cell cycle dynamics of human pluripotent stem cells primed for differentiation. Stem cells (Dayton, Ohio) Shcherbina, A., Li, J., Narayanan, C., Greenleaf, W., Kundaje, A., Chetty, S. 2019

    Abstract

    Understanding the molecular properties of the cell cycle of human pluripotent stem cells (hPSCs) is critical for effectively promoting differentiation. Here, we use the Fluorescence Ubiquitin Cell Cycle Indicator (FUCCI) system adapted into hPSCs and perform RNA-sequencing on cell cycle sorted hPSCs primed and unprimed for differentiation. Gene expression patterns of signaling factors and developmental regulators change in a cell cycle-specific manner in cells primed for differentiation without altering genes associated with pluripotency. Furthermore, we identify an important role for PI3K signaling in regulating the early transitory states of hPSCs toward differentiation. SIGNIFICANCE STATEMENT: Generating differentiated cell types from human pluripotent stem cells (hPSCs) holds great therapeutic promise, but has proven to be challenging in practice. The cell cycle may play an important role in enhancing the differentiation potential of hPSCs. Here, the authors track and isolate hPSCs from different phases of the cell cycle and perform RNA-sequencing. The data show that gene expression patterns of signaling factors and developmental regulators change in a cell cycle-specific manner as hPSCs transition toward differentiation and highlight an important role for PI3K signaling in regulating these early transitory states. © AlphaMed Press 2019.

    View details for DOI 10.1002/stem.3041

    View details for PubMedID 31135093

  • Remodeling of epigenome and transcriptome landscapes with aging in mice reveals widespread induction of inflammatory responses GENOME RESEARCH Benayoun, B. A., Pollina, E. A., Singh, P., Mahmoudi, S., Harel, I., Casey, K. M., Dulken, B. W., Kundaje, A., Brunet, A. 2019; 29 (4): 697–709
  • Initiation of mtDNA transcription is followed by pausing, and diverges across human cell types and during evolution (vol 27, pg 362, 2017) GENOME RESEARCH Blumberg, A., Rice, E. J., Kundaje, A., Danko, C. G., Mishmar, D. 2019; 29 (4): 710

    View details for DOI 10.1101/gr.248971.119

    View details for Web of Science ID 000462858600018

    View details for PubMedID 30936176

    View details for PubMedCentralID PMC6442388

  • Measuring the reproducibility and quality of Hi-C data. Genome biology Yardimci, G. G., Ozadam, H., Sauria, M. E., Ursu, O., Yan, K., Yang, T., Chakraborty, A., Kaul, A., Lajoie, B. R., Song, F., Zhan, Y., Ay, F., Gerstein, M., Kundaje, A., Li, Q., Taylor, J., Yue, F., Dekker, J., Noble, W. S. 2019; 20 (1): 57

    Abstract

    BACKGROUND: Hi-C is currently the most widely used assay to investigate the 3D organization of the genome and to study its role in gene regulation, DNA replication, and disease. However, Hi-C experiments are costly to perform and involve multiple complex experimental steps; thus, accurate methods for measuring the quality and reproducibility of Hi-C data are essential to determine whether the output should be used further in a study.RESULTS: Using real and simulated data, we profile the performance of several recently proposed methods for assessing reproducibility of population Hi-C data, including HiCRep, GenomeDISCO, HiC-Spector, and QuASAR-Rep. By explicitly controlling noise and sparsity through simulations, we demonstrate the deficiencies of performing simple correlation analysis on pairs of matrices, and we show that methods developed specifically for Hi-C data produce better measures of reproducibility. We also show how to use established measures, such as the ratio of intra- to interchromosomal interactions, and novel ones, such as QuASAR-QC, to identify low-quality experiments.CONCLUSIONS: In this work, we assess reproducibility and quality measures by varying sequencing depth, resolution and noise levels in Hi-C data from 13 cell lines, with two biological replicates each, as well as 176 simulated matrices. Through this extensive validation and benchmarking of Hi-C data, we describe best practices for reproducibility and quality assessment of Hi-C experiments. We make all software publicly available at http://github.com/kundajelab/3DChromatin_ReplicateQC to facilitate adoption in the community.

    View details for PubMedID 30890172

  • mtDNA Chromatin-like Organization Is Gradually Established during Mammalian Embryogenesis. iScience Marom, S., Blumberg, A., Kundaje, A., Mishmar, D. 2019; 12: 141–51

    Abstract

    Unlike the nuclear genome, the mammalian mitochondrial genome (mtDNA) is thought to be coatedsolely by mitochondrial transcription factor A (TFAM), whose binding sequence preferences are debated. Therefore, higher-order mtDNA organization is considered much less regulated than both the bacterial nucleoid and the nuclear chromatin. However, our recently identified conserved DNase footprinting pattern in human mtDNA, which co-localizes with regulatory elements and responds to physiological conditions, likely reflects a structured higher-order mtDNA organization. We hypothesized that this pattern emerges during embryogenesis. To test this hypothesis, we analyzed assay for transposase-accessible chromatin sequencing (ATAC-seq) results collected during the course of mouse and human early embryogenesis. Our results reveal, for the first time, a gradual and dynamic emergence of the adult mtDNA footprinting pattern during embryogenesis of both mammals. Taken together, our findings suggest that the structured adult chromatin-like mtDNA organization is gradually formed during mammalian embryogenesis.

    View details for PubMedID 30684873

  • Mitigation of off-target toxicity in CRISPR-Cas9 screens for essential non-coding elements. Nature communications Tycko, J. n., Wainberg, M. n., Marinov, G. K., Ursu, O. n., Hess, G. T., Ego, B. K., Aradhana, n. n., Li, A. n., Truong, A. n., Trevino, A. E., Spees, K. n., Yao, D. n., Kaplow, I. M., Greenside, P. G., Morgens, D. W., Phanstiel, D. H., Snyder, M. P., Bintu, L. n., Greenleaf, W. J., Kundaje, A. n., Bassik, M. C. 2019; 10 (1): 4063

    Abstract

    Pooled CRISPR-Cas9 screens are a powerful method for functionally characterizing regulatory elements in the non-coding genome, but off-target effects in these experiments have not been systematically evaluated. Here, we investigate Cas9, dCas9, and CRISPRi/a off-target activity in screens for essential regulatory elements. The sgRNAs with the largest effects in genome-scale screens for essential CTCF loop anchors in K562 cells were not single guide RNAs (sgRNAs) that disrupted gene expression near the on-target CTCF anchor. Rather, these sgRNAs had high off-target activity that, while only weakly correlated with absolute off-target site number, could be predicted by the recently developed GuideScan specificity score. Screens conducted in parallel with CRISPRi/a, which do not induce double-stranded DNA breaks, revealed that a distinct set of off-targets also cause strong confounding fitness effects with these epigenome-editing tools. Promisingly, filtering of CRISPRi libraries using GuideScan specificity scores removed these confounded sgRNAs and enabled identification of essential regulatory elements.

    View details for DOI 10.1038/s41467-019-11955-7

    View details for PubMedID 31492858

  • Deciphering regulatory DNA sequences and noncoding genetic variants using neural network models of massively parallel reporter assays. PloS one Movva, R. n., Greenside, P. n., Marinov, G. K., Nair, S. n., Shrikumar, A. n., Kundaje, A. n. 2019; 14 (6): e0218073

    Abstract

    The relationship between noncoding DNA sequence and gene expression is not well-understood. Massively parallel reporter assays (MPRAs), which quantify the regulatory activity of large libraries of DNA sequences in parallel, are a powerful approach to characterize this relationship. We present MPRA-DragoNN, a convolutional neural network (CNN)-based framework to predict and interpret the regulatory activity of DNA sequences as measured by MPRAs. While our method is generally applicable to a variety of MPRA designs, here we trained our model on the Sharpr-MPRA dataset that measures the activity of ∼500,000 constructs tiling 15,720 regulatory regions in human K562 and HepG2 cell lines. MPRA-DragoNN predictions were moderately correlated (Spearman ρ = 0.28) with measured activity and were within range of replicate concordance of the assay. State-of-the-art model interpretation methods revealed high-resolution predictive regulatory sequence features that overlapped transcription factor (TF) binding motifs. We used the model to investigate the cell type and chromatin state preferences of predictive TF motifs. We explored the ability of our model to predict the allelic effects of regulatory variants in an independent MPRA experiment and fine map putative functional SNPs in loci associated with lipid traits. Our results suggest that interpretable deep learning models trained on MPRA data have the potential to reveal meaningful patterns in regulatory DNA sequences and prioritize regulatory genetic variants, especially as larger, higher-quality datasets are produced.

    View details for DOI 10.1371/journal.pone.0218073

    View details for PubMedID 31206543

  • Discovery of common and rare genetic risk variants for colorectal cancer. Nature genetics Huyghe, J. R., Bien, S. A., Harrison, T. A., Kang, H. M., Chen, S., Schmit, S. L., Conti, D. V., Qu, C., Jeon, J., Edlund, C. K., Greenside, P., Wainberg, M., Schumacher, F. R., Smith, J. D., Levine, D. M., Nelson, S. C., Sinnott-Armstrong, N. A., Albanes, D., Alonso, M. H., Anderson, K., Arnau-Collell, C., Arndt, V., Bamia, C., Banbury, B. L., Baron, J. A., Berndt, S. I., Bezieau, S., Bishop, D. T., Boehm, J., Boeing, H., Brenner, H., Brezina, S., Buch, S., Buchanan, D. D., Burnett-Hartman, A., Butterbach, K., Caan, B. J., Campbell, P. T., Carlson, C. S., Castellvi-Bel, S., Chan, A. T., Chang-Claude, J., Chanock, S. J., Chirlaque, M., Cho, S. H., Connolly, C. M., Cross, A. J., Cuk, K., Curtis, K. R., de la Chapelle, A., Doheny, K. F., Duggan, D., Easton, D. F., Elias, S. G., Elliott, F., English, D. R., Feskens, E. J., Figueiredo, J. C., Fischer, R., FitzGerald, L. M., Forman, D., Gala, M., Gallinger, S., Gauderman, W. J., Giles, G. G., Gillanders, E., Gong, J., Goodman, P. J., Grady, W. M., Grove, J. S., Gsur, A., Gunter, M. J., Haile, R. W., Hampe, J., Hampel, H., Harlid, S., Hayes, R. B., Hofer, P., Hoffmeister, M., Hopper, J. L., Hsu, W., Huang, W., Hudson, T. J., Hunter, D. J., Ibanez-Sanz, G., Idos, G. E., Ingersoll, R., Jackson, R. D., Jacobs, E. J., Jenkins, M. A., Joshi, A. D., Joshu, C. E., Keku, T. O., Key, T. J., Kim, H. R., Kobayashi, E., Kolonel, L. N., Kooperberg, C., Kuhn, T., Kury, S., Kweon, S., Larsson, S. C., Laurie, C. A., Le Marchand, L., Leal, S. M., Lee, S. C., Lejbkowicz, F., Lemire, M., Li, C. I., Li, L., Lieb, W., Lin, Y., Lindblom, A., Lindor, N. M., Ling, H., Louie, T. L., Mannisto, S., Markowitz, S. D., Martin, V., Masala, G., McNeil, C. E., Melas, M., Milne, R. L., Moreno, L., Murphy, N., Myte, R., Naccarati, A., Newcomb, P. A., Offit, K., Ogino, S., Onland-Moret, N. C., Pardini, B., Parfrey, P. S., Pearlman, R., Perduca, V., Pharoah, P. D., Pinchev, M., Platz, E. A., Prentice, R. L., Pugh, E., Raskin, L., Rennert, G., Rennert, H. S., Riboli, E., Rodriguez-Barranco, M., Romm, J., Sakoda, L. C., Schafmayer, C., Schoen, R. E., Seminara, D., Shah, M., Shelford, T., Shin, M., Shulman, K., Sieri, S., Slattery, M. L., Southey, M. C., Stadler, Z. K., Stegmaier, C., Su, Y., Tangen, C. M., Thibodeau, S. N., Thomas, D. C., Thomas, S. S., Toland, A. E., Trichopoulou, A., Ulrich, C. M., Van Den Berg, D. J., van Duijnhoven, F. J., Van Guelpen, B., van Kranen, H., Vijai, J., Visvanathan, K., Vodicka, P., Vodickova, L., Vymetalkova, V., Weigl, K., Weinstein, S. J., White, E., Win, A. K., Wolf, C. R., Wolk, A., Woods, M. O., Wu, A. H., Zaidi, S. H., Zanke, B. W., Zhang, Q., Zheng, W., Scacheri, P. C., Potter, J. D., Bassik, M. C., Kundaje, A., Casey, G., Moreno, V., Abecasis, G. R., Nickerson, D. A., Gruber, S. B., Hsu, L., Peters, U. 2018

    Abstract

    To further dissect the genetic architecture of colorectal cancer (CRC), we performed whole-genome sequencing of 1,439 cases and 720 controls, imputed discovered sequence variants and Haplotype Reference Consortium panel variants into genome-wide association study data, and tested for association in 34,869 cases and 29,051 controls. Findings were followed up in an additional 23,262 cases and 38,296 controls. We discovered a strongly protective 0.3% frequency variant signal at CHD1. In a combined meta-analysis of 125,478 individuals, we identified 40 new independent signals at P<5*10-8, bringing the number of known independent signals for CRC to ~100. New signals implicate lower-frequency variants, Kruppel-like factors, Hedgehog signaling, Hippo-YAP signaling, long noncoding RNAs and somatic drivers, and support a role for immune function. Heritability analyses suggest that CRC risk is highly polygenic, and larger, more comprehensive studies enabling rare variant analysis will improve understanding of biology underlying this risk and influence personalized screening strategies and drug development.

    View details for PubMedID 30510241

  • Intertumoral Heterogeneity in SCLC Is Influenced by the Cell Type of Origin CANCER DISCOVERY Yang, D., Denny, S. K., Greenside, P. G., Chaikovsky, A. C., Brady, J. J., Ouadah, Y., Granja, J. M., Jahchan, N. S., Lim, J., Kwok, S., Kong, C. S., Berghoff, A. S., Schmitt, A., Reinhardt, H., Park, K., Preusser, M., Kundaje, A., Greenleaf, W. J., Sage, J., Winslow, M. M. 2018; 8 (10): 1316–31
  • GenomeDISCO: a concordance score for chromosome conformation capture experiments using random walks on contact map graphs BIOINFORMATICS Ursu, O., Boley, N., Taranova, M., Wang, Y., Yardimci, G., Noble, W., Kundaje, A. 2018; 34 (16): 2701–7
  • A common pattern of DNase I footprinting throughout the human mtDNA unveils clues for a chromatin-like organization GENOME RESEARCH Blumberg, A., Danko, C. G., Kundaje, A., Mishmar, D. 2018; 28 (8): 1158–68

    Abstract

    Human mitochondrial DNA (mtDNA) is believed to lack chromatin and histones. Instead, it is coated solely by the transcription factor TFAM. We asked whether mtDNA packaging is more regulated than once thought. To address this, we analyzed DNase-seq experiments in 324 human cell types and found, for the first time, a pattern of 29 mtDNA Genomic footprinting (mt-DGF) sites shared by ∼90% of the samples. Their syntenic conservation in mouse DNase-seq experiments reflect selective constraints. Colocalization with known mtDNA regulatory elements, with G-quadruplex structures, in TFAM-poor sites (in HeLa cells) and with transcription pausing sites, suggest a functional regulatory role for such mt-DGFs. Altered mt-DGF pattern in interleukin 3-treated CD34+ cells, certain tissue differences, and significant prevalence change in fetal versus nonfetal samples, offer first clues to their physiological importance. Taken together, human mtDNA has a conserved protein-DNA organization, which is likely involved in mtDNA regulation.

    View details for PubMedID 30002158

    View details for PubMedCentralID PMC6071632

  • Decoding regulatory sequence across skin differentiation with deep learning Kim, D., Risca, V., Chappell, J., Shi, M., Zhao, Z., Jung, N., Chang, H., Snyder, M., Greenleaf, W., Kundaje, A., Khavari, P. ELSEVIER SCIENCE INC. 2018: S135
  • GenomeDISCO: A concordance score for chromosome conformation capture experiments using random walks on contact map graphs. Bioinformatics (Oxford, England) Ursu, O., Boley, N., Taranova, M., Wang, Y. X., Yardimci, G. G., Noble, W. S., Kundaje, A. 2018

    Abstract

    Motivation: The three-dimensional organization of chromatin plays a critical role in gene regulation and disease. High-throughput chromosome conformation capture experiments such as Hi-C are used to obtain genome-wide maps of 3D chromatin contacts. However, robust estimation of data quality and systematic comparison of these contact maps is challenging due to the multi-scale, hierarchical structure of chromatin contacts and the resulting properties of experimental noise in the data. Measuring concordance of contact maps is important for assessing reproducibility of replicate experiments and for modeling variation between different cellular contexts.Results: We introduce a concordance measure called GenomeDISCO (DIfferences between Smoothed COntact maps) for assessing the similarity of a pair of contact maps obtained from chromosome conformation capture experiments. The key idea is to smooth contact maps using random walks on the contact map graph, before estimating concordance. We use simulated datasets to benchmark GenomeDISCO's sensitivity to different types of noise that affect chromatin contact maps. When applied to a large collection of Hi-C datasets, GenomeDISCO accurately distinguishes biological replicates from samples obtained from different cell types. GenomeDISCO also generalizes to other chromosome conformation capture assays, such as HiChIP.Availability: Software implementing GenomeDISCO is available at https://github.com/kundajelab/genomedisco.Contact: akundaje@stanford.edu.Supplementary information: Supplementary data are available at Bioinformatics online.

    View details for PubMedID 29554289

  • ChIP-ping the branches of the tree: functional genomics and the evolution of eukaryotic gene regulation BRIEFINGS IN FUNCTIONAL GENOMICS Marinov, G. K., Kundaje, A. 2018; 17 (2): 116–37

    Abstract

    Advances in the methods for detecting protein-DNA interactions have played a key role in determining the directions of research into the mechanisms of transcriptional regulation. The most recent major technological transformation happened a decade ago, with the move from using tiling arrays [chromatin immunoprecipitation (ChIP)-on-Chip] to high-throughput sequencing (ChIP-seq) as a readout for ChIP assays. In addition to the numerous other ways in which it is superior to arrays, by eliminating the need to design and manufacture them, sequencing also opened the door to carrying out comparative analyses of genome-wide transcription factor occupancy across species and studying chromatin biology in previously less accessible model and nonmodel organisms, thus allowing us to understand the evolution and diversity of regulatory mechanisms in unprecedented detail. Here, we review the biological insights obtained from such studies in recent years and discuss anticipated future developments in the field.

    View details for DOI 10.1093/bfgp/ely004

    View details for Web of Science ID 000429027600006

    View details for PubMedID 29529131

  • Impact of regulatory variation across human iPSCs and differentiated cells GENOME RESEARCH Banovich, N. E., Li, Y. I., Raj, A., Ward, M. C., Greenside, P., Calderon, D., Tung, P., Burnett, J. E., Myrthil, M., Thomas, S. M., Burrows, C. K., Romero, I., Pavlovic, B. J., Kundaje, A., Pritchard, J. K., Gilad, Y. 2018; 28 (1): 122–31

    Abstract

    Induced pluripotent stem cells (iPSCs) are an essential tool for studying cellular differentiation and cell types that are otherwise difficult to access. We investigated the use of iPSCs and iPSC-derived cells to study the impact of genetic variation on gene regulation across different cell types and as models for studies of complex disease. To do so, we established a panel of iPSCs from 58 well-studied Yoruba lymphoblastoid cell lines (LCLs); 14 of these lines were further differentiated into cardiomyocytes. We characterized regulatory variation across individuals and cell types by measuring gene expression levels, chromatin accessibility, and DNA methylation. Our analysis focused on a comparison of inter-individual regulatory variation across cell types. While most cell-type-specific regulatory quantitative trait loci (QTLs) lie in chromatin that is open only in the affected cell types, we found that 20% of cell-type-specific regulatory QTLs are in shared open chromatin. This observation motivated us to develop a deep neural network to predict open chromatin regions from DNA sequence alone. Using this approach, we were able to use the sequences of segregating haplotypes to predict the effects of common SNPs on cell-type-specific chromatin accessibility.

    View details for PubMedID 29208628

  • Prediction of protein-ligand interactions from paired protein sequence motifs and ligand substructures. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing Greenside, P. n., Hillenmeyer, M. n., Kundaje, A. n. 2018; 23: 20–31

    Abstract

    Identification of small molecule ligands that bind to proteins is a critical step in drug discovery. Computational methods have been developed to accelerate the prediction of protein-ligand binding, but often depend on 3D protein structures. As only a limited number of protein 3D structures have been resolved, the ability to predict protein-ligand interactions without relying on a 3D representation would be highly valuable. We use an interpretable confidence-rated boosting algorithm to predict protein-ligand interactions with high accuracy from ligand chemical substructures and protein 1D sequence motifs, without relying on 3D protein structures. We compare several protein motif definitions, assess generalization of our model's predictions to unseen proteins and ligands, demonstrate recovery of well established interactions and identify globally predictive protein-ligand motif pairs. By bridging biological and chemical perspectives, we demonstrate that it is possible to predict protein-ligand interactions using only motif-based features and that interpretation of these features can reveal new insights into the molecular mechanics underlying each interaction. Our work also lays a foundation to explore more predictive feature sets and sophisticated machine learning approaches as well as other applications, such as predicting unintended interactions or the effects of mutations.

    View details for PubMedID 29218866

  • Umap and Bismap: quantifying genome and methylome mappability. Nucleic acids research Karimzadeh, M. n., Ernst, C. n., Kundaje, A. n., Hoffman, M. M. 2018; 46 (20): e120

    Abstract

    Short-read sequencing enables assessment of genetic and biochemical traits of individual genomic regions, such as the location of genetic variation, protein binding and chemical modifications. Every region in a genome assembly has a property called 'mappability', which measures the extent to which it can be uniquely mapped by sequence reads. In regions of lower mappability, estimates of genomic and epigenomic characteristics from sequencing assays are less reliable. These regions have increased susceptibility to spurious mapping from reads from other regions of the genome with sequencing errors or unexpected genetic variation. Bisulfite sequencing approaches used to identify DNA methylation exacerbate these problems by introducing large numbers of reads that map to multiple regions. Both to correct assumptions of uniformity in downstream analysis and to identify regions where the analysis is less reliable, it is necessary to know the mappability of both ordinary and bisulfite-converted genomes. We introduce the Umap software for identifying uniquely mappable regions of any genome. Its Bismap extension identifies mappability of the bisulfite-converted genome. A Umap and Bismap track hub for human genome assemblies GRCh37/hg19 and GRCh38/hg38, and mouse assemblies GRCm37/mm9 and GRCm38/mm10 is available at https://bismap.hoffmanlab.org for use with genome browsers.

    View details for PubMedID 30169659

  • Differential analysis of chromatin accessibility and histone modifications for predicting mouse developmental enhancers. Nucleic acids research Fu, S. n., Wang, Q. n., Moore, J. E., Purcaro, M. J., Pratt, H. E., Fan, K. n., Gu, C. n., Jiang, C. n., Zhu, R. n., Kundaje, A. n., Lu, A. n., Weng, Z. n. 2018; 46 (21): 11184–201

    Abstract

    Enhancers are distal cis-regulatory elements that modulate gene expression. They are depleted of nucleosomes and enriched in specific histone modifications; thus, calling DNase-seq and histone mark ChIP-seq peaks can predict enhancers. We evaluated nine peak-calling algorithms for predicting enhancers validated by transgenic mouse assays. DNase and H3K27ac peaks were consistently more predictive than H3K4me1/2/3 and H3K9ac peaks. DFilter and Hotspot2 were the best DNase peak callers, while HOMER, MUSIC, MACS2, DFilter and F-seq were the best H3K27ac peak callers. We observed that the differential DNase or H3K27ac signals between two distant tissues increased the area under the precision-recall curve (PR-AUC) of DNase peaks by 17.5-166.7% and that of H3K27ac peaks by 7.1-22.2%. We further improved this differential signal method using multiple contrast tissues. Evaluated using a blind test, the differential H3K27ac signal method substantially improved PR-AUC from 0.48 to 0.75 for predicting heart enhancers. We further validated our approach using postnatal retina and cerebral cortex enhancers identified by massively parallel reporter assays, and observed improvements for both tissues. In summary, we compared nine peak callers and devised a superior method for predicting tissue-specific mouse developmental enhancers by reranking the called peaks.

    View details for PubMedID 30137428

  • Challenges and recommendations for epigenomics in precision health NATURE BIOTECHNOLOGY Carter, A. C., Chang, H. Y., Church, G., Dombkowski, A., Ecker, J. R., Gil, E., Giresi, P. G., Greely, H., Greenleaf, W. J., Hacohen, N., He, C., Hill, D., Ko, J., Kohane, I., Kundaje, A., Palmer, M., Snyder, M. P., Tung, J., Urban, A., Vidal, M., Wong, W. 2017; 35 (12): 1128–32

    View details for PubMedID 29220033

  • Chromatin accessibility dynamics reveal novel functional enhancers in C. elegans GENOME RESEARCH Daugherty, A. C., Yeo, R. W., Buenrostro, J. D., Greenleaf, W. J., Kundaje, A., Brunet, A. 2017; 27 (12): 2096–2107

    Abstract

    Chromatin accessibility, a crucial component of genome regulation, has primarily been studied in homogeneous and simple systems, such as isolated cell populations or early-development models. Whether chromatin accessibility can be assessed in complex, dynamic systems in vivo with high sensitivity remains largely unexplored. In this study, we use ATAC-seq to identify chromatin accessibility changes in a whole animal, the model organism Caenorhabditis elegans, from embryogenesis to adulthood. Chromatin accessibility changes between developmental stages are highly reproducible, recapitulate histone modification changes, and reveal key regulatory aspects of the epigenomic landscape throughout organismal development. We find that over 5000 distal noncoding regions exhibit dynamic changes in chromatin accessibility between developmental stages and could thereby represent putative enhancers. When tested in vivo, several of these putative enhancers indeed drive novel cell-type- and temporal-specific patterns of expression. Finally, by integrating transcription factor binding motifs in a machine learning framework, we identify EOR-1 as a unique transcription factor that may regulate chromatin dynamics during development. Our study provides a unique resource for C. elegans, a system in which the prevalence and importance of enhancers remains poorly characterized, and demonstrates the power of using whole organism chromatin accessibility to identify novel regulatory regions in complex systems.

    View details for PubMedID 29141961

  • Enrichment of colorectal cancer associations in functional regions: Insight for using epigenomics data in the analysis of whole genome sequence-imputed GWAS data PLOS ONE Bien, S. A., Auer, P. L., Harrison, T. A., Qu, C., Connolly, C. M., Greenside, P. G., Chen, S., Berndt, S. I., Bezieau, S., Kang, H. M., Huyghe, J., Brenner, H., Casey, G., Chan, A. T., Hopper, J. L., Banbury, B. L., Chang-Claude, J., Chanock, S. J., Haile, R. W., Hoffmeister, M., Fuchsberger, C., Jenkins, M. A., Leal, S. M., Lemire, M., Newcomb, P. A., Gallinger, S., Potter, J. D., Schoen, R. E., Slattery, M. L., Smith, J. D., Le Marchand, L., White, E., Zanke, B. W., Abecasis, G. R., Carlson, C. S., Peters, U., Nickerson, D. A., Kundaje, A., Hsu, L., GECCO CCFR 2017; 12 (11): e0186518

    Abstract

    The evaluation of less frequent genetic variants and their effect on complex disease pose new challenges for genomic research. To investigate whether epigenetic data can be used to inform aggregate rare-variant association methods (RVAM), we assessed whether variants more significantly associated with colorectal cancer (CRC) were preferentially located in non-coding regulatory regions, and whether enrichment was specific to colorectal tissues.Active regulatory elements (ARE) were mapped using data from 127 tissues and cell-types from NIH Roadmap Epigenomics and Encyclopedia of DNA Elements (ENCODE) projects. We investigated whether CRC association p-values were more significant for common variants inside versus outside AREs, or 2) inside colorectal (CR) AREs versus AREs of other tissues and cell-types. We employed an integrative epigenomic RVAM for variants with allele frequency <1%. Gene sets were defined as ARE variants within 200 kilobases of a transcription start site (TSS) using either CR ARE or ARE from non-digestive tissues. CRC-set association p-values were used to evaluate enrichment of less frequent variant associations in CR ARE versus non-digestive ARE.ARE from 126/127 tissues and cell-types were significantly enriched for stronger CRC-variant associations. Strongest enrichment was observed for digestive tissues and immune cell types. CR-specific ARE were also enriched for stronger CRC-variant associations compared to ARE combined across non-digestive tissues (p-value = 9.6 × 10-4). Additionally, we found enrichment of stronger CRC association p-values for rare variant sets of CR ARE compared to non-digestive ARE (p-value = 0.029).Integrative epigenomic RVAM may enable discovery of less frequent variants associated with CRC, and ARE of digestive and immune tissues are most informative. Although distance-based aggregation of less frequent variants in CR ARE surrounding TSS showed modest enrichment, future association studies would likely benefit from joint analysis of transcriptomes and epigenomes to better link regulatory variation with target genes.

    View details for PubMedID 29161273

  • Vicus: Exploiting local structures to improve network-based analysis of biological data PLOS COMPUTATIONAL BIOLOGY Wang, B., Huang, L., Zhu, Y., Kundaje, A., Batzoglou, S., Goldenberg, A. 2017; 13 (10): e1005621

    Abstract

    Biological networks entail important topological features and patterns critical to understanding interactions within complicated biological systems. Despite a great progress in understanding their structure, much more can be done to improve our inference and network analysis. Spectral methods play a key role in many network-based applications. Fundamental to spectral methods is the Laplacian, a matrix that captures the global structure of the network. Unfortunately, the Laplacian does not take into account intricacies of the network's local structure and is sensitive to noise in the network. These two properties are fundamental to biological networks and cannot be ignored. We propose an alternative matrix Vicus. The Vicus matrix captures the local neighborhood structure of the network and thus is more effective at modeling biological interactions. We demonstrate the advantages of Vicus in the context of spectral methods by extensive empirical benchmarking on tasks such as single cell dimensionality reduction, protein module discovery and ranking genes for cancer subtyping. Our experiments show that using Vicus, spectral methods result in more accurate and robust performance in all of these tasks.

    View details for PubMedID 29023470

    View details for PubMedCentralID PMC5638230

  • Genome-scale measurement of off-target activity using Cas9 toxicity in high-throughput screens NATURE COMMUNICATIONS Morgens, D. W., Wainberg, M., Boyle, E. A., Ursu, O., Araya, C. L., Tsui, C. K., Haney, M. S., Hess, G. T., Han, K., Jeng, E. E., Li, A., Snyder, M. P., Greenleaf, W. J., Kundaje, A., Bassik, M. C. 2017; 8

    Abstract

    CRISPR-Cas9 screens are powerful tools for high-throughput interrogation of genome function, but can be confounded by nuclease-induced toxicity at both on- and off-target sites, likely due to DNA damage. Here, to test potential solutions to this issue, we design and analyse a CRISPR-Cas9 library with 10 variable-length guides per gene and thousands of negative controls targeting non-functional, non-genic regions (termed safe-targeting guides), in addition to non-targeting controls. We find this library has excellent performance in identifying genes affecting growth and sensitivity to the ricin toxin. The safe-targeting guides allow for proper control of toxicity from on-target DNA damage. Using this toxicity as a proxy to measure off-target cutting, we demonstrate with tens of thousands of guides both the nucleotide position-dependent sensitivity to single mismatches and the reduction of off-target cutting using truncated guides. Our results demonstrate a simple strategy for high-throughput evaluation of target specificity and nuclease toxicity in Cas9 screens.

    View details for DOI 10.1038/ncomms15178

    View details for PubMedID 28474669

  • Dynamic and stable enhancer-promoter contacts regulate epidermal terminal differentiation Lopez-Pajares, V., Rubin, A., Barajas, B., Furlan-Magaril, M., Mumbach, M., Greenleaf, W., Kundaje, A., Snyder, M., Chang, H., Fraser, P., Khavari, P. A. ELSEVIER SCIENCE INC. 2017: S80
  • Initiation of mtDNA transcription is followed by pausing, and diverges across human cell types and during evolution. Genome research Blumberg, A., Rice, E. J., Kundaje, A., Danko, C. G., Mishmar, D. 2017; 27 (3): 362-373

    Abstract

    Mitochondrial DNA (mtDNA) genes are long known to be cotranscribed in polycistrones, yet it remains impossible to study nascent mtDNA transcripts quantitatively in vivo using existing tools. To this end, we used deep sequencing (GRO-seq and PRO-seq) and analyzed nascent mtDNA-encoded RNA transcripts in diverse human cell lines and metazoan organisms. Surprisingly, accurate detection of human mtDNA transcription initiation sites (TISs) in the heavy and light strands revealed a novel conserved transcription pausing site near the light-strand TIS. This pausing site correlated with the presence of a bacterial pausing sequence motif, with reduced SNP density, and with a DNase footprinting signal in all tested cells. Its location within conserved sequence block 3 (CSBIII), just upstream of the known transcription-replication transition point, suggests involvement in such transition. Analysis of nonhuman organisms enabled de novo mtDNA sequence assembly, as well as detection of previously unknown mtDNA TIS, pausing, and transcription termination sites with unprecedented accuracy. Whereas mammals (Pan troglodytes, Macaca mulatta, Rattus norvegicus, and Mus musculus) showed a human-like mtDNA transcription pattern, the invertebrate pattern (Drosophila melanogaster and Caenorhabditis elegans) profoundly diverged. Our approach paves the path toward in vivo, quantitative, reference sequence-free analysis of mtDNA transcription in all eukaryotes.

    View details for DOI 10.1101/gr.209924.116

    View details for PubMedID 28049628

  • Molecular definition of a metastatic lung cancer state reveals a targetable CD109-Janus kinase-Stat axis. Nature medicine Chuang, C., Greenside, P. G., Rogers, Z. N., Brady, J. J., Yang, D., Ma, R. K., Caswell, D. R., Chiou, S., Winters, A. F., Grüner, B. M., Ramaswami, G., Spencley, A. L., Kopecky, K. E., Sayles, L. C., Sweet-Cordero, E. A., Li, J. B., Kundaje, A., Winslow, M. M. 2017; 23 (3): 291-300

    Abstract

    Lung cancer is the leading cause of cancer deaths worldwide, with the majority of mortality resulting from metastatic spread. However, the molecular mechanism by which cancer cells acquire the ability to disseminate from primary tumors, seed distant organs, and grow into tissue-destructive metastases remains incompletely understood. We combined tumor barcoding in a mouse model of human lung adenocarcinoma with unbiased genomic approaches to identify a transcriptional program that confers metastatic ability and predicts patient survival. Small-scale in vivo screening identified several genes, including Cd109, that encode novel pro-metastatic factors. We uncovered signaling mediated by Janus kinases (Jaks) and the transcription factor Stat3 as a critical, pharmacologically targetable effector of CD109-driven lung cancer metastasis. In summary, by coupling the systematic genomic analysis of purified cancer cells in distinct malignant states from mouse models with extensive human validation, we uncovered several key regulators of metastatic ability, including an actionable pro-metastatic CD109-Jak-Stat3 axis.

    View details for DOI 10.1038/nm.4285

    View details for PubMedID 28191885

  • Predicting gene expression in massively parallel reporter assays: a comparative study. Human mutation Kreimer, A., Zeng, H., Edwards, M. D., Guo, Y., Tian, K., Shin, S., Welch, R., Wainberg, M., Mohan, R., Sinnott-Armstrong, N. A., Li, Y., Eraslan, G., Amin, T. B., Goke, J., Mueller, N. S., Kellis, M., Kundaje, A., Beer, M. A., Keles, S., Gifford, D. K., Yosef, N. 2017

    Abstract

    In many human diseases, associated genetic changes tend to occur within noncoding regions, whose effect might be related to transcriptional control. A central goal in human genetics is to understand the function of such noncoding regions: given a region that is statistically associated with changes in gene expression (expression quantitative trait locus [eQTL]), does it in fact play a regulatory role? And if so, how is this role "coded" in its sequence? These questions were the subject of the Critical Assessment of Genome Interpretation eQTL challenge. Participants were given a set of sequences that flank eQTLs in humans and were asked to predict whether these are capable of regulating transcription (as evaluated by massively parallel reporter assays), and whether this capability changes between alternative alleles. Here, we report lessons learned from this community effort. By inspecting predictive properties in isolation, and conducting meta-analysis over the competing methods, we find that using chromatin accessibility and transcription factor binding as features in an ensemble of classifiers or regression models leads to the most accurate results. We then characterize the loci that are harder to predict, putting the spotlight on areas of weakness, which we expect to be the subject of future studies.

    View details for DOI 10.1002/humu.23197

    View details for PubMedID 28220625

  • An improved ATAC-seq protocol reduces background and enables interrogation of frozen tissues. Nature methods Corces, M. R., Trevino, A. E., Hamilton, E. G., Greenside, P. G., Sinnott-Armstrong, N. A., Vesuna, S. n., Satpathy, A. T., Rubin, A. J., Montine, K. S., Wu, B. n., Kathiria, A. n., Cho, S. W., Mumbach, M. R., Carter, A. C., Kasowski, M. n., Orloff, L. A., Risca, V. I., Kundaje, A. n., Khavari, P. A., Montine, T. J., Greenleaf, W. J., Chang, H. Y. 2017

    Abstract

    We present Omni-ATAC, an improved ATAC-seq protocol for chromatin accessibility profiling that works across multiple applications with substantial improvement of signal-to-background ratio and information content. The Omni-ATAC protocol generates chromatin accessibility profiles from archival frozen tissue samples and 50-μm sections, revealing the activities of disease-associated DNA elements in distinct human brain structures. The Omni-ATAC protocol enables the interrogation of personal regulomes in tissue context and translational studies.

    View details for PubMedID 28846090

  • High-Throughput Characterization of Cascade type I-E CRISPR Guide Efficacy Reveals Unexpected PAM Diversity and Target Sequence Preferences. Genetics Fu, B. X., Wainberg, M. n., Kundaje, A. n., Fire, A. Z. 2017; 206 (4): 1727–38

    Abstract

    Interactions between Clustered Regularly Interspaced Short Palindromic Repeat (CRISPR) RNAs and CRISPR-associated (Cas) proteins form an RNA-guided adaptive immune system in prokaryotes. The adaptive immune system utilizes segments of the genetic material of invasive foreign elements in the CRISPR locus. The loci are transcribed and processed to produce small CRISPR RNAs (crRNAs), with degradation of invading genetic material directed by a combination of complementarity between RNA and DNA and in some cases recognition of adjacent motifs called PAMs (Protospacer Adjacent Motifs). Here we describe a general, high-throughput procedure to test the efficacy of thousands of targets, applying this to the Escherichia coli type I-E Cascade (CRISPR-associated complex for antiviral defense) system. These studies were followed with reciprocal experiments in which the consequence of CRISPR activity was survival in the presence of a lytic phage. From the combined analysis of the Cascade system, we found that (i) type I-E Cascade PAM recognition is more expansive than previously reported, with at least 22 distinct PAMs, with many of the noncanonical PAMs having CRISPR-interference abilities similar to the canonical PAMs; (ii) PAM positioning appears precise, with no evidence for tolerance to PAM slippage in interference; and (iii) while increased guanine-cytosine (GC) content in the spacer is associated with higher CRISPR-interference efficiency, high GC content (>62.5%) decreases CRISPR-interference efficiency. Our findings provide a comprehensive functional profile of Cascade type I-E interference requirements and a method to assay spacer efficacy that can be applied to other CRISPR-Cas systems.

    View details for PubMedID 28634160

  • Learning Important Features Through Propagating Activation Differences Shrikumar, A., Greenside, P., Kundaje, A., Precup, D., Teh, Y. W. JMLR-JOURNAL MACHINE LEARNING RESEARCH. 2017
  • Enhancer connectome in primary human cells identifies target genes of disease-associated DNA elements. Nature genetics Mumbach, M. R., Satpathy, A. T., Boyle, E. A., Dai, C. n., Gowen, B. G., Cho, S. W., Nguyen, M. L., Rubin, A. J., Granja, J. M., Kazane, K. R., Wei, Y. n., Nguyen, T. n., Greenside, P. G., Corces, M. R., Tycko, J. n., Simeonov, D. R., Suliman, N. n., Li, R. n., Xu, J. n., Flynn, R. A., Kundaje, A. n., Khavari, P. A., Marson, A. n., Corn, J. E., Quertermous, T. n., Greenleaf, W. J., Chang, H. Y. 2017

    Abstract

    The challenge of linking intergenic mutations to target genes has limited molecular understanding of human diseases. Here we show that H3K27ac HiChIP generates high-resolution contact maps of active enhancers and target genes in rare primary human T cell subtypes and coronary artery smooth muscle cells. Differentiation of naive T cells into T helper 17 cells or regulatory T cells creates subtype-specific enhancer-promoter interactions, specifically at regions of shared DNA accessibility. These data provide a principled means of assigning molecular functions to autoimmune and cardiovascular disease risk variants, linking hundreds of noncoding variants to putative gene targets. Target genes identified with HiChIP are further supported by CRISPR interference and activation at linked enhancers, by the presence of expression quantitative trait loci, and by allele-specific enhancer loops in patient-derived primary cells. The majority of disease-associated enhancers contact genes beyond the nearest gene in the linear genome, leading to a fourfold increase in the number of potential target genes for autoimmune and cardiovascular diseases.

    View details for PubMedID 28945252

  • Lineage-specific dynamic and pre-established enhancer-promoter contacts cooperate in terminal differentiation. Nature genetics Rubin, A. J., Barajas, B. C., Furlan-Magaril, M. n., Lopez-Pajares, V. n., Mumbach, M. R., Howard, I. n., Kim, D. S., Boxer, L. D., Cairns, J. n., Spivakov, M. n., Wingett, S. W., Shi, M. n., Zhao, Z. n., Greenleaf, W. J., Kundaje, A. n., Snyder, M. n., Chang, H. Y., Fraser, P. n., Khavari, P. A. 2017; 49 (10): 1522–28

    Abstract

    Chromosome conformation is an important feature of metazoan gene regulation; however, enhancer-promoter contact remodeling during cellular differentiation remains poorly understood. To address this, genome-wide promoter capture Hi-C (CHi-C) was performed during epidermal differentiation. Two classes of enhancer-promoter contacts associated with differentiation-induced genes were identified. The first class ('gained') increased in contact strength during differentiation in concert with enhancer acquisition of the H3K27ac activation mark. The second class ('stable') were pre-established in undifferentiated cells, with enhancers constitutively marked by H3K27ac. The stable class was associated with the canonical conformation regulator cohesin, whereas the gained class was not, implying distinct mechanisms of contact formation and regulation. Analysis of stable enhancers identified a new, essential role for a constitutively expressed, lineage-restricted ETS-family transcription factor, EHF, in epidermal differentiation. Furthermore, neither class of contacts was observed in pluripotent cells, suggesting that lineage-specific chromatin structure is established in tissue progenitor cells and is further remodeled in terminal differentiation.

    View details for PubMedID 28805829

  • An atlas of transcriptional, chromatin accessibility, and surface marker changes in human mesoderm development SCIENTIFIC DATA Koh, P. W., Sinha, R., Barkal, A. A., Morganti, R. M., Chen, A., Weissman, I. L., Ang, L. T., Kundaje, A., Loh, K. M. 2016; 3

    Abstract

    Mesoderm is the developmental precursor to myriad human tissues including bone, heart, and skeletal muscle. Unravelling the molecular events through which these lineages become diversified from one another is integral to developmental biology and understanding changes in cellular fate. To this end, we developed an in vitro system to differentiate human pluripotent stem cells through primitive streak intermediates into paraxial mesoderm and its derivatives (somites, sclerotome, dermomyotome) and separately, into lateral mesoderm and its derivatives (cardiac mesoderm). Whole-population and single-cell analyses of these purified populations of human mesoderm lineages through RNA-seq, ATAC-seq, and high-throughput surface marker screens illustrated how transcriptional changes co-occur with changes in open chromatin and surface marker landscapes throughout human mesoderm development. This molecular atlas will facilitate study of human mesoderm development (which cannot be interrogated in vivo due to restrictions on human embryo studies) and provides a broad resource for the study of gene regulation in development at the single-cell level, knowledge that might one day be exploited for regenerative medicine.

    View details for DOI 10.1038/sdata.2016.109

    View details for PubMedID 27996962

  • The International Human Epigenome Consortium: A Blueprint for Scientific Collaboration and Discovery CELL Stunnenberg, H. G., Hirst, M., Int Human Epigenome Consortium 2016; 167 (5): 1145-1149

    Abstract

    The International Human Epigenome Consortium (IHEC) coordinates the generation of a catalog of high-resolution reference epigenomes of major primary human cell types. The studies now presented (see the Cell Press IHEC web portal at http://www.cell.com/consortium/IHEC) highlight the coordinated achievements of IHEC teams to gather and interpret comprehensive epigenomic datasets to gain insights in the epigenetic control of cell states relevant for human health and disease. PAPERCLIP.

    View details for DOI 10.1016/j.cell.2016.11.007

    View details for Web of Science ID 000389470100004

    View details for PubMedID 27863232

  • Lineage-specific and single-cell chromatin accessibility charts human hematopoiesis and leukemia evolution. Nature genetics Corces, M. R., Buenrostro, J. D., Wu, B., Greenside, P. G., Chan, S. M., Koenig, J. L., Snyder, M. P., Pritchard, J. K., Kundaje, A., Greenleaf, W. J., Majeti, R., Chang, H. Y. 2016; 48 (10): 1193-1203

    Abstract

    We define the chromatin accessibility and transcriptional landscapes in 13 human primary blood cell types that span the hematopoietic hierarchy. Exploiting the finding that the enhancer landscape better reflects cell identity than mRNA levels, we enable 'enhancer cytometry' for enumeration of pure cell types from complex populations. We identify regulators governing hematopoietic differentiation and further show the lineage ontogeny of genetic elements linked to diverse human diseases. In acute myeloid leukemia (AML), chromatin accessibility uncovers unique regulatory evolution in cancer cells with a progressively increasing mutation burden. Single AML cells exhibit distinctive mixed regulome profiles corresponding to disparate developmental stages. A method to account for this regulatory heterogeneity identified cancer-specific deviations and implicated HOX factors as key regulators of preleukemic hematopoietic stem cell characteristics. Thus, regulome dynamics can provide diverse insights into hematopoietic development and disease.

    View details for DOI 10.1038/ng.3646

    View details for PubMedID 27526324

  • Characterization of the direct targets of FOXO transcription factors throughout evolution. Aging cell Webb, A. E., Kundaje, A., Brunet, A. 2016; 15 (4): 673-685

    Abstract

    FOXO transcription factors (FOXOs) are central regulators of lifespan across species, yet they also have cell-specific functions, including adult stem cell homeostasis and immune function. Direct targets of FOXOs have been identified genome-wide in several species and cell types. However, whether FOXO targets are specific to cell types and species or conserved across cell types and throughout evolution remains uncharacterized. Here, we perform a meta-analysis of direct FOXO targets across tissues and organisms, using data from mammals as well as Caenorhabditis elegans and Drosophila. We show that FOXOs bind cell type-specific targets, which have functions related to that particular cell. Interestingly, FOXOs also share targets across different tissues in mammals, and the function and even the identity of these shared mammalian targets are conserved in invertebrates. Evolutionarily conserved targets show enrichment for growth factor signaling, metabolism, stress resistance, and proteostasis, suggesting an ancestral, conserved role in the regulation of these processes. We also identify candidate cofactors at conserved FOXO targets that change in expression with age, including CREB and ETS family factors. This meta-analysis provides insight into the evolution of the FOXO network and highlights downstream genes and cofactors that may be particularly important for FOXO's conserved function in adult homeostasis and longevity.

    View details for DOI 10.1111/acel.12479

    View details for PubMedID 27061590

  • Mapping the Pairwise Choices Leading from Pluripotency to Human Bone, Heart, and Other Mesoderm Cell Types CELL Loh, K. M., Chen, A., Koh, P. W., Deng, T. Z., Sinha, R., Tsai, J. M., Barkal, A. A., Shen, K. Y., Jain, R., Morganti, R. M., Shyh-Chang, N., Fernhoff, N. B., George, B. M., Wernig, G., Salomon, R. E., Chen, Z., Vogel, H., Epstein, J. A., Kundaje, A., Talbot, W. S., Beachy, P. A., Ang, L. T., Weissman, I. L. 2016; 166 (2): 451-467

    Abstract

    Stem-cell differentiation to desired lineages requires navigating alternating developmental paths that often lead to unwanted cell types. Hence, comprehensive developmental roadmaps are crucial to channel stem-cell differentiation toward desired fates. To this end, here, we map bifurcating lineage choices leading from pluripotency to 12 human mesodermal lineages, including bone, muscle, and heart. We defined the extrinsic signals controlling each binary lineage decision, enabling us to logically block differentiation toward unwanted fates and rapidly steer pluripotent stem cells toward 80%-99% pure human mesodermal lineages at most branchpoints. This strategy enabled the generation of human bone and heart progenitors that could engraft in respective in vivo models. Mapping stepwise chromatin and single-cell gene expression changes in mesoderm development uncovered somite segmentation, a previously unobservable human embryonic event transiently marked by HOPX expression. Collectively, this roadmap enables navigation of mesodermal development to produce transplantable human tissue progenitors and uncover developmental processes. VIDEO ABSTRACT.

    View details for DOI 10.1016/j.cell.2016.06.011

    View details for PubMedID 27419872

  • Using functional data from Roadmap Epigenomics to inform analysis of rare variants linked to gene expression in a large colorectal cancer study Bien, S. A., Harrison, T. A., Auer, P. L., Qu, F., Huyghe, J., Banbury, B., Greenside, P., Abecasis, G. R., Berndt, S. I., Bezieau, S., Brenner, H., Casey, G., Chan, A. T., Chang-Claude, J., Chen, S., Smith, J. D., Le Marchand, L., Carlson, C., Newcomb, P. A., Fuchsberger, C., Slattery, M. L., Kang, H. M., White, E., Potter, J., Gallinger, S. J., Hoffmeister, M., Gruber, S. B., Nickerson, D. A., Peters, U., Kundaje, A., Hsu, L. AMER ASSOC CANCER RESEARCH. 2016
  • Impact of the X Chromosome and sex on regulatory variation GENOME RESEARCH Kukurba, K. R., Parsana, P., Balliu, B., Smith, K. S., Zappala, Z., Knowles, D. A., Fave, M., Davis, J. R., Li, X., Zhu, X., Potash, J. B., Weissman, M. M., Shi, J., Kundaje, A., Levinson, D. F., Awadalla, P., Mostafavi, S., Battle, A., Montgomery, S. B. 2016; 26 (6): 768-777

    Abstract

    The X Chromosome, with its unique mode of inheritance, contributes to differences between the sexes at a molecular level, including sex-specific gene expression and sex-specific impact of genetic variation. Improving our understanding of these differences offers to elucidate the molecular mechanisms underlying sex-specific traits and diseases. However, to date, most studies have either ignored the X Chromosome or had insufficient power to test for the sex-specific impact of genetic variation. By analyzing whole blood transcriptomes of 922 individuals, we have conducted the first large-scale, genome-wide analysis of the impact of both sex and genetic variation on patterns of gene expression, including comparison between the X Chromosome and autosomes. We identified a depletion of expression quantitative trait loci (eQTL) on the X Chromosome, especially among genes under high selective constraint. In contrast, we discovered an enrichment of sex-specific regulatory variants on the X Chromosome. To resolve the molecular mechanisms underlying such effects, we generated chromatin accessibility data through ATAC-sequencing to connect sex-specific chromatin accessibility to sex-specific patterns of expression and regulatory variation. As sex-specific regulatory variants discovered in our study can inform sex differences in heritable disease prevalence, we integrated our data with genome-wide association study data for multiple immune traits identifying several traits with significant sex biases in genetic susceptibilities. Together, our study provides genome-wide insight into how genetic variation, the X Chromosome, and sex shape human gene regulation and disease.

    View details for DOI 10.1101/gr.197897.115

    View details for PubMedID 27197214

  • An Arntl2-Driven Secretome Enables Lung Adenocarcinoma Metastatic Self-Sufficiency CANCER CELL Brady, J. J., Chuang, C., Greenside, P. G., Rogers, Z. N., Murray, C. W., Caswell, D. R., Hartmann, U., Connolly, A. J., Sweet-Cordero, E. A., Kundaje, A., Winslow, M. M. 2016; 29 (5): 697-710

    Abstract

    The ability of cancer cells to establish lethal metastatic lesions requires the survival and expansion of single cancer cells at distant sites. The factors controlling the clonal growth ability of individual cancer cells remain poorly understood. Here, we show that high expression of the transcription factor ARNTL2 predicts poor lung adenocarcinoma patient outcome. Arntl2 is required for metastatic ability in vivo and clonal growth in cell culture. Arntl2 drives metastatic self-sufficiency by orchestrating the expression of a complex pro-metastatic secretome. We identify Clock as an Arntl2 partner and functionally validate the matricellular protein Smoc2 as a pro-metastatic secreted factor. These findings shed light on the molecular mechanisms that enable single cancer cells to form allochthonous tumors in foreign tissue environments.

    View details for DOI 10.1016/j.ccell.2016.03.003

    View details for PubMedID 27150038

  • Unsupervised Learning from Noisy Networks with Applications to Hi-C Data Wang, B., Zhu, J., Ursu, O., Pourshafeie, A., Batzoglou, S., Kundaje, A., Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, Garnett, R. NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2016
  • Regulatory analysis of the C. elegans genome with spatiotemporal resolution (vol 512, pg 400, 2014) NATURE Araya, C. L., Kawli, T., Kundaje, A., Jiang, L., Wu, B., Vafeados, D., Terrell, R., Weissdepp, P., Gevirtzman, L., Mace, D., Niu, W., Boyle, A. P., Xie, D., Ma, L., Murray, J. I., Reinke, V., Waterston, R. H., Snyder, M. 2015; 528 (7580): 152

    View details for DOI 10.1038/nature16075

    View details for Web of Science ID 000365606000069

    View details for PubMedID 26560031

  • H3K4me3 Breadth Is Linked to Cell Identity and Transcriptional Consistency (vol 158, pg 673, 2014) CELL Benayoun, B. A., Pollina, E. A., Ucar, D., Mahmoudi, S., Karra, K., Wong, E. D., Devarajan, K., Daugherty, A. C., Kundaje, A. B., Mancini, E., Hitz, B. C., Gupta, R., Rando, T. A., Baker, J. C., Snyder, M. P., Cherry, J., Brunet, A. 2015; 163 (5): 1281-U264

    View details for DOI 10.1016/j.cell.2015.10.051

    View details for Web of Science ID 000366044700024

    View details for PubMedID 28930648

  • Characterization of TCF21 Downstream Target Regions Identifies a Transcriptional Network Linking Multiple Independent Coronary Artery Disease Loci. PLoS genetics Sazonova, O., Zhao, Y., Nürnberg, S., Miller, C., Pjanic, M., Castano, V. G., Kim, J. B., Salfati, E. L., Kundaje, A. B., Bejerano, G., Assimes, T., Yang, X., Quertermous, T. 2015; 11 (5)

    Abstract

    To functionally link coronary artery disease (CAD) causal genes identified by genome wide association studies (GWAS), and to investigate the cellular and molecular mechanisms of atherosclerosis, we have used chromatin immunoprecipitation sequencing (ChIP-Seq) with the CAD associated transcription factor TCF21 in human coronary artery smooth muscle cells (HCASMC). Analysis of identified TCF21 target genes for enrichment of molecular and cellular annotation terms identified processes relevant to CAD pathophysiology, including "growth factor binding," "matrix interaction," and "smooth muscle contraction." We characterized the canonical binding sequence for TCF21 as CAGCTG, identified AP-1 binding sites in TCF21 peaks, and by conducting ChIP-Seq for JUN and JUND in HCASMC confirmed that there is significant overlap between TCF21 and AP-1 binding loci in this cell type. Expression quantitative trait variation mapped to target genes of TCF21 was significantly enriched among variants with low P-values in the GWAS analyses, suggesting a possible functional interaction between TCF21 binding and causal variants in other CAD disease loci. Separate enrichment analyses found over-representation of TCF21 target genes among CAD associated genes, and linkage disequilibrium between TCF21 peak variation and that found in GWAS loci, consistent with the hypothesis that TCF21 may affect disease risk through interaction with other disease associated loci. Interestingly, enrichment for TCF21 target genes was also found among other genome wide association phenotypes, including height and inflammatory bowel disease, suggesting a functional profile important for basic cellular processes in non-vascular tissues. Thus, data and analyses presented here suggest that study of GWAS transcription factors may be a highly useful approach to identifying disease gene interactions and thus pathways that may be relevant to complex disease etiology.

    View details for DOI 10.1371/journal.pgen.1005202

    View details for PubMedID 26020271

  • Fine mapping of type 1 diabetes susceptibility loci and evidence for colocalization of causal variants with lymphoid gene enhancers. Nature genetics Onengut-Gumuscu, S., Chen, W., Burren, O., Cooper, N. J., Quinlan, A. R., Mychaleckyj, J. C., Farber, E., Bonnie, J. K., Szpak, M., Schofield, E., Achuthan, P., Guo, H., Fortune, M. D., Stevens, H., Walker, N. M., Ward, L. D., Kundaje, A., Kellis, M., Daly, M. J., Barrett, J. C., Cooper, J. D., Deloukas, P., Todd, J. A., Wallace, C., Concannon, P., Rich, S. S. 2015; 47 (4): 381-386

    Abstract

    Genetic studies of type 1 diabetes (T1D) have identified 50 susceptibility regions, finding major pathways contributing to risk, with some loci shared across immune disorders. To make genetic comparisons across autoimmune disorders as informative as possible, a dense genotyping array, the Immunochip, was developed, from which we identified four new T1D-associated regions (P < 5 × 10(-8)). A comparative analysis with 15 immune diseases showed that T1D is more similar genetically to other autoantibody-positive diseases, significantly most similar to juvenile idiopathic arthritis and significantly least similar to ulcerative colitis, and provided support for three additional new T1D risk loci. Using a Bayesian approach, we defined credible sets for the T1D-associated SNPs. The associated SNPs localized to enhancer sequences active in thymus, T and B cells, and CD34(+) stem cells. Enhancer-promoter interactions can now be analyzed in these cell types to identify which particular genes and regulatory sequences are causal.

    View details for DOI 10.1038/ng.3245

    View details for PubMedID 25751624

  • Fine mapping of type 1 diabetes susceptibility loci and evidence for colocalization of causal variants with lymphoid gene enhancers NATURE GENETICS Onengut-Gumuscu, S., Chen, W., Burren, O., Cooper, N. J., Quinlan, A. R., Mychaleckyj, J. C., Farber, E., Bonnie, J. K., Szpak, M., Schofield, E., Achuthan, P., Guo, H., Fortune, M. D., Stevens, H., Walker, N. M., Ward, L. D., Kundaje, A., Kellis, M., Daly, M. J., Barrett, J. C., Cooper, J. D., Deloukas, P., Todd, J. A., Wallace, C., Concannon, P., Rich, S. S. 2015; 47 (4): 381-U199

    Abstract

    Genetic studies of type 1 diabetes (T1D) have identified 50 susceptibility regions, finding major pathways contributing to risk, with some loci shared across immune disorders. To make genetic comparisons across autoimmune disorders as informative as possible, a dense genotyping array, the Immunochip, was developed, from which we identified four new T1D-associated regions (P < 5 × 10(-8)). A comparative analysis with 15 immune diseases showed that T1D is more similar genetically to other autoantibody-positive diseases, significantly most similar to juvenile idiopathic arthritis and significantly least similar to ulcerative colitis, and provided support for three additional new T1D risk loci. Using a Bayesian approach, we defined credible sets for the T1D-associated SNPs. The associated SNPs localized to enhancer sequences active in thymus, T and B cells, and CD34(+) stem cells. Enhancer-promoter interactions can now be analyzed in these cell types to identify which particular genes and regulatory sequences are causal.

    View details for DOI 10.1038/ng.3245

    View details for Web of Science ID 000351922900014

    View details for PubMedID 25751624

    View details for PubMedCentralID PMC4380767

  • Reassessment of Piwi Binding to the Genome and Piwi Impact on RNA Polymerase II Distribution DEVELOPMENTAL CELL Lin, H., Chen, M., Kundaje, A., Valouev, A., Yin, H., Liu, N., Neuenkirchen, N., Zhong, M., Snyder, M. 2015; 32 (6): 772-774

    Abstract

    Drosophila Piwi was reported by Huang et al. (2013) to be guided by piRNAs to piRNA-complementary sites in the genome, which then recruits heterochromatin protein 1a and histone methyltransferase Su(Var)3-9 to the sites. Among additional findings, Huang et al. (2013) also reported Piwi binding sites in the genome and the reduction of RNA polymerase II in euchromatin but its increase in pericentric regions in piwi mutants. Marinov et al. (2015) disputed the validity of the Huang et al. bioinformatic pipeline that led to the last two claims. Here we report our independent reanalysis of the data using current bioinformatic methods. Our reanalysis agrees with Marinov et al. (2015) that Piwi's genomic targets still remain to be identified but confirms the Huang et al. claim that Piwi influences RNA polymerase II distribution in the genome. This Matters Arising Response addresses the Marinov et al. (2015) Matters Arising, published concurrently in this issue of Developmental Cell.

    View details for DOI 10.1016/j.devcel.2015.03.004

    View details for PubMedID 25805139

  • A comparative encyclopedia of DNA elements in the mouse genome NATURE Yue, F., Cheng, Y., Breschi, A., Vierstra, J., Wu, W., Ryba, T., Sandstrom, R., Ma, Z., Davis, C., Pope, B. D., Shen, Y., Pervouchine, D. D., Djebali, S., Thurman, R. E., Kaul, R., Rynes, E., Kirilusha, A., Marinov, G. K., Williams, B. A., Trout, D., Amrhein, H., Fisher-Aylor, K., Antoshechkin, I., DeSalvo, G., See, L., Fastuca, M., Drenkow, J., Zaleski, C., Dobin, A., Prieto, P., Lagarde, J., Bussotti, G., Tanzer, A., Denas, O., Li, K., Bender, M. A., Zhang, M., Byron, R., Groudine, M. T., McCleary, D., Pham, L., Ye, Z., Kuan, S., Edsall, L., Wu, Y., Rasmussen, M. D., Bansal, M. S., Kellis, M., Keller, C. A., Morrissey, C. S., Mishra, T., Jain, D., Dogan, N., Harris, R. S., Cayting, P., Kawli, T., Boyle, A. P., Euskirchen, G., Kundaje, A., Lin, S., Lin, Y., Jansen, C., Malladi, V. S., Cline, M. S., Erickson, D. T., Kirkup, V. M., Learned, K., Sloan, C. A., Rosenbloom, K. R., De Sousa, B. L., Beal, K., Pignatelli, M., Flicek, P., Lian, J., Kahveci, T., Lee, D., Kent, W. J., Santos, M. R., Herrero, J., Notredame, C., Johnson, A., Vong, S., Lee, K., Bates, D., Neri, F., Diegel, M., Canfield, T., Sabo, P. J., Wilken, M. S., Reh, T. A., Giste, E., Shafer, A., Kutyavin, T., Haugen, E., Dunn, D., Reynolds, A. P., Neph, S., Humbert, R., Hansen, R. S., de Bruijn, M., Selleri, L., Rudensky, A., Josefowicz, S., Samstein, R., Eichler, E. E., Orkin, S. H., Levasseur, D., Papayannopoulou, T., Chang, K., Skoultchi, A., Gosh, S., Disteche, C., Treuting, P., Wang, Y., Weiss, M. J., Blobel, G. A., Cao, X., Zhong, S., Wang, T., Good, P. J., Lowdon, R. F., Adams, L. B., Zhou, X., Pazin, M. J., Feingold, E. A., Wold, B., Taylor, J., Mortazavi, A., Weissman, S. M., Stamatoyannopoulos, J. A., Snyder, M. P., Guigo, R., Gingeras, T. R., Gilbert, D. M., Hardison, R. C., Beer, M. A., Ren, B. 2014; 515 (7527): 355-?

    Abstract

    The laboratory mouse shares the majority of its protein-coding genes with humans, making it the premier model organism in biomedical research, yet the two mammals differ in significant ways. To gain greater insights into both shared and species-specific transcriptional and cellular regulatory programs in the mouse, the Mouse ENCODE Consortium has mapped transcription, DNase I hypersensitivity, transcription factor binding, chromatin modifications and replication domains throughout the mouse genome in diverse cell and tissue types. By comparing with the human genome, we not only confirm substantial conservation in the newly annotated potential functional sequences, but also find a large degree of divergence of sequences involved in transcriptional regulation, chromatin state and higher order chromatin organization. Our results illuminate the wide range of evolutionary forces acting on genes and their regulatory regions, and provide a general resource for research into mammalian biology and mechanisms of human diseases.

    View details for DOI 10.1038/nature13992

    View details for Web of Science ID 000345770600034

  • Principles of regulatory information conservation between mouse and human. Nature Cheng, Y., Ma, Z., Kim, B. H., Wu, W., Cayting, P., Boyle, A. P., Sundaram, V., Xing, X., Dogan, N., Li, J., Euskirchen, G., Lin, S., Lin, Y., Visel, A., Kawli, T., Yang, X., Patacsil, D., Keller, C. A., Giardine, B., Kundaje, A., Wang, T., Pennacchio, L. A., Weng, Z., Hardison, R. C., Snyder, M. P. 2014; 515 (7527): 371-5

    Abstract

    To broaden our understanding of the evolution of gene regulation mechanisms, we generated occupancy profiles for 34 orthologous transcription factors (TFs) in human-mouse erythroid progenitor, lymphoblast and embryonic stem-cell lines. By combining the genome-wide transcription factor occupancy repertoires, associated epigenetic signals, and co-association patterns, here we deduce several evolutionary principles of gene regulatory features operating since the mouse and human lineages diverged. The genomic distribution profiles, primary binding motifs, chromatin states, and DNA methylation preferences are well conserved for TF-occupied sequences. However, the extent to which orthologous DNA segments are bound by orthologous TFs varies both among TFs and with genomic location: binding at promoters is more highly conserved than binding at distal elements. Notably, occupancy-conserved TF-occupied sequences tend to be pleiotropic; they function in several tissues and also co-associate with many TFs. Single nucleotide variants at sites with potential regulatory functions are enriched in occupancy-conserved TF-occupied sequences.

    View details for DOI 10.1038/nature13985

    View details for PubMedID 25409826

    View details for PubMedCentralID PMC4343047

  • Principles of regulatory information conservation between mouse and human NATURE Cheng, Y., Ma, Z., Kim, B., Wu, W., Cayting, P., Boyle, A. P., Sundaram, V., Xing, X., Dogan, N., Li, J., Euskirchen, G., Lin, S., Lin, Y., Visel, A., Kawli, T., Yang, X., Patacsil, D., Keller, C. A., Giardine, B., Kundaje, A., Wang, T., Pennacchio, L. A., Weng, Z., Hardison, R. C., Snyder, M. P. 2014; 515 (7527): 371-?

    Abstract

    To broaden our understanding of the evolution of gene regulation mechanisms, we generated occupancy profiles for 34 orthologous transcription factors (TFs) in human-mouse erythroid progenitor, lymphoblast and embryonic stem-cell lines. By combining the genome-wide transcription factor occupancy repertoires, associated epigenetic signals, and co-association patterns, here we deduce several evolutionary principles of gene regulatory features operating since the mouse and human lineages diverged. The genomic distribution profiles, primary binding motifs, chromatin states, and DNA methylation preferences are well conserved for TF-occupied sequences. However, the extent to which orthologous DNA segments are bound by orthologous TFs varies both among TFs and with genomic location: binding at promoters is more highly conserved than binding at distal elements. Notably, occupancy-conserved TF-occupied sequences tend to be pleiotropic; they function in several tissues and also co-associate with many TFs. Single nucleotide variants at sites with potential regulatory functions are enriched in occupancy-conserved TF-occupied sequences.

    View details for DOI 10.1038/nature13985

    View details for Web of Science ID 000345770600036

    View details for PubMedCentralID PMC4343047

  • A comparative encyclopedia of DNA elements in the mouse genome. Nature Yue, F., Cheng, Y., Breschi, A., Vierstra, J., Wu, W., Ryba, T., Sandstrom, R., Ma, Z., Davis, C., Pope, B. D., Shen, Y., Pervouchine, D. D., Djebali, S., Thurman, R. E., Kaul, R., Rynes, E., Kirilusha, A., Marinov, G. K., Williams, B. A., Trout, D., Amrhein, H., Fisher-Aylor, K., Antoshechkin, I., DeSalvo, G., See, L., Fastuca, M., Drenkow, J., Zaleski, C., Dobin, A., Prieto, P., Lagarde, J., Bussotti, G., Tanzer, A., Denas, O., Li, K., Bender, M. A., Zhang, M., Byron, R., Groudine, M. T., McCleary, D., Pham, L., Ye, Z., Kuan, S., Edsall, L., Wu, Y., Rasmussen, M. D., Bansal, M. S., Kellis, M., Keller, C. A., Morrissey, C. S., Mishra, T., Jain, D., Dogan, N., Harris, R. S., Cayting, P., Kawli, T., Boyle, A. P., Euskirchen, G., Kundaje, A., Lin, S., Lin, Y., Jansen, C., Malladi, V. S., Cline, M. S., Erickson, D. T., Kirkup, V. M., Learned, K., Sloan, C. A., Rosenbloom, K. R., Lacerda de Sousa, B., Beal, K., Pignatelli, M., Flicek, P., Lian, J., Kahveci, T., Lee, D., Kent, W. J., Ramalho Santos, M., Herrero, J., Notredame, C., Johnson, A., Vong, S., Lee, K., Bates, D., Neri, F., Diegel, M., Canfield, T., Sabo, P. J., Wilken, M. S., Reh, T. A., Giste, E., Shafer, A., Kutyavin, T., Haugen, E., Dunn, D., Reynolds, A. P., Neph, S., Humbert, R., Hansen, R. S., de Bruijn, M., Selleri, L., Rudensky, A., Josefowicz, S., Samstein, R., Eichler, E. E., Orkin, S. H., Levasseur, D., Papayannopoulou, T., Chang, K., Skoultchi, A., Gosh, S., Disteche, C., Treuting, P., Wang, Y., Weiss, M. J., Blobel, G. A., Cao, X., Zhong, S., Wang, T., Good, P. J., Lowdon, R. F., Adams, L. B., Zhou, X., Pazin, M. J., Feingold, E. A., Wold, B., Taylor, J., Mortazavi, A., Weissman, S. M., Stamatoyannopoulos, J. A., Snyder, M. P., Guigo, R., Gingeras, T. R., Gilbert, D. M., Hardison, R. C., Beer, M. A., Ren, B. 2014; 515 (7527): 355-364

    Abstract

    The laboratory mouse shares the majority of its protein-coding genes with humans, making it the premier model organism in biomedical research, yet the two mammals differ in significant ways. To gain greater insights into both shared and species-specific transcriptional and cellular regulatory programs in the mouse, the Mouse ENCODE Consortium has mapped transcription, DNase I hypersensitivity, transcription factor binding, chromatin modifications and replication domains throughout the mouse genome in diverse cell and tissue types. By comparing with the human genome, we not only confirm substantial conservation in the newly annotated potential functional sequences, but also find a large degree of divergence of sequences involved in transcriptional regulation, chromatin state and higher order chromatin organization. Our results illuminate the wide range of evolutionary forces acting on genes and their regulatory regions, and provide a general resource for research into mammalian biology and mechanisms of human diseases.

    View details for DOI 10.1038/nature13992

    View details for PubMedID 25409824

  • Transcription Factors Bind Negatively Selected Sites within Human mtDNA Genes GENOME BIOLOGY AND EVOLUTION Blumberg, A., Sailaja, B. S., Kundaje, A., Levin, L., Dadon, S., Shmorak, S., Shaulian, E., Meshorer, E., Mishmar, D. 2014; 6 (10): 2634-2646

    Abstract

    Transcription of mitochondrial DNA (mtDNA)-encoded genes is thought to be regulated by a handful of dedicated transcription factors (TFs), suggesting that mtDNA genes are separately regulated from the nucleus. However, several TFs, with known nuclear activities, were found to bind mtDNA and regulate mitochondrial transcription. Additionally, mtDNA transcriptional regulatory elements, which were proved important in vitro, were harbored by a deletion that normally segregated among healthy individuals. Hence, mtDNA transcriptional regulation is more complex than once thought. Here, by analyzing ENCODE chromatin immunoprecipitation sequencing (ChIP-seq) data, we identified strong binding sites of three bona fide nuclear TFs (c-Jun, Jun-D, and CEBPb) within human mtDNA protein-coding genes. We validated the binding of two TFs by ChIP-quantitative polymerase chain reaction (c-Jun and Jun-D) and showed their mitochondrial localization by electron microscopy and subcellular fractionation. As a step toward investigating the functionality of these TF-binding sites (TFBS), we assessed signatures of selection. By analyzing 9,868 human mtDNA sequences encompassing all major global populations, we recorded genetic variants in tips and nodes of mtDNA phylogeny within the TFBS. We next calculated the effects of variants on binding motif prediction scores. Finally, the mtDNA variation pattern in predicted TFBS, occurring within ChIP-seq negative-binding sites, was compared with ChIP-seq positive-TFBS (CPR). Motifs within CPRs of c-Jun, Jun-D, and CEBPb harbored either only tip variants or their nodal variants retained high motif prediction scores. This reflects negative selection within mtDNA CPRs, thus supporting their functionality. Hence, human mtDNA-coding sequences may have dual roles, namely coding for genes yet possibly also possessing regulatory potential.

    View details for DOI 10.1093/gbe/evu210

    View details for PubMedID 25245407

  • Comparative analysis of regulatory information and circuits across distant species. Nature Boyle, A. P., Araya, C. L., Brdlik, C., Cayting, P., Cheng, C., Cheng, Y., Gardner, K., Hillier, L. W., Janette, J., Jiang, L., Kasper, D., Kawli, T., Kheradpour, P., Kundaje, A., Li, J. J., Ma, L., Niu, W., Rehm, E. J., Rozowsky, J., Slattery, M., Spokony, R., Terrell, R., Vafeados, D., Wang, D., Weisdepp, P., Wu, Y., Xie, D., Yan, K., Feingold, E. A., Good, P. J., Pazin, M. J., Huang, H., Bickel, P. J., Brenner, S. E., Reinke, V., Waterston, R. H., Gerstein, M., White, K. P., Kellis, M., Snyder, M. 2014; 512 (7515): 453-456

    Abstract

    Despite the large evolutionary distances between metazoan species, they can show remarkable commonalities in their biology, and this has helped to establish fly and worm as model organisms for human biology. Although studies of individual elements and factors have explored similarities in gene regulation, a large-scale comparative analysis of basic principles of transcriptional regulatory features is lacking. Here we map the genome-wide binding locations of 165 human, 93 worm and 52 fly transcription regulatory factors, generating a total of 1,019 data sets from diverse cell types, developmental stages, or conditions in the three species, of which 498 (48.9%) are presented here for the first time. We find that structural properties of regulatory networks are remarkably conserved and that orthologous regulatory factor families recognize similar binding motifs in vivo and show some similar co-associations. Our results suggest that gene-regulatory properties previously observed for individual factors are general principles of metazoan regulation that are remarkably well-preserved despite extensive functional divergence of individual network connections. The comparative maps of regulatory circuitry provided here will drive an improved understanding of the regulatory underpinnings of model organism biology and how these relate to human biology, development and disease.

    View details for DOI 10.1038/nature13668

    View details for PubMedID 25164757

  • Comparative analysis of metazoan chromatin organization. Nature Ho, J. W., Jung, Y. L., Liu, T., Alver, B. H., Lee, S., Ikegami, K., Sohn, K., Minoda, A., Tolstorukov, M. Y., Appert, A., Parker, S. C., Gu, T., Kundaje, A., Riddle, N. C., Bishop, E., Egelhofer, T. A., Hu, S. S., Alekseyenko, A. A., Rechtsteiner, A., Asker, D., Belsky, J. A., Bowman, S. K., Chen, Q. B., Chen, R. A., Day, D. S., Dong, Y., Dose, A. C., Duan, X., Epstein, C. B., Ercan, S., Feingold, E. A., Ferrari, F., Garrigues, J. M., Gehlenborg, N., Good, P. J., Haseley, P., He, D., Herrmann, M., Hoffman, M. M., Jeffers, T. E., Kharchenko, P. V., Kolasinska-Zwierz, P., Kotwaliwale, C. V., Kumar, N., Langley, S. A., Larschan, E. N., Latorre, I., Libbrecht, M. W., Lin, X., Park, R., Pazin, M. J., Pham, H. N., Plachetka, A., Qin, B., Schwartz, Y. B., Shoresh, N., Stempor, P., Vielle, A., Wang, C., Whittle, C. M., Xue, H., Kingston, R. E., Kim, J. H., Bernstein, B. E., Dernburg, A. F., Pirrotta, V., Kuroda, M. I., Noble, W. S., Tullius, T. D., Kellis, M., MacAlpine, D. M., Strome, S., Elgin, S. C., Liu, X. S., Lieb, J. D., Ahringer, J., Karpen, G. H., Park, P. J. 2014; 512 (7515): 449-452

    Abstract

    Genome function is dynamically regulated in part by chromatin, which consists of the histones, non-histone proteins and RNA molecules that package DNA. Studies in Caenorhabditis elegans and Drosophila melanogaster have contributed substantially to our understanding of molecular mechanisms of genome function in humans, and have revealed conservation of chromatin components and mechanisms. Nevertheless, the three organisms have markedly different genome sizes, chromosome architecture and gene organization. On human and fly chromosomes, for example, pericentric heterochromatin flanks single centromeres, whereas worm chromosomes have dispersed heterochromatin-like regions enriched in the distal chromosomal 'arms', and centromeres distributed along their lengths. To systematically investigate chromatin organization and associated gene regulation across species, we generated and analysed a large collection of genome-wide chromatin data sets from cell lines and developmental stages in worm, fly and human. Here we present over 800 new data sets from our ENCODE and modENCODE consortia, bringing the total to over 1,400. Comparison of combinatorial patterns of histone modifications, nuclear lamina-associated domains, organization of large-scale topological domains, chromatin environment at promoters and enhancers, nucleosome positioning, and DNA replication patterns reveals many conserved features of chromatin organization among the three organisms. We also find notable differences in the composition and locations of repressive chromatin. These data sets and analyses provide a rich resource for comparative and species-specific investigations of chromatin composition, organization and function.

    View details for DOI 10.1038/nature13415

    View details for PubMedID 25164756

  • Regulatory analysis of the C. elegans genome with spatiotemporal resolution. Nature Araya, C. L., Kawli, T., Kundaje, A., Jiang, L., Wu, B., Vafeados, D., Terrell, R., Weissdepp, P., Gevirtzman, L., Mace, D., Niu, W., Boyle, A. P., Xie, D., Ma, L., Murray, J. I., Reinke, V., Waterston, R. H., Snyder, M. 2014; 512 (7515): 400-405

    View details for DOI 10.1038/nature13497

    View details for PubMedID 25164749

  • Regulatory analysis of the C. elegans genome with spatiotemporal resolution. Nature Araya, C. L., Kawli, T., Kundaje, A., Jiang, L., Wu, B., Vafeados, D., Terrell, R., Weissdepp, P., Gevirtzman, L., Mace, D., Niu, W., Boyle, A. P., Xie, D., Ma, L., Murray, J. I., Reinke, V., Waterston, R. H., Snyder, M. 2014; 512 (7515): 400-405

    Abstract

    Discovering the structure and dynamics of transcriptional regulatory events in the genome with cellular and temporal resolution is crucial to understanding the regulatory underpinnings of development and disease. We determined the genomic distribution of binding sites for 92 transcription factors and regulatory proteins across multiple stages of Caenorhabditis elegans development by performing 241 ChIP-seq (chromatin immunoprecipitation followed by sequencing) experiments. Integration of regulatory binding and cellular-resolution expression data produced a spatiotemporally resolved metazoan transcription factor binding map. Using this map, we explore developmental regulatory circuits that encode combinatorial logic at the levels of co-binding and co-expression of transcription factors, characterizing the genomic coverage and clustering of regulatory binding, the binding preferences of, and biological processes regulated by, transcription factors, the global transcription factor co-associations and genomic subdomains that suggest shared patterns of regulation, and identifying key transcription factors and transcription factor co-associations for fate specification of individual lineages and cell types.

    View details for DOI 10.1038/nature13497

    View details for PubMedID 25164749

  • Reply to Brunet and Doolittle: Both selected effect and causal role elements can influence human biology and disease PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Kellis, M., Wold, B., Snyder, M. P., Bernstein, B. E., Kundaje, A., Marinov, G. K., Ward, L. D., Birney, E., Crawford, G. E., Dekker, J., Dunham, I., Elnitski, L. L., Farnham, P. J., Feingold, E. A., Gerstein, M., Giddings, M. C., Gilbert, D. M., Gingeras, T. R., Green, E. D., Guigo, R., Hubbard, T., Kent, J., Lieb, J. D., Myers, R. M., Pazin, M. J., Ren, B., Stamatoyannopoulos, J., Weng, Z., White, K. P., Hardison, R. C. 2014; 111 (33): E3366-E3366

    View details for DOI 10.1073/pnas.1410434111

    View details for Web of Science ID 000340438800004

    View details for PubMedID 25275169

    View details for PubMedCentralID PMC4143047

  • H3K4me3 Breadth Is Linked to Cell Identity and Transcriptional Consistency. Cell Benayoun, B. A., Pollina, E. A., Ucar, D., Mahmoudi, S., Karra, K., Wong, E. D., Devarajan, K., Daugherty, A. C., Kundaje, A. B., Mancini, E., Hitz, B. C., Gupta, R., Rando, T. A., Baker, J. C., Snyder, M. P., Cherry, J. M., Brunet, A. 2014; 158 (3): 673-688

    Abstract

    Trimethylation of histone H3 at lysine 4 (H3K4me3) is a chromatin modification known to mark the transcription start sites of active genes. Here, we show that H3K4me3 domains that spread more broadly over genes in a given cell type preferentially mark genes that are essential for the identity and function of that cell type. Using the broadest H3K4me3 domains as a discovery tool in neural progenitor cells, we identify novel regulators of these cells. Machine learning models reveal that the broadest H3K4me3 domains represent a distinct entity, characterized by increased marks of elongation. The broadest H3K4me3 domains also have more paused polymerase at their promoters, suggesting a unique transcriptional output. Indeed, genes marked by the broadest H3K4me3 domains exhibit enhanced transcriptional consistency and [corrected] increased transcriptional levels, and perturbation of H3K4me3 breadth leads to changes in transcriptional consistency. Thus, H3K4me3 breadth contains information that could ensure transcriptional precision at key cell identity/function genes.

    View details for DOI 10.1016/j.cell.2014.06.027

    View details for PubMedID 25083876

  • Diverse patterns of genomic targeting by transcriptional regulators in Drosophila melanogaster GENOME RESEARCH Slattery, M., Ma, L., Spokony, R. F., Arthur, R. K., Kheradpour, P., Kundaje, A., Negre, N., Crofts, A., Ptashkin, R., Zieba, J., Ostapenko, A., Suchy, S., Victorsen, A., Jameel, N., Grundstad, A., Gao, W., Moran, J. R., Rehm, E., Grossman, R. L., Kellis, M., White, K. P. 2014; 24 (7): 1224-1235

    Abstract

    Annotation of regulatory elements and identification of the transcription-related factors (TRFs) targeting these elements are key steps in understanding how cells interpret their genetic blueprint and their environment during development, and how that process goes awry in the case of disease. One goal of the modENCODE (model organism ENCyclopedia of DNA Elements) Project is to survey a diverse sampling of TRFs, both DNA-binding and non-DNA-binding factors, to provide a framework for the subsequent study of the mechanisms by which transcriptional regulators target the genome. Here we provide an updated map of the Drosophila melanogaster regulatory genome based on the location of 84 TRFs at various stages of development. This regulatory map reveals a variety of genomic targeting patterns, including factors with strong preferences toward proximal promoter binding, factors that target intergenic and intronic DNA, and factors with distinct chromatin state preferences. The data also highlight the stringency of the Polycomb regulatory network, and show association of the Trithorax-like (Trl) protein with hotspots of DNA binding throughout development. Furthermore, the data identify more than 5800 instances in which TRFs target DNA regions with demonstrated enhancer activity. Regions of high TRF co-occupancy are more likely to be associated with open enhancers used across cell types, while lower TRF occupancy regions are associated with complex enhancers that are also regulated at the epigenetic level. Together these data serve as a resource for the research community in the continued effort to dissect transcriptional regulatory mechanisms directing Drosophila development.

    View details for DOI 10.1101/gr.168807.113

    View details for Web of Science ID 000338185000015

    View details for PubMedID 24985916

    View details for PubMedCentralID PMC4079976

  • Defining functional DNA elements in the human genome PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Kellis, M., Wold, B., Snyder, M. P., Bernstein, B. E., Kundaje, A., Marinov, G. K., Ward, L. D., Birney, E., Crawford, G. E., Dekker, J., Dunham, I., Elnitski, L. L., Farnham, P. J., Feingold, E. A., Gerstein, M., Giddings, M. C., Gilbert, D. M., Gingeras, T. R., Green, E. D., Guigo, R., Hubbard, T., Kent, J., Lieb, J. D., Myers, R. M., Pazin, M. J., Ren, B., Stamatoyannopoulos, J. A., Weng, Z., White, K. P., Hardison, R. C. 2014; 111 (17): 6131-6138

    Abstract

    With the completion of the human genome sequence, attention turned to identifying and annotating its functional DNA elements. As a complement to genetic and comparative genomics approaches, the Encyclopedia of DNA Elements Project was launched to contribute maps of RNA transcripts, transcriptional regulator binding sites, and chromatin states in many cell types. The resulting genome-wide data reveal sites of biochemical activity with high positional resolution and cell type specificity that facilitate studies of gene regulation and interpretation of noncoding variants associated with human disease. However, the biochemically active regions cover a much larger fraction of the genome than do evolutionarily conserved regions, raising the question of whether nonconserved but biochemically active regions are truly functional. Here, we review the strengths and limitations of biochemical, evolutionary, and genetic approaches for defining functional DNA segments, potential sources for the observed differences in estimated genomic coverage, and the biological implications of these discrepancies. We also analyze the relationship between signal intensity, genomic coverage, and evolutionary conservation. Our results reinforce the principle that each approach provides complementary information and that we need to use combinations of all three to elucidate genome function in human biology and disease.

    View details for DOI 10.1073/pnas.1318948111

    View details for Web of Science ID 000335199000025

    View details for PubMedID 24753594

    View details for PubMedCentralID PMC4035993

  • Large-Scale Quality Analysis of Published ChIP-seq Data. G3 (Bethesda, Md.) Marinov, G. K., Kundaje, A., Park, P. J., Wold, B. J. 2014; 4 (2): 209-223

    Abstract

    ChIP-seq has become the primary method for identifying in vivo protein-DNA interactions on a genome-wide scale, with nearly 800 publications involving the technique appearing in PubMed as of December 2012. Individually and in aggregate, these data are an important and information-rich resource. However, uncertainties about data quality confound their use by the wider research community. Recently, the Encyclopedia of DNA Elements (ENCODE) project developed and applied metrics to objectively measure ChIP-seq data quality. The ENCODE quality analysis was useful for flagging datasets for closer inspection, eliminating or replacing poor data, and for driving changes in experimental pipelines. There had been no similarly systematic quality analysis of the large and disparate body of published ChIP-seq profiles. Here, we report a uniform analysis of vertebrate transcription factor ChIP-seq datasets in the Gene Expression Omnibus (GEO) repository as of April 1, 2012. The majority (55%) of datasets scored as being highly successful, but a substantial minority (20%) were of apparently poor quality, and another ∼25% were of intermediate quality. We discuss how different uses of ChIP-seq data are affected by specific aspects of data quality, and we highlight exceptional instances for which the metric values should not be taken at face value. Unexpectedly, we discovered that a significant subset of control datasets (i.e., no immunoprecipitation and mock immunoprecipitation samples) display an enrichment structure similar to successful ChIP-seq data. This can, in turn, affect peak calling and data interpretation. Published datasets identified here as high-quality comprise a large group that users can draw on for large-scale integrated analysis. In the future, ChIP-seq quality assessment similar to that used here could guide experimentalists at early stages in a study, provide useful input in the publication process, and be used to stratify ChIP-seq data for different community-wide uses.

    View details for DOI 10.1534/g3.113.008680

    View details for PubMedID 24347632

    View details for PubMedCentralID PMC3931556

  • STAT3 Targets Suggest Mechanisms of Aggressive Tumorigenesis in Diffuse Large B-Cell Lymphoma G3-GENES GENOMES GENETICS Hardee, J., Ouyang, Z., Zhang, Y., Kundaje, A., Lacroute, P., Snyder, M. 2013; 3 (12): 2173-2185

    Abstract

    The signal transducer and activator of transcription 3 (STAT3) is a transcription factor that, when dysregulated, becomes a powerful oncogene found in many human cancers, including diffuse large B-cell lymphoma. Diffuse large B-cell lymphoma is the most common form of non-Hodgkin's lymphoma and has two major subtypes: germinal center B-cell-like and activated B-cell-like. Compared with the germinal center B-cell-like form, activated B-cell-like lymphomas respond much more poorly to current therapies and often exhibit overexpression or overactivation of STAT3. To investigate how STAT3 might contribute to this aggressive phenotype, we have integrated genome-wide studies of STAT3 DNA binding using chromatin immunoprecipitation-sequencing with whole-transcriptome profiling using RNA-sequencing. STAT3 binding sites are present near almost a third of all genes that differ in expression between the two subtypes, and examination of the affected genes identified previously undetected and clinically significant pathways downstream of STAT3 that drive oncogenesis. Novel treatments aimed at these pathways may increase the survivability of activated B-cell-like diffuse large B-cell lymphoma.

    View details for DOI 10.1534/g3.113.007674

    View details for PubMedID 24142927

  • Extensive Variation in Chromatin States Across Humans SCIENCE Kasowski, M., Kyriazopoulou-Panagiotopoulou, S., Grubert, F., Zaugg, J. B., Kundaje, A., Liu, Y., Boyle, A. P., Zhang, Q. C., Zakharia, F., Spacek, D. V., Li, J., Xie, D., Olarerin-George, A., Steinmetz, L. M., Hogenesch, J. B., Kellis, M., Batzoglou, S., Snyder, M. 2013; 342 (6159): 750-752

    Abstract

    The majority of disease-associated variants lie outside protein-coding regions, suggesting a link between variation in regulatory regions and disease predisposition. We studied differences in chromatin states using five histone modifications, cohesin, and CTCF in lymphoblastoid lines from 19 individuals of diverse ancestry. We found extensive signal variation in regulatory regions, which often switch between active and repressed states across individuals. Enhancer activity is particularly diverse among individuals, whereas gene expression remains relatively stable. Chromatin variability shows genetic inheritance in trios, correlates with genetic variation and population divergence, and is associated with disruptions of transcription factor binding motifs. Overall, our results provide insights into chromatin variation among humans.

    View details for DOI 10.1126/science.1242510

    View details for PubMedID 24136358

  • Integrative annotation of chromatin elements from ENCODE data NUCLEIC ACIDS RESEARCH Hoffman, M. M., Ernst, J., Wilder, S. P., Kundaje, A., Harris, R. S., Libbrecht, M., Giardine, B., Ellenbogen, P. M., Bilmes, J. A., Birney, E., Hardison, R. C., Dunham, I., Kellis, M., Noble, W. S. 2013; 41 (2): 827-841

    Abstract

    The ENCODE Project has generated a wealth of experimental information mapping diverse chromatin properties in several human cell lines. Although each such data track is independently informative toward the annotation of regulatory elements, their interrelations contain much richer information for the systematic annotation of regulatory elements. To uncover these interrelations and to generate an interpretable summary of the massive datasets of the ENCODE Project, we apply unsupervised learning methodologies, converting dozens of chromatin datasets into discrete annotation maps of regulatory regions and other chromatin elements across the human genome. These methods rediscover and summarize diverse aspects of chromatin architecture, elucidate the interplay between chromatin activity and RNA transcription, and reveal that a large proportion of the genome lies in a quiescent state, even across multiple cell types. The resulting annotation of non-coding regulatory elements correlate strongly with mammalian evolutionary constraint, and provide an unbiased approach for evaluating metrics of evolutionary constraint in human. Lastly, we use the regulatory annotations to revisit previously uncharacterized disease-associated loci, resulting in focused, testable hypotheses through the lens of the chromatin landscape.

    View details for DOI 10.1093/nar/gks1284

    View details for Web of Science ID 000314121100021

    View details for PubMedID 23221638

    View details for PubMedCentralID PMC3553955

  • Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors Wang, J., Zhuang, J., Iyer, S., Lin, X., Whitfield, T. W., Greven, M. C., Pierce, B. G., Dong, X., Kundaje, A., Cheng, Y., Rando, O. J., Birney, E., Myers, R. M., Noble, W. S., Snyder, M., Weng, Z. TAYLOR & FRANCIS INC. 2013: 49-50
  • Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors GENOME RESEARCH Wang, J., Zhuang, J., Iyer, S., Lin, X., Whitfield, T. W., Greven, M. C., Pierce, B. G., Dong, X., Kundaje, A., Cheng, Y., Rando, O. J., Birney, E., Myers, R. M., Noble, W. S., Snyder, M., Weng, Z. 2012; 22 (9): 1798-1812

    Abstract

    Chromatin immunoprecipitation coupled with high-throughput sequencing (ChIP-seq) has become the dominant technique for mapping transcription factor (TF) binding regions genome-wide. We performed an integrative analysis centered around 457 ChIP-seq data sets on 119 human TFs generated by the ENCODE Consortium. We identified highly enriched sequence motifs in most data sets, revealing new motifs and validating known ones. The motif sites (TF binding sites) are highly conserved evolutionarily and show distinct footprints upon DNase I digestion. We frequently detected secondary motifs in addition to the canonical motifs of the TFs, indicating tethered binding and cobinding between multiple TFs. We observed significant position and orientation preferences between many cobinding TFs. Genes specifically expressed in a cell line are often associated with a greater occurrence of nearby TF binding in that cell line. We observed cell-line-specific secondary motifs that mediate the binding of the histone deacetylase HDAC2 and the enhancer-binding protein EP300. TF binding sites are located in GC-rich, nucleosome-depleted, and DNase I sensitive regions, flanked by well-positioned nucleosomes, and many of these features show cell type specificity. The GC-richness may be beneficial for regulating TF binding because, when unoccupied by a TF, these regions are occupied by nucleosomes in vivo. We present the results of our analysis in a TF-centric web repository Factorbook (http://factorbook.org) and will continually update this repository as more ENCODE data are generated.

    View details for DOI 10.1101/gr.139105.112

    View details for Web of Science ID 000308272800020

    View details for PubMedID 22955990

    View details for PubMedCentralID PMC3431495

  • Long noncoding RNAs are rarely translated in two human cell lines GENOME RESEARCH Banfai, B., Jia, H., Khatun, J., Wood, E., Risk, B., Gundling, W. E., Kundaje, A., Gunawardena, H. P., Yu, Y., Xie, L., Krajewski, K., Strahl, B. D., Chen, X., Bickel, P., Giddings, M. C., Brown, J. B., Lipovich, L. 2012; 22 (9): 1646-1657

    Abstract

    Data from the Encyclopedia of DNA Elements (ENCODE) project show over 9640 human genome loci classified as long noncoding RNAs (lncRNAs), yet only ~100 have been deeply characterized to determine their role in the cell. To measure the protein-coding output from these RNAs, we jointly analyzed two recent data sets produced in the ENCODE project: tandem mass spectrometry (MS/MS) data mapping expressed peptides to their encoding genomic loci, and RNA-seq data generated by ENCODE in long polyA+ and polyA- fractions in the cell lines K562 and GM12878. We used the machine-learning algorithm RuleFit3 to regress the peptide data against RNA expression data. The most important covariate for predicting translation was, surprisingly, the Cytosol polyA- fraction in both cell lines. LncRNAs are ~13-fold less likely to produce detectable peptides than similar mRNAs, indicating that ~92% of GENCODE v7 lncRNAs are not translated in these two ENCODE cell lines. Intersecting 9640 lncRNA loci with 79,333 peptides yielded 85 unique peptides matching 69 lncRNAs. Most cases were due to a coding transcript misannotated as lncRNA. Two exceptions were an unprocessed pseudogene and a bona fide lncRNA gene, both with open reading frames (ORFs) compromised by upstream stop codons. All potentially translatable lncRNA ORFs had only a single peptide match, indicating low protein abundance and/or false-positive peptide matches. We conclude that with very few exceptions, ribosomes are able to distinguish coding from noncoding transcripts and, hence, that ectopic translation and cryptic mRNAs are rare in the human lncRNAome.

    View details for DOI 10.1101/gr.134767.111

    View details for Web of Science ID 000308272800007

    View details for PubMedID 22955977

    View details for PubMedCentralID PMC3431482

  • Modeling gene expression using chromatin features in various cellular contexts GENOME BIOLOGY Dong, X., Greven, M. C., Kundaje, A., Djebali, S., Brown, J. B., Cheng, C., Gingeras, T. R., Gerstein, M., Guigo, R., Birney, E., Weng, Z. 2012; 13 (9)

    Abstract

    Previous work has demonstrated that chromatin feature levels correlate with gene expression. The ENCODE project enables us to further explore this relationship using an unprecedented volume of data. Expression levels from more than 100,000 promoters were measured using a variety of high-throughput techniques applied to RNA extracted by different protocols from different cellular compartments of several human cell lines. ENCODE also generated the genome-wide mapping of eleven histone marks, one histone variant, and DNase I hypersensitivity sites in seven cell lines.We built a novel quantitative model to study the relationship between chromatin features and expression levels. Our study not only confirms that the general relationships found in previous studies hold across various cell lines, but also makes new suggestions about the relationship between chromatin features and gene expression levels. We found that expression status and expression levels can be predicted by different groups of chromatin features, both with high accuracy. We also found that expression levels measured by CAGE are better predicted than by RNA-PET or RNA-Seq, and different categories of chromatin features are the most predictive of expression for different RNA measurement methods. Additionally, PolyA+ RNA is overall more predictable than PolyA- RNA among different cell compartments, and PolyA+ cytosolic RNA measured with RNA-Seq is more predictable than PolyA+ nuclear RNA, while the opposite is true for PolyA- RNA.Our study provides new insights into transcriptional regulation by analyzing chromatin features in different cellular contexts.

    View details for DOI 10.1186/gb-2012-13-9-r53

    View details for Web of Science ID 000313182600006

    View details for PubMedID 22950368

    View details for PubMedCentralID PMC3491397

  • Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors GENOME BIOLOGY Yip, K. Y., Cheng, C., Bhardwaj, N., Brown, J. B., Leng, J., Kundaje, A., Rozowsky, J., Birney, E., Bickel, P., Snyder, M., Gerstein, M. 2012; 13 (9)

    Abstract

    Transcription factors function by binding different classes of regulatory elements. The Encyclopedia of DNA Elements (ENCODE) project has recently produced binding data for more than 100 transcription factors from about 500 ChIP-seq experiments in multiple cell types. While this large amount of data creates a valuable resource, it is nonetheless overwhelmingly complex and simultaneously incomplete since it covers only a small fraction of all human transcription factors.As part of the consortium effort in providing a concise abstraction of the data for facilitating various types of downstream analyses, we constructed statistical models that capture the genomic features of three paired types of regions by machine-learning methods: firstly, regions with active or inactive binding; secondly, those with extremely high or low degrees of co-binding, termed HOT and LOT regions; and finally, regulatory modules proximal or distal to genes. From the distal regulatory modules, we developed computational pipelines to identify potential enhancers, many of which were validated experimentally. We further associated the predicted enhancers with potential target transcripts and the transcription factors involved. For HOT regions, we found a significant fraction of transcription factor binding without clear sequence motifs and showed that this observation could be related to strong DNA accessibility of these regions.Overall, the three pairs of regions exhibit intricate differences in chromosomal locations, chromatin features, factors that bind them, and cell-type specificity. Our machine learning approach enables us to identify features potentially general to all transcription factors, including those not included in the data.

    View details for DOI 10.1186/gb-2012-13-9-r48

    View details for Web of Science ID 000313182600001

    View details for PubMedID 22950945

    View details for PubMedCentralID PMC3491392

  • A User's Guide to the Encyclopedia of DNA Elements (ENCODE) PLOS BIOLOGY Myers, R. M., Stamatoyannopoulos, J., Snyder, M., Dunham, I., Hardison, R. C., Bernstein, B. E., Gingeras, T. R., Kent, W. J., Birney, E., Wold, B., Crawford, G. E., Bernstein, B. E., Epstein, C. B., Shoresh, N., Ernst, J., Mikkelsen, T. S., Kheradpour, P., Zhang, X., Wang, L., Issner, R., Coyne, M. J., Durham, T., Ku, M., Thanh Truong, T., Ward, L. D., Altshuler, R. C., Lin, M. F., Kellis, M., Gingeras, T. R., Davis, C. A., Kapranov, P., Dobin, A., Zaleski, C., Schlesinger, F., Batut, P., Chakrabortty, S., Jha, S., Lin, W., Drenkow, J., Wang, H., Bell, K., Gao, H., Bell, I., Dumais, E., Dumais, J., Antonarakis, S. E., Ucla, C., Borel, C., Guigo, R., Djebali, S., Lagarde, J., Kingswood, C., Ribeca, P., Sammeth, M., Alioto, T., Merkel, A., Tilgner, H., Carninci, P., Hayashizaki, Y., Lassmann, T., Takahashi, H., Abdelhamid, R. F., Hannon, G., Fejes-Toth, K., Preall, J., Gordon, A., Sotirova, V., Reymond, A., Howald, C., Graison, E. A., Chrast, J., Ruan, Y., Ruan, X., Shahab, A., Poh, W. T., Wei, C., Crawford, G. E., Furey, T. S., Boyle, A. P., Sheffield, N. C., Song, L., Shibata, Y., Vales, T., Winter, D., Zhang, Z., London, D., Wang, T., Birney, E., Keefe, D., Iyer, V. R., Lee, B., McDaniell, R. M., Liu, Z., Battenhouse, A., Bhinge, A. A., Lieb, J. D., Grasfeder, L. L., Showers, K. A., Giresi, P. G., Kim, S. K., Shestak, C., Myers, R. M., Pauli, F., Reddy, T. E., Gertz, J., Partridge, E. C., Jain, P., Sprouse, R. O., Bansal, A., Pusey, B., Muratet, M. A., Varley, K. E., Bowling, K. M., Newberry, K. M., Nesmith, A. S., Dilocker, J. A., Parker, S. L., Waite, L. L., Thibeault, K., Roberts, K., Absher, D. M., Wold, B., Mortazavi, A., Williams, B., Marinov, G., Trout, D., Pepke, S., King, B., McCue, K., Kirilusha, A., DeSalvo, G., Fisher-Aylor, K., Amrhein, H., Vielmetter, J., Sherlock, G., Sidow, A., Batzoglou, S., Rauch, R., Kundaje, A., Libbrecht, M., Margulies, E. H., Parker, S. C., Elnitski, L., Green, E. D., Hubbard, T., Harrow, J., Searle, S., Kokocinski, F., Aken, B., Frankish, A., Hunt, T., Despacio-Reyes, G., Kay, M., Mukherjee, G., Bignell, A., Saunders, G., Boychenko, V., Brent, M., van Baren, M. J., Brown, R. H., Gerstein, M., Khurana, E., Balasubramanian, S., Zhang, Z., Lam, H., Cayting, P., Robilotto, R., Lu, Z., Guigo, R., Derrien, T., Tanzer, A., Knowles, D. G., Mariotti, M., Kent, W. J., Haussler, D., Harte, R., Diekhans, M., Kellis, M., Lin, M., Kheradpour, P., Ernst, J., Reymond, A., Howald, C., Graison, E. A., Chrast, J., Valencia, A., Tress, M., Manuel Rodriguez, J., Snyder, M., Landt, S. G., Raha, D., Shi, M., Euskirchen, G., Grubert, F., Kasowski, M., Lian, J., Cayting, P., Lacroute, P., Xu, Y., Monahan, H., Patacsil, D., Slifer, T., Yang, X., Charos, A., Reed, B., Wu, L., Auerbach, R. K., Habegger, L., Hariharan, M., Rozowsky, J., Abyzov, A., Weissman, S. M., Gerstein, M., Struhl, K., Lamarre-Vincent, N., Lindahl-Allen, M., Miotto, B., Moqtaderi, Z., Fleming, J. D., Newburger, P., Farnham, P. J., Frietze, S., O'Geen, H., Xu, X., Blahnik, K. R., Cao, A. R., Iyengar, S., Stamatoyannopoulos, J. A., Kaul, R., Thurman, R. E., Wang, H., Navas, P. A., Sandstrom, R., Sabo, P. J., Weaver, M., Canfield, T., Lee, K., Neph, S., Roach, V., Reynolds, A., Johnson, A., Rynes, E., Giste, E., Vong, S., Neri, J., Frum, T., Johnson, E. M., Nguyen, E. D., Ebersol, A. K., Sanchez, M. E., Sheffer, H. H., Lotakis, D., Haugen, E., Humbert, R., Kutyavin, T., Shafer, T., Dekker, J., Lajoie, B. R., Sanyal, A., Kent, W. J., Rosenbloom, K. R., Dreszer, T. R., Raney, B. J., Barber, G. P., Meyer, L. R., Sloan, C. A., Malladi, V. S., Cline, M. S., Learned, K., Swing, V. K., Zweig, A. S., Rhead, B., Fujita, P. A., Roskin, K., Karolchik, D., Kuhn, R. M., Haussler, D., Birney, E., Dunham, I., Wilder, S. P., Keefe, D., Sobral, D., Herrero, J., Beal, K., Lukk, M., Brazma, A., Vaquerizas, J. M., Luscombe, N. M., Bickel, P. J., Boley, N., Brown, J. B., Li, Q., Huang, H., Gerstein, M., Habegger, L., Sboner, A., Rozowsky, J., Auerbach, R. K., Yip, K. Y., Cheng, C., Yan, K., Bhardwaj, N., Wang, J., Lochovsky, L., Jee, J., Gibson, T., Leng, J., Du, J., Hardison, R. C., Harris, R. S., Song, G., Miller, W., Haussler, D., Roskin, K., Suh, B., Wang, T., Paten, B., Noble, W. S., Hoffman, M. M., Buske, O. J., Weng, Z., Dong, X., Wang, J., Xi, H., Tenenbaum, S. A., Doyle, F., Penalva, L. O., Chittur, S., Tullius, T. D., Parker, S. C., White, K. P., Karmakar, S., Victorsen, A., Jameel, N., Bild, N., Grossman, R. L., Snyder, M., Landt, S. G., Yang, X., Patacsil, D., Slifer, T., Dekker, J., Lajoie, B. R., Sanyal, A., Weng, Z., Whitfield, T. W., Wang, J., Collins, P. J., Trinklein, N. D., Partridge, E. C., Myers, R. M., Giddings, M. C., Chen, X., Khatun, J., Maier, C., Yu, Y., Gunawardena, H., Risk, B., Feingold, E. A., Lowdon, R. F., Dillon, L. A., Good, P. J. 2011; 9 (4)

    Abstract

    The mission of the Encyclopedia of DNA Elements (ENCODE) Project is to enable the scientific and medical communities to interpret the human genome sequence and apply it to understand human biology and improve health. The ENCODE Consortium is integrating multiple technologies and approaches in a collective effort to discover and define the functional elements encoded in the human genome, including genes, transcripts, and transcriptional regulatory regions, together with their attendant chromatin states and DNA methylation patterns. In the process, standards to ensure high-quality data have been implemented, and novel algorithms have been developed to facilitate analysis. Data and derived results are made available through a freely accessible database. Here we provide an overview of the project and the resources it is generating and illustrate the application of ENCODE data to interpret the human genome.

    View details for DOI 10.1371/journal.pbio.1001046

    View details for Web of Science ID 000289938900014

  • CP motifs, Hap1 and heme signaling Zhang, L., Leslie, C., Lee, H. C., Kundaje, A., Ie, E., Xin, X., Freund, Y., MEDIMOND MEDIMOND S R L. 2007: 45-+
  • A classification-based framework for predicting and analyzing gene regulatory response NIPS Workshop on New Problems and Methods in Computational Biology Kundaje, A., Middendorf, M., Shah, M., Wiggins, C. H., Freund, Y., Leslie, C. BIOMED CENTRAL LTD. 2006

    Abstract

    We have recently introduced a predictive framework for studying gene transcriptional regulation in simpler organisms using a novel supervised learning algorithm called GeneClass. GeneClass is motivated by the hypothesis that in model organisms such as Saccharomyces cerevisiae, we can learn a decision rule for predicting whether a gene is up- or down-regulated in a particular microarray experiment based on the presence of binding site subsequences ("motifs") in the gene's regulatory region and the expression levels of regulators such as transcription factors in the experiment ("parents"). GeneClass formulates the learning task as a classification problem--predicting +1 and -1 labels corresponding to up- and down-regulation beyond the levels of biological and measurement noise in microarray measurements. Using the Adaboost algorithm, GeneClass learns a prediction function in the form of an alternating decision tree, a margin-based generalization of a decision tree.In the current work, we introduce a new, robust version of the GeneClass algorithm that increases stability and computational efficiency, yielding a more scalable and reliable predictive model. The improved stability of the prediction tree enables us to introduce a detailed post-processing framework for biological interpretation, including individual and group target gene analysis to reveal condition-specific regulation programs and to suggest signaling pathways. Robust GeneClass uses a novel stabilized variant of boosting that allows a set of correlated features, rather than single features, to be included at nodes of the tree; in this way, biologically important features that are correlated with the single best feature are retained rather than decorrelated and lost in the next round of boosting. Other computational developments include fast matrix computation of the loss function for all features, allowing scalability to large datasets, and the use of abstaining weak rules, which results in a more shallow and interpretable tree. We also show how to incorporate genome-wide protein-DNA binding data from ChIP chip experiments into the GeneClass algorithm, and we use an improved noise model for gene expression data.Using the improved scalability of Robust GeneClass, we present larger scale experiments on a yeast environmental stress dataset, training and testing on all genes and using a comprehensive set of potential regulators. We demonstrate the improved stability of the features in the learned prediction tree, and we show the utility of the post-processing framework by analyzing two groups of genes in yeast--the protein chaperones and a set of putative targets of the Nrg1 and Nrg2 transcription factors--and suggesting novel hypotheses about their transcriptional and post-transcriptional regulation. Detailed results and Robust GeneClass source code is available for download from http://www.cs.columbia.edu/compbio/robust-geneclass.

    View details for DOI 10.1186/1471-2105-7-S1-S5

    View details for Web of Science ID 000236765200005

    View details for PubMedID 16723008

    View details for PubMedCentralID PMC1810316

  • Statistical analysis of MPSS measurements: Application to the study of LPS-activated macrophage gene expression PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Stolovitzky, G. A., Kundaje, A., Held, G. A., Duggar, K. H., Haudenschild, C. D., Zhou, D., Vasicek, T. J., Smith, K. D., Aderem, A., Roach, J. C. 2005; 102 (5): 1402-1407

    Abstract

    Massively Parallel Signature Sequencing (MPSS), a recently developed high-throughput transcription profiling technology, has the ability to profile almost every transcript in a sample without requiring prior knowledge of the sequence of the transcribed genes. As is the case with DNA microarrays, effective data analysis depends crucially on understanding how noise affects measurements. We analyze the sources of noise in MPSS and present a quantitative model describing the variability between replicate MPSS assays. We use this model to construct statistical hypotheses that test whether an observed change in gene expression in a pair-wise comparison is significant. This analysis is then extended to the determination of the significance of changes in expression levels measured over the course of a time series of measurements. We apply these analytic techniques to the study of a time series of MPSS gene expression measurements on LPS-stimulated macrophages. To evaluate our statistical significance metrics, we compare our results with published data on macrophage activation measured by using Affymetrix GeneChips.

    View details for DOI 10.1073/pnas.0406555102

    View details for Web of Science ID 000226877300029

    View details for PubMedID 15668391

    View details for PubMedCentralID PMC547838

  • Motif discovery through predictive modeling of gene regulation 9th Annual International Conference on Research in Computational Molecular Biology (RECOMB 2005) Middendorf, M., Kundaje, A., Shah, M., Freund, Y., Wiggins, C. H., Leslie, C. SPRINGER-VERLAG BERLIN. 2005: 538–552
  • Predicting genetic regulatory response using classification: Yeast stress response 1st Annual RECOMB Satellite Workshop on Regulatory Genomics Middendorf, M., Kundaje, A., Wiggins, C., Freund, Y., Leslie, C. SPRINGER-VERLAG BERLIN. 2005: 1–13
  • Predicting genetic regulatory response using classification BIOINFORMATICS Middendorf, M., Kundaje, A., Wiggins, C., Freund, Y., Leslie, C. 2004; 20: 232-240
  • Support vector machine (SVM) classification of multifocal visual evoked potential responses (mfVEP) from Glaucoma patients. Baroumand, F., Kundaje, A. B., Zhang, Leslie, C., Hood, D. C. ASSOC RESEARCH VISION OPHTHALMOLOGY INC. 2004: U106
  • Spectrogram analysis of genomes EURASIP JOURNAL ON APPLIED SIGNAL PROCESSING Sussillo, D., Kundaje, A., Anastassiou, D. 2004; 2004 (1): 29-42