Anshul Kundaje is an Assistant Professor of Genetics and Computer Science at Stanford University. His primary research area is large-scale computational regulatory genomics. The Kundaje lab specializes in developing statistical and machine learning methods for large-scale integrative analysis of heterogeneous, high-throughput functional genomic and genetic data to decipher regulatory elements and long-range regulatory interactions, learn predictive regulatory network models across individuals, cell-types and species and improve detection and interpretation of natural and disease-associated genetic variation. Previously as a postdoc at Stanford and Research Scientist at MIT, Anshul was the lead computational analyst of the ENCODE Project and the Roadmap Epigenomics Project. Anshul is also a recipient of the 2016 NIH Director's New Innovator Award and the 2014 Alfred Sloan Fellowship.

Honors & Awards

  • HUGO Chen Award of Excellence, Human Genome Organization (2019)
  • NIH Director's New Innovator Award, NIH (2016)
  • Alfred Sloan Foundation Research Fellowship, Alfred Sloan Foundation (2014-2016)

Boards, Advisory Committees, Professional Organizations

  • Advisor, NIH Director's Advisory Committee for Artificial Intelligence in Biomedical Research (2019 - Present)

Current Research and Scholarly Interests

Our research focusses on development of statistical and machine learning methods for integrative analysis of diverse functional genomic and genetic data to learn models of gene regulation. We have led the analysis efforts of the Encyclopedia of DNA Elements (ENCODE) and The Roadmap Epigenomics Projects with the development of novel methods for
1. Adaptive thresholding and normalization of massive collections of functional genomic data (e.g. ChIP-seq and DNase-seq)
2. Dissecting combinatorial transcription factor co-occupancy within and across cell-types
3. Predicting cell-type specific enhancers from chromatin state profiles
4. Exploiting expression and chromatin co-dynamics with to predict enhancer-target gene links
5. Jointly modeling sequence grammars at regulatory elements and their chromatin state dynamics, expression changes of regulators and functional interaction data to learn unified multi-scale gene regulation programs
6. Elucidating the heterogeneity of chromatin architecture at regulatory elements
7. Improving the detection and interpretation of potentially causal disease-associated variants from Genome-wide association studies
More recently, we have also been developing methods to
1. Decipher the functional heterogeneity of transcription factor binding
2. Learn long-range, three-dimensional regulatory interactions
3. Infer causal regulatory mechansisms by integrating diverse functional genomic data from temporal (e.g. differentiation/reprogramming) and perturbation (e.g. drug response, knockdown, genome-editing) experiments
4. Model the complex relationships between genetic variation, regulatory chromatin variation and expression variation in healthy and diseased individuals
5. Deep learning frameworks for genomics


  • The Encyclopedia of DNA Elements (ENCODE) Project, Stanford University, MIT

    The project generates a resource of cell-type specific genome-wide regulatory maps in the human genome. We develop statistical processing methods for next-gen sequencing based functional genomic data and machine learning methods to predict regulatory events, learn combinatorial regulatory effects of transcription factors, cell-type specific regulatory networks


    Stanford, CA

    For More Information:

  • The Roadmap Epigenomics Project, MIT (February 2012 - Present)

    The project generates genome-wide epigenomic maps in 200 human cell types. We develop computational methods and analyses to infer cell-type specific regulatory elements (e.g. enhancers) and their activity states, learn cell-type specific regulatory networks and use these maps to interpret GWAS and disease studies.


    Boston, MA

2021-22 Courses

Graduate and Fellowship Programs

All Publications

  • The dynamic, combinatorial cis-regulatory lexicon of epidermal differentiation. Nature genetics Kim, D. S., Risca, V. I., Reynolds, D. L., Chappell, J., Rubin, A. J., Jung, N., Donohue, L. K., Lopez-Pajares, V., Kathiria, A., Shi, M., Zhao, Z., Deep, H., Sharmin, M., Rao, D., Lin, S., Chang, H. Y., Snyder, M. P., Greenleaf, W. J., Kundaje, A., Khavari, P. A. 2021


    Transcription factors bind DNA sequence motif vocabularies in cis-regulatory elements (CREs) to modulate chromatin state and gene expression during cell state transitions. A quantitative understanding of how motif lexicons influence dynamic regulatory activity has been elusive due to the combinatorial nature of the cis-regulatory code. To address this, we undertook multiomic data profiling of chromatin and expression dynamics across epidermal differentiation to identify 40,103 dynamic CREs associated with 3,609 dynamically expressed genes, then applied an interpretable deep-learning framework to model the cis-regulatory logic of chromatin accessibility. This analysis framework identified cooperative DNA sequence rules in dynamic CREs regulating synchronous gene modules with diverse roles in skin differentiation. Massively parallel reporter assay analysis validated temporal dynamics and cooperative cis-regulatory logic. Variants linked to human polygenic skin disease were enriched in these time-dependent combinatorial motif rules. This integrative approach shows the combinatorial cis-regulatory lexicon of epidermal differentiation and represents a general framework for deciphering the organizational principles of the cis-regulatory code of dynamic gene regulation.

    View details for DOI 10.1038/s41588-021-00947-3

    View details for PubMedID 34650237

  • A genome-wide atlas of co-essential modules assigns function to uncharacterized genes. Nature genetics Wainberg, M., Kamber, R. A., Balsubramani, A., Meyers, R. M., Sinnott-Armstrong, N., Hornburg, D., Jiang, L., Chan, J., Jian, R., Gu, M., Shcherbina, A., Dubreuil, M. M., Spees, K., Meuleman, W., Snyder, M. P., Bassik, M. C., Kundaje, A. 2021


    A central question in the post-genomic era is how genes interact to form biological pathways. Measurements of gene dependency across hundreds of cell lines have been used to cluster genes into 'co-essential' pathways, but this approach has been limited by ubiquitous false positives. In the present study, we develop a statistical method that enables robust identification of gene co-essentiality and yields a genome-wide set of functional modules. This atlas recapitulates diverse pathways and protein complexes, and predicts the functions of 108 uncharacterized genes. Validating top predictions, we show that TMEM189 encodes plasmanylethanolamine desaturase, a key enzyme for plasmalogen synthesis. We also show that C15orf57 encodes a protein that binds the AP2 complex, localizes to clathrin-coated pits and enables efficient transferrin uptake. Finally, we provide an interactive webtool for the community to explore our results, which establish co-essentiality profiling as a powerful resource for biological pathway identification and discovery of new gene functions.

    View details for DOI 10.1038/s41588-021-00840-z

    View details for PubMedID 33859415

  • Base-resolution models of transcription-factor binding reveal soft motif syntax. Nature genetics Avsec, Ž. n., Weilert, M. n., Shrikumar, A. n., Krueger, S. n., Alexandari, A. n., Dalal, K. n., Fropf, R. n., McAnany, C. n., Gagneur, J. n., Kundaje, A. n., Zeitlinger, J. n. 2021


    The arrangement (syntax) of transcription factor (TF) binding motifs is an important part of the cis-regulatory code, yet remains elusive. We introduce a deep learning model, BPNet, that uses DNA sequence to predict base-resolution chromatin immunoprecipitation (ChIP)-nexus binding profiles of pluripotency TFs. We develop interpretation tools to learn predictive motif representations and identify soft syntax rules for cooperative TF binding interactions. Strikingly, Nanog preferentially binds with helical periodicity, and TFs often cooperate in a directional manner, which we validate using clustered regularly interspaced short palindromic repeat (CRISPR)-induced point mutations. Our model represents a powerful general approach to uncover the motifs and syntax of cis-regulatory sequences in genomics data.

    View details for DOI 10.1038/s41588-021-00782-6

    View details for PubMedID 33603233

  • Single-cell epigenomic analyses implicate candidate causal variants at inherited risk loci for Alzheimer's and Parkinson's diseases. Nature genetics Corces, M. R., Shcherbina, A., Kundu, S., Gloudemans, M. J., Fresard, L., Granja, J. M., Louie, B. H., Eulalio, T., Shams, S., Bagdatli, S. T., Mumbach, M. R., Liu, B., Montine, K. S., Greenleaf, W. J., Kundaje, A., Montgomery, S. B., Chang, H. Y., Montine, T. J. 2020


    Genome-wide association studies of neurological diseases have identified thousands of variants associated with disease phenotypes. However, most of these variants do not alter coding sequences, making it difficult to assign their function. Here, we present a multi-omic epigenetic atlas of the adult human brain through profiling of single-cell chromatin accessibility landscapes and three-dimensional chromatin interactions of diverse adult brain regions across a cohort of cognitively healthy individuals. We developed a machine-learning classifier to integrate this multi-omic framework and predict dozens of functional SNPs for Alzheimer's and Parkinson's diseases, nominating target genes and cell types for previously orphaned loci from genome-wide association studies. Moreover, we dissected the complex inverted haplotype of the MAPT (encoding tau) Parkinson's disease risk locus, identifying putative ectopic regulatory interactions in neurons that may mediate this disease association. This work expands understanding of inherited variation and provides a roadmap for the epigenomic dissection of causal regulatory variation in disease.

    View details for DOI 10.1038/s41588-020-00721-x

    View details for PubMedID 33106633

  • Opportunities and challenges for transcriptome-wide association studies NATURE GENETICS Wainberg, M., Sinnott-Armstrong, N., Mancuso, N., Barbeira, A. N., Knowles, D. A., Golan, D., Ermel, R., Ruusalepp, A., Quertermous, T., Hao, K., Bjorkegren, J. M., Im, H., Pasaniuc, B., Rivas, M. A., Kundaje, A. 2019; 51 (4): 592–99
  • Opportunities and obstacles for deep learning in biology and medicine JOURNAL OF THE ROYAL SOCIETY INTERFACE Ching, T., Himmelstein, D. S., Beaulieu-Jones, B. K., Kalinin, A. A., Do, B. T., Way, G. P., Ferrero, E., Agapow, P., Zietz, M., Hoffman, M. M., Xie, W., Rosen, G. L., Lengerich, B. J., Israeli, J., Lanchantin, J., Woloszynek, S., Carpenter, A. E., Shrikumar, A., Xu, J., Cofer, E. M., Lavender, C. A., Turaga, S. C., Alexandari, A. M., Lu, Z., Harris, D. J., DeCaprio, D., Qi, Y., Kundaje, A., Peng, Y., Wiley, L. K., Segler, M. S., Boca, S. M., Swamidass, S., Huang, A., Gitter, A., Greene, C. S. 2018; 15 (141)


    Deep learning describes a class of machine learning algorithms that are capable of combining raw inputs into layers of intermediate features. These algorithms have recently shown impressive results across a variety of domains. Biology and medicine are data-rich disciplines, but the data are complex and often ill-understood. Hence, deep learning techniques may be particularly well suited to solve problems of these fields. We examine applications of deep learning to a variety of biomedical problems-patient classification, fundamental biological processes and treatment of patients-and discuss whether deep learning will be able to transform these tasks or if the biomedical sphere poses unique challenges. Following from an extensive literature review, we find that deep learning has yet to revolutionize biomedicine or definitively resolve any of the most pressing challenges in the field, but promising advances have been made on the prior state of the art. Even though improvements over previous baselines have been modest in general, the recent progress indicates that deep learning methods will provide valuable means for speeding up or aiding human investigation. Though progress has been made linking a specific neural network's prediction to input features, understanding how users should interpret these models to make testable hypotheses about the system under study remains an open challenge. Furthermore, the limited amount of labelled data for training presents problems in some domains, as do legal and privacy constraints on work with sensitive health records. Nonetheless, we foresee deep learning enabling changes at both bench and bedside with the potential to transform several areas of biology and medicine.

    View details for PubMedID 29618526

    View details for PubMedCentralID PMC5938574

  • Discovering epistatic feature interactions from neural network models of regulatory DNA sequences. Bioinformatics (Oxford, England) Greenside, P. n., Shimko, T. n., Fordyce, P. n., Kundaje, A. n. 2018; 34 (17): i629–i637


    Transcription factors bind regulatory DNA sequences in a combinatorial manner to modulate gene expression. Deep neural networks (DNNs) can learn the cis-regulatory grammars encoded in regulatory DNA sequences associated with transcription factor binding and chromatin accessibility. Several feature attribution methods have been developed for estimating the predictive importance of individual features (nucleotides or motifs) in any input DNA sequence to its associated output prediction from a DNN model. However, these methods do not reveal higher-order feature interactions encoded by the models.We present a new method called Deep Feature Interaction Maps (DFIM) to efficiently estimate interactions between all pairs of features in any input DNA sequence. DFIM accurately identifies ground truth motif interactions embedded in simulated regulatory DNA sequences. DFIM identifies synergistic interactions between GATA1 and TAL1 motifs from in vivo TF binding models. DFIM reveals epistatic interactions involving nucleotides flanking the core motif of the Cbf1 TF in yeast from in vitro TF binding models. We also apply DFIM to regulatory sequence models of in vivo chromatin accessibility to reveal interactions between regulatory genetic variants and proximal motifs of target TFs as validated by TF binding quantitative trait loci. Our approach makes significant strides in improving the interpretability of deep learning models for genomics.Code is available at: data are available at Bioinformatics online.

    View details for PubMedID 30423062

  • Denoising genome-wide histone ChIP-seq with convolutional neural networks BIOINFORMATICS Koh, P., Pierson, E., Kundaje, A. 2017; 33 (14): I225–I233


    Chromatin immune-precipitation sequencing (ChIP-seq) experiments are commonly used to obtain genome-wide profiles of histone modifications associated with different types of functional genomic elements. However, the quality of histone ChIP-seq data is affected by many experimental parameters such as the amount of input DNA, antibody specificity, ChIP enrichment and sequencing depth. Making accurate inferences from chromatin profiling experiments that involve diverse experimental parameters is challenging.We introduce a convolutional denoising algorithm, Coda, that uses convolutional neural networks to learn a mapping from suboptimal to high-quality histone ChIP-seq data. This overcomes various sources of noise and variability, substantially enhancing and recovering signal when applied to low-quality chromatin profiling datasets across individuals, cell types and species. Our method has the potential to improve data quality at reduced costs. More broadly, this approach-using a high-dimensional discriminative model to encode a generative noise process-is generally applicable to other biological domains where it is easy to generate noisy data but difficult to analytically characterize the noise or underlying data distribution.

    View details for PubMedID 28881977

  • Genetic Control of Chromatin States in Humans Involves Local and Distal Chromosomal Interactions CELL Grubert, F., Zaugg, J. B., Kasowski, M., Ursu, O., Spacek, D. V., Martin, A. R., Greenside, P., Srivas, R., Phanstiel, D. H., Pekowska, A., Heidari, N., Euskirchen, G., Huber, W., Pritchard, J. K., Bustamante, C. D., Steinmetz, L. M., Kundaje, A., Snyder, M. 2015; 162 (5): 1051-1065


    Deciphering the impact of genetic variants on gene regulation is fundamental to understanding human disease. Although gene regulation often involves long-range interactions, it is unknown to what extent non-coding genetic variants influence distal molecular phenotypes. Here, we integrate chromatin profiling for three histone marks in lymphoblastoid cell lines (LCLs) from 75 sequenced individuals with LCL-specific Hi-C and ChIA-PET-based chromatin contact maps to uncover one of the largest collections of local and distal histone quantitative trait loci (hQTLs). Distal QTLs are enriched within topologically associated domains and exhibit largely concordant variation of chromatin state coordinated by proximal and distal non-coding genetic variants. Histone QTLs are enriched for common variants associated with autoimmune diseases and enable identification of putative target genes of disease-associated variants from genome-wide association studies. These analyses provide insights into how genetic variation can affect human disease phenotypes by coordinated changes in chromatin at interacting regulatory elements.

    View details for DOI 10.1016/j.cell.2015.07.048

    View details for Web of Science ID 000360589900015

    View details for PubMedCentralID PMC4556133

  • Conserved epigenomic signals in mice and humans reveal immune basis of Alzheimer's disease. Nature Gjoneska, E., Pfenning, A. R., Mathys, H., Quon, G., Kundaje, A., Tsai, L., Kellis, M. 2015; 518 (7539): 365-369


    Alzheimer's disease (AD) is a severe age-related neurodegenerative disorder characterized by accumulation of amyloid-β plaques and neurofibrillary tangles, synaptic and neuronal loss, and cognitive decline. Several genes have been implicated in AD, but chromatin state alterations during neurodegeneration remain uncharacterized. Here we profile transcriptional and chromatin state dynamics across early and late pathology in the hippocampus of an inducible mouse model of AD-like neurodegeneration. We find a coordinated downregulation of synaptic plasticity genes and regulatory regions, and upregulation of immune response genes and regulatory regions, which are targeted by factors that belong to the ETS family of transcriptional regulators, including PU.1. Human regions orthologous to increasing-level enhancers show immune-cell-specific enhancer signatures as well as immune cell expression quantitative trait loci, while decreasing-level enhancer orthologues show fetal-brain-specific enhancer activity. Notably, AD-associated genetic variants are specifically enriched in increasing-level enhancer orthologues, implicating immune processes in AD predisposition. Indeed, increasing enhancers overlap known AD loci lacking protein-altering variants, and implicate additional loci that do not reach genome-wide significance. Our results reveal new insights into the mechanisms of neurodegeneration and establish the mouse as a useful model for functional studies of AD regulatory regions.

    View details for DOI 10.1038/nature14252

    View details for PubMedID 25693568

  • Integrative analysis of 111 reference human epigenomes. Nature Kundaje, A., Meuleman, W., Ernst, J., Bilenky, M., Yen, A., Heravi-Moussavi, A., Kheradpour, P., Zhang, Z., Wang, J., Ziller, M. J., Amin, V., Whitaker, J. W., Schultz, M. D., Ward, L. D., Sarkar, A., Quon, G., Sandstrom, R. S., Eaton, M. L., Wu, Y., Pfenning, A. R., Wang, X., Claussnitzer, M., Liu, Y., Coarfa, C., Harris, R. A., Shoresh, N., Epstein, C. B., Gjoneska, E., Leung, D., Xie, W., Hawkins, R. D., Lister, R., Hong, C., Gascard, P., Mungall, A. J., Moore, R., Chuah, E., Tam, A., Canfield, T. K., Hansen, R. S., Kaul, R., Sabo, P. J., Bansal, M. S., Carles, A., Dixon, J. R., Farh, K., Feizi, S., Karlic, R., Kim, A., Kulkarni, A., Li, D., Lowdon, R., Elliott, G., Mercer, T. R., Neph, S. J., Onuchic, V., Polak, P., Rajagopal, N., Ray, P., Sallari, R. C., Siebenthall, K. T., Sinnott-Armstrong, N. A., Stevens, M., Thurman, R. E., Wu, J., Zhang, B., Zhou, X., Beaudet, A. E., Boyer, L. A., De Jager, P. L., Farnham, P. J., Fisher, S. J., Haussler, D., Jones, S. J., Li, W., Marra, M. A., McManus, M. T., Sunyaev, S., Thomson, J. A., Tlsty, T. D., Tsai, L., Wang, W., Waterland, R. A., Zhang, M. Q., Chadwick, L. H., Bernstein, B. E., Costello, J. F., Ecker, J. R., Hirst, M., Meissner, A., Milosavljevic, A., Ren, B., Stamatoyannopoulos, J. A., Wang, T., Kellis, M. 2015; 518 (7539): 317-330


    The reference human genome sequence set the stage for studies of genetic variation and its association with human disease, but epigenomic studies lack a similar reference. To address this need, the NIH Roadmap Epigenomics Consortium generated the largest collection so far of human epigenomes for primary cells and tissues. Here we describe the integrative analysis of 111 reference human epigenomes generated as part of the programme, profiled for histone modification patterns, DNA accessibility, DNA methylation and RNA expression. We establish global maps of regulatory elements, define regulatory modules of coordinated activity, and their likely activators and repressors. We show that disease- and trait-associated genetic variants are enriched in tissue-specific epigenomic marks, revealing biologically relevant cell types for diverse human traits, and providing a resource for interpreting the molecular basis of human disease. Our results demonstrate the central role of epigenomic information for understanding gene regulation, cellular differentiation and human disease.

    View details for DOI 10.1038/nature14248

    View details for PubMedID 25693563

  • Architecture of the human regulatory network derived from ENCODE data NATURE Gerstein, M. B., Kundaje, A., Hariharan, M., Landt, S. G., Yan, K., Cheng, C., Mu, X. J., Khurana, E., Rozowsky, J., Alexander, R., Min, R., Alves, P., Abyzov, A., Addleman, N., Bhardwaj, N., Boyle, A. P., Cayting, P., Charos, A., Chen, D. Z., Cheng, Y., Clarke, D., Eastman, C., Euskirchen, G., Frietze, S., Fu, Y., Gertz, J., Grubert, F., Harmanci, A., Jain, P., Kasowski, M., Lacroute, P., Leng, J., Lian, J., Monahan, H., O'Geen, H., Ouyang, Z., Partridge, E. C., Patacsil, D., Pauli, F., Raha, D., Ramirez, L., Reddy, T. E., Reed, B., Shi, M., Slifer, T., Wang, J., Wu, L., Yang, X., Yip, K. Y., Zilberman-Schapira, G., Batzoglou, S., Sidow, A., Farnham, P. J., Myers, R. M., Weissman, S. M., Snyder, M. 2012; 489 (7414): 91-100


    Transcription factors bind in a combinatorial fashion to specify the on-and-off states of genes; the ensemble of these binding events forms a regulatory network, constituting the wiring diagram for a cell. To examine the principles of the human transcriptional regulatory network, we determined the genomic binding information of 119 transcription-related factors in over 450 distinct experiments. We found the combinatorial, co-association of transcription factors to be highly context specific: distinct combinations of factors bind at specific genomic locations. In particular, there are significant differences in the binding proximal and distal to genes. We organized all the transcription factor binding into a hierarchy and integrated it with other genomic information (for example, microRNA regulation), forming a dense meta-network. Factors at different levels have different properties; for instance, top-level transcription factors more strongly influence expression and middle-level ones co-regulate targets to mitigate information-flow bottlenecks. Moreover, these co-regulations give rise to many enriched network motifs (for example, noise-buffering feed-forward loops). Finally, more connected network components are under stronger selection and exhibit a greater degree of allele-specific activity (that is, differential binding to the two parental alleles). The regulatory information obtained in this study will be crucial for interpreting personal genome sequences and understanding basic principles of human biology and disease.

    View details for DOI 10.1038/nature11245

    View details for PubMedID 22955619

  • An integrated encyclopedia of DNA elements in the human genome NATURE Dunham, I., Kundaje, A., Aldred, S. F., Collins, P. J., Davis, C., Doyle, F., Epstein, C. B., Frietze, S., Harrow, J., Kaul, R., Khatun, J., Lajoie, B. R., Landt, S. G., Lee, B., Pauli, F., Rosenbloom, K. R., Sabo, P., Safi, A., Sanyal, A., Shoresh, N., Simon, J. M., Song, L., Trinklein, N. D., Altshuler, R. C., Birney, E., Brown, J. B., Cheng, C., Djebali, S., Dong, X., Dunham, I., Ernst, J., Furey, T. S., Gerstein, M., Giardine, B., Greven, M., Hardison, R. C., Harris, R. S., Herrero, J., Hoffman, M. M., Iyer, S., Kellis, M., Khatun, J., Kheradpour, P., Kundaje, A., Lassmann, T., Li, Q., Lin, X., Marinov, G. K., Merkel, A., Mortazavi, A., Parker, S. C., Reddy, T. E., Rozowsky, J., Schlesinger, F., Thurman, R. E., Wang, J., Ward, L. D., Whitfield, T. W., Wilder, S. P., Wu, W., Xi, H. S., Yip, K. Y., Zhuang, J., Bernstein, B. E., Birney, E., Dunham, I., Green, E. D., Gunter, C., Snyder, M., Pazin, M. J., Lowdon, R. F., Dillon, L. A., Adams, L. B., Kelly, C. J., Zhang, J., Wexler, J. R., Green, E. D., Good, P. J., Feingold, E. A., Bernstein, B. E., Birney, E., Crawford, G. E., Dekker, J., Elnitski, L., Farnham, P. J., Gerstein, M., Giddings, M. C., Gingeras, T. R., Green, E. D., Guigo, R., Hardison, R. C., Hubbard, T. J., Kellis, M., Kent, W. J., Lieb, J. D., Margulies, E. H., Myers, R. M., Snyder, M., Stamatoyannopoulos, J. A., Tenenbaum, S. A., Weng, Z., White, K. P., Wold, B., Khatun, J., Yu, Y., Wrobel, J., Risk, B. A., Gunawardena, H. P., Kuiper, H. C., Maier, C. W., Xie, L., Chen, X., Giddings, M. C., Bernstein, B. E., Epstein, C. B., Shoresh, N., Ernst, J., Kheradpour, P., Mikkelsen, T. S., Gillespie, S., Goren, A., Ram, O., Zhang, X., Wang, L., Issner, R., Coyne, M. J., Durham, T., Ku, M., Truong, T., Ward, L. D., Altshuler, R. C., Eaton, M. L., Kellis, M., Djebali, S., Davis, C. A., Merkel, A., Dobin, A., Lassmann, T., Mortazavi, A., Tanzer, A., Lagarde, J., Lin, W., Schlesinger, F., Xue, C., Marinov, G. K., Khatun, J., Williams, B. A., Zaleski, C., Rozowsky, J., Roeder, M., Kokocinski, F., Abdelhamid, R. F., Alioto, T., Antoshechkin, I., Baer, M. T., Batut, P., Bell, I., Bell, K., Chakrabortty, S., Chen, X., Chrast, J., Curado, J., Derrien, T., Drenkow, J., Dumais, E., Dumais, J., Duttagupta, R., Fastuca, M., Fejes-Toth, K., Ferreira, P., Foissac, S., Fullwood, M. J., Gao, H., Gonzalez, D., Gordon, A., Gunawardena, H. P., Howald, C., Jha, S., Johnson, R., Kapranov, P., King, B., Kingswood, C., Li, G., Luo, O. J., Park, E., Preall, J. B., Presaud, K., Ribeca, P., Risk, B. A., Robyr, D., Ruan, X., Sammeth, M., Sandhu, K. S., Schaeffer, L., See, L., Shahab, A., Skancke, J., Suzuki, A. M., Takahashi, H., Tilgner, H., Trout, D., Walters, N., Wang, H., Wrobel, J., Yu, Y., Hayashizaki, Y., Harrow, J., Gerstein, M., Hubbard, T. J., Reymond, A., Antonarakis, S. E., Hannon, G. J., Giddings, M. C., Ruan, Y., Wold, B., Carninci, P., Guigo, R., Gingeras, T. R., Rosenbloom, K. R., Sloan, C. A., Learned, K., Malladi, V. S., Wong, M. C., Barber, G., Cline, M. S., Dreszer, T. R., Heitner, S. G., Karolchik, D., Kent, W. J., Kirkup, V. M., Meyer, L. R., Long, J. C., Maddren, M., Raney, B. J., Furey, T. S., Song, L., Grasfeder, L. L., Giresi, P. G., Lee, B., Battenhouse, A., Sheffield, N. C., Simon, J. M., Showers, K. A., Safi, A., London, D., Bhinge, A. A., Shestak, C., Schaner, M. R., Kim, S. K., Zhang, Z. Z., Mieczkowski, P. A., Mieczkowska, J. O., Liu, Z., McDaniell, R. M., Ni, Y., Rashid, N. U., Kim, M. J., Adar, S., Zhang, Z., Wang, T., Winter, D., Keefe, D., Birney, E., Iyer, V. R., Lieb, J. D., Crawford, G. E., Li, G., Sandhu, K. S., Zheng, M., Wang, P., Luo, O. J., Shahab, A., Fullwood, M. J., Ruan, X., Ruan, Y., Myers, R. M., Pauli, F., Williams, B. A., Gertz, J., Marinov, G. K., Reddy, T. E., Vielmetter, J., Partridge, E. C., Trout, D., Varley, K. E., Gasper, C., Bansal, A., Pepke, S., Jain, P., Amrhein, H., Bowling, K. M., Anaya, M., Cross, M. K., King, B., Muratet, M. A., Antoshechkin, I., Newberry, K. M., McCue, K., Nesmith, A. S., Fisher-Aylor, K. I., Pusey, B., DeSalvo, G., Parker, S. L., Balasubramanian, S., Davis, N. S., Meadows, S. K., Eggleston, T., Gunter, C., Newberry, J. S., Levy, S. E., Absher, D. M., Mortazavi, A., Wong, W. H., Wold, B., Blow, M. J., Visel, A., Pennachio, L. A., Elnitski, L., Margulies, E. H., Parker, S. C., Petrykowska, H. M., Abyzov, A., Aken, B., Barrell, D., Barson, G., Berry, A., Bignell, A., Boychenko, V., Bussotti, G., Chrast, J., Davidson, C., Derrien, T., Despacio-Reyes, G., Diekhans, M., Ezkurdia, I., Frankish, A., Gilbert, J., Gonzalez, J. M., Griffiths, E., Harte, R., Hendrix, D. A., Howald, C., Hunt, T., Jungreis, I., Kay, M., Khurana, E., Kokocinski, F., Leng, J., Lin, M. F., Loveland, J., Lu, Z., Manthravadi, D., Mariotti, M., Mudge, J., Mukherjee, G., Notredame, C., Pei, B., Rodriguez, J. M., Saunders, G., Sboner, A., Searle, S., Sisu, C., Snow, C., Steward, C., Tanzer, A., Tapanari, E., Tress, M. L., van Baren, M. J., Walters, N., Washietl, S., Wilming, L., Zadissa, A., Zhang, Z., Brent, M., Haussler, D., Kellis, M., Valencia, A., Gerstein, M., Reymond, A., Guigo, R., Harrow, J., Hubbard, T. J., Landt, S. G., Frietze, S., Abyzov, A., Addleman, N., Alexander, R. P., Auerbach, R. K., Balasubramanian, S., Bettinger, K., Bhardwaj, N., Boyle, A. P., Cao, A. R., Cayting, P., Charos, A., Cheng, Y., Cheng, C., Eastman, C., Euskirchen, G., Fleming, J. D., Grubert, F., Habegger, L., Hariharan, M., Harmanci, A., Iyengar, S., Jin, V. X., Karczewski, K. J., Kasowski, M., Lacroute, P., Lam, H., Lamarre-Vincent, N., Leng, J., Lian, J., Lindahl-Allen, M., Min, R., Miotto, B., Monahan, H., Moqtaderi, Z., Mu, X. J., O'Geen, H., Ouyang, Z., Patacsil, D., Pei, B., Raha, D., Ramirez, L., Reed, B., Rozowsky, J., Sboner, A., Shi, M., Sisu, C., Slifer, T., Witt, H., Wu, L., Xu, X., Yan, K., Yang, X., Yip, K. Y., Zhang, Z., Struhl, K., Weissman, S. M., Gerstein, M., Farnham, P. J., Snyder, M., Tenenbaum, S. A., Penalva, L. O., Doyle, F., Karmakar, S., Landt, S. G., Bhanvadia, R. R., Choudhury, A., Domanus, M., Ma, L., Moran, J., Patacsil, D., Slifer, T., Victorsen, A., Yang, X., Snyder, M., White, K. P., Auer, T., Centanin, L., Eichenlaub, M., Gruhl, F., Heermann, S., Hoeckendorf, B., Inoue, D., Kellner, T., Kirchmaier, S., Mueller, C., Reinhardt, R., Schertel, L., Schneider, S., Sinn, R., Wittbrodt, B., Wittbrodt, J., Weng, Z., Whitfield, T. W., Wang, J., Collins, P. J., Aldred, S. F., Trinklein, N. D., Partridge, E. C., Myers, R. M., Dekker, J., Jain, G., Lajoie, B. R., Sanyal, A., Balasundaram, G., Bates, D. L., Byron, R., Canfield, T. K., Diegel, M. J., Dunn, D., Ebersol, A. K., Frum, T., Garg, K., Gist, E., Hansen, R. S., Boatman, L., Haugen, E., Humbert, R., Jain, G., Johnson, A. K., Johnson, E. M., Kutyavin, T. V., Lajoie, B. R., Lee, K., Lotakis, D., Maurano, M. T., Neph, S. J., Neri, F. V., Nguyen, E. D., Qu, H., Reynolds, A. P., Roach, V., Rynes, E., Sabo, P., Sanchez, M. E., Sandstrom, R. S., Sanyal, A., Shafer, A. O., Stergachis, A. B., Thomas, S., Thurman, R. E., Vernot, B., Vierstra, J., Vong, S., Wang, H., Weaver, M. A., Yan, Y., Zhang, M., Akey, J. M., Bender, M., Dorschner, M. O., Groudine, M., MacCoss, M. J., Navas, P., Stamatoyannopoulos, G., Kaul, R., Dekker, J., Stamatoyannopoulos, J. A., Dunham, I., Beal, K., Brazma, A., Flicek, P., Herrero, J., Johnson, N., Keefe, D., Lukk, M., Luscombe, N. M., Sobral, D., Vaquerizas, J. M., Wilder, S. P., Batzoglou, S., Sidow, A., Hussami, N., Kyriazopoulou-Panagiotopoulou, S., Libbrecht, M. W., Schaub, M. A., Kundaje, A., Hardison, R. C., Miller, W., Giardine, B., Harris, R. S., Wu, W., Bickel, P. J., Banfai, B., Boley, N. P., Brown, J. B., Huang, H., Li, Q., Li, J. J., Noble, W. S., Bilmes, J. A., Buske, O. J., Hoffman, M. M., Sahu, A. D., Kharchenko, P. V., Park, P. J., Baker, D., Taylor, J., Weng, Z., Iyer, S., Dong, X., Greven, M., Lin, X., Wang, J., Xi, H. S., Zhuang, J., Gerstein, M., Alexander, R. P., Balasubramanian, S., Cheng, C., Harmanci, A., Lochovsky, L., Min, R., Mu, X. J., Rozowsky, J., Yan, K., Yip, K. Y., Birney, E. 2012; 489 (7414): 57-74


    The human genome encodes the blueprint of life, but the function of the vast majority of its nearly three billion bases is unknown. The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure and histone modification. These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions. Many discovered candidate regulatory elements are physically associated with one another and with expressed genes, providing new insights into the mechanisms of gene regulation. The newly identified elements also show a statistical correspondence to sequence variants linked to human disease, and can thereby guide interpretation of this variation. Overall, the project provides new insights into the organization and regulation of our genes and genome, and is an expansive resource of functional annotations for biomedical research.

    View details for DOI 10.1038/nature11247

    View details for Web of Science ID 000308347000039

    View details for PubMedID 22955616

    View details for PubMedCentralID PMC3439153

  • Ubiquitous heterogeneity and asymmetry of the chromatin environment at regulatory elements GENOME RESEARCH Kundaje, A., Kyriazopoulou-Panagiotopoulou, S., Libbrecht, M., Smith, C. L., Raha, D., Winters, E. E., Johnson, S. M., Snyder, M., Batzoglou, S., Sidow, A. 2012; 22 (9): 1735-1747


    Gene regulation at functional elements (e.g., enhancers, promoters, insulators) is governed by an interplay of nucleosome remodeling, histone modifications, and transcription factor binding. To enhance our understanding of gene regulation, the ENCODE Consortium has generated a wealth of ChIP-seq data on DNA-binding proteins and histone modifications. We additionally generated nucleosome positioning data on two cell lines, K562 and GM12878, by MNase digestion and high-depth sequencing. Here we relate 14 chromatin signals (12 histone marks, DNase, and nucleosome positioning) to the binding sites of 119 DNA-binding proteins across a large number of cell lines. We developed a new method for unsupervised pattern discovery, the Clustered AGgregation Tool (CAGT), which accounts for the inherent heterogeneity in signal magnitude, shape, and implicit strand orientation of chromatin marks. We applied CAGT on a total of 5084 data set pairs to obtain an exhaustive catalog of high-resolution patterns of histone modifications and nucleosome positioning signals around bound transcription factors. Our analyses reveal extensive heterogeneity in how histone modifications are deposited, and how nucleosomes are positioned around binding sites. With the exception of the CTCF/cohesin complex, asymmetry of nucleosome positioning is predominant. Asymmetry of histone modifications is also widespread, for all types of chromatin marks examined, including promoter, enhancer, elongation, and repressive marks. The fine-resolution signal shapes discovered by CAGT unveiled novel correlation patterns between chromatin marks, nucleosome positioning, and sequence content. Meta-analyses of the signal profiles revealed a common vocabulary of chromatin signals shared across multiple cell lines and binding proteins.

    View details for DOI 10.1101/gr.136366.111

    View details for PubMedID 22955985

  • ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia GENOME RESEARCH Landt, S. G., Marinov, G. K., Kundaje, A., Kheradpour, P., Pauli, F., Batzoglou, S., Bernstein, B. E., Bickel, P., Brown, J. B., Cayting, P., Chen, Y., DeSalvo, G., Epstein, C., Fisher-Aylor, K. I., Euskirchen, G., Gerstein, M., Gertz, J., Hartemink, A. J., Hoffman, M. M., Iyer, V. R., Jung, Y. L., Karmakar, S., Kellis, M., Kharchenko, P. V., Li, Q., Liu, T., Liu, X. S., Ma, L., Milosavljevic, A., Myers, R. M., Park, P. J., Pazin, M. J., Perry, M. D., Raha, D., Reddy, T. E., Rozowsky, J., Shoresh, N., Sidow, A., Slattery, M., Stamatoyannopoulos, J. A., Tolstorukov, M. Y., White, K. P., Xi, S., Farnham, P. J., Lieb, J. D., Wold, B. J., Snyder, M. 2012; 22 (9): 1813-1831


    Chromatin immunoprecipitation (ChIP) followed by high-throughput DNA sequencing (ChIP-seq) has become a valuable and widely used approach for mapping the genomic location of transcription-factor binding and histone modifications in living cells. Despite its widespread use, there are considerable differences in how these experiments are conducted, how the results are scored and evaluated for quality, and how the data and metadata are archived for public use. These practices affect the quality and utility of any global ChIP experiment. Through our experience in performing ChIP-seq experiments, the ENCODE and modENCODE consortia have developed a set of working standards and guidelines for ChIP experiments that are updated routinely. The current guidelines address antibody validation, experimental replication, sequencing depth, data and metadata reporting, and data quality assessment. We discuss how ChIP quality, assessed in these ways, affects different uses of ChIP-seq data. All data sets used in the analysis have been deposited for public viewing and downloading at the ENCODE ( and modENCODE ( portals.

    View details for DOI 10.1101/gr.136184.111

    View details for PubMedID 22955991

  • Linking disease associations with regulatory information in the human genome GENOME RESEARCH Schaub, M. A., Boyle, A. P., Kundaje, A., Batzoglou, S., Snyder, M. 2012; 22 (9): 1748-1759


    Genome-wide association studies have been successful in identifying single nucleotide polymorphisms (SNPs) associated with a large number of phenotypes. However, an associated SNP is likely part of a larger region of linkage disequilibrium. This makes it difficult to precisely identify the SNPs that have a biological link with the phenotype. We have systematically investigated the association of multiple types of ENCODE data with disease-associated SNPs and show that there is significant enrichment for functional SNPs among the currently identified associations. This enrichment is strongest when integrating multiple sources of functional information and when highest confidence disease-associated SNPs are used. We propose an approach that integrates multiple types of functional data generated by the ENCODE Consortium to help identify "functional SNPs" that may be associated with the disease phenotype. Our approach generates putative functional annotations for up to 80% of all previously reported associations. We show that for most associations, the functional SNP most strongly supported by experimental evidence is a SNP in linkage disequilibrium with the reported association rather than the reported SNP itself. Our results show that the experimental data sets generated by the ENCODE Consortium can be successfully used to suggest functional hypotheses for variants associated with diseases and other phenotypes.

    View details for DOI 10.1101/gr.136127.111

    View details for PubMedID 22955986

  • A Predictive Model of the Oxygen and Heme Regulatory Network in Yeast PLOS COMPUTATIONAL BIOLOGY Kundaje, A., Xin, X., Lan, C., Lianoglou, S., Zhou, M., Zhang, L., Leslie, C. 2008; 4 (11)


    Deciphering gene regulatory mechanisms through the analysis of high-throughput expression data is a challenging computational problem. Previous computational studies have used large expression datasets in order to resolve fine patterns of coexpression, producing clusters or modules of potentially coregulated genes. These methods typically examine promoter sequence information, such as DNA motifs or transcription factor occupancy data, in a separate step after clustering. We needed an alternative and more integrative approach to study the oxygen regulatory network in Saccharomyces cerevisiae using a small dataset of perturbation experiments. Mechanisms of oxygen sensing and regulation underlie many physiological and pathological processes, and only a handful of oxygen regulators have been identified in previous studies. We used a new machine learning algorithm called MEDUSA to uncover detailed information about the oxygen regulatory network using genome-wide expression changes in response to perturbations in the levels of oxygen, heme, Hap1, and Co2+. MEDUSA integrates mRNA expression, promoter sequence, and ChIP-chip occupancy data to learn a model that accurately predicts the differential expression of target genes in held-out data. We used a novel margin-based score to extract significant condition-specific regulators and assemble a global map of the oxygen sensing and regulatory network. This network includes both known oxygen and heme regulators, such as Hap1, Mga2, Hap4, and Upc2, as well as many new candidate regulators. MEDUSA also identified many DNA motifs that are consistent with previous experimentally identified transcription factor binding sites. Because MEDUSA's regulatory program associates regulators to target genes through their promoter sequences, we directly tested the predicted regulators for OLE1, a gene specifically induced under hypoxia, by experimental analysis of the activity of its promoter. In each case, deletion of the candidate regulator resulted in the predicted effect on promoter activity, confirming that several novel regulators identified by MEDUSA are indeed involved in oxygen regulation. MEDUSA can reveal important information from a small dataset and generate testable hypotheses for further experimental analysis. Supplemental data are included.

    View details for DOI 10.1371/journal.pcbi.1000224

    View details for Web of Science ID 000261480800016

    View details for PubMedID 19008939

    View details for PubMedCentralID PMC2573020

  • Combining sequence and time series expression data to learn transcriptional modules IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS Kundaje, A., Middendorf, M., Gao, F., Wiggins, C., Leslie, C. 2005; 2 (3): 194-202


    Our goal is to cluster genes into transcriptional modules--sets of genes where similarity in expression is explained by common regulatory mechanisms at the transcriptional level. We want to learn modules from both time series gene expression data and genome-wide motif data that are now readily available for organisms such as S. cereviseae as a result of prior computational studies or experimental results. We present a generative probabilistic model for combining regulatory sequence and time series expression data to cluster genes into coherent transcriptional modules. Starting with a set of motifs representing known or putative regulatory elements (transcription factor binding sites) and the counts of occurrences of these motifs in each gene's promoter region, together with a time series expression profile for each gene, the learning algorithm uses expectation maximization to learn module assignments based on both types of data. We also present a technique based on the Jensen-Shannon entropy contributions of motifs in the learned model for associating the most significant motifs to each module. Thus, the algorithm gives a global approach for associating sets of regulatory elements to "modules" of genes with similar time series expression profiles. The model for expression data exploits our prior belief of smooth dependence on time by using statistical splines and is suitable for typical time course data sets with relatively few experiments. Moreover, the model is sufficiently interpretable that we can understand how both sequence data and expression data contribute to the cluster assignments, and how to interpolate between the two data sources. We present experimental results on the yeast cell cycle to validate our method and find that our combined expression and motif clustering algorithm discovers modules with both coherent expression and similar motif patterns, including binding motifs associated to known cell cycle transcription factors.

    View details for Web of Science ID 000235704200003

    View details for PubMedID 17044183

  • AP-1 is a temporally regulated dual gatekeeper of reprogramming to pluripotency. Proceedings of the National Academy of Sciences of the United States of America Markov, G. J., Mai, T., Nair, S., Shcherbina, A., Wang, Y. X., Burns, D. M., Kundaje, A., Blau, H. M. 2021; 118 (23)


    Somatic cell transcription factors are critical to maintaining cellular identity and constitute a barrier to human somatic cell reprogramming; yet a comprehensive understanding of the mechanism of action is lacking. To gain insight, we examined epigenome remodeling at the onset of human nuclear reprogramming by profiling human fibroblasts after fusion with murine embryonic stem cells (ESCs). By assay for transposase-accessible chromatin with high-throughput sequencing (ATAC-seq) and chromatin immunoprecipitation sequencing we identified enrichment for the activator protein 1 (AP-1) transcription factor c-Jun at regions of early transient accessibility at fibroblast-specific enhancers. Expression of a dominant negative AP-1 mutant (dnAP-1) reduced accessibility and expression of fibroblast genes, overcoming the barrier to reprogramming. Remarkably, efficient reprogramming of human fibroblasts to induced pluripotent stem cells was achieved by transduction with vectors expressing SOX2, KLF4, and inducible dnAP-1, demonstrating that dnAP-1 can substitute for exogenous human OCT4. Mechanistically, we show that the AP-1 component c-Jun has two unexpected temporally distinct functions in human reprogramming: 1) to potentiate fibroblast enhancer accessibility and fibroblast-specific gene expression, and 2) to bind to and repress OCT4 as a complex with MBD3. Our findings highlight AP-1 as a previously unrecognized potent dual gatekeeper of the somatic cell state.

    View details for DOI 10.1073/pnas.2104841118

    View details for PubMedID 34088849

  • Transcription-dependent domain-scale three-dimensional genome organization in the dinoflagellate Breviolum minutum. Nature genetics Marinov, G. K., Trevino, A. E., Xiang, T., Kundaje, A., Grossman, A. R., Greenleaf, W. J. 2021


    Dinoflagellate chromosomes represent a unique evolutionary experiment, as they exist in a permanently condensed, liquid crystalline state; are not packaged by histones; and contain genes organized into tandem gene arrays, with minimal transcriptional regulation. We analyze the three-dimensional genome of Breviolum minutum, and find large topological domains (dinoflagellate topologically associating domains, which we term 'dinoTADs') without chromatin loops, which are demarcated by convergent gene array boundaries. Transcriptional inhibition disrupts dinoTADs, implicating transcription-induced supercoiling as the primary topological force in dinoflagellates.

    View details for DOI 10.1038/s41588-021-00848-5

    View details for PubMedID 33927397

  • Publisher Correction: MTSplice predicts effects of genetic variants on tissue-specific splicing. Genome biology Cheng, J., Celik, M. H., Kundaje, A., Gagneur, J. 2021; 22 (1): 107

    View details for DOI 10.1186/s13059-021-02338-7

    View details for PubMedID 33858505

  • Genome-wide enhancer maps link risk variants to disease genes. Nature Nasser, J., Bergman, D. T., Fulco, C. P., Guckelberger, P., Doughty, B. R., Patwardhan, T. A., Jones, T. R., Nguyen, T. H., Ulirsch, J. C., Lekschas, F., Mualim, K., Natri, H. M., Weeks, E. M., Munson, G., Kane, M., Kang, H. Y., Cui, A., Ray, J. P., Eisenhaure, T. M., Collins, R. L., Dey, K., Pfister, H., Price, A. L., Epstein, C. B., Kundaje, A., Xavier, R. J., Daly, M. J., Huang, H., Finucane, H. K., Hacohen, N., Lander, E. S., Engreitz, J. M. 2021


    Genome-wide association studies (GWAS) have identified thousands of noncoding loci that are associated with human diseases and complex traits, each of which could reveal insights into the mechanisms of disease1. Many of the underlying causal variants may affect enhancers2,3, but we lack accurate maps of enhancers and their target genes to interpret such variants. We recently developed the activity-by-contact (ABC) model to predict which enhancers regulate which genes and validated the model using CRISPR perturbations in several cell types4. Here we apply this ABC model to create enhancer-genemaps in 131 human cell types and tissues, and use these maps to interpret the functions of GWAS variants. Across 72 diseases and complex traits, ABC links 5,036 GWAS signals to 2,249 unique genes, including a class of 577 genes that appear to influence multiple phenotypes through variants in enhancers that act in different cell types. In inflammatory bowel disease (IBD), causal variants are enriched in predicted enhancers by more than 20-fold in particular cell types such as dendritic cells, and ABC achieves higher precision than other regulatory methods at connecting noncoding variants to target genes. These variant-to-function maps reveal an enhancer that contains an IBD risk variant and that regulates the expression of PPIF to alter the membrane potential of mitochondria in macrophages. Our study reveals principles of genome regulation, identifies genes that affect IBD and provides a resource and generalizable strategy to connect risk variants of common diseases to their molecular and cellular functions.

    View details for DOI 10.1038/s41586-021-03446-x

    View details for PubMedID 33828297

  • Genetic architectures of proximal and distal colorectal cancer are partly distinct. Gut Huyghe, J. R., Harrison, T. A., Bien, S. A., Hampel, H. n., Figueiredo, J. C., Schmit, S. L., Conti, D. V., Chen, S. n., Qu, C. n., Lin, Y. n., Barfield, R. n., Baron, J. A., Cross, A. J., Diergaarde, B. n., Duggan, D. n., Harlid, S. n., Imaz, L. n., Kang, H. M., Levine, D. M., Perduca, V. n., Perez-Cornago, A. n., Sakoda, L. C., Schumacher, F. R., Slattery, M. L., Toland, A. E., van Duijnhoven, F. J., Van Guelpen, B. n., Agudo, A. n., Albanes, D. n., Alonso, M. H., Anderson, K. n., Arnau-Collell, C. n., Arndt, V. n., Banbury, B. L., Bassik, M. C., Berndt, S. I., Bézieau, S. n., Bishop, D. T., Boehm, J. n., Boeing, H. n., Boutron-Ruault, M. C., Brenner, H. n., Brezina, S. n., Buch, S. n., Buchanan, D. D., Burnett-Hartman, A. n., Caan, B. J., Campbell, P. T., Carr, P. R., Castells, A. n., Castellví-Bel, S. n., Chan, A. T., Chang-Claude, J. n., Chanock, S. J., Curtis, K. R., de la Chapelle, A. n., Easton, D. F., English, D. R., Feskens, E. J., Gala, M. n., Gallinger, S. J., Gauderman, W. J., Giles, G. G., Goodman, P. J., Grady, W. M., Grove, J. S., Gsur, A. n., Gunter, M. J., Haile, R. W., Hampe, J. n., Hoffmeister, M. n., Hopper, J. L., Hsu, W. L., Huang, W. Y., Hudson, T. J., Jenab, M. n., Jenkins, M. A., Joshi, A. D., Keku, T. O., Kooperberg, C. n., Kühn, T. n., Küry, S. n., Le Marchand, L. n., Lejbkowicz, F. n., Li, C. I., Li, L. n., Lieb, W. n., Lindblom, A. n., Lindor, N. M., Männistö, S. n., Markowitz, S. D., Milne, R. L., Moreno, L. n., Murphy, N. n., Nassir, R. n., Offit, K. n., Ogino, S. n., Panico, S. n., Parfrey, P. S., Pearlman, R. n., Pharoah, P. D., Phipps, A. I., Platz, E. A., Potter, J. D., Prentice, R. L., Qi, L. n., Raskin, L. n., Rennert, G. n., Rennert, H. S., Riboli, E. n., Schafmayer, C. n., Schoen, R. E., Seminara, D. n., Song, M. n., Su, Y. R., Tangen, C. M., Thibodeau, S. N., Thomas, D. C., Trichopoulou, A. n., Ulrich, C. M., Visvanathan, K. n., Vodicka, P. n., Vodickova, L. n., Vymetalkova, V. n., Weigl, K. n., Weinstein, S. J., White, E. n., Wolk, A. n., Woods, M. O., Wu, A. H., Abecasis, G. R., Nickerson, D. A., Scacheri, P. C., Kundaje, A. n., Casey, G. n., Gruber, S. B., Hsu, L. n., Moreno, V. n., Hayes, R. B., Newcomb, P. A., Peters, U. n. 2021


    An understanding of the etiologic heterogeneity of colorectal cancer (CRC) is critical for improving precision prevention, including individualized screening recommendations and the discovery of novel drug targets and repurposable drug candidates for chemoprevention. Known differences in molecular characteristics and environmental risk factors among tumors arising in different locations of the colorectum suggest partly distinct mechanisms of carcinogenesis. The extent to which the contribution of inherited genetic risk factors for CRC differs by anatomical subsite of the primary tumor has not been examined.To identify new anatomical subsite-specific risk loci, we performed genome-wide association study (GWAS) meta-analyses including data of 48 214 CRC cases and 64 159 controls of European ancestry. We characterised effect heterogeneity at CRC risk loci using multinomial modelling.We identified 13 loci that reached genome-wide significance (p<5×10-8) and that were not reported by previous GWASs for overall CRC risk. Multiple lines of evidence support candidate genes at several of these loci. We detected substantial heterogeneity between anatomical subsites. Just over half (61) of 109 known and new risk variants showed no evidence for heterogeneity. In contrast, 22 variants showed association with distal CRC (including rectal cancer), but no evidence for association or an attenuated association with proximal CRC. For two loci, there was strong evidence for effects confined to proximal colon cancer.Genetic architectures of proximal and distal CRC are partly distinct. Studies of risk factors and mechanisms of carcinogenesis, and precision prevention strategies should take into consideration the anatomical subsite of the tumour.

    View details for DOI 10.1136/gutjnl-2020-321534

    View details for PubMedID 33632709

  • Chromatin and gene-regulatory dynamics of the developing human cerebral cortex at single-cell resolution. Cell Trevino, A. E., Müller, F., Andersen, J., Sundaram, L., Kathiria, A., Shcherbina, A., Farh, K., Chang, H. Y., Pașca, A. M., Kundaje, A., Pașca, S. P., Greenleaf, W. J. 2021


    Genetic perturbations of cortical development can lead to neurodevelopmental disease, including autism spectrum disorder (ASD). To identify genomic regions crucial to corticogenesis, we mapped the activity of gene-regulatory elements generating a single-cell atlas of gene expression and chromatin accessibility both independently and jointly. This revealed waves of gene regulation by key transcription factors (TFs) across a nearly continuous differentiation trajectory, distinguished the expression programs of glial lineages, and identified lineage-determining TFs that exhibited strong correlation between linked gene-regulatory elements and expression levels. These highly connected genes adopted an active chromatin state in early differentiating cells, consistent with lineage commitment. Base-pair-resolution neural network models identified strong cell-type-specific enrichment of noncoding mutations predicted to be disruptive in a cohort of ASD individuals and identified frequently disrupted TF binding sites. This approach illustrates how cell-type-specific mapping can provide insights into the programs governing human development and disease.

    View details for DOI 10.1016/j.cell.2021.07.039

    View details for PubMedID 34390642

  • WILDS: A Benchmark of in-the-Wild Distribution Shifts Koh, P., Sagawa, S., Marklund, H., Xie, S., Zhang, M., Balsubramani, A., Hu, W., Yasunaga, M., Phillips, R., Gao, I., Lee, T., David, E., Stavness, I., Guo, W., Earnshaw, B. A., Haque, I. S., Beery, S., Leskovec, J., Kundaje, A., Pierson, E., Levine, S., Finn, C., Liang, P., Meila, M., Zhang, T. JMLR-JOURNAL MACHINE LEARNING RESEARCH. 2021
  • Learning cis-regulatory principles of ADAR-based RNA editing from CRISPR-mediated mutagenesis. Nature communications Liu, X., Sun, T., Shcherbina, A., Li, Q., Jarmoskaite, I., Kappel, K., Ramaswami, G., Das, R., Kundaje, A., Li, J. B. 2021; 12 (1): 2165


    Adenosine-to-inosine (A-to-I) RNA editing catalyzed by ADAR enzymes occurs in double-stranded RNAs. Despite a compelling need towards predictive understanding of natural and engineered editing events, how the RNA sequence and structure determine the editing efficiency and specificity (i.e., cis-regulation) is poorly understood. We apply a CRISPR/Cas9-mediated saturation mutagenesis approach to generate libraries of mutations near three natural editing substrates at their endogenous genomic loci. We use machine learning to integrate diverse RNA sequence and structure features to model editing levels measured by deep sequencing. We confirm known features and identify new features important for RNA editing. Training and testing XGBoost algorithm within the same substrate yield models that explain 68 to 86 percent of substrate-specific variation in editing levels. However, the models do not generalize across substrates, suggesting complex and context-dependent regulation patterns. Our integrative approach can be applied to larger scale experiments towards deciphering the RNA editing code.

    View details for DOI 10.1038/s41467-021-22489-2

    View details for PubMedID 33846332

  • MTSplice predicts effects of genetic variants on tissue-specific splicing. Genome biology Cheng, J. n., Çelik, M. H., Kundaje, A. n., Gagneur, J. n. 2021; 22 (1): 94


    We develop the free and open-source model Multi-tissue Splicing (MTSplice) to predict the effects of genetic variants on splicing of cassette exons in 56 human tissues. MTSplice combines MMSplice, which models constitutive regulatory sequences, with a new neural network that models tissue-specific regulatory sequences. MTSplice outperforms MMSplice on predicting tissue-specific variations associated with genetic variants in most tissues of the GTEx dataset, with largest improvements on brain tissues. Furthermore, MTSplice predicts that autism-associated de novo mutations are enriched for variants affecting splicing specifically in the brain. We foresee that MTSplice will aid interpreting variants associated with tissue-specific disorders.

    View details for DOI 10.1186/s13059-021-02273-7

    View details for PubMedID 33789710

  • Genetic effects on transcriptome profiles in colon epithelium provide functional insights for genetic risk loci. Cellular and molecular gastroenterology and hepatology Díez-Obrero, V. n., Dampier, C. H., Moratalla-Navarro, F. n., Devall, M. n., Plummer, S. J., Díez-Villanueva, A. n., Peters, U. n., Bien, S. n., Huyghe, J. R., Kundaje, A. n., Ibáñez-Sanz, G. n., Guinó, E. n., Obón-Santacana, M. n., Carreras-Torres, R. n., Casey, G. n., Moreno, V. n. 2021


    The association of genetic variation with tissue-specific gene expression and alternative splicing guides functional characterization of complex trait associated loci and may suggest novel genes implicated in disease. Here, we aimed to 1) generate reference profiles of colon mucosa gene expression and alternative splicing and compare them across colon subsites (ascending, transverse and descending), 2) identify expression and splicing quantitative trait loci (QTLs), 3) find traits for which identified QTLs contribute to single nucleotide polymorphism (SNP)-based heritability, 4) propose candidate effector genes, and 5) provide a web-based visualization resource.We collected colonic mucosal biopsies from 485 healthy adults and performed bulk RNA sequencing (RNA-Seq). We performed genome-wide SNP genotyping from blood leukocytes. Statistical approaches and bioinformatics software were used for QTL identification and downstream analyses.We provided a complete quantification of gene expression and alternative splicing across colon subsites and described their differences. We identified thousands of expression and splicing QTLs and defined their enrichment at genome-wide regulatory regions. We found that part of the SNP-based heritability of diseases affecting colon tissue, such as colorectal cancer and inflammatory bowel disease, but also of diseases affecting other tissues, such as psychiatric conditions, can be explained by the identified QTLs. We provided candidate effector genes for multiple phenotypes. Finally, we provided the Colon Transcriptome Explorer (CoTrEx).We provided the largest characterization to date of gene expression and splicing across colon subsites. Our findings provide greater etiological insight into complex traits and diseases influenced by transcriptomic changes in colon tissue.

    View details for DOI 10.1016/j.jcmgh.2021.02.003

    View details for PubMedID 33601062

  • Transparency and reproducibility in artificial intelligence. Nature Haibe-Kains, B., Adam, G. A., Hosny, A., Khodakarami, F., Massive Analysis Quality Control (MAQC) Society Board of Directors, Waldron, L., Wang, B., McIntosh, C., Goldenberg, A., Kundaje, A., Greene, C. S., Broderick, T., Hoffman, M. M., Leek, J. T., Korthauer, K., Huber, W., Brazma, A., Pineau, J., Tibshirani, R., Hastie, T., Ioannidis, J. P., Quackenbush, J., Aerts, H. J., Shraddha, T., Kusko, R., Sansone, S., Tong, W., Wolfinger, R. D., Mason, C. E., Jones, W., Dopazo, J., Furlanello, C. 2020; 586 (7829): E14–E16

    View details for DOI 10.1038/s41586-020-2766-y

    View details for PubMedID 33057217

  • Perspectives on ENCODE. Nature ENCODE Project Consortium, Snyder, M. P., Gingeras, T. R., Moore, J. E., Weng, Z., Gerstein, M. B., Ren, B., Hardison, R. C., Stamatoyannopoulos, J. A., Graveley, B. R., Feingold, E. A., Pazin, M. J., Pagan, M., Gilchrist, D. A., Hitz, B. C., Cherry, J. M., Bernstein, B. E., Mendenhall, E. M., Zerbino, D. R., Frankish, A., Flicek, P., Myers, R. M., Abascal, F., Acosta, R., Addleman, N. J., Adrian, J., Afzal, V., Aken, B., Akiyama, J. A., Jammal, O. A., Amrhein, H., Anderson, S. M., Andrews, G. R., Antoshechkin, I., Ardlie, K. G., Armstrong, J., Astley, M., Banerjee, B., Barkal, A. A., Barnes, I. H., Barozzi, I., Barrell, D., Barson, G., Bates, D., Baymuradov, U. K., Bazile, C., Beer, M. A., Beik, S., Bender, M. A., Bennett, R., Bouvrette, L. P., Bernstein, B. E., Berry, A., Bhaskar, A., Bignell, A., Blue, S. M., Bodine, D. M., Boix, C., Boley, N., Borrman, T., Borsari, B., Boyle, A. P., Brandsmeier, L. A., Breschi, A., Bresnick, E. H., Brooks, J. A., Buckley, M., Burge, C. B., Byron, R., Cahill, E., Cai, L., Cao, L., Carty, M., Castanon, R. G., Castillo, A., Chaib, H., Chan, E. T., Chee, D. R., Chee, S., Chen, H., Chen, H., Chen, J., Chen, S., Cherry, J. M., Chhetri, S. B., Choudhary, J. S., Chrast, J., Chung, D., Clarke, D., Cody, N. A., Coppola, C. J., Coursen, J., D'Ippolito, A. M., Dalton, S., Danyko, C., Davidson, C., Davila-Velderrain, J., Davis, C. A., Dekker, J., Deran, A., DeSalvo, G., Despacio-Reyes, G., Dewey, C. N., Dickel, D. E., Diegel, M., Diekhans, M., Dileep, V., Ding, B., Djebali, S., Dobin, A., Dominguez, D., Donaldson, S., Drenkow, J., Dreszer, T. R., Drier, Y., Duff, M. O., Dunn, D., Eastman, C., Ecker, J. R., Edwards, M. D., El-Ali, N., Elhajjajy, S. I., Elkins, K., Emili, A., Epstein, C. B., Evans, R. C., Ezkurdia, I., Fan, K., Farnham, P. J., Farrell, N., Feingold, E. A., Ferreira, A., Fisher-Aylor, K., Fitzgerald, S., Flicek, P., Foo, C. S., Fortier, K., Frankish, A., Freese, P., Fu, S., Fu, X., Fu, Y., Fukuda-Yuzawa, Y., Fulciniti, M., Funnell, A. P., Gabdank, I., Galeev, T., Gao, M., Giron, C. G., Garvin, T. H., Gelboin-Burkhart, C. A., Georgolopoulos, G., Gerstein, M. B., Giardine, B. M., Gifford, D. K., Gilbert, D. M., Gilchrist, D. A., Gillespie, S., Gingeras, T. R., Gong, P., Gonzalez, A., Gonzalez, J. M., Good, P., Goren, A., Gorkin, D. U., Graveley, B. R., Gray, M., Greenblatt, J. F., Griffiths, E., Groudine, M. T., Grubert, F., Gu, M., Guigo, R., Guo, H., Guo, Y., Guo, Y., Gursoy, G., Gutierrez-Arcelus, M., Halow, J., Hardison, R. C., Hardy, M., Hariharan, M., Harmanci, A., Harrington, A., Harrow, J. L., Hashimoto, T. B., Hasz, R. D., Hatan, M., Haugen, E., Hayes, J. E., He, P., He, Y., Heidari, N., Hendrickson, D., Heuston, E. F., Hilton, J. A., Hitz, B. C., Hochman, A., Holgren, C., Hou, L., Hou, S., Hsiao, Y. E., Hsu, S., Huang, H., Hubbard, T. J., Huey, J., Hughes, T. R., Hunt, T., Ibarrientos, S., Issner, R., Iwata, M., Izuogu, O., Jaakkola, T., Jameel, N., Jansen, C., Jiang, L., Jiang, P., Johnson, A., Johnson, R., Jungreis, I., Kadaba, M., Kasowski, M., Kasparian, M., Kato, M., Kaul, R., Kawli, T., Kay, M., Keen, J. C., Keles, S., Keller, C. A., Kelley, D., Kellis, M., Kheradpour, P., Kim, D. S., Kirilusha, A., Klein, R. J., Knoechel, B., Kuan, S., Kulik, M. J., Kumar, S., Kundaje, A., Kutyavin, T., Lagarde, J., Lajoie, B. R., Lambert, N. J., Lazar, J., Lee, A. Y., Lee, D., Lee, E., Lee, J. W., Lee, K., Leslie, C. S., Levy, S., Li, B., Li, H., Li, N., Li, X., Li, Y. I., Li, Y., Li, Y., Li, Y., Lian, J., Libbrecht, M. W., Lin, S., Lin, Y., Liu, D., Liu, J., Liu, P., Liu, T., Liu, X. S., Liu, Y., Liu, Y., Long, M., Lou, S., Loveland, J., Lu, A., Lu, Y., Lecuyer, E., Ma, L., Mackiewicz, M., Mannion, B. J., Mannstadt, M., Manthravadi, D., Marinov, G. K., Martin, F. J., Mattei, E., McCue, K., McEown, M., McVicker, G., Meadows, S. K., Meissner, A., Mendenhall, E. M., Messer, C. L., Meuleman, W., Meyer, C., Miller, S., Milton, M. G., Mishra, T., Moore, D. E., Moore, H. M., Moore, J. E., Moore, S. H., Moran, J., Mortazavi, A., Mudge, J. M., Munshi, N., Murad, R., Myers, R. M., Nandakumar, V., Nandi, P., Narasimha, A. M., Narayanan, A. K., Naughton, H., Navarro, F. C., Navas, P., Nazarovs, J., Nelson, J., Neph, S., Neri, F. J., Nery, J. R., Nesmith, A. R., Newberry, J. S., Newberry, K. M., Ngo, V., Nguyen, R., Nguyen, T. B., Nguyen, T., Nishida, A., Noble, W. S., Novak, C. S., Novoa, E. M., Nunez, B., O'Donnell, C. W., Olson, S., Onate, K. C., Otterman, E., Ozadam, H., Pagan, M., Palden, T., Pan, X., Park, Y., Partridge, E. C., Paten, B., Pauli-Behn, F., Pazin, M. J., Pei, B., Pennacchio, L. A., Perez, A. R., Perry, E. H., Pervouchine, D. D., Phalke, N. N., Pham, Q., Phanstiel, D. H., Plajzer-Frick, I., Pratt, G. A., Pratt, H. E., Preissl, S., Pritchard, J. K., Pritykin, Y., Purcaro, M. J., Qin, Q., Quinones-Valdez, G., Rabano, I., Radovani, E., Raj, A., Rajagopal, N., Ram, O., Ramirez, L., Ramirez, R. N., Rausch, D., Raychaudhuri, S., Raymond, J., Razavi, R., Reddy, T. E., Reimonn, T. M., Ren, B., Reymond, A., Reynolds, A., Rhie, S. K., Rinn, J., Rivera, M., Rivera-Mulia, J. C., Roberts, B., Rodriguez, J. M., Rozowsky, J., Ryan, R., Rynes, E., Salins, D. N., Sandstrom, R., Sasaki, T., Sathe, S., Savic, D., Scavelli, A., Scheiman, J., Schlaffner, C., Schloss, J. A., Schmitges, F. W., See, L. H., Sethi, A., Setty, M., Shafer, A., Shan, S., Sharon, E., Shen, Q., Shen, Y., Sherwood, R. I., Shi, M., Shin, S., Shoresh, N., Siebenthall, K., Sisu, C., Slifer, T., Sloan, C. A., Smith, A., Snetkova, V., Snyder, M. P., Spacek, D. V., Srinivasan, S., Srivas, R., Stamatoyannopoulos, G., Stamatoyannopoulos, J. A., Stanton, R., Steffan, D., Stehling-Sun, S., Strattan, J. S., Su, A., Sundararaman, B., Suner, M., Syed, T., Szynkarek, M., Tanaka, F. Y., Tenen, D., Teng, M., Thomas, J. A., Toffey, D., Tress, M. L., Trout, D. E., Trynka, G., Tsuji, J., Upchurch, S. A., Ursu, O., Uszczynska-Ratajczak, B., Uziel, M. C., Valencia, A., Biber, B. V., van der Velde, A. G., Van Nostrand, E. L., Vaydylevich, Y., Vazquez, J., Victorsen, A., Vielmetter, J., Vierstra, J., Visel, A., Vlasova, A., Vockley, C. M., Volpi, S., Vong, S., Wang, H., Wang, M., Wang, Q., Wang, R., Wang, T., Wang, W., Wang, X., Wang, Y., Watson, N. K., Wei, X., Wei, Z., Weisser, H., Weissman, S. M., Welch, R., Welikson, R. E., Weng, Z., Westra, H., Whitaker, J. W., White, C., White, K. P., Wildberg, A., Williams, B. A., Wine, D., Witt, H. N., Wold, B., Wolf, M., Wright, J., Xiao, R., Xiao, X., Xu, J., Xu, J., Yan, K., Yan, Y., Yang, H., Yang, X., Yang, Y., Yardimci, G. G., Yee, B. A., Yeo, G. W., Young, T., Yu, T., Yue, F., Zaleski, C., Zang, C., Zeng, H., Zeng, W., Zerbino, D. R., Zhai, J., Zhan, L., Zhan, Y., Zhang, B., Zhang, J., Zhang, J., Zhang, K., Zhang, L., Zhang, P., Zhang, Q., Zhang, X., Zhang, Y., Zhang, Z., Zhao, Y., Zheng, Y., Zhong, G., Zhou, X., Zhu, Y., Zimmerman, J. 2020; 583 (7818): 693–98


    The Encylopedia of DNA Elements (ENCODE) Project launched in 2003 with the long-term goal of developing a comprehensive map of functional elements in the human genome. These included genes, biochemical regions associated with gene regulation (for example, transcription factor binding sites, open chromatin, and histone marks) and transcript isoforms. The marks serve as sites for candidate cis-regulatory elements (cCREs) that may serve functional roles in regulating gene expression1. The project has been extended to model organisms, particularly the mouse. In the third phase of ENCODE, nearly a million and more than 300,000 cCRE annotations have been generated for human and mouse, respectively, and these have provided a valuable resource for the scientific community.

    View details for DOI 10.1038/s41586-020-2449-8

    View details for PubMedID 32728248

  • The Human Tumor Atlas Network: Charting Tumor Transitions across Space and Time at Single-Cell Resolution. Cell Rozenblatt-Rosen, O., Regev, A., Oberdoerffer, P., Nawy, T., Hupalowska, A., Rood, J. E., Ashenberg, O., Cerami, E., Coffey, R. J., Demir, E., Ding, L., Esplin, E. D., Ford, J. M., Goecks, J., Ghosh, S., Gray, J. W., Guinney, J., Hanlon, S. E., Hughes, S. K., Hwang, E. S., Iacobuzio-Donahue, C. A., Jane-Valbuena, J., Johnson, B. E., Lau, K. S., Lively, T., Mazzilli, S. A., Pe'er, D., Santagata, S., Shalek, A. K., Schapiro, D., Snyder, M. P., Sorger, P. K., Spira, A. E., Srivastava, S., Tan, K., West, R. B., Williams, E. H., Human Tumor Atlas Network, Aberle, D., Achilefu, S. I., Ademuyiwa, F. O., Adey, A. C., Aft, R. L., Agarwal, R., Aguilar, R. A., Alikarami, F., Allaj, V., Amos, C., Anders, R. A., Angelo, M. R., Anton, K., Ashenberg, O., Aster, J. C., Babur, O., Bahmani, A., Balsubramani, A., Barrett, D., Beane, J., Bender, D. E., Bernt, K., Berry, L., Betts, C. B., Bletz, J., Blise, K., Boire, A., Boland, G., Borowsky, A., Bosse, K., Bott, M., Boyden, E., Brooks, J., Bueno, R., Burlingame, E. A., Cai, Q., Campbell, J., Caravan, W., Cerami, E., Chaib, H., Chan, J. M., Chang, Y. H., Chatterjee, D., Chaudhary, O., Chen, A. A., Chen, B., Chen, C., Chen, C., Chen, F., Chen, Y., Chheda, M. G., Chin, K., Chiu, R., Chu, S., Chuaqui, R., Chun, J., Cisneros, L., Coffey, R. J., Colditz, G. A., Cole, K., Collins, N., Contrepois, K., Coussens, L. M., Creason, A. L., Crichton, D., Curtis, C., Davidsen, T., Davies, S. R., de Bruijn, I., Dellostritto, L., De Marzo, A., Demir, E., DeNardo, D. G., Diep, D., Ding, L., Diskin, S., Doan, X., Drewes, J., Dubinett, S., Dyer, M., Egger, J., Eng, J., Engelhardt, B., Erwin, G., Esplin, E. D., Esserman, L., Felmeister, A., Feiler, H. S., Fields, R. C., Fisher, S., Flaherty, K., Flournoy, J., Ford, J. M., Fortunato, A., Frangieh, A., Frye, J. L., Fulton, R. S., Galipeau, D., Gan, S., Gao, J., Gao, L., Gao, P., Gao, V. R., Geiger, T., George, A., Getz, G., Ghosh, S., Giannakis, M., Gibbs, D. L., Gillanders, W. E., Goecks, J., Goedegebuure, S. P., Gould, A., Gowers, K., Gray, J. W., Greenleaf, W., Gresham, J., Guerriero, J. L., Guha, T. K., Guimaraes, A. R., Guinney, J., Gutman, D., Hacohen, N., Hanlon, S., Hansen, C. R., Harismendy, O., Harris, K. A., Hata, A., Hayashi, A., Heiser, C., Helvie, K., Herndon, J. M., Hirst, G., Hodi, F., Hollmann, T., Horning, A., Hsieh, J. J., Hughes, S., Huh, W. J., Hunger, S., Hwang, S. E., Iacobuzio-Donahue, C. A., Ijaz, H., Izar, B., Jacobson, C. A., Janes, S., Jane-Valbuena, J., Jayasinghe, R. G., Jiang, L., Johnson, B. E., Johnson, B., Ju, T., Kadara, H., Kaestner, K., Kagan, J., Kalinke, L., Keith, R., Khan, A., Kibbe, W., Kim, A. H., Kim, E., Kim, J., Kolodzie, A., Kopytra, M., Kotler, E., Krueger, R., Krysan, K., Kundaje, A., Ladabaum, U., Lake, B. B., Lam, H., Laquindanum, R., Lau, K. S., Laughney, A. M., Lee, H., Lenburg, M., Leonard, C., Leshchiner, I., Levy, R., Li, J., Lian, C. G., Lim, K., Lin, J., Lin, Y., Liu, Q., Liu, R., Lively, T., Longabaugh, W. J., Longacre, T., Ma, C. X., Macedonia, M. C., Madison, T., Maher, C. A., Maitra, A., Makinen, N., Makowski, D., Maley, C., Maliga, Z., Mallo, D., Maris, J., Markham, N., Marks, J., Martinez, D., Mashl, R. J., Masilionais, I., Mason, J., Massague, J., Massion, P., Mattar, M., Mazurchuk, R., Mazutis, L., Mazzilli, S. A., McKinley, E. T., McMichael, J. F., Merrick, D., Meyerson, M., Miessner, J. R., Mills, G. B., Mills, M., Mondal, S. B., Mori, M., Mori, Y., Moses, E., Mosse, Y., Muhlich, J. L., Murphy, G. F., Navin, N. E., Nawy, T., Nederlof, M., Ness, R., Nevins, S., Nikolov, M., Nirmal, A. J., Nolan, G., Novikov, E., Oberdoerffer, P., O'Connell, B., Offin, M., Oh, S. T., Olson, A., Ooms, A., Ossandon, M., Owzar, K., Parmar, S., Patel, T., Patti, G. J., Pe'er, D., Pe'er, I., Peng, T., Persson, D., Petty, M., Pfister, H., Polyak, K., Pourfarhangi, K., Puram, S. V., Qiu, Q., Quintanal-Villalonga, A., Raj, A., Ramirez-Solano, M., Rashid, R., Reeb, A. N., Regev, A., Reid, M., Resnick, A., Reynolds, S. M., Riesterer, J. L., Rodig, S., Roland, J. T., Rosenfield, S., Rotem, A., Roy, S., Rozenblatt-Rosen, O., Rudin, C. M., Ryser, M. D., Santagata, S., Santi-Vicini, M., Sato, K., Schapiro, D., Schrag, D., Schultz, N., Sears, C. L., Sears, R. C., Sen, S., Sen, T., Shalek, A., Sheng, J., Sheng, Q., Shoghi, K. I., Shrubsole, M. J., Shyr, Y., Sibley, A. B., Siex, K., Simmons, A. J., Singer, D. S., Sivagnanam, S., Slyper, M., Snyder, M. P., Sokolov, A., Song, S., Sorger, P. K., Southard-Smith, A., Spira, A., Srivastava, S., Stein, J., Storm, P., Stover, E., Strand, S. H., Su, T., Sudar, D., Sullivan, R., Surrey, L., Suva, M., Tan, K., Terekhanova, N. V., Ternes, L., Thammavong, L., Thibault, G., Thomas, G. V., Thorsson, V., Todres, E., Tran, L., Tyler, M., Uzun, Y., Vachani, A., Van Allen, E., Vandekar, S., Veis, D. J., Vigneau, S., Vossough, A., Waanders, A., Wagle, N., Wang, L., Wendl, M. C., West, R., Williams, E. H., Wu, C., Wu, H., Wu, H., Wyczalkowski, M. A., Xie, Y., Yang, X., Yapp, C., Yu, W., Yuan, Y., Zhang, D., Zhang, K., Zhang, M., Zhang, N., Zhang, Y., Zhao, Y., Zhou, D. C., Zhou, Z., Zhu, H., Zhu, Q., Zhu, X., Zhu, Y., Zhuang, X. 2020; 181 (2): 236–49


    Crucial transitions in cancer-including tumor initiation, local expansion, metastasis, and therapeutic resistance-involve complex interactions between cells within the dynamic tumor ecosystem. Transformative single-cell genomics technologies and spatial multiplex in situ methods now provide an opportunity to interrogate this complexity at unprecedented resolution. The Human Tumor Atlas Network (HTAN), part of the National Cancer Institute (NCI) Cancer Moonshot Initiative, will establish a clinical, experimental, computational, and organizational framework to generate informative and accessible three-dimensional atlases of cancer transitions for a diverse set of tumor types. This effort complements both ongoing efforts to map healthy organs and previous large-scale cancer genomics approaches focused on bulk sequencing at a single point in time. Generating single-cell, multiparametric, longitudinal atlases and integrating them with clinical outcomes should help identify novel predictive biomarkers and features as well as therapeutically relevant cell types, cell states, and cellular interactions across transitions. The resulting tumor atlases should have a profound impact on our understanding of cancer biology and have the potential to improve cancer detection, prevention, and therapeutic discovery for better precision-medicine treatments of cancer patients and those at risk for cancer.

    View details for DOI 10.1016/j.cell.2020.03.053

    View details for PubMedID 32302568

  • CRISPR screens in cancer spheroids identify 3D growth-specific vulnerabilities NATURE Han, K., Pierce, S. E., Li, A., Spees, K., Anderson, G. R., Seoane, J. A., Lo, Y., Dubreuil, M., Olivas, M., Kamber, R. A., Wainberg, M., Kostyrko, K., Kelly, M. R., Yousefi, M., Simpkins, S. W., Yao, D., Lee, K., Kuo, C. J., Jackson, P. K., Sweet-Cordero, A., Kundaje, A., Gentles, A. J., Curtis, C., Winslow, M. M., Bassik, M. C. 2020
  • Long-range single-molecule mapping of chromatin accessibility in eukaryotes. Nature methods Shipony, Z., Marinov, G. K., Swaffer, M. P., Sinnott-Armstrong, N. A., Skotheim, J. M., Kundaje, A., Greenleaf, W. J. 2020


    Mapping open chromatin regions has emerged as a widely used tool for identifying active regulatory elements in eukaryotes. However, existing approaches, limited by reliance on DNA fragmentation and short-read sequencing, cannot provide information about large-scale chromatin states or reveal coordination between the states of distal regulatory elements. We have developed a method for profiling the accessibility of individual chromatin fibers, a single-molecule long-read accessible chromatin mapping sequencing assay (SMAC-seq), enabling the simultaneous, high-resolution, single-molecule assessment of chromatin states at multikilobase length scales. Our strategy is based on combining the preferential methylation of open chromatin regions by DNA methyltransferases with low sequence specificity, in this case EcoGII, an N6-methyladenosine (m6A) methyltransferase, and the ability of nanopore sequencing to directly read DNA modifications. We demonstrate that aggregate SMAC-seq signals match bulk-level accessibility measurements, observe single-molecule nucleosome and transcription factor protection footprints, and quantify the correlation between chromatin states of distal genomic elements.

    View details for DOI 10.1038/s41592-019-0730-2

    View details for PubMedID 32042188

  • High-Throughput Discovery and Characterization of Human Transcriptional Effectors. Cell Tycko, J. n., DelRosso, N. n., Hess, G. T., Aradhana, n. n., Banerjee, A. n., Mukund, A. n., Van, M. V., Ego, B. K., Yao, D. n., Spees, K. n., Suzuki, P. n., Marinov, G. K., Kundaje, A. n., Bassik, M. C., Bintu, L. n. 2020


    Thousands of proteins localize to the nucleus; however, it remains unclear which contain transcriptional effectors. Here, we develop HT-recruit, a pooled assay where protein libraries are recruited to a reporter, and their transcriptional effects are measured by sequencing. Using this approach, we measure gene silencing and activation for thousands of domains. We find a relationship between repressor function and evolutionary age for the KRAB domains, discover that Homeodomain repressor strength is collinear with Hox genetic organization, and identify activities for several domains of unknown function. Deep mutational scanning of the CRISPRi KRAB maps the co-repressor binding surface and identifies substitutions that improve stability/silencing. By tiling 238 proteins, we find repressors as short as ten amino acids. Finally, we report new activator domains, including a divergent KRAB. These results provide a resource of 600 human proteins containing effectors and demonstrate a scalable strategy for assigning functions to protein domains.

    View details for DOI 10.1016/j.cell.2020.11.024

    View details for PubMedID 33326746

  • Maximum Likelihood with Bias-Corrected Calibration is Hard-To-Beat at Label Shift Adaptation Alexandari, A. M., Kundaje, A., Shrikumar, A., Daume, H., Singh, A. JMLR-JOURNAL MACHINE LEARNING RESEARCH. 2020
  • CRISPR screens in cancer spheroids identify 3D growth-specific vulnerabilities. Nature Han, K. n., Pierce, S. E., Li, A. n., Spees, K. n., Anderson, G. R., Seoane, J. A., Lo, Y. H., Dubreuil, M. n., Olivas, M. n., Kamber, R. A., Wainberg, M. n., Kostyrko, K. n., Kelly, M. R., Yousefi, M. n., Simpkins, S. W., Yao, D. n., Lee, K. n., Kuo, C. J., Jackson, P. K., Sweet-Cordero, A. n., Kundaje, A. n., Gentles, A. J., Curtis, C. n., Winslow, M. M., Bassik, M. C. 2020; 580 (7801): 136–41


    Cancer genomics studies have identified thousands of putative cancer driver genes1. Development of high-throughput and accurate models to define the functions of these genes is a major challenge. Here we devised a scalable cancer-spheroid model and performed genome-wide CRISPR screens in 2D monolayers and 3D lung-cancer spheroids. CRISPR phenotypes in 3D more accurately recapitulated those of in vivo tumours, and genes with differential sensitivities between 2D and 3D conditions were highly enriched for genes that are mutated in lung cancers. These analyses also revealed drivers that are essential for cancer growth in 3D and in vivo, but not in 2D. Notably, we found that carboxypeptidase D is responsible for removal of a C-terminal RKRR motif2 from the α-chain of the insulin-like growth factor 1 receptor that is critical for receptor activity. Carboxypeptidase D expression correlates with patient outcomes in patients with lung cancer, and loss of carboxypeptidase D reduced tumour growth. Our results reveal key differences between 2D and 3D cancer models, and establish a generalizable strategy for performing CRISPR screens in spheroids to reveal cancer vulnerabilities.

    View details for DOI 10.1038/s41586-020-2099-x

    View details for PubMedID 32238925

  • Landscape of cohesin-mediated chromatin loops in the human genome. Nature Grubert, F. n., Srivas, R. n., Spacek, D. V., Kasowski, M. n., Ruiz-Velasco, M. n., Sinnott-Armstrong, N. n., Greenside, P. n., Narasimha, A. n., Liu, Q. n., Geller, B. n., Sanghi, A. n., Kulik, M. n., Sa, S. n., Rabinovitch, M. n., Kundaje, A. n., Dalton, S. n., Zaugg, J. B., Snyder, M. n. 2020; 583 (7818): 737–43


    Physical interactions between distal regulatory elements have a key role in regulating gene expression, but the extent to which these interactions vary between cell types and contribute to cell-type-specific gene expression remains unclear. Here, to address these questions as part of phase III of the Encyclopedia of DNA Elements (ENCODE), we mapped cohesin-mediated chromatin loops, using chromatin interaction analysis by paired-end tag sequencing (ChIA-PET), and analysed gene expression in 24 diverse human cell types, including core ENCODE cell lines. Twenty-eight per cent of all chromatin loops vary across cell types; these variations modestly correlate with changes in gene expression and are effective at grouping cell types according to their tissue of origin. The connectivity of genes corresponds to different functional classes, with housekeeping genes having few contacts, and dosage-sensitive genes being more connected to enhancer elements. This atlas of chromatin loops complements the diverse maps of regulatory architecture that comprise the ENCODE Encyclopedia, and will help to support emerging analyses of genome structure and function.

    View details for DOI 10.1038/s41586-020-2151-x

    View details for PubMedID 32728247

  • Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature Moore, J. E., Purcaro, M. J., Pratt, H. E., Epstein, C. B., Shoresh, N. n., Adrian, J. n., Kawli, T. n., Davis, C. A., Dobin, A. n., Kaul, R. n., Halow, J. n., Van Nostrand, E. L., Freese, P. n., Gorkin, D. U., Shen, Y. n., He, Y. n., Mackiewicz, M. n., Pauli-Behn, F. n., Williams, B. A., Mortazavi, A. n., Keller, C. A., Zhang, X. O., Elhajjajy, S. I., Huey, J. n., Dickel, D. E., Snetkova, V. n., Wei, X. n., Wang, X. n., Rivera-Mulia, J. C., Rozowsky, J. n., Zhang, J. n., Chhetri, S. B., Zhang, J. n., Victorsen, A. n., White, K. P., Visel, A. n., Yeo, G. W., Burge, C. B., Lécuyer, E. n., Gilbert, D. M., Dekker, J. n., Rinn, J. n., Mendenhall, E. M., Ecker, J. R., Kellis, M. n., Klein, R. J., Noble, W. S., Kundaje, A. n., Guigó, R. n., Farnham, P. J., Cherry, J. M., Myers, R. M., Ren, B. n., Graveley, B. R., Gerstein, M. B., Pennacchio, L. A., Snyder, M. P., Bernstein, B. E., Wold, B. n., Hardison, R. C., Gingeras, T. R., Stamatoyannopoulos, J. A., Weng, Z. n. 2020; 583 (7818): 699–710


    The human and mouse genomes contain instructions that specify RNAs and proteins and govern the timing, magnitude, and cellular context of their production. To better delineate these elements, phase III of the Encyclopedia of DNA Elements (ENCODE) Project has expanded analysis of the cell and tissue repertoires of RNA transcription, chromatin structure and modification, DNA methylation, chromatin looping, and occupancy by transcription factors and RNA-binding proteins. Here we summarize these efforts, which have produced 5,992 new experimental datasets, including systematic determinations across mouse fetal development. All data are available through the ENCODE data portal (, including phase II ENCODE1 and Roadmap Epigenomics2 data. We have developed a registry of 926,535 human and 339,815 mouse candidate cis-regulatory elements, covering 7.9 and 3.4% of their respective genomes, by integrating selected datatypes associated with gene regulation, and constructed a web-based server (SCREEN; to provide flexible, user-defined access to this resource. Collectively, the ENCODE data and registry provide an expansive resource for the scientific community to build a better understanding of the organization and function of the human and mouse genomes.

    View details for DOI 10.1038/s41586-020-2493-4

    View details for PubMedID 32728249

  • Homogeneity in the association of body mass index with type 2 diabetes across the UK Biobank: A Mendelian randomization study. PLoS medicine Wainberg, M., Mahajan, A., Kundaje, A., McCarthy, M. I., Ingelsson, E., Sinnott-Armstrong, N., Rivas, M. A. 2019; 16 (12): e1002982


    BACKGROUND: Lifestyle interventions to reduce body mass index (BMI) are critical public health strategies for type 2 diabetes prevention. While weight loss interventions have shown demonstrable benefit for high-risk and prediabetic individuals, we aimed to determine whether the same benefits apply to those at lower risk.METHODS AND FINDINGS: We performed a multi-stratum Mendelian randomization study of the effect size of BMI on diabetes odds in 287,394 unrelated individuals of self-reported white British ancestry in the UK Biobank, who were recruited from across the United Kingdom from 2006 to 2010 when they were between the ages of 40 and 69 years. Individuals were stratified on the following diabetes risk factors: BMI, diabetes family history, and genome-wide diabetes polygenic risk score. The main outcome measure was the odds ratio of diabetes per 1-kg/m2 BMI reduction, in the full cohort and in each stratum. Diabetes prevalence increased sharply with BMI, family history of diabetes, and genetic risk. Conversely, predicted risk reduction from weight loss was strikingly similar across BMI and genetic risk categories. Weight loss was predicted to substantially reduce diabetes odds even among lower-risk individuals: for instance, a 1-kg/m2 BMI reduction was associated with a 1.37-fold reduction (95% CI 1.12-1.68) in diabetes odds among non-overweight individuals (BMI < 25 kg/m2) without a family history of diabetes, similar to that in obese individuals (BMI ≥ 30 kg/m2) with a family history (1.21-fold reduction, 95% CI 1.13-1.29). A key limitation of this analysis is that the BMI-altering DNA sequence polymorphisms it studies represent cumulative predisposition over an individual's entire lifetime, and may consequently incorrectly estimate the risk modification potential of weight loss interventions later in life.CONCLUSIONS: In a population-scale cohort, lower BMI was consistently associated with reduced diabetes risk across BMI, family history, and genetic risk categories, suggesting all individuals can substantially reduce their diabetes risk through weight loss. Our results support the broad deployment of weight loss interventions to individuals at all levels of diabetes risk.

    View details for DOI 10.1371/journal.pmed.1002982

    View details for PubMedID 31821322

  • NETWORK MODELLING OF TOPOLOGICAL DOMAINS USING HI-C DATA ANNALS OF APPLIED STATISTICS Wang, Y., Sarkar, P., Ursu, O., Kundaje, A., Bickel, P. J. 2019; 13 (3): 1511–36
  • Matrix stiffness induces a tumorigenic phenotype in mammary epithelium through changes in chromatin accessibility. Nature biomedical engineering Stowers, R. S., Shcherbina, A., Israeli, J., Gruber, J. J., Chang, J., Nam, S., Rabiee, A., Teruel, M. N., Snyder, M. P., Kundaje, A., Chaudhuri, O. 2019


    In breast cancer, the increased stiffness of the extracellular matrix is a key driver of malignancy. Yet little is known about the epigenomic changes that underlie the tumorigenic impact of extracellular matrix mechanics. Here, we show in a three-dimensional culture model of breast cancer that stiff extracellular matrix induces a tumorigenic phenotype through changes in chromatin state. We found that increased stiffness yielded cells with more wrinkled nuclei and with increased lamina-associated chromatin, that cells cultured in stiff matrices displayed more accessible chromatin sites, which exhibited footprints of Sp1 binding, and that this transcription factor acts along with the histone deacetylases 3 and 8 to regulate the induction of stiffness-mediated tumorigenicity. Just as cell culture on soft environments or in them rather than on tissue-culture plastic better recapitulates the acinar morphology observed in mammary epithelium in vivo, mammary epithelial cells cultured on soft microenvironments or in them also more closely replicate the in vivo chromatin state. Our results emphasize the importance of culture conditions for epigenomic studies, and reveal that chromatin state is a critical mediator of mechanotransduction.

    View details for DOI 10.1038/s41551-019-0420-5

    View details for PubMedID 31285581

  • Predicting gene expression from plasma cell-free DNA using both the fragment length and fragment position St John, J. A., Gafni, E., White, B., Kannan, A., Hansen, L., Jaroszewicz, A., Kundaje, A., Boley, N. AMER ASSOC CANCER RESEARCH. 2019
  • The ENCODE Blacklist: Identification of Problematic Regions of the Genome. Scientific reports Amemiya, H. M., Kundaje, A., Boyle, A. P. 2019; 9 (1): 9354


    Functional genomics assays based on high-throughput sequencing greatly expand our ability to understand the genome. Here, we define the ENCODE blacklist- a comprehensive set of regions in the human, mouse, worm, and fly genomes that have anomalous, unstructured, or high signal in next-generation sequencing experiments independent of cell line or experiment. The removal of the ENCODE blacklist is an essential quality measure when analyzing functional genomics data.

    View details for DOI 10.1038/s41598-019-45839-z

    View details for PubMedID 31249361

  • The Kipoi repository accelerates community exchange and reuse of predictive models for genomics. Nature biotechnology Avsec, Z., Kreuzhuber, R., Israeli, J., Xu, N., Cheng, J., Shrikumar, A., Banerjee, A., Kim, D. S., Beier, T., Urban, L., Kundaje, A., Stegle, O., Gagneur, J. 2019

    View details for DOI 10.1038/s41587-019-0140-0

    View details for PubMedID 31138913

  • Cell cycle dynamics of human pluripotent stem cells primed for differentiation. Stem cells (Dayton, Ohio) Shcherbina, A., Li, J., Narayanan, C., Greenleaf, W., Kundaje, A., Chetty, S. 2019


    Understanding the molecular properties of the cell cycle of human pluripotent stem cells (hPSCs) is critical for effectively promoting differentiation. Here, we use the Fluorescence Ubiquitin Cell Cycle Indicator (FUCCI) system adapted into hPSCs and perform RNA-sequencing on cell cycle sorted hPSCs primed and unprimed for differentiation. Gene expression patterns of signaling factors and developmental regulators change in a cell cycle-specific manner in cells primed for differentiation without altering genes associated with pluripotency. Furthermore, we identify an important role for PI3K signaling in regulating the early transitory states of hPSCs toward differentiation. SIGNIFICANCE STATEMENT: Generating differentiated cell types from human pluripotent stem cells (hPSCs) holds great therapeutic promise, but has proven to be challenging in practice. The cell cycle may play an important role in enhancing the differentiation potential of hPSCs. Here, the authors track and isolate hPSCs from different phases of the cell cycle and perform RNA-sequencing. The data show that gene expression patterns of signaling factors and developmental regulators change in a cell cycle-specific manner as hPSCs transition toward differentiation and highlight an important role for PI3K signaling in regulating these early transitory states. © AlphaMed Press 2019.

    View details for DOI 10.1002/stem.3041

    View details for PubMedID 31135093

  • Remodeling of epigenome and transcriptome landscapes with aging in mice reveals widespread induction of inflammatory responses GENOME RESEARCH Benayoun, B. A., Pollina, E. A., Singh, P., Mahmoudi, S., Harel, I., Casey, K. M., Dulken, B. W., Kundaje, A., Brunet, A. 2019; 29 (4): 697–709
  • Measuring the reproducibility and quality of Hi-C data. Genome biology Yardimci, G. G., Ozadam, H., Sauria, M. E., Ursu, O., Yan, K., Yang, T., Chakraborty, A., Kaul, A., Lajoie, B. R., Song, F., Zhan, Y., Ay, F., Gerstein, M., Kundaje, A., Li, Q., Taylor, J., Yue, F., Dekker, J., Noble, W. S. 2019; 20 (1): 57


    BACKGROUND: Hi-C is currently the most widely used assay to investigate the 3D organization of the genome and to study its role in gene regulation, DNA replication, and disease. However, Hi-C experiments are costly to perform and involve multiple complex experimental steps; thus, accurate methods for measuring the quality and reproducibility of Hi-C data are essential to determine whether the output should be used further in a study.RESULTS: Using real and simulated data, we profile the performance of several recently proposed methods for assessing reproducibility of population Hi-C data, including HiCRep, GenomeDISCO, HiC-Spector, and QuASAR-Rep. By explicitly controlling noise and sparsity through simulations, we demonstrate the deficiencies of performing simple correlation analysis on pairs of matrices, and we show that methods developed specifically for Hi-C data produce better measures of reproducibility. We also show how to use established measures, such as the ratio of intra- to interchromosomal interactions, and novel ones, such as QuASAR-QC, to identify low-quality experiments.CONCLUSIONS: In this work, we assess reproducibility and quality measures by varying sequencing depth, resolution and noise levels in Hi-C data from 13 cell lines, with two biological replicates each, as well as 176 simulated matrices. Through this extensive validation and benchmarking of Hi-C data, we describe best practices for reproducibility and quality assessment of Hi-C experiments. We make all software publicly available at to facilitate adoption in the community.

    View details for PubMedID 30890172

  • mtDNA Chromatin-like Organization Is Gradually Established during Mammalian Embryogenesis. iScience Marom, S., Blumberg, A., Kundaje, A., Mishmar, D. 2019; 12: 141–51


    Unlike the nuclear genome, the mammalian mitochondrial genome (mtDNA) is thought to be coatedsolely by mitochondrial transcription factor A (TFAM), whose binding sequence preferences are debated. Therefore, higher-order mtDNA organization is considered much less regulated than both the bacterial nucleoid and the nuclear chromatin. However, our recently identified conserved DNase footprinting pattern in human mtDNA, which co-localizes with regulatory elements and responds to physiological conditions, likely reflects a structured higher-order mtDNA organization. We hypothesized that this pattern emerges during embryogenesis. To test this hypothesis, we analyzed assay for transposase-accessible chromatin sequencing (ATAC-seq) results collected during the course of mouse and human early embryogenesis. Our results reveal, for the first time, a gradual and dynamic emergence of the adult mtDNA footprinting pattern during embryogenesis of both mammals. Taken together, our findings suggest that the structured adult chromatin-like mtDNA organization is gradually formed during mammalian embryogenesis.

    View details for PubMedID 30684873

  • Mitigation of off-target toxicity in CRISPR-Cas9 screens for essential non-coding elements. Nature communications Tycko, J. n., Wainberg, M. n., Marinov, G. K., Ursu, O. n., Hess, G. T., Ego, B. K., Aradhana, n. n., Li, A. n., Truong, A. n., Trevino, A. E., Spees, K. n., Yao, D. n., Kaplow, I. M., Greenside, P. G., Morgens, D. W., Phanstiel, D. H., Snyder, M. P., Bintu, L. n., Greenleaf, W. J., Kundaje, A. n., Bassik, M. C. 2019; 10 (1): 4063


    Pooled CRISPR-Cas9 screens are a powerful method for functionally characterizing regulatory elements in the non-coding genome, but off-target effects in these experiments have not been systematically evaluated. Here, we investigate Cas9, dCas9, and CRISPRi/a off-target activity in screens for essential regulatory elements. The sgRNAs with the largest effects in genome-scale screens for essential CTCF loop anchors in K562 cells were not single guide RNAs (sgRNAs) that disrupted gene expression near the on-target CTCF anchor. Rather, these sgRNAs had high off-target activity that, while only weakly correlated with absolute off-target site number, could be predicted by the recently developed GuideScan specificity score. Screens conducted in parallel with CRISPRi/a, which do not induce double-stranded DNA breaks, revealed that a distinct set of off-targets also cause strong confounding fitness effects with these epigenome-editing tools. Promisingly, filtering of CRISPRi libraries using GuideScan specificity scores removed these confounded sgRNAs and enabled identification of essential regulatory elements.

    View details for DOI 10.1038/s41467-019-11955-7

    View details for PubMedID 31492858

  • Integrating regulatory DNA sequence and gene expression to predict genome-wide chromatin accessibility across cellular contexts. Bioinformatics (Oxford, England) Nair, S. n., Kim, D. S., Perricone, J. n., Kundaje, A. n. 2019; 35 (14): i108–i116


    Genome-wide profiles of chromatin accessibility and gene expression in diverse cellular contexts are critical to decipher the dynamics of transcriptional regulation. Recently, convolutional neural networks have been used to learn predictive cis-regulatory DNA sequence models of context-specific chromatin accessibility landscapes. However, these context-specific regulatory sequence models cannot generalize predictions across cell types.We introduce multi-modal, residual neural network architectures that integrate cis-regulatory sequence and context-specific expression of trans-regulators to predict genome-wide chromatin accessibility profiles across cellular contexts. We show that the average accessibility of a genomic region across training contexts can be a surprisingly powerful predictor. We leverage this feature and employ novel strategies for training models to enhance genome-wide prediction of shared and context-specific chromatin accessible sites across cell types. We interpret the models to reveal insights into cis- and trans-regulation of chromatin dynamics across 123 diverse cellular contexts.The code is available at data are available at Bioinformatics online.

    View details for DOI 10.1093/bioinformatics/btz352

    View details for PubMedID 31510655

  • GkmExplain: fast and accurate interpretation of nonlinear gapped k-mer SVMs. Bioinformatics (Oxford, England) Shrikumar, A. n., Prakash, E. n., Kundaje, A. n. 2019; 35 (14): i173–i182


    Support Vector Machines with gapped k-mer kernels (gkm-SVMs) have been used to learn predictive models of regulatory DNA sequence. However, interpreting predictive sequence patterns learned by gkm-SVMs can be challenging. Existing interpretation methods such as deltaSVM, in-silico mutagenesis (ISM) or SHAP either do not scale well or make limiting assumptions about the model that can produce misleading results when the gkm kernel is combined with nonlinear kernels. Here, we propose GkmExplain: a computationally efficient feature attribution method for interpreting predictive sequence patterns from gkm-SVM models that has theoretical connections to the method of Integrated Gradients. Using simulated regulatory DNA sequences, we show that GkmExplain identifies predictive patterns with high accuracy while avoiding pitfalls of deltaSVM and ISM and being orders of magnitude more computationally efficient than SHAP. By applying GkmExplain and a recently developed motif discovery method called TF-MoDISco to gkm-SVM models trained on in vivo transcription factor (TF) binding data, we recover consolidated, non-redundant TF motifs. Mutation impact scores derived using GkmExplain consistently outperform deltaSVM and ISM at identifying regulatory genetic variants from gkm-SVM models of chromatin accessibility in lymphoblastoid cell-lines.Code and example notebooks to reproduce results are at data are available at Bioinformatics online.

    View details for DOI 10.1093/bioinformatics/btz322

    View details for PubMedID 31510661

  • NETWORK MODELLING OF TOPOLOGICAL DOMAINS USING HI-C DATA. The annals of applied statistics Wang, Y. X., Sarkar, P. n., Ursu, O. n., Kundaje, A. n., Bickel, P. J. 2019; 13 (3): 1511–36


    Chromosome conformation capture experiments such as Hi-C are used to map the three-dimensional spatial organization of genomes. One specific feature of the 3D organization is known as topologically associating domains (TADs), which are densely interacting, contiguous chromatin regions playing important roles in regulating gene expression. A few algorithms have been proposed to detect TADs. In particular, the structure of Hi-C data naturally inspires application of community detection methods. However, one of the drawbacks of community detection is that most methods take exchangeability of the nodes in the network for granted; whereas the nodes in this case, that is, the positions on the chromosomes, are not exchangeable. We propose a network model for detecting TADs using Hi-C data that takes into account this nonexchangeability. in addition, our model explicitly makes use of cell-type specific CTCF binding sites as biological covariates and can be used to identify conserved TADs across multiple cell types. The model leads to a likelihood objective that can be efficiently optimized via relaxation. We also prove that when suitably initialized, this model finds the underlying TAD structure with high probability. using simulated data, we show the advantages of our method and the caveats of popular community detection methods, such as spectral clustering, in this application. Applying our method to real Hi-C data, we demonstrate the domains identified have desirable epigenetic features and compare them across different cell types.

    View details for DOI 10.1214/19-aoas1244

    View details for PubMedID 32968472

    View details for PubMedCentralID PMC7508461

  • Deciphering regulatory DNA sequences and noncoding genetic variants using neural network models of massively parallel reporter assays. PloS one Movva, R. n., Greenside, P. n., Marinov, G. K., Nair, S. n., Shrikumar, A. n., Kundaje, A. n. 2019; 14 (6): e0218073


    The relationship between noncoding DNA sequence and gene expression is not well-understood. Massively parallel reporter assays (MPRAs), which quantify the regulatory activity of large libraries of DNA sequences in parallel, are a powerful approach to characterize this relationship. We present MPRA-DragoNN, a convolutional neural network (CNN)-based framework to predict and interpret the regulatory activity of DNA sequences as measured by MPRAs. While our method is generally applicable to a variety of MPRA designs, here we trained our model on the Sharpr-MPRA dataset that measures the activity of ∼500,000 constructs tiling 15,720 regulatory regions in human K562 and HepG2 cell lines. MPRA-DragoNN predictions were moderately correlated (Spearman ρ = 0.28) with measured activity and were within range of replicate concordance of the assay. State-of-the-art model interpretation methods revealed high-resolution predictive regulatory sequence features that overlapped transcription factor (TF) binding motifs. We used the model to investigate the cell type and chromatin state preferences of predictive TF motifs. We explored the ability of our model to predict the allelic effects of regulatory variants in an independent MPRA experiment and fine map putative functional SNPs in loci associated with lipid traits. Our results suggest that interpretable deep learning models trained on MPRA data have the potential to reveal meaningful patterns in regulatory DNA sequences and prioritize regulatory genetic variants, especially as larger, higher-quality datasets are produced.

    View details for DOI 10.1371/journal.pone.0218073

    View details for PubMedID 31206543

  • Discovery of common and rare genetic risk variants for colorectal cancer. Nature genetics Huyghe, J. R., Bien, S. A., Harrison, T. A., Kang, H. M., Chen, S., Schmit, S. L., Conti, D. V., Qu, C., Jeon, J., Edlund, C. K., Greenside, P., Wainberg, M., Schumacher, F. R., Smith, J. D., Levine, D. M., Nelson, S. C., Sinnott-Armstrong, N. A., Albanes, D., Alonso, M. H., Anderson, K., Arnau-Collell, C., Arndt, V., Bamia, C., Banbury, B. L., Baron, J. A., Berndt, S. I., Bezieau, S., Bishop, D. T., Boehm, J., Boeing, H., Brenner, H., Brezina, S., Buch, S., Buchanan, D. D., Burnett-Hartman, A., Butterbach, K., Caan, B. J., Campbell, P. T., Carlson, C. S., Castellvi-Bel, S., Chan, A. T., Chang-Claude, J., Chanock, S. J., Chirlaque, M., Cho, S. H., Connolly, C. M., Cross, A. J., Cuk, K., Curtis, K. R., de la Chapelle, A., Doheny, K. F., Duggan, D., Easton, D. F., Elias, S. G., Elliott, F., English, D. R., Feskens, E. J., Figueiredo, J. C., Fischer, R., FitzGerald, L. M., Forman, D., Gala, M., Gallinger, S., Gauderman, W. J., Giles, G. G., Gillanders, E., Gong, J., Goodman, P. J., Grady, W. M., Grove, J. S., Gsur, A., Gunter, M. J., Haile, R. W., Hampe, J., Hampel, H., Harlid, S., Hayes, R. B., Hofer, P., Hoffmeister, M., Hopper, J. L., Hsu, W., Huang, W., Hudson, T. J., Hunter, D. J., Ibanez-Sanz, G., Idos, G. E., Ingersoll, R., Jackson, R. D., Jacobs, E. J., Jenkins, M. A., Joshi, A. D., Joshu, C. E., Keku, T. O., Key, T. J., Kim, H. R., Kobayashi, E., Kolonel, L. N., Kooperberg, C., Kuhn, T., Kury, S., Kweon, S., Larsson, S. C., Laurie, C. A., Le Marchand, L., Leal, S. M., Lee, S. C., Lejbkowicz, F., Lemire, M., Li, C. I., Li, L., Lieb, W., Lin, Y., Lindblom, A., Lindor, N. M., Ling, H., Louie, T. L., Mannisto, S., Markowitz, S. D., Martin, V., Masala, G., McNeil, C. E., Melas, M., Milne, R. L., Moreno, L., Murphy, N., Myte, R., Naccarati, A., Newcomb, P. A., Offit, K., Ogino, S., Onland-Moret, N. C., Pardini, B., Parfrey, P. S., Pearlman, R., Perduca, V., Pharoah, P. D., Pinchev, M., Platz, E. A., Prentice, R. L., Pugh, E., Raskin, L., Rennert, G., Rennert, H. S., Riboli, E., Rodriguez-Barranco, M., Romm, J., Sakoda, L. C., Schafmayer, C., Schoen, R. E., Seminara, D., Shah, M., Shelford, T., Shin, M., Shulman, K., Sieri, S., Slattery, M. L., Southey, M. C., Stadler, Z. K., Stegmaier, C., Su, Y., Tangen, C. M., Thibodeau, S. N., Thomas, D. C., Thomas, S. S., Toland, A. E., Trichopoulou, A., Ulrich, C. M., Van Den Berg, D. J., van Duijnhoven, F. J., Van Guelpen, B., van Kranen, H., Vijai, J., Visvanathan, K., Vodicka, P., Vodickova, L., Vymetalkova, V., Weigl, K., Weinstein, S. J., White, E., Win, A. K., Wolf, C. R., Wolk, A., Woods, M. O., Wu, A. H., Zaidi, S. H., Zanke, B. W., Zhang, Q., Zheng, W., Scacheri, P. C., Potter, J. D., Bassik, M. C., Kundaje, A., Casey, G., Moreno, V., Abecasis, G. R., Nickerson, D. A., Gruber, S. B., Hsu, L., Peters, U. 2018


    To further dissect the genetic architecture of colorectal cancer (CRC), we performed whole-genome sequencing of 1,439 cases and 720 controls, imputed discovered sequence variants and Haplotype Reference Consortium panel variants into genome-wide association study data, and tested for association in 34,869 cases and 29,051 controls. Findings were followed up in an additional 23,262 cases and 38,296 controls. We discovered a strongly protective 0.3% frequency variant signal at CHD1. In a combined meta-analysis of 125,478 individuals, we identified 40 new independent signals at P<5*10-8, bringing the number of known independent signals for CRC to ~100. New signals implicate lower-frequency variants, Kruppel-like factors, Hedgehog signaling, Hippo-YAP signaling, long noncoding RNAs and somatic drivers, and support a role for immune function. Heritability analyses suggest that CRC risk is highly polygenic, and larger, more comprehensive studies enabling rare variant analysis will improve understanding of biology underlying this risk and influence personalized screening strategies and drug development.

    View details for PubMedID 30510241

  • Intertumoral Heterogeneity in SCLC Is Influenced by the Cell Type of Origin CANCER DISCOVERY Yang, D., Denny, S. K., Greenside, P. G., Chaikovsky, A. C., Brady, J. J., Ouadah, Y., Granja, J. M., Jahchan, N. S., Lim, J., Kwok, S., Kong, C. S., Berghoff, A. S., Schmitt, A., Reinhardt, H., Park, K., Preusser, M., Kundaje, A., Greenleaf, W. J., Sage, J., Winslow, M. M. 2018; 8 (10): 1316–31
  • GenomeDISCO: a concordance score for chromosome conformation capture experiments using random walks on contact map graphs BIOINFORMATICS Ursu, O., Boley, N., Taranova, M., Wang, Y., Yardimci, G., Noble, W., Kundaje, A. 2018; 34 (16): 2701–7
  • A common pattern of DNase I footprinting throughout the human mtDNA unveils clues for a chromatin-like organization GENOME RESEARCH Blumberg, A., Danko, C. G., Kundaje, A., Mishmar, D. 2018; 28 (8): 1158–68


    Human mitochondrial DNA (mtDNA) is believed to lack chromatin and histones. Instead, it is coated solely by the transcription factor TFAM. We asked whether mtDNA packaging is more regulated than once thought. To address this, we analyzed DNase-seq experiments in 324 human cell types and found, for the first time, a pattern of 29 mtDNA Genomic footprinting (mt-DGF) sites shared by ∼90% of the samples. Their syntenic conservation in mouse DNase-seq experiments reflect selective constraints. Colocalization with known mtDNA regulatory elements, with G-quadruplex structures, in TFAM-poor sites (in HeLa cells) and with transcription pausing sites, suggest a functional regulatory role for such mt-DGFs. Altered mt-DGF pattern in interleukin 3-treated CD34+ cells, certain tissue differences, and significant prevalence change in fetal versus nonfetal samples, offer first clues to their physiological importance. Taken together, human mtDNA has a conserved protein-DNA organization, which is likely involved in mtDNA regulation.

    View details for PubMedID 30002158

    View details for PubMedCentralID PMC6071632

  • Decoding regulatory sequence across skin differentiation with deep learning Kim, D., Risca, V., Chappell, J., Shi, M., Zhao, Z., Jung, N., Chang, H., Snyder, M., Greenleaf, W., Kundaje, A., Khavari, P. ELSEVIER SCIENCE INC. 2018: S135
  • GenomeDISCO: A concordance score for chromosome conformation capture experiments using random walks on contact map graphs. Bioinformatics (Oxford, England) Ursu, O., Boley, N., Taranova, M., Wang, Y. X., Yardimci, G. G., Noble, W. S., Kundaje, A. 2018


    Motivation: The three-dimensional organization of chromatin plays a critical role in gene regulation and disease. High-throughput chromosome conformation capture experiments such as Hi-C are used to obtain genome-wide maps of 3D chromatin contacts. However, robust estimation of data quality and systematic comparison of these contact maps is challenging due to the multi-scale, hierarchical structure of chromatin contacts and the resulting properties of experimental noise in the data. Measuring concordance of contact maps is important for assessing reproducibility of replicate experiments and for modeling variation between different cellular contexts.Results: We introduce a concordance measure called GenomeDISCO (DIfferences between Smoothed COntact maps) for assessing the similarity of a pair of contact maps obtained from chromosome conformation capture experiments. The key idea is to smooth contact maps using random walks on the contact map graph, before estimating concordance. We use simulated datasets to benchmark GenomeDISCO's sensitivity to different types of noise that affect chromatin contact maps. When applied to a large collection of Hi-C datasets, GenomeDISCO accurately distinguishes biological replicates from samples obtained from different cell types. GenomeDISCO also generalizes to other chromosome conformation capture assays, such as HiChIP.Availability: Software implementing GenomeDISCO is available at information: Supplementary data are available at Bioinformatics online.

    View details for PubMedID 29554289

  • ChIP-ping the branches of the tree: functional genomics and the evolution of eukaryotic gene regulation BRIEFINGS IN FUNCTIONAL GENOMICS Marinov, G. K., Kundaje, A. 2018; 17 (2): 116–37


    Advances in the methods for detecting protein-DNA interactions have played a key role in determining the directions of research into the mechanisms of transcriptional regulation. The most recent major technological transformation happened a decade ago, with the move from using tiling arrays [chromatin immunoprecipitation (ChIP)-on-Chip] to high-throughput sequencing (ChIP-seq) as a readout for ChIP assays. In addition to the numerous other ways in which it is superior to arrays, by eliminating the need to design and manufacture them, sequencing also opened the door to carrying out comparative analyses of genome-wide transcription factor occupancy across species and studying chromatin biology in previously less accessible model and nonmodel organisms, thus allowing us to understand the evolution and diversity of regulatory mechanisms in unprecedented detail. Here, we review the biological insights obtained from such studies in recent years and discuss anticipated future developments in the field.

    View details for DOI 10.1093/bfgp/ely004

    View details for Web of Science ID 000429027600006

    View details for PubMedID 29529131

  • Impact of regulatory variation across human iPSCs and differentiated cells GENOME RESEARCH Banovich, N. E., Li, Y. I., Raj, A., Ward, M. C., Greenside, P., Calderon, D., Tung, P., Burnett, J. E., Myrthil, M., Thomas, S. M., Burrows, C. K., Romero, I., Pavlovic, B. J., Kundaje, A., Pritchard, J. K., Gilad, Y. 2018; 28 (1): 122–31


    Induced pluripotent stem cells (iPSCs) are an essential tool for studying cellular differentiation and cell types that are otherwise difficult to access. We investigated the use of iPSCs and iPSC-derived cells to study the impact of genetic variation on gene regulation across different cell types and as models for studies of complex disease. To do so, we established a panel of iPSCs from 58 well-studied Yoruba lymphoblastoid cell lines (LCLs); 14 of these lines were further differentiated into cardiomyocytes. We characterized regulatory variation across individuals and cell types by measuring gene expression levels, chromatin accessibility, and DNA methylation. Our analysis focused on a comparison of inter-individual regulatory variation across cell types. While most cell-type-specific regulatory quantitative trait loci (QTLs) lie in chromatin that is open only in the affected cell types, we found that 20% of cell-type-specific regulatory QTLs are in shared open chromatin. This observation motivated us to develop a deep neural network to predict open chromatin regions from DNA sequence alone. Using this approach, we were able to use the sequences of segregating haplotypes to predict the effects of common SNPs on cell-type-specific chromatin accessibility.

    View details for PubMedID 29208628

  • Prediction of protein-ligand interactions from paired protein sequence motifs and ligand substructures. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing Greenside, P. n., Hillenmeyer, M. n., Kundaje, A. n. 2018; 23: 20–31


    Identification of small molecule ligands that bind to proteins is a critical step in drug discovery. Computational methods have been developed to accelerate the prediction of protein-ligand binding, but often depend on 3D protein structures. As only a limited number of protein 3D structures have been resolved, the ability to predict protein-ligand interactions without relying on a 3D representation would be highly valuable. We use an interpretable confidence-rated boosting algorithm to predict protein-ligand interactions with high accuracy from ligand chemical substructures and protein 1D sequence motifs, without relying on 3D protein structures. We compare several protein motif definitions, assess generalization of our model's predictions to unseen proteins and ligands, demonstrate recovery of well established interactions and identify globally predictive protein-ligand motif pairs. By bridging biological and chemical perspectives, we demonstrate that it is possible to predict protein-ligand interactions using only motif-based features and that interpretation of these features can reveal new insights into the molecular mechanics underlying each interaction. Our work also lays a foundation to explore more predictive feature sets and sophisticated machine learning approaches as well as other applications, such as predicting unintended interactions or the effects of mutations.

    View details for PubMedID 29218866

  • Umap and Bismap: quantifying genome and methylome mappability. Nucleic acids research Karimzadeh, M. n., Ernst, C. n., Kundaje, A. n., Hoffman, M. M. 2018; 46 (20): e120


    Short-read sequencing enables assessment of genetic and biochemical traits of individual genomic regions, such as the location of genetic variation, protein binding and chemical modifications. Every region in a genome assembly has a property called 'mappability', which measures the extent to which it can be uniquely mapped by sequence reads. In regions of lower mappability, estimates of genomic and epigenomic characteristics from sequencing assays are less reliable. These regions have increased susceptibility to spurious mapping from reads from other regions of the genome with sequencing errors or unexpected genetic variation. Bisulfite sequencing approaches used to identify DNA methylation exacerbate these problems by introducing large numbers of reads that map to multiple regions. Both to correct assumptions of uniformity in downstream analysis and to identify regions where the analysis is less reliable, it is necessary to know the mappability of both ordinary and bisulfite-converted genomes. We introduce the Umap software for identifying uniquely mappable regions of any genome. Its Bismap extension identifies mappability of the bisulfite-converted genome. A Umap and Bismap track hub for human genome assemblies GRCh37/hg19 and GRCh38/hg38, and mouse assemblies GRCm37/mm9 and GRCm38/mm10 is available at for use with genome browsers.

    View details for PubMedID 30169659

  • Differential analysis of chromatin accessibility and histone modifications for predicting mouse developmental enhancers. Nucleic acids research Fu, S. n., Wang, Q. n., Moore, J. E., Purcaro, M. J., Pratt, H. E., Fan, K. n., Gu, C. n., Jiang, C. n., Zhu, R. n., Kundaje, A. n., Lu, A. n., Weng, Z. n. 2018; 46 (21): 11184–201


    Enhancers are distal cis-regulatory elements that modulate gene expression. They are depleted of nucleosomes and enriched in specific histone modifications; thus, calling DNase-seq and histone mark ChIP-seq peaks can predict enhancers. We evaluated nine peak-calling algorithms for predicting enhancers validated by transgenic mouse assays. DNase and H3K27ac peaks were consistently more predictive than H3K4me1/2/3 and H3K9ac peaks. DFilter and Hotspot2 were the best DNase peak callers, while HOMER, MUSIC, MACS2, DFilter and F-seq were the best H3K27ac peak callers. We observed that the differential DNase or H3K27ac signals between two distant tissues increased the area under the precision-recall curve (PR-AUC) of DNase peaks by 17.5-166.7% and that of H3K27ac peaks by 7.1-22.2%. We further improved this differential signal method using multiple contrast tissues. Evaluated using a blind test, the differential H3K27ac signal method substantially improved PR-AUC from 0.48 to 0.75 for predicting heart enhancers. We further validated our approach using postnatal retina and cerebral cortex enhancers identified by massively parallel reporter assays, and observed improvements for both tissues. In summary, we compared nine peak callers and devised a superior method for predicting tissue-specific mouse developmental enhancers by reranking the called peaks.

    View details for PubMedID 30137428

  • Challenges and recommendations for epigenomics in precision health NATURE BIOTECHNOLOGY Carter, A. C., Chang, H. Y., Church, G., Dombkowski, A., Ecker, J. R., Gil, E., Giresi, P. G., Greely, H., Greenleaf, W. J., Hacohen, N., He, C., Hill, D., Ko, J., Kohane, I., Kundaje, A., Palmer, M., Snyder, M. P., Tung, J., Urban, A., Vidal, M., Wong, W. 2017; 35 (12): 1128–32

    View details for PubMedID 29220033

  • Chromatin accessibility dynamics reveal novel functional enhancers in C. elegans GENOME RESEARCH Daugherty, A. C., Yeo, R. W., Buenrostro, J. D., Greenleaf, W. J., Kundaje, A., Brunet, A. 2017; 27 (12): 2096–2107


    Chromatin accessibility, a crucial component of genome regulation, has primarily been studied in homogeneous and simple systems, such as isolated cell populations or early-development models. Whether chromatin accessibility can be assessed in complex, dynamic systems in vivo with high sensitivity remains largely unexplored. In this study, we use ATAC-seq to identify chromatin accessibility changes in a whole animal, the model organism Caenorhabditis elegans, from embryogenesis to adulthood. Chromatin accessibility changes between developmental stages are highly reproducible, recapitulate histone modification changes, and reveal key regulatory aspects of the epigenomic landscape throughout organismal development. We find that over 5000 distal noncoding regions exhibit dynamic changes in chromatin accessibility between developmental stages and could thereby represent putative enhancers. When tested in vivo, several of these putative enhancers indeed drive novel cell-type- and temporal-specific patterns of expression. Finally, by integrating transcription factor binding motifs in a machine learning framework, we identify EOR-1 as a unique transcription factor that may regulate chromatin dynamics during development. Our study provides a unique resource for C. elegans, a system in which the prevalence and importance of enhancers remains poorly characterized, and demonstrates the power of using whole organism chromatin accessibility to identify novel regulatory regions in complex systems.

    View details for PubMedID 29141961

  • Enrichment of colorectal cancer associations in functional regions: Insight for using epigenomics data in the analysis of whole genome sequence-imputed GWAS data PLOS ONE Bien, S. A., Auer, P. L., Harrison, T. A., Qu, C., Connolly, C. M., Greenside, P. G., Chen, S., Berndt, S. I., Bezieau, S., Kang, H. M., Huyghe, J., Brenner, H., Casey, G., Chan, A. T., Hopper, J. L., Banbury, B. L., Chang-Claude, J., Chanock, S. J., Haile, R. W., Hoffmeister, M., Fuchsberger, C., Jenkins, M. A., Leal, S. M., Lemire, M., Newcomb, P. A., Gallinger, S., Potter, J. D., Schoen, R. E., Slattery, M. L., Smith, J. D., Le Marchand, L., White, E., Zanke, B. W., Abecasis, G. R., Carlson, C. S., Peters, U., Nickerson, D. A., Kundaje, A., Hsu, L., GECCO CCFR 2017; 12 (11): e0186518


    The evaluation of less frequent genetic variants and their effect on complex disease pose new challenges for genomic research. To investigate whether epigenetic data can be used to inform aggregate rare-variant association methods (RVAM), we assessed whether variants more significantly associated with colorectal cancer (CRC) were preferentially located in non-coding regulatory regions, and whether enrichment was specific to colorectal tissues.Active regulatory elements (ARE) were mapped using data from 127 tissues and cell-types from NIH Roadmap Epigenomics and Encyclopedia of DNA Elements (ENCODE) projects. We investigated whether CRC association p-values were more significant for common variants inside versus outside AREs, or 2) inside colorectal (CR) AREs versus AREs of other tissues and cell-types. We employed an integrative epigenomic RVAM for variants with allele frequency <1%. Gene sets were defined as ARE variants within 200 kilobases of a transcription start site (TSS) using either CR ARE or ARE from non-digestive tissues. CRC-set association p-values were used to evaluate enrichment of less frequent variant associations in CR ARE versus non-digestive ARE.ARE from 126/127 tissues and cell-types were significantly enriched for stronger CRC-variant associations. Strongest enrichment was observed for digestive tissues and immune cell types. CR-specific ARE were also enriched for stronger CRC-variant associations compared to ARE combined across non-digestive tissues (p-value = 9.6 × 10-4). Additionally, we found enrichment of stronger CRC association p-values for rare variant sets of CR ARE compared to non-digestive ARE (p-value = 0.029).Integrative epigenomic RVAM may enable discovery of less frequent variants associated with CRC, and ARE of digestive and immune tissues are most informative. Although distance-based aggregation of less frequent variants in CR ARE surrounding TSS showed modest enrichment, future association studies would likely benefit from joint analysis of transcriptomes and epigenomes to better link regulatory variation with target genes.

    View details for PubMedID 29161273

  • Vicus: Exploiting local structures to improve network-based analysis of biological data PLOS COMPUTATIONAL BIOLOGY Wang, B., Huang, L., Zhu, Y., Kundaje, A., Batzoglou, S., Goldenberg, A. 2017; 13 (10): e1005621


    Biological networks entail important topological features and patterns critical to understanding interactions within complicated biological systems. Despite a great progress in understanding their structure, much more can be done to improve our inference and network analysis. Spectral methods play a key role in many network-based applications. Fundamental to spectral methods is the Laplacian, a matrix that captures the global structure of the network. Unfortunately, the Laplacian does not take into account intricacies of the network's local structure and is sensitive to noise in the network. These two properties are fundamental to biological networks and cannot be ignored. We propose an alternative matrix Vicus. The Vicus matrix captures the local neighborhood structure of the network and thus is more effective at modeling biological interactions. We demonstrate the advantages of Vicus in the context of spectral methods by extensive empirical benchmarking on tasks such as single cell dimensionality reduction, protein module discovery and ranking genes for cancer subtyping. Our experiments show that using Vicus, spectral methods result in more accurate and robust performance in all of these tasks.

    View details for PubMedID 29023470

    View details for PubMedCentralID PMC5638230

  • Genome-scale measurement of off-target activity using Cas9 toxicity in high-throughput screens NATURE COMMUNICATIONS Morgens, D. W., Wainberg, M., Boyle, E. A., Ursu, O., Araya, C. L., Tsui, C. K., Haney, M. S., Hess, G. T., Han, K., Jeng, E. E., Li, A., Snyder, M. P., Greenleaf, W. J., Kundaje, A., Bassik, M. C. 2017; 8


    CRISPR-Cas9 screens are powerful tools for high-throughput interrogation of genome function, but can be confounded by nuclease-induced toxicity at both on- and off-target sites, likely due to DNA damage. Here, to test potential solutions to this issue, we design and analyse a CRISPR-Cas9 library with 10 variable-length guides per gene and thousands of negative controls targeting non-functional, non-genic regions (termed safe-targeting guides), in addition to non-targeting controls. We find this library has excellent performance in identifying genes affecting growth and sensitivity to the ricin toxin. The safe-targeting guides allow for proper control of toxicity from on-target DNA damage. Using this toxicity as a proxy to measure off-target cutting, we demonstrate with tens of thousands of guides both the nucleotide position-dependent sensitivity to single mismatches and the reduction of off-target cutting using truncated guides. Our results demonstrate a simple strategy for high-throughput evaluation of target specificity and nuclease toxicity in Cas9 screens.

    View details for DOI 10.1038/ncomms15178

    View details for PubMedID 28474669

  • Dynamic and stable enhancer-promoter contacts regulate epidermal terminal differentiation Lopez-Pajares, V., Rubin, A., Barajas, B., Furlan-Magaril, M., Mumbach, M., Greenleaf, W., Kundaje, A., Snyder, M., Chang, H., Fraser, P., Khavari, P. A. ELSEVIER SCIENCE INC. 2017: S80
  • Initiation of mtDNA transcription is followed by pausing, and diverges across human cell types and during evolution. Genome research Blumberg, A., Rice, E. J., Kundaje, A., Danko, C. G., Mishmar, D. 2017; 27 (3): 362-373


    Mitochondrial DNA (mtDNA) genes are long known to be cotranscribed in polycistrones, yet it remains impossible to study nascent mtDNA transcripts quantitatively in vivo using existing tools. To this end, we used deep sequencing (GRO-seq and PRO-seq) and analyzed nascent mtDNA-encoded RNA transcripts in diverse human cell lines and metazoan organisms. Surprisingly, accurate detection of human mtDNA transcription initiation sites (TISs) in the heavy and light strands revealed a novel conserved transcription pausing site near the light-strand TIS. This pausing site correlated with the presence of a bacterial pausing sequence motif, with reduced SNP density, and with a DNase footprinting signal in all tested cells. Its location within conserved sequence block 3 (CSBIII), just upstream of the known transcription-replication transition point, suggests involvement in such transition. Analysis of nonhuman organisms enabled de novo mtDNA sequence assembly, as well as detection of previously unknown mtDNA TIS, pausing, and transcription termination sites with unprecedented accuracy. Whereas mammals (Pan troglodytes, Macaca mulatta, Rattus norvegicus, and Mus musculus) showed a human-like mtDNA transcription pattern, the invertebrate pattern (Drosophila melanogaster and Caenorhabditis elegans) profoundly diverged. Our approach paves the path toward in vivo, quantitative, reference sequence-free analysis of mtDNA transcription in all eukaryotes.

    View details for DOI 10.1101/gr.209924.116

    View details for PubMedID 28049628

  • Molecular definition of a metastatic lung cancer state reveals a targetable CD109-Janus kinase-Stat axis. Nature medicine Chuang, C., Greenside, P. G., Rogers, Z. N., Brady, J. J., Yang, D., Ma, R. K., Caswell, D. R., Chiou, S., Winters, A. F., Grüner, B. M., Ramaswami, G., Spencley, A. L., Kopecky, K. E., Sayles, L. C., Sweet-Cordero, E. A., Li, J. B., Kundaje, A., Winslow, M. M. 2017; 23 (3): 291-300


    Lung cancer is the leading cause of cancer deaths worldwide, with the majority of mortality resulting from metastatic spread. However, the molecular mechanism by which cancer cells acquire the ability to disseminate from primary tumors, seed distant organs, and grow into tissue-destructive metastases remains incompletely understood. We combined tumor barcoding in a mouse model of human lung adenocarcinoma with unbiased genomic approaches to identify a transcriptional program that confers metastatic ability and predicts patient survival. Small-scale in vivo screening identified several genes, including Cd109, that encode novel pro-metastatic factors. We uncovered signaling mediated by Janus kinases (Jaks) and the transcription factor Stat3 as a critical, pharmacologically targetable effector of CD109-driven lung cancer metastasis. In summary, by coupling the systematic genomic analysis of purified cancer cells in distinct malignant states from mouse models with extensive human validation, we uncovered several key regulators of metastatic ability, including an actionable pro-metastatic CD109-Jak-Stat3 axis.

    View details for DOI 10.1038/nm.4285

    View details for PubMedID 28191885

  • Predicting gene expression in massively parallel reporter assays: a comparative study. Human mutation Kreimer, A., Zeng, H., Edwards, M. D., Guo, Y., Tian, K., Shin, S., Welch, R., Wainberg, M., Mohan, R., Sinnott-Armstrong, N. A., Li, Y., Eraslan, G., Amin, T. B., Goke, J., Mueller, N. S., Kellis, M., Kundaje, A., Beer, M. A., Keles, S., Gifford, D. K., Yosef, N. 2017


    In many human diseases, associated genetic changes tend to occur within noncoding regions, whose effect might be related to transcriptional control. A central goal in human genetics is to understand the function of such noncoding regions: given a region that is statistically associated with changes in gene expression (expression quantitative trait locus [eQTL]), does it in fact play a regulatory role? And if so, how is this role "coded" in its sequence? These questions were the subject of the Critical Assessment of Genome Interpretation eQTL challenge. Participants were given a set of sequences that flank eQTLs in humans and were asked to predict whether these are capable of regulating transcription (as evaluated by massively parallel reporter assays), and whether this capability changes between alternative alleles. Here, we report lessons learned from this community effort. By inspecting predictive properties in isolation, and conducting meta-analysis over the competing methods, we find that using chromatin accessibility and transcription factor binding as features in an ensemble of classifiers or regression models leads to the most accurate results. We then characterize the loci that are harder to predict, putting the spotlight on areas of weakness, which we expect to be the subject of future studies.

    View details for DOI 10.1002/humu.23197

    View details for PubMedID 28220625

  • An improved ATAC-seq protocol reduces background and enables interrogation of frozen tissues. Nature methods Corces, M. R., Trevino, A. E., Hamilton, E. G., Greenside, P. G., Sinnott-Armstrong, N. A., Vesuna, S. n., Satpathy, A. T., Rubin, A. J., Montine, K. S., Wu, B. n., Kathiria, A. n., Cho, S. W., Mumbach, M. R., Carter, A. C., Kasowski, M. n., Orloff, L. A., Risca, V. I., Kundaje, A. n., Khavari, P. A., Montine, T. J., Greenleaf, W. J., Chang, H. Y. 2017


    We present Omni-ATAC, an improved ATAC-seq protocol for chromatin accessibility profiling that works across multiple applications with substantial improvement of signal-to-background ratio and information content. The Omni-ATAC protocol generates chromatin accessibility profiles from archival frozen tissue samples and 50-μm sections, revealing the activities of disease-associated DNA elements in distinct human brain structures. The Omni-ATAC protocol enables the interrogation of personal regulomes in tissue context and translational studies.

    View details for PubMedID 28846090

  • Learning Important Features Through Propagating Activation Differences Shrikumar, A., Greenside, P., Kundaje, A., Precup, D., Teh, Y. W. JMLR-JOURNAL MACHINE LEARNING RESEARCH. 2017
  • Enhancer connectome in primary human cells identifies target genes of disease-associated DNA elements. Nature genetics Mumbach, M. R., Satpathy, A. T., Boyle, E. A., Dai, C. n., Gowen, B. G., Cho, S. W., Nguyen, M. L., Rubin, A. J., Granja, J. M., Kazane, K. R., Wei, Y. n., Nguyen, T. n., Greenside, P. G., Corces, M. R., Tycko, J. n., Simeonov, D. R., Suliman, N. n., Li, R. n., Xu, J. n., Flynn, R. A., Kundaje, A. n., Khavari, P. A., Marson, A. n., Corn, J. E., Quertermous, T. n., Greenleaf, W. J., Chang, H. Y. 2017


    The challenge of linking intergenic mutations to target genes has limited molecular understanding of human diseases. Here we show that H3K27ac HiChIP generates high-resolution contact maps of active enhancers and target genes in rare primary human T cell subtypes and coronary artery smooth muscle cells. Differentiation of naive T cells into T helper 17 cells or regulatory T cells creates subtype-specific enhancer-promoter interactions, specifically at regions of shared DNA accessibility. These data provide a principled means of assigning molecular functions to autoimmune and cardiovascular disease risk variants, linking hundreds of noncoding variants to putative gene targets. Target genes identified with HiChIP are further supported by CRISPR interference and activation at linked enhancers, by the presence of expression quantitative trait loci, and by allele-specific enhancer loops in patient-derived primary cells. The majority of disease-associated enhancers contact genes beyond the nearest gene in the linear genome, leading to a fourfold increase in the number of potential target genes for autoimmune and cardiovascular diseases.

    View details for PubMedID 28945252

  • Lineage-specific dynamic and pre-established enhancer-promoter contacts cooperate in terminal differentiation. Nature genetics Rubin, A. J., Barajas, B. C., Furlan-Magaril, M. n., Lopez-Pajares, V. n., Mumbach, M. R., Howard, I. n., Kim, D. S., Boxer, L. D., Cairns, J. n., Spivakov, M. n., Wingett, S. W., Shi, M. n., Zhao, Z. n., Greenleaf, W. J., Kundaje, A. n., Snyder, M. n., Chang, H. Y., Fraser, P. n., Khavari, P. A. 2017; 49 (10): 1522–28


    Chromosome conformation is an important feature of metazoan gene regulation; however, enhancer-promoter contact remodeling during cellular differentiation remains poorly understood. To address this, genome-wide promoter capture Hi-C (CHi-C) was performed during epidermal differentiation. Two classes of enhancer-promoter contacts associated with differentiation-induced genes were identified. The first class ('gained') increased in contact strength during differentiation in concert with enhancer acquisition of the H3K27ac activation mark. The second class ('stable') were pre-established in undifferentiated cells, with enhancers constitutively marked by H3K27ac. The stable class was associated with the canonical conformation regulator cohesin, whereas the gained class was not, implying distinct mechanisms of contact formation and regulation. Analysis of stable enhancers identified a new, essential role for a constitutively expressed, lineage-restricted ETS-family transcription factor, EHF, in epidermal differentiation. Furthermore, neither class of contacts was observed in pluripotent cells, suggesting that lineage-specific chromatin structure is established in tissue progenitor cells and is further remodeled in terminal differentiation.

    View details for PubMedID 28805829

  • High-Throughput Characterization of Cascade type I-E CRISPR Guide Efficacy Reveals Unexpected PAM Diversity and Target Sequence Preferences. Genetics Fu, B. X., Wainberg, M. n., Kundaje, A. n., Fire, A. Z. 2017; 206 (4): 1727–38


    Interactions between Clustered Regularly Interspaced Short Palindromic Repeat (CRISPR) RNAs and CRISPR-associated (Cas) proteins form an RNA-guided adaptive immune system in prokaryotes. The adaptive immune system utilizes segments of the genetic material of invasive foreign elements in the CRISPR locus. The loci are transcribed and processed to produce small CRISPR RNAs (crRNAs), with degradation of invading genetic material directed by a combination of complementarity between RNA and DNA and in some cases recognition of adjacent motifs called PAMs (Protospacer Adjacent Motifs). Here we describe a general, high-throughput procedure to test the efficacy of thousands of targets, applying this to the Escherichia coli type I-E Cascade (CRISPR-associated complex for antiviral defense) system. These studies were followed with reciprocal experiments in which the consequence of CRISPR activity was survival in the presence of a lytic phage. From the combined analysis of the Cascade system, we found that (i) type I-E Cascade PAM recognition is more expansive than previously reported, with at least 22 distinct PAMs, with many of the noncanonical PAMs having CRISPR-interference abilities similar to the canonical PAMs; (ii) PAM positioning appears precise, with no evidence for tolerance to PAM slippage in interference; and (iii) while increased guanine-cytosine (GC) content in the spacer is associated with higher CRISPR-interference efficiency, high GC content (>62.5%) decreases CRISPR-interference efficiency. Our findings provide a comprehensive functional profile of Cascade type I-E interference requirements and a method to assay spacer efficacy that can be applied to other CRISPR-Cas systems.

    View details for PubMedID 28634160

  • An atlas of transcriptional, chromatin accessibility, and surface marker changes in human mesoderm development SCIENTIFIC DATA Koh, P. W., Sinha, R., Barkal, A. A., Morganti, R. M., Chen, A., Weissman, I. L., Ang, L. T., Kundaje, A., Loh, K. M. 2016; 3


    Mesoderm is the developmental precursor to myriad human tissues including bone, heart, and skeletal muscle. Unravelling the molecular events through which these lineages become diversified from one another is integral to developmental biology and understanding changes in cellular fate. To this end, we developed an in vitro system to differentiate human pluripotent stem cells through primitive streak intermediates into paraxial mesoderm and its derivatives (somites, sclerotome, dermomyotome) and separately, into lateral mesoderm and its derivatives (cardiac mesoderm). Whole-population and single-cell analyses of these purified populations of human mesoderm lineages through RNA-seq, ATAC-seq, and high-throughput surface marker screens illustrated how transcriptional changes co-occur with changes in open chromatin and surface marker landscapes throughout human mesoderm development. This molecular atlas will facilitate study of human mesoderm development (which cannot be interrogated in vivo due to restrictions on human embryo studies) and provides a broad resource for the study of gene regulation in development at the single-cell level, knowledge that might one day be exploited for regenerative medicine.

    View details for DOI 10.1038/sdata.2016.109

    View details for PubMedID 27996962

  • The International Human Epigenome Consortium: A Blueprint for Scientific Collaboration and Discovery CELL Stunnenberg, H. G., Hirst, M., Int Human Epigenome Consortium 2016; 167 (5): 1145-1149


    The International Human Epigenome Consortium (IHEC) coordinates the generation of a catalog of high-resolution reference epigenomes of major primary human cell types. The studies now presented (see the Cell Press IHEC web portal at highlight the coordinated achievements of IHEC teams to gather and interpret comprehensive epigenomic datasets to gain insights in the epigenetic control of cell states relevant for human health and disease. PAPERCLIP.

    View details for DOI 10.1016/j.cell.2016.11.007

    View details for Web of Science ID 000389470100004

    View details for PubMedID 27863232

  • Lineage-specific and single-cell chromatin accessibility charts human hematopoiesis and leukemia evolution. Nature genetics Corces, M. R., Buenrostro, J. D., Wu, B., Greenside, P. G., Chan, S. M., Koenig, J. L., Snyder, M. P., Pritchard, J. K., Kundaje, A., Greenleaf, W. J., Majeti, R., Chang, H. Y. 2016; 48 (10): 1193-1203


    We define the chromatin accessibility and transcriptional landscapes in 13 human primary blood cell types that span the hematopoietic hierarchy. Exploiting the finding that the enhancer landscape better reflects cell identity than mRNA levels, we enable 'enhancer cytometry' for enumeration of pure cell types from complex populations. We identify regulators governing hematopoietic differentiation and further show the lineage ontogeny of genetic elements linked to diverse human diseases. In acute myeloid leukemia (AML), chromatin accessibility uncovers unique regulatory evolution in cancer cells with a progressively increasing mutation burden. Single AML cells exhibit distinctive mixed regulome profiles corresponding to disparate developmental stages. A method to account for this regulatory heterogeneity identified cancer-specific deviations and implicated HOX factors as key regulators of preleukemic hematopoietic stem cell characteristics. Thus, regulome dynamics can provide diverse insights into hematopoietic development and disease.

    View details for DOI 10.1038/ng.3646

    View details for PubMedID 27526324

  • Characterization of the direct targets of FOXO transcription factors throughout evolution. Aging cell Webb, A. E., Kundaje, A., Brunet, A. 2016; 15 (4): 673-685


    FOXO transcription factors (FOXOs) are central regulators of lifespan across species, yet they also have cell-specific functions, including adult stem cell homeostasis and immune function. Direct targets of FOXOs have been identified genome-wide in several species and cell types. However, whether FOXO targets are specific to cell types and species or conserved across cell types and throughout evolution remains uncharacterized. Here, we perform a meta-analysis of direct FOXO targets across tissues and organisms, using data from mammals as well as Caenorhabditis elegans and Drosophila. We show that FOXOs bind cell type-specific targets, which have functions related to that particular cell. Interestingly, FOXOs also share targets across different tissues in mammals, and the function and even the identity of these shared mammalian targets are conserved in invertebrates. Evolutionarily conserved targets show enrichment for growth factor signaling, metabolism, stress resistance, and proteostasis, suggesting an ancestral, conserved role in the regulation of these processes. We also identify candidate cofactors at conserved FOXO targets that change in expression with age, including CREB and ETS family factors. This meta-analysis provides insight into the evolution of the FOXO network and highlights downstream genes and cofactors that may be particularly important for FOXO's conserved function in adult homeostasis and longevity.

    View details for DOI 10.1111/acel.12479

    View details for PubMedID 27061590

  • Mapping the Pairwise Choices Leading from Pluripotency to Human Bone, Heart, and Other Mesoderm Cell Types CELL Loh, K. M., Chen, A., Koh, P. W., Deng, T. Z., Sinha, R., Tsai, J. M., Barkal, A. A., Shen, K. Y., Jain, R., Morganti, R. M., Shyh-Chang, N., Fernhoff, N. B., George, B. M., Wernig, G., Salomon, R. E., Chen, Z., Vogel, H., Epstein, J. A., Kundaje, A., Talbot, W. S., Beachy, P. A., Ang, L. T., Weissman, I. L. 2016; 166 (2): 451-467


    Stem-cell differentiation to desired lineages requires navigating alternating developmental paths that often lead to unwanted cell types. Hence, comprehensive developmental roadmaps are crucial to channel stem-cell differentiation toward desired fates. To this end, here, we map bifurcating lineage choices leading from pluripotency to 12 human mesodermal lineages, including bone, muscle, and heart. We defined the extrinsic signals controlling each binary lineage decision, enabling us to logically block differentiation toward unwanted fates and rapidly steer pluripotent stem cells toward 80%-99% pure human mesodermal lineages at most branchpoints. This strategy enabled the generation of human bone and heart progenitors that could engraft in respective in vivo models. Mapping stepwise chromatin and single-cell gene expression changes in mesoderm development uncovered somite segmentation, a previously unobservable human embryonic event transiently marked by HOPX expression. Collectively, this roadmap enables navigation of mesodermal development to produce transplantable human tissue progenitors and uncover developmental processes. VIDEO ABSTRACT.

    View details for DOI 10.1016/j.cell.2016.06.011

    View details for PubMedID 27419872

  • Using functional data from Roadmap Epigenomics to inform analysis of rare variants linked to gene expression in a large colorectal cancer study Bien, S. A., Harrison, T. A., Auer, P. L., Qu, F., Huyghe, J., Banbury, B., Greenside, P., Abecasis, G. R., Berndt, S. I., Bezieau, S., Brenner, H., Casey, G., Chan, A. T., Chang-Claude, J., Chen, S., Smith, J. D., Le Marchand, L., Carlson, C., Newcomb, P. A., Fuchsberger, C., Slattery, M. L., Kang, H. M., White, E., Potter, J., Gallinger, S. J., Hoffmeister, M., Gruber, S. B., Nickerson, D. A., Peters, U., Kundaje, A., Hsu, L. AMER ASSOC CANCER RESEARCH. 2016
  • Impact of the X Chromosome and sex on regulatory variation GENOME RESEARCH Kukurba, K. R., Parsana, P., Balliu, B., Smith, K. S., Zappala, Z., Knowles, D. A., Fave, M., Davis, J. R., Li, X., Zhu, X., Potash, J. B., Weissman, M. M., Shi, J., Kundaje, A., Levinson, D. F., Awadalla, P., Mostafavi, S., Battle, A., Montgomery, S. B. 2016; 26 (6): 768-777


    The X Chromosome, with its unique mode of inheritance, contributes to differences between the sexes at a molecular level, including sex-specific gene expression and sex-specific impact of genetic variation. Improving our understanding of these differences offers to elucidate the molecular mechanisms underlying sex-specific traits and diseases. However, to date, most studies have either ignored the X Chromosome or had insufficient power to test for the sex-specific impact of genetic variation. By analyzing whole blood transcriptomes of 922 individuals, we have conducted the first large-scale, genome-wide analysis of the impact of both sex and genetic variation on patterns of gene expression, including comparison between the X Chromosome and autosomes. We identified a depletion of expression quantitative trait loci (eQTL) on the X Chromosome, especially among genes under high selective constraint. In contrast, we discovered an enrichment of sex-specific regulatory variants on the X Chromosome. To resolve the molecular mechanisms underlying such effects, we generated chromatin accessibility data through ATAC-sequencing to connect sex-specific chromatin accessibility to sex-specific patterns of expression and regulatory variation. As sex-specific regulatory variants discovered in our study can inform sex differences in heritable disease prevalence, we integrated our data with genome-wide association study data for multiple immune traits identifying several traits with significant sex biases in genetic susceptibilities. Together, our study provides genome-wide insight into how genetic variation, the X Chromosome, and sex shape human gene regulation and disease.

    View details for DOI 10.1101/gr.197897.115

    View details for PubMedID 27197214

  • An Arntl2-Driven Secretome Enables Lung Adenocarcinoma Metastatic Self-Sufficiency CANCER CELL Brady, J. J., Chuang, C., Greenside, P. G., Rogers, Z. N., Murray, C. W., Caswell, D. R., Hartmann, U., Connolly, A. J., Sweet-Cordero, E. A., Kundaje, A., Winslow, M. M. 2016; 29 (5): 697-710


    The ability of cancer cells to establish lethal metastatic lesions requires the survival and expansion of single cancer cells at distant sites. The factors controlling the clonal growth ability of individual cancer cells remain poorly understood. Here, we show that high expression of the transcription factor ARNTL2 predicts poor lung adenocarcinoma patient outcome. Arntl2 is required for metastatic ability in vivo and clonal growth in cell culture. Arntl2 drives metastatic self-sufficiency by orchestrating the expression of a complex pro-metastatic secretome. We identify Clock as an Arntl2 partner and functionally validate the matricellular protein Smoc2 as a pro-metastatic secreted factor. These findings shed light on the molecular mechanisms that enable single cancer cells to form allochthonous tumors in foreign tissue environments.

    View details for DOI 10.1016/j.ccell.2016.03.003

    View details for PubMedID 27150038

  • Unsupervised Learning from Noisy Networks with Applications to Hi-C Data Wang, B., Zhu, J., Ursu, O., Pourshafeie, A., Batzoglou, S., Kundaje, A., Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, Garnett, R. NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2016
  • Regulatory analysis of the C. elegans genome with spatiotemporal resolution (vol 512, pg 400, 2014) NATURE Araya, C. L., Kawli, T., Kundaje, A., Jiang, L., Wu, B., Vafeados, D., Terrell, R., Weissdepp, P., Gevirtzman, L., Mace, D., Niu, W., Boyle, A. P., Xie, D., Ma, L., Murray, J. I., Reinke, V., Waterston, R. H., Snyder, M. 2015; 528 (7580): 152

    View details for DOI 10.1038/nature16075

    View details for Web of Science ID 000365606000069

    View details for PubMedID 26560031

  • H3K4me3 Breadth Is Linked to Cell Identity and Transcriptional Consistency (vol 158, pg 673, 2014) CELL Benayoun, B. A., Pollina, E. A., Ucar, D., Mahmoudi, S., Karra, K., Wong, E. D., Devarajan, K., Daugherty, A. C., Kundaje, A. B., Mancini, E., Hitz, B. C., Gupta, R., Rando, T. A., Baker, J. C., Snyder, M. P., Cherry, J., Brunet, A. 2015; 163 (5): 1281-U264

    View details for DOI 10.1016/j.cell.2015.10.051

    View details for Web of Science ID 000366044700024

    View details for PubMedID 28930648

  • Characterization of TCF21 Downstream Target Regions Identifies a Transcriptional Network Linking Multiple Independent Coronary Artery Disease Loci. PLoS genetics Sazonova, O., Zhao, Y., Nürnberg, S., Miller, C., Pjanic, M., Castano, V. G., Kim, J. B., Salfati, E. L., Kundaje, A. B., Bejerano, G., Assimes, T., Yang, X., Quertermous, T. 2015; 11 (5)


    To functionally link coronary artery disease (CAD) causal genes identified by genome wide association studies (GWAS), and to investigate the cellular and molecular mechanisms of atherosclerosis, we have used chromatin immunoprecipitation sequencing (ChIP-Seq) with the CAD associated transcription factor TCF21 in human coronary artery smooth muscle cells (HCASMC). Analysis of identified TCF21 target genes for enrichment of molecular and cellular annotation terms identified processes relevant to CAD pathophysiology, including "growth factor binding," "matrix interaction," and "smooth muscle contraction." We characterized the canonical binding sequence for TCF21 as CAGCTG, identified AP-1 binding sites in TCF21 peaks, and by conducting ChIP-Seq for JUN and JUND in HCASMC confirmed that there is significant overlap between TCF21 and AP-1 binding loci in this cell type. Expression quantitative trait variation mapped to target genes of TCF21 was significantly enriched among variants with low P-values in the GWAS analyses, suggesting a possible functional interaction between TCF21 binding and causal variants in other CAD disease loci. Separate enrichment analyses found over-representation of TCF21 target genes among CAD associated genes, and linkage disequilibrium between TCF21 peak variation and that found in GWAS loci, consistent with the hypothesis that TCF21 may affect disease risk through interaction with other disease associated loci. Interestingly, enrichment for TCF21 target genes was also found among other genome wide association phenotypes, including height and inflammatory bowel disease, suggesting a functional profile important for basic cellular processes in non-vascular tissues. Thus, data and analyses presented here suggest that study of GWAS transcription factors may be a highly useful approach to identifying disease gene interactions and thus pathways that may be relevant to complex disease etiology.

    View details for DOI 10.1371/journal.pgen.1005202

    View details for PubMedID 26020271

  • Fine mapping of type 1 diabetes susceptibility loci and evidence for colocalization of causal variants with lymphoid gene enhancers. Nature genetics Onengut-Gumuscu, S., Chen, W., Burren, O., Cooper, N. J., Quinlan, A. R., Mychaleckyj, J. C., Farber, E., Bonnie, J. K., Szpak, M., Schofield, E., Achuthan, P., Guo, H., Fortune, M. D., Stevens, H., Walker, N. M., Ward, L. D., Kundaje, A., Kellis, M., Daly, M. J., Barrett, J. C., Cooper, J. D., Deloukas, P., Todd, J. A., Wallace, C., Concannon, P., Rich, S. S. 2015; 47 (4): 381-386


    Genetic studies of type 1 diabetes (T1D) have identified 50 susceptibility regions, finding major pathways contributing to risk, with some loci shared across immune disorders. To make genetic comparisons across autoimmune disorders as informative as possible, a dense genotyping array, the Immunochip, was developed, from which we identified four new T1D-associated regions (P < 5 × 10(-8)). A comparative analysis with 15 immune diseases showed that T1D is more similar genetically to other autoantibody-positive diseases, significantly most similar to juvenile idiopathic arthritis and significantly least similar to ulcerative colitis, and provided support for three additional new T1D risk loci. Using a Bayesian approach, we defined credible sets for the T1D-associated SNPs. The associated SNPs localized to enhancer sequences active in thymus, T and B cells, and CD34(+) stem cells. Enhancer-promoter interactions can now be analyzed in these cell types to identify which particular genes and regulatory sequences are causal.

    View details for DOI 10.1038/ng.3245

    View details for PubMedID 25751624

  • Fine mapping of type 1 diabetes susceptibility loci and evidence for colocalization of causal variants with lymphoid gene enhancers NATURE GENETICS Onengut-Gumuscu, S., Chen, W., Burren, O., Cooper, N. J., Quinlan, A. R., Mychaleckyj, J. C., Farber, E., Bonnie, J. K., Szpak, M., Schofield, E., Achuthan, P., Guo, H., Fortune, M. D., Stevens, H., Walker, N. M., Ward, L. D., Kundaje, A., Kellis, M., Daly, M. J., Barrett, J. C., Cooper, J. D., Deloukas, P., Todd, J. A., Wallace, C., Concannon, P., Rich, S. S. 2015; 47 (4): 381-U199


    Genetic studies of type 1 diabetes (T1D) have identified 50 susceptibility regions, finding major pathways contributing to risk, with some loci shared across immune disorders. To make genetic comparisons across autoimmune disorders as informative as possible, a dense genotyping array, the Immunochip, was developed, from which we identified four new T1D-associated regions (P < 5 × 10(-8)). A comparative analysis with 15 immune diseases showed that T1D is more similar genetically to other autoantibody-positive diseases, significantly most similar to juvenile idiopathic arthritis and significantly least similar to ulcerative colitis, and provided support for three additional new T1D risk loci. Using a Bayesian approach, we defined credible sets for the T1D-associated SNPs. The associated SNPs localized to enhancer sequences active in thymus, T and B cells, and CD34(+) stem cells. Enhancer-promoter interactions can now be analyzed in these cell types to identify which particular genes and regulatory sequences are causal.

    View details for DOI 10.1038/ng.3245

    View details for Web of Science ID 000351922900014

    View details for PubMedID 25751624

    View details for PubMedCentralID PMC4380767

  • Reassessment of Piwi Binding to the Genome and Piwi Impact on RNA Polymerase II Distribution DEVELOPMENTAL CELL Lin, H., Chen, M., Kundaje, A., Valouev, A., Yin, H., Liu, N., Neuenkirchen, N., Zhong, M., Snyder, M. 2015; 32 (6): 772-774


    Drosophila Piwi was reported by Huang et al. (2013) to be guided by piRNAs to piRNA-complementary sites in the genome, which then recruits heterochromatin protein 1a and histone methyltransferase Su(Var)3-9 to the sites. Among additional findings, Huang et al. (2013) also reported Piwi binding sites in the genome and the reduction of RNA polymerase II in euchromatin but its increase in pericentric regions in piwi mutants. Marinov et al. (2015) disputed the validity of the Huang et al. bioinformatic pipeline that led to the last two claims. Here we report our independent reanalysis of the data using current bioinformatic methods. Our reanalysis agrees with Marinov et al. (2015) that Piwi's genomic targets still remain to be identified but confirms the Huang et al. claim that Piwi influences RNA polymerase II distribution in the genome. This Matters Arising Response addresses the Marinov et al. (2015) Matters Arising, published concurrently in this issue of Developmental Cell.

    View details for DOI 10.1016/j.devcel.2015.03.004

    View details for PubMedID 25805139

  • A comparative encyclopedia of DNA elements in the mouse genome NATURE Yue, F., Cheng, Y., Breschi, A., Vierstra, J., Wu, W., Ryba, T., Sandstrom, R., Ma, Z., Davis, C., Pope, B. D., Shen, Y., Pervouchine, D. D., Djebali, S., Thurman, R. E., Kaul, R., Rynes, E., Kirilusha, A., Marinov, G. K., Williams, B. A., Trout, D., Amrhein, H., Fisher-Aylor, K., Antoshechkin, I., DeSalvo, G., See, L., Fastuca, M., Drenkow, J., Zaleski, C., Dobin, A., Prieto, P., Lagarde, J., Bussotti, G., Tanzer, A., Denas, O., Li, K., Bender, M. A., Zhang, M., Byron, R., Groudine, M. T., McCleary, D., Pham, L., Ye, Z., Kuan, S., Edsall, L., Wu, Y., Rasmussen, M. D., Bansal, M. S., Kellis, M., Keller, C. A., Morrissey, C. S., Mishra, T., Jain, D., Dogan, N., Harris, R. S., Cayting, P., Kawli, T., Boyle, A. P., Euskirchen, G., Kundaje, A., Lin, S., Lin, Y., Jansen, C., Malladi, V. S., Cline, M. S., Erickson, D. T., Kirkup, V. M., Learned, K., Sloan, C. A., Rosenbloom, K. R., De Sousa, B. L., Beal, K., Pignatelli, M., Flicek, P., Lian, J., Kahveci, T., Lee, D., Kent, W. J., Santos, M. R., Herrero, J., Notredame, C., Johnson, A., Vong, S., Lee, K., Bates, D., Neri, F., Diegel, M., Canfield, T., Sabo, P. J., Wilken, M. S., Reh, T. A., Giste, E., Shafer, A., Kutyavin, T., Haugen, E., Dunn, D., Reynolds, A. P., Neph, S., Humbert, R., Hansen, R. S., de Bruijn, M., Selleri, L., Rudensky, A., Josefowicz, S., Samstein, R., Eichler, E. E., Orkin, S. H., Levasseur, D., Papayannopoulou, T., Chang, K., Skoultchi, A., Gosh, S., Disteche, C., Treuting, P., Wang, Y., Weiss, M. J., Blobel, G. A., Cao, X., Zhong, S., Wang, T., Good, P. J., Lowdon, R. F., Adams, L. B., Zhou, X., Pazin, M. J., Feingold, E. A., Wold, B., Taylor, J., Mortazavi, A., Weissman, S. M., Stamatoyannopoulos, J. A., Snyder, M. P., Guigo, R., Gingeras, T. R., Gilbert, D. M., Hardison, R. C., Beer, M. A., Ren, B. 2014; 515 (7527): 355-?


    The laboratory mouse shares the majority of its protein-coding genes with humans, making it the premier model organism in biomedical research, yet the two mammals differ in significant ways. To gain greater insights into both shared and species-specific transcriptional and cellular regulatory programs in the mouse, the Mouse ENCODE Consortium has mapped transcription, DNase I hypersensitivity, transcription factor binding, chromatin modifications and replication domains throughout the mouse genome in diverse cell and tissue types. By comparing with the human genome, we not only confirm substantial conservation in the newly annotated potential functional sequences, but also find a large degree of divergence of sequences involved in transcriptional regulation, chromatin state and higher order chromatin organization. Our results illuminate the wide range of evolutionary forces acting on genes and their regulatory regions, and provide a general resource for research into mammalian biology and mechanisms of human diseases.

    View details for DOI 10.1038/nature13992

    View details for Web of Science ID 000345770600034

  • A comparative encyclopedia of DNA elements in the mouse genome. Nature Yue, F., Cheng, Y., Breschi, A., Vierstra, J., Wu, W., Ryba, T., Sandstrom, R., Ma, Z., Davis, C., Pope, B. D., Shen, Y., Pervouchine, D. D., Djebali, S., Thurman, R. E., Kaul, R., Rynes, E., Kirilusha, A., Marinov, G. K., Williams, B. A., Trout, D., Amrhein, H., Fisher-Aylor, K., Antoshechkin, I., DeSalvo, G., See, L., Fastuca, M., Drenkow, J., Zaleski, C., Dobin, A., Prieto, P., Lagarde, J., Bussotti, G., Tanzer, A., Denas, O., Li, K., Bender, M. A., Zhang, M., Byron, R., Groudine, M. T., McCleary, D., Pham, L., Ye, Z., Kuan, S., Edsall, L., Wu, Y., Rasmussen, M. D., Bansal, M. S., Kellis, M., Keller, C. A., Morrissey, C. S., Mishra, T., Jain, D., Dogan, N., Harris, R. S., Cayting, P., Kawli, T., Boyle, A. P., Euskirchen, G., Kundaje, A., Lin, S., Lin, Y., Jansen, C., Malladi, V. S., Cline, M. S., Erickson, D. T., Kirkup, V. M., Learned, K., Sloan, C. A., Rosenbloom, K. R., Lacerda de Sousa, B., Beal, K., Pignatelli, M., Flicek, P., Lian, J., Kahveci, T., Lee, D., Kent, W. J., Ramalho Santos, M., Herrero, J., Notredame, C., Johnson, A., Vong, S., Lee, K., Bates, D., Neri, F., Diegel, M., Canfield, T., Sabo, P. J., Wilken, M. S., Reh, T. A., Giste, E., Shafer, A., Kutyavin, T., Haugen, E., Dunn, D., Reynolds, A. P., Neph, S., Humbert, R., Hansen, R. S., de Bruijn, M., Selleri, L., Rudensky, A., Josefowicz, S., Samstein, R., Eichler, E. E., Orkin, S. H., Levasseur, D., Papayannopoulou, T., Chang, K., Skoultchi, A., Gosh, S., Disteche, C., Treuting, P., Wang, Y., Weiss, M. J., Blobel, G. A., Cao, X., Zhong, S., Wang, T., Good, P. J., Lowdon, R. F., Adams, L. B., Zhou, X., Pazin, M. J., Feingold, E. A., Wold, B., Taylor, J., Mortazavi, A., Weissman, S. M., Stamatoyannopoulos, J. A., Snyder, M. P., Guigo, R., Gingeras, T. R., Gilbert, D. M., Hardison, R. C., Beer, M. A., Ren, B. 2014; 515 (7527): 355-364


    The laboratory mouse shares the majority of its protein-coding genes with humans, making it the premier model organism in biomedical research, yet the two mammals differ in significant ways. To gain greater insights into both shared and species-specific transcriptional and cellular regulatory programs in the mouse, the Mouse ENCODE Consortium has mapped transcription, DNase I hypersensitivity, transcription factor binding, chromatin modifications and replication domains throughout the mouse genome in diverse cell and tissue types. By comparing with the human genome, we not only confirm substantial conservation in the newly annotated potential functional sequences, but also find a large degree of divergence of sequences involved in transcriptional regulation, chromatin state and higher order chromatin organization. Our results illuminate the wide range of evolutionary forces acting on genes and their regulatory regions, and provide a general resource for research into mammalian biology and mechanisms of human diseases.

    View details for DOI 10.1038/nature13992

    View details for PubMedID 25409824

  • Principles of regulatory information conservation between mouse and human NATURE Cheng, Y., Ma, Z., Kim, B., Wu, W., Cayting, P., Boyle, A. P., Sundaram, V., Xing, X., Dogan, N., Li, J., Euskirchen, G., Lin, S., Lin, Y., Visel, A., Kawli, T., Yang, X., Patacsil, D., Keller, C. A., Giardine, B., Kundaje, A., Wang, T., Pennacchio, L. A., Weng, Z., Hardison, R. C., Snyder, M. P. 2014; 515 (7527): 371-?


    To broaden our understanding of the evolution of gene regulation mechanisms, we generated occupancy profiles for 34 orthologous transcription factors (TFs) in human-mouse erythroid progenitor, lymphoblast and embryonic stem-cell lines. By combining the genome-wide transcription factor occupancy repertoires, associated epigenetic signals, and co-association patterns, here we deduce several evolutionary principles of gene regulatory features operating since the mouse and human lineages diverged. The genomic distribution profiles, primary binding motifs, chromatin states, and DNA methylation preferences are well conserved for TF-occupied sequences. However, the extent to which orthologous DNA segments are bound by orthologous TFs varies both among TFs and with genomic location: binding at promoters is more highly conserved than binding at distal elements. Notably, occupancy-conserved TF-occupied sequences tend to be pleiotropic; they function in several tissues and also co-associate with many TFs. Single nucleotide variants at sites with potential regulatory functions are enriched in occupancy-conserved TF-occupied sequences.

    View details for DOI 10.1038/nature13985

    View details for Web of Science ID 000345770600036

    View details for PubMedCentralID PMC4343047

  • Transcription Factors Bind Negatively Selected Sites within Human mtDNA Genes GENOME BIOLOGY AND EVOLUTION Blumberg, A., Sailaja, B. S., Kundaje, A., Levin, L., Dadon, S., Shmorak, S., Shaulian, E., Meshorer, E., Mishmar, D. 2014; 6 (10): 2634-2646


    Transcription of mitochondrial DNA (mtDNA)-encoded genes is thought to be regulated by a handful of dedicated transcription factors (TFs), suggesting that mtDNA genes are separately regulated from the nucleus. However, several TFs, with known nuclear activities, were found to bind mtDNA and regulate mitochondrial transcription. Additionally, mtDNA transcriptional regulatory elements, which were proved important in vitro, were harbored by a deletion that normally segregated among healthy individuals. Hence, mtDNA transcriptional regulation is more complex than once thought. Here, by analyzing ENCODE chromatin immunoprecipitation sequencing (ChIP-seq) data, we identified strong binding sites of three bona fide nuclear TFs (c-Jun, Jun-D, and CEBPb) within human mtDNA protein-coding genes. We validated the binding of two TFs by ChIP-quantitative polymerase chain reaction (c-Jun and Jun-D) and showed their mitochondrial localization by electron microscopy and subcellular fractionation. As a step toward investigating the functionality of these TF-binding sites (TFBS), we assessed signatures of selection. By analyzing 9,868 human mtDNA sequences encompassing all major global populations, we recorded genetic variants in tips and nodes of mtDNA phylogeny within the TFBS. We next calculated the effects of variants on binding motif prediction scores. Finally, the mtDNA variation pattern in predicted TFBS, occurring within ChIP-seq negative-binding sites, was compared with ChIP-seq positive-TFBS (CPR). Motifs within CPRs of c-Jun, Jun-D, and CEBPb harbored either only tip variants or their nodal variants retained high motif prediction scores. This reflects negative selection within mtDNA CPRs, thus supporting their functionality. Hence, human mtDNA-coding sequences may have dual roles, namely coding for genes yet possibly also possessing regulatory potential.

    View details for DOI 10.1093/gbe/evu210

    View details for PubMedID 25245407

  • Comparative analysis of regulatory information and circuits across distant species. Nature Boyle, A. P., Araya, C. L., Brdlik, C., Cayting, P., Cheng, C., Cheng, Y., Gardner, K., Hillier, L. W., Janette, J., Jiang, L., Kasper, D., Kawli, T., Kheradpour, P., Kundaje, A., Li, J. J., Ma, L., Niu, W., Rehm, E. J., Rozowsky, J., Slattery, M., Spokony, R., Terrell, R., Vafeados, D., Wang, D., Weisdepp, P., Wu, Y., Xie, D., Yan, K., Feingold, E. A., Good, P. J., Pazin, M. J., Huang, H., Bickel, P. J., Brenner, S. E., Reinke, V., Waterston, R. H., Gerstein, M., White, K. P., Kellis, M., Snyder, M. 2014; 512 (7515): 453-456


    Despite the large evolutionary distances between metazoan species, they can show remarkable commonalities in their biology, and this has helped to establish fly and worm as model organisms for human biology. Although studies of individual elements and factors have explored similarities in gene regulation, a large-scale comparative analysis of basic principles of transcriptional regulatory features is lacking. Here we map the genome-wide binding locations of 165 human, 93 worm and 52 fly transcription regulatory factors, generating a total of 1,019 data sets from diverse cell types, developmental stages, or conditions in the three species, of which 498 (48.9%) are presented here for the first time. We find that structural properties of regulatory networks are remarkably conserved and that orthologous regulatory factor families recognize similar binding motifs in vivo and show some similar co-associations. Our results suggest that gene-regulatory properties previously observed for individual factors are general principles of metazoan regulation that are remarkably well-preserved despite extensive functional divergence of individual network connections. The comparative maps of regulatory circuitry provided here will drive an improved understanding of the regulatory underpinnings of model organism biology and how these relate to human biology, development and disease.

    View details for DOI 10.1038/nature13668

    View details for PubMedID 25164757

  • Comparative analysis of metazoan chromatin organization. Nature Ho, J. W., Jung, Y. L., Liu, T., Alver, B. H., Lee, S., Ikegami, K., Sohn, K., Minoda, A., Tolstorukov, M. Y., Appert, A., Parker, S. C., Gu, T., Kundaje, A., Riddle, N. C., Bishop, E., Egelhofer, T. A., Hu, S. S., Alekseyenko, A. A., Rechtsteiner, A., Asker, D., Belsky, J. A., Bowman, S. K., Chen, Q. B., Chen, R. A., Day, D. S., Dong, Y., Dose, A. C., Duan, X., Epstein, C. B., Ercan, S., Feingold, E. A., Ferrari, F., Garrigues, J. M., Gehlenborg, N., Good, P. J., Haseley, P., He, D., Herrmann, M., Hoffman, M. M., Jeffers, T. E., Kharchenko, P. V., Kolasinska-Zwierz, P., Kotwaliwale, C. V., Kumar, N., Langley, S. A., Larschan, E. N., Latorre, I., Libbrecht, M. W., Lin, X., Park, R., Pazin, M. J., Pham, H. N., Plachetka, A., Qin, B., Schwartz, Y. B., Shoresh, N., Stempor, P., Vielle, A., Wang, C., Whittle, C. M., Xue, H., Kingston, R. E., Kim, J. H., Bernstein, B. E., Dernburg, A. F., Pirrotta, V., Kuroda, M. I., Noble, W. S., Tullius, T. D., Kellis, M., MacAlpine, D. M., Strome, S., Elgin, S. C., Liu, X. S., Lieb, J. D., Ahringer, J., Karpen, G. H., Park, P. J. 2014; 512 (7515): 449-452


    Genome function is dynamically regulated in part by chromatin, which consists of the histones, non-histone proteins and RNA molecules that package DNA. Studies in Caenorhabditis elegans and Drosophila melanogaster have contributed substantially to our understanding of molecular mechanisms of genome function in humans, and have revealed conservation of chromatin components and mechanisms. Nevertheless, the three organisms have markedly different genome sizes, chromosome architecture and gene organization. On human and fly chromosomes, for example, pericentric heterochromatin flanks single centromeres, whereas worm chromosomes have dispersed heterochromatin-like regions enriched in the distal chromosomal 'arms', and centromeres distributed along their lengths. To systematically investigate chromatin organization and associated gene regulation across species, we generated and analysed a large collection of genome-wide chromatin data sets from cell lines and developmental stages in worm, fly and human. Here we present over 800 new data sets from our ENCODE and modENCODE consortia, bringing the total to over 1,400. Comparison of combinatorial patterns of histone modifications, nuclear lamina-associated domains, organization of large-scale topological domains, chromatin environment at promoters and enhancers, nucleosome positioning, and DNA replication patterns reveals many conserved features of chromatin organization among the three organisms. We also find notable differences in the composition and locations of repressive chromatin. These data sets and analyses provide a rich resource for comparative and species-specific investigations of chromatin composition, organization and function.

    View details for DOI 10.1038/nature13415

    View details for PubMedID 25164756

  • Regulatory analysis of the C. elegans genome with spatiotemporal resolution. Nature Araya, C. L., Kawli, T., Kundaje, A., Jiang, L., Wu, B., Vafeados, D., Terrell, R., Weissdepp, P., Gevirtzman, L., Mace, D., Niu, W., Boyle, A. P., Xie, D., Ma, L., Murray, J. I., Reinke, V., Waterston, R. H., Snyder, M. 2014; 512 (7515): 400-405

    View details for DOI 10.1038/nature13497

    View details for PubMedID 25164749

  • Regulatory analysis of the C. elegans genome with spatiotemporal resolution. Nature Araya, C. L., Kawli, T., Kundaje, A., Jiang, L., Wu, B., Vafeados, D., Terrell, R., Weissdepp, P., Gevirtzman, L., Mace, D., Niu, W., Boyle, A. P., Xie, D., Ma, L., Murray, J. I., Reinke, V., Waterston, R. H., Snyder, M. 2014; 512 (7515): 400-405


    Discovering the structure and dynamics of transcriptional regulatory events in the genome with cellular and temporal resolution is crucial to understanding the regulatory underpinnings of development and disease. We determined the genomic distribution of binding sites for 92 transcription factors and regulatory proteins across multiple stages of Caenorhabditis elegans development by performing 241 ChIP-seq (chromatin immunoprecipitation followed by sequencing) experiments. Integration of regulatory binding and cellular-resolution expression data produced a spatiotemporally resolved metazoan transcription factor binding map. Using this map, we explore developmental regulatory circuits that encode combinatorial logic at the levels of co-binding and co-expression of transcription factors, characterizing the genomic coverage and clustering of regulatory binding, the binding preferences of, and biological processes regulated by, transcription factors, the global transcription factor co-associations and genomic subdomains that suggest shared patterns of regulation, and identifying key transcription factors and transcription factor co-associations for fate specification of individual lineages and cell types.

    View details for DOI 10.1038/nature13497

    View details for PubMedID 25164749

  • Reply to Brunet and Doolittle: Both selected effect and causal role elements can influence human biology and disease PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Kellis, M., Wold, B., Snyder, M. P., Bernstein, B. E., Kundaje, A., Marinov, G. K., Ward, L. D., Birney, E., Crawford, G. E., Dekker, J., Dunham, I., Elnitski, L. L., Farnham, P. J., Feingold, E. A., Gerstein, M., Giddings, M. C., Gilbert, D. M., Gingeras, T. R., Green, E. D., Guigo, R., Hubbard, T., Kent, J., Lieb, J. D., Myers, R. M., Pazin, M. J., Ren, B., Stamatoyannopoulos, J., Weng, Z., White, K. P., Hardison, R. C. 2014; 111 (33): E3366-E3366

    View details for DOI 10.1073/pnas.1410434111

    View details for Web of Science ID 000340438800004

    View details for PubMedID 25275169

    View details for PubMedCentralID PMC4143047

  • H3K4me3 Breadth Is Linked to Cell Identity and Transcriptional Consistency. Cell Benayoun, B. A., Pollina, E. A., Ucar, D., Mahmoudi, S., Karra, K., Wong, E. D., Devarajan, K., Daugherty, A. C., Kundaje, A. B., Mancini, E., Hitz, B. C., Gupta, R., Rando, T. A., Baker, J. C., Snyder, M. P., Cherry, J. M., Brunet, A. 2014; 158 (3): 673-688


    Trimethylation of histone H3 at lysine 4 (H3K4me3) is a chromatin modification known to mark the transcription start sites of active genes. Here, we show that H3K4me3 domains that spread more broadly over genes in a given cell type preferentially mark genes that are essential for the identity and function of that cell type. Using the broadest H3K4me3 domains as a discovery tool in neural progenitor cells, we identify novel regulators of these cells. Machine learning models reveal that the broadest H3K4me3 domains represent a distinct entity, characterized by increased marks of elongation. The broadest H3K4me3 domains also have more paused polymerase at their promoters, suggesting a unique transcriptional output. Indeed, genes marked by the broadest H3K4me3 domains exhibit enhanced transcriptional consistency and [corrected] increased transcriptional levels, and perturbation of H3K4me3 breadth leads to changes in transcriptional consistency. Thus, H3K4me3 breadth contains information that could ensure transcriptional precision at key cell identity/function genes.

    View details for DOI 10.1016/j.cell.2014.06.027

    View details for PubMedID 25083876

  • Diverse patterns of genomic targeting by transcriptional regulators in Drosophila melanogaster GENOME RESEARCH Slattery, M., Ma, L., Spokony, R. F., Arthur, R. K., Kheradpour, P., Kundaje, A., Negre, N., Crofts, A., Ptashkin, R., Zieba, J., Ostapenko, A., Suchy, S., Victorsen, A., Jameel, N., Grundstad, A., Gao, W., Moran, J. R., Rehm, E., Grossman, R. L., Kellis, M., White, K. P. 2014; 24 (7): 1224-1235


    Annotation of regulatory elements and identification of the transcription-related factors (TRFs) targeting these elements are key steps in understanding how cells interpret their genetic blueprint and their environment during development, and how that process goes awry in the case of disease. One goal of the modENCODE (model organism ENCyclopedia of DNA Elements) Project is to survey a diverse sampling of TRFs, both DNA-binding and non-DNA-binding factors, to provide a framework for the subsequent study of the mechanisms by which transcriptional regulators target the genome. Here we provide an updated map of the Drosophila melanogaster regulatory genome based on the location of 84 TRFs at various stages of development. This regulatory map reveals a variety of genomic targeting patterns, including factors with strong preferences toward proximal promoter binding, factors that target intergenic and intronic DNA, and factors with distinct chromatin state preferences. The data also highlight the stringency of the Polycomb regulatory network, and show association of the Trithorax-like (Trl) protein with hotspots of DNA binding throughout development. Furthermore, the data identify more than 5800 instances in which TRFs target DNA regions with demonstrated enhancer activity. Regions of high TRF co-occupancy are more likely to be associated with open enhancers used across cell types, while lower TRF occupancy regions are associated with complex enhancers that are also regulated at the epigenetic level. Together these data serve as a resource for the research community in the continued effort to dissect transcriptional regulatory mechanisms directing Drosophila development.

    View details for DOI 10.1101/gr.168807.113

    View details for Web of Science ID 000338185000015

    View details for PubMedID 24985916

    View details for PubMedCentralID PMC4079976

  • Defining functional DNA elements in the human genome PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Kellis, M., Wold, B., Snyder, M. P., Bernstein, B. E., Kundaje, A., Marinov, G. K., Ward, L. D., Birney, E., Crawford, G. E., Dekker, J., Dunham, I., Elnitski, L. L., Farnham, P. J., Feingold, E. A., Gerstein, M., Giddings, M. C., Gilbert, D. M., Gingeras, T. R., Green, E. D., Guigo, R., Hubbard, T., Kent, J., Lieb, J. D., Myers, R. M., Pazin, M. J., Ren, B., Stamatoyannopoulos, J. A., Weng, Z., White, K. P., Hardison, R. C. 2014; 111 (17): 6131-6138


    With the completion of the human genome sequence, attention turned to identifying and annotating its functional DNA elements. As a complement to genetic and comparative genomics approaches, the Encyclopedia of DNA Elements Project was launched to contribute maps of RNA transcripts, transcriptional regulator binding sites, and chromatin states in many cell types. The resulting genome-wide data reveal sites of biochemical activity with high positional resolution and cell type specificity that facilitate studies of gene regulation and interpretation of noncoding variants associated with human disease. However, the biochemically active regions cover a much larger fraction of the genome than do evolutionarily conserved regions, raising the question of whether nonconserved but biochemically active regions are truly functional. Here, we review the strengths and limitations of biochemical, evolutionary, and genetic approaches for defining functional DNA segments, potential sources for the observed differences in estimated genomic coverage, and the biological implications of these discrepancies. We also analyze the relationship between signal intensity, genomic coverage, and evolutionary conservation. Our results reinforce the principle that each approach provides complementary information and that we need to use combinations of all three to elucidate genome function in human biology and disease.

    View details for DOI 10.1073/pnas.1318948111

    View details for Web of Science ID 000335199000025

    View details for PubMedID 24753594

    View details for PubMedCentralID PMC4035993

  • Large-Scale Quality Analysis of Published ChIP-seq Data. G3 (Bethesda, Md.) Marinov, G. K., Kundaje, A., Park, P. J., Wold, B. J. 2014; 4 (2): 209-223


    ChIP-seq has become the primary method for identifying in vivo protein-DNA interactions on a genome-wide scale, with nearly 800 publications involving the technique appearing in PubMed as of December 2012. Individually and in aggregate, these data are an important and information-rich resource. However, uncertainties about data quality confound their use by the wider research community. Recently, the Encyclopedia of DNA Elements (ENCODE) project developed and applied metrics to objectively measure ChIP-seq data quality. The ENCODE quality analysis was useful for flagging datasets for closer inspection, eliminating or replacing poor data, and for driving changes in experimental pipelines. There had been no similarly systematic quality analysis of the large and disparate body of published ChIP-seq profiles. Here, we report a uniform analysis of vertebrate transcription factor ChIP-seq datasets in the Gene Expression Omnibus (GEO) repository as of April 1, 2012. The majority (55%) of datasets scored as being highly successful, but a substantial minority (20%) were of apparently poor quality, and another ∼25% were of intermediate quality. We discuss how different uses of ChIP-seq data are affected by specific aspects of data quality, and we highlight exceptional instances for which the metric values should not be taken at face value. Unexpectedly, we discovered that a significant subset of control datasets (i.e., no immunoprecipitation and mock immunoprecipitation samples) display an enrichment structure similar to successful ChIP-seq data. This can, in turn, affect peak calling and data interpretation. Published datasets identified here as high-quality comprise a large group that users can draw on for large-scale integrated analysis. In the future, ChIP-seq quality assessment similar to that used here could guide experimentalists at early stages in a study, provide useful input in the publication process, and be used to stratify ChIP-seq data for different community-wide uses.

    View details for DOI 10.1534/g3.113.008680

    View details for PubMedID 24347632

    View details for PubMedCentralID PMC3931556

  • Principles of regulatory information conservation between mouse and human. Nature Cheng, Y. n., Ma, Z. n., Kim, B. H., Wu, W. n., Cayting, P. n., Boyle, A. P., Sundaram, V. n., Xing, X. n., Dogan, N. n., Li, J. n., Euskirchen, G. n., Lin, S. n., Lin, Y. n., Visel, A. n., Kawli, T. n., Yang, X. n., Patacsil, D. n., Keller, C. A., Giardine, B. n., Kundaje, A. n., Wang, T. n., Pennacchio, L. A., Weng, Z. n., Hardison, R. C., Snyder, M. P. 2014; 515 (7527): 371–75


    To broaden our understanding of the evolution of gene regulation mechanisms, we generated occupancy profiles for 34 orthologous transcription factors (TFs) in human-mouse erythroid progenitor, lymphoblast and embryonic stem-cell lines. By combining the genome-wide transcription factor occupancy repertoires, associated epigenetic signals, and co-association patterns, here we deduce several evolutionary principles of gene regulatory features operating since the mouse and human lineages diverged. The genomic distribution profiles, primary binding motifs, chromatin states, and DNA methylation preferences are well conserved for TF-occupied sequences. However, the extent to which orthologous DNA segments are bound by orthologous TFs varies both among TFs and with genomic location: binding at promoters is more highly conserved than binding at distal elements. Notably, occupancy-conserved TF-occupied sequences tend to be pleiotropic; they function in several tissues and also co-associate with many TFs. Single nucleotide variants at sites with potential regulatory functions are enriched in occupancy-conserved TF-occupied sequences.

    View details for PubMedID 25409826

  • STAT3 Targets Suggest Mechanisms of Aggressive Tumorigenesis in Diffuse Large B-Cell Lymphoma G3-GENES GENOMES GENETICS Hardee, J., Ouyang, Z., Zhang, Y., Kundaje, A., Lacroute, P., Snyder, M. 2013; 3 (12): 2173-2185


    The signal transducer and activator of transcription 3 (STAT3) is a transcription factor that, when dysregulated, becomes a powerful oncogene found in many human cancers, including diffuse large B-cell lymphoma. Diffuse large B-cell lymphoma is the most common form of non-Hodgkin's lymphoma and has two major subtypes: germinal center B-cell-like and activated B-cell-like. Compared with the germinal center B-cell-like form, activated B-cell-like lymphomas respond much more poorly to current therapies and often exhibit overexpression or overactivation of STAT3. To investigate how STAT3 might contribute to this aggressive phenotype, we have integrated genome-wide studies of STAT3 DNA binding using chromatin immunoprecipitation-sequencing with whole-transcriptome profiling using RNA-sequencing. STAT3 binding sites are present near almost a third of all genes that differ in expression between the two subtypes, and examination of the affected genes identified previously undetected and clinically significant pathways downstream of STAT3 that drive oncogenesis. Novel treatments aimed at these pathways may increase the survivability of activated B-cell-like diffuse large B-cell lymphoma.

    View details for DOI 10.1534/g3.113.007674

    View details for PubMedID 24142927

  • Extensive Variation in Chromatin States Across Humans SCIENCE Kasowski, M., Kyriazopoulou-Panagiotopoulou, S., Grubert, F., Zaugg, J. B., Kundaje, A., Liu, Y., Boyle, A. P., Zhang, Q. C., Zakharia, F., Spacek, D. V., Li, J., Xie, D., Olarerin-George, A., Steinmetz, L. M., Hogenesch, J. B., Kellis, M., Batzoglou, S., Snyder, M. 2013; 342 (6159): 750-752


    The majority of disease-associated variants lie outside protein-coding regions, suggesting a link between variation in regulatory regions and disease predisposition. We studied differences in chromatin states using five histone modifications, cohesin, and CTCF in lymphoblastoid lines from 19 individuals of diverse ancestry. We found extensive signal variation in regulatory regions, which often switch between active and repressed states across individuals. Enhancer activity is particularly diverse among individuals, whereas gene expression remains relatively stable. Chromatin variability shows genetic inheritance in trios, correlates with genetic variation and population divergence, and is associated with disruptions of transcription factor binding motifs. Overall, our results provide insights into chromatin variation among humans.

    View details for DOI 10.1126/science.1242510

    View details for PubMedID 24136358

  • Integrative annotation of chromatin elements from ENCODE data NUCLEIC ACIDS RESEARCH Hoffman, M. M., Ernst, J., Wilder, S. P., Kundaje, A., Harris, R. S., Libbrecht, M., Giardine, B., Ellenbogen, P. M., Bilmes, J. A., Birney, E., Hardison, R. C., Dunham, I., Kellis, M., Noble, W. S. 2013; 41 (2): 827-841


    The ENCODE Project has generated a wealth of experimental information mapping diverse chromatin properties in several human cell lines. Although each such data track is independently informative toward the annotation of regulatory elements, their interrelations contain much richer information for the systematic annotation of regulatory elements. To uncover these interrelations and to generate an interpretable summary of the massive datasets of the ENCODE Project, we apply unsupervised learning methodologies, converting dozens of chromatin datasets into discrete annotation maps of regulatory regions and other chromatin elements across the human genome. These methods rediscover and summarize diverse aspects of chromatin architecture, elucidate the interplay between chromatin activity and RNA transcription, and reveal that a large proportion of the genome lies in a quiescent state, even across multiple cell types. The resulting annotation of non-coding regulatory elements correlate strongly with mammalian evolutionary constraint, and provide an unbiased approach for evaluating metrics of evolutionary constraint in human. Lastly, we use the regulatory annotations to revisit previously uncharacterized disease-associated loci, resulting in focused, testable hypotheses through the lens of the chromatin landscape.

    View details for DOI 10.1093/nar/gks1284

    View details for Web of Science ID 000314121100021

    View details for PubMedID 23221638

    View details for PubMedCentralID PMC3553955

  • Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors Wang, J., Zhuang, J., Iyer, S., Lin, X., Whitfield, T. W., Greven, M. C., Pierce, B. G., Dong, X., Kundaje, A., Cheng, Y., Rando, O. J., Birney, E., Myers, R. M., Noble, W. S., Snyder, M., Weng, Z. TAYLOR & FRANCIS INC. 2013: 49-50
  • Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors GENOME RESEARCH Wang, J., Zhuang, J., Iyer, S., Lin, X., Whitfield, T. W., Greven, M. C., Pierce, B. G., Dong, X., Kundaje, A., Cheng, Y., Rando, O. J., Birney, E., Myers, R. M., Noble, W. S., Snyder, M., Weng, Z. 2012; 22 (9): 1798-1812


    Chromatin immunoprecipitation coupled with high-throughput sequencing (ChIP-seq) has become the dominant technique for mapping transcription factor (TF) binding regions genome-wide. We performed an integrative analysis centered around 457 ChIP-seq data sets on 119 human TFs generated by the ENCODE Consortium. We identified highly enriched sequence motifs in most data sets, revealing new motifs and validating known ones. The motif sites (TF binding sites) are highly conserved evolutionarily and show distinct footprints upon DNase I digestion. We frequently detected secondary motifs in addition to the canonical motifs of the TFs, indicating tethered binding and cobinding between multiple TFs. We observed significant position and orientation preferences between many cobinding TFs. Genes specifically expressed in a cell line are often associated with a greater occurrence of nearby TF binding in that cell line. We observed cell-line-specific secondary motifs that mediate the binding of the histone deacetylase HDAC2 and the enhancer-binding protein EP300. TF binding sites are located in GC-rich, nucleosome-depleted, and DNase I sensitive regions, flanked by well-positioned nucleosomes, and many of these features show cell type specificity. The GC-richness may be beneficial for regulating TF binding because, when unoccupied by a TF, these regions are occupied by nucleosomes in vivo. We present the results of our analysis in a TF-centric web repository Factorbook ( and will continually update this repository as more ENCODE data are generated.

    View details for DOI 10.1101/gr.139105.112

    View details for Web of Science ID 000308272800020

    View details for PubMedID 22955990

    View details for PubMedCentralID PMC3431495

  • Long noncoding RNAs are rarely translated in two human cell lines GENOME RESEARCH Banfai, B., Jia, H., Khatun, J., Wood, E., Risk, B., Gundling, W. E., Kundaje, A., Gunawardena, H. P., Yu, Y., Xie, L., Krajewski, K., Strahl, B. D., Chen, X., Bickel, P., Giddings, M. C., Brown, J. B., Lipovich, L. 2012; 22 (9): 1646-1657


    Data from the Encyclopedia of DNA Elements (ENCODE) project show over 9640 human genome loci classified as long noncoding RNAs (lncRNAs), yet only ~100 have been deeply characterized to determine their role in the cell. To measure the protein-coding output from these RNAs, we jointly analyzed two recent data sets produced in the ENCODE project: tandem mass spectrometry (MS/MS) data mapping expressed peptides to their encoding genomic loci, and RNA-seq data generated by ENCODE in long polyA+ and polyA- fractions in the cell lines K562 and GM12878. We used the machine-learning algorithm RuleFit3 to regress the peptide data against RNA expression data. The most important covariate for predicting translation was, surprisingly, the Cytosol polyA- fraction in both cell lines. LncRNAs are ~13-fold less likely to produce detectable peptides than similar mRNAs, indicating that ~92% of GENCODE v7 lncRNAs are not translated in these two ENCODE cell lines. Intersecting 9640 lncRNA loci with 79,333 peptides yielded 85 unique peptides matching 69 lncRNAs. Most cases were due to a coding transcript misannotated as lncRNA. Two exceptions were an unprocessed pseudogene and a bona fide lncRNA gene, both with open reading frames (ORFs) compromised by upstream stop codons. All potentially translatable lncRNA ORFs had only a single peptide match, indicating low protein abundance and/or false-positive peptide matches. We conclude that with very few exceptions, ribosomes are able to distinguish coding from noncoding transcripts and, hence, that ectopic translation and cryptic mRNAs are rare in the human lncRNAome.

    View details for DOI 10.1101/gr.134767.111

    View details for Web of Science ID 000308272800007

    View details for PubMedID 22955977

    View details for PubMedCentralID PMC3431482

  • Modeling gene expression using chromatin features in various cellular contexts GENOME BIOLOGY Dong, X., Greven, M. C., Kundaje, A., Djebali, S., Brown, J. B., Cheng, C., Gingeras, T. R., Gerstein, M., Guigo, R., Birney, E., Weng, Z. 2012; 13 (9)


    Previous work has demonstrated that chromatin feature levels correlate with gene expression. The ENCODE project enables us to further explore this relationship using an unprecedented volume of data. Expression levels from more than 100,000 promoters were measured using a variety of high-throughput techniques applied to RNA extracted by different protocols from different cellular compartments of several human cell lines. ENCODE also generated the genome-wide mapping of eleven histone marks, one histone variant, and DNase I hypersensitivity sites in seven cell lines.We built a novel quantitative model to study the relationship between chromatin features and expression levels. Our study not only confirms that the general relationships found in previous studies hold across various cell lines, but also makes new suggestions about the relationship between chromatin features and gene expression levels. We found that expression status and expression levels can be predicted by different groups of chromatin features, both with high accuracy. We also found that expression levels measured by CAGE are better predicted than by RNA-PET or RNA-Seq, and different categories of chromatin features are the most predictive of expression for different RNA measurement methods. Additionally, PolyA+ RNA is overall more predictable than PolyA- RNA among different cell compartments, and PolyA+ cytosolic RNA measured with RNA-Seq is more predictable than PolyA+ nuclear RNA, while the opposite is true for PolyA- RNA.Our study provides new insights into transcriptional regulation by analyzing chromatin features in different cellular contexts.

    View details for DOI 10.1186/gb-2012-13-9-r53

    View details for Web of Science ID 000313182600006

    View details for PubMedID 22950368

    View details for PubMedCentralID PMC3491397

  • Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors GENOME BIOLOGY Yip, K. Y., Cheng, C., Bhardwaj, N., Brown, J. B., Leng, J., Kundaje, A., Rozowsky, J., Birney, E., Bickel, P., Snyder, M., Gerstein, M. 2012; 13 (9)


    Transcription factors function by binding different classes of regulatory elements. The Encyclopedia of DNA Elements (ENCODE) project has recently produced binding data for more than 100 transcription factors from about 500 ChIP-seq experiments in multiple cell types. While this large amount of data creates a valuable resource, it is nonetheless overwhelmingly complex and simultaneously incomplete since it covers only a small fraction of all human transcription factors.As part of the consortium effort in providing a concise abstraction of the data for facilitating various types of downstream analyses, we constructed statistical models that capture the genomic features of three paired types of regions by machine-learning methods: firstly, regions with active or inactive binding; secondly, those with extremely high or low degrees of co-binding, termed HOT and LOT regions; and finally, regulatory modules proximal or distal to genes. From the distal regulatory modules, we developed computational pipelines to identify potential enhancers, many of which were validated experimentally. We further associated the predicted enhancers with potential target transcripts and the transcription factors involved. For HOT regions, we found a significant fraction of transcription factor binding without clear sequence motifs and showed that this observation could be related to strong DNA accessibility of these regions.Overall, the three pairs of regions exhibit intricate differences in chromosomal locations, chromatin features, factors that bind them, and cell-type specificity. Our machine learning approach enables us to identify features potentially general to all transcription factors, including those not included in the data.

    View details for DOI 10.1186/gb-2012-13-9-r48

    View details for Web of Science ID 000313182600001

    View details for PubMedID 22950945

    View details for PubMedCentralID PMC3491392

  • A User's Guide to the Encyclopedia of DNA Elements (ENCODE) PLOS BIOLOGY Myers, R. M., Stamatoyannopoulos, J., Snyder, M., Dunham, I., Hardison, R. C., Bernstein, B. E., Gingeras, T. R., Kent, W. J., Birney, E., Wold, B., Crawford, G. E., Bernstein, B. E., Epstein, C. B., Shoresh, N., Ernst, J., Mikkelsen, T. S., Kheradpour, P., Zhang, X., Wang, L., Issner, R., Coyne, M. J., Durham, T., Ku, M., Thanh Truong, T., Ward, L. D., Altshuler, R. C., Lin, M. F., Kellis, M., Gingeras, T. R., Davis, C. A., Kapranov, P., Dobin, A., Zaleski, C., Schlesinger, F., Batut, P., Chakrabortty, S., Jha, S., Lin, W., Drenkow, J., Wang, H., Bell, K., Gao, H., Bell, I., Dumais, E., Dumais, J., Antonarakis, S. E., Ucla, C., Borel, C., Guigo, R., Djebali, S., Lagarde, J., Kingswood, C., Ribeca, P., Sammeth, M., Alioto, T., Merkel, A., Tilgner, H., Carninci, P., Hayashizaki, Y., Lassmann, T., Takahashi, H., Abdelhamid, R. F., Hannon, G., Fejes-Toth, K., Preall, J., Gordon, A., Sotirova, V., Reymond, A., Howald, C., Graison, E. A., Chrast, J., Ruan, Y., Ruan, X., Shahab, A., Poh, W. T., Wei, C., Crawford, G. E., Furey, T. S., Boyle, A. P., Sheffield, N. C., Song, L., Shibata, Y., Vales, T., Winter, D., Zhang, Z., London, D., Wang, T., Birney, E., Keefe, D., Iyer, V. R., Lee, B., McDaniell, R. M., Liu, Z., Battenhouse, A., Bhinge, A. A., Lieb, J. D., Grasfeder, L. L., Showers, K. A., Giresi, P. G., Kim, S. K., Shestak, C., Myers, R. M., Pauli, F., Reddy, T. E., Gertz, J., Partridge, E. C., Jain, P., Sprouse, R. O., Bansal, A., Pusey, B., Muratet, M. A., Varley, K. E., Bowling, K. M., Newberry, K. M., Nesmith, A. S., Dilocker, J. A., Parker, S. L., Waite, L. L., Thibeault, K., Roberts, K., Absher, D. M., Wold, B., Mortazavi, A., Williams, B., Marinov, G., Trout, D., Pepke, S., King, B., McCue, K., Kirilusha, A., DeSalvo, G., Fisher-Aylor, K., Amrhein, H., Vielmetter, J., Sherlock, G., Sidow, A., Batzoglou, S., Rauch, R., Kundaje, A., Libbrecht, M., Margulies, E. H., Parker, S. C., Elnitski, L., Green, E. D., Hubbard, T., Harrow, J., Searle, S., Kokocinski, F., Aken, B., Frankish, A., Hunt, T., Despacio-Reyes, G., Kay, M., Mukherjee, G., Bignell, A., Saunders, G., Boychenko, V., Brent, M., van Baren, M. J., Brown, R. H., Gerstein, M., Khurana, E., Balasubramanian, S., Zhang, Z., Lam, H., Cayting, P., Robilotto, R., Lu, Z., Guigo, R., Derrien, T., Tanzer, A., Knowles, D. G., Mariotti, M., Kent, W. J., Haussler, D., Harte, R., Diekhans, M., Kellis, M., Lin, M., Kheradpour, P., Ernst, J., Reymond, A., Howald, C., Graison, E. A., Chrast, J., Valencia, A., Tress, M., Manuel Rodriguez, J., Snyder, M., Landt, S. G., Raha, D., Shi, M., Euskirchen, G., Grubert, F., Kasowski, M., Lian, J., Cayting, P., Lacroute, P., Xu, Y., Monahan, H., Patacsil, D., Slifer, T., Yang, X., Charos, A., Reed, B., Wu, L., Auerbach, R. K., Habegger, L., Hariharan, M., Rozowsky, J., Abyzov, A., Weissman, S. M., Gerstein, M., Struhl, K., Lamarre-Vincent, N., Lindahl-Allen, M., Miotto, B., Moqtaderi, Z., Fleming, J. D., Newburger, P., Farnham, P. J., Frietze, S., O'Geen, H., Xu, X., Blahnik, K. R., Cao, A. R., Iyengar, S., Stamatoyannopoulos, J. A., Kaul, R., Thurman, R. E., Wang, H., Navas, P. A., Sandstrom, R., Sabo, P. J., Weaver, M., Canfield, T., Lee, K., Neph, S., Roach, V., Reynolds, A., Johnson, A., Rynes, E., Giste, E., Vong, S., Neri, J., Frum, T., Johnson, E. M., Nguyen, E. D., Ebersol, A. K., Sanchez, M. E., Sheffer, H. H., Lotakis, D., Haugen, E., Humbert, R., Kutyavin, T., Shafer, T., Dekker, J., Lajoie, B. R., Sanyal, A., Kent, W. J., Rosenbloom, K. R., Dreszer, T. R., Raney, B. J., Barber, G. P., Meyer, L. R., Sloan, C. A., Malladi, V. S., Cline, M. S., Learned, K., Swing, V. K., Zweig, A. S., Rhead, B., Fujita, P. A., Roskin, K., Karolchik, D., Kuhn, R. M., Haussler, D., Birney, E., Dunham, I., Wilder, S. P., Keefe, D., Sobral, D., Herrero, J., Beal, K., Lukk, M., Brazma, A., Vaquerizas, J. M., Luscombe, N. M., Bickel, P. J., Boley, N., Brown, J. B., Li, Q., Huang, H., Gerstein, M., Habegger, L., Sboner, A., Rozowsky, J., Auerbach, R. K., Yip, K. Y., Cheng, C., Yan, K., Bhardwaj, N., Wang, J., Lochovsky, L., Jee, J., Gibson, T., Leng, J., Du, J., Hardison, R. C., Harris, R. S., Song, G., Miller, W., Haussler, D., Roskin, K., Suh, B., Wang, T., Paten, B., Noble, W. S., Hoffman, M. M., Buske, O. J., Weng, Z., Dong, X., Wang, J., Xi, H., Tenenbaum, S. A., Doyle, F., Penalva, L. O., Chittur, S., Tullius, T. D., Parker, S. C., White, K. P., Karmakar, S., Victorsen, A., Jameel, N., Bild, N., Grossman, R. L., Snyder, M., Landt, S. G., Yang, X., Patacsil, D., Slifer, T., Dekker, J., Lajoie, B. R., Sanyal, A., Weng, Z., Whitfield, T. W., Wang, J., Collins, P. J., Trinklein, N. D., Partridge, E. C., Myers, R. M., Giddings, M. C., Chen, X., Khatun, J., Maier, C., Yu, Y., Gunawardena, H., Risk, B., Feingold, E. A., Lowdon, R. F., Dillon, L. A., Good, P. J. 2011; 9 (4)


    The mission of the Encyclopedia of DNA Elements (ENCODE) Project is to enable the scientific and medical communities to interpret the human genome sequence and apply it to understand human biology and improve health. The ENCODE Consortium is integrating multiple technologies and approaches in a collective effort to discover and define the functional elements encoded in the human genome, including genes, transcripts, and transcriptional regulatory regions, together with their attendant chromatin states and DNA methylation patterns. In the process, standards to ensure high-quality data have been implemented, and novel algorithms have been developed to facilitate analysis. Data and derived results are made available through a freely accessible database. Here we provide an overview of the project and the resources it is generating and illustrate the application of ENCODE data to interpret the human genome.

    View details for DOI 10.1371/journal.pbio.1001046

    View details for Web of Science ID 000289938900014

  • CP motifs, Hap1 and heme signaling Zhang, L., Leslie, C., Lee, H. C., Kundaje, A., Ie, E., Xin, X., Freund, Y., MEDIMOND MEDIMOND S R L. 2007: 45-+
  • A classification-based framework for predicting and analyzing gene regulatory response NIPS Workshop on New Problems and Methods in Computational Biology Kundaje, A., Middendorf, M., Shah, M., Wiggins, C. H., Freund, Y., Leslie, C. BIOMED CENTRAL LTD. 2006


    We have recently introduced a predictive framework for studying gene transcriptional regulation in simpler organisms using a novel supervised learning algorithm called GeneClass. GeneClass is motivated by the hypothesis that in model organisms such as Saccharomyces cerevisiae, we can learn a decision rule for predicting whether a gene is up- or down-regulated in a particular microarray experiment based on the presence of binding site subsequences ("motifs") in the gene's regulatory region and the expression levels of regulators such as transcription factors in the experiment ("parents"). GeneClass formulates the learning task as a classification problem--predicting +1 and -1 labels corresponding to up- and down-regulation beyond the levels of biological and measurement noise in microarray measurements. Using the Adaboost algorithm, GeneClass learns a prediction function in the form of an alternating decision tree, a margin-based generalization of a decision tree.In the current work, we introduce a new, robust version of the GeneClass algorithm that increases stability and computational efficiency, yielding a more scalable and reliable predictive model. The improved stability of the prediction tree enables us to introduce a detailed post-processing framework for biological interpretation, including individual and group target gene analysis to reveal condition-specific regulation programs and to suggest signaling pathways. Robust GeneClass uses a novel stabilized variant of boosting that allows a set of correlated features, rather than single features, to be included at nodes of the tree; in this way, biologically important features that are correlated with the single best feature are retained rather than decorrelated and lost in the next round of boosting. Other computational developments include fast matrix computation of the loss function for all features, allowing scalability to large datasets, and the use of abstaining weak rules, which results in a more shallow and interpretable tree. We also show how to incorporate genome-wide protein-DNA binding data from ChIP chip experiments into the GeneClass algorithm, and we use an improved noise model for gene expression data.Using the improved scalability of Robust GeneClass, we present larger scale experiments on a yeast environmental stress dataset, training and testing on all genes and using a comprehensive set of potential regulators. We demonstrate the improved stability of the features in the learned prediction tree, and we show the utility of the post-processing framework by analyzing two groups of genes in yeast--the protein chaperones and a set of putative targets of the Nrg1 and Nrg2 transcription factors--and suggesting novel hypotheses about their transcriptional and post-transcriptional regulation. Detailed results and Robust GeneClass source code is available for download from

    View details for DOI 10.1186/1471-2105-7-S1-S5

    View details for Web of Science ID 000236765200005

    View details for PubMedID 16723008

    View details for PubMedCentralID PMC1810316

  • Statistical analysis of MPSS measurements: Application to the study of LPS-activated macrophage gene expression PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Stolovitzky, G. A., Kundaje, A., Held, G. A., Duggar, K. H., Haudenschild, C. D., Zhou, D., Vasicek, T. J., Smith, K. D., Aderem, A., Roach, J. C. 2005; 102 (5): 1402-1407


    Massively Parallel Signature Sequencing (MPSS), a recently developed high-throughput transcription profiling technology, has the ability to profile almost every transcript in a sample without requiring prior knowledge of the sequence of the transcribed genes. As is the case with DNA microarrays, effective data analysis depends crucially on understanding how noise affects measurements. We analyze the sources of noise in MPSS and present a quantitative model describing the variability between replicate MPSS assays. We use this model to construct statistical hypotheses that test whether an observed change in gene expression in a pair-wise comparison is significant. This analysis is then extended to the determination of the significance of changes in expression levels measured over the course of a time series of measurements. We apply these analytic techniques to the study of a time series of MPSS gene expression measurements on LPS-stimulated macrophages. To evaluate our statistical significance metrics, we compare our results with published data on macrophage activation measured by using Affymetrix GeneChips.

    View details for DOI 10.1073/pnas.0406555102

    View details for Web of Science ID 000226877300029

    View details for PubMedID 15668391

    View details for PubMedCentralID PMC547838

  • Motif discovery through predictive modeling of gene regulation 9th Annual International Conference on Research in Computational Molecular Biology (RECOMB 2005) Middendorf, M., Kundaje, A., Shah, M., Freund, Y., Wiggins, C. H., Leslie, C. SPRINGER-VERLAG BERLIN. 2005: 538–552
  • Predicting genetic regulatory response using classification: Yeast stress response 1st Annual RECOMB Satellite Workshop on Regulatory Genomics Middendorf, M., Kundaje, A., Wiggins, C., Freund, Y., Leslie, C. SPRINGER-VERLAG BERLIN. 2005: 1–13
  • Predicting genetic regulatory response using classification BIOINFORMATICS Middendorf, M., Kundaje, A., Wiggins, C., Freund, Y., Leslie, C. 2004; 20: 232-240
  • Support vector machine (SVM) classification of multifocal visual evoked potential responses (mfVEP) from Glaucoma patients. Baroumand, F., Kundaje, A. B., Zhang, Leslie, C., Hood, D. C. ASSOC RESEARCH VISION OPHTHALMOLOGY INC. 2004: U106
  • Spectrogram analysis of genomes EURASIP JOURNAL ON APPLIED SIGNAL PROCESSING Sussillo, D., Kundaje, A., Anastassiou, D. 2004; 2004 (1): 29-42