Anshul Kundaje's Profile | Stanford Profiles

Bio

Anshul Kundaje is Associate Professor of Genetics and Computer Science at Stanford University. The Kundaje lab develops machine learning models of gene regulation to decipher the genetic and molecular basis of disease. The lab has pioneered deep learning models and interpretation frameworks to decode the functional language encoded in DNA, RNA and proteins. Dr. Kundaje has led computational efforts of large genomics consortia including the ENCODE Project and the Roadmap Epigenomics Project. Dr. Kundaje is a recipient of the NIH Director's New Innovator Award, the Alfred Sloan Fellowship and the HUGO Chen Award of Excellence.

Academic Appointments

Associate Professor, Genetics
Associate Professor, Computer Science
Member, Bio-X
Faculty Affiliate, Institute for Human-Centered Artificial Intelligence (HAI)
Member, Wu Tsai Human Performance Alliance
Member, Maternal & Child Health Research Institute (MCHRI)
Member, Wu Tsai Neurosciences Institute

Honors & Awards

HUGO Chen Award of Excellence, Human Genome Organization (2019)
NIH Director's New Innovator Award, NIH (2016)
Alfred Sloan Foundation Research Fellowship, Alfred Sloan Foundation (2014-2016)

Boards, Advisory Committees, Professional Organizations

Advisor, National Human Genome Research Institute Genomic Data Science Working Group (2021 - Present)
Editorial Board, Journal of Computational Biology (2021 - Present)
Editorial Board, Genome Research (2020 - Present)
Advisor, NIH Director's Advisory Committee for Artificial Intelligence in Biomedical Research (2019 - 2021)

Current Research and Scholarly Interests

My laboratory develops innovative machine learning methods to predict and decode biological sequences, molecular interactions, and genetic variation. We have pioneered deep learning models and interpretation frameworks that decode DNA and RNA sequence syntax governing context-specific transcription factor binding, RNA binding protein interactions, chromatin accessibility, histone modifications, transcription initiation, gene expression, alternative polyadenylation, and RNA editing. Using these approaches, we have built regulatory models across thousands of cellular contexts in humans and mice, elucidating dynamic regulation during differentiation and cellular reprogramming. Our methodological contributions span regulatory element mapping, deciphering the cis-regulatory code, long-range regulatory interaction modeling, and predictive regulatory network construction. We have adapted protein language models to predict and design transcription factor effector domains and developed machine learning frameworks leveraging T-cell and B-cell repertoire sequences for disease diagnostics.

I have extensive leadership experience in collaborative genomics consortia. As principal investigator, I led integrative analyses for the Encyclopedia of DNA Elements (ENCODE) consortium and the Roadmap Epigenomics Project. Currently, I serve as steering committee co-chair of the Impact of Genomic Variation on Function (IGVF) consortium and co-lead the Data Analysis and Coordination Center for the Multi-omics in Health and Disease (MOHD) consortium. My team has developed standardized processing and quality control pipelines for bulk and single-cell molecular profiling data across ENCODE, Roadmap, IGVF, and MOHD initiatives.

Translating our regulatory models to biomedical applications, we dissect functional genetic variation in rare and complex diseases using large biobanks and genome sequencing projects. Our disease-focused collaborations span colorectal cancer (GECCO and HTAN consortia), cardiometabolic disorders (AMP-CMD, CZI Seed networks), neurodegenerative diseases (ADSP consortium), and neuropsychiatric conditions (PsychENCODE consortium).

We have also developed widely-used software tools and web portals for mining and visualizing large-scale regulatory genomics data, facilitating community access to our resources and findings.

I have successfully mentored over 35 graduate students and postdocs who have gone on to leadership positions in academia (faculty at Carnegie Mellon, Michigan State, Memorial Sloan Kettering) and industry (Genentech, Illumina, NVIDIA), demonstrating our lab's commitment to training the next generation of computational biologists.

Projects

The Encyclopedia of DNA Elements (ENCODE) Project, Stanford University, MIT
The project generates a resource of cell-type specific genome-wide regulatory maps in the human genome. We develop statistical processing methods for next-gen sequencing based functional genomic data and machine learning methods to predict regulatory events, learn combinatorial regulatory effects of transcription factors, cell-type specific regulatory networks

Location

Stanford, CA

For More Information:
- The ENCODE data portal
- The ENCODE portal at nature.com
The Roadmap Epigenomics Project, MIT (February 2012 - Present)

The project generates genome-wide epigenomic maps in 200 human cell types. We develop computational methods and analyses to infer cell-type specific regulatory elements (e.g. enhancers) and their activity states, learn cell-type specific regulatory networks and use these maps to interpret GWAS and disease studies.

Location

Boston, MA

2025-26 Courses

Cloud Computing for Biology and Healthcare
BMDS 222, CS 273C, GENE 222 (Spr)
Deep Learning in Genomics and Biomedicine
BMDS 273, CS 273B, GENE 236 (Spr)
Independent Studies (26)
- Advanced Reading and Research
  CS 499 (Aut, Win, Spr, Sum)
- Advanced Reading and Research
  CS 499P (Aut, Win, Spr, Sum)
- Biomedical Informatics Teaching Methods
  BMDS 295 (Aut, Win, Spr)
- Curricular Practical Training
  CS 390A (Aut, Win, Spr, Sum)
- Curricular Practical Training
  CS 390B (Aut, Win, Spr, Sum)
- Curricular Practical Training
  CS 390C (Win)
- Directed Reading
  BMDS 299 (Aut, Win, Spr, Sum)
- Directed Reading in Biophysics
  BIOPHYS 399 (Aut, Win, Spr, Sum)
- Directed Reading in Genetics
  GENE 299 (Aut, Win, Spr, Sum)
- Directed Reading in Neurosciences
  NEPR 299 (Aut, Win, Spr, Sum)
- Graduate Research
  BIOPHYS 300 (Aut, Win, Spr, Sum)
- Graduate Research
  GENE 399 (Aut, Win, Spr, Sum)
- Graduate Research
  NEPR 399 (Aut, Win, Spr, Sum)
- Independent Project
  CS 399 (Aut, Win, Spr, Sum)
- Independent Work
  CS 199 (Aut, Win, Spr, Sum)
- Independent Work
  CS 199P (Aut, Win, Spr, Sum)
- Medical Scholars Research
  BMDS 370 (Aut, Win, Spr)
- Medical Scholars Research
  GENE 370 (Aut, Win, Spr, Sum)
- Part-time Curricular Practical Training
  CS 390D (Aut, Win, Spr, Sum)
- Ph.D. Research
  CME 400 (Aut, Win, Spr, Sum)
- Research
  PHYSICS 490 (Sum)
- Senior Project
  CS 191 (Aut, Win, Spr)
- Supervised Study
  GENE 260 (Aut, Win, Spr, Sum)
- Supervised Undergraduate Research
  CS 195 (Aut, Win, Spr, Sum)
- Undergraduate Research
  GENE 199 (Aut, Win, Spr, Sum)
- Writing Intensive Senior Research Project
  CS 191W (Aut, Win, Spr)
Prior Year Courses
2024-25 Courses
- Cloud Computing for Biology and Healthcare
  BIOMEDIN 222, CS 273C, GENE 222 (Spr)
2023-24 Courses
- Cloud Computing for Biology and Healthcare
  BIOMEDIN 222, CS 273C, GENE 222 (Spr)
- Deep Learning in Genomics and Biomedicine
  BIODS 237, CS 273B (Spr)
2022-23 Courses
- Big Data for Biologists - Decoding Genomic Function
  HUMBIO 51 (Win)
- Cloud Computing for Biology and Healthcare
  BIOMEDIN 222, CS 273C, GENE 222 (Spr)
- Deep Learning in Genomics and Biomedicine
  BIODS 237, BIOMEDIN 273B, CS 273B, GENE 236 (Spr)
- Genetics and Developmental Biology Training Camp
  DBIO 200, GENE 200 (Aut)

Stanford Advisees

Arthur Deng
Doctoral Dissertation Reader (AC)
Shawn Cai, Meena Chakraborty, Benjamin Doughty, Tami Gjorgjieva, Michael Hayes, Maya Sheth, Elana Simon, Alp Tartici
Postdoctoral Faculty Sponsor
Pau Badia i Mompel, Seungbyn Baek, Mingze Dong, Adam He, Ruchir Rastogi, Isaac Vock, Lei Xiong
Doctoral Dissertation Advisor (AC)
Alejandro Buendia, Ziwei Chen, Salil Deshpande, Martin Kjellberg, Kamal Obbad, Valeh Valiollah Pour Amiri, Chang M. Yun, Chris Zou
Orals Evaluator
Minji Kang
Doctoral Dissertation Co-Advisor (AC)
Samuel Alber, Antony Chang, Nathaniel Diamant, Michal Gerasimiuk, Alexander Johansen, Minji Kang, Shouvik Mani, Owen Queen, Esther Robb, Jake Silberg, Arpita Singhal, Jason Tan, Nitya Thakkar, Zoe Wefers
Master's Program Advisor
Rohan Mehrotra, Isabel Michel, Sanjay Nagaraj, Jessie Ou, Emmy Thamakaison
Postdoctoral Research Mentor
Danila Bredikhin, Selin Jessa
Doctoral (Program)
Ziwei Chen, Anvita Gupta, Chiho Im, Riya Sinha

Graduate and Fellowship Programs

Biomedical Data Science (Masters Program)
Biomedical Data Science (Phd Program)
Genetics (Phd Program)

All Publications

Disease diagnostics using machine learning of B cell and T cell receptor sequences. Science (New York, N.Y.) Zaslavsky, M. E., Craig, E., Michuda, J. K., Sehgal, N., Ram-Mohan, N., Lee, J. Y., Nguyen, K. D., Hoh, R. A., Pham, T. D., Röltgen, K., Lam, B., Parsons, E. S., Macwana, S. R., DeJager, W., Drapeau, E. M., Roskin, K. M., Cunningham-Rundles, C., Moody, M. A., Haynes, B. F., Goldman, J. D., Heath, J. R., Chinthrajah, R. S., Nadeau, K. C., Pinsky, B. A., Blish, C. A., Hensley, S. E., Jensen, K., Meyer, E., Balboni, I., Utz, P. J., Merrill, J. T., Guthridge, J. M., James, J. A., Yang, S., Tibshirani, R., Kundaje, A., Boyd, S. D. 2025; 387 (6736): eadp2407

Abstract

Clinical diagnosis typically incorporates physical examination, patient history, various laboratory tests, and imaging studies but makes limited use of the human immune system's own record of antigen exposures encoded by receptors on B cells and T cells. We analyzed immune receptor datasets from 593 individuals to develop MAchine Learning for Immunological Diagnosis, an interpretive framework to screen for multiple illnesses simultaneously or precisely test for one condition. This approach detects specific infections, autoimmune disorders, vaccine responses, and disease severity differences. Human-interpretable features of the model recapitulate known immune responses to severe acute respiratory syndrome coronavirus 2, influenza, and human immunodeficiency virus, highlight antigen-specific receptors, and reveal distinct characteristics of systemic lupus erythematosus and type-1 diabetes autoreactivity. This analysis framework has broad potential for scientific and clinical interpretation of immune responses.

View details for DOI 10.1126/science.adp2407

View details for PubMedID 39977494
Transcription factor stoichiometry, motif affinity and syntax regulate single-cell chromatin dynamics during fibroblast reprogramming to pluripotency. bioRxiv : the preprint server for biology Nair, S., Ameen, M., Sundaram, L., Pampari, A., Schreiber, J., Balsubramani, A., Wang, Y. X., Burns, D., Blau, H. M., Karakikes, I., Wang, K. C., Kundaje, A. 2023

Abstract

Ectopic expression of OCT4, SOX2, KLF4 and MYC (OSKM) transforms differentiated cells into induced pluripotent stem cells. To refine our mechanistic understanding of reprogramming, especially during the early stages, we profiled chromatin accessibility and gene expression at single-cell resolution across a densely sampled time course of human fibroblast reprogramming. Using neural networks that map DNA sequence to ATAC-seq profiles at base-resolution, we annotated cell-state-specific predictive transcription factor (TF) motif syntax in regulatory elements, inferred affinity- and concentration-dependent dynamics of Tn5-bias corrected TF footprints, linked peaks to putative target genes, and elucidated rewiring of TF-to-gene cis-regulatory networks. Our models reveal that early in reprogramming, OSK, at supraphysiological concentrations, rapidly open transient regulatory elements by occupying non-canonical low-affinity binding sites. As OSK concentration falls, the accessibility of these transient elements decays as a function of motif affinity. We find that these OSK-dependent transient elements sequester the somatic TF AP-1. This redistribution is strongly associated with the silencing of fibroblast-specific genes within individual nuclei. Together, our integrated single-cell resource and models reveal insights into the cis-regulatory code of reprogramming at unprecedented resolution, connect TF stoichiometry and motif syntax to diversification of cell fate trajectories, and provide new perspectives on the dynamics and role of transient regulatory elements in somatic silencing.

View details for DOI 10.1101/2023.10.04.560808

View details for PubMedID 37873116
Integrative single-cell analysis of cardiogenesis identifies developmental trajectories and non-coding mutations in congenital heart disease. Cell Ameen, M., Sundaram, L., Shen, M., Banerjee, A., Kundu, S., Nair, S., Shcherbina, A., Gu, M., Wilson, K. D., Varadarajan, A., Vadgama, N., Balsubramani, A., Wu, J. C., Engreitz, J. M., Farh, K., Karakikes, I., Wang, K. C., Quertermous, T., Greenleaf, W. J., Kundaje, A. 2022; 185 (26): 4937

Abstract

To define the multi-cellular epigenomic and transcriptional landscape of cardiac cellular development, we generated single-cell chromatin accessibility maps of human fetal heart tissues. We identified eight major differentiation trajectories involving primary cardiac cell types, each associated with dynamic transcription factor (TF) activity signatures. We contrasted regulatory landscapes of iPSC-derived cardiac cell types and their invivo counterparts, which enabled optimization of invitro differentiation of epicardial cells. Further, we interpreted sequence based deep learning models of cell-type-resolved chromatin accessibility profiles to decipher underlying TF motif lexicons. De novo mutations predicted to affect chromatin accessibility in arterial endothelium were enriched in congenital heart disease (CHD) cases vs. controls. Invitro studies in iPSCs validated the functional impact of identified variation on the predicted developmental cell types. This work thus defines the cell-type-resolved cis-regulatory sequence determinants of heart development and identifies disruption of cell type-specific regulatory elements in CHD.

View details for DOI 10.1016/j.cell.2022.11.028

View details for PubMedID 36563664
The dynamic, combinatorial cis-regulatory lexicon of epidermal differentiation. Nature genetics Kim, D. S., Risca, V. I., Reynolds, D. L., Chappell, J., Rubin, A. J., Jung, N., Donohue, L. K., Lopez-Pajares, V., Kathiria, A., Shi, M., Zhao, Z., Deep, H., Sharmin, M., Rao, D., Lin, S., Chang, H. Y., Snyder, M. P., Greenleaf, W. J., Kundaje, A., Khavari, P. A. 2021

Abstract

Transcription factors bind DNA sequence motif vocabularies in cis-regulatory elements (CREs) to modulate chromatin state and gene expression during cell state transitions. A quantitative understanding of how motif lexicons influence dynamic regulatory activity has been elusive due to the combinatorial nature of the cis-regulatory code. To address this, we undertook multiomic data profiling of chromatin and expression dynamics across epidermal differentiation to identify 40,103 dynamic CREs associated with 3,609 dynamically expressed genes, then applied an interpretable deep-learning framework to model the cis-regulatory logic of chromatin accessibility. This analysis framework identified cooperative DNA sequence rules in dynamic CREs regulating synchronous gene modules with diverse roles in skin differentiation. Massively parallel reporter assay analysis validated temporal dynamics and cooperative cis-regulatory logic. Variants linked to human polygenic skin disease were enriched in these time-dependent combinatorial motif rules. This integrative approach shows the combinatorial cis-regulatory lexicon of epidermal differentiation and represents a general framework for deciphering the organizational principles of the cis-regulatory code of dynamic gene regulation.

View details for DOI 10.1038/s41588-021-00947-3

View details for PubMedID 34650237
A genome-wide atlas of co-essential modules assigns function to uncharacterized genes. Nature genetics Wainberg, M., Kamber, R. A., Balsubramani, A., Meyers, R. M., Sinnott-Armstrong, N., Hornburg, D., Jiang, L., Chan, J., Jian, R., Gu, M., Shcherbina, A., Dubreuil, M. M., Spees, K., Meuleman, W., Snyder, M. P., Bassik, M. C., Kundaje, A. 2021

Abstract

A central question in the post-genomic era is how genes interact to form biological pathways. Measurements of gene dependency across hundreds of cell lines have been used to cluster genes into 'co-essential' pathways, but this approach has been limited by ubiquitous false positives. In the present study, we develop a statistical method that enables robust identification of gene co-essentiality and yields a genome-wide set of functional modules. This atlas recapitulates diverse pathways and protein complexes, and predicts the functions of 108 uncharacterized genes. Validating top predictions, we show that TMEM189 encodes plasmanylethanolamine desaturase, a key enzyme for plasmalogen synthesis. We also show that C15orf57 encodes a protein that binds the AP2 complex, localizes to clathrin-coated pits and enables efficient transferrin uptake. Finally, we provide an interactive webtool for the community to explore our results, which establish co-essentiality profiling as a powerful resource for biological pathway identification and discovery of new gene functions.

View details for DOI 10.1038/s41588-021-00840-z

View details for PubMedID 33859415
Base-resolution models of transcription-factor binding reveal soft motif syntax. Nature genetics Avsec, Ž. n., Weilert, M. n., Shrikumar, A. n., Krueger, S. n., Alexandari, A. n., Dalal, K. n., Fropf, R. n., McAnany, C. n., Gagneur, J. n., Kundaje, A. n., Zeitlinger, J. n. 2021

Abstract

The arrangement (syntax) of transcription factor (TF) binding motifs is an important part of the cis-regulatory code, yet remains elusive. We introduce a deep learning model, BPNet, that uses DNA sequence to predict base-resolution chromatin immunoprecipitation (ChIP)-nexus binding profiles of pluripotency TFs. We develop interpretation tools to learn predictive motif representations and identify soft syntax rules for cooperative TF binding interactions. Strikingly, Nanog preferentially binds with helical periodicity, and TFs often cooperate in a directional manner, which we validate using clustered regularly interspaced short palindromic repeat (CRISPR)-induced point mutations. Our model represents a powerful general approach to uncover the motifs and syntax of cis-regulatory sequences in genomics data.

View details for DOI 10.1038/s41588-021-00782-6

View details for PubMedID 33603233
Single-cell epigenomic analyses implicate candidate causal variants at inherited risk loci for Alzheimer's and Parkinson's diseases. Nature genetics Corces, M. R., Shcherbina, A., Kundu, S., Gloudemans, M. J., Fresard, L., Granja, J. M., Louie, B. H., Eulalio, T., Shams, S., Bagdatli, S. T., Mumbach, M. R., Liu, B., Montine, K. S., Greenleaf, W. J., Kundaje, A., Montgomery, S. B., Chang, H. Y., Montine, T. J. 2020

Abstract

Genome-wide association studies of neurological diseases have identified thousands of variants associated with disease phenotypes. However, most of these variants do not alter coding sequences, making it difficult to assign their function. Here, we present a multi-omic epigenetic atlas of the adult human brain through profiling of single-cell chromatin accessibility landscapes and three-dimensional chromatin interactions of diverse adult brain regions across a cohort of cognitively healthy individuals. We developed a machine-learning classifier to integrate this multi-omic framework and predict dozens of functional SNPs for Alzheimer's and Parkinson's diseases, nominating target genes and cell types for previously orphaned loci from genome-wide association studies. Moreover, we dissected the complex inverted haplotype of the MAPT (encoding tau) Parkinson's disease risk locus, identifying putative ectopic regulatory interactions in neurons that may mediate this disease association. This work expands understanding of inherited variation and provides a roadmap for the epigenomic dissection of causal regulatory variation in disease.

View details for DOI 10.1038/s41588-020-00721-x

View details for PubMedID 33106633
Opportunities and challenges for transcriptome-wide association studies NATURE GENETICS Wainberg, M., Sinnott-Armstrong, N., Mancuso, N., Barbeira, A. N., Knowles, D. A., Golan, D., Ermel, R., Ruusalepp, A., Quertermous, T., Hao, K., Bjorkegren, J. L. M., Im, H., Pasaniuc, B., Rivas, M. A., Kundaje, A. 2019; 51 (4): 592–99

View details for DOI 10.1038/s41588-019-0385-z

View details for Web of Science ID 000462767500005
Discovering epistatic feature interactions from neural network models of regulatory DNA sequences. Bioinformatics (Oxford, England) Greenside, P., Shimko, T., Fordyce, P., Kundaje, A. 2018; 34 (17): i629-i637

Abstract

Transcription factors bind regulatory DNA sequences in a combinatorial manner to modulate gene expression. Deep neural networks (DNNs) can learn the cis-regulatory grammars encoded in regulatory DNA sequences associated with transcription factor binding and chromatin accessibility. Several feature attribution methods have been developed for estimating the predictive importance of individual features (nucleotides or motifs) in any input DNA sequence to its associated output prediction from a DNN model. However, these methods do not reveal higher-order feature interactions encoded by the models.We present a new method called Deep Feature Interaction Maps (DFIM) to efficiently estimate interactions between all pairs of features in any input DNA sequence. DFIM accurately identifies ground truth motif interactions embedded in simulated regulatory DNA sequences. DFIM identifies synergistic interactions between GATA1 and TAL1 motifs from in vivo TF binding models. DFIM reveals epistatic interactions involving nucleotides flanking the core motif of the Cbf1 TF in yeast from in vitro TF binding models. We also apply DFIM to regulatory sequence models of in vivo chromatin accessibility to reveal interactions between regulatory genetic variants and proximal motifs of target TFs as validated by TF binding quantitative trait loci. Our approach makes significant strides in improving the interpretability of deep learning models for genomics.Code is available at: https://github.com/kundajelab/dfim.Supplementary data are available at Bioinformatics online.

View details for DOI 10.1093/bioinformatics/bty575

View details for PubMedID 30423062

View details for PubMedCentralID PMC6129272
Opportunities and obstacles for deep learning in biology and medicine JOURNAL OF THE ROYAL SOCIETY INTERFACE Ching, T., Himmelstein, D. S., Beaulieu-Jones, B. K., Kalinin, A. A., Do, B. T., Way, G. P., Ferrero, E., Agapow, P., Zietz, M., Hoffman, M. M., Xie, W., Rosen, G. L., Lengerich, B. J., Israeli, J., Lanchantin, J., Woloszynek, S., Carpenter, A. E., Shrikumar, A., Xu, J., Cofer, E. M., Lavender, C. A., Turaga, S. C., Alexandari, A. M., Lu, Z., Harris, D. J., DeCaprio, D., Qi, Y., Kundaje, A., Peng, Y., Wiley, L. K., Segler, M. H. S., Boca, S. M., Swamidass, S., Huang, A., Gitter, A., Greene, C. S. 2018; 15 (141)

Abstract

Deep learning describes a class of machine learning algorithms that are capable of combining raw inputs into layers of intermediate features. These algorithms have recently shown impressive results across a variety of domains. Biology and medicine are data-rich disciplines, but the data are complex and often ill-understood. Hence, deep learning techniques may be particularly well suited to solve problems of these fields. We examine applications of deep learning to a variety of biomedical problems-patient classification, fundamental biological processes and treatment of patients-and discuss whether deep learning will be able to transform these tasks or if the biomedical sphere poses unique challenges. Following from an extensive literature review, we find that deep learning has yet to revolutionize biomedicine or definitively resolve any of the most pressing challenges in the field, but promising advances have been made on the prior state of the art. Even though improvements over previous baselines have been modest in general, the recent progress indicates that deep learning methods will provide valuable means for speeding up or aiding human investigation. Though progress has been made linking a specific neural network's prediction to input features, understanding how users should interpret these models to make testable hypotheses about the system under study remains an open challenge. Furthermore, the limited amount of labelled data for training presents problems in some domains, as do legal and privacy constraints on work with sensitive health records. Nonetheless, we foresee deep learning enabling changes at both bench and bedside with the potential to transform several areas of biology and medicine.

View details for PubMedID 29618526

View details for PubMedCentralID PMC5938574
Denoising genome-wide histone ChIP-seq with convolutional neural networks BIOINFORMATICS Koh, P., Pierson, E., Kundaje, A. 2017; 33 (14): I225–I233

Abstract

Chromatin immune-precipitation sequencing (ChIP-seq) experiments are commonly used to obtain genome-wide profiles of histone modifications associated with different types of functional genomic elements. However, the quality of histone ChIP-seq data is affected by many experimental parameters such as the amount of input DNA, antibody specificity, ChIP enrichment and sequencing depth. Making accurate inferences from chromatin profiling experiments that involve diverse experimental parameters is challenging.We introduce a convolutional denoising algorithm, Coda, that uses convolutional neural networks to learn a mapping from suboptimal to high-quality histone ChIP-seq data. This overcomes various sources of noise and variability, substantially enhancing and recovering signal when applied to low-quality chromatin profiling datasets across individuals, cell types and species. Our method has the potential to improve data quality at reduced costs. More broadly, this approach-using a high-dimensional discriminative model to encode a generative noise process-is generally applicable to other biological domains where it is easy to generate noisy data but difficult to analytically characterize the noise or underlying data distribution.https://github.com/kundajelab/coda .akundaje@stanford.edu.

View details for PubMedID 28881977
Genetic Control of Chromatin States in Humans Involves Local and Distal Chromosomal Interactions CELL Grubert, F., Zaugg, J. B., Kasowski, M., Ursu, O., Spacek, D. V., Martin, A. R., Greenside, P., Srivas, R., Phanstiel, D. H., Pekowska, A., Heidari, N., Euskirchen, G., Huber, W., Pritchard, J. K., Bustamante, C. D., Steinmetz, L. M., Kundaje, A., Snyder, M. 2015; 162 (5): 1051-1065

Abstract

Deciphering the impact of genetic variants on gene regulation is fundamental to understanding human disease. Although gene regulation often involves long-range interactions, it is unknown to what extent non-coding genetic variants influence distal molecular phenotypes. Here, we integrate chromatin profiling for three histone marks in lymphoblastoid cell lines (LCLs) from 75 sequenced individuals with LCL-specific Hi-C and ChIA-PET-based chromatin contact maps to uncover one of the largest collections of local and distal histone quantitative trait loci (hQTLs). Distal QTLs are enriched within topologically associated domains and exhibit largely concordant variation of chromatin state coordinated by proximal and distal non-coding genetic variants. Histone QTLs are enriched for common variants associated with autoimmune diseases and enable identification of putative target genes of disease-associated variants from genome-wide association studies. These analyses provide insights into how genetic variation can affect human disease phenotypes by coordinated changes in chromatin at interacting regulatory elements.

View details for DOI 10.1016/j.cell.2015.07.048

View details for Web of Science ID 000360589900015

View details for PubMedCentralID PMC4556133
Conserved epigenomic signals in mice and humans reveal immune basis of Alzheimer's disease. Nature Gjoneska, E., Pfenning, A. R., Mathys, H., Quon, G., Kundaje, A., Tsai, L., Kellis, M. 2015; 518 (7539): 365-369

Abstract

Alzheimer's disease (AD) is a severe age-related neurodegenerative disorder characterized by accumulation of amyloid-β plaques and neurofibrillary tangles, synaptic and neuronal loss, and cognitive decline. Several genes have been implicated in AD, but chromatin state alterations during neurodegeneration remain uncharacterized. Here we profile transcriptional and chromatin state dynamics across early and late pathology in the hippocampus of an inducible mouse model of AD-like neurodegeneration. We find a coordinated downregulation of synaptic plasticity genes and regulatory regions, and upregulation of immune response genes and regulatory regions, which are targeted by factors that belong to the ETS family of transcriptional regulators, including PU.1. Human regions orthologous to increasing-level enhancers show immune-cell-specific enhancer signatures as well as immune cell expression quantitative trait loci, while decreasing-level enhancer orthologues show fetal-brain-specific enhancer activity. Notably, AD-associated genetic variants are specifically enriched in increasing-level enhancer orthologues, implicating immune processes in AD predisposition. Indeed, increasing enhancers overlap known AD loci lacking protein-altering variants, and implicate additional loci that do not reach genome-wide significance. Our results reveal new insights into the mechanisms of neurodegeneration and establish the mouse as a useful model for functional studies of AD regulatory regions.

View details for DOI 10.1038/nature14252

View details for PubMedID 25693568
Integrative analysis of 111 reference human epigenomes. Nature Kundaje, A., Meuleman, W., Ernst, J., Bilenky, M., Yen, A., Heravi-Moussavi, A., Kheradpour, P., Zhang, Z., Wang, J., Ziller, M. J., Amin, V., Whitaker, J. W., Schultz, M. D., Ward, L. D., Sarkar, A., Quon, G., Sandstrom, R. S., Eaton, M. L., Wu, Y., Pfenning, A. R., Wang, X., Claussnitzer, M., Liu, Y., Coarfa, C., Harris, R. A., Shoresh, N., Epstein, C. B., Gjoneska, E., Leung, D., Xie, W., Hawkins, R. D., Lister, R., Hong, C., Gascard, P., Mungall, A. J., Moore, R., Chuah, E., Tam, A., Canfield, T. K., Hansen, R. S., Kaul, R., Sabo, P. J., Bansal, M. S., Carles, A., Dixon, J. R., Farh, K., Feizi, S., Karlic, R., Kim, A., Kulkarni, A., Li, D., Lowdon, R., Elliott, G., Mercer, T. R., Neph, S. J., Onuchic, V., Polak, P., Rajagopal, N., Ray, P., Sallari, R. C., Siebenthall, K. T., Sinnott-Armstrong, N. A., Stevens, M., Thurman, R. E., Wu, J., Zhang, B., Zhou, X., Beaudet, A. E., Boyer, L. A., De Jager, P. L., Farnham, P. J., Fisher, S. J., Haussler, D., Jones, S. J., Li, W., Marra, M. A., McManus, M. T., Sunyaev, S., Thomson, J. A., Tlsty, T. D., Tsai, L., Wang, W., Waterland, R. A., Zhang, M. Q., Chadwick, L. H., Bernstein, B. E., Costello, J. F., Ecker, J. R., Hirst, M., Meissner, A., Milosavljevic, A., Ren, B., Stamatoyannopoulos, J. A., Wang, T., Kellis, M. 2015; 518 (7539): 317-330

Abstract

The reference human genome sequence set the stage for studies of genetic variation and its association with human disease, but epigenomic studies lack a similar reference. To address this need, the NIH Roadmap Epigenomics Consortium generated the largest collection so far of human epigenomes for primary cells and tissues. Here we describe the integrative analysis of 111 reference human epigenomes generated as part of the programme, profiled for histone modification patterns, DNA accessibility, DNA methylation and RNA expression. We establish global maps of regulatory elements, define regulatory modules of coordinated activity, and their likely activators and repressors. We show that disease- and trait-associated genetic variants are enriched in tissue-specific epigenomic marks, revealing biologically relevant cell types for diverse human traits, and providing a resource for interpreting the molecular basis of human disease. Our results demonstrate the central role of epigenomic information for understanding gene regulation, cellular differentiation and human disease.

View details for DOI 10.1038/nature14248

View details for PubMedID 25693563
Architecture of the human regulatory network derived from ENCODE data NATURE Gerstein, M. B., Kundaje, A., Hariharan, M., Landt, S. G., Yan, K., Cheng, C., Mu, X. J., Khurana, E., Rozowsky, J., Alexander, R., Min, R., Alves, P., Abyzov, A., Addleman, N., Bhardwaj, N., Boyle, A. P., Cayting, P., Charos, A., Chen, D. Z., Cheng, Y., Clarke, D., Eastman, C., Euskirchen, G., Frietze, S., Fu, Y., Gertz, J., Grubert, F., Harmanci, A., Jain, P., Kasowski, M., Lacroute, P., Leng, J., Lian, J., Monahan, H., O'Geen, H., Ouyang, Z., Partridge, E. C., Patacsil, D., Pauli, F., Raha, D., Ramirez, L., Reddy, T. E., Reed, B., Shi, M., Slifer, T., Wang, J., Wu, L., Yang, X., Yip, K. Y., Zilberman-Schapira, G., Batzoglou, S., Sidow, A., Farnham, P. J., Myers, R. M., Weissman, S. M., Snyder, M. 2012; 489 (7414): 91-100

Abstract

Transcription factors bind in a combinatorial fashion to specify the on-and-off states of genes; the ensemble of these binding events forms a regulatory network, constituting the wiring diagram for a cell. To examine the principles of the human transcriptional regulatory network, we determined the genomic binding information of 119 transcription-related factors in over 450 distinct experiments. We found the combinatorial, co-association of transcription factors to be highly context specific: distinct combinations of factors bind at specific genomic locations. In particular, there are significant differences in the binding proximal and distal to genes. We organized all the transcription factor binding into a hierarchy and integrated it with other genomic information (for example, microRNA regulation), forming a dense meta-network. Factors at different levels have different properties; for instance, top-level transcription factors more strongly influence expression and middle-level ones co-regulate targets to mitigate information-flow bottlenecks. Moreover, these co-regulations give rise to many enriched network motifs (for example, noise-buffering feed-forward loops). Finally, more connected network components are under stronger selection and exhibit a greater degree of allele-specific activity (that is, differential binding to the two parental alleles). The regulatory information obtained in this study will be crucial for interpreting personal genome sequences and understanding basic principles of human biology and disease.

View details for DOI 10.1038/nature11245

View details for PubMedID 22955619
An integrated encyclopedia of DNA elements in the human genome NATURE Dunham, I., Kundaje, A., Aldred, S. F., Collins, P. J., Davis, C., Doyle, F., Epstein, C. B., Frietze, S., Harrow, J., Kaul, R., Khatun, J., Lajoie, B. R., Landt, S. G., Lee, B., Pauli, F., Rosenbloom, K. R., Sabo, P., Safi, A., Sanyal, A., Shoresh, N., Simon, J. M., Song, L., Trinklein, N. D., Altshuler, R. C., Birney, E., Brown, J. B., Cheng, C., Djebali, S., Dong, X., Dunham, I., Ernst, J., Furey, T. S., Gerstein, M., Giardine, B., Greven, M., Hardison, R. C., Harris, R. S., Herrero, J., Hoffman, M. M., Iyer, S., Kellis, M., Khatun, J., Kheradpour, P., Kundaje, A., Lassmann, T., Li, Q., Lin, X., Marinov, G. K., Merkel, A., Mortazavi, A., Parker, S. C., Reddy, T. E., Rozowsky, J., Schlesinger, F., Thurman, R. E., Wang, J., Ward, L. D., Whitfield, T. W., Wilder, S. P., Wu, W., Xi, H. S., Yip, K. Y., Zhuang, J., Bernstein, B. E., Birney, E., Dunham, I., Green, E. D., Gunter, C., Snyder, M., Pazin, M. J., Lowdon, R. F., Dillon, L. A., Adams, L. B., Kelly, C. J., Zhang, J., Wexler, J. R., Green, E. D., Good, P. J., Feingold, E. A., Bernstein, B. E., Birney, E., Crawford, G. E., Dekker, J., Elnitski, L., Farnham, P. J., Gerstein, M., Giddings, M. C., Gingeras, T. R., Green, E. D., Guigo, R., Hardison, R. C., Hubbard, T. J., Kellis, M., Kent, W. J., Lieb, J. D., Margulies, E. H., Myers, R. M., Snyder, M., Stamatoyannopoulos, J. A., Tenenbaum, S. A., Weng, Z., White, K. P., Wold, B., Khatun, J., Yu, Y., Wrobel, J., Risk, B. A., Gunawardena, H. P., Kuiper, H. C., Maier, C. W., Xie, L., Chen, X., Giddings, M. C., Bernstein, B. E., Epstein, C. B., Shoresh, N., Ernst, J., Kheradpour, P., Mikkelsen, T. S., Gillespie, S., Goren, A., Ram, O., Zhang, X., Wang, L., Issner, R., Coyne, M. J., Durham, T., Ku, M., Truong, T., Ward, L. D., Altshuler, R. C., Eaton, M. L., Kellis, M., Djebali, S., Davis, C. A., Merkel, A., Dobin, A., Lassmann, T., Mortazavi, A., Tanzer, A., Lagarde, J., Lin, W., Schlesinger, F., Xue, C., Marinov, G. K., Khatun, J., Williams, B. A., Zaleski, C., Rozowsky, J., Roeder, M., Kokocinski, F., Abdelhamid, R. F., Alioto, T., Antoshechkin, I., Baer, M. T., Batut, P., Bell, I., Bell, K., Chakrabortty, S., Chen, X., Chrast, J., Curado, J., Derrien, T., Drenkow, J., Dumais, E., Dumais, J., Duttagupta, R., Fastuca, M., Fejes-Toth, K., Ferreira, P., Foissac, S., Fullwood, M. J., Gao, H., Gonzalez, D., Gordon, A., Gunawardena, H. P., Howald, C., Jha, S., Johnson, R., Kapranov, P., King, B., Kingswood, C., Li, G., Luo, O. J., Park, E., Preall, J. B., Presaud, K., Ribeca, P., Risk, B. A., Robyr, D., Ruan, X., Sammeth, M., Sandhu, K. S., Schaeffer, L., See, L., Shahab, A., Skancke, J., Suzuki, A. M., Takahashi, H., Tilgner, H., Trout, D., Walters, N., Wang, H., Wrobel, J., Yu, Y., Hayashizaki, Y., Harrow, J., Gerstein, M., Hubbard, T. J., Reymond, A., Antonarakis, S. E., Hannon, G. J., Giddings, M. C., Ruan, Y., Wold, B., Carninci, P., Guigo, R., Gingeras, T. R., Rosenbloom, K. R., Sloan, C. A., Learned, K., Malladi, V. S., Wong, M. C., Barber, G., Cline, M. S., Dreszer, T. R., Heitner, S. G., Karolchik, D., Kent, W. J., Kirkup, V. M., Meyer, L. R., Long, J. C., Maddren, M., Raney, B. J., Furey, T. S., Song, L., Grasfeder, L. L., Giresi, P. G., Lee, B., Battenhouse, A., Sheffield, N. C., Simon, J. M., Showers, K. A., Safi, A., London, D., Bhinge, A. A., Shestak, C., Schaner, M. R., Kim, S. K., Zhang, Z. Z., Mieczkowski, P. A., Mieczkowska, J. O., Liu, Z., McDaniell, R. M., Ni, Y., Rashid, N. U., Kim, M. J., Adar, S., Zhang, Z., Wang, T., Winter, D., Keefe, D., Birney, E., Iyer, V. R., Lieb, J. D., Crawford, G. E., Li, G., Sandhu, K. S., Zheng, M., Wang, P., Luo, O. J., Shahab, A., Fullwood, M. J., Ruan, X., Ruan, Y., Myers, R. M., Pauli, F., Williams, B. A., Gertz, J., Marinov, G. K., Reddy, T. E., Vielmetter, J., Partridge, E. C., Trout, D., Varley, K. E., Gasper, C., Bansal, A., Pepke, S., Jain, P., Amrhein, H., Bowling, K. M., Anaya, M., Cross, M. K., King, B., Muratet, M. A., Antoshechkin, I., Newberry, K. M., McCue, K., Nesmith, A. S., Fisher-Aylor, K. I., Pusey, B., DeSalvo, G., Parker, S. L., Balasubramanian, S., Davis, N. S., Meadows, S. K., Eggleston, T., Gunter, C., Newberry, J. S., Levy, S. E., Absher, D. M., Mortazavi, A., Wong, W. H., Wold, B., Blow, M. J., Visel, A., Pennachio, L. A., Elnitski, L., Margulies, E. H., Parker, S. C., Petrykowska, H. M., Abyzov, A., Aken, B., Barrell, D., Barson, G., Berry, A., Bignell, A., Boychenko, V., Bussotti, G., Chrast, J., Davidson, C., Derrien, T., Despacio-Reyes, G., Diekhans, M., Ezkurdia, I., Frankish, A., Gilbert, J., Gonzalez, J. M., Griffiths, E., Harte, R., Hendrix, D. A., Howald, C., Hunt, T., Jungreis, I., Kay, M., Khurana, E., Kokocinski, F., Leng, J., Lin, M. F., Loveland, J., Lu, Z., Manthravadi, D., Mariotti, M., Mudge, J., Mukherjee, G., Notredame, C., Pei, B., Rodriguez, J. M., Saunders, G., Sboner, A., Searle, S., Sisu, C., Snow, C., Steward, C., Tanzer, A., Tapanari, E., Tress, M. L., van Baren, M. J., Walters, N., Washietl, S., Wilming, L., Zadissa, A., Zhang, Z., Brent, M., Haussler, D., Kellis, M., Valencia, A., Gerstein, M., Reymond, A., Guigo, R., Harrow, J., Hubbard, T. J., Landt, S. G., Frietze, S., Abyzov, A., Addleman, N., Alexander, R. P., Auerbach, R. K., Balasubramanian, S., Bettinger, K., Bhardwaj, N., Boyle, A. P., Cao, A. R., Cayting, P., Charos, A., Cheng, Y., Cheng, C., Eastman, C., Euskirchen, G., Fleming, J. D., Grubert, F., Habegger, L., Hariharan, M., Harmanci, A., Iyengar, S., Jin, V. X., Karczewski, K. J., Kasowski, M., Lacroute, P., Lam, H., Lamarre-Vincent, N., Leng, J., Lian, J., Lindahl-Allen, M., Min, R., Miotto, B., Monahan, H., Moqtaderi, Z., Mu, X. J., O'Geen, H., Ouyang, Z., Patacsil, D., Pei, B., Raha, D., Ramirez, L., Reed, B., Rozowsky, J., Sboner, A., Shi, M., Sisu, C., Slifer, T., Witt, H., Wu, L., Xu, X., Yan, K., Yang, X., Yip, K. Y., Zhang, Z., Struhl, K., Weissman, S. M., Gerstein, M., Farnham, P. J., Snyder, M., Tenenbaum, S. A., Penalva, L. O., Doyle, F., Karmakar, S., Landt, S. G., Bhanvadia, R. R., Choudhury, A., Domanus, M., Ma, L., Moran, J., Patacsil, D., Slifer, T., Victorsen, A., Yang, X., Snyder, M., White, K. P., Auer, T., Centanin, L., Eichenlaub, M., Gruhl, F., Heermann, S., Hoeckendorf, B., Inoue, D., Kellner, T., Kirchmaier, S., Mueller, C., Reinhardt, R., Schertel, L., Schneider, S., Sinn, R., Wittbrodt, B., Wittbrodt, J., Weng, Z., Whitfield, T. W., Wang, J., Collins, P. J., Aldred, S. F., Trinklein, N. D., Partridge, E. C., Myers, R. M., Dekker, J., Jain, G., Lajoie, B. R., Sanyal, A., Balasundaram, G., Bates, D. L., Byron, R., Canfield, T. K., Diegel, M. J., Dunn, D., Ebersol, A. K., Frum, T., Garg, K., Gist, E., Hansen, R. S., Boatman, L., Haugen, E., Humbert, R., Jain, G., Johnson, A. K., Johnson, E. M., Kutyavin, T. V., Lajoie, B. R., Lee, K., Lotakis, D., Maurano, M. T., Neph, S. J., Neri, F. V., Nguyen, E. D., Qu, H., Reynolds, A. P., Roach, V., Rynes, E., Sabo, P., Sanchez, M. E., Sandstrom, R. S., Sanyal, A., Shafer, A. O., Stergachis, A. B., Thomas, S., Thurman, R. E., Vernot, B., Vierstra, J., Vong, S., Wang, H., Weaver, M. A., Yan, Y., Zhang, M., Akey, J. M., Bender, M., Dorschner, M. O., Groudine, M., MacCoss, M. J., Navas, P., Stamatoyannopoulos, G., Kaul, R., Dekker, J., Stamatoyannopoulos, J. A., Dunham, I., Beal, K., Brazma, A., Flicek, P., Herrero, J., Johnson, N., Keefe, D., Lukk, M., Luscombe, N. M., Sobral, D., Vaquerizas, J. M., Wilder, S. P., Batzoglou, S., Sidow, A., Hussami, N., Kyriazopoulou-Panagiotopoulou, S., Libbrecht, M. W., Schaub, M. A., Kundaje, A., Hardison, R. C., Miller, W., Giardine, B., Harris, R. S., Wu, W., Bickel, P. J., Banfai, B., Boley, N. P., Brown, J. B., Huang, H., Li, Q., Li, J. J., Noble, W. S., Bilmes, J. A., Buske, O. J., Hoffman, M. M., Sahu, A. D., Kharchenko, P. V., Park, P. J., Baker, D., Taylor, J., Weng, Z., Iyer, S., Dong, X., Greven, M., Lin, X., Wang, J., Xi, H. S., Zhuang, J., Gerstein, M., Alexander, R. P., Balasubramanian, S., Cheng, C., Harmanci, A., Lochovsky, L., Min, R., Mu, X. J., Rozowsky, J., Yan, K., Yip, K. Y., Birney, E. 2012; 489 (7414): 57-74

Abstract

The human genome encodes the blueprint of life, but the function of the vast majority of its nearly three billion bases is unknown. The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure and histone modification. These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions. Many discovered candidate regulatory elements are physically associated with one another and with expressed genes, providing new insights into the mechanisms of gene regulation. The newly identified elements also show a statistical correspondence to sequence variants linked to human disease, and can thereby guide interpretation of this variation. Overall, the project provides new insights into the organization and regulation of our genes and genome, and is an expansive resource of functional annotations for biomedical research.

View details for DOI 10.1038/nature11247

View details for Web of Science ID 000308347000039

View details for PubMedID 22955616

View details for PubMedCentralID PMC3439153
Ubiquitous heterogeneity and asymmetry of the chromatin environment at regulatory elements GENOME RESEARCH Kundaje, A., Kyriazopoulou-Panagiotopoulou, S., Libbrecht, M., Smith, C. L., Raha, D., Winters, E. E., Johnson, S. M., Snyder, M., Batzoglou, S., Sidow, A. 2012; 22 (9): 1735-1747

Abstract

Gene regulation at functional elements (e.g., enhancers, promoters, insulators) is governed by an interplay of nucleosome remodeling, histone modifications, and transcription factor binding. To enhance our understanding of gene regulation, the ENCODE Consortium has generated a wealth of ChIP-seq data on DNA-binding proteins and histone modifications. We additionally generated nucleosome positioning data on two cell lines, K562 and GM12878, by MNase digestion and high-depth sequencing. Here we relate 14 chromatin signals (12 histone marks, DNase, and nucleosome positioning) to the binding sites of 119 DNA-binding proteins across a large number of cell lines. We developed a new method for unsupervised pattern discovery, the Clustered AGgregation Tool (CAGT), which accounts for the inherent heterogeneity in signal magnitude, shape, and implicit strand orientation of chromatin marks. We applied CAGT on a total of 5084 data set pairs to obtain an exhaustive catalog of high-resolution patterns of histone modifications and nucleosome positioning signals around bound transcription factors. Our analyses reveal extensive heterogeneity in how histone modifications are deposited, and how nucleosomes are positioned around binding sites. With the exception of the CTCF/cohesin complex, asymmetry of nucleosome positioning is predominant. Asymmetry of histone modifications is also widespread, for all types of chromatin marks examined, including promoter, enhancer, elongation, and repressive marks. The fine-resolution signal shapes discovered by CAGT unveiled novel correlation patterns between chromatin marks, nucleosome positioning, and sequence content. Meta-analyses of the signal profiles revealed a common vocabulary of chromatin signals shared across multiple cell lines and binding proteins.

View details for DOI 10.1101/gr.136366.111

View details for PubMedID 22955985
ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia GENOME RESEARCH Landt, S. G., Marinov, G. K., Kundaje, A., Kheradpour, P., Pauli, F., Batzoglou, S., Bernstein, B. E., Bickel, P., Brown, J. B., Cayting, P., Chen, Y., DeSalvo, G., Epstein, C., Fisher-Aylor, K. I., Euskirchen, G., Gerstein, M., Gertz, J., Hartemink, A. J., Hoffman, M. M., Iyer, V. R., Jung, Y. L., Karmakar, S., Kellis, M., Kharchenko, P. V., Li, Q., Liu, T., Liu, X. S., Ma, L., Milosavljevic, A., Myers, R. M., Park, P. J., Pazin, M. J., Perry, M. D., Raha, D., Reddy, T. E., Rozowsky, J., Shoresh, N., Sidow, A., Slattery, M., Stamatoyannopoulos, J. A., Tolstorukov, M. Y., White, K. P., Xi, S., Farnham, P. J., Lieb, J. D., Wold, B. J., Snyder, M. 2012; 22 (9): 1813-1831

Abstract

Chromatin immunoprecipitation (ChIP) followed by high-throughput DNA sequencing (ChIP-seq) has become a valuable and widely used approach for mapping the genomic location of transcription-factor binding and histone modifications in living cells. Despite its widespread use, there are considerable differences in how these experiments are conducted, how the results are scored and evaluated for quality, and how the data and metadata are archived for public use. These practices affect the quality and utility of any global ChIP experiment. Through our experience in performing ChIP-seq experiments, the ENCODE and modENCODE consortia have developed a set of working standards and guidelines for ChIP experiments that are updated routinely. The current guidelines address antibody validation, experimental replication, sequencing depth, data and metadata reporting, and data quality assessment. We discuss how ChIP quality, assessed in these ways, affects different uses of ChIP-seq data. All data sets used in the analysis have been deposited for public viewing and downloading at the ENCODE (http://encodeproject.org/ENCODE/) and modENCODE (http://www.modencode.org/) portals.

View details for DOI 10.1101/gr.136184.111

View details for PubMedID 22955991
Linking disease associations with regulatory information in the human genome GENOME RESEARCH Schaub, M. A., Boyle, A. P., Kundaje, A., Batzoglou, S., Snyder, M. 2012; 22 (9): 1748-1759

Abstract

Genome-wide association studies have been successful in identifying single nucleotide polymorphisms (SNPs) associated with a large number of phenotypes. However, an associated SNP is likely part of a larger region of linkage disequilibrium. This makes it difficult to precisely identify the SNPs that have a biological link with the phenotype. We have systematically investigated the association of multiple types of ENCODE data with disease-associated SNPs and show that there is significant enrichment for functional SNPs among the currently identified associations. This enrichment is strongest when integrating multiple sources of functional information and when highest confidence disease-associated SNPs are used. We propose an approach that integrates multiple types of functional data generated by the ENCODE Consortium to help identify "functional SNPs" that may be associated with the disease phenotype. Our approach generates putative functional annotations for up to 80% of all previously reported associations. We show that for most associations, the functional SNP most strongly supported by experimental evidence is a SNP in linkage disequilibrium with the reported association rather than the reported SNP itself. Our results show that the experimental data sets generated by the ENCODE Consortium can be successfully used to suggest functional hypotheses for variants associated with diseases and other phenotypes.

View details for DOI 10.1101/gr.136127.111

View details for PubMedID 22955986
A Predictive Model of the Oxygen and Heme Regulatory Network in Yeast PLOS COMPUTATIONAL BIOLOGY Kundaje, A., Xin, X., Lan, C., Lianoglou, S., Zhou, M., Zhang, L., Leslie, C. 2008; 4 (11)

Abstract

Deciphering gene regulatory mechanisms through the analysis of high-throughput expression data is a challenging computational problem. Previous computational studies have used large expression datasets in order to resolve fine patterns of coexpression, producing clusters or modules of potentially coregulated genes. These methods typically examine promoter sequence information, such as DNA motifs or transcription factor occupancy data, in a separate step after clustering. We needed an alternative and more integrative approach to study the oxygen regulatory network in Saccharomyces cerevisiae using a small dataset of perturbation experiments. Mechanisms of oxygen sensing and regulation underlie many physiological and pathological processes, and only a handful of oxygen regulators have been identified in previous studies. We used a new machine learning algorithm called MEDUSA to uncover detailed information about the oxygen regulatory network using genome-wide expression changes in response to perturbations in the levels of oxygen, heme, Hap1, and Co2+. MEDUSA integrates mRNA expression, promoter sequence, and ChIP-chip occupancy data to learn a model that accurately predicts the differential expression of target genes in held-out data. We used a novel margin-based score to extract significant condition-specific regulators and assemble a global map of the oxygen sensing and regulatory network. This network includes both known oxygen and heme regulators, such as Hap1, Mga2, Hap4, and Upc2, as well as many new candidate regulators. MEDUSA also identified many DNA motifs that are consistent with previous experimentally identified transcription factor binding sites. Because MEDUSA's regulatory program associates regulators to target genes through their promoter sequences, we directly tested the predicted regulators for OLE1, a gene specifically induced under hypoxia, by experimental analysis of the activity of its promoter. In each case, deletion of the candidate regulator resulted in the predicted effect on promoter activity, confirming that several novel regulators identified by MEDUSA are indeed involved in oxygen regulation. MEDUSA can reveal important information from a small dataset and generate testable hypotheses for further experimental analysis. Supplemental data are included.

View details for DOI 10.1371/journal.pcbi.1000224

View details for Web of Science ID 000261480800016

View details for PubMedID 19008939

View details for PubMedCentralID PMC2573020
Learning regulatory programs that accurately predict differential expression with MEDUSA Workshop on Dialogue on Reverse Engineering Assessment and Methods Kundaje, A., Lianoglou, S., Li, X., Quigley, D., Arias, M., Wiggins, C. H., Zhang, L., Leslie, C. WILEY-BLACKWELL. 2007: 178–202

Abstract

Inferring gene regulatory networks from high-throughput genomic data is one of the central problems in computational biology. In this paper, we describe a predictive modeling approach for studying regulatory networks, based on a machine learning algorithm called MEDUSA. MEDUSA integrates promoter sequence, mRNA expression, and transcription factor occupancy data to learn gene regulatory programs that predict the differential expression of target genes. Instead of using clustering or correlation of expression profiles to infer regulatory relationships, MEDUSA determines condition-specific regulators and discovers regulatory motifs that mediate the regulation of target genes. In this way, MEDUSA meaningfully models biological mechanisms of transcriptional regulation. MEDUSA solves the problem of predicting the differential (up/down) expression of target genes by using boosting, a technique from statistical learning, which helps to avoid overfitting as the algorithm searches through the high-dimensional space of potential regulators and sequence motifs. Experimental results demonstrate that MEDUSA achieves high prediction accuracy on held-out experiments (test data), that is, data not seen in training. We also present context-specific analysis of MEDUSA regulatory programs for DNA damage and hypoxia, demonstrating that MEDUSA identifies key regulators and motifs in these processes. A central challenge in the field is the difficulty of validating reverse-engineered networks in the absence of a gold standard. Our approach of learning regulatory programs provides at least a partial solution for the problem: MEDUSA's prediction accuracy on held-out data gives a concrete and statistically sound way to validate how well the algorithm performs. With MEDUSA, statistical validation becomes a prerequisite for hypothesis generation and network building rather than a secondary consideration.

View details for DOI 10.1196/annals.1407.020

View details for Web of Science ID 000252037600013

View details for PubMedID 17934055
Combining sequence and time series expression data to learn transcriptional modules IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS Kundaje, A., Middendorf, M., Gao, F., Wiggins, C., Leslie, C. 2005; 2 (3): 194-202

Abstract

Our goal is to cluster genes into transcriptional modules--sets of genes where similarity in expression is explained by common regulatory mechanisms at the transcriptional level. We want to learn modules from both time series gene expression data and genome-wide motif data that are now readily available for organisms such as S. cereviseae as a result of prior computational studies or experimental results. We present a generative probabilistic model for combining regulatory sequence and time series expression data to cluster genes into coherent transcriptional modules. Starting with a set of motifs representing known or putative regulatory elements (transcription factor binding sites) and the counts of occurrences of these motifs in each gene's promoter region, together with a time series expression profile for each gene, the learning algorithm uses expectation maximization to learn module assignments based on both types of data. We also present a technique based on the Jensen-Shannon entropy contributions of motifs in the learned model for associating the most significant motifs to each module. Thus, the algorithm gives a global approach for associating sets of regulatory elements to "modules" of genes with similar time series expression profiles. The model for expression data exploits our prior belief of smooth dependence on time by using statistical splines and is suitable for typical time course data sets with relatively few experiments. Moreover, the model is sufficiently interpretable that we can understand how both sequence data and expression data contribute to the cluster assignments, and how to interpolate between the two data sources. We present experimental results on the yeast cell cycle to validate our method and find that our combined expression and motif clustering algorithm discovers modules with both coherent expression and similar motif patterns, including binding motifs associated to known cell cycle transcription factors.

View details for Web of Science ID 000235704200003

View details for PubMedID 17044183
Learning Important Features Through Propagating Activation Differences Proceedings of the 34th International Conference on Machine Learning, 70:3145-3153, 2017 Shrikumar, A., Greenside, P., Kundaje, A. 2017
An encyclopedia of human enhancer-gene regulatory interactions. Nature Gschwind, A. R., Mualim, K. S., Karbalayghareh, A., Sheth, M. U., Dey, K. K., Jagoda, E., Nurtdinov, R. N., Xi, W., Tan, A. S., Galante, J., Jones, H., Ma, X. R., Yao, D., Amgalan, D., Ray, J., Munger, C. J., Nasser, J., Avsec, Ž., James, B. T., Shamim, M. S., Durand, N. C., Rao, S. S., Mahajan, R., Doughty, B. R., Andreeva, K., Ulirsch, J. C., Fan, K., Perez, E. M., Nguyen, T. C., Kelley, D. R., Finucane, H. K., Moore, J. E., Weng, Z., Kellis, M., Bassik, M. C., Ustun, B., Price, A. L., Beer, M. A., Guigó, R., Stamatoyannopoulos, J. A., Lieberman Aiden, E., Greenleaf, W. J., Leslie, C. S., Steinmetz, L. M., Kundaje, A., Engreitz, J. M. 2026

Abstract

Identifying transcriptional enhancers and their target genes is essential for understanding gene regulation and the effect of human genetic variation on disease1-6. Here we create and evaluate a resource of more than 92 million enhancer-gene regulatory interactions across 1,458 biosamples covering 369 cell types and tissues, by integrating predictive models, chromatin states, three-dimensional contacts and large-scale genetic perturbations generated by the ENCODE Consortium7. We first create a systematic benchmarking pipeline to compare predictive models, assembling a dataset of 10,356 element-gene pairs measured in CRISPR perturbation experiments, more than 30,000 fine-mapped expression quantitative trait loci and 569 fine-mapped genome-wide association study (GWAS) variants linked to a probable causal gene. Using this framework, we develop ENCODE-rE2G, a predictive model achieving state-of-the-art performance across several prediction tasks, demonstrating that iterative perturbations and supervised machine learning can build increasingly accurate predictive models of enhancer regulation. Using ENCODE-rE2G, we build an encyclopedia of enhancer-gene regulatory interactions in the human genome, revealing global properties of enhancer networks, identifying differences in regulatory complexity across genes and improving analyses linking noncoding variants to target genes and cell types for common complex diseases. By interpreting the model, we find that beyond enhancer activity and three-dimensional enhancer-promoter contacts, additional features that guide enhancer-promoter communication include promoter class and enhancer-enhancer synergy. These genome-wide maps of enhancer-gene regulatory interactions, benchmarking software, predictive models and insights about enhancer function provide a valuable resource for future studies of gene regulation and human genetics.

View details for DOI 10.1038/s41586-026-10781-4

View details for PubMedID 42457959

View details for PubMedCentralID 7845138
Positional interpretation of cis-regulatory code and nucleosome organization with deep learning models. Nature communications McAnany, C. E., Weilert, M., Mehta, G., Kamulegeya, F., Gardner, J. M., Schreiber, J., Kundaje, A., Zeitlinger, J. 2026

Abstract

Sequence-to-function neural networks learn cis-regulatory sequence rules driving many types of genomic data. Interpreting these models to relate the sequence rules to underlying biological processes remains challenging, especially for complex genomic readouts such as MNase-seq, which maps nucleosome occupancy but is confounded by experimental bias. Here, we introduce pairwise influence by sequence attribution (PISA), which uses attribution to combinatorially decode which bases contributed to the readout at a specific genomic coordinate. PISA visualizes the effects of transcription factor motifs, detects undiscovered motifs with complex contribution patterns, and reveals experimental biases. By learning the bias for MNase-seq, PISA enables unprecedented nucleosome prediction models. These models allow the de novo discovery of nucleosome-positioning motifs and reveal the basis of Micro-C chromatin domain boundaries through systematic motif perturbations. Finally, these models allow the design of sequences with altered nucleosome configurations. These results show that PISA is a versatile tool that expands our ability to train and interpret sequence-to-function neural networks on genomics data and understand the underlying cis-regulatory code.

View details for DOI 10.1038/s41467-026-74807-1

View details for PubMedID 42401553
Decoding common and rare noncoding variant effects across cellular and developmental contexts. Nature genetics Marderstein, A. R., Kundu, S., Padhi, E. M., Deshpande, S., Wang, A., Robb, E., Sun, Y., Yun, C. M., Pomales-Matos, D., Xie, Y., Chang, S. H., Chin, I. M., Shah, A. J., Gardell, Z. A., Corces, M. R., Nachun, D., Jessa, S., Kundaje, A., Montgomery, S. B. 2026

Abstract

Interpreting how noncoding variants act in specific cell types across human development is a major challenge. Here we generated 3 billion predictions from deep learning sequence models of chromatin accessibility across diverse fetal and adult cellular contexts. These prioritized functional variants and revealed a dichotomy: common variants are more cell-type-specific, whereas ultra-rare variants had larger and broader effects across cell types, with the strongest evidence of purifying selection in fetal neurons. Leveraging these insights, we developed FLARE (Functional Lasso Analysis of Regulatory Evolution), which integrates evolutionary constraint to prioritize noncoding variants with extreme regulatory effects. FLARE provided a general framework for studying regulatory variation, from de novo mutations in childhood disorders to rare variants underlying outlier adult brain expression and common variants enriched for schizophrenia heritability. Together, these results demonstrate how integrating single-cell chromatin accessibility, population genetics and deep learning can identify regulatory variants that influence human development and disease.

View details for DOI 10.1038/s41588-026-02619-6

View details for PubMedID 42298188

View details for PubMedCentralID 7237642
Ribo-Tweezer: Rapid removal of ribosomal proteins reveals additional layers of post-transcriptional gene regulation. Molecular cell Chen, Y., Cheng, C. P., Cates, K., Marinov, G. K., Lantz, T. C., Yang, H., Liu, I., Genuth, N. R., Andronescu, C., Hung, V., Bermudez, A., Rothschild, D., Georgeson, J., Barocio, S. B., Kundaje, A., Pitteri, S., Ruggero, D., Barna, M. 2026

Abstract

The ribosome is a ribozyme, but it also acts as a dynamic regulator of gene expression. Although ribosomal protein (RP) composition varies, dissecting the functional contributions of individual RPs beyond their housekeeping roles is challenging because of the lack of tools for manipulation in situ. Here, we developed Ribo-Tweezer, a degron-based system directly tethered to mature ribosomes that enables rapid, reversible, and selective depletion of specific RPs. Using Ribo-Tweezer in mouse embryonic stem cells (mESC), we find a previously uncharacterized role for RACK1 in stem cell fate control via translational regulation of zinc-finger transcriptional networks and long interspersed nuclear element-1 (LINE1) expression. This translation-transcription coupling provides a mechanism by which translation control is further amplified in gene regulation. Distinct translational programs induced by RPLP0 and RPLP1 depletion further demonstrate RP-specific regulatory functions in translation. Together, these findings establish Ribo-Tweezer as a powerful platform that has illuminated selective functions for RPs in gene regulation, which gives biological meaning to ribosome heterogeneity.

View details for DOI 10.1016/j.molcel.2026.04.023

View details for PubMedID 42167236
Bio-BLIP: A Multimodal Architecture for Transferable Reasoning in Genomic Variant Interpretation. bioRxiv : the preprint server for biology Gupta, A., Kundaje, A., Buendia, A., Leskovec, J. 2026

Abstract

Developing scientific hypotheses in biology requires integrating heterogeneous evidence across DNA sequence, gene context, protein function, and prior literature. Existing multimodal AI systems expose biological evidence to reasoning models through textification or by projecting biological embeddings into fine-tuned language models. However, these models are typically highly optimized the specific set of tasks for which they are fine-tuned. Here we present Bio-BLIP, a multimodal Q-former based architecture which leverages biological embeddings and a LLM to generalize to complex reasoning tasks without task-specific fine-tuning. The key to Bio-BLIP is a new neural network architecture that integrates four data modalities - DNA, genes, proteins, and text - through a master Qformer model, which integrates the modality-specific information into a fixed-length prefix for the LLM backbone. Bio-BLIP is pretrained on the task of human genetic variant annotation and achieves a 29.8% increase in generating accurate variant features over frontier LLMs. We evaluate Bio-BLIP zero-shot on downstream genomic tasks of variant prioritization and target gene prediction. Bio-BLIP outperforms two alignment-free genomic language models on regulatory variant prioritization for Mendelian disease. Across the target gene prediction task, Bio-BLIP improves accuracy over LLMs by leveraging learned genomic variant knowledge in difficult cases. Our model produces rich, transparent reasoning traces. In biological domains characterized by multiple scales of data and varied downstream tasks, Bio-BLIP offers a step toward natively multimodal, generalizable reasoning.

View details for DOI 10.64898/2026.05.12.724740

View details for PubMedID 42182443

View details for PubMedCentralID PMC13192633
Prediction and functional interpretation of inter-chromosomal genome architecture from DNA sequence with TwinC. Nature communications Jha, A., Hristov, B., Wang, X., Wang, S., Greenleaf, W. J., Kundaje, A., Aiden, E. L., Bertero, A., Noble, W. S. 2026

Abstract

Three-dimensional nuclear DNA architecture comprises well-studied intra-chromosomal (cis) folding and less characterized inter-chromosomal (trans) interfaces. Current predictive models of 3D genome folding overlook trans-genome organization. We present TwinC, an interpretable convolutional neural network model that reliably predicts trans contacts measurable through proximity ligation-dependent (in situ and intact Hi-C) and independent (DNA SPRITE) genome-wide chromatin conformation assays. TwinC achieves high predictive accuracy (AUROC=0.80) on a cross-chromosomal test set from in situ and intact Hi-C experiments in heart tissue. Furthermore, we train TwinC using in situ Hi-C data from the widely used GM12878 cell line and validate its performance with orthogonal DNA SPRITE assay in the same cell type. Mechanistically, the neural network learns the importance of compartments, chromatin accessibility, clustered transcription factor binding, and G-quadruplexes in forming trans contacts. In summary, TwinC models and interprets trans genome architecture, illuminating this poorly understood aspect of gene regulation.

View details for DOI 10.1038/s41467-026-72031-5

View details for PubMedID 42009674
Multiomics and deep learning dissect regulatory syntax in human development. Nature Liu, B. B., Jessa, S., Kim, S. H., Ng, Y. T., Higashino, S. I., Marinov, G. K., Chen, D. C., Parks, B. E., Li, L., Nguyen, T. C., Wang, A. T., Wang, S. K., Tan, M. H., Tan, S. Y., Kosicki, M., Pennacchio, L. A., Ben-David, E., Pasca, A. M., Kundaje, A., Farh, K. K., Greenleaf, W. J. 2026

Abstract

Transcription factors establish cell identity during development by binding regulatory DNA in a sequence-specific manner, often promoting local chromatin accessibility and regulating gene expression1. Mapping accessible chromatin offers critical insights into transcriptional control, but available datasets for human development are restricted to bulk tissue, single organs or single modalities2. Here we present the Human Development Multiomic Atlas, a single-cell atlas of chromatin accessibility and gene expression from 817,740 fetal cells across 12 organs, spanning 203 cell types and more than 1 million candidate cis-regulatory elements, many of which exhibit organ-specific in vivo enhancer activity. Deep learning models trained to predict accessibility from local DNA sequence unravel a comprehensive lexicon of motifs that influence accessibility, including composite motifs exhibiting distinct syntactic constraints that are predicted to mediate transcription factor cooperativity. We identify 'hard' syntactic rules requiring precise motif spacing and orientation, 'soft' rules allowing flexible motif arrangements, and ubiquitous motifs inhibiting accessibility. Model-based interpretation of genetic variants reveals that disruption of motifs with positive and negative effects is associated with concordant effects on gene expression. Our work delineates how motif syntax governs cell-type-specific chromatin accessibility and provides a foundational resource for decoding cis-regulatory logic and interpreting genetic variation during human development.

View details for DOI 10.1038/s41586-026-10326-9

View details for PubMedID 41951735

View details for PubMedCentralID 8462829
TGF-β-pathway-based polygenic risk score modifies the association between red meat intake and colorectal cancer risk: Application of a novel pathway-based PRS method. Cancer epidemiology, biomarkers & prevention : a publication of the American Association for Cancer Research, cosponsored by the American Society of Preventive Oncology Sanchez Mendez, J., Queme, B., Fu, Y., Morrison, J. L., Lewinger, J. P., Kawaguchi, E. S., Mi, H., Obón-Santacana, M., Moratalla-Navarro, F., Martín, V., Moreno, V., Qu, C., Huyghe, J. R., Newcomb, P. A., Phipps, A. I., Thomas, C. E., Conti, D. V., Wang, J., Platz, E. A., Visvanathan, K., Keku, T. O., Newton, C. C., Um, C. Y., Kundaje, A., Gunter, M. J., Dimou, N., Papadimitriou, N., van Duijnhoven, F. J., Männistö, S., Rennert, G., Wolk, A., Hoffmeister, M., Brenner, H., Tian, Y., Le Marchand, L., Bouras, E., Tsilidis, K. K., Bishop, D. T., Maclnnis, R. J., Buchanan, D. D., Ulrich, C. M., Peoples, A. R., Pellatt, A., Li, L., Devall, M. A., Albanes, D., Berndt, S. I., Gruber, S. B., Ruiz-Narvaez, E., Song, M., Drew, D. A., Chan, A. T., Giannakis, M., Hsu, L., Peters, U., Stern, M. C., Gauderman, W. J. 2026

Abstract

Red and/or processed meat are established colorectal cancer (CRC) risk factors. Genome-wide association studies (GWAS) have reported over 200 variants associated with CRC risk. We used functional annotation data to identify subsets of variants within known pathways to construct pathway-based Polygenic Risk Scores (pPRS) to assess interactions with meat intake.A pooled sample of 30,812 cases and 40,504 CRC controls from 27 studies were analyzed. Quantiles for red and processed meat intake were constructed. 204 GWAS variants were annotated to genes with AnnoQ and assessed for overrepresentation in PANTHER-reported pathways. pPRS's were constructed from significantly overrepresented pathways. Covariate-adjusted logistic regression models evaluated interactions between pPRS and red or processed meat intake in relation to CRC risk.A total of 30 variants were overrepresented in four pathways: Presenilin-Alzheimer disease, Cadherin/WNT-signaling, Gonadotropin-releasing hormone receptor, and TGF-β signaling. We found a significant interaction between TGF-β-pPRS and red meat intake (ORint = 0.95; 95% CI = 0.92-0.98; p = 0.003). When variants in the TGF-β pathway were assessed, we observed significant interactions of red meat with rs2337113 (intron SMAD7 gene, Chr18), and rs2208603 (intergenic region BMP5, Chr6) (p = 0.0005 & 0.036, respectively). There was no evidence of pPRS x red meat interactions for other pathways or with processed meat Conclusions:This pathway-based interaction analysis revealed a statistically significant interaction between variants in the TGF-β pathway and red meat consumption that impacts CRC risk.These findings shed light into the possible mechanistic link between red meat consumption and CRC risk.

View details for DOI 10.1158/1055-9965.EPI-25-1754

View details for PubMedID 41920173
Vascular smooth muscle cell state trajectories mediate molecular mechanisms of coronary disease risk. Nature communications Li, D. Y., Kundu, S., Cheng, P., Gu, W., Worssam, M. D., Jackson, W. R., Zhao, Q., Nguyen, T., Yu, A. M., Monteiro, J. P., Caceres, R. D., Dale, S., Palmisano, B. T., Weldy, C. S., Ramste, M., Kundu, R., Kundaje, A., Wirka, R. C., Quertermous, T. 2026

Abstract

Vascular smooth muscle cells contribute to heritable coronary artery disease risk and undergo complex transitions to multiple disease-related phenotypes. To investigate the genetic basis of these trajectories, we develop a dense timecourse single-cell transcriptomic and epigenetic map of atherosclerosis in a murine disease model accompanied by high-plex in situ spatial data. Using temporal data and probabilistic fate modeling, we identify key transcription factors that drive cell state changes through a combination of network-based prioritization and in silico transcription factor perturbation. Parallel knockout studies of validated coronary artery disease gene Tcf21 uncover its molecular mechanisms in smooth muscle cell transition, due in part to a role regulating the transition of smooth muscle cells in the secondary heart field. Integrating the murine atlas with human coronary artery disease genetics pinpoint smooth muscle cell phenotypes that mediate disease risk, highlighting causal disease mechanisms. Together, these studies resolve atherosclerosis trajectories at single-cell resolution and identify genetic causal transcriptomic and epigenomic mechanisms of coronary artery disease risk.

View details for DOI 10.1038/s41467-026-70530-z

View details for PubMedID 41844614
Short-Context Regulatory DNA Language Models with Motif-Discovery Regularization. bioRxiv : the preprint server for biology Patel, A., Kundaje, A. 2026

Abstract

Self-supervised DNA language models (DNALMs) are typically trained at massive scale on whole genomes and long contexts. However, regulatory sequence features are sparse, heterogeneous, and dominated by poorly conserved flexible syntax of short motifs, which can be difficult to learn from genome-wide self-supervision. As a result, annotation agnostic, long-context DNALMs struggle to learn regulatory syntax and can underperform simpler baseline models on key regulatory tasks. We therefore introduce ARSENAL, a short-context masked DNA language model trained on a functionally enriched regulatory corpus and augmented with a novel regularizer than that encourages motif discovery. ARSENAL improves recovery of diverse transcription factor motifs de novo and prediction of regulatory variant effects in the zero-shot setting compared to other DNALMs. Incorporating ARSENAL embeddings also improves supervised chromatin accessibility prediction over strong ab-initio baselines across multiple cell types and yields improved regulatory variant scoring. Finally, ARSENAL serves as a practical generative prior, enabling targeted regulatory sequence design under downstream functional constraints. All code can be found at https://github.com/kundajelab/regulatory_lm, and models and data can be found at https://sageb.io/4ZpEnk.

View details for DOI 10.64898/2026.02.05.703637

View details for PubMedID 41676719

View details for PubMedCentralID PMC12889687
Genetic risk factors modulate the association between physical activity and colorectal cancer. BMC medicine Peoples, A. R., Obón-Santacana, M., Kim, A. E., Kawaguchi, E. S., Fu, Y., Qu, C., Moratalla-Navarro, F., Morrison, J., Lin, Y., Arndt, V., Berndt, S. I., Bien, S. A., Bishop, D. T., Bouras, E., Brenner, H., Buchanan, D. D., Campbell, P. T., Chan, A. T., Chang-Claude, J., Conti, D. V., Corley, D. A., Devall, M. A., Dimou, N., Drew, D. A., Gruber, S. B., Gunter, M. J., Harlid, S., Harrison, T. A., Hoffmeister, M., Hsu, L., Huyghe, J. R., Keku, T. O., Kundaje, A., Lewinger, J. P., Li, L., Lynch, B. M., Le Marchand, L., Martín, V., Murphy, N., Newton, C. C., Ogino, S., Hardikar, S., Ose, J., Pai, R. K., Palmer, J. R., Papadimitriou, N., Pardamean, B., Pellatt, A. J., Pinchev, M., Platz, E. A., Potter, J. D., Rennert, G., Ruiz-Narvaez, E. A., Sakoda, L. C., Schoen, R. E., Shcherbina, A., Stern, M. C., Su, Y. R., Thomas, C. E., Tian, Y., Tsilidis, K. K., Um, C. Y., van Duijnhoven, F. J., Van Guelpen, B., Visvanathan, K., Wang, J., White, E., Wolk, A., Woods, M. O., Wu, A. H., Ulrich, C. M., Peters, U., Gauderman, W. J., Moreno, V. 2026

Abstract

Physical activity is an established protective factor for colorectal cancer (CRC), but it is unclear if genetic variants modify this effect. To investigate this possibility, we conducted a genome-wide gene-physical activity interaction analysis.Using logistic regression (1-d.f), two-step screening and testing method (EDGE), and joint tests (3-d.f), we analyzed interactions between common genetic variants across the genome and physical activity in relation to CRC risk. Self-reported physical activity levels were categorized as active (≥ 8.75 MET-h/wk) vs. inactive (< 8.75 MET-h/wk; 39,992 participants) and as study- and sex-specific quartiles of activity (42,602 participants).Physical activity was inversely associated with CRC risk overall (OR [active vs. inactive] = 0.85; 95% CI = 0.81-0.90). The two-step EDGE method identified an interaction between rs4779584, an intergenic variant near the GREM1 and SCG5 genes, and physical activity for CRC risk (p-interaction = 2.6 × 10-8). Stratification by genotype at this locus showed a significant reduction in CRC risk by 20% in active vs. inactive participants with the CC genotype (OR = 0.80; 95% CI = 0.75-0.85), but no significant physical activity-CRC associations among CT or TT carriers. When physical activity was modeled as quartiles, the 1-d.f. test identified that rs56906466, an intergenic variant near the KCNG1 gene, modified the association between physical activity and CRC (p-interaction = 3.5 × 10-8). Stratification at this locus showed that an increase in physical activity (highest vs. lowest quartile) was associated with a lower CRC risk solely among TT carriers (OR = 0.77; 95% CI = 0.72-0.82).In summary, we identified two genetic variants that modified the association between physical activity and CRC risk. One of them, related to GREM1 and SCG5, suggests that the bone morphogenetic protein (BMP)-related, inflammatory, and/or insulin signaling pathways may be involved in the protective association between physical activity and colorectal carcinogenesis.

View details for DOI 10.1186/s12916-026-04675-5

View details for PubMedID 41645200
Deep learning-guided design of cell type-specific AAV promoters. bioRxiv : the preprint server for biology Wang, S. K., Deng, B., Nair, S., Ren, X., Li, J., Tijerina, J., Prakhar, P., Luo, Z., Nnebe, C., Kim, S. H., Zhou, Y., Shah, S. H., Davis, A., Mahajan, R., Qiao, Y., Zhou, Y., Zhang, J., Xue, Y., Goldberg, J. L., Wei, W., Kundaje, A., Chang, H. Y., Wang, S. 2026

Abstract

Precise cell type targeting is critical for both clinical and experimental applications of adeno-associated viral (AAV) vectors, yet engineering vectors with cell type-specific activity remains a challenge. Here, we compared three strategies leveraging single-cell chromatin accessibility data to design cell type-specific AAV promoters, including a deep learning-based method to generate de novo regulatory sequences. When applied to target retinal ganglion cells or horizontal cells in mouse retina, deep learning-guided design consistently outperformed rational approaches, yielding synthetic promoters with stronger and more specific expression in vivo. Synthetic AAV promoters supported diverse transgenes, enabling the recording and ablation of targeted cells. Promoter activity was also maintained in human retinal organoids, suggesting that deep learning-designed sequences may be suitable for translation. Our findings highlight the potential of deep learning to synthesize cell type-specific AAV promoters and establish a versatile platform for cell type targeting with broad implications for gene therapy and basic research.

View details for DOI 10.64898/2026.01.13.699371

View details for PubMedID 41648586

View details for PubMedCentralID PMC12871160
Single-cell multiome and enhancer connectome of human retinal pigment epithelium and choroid nominate causal variants in macular degeneration. Cell reports Wang, S. K., Li, J., Nair, S., Kosaraju, R., Chen, Y., Zhang, Y., Kundaje, A., Liu, Y., Wang, N., Chang, H. Y. 2026; 45 (1): 116814

Abstract

Age-related macular degeneration (AMD) is a leading cause of vision loss worldwide. Genome-wide association studies (GWASs) of AMD have identified dozens of risk loci that may house disease targets. However, variants at these loci are largely noncoding, making it difficult to assess their function and whether they are causal. Here, we present a single-cell gene expression and chromatin accessibility atlas of human retinal pigment epithelium (RPE) and choroid to systematically analyze both coding and noncoding variants implicated in AMD. We employ HiChIP and activity-by-contact modeling to map enhancers in these tissues and predict cell and gene targets of risk variants. We further perform allele-specific self-transcribing active regulatory region sequencing (STARR-seq) to functionally test variant activity in RPE cells, including in the context of complement activation. Our work nominates pathogenic variants and mechanisms in AMD and offers a rich and accessible resource for studying diseases of the RPE and choroid.

View details for DOI 10.1016/j.celrep.2025.116814

View details for PubMedID 41528844
An expanded registry of candidate cis-regulatory elements. Nature Moore, J. E., Pratt, H. E., Fan, K., Phalke, N., Fisher, J., Elhajjajy, S. I., Andrews, G., Gao, M., Shedd, N., Fu, Y., Lacadie, M. C., Meza, J., Khandpekar, M., Ganna, M., Choudhury, E., Swofford, R., Phan, H., Ramirez, C. C., Campbell, M., Likhite, M., Farrell, N. P., Weimer, A. K., Pampari, A., Ramalingam, V., Reese, F., Borsari, B., Yu, X., Wattenberg, E., Ruiz-Romero, M., Razavi-Mohseni, M., Xu, J., Galeev, T., Colubri, A., Beer, M. A., Guigó, R., Gerstein, M. B., Engreitz, J. M., Ljungman, M., Reddy, T. E., Snyder, M. P., Epstein, C. B., Gaskell, E., Bernstein, B. E., Dickel, D. E., Visel, A., Pennacchio, L. A., Mortazavi, A., Kundaje, A., Weng, Z. 2026

Abstract

Mammalian genomes contain millions of regulatory elements that control the complex patterns of gene expression1. Previously, the ENCODE consortium mapped biochemical signals across hundreds of cell types and tissues and integrated these data to develop a registry containing 0.9 million human and 300,000 mouse candidate cis-regulatory elements (cCREs) annotated with potential functions2. Here we have expanded the registry to include 2.37 million human and 967,000 mouse cCREs, leveraging new ENCODE datasets and enhanced computational methods. This expanded registry covers hundreds of unique cell and tissue types, providing a comprehensive understanding of gene regulation. Functional characterization data from assays such as STARR-seq3, massively parallel reporter assay4, CRISPR perturbation5,6 and transgenic mouse assays7 have profiled more than 90% of human cCREs, revealing complex regulatory functions. We identified thousands of novel silencer cCREs and demonstrated their dual enhancer and silencer roles in different cellular contexts. Integrating the registry with other ENCODE annotations facilitates genetic variation interpretation and trait-associated gene identification, exemplified by the identification of KLF1 as a novel causal gene for red blood cell traits. This expanded registry is a valuable resource for studying the regulatory genome and its impact on health and disease.

View details for DOI 10.1038/s41586-025-09909-9

View details for PubMedID 41501460

View details for PubMedCentralID 11903340
JASPAR 2026: expansion of transcription factor binding profiles and integration of deep learning models. Nucleic acids research Ovek Baydar, D., Rauluseviciute, I., Aronsen, D. R., Blanc-Mathieu, R., Bonthuis, I., de Beukelaer, H., Ferenc, K., Jegou, A., Kumar, V., Lemma, R. B., Lucas, J., Pochon, M., Yun, C. M., Ramalingam, V., Deshpande, S. S., Patel, A., Marinov, G. K., Wang, A. T., Aguirre, A., Castro-Mondragon, J. A., Baranasic, D., Chèneby, J., Gundersen, S., Johansen, M., Khan, A., Kuijjer, M. L., Hovig, E., Lenhard, B., Sandelin, A., Vandepoele, K., Wasserman, W. W., Parcy, F., Kundaje, A., Mathelier, A. 2025

Abstract

JASPAR (https://jaspar.elixir.no/) is an open-access database that has provided high-quality, manually curated, and non-redundant DNA binding profiles for transcription factors (TFs) as position frequency matrices (PFMs) for over 20 years. We expanded the CORE (306 new profiles, 12% increase) and UNVALIDATED (433, 60% increase) collections with new PFMs and updated 13 existing profiles. We updated the TF binding site predictions and genome tracks for eight species. TF binding profile clusters and familial TF binding sites were updated accordingly. We integrate the inMOTIFin software to easily simulate regulatory sequences using JASPAR PFMs. To enrich TFs' annotations, we provide scientific literature-based human TF target information. Notably, this release features a deep learning (DL) collection, providing a paradigm shift in modeling and characterizing TF-DNA interactions with 1259 BPNet models trained on Homo sapiens ENCODE chromatin immunoprecipitation followed by sequencing (ChIP-seq) datasets from 240 TFs and interpreted to reveal predictive motif patterns for the models. The motifs associated with the same TF were clustered to provide a summary of the binding properties, resulting in 240 primary and 113 alternative motif patterns in the DL collection. The JASPAR 2026 collections lay a foundation for future endeavors in genomic research, serving the scientific community in uncovering the mechanisms of gene regulation.

View details for DOI 10.1093/nar/gkaf1209

View details for PubMedID 41325984
Polyclonal origins of human premalignant colorectal lesions. Nature Van Egeren, D., Schenck, R. O., Khan, A., Horning, A. M., Mo, S., Weiß, C. L., Esplin, E. D., Becker, W. R., Wu, S., Hanson, C., Barapour, N., Jiang, L., Contrepois, K., Lee, H., Nevins, S. A., Guha, T. K., Zhang, H., He, Z., Ma, Z., Monte, E., Karathanos, T. V., Laquindanum, R., Mills, M. A., Chaib, H., Chiu, R., Jian, R., Chan, J., Ellenberger, M., Bahmani, B., Michael, B., Weimer, A. K., Esplin, D. G., Lancaster, S., Shen, J., Ladabaum, U., Longacre, T. A., Kundaje, A., Greenleaf, W. J., Hu, Z., Ford, J. M., Snyder, M. P., Curtis, C. 2025

Abstract

Cancer is generally thought to be caused by expansion of a single mutant cell1. However, analyses of early colorectal cancer lesions suggest that tumors may instead originate from multiple, genetically distinct cell populations2,3. Detecting polyclonal tumor initiation is challenging in patients, as it requires profiling early-stage lesions before clonal sweeps obscure diversity. To investigate this, we analyzed normal colorectal mucosa, benign and dysplastic premalignant polyps, and malignant adenocarcinomas (123 samples) from six individuals with familial adenomatous polyposis (FAP). Individuals with FAP have a germline heterozygous APC mutation, predisposing them to colorectal cancer and numerous premalignant polyps by early adulthood4. Whole-genome and/or whole-exome sequencing revealed that many premalignant polyps-40% with benign histology and 28% with dysplasia-were composed of multiple genetic lineages that diverged early, consistent with polyclonal origins. This conclusion was reinforced by whole-genome sequencing of single crypts from multiple polyps in additional patients which showed limited sharing of mutations among crypts within the same lesion. In some cases, multiple distinct APC mutations co-existed in different lineages of a single polyp, consistent with polyclonality. These findings reshape our understanding of early neoplastic events, demonstrating that tumor initiation can arise from the convergence of diverse mutant clones. They also suggest that cell-intrinsic growth advantages alone may not fully explain tumor initiation, highlighting the importance of microenvironmental and tissue-level factors in early cancer evolution.

View details for DOI 10.1038/s41586-025-09930-y

View details for PubMedID 41291291
Polyclonal origins of human premalignant colorectal lesions. bioRxiv : the preprint server for biology Van Egeren, D., Schenck, R. O., Khan, A., Horning, A. M., Mo, S., Weiß, C. L., Esplin, E. D., Becker, W. R., Wu, S., Hanson, C., Barapour, N., Jiang, L., Contrepois, K., Lee, H., Nevins, S. A., Guha, T. K., Zhang, H., He, Z., Ma, Z., Monte, E., Karathanos, T. V., Laquindanum, R., Mills, M. A., Chaib, H., Chiu, R., Jian, R., Chan, J., Ellenberger, M., Bahmani, B., Michael, B., Weimer, A. K., Esplin, D. G., Lancaster, S., Shen, J., Ladabaum, U., Longacre, T. A., Kundaje, A., Greenleaf, W. J., Hu, Z., Ford, J. M., Snyder, M. P., Curtis, C. 2025

Abstract

Cancer is generally thought to be caused by expansion of a single mutant cell1. However, analyses of early colorectal cancer lesions suggest that tumors may instead originate from multiple, genetically distinct cell populations2,3. Detecting polyclonal tumor initiation is challenging in patients, as it requires profiling early-stage lesions before clonal sweeps obscure diversity. To investigate this, we analyzed normal colorectal mucosa, benign and dysplastic premalignant polyps, and malignant adenocarcinomas (123 samples) from six individuals with familial adenomatous polyposis (FAP). Individuals with FAP have a germline heterozygous APC mutation, predisposing them to colorectal cancer and numerous premalignant polyps by early adulthood4. Whole-genome and/or whole-exome sequencing revealed that many premalignant polyps-40% with benign histology and 28% with dysplasia-were composed of multiple genetic lineages that diverged early, consistent with polyclonal origins. This conclusion was reinforced by whole-genome sequencing of single crypts from multiple polyps in additional patients which showed limited sharing of mutations among crypts within the same lesion. In some cases, multiple distinct APC mutations co-existed in different lineages of a single polyp, consistent with polyclonality. These findings reshape our understanding of early neoplastic events, demonstrating that tumor initiation can arise from the convergence of diverse mutant clones. They also suggest that cell-intrinsic growth advantages alone may not fully explain tumor initiation, highlighting the importance of microenvironmental and tissue-level factors in early cancer evolution.

View details for DOI 10.1101/2025.09.08.674484

View details for PubMedID 41292759

View details for PubMedCentralID PMC12642608
Vascular smooth muscle cell atherosclerosis trajectories characterized at single cell resolution identify causal transcriptomic and epigenomic mechanisms of disease risk Li, D., Kundu, S., Cheng, P., Gu, W., Jackson, W., Zhao, Q., Nguyen, T., Worssam, M., Monteiro, J., Palmisano, B., Weldy, C., Kundu, R., Kundaje, A., Wirka, R., Quertermous, T. LIPPINCOTT WILLIAMS & WILKINS. 2025

View details for DOI 10.1161/circ.152.suppl_3.4370980

View details for Web of Science ID 001613807100034
The epigenomic landscape of single vascular cells reflects developmental origin and identifies disease risk loci Weldy, C., Kundu, S., Monteiro, J., Gu, W., Pedroza, A., Dalal, A., Worssam, M., Li, D., Palmisano, B., Zhao, Q., Sharma, D., Nguyen, T., Kundu, R., Fischbein, M., Engreitz, J., Kundaje, A., Cheng, P., Quertermous, T. LIPPINCOTT WILLIAMS & WILKINS. 2025

View details for DOI 10.1161/circ.152.suppl_3.4358524

View details for Web of Science ID 001613911000034
GREGoR: accelerating genomics for rare diseases. Nature Dawood, M., Heavner, B., Wheeler, M. M., Ungar, R. A., LoTempio, J., Wiel, L., Berger, S., Bernstein, J. A., Chong, J. X., Délot, E. C., Eichler, E. E., Lupski, J. R., Shojaie, A., Talkowski, M. E., Wagner, A. H., Wei, C. L., Wellington, C., Wheeler, M. T., Carvalho, C. M., Gibbs, R. A., Gifford, C. A., May, S., Miller, D. E., Rehm, H. L., Samocha, K. E., Sedlazeck, F. J., Vilain, E., O'Donnell-Luria, A., Posey, J. E., Chadwick, L. H., Bamshad, M. J., Montgomery, S. B. 2025; 647 (8089): 331-342

Abstract

Rare diseases are collectively common, affecting approximately 1 in 20 individuals worldwide. In recent years, rapid progress has been made in rare disease diagnostics due to advances in next-generation sequencing, development of new computational and functional genomics approaches to prioritize genes and variants and increased global sharing of clinical and genetic data. However, more than half of individuals suspected to have a rare disease lack a genetic diagnosis. The Genomics Research to Elucidate the Genetics of Rare Diseases (GREGoR) Consortium was initiated to study thousands of challenging rare disease cases and families and apply, standardize and evaluate emerging genomics technologies and analytics to accelerate their adoption in clinical practice. Furthermore, all data generated, currently representing over 7,500 individuals from over 3,000 families, are rapidly made available to researchers worldwide through the Analysis, Visualization and Informatics Lab-space (AnVIL) to catalyse global efforts to develop approaches for genetic diagnoses in rare diseases. Most of these families have undergone previous clinical genetic testing but remained unsolved, with most being exome-negative. Here we describe the collaborative research framework, datasets and discoveries comprising GREGoR that will provide foundational resources and substrates for the future of rare disease genomics.

View details for DOI 10.1038/s41586-025-09613-8

View details for PubMedID 41224980

View details for PubMedCentralID 9119004
Deep learning the dynamic regulatory sequence code of cardiac organoid differentiation. bioRxiv : the preprint server for biology Metzl-Raz, E., Zhao, R., Deshpande, S., Powell, J., Porter, E. G., Zouaghi, Y., Liu, B. B., Kim, S. H., Abdi, I., Evergreen, I., Agarwal, M., Sheth, M. U., Rico, J., Miyamoto, M., Sanchez, J. M., Engreitz, J. M., Kundaje, A., Greenleaf, W. J., Gifford, C. A. 2025

Abstract

Defining the temporal gene regulatory programs that drive human organogenesis is essential for understanding the origins of congenital disease. We combined a time-resolved, single-cell multi-omic atlas of human iPSC-derived cardiac organoids with deep learning models that predict chromatin accessibility from DNA sequence, enabling the discovery of the regulatory syntax underlying early heart development. This framework uncovered cell-state-specific rules of cardiogenesis, including context-dependent activities of TEAD, HAND, and TBX transcription factor families, and linked these motifs to their target genes. We identified distinct programs guiding lineage divergence, such as ventricular versus pacemaker cardiomyocytes, and validated predictions by perturbing Myocardin (MYOCD), establishing its essential role in ventricular specification. Integration of chromatin, transcriptional, and genetic data further highlighted regulatory regions and disease-associated variants that perturb differentiation state transitions, supporting evidence that suggests congenital heart disease emerges early in development. This work bridges developmental gene regulation with disease genetics, providing a foundation for mechanistic and therapeutic insights into congenital diseases.

View details for DOI 10.1101/2025.10.15.680997

View details for PubMedID 41279701

View details for PubMedCentralID PMC12632746
A cell and transcriptome atlas of human arterial vasculature. Cell genomics Zhao, Q., Pedroza, A., Sharma, D., Gu, W., Dalal, A., Weldy, C., Jackson, W., Li, D. Y., Ryan, Y., Nguyen, T., Shad, R., Palmisano, B. T., Monteiro, J. P., Worssam, M., Berezwitz, A., Iyer, M., Shi, H., Kundu, R., Limbu, L., Kim, J. B., Kundaje, A., Fischbein, M., Wirka, R., Quertermous, T., Cheng, P. 2025: 101034

Abstract

Arterial segments show differing disease propensities, yet mechanisms remain unknown. We compiled a transcriptomic and spatial atlas of healthy human arterial cells across multiple segments to understand these differences. Arteries demonstrated a stereotyped pattern of cell-specific, segmental heterogeneity not captured by common marker genes. Arterial identities are encoded in fibroblast and smooth muscle cell (SMC) transcriptomes. Differentially expressed genes enrich for disease loci. Fibroblast gene expression enriches for a disproportionate number of disease loci, highlighting an underrecognized role for fibroblasts in disease risk. Cells of different segments cluster more by embryonic origin than anatomy. Global analysis of disease regulons in fibroblasts and SMCs identified developmental transcription factors that persist into adulthood, suggesting a functional role of these factors in disease. Lastly, the heterogeneity of non-coding transcriptomes rivals that of protein-coding transcriptomes. Differentially expressed lncRNAs enrich for genetic signals for vascular diseases, suggesting a role for lncRNAs in vascular disease.

View details for DOI 10.1016/j.xgen.2025.101034

View details for PubMedID 41086809
Using gene-environment interactions to explore pathways for colorectal cancer risk. EBioMedicine Bouras, E., Yu, R., Kim, A. E., Markozannes, G., Murphy, N., Albanes, D., Anderson, L. N., Barry, E. L., Berndt, S. I., Bishop, D. T., Brenner, H., Burnett-Hartman, A., Campbell, P. T., Carreras-Torres, R., Chan, A. T., Cheng, I., Devall, M. A., Diez-Obrero, V., Dimou, N., Drew, D. A., Gruber, S. B., Gsur, A., Hoffmeister, M., Hsu, L., Huyghe, J. R., Kawaguchi, E., Keku, T. O., Kundaje, A., Küry, S., Le Marchand, L., Lewinger, J. P., Li, L., Lynch, B. M., Moreno, V., Morrison, J. L., Newton, C. C., Obón-Santacana, M., Palmer, J. R., Papadimitriou, N., Pellatt, A. J., Peoples, A. R., Pharoah, P. D., Platz, E. A., Qu, C., Ruiz-Narvaez, E., Mendez, J. S., Schoen, R. E., Stern, M. C., Thomas, C. E., Tian, Y., Um, C. Y., Visvanathan, K., Vodicka, P., Vymetalkova, V., White, E., Wolk, A., Woods, M. O., Wu, A. H., Gunter, M. J., Gauderman, W. J., Peters, U., Evangelou, M., Tsilidis, K. K. 2025; 121: 105964

Abstract

Colorectal cancer (CRC) is a significant public health concern, highlighting the critical need for identifying novel intervention targets for its prevention.We conducted genome-wide interaction analyses for 15 exposures with established or putative CRC risk [body mass index (BMI), height, physical activity, smoking, type 2 diabetes, use of menopausal hormone therapy, non-steroidal anti-inflammatory drugs, and intake of alcohol, calcium, fibre, folate, fruits, processed meat, red meat, and vegetables], and used interaction estimates to explore pathways and genes underlying CRC risk. The adaptive combination of Bayes Factors (ADABF), and over-representation analysis (ORA) were used for pathway analyses, and findings were further investigated using publicly available resources [hallmarks of cancer, Open Targets Platform (OTP)].A total of 1973 pathways using ADABF, and 840 pathways using ORA, out of the 2950 analysed, were enriched (P < 0.05) for at least one exposure, as well as 1227 genes within the enriched pathways. Data were available for 811/1227 coding genes in the OTP, 241 of which were supported by strong relative abundance of prior evidence (overall OTP score > 0.05). Fifty percent of the genes (617/1227) mapped to at least one hallmark of cancer, most of which (388/617) pertained to the Sustaining Proliferative Signalling hallmark. Our findings reflect previously established pathways for CRC risk and highlight the emerging importance of several less studied genes. Common pathways were found for several combinations of exposures, potentially suggesting common underlying mechanisms.The results of the present analysis provide a basis for further functional research. If confirmed, they may help elucidate the etiological associations between risk factors and CRC risk and ultimately inform personalized prevention strategies.This study was funded by Cancer Research UK (CRUK; grant number:PPRCPJT∖100005) and World Cancer Research Fund International (WCRF; IIG_FULL_2020_022). Funding for grant IIG_FULL_2020_022 was obtained from Wereld Kanker Onderzoek Fonds (WKOF) as part of the World Cancer Research Fund International grant programme. Full funding details for the individual consortia are provided in the acknowledgements.

View details for DOI 10.1016/j.ebiom.2025.105964

View details for PubMedID 41076992
Predicting chromatin conformation contact maps. PloS one Min, A., Schreiber, J., Kundaje, A., Noble, W. S. 2025; 20 (9): e0331124

Abstract

Over the past 15 years, a variety of next-generation sequencing assays have been developed for measuring the 3D conformation of DNA in the nucleus. Each of these assays gives, for a particular cell or tissue type, a distinct picture of 3D chromatin architecture. Accordingly, making sense of the relationship between genome structure and function requires teasing apart two closely related questions: how does chromatin 3D structure change from one cell type to the next, and how do different measurements of that structure differ from one another, even when the two assays are carried out in the same cell type? In this work, we assemble a collection of chromatin 3D datasets-each represented as a 2D contact map-spanning multiple assay types and cell types. We then build a machine learning model that predicts missing contact maps in this collection. We use the model to systematically explore how genome 3D architecture changes, at the level of compartments, domains, and loops, between cell type and between assay types.

View details for DOI 10.1371/journal.pone.0331124

View details for PubMedID 41021649
The regulatory landscape of nascent transcription in human health and disease. bioRxiv : the preprint server for biology Shah, S. R., Chen, Y., Leung, A. K., Navarro, P. V., Paramo, M. I., Gupta, J., Gurumurthy, A., Fite, R. F., Weimer, A. K., Du, Q., Mohyeldin, A. M., Egli, D., Creusot, R. J., Ryan, R. J., Snyder, M. P., Clark, A. G., Lis, J. T., Yu, H. 2025

Abstract

Transcriptional regulatory elements (TREs) orchestrate gene expression programs fundamental to cellular identity and transitions between physiological and pathological states. Decoding the regulatory logic of human biology requires resolving where, when, and how these elements are transcriptionally engaged. Here, we profiled the active transcriptional regulatory landscape across all major organ systems and a broad spectrum of developmental and disease states using PRO-cap, a high-resolution method that captures nascent transcription start sites with unprecedented sensitivity and specificity. This atlas of active TREs highlights elements shaped by their cellular contexts and evolutionary constraints, sheds light on the genetic architecture of human traits and diseases, and reveals how patterns of transcription initiation and pausing encode regulatory logic. In cancer, nascent transcription enables the delineation of lineage-specific regulatory states, metastatic adaptations, and the co-option of pre-existing programs. Together, these findings establish nascent transcription as a core dimension of gene regulation, illuminating principles that govern development, physiology, and disease.

View details for DOI 10.1101/2025.09.24.676871

View details for PubMedID 41040139
Sensitive, direct detection of non-coding off-target base editor unwinding and editing in primary cells. bioRxiv : the preprint server for biology Wang, T., Jessa, S., Marinov, G. K., Klemm, S., Kundaje, A., Greenleaf, W. J. 2025

Abstract

Base editors create precise nucleotide changes in DNA, but their off-target activity remains challenging to quantify. Here, we develop and deploy a direct, in cellulo sequencing assay that simultaneously measures both Cas9-mediated unwinding and deaminase editing of genomic DNA (beCasKAS). Our strategy nominates >460-fold more potential off-target sites than other methods by enriching for Cas9-dependent R-loops immediately preceding editing. Using beCasKAS in primary human T-cells, we observe that mRNA-encoded ABE8e and PAMless ABE8e-SpRY base editors have distinct off-target profiles that can be mitigated by optimizing mRNA dose. Finally, we combine beCasKAS with base-resolution deep learning models to risk-stratify off-target edits by their likelihood of epigenetic dysregulation. Collectively, beCasKAS offers a sensitive and facile tool to optimize the balance between base editor on- and off-target activity.

View details for DOI 10.1101/2025.09.25.678665

View details for PubMedID 41040263

View details for PubMedCentralID PMC12485731
Epigenomic landscape of single vascular cells reflects developmental origin and disease risk loci. Molecular systems biology Weldy, C. S., Kundu, S., Monteiro, J., Gu, W., Pedroza, A. J., Dalal, A. R., Worssam, M. D., Li, D., Palmisano, B., Zhao, Q., Sharma, D., Nguyen, T., Kundu, R., Fischbein, M. P., Engreitz, J., Kundaje, A. B., Cheng, P. P., Quertermous, T. 2025

Abstract

Vascular sites have distinct susceptibility to atherosclerosis and aneurysm, yet the epigenomic and transcriptomic underpinning of vascular site-specific disease risk is largely unknown. Here, we performed single-cell chromatin accessibility (scATACseq) and gene expression profiling (scRNAseq) of mouse vascular tissue from three vascular sites. Through interrogation of epigenomic enhancers and gene regulatory networks, we discovered key regulatory enhancers to not only be cell type, but vascular site-specific. We identified epigenetic markers of embryonic origin including developmental transcription factors such as Tbx20, Hand2, Gata4, and Hoxb family members and discovered transcription factor motif accessibility to be vascular site-specific for smooth muscle, fibroblasts, and endothelial cells. We further integrated genome-wide association data for aortic dimension, and using a deep learning model to predict variant effect on chromatin accessibility, ChromBPNet, we predicted variant effects across cell type and vascular site of origin, revealing genomic regions enriched for specific TF motif footprints-including MEF2A, SMAD3, and HAND2. This work supports a paradigm that cell type and vascular site-specific enhancers govern complex genetic drivers of disease risk.

View details for DOI 10.1038/s44320-025-00140-2

View details for PubMedID 40931195

View details for PubMedCentralID 3357908
Enhancer-targeting CRISPR screens at coronary artery disease loci suggest shared mechanisms of disease risk. medRxiv : the preprint server for health sciences Ramste, M., Weldy, C., Kundu, S., Zhao, Q., Li, D., Brand, K., Sharma, D., Ramste, A., Jagoda, E., Ray, J., Caceres, R. D., Galante, J., Gschwind, A. R., Lahtinen, N., Nguyen, T., Amrute, J. M., Park, C. Y., Kim, J. B., Kaikkonen, M. U., Stitziel, N. O., Steinmetz, L., Kundaje, A., Engreitz, J. M., Quertermous, T. 2025

Abstract

To systematically identify causal genetic mechanisms that confer risk for coronary artery disease (CAD) in GWAS loci, we mapped genome-wide variant-to-enhancer-to-gene (V2E2G) links in vascular smooth muscle cells (SMC). Enhancers identified by active chromatin features, and further prioritized by base-resolution deep learning models of chromatin accessibility in 108 CAD loci, were studied with CRISPRi targeting and Direct-Capture Targeted Perturb-seq (DC-TAP-seq) evaluation of 470 genes. Seventy-six V2E2G links were identified for 59 candidate CAD genes representing gene programs including epithelial-mesenchymal transformation, ubiquitination, and protein folding as well as BMP and TGFB signaling. Similar methods employed with an independent focused screen targeting one candidate locus at 9p21.3 identified 10 enhancers regulating expression of multiple genes at this location. Detailed molecular studies revealed that two enhancers mediating transcription factor binding and transcriptional regulation contribute to ancestry-specific and sex-specific risk for CAD and the surrogate biomarker vascular calcification. Together, these studies advance our identification of GWAS CAD V2E2G links across the genome, and specific mechanisms of risk at the complex 9p21.3 locus.

View details for DOI 10.1101/2025.08.28.25334684

View details for PubMedID 40950476

View details for PubMedCentralID PMC12424881
Genetic risk factors modulate the association between physical activity and colorectal cancer. Research square Peoples, A. R., Obón-Santacana, M., Kim, A. E., Kawaguchi, E. S., Fu, Y., Qu, C., Moratalla-Navarro, F., Morrison, J., Lin, Y., Arndt, V., Berndt, S. I., Bien, S. A., Bishop, D. T., Bouras, E., Brenner, H., Buchanan, D. D., Campbell, P. T., Chan, A. T., Chang-Claude, J., Conti, D. V., Corley, D. A., Devall, M. A., Dimou, N., Drew, D. A., Gruber, S. B., Gunter, M. J., Harlid, S., Harrison, T. A., Hoffmeister, M., Hsu, L., Huyghe, J. R., Keku, T. O., Kundaje, A., Lewinger, J. P., Li, L., Lynch, B. M., Marchand, L. L., Martín, V., Murphy, N., Newton, C. C., Ogino, S., Hardikar, S., Ose, J., Pai, R. K., Palmer, J. R., Papadimitriou, N., Pardamean, B., Pellatt, A. J., Pinchev, M., Platz, E. A., Potter, J. D., Rennert, G., Ruiz-Narvaez, E. A., Sakoda, L. C., Schoen, R. E., Shcherbina, A., Stern, M. C., Su, Y. R., Thomas, C. E., Tian, Y., Tsilidis, K. K., Um, C. Y., van Duijnhoven, F. J., van Guelpen, B., Visvanathan, K., Wang, J., White, E., Wolk, A., Woods, M. O., Wu, A. H., Ulrich, C. M., Peters, U., Gauderman, W. J., Moreno, V. 2025

Abstract

Physical activity (PA) is an established protective factor for colorectal cancer (CRC), but it is unclear if genetic variants modify this effect. To investigate this possibility, we conducted a genome-wide gene-PA interaction analysis.Using logistic regression and two-step and joint tests, we analyzed interactions between common genetic variants across the genome and PA in relation to CRC risk. Self-reported PA levels were categorized as active (≥ 8.75 MET-h/wk) vs. inactive (< 8.75 MET-h/wk) and as study- and sex-specific quartiles of activity.PA had an overall protective effect on CRC (OR [active vs. inactive] = 0.85; 95%CI = 0.81-0.90). The two-step GxE method identified an interaction between rs4779584, an intergenic variant near the GREM1 and SCG5 genes, and PA for CRC risk (p-interaction = 2.6×10- 8). Stratification by genotype at this locus showed a significant reduction in CRC risk by 20% in active vs. inactive participants with the CC genotype (OR = 0.80; 95%CI = 0.75-0.85), but no significant PA-CRC association among CT or TT carriers. When PA was modeled as quartiles, the 1-d.f. GxE test identified that rs56906466, an intergenic variant near the KCNG1 gene, modified the association between PA and CRC (p-interaction = 3.5×10- 8). Stratification at this locus showed that increase in PA (highest vs. lowest quartile) was associated with a lower CRC risk solely among TT carriers (OR = 0.77; 95%CI = 0.72-0.82).In summary, we identified two genetic variants that modified the association between PA and CRC risk. One of them, related to GREM1 and SCG5, suggests that the bone morphogenetic protein (BMP)-related, inflammatory, and/or insulin signaling pathways may be associated with the protective influence of PA on colorectal carcinogenesis.

View details for DOI 10.21203/rs.3.rs-7350654/v1

View details for PubMedID 40951278

View details for PubMedCentralID PMC12425077
Achieving inclusive healthcare through integrating education and research with AI and personalized curricula. Communications medicine Bahmani, A., Cha, K., Alavi, A., Dixit, A., Ross, A., Park, R., Goncalves, F., Ma, S., Saxman, P., Nair, R., Akhavan-Sarraf, R., Zhou, X., Wang, M., Contrepois, K., Li-Pook-Than, J., Monte, E., Rodriguez, D. J., Lai, J., Babu, M., Tondar, A., Schüssler-Fiorenza Rose, S. M., Akbari, I., Zhang, X., Yegnashankaran, K., Yracheta, J., Dale, K., Miller, A. D., Edmiston, S., McGhee, E. M., Nebeker, C., Wu, J. C., Kundaje, A., Snyder, M. 2025; 5 (1): 356

Abstract

Precision medicine promises significant health benefits but faces challenges such as complex data management and analytics, interdisciplinary collaboration, and education of researchers, healthcare professionals, and participants. Addressing these needs requires the integration of computational experts, engineers, designers, and healthcare professionals to develop user-friendly systems and shared terminologies. The widespread adoption of large language models (LLMs) such as Generative Pretrained Transformer (GPT) and Claude highlights the importance of making complex data accessible to non-specialists.We evaluated the Stanford Data Ocean (SDO) precision medicine training program's learning outcomes, AI Tutor performance, and learner satisfaction by assessing self-rated competency on key learning objectives through pre- and post-learning surveys, along with formative and summative assessment completion rates. We also analyzed AI Tutor accuracy and learners' self-reported satisfaction, and post-program academic and career impacts. Additionally, we demonstrated the capabilities of the AI Data Visualization tool.SDO demonstrates the ability to improve learning outcomes for learners from broad educational and socioeconomic backgrounds with the support of the AI Tutor. The AI Data Visualization tool enables learners to interpret multi-omics and wearable data and replicate research findings.SDO strives to mitigate challenges in precision medicine through a scalable, cloud-based platform that supports data management for various data types, advanced research, and personalized learning. SDO provides AI Tutors and AI-powered data visualization tools to enhance educational and research outcomes and make data analysis accessible to users from broad educational backgrounds. By extending engagement and cutting-edge research capabilities globally, SDO particularly benefits economically disadvantaged and historically marginalized communities, fostering interdisciplinary biomedical research and bridging the gap between education and practical application in the biomedical field.

View details for DOI 10.1038/s43856-025-01034-y

View details for PubMedID 40819118

View details for PubMedCentralID 9108683
DART-Eval: A Comprehensive DNA Language Model Evaluation Benchmark on Regulatory DNA. ArXiv Patel, A., Singhal, A., Wang, A., Pampari, A., Kasowski, M., Kundaje, A. 2025

Abstract

Recent advances in self-supervised models for natural language, vision, and protein sequences have catalyzed the development of genomic DNA language models (DNALMs). These models aim to learn generalizable representations of diverse DNA elements, potentially enabling various downstream genomic prediction, interpretation and design tasks. However, existing benchmarks do not adequately assess the capabilities of DNALMs on an important class of non-coding DNA elements critical for regulating gene activity. Here, we introduce DART-Eval, a suite of representative benchmarks focused on regulatory DNA to evaluate performance of DNALMs across zero-shot, probed, and fine-tuned settings against contemporary ab initio models as baselines. DART-Eval addresses biologically relevant tasks including sequence motif discovery, cell-type specific regulatory activity prediction, and counterfactual prediction of regulatory genetic variants. Our systematic evaluations reveal that current annotation-agnostic DNALMs exhibit inconsistent performance and do not offer compelling gains over alternative baseline models for most tasks, despite requiring significantly more computational resources. We discuss potentially promising modeling, data curation, and evaluation strategies for the next generation of DNALMs. Our benchmark datasets and evaluation framework are available at https://github.com/kundajelab/DART-Eval.

View details for PubMedID 40799805

View details for PubMedCentralID PMC12340898
DART-Eval: A Comprehensive DNA Language Model Evaluation Benchmark on Regulatory DNA. ArXiv Patel, A., Singhal, A., Wang, A., Pampari, A., Kasowski, M., Kundaje, A. 2025

Abstract

Recent advances in self-supervised models for natural language, vision, and protein sequences have catalyzed the development of genomic DNA language models (DNALMs). These models aim to learn generalizable representations of diverse DNA elements, potentially enabling various downstream genomic prediction, interpretation and design tasks. However, existing benchmarks do not adequately assess the capabilities of DNALMs on an important class of non-coding DNA elements critical for regulating gene activity. Here, we introduce DART-Eval, a suite of representative benchmarks focused on regulatory DNA to evaluate performance of DNALMs across zero-shot, probed, and fine-tuned settings against contemporary ab initio models as baselines. DART-Eval addresses biologically relevant tasks including sequence motif discovery, cell-type specific regulatory activity prediction, and counterfactual prediction of regulatory genetic variants. Our systematic evaluations reveal that current annotation-agnostic DNALMs exhibit inconsistent performance and do not offer compelling gains over alternative baseline models for most tasks, despite requiring significantly more computational resources. We discuss potentially promising modeling, data curation, and evaluation strategies for the next generation of DNALMs. Our benchmark datasets and evaluation framework are available at https://github.com/kundajelab/DART-Eval.

View details for PubMedID 40799805
Cell type-specific purifying selection of synonymous mitochondrial DNA variation. Proceedings of the National Academy of Sciences of the United States of America Lareau, C. A., Maschmeyer, P., Yin, Y., Gutierrez, J. C., Dhindsa, R. S., Gribling-Burrer, A. S., Zielinski, S., Hsieh, Y. H., Nitsch, L., Dimitrova, V., Nalbant, B., Buquicchio, F. A., Abay, T., Stickels, R. R., Ulirsch, J. C., Yan, P., Wang, F., Miao, Z., Sandor, K., Daniel, B., Liu, V., Mendez, P. L., Knaus, P., Meyer, M., Greenleaf, W. J., Kundaje, A., Smyth, R. P., Munschauer, M., Ludwig, L. S., Satpathy, A. T. 2025; 122 (30): e2505704122

Abstract

While somatic variants are well-characterized drivers of tumor evolution, their influence on cellular fitness in nonmalignant contexts remains understudied. We identified a mosaic synonymous variant (m.7076A > G) in the mitochondrial DNA (mtDNA)-encoded cytochrome c-oxidase subunit 1 (MT-CO1, p.Gly391=), present at homoplasmy in 47% of immune cells from a healthy donor. Single-cell multiomics revealed strong, lineage-specific selection against the m.7076G allele in CD8+ effector memory T cells, but not other T cell subsets, mirroring patterns of purifying selection of pathogenic mtDNA alleles. The limited anticodon diversity of mitochondrial tRNAs forces m.7076G translation to rely on wobble pairing, unlike the Watson-Crick-Franklin pairing used for m.7076A. Mitochondrial ribosome profiling confirmed stalled translation of the m.7076G allele. Functional analyses demonstrated that the elevated translational and metabolic demands of short-lived effector T cells (SLECs) amplify dependence on MT-CO1, driving this selective pressure. These findings suggest that synonymous variants can alter codon syntax, impacting mitochondrial physiology in a cell type-specific manner.

View details for DOI 10.1073/pnas.2505704122

View details for PubMedID 40705423
In vivo mapping of mutagenesis sensitivity of human enhancers. Nature Kosicki, M., Zhang, B., Hecht, V., Pampari, A., Cook, L. E., Slaven, N., Akiyama, J. A., Plajzer-Frick, I., Novak, C. S., Kato, M., Tran, S., Hunter, R. D., von Maydell, K., Barton, S., Beckman, E., Zhu, Y., Dickel, D. E., Kundaje, A., Visel, A., Pennacchio, L. A. 2025

Abstract

Distant-acting enhancers are central to human development1. However, our limited understanding of their functional sequence features prevents the interpretation of enhancer mutations in disease2. Here we determined the functional sensitivity to mutagenesis of human developmental enhancers in vivo. Focusing on seven enhancers that are active in the developing brain, heart, limb and face, we created over 1,700 transgenic mice for over 260 mutagenized enhancer alleles. Systematic mutation of 12-base-pair blocks collectively altered each sequence feature in each enhancer at least once. We show that 69% of all blocks are required for normal in vivo activity, with mutations more commonly resulting in loss (60%) than in gain (9%) of function. Using predictive modelling, we annotated critical nucleotides at the base-pair resolution. The vast majority of motifs predicted by these machine learning models (88%) coincided with changes in in vivo function, and the models showed considerable sensitivity, identifying 59% of all functional blocks. Taken together, our results reveal that human enhancers contain a high density of sequence features that are required for their normal in vivo function and provide a rich resource for further exploration of human enhancer logic.

View details for DOI 10.1038/s41586-025-09182-w

View details for PubMedID 40533554

View details for PubMedCentralID 5123704
Red meat intake interacts with a TGF-β-pathway-based polygenic risk score to impact colorectal cancer risk: Application of a novel approach for polygenic risk score construction. medRxiv : the preprint server for health sciences Mendez, J. S., Queme, B., Fu, Y., Morrison, J., Lewinger, J. P., Kawaguchi, E., Mi, H., Obón-Santacana, M., Moratalla-Navarro, F., Martín, V., Moreno, V., Lin, Y., Bien, S. A., Qu, C., Su, Y. R., White, E., Harrison, T. A., Huyghe, J. R., Tangen, C. M., Newcomb, P. A., Phipps, A. I., Thomas, C. E., Conti, D. V., Wang, J., Platz, E. A., Keku, T. O., Newton, C. C., Um, C. Y., Kundaje, A., Shcherbina, A., Murphy, N., Gunter, M. J., Dimou, N., Papadimitriou, N., Bézieau, S., van Duijnhoven, F. J., Männistö, S., Rennert, G., Wolk, A., Hoffmeister, M., Brenner, H., Chang-Claude, J., Tian, Y., Le Marchand, L., Cotterchio, M., Tsilidis, K. K., Bishop, D. T., Melaku, Y. A., Lynch, B. M., Buchanan, D. D., Ulrich, C. M., Ose, J., Peoples, A. R., Pellatt, A. J., Li, L., Devall, M. A., Campbell, P. T., Albanes, D., Weinstein, S. J., Berndt, S. I., Gruber, S. B., Ruiz-Narvaez, E., Song, M., Joshi, A. D., Drew, D. A., Petrick, J. L., Chan, A. T., Giannakis, M., Hsu, L., Peters, U., Gauderman, W. J., Stern, M. C. 2025

Abstract

High intake of red and/or processed meat are established colorectal cancer (CRC) risk factors. Genome-wide association studies (GWAS) have reported 204 variants (G) associated with CRC risk. We used functional annotation data to identify subsets of variants within known pathways and constructed pathway-based Polygenic Risk Scores (pPRS) to model pPRS x environment (E) interactions.A pooled sample of 30,812 cases and 40,504 CRC controls of European ancestry from 27 studies were analyzed. Quantiles for red and processed meat intake were constructed. The 204 GWAS variants were annotated to genes with AnnoQ and assessed for overrepresentation in PANTHER-reported pathways. pPRS's were constructed from significantly overrepresented pathways. Covariate-adjusted logistic regression models evaluated pPRSxE interactions with red or processed meat intake in relation to CRC risk.A total of 30 variants were overrepresented in four pathways: Alzheimer disease-presenilin, Cadherin/WNT-signaling, Gonadotropin-releasing hormone receptor, and TGF-β signaling. We found a significant interaction between TGF-β-pPRS and red meat intake (p = 0.003). When variants in the TGF-β pathway were assessed, significant interactions with red meat for rs2337113 (intron SMAD7 gene, Chr18), and rs2208603 (intergenic region BMP5, Chr6) (p = 0.013 & 0.011, respectively) were observed. We did not find evidence of pPRS x red meat interactions for other pathways or with processed meat.This pathway-based interaction analysis revealed a significant interaction between variants in the TGF-β pathway and red meat consumption that impacts CRC risk.These findings shed light into the possible mechanistic link between CRC risk and red meat consumption.

View details for DOI 10.1101/2025.06.13.25329599

View details for PubMedID 40568668

View details for PubMedCentralID PMC12191096
Multiomic profiling reveals that prostaglandin E2 reverses aged muscle stem cell dysfunction, leading to increased regeneration and strength. Cell stem cell Wang, Y. X., Palla, A. R., Ho, A. T., Robinson, D. C., Ravichandran, M., Markov, G. J., Mai, T., Still, C. 2., Balsubramani, A., Nair, S., Holbrook, C. A., Yang, A. V., Kraft, P. E., Su, S., Burns, D. M., Yucel, N. D., Qi, L. S., Kundaje, A., Blau, H. M. 2025

Abstract

Repair of muscle damage declines with age due to the accumulation of dysfunctional muscle stem cells (MuSCs). Here, we uncover that aged MuSCs have blunted prostaglandin E2 (PGE2)-EP4 receptor signaling, which causes precocious commitment and mitotic catastrophe. Treatment with PGE2 alters chromatin accessibility and overcomes the dysfunctional aged MuSC fate trajectory, increasing viability and triggering cell cycle re-entry. We employ neural network models to learn the complex logic of transcription factors driving the change in accessibility. After PGE2 treatment, we detect increased transcription factor binding at sites with CRE and E-box motifs and reduced binding at sites with AP1 motifs, overcoming the changes that occur with age. We find that short-term exposure of aged MuSCs to PGE2 augments their long-term regenerative capacity upon transplantation. Strikingly, PGE2 injections following myotoxin- or exercise-induced injury overcome the aged niche, leading to enhanced regenerative function of endogenous tissue-resident MuSCs and an increase in strength.

View details for DOI 10.1016/j.stem.2025.05.012

View details for PubMedID 40513560
Deep learning guided design of cell type-specific AAV promoters Wang, S. K., Nair, S., Deng, B., Ren, X., Shah, S., Kundaje, A., Chang, H. Y., Wang, S. ASSOC RESEARCH VISION OPHTHALMOLOGY INC. 2025

View details for Web of Science ID 001559990100037
Predicting expression-altering promoter mutations with deep learning. Science (New York, N.Y.) Jaganathan, K., Ersaro, N., Novakovsky, G., Wang, Y., James, T., Schwartzentruber, J., Fiziev, P., Kassam, I., Cao, F., Hawe, J., Cavanagh, H., Lim, A., Png, G., McRae, J., Banerjee, A., Kumar, A., Ulirsch, J., Zhang, Y., Aguet, F., Wainschtein, P., Sundaram, L., Salcedo, A., Kyriazopoulou Panagiotopoulou, S., Aghamirzaie, D., Padhi, E., Weng, Z., Dong, S., Smedley, D., Caulfield, M., O'Donnell-Luria, A., Rehm, H. L., Sanders, S. J., Kundaje, A., Montgomery, S. B., Ross, M. T., Farh, K. K. 2025: eads7373

Abstract

Only a minority of patients with rare genetic diseases are currently diagnosed by exome sequencing, suggesting that additional unrecognized pathogenic variants may reside in non-coding sequence. Here, we describe PromoterAI, a deep neural network that accurately identifies non-coding promoter variants which dysregulate gene expression. We show that promoter variants with predicted expression-altering consequences produce outlier expression at both RNA and protein levels in thousands of individuals, and that these variants experience strong negative selection in human populations. We observe that clinically relevant genes in rare disease patients are enriched for such variants and validate their functional impact through reporter assays. Our estimates suggest that promoter variation accounts for 6% of the genetic burden associated with rare diseases.

View details for DOI 10.1126/science.ads7373

View details for PubMedID 40440429
The epigenomic landscape of single vascular cells reflects developmental origin and identifies disease risk loci. bioRxiv : the preprint server for biology Weldy, C. S., Kundu, S., Monteiro, J., Gu, W., Pedroza, A. J., Dalal, A. R., Worssam, M. D., Li, D., Palmisano, B., Zhao, Q., Sharma, D., Nguyen, T., Kundu, R., Fischbein, M. P., Engreitz, J., Kundaje, A. B., Cheng, P. P., Quertermous, T. 2025

Abstract

Vascular sites have distinct susceptibility to atherosclerosis and aneurysm, yet the biological underpinning of vascular site-specific disease risk is largely unknown. Vascular tissues have different developmental origins that may influence global chromatin accessibility, and understanding differential chromatin accessibility, gene expression profiles, and gene regulatory networks (GRN) on single cell resolution may give key insight into vascular site-specific disease risk. Here, we performed single cell chromatin accessibility (scATACseq) and gene expression profiling (scRNAseq) of healthy adult mouse vascular tissue from three vascular sites, 1) aortic root and ascending aorta, 2) brachiocephalic and carotid artery, and 3) descending thoracic aorta. Through a comprehensive analysis at single cell resolution, we discovered key regulatory enhancers to not only be cell type, but vascular site specific in vascular smooth muscle (SMC), fibroblasts, and endothelial cells. We identified epigenetic markers of embryonic origin with differential chromatin accessibility of key developmental transcription factors such as Tbx20, Hand2, Gata4, and Hoxb family members and discovered transcription factor motif accessibility to be cell type and vascular site specific. Notably, we found ascending fibroblasts to have distinct epigenomic patterns, highlighting SMAD2/3 function to suggest a differential susceptibility to TGFβ, a finding we confirmed through in vitro culture of primary adventitial fibroblasts. Finally, to understand how vascular site-specific enhancers may regulate human genetic risk for disease, we integrated genome wide association study (GWAS) data for ascending and descending aortic dimension, and through using a distinct base resolution deep learning model to predict variant effect on chromatin accessibility, ChromBPNet, to predict variant effects in SMC, Fibroblasts, and Endothelial cells within ascending aorta, carotid, and descending aorta sites of origin. We reveal that although cell type remains a primary influence on variant effects, vascular site modifies cell type transcription and highlights genomic regions that are enriched for specific TF motif footprints - including MEF2A, SMAD3, and HAND2. This work supports a paradigm that the epigenomic and transcriptomic landscape of vascular cells are cell type and vascular site-specific and that site-specific enhancers govern complex genetic drivers of disease risk.

View details for DOI 10.1101/2022.05.18.492517

View details for PubMedID 40655014

View details for PubMedCentralID PMC12247710
Rewriting regulatory DNA to dissect and reprogram gene expression. Cell Martyn, G. E., Montgomery, M. T., Jones, H., Guo, K., Doughty, B. R., Linder, J., Bisht, D., Xia, F., Cai, X. S., Chen, Z., Cochran, K., Lawrence, K. A., Munson, G., Pampari, A., Fulco, C. P., Sahni, N., Kelley, D. R., Lander, E. S., Kundaje, A., Engreitz, J. M. 2025

Abstract

Regulatory DNA provides a platform for transcription factor binding to encode cell-type-specific patterns of gene expression. However, the effects and programmability of regulatory DNA sequences remain difficult to map or predict. Here, we develop variant effects from flow-sorting experiments with CRISPR targeting screens (Variant-EFFECTS) to introduce hundreds of designed edits to endogenous regulatory DNA and quantify their effects on gene expression. We systematically dissect and reprogram 3 regulatory elements for 2 genes in 2 cell types. These data reveal endogenous binding sites with effects specific to genomic context, transcription factor motifs with cell-type-specific activities, and limitations of computational models for predicting the effect sizes of variants. We identify small edits that can tune gene expression over a large dynamic range, suggesting new possibilities for prime-editing-based therapeutics targeting regulatory DNA. Variant-EFFECTS provides a generalizable tool to dissect regulatory DNA and to identify genome editing reagents that tune gene expression in an endogenous context.

View details for DOI 10.1016/j.cell.2025.03.034

View details for PubMedID 40245860
An updated compendium and reevaluation of the evidence for nuclear transcription factor occupancy over the mitochondrial genome. PloS one Marinov, G. K., Ramalingam, V., Greenleaf, W. J., Kundaje, A. 2025; 20 (3): e0318796

Abstract

In most eukaryotes, mitochondrial organelles contain their own genome, usually circular, which is the remnant of the genome of the ancestral bacterial endosymbiont that gave rise to modern mitochondria. Mitochondrial genomes are dramatically reduced in their gene content due to the process of endosymbiotic gene transfer to the nucleus; as a result most mitochondrial proteins are encoded in the nucleus and imported into mitochondria. This includes the components of the dedicated mitochondrial transcription and replication systems and regulatory factors, which are entirely distinct from the information processing systems in the nucleus. However, since the 1990s several nuclear transcription factors have been reported to act in mitochondria, and previously we identified 8 human and 3 mouse transcription factors (TFs) with strong localized enrichment over the mitochondrial genome using ChIP-seq (Chromatin Immunoprecipitation) datasets from the second phase of the ENCODE (Encyclopedia of DNA Elements) Project Consortium. Here, we analyze the greatly expanded in the intervening decade ENCODE compendium of TF ChIP-seq datasets (a total of 6,153 ChIP experiments for 942 proteins, of which 763 are sequence-specific TFs) combined with interpretative deep learning models of TF occupancy to create a comprehensive compendium of nuclear TFs that show evidence of association with the mitochondrial genome. We find some evidence for chrM occupancy for 50 nuclear TFs and two other proteins, with bZIP TFs emerging as most likely to be playing a role in mitochondria. However, we also observe that in cases where the same TF has been assayed with multiple antibodies and ChIP protocols, evidence for its chrM occupancy is not always reproducible. In the light of these findings, we discuss the evidential criteria for establishing chrM occupancy and reevaluate the overall compendium of putative mitochondrial-acting nuclear TFs.

View details for DOI 10.1371/journal.pone.0318796

View details for PubMedID 40163815
Single-cell multiome and enhancer connectome of human retinal pigment epithelium and choroid nominate pathogenic variants in age-related macular degeneration. bioRxiv : the preprint server for biology Wang, S. K., Li, J., Nair, S., Korasaju, R., Chen, Y., Zhang, Y., Kundaje, A., Liu, Y., Wang, N., Chang, H. Y. 2025

Abstract

Age-related macular degeneration (AMD) is a leading cause of vision loss worldwide. Genome-wide association studies (GWAS) of AMD have identified dozens of risk loci that may house disease targets. However, variants at these loci are largely noncoding, making it difficult to assess their function and whether they are causal. Here, we present a single-cell gene expression and chromatin accessibility atlas of human retinal pigment epithelium (RPE) and choroid to systematically analyze both coding and noncoding variants implicated in AMD. We employ HiChIP and Activity-by-Contact modeling to map enhancers in these tissues and predict cell and gene targets of risk variants. We further perform allele-specific self-transcribing active regulatory region sequencing (STARR-seq) to functionally test variant activity in RPE cells, including in the context of complement activation. Our work nominates new pathogenic variants and mechanisms in AMD and offers a rich and accessible resource for studying diseases of the RPE and choroid.

View details for DOI 10.1101/2025.03.21.644670

View details for PubMedID 40196652

View details for PubMedCentralID PMC11974679
CXCL12 drives natural variation in coronary artery anatomy across diverse populations. Cell Rios Coronado, P. E., Zhou, J., Fan, X., Zanetti, D., Naftaly, J. A., Prabala, P., Martínez Jaimes, A. M., Farah, E. N., Kundu, S., Deshpande, S. S., Evergreen, I., Kho, P. F., Ma, Q., Hilliard, A. T., Abramowitz, S., Pyarajan, S., Dochtermann, D., Damrauer, S. M., Chang, K. M., Levin, M. G., Winn, V. D., Paşca, A. M., Plomondon, M. E., Waldo, S. W., Tsao, P. S., Kundaje, A., Chi, N. C., Clarke, S. L., Red-Horse, K., Assimes, T. L. 2025

Abstract

Coronary arteries have a specific branching pattern crucial for oxygenating heart muscle. Among humans, there is natural variation in coronary anatomy with respect to perfusion of the inferior/posterior left heart, which can branch from either the right arterial tree, the left, or both-a phenotype known as coronary dominance. Using angiographic data for >60,000 US veterans of diverse ancestry, we conducted a genome-wide association study of coronary dominance, revealing moderate heritability and identifying ten significant loci. The strongest association occurred near CXCL12 in both European- and African-ancestry cohorts, with downstream analyses implicating effects on CXCL12 expression. We show that CXCL12 is expressed in human fetal hearts at the time dominance is established. Reducing Cxcl12 in mice altered coronary dominance and caused septal arteries to develop away from Cxcl12 expression domains. These findings indicate that CXCL12 patterns human coronary arteries, paving the way for "medical revascularization" through targeting developmental pathways.

View details for DOI 10.1016/j.cell.2025.02.005

View details for PubMedID 40049164
Transfer learning reveals sequence determinants of the quantitative response to transcription factor dosage. Cell genomics Naqvi, S., Kim, S., Tabatabaee, S., Pampari, A., Kundaje, A., Pritchard, J. K., Wysocka, J. 2025: 100780

Abstract

Deep learning models have advanced our ability to predict cell-type-specific chromatin patterns from transcription factor (TF) binding motifs, but their application to perturbed contexts remains limited. We applied transfer learning to predict how concentrations of the dosage-sensitive TFs TWIST1 and SOX9 affect regulatory element (RE) chromatin accessibility in facial progenitor cells, achieving near-experimental accuracy. High-affinity motifs that allow for heterotypic TF co-binding and are concentrated at the center of REs buffer against quantitative changes in TF dosage and predict unperturbed accessibility. Conversely, low-affinity or homotypic binding motifs distributed throughout REs drive sensitive responses with minimal impact on unperturbed accessibility. Both buffering and sensitizing features display purifying selection signatures. We validated these sequence features through reporter assays and demonstrated that TF-nucleosome competition can explain low-affinity motifs' sensitizing effects. This combination of transfer learning and quantitative chromatin response measurements provides a novel approach for uncovering additional layers of the cis-regulatory code.

View details for DOI 10.1016/j.xgen.2025.100780

View details for PubMedID 40020686
Mapping the regulatory effects of common and rare non-coding variants across cellular and developmental contexts in the brain and heart. bioRxiv : the preprint server for biology Marderstein, A. R., Kundu, S., Padhi, E. M., Deshpande, S., Wang, A., Robb, E., Sun, Y., Yun, C. M., Pomales-Matos, D., Xie, Y., Nachun, D., Jessa, S., Kundaje, A., Montgomery, S. B. 2025

Abstract

Whole genome sequencing has identified over a billion non-coding variants in humans, while GWAS has revealed the non-coding genome as a significant contributor to disease. However, prioritizing causal common and rare non-coding variants in human disease, and understanding how selective pressures have shaped the non-coding genome, remains a significant challenge. Here, we predicted the effects of 15 million variants with deep learning models trained on single-cell ATAC-seq across 132 cellular contexts in adult and fetal brain and heart, producing nearly two billion context-specific predictions. Using these predictions, we distinguish candidate causal variants underlying human traits and diseases and their context-specific effects. While common variant effects are more cell-type-specific, rare variants exert more cell-type-shared regulatory effects, with selective pressures particularly targeting variants affecting fetal brain neurons. To prioritize de novo mutations with extreme regulatory effects, we developed FLARE, a context-specific functional genomic model of constraint. FLARE outperformed other methods in prioritizing case mutations from autism-affected families near syndromic autism-associated genes; for example, identifying mutation outliers near CNTNAP2 that would be missed by alternative approaches. Overall, our findings demonstrate the potential of integrating single-cell maps with population genetics and deep learning-based variant effect prediction to elucidate mechanisms of development and disease-ultimately, supporting the notion that genetic contributions to neurodevelopmental disorders are predominantly rare.

View details for DOI 10.1101/2025.02.18.638922

View details for PubMedID 40027628

View details for PubMedCentralID PMC11870466
Publisher Correction: The ENCODE Imputation Challenge: a critical assessment of methods for cross-cell type imputation of epigenomic profiles. Genome biology Schreiber, J. M., Boix, C. A., Wook Lee, J., Li, H., Guan, Y., Chang, C. C., Chang, J. C., Hawkins-Hooker, A., Schölkopf, B., Schweikert, G., Carulla, M. R., Canakoglu, A., Guzzo, F., Nanni, L., Masseroli, M., Carman, M. J., Pinoli, P., Hong, C., Yip, K. Y., Spence, J. P., Batra, S. S., Song, Y. S., Mahony, S., Zhang, Z., Tan, W., Shen, Y., Sun, Y., Shi, M., Adrian, J., Sandstrom, R. S., Farrell, N. P., Halow, J. M., Lee, K., Jiang, L., Yang, X., Epstein, C. B., Strattan, J. S., Bernstein, B. E., Snyder, M. P., Kellis, M., Noble, W. S., Kundaje, A. B. 2025; 26 (1): 31

View details for DOI 10.1186/s13059-025-03494-w

View details for PubMedID 39948633

View details for PubMedCentralID 10111747
MorPhiC Consortium: towards functional characterization of all human genes. Nature Adli, M., Przybyla, L., Burdett, T., Burridge, P. W., Cacheiro, P., Chang, H. Y., Engreitz, J. M., Gilbert, L. A., Greenleaf, W. J., Hsu, L., Huangfu, D., Hung, L. H., Kundaje, A., Li, S., Parkinson, H., Qiu, X., Robson, P., Schürer, S. C., Shojaie, A., Skarnes, W. C., Smedley, D., Studer, L., Sun, W., Vidović, D., Vierbuchen, T., White, B. S., Yeung, K. Y., Yue, F., Zhou, T. 2025; 638 (8050): 351-359

Abstract

Recent advances in functional genomics and human cellular models have substantially enhanced our understanding of the structure and regulation of the human genome. However, our grasp of the molecular functions of human genes remains incomplete and biased towards specific gene classes. The Molecular Phenotypes of Null Alleles in Cells (MorPhiC) Consortium aims to address this gap by creating a comprehensive catalogue of the molecular and cellular phenotypes associated with null alleles of all human genes using in vitro multicellular systems. In this Perspective, we present the strategic vision of the MorPhiC Consortium and discuss various strategies for generating null alleles, as well as the challenges involved. We describe the cellular models and scalable phenotypic readouts that will be used in the consortium's initial phase, focusing on 1,000 protein-coding genes. The resulting molecular and cellular data will be compiled into a catalogue of null-allele phenotypes. The methodologies developed in this phase will establish best practices for extending these approaches to all human protein-coding genes. The resources generated-including engineered cell lines, plasmids, phenotypic data, genomic information and computational tools-will be made available to the broader research community to facilitate deeper insights into human gene functions.

View details for DOI 10.1038/s41586-024-08243-w

View details for PubMedID 39939790

View details for PubMedCentralID 9903716
Author Correction: Global loss of promoter-enhancer connectivity and rebalancing of gene expression during early colorectal cancer carcinogenesis. Nature cancer Zhu, Y., Lee, H., White, S., Weimer, A. K., Monte, E., Horning, A., Nevins, S. A., Esplin, E. D., Paul, K., Krieger, G., Shipony, Z., Chiu, R., Laquindanum, R., Karathanos, T. V., Chua, M. W., Mills, M., Ladabaum, U., Longacre, T., Shen, J., Jaimovich, A., Lipson, D., Kundaje, A., Greenleaf, W. J., Curtis, C., Ford, J. M., Snyder, M. P. 2025

View details for DOI 10.1038/s43018-025-00915-4

View details for PubMedID 39865177
Artificial intelligence in molecular biology MOLECULAR CELL Kundaje, A., Pollard, K. S., Ma, J., Chang, X., Chen, M., Rohs, R. 2025; 85 (2): 193-198

Abstract

In recent years, computational methods and artificial intelligence approaches have proven uniquely suited for studying patterns in molecular biology. In this focus issue, we spoke with researchers about using these tools to address various biological questions and explore both current implications and future possibilities.

View details for Web of Science ID 001416249000001

View details for PubMedID 39824161
ChromBPNet: bias factorized, base-resolution deep learning models of chromatin accessibility reveal cis-regulatory sequence syntax, transcription factor footprints and regulatory variants. bioRxiv : the preprint server for biology Pampari, A., Shcherbina, A., Kvon, E. Z., Kosicki, M., Nair, S., Kundu, S., Kathiria, A. S., Risca, V. I., Kuningas, K., Alasoo, K., Greenleaf, W. J., Pennacchio, L. A., Kundaje, A. 2025

Abstract

Despite extensive mapping of cis-regulatory elements (cREs) across cellular contexts with chromatin accessibility assays, the sequence syntax and genetic variants that regulate transcription factor (TF) binding and chromatin accessibility at context-specific cREs remain elusive. We introduce ChromBPNet, a deep learning DNA sequence model of base-resolution accessibility profiles that detects, learns and deconvolves assay-specific enzyme biases from regulatory sequence determinants of accessibility, enabling robust discovery of compact TF motif lexicons, cooperative motif syntax and precision footprints across assays and sequencing depths. Extensive benchmarks show that ChromBPNet, despite its lightweight design, is competitive with much larger contemporary models at predicting variant effects on chromatin accessibility, pioneer TF binding and reporter activity across assays, cell contexts and ancestry, while providing interpretation of disrupted regulatory syntax. ChromBPNet also helps prioritize and interpret regulatory variants that influence complex traits and rare diseases, thereby providing a powerful lens to decode regulatory DNA and genetic variation.

View details for DOI 10.1101/2024.12.25.630221

View details for PubMedID 39829783

View details for PubMedCentralID PMC11741299
Somatic Hypermutation Informed Vocabulary Encoder Representations Im, C., Mikelov, A., Zhao, R., Kundaje, A., Boyd, S. D. edited by Knowles, D. A., Koo, P. K. JMLR-JOURNAL MACHINE LEARNING RESEARCH. 2025

View details for Web of Science ID 001676326200016
An Expanded Registry of Candidate cis-Regulatory Elements for Studying Transcriptional Regulation. bioRxiv : the preprint server for biology Moore, J. E., Pratt, H. E., Fan, K., Phalke, N., Fisher, J., Elhajjajy, S. I., Andrews, G., Gao, M., Shedd, N., Fu, Y., Lacadie, M. C., Meza, J., Ganna, M., Choudhury, E., Swofford, R., Farrell, N. P., Pampari, A., Ramalingam, V., Reese, F., Borsari, B., Yu, M., Wattenberg, E., Ruiz-Romero, M., Razavi-Mohseni, M., Xu, J., Galeev, T., Beer, M. A., Guigo, R., Gerstein, M., Engreitz, J., Ljungman, M., Reddy, T. E., Snyder, M. P., Epstein, C. B., Gaskell, E., Bernstein, B. E., Dickel, D. E., Visel, A., Pennacchio, L. A., Mortazavi, A., Kundaje, A., Weng, Z. 2024

Abstract

Mammalian genomes contain millions of regulatory elements that control the complex patterns of gene expression. Previously, The ENCODE consortium mapped biochemical signals across many cell types and tissues and integrated these data to develop a Registry of 0.9 million human and 300 thousand mouse candidate cis-Regulatory Elements (cCREs) annotated with potential functions1. We have expanded the Registry to include 2.35 million human and 927 thousand mouse cCREs, leveraging new ENCODE datasets and enhanced computational methods. This expanded Registry covers hundreds of unique cell and tissue types, providing a comprehensive understanding of gene regulation. Functional characterization data from assays like STARR-seq, MPRA, CRISPR perturbation, and transgenic mouse assays now cover over 90% of human cCREs, revealing complex regulatory functions. We identified thousands of novel silencer cCREs and demonstrated their dual enhancer/silencer roles in different cellular contexts. Integrating the Registry with other ENCODE annotations facilitates genetic variation interpretation and trait-associated gene identification, exemplified by discovering KLF1 as a novel causal gene for red blood cell traits. This expanded Registry is a valuable resource for studying the regulatory genome and its impact on health and disease.

View details for DOI 10.1101/2024.12.26.629296

View details for PubMedID 39763870
Molecular convergence of risk variants for congenital heart defects leveraging a regulatory map of the human fetal heart. medRxiv : the preprint server for health sciences Ma, X. R., Conley, S. D., Kosicki, M., Bredikhin, D., Cui, R., Tran, S., Sheth, M. U., Qiu, W. L., Chen, S., Kundu, S., Kang, H. Y., Amgalan, D., Munger, C. J., Duan, L., Dang, K., Rubio, O. M., Kany, S., Zamirpour, S., DePaolo, J., Padmanabhan, A., Olgin, J., Damrauer, S., Andersson, R., Gu, M., Priest, J. R., Quertermous, T., Qiu, X., Rabinovitch, M., Visel, A., Pennacchio, L., Kundaje, A., Glass, I. A., Gifford, C. A., Pirruccello, J. P., Goodyer, W. R., Engreitz, J. M. 2024

Abstract

Congenital heart defects (CHD) arise in part due to inherited genetic variants that alter genes and noncoding regulatory elements in the human genome. These variants are thought to act during fetal development to influence the formation of different heart structures. However, identifying the genes, pathways, and cell types that mediate these effects has been challenging due to the immense diversity of cell types involved in heart development as well as the superimposed complexities of interpreting noncoding sequences. As such, understanding the molecular functions of both noncoding and coding variants remains paramount to our fundamental understanding of cardiac development and CHD. Here, we created a gene regulation map of the healthy human fetal heart across developmental time, and applied it to interpret the functions of variants associated with CHD and quantitative cardiac traits. We collected single-cell multiomic data from 734,000 single cells sampled from 41 fetal hearts spanning post-conception weeks 6 to 22, enabling the construction of gene regulation maps in 90 cardiac cell types and states, including rare populations of cardiac conduction cells. Through an unbiased analysis of all 90 cell types, we find that both rare coding variants associated with CHD and common noncoding variants associated with valve traits converge to affect valvular interstitial cells (VICs). VICs are enriched for high expression of known CHD genes previously identified through mapping of rare coding variants. Eight CHD genes, as well as other genes in similar molecular pathways, are linked to common noncoding variants associated with other valve diseases or traits via enhancers in VICs. In addition, certain common noncoding variants impact enhancers with activities highly specific to particular subanatomic structures in the heart, illuminating how such variants can impact specific aspects of heart structure and function. Together, these results implicate new enhancers, genes, and cell types in the genetic etiology of CHD, identify molecular convergence of common noncoding and rare coding variants on VICs, and suggest a more expansive view of the cell types instrumental in genetic risk for CHD, beyond the working cardiomyocyte. This regulatory map of the human fetal heart will provide a foundational resource for understanding cardiac development, interpreting genetic variants associated with heart disease, and discovering targets for cell-type specific therapies.

View details for DOI 10.1101/2024.11.20.24317557

View details for PubMedID 39606363

View details for PubMedCentralID PMC11601760
The chromatin landscape of the histone-possessing Bacteriovorax bacteria. Genome research Marinov, G. K., Doughty, B., Kundaje, A., Greenleaf, W. J. 2024

Abstract

Histone proteins have traditionally been thought to be restricted to eukaryotes and most archaea, with eukaryotic nucleosomal histones deriving from their archaeal ancestors. In contrast, bacteria lack histones as a rule. However, histone proteins have recently been identified in a few bacterial clades, most notably the phylum Bdellovibrionota, and these histones have been proposed to exhibit a range of divergent features compared to histones in archaea and eukaryotes. However, no functional genomic studies of the properties of Bdellovibrionota chromatin have been carried out. In this work, we map the landscape of chromatin accessibility, active transcription and three-dimensional genome organization in a member of Bdellovibrionota (a Bacteriovorax strain). We find that, similar to what is observed in some archaea and in eukaryotes with compact genomes such as yeast, Bacteriovorax chromatin is characterized by preferential accessibility around promoter regions. Similar to eukaryotes, chromatin accessibility in Bacteriovorax positively correlates with gene expression. Mapping active transcription through single-strand DNA (ssDNA) profiling revealed that unlike in yeast, but similar to the state of mammalian and fly promoters, Bacteriovorax promoters exhibit very strong polymerase pausing. Finally, similar to that of other bacteria without histones, the Bacteriovorax genome exists in a three-dimensional (3D) configuration organized by the parABS system along the axis defined by replication origin and termination regions. These results provide a foundation for understanding the chromatin biology of the unique Bdellovibrionota bacteria and the functional diversity in chromatin organization across the tree of life.

View details for DOI 10.1101/gr.279418.124

View details for PubMedID 39572228
GENCODE 2025: reference gene annotation for human and mouse. Nucleic acids research Mudge, J. M., Carbonell-Sala, S., Diekhans, M., Martinez, J. G., Hunt, T., Jungreis, I., Loveland, J. E., Arnan, C., Barnes, I., Bennett, R., Berry, A., Bignell, A., Cerdán-Vélez, D., Cochran, K., Cortés, L. T., Davidson, C., Donaldson, S., Dursun, C., Fatima, R., Hardy, M., Hebbar, P., Hollis, Z., James, B. T., Jiang, Y., Johnson, R., Kaur, G., Kay, M., Mangan, R. J., Maquedano, M., Gómez, L. M., Mathlouthi, N., Merritt, R., Ni, P., Palumbo, E., Perteghella, T., Pozo, F., Raj, S., Sisu, C., Steed, E., Sumathipala, D., Suner, M. M., Uszczynska-Ratajczak, B., Wass, E., Yang, Y. T., Zhang, D., Finn, R. D., Gerstein, M., Guigó, R., Hubbard, T. J., Kellis, M., Kundaje, A., Paten, B., Tress, M. L., Birney, E., Martin, F. J., Frankish, A. 2024

Abstract

GENCODE produces comprehensive reference gene annotation for human and mouse. Entering its twentieth year, the project remains highly active as new technologies and methodologies allow us to catalog the genome at ever-increasing granularity. In particular, long-read transcriptome sequencing enables us to identify large numbers of missing transcripts and to substantially improve existing models, and our long non-coding RNA catalogs have undergone a dramatic expansion and reconfiguration as a result. Meanwhile, we are incorporating data from state-of-the-art proteomics and Ribo-seq experiments to fine-tune our annotation of translated sequences, while further insights into function can be gained from multi-genome alignments that grow richer as more species' genomes are sequenced. Such methodologies are combined into a fully integrated annotation workflow. However, the increasing complexity of our resources can present usability challenges, and we are resolving these with the creation of filtered genesets such as MANE Select and GENCODE Primary. The next challenge is to propagate annotations throughout multiple human and mouse genomes, as we enter the pangenome era. Our resources are freely available at our web portal www.gencodegenes.org, and via the Ensembl and UCSC genome browsers.

View details for DOI 10.1093/nar/gkae1078

View details for PubMedID 39565199
Deep Learning the Cis-Regulatory Code of Epigenomic Rewiring By Fusion Transcription Factors in Pediatric Leukemia Mani, S., Grubert, F., Gratzinger, D., Kundaje, A., Kasowski, M. ELSEVIER. 2024: 4993

View details for DOI 10.1182/blood-2024-209601

View details for Web of Science ID 001407948200003
GENCODE: massively expanding the lncRNA catalog through capture long-read RNA sequencing. bioRxiv : the preprint server for biology Kaur, G., Perteghella, T., Carbonell-Sala, S., Gonzalez-Martinez, J., Hunt, T., Mądry, T., Jungreis, I., Arnan, C., Lagarde, J., Borsari, B., Sisu, C., Jiang, Y., Bennett, R., Berry, A., Cerdán-Vélez, D., Cochran, K., Vara, C., Davidson, C., Donaldson, S., Dursun, C., González-López, S., Gopal Das, S., Hardy, M., Hollis, Z., Kay, M., Montañés, J. C., Ni, P., Nurtdinov, R., Palumbo, E., Pulido-Quetglas, C., Suner, M. M., Yu, X., Zhang, D., Loveland, J. E., Albà, M. M., Diekhans, M., Tanzer, A., Mudge, J. M., Flicek, P., Martin, F. J., Gerstein, M., Kellis, M., Kundaje, A., Paten, B., Tress, M. L., Johnson, R., Uszczynska-Ratajczak, B., Frankish, A., Guigó, R. 2024

Abstract

Accurate and complete gene annotations are indispensable for understanding how genome sequences encode biological functions. For twenty years, the GENCODE consortium has developed reference annotations for the human and mouse genomes, becoming a foundation for biomedical and genomics communities worldwide. Nevertheless, collections of important yet poorly-understood gene classes like long non-coding RNAs (lncRNAs) remain incomplete and scattered across multiple, uncoordinated catalogs, slowing down progress in the field. To address these issues, GENCODE has undertaken the most comprehensive lncRNAs annotation effort to date. This is founded on the manual annotation of full-length targeted long-read sequencing, on matched embryonic and adult tissues, of orthologous regions in human and mouse. Altogether 17,931 novel human genes (140,268 novel transcripts) and 22,784 novel mouse genes (136,169 novel transcripts) have been added to the GENCODE catalog representing a 2-fold and 6-fold increase in transcripts, respectively - the greatest increase since the sequencing of the human genome. Novel gene annotations display evolutionary constraints, have well-formed promoter regions, and link to phenotype-associated genetic variants. They greatly enhance the functional interpretability of the human genome, as they help explain millions of previously-mapped "orphan" omics measurements corresponding to transcription start sites, chromatin modifications and transcription factor binding sites. Crucially, our targeted design assigned human-mouse orthologs at a rate beyond previous studies, tripling the number of human disease-associated lncRNAs with mouse orthologs. The expanded and enhanced GENCODE lncRNA annotations mark a critical step towards deciphering the human and mouse genomes.

View details for DOI 10.1101/2024.10.29.620654

View details for PubMedID 39554180

View details for PubMedCentralID PMC11565817
Multiomic analysis of familial adenomatous polyposis reveals molecular pathways associated with early tumorigenesis. Nature cancer Esplin, E. D., Hanson, C., Wu, S., Horning, A. M., Barapour, N., Nevins, S. A., Jiang, L., Contrepois, K., Lee, H., Guha, T. K., Hu, Z., Laquindanum, R., Mills, M. A., Chaib, H., Chiu, R., Jian, R., Chan, J., Ellenberger, M., Becker, W. R., Bahmani, B., Khan, A., Michael, B., Weimer, A. K., Esplin, D. G., Shen, J., Lancaster, S., Monte, E., Karathanos, T. V., Ladabaum, U., Longacre, T. A., Kundaje, A., Curtis, C., Greenleaf, W. J., Ford, J. M., Snyder, M. P. 2024

Abstract

Familial adenomatous polyposis (FAP) is a genetic disease causing hundreds of premalignant polyps in affected persons and is an ideal model to study transitions of early precancer states to colorectal cancer (CRC). We performed deep multiomic profiling of 93 samples, including normal mucosa, benign polyps and dysplastic polyps, from six persons with FAP. Transcriptomic, proteomic, metabolomic and lipidomic analyses revealed a dynamic choreography of thousands of molecular and cellular events that occur during precancerous transitions toward cancer formation. These involve processes such as cell proliferation, immune response, metabolic alterations (including amino acids and lipids), hormones and extracellular matrix proteins. Interestingly, activation of the arachidonic acid pathway was found to occur early in hyperplasia; this pathway is targeted by aspirin and other nonsteroidal anti-inflammatory drugs, a preventative treatment under investigation in persons with FAP. Overall, our results reveal key genomic, cellular and molecular events during the earliest steps in CRC formation and potential mechanisms of pharmaceutical prophylaxis.

View details for DOI 10.1038/s43018-024-00831-z

View details for PubMedID 39478120

View details for PubMedCentralID 2706149
Global loss of promoter-enhancer connectivity and rebalancing of gene expression during early colorectal cancer carcinogenesis. Nature cancer Zhu, Y., Lee, H., White, S., Weimer, A. K., Monte, E., Horning, A., Nevins, S. A., Esplin, E. D., Paul, K., Krieger, G., Shipony, Z., Chiu, R., Laquindanum, R., Karathanos, T. V., Chua, M. W., Mills, M., Ladabaum, U., Longacre, T., Shen, J., Jaimovich, A., Lipson, D., Kundaje, A., Greenleaf, W. J., Curtis, C., Ford, J. M., Snyder, M. P. 2024

Abstract

Although three-dimensional (3D) genome architecture is crucial for gene regulation, its role in disease remains elusive. We traced the evolution and malignant transformation of colorectal cancer (CRC) by generating high-resolution chromatin conformation maps of 33 colon samples spanning different stages of early neoplastic growth in persons with familial adenomatous polyposis (FAP). Our analysis revealed a substantial progressive loss of genome-wide cis-regulatory connectivity at early malignancy stages, correlating with nonlinear gene regulation effects. Genes with high promoter-enhancer (P-E) connectivity in unaffected mucosa were not linked to elevated baseline expression but tended to be upregulated in advanced stages. Inhibiting highly connected promoters preferentially represses gene expression in CRC cells compared to normal colonic epithelial cells. Our results suggest a two-phase model whereby neoplastic transformation reduces P-E connectivity from a redundant state to a rate-limiting one for transcriptional levels, highlighting the intricate interplay between 3D genome architecture and gene regulation during early CRC progression.

View details for DOI 10.1038/s43018-024-00823-z

View details for PubMedID 39478119

View details for PubMedCentralID 7541718
An SLC12A9-dependent ion transport mechanism maintains lysosomal osmolarity. Developmental cell Levin-Konigsberg, R., Mitra, K., Spees, K., Nigam, A., Liu, K., Januel, C., Hivare, P., Arana, S. M., Prolo, L. M., Kundaje, A., Leonetti, M. D., Krishnan, Y., Bassik, M. C. 2024

Abstract

Ammonia is a ubiquitous, toxic by-product of cell metabolism. Its high membrane permeability and proton affinity cause ammonia to accumulate inside acidic lysosomes in its poorly membrane-permeant form: ammonium (NH4+). Ammonium buildup compromises lysosomal function, suggesting the existence of mechanisms that protect cells from ammonium toxicity. Here, we identified SLC12A9 as a lysosomal-resident protein that preserves organelle homeostasis by controlling ammonium and chloride levels. SLC12A9 knockout (KO) cells showed grossly enlarged lysosomes and elevated ammonium content. These phenotypes were reversed upon removal of the metabolic source of ammonium or dissipation of the lysosomal pH gradient. Lysosomal chloride increased in SLC12A9 KO cells, and chloride binding by SLC12A9 was required for ammonium transport. Our data indicate that SLC12A9 function is central for the handling of lysosomal ammonium and chloride, an unappreciated, fundamental mechanism of lysosomal physiology that may have special relevance in tissues with elevated ammonia, such as tumors.

View details for DOI 10.1016/j.devcel.2024.10.003

View details for PubMedID 39476838
IMPLICATION OF COMPLEX STRUCTURAL GENOME VARIATION IN THE GENETIC ARCHITECTURE OF NEUROPSYCHIATRIC DISORDERS: INSIGHTS FROM HUMAN POPULATION ANALYSIS AND FROM POSTMORTEM BRAINS OF INDIVIDUALS WITH PSYCHIATRIC DISORDERS Zhou, B., Arthur, J., Guo, H., Kim, T., Huang, Y., Pattni, R., Song, G., Palejev, D., Dohna, H., Roussos, P., Kundaje, A., Hallmayer, J., Snyder, M., Wong, W., Urban, A. ELSEVIER. 2024: 93

View details for DOI 10.1016/j.euroneuro.2024.08.194

View details for Web of Science ID 001336799000163
Characterization of additive gene-environment interactions for colorectal cancer risk. Epidemiology (Cambridge, Mass.) Thomas, C. E., Lin, Y., Kim, M., Kawaguchi, E. S., Qu, C., Um, C. Y., Lynch, B. M., Van Guelpen, B., Tsilidis, K., Carreras-Torres, R., van Duijnhoven, F. J., Sakoda, L. C., Campbell, P. T., Tian, Y., Chang-Claude, J., Bézieau, S., Budiarto, A., Palmer, J. R., Newcomb, P. A., Casey, G., Le Marchand, L., Giannakis, M., Li, C. I., Gsur, A., Newton, C., Obón-Santacana, M., Moreno, V., Vodicka, P., Brenner, H., Hoffmeister, M., Pellatt, A. J., Schoen, R. E., Dimou, N., Murphy, N., Gunter, M. J., Castellví-Bel, S., Figueiredo, J. C., Chan, A. T., Song, M., Li, L., Bishop, D. T., Gruber, S. B., Baurley, J. W., Bien, S. A., Conti, D. V., Huyghe, J. R., Kundaje, A., Su, Y. R., Wang, J., Keku, T. O., Woods, M. O., Berndt, S. I., Chanock, S. J., Tangen, C. M., Wolk, A., Burnett-Hartman, A., Wu, A. H., White, E., Devall, M. A., Díez-Obrero, V., Drew, D. A., Giovannucci, E., Hidaka, A., Kim, A. E., Lewinger, J. P., Morrison, J., Ose, J., Papadimitriou, N., Pardamean, B., Peoples, A. R., Ruiz-Narvaez, E. A., Shcherbina, A., Stern, M. C., Chen, X., Thomas, D. C., Platz, E. A., Gauderman, W. J., Peters, U., Hsu, L. 2024

Abstract

Colorectal cancer (CRC) is a common, fatal cancer. Identifying subgroups who may benefit more from intervention is of critical public health importance. Previous studies have assessed multiplicative interaction between genetic risk scores and environmental factors, but few have assessed additive interaction, the relevant public health measure.Using resources from colorectal cancer consortia including 45,247 CRC cases and 52,671 controls, we assessed multiplicative and additive interaction (relative excess risk due to interaction, RERI) using logistic regression between 13 harmonized environmental factors and genetic risk score including 141 variants associated with CRC risk.There was no evidence of multiplicative interaction between environmental factors and genetic risk score. There was additive interaction where, for individuals with high genetic susceptibility, either heavy drinking [RERI = 0.24, 95% confidence interval, CI, (0.13, 0.36)], ever smoking [0.11 (0.05, 0.16)], high BMI [female 0.09 (0.05, 0.13), male 0.10 (0.05, 0.14)], or high red meat intake [highest versus lowest quartile 0.18 (0.09, 0.27)] was associated with excess CRC risk greater than that for individuals with average genetic susceptibility. Conversely, we estimate those with high genetic susceptibility may benefit more from reducing CRC risk with aspirin/NSAID use [-0.16 (-0.20, -0.11)] or higher intake of fruit, fiber, or calcium [highest quartile versus lowest quartile -0.12 (-0.18, -0.050); -0.16 (-0.23, -0.09); -0.11 (-0.18, -0.05), respectively] than those with average genetic susceptibility.Additive interaction is important to assess for identifying subgroups who may benefit from intervention. The subgroups identified in this study may help inform precision CRC prevention.

View details for DOI 10.1097/EDE.0000000000001795

View details for PubMedID 39316822
Prediction and functional interpretation of inter-chromosomal genome architecture from DNA sequence with TwinC. bioRxiv : the preprint server for biology Jha, A., Hristov, B., Wang, X., Wang, S., Greenleaf, W. J., Kundaje, A., Aiden, E. L., Bertero, A., Noble, W. S. 2024

Abstract

Three-dimensional nuclear DNA architecture comprises well-studied intra-chromosomal (cis) folding and less characterized inter-chromosomal (trans) interfaces. Current predictive models of 3D genome folding can effectively infer pairwise cis-chromatin interactions from the primary DNA sequence but generally ignore trans contacts. There is an unmet need for robust models of trans-genome organization that provide insights into their underlying principles and functional relevance. We present TwinC, an interpretable convolutional neural network model that reliably predicts trans contacts measurable through genome-wide chromatin conformation capture (Hi-C). TwinC uses a paired sequence design from replicate Hi-C experiments to learn single base pair relevance in trans interactions across two stretches of DNA. The method achieves high predictive accuracy (AUROC=0.80) on a cross-chromosomal test set from Hi-C experiments in heart tissue. Mechanistically, the neural network learns the importance of compartments, chromatin accessibility, clustered transcription factor binding and G-quadruplexes in forming trans contacts. In summary, TwinC models and interprets trans genome architecture, shedding light on this poorly understood aspect of gene regulation.

View details for DOI 10.1101/2024.09.16.613355

View details for PubMedID 39345598

View details for PubMedCentralID PMC11429679
Mutagenesis Sensitivity Mapping of Human Enhancers In Vivo. bioRxiv : the preprint server for biology Kosicki, M., Zhang, B., Pampari, A., Akiyama, J. A., Plajzer-Frick, I., Novak, C. S., Tran, S., Zhu, Y., Kato, M., Hunter, R. D., von Maydell, K., Barton, S., Beckman, E., Kundaje, A., Dickel, D. E., Visel, A., Pennacchio, L. A. 2024

Abstract

Distant-acting enhancers are central to human development. However, our limited understanding of their functional sequence features prevents the interpretation of enhancer mutations in disease. Here, we determined the functional sensitivity to mutagenesis of human developmental enhancers in vivo. Focusing on seven enhancers active in the developing brain, heart, limb and face, we created over 1700 transgenic mice for over 260 mutagenized enhancer alleles. Systematic mutation of 12-basepair blocks collectively altered each sequence feature in each enhancer at least once. We show that 69% of all blocks are required for normal in vivo activity, with mutations more commonly resulting in loss (60%) than in gain (9%) of function. Using predictive modeling, we annotated critical nucleotides at base-pair resolution. The vast majority of motifs predicted by these machine learning models (88%) coincided with changes to in vivo function, and the models showed considerable sensitivity, identifying 59% of all functional blocks. Taken together, our results reveal that human enhancers contain a high density of sequence features required for their normal in vivo function and provide a rich resource for further exploration of human enhancer logic.

View details for DOI 10.1101/2024.09.06.611737

View details for PubMedID 39282388

View details for PubMedCentralID PMC11398460
Flexible use of conserved motif vocabularies constrains genome access in cell type evolution. bioRxiv : the preprint server for biology Chai, C., Gibson, J., Li, P., Pampari, A., Patel, A., Kundaje, A., Wang, B. 2024

Abstract

Cell types evolve into a hierarchy with related types grouped into families. How cell type diversification is constrained by the stable separation between families over vast evolutionary times remains unknown. Here, integrating single-nucleus multiomic sequencing and deep learning, we show that hundreds of sequence features (motifs) divide into distinct sets associated with accessible genomes of specific cell type families. This division is conserved across highly divergent, early-branching animals including flatworms and cnidarians. While specific interactions between motifs delineate cell type relationships within families, surprisingly, these interactions are not conserved between species. Consistently, while deep learning models trained on one species can predict accessibility of other species' sequences, their predictions frequently rely on distinct, but synonymous, motif combinations. We propose that long-term stability of cell type families is maintained through genome access specified by conserved motif sets, or 'vocabularies', whereas cell types diversify through flexible use of motifs within each set.

View details for DOI 10.1101/2024.09.03.611027

View details for PubMedID 39282369

View details for PubMedCentralID PMC11398382
Deciphering the impact of genomic variation on function. Nature 2024; 633 (8028): 47-57

Abstract

Our genomes influence nearly every aspect of human biology-from molecular and cellular functions to phenotypes in health and disease. Studying the differences in DNA sequence between individuals (genomic variation) could reveal previously unknown mechanisms of human biology, uncover the basis of genetic predispositions to diseases, and guide the development of new diagnostic tools and therapeutic agents. Yet, understanding how genomic variation alters genome function to influence phenotype has proved challenging. To unlock these insights, we need a systematic and comprehensive catalogue of genome function and the molecular and cellular effects of genomic variants. Towards this goal, the Impact of Genomic Variation on Function (IGVF) Consortium will combine approaches in single-cell mapping, genomic perturbations and predictive modelling to investigate the relationships among genomic variation, genome function and phenotypes. IGVF will create maps across hundreds of cell types and states describing how coding variants alter protein activity, how noncoding variants change the regulation of gene expression, and how such effects connect through gene-regulatory and protein-interaction networks. These experimental data, computational predictions and accompanying standards and pipelines will be integrated into an open resource that will catalyse community efforts to explore how our genomes influence biology and disease across populations.

View details for DOI 10.1038/s41586-024-07510-0

View details for PubMedID 39232149

View details for PubMedCentralID 7405896
Multiplexed single-cell characterization of alternative polyadenylation regulators. Cell Kowalski, M. H., Wessels, H. H., Linder, J., Dalgarno, C., Mascio, I., Choudhary, S., Hartman, A., Hao, Y., Kundaje, A., Satija, R. 2024

Abstract

Most mammalian genes have multiple polyA sites, representing a substantial source of transcript diversity regulated by the cleavage and polyadenylation (CPA) machinery. To better understand how these proteins govern polyA site choice, we introduce CPA-Perturb-seq, a multiplexed perturbation screen dataset of 42 CPA regulators with a 3' scRNA-seq readout that enables transcriptome-wide inference of polyA site usage. We develop a framework to detect perturbation-dependent changes in polyadenylation and characterize modules of co-regulated polyA sites. We find groups of intronic polyA sites regulated by distinct components of the nuclear RNA life cycle, including elongation, splicing, termination, and surveillance. We train and validate a deep neural network (APARENT-Perturb) for tandem polyA site usage, delineating a cis-regulatory code that predicts perturbation response and reveals interactions between regulatory complexes. Our work highlights the potential for multiplexed single-cell perturbation screens to further our understanding of post-transcriptional regulation.

View details for DOI 10.1016/j.cell.2024.06.005

View details for PubMedID 38925112
An updated compendium and reevaluation of the evidence for nuclear transcription factor occupancy over the mitochondrial genome. bioRxiv : the preprint server for biology Marinov, G. K., Ramalingam, V., Greenleaf, W. J., Kundaje, A. 2024

Abstract

In most eukaryotes, mitochondrial organelles contain their own genome, usually circular, which is the remnant of the genome of the ancestral bacterial endosymbiont that gave rise to modern mitochondria. Mitochondrial genomes are dramatically reduced in their gene content due to the process of endosymbiotic gene transfer to the nucleus; as a result most mitochondrial proteins are encoded in the nucleus and imported into mitochondria. This includes the components of the dedicated mitochondrial transcription and replication systems and regulatory factors, which are entirely distinct from the information processing systems in the nucleus. However, since the 1990s several nuclear transcription factors have been reported to act in mitochondria, and previously we identified 8 human and 3 mouse transcription factors (TFs) with strong localized enrichment over the mitochondrial genome using ChIP-seq (Chromatin Immunoprecipitation) datasets from the second phase of the ENCODE (Encyclopedia of DNA Elements) Project Consortium. Here, we analyze the greatly expanded in the intervening decade ENCODE compendium of TF ChIP-seq datasets (a total of 6,153 ChIP experiments for 942 proteins, of which 763 are sequence-specific TFs) combined with interpretative deep learning models of TF occupancy to create a comprehensive compendium of nuclear TFs that show evidence of association with the mitochondrial genome. We find some evidence for chrM occupancy for 50 nuclear TFs and two other proteins, with bZIP TFs emerging as most likely to be playing a role in mitochondria. However, we also observe that in cases where the same TF has been assayed with multiple antibodies and ChIP protocols, evidence for its chrM occupancy is not always reproducible. In the light of these findings, we discuss the evidential criteria for establishing chrM occupancy and reevaluate the overall compendium of putative mitochondrial-acting nuclear TFs.

View details for DOI 10.1101/2024.06.04.597442

View details for PubMedID 38895386

View details for PubMedCentralID PMC11185660
Two genome-wide interaction loci modify the association of nonsteroidal anti-inflammatory drugs with colorectal cancer. Science advances Drew, D. A., Kim, A. E., Lin, Y., Qu, C., Morrison, J., Lewinger, J. P., Kawaguchi, E., Wang, J., Fu, Y., Zemlianskaia, N., Díez-Obrero, V., Bien, S. A., Dimou, N., Albanes, D., Baurley, J. W., Wu, A. H., Buchanan, D. D., Potter, J. D., Prentice, R. L., Harlid, S., Arndt, V., Barry, E. L., Berndt, S. I., Bouras, E., Brenner, H., Budiarto, A., Burnett-Hartman, A., Campbell, P. T., Carreras-Torres, R., Casey, G., Chang-Claude, J., Conti, D. V., Devall, M. A., Figueiredo, J. C., Gruber, S. B., Gsur, A., Gunter, M. J., Harrison, T. A., Hidaka, A., Hoffmeister, M., Huyghe, J. R., Jenkins, M. A., Jordahl, K. M., Kundaje, A., Le Marchand, L., Li, L., Lynch, B. M., Murphy, N., Nassir, R., Newcomb, P. A., Newton, C. C., Obón-Santacana, M., Ogino, S., Ose, J., Pai, R. K., Palmer, J. R., Papadimitriou, N., Pardamean, B., Pellatt, A. J., Peoples, A. R., Platz, E. A., Rennert, G., Ruiz-Narvaez, E., Sakoda, L. C., Scacheri, P. C., Schmit, S. L., Schoen, R. E., Stern, M. C., Su, Y. R., Thomas, D. C., Tian, Y., Tsilidis, K. K., Ulrich, C. M., Um, C. Y., van Duijnhoven, F. J., Van Guelpen, B., White, E., Hsu, L., Moreno, V., Peters, U., Chan, A. T., Gauderman, W. J. 2024; 10 (22): eadk3121

Abstract

Regular, long-term aspirin use may act synergistically with genetic variants, particularly those in mechanistically relevant pathways, to confer a protective effect on colorectal cancer (CRC) risk. We leveraged pooled data from 52 clinical trial, cohort, and case-control studies that included 30,806 CRC cases and 41,861 controls of European ancestry to conduct a genome-wide interaction scan between regular aspirin/nonsteroidal anti-inflammatory drug (NSAID) use and imputed genetic variants. After adjusting for multiple comparisons, we identified statistically significant interactions between regular aspirin/NSAID use and variants in 6q24.1 (top hit rs72833769), which has evidence of influencing expression of TBC1D7 (a subunit of the TSC1-TSC2 complex, a key regulator of MTOR activity), and variants in 5p13.1 (top hit rs350047), which is associated with expression of PTGER4 (codes a cell surface receptor directly involved in the mode of action of aspirin). Genetic variants with functional impact may modulate the chemopreventive effect of regular aspirin use, and our study identifies putative previously unidentified targets for additional mechanistic interrogation.

View details for DOI 10.1126/sciadv.adk3121

View details for PubMedID 38809988
Dissecting the cis-regulatory syntax of transcription initiation with deep learning. bioRxiv : the preprint server for biology Cochran, K., Yin, M., Mantripragada, A., Schreiber, J., Marinov, G. K., Kundaje, A. 2024

Abstract

Despite extensive characterization of mammalian Pol II transcription, the DNA sequence determinants of transcription initiation at a third of human promoters and most enhancers remain poorly understood. Hence, we trained and interpreted a neural network called ProCapNet that accurately models base-resolution initiation profiles from PRO-cap experiments using local DNA sequence. ProCapNet learns sequence motifs with distinct effects on initiation rates and TSS positioning and uncovers context-specific cryptic initiator elements intertwined within other TF motifs. ProCapNet annotates predictive motifs in nearly all actively transcribed regulatory elements across multiple cell-lines, revealing a shared cis-regulatory logic across promoters and enhancers mediated by a highly epistatic sequence syntax of cooperative and competitive motif interactions. ProCapNet models of RAMPAGE profiles measuring steady-state RNA abundance at TSSs distill initiation signals on par with models trained directly on PRO-cap profiles. ProCapNet learns a largely cell-type-agnostic cis-regulatory code of initiation complementing sequence drivers of cell-type-specific chromatin state critical for accurate prediction of cell-type-specific transcription initiation.

View details for DOI 10.1101/2024.05.28.596138

View details for PubMedID 38853896
Transfer learning reveals sequence determinants of the quantitative response to transcription factor dosage. bioRxiv : the preprint server for biology Naqvi, S., Kim, S., Tabatabaee, S., Pampari, A., Kundaje, A., Pritchard, J. K., Wysocka, J. 2024

Abstract

Deep learning approaches have made significant advances in predicting cell type-specific chromatin patterns from the identity and arrangement of transcription factor (TF) binding motifs. However, most models have been applied in unperturbed contexts, precluding a predictive understanding of how chromatin state responds to TF perturbation. Here, we used transfer learning to train and interpret deep learning models that use DNA sequence to predict, with accuracy approaching experimental reproducibility, how the concentration of two dosage-sensitive TFs (TWIST1, SOX9) affects regulatory element (RE) chromatin accessibility in facial progenitor cells. High-affinity motifs that allow for heterotypic TF co-binding and are concentrated at the center of REs buffer against quantitative changes in TF dosage and strongly predict unperturbed accessibility. In contrast, motifs with low-affinity or homotypic binding distributed throughout REs lead to sensitive responses with minimal contributions to unperturbed accessibility. Both buffering and sensitizing features show signatures of purifying selection. We validated these predictive sequence features using reporter assays and showed that a biophysical model of TF-nucleosome competition can explain the sensitizing effect of low-affinity motifs. Our approach of combining transfer learning and quantitative measurements of the chromatin response to TF dosage therefore represents a powerful method to reveal additional layers of the cis-regulatory code.

View details for DOI 10.1101/2024.05.28.596078

View details for PubMedID 38853998
Using a comprehensive atlas and predictive models to reveal the complexity and evolution of brain-active regulatory elements. Science advances Pratt, H. E., Andrews, G., Shedd, N., Phalke, N., Li, T., Pampari, A., Jensen, M., Wen, C., Consortium, P., Gandal, M. J., Geschwind, D. H., Gerstein, M., Moore, J., Kundaje, A., Colubri, A., Weng, Z. 2024; 10 (21): eadj4452

Abstract

Most genetic variants associated with psychiatric disorders are located in noncoding regions of the genome. To investigate their functional implications, we integrate epigenetic data from the PsychENCODE Consortium and other published sources to construct a comprehensive atlas of candidate brain cis-regulatory elements. Using deep learning, we model these elements' sequence syntax and predict how binding sites for lineage-specific transcription factors contribute to cell type-specific gene regulation in various types of glia and neurons. The elements' evolutionary history suggests that new regulatory information in the brain emerges primarily via smaller sequence mutations within conserved mammalian elements rather than entirely new human- or primate-specific sequences. However, primate-specific candidate elements, particularly those active during fetal brain development and in excitatory neurons and astrocytes, are implicated in the heritability of brain-related human traits. Additionally, we introduce PsychSCREEN, a web-based platform offering interactive visualization of PsychENCODE-generated genetic and epigenetic data from diverse brain cell types in individuals with psychiatric disorders and healthy controls.

View details for DOI 10.1126/sciadv.adj4452

View details for PubMedID 38781344
Application of established computational techniques to identify potential SARS-CoV-2 Nsp14-MTase inhibitors in low data regimes DIGITAL DISCOVERY Nigam, A., Hurley, M. F. D., Li, F., Konkolova, E., Klima, M., Trylcova, J., Pollice, R., Cinaroglu, S., Levin-Konigsberg, R., Handjaya, J., Schapira, M., Chau, I., Perveen, S., Ng, H., Kaniskan, H., Han, Y., Singh, S., Gorgulla, C., Kundaje, A., Jin, J., Voelz, V. A., Weber, J., Nencka, R., Boura, E., Vedadi, M., Aspuru-Guzik, A. 2024

View details for DOI 10.1039/d4dd00006d

View details for Web of Science ID 001230436200001
Genome-wide interaction study of dietary intake of fibre, fruits, and vegetables with risk of colorectal cancer. EBioMedicine Papadimitriou, N., Kim, A., Kawaguchi, E. S., Morrison, J., Diez-Obrero, V., Albanes, D., Berndt, S. I., Bézieau, S., Bien, S. A., Bishop, D. T., Bouras, E., Brenner, H., Buchanan, D. D., Campbell, P. T., Carreras-Torres, R., Chan, A. T., Chang-Claude, J., Conti, D. V., Devall, M. A., Dimou, N., Drew, D. A., Gruber, S. B., Harrison, T. A., Hoffmeister, M., Huyghe, J. R., Joshi, A. D., Keku, T. O., Kundaje, A., Küry, S., Le Marchand, L., Lewinger, J. P., Li, L., Lynch, B. M., Moreno, V., Newton, C. C., Obón-Santacana, M., Ose, J., Pellatt, A. J., Peoples, A. R., Platz, E. A., Qu, C., Rennert, G., Ruiz-Narvaez, E., Shcherbina, A., Stern, M. C., Su, Y. R., Thomas, D. C., Thomas, C. E., Tian, Y., Tsilidis, K. K., Ulrich, C. M., Um, C. Y., Visvanathan, K., Wang, J., White, E., Woods, M. O., Schmit, S. L., Macrae, F., Potter, J. D., Hopper, J. L., Peters, U., Murphy, N., Hsu, L., Gunter, M. J., Gauderman, W. J. 2024; 104: 105146

Abstract

Consumption of fibre, fruits and vegetables have been linked with lower colorectal cancer (CRC) risk. A genome-wide gene-environment (G × E) analysis was performed to test whether genetic variants modify these associations.A pooled sample of 45 studies including up to 69,734 participants (cases: 29,896; controls: 39,838) of European ancestry were included. To identify G × E interactions, we used the traditional 1--degree-of-freedom (DF) G × E test and to improve power a 2-step procedure and a 3DF joint test that investigates the association between a genetic variant and dietary exposure, CRC risk and G × E interaction simultaneously.The 3-DF joint test revealed two significant loci with p-value <5 × 10-8. Rs4730274 close to the SLC26A3 gene showed an association with fibre (p-value: 2.4 × 10-3) and G × fibre interaction with CRC (OR per quartile of fibre increase = 0.87, 0.80, and 0.75 for CC, TC, and TT genotype, respectively; G × E p-value: 1.8 × 10-7). Rs1620977 in the NEGR1 gene showed an association with fruit intake (p-value: 1.0 × 10-8) and G × fruit interaction with CRC (OR per quartile of fruit increase = 0.75, 0.65, and 0.56 for AA, AG, and GG genotype, respectively; G × E -p-value: 0.029).We identified 2 loci associated with fibre and fruit intake that also modify the association of these dietary factors with CRC risk. Potential mechanisms include chronic inflammatory intestinal disorders, and gut function. However, further studies are needed for mechanistic validation and replication of findings.National Institutes of Health, National Cancer Institute. Full funding details for the individual consortia are provided in acknowledgments.

View details for DOI 10.1016/j.ebiom.2024.105146

View details for PubMedID 38749303
Fine-mapping analysis including over 254,000 East Asian and European descendants identifies 136 putative colorectal cancer susceptibility genes. Nature communications Chen, Z., Guo, X., Tao, R., Huyghe, J. R., Law, P. J., Fernandez-Rozadilla, C., Ping, J., Jia, G., Long, J., Li, C., Shen, Q., Xie, Y., Timofeeva, M. N., Thomas, M., Schmit, S. L., Díez-Obrero, V., Devall, M., Moratalla-Navarro, F., Fernandez-Tajes, J., Palles, C., Sherwood, K., Briggs, S. E., Svinti, V., Donnelly, K., Farrington, S. M., Blackmur, J., Vaughan-Shaw, P. G., Shu, X. O., Lu, Y., Broderick, P., Studd, J., Harrison, T. A., Conti, D. V., Schumacher, F. R., Melas, M., Rennert, G., Obón-Santacana, M., Martín-Sánchez, V., Oh, J. H., Kim, J., Jee, S. H., Jung, K. J., Kweon, S. S., Shin, M. H., Shin, A., Ahn, Y. O., Kim, D. H., Oze, I., Wen, W., Matsuo, K., Matsuda, K., Tanikawa, C., Ren, Z., Gao, Y. T., Jia, W. H., Hopper, J. L., Jenkins, M. A., Win, A. K., Pai, R. K., Figueiredo, J. C., Haile, R. W., Gallinger, S., Woods, M. O., Newcomb, P. A., Duggan, D., Cheadle, J. P., Kaplan, R., Kerr, R., Kerr, D., Kirac, I., Böhm, J., Mecklin, J. P., Jousilahti, P., Knekt, P., Aaltonen, L. A., Rissanen, H., Pukkala, E., Eriksson, J. G., Cajuso, T., Hänninen, U., Kondelin, J., Palin, K., Tanskanen, T., Renkonen-Sinisalo, L., Männistö, S., Albanes, D., Weinstein, S. J., Ruiz-Narvaez, E., Palmer, J. R., Buchanan, D. D., Platz, E. A., Visvanathan, K., Ulrich, C. M., Siegel, E., Brezina, S., Gsur, A., Campbell, P. T., Chang-Claude, J., Hoffmeister, M., Brenner, H., Slattery, M. L., Potter, J. D., Tsilidis, K. K., Schulze, M. B., Gunter, M. J., Murphy, N., Castells, A., Castellví-Bel, S., Moreira, L., Arndt, V., Shcherbina, A., Bishop, D. T., Giles, G. G., Southey, M. C., Idos, G. E., McDonnell, K. J., Abu-Ful, Z., Greenson, J. K., Shulman, K., Lejbkowicz, F., Offit, K., Su, Y. R., Steinfelder, R., Keku, T. O., van Guelpen, B., Hudson, T. J., Hampel, H., Pearlman, R., Berndt, S. I., Hayes, R. B., Martinez, M. E., Thomas, S. S., Pharoah, P. D., Larsson, S. C., Yen, Y., Lenz, H. J., White, E., Li, L., Doheny, K. F., Pugh, E., Shelford, T., Chan, A. T., Cruz-Correa, M., Lindblom, A., Hunter, D. J., Joshi, A. D., Schafmayer, C., Scacheri, P. C., Kundaje, A., Schoen, R. E., Hampe, J., Stadler, Z. K., Vodicka, P., Vodickova, L., Vymetalkova, V., Edlund, C. K., Gauderman, W. J., Shibata, D., Toland, A., Markowitz, S., Kim, A., Chanock, S. J., van Duijnhoven, F., Feskens, E. J., Sakoda, L. C., Gago-Dominguez, M., Wolk, A., Pardini, B., FitzGerald, L. M., Lee, S. C., Ogino, S., Bien, S. A., Kooperberg, C., Li, C. I., Lin, Y., Prentice, R., Qu, C., Bézieau, S., Yamaji, T., Sawada, N., Iwasaki, M., Le Marchand, L., Wu, A. H., Qu, C., McNeil, C. E., Coetzee, G., Hayward, C., Deary, I. J., Harris, S. E., Theodoratou, E., Reid, S., Walker, M., Ooi, L. Y., Lau, K. S., Zhao, H., Hsu, L., Cai, Q., Dunlop, M. G., Gruber, S. B., Houlston, R. S., Moreno, V., Casey, G., Peters, U., Tomlinson, I., Zheng, W. 2024; 15 (1): 3557

Abstract

Genome-wide association studies (GWAS) have identified more than 200 common genetic variants independently associated with colorectal cancer (CRC) risk, but the causal variants and target genes are mostly unknown. We sought to fine-map all known CRC risk loci using GWAS data from 100,204 cases and 154,587 controls of East Asian and European ancestry. Our stepwise conditional analyses revealed 238 independent association signals of CRC risk, each with a set of credible causal variants (CCVs), of which 28 signals had a single CCV. Our cis-eQTL/mQTL and colocalization analyses using colorectal tissue-specific transcriptome and methylome data separately from 1299 and 321 individuals, along with functional genomic investigation, uncovered 136 putative CRC susceptibility genes, including 56 genes not previously reported. Analyses of single-cell RNA-seq data from colorectal tissues revealed 17 putative CRC susceptibility genes with distinct expression patterns in specific cell types. Analyses of whole exome sequencing data provided additional support for several target genes identified in this study as CRC susceptibility genes. Enrichment analyses of the 136 genes uncover pathways not previously linked to CRC risk. Our study substantially expanded association signals for CRC and provided additional insight into the biological mechanisms underlying CRC development.

View details for DOI 10.1038/s41467-024-47399-x

View details for PubMedID 38670944

View details for PubMedCentralID PMC11053150
Predicting chromatin conformation contact maps. bioRxiv : the preprint server for biology Min, A., Schreiber, J., Kundaje, A., Noble, W. S. 2024

Abstract

Over the past 15 years, a variety of next-generation sequencing assays have been developed for measuring the 3D conformation of DNA in the nucleus. Each of these assays gives, for a particular cell or tissue type, a distinct picture of 3D chromatin architecture. Accordingly, making sense of the relationship between genome structure and function requires teasing apart two closely related questions: how does chromatin 3D structure change from one cell type to the next, and how do different measurements of that structure differ from one another, even when the two assays are carried out in the same cell type? In this work, we assemble a collection of chromatin 3D datasets-each represented as a 2D contact map- spanning multiple assay types and cell types. We then build a machine learning model that predicts missing contact maps in this collection. We use the model to systematically explore how genome 3D architecture changes, at the level of compartments, domains, and loops, between cell type and between assay types.

View details for DOI 10.1101/2024.04.12.589240

View details for PubMedID 38645064

View details for PubMedCentralID PMC11030330
Genetic risk impacts the association of menopausal hormone therapy with colorectal cancer risk. British journal of cancer Tian, Y., Lin, Y., Qu, C., Arndt, V., Baurley, J. W., Berndt, S. I., Bien, S. A., Bishop, D. T., Brenner, H., Buchanan, D. D., Budiarto, A., Campbell, P. T., Carreras-Torres, R., Casey, G., Chan, A. T., Chen, R., Chen, X., Conti, D. V., Díez-Obrero, V., Dimou, N., Drew, D. A., Figueiredo, J. C., Gallinger, S., Giles, G. G., Gruber, S. B., Gunter, M. J., Harlid, S., Harrison, T. A., Hidaka, A., Hoffmeister, M., Huyghe, J. R., Jenkins, M. A., Jordahl, K. M., Joshi, A. D., Keku, T. O., Kawaguchi, E., Kim, A. E., Kundaje, A., Larsson, S. C., Marchand, L. L., Lewinger, J. P., Li, L., Moreno, V., Morrison, J., Murphy, N., Nan, H., Nassir, R., Newcomb, P. A., Obón-Santacana, M., Ogino, S., Ose, J., Pardamean, B., Pellatt, A. J., Peoples, A. R., Platz, E. A., Potter, J. D., Prentice, R. L., Rennert, G., Ruiz-Narvaez, E. A., Sakoda, L. C., Schoen, R. E., Shcherbina, A., Stern, M. C., Su, Y. R., Thibodeau, S. N., Thomas, D. C., Tsilidis, K. K., van Duijnhoven, F. J., Van Guelpen, B., Visvanathan, K., White, E., Wolk, A., Woods, M. O., Wu, A. H., Peters, U., Gauderman, W. J., Hsu, L., Chang-Claude, J. 2024

Abstract

Menopausal hormone therapy (MHT), a common treatment to relieve symptoms of menopause, is associated with a lower risk of colorectal cancer (CRC). To inform CRC risk prediction and MHT risk-benefit assessment, we aimed to evaluate the joint association of a polygenic risk score (PRS) for CRC and MHT on CRC risk.We used data from 28,486 postmenopausal women (11,519 cases and 16,967 controls) of European descent. A PRS based on 141 CRC-associated genetic variants was modeled as a categorical variable in quartiles. Multiplicative interaction between PRS and MHT use was evaluated using logistic regression. Additive interaction was measured using the relative excess risk due to interaction (RERI). 30-year cumulative risks of CRC for 50-year-old women according to MHT use and PRS were calculated.The reduction in odds ratios by MHT use was larger in women within the highest quartile of PRS compared to that in women within the lowest quartile of PRS (p-value = 2.7 × 10-8). At the highest quartile of PRS, the 30-year CRC risk was statistically significantly lower for women taking any MHT than for women not taking any MHT, 3.7% (3.3%-4.0%) vs 6.1% (5.7%-6.5%) (difference 2.4%, P-value = 1.83 × 10-14); these differences were also statistically significant but smaller in magnitude in the lowest PRS quartile, 1.6% (1.4%-1.8%) vs 2.2% (1.9%-2.4%) (difference 0.6%, P-value = 1.01 × 10-3), indicating 4 times greater reduction in absolute risk associated with any MHT use in the highest compared to the lowest quartile of genetic CRC risk.MHT use has a greater impact on the reduction of CRC risk for women at higher genetic risk. These findings have implications for the development of risk prediction models for CRC and potentially for the consideration of genetic information in the risk-benefit assessment of MHT use.

View details for DOI 10.1038/s41416-024-02638-2

View details for PubMedID 38561434

View details for PubMedCentralID 7431089
Multicenter integrated analysis of noncoding CRISPRi screens. Nature methods Yao, D., Tycko, J., Oh, J. W., Bounds, L. R., Gosai, S. J., Lataniotis, L., Mackay-Smith, A., Doughty, B. R., Gabdank, I., Schmidt, H., Guerrero-Altamirano, T., Siklenka, K., Guo, K., White, A. D., Youngworth, I., Andreeva, K., Ren, X., Barrera, A., Luo, Y., Yardımcı, G. G., Tewhey, R., Kundaje, A., Greenleaf, W. J., Sabeti, P. C., Leslie, C., Pritykin, Y., Moore, J. E., Beer, M. A., Gersbach, C. A., Reddy, T. E., Shen, Y., Engreitz, J. M., Bassik, M. C., Reilly, S. K. 2024

Abstract

The ENCODE Consortium's efforts to annotate noncoding cis-regulatory elements (CREs) have advanced our understanding of gene regulatory landscapes. Pooled, noncoding CRISPR screens offer a systematic approach to investigate cis-regulatory mechanisms. The ENCODE4 Functional Characterization Centers conducted 108 screens in human cell lines, comprising >540,000 perturbations across 24.85 megabases of the genome. Using 332 functionally confirmed CRE-gene links in K562 cells, we established guidelines for screening endogenous noncoding elements with CRISPR interference (CRISPRi), including accurate detection of CREs that exhibit variable, often low, transcriptional effects. Benchmarking five screen analysis tools, we find that CASA produces the most conservative CRE calls and is robust to artifacts of low-specificity single guide RNAs. We uncover a subtle DNA strand bias for CRISPRi in transcribed regions with implications for screen design and analysis. Together, we provide an accessible data resource, predesigned single guide RNAs for targeting 3,275,697 ENCODE SCREEN candidate CREs with CRISPRi and screening guidelines to accelerate functional characterization of the noncoding genome.

View details for DOI 10.1038/s41592-024-02216-7

View details for PubMedID 38504114

View details for PubMedCentralID 3771521
Protocol for mapping the three-dimensional organization of dinoflagellate genomes. STAR protocols Marinov, G. K., Kundaje, A., Greenleaf, W. J., Grossman, A. R. 2024; 5 (2): 102941

Abstract

Dinoflagellate genomes often are very large and difficult to assemble, which has until recently precluded their analysis with modern functional genomic tools. Here, we present a protocol for mapping three-dimensional (3D) genome organization in dinoflagellates and using it for scaffolding their genome assemblies. We describe steps for crosslinking, nuclear lysis, denaturation, restriction digest, ligation, and DNA shearing and purification. We then detail procedures sequencing library generation and computational analysis, including initial Hi-C read mapping and 3D-DNA scaffolding/assembly correction. For complete details on the use and execution of this protocol, please refer to Marinov et al.1.

View details for DOI 10.1016/j.xpro.2024.102941

View details for PubMedID 38483898
Author Correction: Advances and prospects for the Human BioMolecular Atlas Program (HuBMAP). Nature cell biology Jain, S., Pei, L., Spraggins, J. M., Angelo, M., Carson, J. P., Gehlenborg, N., Ginty, F., Goncalves, J. P., Hagood, J. S., Hickey, J. W., Kelleher, N. L., Laurent, L. C., Lin, S., Lin, Y., Liu, H., Naba, A., Nakayasu, E. S., Qian, W., Radtke, A., Robson, P., Stockwell, B. R., Van de Plas, R., Vlachos, I. S., Zhou, M., HuBMAP Consortium, Borner, K., Snyder, M. P., Ahn, K. J., Allen, J., Anderson, D. M., Anderton, C. R., Curcio, C., Angelin, A., Arvanitis, C., Atta, L., Awosika-Olumo, D., Bahmani, A., Bai, H., Balderrama, K., Balzano, L., Bandyopadhyay, G., Bandyopadhyay, S., Bar-Joseph, Z., Barnhart, K., Barwinska, D., Becich, M., Becker, L., Becker, W., Bedi, K., Bendall, S., Benninger, K., Betancur, D., Bettinger, K., Billings, S., Blood, P., Bolin, D., Border, S., Bosse, M., Bramer, L., Brewer, M., Brusko, M., Bueckle, A., Burke, K., Burnum-Johnson, K., Butcher, E., Butterworth, E., Cai, L., Calandrelli, R., Caldwell, M., Campbell-Thompson, M., Cao, D., Cao-Berg, I., Caprioli, R., Caraccio, C., Caron, A., Carroll, M., Chadwick, C., Chen, A., Chen, D., Chen, F., Chen, H., Chen, J., Chen, L., Chen, L., Chiacchia, K., Cho, S., Chou, P., Choy, L., Cisar, C., Clair, G., Clarke, L., Clouthier, K. A., Colley, M. E., Conlon, K., Conroy, J., Contrepois, K., Corbett, A., Corwin, A., Cotter, D., Courtois, E., Cruz, A., Csonka, C., Czupil, K., Daiya, V., Dale, K., Davanagere, S. A., Dayao, M., de Caestecker, M. P., Decker, A., Deems, S., Degnan, D., Desai, T., Deshpande, V., Deutsch, G., Devlin, M., Diep, D., Dodd, C., Donahue, S., Dong, W., Dos Santos Peixoto, R., Duffy, M., Dufresne, M., Duong, T. E., Dutra, J., Eadon, M. T., El-Achkar, T. M., Enninful, A., Eraslan, G., Eshelman, D., Espin-Perez, A., Esplin, E. D., Esselman, A., Falo, L. D., Falo, L., Fan, J., Fan, R., Farrow, M. A., Farzad, N., Favaro, P., Fermin, J., Filiz, F., Filus, S., Fisch, K., Fisher, E., Fisher, S., Flowers, K., Flynn, W. F., Fogo, A. B., Fu, D. A., Fulcher, J., Fung, A., Furst, D., Gallant, M., Gao, F., Gao, Y., Gaulton, K., Gaut, J. P., Gee, J., Ghag, R. R., Ghazanfar, S., Ghose, S., Gisch, D., Gold, I., Gondalia, A., Gorman, B., Greenleaf, W., Greenwald, N., Gregory, B., Guo, R., Gupta, R., Hakimian, H., Haltom, J., Halushka, M., Han, K. S., Hanson, C., Harbury, P., Hardi, J., Harlan, L., Harris, R. C., Hartman, A., Heidari, E., Helfer, J., Helminiak, D., Hemberg, M., Henning, N., Herr, B. W., Ho, J., Holden-Wiltse, J., Hong, S., Hong, Y., Honick, B., Hood, G., Hu, P., Hu, Q., Huang, M., Huyck, H., Imtiaz, T., Isberg, O. G., Itkin, M., Jackson, D., Jacobs, M., Jain, Y., Jewell, D., Jiang, L., Jiang, Z. G., Johnston, S., Joshi, P., Ju, Y., Judd, A., Kagel, A., Kahn, A., Kalavros, N., Kalhor, K., Karagkouni, D., Karathanos, T., Karunamurthy, A., Katari, S., Kates, H., Kaushal, M., Keener, N., Keller, M., Kenney, M., Kern, C., Kharchenko, P., Kim, J., Kingsford, C., Kirwan, J., Kiselev, V., Kishi, J., Kitata, R. B., Knoten, A., Kollar, C., Krishnamoorthy, P., Kruse, A. R., Da, K., Kundaje, A., Kutschera, E., Kwon, Y., Lake, B. B., Lancaster, S., Langlieb, J., Lardenoije, R., Laronda, M., Laskin, J., Lau, K., Lee, H., Lee, M., Lee, M., Strekalova, Y. L., Li, D., Li, J., Li, J., Li, X., Li, Z., Liao, Y., Liaw, T., Lin, P., Lin, Y., Lindsay, S., Liu, C., Liu, Y., Liu, Y., Lott, M., Lotz, M., Lowery, L., Lu, P., Lu, X., Lucarelli, N., Lun, X., Luo, Z., Ma, J., Macosko, E., Mahajan, M., Maier, L., Makowski, D., Malek, M., Manthey, D., Manz, T., Margulies, K., Marioni, J., Martindale, M., Mason, C., Mathews, C., Maye, P., McCallum, C., McDonough, E., McDonough, L., Mcdowell, H., Meads, M., Medina-Serpas, M., Ferreira, R. M., Messinger, J., Metis, K., Migas, L. G., Miller, B., Mimar, S., Minor, B., Misra, R., Missarova, A., Mistretta, C., Moens, R., Moerth, E., Moffitt, J., Molla, G., Monroe, M., Monte, E., Morgan, M., Muraro, D., Murphy, B. R., Murray, E., Musen, M. A., Naglah, A., Nasamran, C., Neelakantan, T., Nevins, S., Nguyen, H., Nguyen, N., Nguyen, T., Nguyen, T., Nigra, D., Nofal, M., Nolan, G., Nwanne, G., O'Connor, M., Okuda, K., Olmer, M., O'Neill, K., Otaluka, N., Pang, M., Parast, M., Pasa-Tolic, L., Paten, B., Patterson, N. H., Peng, T., Phillips, G., Pichavant, M., Piehowski, P., Pilner, H., Pingry, E., Pita-Juarez, Y., Plevritis, S., Ploumakis, A., Pouch, A., Pryhuber, G., Puerto, J., Qaurooni, D., Qin, L., Quardokus, E. M., Rajbhandari, P., Rakow-Penner, R., Ramasamy, R., Read, D., Record, E. G., Reeves, D., Ricarte, A., Rodriguez-Soto, A., Ropelewski, A., Rosario, J., Roselkis, M., Rowe, D., Roy, T. K., Ruffalo, M., Ruschman, N., Sabo, A., Sachdev, N., Saka, S., Salamon, D., Sarder, P., Sasaki, H., Satija, R., Saunders, D., Sawka, R., Schey, K., Schlehlein, H., Scholten, D., Schultz, S., Schwartz, L., Schwenk, M., Scibek, R., Segre, A., Serrata, M., Shands, W., Shen, X., Shendure, J., Shephard, H., Shi, L., Shi, T., Shin, D., Shirey, B., Sibilla, M., Silber, M., Silverstein, J., Simmel, D., Simmons, A., Singhal, D., Sivajothi, S., Smits, T., Soncin, F., Song, Q., Stanley, V., Stuart, T., Su, H., Su, P., Sun, X., Surrette, C., Swahn, H., Tan, K., Teichmann, S., Tejomay, A., Tellides, G., Thomas, K., Thomas, T., Thompson, M., Tian, H., Tideman, L., Trapnell, C., Tsai, A. G., Tsai, C., Tsai, L., Tsui, E., Tsui, T., Tung, J., Turner, M., Uranic, J., Vaishnav, E. D., Varra, S. R., Vaskivskyi, V., Velickovic, D., Velickovic, M., Verheyden, J., Waldrip, J., Wallace, D., Wan, X., Wang, A., Wang, F., Wang, M., Wang, S., Wang, X., Wasserfall, C., Wayne, L., Webber, J., Weber, G. M., Wei, B., Wei, J., Weimer, A., Welling, J., Wen, X., Wen, Z., Williams, M., Winfree, S., Winograd, N., Woodard, A., Wright, D., Wu, F., Wu, P., Wu, Q., Wu, X., Xing, Y., Xu, T., Yang, M., Yang, M., Yap, J., Ye, D. H., Yin, P., Yuan, Z., Yun, C. J., Zahraei, A., Zemaitis, K., Zhang, B., Zhang, C., Zhang, C., Zhang, C., Zhang, K., Zhang, S., Zhang, T., Zhang, Y., Zhao, B., Zhao, W., Zheng, J. W., Zhong, S., Zhu, B., Zhu, C., Zhu, D., Zhu, Q., Zhu, Y. 2024

View details for DOI 10.1038/s41556-024-01384-0

View details for PubMedID 38429479
CAGI, the Critical Assessment of Genome Interpretation, establishes progress and prospects for computational genetic variant interpretation methods GENOME BIOLOGY Jain, S., Bakolitsa, C., Brenner, S. E., Radivojac, P., Moult, J., Repo, S., Hoskins, R. A., Andreoletti, G., Barsky, D., Chellapan, A., Chu, H., Dabbiru, N., Kollipara, N. K., Ly, M., Neumann, A. J., Pal, L. R., Odell, E., Pandey, G., Peters-Petrulewicz, R. C., Srinivasan, R., Yee, S. F., Yeleswarapu, S., Zuhl, M., Adebali, O., Patra, A., Beer, M. A., Hosur, R., Peng, J., Bernard, B. M., Berry, M., Dong, S., Boyle, A. P., Adhikari, A., Chen, J., Hu, Z., Wang, R., Wang, Y., Miller, M., Wang, Y., Bromberg, Y., Turina, P., Capriotti, E., Han, J. J., Ozturk, K., Carter, H., Babbi, G., Bovo, S., Di Lena, P., Martelli, P., Savojardo, C., Casadio, R., Cline, M. S., De Baets, G., Bonache, S., Diez, O., Gutierrez-Enriquez, S., Fernandez, A., Montalban, G., Ootes, L., Ozkan, S., Padilla, N., Riera, C., De la Cruz, X., Diekhans, M., Huwe, P. J., Wei, Q., Xu, Q., Dunbrack, R. L., Gotea, V., Elnitski, L., Margolin, G., Fariselli, P., Kulakovskiy, I. V., Makeev, V. J., Penzar, D. D., Vorontsov, I. E., Favorov, A. V., Forman, J. R., Hasenahuer, M., Fornasari, M. S., Parisi, G., Avsec, Z., Celik, M. H., Thi Yen Duong Nguyen, Gagneur, J., Shi, F., Edwards, M. D., Guo, Y., Tian, K., Zeng, H., Gifford, D. K., Goke, J., Zaucha, J., Gough, J., Ritchie, G. R. S., Frankish, A., Mudge, J. M., Harrow, J., Young, E. L., Yu, Y., Huff, C. D., Murakami, K., Nagai, Y., Imanishi, T., Mungall, C. J., Jacobsen, J. O. B., Kim, D., Jeong, C., Jones, D. T., Li, M., Guthrie, V., Bhattacharya, R., Chen, Y., Douville, C., Fan, J., Kim, D., Masica, D., Niknafs, N., Sengupta, S., Tokheim, C., Turner, T. N., Yeo, H., Karchin, R., Shin, S., Welch, R., Keles, S., Li, Y., Kellis, M., Corbi-Verge, C., Strokach, A. V., Kim, P. M., Klein, T. E., Mohan, R., Sinnott-Armstrong, N. A., Wainberg, M., Kundaje, A., Gonzaludo, N., Mak, A. C. Y., Chhibber, A., Lam, H. Y. K., Dahary, D., Fishilevich, S., Lancet, D., Lee, I., Bachman, B., Katsonis, P., Lua, R. C., Wilson, S. J., Lichtarge, O., Bhat, R. R., Sundaram, L., Viswanath, V., Bellazzi, R., Nicora, G., Rizzo, E., Limongelli, I., Mezlini, A. M., Chang, R., Kim, S., Lai, C., O'Connor, R., Topper, S., van den Akker, J., Zhou, A. Y., Zimmer, A. D., Mishne, G., Bergquist, T. R., Breese, M. R., Guerrero, R. F., Jiang, Y., Kiga, N., Li, B., Mort, M., Pagel, K. A., Pejaver, V., Stamboulian, M. H., Thusberg, J., Mooney, S. D., Teerakulkittipong, N., Cao, C., Kundu, K., Yin, Y., Yu, C., Kleyman, M., Lin, C., Stackpole, M., Mount, S. M., Eraslan, G., Mueller, N. S., Naito, T., Rao, A. R., Azaria, J. R., Brodie, A., Ofran, Y., Garg, A., Pal, D., Hawkins-Hooker, A., Kenlay, H., Reid, J., Mucaki, E. J., Rogan, P. K., Schwarz, J. M., Searls, D. B., Lee, G., Seok, C., Kramer, A., Shah, S., Huang, C. V., Kirsch, J. F., Shatsky, M., Cao, Y., Chen, H., Karimi, M., Moronfoye, O., Sun, Y., Shen, Y., Shigeta, R., Ford, C. T., Nodzak, C., Uppal, A., Shi, X., Joseph, T., Kotte, S., Rana, S., Rao, A., Saipradeep, V. G., Sivadasan, N., Sunderam, U., Stanke, M., Su, A., Adzhubey, I., Jordan, D. M., Sunyaev, S., Rousseau, F., Schymkowitz, J., Van Durme, J., Tavtigian, S. V., Carraro, M., Giollo, M., Tosatto, S. C. E., Adato, O., Carmel, L., Cohen, N. E., Fenesh, I., Holtzer, I., Juven-Gershon, T., Unger, R., Niroula, A., Olatubosun, A., Valiaho, J., Yang, Y., Vihinen, M., Wahl, M. E., Chang, B., Chong, K., Hu, I., Sun, R., Wu, W., Xia, X., Zee, B. C., Wang, M. H., Wang, M., Wu, C., Lu, Y., Chen, K., Yang, Y., Yates, C. M., Kreimer, A., Yan, Z., Yosef, N., Zhao, H., Wei, Z., Yao, Z., Zhou, F., Folkman, L., Zhou, Y., Daneshjou, R., Altman, R. B., Inoue, F., Ahituv, N., Arkin, A. P., Lovisa, F., Bonvini, P., Bowdin, S., Gianni, S., Mantuano, E., Minicozzi, V., Novak, L., Pasquo, A., Pastore, A., Petrosino, M., Puglisi, R., Toto, A., Veneziano, L., Chiaraluce, R., Ball, M. P., Bobe, J. R., Church, G. M., Consalvi, V., Mort, M., Cooper, D. N., Buckley, B. A., Sheridan, M. B., Cutting, G. R., Scaini, M., Cygan, K. J., Fredericks, A. M., Glidden, D. T., Neil, C., Rhine, C. L., Fairbrother, W. G., Alontaga, A. Y., Fenton, A. W., Matreyek, K. A., Starita, L. M., Fowler, D. M., Loescher, B., Franke, A., Adamson, S. I., Graveley, B. R., Gray, J. W., Malloy, M. J., Kane, J. P., Kousi, M., Katsanis, N., Schubach, M., Kircher, M., Tang, P. L. F., Kwok, P., Lathrop, R. H., Clark, W. T., Yu, G. K., LeBowitz, J. H., Benedicenti, F., Bettella, E., Bigoni, S., Cesca, F., Mammi, I., Marino-Bus-Ije, C., Milani, D., Peron, A., Polli, R., Sartori, S., Stanzial, F., Ioldo, I., Turolla, L., Aspromonte, M. C., Bellini, M., Leonardi, E., Liu, X., Marshall, C., McCombie, W., Elefanti, L., Menin, C., Meyn, M., Murgia, A., Nadeau, K. C. Y., Neuhausen, S. L., Nussbaum, R. L., Pirooznia, M., Potash, J. B., Dimster-Denk, D. F., Rine, J. D., Sanford, J. R., Snyder, M., Tavtigian, S. V., Cole, A. G., Sun, S., Verby, M. W., Weile, J., Roth, F. P., Tewhey, R., Sabeti, P. C., Campagna, J., Refaat, M. M., Wojciak, J., Grubb, S., Schmitt, N., Shendure, J., Spurdle, A. B., Stavropoulos, D. J., Walton, N. A., Zandi, P. P., Ziv, E., Burke, W., Chen, F., Carr, L. R., Martinez, S., Paik, J., Harris-Wai, J., Yarborough, M., Fullerton, S. M., Koenig, B. A., McInnes, G., Shigaki, D., Chandonia, J., Furutsuki, M., Kasak, L., Yu, C., Chen, R., Cline, M. S., Pandey, G., Friedberg, I., Getz, G. A., Cong, Q., Kinch, L. N., Zhang, J., Grishin, N. V., Voskanian, A., Kann, M. G., Clark, W. T., Tran, E., Ioannidis, N. M., Hunter, J. M., Udani, R., Cai, B., Morgan, A. A., Sokolov, A., Stuart, J. M., Tavtigian, S. V., Minervini, G., Monzon, A. M., Batzoglou, S., Butte, A. J., Church, G. M., Greenblatt, M. S., Hart, R. K., Hernandez, R., Hubbard, T. J. P., Kahn, S., O'Donnell-Luria, A., Ng, P. C., Shon, J., Tavtigian, S. V., Veltman, J., Zook, J. M., Critical Assessment Genome 2024; 25 (1): 53

Abstract

The Critical Assessment of Genome Interpretation (CAGI) aims to advance the state-of-the-art for computational prediction of genetic variant impact, particularly where relevant to disease. The five complete editions of the CAGI community experiment comprised 50 challenges, in which participants made blind predictions of phenotypes from genetic data, and these were evaluated by independent assessors.Performance was particularly strong for clinical pathogenic variants, including some difficult-to-diagnose cases, and extends to interpretation of cancer-related variants. Missense variant interpretation methods were able to estimate biochemical effects with increasing accuracy. Assessment of methods for regulatory variants and complex trait disease risk was less definitive and indicates performance potentially suitable for auxiliary use in the clinic.Results show that while current methods are imperfect, they have major utility for research and clinical applications. Emerging methods and increasingly large, robust datasets for training and assessment promise further progress ahead.

View details for DOI 10.1186/s13059-023-03113-6

View details for Web of Science ID 001184832400002

View details for PubMedID 38389099

View details for PubMedCentralID PMC10882881
Genome-wide interaction study with smoking for colorectal cancer risk identifies novel genetic loci related to tumor suppression, inflammation and immune response Carreras-Torres, R., Kim, A. E., Lin, Y., Diez-Obrero, V., Bien, S. A., Qu, C., Wang, J., Dimou, N., Aglago, E. K., Bouras, E., Campbell, P. T., Casey, G., Chang-Claude, J., Drew, D. A., Gunter, M., Jordahl, K. M., Kawaguchi, E., Kundaje, A., Morrison, J. L., Murphy, N., Newcomb, P., Obon-Santacana, M., Papadimitriou, N., Peoples, A. R., Ruiz-Narvaez, E., Shcherbina, A., Stern, M. C., Su, Y., Tian, Y., Tsilidis, K. K., van Duijnhoven, F. J. B., Hsu, L., Peters, U., Moreno, V., Gauderman, W. SPRINGERNATURE. 2024: 772

View details for Web of Science ID 001147414903547
DART-Eval: A Comprehensive DNA Language Model Evaluation Benchmark on Regulatory DNA Patel, A., Singhal, A., Wang, A., Pampari, A., Kasowski, M., Kundaje, A. edited by Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C. NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2024

View details for Web of Science ID 001633273000042
Detection and analysis of complex structural variation in human genomes across populations and in brains of donors with psychiatric disorders Cell Zhou, B., Arthur, J. G., Guo, H., et al 2024; Published online September 30, 2024

View details for DOI 10.1016/j.cell.2024.09.014
Rewriting regulatory DNA to dissect and reprogram gene expression. bioRxiv : the preprint server for biology Martyn, G. E., Montgomery, M. T., Jones, H., Guo, K., Doughty, B. R., Linder, J., Chen, Z., Cochran, K., Lawrence, K. A., Munson, G., Pampari, A., Fulco, C. P., Kelley, D. R., Lander, E. S., Kundaje, A., Engreitz, J. M. 2023

Abstract

Regulatory DNA sequences within enhancers and promoters bind transcription factors to encode cell type-specific patterns of gene expression. However, the regulatory effects and programmability of such DNA sequences remain difficult to map or predict because we have lacked scalable methods to precisely edit regulatory DNA and quantify the effects in an endogenous genomic context. Here we present an approach to measure the quantitative effects of hundreds of designed DNA sequence variants on gene expression, by combining pooled CRISPR prime editing with RNA fluorescence in situ hybridization and cell sorting (Variant-FlowFISH). We apply this method to mutagenize and rewrite regulatory DNA sequences in an enhancer and the promoter of PPIF in two immune cell lines. Of 672 variant-cell type pairs, we identify 497 that affect PPIF expression. These variants appear to act through a variety of mechanisms including disruption or optimization of existing transcription factor binding sites, as well as creation of de novo sites. Disrupting a single endogenous transcription factor binding site often led to large changes in expression (up to -40% in the enhancer, and -50% in the promoter). The same variant often had different effects across cell types and states, demonstrating a highly tunable regulatory landscape. We use these data to benchmark performance of sequence-based predictive models of gene regulation, and find that certain types of variants are not accurately predicted by existing models. Finally, we computationally design 185 small sequence variants (≤10 bp) and optimize them for specific effects on expression in silico. 84% of these rationally designed edits showed the intended direction of effect, and some had dramatic effects on expression (-100% to +202%). Variant-FlowFISH thus provides a powerful tool to map the effects of variants and transcription factor binding sites on gene expression, test and improve computational models of gene regulation, and reprogram regulatory DNA.

View details for DOI 10.1101/2023.12.20.572268

View details for PubMedID 38187584

View details for PubMedCentralID PMC10769263
Genome-wide gene-environment interaction analyses to understand the relationship between red meat and processed meat intake and colorectal cancer risk. Cancer epidemiology, biomarkers & prevention : a publication of the American Association for Cancer Research, cosponsored by the American Society of Preventive Oncology Stern, M. C., Sanchez Mendez, J., Kim, A. E., Obón-Santacana, M., Moratalla-Navarro, F., Martín, V., Moreno, V., Lin, Y., Bien, S. A., Qu, C., Su, Y. R., White, E., Harrison, T. A., Huyghe, J. R., Tangen, C. M., Newcomb, P. A., Phipps, A. I., Thomas, C. E., Kawaguchi, E. S., Lewinger, J. P., Morrison, J. L., Conti, D. V., Wang, J., Thomas, D. C., Platz, E. A., Visvanathan, K., Keku, T. O., Newton, C. C., Um, C. Y., Kundaje, A., Shcherbina, A., Murphy, N., Gunter, M. J., Dimou, N., Papadimitriou, N., Bézieau, S., van Duijnhoven, F. J., Männistö, S., Rennert, G., Wolk, A., Hoffmeister, M., Brenner, H., Chang-Claude, J., Tian, Y., Le Marchand, L., Cotterchio, M., Tsilidis, K. K., Bishop, D. T., Melaku, Y. A., Lynch, B. M., Buchanan, D. D., Ulrich, C. M., Ose, J., Peoples, A. R., Pellatt, A. J., Li, L., Devall, M. A., Campbell, P. T., Albanes, D., Weinstein, S. J., Berndt, S. I., Gruber, S. B., Ruiz-Narvaez, E., Song, M., Joshi, A. D., Drew, D. A., Petrick, J. L., Chan, A. T., Giannakis, M., Peters, U., Hsu, L., Gauderman, W. J. 2023

Abstract

High red meat and/or processed meat consumption are established colorectal cancer (CRC) risk factors. We conducted a genome-wide gene-environment (GxE) interaction analysis to identify genetic variants that may modify these associations.A pooled sample of 29,842 CRC cases and 39,635 controls of European ancestry from 27 studies were included. Quantiles for red meat and processed meat intake were constructed from harmonized questionnaire data. Genotyping arrays were imputed to the Haplotype Reference Consortium. Two-step EDGE and joint tests of GxE interaction were utilized in our genome-wide scan.Meta-analyses confirmed positive associations between increased consumption of red meat and processed meat with CRC risk (per quartile red meat OR = 1.30; 95%CI = 1.21-1.41; processed meat OR = 1.40; 95%CI = 1.20-1.63). Two significant genome-wide GxE interactions for red meat consumption were found. Joint GxE tests revealed the rs4871179 SNP in chromosome 8 (downstream of HAS2); greater than median of consumption ORs = 1.38 (95%CI = 1.29-1.46), 1.20 (95%CI = 1.12 -1.27), and 1.07 (95%CI = 0.95 - 1.19) for CC, CG and GG, respectively. The two-step EDGE method identified the rs35352860 SNP in chromosome 18 (SMAD7 intron); greater than median of consumption ORs = 1.18 (95%CI = 1.11-1.24), 1.35 (95%CI = 1.26-1.44), and 1.46 (95%CI = 1.26-1.69) for CC, CT, and TT, respectively.We propose two novel biomarkers that support the role of meat consumption with an increased risk of CRC.The reported GxE interactions may explain the increased risk of CRC in certain population subgroups.

View details for DOI 10.1158/1055-9965.EPI-23-0717

View details for PubMedID 38112776
Identification of constrained sequence elements across 239 primate genomes. Nature Kuderna, L. F., Ulirsch, J. C., Rashid, S., Ameen, M., Sundaram, L., Hickey, G., Cox, A. J., Gao, H., Kumar, A., Aguet, F., Christmas, M. J., Clawson, H., Haeussler, M., Janiak, M. C., Kuhlwilm, M., Orkin, J. D., Bataillon, T., Manu, S., Valenzuela, A., Bergman, J., Rouselle, M., Silva, F. E., Agueda, L., Blanc, J., Gut, M., de Vries, D., Goodhead, I., Harris, R. A., Raveendran, M., Jensen, A., Chuma, I. S., Horvath, J. E., Hvilsom, C., Juan, D., Frandsen, P., Schraiber, J. G., de Melo, F. R., Bertuol, F., Byrne, H., Sampaio, I., Farias, I., Valsecchi, J., Messias, M., da Silva, M. N., Trivedi, M., Rossi, R., Hrbek, T., Andriaholinirina, N., Rabarivola, C. J., Zaramody, A., Jolly, C. J., Phillips-Conroy, J., Wilkerson, G., Abee, C., Simmons, J. H., Fernandez-Duque, E., Kanthaswamy, S., Shiferaw, F., Wu, D., Zhou, L., Shao, Y., Zhang, G., Keyyu, J. D., Knauf, S., Le, M. D., Lizano, E., Merker, S., Navarro, A., Nadler, T., Khor, C. C., Lee, J., Tan, P., Lim, W. K., Kitchener, A. C., Zinner, D., Gut, I., Melin, A. D., Guschanski, K., Schierup, M. H., Beck, R. M., Karakikes, I., Wang, K. C., Umapathy, G., Roos, C., Boubli, J. P., Siepel, A., Kundaje, A., Paten, B., Lindblad-Toh, K., Rogers, J., Marques Bonet, T., Farh, K. K. 2023

Abstract

Noncoding DNA is central to our understanding of human gene regulation and complex diseases1,2, and measuring the evolutionary sequence constraint can establish the functional relevance of putative regulatory elements in the human genome3-9. Identifying the genomic elements that have become constrained specifically in primates has been hampered by the faster evolution of noncoding DNA compared to protein-coding DNA10, the relatively short timescales separating primate species11, and the previously limited availability of whole-genome sequences12. Here we construct a whole-genome alignment of 239 species, representing nearly half of all extant species in the primate order. Using this resource, we identified human regulatory elements that are under selective constraint across primates and other mammals at a 5% false discovery rate. We detected 111,318 DNase I hypersensitivity sites and 267,410 transcription factor binding sites that are constrained specifically in primates but not across other placental mammals and validate their cis-regulatory effects on gene expression. These regulatory elements are enriched for human genetic variants that affect gene expression and complex traits and diseases. Our results highlight the important role of recent evolution in regulatory sequence elements differentiating primates, including humans, from other placental mammals.

View details for DOI 10.1038/s41586-023-06798-8

View details for PubMedID 38030727

View details for PubMedCentralID 1891336
An encyclopedia of enhancer-gene regulatory interactions in the human genome. bioRxiv : the preprint server for biology Gschwind, A. R., Mualim, K. S., Karbalayghareh, A., Sheth, M. U., Dey, K. K., Jagoda, E., Nurtdinov, R. N., Xi, W., Tan, A. S., Jones, H., Ma, X. R., Yao, D., Nasser, J., Avsec, Ž., James, B. T., Shamim, M. S., Durand, N. C., Rao, S. S., Mahajan, R., Doughty, B. R., Andreeva, K., Ulirsch, J. C., Fan, K., Perez, E. M., Nguyen, T. C., Kelley, D. R., Finucane, H. K., Moore, J. E., Weng, Z., Kellis, M., Bassik, M. C., Price, A. L., Beer, M. A., Guigó, R., Stamatoyannopoulos, J. A., Lieberman Aiden, E., Greenleaf, W. J., Leslie, C. S., Steinmetz, L. M., Kundaje, A., Engreitz, J. M. 2023

Abstract

Identifying transcriptional enhancers and their target genes is essential for understanding gene regulation and the impact of human genetic variation on disease1-6. Here we create and evaluate a resource of >13 million enhancer-gene regulatory interactions across 352 cell types and tissues, by integrating predictive models, measurements of chromatin state and 3D contacts, and largescale genetic perturbations generated by the ENCODE Consortium7. We first create a systematic benchmarking pipeline to compare predictive models, assembling a dataset of 10,411 elementgene pairs measured in CRISPR perturbation experiments, >30,000 fine-mapped eQTLs, and 569 fine-mapped GWAS variants linked to a likely causal gene. Using this framework, we develop a new predictive model, ENCODE-rE2G, that achieves state-of-the-art performance across multiple prediction tasks, demonstrating a strategy involving iterative perturbations and supervised machine learning to build increasingly accurate predictive models of enhancer regulation. Using the ENCODE-rE2G model, we build an encyclopedia of enhancer-gene regulatory interactions in the human genome, which reveals global properties of enhancer networks, identifies differences in the functions of genes that have more or less complex regulatory landscapes, and improves analyses to link noncoding variants to target genes and cell types for common, complex diseases. By interpreting the model, we find evidence that, beyond enhancer activity and 3D enhancer-promoter contacts, additional features guide enhancerpromoter communication including promoter class and enhancer-enhancer synergy. Altogether, these genome-wide maps of enhancer-gene regulatory interactions, benchmarking software, predictive models, and insights about enhancer function provide a valuable resource for future studies of gene regulation and human genetics.

View details for DOI 10.1101/2023.11.09.563812

View details for PubMedID 38014075

View details for PubMedCentralID PMC10680627
Latent human herpesvirus 6 is reactivated in CAR T cells. Nature Lareau, C. A., Yin, Y., Maurer, K., Sandor, K. D., Daniel, B., Yagnik, G., Peña, J., Crawford, J. C., Spanjaart, A. M., Gutierrez, J. C., Haradhvala, N. J., Riberdy, J. M., Abay, T., Stickels, R. R., Verboon, J. M., Liu, V., Buquicchio, F. A., Wang, F., Southard, J., Song, R., Li, W., Shrestha, A., Parida, L., Getz, G., Maus, M. V., Li, S., Moore, A., Roberts, Z. J., Ludwig, L. S., Talleur, A. C., Thomas, P. G., Dehghani, H., Pertel, T., Kundaje, A., Gottschalk, S., Roth, T. L., Kersten, M. J., Wu, C. J., Majzner, R. G., Satpathy, A. T. 2023

Abstract

Cell therapies have yielded durable clinical benefits for patients with cancer, but the risks associated with the development of therapies from manipulated human cells are understudied. For example, we lack a comprehensive understanding of the mechanisms of toxicities observed in patients receiving T cell therapies, including recent reports of encephalitis caused by reactivation of human herpesvirus 6 (HHV-6)1. Here, through petabase-scale viral genomics mining, we examine the landscape of human latent viral reactivation and demonstrate that HHV-6B can become reactivated in cultures of human CD4+ T cells. Using single-cell sequencing, we identify a rare population of HHV-6 'super-expressors' (about 1 in 300-10,000 cells) that possess high viral transcriptional activity, among research-grade allogeneic chimeric antigen receptor (CAR) T cells. By analysing single-cell sequencing data from patients receiving cell therapy products that are approved by the US Food and Drug Administration2 or are in clinical studies3-5, we identify the presence of HHV-6-super-expressor CAR T cells in patients in vivo. Together, the findings of our study demonstrate the utility of comprehensive genomics analyses in implicating cell therapy products as a potential source contributing to the lytic HHV-6 infection that has been reported in clinical trials1,6-8 and may influence the design and production of autologous and allogeneic cell therapies.

View details for DOI 10.1038/s41586-023-06704-2

View details for PubMedID 37938768

View details for PubMedCentralID 9827115
The chromatin landscape of the euryarchaeon Haloferax volcanii. Genome biology Marinov, G. K., Bagdatli, S. T., Wu, T., He, C., Kundaje, A., Greenleaf, W. J. 2023; 24 (1): 253

Abstract

BACKGROUND: Archaea, together with Bacteria, represent the two main divisions of life on Earth, with many of the defining characteristics of the more complex eukaryotes tracing their origin to evolutionary innovations first made in their archaeal ancestors. One of the most notable such features is nucleosomal chromatin, although archaeal histones and chromatin differ significantly from those of eukaryotes, not all archaea possess histones and it is not clear if histones are a main packaging component for all that do. Despite increased interest in archaeal chromatin in recent years, its properties have been little studied using genomic tools.RESULTS: Here, we adapt the ATAC-seq assay to archaea and use it to map the accessible landscape of the genome of the euryarchaeote Haloferax volcanii. We integrate the resulting datasets with genome-wide maps of active transcription and single-stranded DNA (ssDNA) and find that while H. volcanii promoters exist in a preferentially accessible state, unlike most eukaryotes, modulation of transcriptional activity is not associated with changes in promoter accessibility. Applying orthogonal single-molecule footprinting methods, we quantify the absolute levels of physical protection of H. volcanii and find that Haloferax chromatin is similarly or only slightly more accessible, in aggregate, than that of eukaryotes. We also evaluate the degree of coordination of transcription within archaeal operons and make the unexpected observation that some CRISPR arrays are associated with highly prevalent ssDNA structures.CONCLUSIONS: Our results provide the first comprehensive maps of chromatin accessibility and active transcription in Haloferax across conditions and thus a foundation for future functional studies of archaeal chromatin.

View details for DOI 10.1186/s13059-023-03095-5

View details for PubMedID 37932847
Transcriptomics and chromatin accessibility in multiple African population samples. bioRxiv : the preprint server for biology DeGorter, M. K., Goddard, P. C., Karakoc, E., Kundu, S., Yan, S. M., Nachun, D., Abell, N., Aguirre, M., Carstensen, T., Chen, Z., Durrant, M., Dwaracherla, V. R., Feng, K., Gloudemans, M. J., Hunter, N., Moorthy, M. P., Pomilla, C., Rodrigues, K. B., Smith, C. J., Smith, K. S., Ungar, R. A., Balliu, B., Fellay, J., Flicek, P., McLaren, P. J., Henn, B., McCoy, R. C., Sugden, L., Kundaje, A., Sandhu, M. S., Gurdasani, D., Montgomery, S. B. 2023

Abstract

Mapping the functional human genome and impact of genetic variants is often limited to European-descendent population samples. To aid in overcoming this limitation, we measured gene expression using RNA sequencing in lymphoblastoid cell lines (LCLs) from 599 individuals from six African populations to identify novel transcripts including those not represented in the hg38 reference genome. We used whole genomes from the 1000 Genomes Project and 164 Maasai individuals to identify 8,881 expression and 6,949 splicing quantitative trait loci (eQTLs/sQTLs), and 2,611 structural variants associated with gene expression (SV-eQTLs). We further profiled chromatin accessibility using ATAC-Seq in a subset of 100 representative individuals, to identity chromatin accessibility quantitative trait loci (caQTLs) and allele-specific chromatin accessibility, and provide predictions for the functional effect of 78.9 million variants on chromatin accessibility. Using this map of eQTLs and caQTLs we fine-mapped GWAS signals for a range of complex diseases. Combined, this work expands global functional genomic data to identify novel transcripts, functional elements and variants, understand population genetic history of molecular quantitative trait loci, and further resolve the genetic basis of multiple human traits and disease.

View details for DOI 10.1101/2023.11.04.564839

View details for PubMedID 37986808

View details for PubMedCentralID PMC10659267
The landscape of the histone-organized chromatin of Bdellovibrionota bacteria. bioRxiv : the preprint server for biology Marinov, G. K., Doughty, B., Kundaje, A., Greenleaf, W. J. 2023

Abstract

Histone proteins have traditionally been thought to be restricted to eukaryotes and most archaea, with eukaryotic nucleosomal histones deriving from their archaeal ancestors. In contrast, bacteria lack histones as a rule. However, histone proteins have recently been identified in a few bacterial clades, most notably the phylum Bdellovibrionota, and these histones have been proposed to exhibit a range of divergent features compared to histones in archaea and eukaryotes. However, no functional genomic studies of the properties of Bdellovibrionota chromatin have been carried out. In this work, we map the landscape of chromatin accessibility, active transcription and three-dimensional genome organization in a member of Bdellovibrionota (a Bacteriovorax strain). We find that, similar to what is observed in some archaea and in eukaryotes with compact genomes such as yeast, Bacteriovorax chromatin is characterized by preferential accessibility around promoter regions. Similar to eukaryotes, chromatin accessibility in Bacteriovorax positively correlates with gene expression. Mapping active transcription through single-strand DNA (ssDNA) profiling revealed that unlike in yeast, but similar to the state of mammalian and fly promoters, Bacteriovorax promoters exhibit very strong polymerase pausing. Finally, similar to that of other bacteria without histones, the Bacteriovorax genome exists in a three-dimensional (3D) configuration organized by the parABS system along the axis defined by replication origin and termination regions. These results provide a foundation for understanding the chromatin biology of the unique Bdellovibrionota bacteria and the functional diversity in chromatin organization across the tree of life.

View details for DOI 10.1101/2023.10.30.564843

View details for PubMedID 37961278

View details for PubMedCentralID PMC10634947
RNA polymerase II dynamics and mRNA stability feedback scale mRNA amounts with cell size. Cell Swaffer, M. P., Marinov, G. K., Zheng, H., Fuentes Valenzuela, L., Tsui, C. Y., Jones, A. W., Greenwood, J., Kundaje, A., Greenleaf, W. J., Reyes-Lamothe, R., Skotheim, J. M. 2023

Abstract

A fundamental feature of cellular growth is that total protein and RNA amounts increase with cell size to keep concentrations approximately constant. A key component of this is that global transcription rates increase in larger cells. Here, we identify RNA polymerase II (RNAPII) as the limiting factor scaling mRNA transcription with cell size in budding yeast, as transcription is highly sensitive to the dosage of RNAPII but not to other components of the transcriptional machinery. Our experiments support a dynamic equilibrium model where global RNAPII transcription at a given size is set by the mass action recruitment kinetics of unengaged nucleoplasmic RNAPII to the genome. However, this only drives a sub-linear increase in transcription with size, which is then partially compensated for by a decrease in mRNA decay rates as cells enlarge. Thus, limiting RNAPII and feedback on mRNA stability work in concert to scale mRNA amounts with cell size.

View details for DOI 10.1016/j.cell.2023.10.012

View details for PubMedID 37944513
CXCL12 regulates coronary artery dominance in diverse populations and links development to disease. medRxiv : the preprint server for health sciences Rios Coronado, P. E., Zanetti, D., Zhou, J., Naftaly, J. A., Prabala, P., Kho, P. F., Martínez Jaimes, A. M., Hilliard, A. T., Pyarajan, S., Dochtermann, D., Chang, K. M., Winn, V. D., Pașca, A. M., Plomondon, M. E., Waldo, S. W., Tsao, P. S., Clarke, S. L., Red-Horse, K., Assimes, T. L. 2023

Abstract

Mammalian cardiac muscle is supplied with blood by right and left coronary arteries that form branches covering both ventricles of the heart. Whether branches of the right or left coronary arteries wrap around to the inferior side of the left ventricle is variable in humans and termed right or left dominance. Coronary dominance is likely a heritable trait, but its genetic architecture has never been explored. Here, we present the first large-scale multi-ancestry genome-wide association study of dominance in 61,043 participants of the VA Million Veteran Program, including over 10,300 Africans and 4,400 Admixed Americans. Dominance was moderately heritable with ten loci reaching genome wide significance. The most significant mapped to the chemokine CXCL12 in both Europeans and Africans. Whole-organ imaging of human fetal hearts revealed that dominance is established during development in locations where CXCL12 is expressed. In mice, dominance involved the septal coronary artery, and its patterning was altered with Cxcl12 deficiency. Finally, we linked human dominance patterns with coronary artery disease through colocalization, genome-wide genetic correlation and Mendelian Randomization analyses. Together, our data supports CXCL12 as a primary determinant of coronary artery dominance in humans of diverse backgrounds and suggests that developmental patterning of arteries may influence one's susceptibility to ischemic heart disease.

View details for DOI 10.1101/2023.10.27.23297507

View details for PubMedID 37961706

View details for PubMedCentralID PMC10635223
Drug Discovery in Low Data Regimes: Leveraging a Computational Pipeline for the Discovery of Novel SARS-CoV-2 Nsp14-MTase Inhibitors. bioRxiv : the preprint server for biology Nigam, A., Hurley, M. F., Li, F., Konkoĭová, E., Klíma, M., Trylčová, J., Pollice, R., Çinaroǧlu, S. S., Levin-Konigsberg, R., Handjaya, J., Schapira, M., Chau, I., Perveen, S., Ng, H. L., Ümit Kaniskan, H., Han, Y., Singh, S., Gorgulla, C., Kundaje, A., Jin, J., Voelz, V. A., Weber, J., Nencka, R., Boura, E., Vedadi, M., Aspuru-Guzik, A. 2023

Abstract

The COVID-19 pandemic, caused by the SARS-CoV-2 virus, has led to significant global morbidity and mortality. A crucial viral protein, the non-structural protein 14 (nsp14), catalyzes the methylation of viral RNA and plays a critical role in viral genome replication and transcription. Due to the low mutation rate in the nsp region among various SARS-CoV-2 variants, nsp14 has emerged as a promising therapeutic target. However, discovering potential inhibitors remains a challenge. In this work, we introduce a computational pipeline for the rapid and efficient identification of potential nsp14 inhibitors by leveraging virtual screening and the NCI open compound collection, which contains 250,000 freely available molecules for researchers worldwide. The introduced pipeline provides a cost-effective and efficient approach for early-stage drug discovery by allowing researchers to evaluate promising molecules without incurring synthesis expenses. Our pipeline successfully identified seven promising candidates after experimentally validating only 40 compounds. Notably, we discovered NSC620333, a compound that exhibits a strong binding affinity to nsp14 with a dissociation constant of 427 ± 84 nM. In addition, we gained new insights into the structure and function of this protein through molecular dynamics simulations. We identified new conformational states of the protein and determined that residues Phe367, Tyr368, and Gln354 within the binding pocket serve as stabilizing residues for novel ligand interactions. We also found that metal coordination complexes are crucial for the overall function of the binding pocket. Lastly, we present the solved crystal structure of the nsp14-MTase complexed with SS148, a potent inhibitor of methyltransferase activity at the nanomolar level (IC50 value of 70 ± 6 nM). Our computational pipeline accurately predicted the binding pose of SS148, demonstrating its effectiveness and potential in accelerating drug discovery efforts against SARS-CoV-2 and other emerging viruses.

View details for DOI 10.1101/2023.10.03.560722

View details for PubMedID 37873443

View details for PubMedCentralID PMC10592886
Short tandem repeats bind transcription factors to tune eukaryotic gene expression. Science (New York, N.Y.) Horton, C. A., Alexandari, A. M., Hayes, M. G., Marklund, E., Schaepe, J. M., Aditham, A. K., Shah, N., Suzuki, P. H., Shrikumar, A., Afek, A., Greenleaf, W. J., Gordân, R., Zeitlinger, J., Kundaje, A., Fordyce, P. M. 2023; 381 (6664): eadd1250

Abstract

Short tandem repeats (STRs) are enriched in eukaryotic cis-regulatory elements and alter gene expression, yet how they regulate transcription remains unknown. We found that STRs modulate transcription factor (TF)-DNA affinities and apparent on-rates by about 70-fold by directly binding TF DNA-binding domains, with energetic impacts exceeding many consensus motif mutations. STRs maximize the number of weakly preferred microstates near target sites, thereby increasing TF density, with impacts well predicted by statistical mechanics. Confirming that STRs also affect TF binding in cells, neural networks trained only on in vivo occupancies predicted effects identical to those observed in vitro. Approximately 90% of TFs preferentially bound STRs that need not resemble known motifs, providing a cis-regulatory mechanism to target TFs to genomic sites.

View details for DOI 10.1126/science.add1250

View details for PubMedID 37733848
Genome-wide interaction analysis of folate for colorectal cancer risk. The American journal of clinical nutrition Bouras, E., Kim, A. E., Lin, Y., Morrison, J., Du, M., Albanes, D., Barry, E. L., Baurley, J. W., Berndt, S. I., Bien, S. A., Bishop, T. D., Brenner, H., Budiarto, A., Burnett-Hartman, A., Campbell, P. T., Carreras-Torres, R., Casey, G., Cenggoro, T. W., Chan, A. T., Chang-Claude, J., Conti, D. V., Cotterchio, M., Devall, M., Diez-Obrero, V., Dimou, N., Drew, D. A., Figueiredo, J. C., Giles, G. G., Gruber, S. B., Gunter, M. J., Harrison, T. A., Hidaka, A., Hoffmeister, M., Huyghe, J. R., Joshi, A. D., Kawaguchi, E. S., Keku, T. O., Kundaje, A., Le Marchand, L., Lewinger, J. P., Li, L., Lynch, B. M., Mahesworo, B., Männistö, S., Moreno, V., Murphy, N., Newcomb, P. A., Obón-Santacana, M., Ose, J., Palmer, J. R., Papadimitriou, N., Pardamean, B., Pellatt, A. J., Peoples, A. R., Platz, E. A., Potter, J. D., Qi, L., Qu, C., Rennert, G., Ruiz-Narvaez, E., Sakoda, L. C., Schmit, S. L., Shcherbina, A., Stern, M. C., Su, Y. R., Tangen, C. M., Thomas, D. C., Tian, Y., Um, C. Y., van Duijnhoven, F. J., Van Guelpen, B., Visvanathan, K., Wang, J., White, E., Wolk, A., Woods, M. O., Ulrich, C. M., Hsu, L., Gauderman, W. J., Peters, U., Tsilidis, K. K. 2023

Abstract

Epidemiological and experimental evidence suggests that higher folate intake is associated with a decreased colorectal cancer (CRC) risk; however, the mechanisms underlying this relationship are not fully understood. Genetic variation that may have a direct or indirect impact on folate metabolism can provide insights into folate's role in CRC.Our aim was to perform a genome-wide interaction analysis to identify genetic variants that may modify the association of folate on CRC risk.We applied traditional case-control logistic regression, joint 3-degree of freedom (3DF), and a two-step weighted hypothesis approach to test the interactions of common variants (allele frequency >1%) across the genome and dietary folate, folic acid supplement use, and total folate in relation to risk of CRC, in 30,550 cases and 42,336 controls from 51 studies from 3 genetic consortia (CCFR, CORECT, GECCO).Inverse associations of dietary, total folate, and folic acid supplement with CRC were found [odds ratio: 0.93 (95% confidence intervals [CI]: 0.90-0.96), and 0.91 (0.89-0.94) per quartile higher intake, and 0.82 (0.78-0.88) for users vs. non-users, respectively]. Interactions (P-interaction <5×10-8) of folic acid supplement and variants in the 3p25.2 locus [in the region of Synapsin II (SYN2)/tissue inhibitor of metalloproteinase 4 (TIMP4)] were found using the traditional interaction analysis, with variant rs150924902 (located upstream to SYN2) showing the strongest interaction. In stratified analyses by rs150924902 genotypes, folate supplement was associated with decreased CRC risk among those carrying the TT genotype (OR = 0.82; 95%CI: 0.79-0.86) but increased CRC risk among those carrying the TA genotype (OR = 1.63; 95%CI: 1.29-2.05), suggesting a qualitative interaction (P-interaction = 1.4×10-8). No interactions were observed for dietary and total folate.Variation in 3p25.2 locus may modify the association of folate supplement with CRC risk. Experimental studies and studies incorporating other relevant -omics data are warranted to validate this finding.

View details for DOI 10.1016/j.ajcnut.2023.08.010

View details for PubMedID 37640106
Chromatin accessibility in the Drosophila embryo is determined by transcription factor pioneering and enhancer activation. Developmental cell Brennan, K. J., Weilert, M., Krueger, S., Pampari, A., Liu, H. Y., Yang, A. W., Morrison, J. A., Hughes, T. R., Rushlow, C. A., Kundaje, A., Zeitlinger, J. 2023

Abstract

Chromatin accessibility is integral to the process by which transcription factors (TFs) read out cis-regulatory DNA sequences, but it is difficult to differentiate between TFs that drive accessibility and those that do not. Deep learning models that learn complex sequence rules provide an unprecedented opportunity to dissect this problem. Using zygotic genome activation in Drosophila as a model, we analyzed high-resolution TF binding and chromatin accessibility data with interpretable deep learning and performed genetic validation experiments. We identify a hierarchical relationship between the pioneer TF Zelda and the TFs involved in axis patterning. Zelda consistently pioneers chromatin accessibility proportional to motif affinity, whereas patterning TFs augment chromatin accessibility in sequence contexts where they mediate enhancer activation. We conclude that chromatin accessibility occurs in two tiers: one through pioneering, which makes enhancers accessible but not necessarily active, and the second when the correct combination of TFs leads to enhancer activation.

View details for DOI 10.1016/j.devcel.2023.07.007

View details for PubMedID 37557175
The ENCODE Uniform Analysis Pipelines. Research square Hitz, B. C., Lee, J. W., Jolanki, O., Kagda, M. S., Graham, K., Sud, P., Gabdank, I., Strattan, J. S., Sloan, C. A., Dreszer, T., Rowe, L. D., Podduturi, N. R., Malladi, V. S., Chan, E. T., Davidson, J. M., Ho, M., Miyasato, S., Simison, M., Tanaka, F., Luo, Y., Whaling, I., Hong, E. L., Lee, B. T., Sandstrom, R., Rynes, E., Nelson, J., Nishida, A., Ingersoll, A., Buckley, M., Frerker, M., Kim, D. S., Boley, N., Trout, D., Dobin, A., Rahmanian, S., Wyman, D., Balderrama-Gutierrez, G., Reese, F., Durand, N. C., Dudchenko, O., Weisz, D., Rao, S. S., Blackburn, A., Gkountaroulis, D., Sadr, M., Olshansky, M., Eliaz, Y., Nguyen, D., Bochkov, I., Shamim, M. S., Mahajan, R., Aiden, E., Gingeras, T., Heath, S., Hirst, M., Kent, W. J., Kundaje, A., Mortazavi, A., Wold, B., Cherry, J. M. 2023

Abstract

The Encyclopedia of DNA elements (ENCODE) project is a collaborative effort to create a comprehensive catalog of functional elements in the human genome. The current database comprises more than 19000 functional genomics experiments across more than 1000 cell lines and tissues using a wide array of experimental techniques to study the chromatin structure, regulatory and transcriptional landscape of the Homo sapiens and Mus musculus genomes. All experimental data, metadata, and associated computational analyses created by the ENCODE consortium are submitted to the Data Coordination Center (DCC) for validation, tracking, storage, and distribution to community resources and the scientific community. The ENCODE project has engineered and distributed uniform processing pipelines in order to promote data provenance and reproducibility as well as allow interoperability between genomic resources and other consortia. All data files, reference genome versions, software versions, and parameters used by the pipelines are captured and available via the ENCODE Portal. The pipeline code, developed using Docker and Workflow Description Language (WDL; https://openwdl.org/) is publicly available in GitHub, with images available on Dockerhub (https://hub.docker.com), enabling access to a diverse range of biomedical researchers. ENCODE pipelines maintained and used by the DCC can be installed to run on personal computers, local HPC clusters, or in cloud computing environments via Cromwell. Access to the pipelines and data via the cloud allows small labs the ability to use the data or software without access to institutional compute clusters. Standardization of the computational methodologies for analysis and quality control leads to comparable results from different ENCODE collections - a prerequisite for successful integrative analyses.

View details for DOI 10.21203/rs.3.rs-3111932/v1

View details for PubMedID 37503119

View details for PubMedCentralID PMC10371165
Advances and prospects for the Human BioMolecular Atlas Program (HuBMAP). Nature cell biology Jain, S., Pei, L., Spraggins, J. M., Angelo, M., Carson, J. P., Gehlenborg, N., Ginty, F., Gonçalves, J. P., Hagood, J. S., Hickey, J. W., Kelleher, N. L., Laurent, L. C., Lin, S., Lin, Y., Liu, H., Naba, A., Nakayasu, E. S., Qian, W. J., Radtke, A., Robson, P., Stockwell, B. R., Van de Plas, R., Vlachos, I. S., Zhou, M., Börner, K., Snyder, M. P. 2023

Abstract

The Human BioMolecular Atlas Program (HuBMAP) aims to create a multi-scale spatial atlas of the healthy human body at single-cell resolution by applying advanced technologies and disseminating resources to the community. As the HuBMAP moves past its first phase, creating ontologies, protocols and pipelines, this Perspective introduces the production phase: the generation of reference spatial maps of functional tissue units across many organs from diverse populations and the creation of mapping tools and infrastructure to advance biomedical research.

View details for DOI 10.1038/s41556-023-01194-w

View details for PubMedID 37468756

View details for PubMedCentralID 8238499
Chromatin accessibility dynamics of neurogenic niche cells reveal defects in neural stem cell adhesion and migration during aging. Nature aging Yeo, R. W., Zhou, O. Y., Zhong, B. L., Sun, E. D., Navarro Negredo, P., Nair, S., Sharmin, M., Ruetz, T. J., Wilson, M., Kundaje, A., Dunn, A. R., Brunet, A. 2023

Abstract

The regenerative potential of brain stem cell niches deteriorates during aging. Yet the mechanisms underlying this decline are largely unknown. Here we characterize genome-wide chromatin accessibility of neurogenic niche cells in vivo during aging. Interestingly, chromatin accessibility at adhesion and migration genes decreases with age in quiescent neural stem cells (NSCs) but increases with age in activated (proliferative) NSCs. Quiescent and activated NSCs exhibit opposing adhesion behaviors during aging: quiescent NSCs become less adhesive, whereas activated NSCs become more adhesive. Old activated NSCs also show decreased migration in vitro and diminished mobilization out of the niche for neurogenesis in vivo. Using tension sensors, we find that aging increases force-producing adhesions in activated NSCs. Inhibiting the cytoskeletal-regulating kinase ROCK reduces these adhesions, restores migration in old activated NSCs in vitro, and boosts neurogenesis in vivo. These results have implications for restoring the migratory potential of NSCs and for improving neurogenesis in the aged brain.

View details for DOI 10.1038/s43587-023-00449-3

View details for PubMedID 37443352

View details for PubMedCentralID 4683085
Single-cell multi-omics of mitochondrial DNA disorders reveals dynamics of purifying selection across human immune cells. Nature genetics Lareau, C. A., Dubois, S. M., Buquicchio, F. A., Hsieh, Y. H., Garg, K., Kautz, P., Nitsch, L., Praktiknjo, S. D., Maschmeyer, P., Verboon, J. M., Gutierrez, J. C., Yin, Y., Fiskin, E., Luo, W., Mimitou, E. P., Muus, C., Malhotra, R., Parikh, S., Fleming, M. D., Oevermann, L., Schulte, J., Eckert, C., Kundaje, A., Smibert, P., Vardhana, S. A., Satpathy, A. T., Regev, A., Sankaran, V. G., Agarwal, S., Ludwig, L. S. 2023

Abstract

Pathogenic mutations in mitochondrial DNA (mtDNA) compromise cellular metabolism, contributing to cellular heterogeneity and disease. Diverse mutations are associated with diverse clinical phenotypes, suggesting distinct organ- and cell-type-specific metabolic vulnerabilities. Here we establish a multi-omics approach to quantify deletions in mtDNA alongside cell state features in single cells derived from six patients across the phenotypic spectrum of single large-scale mtDNA deletions (SLSMDs). By profiling 206,663 cells, we reveal the dynamics of pathogenic mtDNA deletion heteroplasmy consistent with purifying selection and distinct metabolic vulnerabilities across T-cell states in vivo and validate these observations in vitro. By extending analyses to hematopoietic and erythroid progenitors, we reveal mtDNA dynamics and cell-type-specific gene regulatory adaptations, demonstrating the context-dependence of perturbing mitochondrial genomic integrity. Collectively, we report pathogenic mtDNA heteroplasmy dynamics of individual blood and immune cells across lineages, demonstrating the power of single-cell multi-omics for revealing fundamental properties of mitochondrial genetics.

View details for DOI 10.1038/s41588-023-01433-8

View details for PubMedID 37386249

View details for PubMedCentralID 3809581
Probing the diabetes and colorectal cancer relationship using gene - environment interaction analyses. British journal of cancer Dimou, N., Kim, A. E., Flanagan, O., Murphy, N., Diez-Obrero, V., Shcherbina, A., Aglago, E. K., Bouras, E., Campbell, P. T., Casey, G., Gallinger, S., Gruber, S. B., Jenkins, M. A., Lin, Y., Moreno, V., Ruiz-Narvaez, E., Stern, M. C., Tian, Y., Tsilidis, K. K., Arndt, V., Barry, E. L., Baurley, J. W., Berndt, S. I., Bézieau, S., Bien, S. A., Bishop, D. T., Brenner, H., Budiarto, A., Carreras-Torres, R., Cenggoro, T. W., Chan, A. T., Chang-Claude, J., Chanock, S. J., Chen, X., Conti, D. V., Dampier, C. H., Devall, M., Drew, D. A., Figueiredo, J. C., Giles, G. G., Gsur, A., Harrison, T. A., Hidaka, A., Hoffmeister, M., Huyghe, J. R., Jordahl, K., Kawaguchi, E., Keku, T. O., Larsson, S. C., Le Marchand, L., Lewinger, J. P., Li, L., Mahesworo, B., Morrison, J., Newcomb, P. A., Newton, C. C., Obon-Santacana, M., Ose, J., Pai, R. K., Palmer, J. R., Papadimitriou, N., Pardamean, B., Peoples, A. R., Pharoah, P. D., Platz, E. A., Potter, J. D., Rennert, G., Scacheri, P. C., Schoen, R. E., Su, Y. R., Tangen, C. M., Thibodeau, S. N., Thomas, D. C., Ulrich, C. M., Um, C. Y., van Duijnhoven, F. J., Visvanathan, K., Vodicka, P., Vodickova, L., White, E., Wolk, A., Woods, M. O., Qu, C., Kundaje, A., Hsu, L., Gauderman, W. J., Gunter, M. J., Peters, U. 2023

Abstract

Diabetes is an established risk factor for colorectal cancer. However, the mechanisms underlying this relationship still require investigation and it is not known if the association is modified by genetic variants. To address these questions, we undertook a genome-wide gene-environment interaction analysis.We used data from 3 genetic consortia (CCFR, CORECT, GECCO; 31,318 colorectal cancer cases/41,499 controls) and undertook genome-wide gene-environment interaction analyses with colorectal cancer risk, including interaction tests of genetics(G)xdiabetes (1-degree of freedom; d.f.) and joint testing of Gxdiabetes, G-colorectal cancer association (2-d.f. joint test) and G-diabetes correlation (3-d.f. joint test).Based on the joint tests, we found that the association of diabetes with colorectal cancer risk is modified by loci on chromosomes 8q24.11 (rs3802177, SLC30A8 - ORAA: 1.62, 95% CI: 1.34-1.96; ORAG: 1.41, 95% CI: 1.30-1.54; ORGG: 1.22, 95% CI: 1.13-1.31; p-value3-d.f.: 5.46 × 10-11) and 13q14.13 (rs9526201, LRCH1 - ORGG: 2.11, 95% CI: 1.56-2.83; ORGA: 1.52, 95% CI: 1.38-1.68; ORAA: 1.13, 95% CI: 1.06-1.21; p-value2-d.f.: 7.84 × 10-09).These results suggest that variation in genes related to insulin signaling (SLC30A8) and immune function (LRCH1) may modify the association of diabetes with colorectal cancer risk and provide novel insights into the biology underlying the diabetes and colorectal cancer relationship.

View details for DOI 10.1038/s41416-023-02312-z

View details for PubMedID 37365285

View details for PubMedCentralID 6767750
A genetic locus within the FMN1/GREM1 gene region interacts with body mass index in colorectal cancer risk. Cancer research Aglago, E. K., Kim, A. E., Lin, Y., Qu, C., Evangelou, M., Ren, Y., Morrison, J., Albanes, D., Arndt, V., Barry, E. L., Baurley, J. W., Berndt, S. I., Bien, S. A., Bishop, D. T., Bouras, E., Brenner, H., Buchanan, D. D., Budiarto, A., Carreras-Torres, R., Casey, G., Cenggoro, T. W., Chan, A. T., Chang-Claude, J., Chen, X., Conti, D. V., Devall, M., Díez-Obrero, V., Dimou, N., Drew, D., Figueiredo, J. C., Gallinger, S., Giles, G. G., Gruber, S. B., Gsur, A., Gunter, M. J., Hampel, H., Harlid, S., Hidaka, A., Harrison, T. A., Hoffmeister, M., Huyghe, J. R., Jenkins, M. A., Jordahl, K., Joshi, A. D., Kawaguchi, E. S., Keku, T. O., Kundaje, A., Larsson, S. C., Le Marchand, L., Lewinger, J. P., Li, L., Lynch, B. M., Mahesworo, B., Mandic, M., Obón-Santacana, M., Moreno, V., Murphy, N., Nan, H., Nassir, R., Newcomb, P. A., Ogino, S., Ose, J., Pai, R. K., Palmer, J. R., Papadimitriou, N., Pardamean, B., Peoples, A. R., Platz, E. A., Potter, J. D., Prentice, R. L., Rennert, G., Ruiz-Narvaez, E., Sakoda, L. C., Scacheri, P. C., Schmit, S. L., Schoen, R. E., Shcherbina, A., Slattery, M. L., Stern, M. C., Su, Y. R., Tangen, C. M., Thibodeau, S. N., Thomas, D. C., Tian, Y., Ulrich, C. M., van Duijnhoven, F. J., Van Guelpen, B., Visvanathan, K., Vodicka, P., Wang, J., White, E., Wolk, A., Woods, M. O., Wu, A. H., Zemlianskaia, N., Hsu, L., Gauderman, W. J., Peters, U., Tsilidis, K. K., Campbell, P. T. 2023

Abstract

Colorectal cancer (CRC) risk can be impacted by genetic, environmental, and lifestyle factors, including diet and obesity. Gene-environment (G×E) interactions can provide biological insights into the effects of obesity on CRC risk. Here, we assessed potential genome-wide G×E interactions between body mass index (BMI) and common single nucleotide polymorphisms (SNPs) for CRC risk using data from 36,415 CRC cases and 48,451 controls from three international CRC consortia (CCFR, CORECT, and GECCO). The G×E tests included the conventional logistic regression using multiplicative terms (one-degree of freedom, 1DF test), the two-step EDGE method, and the joint 3DF test, each of which is powerful for detecting G×E interactions under specific conditions. BMI was associated with higher CRC risk. The two-step approach revealed a statistically significant G×BMI interaction located within the Formin 1/Gremlin 1 (FMN1/GREM1) gene region (rs58349661). This SNP was also identified by the 3DF test, with a suggestive statistical significance in the 1DF test. Among participants with the CC genotype of rs58349661, overweight and obesity categories were associated with higher CRC risk, whereas null associations were observed across BMI categories in those with the TT genotype. Using data from three large international consortia, this study discovered a locus in the FMN1/GREM1 gene region that interacts with BMI on the association with CRC risk. Further studies should examine the potential mechanisms through which this locus modifies the etiologic link between obesity and CRC.

View details for DOI 10.1158/0008-5472.CAN-22-3713

View details for PubMedID 37249599
De novo distillation of thermodynamic affinity from deep learning regulatory sequence models of in vivo protein-DNA binding. bioRxiv : the preprint server for biology Alexandari, A. M., Horton, C. A., Shrikumar, A., Shah, N., Li, E., Weilert, M., Pufall, M. A., Zeitlinger, J., Fordyce, P. M., Kundaje, A. 2023

Abstract

Transcription factors (TF) are proteins that bind DNA in a sequence-specific manner to regulate gene transcription. Despite their unique intrinsic sequence preferences, in vivo genomic occupancy profiles of TFs differ across cellular contexts. Hence, deciphering the sequence determinants of TF binding, both intrinsic and context-specific, is essential to understand gene regulation and the impact of regulatory, non-coding genetic variation. Biophysical models trained on in vitro TF binding assays can estimate intrinsic affinity landscapes and predict occupancy based on TF concentration and affinity. However, these models cannot adequately explain context-specific, in vivo binding profiles. Conversely, deep learning models, trained on in vivo TF binding assays, effectively predict and explain genomic occupancy profiles as a function of complex regulatory sequence syntax, albeit without a clear biophysical interpretation. To reconcile these complementary models of in vitro and in vivo TF binding, we developed Affinity Distillation (AD), a method that extracts thermodynamic affinities de-novo from deep learning models of TF chromatin immunoprecipitation (ChIP) experiments by marginalizing away the influence of genomic sequence context. Applied to neural networks modeling diverse classes of yeast and mammalian TFs, AD predicts energetic impacts of sequence variation within and surrounding motifs on TF binding as measured by diverse in vitro assays with superior dynamic range and accuracy compared to motif-based methods. Furthermore, AD can accurately discern affinities of TF paralogs. Our results highlight thermodynamic affinity as a key determinant of in vivo binding, suggest that deep learning models of in vivo binding implicitly learn high-resolution affinity landscapes, and show that these affinities can be successfully distilled using AD. This new biophysical interpretation of deep learning models enables high-throughput in silico experiments to explore the influence of sequence context and variation on both intrinsic affinity and in vivo occupancy.

View details for DOI 10.1101/2023.05.11.540401

View details for PubMedID 37214836
CasKAS: direct profiling of genome-wide dCas9 and Cas9 specificity using ssDNA mapping. Genome biology Marinov, G. K., Kim, S. H., Bagdatli, S. T., Higashino, S. I., Trevino, A. E., Tycko, J., Wu, T., Bintu, L., Bassik, M. C., He, C., Kundaje, A., Greenleaf, W. J. 2023; 24 (1): 85

Abstract

Detecting and mitigating off-target activity is critical to the practical application of CRISPR-mediated genome and epigenome editing. While numerous methods have been developed to map Cas9 binding specificity genome-wide, they are generally time-consuming and/or expensive, and not applicable to catalytically dead CRISPR enzymes. We have developed CasKAS, a rapid, inexpensive, and facile assay for identifying off-target CRISPR enzyme binding and cleavage by chemically mapping the unwound single-stranded DNA structures formed upon binding of a sgRNA-loaded Cas9 protein. We demonstrate this method in both in vitro and in vivo contexts.

View details for DOI 10.1186/s13059-023-02930-z

View details for PubMedID 37085898

View details for PubMedCentralID PMC10120127
The ENCODE Imputation Challenge: a critical assessment of methods for cross-cell type imputation of epigenomic profiles. Genome biology Schreiber, J., Boix, C., Wook Lee, J., Li, H., Guan, Y., Chang, C. C., Chang, J. C., Hawkins-Hooker, A., Schölkopf, B., Schweikert, G., Carulla, M. R., Canakoglu, A., Guzzo, F., Nanni, L., Masseroli, M., Carman, M. J., Pinoli, P., Hong, C., Yip, K. Y., Spence, J. P., Batra, S. S., Song, Y. S., Mahony, S., Zhang, Z., Tan, W., Shen, Y., Sun, Y., Shi, M., Adrian, J., Sandstrom, R., Farrell, N., Halow, J., Lee, K., Jiang, L., Yang, X., Epstein, C., Strattan, J. S., Bernstein, B., Snyder, M., Kellis, M., Stafford, W., Kundaje, A. 2023; 24 (1): 79

Abstract

A promising alternative to comprehensively performing genomics experiments is to, instead, perform a subset of experiments and use computational methods to impute the remainder. However, identifying the best imputation methods and what measures meaningfully evaluate performance are open questions. We address these questions by comprehensively analyzing 23 methods from the ENCODE Imputation Challenge. We find that imputation evaluations are challenging and confounded by distributional shifts from differences in data collection and processing over time, the amount of available data, and redundancy among performance measures. Our analyses suggest simple steps for overcoming these issues and promising directions for more robust research.

View details for DOI 10.1186/s13059-023-02915-y

View details for PubMedID 37072822

View details for PubMedCentralID PMC10111747
The ENCODE Uniform Analysis Pipelines. bioRxiv : the preprint server for biology Hitz, B. C., Jin-Wook, L., Jolanki, O., Kagda, M. S., Graham, K., Sud, P., Gabdank, I., Strattan, J. S., Sloan, C. A., Dreszer, T., Rowe, L. D., Podduturi, N. R., Malladi, V. S., Chan, E. T., Davidson, J. M., Ho, M., Miyasato, S., Simison, M., Tanaka, F., Luo, Y., Whaling, I., Hong, E. L., Lee, B. T., Sandstrom, R., Rynes, E., Nelson, J., Nishida, A., Ingersoll, A., Buckley, M., Frerker, M., Kim, D. S., Boley, N., Trout, D., Dobin, A., Rahmanian, S., Wyman, D., Balderrama-Gutierrez, G., Reese, F., Durand, N. C., Dudchenko, O., Weisz, D., Rao, S. S., Blackburn, A., Gkountaroulis, D., Sadr, M., Olshansky, M., Eliaz, Y., Nguyen, D., Bochkov, I., Shamim, M. S., Mahajan, R., Aiden, E., Gingeras, T., Heath, S., Hirst, M., Kent, W. J., Kundaje, A., Mortazavi, A., Wold, B., Cherry, J. M. 2023

Abstract

The Encyclopedia of DNA elements (ENCODE) project is a collaborative effort to create a comprehensive catalog of functional elements in the human genome. The current database comprises more than 19000 functional genomics experiments across more than 1000 cell lines and tissues using a wide array of experimental techniques to study the chromatin structure, regulatory and transcriptional landscape of the Homo sapiens and Mus musculus genomes. All experimental data, metadata, and associated computational analyses created by the ENCODE consortium are submitted to the Data Coordination Center (DCC) for validation, tracking, storage, and distribution to community resources and the scientific community. The ENCODE project has engineered and distributed uniform processing pipelines in order to promote data provenance and reproducibility as well as allow interoperability between genomic resources and other consortia. All data files, reference genome versions, software versions, and parameters used by the pipelines are captured and available via the ENCODE Portal. The pipeline code, developed using Docker and Workflow Description Language (WDL; https://openwdl.org/) is publicly available in GitHub, with images available on Dockerhub (https://hub.docker.com), enabling access to a diverse range of biomedical researchers. ENCODE pipelines maintained and used by the DCC can be installed to run on personal computers, local HPC clusters, or in cloud computing environments via Cromwell. Access to the pipelines and data via the cloud allows small labs the ability to use the data or software without access to institutional compute clusters. Standardization of the computational methodologies for analysis and quality control leads to comparable results from different ENCODE collections - a prerequisite for successful integrative analyses.

View details for DOI 10.1101/2023.04.04.535623

View details for PubMedID 37066421

View details for PubMedCentralID PMC10104020
The polyclonal path to malignant transformation in familial adenomatous polyposis Schenck, R. O., Khan, A., Horning, A., Esplin, E. D., Monte, E., Wu, S., Hanson, C., Bararpour, N., Neves, S., Jiang, L., Contrepois, K., Lee, H., Guha, T. K., Hu, Z., Laquindanum, R., Mills, M. A., Chaib, H., Chiu, R., Jian, R., Chan, J., Ellenberger, M., Becker, W. R., Bahmani, B., Michael, B., Shen, J., Lancaster, S., Ladabaum, U., Kundaje, A., Longacre, T. A., Greenleaf, W. J., Ford, J. M., Snyder, M. P., Curtis, C. AMER ASSOC CANCER RESEARCH. 2023

View details for DOI 10.1158/1538-7445.AM2023-3497

View details for Web of Science ID 001008499100430
Genome-Wide Analyses Characterize Shared Heritability Among Cancers and Identify Novel Cancer Susceptibility Regions. Journal of the National Cancer Institute Lindström, S., Wang, L., Feng, H., Majumdar, A., Huo, S., Macdonald, J., Harrison, T., Turman, C., Chen, H., Mancuso, N., Bammler, T., Gallinger, S., Gruber, S. B., Gunter, M. J., Le Marchand, L., Moreno, V., Offit, K., de Vivo, I., O'Mara, T. A., Spurdle, A. B., Tomlinson, I., Fitzgerald, R., Gharahkhani, P., Gockel, I., Jankowski, J., Macgregor, S., Schumacher, J., Barnholtz-Sloan, J., Bondy, M. L., Houlston, R. S., Jenkins, R. B., Melin, B., Wrensch, M., Brennan, P., Christiani, D., Johansson, M., Mckay, J., Aldrich, M. C., Amos, C. I., Landi, M. T., Tardon, A., Bishop, D. T., Demenais, F., Goldstein, A. M., Iles, M. M., Kanetsky, P. A., Law, M. H., Amundadottir, L. T., Stolzenberg-Solomon, R., Wolpin, B. M., Klein, A., Petersen, G., Risch, H., Chanock, S. J., Purdue, M. P., Scelo, G., Pharoah, P., Kar, S., Hung, R. J., Pasaniuc, B., Kraft, P. 2023

Abstract

The shared inherited genetic contribution to risk of different cancers is not fully known. In this study, we leverage results from twelve cancer genome-wide association studies (GWAS) to quantify pair-wise genome-wide genetic correlations across cancers and identify novel cancer susceptibility loci.We collected GWAS summary statistics for twelve solid cancers based on 376,759 cancer cases and 532,864 controls of European ancestry. The included cancer types were breast, colorectal, endometrial, esophageal, glioma, head and neck, lung, melanoma, ovarian, pancreatic, prostate, and renal cancers. We conducted cross-cancer GWAS and transcriptome-wide association studies (TWAS) to discover novel cancer susceptibility loci. Finally, we assessed the extent of variant-specific pleiotropy among cancers at known and newly identified cancer susceptibility loci.We observed wide-spread but modest genome-wide genetic correlations across cancers. In cross-cancer GWAS and TWAS, we identified 15 novel cancer susceptibility loci. Additionally, we identified multiple variants at 77 distinct loci with strong evidence of being associated with at least two cancer types by testing for pleiotropy at known cancer susceptibility loci.Overall, these results suggest that some genetic risk variants are shared among cancers, though much of cancer heritability is cancer- and thus tissue-specific. The increase in statistical power associated with larger sample sizes in cross-disease analysis allows for the identification of novel susceptibility regions. Future studies incorporating data on multiple cancer types are likely to identify additional regions associated with the risk of multiple cancer types.

View details for DOI 10.1093/jnci/djad043

View details for PubMedID 36929942
Aberrant phase separation is a common killing strategy of positively charged peptides in biology and human disease. bioRxiv : the preprint server for biology Boeynaems, S., Ma, X. R., Yeong, V., Ginell, G. M., Chen, J. H., Blum, J. A., Nakayama, L., Sanyal, A., Briner, A., Haver, D. V., Pauwels, J., Ekman, A., Schmidt, H. B., Sundararajan, K., Porta, L., Lasker, K., Larabell, C., Hayashi, M. A., Kundaje, A., Impens, F., Obermeyer, A., Holehouse, A. S., Gitler, A. D. 2023

Abstract

Positively charged repeat peptides are emerging as key players in neurodegenerative diseases. These peptides can perturb diverse cellular pathways but a unifying framework for how such promiscuous toxicity arises has remained elusive. We used mass-spectrometry-based proteomics to define the protein targets of these neurotoxic peptides and found that they all share similar sequence features that drive their aberrant condensation with these positively charged peptides. We trained a machine learning algorithm to detect such sequence features and unexpectedly discovered that this mode of toxicity is not limited to human repeat expansion disorders but has evolved countless times across the tree of life in the form of cationic antimicrobial and venom peptides. We demonstrate that an excess in positive charge is necessary and sufficient for this killer activity, which we name 'polycation poisoning'. These findings reveal an ancient and conserved mechanism and inform ways to leverage its design rules for new generations of bioactive peptides.

View details for DOI 10.1101/2023.03.09.531820

View details for PubMedID 36945394

View details for PubMedCentralID PMC10028949
Author Correction: Deciphering colorectal cancer genetics through multi-omic analysis of 100,204 cases and 154,587 controls of European and east Asian ancestries. Nature genetics Fernandez-Rozadilla, C., Timofeeva, M., Chen, Z., Law, P., Thomas, M., Schmit, S., Diez-Obrero, V., Hsu, L., Fernandez-Tajes, J., Palles, C., Sherwood, K., Briggs, S., Svinti, V., Donnelly, K., Farrington, S., Blackmur, J., Vaughan-Shaw, P., Shu, X., Long, J., Cai, Q., Guo, X., Lu, Y., Broderick, P., Studd, J., Huyghe, J., Harrison, T., Conti, D., Dampier, C., Devall, M., Schumacher, F., Melas, M., Rennert, G., Obon-Santacana, M., Martin-Sanchez, V., Moratalla-Navarro, F., Oh, J. H., Kim, J., Jee, S. H., Jung, K. J., Kweon, S., Shin, M., Shin, A., Ahn, Y., Kim, D., Oze, I., Wen, W., Matsuo, K., Matsuda, K., Tanikawa, C., Ren, Z., Gao, Y., Jia, W., Hopper, J., Jenkins, M., Win, A. K., Pai, R., Figueiredo, J., Haile, R., Gallinger, S., Woods, M., Newcomb, P., Duggan, D., Cheadle, J., Kaplan, R., Maughan, T., Kerr, R., Kerr, D., Kirac, I., Bohm, J., Mecklin, L., Jousilahti, P., Knekt, P., Aaltonen, L., Rissanen, H., Pukkala, E., Eriksson, J., Cajuso, T., Hanninen, U., Kondelin, J., Palin, K., Tanskanen, T., Renkonen-Sinisalo, L., Zanke, B., Mannisto, S., Albanes, D., Weinstein, S., Ruiz-Narvaez, E., Palmer, J., Buchanan, D., Platz, E., Visvanathan, K., Ulrich, C., Siegel, E., Brezina, S., Gsur, A., Campbell, P., Chang-Claude, J., Hoffmeister, M., Brenner, H., Slattery, M., Potter, J., Tsilidis, K., Schulze, M., Gunter, M., Murphy, N., Castells, A., Castellvi-Bel, S., Moreira, L., Arndt, V., Shcherbina, A., Stern, M., Pardamean, B., Bishop, T., Giles, G., Southey, M., Idos, G., McDonnell, K., Abu-Ful, Z., Greenson, J., Shulman, K., Lejbkowicz, F., Offit, K., Su, Y., Steinfelder, R., Keku, T., van Guelpen, B., Hudson, T., Hampel, H., Pearlman, R., Berndt, S., Hayes, R., Martinez, M. E., Thomas, S., Corley, D., Pharoah, P., Larsson, S., Yen, Y., Lenz, H., White, E., Li, L., Doheny, K., Pugh, E., Shelford, T., Chan, A., Cruz-Correa, M., Lindblom, A., Hunter, D., Joshi, A., Schafmayer, C., Scacheri, P., Kundaje, A., Nickerson, D., Schoen, R., Hampe, J., Stadler, Z., Vodicka, P., Vodickova, L., Vymetalkova, V., Papadopoulos, N., Edlund, C., Gauderman, W., Thomas, D., Shibata, D., Toland, A., Markowitz, S., Kim, A., Chanock, S., van Duijnhoven, F., Feskens, E., Sakoda, L., Gago-Dominguez, M., Wolk, A., Naccarati, A., Pardini, B., FitzGerald, L., Lee, S. C., Ogino, S., Bien, S., Kooperberg, C., Li, C., Lin, Y., Prentice, R., Qu, C., Bezieau, S., Tangen, C., Mardis, E., Yamaji, T., Sawada, N., Iwasaki, M., Haiman, C., Le Marchand, L., Wu, A., Qu, C., McNeil, C., Coetzee, G., Hayward, C., Deary, I., Harris, S., Theodoratou, E., Reid, S., Walker, M., Ooi, L. Y., Moreno, V., Casey, G., Gruber, S., Tomlinson, I., Zheng, W., Dunlop, M., Houlston, R., Peters, U. 2023

View details for DOI 10.1038/s41588-023-01334-w

View details for PubMedID 36782065
CPA-Perturb-seq: Multiplexed single-cell characterization of alternative polyadenylation regulators. bioRxiv : the preprint server for biology Kowalski, M. H., Wessels, H., Linder, J., Choudhary, S., Hartman, A., Hao, Y., Mascio, I., Dalgarno, C., Kundaje, A., Satija, R. 2023

Abstract

Most mammalian genes have multiple polyA sites, representing a substantial source of transcript diversity that is governed by the cleavage and polyadenylation (CPA) regulatory machinery. To better understand how these proteins govern polyA site choice we introduce CPA-Perturb-seq, a multiplexed perturbation screen dataset of 42 known CPA regulators with a 3' scRNA-seq readout that enables transcriptome-wide inference of polyA site usage. We develop a statistical framework to specifically identify perturbation-dependent changes in intronic and tandem polyadenylation, and discover modules of co-regulated polyA sites exhibiting distinct functional properties. By training a multi-task deep neural network (APARENT-Perturb) on our dataset, we delineate a cis -regulatory code that predicts responsiveness to perturbation and reveals interactions between distinct regulatory complexes. Finally, we leverage our framework to re-analyze published scRNA-seq datasets, identifying new regulators that affect the relative abundance of alternatively polyadenylated transcripts, and characterizing extensive cellular heterogeneity in 3' UTR length amongst antibody-producing cells. Our work highlights the potential for multiplexed single-cell perturbation screens to further our understanding of post-transcriptional regulation in vitro and in vivo .

View details for DOI 10.1101/2023.02.09.527751

View details for PubMedID 36798324
Single-Molecule Mapping of Chromatin Accessibility Using NOMe-seq/dSMF. Methods in molecular biology (Clifton, N.J.) Hinks, M., Marinov, G. K., Kundaje, A., Bintu, L., Greenleaf, W. J. 2023; 2611: 101-119

Abstract

The bulk of gene expression regulation in most organisms is accomplished through the action of transcription factors (TFs) on cis-regulatory elements (CREs). In eukaryotes, these CREs are generally characterized by nucleosomal depletion and thus higher physical accessibility of DNA. Many methods exploit this property to map regions of high average accessibility, and thus putative active CREs, in bulk. However, these techniques do not provide information about coordinated patterns of accessibility along the same DNA molecule, nor do they map the absolute levels of occupancy/accessibility. SMF (Single-Molecule Footprinting) fills these gaps by leveraging recombinant DNA cytosine methyltransferases (MTase) to mark accessible locations on individual DNA molecules. In this chapter, we discuss current methods and important considerations for performing SMF experiments.

View details for DOI 10.1007/978-1-0716-2899-7_8

View details for PubMedID 36807067
TARTARUS: A Benchmarking Platform for Realistic And Practical Inverse Molecular Design Nigam, A., Pollice, R., Tom, G., Jorner, K., Willes, J., Thiede, L., Kundaje, A., Aspuru-Guzik, A. edited by Oh, A., Neumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2023

View details for Web of Science ID 001229826601043
Simultaneous Single-Cell Profiling of the Transcriptome and Accessible Chromatin Using SHARE-seq. Methods in molecular biology (Clifton, N.J.) Kim, S. H., Marinov, G. K., Bagdatli, S. T., Higashino, S. I., Shipony, Z., Kundaje, A., Greenleaf, W. J. 2023; 2611: 187-230

Abstract

The ability to analyze the transcriptomic and epigenomic states of individual single cells has in recent years transformed our ability to measure and understand biological processes. Recent advancements have focused on increasing sensitivity and throughput to provide richer and deeper biological insights at the cellular level. The next frontier is the development of multiomic methods capable of analyzing multiple features from the same cell, such as the simultaneous measurement of the transcriptome and the chromatin accessibility of candidate regulatory elements. In this chapter, we discuss and describe SHARE-seq (Simultaneous high-throughput ATAC, and RNA expression with sequencing) for carrying out simultaneous chromatin accessibility and transcriptome measurements in single cells, together with the experimental and analytical considerations for achieving optimal results.

View details for DOI 10.1007/978-1-0716-2899-7_11

View details for PubMedID 36807070
Genome-Wide Mapping of Active Regulatory Elements Using ATAC-seq. Methods in molecular biology (Clifton, N.J.) Marinov, G. K., Shipony, Z., Kundaje, A., Greenleaf, W. J. 2023; 2611: 3-19

Abstract

Active cis-regulatory elements (cREs) in eukaryotes are characterized by nucleosomal depletion and, accordingly, higher accessibility. This property has turned out to be immensely useful for identifying cREs genome-wide and tracking their dynamics across different cellular states and is the basis of numerous methods taking advantage of the preferential enzymatic cleavage/labeling of accessible DNA. ATAC-seq (Assay for Transposase-Accessible Chromatin using sequencing) has emerged as the most versatile and widely adaptable method and has been widely adopted as the standard tool for mapping open chromatin regions. Here, we discuss the current optimal practices and important considerations for carrying out ATAC-seq experiments, primarily in the context of mammalian systems.

View details for DOI 10.1007/978-1-0716-2899-7_1

View details for PubMedID 36807060
Genome-wide interaction study with smoking for colorectal cancer risk identifies novel genetic loci related to tumor suppression, inflammation and immune response. Cancer epidemiology, biomarkers & prevention : a publication of the American Association for Cancer Research, cosponsored by the American Society of Preventive Oncology Carreras-Torres, R., Kim, A. E., Lin, Y., Díez-Obrero, V., Bien, S. A., Qu, C., Wang, J., Dimou, N., Aglago, E. K., Albanes, D., Arndt, V., Baurley, J. W., Berndt, S. I., Bézieau, S., Bishop, D. T., Bouras, E., Brenner, H., Budiarto, A., Campbell, P. T., Casey, G., Chan, A. T., Chang-Claude, J., Chen, X., Conti, D. V., Dampier, C. H., Devall, M. A., Drew, D. A., Figueiredo, J. C., Gallinger, S., Giles, G. G., Gruber, S. B., Gsur, A., Gunter, M. J., Harrison, T. A., Hidaka, A., Hoffmeister, M., Huyghe, J. R., Jenkins, M. A., Jordahl, K. M., Kawaguchi, E., Keku, T. O., Kundaje, A., Le Marchand, L., Lewinger, J. P., Li, L., Mahesworo, B., Morrison, J. L., Murphy, N., Nan, H., Nassir, R., Newcomb, P. A., Obón-Santacana, M., Ogino, S., Ose, J., Pai, R. K., Palmer, J. R., Papadimitriou, N., Pardamean, B., Peoples, A. R., Pharoah, P. D., Platz, E. A., Rennert, G., Ruiz-Narvaez, E., Sakoda, L. C., Scacheri, P. C., Schmit, S. L., Schoen, R. E., Shcherbina, A., Slattery, M. L., Stern, M. C., Su, Y. R., Tangen, C. M., Thomas, D. C., Tian, Y., Tsilidis, K. K., Ulrich, C. M., van Duijnhoven, F. J., Van Guelpen, B., Visvanathan, K., Vodicka, P., Cenggoro, T. W., Weinstein, S. J., White, E., Wolk, A., Woods, M. O., Hsu, L., Peters, U., Moreno, V., Gauderman, W. J. 2022

Abstract

Tobacco smoking is an established risk factor for colorectal cancer (CRC). However, genetically-defined population subgroups may have increased susceptibility to smoking-related effects on CRC.A genome-wide interaction scan was performed including 33,756 CRC cases and 44,346 controls from three genetic consortia.Evidence of an interaction was observed between smoking status (ever vs never smokers) and a locus on 3p12.1 (rs9880919, p=4.58x10-8), with higher associated risk in subjects carrying the GG genotype (OR 1.25, 95%CI 1.20-1.30) compared with the other genotypes (OR <1.17 for GA and AA). Among ever smokers, we observed interactions between smoking intensity (increase in 10 cigarettes smoked per day) and two loci on 6p21.33 (rs4151657, p=1.72x10-8) and 8q24.23 (rs7005722, p=2.88x10-8). Subjects carrying the rs4151657 TT genotype showed higher risk (OR 1.12, 95%CI 1.09-1.16) compared with the other genotypes (OR <1.06 for TC and CC). Similarly, higher risk was observed among subjects carrying the rs7005722 AA genotype (OR 1.17, 95%CI 1.07-1.28) compared with the other genotypes (OR <1.13 for AC and CC). Functional annotation revealed that SNPs in 3p12.1 and 6p21.33 loci were located in regulatory regions, and were associated with expression levels of nearby genes. Genetic models predicting gene expression revealed that smoking parameters were associated with lower CRC risk with higher expression levels of CADM2 (3p12.1) and ATF6B (6p21.33).Our study identified novel genetic loci that may modulate the risk for CRC of smoking status and intensity, linked to tumor suppression and immune response.These findings can guide potential prevention treatments.

View details for DOI 10.1158/1055-9965.EPI-22-0763

View details for PubMedID 36576985
Deciphering colorectal cancer genetics through multi-omic analysis of 100,204 cases and 154,587 controls of European and east Asian ancestries. Nature genetics Fernandez-Rozadilla, C., Timofeeva, M., Chen, Z., Law, P., Thomas, M., Schmit, S., Díez-Obrero, V., Hsu, L., Fernandez-Tajes, J., Palles, C., Sherwood, K., Briggs, S., Svinti, V., Donnelly, K., Farrington, S., Blackmur, J., Vaughan-Shaw, P., Shu, X. O., Long, J., Cai, Q., Guo, X., Lu, Y., Broderick, P., Studd, J., Huyghe, J., Harrison, T., Conti, D., Dampier, C., Devall, M., Schumacher, F., Melas, M., Rennert, G., Obón-Santacana, M., Martín-Sánchez, V., Moratalla-Navarro, F., Oh, J. H., Kim, J., Jee, S. H., Jung, K. J., Kweon, S. S., Shin, M. H., Shin, A., Ahn, Y. O., Kim, D. H., Oze, I., Wen, W., Matsuo, K., Matsuda, K., Tanikawa, C., Ren, Z., Gao, Y. T., Jia, W. H., Hopper, J., Jenkins, M., Win, A. K., Pai, R., Figueiredo, J., Haile, R., Gallinger, S., Woods, M., Newcomb, P., Duggan, D., Cheadle, J., Kaplan, R., Maughan, T., Kerr, R., Kerr, D., Kirac, I., Böhm, J., Mecklin, L. P., Jousilahti, P., Knekt, P., Aaltonen, L., Rissanen, H., Pukkala, E., Eriksson, J., Cajuso, T., Hänninen, U., Kondelin, J., Palin, K., Tanskanen, T., Renkonen-Sinisalo, L., Zanke, B., Männistö, S., Albanes, D., Weinstein, S., Ruiz-Narvaez, E., Palmer, J., Buchanan, D., Platz, E., Visvanathan, K., Ulrich, C., Siegel, E., Brezina, S., Gsur, A., Campbell, P., Chang-Claude, J., Hoffmeister, M., Brenner, H., Slattery, M., Potter, J., Tsilidis, K., Schulze, M., Gunter, M., Murphy, N., Castells, A., Castellví-Bel, S., Moreira, L., Arndt, V., Shcherbina, A., Stern, M., Pardamean, B., Bishop, T., Giles, G., Southey, M., Idos, G., McDonnell, K., Abu-Ful, Z., Greenson, J., Shulman, K., Lejbkowicz, F., Offit, K., Su, Y. R., Steinfelder, R., Keku, T., van Guelpen, B., Hudson, T., Hampel, H., Pearlman, R., Berndt, S., Hayes, R., Martinez, M. E., Thomas, S., Corley, D., Pharoah, P., Larsson, S., Yen, Y., Lenz, H. J., White, E., Li, L., Doheny, K., Pugh, E., Shelford, T., Chan, A., Cruz-Correa, M., Lindblom, A., Hunter, D., Joshi, A., Schafmayer, C., Scacheri, P., Kundaje, A., Nickerson, D., Schoen, R., Hampe, J., Stadler, Z., Vodicka, P., Vodickova, L., Vymetalkova, V., Papadopoulos, N., Edlund, C., Gauderman, W., Thomas, D., Shibata, D., Toland, A., Markowitz, S., Kim, A., Chanock, S., van Duijnhoven, F., Feskens, E., Sakoda, L., Gago-Dominguez, M., Wolk, A., Naccarati, A., Pardini, B., FitzGerald, L., Lee, S. C., Ogino, S., Bien, S., Kooperberg, C., Li, C., Lin, Y., Prentice, R., Qu, C., Bézieau, S., Tangen, C., Mardis, E., Yamaji, T., Sawada, N., Iwasaki, M., Haiman, C., Le Marchand, L., Wu, A., Qu, C., McNeil, C., Coetzee, G., Hayward, C., Deary, I., Harris, S., Theodoratou, E., Reid, S., Walker, M., Ooi, L. Y., Moreno, V., Casey, G., Gruber, S., Tomlinson, I., Zheng, W., Dunlop, M., Houlston, R., Peters, U. 2022

Abstract

Colorectal cancer (CRC) is a leading cause of mortality worldwide. We conducted a genome-wide association study meta-analysis of 100,204 CRC cases and 154,587 controls of European and east Asian ancestry, identifying 205 independent risk associations, of which 50 were unreported. We performed integrative genomic, transcriptomic and methylomic analyses across large bowel mucosa and other tissues. Transcriptome- and methylome-wide association studies revealed an additional 53 risk associations. We identified 155 high-confidence effector genes functionally linked to CRC risk, many of which had no previously established role in CRC. These have multiple different functions and specifically indicate that variation in normal colorectal homeostasis, proliferation, cell adhesion, migration, immunity and microbial interactions determines CRC risk. Crosstissue analyses indicated that over a third of effector genes most probably act outside the colonic mucosa. Our findings provide insights into colorectal oncogenesis and highlight potential targets across tissues for new CRC treatment and chemoprevention strategies.

View details for DOI 10.1038/s41588-022-01222-9

View details for PubMedID 36539618
GENCODE: reference annotation for the human and mouse genomes in 2023. Nucleic acids research Frankish, A., Carbonell-Sala, S., Diekhans, M., Jungreis, I., Loveland, J. E., Mudge, J. M., Sisu, C., Wright, J. C., Arnan, C., Barnes, I., Banerjee, A., Bennett, R., Berry, A., Bignell, A., Boix, C., Calvet, F., Cerdan-Velez, D., Cunningham, F., Davidson, C., Donaldson, S., Dursun, C., Fatima, R., Giorgetti, S., Giron, C. G., Gonzalez, J. M., Hardy, M., Harrison, P. W., Hourlier, T., Hollis, Z., Hunt, T., James, B., Jiang, Y., Johnson, R., Kay, M., Lagarde, J., Martin, F. J., Gomez, L. M., Nair, S., Ni, P., Pozo, F., Ramalingam, V., Ruffier, M., Schmitt, B. M., Schreiber, J. M., Steed, E., Suner, M., Sumathipala, D., Sycheva, I., Uszczynska-Ratajczak, B., Wass, E., Yang, Y. T., Yates, A., Zafrulla, Z., Choudhary, J. S., Gerstein, M., Guigo, R., Hubbard, T. J., Kellis, M., Kundaje, A., Paten, B., Tress, M. L., Flicek, P. 2022

Abstract

GENCODE produces high quality gene and transcript annotation for the human and mouse genomes. All GENCODE annotation is supported by experimental data and serves as a reference for genome biology and clinical genomics. The GENCODE consortium generates targeted experimental data, develops bioinformatic tools and carries out analyses that, along with externally produced data and methods, support the identification and annotation of transcript structures and the determination of their function. Here, we present an update on the annotation of human and mouse genes, including developments in the tools, data, analyses and major collaborations which underpin this progress. For example, we report the creation of a set of non-canonical ORFs identified in GENCODE transcripts, the LRGASP collaboration to assess the use of long transcriptomic data to build transcript models, the progress in collaborations with RefSeq and UniProt to increase convergence in the annotation of human and mouse protein-coding genes, the propagation of GENCODE across the human pan-genome and the development of new tools to support annotation of regulatory features by GENCODE. Our annotation is accessible via Ensembl, the UCSC Genome Browser and https://www.gencodegenes.org.

View details for DOI 10.1093/nar/gkac1071

View details for PubMedID 36420896
Deciphering the impact of genetic variation on human polyadenylation using APARENT2. Genome biology Linder, J., Koplik, S. E., Kundaje, A., Seelig, G. 2022; 23 (1): 232

Abstract

3'-end processing by cleavage and polyadenylation is an important and finely tuned regulatory process during mRNA maturation. Numerous genetic variants are known to cause or contribute to human disorders by disrupting the cis-regulatory code of polyadenylation signals. Yet, due to the complexity of this code, variant interpretation remains challenging.We introduce a residual neural network model, APARENT2, that can infer 3'-cleavage and polyadenylation from DNA sequence more accurately than any previous model. This model generalizes to the case of alternative polyadenylation (APA) for a variable number of polyadenylation signals. We demonstrate APARENT2's performance on several variant datasets, including functional reporter data and human 3' aQTLs from GTEx. We apply neural network interpretation methods to gain insights into disrupted or protective higher-order features of polyadenylation. We fine-tune APARENT2 on human tissue-resolved transcriptomic data to elucidate tissue-specific variant effects. By combining APARENT2 with models of mRNA stability, we extend aQTL effect size predictions to the entire 3' untranslated region. Finally, we perform in silico saturation mutagenesis of all human polyadenylation signals and compare the predicted effects of [Formula: see text] million variants against gnomAD. While loss-of-function variants were generally selected against, we also find specific clinical conditions linked to gain-of-function mutations. For example, we detect an association between gain-of-function mutations in the 3'-end and autism spectrum disorder. To experimentally validate APARENT2's predictions, we assayed clinically relevant variants in multiple cell lines, including microglia-derived cells.A sequence-to-function model based on deep residual learning enables accurate functional interpretation of genetic variants in polyadenylation signals and, when coupled with large human variation databases, elucidates the link between functional 3'-end mutations and human health.

View details for DOI 10.1186/s13059-022-02799-4

View details for PubMedID 36335397
The dynseq browser track shows context-specific features at nucleotide resolution. Nature genetics Nair, S., Barrett, A., Li, D., Raney, B. J., Lee, B. T., Kerpedjiev, P., Ramalingam, V., Pampari, A., Lekschas, F., Wang, T., Haeussler, M., Kundaje, A. 2022

View details for DOI 10.1038/s41588-022-01194-w

View details for PubMedID 36241719
Single-cell multiome of the human retina and deep learning nominate causal variants in complex eye diseases. Cell genomics Wang, S. K., Nair, S., Li, R., Kraft, K., Pampari, A., Patel, A., Kang, J. B., Luong, C., Kundaje, A., Chang, H. Y. 2022; 2 (8)

Abstract

Genome-wide association studies (GWASs) of eye disorders have identified hundreds of genetic variants associated with ocular disease. However, the vast majority of these variants are noncoding, making it challenging to interpret their function. Here we present a joint single-cell atlas of gene expression and chromatin accessibility of the adult human retina with more than 50,000 cells, which we used to analyze single-nucleotide polymorphisms (SNPs) implicated by GWASs of age-related macular degeneration, glaucoma, diabetic retinopathy, myopia, and type 2 macular telangiectasia. We integrate this atlas with a HiChIP enhancer connectome, expression quantitative trait loci (eQTL) data, and base-resolution deep learning models to predict noncoding SNPs with causal roles in eye disease, assess SNP impact on transcription factor binding, and define their known and novel target genes. Our efforts nominate pathogenic SNP-target gene interactions for multiple vision disorders and provide a potentially powerful resource for interpreting noncoding variation in the eye.

View details for DOI 10.1016/j.xgen.2022.100164

View details for PubMedID 36277849
Automated sequence-based annotation and interpretation of the human genome. Nature genetics Kundaje, A., Meuleman, W. 2022

View details for DOI 10.1038/s41588-022-01123-x

View details for PubMedID 35817978
Author Correction: Single-nucleus chromatin accessibility profiling highlights regulatory mechanisms of coronary artery disease risk. Nature genetics Turner, A. W., Hu, S. S., Mosquera, J. V., Ma, W. F., Hodonsky, C. J., Wong, D., Auguste, G., Song, Y., Sol-Church, K., Farber, E., Kundu, S., Kundaje, A., Lopez, N. G., Ma, L., Ghosh, S. K., Onengut-Gumuscu, S., Ashley, E. A., Quertermous, T., Finn, A. V., Leeper, N. J., Kovacic, J. C., Björkegren, J. L., Zang, C., Miller, C. L. 2022

View details for DOI 10.1038/s41588-022-01142-8

View details for PubMedID 35768727
Single-cell analyses define a continuum of cell state and composition changes in the malignant transformation of polyps to colorectal cancer. Nature genetics Becker, W. R., Nevins, S. A., Chen, D. C., Chiu, R., Horning, A. M., Guha, T. K., Laquindanum, R., Mills, M., Chaib, H., Ladabaum, U., Longacre, T., Shen, J., Esplin, E. D., Kundaje, A., Ford, J. M., Curtis, C., Snyder, M. P., Greenleaf, W. J. 2022

Abstract

To chart cell composition and cell state changes that occur during the transformation of healthy colon to precancerous adenomas to colorectal cancer (CRC), we generated single-cell chromatin accessibility profiles and single-cell transcriptomes from 1,000 to 10,000 cells per sample for 48 polyps, 27 normal tissues and 6 CRCs collected from patients with or without germline APC mutations. A large fraction of polyp and CRC cells exhibit a stem-like phenotype, and we define a continuum of epigenetic and transcriptional changes occurring in these stem-like cells as they progress from homeostasis to CRC. Advanced polyps contain increasing numbers of stem-like cells, regulatory T cells and a subtype of pre-cancer-associated fibroblasts. In the cancerous state, we observe T cell exhaustion, RUNX1-regulated cancer-associated fibroblasts and increasing accessibility associated with HNF4A motifs in epithelia. DNA methylation changes in sporadic CRC are strongly anti-correlated with accessibility changes along this continuum, further identifying regulatory markers for molecular staging of polyps.

View details for DOI 10.1038/s41588-022-01088-x

View details for PubMedID 35726067
Accelerating in-silico saturation mutagenesis using compressed sensing. Bioinformatics (Oxford, England) Schreiber, J., Nair, S., Balsubramani, A., Kundaje, A. 2022

Abstract

MOTIVATION: In-silico saturation mutagenesis (ISM) is a popular approach in computational genomics for calculating feature attributions on biological sequences that proceeds by systematically perturbing each position in a sequence and recording the difference in model output. However, this method can be slow because systematically perturbing each position requires performing a number of forward passes proportional to the length of the sequence being examined.RESULTS: In this work, we propose a modification of ISM that leverages the principles of compressed sensing to require only a constant number of forward passes, regardless of sequence length, when applied to models that contain operations with a limited receptive field, such as convolutions. Our method, named Yuzu, can reduce the time that ISM spends in convolution operations by several orders of magnitude and, consequently, Yuzu can speed up ISM on several commonly used architectures in genomics by over an order of magnitude. Notably, we found that Yuzu provides speedups that increase with the complexity of the convolution operation and the length of the sequence being analyzed, suggesting that Yuzu provides large benefits in realistic settings.AVAILABILITY: We have made this tool available at https://github.com/kundajelab/yuzu.SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

View details for DOI 10.1093/bioinformatics/btac385

View details for PubMedID 35678521
Single-nucleus chromatin accessibility profiling highlights regulatory mechanisms of coronary artery disease risk. Nature genetics Turner, A. W., Hu, S. S., Mosquera, J. V., Ma, W. F., Hodonsky, C. J., Wong, D., Auguste, G., Song, Y., Sol-Church, K., Farber, E., Kundu, S., Kundaje, A., Lopez, N. G., Ma, L., Ghosh, S. K., Onengut-Gumuscu, S., Ashley, E. A., Quertermous, T., Finn, A. V., Leeper, N. J., Kovacic, J. C., Björkgren, J. L., Zang, C., Miller, C. L. 2022

Abstract

Coronary artery disease (CAD) is a complex inflammatory disease involving genetic influences across cell types. Genome-wide association studies have identified over 200 loci associated with CAD, where the majority of risk variants reside in noncoding DNA sequences impacting cis-regulatory elements. Here, we applied single-nucleus assay for transposase-accessible chromatin with sequencing to profile 28,316 nuclei across coronary artery segments from 41 patients with varying stages of CAD, which revealed 14 distinct cellular clusters. We mapped ~320,000 accessible sites across all cells, identified cell-type-specific elements and transcription factors, and prioritized functional CAD risk variants. We identified elements in smooth muscle cell transition states (for example, fibromyocytes) and functional variants predicted to alter smooth muscle cell- and macrophage-specific regulation of MRAS (3q22) and LIPA (10q23), respectively. We further nominated key driver transcription factors such as PRDM16 and TBX2. Together, this single-nucleus atlas provides a critical step towards interpreting regulatory mechanisms across the continuum of CAD risk.

View details for DOI 10.1038/s41588-022-01069-0

View details for PubMedID 35590109
Genome-Wide Interaction Analysis of Genetic Variants with Menopausal Hormone Therapy for Colorectal Cancer Risk. Journal of the National Cancer Institute Tian, Y., Kim, A. E., Bien, S. A., Lin, Y., Qu, C., Harrison, T., Carreras-Torres, R., Diez-Obrero, V., Dimou, N., Drew, D. A., Hidaka, A., Huyghe, J. R., Jordahl, K. M., Morrison, J., Murphy, N., Obon-Santacana, M., Ulrich, C. M., Ose, J., Peoples, A. R., Ruiz-Narvaez, E. A., Shcherbina, A., Stern, M., Su, Y., van Duijnhoven, F. J., Arndt, V., Baurley, J., Berndt, S. I., Bishop, D. T., Brenner, H., Buchanan, D. D., Chan, A. T., Figueiredo, J. C., Gallinger, S., Gruber, S. B., Harlid, S., Hoffmeister, M., Jenkins, M. A., Joshi, A. D., Keku, T. O., Larsson, S. C., Le Marchand, L., Li, L., Giles, G. G., Milne, R. L., Nan, H., Nassir, R., Ogino, S., Budiarto, A., Platz, E. A., Potter, J. D., Prentice, R. L., Rennert, G., Sakoda, L. C., Schoen, R. E., Slattery, M. L., Thibodeau, S. N., Van Guelpen, B., Visvanathan, K., White, E., Wolk, A., Woods, M. O., Wu, A. H., Campbell, P. T., Casey, G., Conti, D. V., Gunter, M. J., Kundaje, A., Lewinger, J. P., Moreno, V., Newcomb, P. A., Pardamean, B., Thomas, D. C., Tsilidis, K. K., Peters, U., Gauderman, W. J., Hsu, L., Chang-Claude, J. 2022

Abstract

BACKGROUND: The use of menopausal hormone therapy (MHT) may interact with genetic variants to influence colorectal cancer (CRC) risk.METHODS: We conducted a genome-wide gene-environment interaction between single nucleotide polymorphisms and the use of any MHT, estrogen-only, and combined estrogen-progestogen therapy with CRC risk, among 28,486 postmenopausal women (11,519 cases and 16,967 controls) from 38 studies, using logistic regression, two-step method, and 2- or 3-degree-of-freedom (d.f.) joint test. A set-based score test was applied for rare genetic variants.RESULTS: The use of any MHT, estrogen-only and estrogen-progestogen were associated with a reduced CRC risk [odds ratio (OR) with 95% confidence interval (95% CI) of 0.71 (0.64-0.78), 0.65 (0.53-0.79), and 0.73 (0.59-0.90), respectively]. The two-step method identified a statistically significant interaction between a GRIN2B variant rs117868593 and MHT use, whereby MHT-associated CRC risk was significantly reduced in women with the GG genotype [0.68 (0.64-0.72)] but not within strata of GC or CC genotypes. A statistically significant interaction between a DCBLD1 intronic variant at 6q22.1 (rs10782186) and MHT use was identified by the 2-d.f. joint test. The MHT-associated CRC risk was reduced with increasing number of rs10782186-C alleles, showing ORs of 0.78 (0.70-0.87) for TT, 0.68 (0.63-0.73) for TC, and 0.66 (0.60-0.74) for CC genotypes. In addition, five genes in rare variant analysis showed suggestive interactions with MHT (two-sided P<1.2x10-4).CONCLUSION: Genetic variants that modify the association between MHT and CRC risk were identified, offering new insights into pathways of CRC carcinogenesis and potential mechanisms involved.

View details for DOI 10.1093/jnci/djac094

View details for PubMedID 35512400
Author Correction: Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature ENCODE Project Consortium, Moore, J. E., Purcaro, M. J., Pratt, H. E., Epstein, C. B., Shoresh, N., Adrian, J., Kawli, T., Davis, C. A., Dobin, A., Kaul, R., Halow, J., Van Nostrand, E. L., Freese, P., Gorkin, D. U., Shen, Y., He, Y., Mackiewicz, M., Pauli-Behn, F., Williams, B. A., Mortazavi, A., Keller, C. A., Zhang, X., Elhajjajy, S. I., Huey, J., Dickel, D. E., Snetkova, V., Wei, X., Wang, X., Rivera-Mulia, J. C., Rozowsky, J., Zhang, J., Chhetri, S. B., Zhang, J., Victorsen, A., White, K. P., Visel, A., Yeo, G. W., Burge, C. B., Lecuyer, E., Gilbert, D. M., Dekker, J., Rinn, J., Mendenhall, E. M., Ecker, J. R., Kellis, M., Klein, R. J., Noble, W. S., Kundaje, A., Guigo, R., Farnham, P. J., Cherry, J. M., Myers, R. M., Ren, B., Graveley, B. R., Gerstein, M. B., Pennacchio, L. A., Snyder, M. P., Bernstein, B. E., Wold, B., Hardison, R. C., Gingeras, T. R., Stamatoyannopoulos, J. A., Weng, Z., Abascal, F., Acosta, R., Addleman, N. J., Adrian, J., Afzal, V., Aken, B., Akiyama, J. A., Jammal, O. A., Amrhein, H., Anderson, S. M., Andrews, G. R., Antoshechkin, I., Ardlie, K. G., Armstrong, J., Astley, M., Banerjee, B., Barkal, A. A., Barnes, I. H., Barozzi, I., Barrell, D., Barson, G., Bates, D., Baymuradov, U. K., Bazile, C., Beer, M. A., Beik, S., Bender, M. A., Bennett, R., Bouvrette, L. P., Bernstein, B. E., Berry, A., Bhaskar, A., Bignell, A., Blue, S. M., Bodine, D. M., Boix, C., Boley, N., Borrman, T., Borsari, B., Boyle, A. P., Brandsmeier, L. A., Breschi, A., Bresnick, E. H., Brooks, J. A., Buckley, M., Burge, C. B., Byron, R., Cahill, E., Cai, L., Cao, L., Carty, M., Castanon, R. G., Castillo, A., Chaib, H., Chan, E. T., Chee, D. R., Chee, S., Chen, H., Chen, H., Chen, J., Chen, S., Cherry, J. M., Chhetri, S. B., Choudhary, J. S., Chrast, J., Chung, D., Clarke, D., Cody, N. A., Coppola, C. J., Coursen, J., D'Ippolito, A. M., Dalton, S., Danyko, C., Davidson, C., Davila-Velderrain, J., Davis, C. A., Dekker, J., Deran, A., DeSalvo, G., Despacio-Reyes, G., Dewey, C. N., Dickel, D. E., Diegel, M., Diekhans, M., Dileep, V., Ding, B., Djebali, S., Dobin, A., Dominguez, D., Donaldson, S., Drenkow, J., Dreszer, T. R., Drier, Y., Duff, M. O., Dunn, D., Eastman, C., Ecker, J. R., Edwards, M. D., El-Ali, N., Elhajjajy, S. I., Elkins, K., Emili, A., Epstein, C. B., Evans, R. C., Ezkurdia, I., Fan, K., Farnham, P. J., Farrell, N. P., Feingold, E. A., Ferreira, A., Fisher-Aylor, K., Fitzgerald, S., Flicek, P., Foo, C. S., Fortier, K., Frankish, A., Freese, P., Fu, S., Fu, X., Fu, Y., Fukuda-Yuzawa, Y., Fulciniti, M., Funnell, A. P., Gabdank, I., Galeev, T., Gao, M., Giron, C. G., Garvin, T. H., Gelboin-Burkhart, C. A., Georgolopoulos, G., Gerstein, M. B., Giardine, B. M., Gifford, D. K., Gilbert, D. M., Gilchrist, D. A., Gillespie, S., Gingeras, T. R., Gong, P., Gonzalez, A., Gonzalez, J. M., Good, P., Goren, A., Gorkin, D. U., Graveley, B. R., Gray, M., Greenblatt, J. F., Griffiths, E., Groudine, M. T., Grubert, F., Gu, M., Guigo, R., Guo, H., Guo, Y., Guo, Y., Gursoy, G., Gutierrez-Arcelus, M., Halow, J., Hardison, R. C., Hardy, M., Hariharan, M., Harmanci, A., Harrington, A., Harrow, J. L., Hashimoto, T. B., Hasz, R. D., Hatan, M., Haugen, E., Hayes, J. E., He, P., He, Y., Heidari, N., Hendrickson, D., Heuston, E. F., Hilton, J. A., Hitz, B. C., Hochman, A., Holgren, C., Hou, L., Hou, S., Hsiao, Y. E., Hsu, S., Huang, H., Hubbard, T. J., Huey, J., Hughes, T. R., Hunt, T., Ibarrientos, S., Issner, R., Iwata, M., Izuogu, O., Jaakkola, T., Jameel, N., Jansen, C., Jiang, L., Jiang, P., Johnson, A., Johnson, R., Jungreis, I., Kadaba, M., Kasowski, M., Kasparian, M., Kato, M., Kaul, R., Kawli, T., Kay, M., Keen, J. C., Keles, S., Keller, C. A., Kelley, D., Kellis, M., Kheradpour, P., Kim, D. S., Kirilusha, A., Klein, R. J., Knoechel, B., Kuan, S., Kulik, M. J., Kumar, S., Kundaje, A., Kutyavin, T., Lagarde, J., Lajoie, B. R., Lambert, N. J., Lazar, J., Lee, A. Y., Lee, D., Lee, E., Lee, J. W., Lee, K., Leslie, C. S., Levy, S., Li, B., Li, H., Li, N., Li, X., Li, Y. I., Li, Y., Li, Y., Li, Y., Lian, J., Libbrecht, M. W., Lin, S., Lin, Y., Liu, D., Liu, J., Liu, P., Liu, T., Liu, X. S., Liu, Y., Liu, Y., Long, M., Lou, S., Loveland, J., Lu, A., Lu, Y., Lecuyer, E., Ma, L., Mackiewicz, M., Mannion, B. J., Mannstadt, M., Manthravadi, D., Marinov, G. K., Martin, F. J., Mattei, E., McCue, K., McEown, M., McVicker, G., Meadows, S. K., Meissner, A., Mendenhall, E. M., Messer, C. L., Meuleman, W., Meyer, C., Miller, S., Milton, M. G., Mishra, T., Moore, D. E., Moore, H. M., Moore, J. E., Moore, S. H., Moran, J., Mortazavi, A., Mudge, J. M., Munshi, N., Murad, R., Myers, R. M., Nandakumar, V., Nandi, P., Narasimha, A. M., Narayanan, A. K., Naughton, H., Navarro, F. C., Navas, P., Nazarovs, J., Nelson, J., Neph, S., Neri, F. J., Nery, J. R., Nesmith, A. R., Newberry, J. S., Newberry, K. M., Ngo, V., Nguyen, R., Nguyen, T. B., Nguyen, T., Nishida, A., Noble, W. S., Novak, C. S., Novoa, E. M., Nunez, B., O'Donnell, C. W., Olson, S., Onate, K. C., Otterman, E., Ozadam, H., Pagan, M., Palden, T., Pan, X., Park, Y., Partridge, E. C., Paten, B., Pauli-Behn, F., Pazin, M. J., Pei, B., Pennacchio, L. A., Perez, A. R., Perry, E. H., Pervouchine, D. D., Phalke, N. N., Pham, Q., Phanstiel, D. H., Plajzer-Frick, I., Pratt, G. A., Pratt, H. E., Preissl, S., Pritchard, J. K., Pritykin, Y., Purcaro, M. J., Qin, Q., Quinones-Valdez, G., Rabano, I., Radovani, E., Raj, A., Rajagopal, N., Ram, O., Ramirez, L., Ramirez, R. N., Rausch, D., Raychaudhuri, S., Raymond, J., Razavi, R., Reddy, T. E., Reimonn, T. M., Ren, B., Reymond, A., Reynolds, A., Rhie, S. K., Rinn, J., Rivera, M., Rivera-Mulia, J. C., Roberts, B. S., Rodriguez, J. M., Rozowsky, J., Ryan, R., Rynes, E., Salins, D. N., Sandstrom, R., Sasaki, T., Sathe, S., Savic, D., Scavelli, A., Scheiman, J., Schlaffner, C., Schloss, J. A., Schmitges, F. W., See, L. H., Sethi, A., Setty, M., Shafer, A., Shan, S., Sharon, E., Shen, Q., Shen, Y., Sherwood, R. I., Shi, M., Shin, S., Shoresh, N., Siebenthall, K., Sisu, C., Slifer, T., Sloan, C. A., Smith, A., Snetkova, V., Snyder, M. P., Spacek, D. V., Srinivasan, S., Srivas, R., Stamatoyannopoulos, G., Stamatoyannopoulos, J. A., Stanton, R., Steffan, D., Stehling-Sun, S., Strattan, J. S., Su, A., Sundararaman, B., Suner, M., Syed, T., Szynkarek, M., Tanaka, F. Y., Tenen, D., Teng, M., Thomas, J. A., Toffey, D., Tress, M. L., Trout, D. E., Trynka, G., Tsuji, J., Upchurch, S. A., Ursu, O., Uszczynska-Ratajczak, B., Uziel, M. C., Valencia, A., Biber, B. V., van der Velde, A. G., Van Nostrand, E. L., Vaydylevich, Y., Vazquez, J., Victorsen, A., Vielmetter, J., Vierstra, J., Visel, A., Vlasova, A., Vockley, C. M., Volpi, S., Vong, S., Wang, H., Wang, M., Wang, Q., Wang, R., Wang, T., Wang, W., Wang, X., Wang, Y., Watson, N. K., Wei, X., Wei, Z., Weisser, H., Weissman, S. M., Welch, R., Welikson, R. E., Weng, Z., Westra, H., Whitaker, J. W., White, C., White, K. P., Wildberg, A., Williams, B. A., Wine, D., Witt, H. N., Wold, B., Wolf, M., Wright, J., Xiao, R., Xiao, X., Xu, J., Xu, J., Yan, K., Yan, Y., Yang, H., Yang, X., Yang, Y., Yardimci, G. G., Yee, B. A., Yeo, G. W., Young, T., Yu, T., Yue, F., Zaleski, C., Zang, C., Zeng, H., Zeng, W., Zerbino, D. R., Zhai, J., Zhan, L., Zhan, Y., Zhang, B., Zhang, J., Zhang, J., Zhang, K., Zhang, L., Zhang, P., Zhang, Q., Zhang, X., Zhang, Y., Zhang, Z., Zhao, Y., Zheng, Y., Zhong, G., Zhou, X., Zhu, Y., Zimmerman, J., Ai, R., Li, S. 2022

View details for DOI 10.1038/s41586-021-04226-3

View details for PubMedID 35474001
Author Correction: Perspectives on ENCODE. Nature ENCODE Project Consortium, Snyder, M. P., Gingeras, T. R., Moore, J. E., Weng, Z., Gerstein, M. B., Ren, B., Hardison, R. C., Stamatoyannopoulos, J. A., Graveley, B. R., Feingold, E. A., Pazin, M. J., Pagan, M., Gilchrist, D. A., Hitz, B. C., Cherry, J. M., Bernstein, B. E., Mendenhall, E. M., Zerbino, D. R., Frankish, A., Flicek, P., Myers, R. M., Abascal, F. B., Acosta, R., Addleman, N. J., Adrian, J., Afzal, V., Aken, B., Ai, R., Akiyama, J. A., Jammal, O. A., Amrhein, H., Anderson, S. M., Andrews, G. R., Antoshechkin, I., Ardlie, K. G., Armstrong, J., Astley, M., Banerjee, B., Barkal, A. A., Barnes, I. H., Barozzi, I., Barrell, D., Barson, G., Bates, D., Baymuradov, U. K., Bazile, C., Beer, M. A., Beik, S., Bender, M. A., Bennett, R., Bouvrette, L. P., Bernstein, B. E., Berry, A., Bhaskar, A., Bignell, A., Blue, S. M., Bodine, D. M., Boix, C., Boley, N., Borrman, T., Borsari, B., Boyle, A. P., Brandsmeier, L. A., Breschi, A., Bresnick, E. H., Brooks, J. A., Buckley, M., Burge, C. B., Byron, R., Cahill, E., Cai, L., Cao, L., Carty, M., Castanon, R. G., Castillo, A., Chaib, H., Chan, E. T., Chee, D. R., Chee, S., Chen, H., Chen, H., Chen, J., Chen, S., Cherry, J. M., Chhetri, S. B., Choudhary, J. S., Chrast, J., Chung, D., Clarke, D., Cody, N. A., Coppola, C. J., Coursen, J., D'Ippolito, A. M., Dalton, S., Danyko, C., Davidson, C., Davila-Velderrain, J., Davis, C. A., Dekker, J., Deran, A., DeSalvo, G., Despacio-Reyes, G., Dewey, C. N., Dickel, D. E., Diegel, M., Diekhans, M., Dileep, V., Ding, B., Djebali, S., Dobin, A., Dominguez, D., Donaldson, S., Drenkow, J., Dreszer, T. R., Drier, Y., Duff, M. O., Dunn, D., Eastman, C., Ecker, J. R., Edwards, M. D., El-Ali, N., Elhajjajy, S. I., Elkins, K., Emili, A., Epstein, C. B., Evans, R. C., Ezkurdia, I., Fan, K., Farnham, P. J., Farrell, N., Feingold, E. A., Ferreira, A., Fisher-Aylor, K., Fitzgerald, S., Flicek, P., Foo, C. S., Fortier, K., Frankish, A., Freese, P., Fu, S., Fu, X., Fu, Y., Fukuda-Yuzawa, Y., Fulciniti, M., Funnell, A. P., Gabdank, I., Galeev, T., Gao, M., Giron, C. G., Garvin, T. H., Gelboin-Burkhart, C. A., Georgolopoulos, G., Gerstein, M. B., Giardine, B. M., Gifford, D. K., Gilbert, D. M., Gilchrist, D. A., Gillespie, S., Gingeras, T. R., Gong, P., Gonzalez, A., Gonzalez, J. M., Good, P., Goren, A., Gorkin, D. U., Graveley, B. R., Gray, M., Greenblatt, J. F., Griffiths, E., Groudine, M. T., Grubert, F., Gu, M., Guigo, R., Guo, H., Guo, Y., Guo, Y., Gursoy, G., Gutierrez-Arcelus, M., Halow, J., Hardison, R. C., Hardy, M., Hariharan, M., Harmanci, A., Harrington, A., Harrow, J. L., Hashimoto, T. B., Hasz, R. D., Hatan, M., Haugen, E., Hayes, J. E., He, P., He, Y., Heidari, N., Hendrickson, D., Heuston, E. F., Hilton, J. A., Hitz, B. C., Hochman, A., Holgren, C., Hou, L., Hou, S., Hsiao, Y. E., Hsu, S., Huang, H., Hubbard, T. J., Huey, J., Hughes, T. R., Hunt, T., Ibarrientos, S., Issner, R., Iwata, M., Izuogu, O., Jaakkola, T., Jameel, N., Jansen, C., Jiang, L., Jiang, P., Johnson, A., Johnson, R., Jungreis, I., Kadaba, M., Kasowski, M., Kasparian, M., Kato, M., Kaul, R., Kawli, T., Kay, M., Keen, J. C., Keles, S., Keller, C. A., Kelley, D., Kellis, M., Kheradpour, P., Kim, D. S., Kirilusha, A., Klein, R. J., Knoechel, B., Kuan, S., Kulik, M. J., Kumar, S., Kundaje, A., Kutyavin, T., Lagarde, J., Lajoie, B. R., Lambert, N. J., Lazar, J., Lee, A. Y., Lee, D., Lee, E., Lee, J. W., Lee, K., Leslie, C. S., Levy, S., Li, B., Li, H., Li, N., Li, S., Li, X., Li, Y. I., Li, Y., Li, Y., Li, Y., Lian, J., Libbrecht, M. W., Lin, S., Lin, Y., Liu, D., Liu, J., Liu, P., Liu, T., Liu, X. S., Liu, Y., Liu, Y., Long, M., Lou, S., Loveland, J., Lu, A., Lu, Y., Lecuyer, E., Ma, L., Mackiewicz, M., Mannion, B. J., Mannstadt, M., Manthravadi, D., Marinov, G. K., Martin, F. J., Mattei, E., McCue, K., McEown, M., McVicker, G., Meadows, S. K., Meissner, A., Mendenhall, E. M., Messer, C. L., Meuleman, W., Meyer, C., Miller, S., Milton, M. G., Mishra, T., Moore, D. E., Moore, H. M., Moore, J. E., Moore, S. H., Moran, J., Mortazavi, A., Mudge, J. M., Munshi, N., Murad, R., Myers, R. M., Nandakumar, V., Nandi, P., Narasimha, A. M., Narayanan, A. K., Naughton, H., Navarro, F. C., Navas, P., Nazarovs, J., Nelson, J., Neph, S., Neri, F. J., Nery, J. R., Nesmith, A. R., Newberry, J. S., Newberry, K. M., Ngo, V., Nguyen, R., Nguyen, T. B., Nguyen, T., Nishida, A., Noble, W. S., Novak, C. S., Novoa, E. M., Nunez, B., O'Donnell, C. W., Olson, S., Onate, K. C., Otterman, E., Ozadam, H., Pagan, M., Palden, T., Pan, X., Park, Y., Partridge, E. C., Paten, B., Pauli-Behn, F., Pazin, M. J., Pei, B., Pennacchio, L. A., Perez, A. R., Perry, E. H., Pervouchine, D. D., Phalke, N. N., Pham, Q., Phanstiel, D. H., Plajzer-Frick, I., Pratt, G. A., Pratt, H. E., Preissl, S., Pritchard, J. K., Pritykin, Y., Purcaro, M. J., Qin, Q., Quinones-Valdez, G., Rabano, I., Radovani, E., Raj, A., Rajagopal, N., Ram, O., Ramirez, L., Ramirez, R. N., Rausch, D., Raychaudhuri, S., Raymond, J., Razavi, R., Reddy, T. E., Reimonn, T. M., Ren, B., Reymond, A., Reynolds, A., Rhie, S. K., Rinn, J., Rivera, M., Rivera-Mulia, J. C., Roberts, B., Rodriguez, J. M., Rozowsky, J., Ryan, R., Rynes, E., Salins, D. N., Sandstrom, R., Sasaki, T., Sathe, S., Savic, D., Scavelli, A., Scheiman, J., Schlaffner, C., Schloss, J. A., Schmitges, F. W., See, L. H., Sethi, A., Setty, M., Shafer, A., Shan, S., Sharon, E., Shen, Q., Shen, Y., Sherwood, R. I., Shi, M., Shin, S., Shoresh, N., Siebenthall, K., Sisu, C., Slifer, T., Sloan, C. A., Smith, A., Snetkova, V., Snyder, M. P., Spacek, D. V., Srinivasan, S., Srivas, R., Stamatoyannopoulos, G., Stamatoyannopoulos, J. A., Stanton, R., Steffan, D., Stehling-Sun, S., Strattan, J. S., Su, A., Sundararaman, B., Suner, M., Syed, T., Szynkarek, M., Tanaka, F. Y., Tenen, D., Teng, M., Thomas, J. A., Toffey, D., Tress, M. L., Trout, D. E., Trynka, G., Tsuji, J., Upchurch, S. A., Ursu, O., Uszczynska-Ratajczak, B., Uziel, M. C., Valencia, A., Biber, B. V., van der Velde, A. G., Van Nostrand, E. L., Vaydylevich, Y., Vazquez, J., Victorsen, A., Vielmetter, J., Vierstra, J., Visel, A., Vlasova, A., Vockley, C. M., Volpi, S., Vong, S., Wang, H., Wang, M., Wang, Q., Wang, R., Wang, T., Wang, W., Wang, X., Wang, Y., Watson, N. K., Wei, X., Wei, Z., Weisser, H., Weissman, S. M., Welch, R., Welikson, R. E., Weng, Z., Westra, H., Whitaker, J. W., White, C., White, K. P., Wildberg, A., Williams, B. A., Wine, D., Witt, H. N., Wold, B., Wolf, M., Wright, J., Xiao, R., Xiao, X., Xu, J., Xu, J., Yan, K., Yan, Y., Yang, H., Yang, X., Yang, Y., Yardimci, G. G., Yee, B. A., Yeo, G. W., Young, T., Yu, T., Yue, F., Zaleski, C., Zang, C., Zeng, H., Zeng, W., Zerbino, D. R., Zhai, J., Zhan, L., Zhan, Y., Zhang, B., Zhang, J., Zhang, J., Zhang, K., Zhang, L., Zhang, P., Zhang, Q., Zhang, X., Zhang, Y., Zhang, Z., Zhao, Y., Zheng, Y., Zhong, G., Zhou, X., Zhu, Y., Zimmerman, J. 2022

View details for DOI 10.1038/s41586-021-04213-8

View details for PubMedID 35474002
Beyond GWAS of Colorectal Cancer: Evidence of Interaction with Alcohol Consumption and Putative Causal Variant for the 10q24.2 Region. Cancer epidemiology, biomarkers & prevention : a publication of the American Association for Cancer Research, cosponsored by the American Society of Preventive Oncology Jordahl, K. M., Shcherbina, A., Kim, A. E., Su, Y. R., Lin, Y., Wang, J., Qu, C., Albanes, D., Arndt, V., Baurley, J. W., Berndt, S. I., Bien, S. A., Bishop, D. T., Bouras, E., Brenner, H., Buchanan, D. D., Budiarto, A., Campbell, P. T., Carreras-Torres, R., Casey, G., Cenggoro, T. W., Chan, A. T., Conti, D. V., Dampier, C. H., Devall, M. A., Díez-Obrero, V., Dimou, N., Drew, D. A., Figueiredo, J. C., Gallinger, S., Giles, G. G., Gruber, S. B., Gsur, A., Gunter, M. J., Hampel, H., Harlid, S., Harrison, T. A., Hidaka, A., Hoffmeister, M., Huyghe, J. R., Jenkins, M. A., Joshi, A. D., Keku, T. O., Larsson, S. C., Le Marchand, L., Lewinger, J. P., Li, L., Mahesworo, B., Moreno, V., Morrison, J. L., Murphy, N., Nan, H., Nassir, R., Newcomb, P. A., Obón-Santacana, M., Ogino, S., Ose, J., Pai, R. K., Palmer, J. R., Papadimitriou, N., Pardamean, B., Peoples, A. R., Pharoah, P. D., Platz, E. A., Potter, J. D., Prentice, R. L., Rennert, G., Ruiz-Narvaez, E., Sakoda, L. C., Scacheri, P. C., Schmit, S. L., Schoen, R. E., Slattery, M. L., Stern, M. C., Tangen, C. M., Thibodeau, S. N., Thomas, D. C., Tian, Y., Tsilidis, K. K., Ulrich, C. M., van Duijnhoven, F. J., Van Guelpen, B., Visvanathan, K., Vodicka, P., White, E., Wolk, A., Woods, M. O., Wu, A. H., Zemlianskaia, N., Chang-Claude, J., Gauderman, W. J., Hsu, L., Kundaje, A., Peters, U. 2022: OF1-OF13

Abstract

Currently known associations between common genetic variants and colorectal cancer explain less than half of its heritability of 25%. As alcohol consumption has a J-shape association with colorectal cancer risk, nondrinking and heavy drinking are both risk factors for colorectal cancer.Individual-level data was pooled from the Colon Cancer Family Registry, Colorectal Transdisciplinary Study, and Genetics and Epidemiology of Colorectal Cancer Consortium to compare nondrinkers (≤1 g/day) and heavy drinkers (>28 g/day) with light-to-moderate drinkers (1-28 g/day) in GxE analyses. To improve power, we implemented joint 2df and 3df tests and a novel two-step method that modifies the weighted hypothesis testing framework. We prioritized putative causal variants by predicting allelic effects using support vector machine models.For nondrinking as compared with light-to-moderate drinking, the hybrid two-step approach identified 13 significant SNPs with pairwise r2 > 0.9 in the 10q24.2/COX15 region. When stratified by alcohol intake, the A allele of lead SNP rs2300985 has a dose-response increase in risk of colorectal cancer as compared with the G allele in light-to-moderate drinkers [OR for GA genotype = 1.11; 95% confidence interval (CI), 1.06-1.17; OR for AA genotype = 1.22; 95% CI, 1.14-1.31], but not in nondrinkers or heavy drinkers. Among the correlated candidate SNPs in the 10q24.2/COX15 region, rs1318920 was predicted to disrupt an HNF4 transcription factor binding motif.Our study suggests that the association with colorectal cancer in 10q24.2/COX15 observed in genome-wide association study is strongest in nondrinkers. We also identified rs1318920 as the putative causal regulatory variant for the region.The study identifies multifaceted evidence of a possible functional effect for rs1318920.

View details for DOI 10.1158/1055-9965.EPI-21-1003

View details for PubMedID 35438744
fastISM: Performant in-silico saturation mutagenesis for convolutional neural networks. Bioinformatics (Oxford, England) Nair, S., Shrikumar, A., Schreiber, J., Kundaje, A. 2022

Abstract

MOTIVATION: Deep learning models such as convolutional neural networks are able to accurately map biological sequences to associated functional readouts and properties by learning predictive de novo representations. In-silico saturation mutagenesis (ISM) is a popular feature attribution technique for inferring contributions of all characters in an input sequence to the model's predicted output. The main drawback of ISM is its runtime, as it involves multiple forward propagations of all possible mutations of each character in the input sequence through the trained model to predict the effects on the output.RESULTS: We present fastISM, an algorithm that speeds up ISM by a factor of over 10x for commonly used convolutional neural network architectures. fastISM is based on the observations that the majority of computation in ISM is spent in convolutional layers, and a single mutation only disrupts a limited region of intermediate layers, rendering most computation redundant. fastISM reduces the gap between backpropagation-based feature attribution methods and ISM. It far surpasses the runtime of backpropagation-based methods on multi-output architectures, making it feasible to run ISM on a large number of sequences.AVAILABILITY: An easy-to-use Keras/TensorFlow 2 implementation of fastISM is available at https://github.com/kundajelab/fastISM. fastISM can be installed using pip install fastism. A hands-on tutorial can be found at https://colab.research.google.com/github/kundajelab/fastISM/blob/master/notebooks/colab/DeepSEA.ipynb.SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

View details for DOI 10.1093/bioinformatics/btac135

View details for PubMedID 35238376
Discovery of widespread transcription initiation at microsatellites predictable by sequence-based deep neural network (vol 12, 3279, 2021) NATURE COMMUNICATIONS Grapotte, M., Saraswat, M., Bessiere, C., Menichelli, C., Ramilowski, J. A., Severin, J., Hayashizaki, Y., Itoh, M., Tagami, M., Murata, M., Kojima-Ishiyama, M., Noma, S., Noguchi, S., Kasukawa, T., Hasegawa, A., Suzuki, H., Nishiyori-Sueki, H., Frith, M. C., Chatelain, C., Carninci, P., de Hoon, M. J. L., Wasserman, W. W., Brehelin, L., Lecellier, C., FANTOM consortium 2022; 13 (1): 1200

View details for DOI 10.1038/s41467-022-28758-y

View details for Web of Science ID 000771136200018

View details for PubMedID 35232988

View details for PubMedCentralID PMC8888638
The chromatin organization of a chlorarachniophyte nucleomorph genome. Genome biology Marinov, G. K., Chen, X., Wu, T., He, C., Grossman, A. R., Kundaje, A., Greenleaf, W. J. 2022; 23 (1): 65

Abstract

BACKGROUND: Nucleomorphs are remnants of secondary endosymbiotic events between two eukaryote cells wherein the endosymbiont has retained its eukaryotic nucleus. Nucleomorphs have evolved at least twice independently, in chlorarachniophytes and cryptophytes, yet they have converged on a remarkably similar genomic architecture, characterized by the most extreme compression and miniaturization among all known eukaryotic genomes. Previous computational studies have suggested that nucleomorph chromatin likely exhibits a number of divergent features.RESULTS: In this work, we provide the first maps of open chromatin, active transcription, and three-dimensional organization for the nucleomorph genome of the chlorarachniophyte Bigelowiella natans. We find that the B. natans nucleomorph genome exists in a highly accessible state, akin to that of ribosomal DNA in some other eukaryotes, and that it is highly transcribed over its entire length, with few signs of polymerase pausing at transcription start sites (TSSs). At the same time, most nucleomorph TSSs show very strong nucleosome positioning. Chromosome conformation (Hi-C) maps reveal that nucleomorph chromosomes interact with one other at their telomeric regions and show the relative contact frequencies between the multiple genomic compartments of distinct origin that B. natans cells contain.CONCLUSIONS: We provide the first study of a nucleomorph genome using modern functional genomic tools, and derive numerous novel insights into the physical and functional organization of these unique genomes.

View details for DOI 10.1186/s13059-022-02639-5

View details for PubMedID 35232465
MITI minimum information guidelines for highly multiplexed tissue images. Nature methods Schapiro, D., Yapp, C., Sokolov, A., Reynolds, S. M., Chen, Y., Sudar, D., Xie, Y., Muhlich, J., Arias-Camison, R., Arena, S., Taylor, A. J., Nikolov, M., Tyler, M., Lin, J., Burlingame, E. A., Human Tumor Atlas Network, Chang, Y. H., Farhi, S. L., Thorsson, V., Venkatamohan, N., Drewes, J. L., Pe'er, D., Gutman, D. A., Herrmann, M. D., Gehlenborg, N., Bankhead, P., Roland, J. T., Herndon, J. M., Snyder, M. P., Angelo, M., Nolan, G., Swedlow, J. R., Schultz, N., Merrick, D. T., Mazzili, S. A., Cerami, E., Rodig, S. J., Santagata, S., Sorger, P. K., Abravanel, D. L., Achilefu, S., Ademuyiwa, F. O., Adey, A. C., Aft, R., Ahn, K. J., Alikarami, F., Alon, S., Ashenberg, O., Baker, E., Baker, G. J., Bandyopadhyay, S., Bayguinov, P., Beane, J., Becker, W., Bernt, K., Betts, C. B., Bletz, J., Blosser, T., Boire, A., Boland, G. M., Boyden, E. S., Bucher, E., Bueno, R., Cai, Q., Cambuli, F., Campbell, J., Cao, S., Caravan, W., Chaligne, R., Chan, J. M., Chasnoff, S., Chatterjee, D., Chen, A. A., Chen, C., Chen, C., Chen, B., Chen, F., Chen, S., Chheda, M. G., Chin, K., Cho, H., Chun, J., Cisneros, L., Coffey, R. J., Cohen, O., Colditz, G. A., Cole, K. A., Collins, N., Cotter, D., Coussens, L. M., Coy, S., Creason, A. L., Cui, Y., Zhou, D. C., Curtis, C., Davies, S. R., Bruijn, I., Delorey, T. M., Demir, E., Denardo, D., Diep, D., Ding, L., DiPersio, J., Dubinett, S. M., Eberlein, T. J., Eddy, J. A., Esplin, E. D., Factor, R. E., Fatahalian, K., Feiler, H. S., Fernandez, J., Fields, A., Fields, R. C., Fitzpatrick, J. A., Ford, J. M., Franklin, J., Fulton, B., Gaglia, G., Galdieri, L., Ganesh, K., Gao, J., Gaudio, B. L., Getz, G., Gibbs, D. L., Gillanders, W. E., Goecks, J., Goodwin, D., Gray, J. W., Greenleaf, W., Grimm, L. J., Gu, Q., Guerriero, J. L., Guha, T., Guimaraes, A. R., Gutierrez, B., Hacohen, N., Hanson, C. R., Harris, C. R., Hawkins, W. G., Heiser, C. N., Hoffer, J., Hollmann, T. J., Hsieh, J. J., Huang, J., Hunger, S. P., Hwang, E., Iacobuzio-Donahue, C., Iglesia, M. D., Islam, M., Izar, B., Jacobson, C. A., Janes, S., Jayasinghe, R. G., Jeudi, T., Johnson, B. E., Johnson, B. E., Ju, T., Kadara, H., Karnoub, E., Karpova, A., Khan, A., Kibbe, W., Kim, A. H., King, L. M., Kozlowski, E., Krishnamoorthy, P., Krueger, R., Kundaje, A., Ladabaum, U., Laquindanum, R., Lau, C., Lau, K. S., LeBoeuf, N. R., Lee, H., Lenburg, M., Leshchiner, I., Levy, R., Li, Y., Lian, C. G., Liang, W., Lim, K., Lin, Y., Liu, D., Liu, Q., Liu, R., Lo, J., Lo, P., Longabaugh, W. J., Longacre, T., Luckett, K., Ma, C., Maher, C., Maier, A., Makowski, D., Maley, C., Maliga, Z., Manoj, P., Maris, J. M., Markham, N., Marks, J. R., Martinez, D., Mashl, J., Masilionis, I., Massague, J., Mazurowski, M. A., McKinley, E. T., McMichael, J., Meyerson, M., Mills, G. B., Mitri, Z. I., Moorman, A., Mudd, J., Murphy, G. F., Deen, N. N., Navin, N. E., Nawy, T., Ness, R. M., Nevins, S., Nirmal, A. J., Novikov, E., Oh, S. T., Oldridge, D. A., Owzar, K., Pant, S. M., Park, W., Patti, G. J., Paul, K., Pelletier, R., Persson, D., Petty, C., Pfister, H., Polyak, K., Puram, S. V., Qiu, Q., Villalonga, A. Q., Ramirez, M. A., Rashid, R., Reeb, A. N., Reid, M. E., Remsik, J., Riesterer, J. L., Risom, T., Ritch, C. C., Rolong, A., Rudin, C. M., Ryser, M. D., Sato, K., Sears, C. L., Semenov, Y. R., Shen, J., Shoghi, K. I., Shrubsole, M. J., Shyr, Y., Sibley, A. B., Simmons, A. J., Sinha, A., Sivagnanam, S., Song, S., Southar-Smith, A., Spira, A. E., Cyr, J. S., Stefankiewicz, S., Storrs, E. P., Stover, E. H., Strand, S. H., Straub, C., Street, C., Su, T., Surrey, L. F., Suver, C., Tan, K., Terekhanova, N. V., Ternes, L., Thadi, A., Thomas, G., Tibshirani, R., Umeda, S., Uzun, Y., Vallius, T., Van Allen, E. R., Vandekar, S., Vega, P. N., Veis, D. J., Vennam, S., Verma, A., Vigneau, S., Wagle, N., Wahl, R., Walle, T., Wang, L., Warchol, S., Washington, M. K., Watson, C., Weimer, A. K., Wendl, M. C., West, R. B., White, S., Windon, A. L., Wu, H., Wu, C., Wu, Y., Wyczalkowski, M. A., Xu, J., Yao, L., Yu, W., Zhang, K., Zhu, X. 2022; 19 (3): 262-267

View details for DOI 10.1038/s41592-022-01415-4

View details for PubMedID 35277708
Short tandem repeats recruit transcription factors to tune eukaryotic gene expression Horton, C. A., Alexandari, A. M., Hayes, M. G., Schaepe, J. M., Marklund, E., Shah, N., Aditham, A. K., Shrikumar, A., Afek, A., Greenleaf, W. J., Gordan, R., Zeitlinger, J., Kundaje, A., Fordyce, P. M. CELL PRESS. 2022: 287A-288A

View details for Web of Science ID 000759523001660
Domain adaptive neural networks improve cross-species prediction of transcription factor binding. Genome research Cochran, K., Srivastava, D., Shrikumar, A., Balsubramani, A., Hardison, R. C., Kundaje, A., Mahony, S. 1800

Abstract

The intrinsic DNA sequence preferences and cell-type specific cooperative partners of transcription factors (TFs) are typically highly conserved. Hence, despite the rapid evolutionary turnover of individual TF binding sites, predictive sequence models of cell-type specific genomic occupancy of a TF in one species should generalize to closely matched cell types in a related species. To assess the viability of cross-species TF binding prediction, we train neural networks to discriminate ChIP-seq peak locations from genomic background and evaluate their performance within and across species. Cross-species predictive performance is consistently worse than within-species performance, which we show is caused in part by species-specific repeats. To account for this domain shift, we use an augmented network architecture to automatically discourage learning of training species-specific sequence features. This domain adaptation approach corrects for prediction errors on species-specific repeats and improves overall cross-species model performance. Our results demonstrate that cross-species TF binding prediction is feasible when models account for domain shifts driven by species-specific repeats.

View details for DOI 10.1101/gr.275394.121

View details for PubMedID 35042722
ZEB2 Shapes the Epigenetic Landscape of Atherosclerosis Circulation Cheng, P., Wirka, R. C., Clarke, L., Zhao, Q., Kundu, R., Nguyen, T., Nair, S., Sharma, D., Kim, H., Shi, H., Assimes, T., Kim, J., Kundaje, A., Quertermous, T. 2022; 145 (6): 469–485

Abstract

Background: Smooth muscle cells (SMC) transition into a number of different phenotypes during atherosclerosis, including those that resemble fibroblasts and chondrocytes, and make up the majority of cells in the atherosclerotic plaque. To better understand the epigenetic and transcriptional mechanisms that mediate these cell state changes, and how they relate to risk for coronary artery disease (CAD), we have investigated the causality and function of transcription factors (TFs) at genome wide associated loci. Methods: We employed CRISPR-Cas 9 genome and epigenome editing to identify the causal gene and cell(s) for a complex CAD GWAS signal at 2q22.3. Subsequently, single-cell epigenetic and transcriptomic profiling in murine models and human coronary artery smooth muscle cells were employed to understand the cellular and molecular mechanism by which this CAD risk gene exerts its function. Results: CRISPR-Cas 9 genome and epigenome editing showed that the complex CAD genetic signals within a genomic region at 2q22.3 lie within smooth muscle long-distance enhancers for ZEB2, a TF extensively studied in the context of epithelial mesenchymal transition (EMT) in development and cancer. ZEB2 regulates SMC phenotypic transition through chromatin remodeling that obviates accessibility and disrupts both Notch and TGFβ signaling, thus altering the epigenetic trajectory of SMC transitions. SMC specific loss of ZEB2 resulted in an inability of transitioning SMCs to turn off contractile programing and take on a fibroblast-like phenotype, but accelerated the formation of chondromyocytes, mirroring features of high-risk atherosclerotic plaques in human coronary arteries. Conclusions: These studies identify ZEB2 as a new CAD GWAS gene that affects features of plaque vulnerability through direct effects on the epigenome, providing a new thereapeutic approach to target vascular disease.

View details for DOI 10.1161/CIRCULATIONAHA.121.057789
ZEB2 Shapes the Epigenetic Landscape of Atherosclerosis. Circulation Cheng, P., Wirka, R. C., Clarke, L. S., Zhao, Q., Kundu, R., Nguyen, T., Nair, S., Sharma, D., Kim, H. J., Shi, H., Assimes, T., Kim, J. B., Kundaje, A., Quertermous, T. 2022

Abstract

Background: Smooth muscle cells (SMC) transition into a number of different phenotypes during atherosclerosis, including those that resemble fibroblasts and chondrocytes, and make up the majority of cells in the atherosclerotic plaque. To better understand the epigenetic and transcriptional mechanisms that mediate these cell state changes, and how they relate to risk for coronary artery disease (CAD), we have investigated the causality and function of transcription factors (TFs) at genome wide associated loci. Methods: We employed CRISPR-Cas 9 genome and epigenome editing to identify the causal gene and cell(s) for a complex CAD GWAS signal at 2q22.3. Subsequently, single-cell epigenetic and transcriptomic profiling in murine models and human coronary artery smooth muscle cells were employed to understand the cellular and molecular mechanism by which this CAD risk gene exerts its function. Results: CRISPR-Cas 9 genome and epigenome editing showed that the complex CAD genetic signals within a genomic region at 2q22.3 lie within smooth muscle long-distance enhancers for ZEB2, a TF extensively studied in the context of epithelial mesenchymal transition (EMT) in development and cancer. ZEB2 regulates SMC phenotypic transition through chromatin remodeling that obviates accessibility and disrupts both Notch and TGFβ signaling, thus altering the epigenetic trajectory of SMC transitions. SMC specific loss of ZEB2 resulted in an inability of transitioning SMCs to turn off contractile programing and take on a fibroblast-like phenotype, but accelerated the formation of chondromyocytes, mirroring features of high-risk atherosclerotic plaques in human coronary arteries. Conclusions: These studies identify ZEB2 as a new CAD GWAS gene that affects features of plaque vulnerability through direct effects on the epigenome, providing a new thereapeutic approach to target vascular disease.

View details for DOI 10.1161/CIRCULATIONAHA.121.057789

View details for PubMedID 34990206
The epigenomic landscape of single vascular cells reflects developmental origin and identifies disease risk loci bioRxiv Weldy, C. S., Cheng, P. P., Pedroza, A. J., Dalal, A. R., Sharma, D., Kim, H., Shi, H., Nguyen, T., Kundu, R. K., Fischbein, M. P., Quertermous, T. 2022

View details for DOI 10.1101/2022.05.18.492517
A Congenital Anemia Reveals Distinct Targeting Mechanisms for Master Transcription Factor GATA1. Blood Ludwig, L., Lareau, C. A., Bao, E. L., Liu, N., Utsugisawa, T., Tseng, A. M., Myers, S. A., Verboon, J. M., Ulirsch, J. C., Luo, W., Muus, C., Fiorini, C., Olive, M. E., Vockley, C. M., Munschauer, M., Hunter, A., Ogura, H., Yamamoto, T., Inada, H., Nakagawa, S., Ozono, S., Subramanian, V., Chiarle, R., Glader, B., Carr, S. A., Aryee, M. J., Kundaje, A., Orkin, S., Regev, A., McCavit, T., Kanno, H., Sankaran, V. G. 2022

Abstract

Master regulators, such as the hematopoietic transcription factor (TF) GATA1, play an essential role in orchestrating lineage commitment and differentiation. However, the precise mechanisms by which such TFs regulate transcription through interactions with specific cis-regulatory elements remain incompletely understood. Here, we describe a form of congenital hemolytic anemia caused by missense mutations in an intrinsically disordered region of GATA1, with a poorly understood role in transcriptional regulation. Through integrative functional approaches, we demonstrate that these mutations perturb GATA1 transcriptional activity by partially impairing nuclear localization and selectively altering precise chromatin occupancy by GATA1. These alterations in chromatin occupancy and concordant chromatin accessibility changes alter faithful gene expression, with failure to both effectively silence and activate select genes necessary for effective terminal red cell production. We demonstrate how disease-causing mutations can reveal regulatory mechanisms that enable the faithful genomic targeting of master TFs during cellular differentiation.

View details for DOI 10.1182/blood.2021013753

View details for PubMedID 35030251
Single-Molecule Multikilobase-Scale Profiling of Chromatin Accessibility Using m6A-SMAC-Seq and m6A-CpG-GpC-SMAC-Seq. Methods in molecular biology (Clifton, N.J.) Marinov, G. K., Shipony, Z., Kundaje, A., Greenleaf, W. J. 2022; 2458: 269-298

Abstract

A hallmark feature of active cis-regulatory elements (CREs) in eukaryotes is their nucleosomal depletion and, accordingly, higher accessibility to enzymatic treatment. This property has been the basis of a number of sequencing-based assays for genome-wide identification and tracking the activity of CREs across different biological conditions, such as DNAse-seq, ATAC-seq , NOMeseq, and others. However, the fragmentation of DNA inherent to many of these assays and the limited read length of short-read sequencing platforms have so far not allowed the simultaneous measurement of the chromatin accessibility state of CREs located distally from each other. The combination of labeling accessible DNA with DNA modifications and nanopore sequencing has made it possible to develop such assays. Here, we provide a detailed protocol for carrying out the SMAC-seq assay (Single-Molecule long-read Accessible Chromatin mapping sequencing), in its m6A-SMAC-seq and m6A-CpG-GpC-SMAC-seq variants, together with methods for data processing and analysis, and discuss key experimental and analytical considerations for working with SMAC-seq datasets.

View details for DOI 10.1007/978-1-0716-2140-0_15

View details for PubMedID 35103973
Transcriptional and chromatin-based partitioning mechanisms uncouple protein scaling from cell size. Molecular cell Swaffer, M. P., Kim, J., Chandler-Brown, D., Langhinrichs, M., Marinov, G. K., Greenleaf, W. J., Kundaje, A., Schmoller, K. M., Skotheim, J. M. 2021

Abstract

Biosynthesis scales with cell size such that protein concentrations generally remain constant as cells grow. As an exception, synthesis of the cell-cycle inhibitor Whi5 "sub-scales" with cell size so that its concentration is lower in larger cells to promote cell-cycle entry. Here, we find that transcriptional control uncouples Whi5 synthesis from cell size, and we identify histones as the major class of sub-scaling transcripts besides WHI5 by screening for similar genes. Histone synthesis is thereby matched to genome content rather than cell size. Such sub-scaling proteins are challenged by asymmetric cell division because proteins are typically partitioned in proportion to newborn cell volume. To avoid this fate, Whi5 uses chromatin-binding to partition similar protein amounts to each newborn cell regardless of cell size. Disrupting both Whi5 synthesis and chromatin-based partitioning weakens G1 size control. Thus, specific transcriptional and partitioning mechanisms determine protein sub-scaling to control cell size.

View details for DOI 10.1016/j.molcel.2021.10.007

View details for PubMedID 34731644
Cell-specific Chromatin Landscape Of Human Coronary Artery Resolves Mechanisms Of Disease Risk Turner, A. W., Hu, S., Mosquera, J., Ma, W., Hodonsky, C. J., Wong, D., Auguste, G. E., Sol-Church, K., Farber, E., Kundu, S., Kundaje, A. B., Lopez, N. G., Ma, L., Ghosh, S., Onengut-Gumuscu, S., Ashley, E. A., Quertermous, T., Finn, A., Leeper, N. J., Kovacic, J. C., Bjorkegren, J. L., Zang, C., Miller, C. L. LIPPINCOTT WILLIAMS & WILKINS. 2021

View details for DOI 10.1161/atvb.41.suppl_1.113

View details for Web of Science ID 000861072500072
Cell-free DNA fragments inform epigenomic mechanisms for early detection of breast cancer. Gafni, E., Harvey, A., Jaroszewicz, A., Solari, O., Landolin, J., Barbirou, M., Miller, A., Tonellato, P. J., Kundaje, A., Jeffrey, S. S., Curtis, C., Sledge, G. W., Giresi, P., Boley, N. AMER ASSOC CANCER RESEARCH. 2021

View details for Web of Science ID 000680263504022
AP-1 is a temporally regulated dual gatekeeper of reprogramming to pluripotency. Proceedings of the National Academy of Sciences of the United States of America Markov, G. J., Mai, T., Nair, S., Shcherbina, A., Wang, Y. X., Burns, D. M., Kundaje, A., Blau, H. M. 2021; 118 (23)

Abstract

Somatic cell transcription factors are critical to maintaining cellular identity and constitute a barrier to human somatic cell reprogramming; yet a comprehensive understanding of the mechanism of action is lacking. To gain insight, we examined epigenome remodeling at the onset of human nuclear reprogramming by profiling human fibroblasts after fusion with murine embryonic stem cells (ESCs). By assay for transposase-accessible chromatin with high-throughput sequencing (ATAC-seq) and chromatin immunoprecipitation sequencing we identified enrichment for the activator protein 1 (AP-1) transcription factor c-Jun at regions of early transient accessibility at fibroblast-specific enhancers. Expression of a dominant negative AP-1 mutant (dnAP-1) reduced accessibility and expression of fibroblast genes, overcoming the barrier to reprogramming. Remarkably, efficient reprogramming of human fibroblasts to induced pluripotent stem cells was achieved by transduction with vectors expressing SOX2, KLF4, and inducible dnAP-1, demonstrating that dnAP-1 can substitute for exogenous human OCT4. Mechanistically, we show that the AP-1 component c-Jun has two unexpected temporally distinct functions in human reprogramming: 1) to potentiate fibroblast enhancer accessibility and fibroblast-specific gene expression, and 2) to bind to and repress OCT4 as a complex with MBD3. Our findings highlight AP-1 as a previously unrecognized potent dual gatekeeper of the somatic cell state.

View details for DOI 10.1073/pnas.2104841118

View details for PubMedID 34088849
Discovery of widespread transcription initiation at microsatellites predictable by sequence-based deep neural network NATURE COMMUNICATIONS Grapotte, M., Saraswat, M., Bessiere, C., Menichelli, C., Ramilowski, J. A., Severin, J., Hayashizaki, Y., Itoh, M., Tagami, M., Murata, M., Kojima-Ishiyamas, M., Noma, S., Noguchi, S., Kasukawa, T., Hasegawa, A., Suzuki, H., Nishiyori-Sueki, H., Frith, M. C., Chatelain, C., Carninci, P., de Hoom, M. J. L., Wasserman, W. W., Brehelin, L., Lecellieree, C., FANTOM Consortium 2021; 12 (1): 3297

Abstract

Using the Cap Analysis of Gene Expression (CAGE) technology, the FANTOM5 consortium provided one of the most comprehensive maps of transcription start sites (TSSs) in several species. Strikingly, ~72% of them could not be assigned to a specific gene and initiate at unconventional regions, outside promoters or enhancers. Here, we probe these unassigned TSSs and show that, in all species studied, a significant fraction of CAGE peaks initiate at microsatellites, also called short tandem repeats (STRs). To confirm this transcription, we develop Cap Trap RNA-seq, a technology which combines cap trapping and long read MinION sequencing. We train sequence-based deep learning models able to predict CAGE signal at STRs with high accuracy. These models unveil the importance of STR surrounding sequences not only to distinguish STR classes, but also to predict the level of transcription initiation. Importantly, genetic variants linked to human diseases are preferentially found at STRs with high transcription initiation level, supporting the biological and clinical relevance of transcription initiation at STRs. Together, our results extend the repertoire of non-coding transcription associated with DNA tandem repeats and complexify STR polymorphism.

View details for DOI 10.1038/s41467-021-23143-7

View details for Web of Science ID 000660869500001

View details for PubMedID 34078885

View details for PubMedCentralID PMC8172540
Transcription-dependent domain-scale three-dimensional genome organization in the dinoflagellate Breviolum minutum. Nature genetics Marinov, G. K., Trevino, A. E., Xiang, T., Kundaje, A., Grossman, A. R., Greenleaf, W. J. 2021

Abstract

Dinoflagellate chromosomes represent a unique evolutionary experiment, as they exist in a permanently condensed, liquid crystalline state; are not packaged by histones; and contain genes organized into tandem gene arrays, with minimal transcriptional regulation. We analyze the three-dimensional genome of Breviolum minutum, and find large topological domains (dinoflagellate topologically associating domains, which we term 'dinoTADs') without chromatin loops, which are demarcated by convergent gene array boundaries. Transcriptional inhibition disrupts dinoTADs, implicating transcription-induced supercoiling as the primary topological force in dinoflagellates.

View details for DOI 10.1038/s41588-021-00848-5

View details for PubMedID 33927397
Publisher Correction: MTSplice predicts effects of genetic variants on tissue-specific splicing. Genome biology Cheng, J., Celik, M. H., Kundaje, A., Gagneur, J. 2021; 22 (1): 107

View details for DOI 10.1186/s13059-021-02338-7

View details for PubMedID 33858505
Genome-wide enhancer maps link risk variants to disease genes. Nature Nasser, J., Bergman, D. T., Fulco, C. P., Guckelberger, P., Doughty, B. R., Patwardhan, T. A., Jones, T. R., Nguyen, T. H., Ulirsch, J. C., Lekschas, F., Mualim, K., Natri, H. M., Weeks, E. M., Munson, G., Kane, M., Kang, H. Y., Cui, A., Ray, J. P., Eisenhaure, T. M., Collins, R. L., Dey, K., Pfister, H., Price, A. L., Epstein, C. B., Kundaje, A., Xavier, R. J., Daly, M. J., Huang, H., Finucane, H. K., Hacohen, N., Lander, E. S., Engreitz, J. M. 2021

Abstract

Genome-wide association studies (GWAS) have identified thousands of noncoding loci that are associated with human diseases and complex traits, each of which could reveal insights into the mechanisms of disease1. Many of the underlying causal variants may affect enhancers2,3, but we lack accurate maps of enhancers and their target genes to interpret such variants. We recently developed the activity-by-contact (ABC) model to predict which enhancers regulate which genes and validated the model using CRISPR perturbations in several cell types4. Here we apply this ABC model to create enhancer-genemaps in 131 human cell types and tissues, and use these maps to interpret the functions of GWAS variants. Across 72 diseases and complex traits, ABC links 5,036 GWAS signals to 2,249 unique genes, including a class of 577 genes that appear to influence multiple phenotypes through variants in enhancers that act in different cell types. In inflammatory bowel disease (IBD), causal variants are enriched in predicted enhancers by more than 20-fold in particular cell types such as dendritic cells, and ABC achieves higher precision than other regulatory methods at connecting noncoding variants to target genes. These variant-to-function maps reveal an enhancer that contains an IBD risk variant and that regulates the expression of PPIF to alter the membrane potential of mitochondria in macrophages. Our study reveals principles of genome regulation, identifies genes that affect IBD and provides a resource and generalizable strategy to connect risk variants of common diseases to their molecular and cellular functions.

View details for DOI 10.1038/s41586-021-03446-x

View details for PubMedID 33828297
Genetic architectures of proximal and distal colorectal cancer are partly distinct. Gut Huyghe, J. R., Harrison, T. A., Bien, S. A., Hampel, H. n., Figueiredo, J. C., Schmit, S. L., Conti, D. V., Chen, S. n., Qu, C. n., Lin, Y. n., Barfield, R. n., Baron, J. A., Cross, A. J., Diergaarde, B. n., Duggan, D. n., Harlid, S. n., Imaz, L. n., Kang, H. M., Levine, D. M., Perduca, V. n., Perez-Cornago, A. n., Sakoda, L. C., Schumacher, F. R., Slattery, M. L., Toland, A. E., van Duijnhoven, F. J., Van Guelpen, B. n., Agudo, A. n., Albanes, D. n., Alonso, M. H., Anderson, K. n., Arnau-Collell, C. n., Arndt, V. n., Banbury, B. L., Bassik, M. C., Berndt, S. I., Bézieau, S. n., Bishop, D. T., Boehm, J. n., Boeing, H. n., Boutron-Ruault, M. C., Brenner, H. n., Brezina, S. n., Buch, S. n., Buchanan, D. D., Burnett-Hartman, A. n., Caan, B. J., Campbell, P. T., Carr, P. R., Castells, A. n., Castellví-Bel, S. n., Chan, A. T., Chang-Claude, J. n., Chanock, S. J., Curtis, K. R., de la Chapelle, A. n., Easton, D. F., English, D. R., Feskens, E. J., Gala, M. n., Gallinger, S. J., Gauderman, W. J., Giles, G. G., Goodman, P. J., Grady, W. M., Grove, J. S., Gsur, A. n., Gunter, M. J., Haile, R. W., Hampe, J. n., Hoffmeister, M. n., Hopper, J. L., Hsu, W. L., Huang, W. Y., Hudson, T. J., Jenab, M. n., Jenkins, M. A., Joshi, A. D., Keku, T. O., Kooperberg, C. n., Kühn, T. n., Küry, S. n., Le Marchand, L. n., Lejbkowicz, F. n., Li, C. I., Li, L. n., Lieb, W. n., Lindblom, A. n., Lindor, N. M., Männistö, S. n., Markowitz, S. D., Milne, R. L., Moreno, L. n., Murphy, N. n., Nassir, R. n., Offit, K. n., Ogino, S. n., Panico, S. n., Parfrey, P. S., Pearlman, R. n., Pharoah, P. D., Phipps, A. I., Platz, E. A., Potter, J. D., Prentice, R. L., Qi, L. n., Raskin, L. n., Rennert, G. n., Rennert, H. S., Riboli, E. n., Schafmayer, C. n., Schoen, R. E., Seminara, D. n., Song, M. n., Su, Y. R., Tangen, C. M., Thibodeau, S. N., Thomas, D. C., Trichopoulou, A. n., Ulrich, C. M., Visvanathan, K. n., Vodicka, P. n., Vodickova, L. n., Vymetalkova, V. n., Weigl, K. n., Weinstein, S. J., White, E. n., Wolk, A. n., Woods, M. O., Wu, A. H., Abecasis, G. R., Nickerson, D. A., Scacheri, P. C., Kundaje, A. n., Casey, G. n., Gruber, S. B., Hsu, L. n., Moreno, V. n., Hayes, R. B., Newcomb, P. A., Peters, U. n. 2021

Abstract

An understanding of the etiologic heterogeneity of colorectal cancer (CRC) is critical for improving precision prevention, including individualized screening recommendations and the discovery of novel drug targets and repurposable drug candidates for chemoprevention. Known differences in molecular characteristics and environmental risk factors among tumors arising in different locations of the colorectum suggest partly distinct mechanisms of carcinogenesis. The extent to which the contribution of inherited genetic risk factors for CRC differs by anatomical subsite of the primary tumor has not been examined.To identify new anatomical subsite-specific risk loci, we performed genome-wide association study (GWAS) meta-analyses including data of 48 214 CRC cases and 64 159 controls of European ancestry. We characterised effect heterogeneity at CRC risk loci using multinomial modelling.We identified 13 loci that reached genome-wide significance (p<5×10-8) and that were not reported by previous GWASs for overall CRC risk. Multiple lines of evidence support candidate genes at several of these loci. We detected substantial heterogeneity between anatomical subsites. Just over half (61) of 109 known and new risk variants showed no evidence for heterogeneity. In contrast, 22 variants showed association with distal CRC (including rectal cancer), but no evidence for association or an attenuated association with proximal CRC. For two loci, there was strong evidence for effects confined to proximal colon cancer.Genetic architectures of proximal and distal CRC are partly distinct. Studies of risk factors and mechanisms of carcinogenesis, and precision prevention strategies should take into consideration the anatomical subsite of the tumour.

View details for DOI 10.1136/gutjnl-2020-321534

View details for PubMedID 33632709
Chromatin and gene-regulatory dynamics of the developing human cerebral cortex at single-cell resolution. Cell Trevino, A. E., Müller, F., Andersen, J., Sundaram, L., Kathiria, A., Shcherbina, A., Farh, K., Chang, H. Y., Pașca, A. M., Kundaje, A., Pașca, S. P., Greenleaf, W. J. 2021

Abstract

Genetic perturbations of cortical development can lead to neurodevelopmental disease, including autism spectrum disorder (ASD). To identify genomic regions crucial to corticogenesis, we mapped the activity of gene-regulatory elements generating a single-cell atlas of gene expression and chromatin accessibility both independently and jointly. This revealed waves of gene regulation by key transcription factors (TFs) across a nearly continuous differentiation trajectory, distinguished the expression programs of glial lineages, and identified lineage-determining TFs that exhibited strong correlation between linked gene-regulatory elements and expression levels. These highly connected genes adopted an active chromatin state in early differentiating cells, consistent with lineage commitment. Base-pair-resolution neural network models identified strong cell-type-specific enrichment of noncoding mutations predicted to be disruptive in a cohort of ASD individuals and identified frequently disrupted TF binding sites. This approach illustrates how cell-type-specific mapping can provide insights into the programs governing human development and disease.

View details for DOI 10.1016/j.cell.2021.07.039

View details for PubMedID 34390642
WILDS: A Benchmark of in-the-Wild Distribution Shifts Koh, P., Sagawa, S., Marklund, H., Xie, S., Zhang, M., Balsubramani, A., Hu, W., Yasunaga, M., Phillips, R., Gao, I., Lee, T., David, E., Stavness, I., Guo, W., Earnshaw, B. A., Haque, I. S., Beery, S., Leskovec, J., Kundaje, A., Pierson, E., Levine, S., Finn, C., Liang, P. edited by Meila, M., Zhang, T. JMLR-JOURNAL MACHINE LEARNING RESEARCH. 2021

View details for Web of Science ID 000683104605062
Learning cis-regulatory principles of ADAR-based RNA editing from CRISPR-mediated mutagenesis. Nature communications Liu, X., Sun, T., Shcherbina, A., Li, Q., Jarmoskaite, I., Kappel, K., Ramaswami, G., Das, R., Kundaje, A., Li, J. B. 2021; 12 (1): 2165

Abstract

Adenosine-to-inosine (A-to-I) RNA editing catalyzed by ADAR enzymes occurs in double-stranded RNAs. Despite a compelling need towards predictive understanding of natural and engineered editing events, how the RNA sequence and structure determine the editing efficiency and specificity (i.e., cis-regulation) is poorly understood. We apply a CRISPR/Cas9-mediated saturation mutagenesis approach to generate libraries of mutations near three natural editing substrates at their endogenous genomic loci. We use machine learning to integrate diverse RNA sequence and structure features to model editing levels measured by deep sequencing. We confirm known features and identify new features important for RNA editing. Training and testing XGBoost algorithm within the same substrate yield models that explain 68 to 86 percent of substrate-specific variation in editing levels. However, the models do not generalize across substrates, suggesting complex and context-dependent regulation patterns. Our integrative approach can be applied to larger scale experiments towards deciphering the RNA editing code.

View details for DOI 10.1038/s41467-021-22489-2

View details for PubMedID 33846332
MTSplice predicts effects of genetic variants on tissue-specific splicing. Genome biology Cheng, J. n., Çelik, M. H., Kundaje, A. n., Gagneur, J. n. 2021; 22 (1): 94

Abstract

We develop the free and open-source model Multi-tissue Splicing (MTSplice) to predict the effects of genetic variants on splicing of cassette exons in 56 human tissues. MTSplice combines MMSplice, which models constitutive regulatory sequences, with a new neural network that models tissue-specific regulatory sequences. MTSplice outperforms MMSplice on predicting tissue-specific variations associated with genetic variants in most tissues of the GTEx dataset, with largest improvements on brain tissues. Furthermore, MTSplice predicts that autism-associated de novo mutations are enriched for variants affecting splicing specifically in the brain. We foresee that MTSplice will aid interpreting variants associated with tissue-specific disorders.

View details for DOI 10.1186/s13059-021-02273-7

View details for PubMedID 33789710
Genetic effects on transcriptome profiles in colon epithelium provide functional insights for genetic risk loci. Cellular and molecular gastroenterology and hepatology Díez-Obrero, V. n., Dampier, C. H., Moratalla-Navarro, F. n., Devall, M. n., Plummer, S. J., Díez-Villanueva, A. n., Peters, U. n., Bien, S. n., Huyghe, J. R., Kundaje, A. n., Ibáñez-Sanz, G. n., Guinó, E. n., Obón-Santacana, M. n., Carreras-Torres, R. n., Casey, G. n., Moreno, V. n. 2021

Abstract

The association of genetic variation with tissue-specific gene expression and alternative splicing guides functional characterization of complex trait associated loci and may suggest novel genes implicated in disease. Here, we aimed to 1) generate reference profiles of colon mucosa gene expression and alternative splicing and compare them across colon subsites (ascending, transverse and descending), 2) identify expression and splicing quantitative trait loci (QTLs), 3) find traits for which identified QTLs contribute to single nucleotide polymorphism (SNP)-based heritability, 4) propose candidate effector genes, and 5) provide a web-based visualization resource.We collected colonic mucosal biopsies from 485 healthy adults and performed bulk RNA sequencing (RNA-Seq). We performed genome-wide SNP genotyping from blood leukocytes. Statistical approaches and bioinformatics software were used for QTL identification and downstream analyses.We provided a complete quantification of gene expression and alternative splicing across colon subsites and described their differences. We identified thousands of expression and splicing QTLs and defined their enrichment at genome-wide regulatory regions. We found that part of the SNP-based heritability of diseases affecting colon tissue, such as colorectal cancer and inflammatory bowel disease, but also of diseases affecting other tissues, such as psychiatric conditions, can be explained by the identified QTLs. We provided candidate effector genes for multiple phenotypes. Finally, we provided the Colon Transcriptome Explorer (CoTrEx).We provided the largest characterization to date of gene expression and splicing across colon subsites. Our findings provide greater etiological insight into complex traits and diseases influenced by transcriptomic changes in colon tissue.

View details for DOI 10.1016/j.jcmgh.2021.02.003

View details for PubMedID 33601062
Landscape of cohesin-mediated chromatin loops in the human genome Grubert, F., Srivas, R., Spacek, D., Kasowski, M., Ruiz-Velasco, M., Sinnott-Armstrong, N., Greenside, P., Narasimha, A., Liu, Q., Geller, B., Sanghi, A., Kulik, M., Sa, S., Rabinovitch, M., Kundaje, A., Dalton, S., Zaugg, J., Snyder, M. SPRINGERNATURE. 2020: 72

View details for Web of Science ID 000598482600137
Transparency and reproducibility in artificial intelligence. Nature Haibe-Kains, B., Adam, G. A., Hosny, A., Khodakarami, F., Massive Analysis Quality Control (MAQC) Society Board of Directors, Waldron, L., Wang, B., McIntosh, C., Goldenberg, A., Kundaje, A., Greene, C. S., Broderick, T., Hoffman, M. M., Leek, J. T., Korthauer, K., Huber, W., Brazma, A., Pineau, J., Tibshirani, R., Hastie, T., Ioannidis, J. P., Quackenbush, J., Aerts, H. J., Shraddha, T., Kusko, R., Sansone, S., Tong, W., Wolfinger, R. D., Mason, C. E., Jones, W., Dopazo, J., Furlanello, C. 2020; 586 (7829): E14–E16

View details for DOI 10.1038/s41586-020-2766-y

View details for PubMedID 33057217
Perspectives on ENCODE. Nature ENCODE Project Consortium, Snyder, M. P., Gingeras, T. R., Moore, J. E., Weng, Z., Gerstein, M. B., Ren, B., Hardison, R. C., Stamatoyannopoulos, J. A., Graveley, B. R., Feingold, E. A., Pazin, M. J., Pagan, M., Gilchrist, D. A., Hitz, B. C., Cherry, J. M., Bernstein, B. E., Mendenhall, E. M., Zerbino, D. R., Frankish, A., Flicek, P., Myers, R. M., Abascal, F., Acosta, R., Addleman, N. J., Adrian, J., Afzal, V., Aken, B., Akiyama, J. A., Jammal, O. A., Amrhein, H., Anderson, S. M., Andrews, G. R., Antoshechkin, I., Ardlie, K. G., Armstrong, J., Astley, M., Banerjee, B., Barkal, A. A., Barnes, I. H., Barozzi, I., Barrell, D., Barson, G., Bates, D., Baymuradov, U. K., Bazile, C., Beer, M. A., Beik, S., Bender, M. A., Bennett, R., Bouvrette, L. P., Bernstein, B. E., Berry, A., Bhaskar, A., Bignell, A., Blue, S. M., Bodine, D. M., Boix, C., Boley, N., Borrman, T., Borsari, B., Boyle, A. P., Brandsmeier, L. A., Breschi, A., Bresnick, E. H., Brooks, J. A., Buckley, M., Burge, C. B., Byron, R., Cahill, E., Cai, L., Cao, L., Carty, M., Castanon, R. G., Castillo, A., Chaib, H., Chan, E. T., Chee, D. R., Chee, S., Chen, H., Chen, H., Chen, J., Chen, S., Cherry, J. M., Chhetri, S. B., Choudhary, J. S., Chrast, J., Chung, D., Clarke, D., Cody, N. A., Coppola, C. J., Coursen, J., D'Ippolito, A. M., Dalton, S., Danyko, C., Davidson, C., Davila-Velderrain, J., Davis, C. A., Dekker, J., Deran, A., DeSalvo, G., Despacio-Reyes, G., Dewey, C. N., Dickel, D. E., Diegel, M., Diekhans, M., Dileep, V., Ding, B., Djebali, S., Dobin, A., Dominguez, D., Donaldson, S., Drenkow, J., Dreszer, T. R., Drier, Y., Duff, M. O., Dunn, D., Eastman, C., Ecker, J. R., Edwards, M. D., El-Ali, N., Elhajjajy, S. I., Elkins, K., Emili, A., Epstein, C. B., Evans, R. C., Ezkurdia, I., Fan, K., Farnham, P. J., Farrell, N., Feingold, E. A., Ferreira, A., Fisher-Aylor, K., Fitzgerald, S., Flicek, P., Foo, C. S., Fortier, K., Frankish, A., Freese, P., Fu, S., Fu, X., Fu, Y., Fukuda-Yuzawa, Y., Fulciniti, M., Funnell, A. P., Gabdank, I., Galeev, T., Gao, M., Giron, C. G., Garvin, T. H., Gelboin-Burkhart, C. A., Georgolopoulos, G., Gerstein, M. B., Giardine, B. M., Gifford, D. K., Gilbert, D. M., Gilchrist, D. A., Gillespie, S., Gingeras, T. R., Gong, P., Gonzalez, A., Gonzalez, J. M., Good, P., Goren, A., Gorkin, D. U., Graveley, B. R., Gray, M., Greenblatt, J. F., Griffiths, E., Groudine, M. T., Grubert, F., Gu, M., Guigo, R., Guo, H., Guo, Y., Guo, Y., Gursoy, G., Gutierrez-Arcelus, M., Halow, J., Hardison, R. C., Hardy, M., Hariharan, M., Harmanci, A., Harrington, A., Harrow, J. L., Hashimoto, T. B., Hasz, R. D., Hatan, M., Haugen, E., Hayes, J. E., He, P., He, Y., Heidari, N., Hendrickson, D., Heuston, E. F., Hilton, J. A., Hitz, B. C., Hochman, A., Holgren, C., Hou, L., Hou, S., Hsiao, Y. E., Hsu, S., Huang, H., Hubbard, T. J., Huey, J., Hughes, T. R., Hunt, T., Ibarrientos, S., Issner, R., Iwata, M., Izuogu, O., Jaakkola, T., Jameel, N., Jansen, C., Jiang, L., Jiang, P., Johnson, A., Johnson, R., Jungreis, I., Kadaba, M., Kasowski, M., Kasparian, M., Kato, M., Kaul, R., Kawli, T., Kay, M., Keen, J. C., Keles, S., Keller, C. A., Kelley, D., Kellis, M., Kheradpour, P., Kim, D. S., Kirilusha, A., Klein, R. J., Knoechel, B., Kuan, S., Kulik, M. J., Kumar, S., Kundaje, A., Kutyavin, T., Lagarde, J., Lajoie, B. R., Lambert, N. J., Lazar, J., Lee, A. Y., Lee, D., Lee, E., Lee, J. W., Lee, K., Leslie, C. S., Levy, S., Li, B., Li, H., Li, N., Li, X., Li, Y. I., Li, Y., Li, Y., Li, Y., Lian, J., Libbrecht, M. W., Lin, S., Lin, Y., Liu, D., Liu, J., Liu, P., Liu, T., Liu, X. S., Liu, Y., Liu, Y., Long, M., Lou, S., Loveland, J., Lu, A., Lu, Y., Lecuyer, E., Ma, L., Mackiewicz, M., Mannion, B. J., Mannstadt, M., Manthravadi, D., Marinov, G. K., Martin, F. J., Mattei, E., McCue, K., McEown, M., McVicker, G., Meadows, S. K., Meissner, A., Mendenhall, E. M., Messer, C. L., Meuleman, W., Meyer, C., Miller, S., Milton, M. G., Mishra, T., Moore, D. E., Moore, H. M., Moore, J. E., Moore, S. H., Moran, J., Mortazavi, A., Mudge, J. M., Munshi, N., Murad, R., Myers, R. M., Nandakumar, V., Nandi, P., Narasimha, A. M., Narayanan, A. K., Naughton, H., Navarro, F. C., Navas, P., Nazarovs, J., Nelson, J., Neph, S., Neri, F. J., Nery, J. R., Nesmith, A. R., Newberry, J. S., Newberry, K. M., Ngo, V., Nguyen, R., Nguyen, T. B., Nguyen, T., Nishida, A., Noble, W. S., Novak, C. S., Novoa, E. M., Nunez, B., O'Donnell, C. W., Olson, S., Onate, K. C., Otterman, E., Ozadam, H., Pagan, M., Palden, T., Pan, X., Park, Y., Partridge, E. C., Paten, B., Pauli-Behn, F., Pazin, M. J., Pei, B., Pennacchio, L. A., Perez, A. R., Perry, E. H., Pervouchine, D. D., Phalke, N. N., Pham, Q., Phanstiel, D. H., Plajzer-Frick, I., Pratt, G. A., Pratt, H. E., Preissl, S., Pritchard, J. K., Pritykin, Y., Purcaro, M. J., Qin, Q., Quinones-Valdez, G., Rabano, I., Radovani, E., Raj, A., Rajagopal, N., Ram, O., Ramirez, L., Ramirez, R. N., Rausch, D., Raychaudhuri, S., Raymond, J., Razavi, R., Reddy, T. E., Reimonn, T. M., Ren, B., Reymond, A., Reynolds, A., Rhie, S. K., Rinn, J., Rivera, M., Rivera-Mulia, J. C., Roberts, B., Rodriguez, J. M., Rozowsky, J., Ryan, R., Rynes, E., Salins, D. N., Sandstrom, R., Sasaki, T., Sathe, S., Savic, D., Scavelli, A., Scheiman, J., Schlaffner, C., Schloss, J. A., Schmitges, F. W., See, L. H., Sethi, A., Setty, M., Shafer, A., Shan, S., Sharon, E., Shen, Q., Shen, Y., Sherwood, R. I., Shi, M., Shin, S., Shoresh, N., Siebenthall, K., Sisu, C., Slifer, T., Sloan, C. A., Smith, A., Snetkova, V., Snyder, M. P., Spacek, D. V., Srinivasan, S., Srivas, R., Stamatoyannopoulos, G., Stamatoyannopoulos, J. A., Stanton, R., Steffan, D., Stehling-Sun, S., Strattan, J. S., Su, A., Sundararaman, B., Suner, M., Syed, T., Szynkarek, M., Tanaka, F. Y., Tenen, D., Teng, M., Thomas, J. A., Toffey, D., Tress, M. L., Trout, D. E., Trynka, G., Tsuji, J., Upchurch, S. A., Ursu, O., Uszczynska-Ratajczak, B., Uziel, M. C., Valencia, A., Biber, B. V., van der Velde, A. G., Van Nostrand, E. L., Vaydylevich, Y., Vazquez, J., Victorsen, A., Vielmetter, J., Vierstra, J., Visel, A., Vlasova, A., Vockley, C. M., Volpi, S., Vong, S., Wang, H., Wang, M., Wang, Q., Wang, R., Wang, T., Wang, W., Wang, X., Wang, Y., Watson, N. K., Wei, X., Wei, Z., Weisser, H., Weissman, S. M., Welch, R., Welikson, R. E., Weng, Z., Westra, H., Whitaker, J. W., White, C., White, K. P., Wildberg, A., Williams, B. A., Wine, D., Witt, H. N., Wold, B., Wolf, M., Wright, J., Xiao, R., Xiao, X., Xu, J., Xu, J., Yan, K., Yan, Y., Yang, H., Yang, X., Yang, Y., Yardimci, G. G., Yee, B. A., Yeo, G. W., Young, T., Yu, T., Yue, F., Zaleski, C., Zang, C., Zeng, H., Zeng, W., Zerbino, D. R., Zhai, J., Zhan, L., Zhan, Y., Zhang, B., Zhang, J., Zhang, J., Zhang, K., Zhang, L., Zhang, P., Zhang, Q., Zhang, X., Zhang, Y., Zhang, Z., Zhao, Y., Zheng, Y., Zhong, G., Zhou, X., Zhu, Y., Zimmerman, J. 2020; 583 (7818): 693–98

Abstract

The Encylopedia of DNA Elements (ENCODE) Project launched in 2003 with the long-term goal of developing a comprehensive map of functional elements in the human genome. These included genes, biochemical regions associated with gene regulation (for example, transcription factor binding sites, open chromatin, and histone marks) and transcript isoforms. The marks serve as sites for candidate cis-regulatory elements (cCREs) that may serve functional roles in regulating gene expression1. The project has been extended to model organisms, particularly the mouse. In the third phase of ENCODE, nearly a million and more than 300,000 cCRE annotations have been generated for human and mouse, respectively, and these have provided a valuable resource for the scientific community.

View details for DOI 10.1038/s41586-020-2449-8

View details for PubMedID 32728248
The Human Tumor Atlas Network: Charting Tumor Transitions across Space and Time at Single-Cell Resolution. Cell Rozenblatt-Rosen, O., Regev, A., Oberdoerffer, P., Nawy, T., Hupalowska, A., Rood, J. E., Ashenberg, O., Cerami, E., Coffey, R. J., Demir, E., Ding, L., Esplin, E. D., Ford, J. M., Goecks, J., Ghosh, S., Gray, J. W., Guinney, J., Hanlon, S. E., Hughes, S. K., Hwang, E. S., Iacobuzio-Donahue, C. A., Jane-Valbuena, J., Johnson, B. E., Lau, K. S., Lively, T., Mazzilli, S. A., Pe'er, D., Santagata, S., Shalek, A. K., Schapiro, D., Snyder, M. P., Sorger, P. K., Spira, A. E., Srivastava, S., Tan, K., West, R. B., Williams, E. H., Human Tumor Atlas Network, Aberle, D., Achilefu, S. I., Ademuyiwa, F. O., Adey, A. C., Aft, R. L., Agarwal, R., Aguilar, R. A., Alikarami, F., Allaj, V., Amos, C., Anders, R. A., Angelo, M. R., Anton, K., Ashenberg, O., Aster, J. C., Babur, O., Bahmani, A., Balsubramani, A., Barrett, D., Beane, J., Bender, D. E., Bernt, K., Berry, L., Betts, C. B., Bletz, J., Blise, K., Boire, A., Boland, G., Borowsky, A., Bosse, K., Bott, M., Boyden, E., Brooks, J., Bueno, R., Burlingame, E. A., Cai, Q., Campbell, J., Caravan, W., Cerami, E., Chaib, H., Chan, J. M., Chang, Y. H., Chatterjee, D., Chaudhary, O., Chen, A. A., Chen, B., Chen, C., Chen, C., Chen, F., Chen, Y., Chheda, M. G., Chin, K., Chiu, R., Chu, S., Chuaqui, R., Chun, J., Cisneros, L., Coffey, R. J., Colditz, G. A., Cole, K., Collins, N., Contrepois, K., Coussens, L. M., Creason, A. L., Crichton, D., Curtis, C., Davidsen, T., Davies, S. R., de Bruijn, I., Dellostritto, L., De Marzo, A., Demir, E., DeNardo, D. G., Diep, D., Ding, L., Diskin, S., Doan, X., Drewes, J., Dubinett, S., Dyer, M., Egger, J., Eng, J., Engelhardt, B., Erwin, G., Esplin, E. D., Esserman, L., Felmeister, A., Feiler, H. S., Fields, R. C., Fisher, S., Flaherty, K., Flournoy, J., Ford, J. M., Fortunato, A., Frangieh, A., Frye, J. L., Fulton, R. S., Galipeau, D., Gan, S., Gao, J., Gao, L., Gao, P., Gao, V. R., Geiger, T., George, A., Getz, G., Ghosh, S., Giannakis, M., Gibbs, D. L., Gillanders, W. E., Goecks, J., Goedegebuure, S. P., Gould, A., Gowers, K., Gray, J. W., Greenleaf, W., Gresham, J., Guerriero, J. L., Guha, T. K., Guimaraes, A. R., Guinney, J., Gutman, D., Hacohen, N., Hanlon, S., Hansen, C. R., Harismendy, O., Harris, K. A., Hata, A., Hayashi, A., Heiser, C., Helvie, K., Herndon, J. M., Hirst, G., Hodi, F., Hollmann, T., Horning, A., Hsieh, J. J., Hughes, S., Huh, W. J., Hunger, S., Hwang, S. E., Iacobuzio-Donahue, C. A., Ijaz, H., Izar, B., Jacobson, C. A., Janes, S., Jane-Valbuena, J., Jayasinghe, R. G., Jiang, L., Johnson, B. E., Johnson, B., Ju, T., Kadara, H., Kaestner, K., Kagan, J., Kalinke, L., Keith, R., Khan, A., Kibbe, W., Kim, A. H., Kim, E., Kim, J., Kolodzie, A., Kopytra, M., Kotler, E., Krueger, R., Krysan, K., Kundaje, A., Ladabaum, U., Lake, B. B., Lam, H., Laquindanum, R., Lau, K. S., Laughney, A. M., Lee, H., Lenburg, M., Leonard, C., Leshchiner, I., Levy, R., Li, J., Lian, C. G., Lim, K., Lin, J., Lin, Y., Liu, Q., Liu, R., Lively, T., Longabaugh, W. J., Longacre, T., Ma, C. X., Macedonia, M. C., Madison, T., Maher, C. A., Maitra, A., Makinen, N., Makowski, D., Maley, C., Maliga, Z., Mallo, D., Maris, J., Markham, N., Marks, J., Martinez, D., Mashl, R. J., Masilionais, I., Mason, J., Massague, J., Massion, P., Mattar, M., Mazurchuk, R., Mazutis, L., Mazzilli, S. A., McKinley, E. T., McMichael, J. F., Merrick, D., Meyerson, M., Miessner, J. R., Mills, G. B., Mills, M., Mondal, S. B., Mori, M., Mori, Y., Moses, E., Mosse, Y., Muhlich, J. L., Murphy, G. F., Navin, N. E., Nawy, T., Nederlof, M., Ness, R., Nevins, S., Nikolov, M., Nirmal, A. J., Nolan, G., Novikov, E., Oberdoerffer, P., O'Connell, B., Offin, M., Oh, S. T., Olson, A., Ooms, A., Ossandon, M., Owzar, K., Parmar, S., Patel, T., Patti, G. J., Pe'er, D., Pe'er, I., Peng, T., Persson, D., Petty, M., Pfister, H., Polyak, K., Pourfarhangi, K., Puram, S. V., Qiu, Q., Quintanal-Villalonga, A., Raj, A., Ramirez-Solano, M., Rashid, R., Reeb, A. N., Regev, A., Reid, M., Resnick, A., Reynolds, S. M., Riesterer, J. L., Rodig, S., Roland, J. T., Rosenfield, S., Rotem, A., Roy, S., Rozenblatt-Rosen, O., Rudin, C. M., Ryser, M. D., Santagata, S., Santi-Vicini, M., Sato, K., Schapiro, D., Schrag, D., Schultz, N., Sears, C. L., Sears, R. C., Sen, S., Sen, T., Shalek, A., Sheng, J., Sheng, Q., Shoghi, K. I., Shrubsole, M. J., Shyr, Y., Sibley, A. B., Siex, K., Simmons, A. J., Singer, D. S., Sivagnanam, S., Slyper, M., Snyder, M. P., Sokolov, A., Song, S., Sorger, P. K., Southard-Smith, A., Spira, A., Srivastava, S., Stein, J., Storm, P., Stover, E., Strand, S. H., Su, T., Sudar, D., Sullivan, R., Surrey, L., Suva, M., Tan, K., Terekhanova, N. V., Ternes, L., Thammavong, L., Thibault, G., Thomas, G. V., Thorsson, V., Todres, E., Tran, L., Tyler, M., Uzun, Y., Vachani, A., Van Allen, E., Vandekar, S., Veis, D. J., Vigneau, S., Vossough, A., Waanders, A., Wagle, N., Wang, L., Wendl, M. C., West, R., Williams, E. H., Wu, C., Wu, H., Wu, H., Wyczalkowski, M. A., Xie, Y., Yang, X., Yapp, C., Yu, W., Yuan, Y., Zhang, D., Zhang, K., Zhang, M., Zhang, N., Zhang, Y., Zhao, Y., Zhou, D. C., Zhou, Z., Zhu, H., Zhu, Q., Zhu, X., Zhu, Y., Zhuang, X. 2020; 181 (2): 236–49

Abstract

Crucial transitions in cancer-including tumor initiation, local expansion, metastasis, and therapeutic resistance-involve complex interactions between cells within the dynamic tumor ecosystem. Transformative single-cell genomics technologies and spatial multiplex in situ methods now provide an opportunity to interrogate this complexity at unprecedented resolution. The Human Tumor Atlas Network (HTAN), part of the National Cancer Institute (NCI) Cancer Moonshot Initiative, will establish a clinical, experimental, computational, and organizational framework to generate informative and accessible three-dimensional atlases of cancer transitions for a diverse set of tumor types. This effort complements both ongoing efforts to map healthy organs and previous large-scale cancer genomics approaches focused on bulk sequencing at a single point in time. Generating single-cell, multiparametric, longitudinal atlases and integrating them with clinical outcomes should help identify novel predictive biomarkers and features as well as therapeutically relevant cell types, cell states, and cellular interactions across transitions. The resulting tumor atlases should have a profound impact on our understanding of cancer biology and have the potential to improve cancer detection, prevention, and therapeutic discovery for better precision-medicine treatments of cancer patients and those at risk for cancer.

View details for DOI 10.1016/j.cell.2020.03.053

View details for PubMedID 32302568
CRISPR screens in cancer spheroids identify 3D growth-specific vulnerabilities. Nature Han, K., Pierce, S. E., Li, A., Spees, K., Anderson, G. R., Seoane, J. A., Lo, Y. H., Dubreuil, M., Olivas, M., Kamber, R. A., Wainberg, M., Kostyrko, K., Kelly, M. R., Yousefi, M., Simpkins, S. W., Yao, D., Lee, K., Kuo, C. J., Jackson, P. K., Sweet-Cordero, A., Kundaje, A., Gentles, A. J., Curtis, C., Winslow, M. M., Bassik, M. C. 2020; 580 (7801): 136-141

Abstract

Cancer genomics studies have identified thousands of putative cancer driver genes1. Development of high-throughput and accurate models to define the functions of these genes is a major challenge. Here we devised a scalable cancer-spheroid model and performed genome-wide CRISPR screens in 2D monolayers and 3D lung-cancer spheroids. CRISPR phenotypes in 3D more accurately recapitulated those of in vivo tumours, and genes with differential sensitivities between 2D and 3D conditions were highly enriched for genes that are mutated in lung cancers. These analyses also revealed drivers that are essential for cancer growth in 3D and in vivo, but not in 2D. Notably, we found that carboxypeptidase D is responsible for removal of a C-terminal RKRR motif2 from the α-chain of the insulin-like growth factor 1 receptor that is critical for receptor activity. Carboxypeptidase D expression correlates with patient outcomes in patients with lung cancer, and loss of carboxypeptidase D reduced tumour growth. Our results reveal key differences between 2D and 3D cancer models, and establish a generalizable strategy for performing CRISPR screens in spheroids to reveal cancer vulnerabilities.

View details for DOI 10.1038/s41586-020-2099-x

View details for PubMedID 32238925
CRISPR screens in cancer spheroids identify 3D growth-specific vulnerabilities NATURE Han, K., Pierce, S. E., Li, A., Spees, K., Anderson, G. R., Seoane, J. A., Lo, Y., Dubreuil, M., Olivas, M., Kamber, R. A., Wainberg, M., Kostyrko, K., Kelly, M. R., Yousefi, M., Simpkins, S. W., Yao, D., Lee, K., Kuo, C. J., Jackson, P. K., Sweet-Cordero, A., Kundaje, A., Gentles, A. J., Curtis, C., Winslow, M. M., Bassik, M. C. 2020

View details for DOI 10.1038/s41586-020-2099-x

View details for Web of Science ID 000519162500002
Long-range single-molecule mapping of chromatin accessibility in eukaryotes. Nature methods Shipony, Z., Marinov, G. K., Swaffer, M. P., Sinnott-Armstrong, N. A., Skotheim, J. M., Kundaje, A., Greenleaf, W. J. 2020

Abstract

Mapping open chromatin regions has emerged as a widely used tool for identifying active regulatory elements in eukaryotes. However, existing approaches, limited by reliance on DNA fragmentation and short-read sequencing, cannot provide information about large-scale chromatin states or reveal coordination between the states of distal regulatory elements. We have developed a method for profiling the accessibility of individual chromatin fibers, a single-molecule long-read accessible chromatin mapping sequencing assay (SMAC-seq), enabling the simultaneous, high-resolution, single-molecule assessment of chromatin states at multikilobase length scales. Our strategy is based on combining the preferential methylation of open chromatin regions by DNA methyltransferases with low sequence specificity, in this case EcoGII, an N6-methyladenosine (m6A) methyltransferase, and the ability of nanopore sequencing to directly read DNA modifications. We demonstrate that aggregate SMAC-seq signals match bulk-level accessibility measurements, observe single-molecule nucleosome and transcription factor protection footprints, and quantify the correlation between chromatin states of distal genomic elements.

View details for DOI 10.1038/s41592-019-0730-2

View details for PubMedID 32042188
High-Throughput Discovery and Characterization of Human Transcriptional Effectors. Cell Tycko, J. n., DelRosso, N. n., Hess, G. T., Aradhana, n. n., Banerjee, A. n., Mukund, A. n., Van, M. V., Ego, B. K., Yao, D. n., Spees, K. n., Suzuki, P. n., Marinov, G. K., Kundaje, A. n., Bassik, M. C., Bintu, L. n. 2020

Abstract

Thousands of proteins localize to the nucleus; however, it remains unclear which contain transcriptional effectors. Here, we develop HT-recruit, a pooled assay where protein libraries are recruited to a reporter, and their transcriptional effects are measured by sequencing. Using this approach, we measure gene silencing and activation for thousands of domains. We find a relationship between repressor function and evolutionary age for the KRAB domains, discover that Homeodomain repressor strength is collinear with Hox genetic organization, and identify activities for several domains of unknown function. Deep mutational scanning of the CRISPRi KRAB maps the co-repressor binding surface and identifies substitutions that improve stability/silencing. By tiling 238 proteins, we find repressors as short as ten amino acids. Finally, we report new activator domains, including a divergent KRAB. These results provide a resource of 600 human proteins containing effectors and demonstrate a scalable strategy for assigning functions to protein domains.

View details for DOI 10.1016/j.cell.2020.11.024

View details for PubMedID 33326746
Maximum Likelihood with Bias-Corrected Calibration is Hard-To-Beat at Label Shift Adaptation Alexandari, A. M., Kundaje, A., Shrikumar, A. edited by Daume, H., Singh, A. JMLR-JOURNAL MACHINE LEARNING RESEARCH. 2020

View details for Web of Science ID 000683178500022
Landscape of cohesin-mediated chromatin loops in the human genome. Nature Grubert, F. n., Srivas, R. n., Spacek, D. V., Kasowski, M. n., Ruiz-Velasco, M. n., Sinnott-Armstrong, N. n., Greenside, P. n., Narasimha, A. n., Liu, Q. n., Geller, B. n., Sanghi, A. n., Kulik, M. n., Sa, S. n., Rabinovitch, M. n., Kundaje, A. n., Dalton, S. n., Zaugg, J. B., Snyder, M. n. 2020; 583 (7818): 737–43

Abstract

Physical interactions between distal regulatory elements have a key role in regulating gene expression, but the extent to which these interactions vary between cell types and contribute to cell-type-specific gene expression remains unclear. Here, to address these questions as part of phase III of the Encyclopedia of DNA Elements (ENCODE), we mapped cohesin-mediated chromatin loops, using chromatin interaction analysis by paired-end tag sequencing (ChIA-PET), and analysed gene expression in 24 diverse human cell types, including core ENCODE cell lines. Twenty-eight per cent of all chromatin loops vary across cell types; these variations modestly correlate with changes in gene expression and are effective at grouping cell types according to their tissue of origin. The connectivity of genes corresponds to different functional classes, with housekeeping genes having few contacts, and dosage-sensitive genes being more connected to enhancer elements. This atlas of chromatin loops complements the diverse maps of regulatory architecture that comprise the ENCODE Encyclopedia, and will help to support emerging analyses of genome structure and function.

View details for DOI 10.1038/s41586-020-2151-x

View details for PubMedID 32728247
Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature Moore, J. E., Purcaro, M. J., Pratt, H. E., Epstein, C. B., Shoresh, N. n., Adrian, J. n., Kawli, T. n., Davis, C. A., Dobin, A. n., Kaul, R. n., Halow, J. n., Van Nostrand, E. L., Freese, P. n., Gorkin, D. U., Shen, Y. n., He, Y. n., Mackiewicz, M. n., Pauli-Behn, F. n., Williams, B. A., Mortazavi, A. n., Keller, C. A., Zhang, X. O., Elhajjajy, S. I., Huey, J. n., Dickel, D. E., Snetkova, V. n., Wei, X. n., Wang, X. n., Rivera-Mulia, J. C., Rozowsky, J. n., Zhang, J. n., Chhetri, S. B., Zhang, J. n., Victorsen, A. n., White, K. P., Visel, A. n., Yeo, G. W., Burge, C. B., Lécuyer, E. n., Gilbert, D. M., Dekker, J. n., Rinn, J. n., Mendenhall, E. M., Ecker, J. R., Kellis, M. n., Klein, R. J., Noble, W. S., Kundaje, A. n., Guigó, R. n., Farnham, P. J., Cherry, J. M., Myers, R. M., Ren, B. n., Graveley, B. R., Gerstein, M. B., Pennacchio, L. A., Snyder, M. P., Bernstein, B. E., Wold, B. n., Hardison, R. C., Gingeras, T. R., Stamatoyannopoulos, J. A., Weng, Z. n. 2020; 583 (7818): 699–710

Abstract

The human and mouse genomes contain instructions that specify RNAs and proteins and govern the timing, magnitude, and cellular context of their production. To better delineate these elements, phase III of the Encyclopedia of DNA Elements (ENCODE) Project has expanded analysis of the cell and tissue repertoires of RNA transcription, chromatin structure and modification, DNA methylation, chromatin looping, and occupancy by transcription factors and RNA-binding proteins. Here we summarize these efforts, which have produced 5,992 new experimental datasets, including systematic determinations across mouse fetal development. All data are available through the ENCODE data portal (https://www.encodeproject.org), including phase II ENCODE1 and Roadmap Epigenomics2 data. We have developed a registry of 926,535 human and 339,815 mouse candidate cis-regulatory elements, covering 7.9 and 3.4% of their respective genomes, by integrating selected datatypes associated with gene regulation, and constructed a web-based server (SCREEN; http://screen.encodeproject.org) to provide flexible, user-defined access to this resource. Collectively, the ENCODE data and registry provide an expansive resource for the scientific community to build a better understanding of the organization and function of the human and mouse genomes.

View details for DOI 10.1038/s41586-020-2493-4

View details for PubMedID 32728249
Homogeneity in the association of body mass index with type 2 diabetes across the UK Biobank: A Mendelian randomization study. PLoS medicine Wainberg, M., Mahajan, A., Kundaje, A., McCarthy, M. I., Ingelsson, E., Sinnott-Armstrong, N., Rivas, M. A. 2019; 16 (12): e1002982

Abstract

BACKGROUND: Lifestyle interventions to reduce body mass index (BMI) are critical public health strategies for type 2 diabetes prevention. While weight loss interventions have shown demonstrable benefit for high-risk and prediabetic individuals, we aimed to determine whether the same benefits apply to those at lower risk.METHODS AND FINDINGS: We performed a multi-stratum Mendelian randomization study of the effect size of BMI on diabetes odds in 287,394 unrelated individuals of self-reported white British ancestry in the UK Biobank, who were recruited from across the United Kingdom from 2006 to 2010 when they were between the ages of 40 and 69 years. Individuals were stratified on the following diabetes risk factors: BMI, diabetes family history, and genome-wide diabetes polygenic risk score. The main outcome measure was the odds ratio of diabetes per 1-kg/m2 BMI reduction, in the full cohort and in each stratum. Diabetes prevalence increased sharply with BMI, family history of diabetes, and genetic risk. Conversely, predicted risk reduction from weight loss was strikingly similar across BMI and genetic risk categories. Weight loss was predicted to substantially reduce diabetes odds even among lower-risk individuals: for instance, a 1-kg/m2 BMI reduction was associated with a 1.37-fold reduction (95% CI 1.12-1.68) in diabetes odds among non-overweight individuals (BMI < 25 kg/m2) without a family history of diabetes, similar to that in obese individuals (BMI ≥ 30 kg/m2) with a family history (1.21-fold reduction, 95% CI 1.13-1.29). A key limitation of this analysis is that the BMI-altering DNA sequence polymorphisms it studies represent cumulative predisposition over an individual's entire lifetime, and may consequently incorrectly estimate the risk modification potential of weight loss interventions later in life.CONCLUSIONS: In a population-scale cohort, lower BMI was consistently associated with reduced diabetes risk across BMI, family history, and genetic risk categories, suggesting all individuals can substantially reduce their diabetes risk through weight loss. Our results support the broad deployment of weight loss interventions to individuals at all levels of diabetes risk.

View details for DOI 10.1371/journal.pmed.1002982

View details for PubMedID 31821322
NETWORK MODELLING OF TOPOLOGICAL DOMAINS USING HI-C DATA. The annals of applied statistics Wang, Y. X., Sarkar, P., Ursu, O., Kundaje, A., Bickel, P. J. 2019; 13 (3): 1511-1536

Abstract

Chromosome conformation capture experiments such as Hi-C are used to map the three-dimensional spatial organization of genomes. One specific feature of the 3D organization is known as topologically associating domains (TADs), which are densely interacting, contiguous chromatin regions playing important roles in regulating gene expression. A few algorithms have been proposed to detect TADs. In particular, the structure of Hi-C data naturally inspires application of community detection methods. However, one of the drawbacks of community detection is that most methods take exchangeability of the nodes in the network for granted; whereas the nodes in this case, that is, the positions on the chromosomes, are not exchangeable. We propose a network model for detecting TADs using Hi-C data that takes into account this nonexchangeability. in addition, our model explicitly makes use of cell-type specific CTCF binding sites as biological covariates and can be used to identify conserved TADs across multiple cell types. The model leads to a likelihood objective that can be efficiently optimized via relaxation. We also prove that when suitably initialized, this model finds the underlying TAD structure with high probability. using simulated data, we show the advantages of our method and the caveats of popular community detection methods, such as spectral clustering, in this application. Applying our method to real Hi-C data, we demonstrate the domains identified have desirable epigenetic features and compare them across different cell types.

View details for DOI 10.1214/19-aoas1244

View details for PubMedID 32968472

View details for PubMedCentralID PMC7508461
NETWORK MODELLING OF TOPOLOGICAL DOMAINS USING HI-C DATA ANNALS OF APPLIED STATISTICS Wang, Y., Sarkar, P., Ursu, O., Kundaje, A., Bickel, P. J. 2019; 13 (3): 1511–36

View details for DOI 10.1214/19-AOAS1244

View details for Web of Science ID 000490874300008
GkmExplain: fast and accurate interpretation of nonlinear gapped k-mer SVMs. Bioinformatics (Oxford, England) Shrikumar, A., Prakash, E., Kundaje, A. 2019; 35 (14): i173-i182

Abstract

Support Vector Machines with gapped k-mer kernels (gkm-SVMs) have been used to learn predictive models of regulatory DNA sequence. However, interpreting predictive sequence patterns learned by gkm-SVMs can be challenging. Existing interpretation methods such as deltaSVM, in-silico mutagenesis (ISM) or SHAP either do not scale well or make limiting assumptions about the model that can produce misleading results when the gkm kernel is combined with nonlinear kernels. Here, we propose GkmExplain: a computationally efficient feature attribution method for interpreting predictive sequence patterns from gkm-SVM models that has theoretical connections to the method of Integrated Gradients. Using simulated regulatory DNA sequences, we show that GkmExplain identifies predictive patterns with high accuracy while avoiding pitfalls of deltaSVM and ISM and being orders of magnitude more computationally efficient than SHAP. By applying GkmExplain and a recently developed motif discovery method called TF-MoDISco to gkm-SVM models trained on in vivo transcription factor (TF) binding data, we recover consolidated, non-redundant TF motifs. Mutation impact scores derived using GkmExplain consistently outperform deltaSVM and ISM at identifying regulatory genetic variants from gkm-SVM models of chromatin accessibility in lymphoblastoid cell-lines.Code and example notebooks to reproduce results are at https://github.com/kundajelab/gkmexplain.Supplementary data are available at Bioinformatics online.

View details for DOI 10.1093/bioinformatics/btz322

View details for PubMedID 31510661
Integrating regulatory DNA sequence and gene expression to predict genome-wide chromatin accessibility across cellular contexts. Bioinformatics (Oxford, England) Nair, S., Kim, D. S., Perricone, J., Kundaje, A. 2019; 35 (14): i108-i116

Abstract

Genome-wide profiles of chromatin accessibility and gene expression in diverse cellular contexts are critical to decipher the dynamics of transcriptional regulation. Recently, convolutional neural networks have been used to learn predictive cis-regulatory DNA sequence models of context-specific chromatin accessibility landscapes. However, these context-specific regulatory sequence models cannot generalize predictions across cell types.We introduce multi-modal, residual neural network architectures that integrate cis-regulatory sequence and context-specific expression of trans-regulators to predict genome-wide chromatin accessibility profiles across cellular contexts. We show that the average accessibility of a genomic region across training contexts can be a surprisingly powerful predictor. We leverage this feature and employ novel strategies for training models to enhance genome-wide prediction of shared and context-specific chromatin accessible sites across cell types. We interpret the models to reveal insights into cis- and trans-regulation of chromatin dynamics across 123 diverse cellular contexts.The code is available at https://github.com/kundajelab/ChromDragoNN.Supplementary data are available at Bioinformatics online.

View details for DOI 10.1093/bioinformatics/btz352

View details for PubMedID 31510655
Matrix stiffness induces a tumorigenic phenotype in mammary epithelium through changes in chromatin accessibility. Nature biomedical engineering Stowers, R. S., Shcherbina, A., Israeli, J., Gruber, J. J., Chang, J., Nam, S., Rabiee, A., Teruel, M. N., Snyder, M. P., Kundaje, A., Chaudhuri, O. 2019

Abstract

In breast cancer, the increased stiffness of the extracellular matrix is a key driver of malignancy. Yet little is known about the epigenomic changes that underlie the tumorigenic impact of extracellular matrix mechanics. Here, we show in a three-dimensional culture model of breast cancer that stiff extracellular matrix induces a tumorigenic phenotype through changes in chromatin state. We found that increased stiffness yielded cells with more wrinkled nuclei and with increased lamina-associated chromatin, that cells cultured in stiff matrices displayed more accessible chromatin sites, which exhibited footprints of Sp1 binding, and that this transcription factor acts along with the histone deacetylases 3 and 8 to regulate the induction of stiffness-mediated tumorigenicity. Just as cell culture on soft environments or in them rather than on tissue-culture plastic better recapitulates the acinar morphology observed in mammary epithelium in vivo, mammary epithelial cells cultured on soft microenvironments or in them also more closely replicate the in vivo chromatin state. Our results emphasize the importance of culture conditions for epigenomic studies, and reveal that chromatin state is a critical mediator of mechanotransduction.

View details for DOI 10.1038/s41551-019-0420-5

View details for PubMedID 31285581
Predicting gene expression from plasma cell-free DNA using both the fragment length and fragment position St John, J. A., Gafni, E., White, B., Kannan, A., Hansen, L., Jaroszewicz, A., Kundaje, A., Boley, N. AMER ASSOC CANCER RESEARCH. 2019

View details for DOI 10.1158/1538-7445.SABCS18-4349

View details for Web of Science ID 000488279404232
The ENCODE Blacklist: Identification of Problematic Regions of the Genome. Scientific reports Amemiya, H. M., Kundaje, A., Boyle, A. P. 2019; 9 (1): 9354

Abstract

Functional genomics assays based on high-throughput sequencing greatly expand our ability to understand the genome. Here, we define the ENCODE blacklist- a comprehensive set of regions in the human, mouse, worm, and fly genomes that have anomalous, unstructured, or high signal in next-generation sequencing experiments independent of cell line or experiment. The removal of the ENCODE blacklist is an essential quality measure when analyzing functional genomics data.

View details for DOI 10.1038/s41598-019-45839-z

View details for PubMedID 31249361
The Kipoi repository accelerates community exchange and reuse of predictive models for genomics. Nature biotechnology Avsec, Z., Kreuzhuber, R., Israeli, J., Xu, N., Cheng, J., Shrikumar, A., Banerjee, A., Kim, D. S., Beier, T., Urban, L., Kundaje, A., Stegle, O., Gagneur, J. 2019

View details for DOI 10.1038/s41587-019-0140-0

View details for PubMedID 31138913
Cell cycle dynamics of human pluripotent stem cells primed for differentiation. Stem cells (Dayton, Ohio) Shcherbina, A., Li, J., Narayanan, C., Greenleaf, W., Kundaje, A., Chetty, S. 2019

Abstract

Understanding the molecular properties of the cell cycle of human pluripotent stem cells (hPSCs) is critical for effectively promoting differentiation. Here, we use the Fluorescence Ubiquitin Cell Cycle Indicator (FUCCI) system adapted into hPSCs and perform RNA-sequencing on cell cycle sorted hPSCs primed and unprimed for differentiation. Gene expression patterns of signaling factors and developmental regulators change in a cell cycle-specific manner in cells primed for differentiation without altering genes associated with pluripotency. Furthermore, we identify an important role for PI3K signaling in regulating the early transitory states of hPSCs toward differentiation. SIGNIFICANCE STATEMENT: Generating differentiated cell types from human pluripotent stem cells (hPSCs) holds great therapeutic promise, but has proven to be challenging in practice. The cell cycle may play an important role in enhancing the differentiation potential of hPSCs. Here, the authors track and isolate hPSCs from different phases of the cell cycle and perform RNA-sequencing. The data show that gene expression patterns of signaling factors and developmental regulators change in a cell cycle-specific manner as hPSCs transition toward differentiation and highlight an important role for PI3K signaling in regulating these early transitory states. © AlphaMed Press 2019.

View details for DOI 10.1002/stem.3041

View details for PubMedID 31135093
Remodeling of epigenome and transcriptome landscapes with aging in mice reveals widespread induction of inflammatory responses GENOME RESEARCH Benayoun, B. A., Pollina, E. A., Singh, P., Mahmoudi, S., Harel, I., Casey, K. M., Dulken, B. W., Kundaje, A., Brunet, A. 2019; 29 (4): 697–709

View details for DOI 10.1101/gr.240093.118

View details for Web of Science ID 000462858600016
Initiation of mtDNA transcription is followed by pausing, and diverges across human cell types and during evolution (vol 27, pg 362, 2017) GENOME RESEARCH Blumberg, A., Rice, E. J., Kundaje, A., Danko, C. G., Mishmar, D. 2019; 29 (4): 710

View details for DOI 10.1101/gr.248971.119

View details for Web of Science ID 000462858600018

View details for PubMedID 30936176

View details for PubMedCentralID PMC6442388
Measuring the reproducibility and quality of Hi-C data. Genome biology Yardimci, G. G., Ozadam, H., Sauria, M. E., Ursu, O., Yan, K., Yang, T., Chakraborty, A., Kaul, A., Lajoie, B. R., Song, F., Zhan, Y., Ay, F., Gerstein, M., Kundaje, A., Li, Q., Taylor, J., Yue, F., Dekker, J., Noble, W. S. 2019; 20 (1): 57

Abstract

BACKGROUND: Hi-C is currently the most widely used assay to investigate the 3D organization of the genome and to study its role in gene regulation, DNA replication, and disease. However, Hi-C experiments are costly to perform and involve multiple complex experimental steps; thus, accurate methods for measuring the quality and reproducibility of Hi-C data are essential to determine whether the output should be used further in a study.RESULTS: Using real and simulated data, we profile the performance of several recently proposed methods for assessing reproducibility of population Hi-C data, including HiCRep, GenomeDISCO, HiC-Spector, and QuASAR-Rep. By explicitly controlling noise and sparsity through simulations, we demonstrate the deficiencies of performing simple correlation analysis on pairs of matrices, and we show that methods developed specifically for Hi-C data produce better measures of reproducibility. We also show how to use established measures, such as the ratio of intra- to interchromosomal interactions, and novel ones, such as QuASAR-QC, to identify low-quality experiments.CONCLUSIONS: In this work, we assess reproducibility and quality measures by varying sequencing depth, resolution and noise levels in Hi-C data from 13 cell lines, with two biological replicates each, as well as 176 simulated matrices. Through this extensive validation and benchmarking of Hi-C data, we describe best practices for reproducibility and quality assessment of Hi-C experiments. We make all software publicly available at http://github.com/kundajelab/3DChromatin_ReplicateQC to facilitate adoption in the community.

View details for PubMedID 30890172
mtDNA Chromatin-like Organization Is Gradually Established during Mammalian Embryogenesis. iScience Marom, S., Blumberg, A., Kundaje, A., Mishmar, D. 2019; 12: 141–51

Abstract

Unlike the nuclear genome, the mammalian mitochondrial genome (mtDNA) is thought to be coatedsolely by mitochondrial transcription factor A (TFAM), whose binding sequence preferences are debated. Therefore, higher-order mtDNA organization is considered much less regulated than both the bacterial nucleoid and the nuclear chromatin. However, our recently identified conserved DNase footprinting pattern in human mtDNA, which co-localizes with regulatory elements and responds to physiological conditions, likely reflects a structured higher-order mtDNA organization. We hypothesized that this pattern emerges during embryogenesis. To test this hypothesis, we analyzed assay for transposase-accessible chromatin sequencing (ATAC-seq) results collected during the course of mouse and human early embryogenesis. Our results reveal, for the first time, a gradual and dynamic emergence of the adult mtDNA footprinting pattern during embryogenesis of both mammals. Taken together, our findings suggest that the structured adult chromatin-like mtDNA organization is gradually formed during mammalian embryogenesis.

View details for PubMedID 30684873
Mitigation of off-target toxicity in CRISPR-Cas9 screens for essential non-coding elements. Nature communications Tycko, J. n., Wainberg, M. n., Marinov, G. K., Ursu, O. n., Hess, G. T., Ego, B. K., Aradhana, n. n., Li, A. n., Truong, A. n., Trevino, A. E., Spees, K. n., Yao, D. n., Kaplow, I. M., Greenside, P. G., Morgens, D. W., Phanstiel, D. H., Snyder, M. P., Bintu, L. n., Greenleaf, W. J., Kundaje, A. n., Bassik, M. C. 2019; 10 (1): 4063

Abstract

Pooled CRISPR-Cas9 screens are a powerful method for functionally characterizing regulatory elements in the non-coding genome, but off-target effects in these experiments have not been systematically evaluated. Here, we investigate Cas9, dCas9, and CRISPRi/a off-target activity in screens for essential regulatory elements. The sgRNAs with the largest effects in genome-scale screens for essential CTCF loop anchors in K562 cells were not single guide RNAs (sgRNAs) that disrupted gene expression near the on-target CTCF anchor. Rather, these sgRNAs had high off-target activity that, while only weakly correlated with absolute off-target site number, could be predicted by the recently developed GuideScan specificity score. Screens conducted in parallel with CRISPRi/a, which do not induce double-stranded DNA breaks, revealed that a distinct set of off-targets also cause strong confounding fitness effects with these epigenome-editing tools. Promisingly, filtering of CRISPRi libraries using GuideScan specificity scores removed these confounded sgRNAs and enabled identification of essential regulatory elements.

View details for DOI 10.1038/s41467-019-11955-7

View details for PubMedID 31492858
Deciphering regulatory DNA sequences and noncoding genetic variants using neural network models of massively parallel reporter assays. PloS one Movva, R. n., Greenside, P. n., Marinov, G. K., Nair, S. n., Shrikumar, A. n., Kundaje, A. n. 2019; 14 (6): e0218073

Abstract

The relationship between noncoding DNA sequence and gene expression is not well-understood. Massively parallel reporter assays (MPRAs), which quantify the regulatory activity of large libraries of DNA sequences in parallel, are a powerful approach to characterize this relationship. We present MPRA-DragoNN, a convolutional neural network (CNN)-based framework to predict and interpret the regulatory activity of DNA sequences as measured by MPRAs. While our method is generally applicable to a variety of MPRA designs, here we trained our model on the Sharpr-MPRA dataset that measures the activity of ∼500,000 constructs tiling 15,720 regulatory regions in human K562 and HepG2 cell lines. MPRA-DragoNN predictions were moderately correlated (Spearman ρ = 0.28) with measured activity and were within range of replicate concordance of the assay. State-of-the-art model interpretation methods revealed high-resolution predictive regulatory sequence features that overlapped transcription factor (TF) binding motifs. We used the model to investigate the cell type and chromatin state preferences of predictive TF motifs. We explored the ability of our model to predict the allelic effects of regulatory variants in an independent MPRA experiment and fine map putative functional SNPs in loci associated with lipid traits. Our results suggest that interpretable deep learning models trained on MPRA data have the potential to reveal meaningful patterns in regulatory DNA sequences and prioritize regulatory genetic variants, especially as larger, higher-quality datasets are produced.

View details for DOI 10.1371/journal.pone.0218073

View details for PubMedID 31206543
Discovery of common and rare genetic risk variants for colorectal cancer. Nature genetics Huyghe, J. R., Bien, S. A., Harrison, T. A., Kang, H. M., Chen, S., Schmit, S. L., Conti, D. V., Qu, C., Jeon, J., Edlund, C. K., Greenside, P., Wainberg, M., Schumacher, F. R., Smith, J. D., Levine, D. M., Nelson, S. C., Sinnott-Armstrong, N. A., Albanes, D., Alonso, M. H., Anderson, K., Arnau-Collell, C., Arndt, V., Bamia, C., Banbury, B. L., Baron, J. A., Berndt, S. I., Bezieau, S., Bishop, D. T., Boehm, J., Boeing, H., Brenner, H., Brezina, S., Buch, S., Buchanan, D. D., Burnett-Hartman, A., Butterbach, K., Caan, B. J., Campbell, P. T., Carlson, C. S., Castellvi-Bel, S., Chan, A. T., Chang-Claude, J., Chanock, S. J., Chirlaque, M., Cho, S. H., Connolly, C. M., Cross, A. J., Cuk, K., Curtis, K. R., de la Chapelle, A., Doheny, K. F., Duggan, D., Easton, D. F., Elias, S. G., Elliott, F., English, D. R., Feskens, E. J., Figueiredo, J. C., Fischer, R., FitzGerald, L. M., Forman, D., Gala, M., Gallinger, S., Gauderman, W. J., Giles, G. G., Gillanders, E., Gong, J., Goodman, P. J., Grady, W. M., Grove, J. S., Gsur, A., Gunter, M. J., Haile, R. W., Hampe, J., Hampel, H., Harlid, S., Hayes, R. B., Hofer, P., Hoffmeister, M., Hopper, J. L., Hsu, W., Huang, W., Hudson, T. J., Hunter, D. J., Ibanez-Sanz, G., Idos, G. E., Ingersoll, R., Jackson, R. D., Jacobs, E. J., Jenkins, M. A., Joshi, A. D., Joshu, C. E., Keku, T. O., Key, T. J., Kim, H. R., Kobayashi, E., Kolonel, L. N., Kooperberg, C., Kuhn, T., Kury, S., Kweon, S., Larsson, S. C., Laurie, C. A., Le Marchand, L., Leal, S. M., Lee, S. C., Lejbkowicz, F., Lemire, M., Li, C. I., Li, L., Lieb, W., Lin, Y., Lindblom, A., Lindor, N. M., Ling, H., Louie, T. L., Mannisto, S., Markowitz, S. D., Martin, V., Masala, G., McNeil, C. E., Melas, M., Milne, R. L., Moreno, L., Murphy, N., Myte, R., Naccarati, A., Newcomb, P. A., Offit, K., Ogino, S., Onland-Moret, N. C., Pardini, B., Parfrey, P. S., Pearlman, R., Perduca, V., Pharoah, P. D., Pinchev, M., Platz, E. A., Prentice, R. L., Pugh, E., Raskin, L., Rennert, G., Rennert, H. S., Riboli, E., Rodriguez-Barranco, M., Romm, J., Sakoda, L. C., Schafmayer, C., Schoen, R. E., Seminara, D., Shah, M., Shelford, T., Shin, M., Shulman, K., Sieri, S., Slattery, M. L., Southey, M. C., Stadler, Z. K., Stegmaier, C., Su, Y., Tangen, C. M., Thibodeau, S. N., Thomas, D. C., Thomas, S. S., Toland, A. E., Trichopoulou, A., Ulrich, C. M., Van Den Berg, D. J., van Duijnhoven, F. J., Van Guelpen, B., van Kranen, H., Vijai, J., Visvanathan, K., Vodicka, P., Vodickova, L., Vymetalkova, V., Weigl, K., Weinstein, S. J., White, E., Win, A. K., Wolf, C. R., Wolk, A., Woods, M. O., Wu, A. H., Zaidi, S. H., Zanke, B. W., Zhang, Q., Zheng, W., Scacheri, P. C., Potter, J. D., Bassik, M. C., Kundaje, A., Casey, G., Moreno, V., Abecasis, G. R., Nickerson, D. A., Gruber, S. B., Hsu, L., Peters, U. 2018

Abstract

To further dissect the genetic architecture of colorectal cancer (CRC), we performed whole-genome sequencing of 1,439 cases and 720 controls, imputed discovered sequence variants and Haplotype Reference Consortium panel variants into genome-wide association study data, and tested for association in 34,869 cases and 29,051 controls. Findings were followed up in an additional 23,262 cases and 38,296 controls. We discovered a strongly protective 0.3% frequency variant signal at CHD1. In a combined meta-analysis of 125,478 individuals, we identified 40 new independent signals at P<5*10-8, bringing the number of known independent signals for CRC to ~100. New signals implicate lower-frequency variants, Kruppel-like factors, Hedgehog signaling, Hippo-YAP signaling, long noncoding RNAs and somatic drivers, and support a role for immune function. Heritability analyses suggest that CRC risk is highly polygenic, and larger, more comprehensive studies enabling rare variant analysis will improve understanding of biology underlying this risk and influence personalized screening strategies and drug development.

View details for PubMedID 30510241
Intertumoral Heterogeneity in SCLC Is Influenced by the Cell Type of Origin CANCER DISCOVERY Yang, D., Denny, S. K., Greenside, P. G., Chaikovsky, A. C., Brady, J. J., Ouadah, Y., Granja, J. M., Jahchan, N. S., Lim, J., Kwok, S., Kong, C. S., Berghoff, A. S., Schmitt, A., Reinhardt, H., Park, K., Preusser, M., Kundaje, A., Greenleaf, W. J., Sage, J., Winslow, M. M. 2018; 8 (10): 1316–31

View details for DOI 10.1158/2159-8290.CD-17-0987

View details for Web of Science ID 000446398800012
GenomeDISCO: a concordance score for chromosome conformation capture experiments using random walks on contact map graphs BIOINFORMATICS Ursu, O., Boley, N., Taranova, M., Wang, Y., Yardimci, G., Noble, W., Kundaje, A. 2018; 34 (16): 2701–7

View details for DOI 10.1093/bioinformatics/bty164

View details for Web of Science ID 000441730900001
A common pattern of DNase I footprinting throughout the human mtDNA unveils clues for a chromatin-like organization GENOME RESEARCH Blumberg, A., Danko, C. G., Kundaje, A., Mishmar, D. 2018; 28 (8): 1158–68

Abstract

Human mitochondrial DNA (mtDNA) is believed to lack chromatin and histones. Instead, it is coated solely by the transcription factor TFAM. We asked whether mtDNA packaging is more regulated than once thought. To address this, we analyzed DNase-seq experiments in 324 human cell types and found, for the first time, a pattern of 29 mtDNA Genomic footprinting (mt-DGF) sites shared by ∼90% of the samples. Their syntenic conservation in mouse DNase-seq experiments reflect selective constraints. Colocalization with known mtDNA regulatory elements, with G-quadruplex structures, in TFAM-poor sites (in HeLa cells) and with transcription pausing sites, suggest a functional regulatory role for such mt-DGFs. Altered mt-DGF pattern in interleukin 3-treated CD34+ cells, certain tissue differences, and significant prevalence change in fetal versus nonfetal samples, offer first clues to their physiological importance. Taken together, human mtDNA has a conserved protein-DNA organization, which is likely involved in mtDNA regulation.

View details for PubMedID 30002158

View details for PubMedCentralID PMC6071632
Decoding regulatory sequence across skin differentiation with deep learning Kim, D., Risca, V., Chappell, J., Shi, M., Zhao, Z., Jung, N., Chang, H., Snyder, M., Greenleaf, W., Kundaje, A., Khavari, P. ELSEVIER SCIENCE INC. 2018: S135

View details for Web of Science ID 000431498600057
GenomeDISCO: A concordance score for chromosome conformation capture experiments using random walks on contact map graphs. Bioinformatics (Oxford, England) Ursu, O., Boley, N., Taranova, M., Wang, Y. X., Yardimci, G. G., Noble, W. S., Kundaje, A. 2018

Abstract

Motivation: The three-dimensional organization of chromatin plays a critical role in gene regulation and disease. High-throughput chromosome conformation capture experiments such as Hi-C are used to obtain genome-wide maps of 3D chromatin contacts. However, robust estimation of data quality and systematic comparison of these contact maps is challenging due to the multi-scale, hierarchical structure of chromatin contacts and the resulting properties of experimental noise in the data. Measuring concordance of contact maps is important for assessing reproducibility of replicate experiments and for modeling variation between different cellular contexts.Results: We introduce a concordance measure called GenomeDISCO (DIfferences between Smoothed COntact maps) for assessing the similarity of a pair of contact maps obtained from chromosome conformation capture experiments. The key idea is to smooth contact maps using random walks on the contact map graph, before estimating concordance. We use simulated datasets to benchmark GenomeDISCO's sensitivity to different types of noise that affect chromatin contact maps. When applied to a large collection of Hi-C datasets, GenomeDISCO accurately distinguishes biological replicates from samples obtained from different cell types. GenomeDISCO also generalizes to other chromosome conformation capture assays, such as HiChIP.Availability: Software implementing GenomeDISCO is available at https://github.com/kundajelab/genomedisco.Contact: akundaje@stanford.edu.Supplementary information: Supplementary data are available at Bioinformatics online.

View details for PubMedID 29554289
ChIP-ping the branches of the tree: functional genomics and the evolution of eukaryotic gene regulation BRIEFINGS IN FUNCTIONAL GENOMICS Marinov, G. K., Kundaje, A. 2018; 17 (2): 116–37

Abstract

Advances in the methods for detecting protein-DNA interactions have played a key role in determining the directions of research into the mechanisms of transcriptional regulation. The most recent major technological transformation happened a decade ago, with the move from using tiling arrays [chromatin immunoprecipitation (ChIP)-on-Chip] to high-throughput sequencing (ChIP-seq) as a readout for ChIP assays. In addition to the numerous other ways in which it is superior to arrays, by eliminating the need to design and manufacture them, sequencing also opened the door to carrying out comparative analyses of genome-wide transcription factor occupancy across species and studying chromatin biology in previously less accessible model and nonmodel organisms, thus allowing us to understand the evolution and diversity of regulatory mechanisms in unprecedented detail. Here, we review the biological insights obtained from such studies in recent years and discuss anticipated future developments in the field.

View details for DOI 10.1093/bfgp/ely004

View details for Web of Science ID 000429027600006

View details for PubMedID 29529131
Impact of regulatory variation across human iPSCs and differentiated cells GENOME RESEARCH Banovich, N. E., Li, Y. I., Raj, A., Ward, M. C., Greenside, P., Calderon, D., Tung, P., Burnett, J. E., Myrthil, M., Thomas, S. M., Burrows, C. K., Romero, I., Pavlovic, B. J., Kundaje, A., Pritchard, J. K., Gilad, Y. 2018; 28 (1): 122–31

Abstract

Induced pluripotent stem cells (iPSCs) are an essential tool for studying cellular differentiation and cell types that are otherwise difficult to access. We investigated the use of iPSCs and iPSC-derived cells to study the impact of genetic variation on gene regulation across different cell types and as models for studies of complex disease. To do so, we established a panel of iPSCs from 58 well-studied Yoruba lymphoblastoid cell lines (LCLs); 14 of these lines were further differentiated into cardiomyocytes. We characterized regulatory variation across individuals and cell types by measuring gene expression levels, chromatin accessibility, and DNA methylation. Our analysis focused on a comparison of inter-individual regulatory variation across cell types. While most cell-type-specific regulatory quantitative trait loci (QTLs) lie in chromatin that is open only in the affected cell types, we found that 20% of cell-type-specific regulatory QTLs are in shared open chromatin. This observation motivated us to develop a deep neural network to predict open chromatin regions from DNA sequence alone. Using this approach, we were able to use the sequences of segregating haplotypes to predict the effects of common SNPs on cell-type-specific chromatin accessibility.

View details for PubMedID 29208628
Prediction of protein-ligand interactions from paired protein sequence motifs and ligand substructures. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing Greenside, P. n., Hillenmeyer, M. n., Kundaje, A. n. 2018; 23: 20–31

Abstract

Identification of small molecule ligands that bind to proteins is a critical step in drug discovery. Computational methods have been developed to accelerate the prediction of protein-ligand binding, but often depend on 3D protein structures. As only a limited number of protein 3D structures have been resolved, the ability to predict protein-ligand interactions without relying on a 3D representation would be highly valuable. We use an interpretable confidence-rated boosting algorithm to predict protein-ligand interactions with high accuracy from ligand chemical substructures and protein 1D sequence motifs, without relying on 3D protein structures. We compare several protein motif definitions, assess generalization of our model's predictions to unseen proteins and ligands, demonstrate recovery of well established interactions and identify globally predictive protein-ligand motif pairs. By bridging biological and chemical perspectives, we demonstrate that it is possible to predict protein-ligand interactions using only motif-based features and that interpretation of these features can reveal new insights into the molecular mechanics underlying each interaction. Our work also lays a foundation to explore more predictive feature sets and sophisticated machine learning approaches as well as other applications, such as predicting unintended interactions or the effects of mutations.

View details for PubMedID 29218866
Umap and Bismap: quantifying genome and methylome mappability. Nucleic acids research Karimzadeh, M. n., Ernst, C. n., Kundaje, A. n., Hoffman, M. M. 2018; 46 (20): e120

Abstract

Short-read sequencing enables assessment of genetic and biochemical traits of individual genomic regions, such as the location of genetic variation, protein binding and chemical modifications. Every region in a genome assembly has a property called 'mappability', which measures the extent to which it can be uniquely mapped by sequence reads. In regions of lower mappability, estimates of genomic and epigenomic characteristics from sequencing assays are less reliable. These regions have increased susceptibility to spurious mapping from reads from other regions of the genome with sequencing errors or unexpected genetic variation. Bisulfite sequencing approaches used to identify DNA methylation exacerbate these problems by introducing large numbers of reads that map to multiple regions. Both to correct assumptions of uniformity in downstream analysis and to identify regions where the analysis is less reliable, it is necessary to know the mappability of both ordinary and bisulfite-converted genomes. We introduce the Umap software for identifying uniquely mappable regions of any genome. Its Bismap extension identifies mappability of the bisulfite-converted genome. A Umap and Bismap track hub for human genome assemblies GRCh37/hg19 and GRCh38/hg38, and mouse assemblies GRCm37/mm9 and GRCm38/mm10 is available at https://bismap.hoffmanlab.org for use with genome browsers.

View details for PubMedID 30169659
Differential analysis of chromatin accessibility and histone modifications for predicting mouse developmental enhancers. Nucleic acids research Fu, S. n., Wang, Q. n., Moore, J. E., Purcaro, M. J., Pratt, H. E., Fan, K. n., Gu, C. n., Jiang, C. n., Zhu, R. n., Kundaje, A. n., Lu, A. n., Weng, Z. n. 2018; 46 (21): 11184–201

Abstract

Enhancers are distal cis-regulatory elements that modulate gene expression. They are depleted of nucleosomes and enriched in specific histone modifications; thus, calling DNase-seq and histone mark ChIP-seq peaks can predict enhancers. We evaluated nine peak-calling algorithms for predicting enhancers validated by transgenic mouse assays. DNase and H3K27ac peaks were consistently more predictive than H3K4me1/2/3 and H3K9ac peaks. DFilter and Hotspot2 were the best DNase peak callers, while HOMER, MUSIC, MACS2, DFilter and F-seq were the best H3K27ac peak callers. We observed that the differential DNase or H3K27ac signals between two distant tissues increased the area under the precision-recall curve (PR-AUC) of DNase peaks by 17.5-166.7% and that of H3K27ac peaks by 7.1-22.2%. We further improved this differential signal method using multiple contrast tissues. Evaluated using a blind test, the differential H3K27ac signal method substantially improved PR-AUC from 0.48 to 0.75 for predicting heart enhancers. We further validated our approach using postnatal retina and cerebral cortex enhancers identified by massively parallel reporter assays, and observed improvements for both tissues. In summary, we compared nine peak callers and devised a superior method for predicting tissue-specific mouse developmental enhancers by reranking the called peaks.

View details for PubMedID 30137428
Challenges and recommendations for epigenomics in precision health NATURE BIOTECHNOLOGY Carter, A. C., Chang, H. Y., Church, G., Dombkowski, A., Ecker, J. R., Gil, E., Giresi, P. G., Greely, H., Greenleaf, W. J., Hacohen, N., He, C., Hill, D., Ko, J., Kohane, I., Kundaje, A., Palmer, M., Snyder, M. P., Tung, J., Urban, A., Vidal, M., Wong, W. 2017; 35 (12): 1128–32

View details for PubMedID 29220033
Chromatin accessibility dynamics reveal novel functional enhancers in C. elegans GENOME RESEARCH Daugherty, A. C., Yeo, R. W., Buenrostro, J. D., Greenleaf, W. J., Kundaje, A., Brunet, A. 2017; 27 (12): 2096–2107

Abstract

Chromatin accessibility, a crucial component of genome regulation, has primarily been studied in homogeneous and simple systems, such as isolated cell populations or early-development models. Whether chromatin accessibility can be assessed in complex, dynamic systems in vivo with high sensitivity remains largely unexplored. In this study, we use ATAC-seq to identify chromatin accessibility changes in a whole animal, the model organism Caenorhabditis elegans, from embryogenesis to adulthood. Chromatin accessibility changes between developmental stages are highly reproducible, recapitulate histone modification changes, and reveal key regulatory aspects of the epigenomic landscape throughout organismal development. We find that over 5000 distal noncoding regions exhibit dynamic changes in chromatin accessibility between developmental stages and could thereby represent putative enhancers. When tested in vivo, several of these putative enhancers indeed drive novel cell-type- and temporal-specific patterns of expression. Finally, by integrating transcription factor binding motifs in a machine learning framework, we identify EOR-1 as a unique transcription factor that may regulate chromatin dynamics during development. Our study provides a unique resource for C. elegans, a system in which the prevalence and importance of enhancers remains poorly characterized, and demonstrates the power of using whole organism chromatin accessibility to identify novel regulatory regions in complex systems.

View details for PubMedID 29141961
Enrichment of colorectal cancer associations in functional regions: Insight for using epigenomics data in the analysis of whole genome sequence-imputed GWAS data PLOS ONE Bien, S. A., Auer, P. L., Harrison, T. A., Qu, C., Connolly, C. M., Greenside, P. G., Chen, S., Berndt, S. I., Bezieau, S., Kang, H. M., Huyghe, J., Brenner, H., Casey, G., Chan, A. T., Hopper, J. L., Banbury, B. L., Chang-Claude, J., Chanock, S. J., Haile, R. W., Hoffmeister, M., Fuchsberger, C., Jenkins, M. A., Leal, S. M., Lemire, M., Newcomb, P. A., Gallinger, S., Potter, J. D., Schoen, R. E., Slattery, M. L., Smith, J. D., Le Marchand, L., White, E., Zanke, B. W., Abecasis, G. R., Carlson, C. S., Peters, U., Nickerson, D. A., Kundaje, A., Hsu, L., GECCO CCFR 2017; 12 (11): e0186518

Abstract

The evaluation of less frequent genetic variants and their effect on complex disease pose new challenges for genomic research. To investigate whether epigenetic data can be used to inform aggregate rare-variant association methods (RVAM), we assessed whether variants more significantly associated with colorectal cancer (CRC) were preferentially located in non-coding regulatory regions, and whether enrichment was specific to colorectal tissues.Active regulatory elements (ARE) were mapped using data from 127 tissues and cell-types from NIH Roadmap Epigenomics and Encyclopedia of DNA Elements (ENCODE) projects. We investigated whether CRC association p-values were more significant for common variants inside versus outside AREs, or 2) inside colorectal (CR) AREs versus AREs of other tissues and cell-types. We employed an integrative epigenomic RVAM for variants with allele frequency <1%. Gene sets were defined as ARE variants within 200 kilobases of a transcription start site (TSS) using either CR ARE or ARE from non-digestive tissues. CRC-set association p-values were used to evaluate enrichment of less frequent variant associations in CR ARE versus non-digestive ARE.ARE from 126/127 tissues and cell-types were significantly enriched for stronger CRC-variant associations. Strongest enrichment was observed for digestive tissues and immune cell types. CR-specific ARE were also enriched for stronger CRC-variant associations compared to ARE combined across non-digestive tissues (p-value = 9.6 × 10-4). Additionally, we found enrichment of stronger CRC association p-values for rare variant sets of CR ARE compared to non-digestive ARE (p-value = 0.029).Integrative epigenomic RVAM may enable discovery of less frequent variants associated with CRC, and ARE of digestive and immune tissues are most informative. Although distance-based aggregation of less frequent variants in CR ARE surrounding TSS showed modest enrichment, future association studies would likely benefit from joint analysis of transcriptomes and epigenomes to better link regulatory variation with target genes.

View details for PubMedID 29161273
Vicus: Exploiting local structures to improve network-based analysis of biological data PLOS COMPUTATIONAL BIOLOGY Wang, B., Huang, L., Zhu, Y., Kundaje, A., Batzoglou, S., Goldenberg, A. 2017; 13 (10): e1005621

Abstract

Biological networks entail important topological features and patterns critical to understanding interactions within complicated biological systems. Despite a great progress in understanding their structure, much more can be done to improve our inference and network analysis. Spectral methods play a key role in many network-based applications. Fundamental to spectral methods is the Laplacian, a matrix that captures the global structure of the network. Unfortunately, the Laplacian does not take into account intricacies of the network's local structure and is sensitive to noise in the network. These two properties are fundamental to biological networks and cannot be ignored. We propose an alternative matrix Vicus. The Vicus matrix captures the local neighborhood structure of the network and thus is more effective at modeling biological interactions. We demonstrate the advantages of Vicus in the context of spectral methods by extensive empirical benchmarking on tasks such as single cell dimensionality reduction, protein module discovery and ranking genes for cancer subtyping. Our experiments show that using Vicus, spectral methods result in more accurate and robust performance in all of these tasks.

View details for PubMedID 29023470

View details for PubMedCentralID PMC5638230
Genome-scale measurement of off-target activity using Cas9 toxicity in high-throughput screens NATURE COMMUNICATIONS Morgens, D. W., Wainberg, M., Boyle, E. A., Ursu, O., Araya, C. L., Tsui, C. K., Haney, M. S., Hess, G. T., Han, K., Jeng, E. E., Li, A., Snyder, M. P., Greenleaf, W. J., Kundaje, A., Bassik, M. C. 2017; 8

Abstract

CRISPR-Cas9 screens are powerful tools for high-throughput interrogation of genome function, but can be confounded by nuclease-induced toxicity at both on- and off-target sites, likely due to DNA damage. Here, to test potential solutions to this issue, we design and analyse a CRISPR-Cas9 library with 10 variable-length guides per gene and thousands of negative controls targeting non-functional, non-genic regions (termed safe-targeting guides), in addition to non-targeting controls. We find this library has excellent performance in identifying genes affecting growth and sensitivity to the ricin toxin. The safe-targeting guides allow for proper control of toxicity from on-target DNA damage. Using this toxicity as a proxy to measure off-target cutting, we demonstrate with tens of thousands of guides both the nucleotide position-dependent sensitivity to single mismatches and the reduction of off-target cutting using truncated guides. Our results demonstrate a simple strategy for high-throughput evaluation of target specificity and nuclease toxicity in Cas9 screens.

View details for DOI 10.1038/ncomms15178

View details for PubMedID 28474669
Dynamic and stable enhancer-promoter contacts regulate epidermal terminal differentiation Lopez-Pajares, V., Rubin, A., Barajas, B., Furlan-Magaril, M., Mumbach, M., Greenleaf, W., Kundaje, A., Snyder, M., Chang, H., Fraser, P., Khavari, P. A. ELSEVIER SCIENCE INC. 2017: S80

View details for DOI 10.1016/j.jid.2017.02.483

View details for Web of Science ID 000406862400458
Initiation of mtDNA transcription is followed by pausing, and diverges across human cell types and during evolution. Genome research Blumberg, A., Rice, E. J., Kundaje, A., Danko, C. G., Mishmar, D. 2017; 27 (3): 362-373

Abstract

Mitochondrial DNA (mtDNA) genes are long known to be cotranscribed in polycistrones, yet it remains impossible to study nascent mtDNA transcripts quantitatively in vivo using existing tools. To this end, we used deep sequencing (GRO-seq and PRO-seq) and analyzed nascent mtDNA-encoded RNA transcripts in diverse human cell lines and metazoan organisms. Surprisingly, accurate detection of human mtDNA transcription initiation sites (TISs) in the heavy and light strands revealed a novel conserved transcription pausing site near the light-strand TIS. This pausing site correlated with the presence of a bacterial pausing sequence motif, with reduced SNP density, and with a DNase footprinting signal in all tested cells. Its location within conserved sequence block 3 (CSBIII), just upstream of the known transcription-replication transition point, suggests involvement in such transition. Analysis of nonhuman organisms enabled de novo mtDNA sequence assembly, as well as detection of previously unknown mtDNA TIS, pausing, and transcription termination sites with unprecedented accuracy. Whereas mammals (Pan troglodytes, Macaca mulatta, Rattus norvegicus, and Mus musculus) showed a human-like mtDNA transcription pattern, the invertebrate pattern (Drosophila melanogaster and Caenorhabditis elegans) profoundly diverged. Our approach paves the path toward in vivo, quantitative, reference sequence-free analysis of mtDNA transcription in all eukaryotes.

View details for DOI 10.1101/gr.209924.116

View details for PubMedID 28049628
Molecular definition of a metastatic lung cancer state reveals a targetable CD109-Janus kinase-Stat axis. Nature medicine Chuang, C., Greenside, P. G., Rogers, Z. N., Brady, J. J., Yang, D., Ma, R. K., Caswell, D. R., Chiou, S., Winters, A. F., Grüner, B. M., Ramaswami, G., Spencley, A. L., Kopecky, K. E., Sayles, L. C., Sweet-Cordero, E. A., Li, J. B., Kundaje, A., Winslow, M. M. 2017; 23 (3): 291-300

Abstract

Lung cancer is the leading cause of cancer deaths worldwide, with the majority of mortality resulting from metastatic spread. However, the molecular mechanism by which cancer cells acquire the ability to disseminate from primary tumors, seed distant organs, and grow into tissue-destructive metastases remains incompletely understood. We combined tumor barcoding in a mouse model of human lung adenocarcinoma with unbiased genomic approaches to identify a transcriptional program that confers metastatic ability and predicts patient survival. Small-scale in vivo screening identified several genes, including Cd109, that encode novel pro-metastatic factors. We uncovered signaling mediated by Janus kinases (Jaks) and the transcription factor Stat3 as a critical, pharmacologically targetable effector of CD109-driven lung cancer metastasis. In summary, by coupling the systematic genomic analysis of purified cancer cells in distinct malignant states from mouse models with extensive human validation, we uncovered several key regulators of metastatic ability, including an actionable pro-metastatic CD109-Jak-Stat3 axis.

View details for DOI 10.1038/nm.4285

View details for PubMedID 28191885
Predicting gene expression in massively parallel reporter assays: a comparative study. Human mutation Kreimer, A., Zeng, H., Edwards, M. D., Guo, Y., Tian, K., Shin, S., Welch, R., Wainberg, M., Mohan, R., Sinnott-Armstrong, N. A., Li, Y., Eraslan, G., Amin, T. B., Goke, J., Mueller, N. S., Kellis, M., Kundaje, A., Beer, M. A., Keles, S., Gifford, D. K., Yosef, N. 2017

Abstract

In many human diseases, associated genetic changes tend to occur within noncoding regions, whose effect might be related to transcriptional control. A central goal in human genetics is to understand the function of such noncoding regions: given a region that is statistically associated with changes in gene expression (expression quantitative trait locus [eQTL]), does it in fact play a regulatory role? And if so, how is this role "coded" in its sequence? These questions were the subject of the Critical Assessment of Genome Interpretation eQTL challenge. Participants were given a set of sequences that flank eQTLs in humans and were asked to predict whether these are capable of regulating transcription (as evaluated by massively parallel reporter assays), and whether this capability changes between alternative alleles. Here, we report lessons learned from this community effort. By inspecting predictive properties in isolation, and conducting meta-analysis over the competing methods, we find that using chromatin accessibility and transcription factor binding as features in an ensemble of classifiers or regression models leads to the most accurate results. We then characterize the loci that are harder to predict, putting the spotlight on areas of weakness, which we expect to be the subject of future studies.

View details for DOI 10.1002/humu.23197

View details for PubMedID 28220625
An improved ATAC-seq protocol reduces background and enables interrogation of frozen tissues. Nature methods Corces, M. R., Trevino, A. E., Hamilton, E. G., Greenside, P. G., Sinnott-Armstrong, N. A., Vesuna, S. n., Satpathy, A. T., Rubin, A. J., Montine, K. S., Wu, B. n., Kathiria, A. n., Cho, S. W., Mumbach, M. R., Carter, A. C., Kasowski, M. n., Orloff, L. A., Risca, V. I., Kundaje, A. n., Khavari, P. A., Montine, T. J., Greenleaf, W. J., Chang, H. Y. 2017

Abstract

We present Omni-ATAC, an improved ATAC-seq protocol for chromatin accessibility profiling that works across multiple applications with substantial improvement of signal-to-background ratio and information content. The Omni-ATAC protocol generates chromatin accessibility profiles from archival frozen tissue samples and 50-μm sections, revealing the activities of disease-associated DNA elements in distinct human brain structures. The Omni-ATAC protocol enables the interrogation of personal regulomes in tissue context and translational studies.

View details for PubMedID 28846090
Learning Important Features Through Propagating Activation Differences Shrikumar, A., Greenside, P., Kundaje, A. edited by Precup, D., Teh, Y. W. JMLR-JOURNAL MACHINE LEARNING RESEARCH. 2017

View details for Web of Science ID 000683309503025
Enhancer connectome in primary human cells identifies target genes of disease-associated DNA elements. Nature genetics Mumbach, M. R., Satpathy, A. T., Boyle, E. A., Dai, C. n., Gowen, B. G., Cho, S. W., Nguyen, M. L., Rubin, A. J., Granja, J. M., Kazane, K. R., Wei, Y. n., Nguyen, T. n., Greenside, P. G., Corces, M. R., Tycko, J. n., Simeonov, D. R., Suliman, N. n., Li, R. n., Xu, J. n., Flynn, R. A., Kundaje, A. n., Khavari, P. A., Marson, A. n., Corn, J. E., Quertermous, T. n., Greenleaf, W. J., Chang, H. Y. 2017

Abstract

The challenge of linking intergenic mutations to target genes has limited molecular understanding of human diseases. Here we show that H3K27ac HiChIP generates high-resolution contact maps of active enhancers and target genes in rare primary human T cell subtypes and coronary artery smooth muscle cells. Differentiation of naive T cells into T helper 17 cells or regulatory T cells creates subtype-specific enhancer-promoter interactions, specifically at regions of shared DNA accessibility. These data provide a principled means of assigning molecular functions to autoimmune and cardiovascular disease risk variants, linking hundreds of noncoding variants to putative gene targets. Target genes identified with HiChIP are further supported by CRISPR interference and activation at linked enhancers, by the presence of expression quantitative trait loci, and by allele-specific enhancer loops in patient-derived primary cells. The majority of disease-associated enhancers contact genes beyond the nearest gene in the linear genome, leading to a fourfold increase in the number of potential target genes for autoimmune and cardiovascular diseases.

View details for PubMedID 28945252
Lineage-specific dynamic and pre-established enhancer-promoter contacts cooperate in terminal differentiation. Nature genetics Rubin, A. J., Barajas, B. C., Furlan-Magaril, M. n., Lopez-Pajares, V. n., Mumbach, M. R., Howard, I. n., Kim, D. S., Boxer, L. D., Cairns, J. n., Spivakov, M. n., Wingett, S. W., Shi, M. n., Zhao, Z. n., Greenleaf, W. J., Kundaje, A. n., Snyder, M. n., Chang, H. Y., Fraser, P. n., Khavari, P. A. 2017; 49 (10): 1522–28

Abstract

Chromosome conformation is an important feature of metazoan gene regulation; however, enhancer-promoter contact remodeling during cellular differentiation remains poorly understood. To address this, genome-wide promoter capture Hi-C (CHi-C) was performed during epidermal differentiation. Two classes of enhancer-promoter contacts associated with differentiation-induced genes were identified. The first class ('gained') increased in contact strength during differentiation in concert with enhancer acquisition of the H3K27ac activation mark. The second class ('stable') were pre-established in undifferentiated cells, with enhancers constitutively marked by H3K27ac. The stable class was associated with the canonical conformation regulator cohesin, whereas the gained class was not, implying distinct mechanisms of contact formation and regulation. Analysis of stable enhancers identified a new, essential role for a constitutively expressed, lineage-restricted ETS-family transcription factor, EHF, in epidermal differentiation. Furthermore, neither class of contacts was observed in pluripotent cells, suggesting that lineage-specific chromatin structure is established in tissue progenitor cells and is further remodeled in terminal differentiation.

View details for PubMedID 28805829
High-Throughput Characterization of Cascade type I-E CRISPR Guide Efficacy Reveals Unexpected PAM Diversity and Target Sequence Preferences. Genetics Fu, B. X., Wainberg, M. n., Kundaje, A. n., Fire, A. Z. 2017; 206 (4): 1727–38

Abstract

Interactions between Clustered Regularly Interspaced Short Palindromic Repeat (CRISPR) RNAs and CRISPR-associated (Cas) proteins form an RNA-guided adaptive immune system in prokaryotes. The adaptive immune system utilizes segments of the genetic material of invasive foreign elements in the CRISPR locus. The loci are transcribed and processed to produce small CRISPR RNAs (crRNAs), with degradation of invading genetic material directed by a combination of complementarity between RNA and DNA and in some cases recognition of adjacent motifs called PAMs (Protospacer Adjacent Motifs). Here we describe a general, high-throughput procedure to test the efficacy of thousands of targets, applying this to the Escherichia coli type I-E Cascade (CRISPR-associated complex for antiviral defense) system. These studies were followed with reciprocal experiments in which the consequence of CRISPR activity was survival in the presence of a lytic phage. From the combined analysis of the Cascade system, we found that (i) type I-E Cascade PAM recognition is more expansive than previously reported, with at least 22 distinct PAMs, with many of the noncanonical PAMs having CRISPR-interference abilities similar to the canonical PAMs; (ii) PAM positioning appears precise, with no evidence for tolerance to PAM slippage in interference; and (iii) while increased guanine-cytosine (GC) content in the spacer is associated with higher CRISPR-interference efficiency, high GC content (>62.5%) decreases CRISPR-interference efficiency. Our findings provide a comprehensive functional profile of Cascade type I-E interference requirements and a method to assay spacer efficacy that can be applied to other CRISPR-Cas systems.

View details for PubMedID 28634160
An atlas of transcriptional, chromatin accessibility, and surface marker changes in human mesoderm development SCIENTIFIC DATA Koh, P. W., Sinha, R., Barkal, A. A., Morganti, R. M., Chen, A., Weissman, I. L., Ang, L. T., Kundaje, A., Loh, K. M. 2016; 3

Abstract

Mesoderm is the developmental precursor to myriad human tissues including bone, heart, and skeletal muscle. Unravelling the molecular events through which these lineages become diversified from one another is integral to developmental biology and understanding changes in cellular fate. To this end, we developed an in vitro system to differentiate human pluripotent stem cells through primitive streak intermediates into paraxial mesoderm and its derivatives (somites, sclerotome, dermomyotome) and separately, into lateral mesoderm and its derivatives (cardiac mesoderm). Whole-population and single-cell analyses of these purified populations of human mesoderm lineages through RNA-seq, ATAC-seq, and high-throughput surface marker screens illustrated how transcriptional changes co-occur with changes in open chromatin and surface marker landscapes throughout human mesoderm development. This molecular atlas will facilitate study of human mesoderm development (which cannot be interrogated in vivo due to restrictions on human embryo studies) and provides a broad resource for the study of gene regulation in development at the single-cell level, knowledge that might one day be exploited for regenerative medicine.

View details for DOI 10.1038/sdata.2016.109

View details for PubMedID 27996962
The International Human Epigenome Consortium: A Blueprint for Scientific Collaboration and Discovery CELL Stunnenberg, H. G., Hirst, M., Int Human Epigenome Consortium 2016; 167 (5): 1145-1149

Abstract

The International Human Epigenome Consortium (IHEC) coordinates the generation of a catalog of high-resolution reference epigenomes of major primary human cell types. The studies now presented (see the Cell Press IHEC web portal at http://www.cell.com/consortium/IHEC) highlight the coordinated achievements of IHEC teams to gather and interpret comprehensive epigenomic datasets to gain insights in the epigenetic control of cell states relevant for human health and disease. PAPERCLIP.

View details for DOI 10.1016/j.cell.2016.11.007

View details for Web of Science ID 000389470100004

View details for PubMedID 27863232
Lineage-specific and single-cell chromatin accessibility charts human hematopoiesis and leukemia evolution. Nature genetics Corces, M. R., Buenrostro, J. D., Wu, B., Greenside, P. G., Chan, S. M., Koenig, J. L., Snyder, M. P., Pritchard, J. K., Kundaje, A., Greenleaf, W. J., Majeti, R., Chang, H. Y. 2016; 48 (10): 1193-1203

Abstract

We define the chromatin accessibility and transcriptional landscapes in 13 human primary blood cell types that span the hematopoietic hierarchy. Exploiting the finding that the enhancer landscape better reflects cell identity than mRNA levels, we enable 'enhancer cytometry' for enumeration of pure cell types from complex populations. We identify regulators governing hematopoietic differentiation and further show the lineage ontogeny of genetic elements linked to diverse human diseases. In acute myeloid leukemia (AML), chromatin accessibility uncovers unique regulatory evolution in cancer cells with a progressively increasing mutation burden. Single AML cells exhibit distinctive mixed regulome profiles corresponding to disparate developmental stages. A method to account for this regulatory heterogeneity identified cancer-specific deviations and implicated HOX factors as key regulators of preleukemic hematopoietic stem cell characteristics. Thus, regulome dynamics can provide diverse insights into hematopoietic development and disease.

View details for DOI 10.1038/ng.3646

View details for PubMedID 27526324
Characterization of the direct targets of FOXO transcription factors throughout evolution. Aging cell Webb, A. E., Kundaje, A., Brunet, A. 2016; 15 (4): 673-685

Abstract

FOXO transcription factors (FOXOs) are central regulators of lifespan across species, yet they also have cell-specific functions, including adult stem cell homeostasis and immune function. Direct targets of FOXOs have been identified genome-wide in several species and cell types. However, whether FOXO targets are specific to cell types and species or conserved across cell types and throughout evolution remains uncharacterized. Here, we perform a meta-analysis of direct FOXO targets across tissues and organisms, using data from mammals as well as Caenorhabditis elegans and Drosophila. We show that FOXOs bind cell type-specific targets, which have functions related to that particular cell. Interestingly, FOXOs also share targets across different tissues in mammals, and the function and even the identity of these shared mammalian targets are conserved in invertebrates. Evolutionarily conserved targets show enrichment for growth factor signaling, metabolism, stress resistance, and proteostasis, suggesting an ancestral, conserved role in the regulation of these processes. We also identify candidate cofactors at conserved FOXO targets that change in expression with age, including CREB and ETS family factors. This meta-analysis provides insight into the evolution of the FOXO network and highlights downstream genes and cofactors that may be particularly important for FOXO's conserved function in adult homeostasis and longevity.

View details for DOI 10.1111/acel.12479

View details for PubMedID 27061590
Mapping the Pairwise Choices Leading from Pluripotency to Human Bone, Heart, and Other Mesoderm Cell Types CELL Loh, K. M., Chen, A., Koh, P. W., Deng, T. Z., Sinha, R., Tsai, J. M., Barkal, A. A., Shen, K. Y., Jain, R., Morganti, R. M., Shyh-Chang, N., Fernhoff, N. B., George, B. M., Wernig, G., Salomon, R. E., Chen, Z., Vogel, H., Epstein, J. A., Kundaje, A., Talbot, W. S., Beachy, P. A., Ang, L. T., Weissman, I. L. 2016; 166 (2): 451-467

Abstract

Stem-cell differentiation to desired lineages requires navigating alternating developmental paths that often lead to unwanted cell types. Hence, comprehensive developmental roadmaps are crucial to channel stem-cell differentiation toward desired fates. To this end, here, we map bifurcating lineage choices leading from pluripotency to 12 human mesodermal lineages, including bone, muscle, and heart. We defined the extrinsic signals controlling each binary lineage decision, enabling us to logically block differentiation toward unwanted fates and rapidly steer pluripotent stem cells toward 80%-99% pure human mesodermal lineages at most branchpoints. This strategy enabled the generation of human bone and heart progenitors that could engraft in respective in vivo models. Mapping stepwise chromatin and single-cell gene expression changes in mesoderm development uncovered somite segmentation, a previously unobservable human embryonic event transiently marked by HOPX expression. Collectively, this roadmap enables navigation of mesodermal development to produce transplantable human tissue progenitors and uncover developmental processes. VIDEO ABSTRACT.

View details for DOI 10.1016/j.cell.2016.06.011

View details for PubMedID 27419872
Using functional data from Roadmap Epigenomics to inform analysis of rare variants linked to gene expression in a large colorectal cancer study Bien, S. A., Harrison, T. A., Auer, P. L., Qu, F., Huyghe, J., Banbury, B., Greenside, P., Abecasis, G. R., Berndt, S. I., Bezieau, S., Brenner, H., Casey, G., Chan, A. T., Chang-Claude, J., Chen, S., Smith, J. D., Le Marchand, L., Carlson, C., Newcomb, P. A., Fuchsberger, C., Slattery, M. L., Kang, H. M., White, E., Potter, J., Gallinger, S. J., Hoffmeister, M., Gruber, S. B., Nickerson, D. A., Peters, U., Kundaje, A., Hsu, L. AMER ASSOC CANCER RESEARCH. 2016

View details for DOI 10.1158/1538-7445.AM2016-4489

View details for Web of Science ID 000389941705145
Impact of the X Chromosome and sex on regulatory variation GENOME RESEARCH Kukurba, K. R., Parsana, P., Balliu, B., Smith, K. S., Zappala, Z., Knowles, D. A., Fave, M., Davis, J. R., Li, X., Zhu, X., Potash, J. B., Weissman, M. M., Shi, J., Kundaje, A., Levinson, D. F., Awadalla, P., Mostafavi, S., Battle, A., Montgomery, S. B. 2016; 26 (6): 768-777

Abstract

The X Chromosome, with its unique mode of inheritance, contributes to differences between the sexes at a molecular level, including sex-specific gene expression and sex-specific impact of genetic variation. Improving our understanding of these differences offers to elucidate the molecular mechanisms underlying sex-specific traits and diseases. However, to date, most studies have either ignored the X Chromosome or had insufficient power to test for the sex-specific impact of genetic variation. By analyzing whole blood transcriptomes of 922 individuals, we have conducted the first large-scale, genome-wide analysis of the impact of both sex and genetic variation on patterns of gene expression, including comparison between the X Chromosome and autosomes. We identified a depletion of expression quantitative trait loci (eQTL) on the X Chromosome, especially among genes under high selective constraint. In contrast, we discovered an enrichment of sex-specific regulatory variants on the X Chromosome. To resolve the molecular mechanisms underlying such effects, we generated chromatin accessibility data through ATAC-sequencing to connect sex-specific chromatin accessibility to sex-specific patterns of expression and regulatory variation. As sex-specific regulatory variants discovered in our study can inform sex differences in heritable disease prevalence, we integrated our data with genome-wide association study data for multiple immune traits identifying several traits with significant sex biases in genetic susceptibilities. Together, our study provides genome-wide insight into how genetic variation, the X Chromosome, and sex shape human gene regulation and disease.

View details for DOI 10.1101/gr.197897.115

View details for PubMedID 27197214
An Arntl2-Driven Secretome Enables Lung Adenocarcinoma Metastatic Self-Sufficiency CANCER CELL Brady, J. J., Chuang, C., Greenside, P. G., Rogers, Z. N., Murray, C. W., Caswell, D. R., Hartmann, U., Connolly, A. J., Sweet-Cordero, E. A., Kundaje, A., Winslow, M. M. 2016; 29 (5): 697-710

Abstract

The ability of cancer cells to establish lethal metastatic lesions requires the survival and expansion of single cancer cells at distant sites. The factors controlling the clonal growth ability of individual cancer cells remain poorly understood. Here, we show that high expression of the transcription factor ARNTL2 predicts poor lung adenocarcinoma patient outcome. Arntl2 is required for metastatic ability in vivo and clonal growth in cell culture. Arntl2 drives metastatic self-sufficiency by orchestrating the expression of a complex pro-metastatic secretome. We identify Clock as an Arntl2 partner and functionally validate the matricellular protein Smoc2 as a pro-metastatic secreted factor. These findings shed light on the molecular mechanisms that enable single cancer cells to form allochthonous tumors in foreign tissue environments.

View details for DOI 10.1016/j.ccell.2016.03.003

View details for PubMedID 27150038
Unsupervised Learning from Noisy Networks with Applications to Hi-C Data Wang, B., Zhu, J., Ursu, O., Pourshafeie, A., Batzoglou, S., Kundaje, A. edited by Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, Garnett, R. NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2016

View details for Web of Science ID 000458973702038
Regulatory analysis of the C. elegans genome with spatiotemporal resolution (vol 512, pg 400, 2014) NATURE Araya, C. L., Kawli, T., Kundaje, A., Jiang, L., Wu, B., Vafeados, D., Terrell, R., Weissdepp, P., Gevirtzman, L., Mace, D., Niu, W., Boyle, A. P., Xie, D., Ma, L., Murray, J. I., Reinke, V., Waterston, R. H., Snyder, M. 2015; 528 (7580): 152

View details for DOI 10.1038/nature16075

View details for Web of Science ID 000365606000069

View details for PubMedID 26560031
H3K4me3 Breadth Is Linked to Cell Identity and Transcriptional Consistency (vol 158, pg 673, 2014) CELL Benayoun, B. A., Pollina, E. A., Ucar, D., Mahmoudi, S., Karra, K., Wong, E. D., Devarajan, K., Daugherty, A. C., Kundaje, A. B., Mancini, E., Hitz, B. C., Gupta, R., Rando, T. A., Baker, J. C., Snyder, M. P., Cherry, J., Brunet, A. 2015; 163 (5): 1281-U264

View details for DOI 10.1016/j.cell.2015.10.051

View details for Web of Science ID 000366044700024

View details for PubMedID 28930648
Characterization of TCF21 Downstream Target Regions Identifies a Transcriptional Network Linking Multiple Independent Coronary Artery Disease Loci. PLoS genetics Sazonova, O., Zhao, Y., Nürnberg, S., Miller, C., Pjanic, M., Castano, V. G., Kim, J. B., Salfati, E. L., Kundaje, A. B., Bejerano, G., Assimes, T., Yang, X., Quertermous, T. 2015; 11 (5)

Abstract

To functionally link coronary artery disease (CAD) causal genes identified by genome wide association studies (GWAS), and to investigate the cellular and molecular mechanisms of atherosclerosis, we have used chromatin immunoprecipitation sequencing (ChIP-Seq) with the CAD associated transcription factor TCF21 in human coronary artery smooth muscle cells (HCASMC). Analysis of identified TCF21 target genes for enrichment of molecular and cellular annotation terms identified processes relevant to CAD pathophysiology, including "growth factor binding," "matrix interaction," and "smooth muscle contraction." We characterized the canonical binding sequence for TCF21 as CAGCTG, identified AP-1 binding sites in TCF21 peaks, and by conducting ChIP-Seq for JUN and JUND in HCASMC confirmed that there is significant overlap between TCF21 and AP-1 binding loci in this cell type. Expression quantitative trait variation mapped to target genes of TCF21 was significantly enriched among variants with low P-values in the GWAS analyses, suggesting a possible functional interaction between TCF21 binding and causal variants in other CAD disease loci. Separate enrichment analyses found over-representation of TCF21 target genes among CAD associated genes, and linkage disequilibrium between TCF21 peak variation and that found in GWAS loci, consistent with the hypothesis that TCF21 may affect disease risk through interaction with other disease associated loci. Interestingly, enrichment for TCF21 target genes was also found among other genome wide association phenotypes, including height and inflammatory bowel disease, suggesting a functional profile important for basic cellular processes in non-vascular tissues. Thus, data and analyses presented here suggest that study of GWAS transcription factors may be a highly useful approach to identifying disease gene interactions and thus pathways that may be relevant to complex disease etiology.

View details for DOI 10.1371/journal.pgen.1005202

View details for PubMedID 26020271
Fine mapping of type 1 diabetes susceptibility loci and evidence for colocalization of causal variants with lymphoid gene enhancers. Nature genetics Onengut-Gumuscu, S., Chen, W., Burren, O., Cooper, N. J., Quinlan, A. R., Mychaleckyj, J. C., Farber, E., Bonnie, J. K., Szpak, M., Schofield, E., Achuthan, P., Guo, H., Fortune, M. D., Stevens, H., Walker, N. M., Ward, L. D., Kundaje, A., Kellis, M., Daly, M. J., Barrett, J. C., Cooper, J. D., Deloukas, P., Todd, J. A., Wallace, C., Concannon, P., Rich, S. S. 2015; 47 (4): 381-386

Abstract

Genetic studies of type 1 diabetes (T1D) have identified 50 susceptibility regions, finding major pathways contributing to risk, with some loci shared across immune disorders. To make genetic comparisons across autoimmune disorders as informative as possible, a dense genotyping array, the Immunochip, was developed, from which we identified four new T1D-associated regions (P < 5 × 10(-8)). A comparative analysis with 15 immune diseases showed that T1D is more similar genetically to other autoantibody-positive diseases, significantly most similar to juvenile idiopathic arthritis and significantly least similar to ulcerative colitis, and provided support for three additional new T1D risk loci. Using a Bayesian approach, we defined credible sets for the T1D-associated SNPs. The associated SNPs localized to enhancer sequences active in thymus, T and B cells, and CD34(+) stem cells. Enhancer-promoter interactions can now be analyzed in these cell types to identify which particular genes and regulatory sequences are causal.

View details for DOI 10.1038/ng.3245

View details for PubMedID 25751624
Fine mapping of type 1 diabetes susceptibility loci and evidence for colocalization of causal variants with lymphoid gene enhancers NATURE GENETICS Onengut-Gumuscu, S., Chen, W., Burren, O., Cooper, N. J., Quinlan, A. R., Mychaleckyj, J. C., Farber, E., Bonnie, J. K., Szpak, M., Schofield, E., Achuthan, P., Guo, H., Fortune, M. D., Stevens, H., Walker, N. M., Ward, L. D., Kundaje, A., Kellis, M., Daly, M. J., Barrett, J. C., Cooper, J. D., Deloukas, P., Todd, J. A., Wallace, C., Concannon, P., Rich, S. S. 2015; 47 (4): 381-U199

Abstract

Genetic studies of type 1 diabetes (T1D) have identified 50 susceptibility regions, finding major pathways contributing to risk, with some loci shared across immune disorders. To make genetic comparisons across autoimmune disorders as informative as possible, a dense genotyping array, the Immunochip, was developed, from which we identified four new T1D-associated regions (P < 5 × 10(-8)). A comparative analysis with 15 immune diseases showed that T1D is more similar genetically to other autoantibody-positive diseases, significantly most similar to juvenile idiopathic arthritis and significantly least similar to ulcerative colitis, and provided support for three additional new T1D risk loci. Using a Bayesian approach, we defined credible sets for the T1D-associated SNPs. The associated SNPs localized to enhancer sequences active in thymus, T and B cells, and CD34(+) stem cells. Enhancer-promoter interactions can now be analyzed in these cell types to identify which particular genes and regulatory sequences are causal.

View details for DOI 10.1038/ng.3245

View details for Web of Science ID 000351922900014

View details for PubMedID 25751624

View details for PubMedCentralID PMC4380767
Reassessment of Piwi Binding to the Genome and Piwi Impact on RNA Polymerase II Distribution DEVELOPMENTAL CELL Lin, H., Chen, M., Kundaje, A., Valouev, A., Yin, H., Liu, N., Neuenkirchen, N., Zhong, M., Snyder, M. 2015; 32 (6): 772-774

Abstract

Drosophila Piwi was reported by Huang et al. (2013) to be guided by piRNAs to piRNA-complementary sites in the genome, which then recruits heterochromatin protein 1a and histone methyltransferase Su(Var)3-9 to the sites. Among additional findings, Huang et al. (2013) also reported Piwi binding sites in the genome and the reduction of RNA polymerase II in euchromatin but its increase in pericentric regions in piwi mutants. Marinov et al. (2015) disputed the validity of the Huang et al. bioinformatic pipeline that led to the last two claims. Here we report our independent reanalysis of the data using current bioinformatic methods. Our reanalysis agrees with Marinov et al. (2015) that Piwi's genomic targets still remain to be identified but confirms the Huang et al. claim that Piwi influences RNA polymerase II distribution in the genome. This Matters Arising Response addresses the Marinov et al. (2015) Matters Arising, published concurrently in this issue of Developmental Cell.

View details for DOI 10.1016/j.devcel.2015.03.004

View details for PubMedID 25805139
A comparative encyclopedia of DNA elements in the mouse genome NATURE Yue, F., Cheng, Y., Breschi, A., Vierstra, J., Wu, W., Ryba, T., Sandstrom, R., Ma, Z., Davis, C., Pope, B. D., Shen, Y., Pervouchine, D. D., Djebali, S., Thurman, R. E., Kaul, R., Rynes, E., Kirilusha, A., Marinov, G. K., Williams, B. A., Trout, D., Amrhein, H., Fisher-Aylor, K., Antoshechkin, I., DeSalvo, G., See, L., Fastuca, M., Drenkow, J., Zaleski, C., Dobin, A., Prieto, P., Lagarde, J., Bussotti, G., Tanzer, A., Denas, O., Li, K., Bender, M. A., Zhang, M., Byron, R., Groudine, M. T., McCleary, D., Pham, L., Ye, Z., Kuan, S., Edsall, L., Wu, Y., Rasmussen, M. D., Bansal, M. S., Kellis, M., Keller, C. A., Morrissey, C. S., Mishra, T., Jain, D., Dogan, N., Harris, R. S., Cayting, P., Kawli, T., Boyle, A. P., Euskirchen, G., Kundaje, A., Lin, S., Lin, Y., Jansen, C., Malladi, V. S., Cline, M. S., Erickson, D. T., Kirkup, V. M., Learned, K., Sloan, C. A., Rosenbloom, K. R., De Sousa, B. L., Beal, K., Pignatelli, M., Flicek, P., Lian, J., Kahveci, T., Lee, D., Kent, W. J., Santos, M. R., Herrero, J., Notredame, C., Johnson, A., Vong, S., Lee, K., Bates, D., Neri, F., Diegel, M., Canfield, T., Sabo, P. J., Wilken, M. S., Reh, T. A., Giste, E., Shafer, A., Kutyavin, T., Haugen, E., Dunn, D., Reynolds, A. P., Neph, S., Humbert, R., Hansen, R. S., de Bruijn, M., Selleri, L., Rudensky, A., Josefowicz, S., Samstein, R., Eichler, E. E., Orkin, S. H., Levasseur, D., Papayannopoulou, T., Chang, K., Skoultchi, A., Gosh, S., Disteche, C., Treuting, P., Wang, Y., Weiss, M. J., Blobel, G. A., Cao, X., Zhong, S., Wang, T., Good, P. J., Lowdon, R. F., Adams, L. B., Zhou, X., Pazin, M. J., Feingold, E. A., Wold, B., Taylor, J., Mortazavi, A., Weissman, S. M., Stamatoyannopoulos, J. A., Snyder, M. P., Guigo, R., Gingeras, T. R., Gilbert, D. M., Hardison, R. C., Beer, M. A., Ren, B. 2014; 515 (7527): 355-?

Abstract

The laboratory mouse shares the majority of its protein-coding genes with humans, making it the premier model organism in biomedical research, yet the two mammals differ in significant ways. To gain greater insights into both shared and species-specific transcriptional and cellular regulatory programs in the mouse, the Mouse ENCODE Consortium has mapped transcription, DNase I hypersensitivity, transcription factor binding, chromatin modifications and replication domains throughout the mouse genome in diverse cell and tissue types. By comparing with the human genome, we not only confirm substantial conservation in the newly annotated potential functional sequences, but also find a large degree of divergence of sequences involved in transcriptional regulation, chromatin state and higher order chromatin organization. Our results illuminate the wide range of evolutionary forces acting on genes and their regulatory regions, and provide a general resource for research into mammalian biology and mechanisms of human diseases.

View details for DOI 10.1038/nature13992

View details for Web of Science ID 000345770600034
Principles of regulatory information conservation between mouse and human. Nature Cheng, Y., Ma, Z., Kim, B. H., Wu, W., Cayting, P., Boyle, A. P., Sundaram, V., Xing, X., Dogan, N., Li, J., Euskirchen, G., Lin, S., Lin, Y., Visel, A., Kawli, T., Yang, X., Patacsil, D., Keller, C. A., Giardine, B., Kundaje, A., Wang, T., Pennacchio, L. A., Weng, Z., Hardison, R. C., Snyder, M. P. 2014; 515 (7527): 371-5

Abstract

To broaden our understanding of the evolution of gene regulation mechanisms, we generated occupancy profiles for 34 orthologous transcription factors (TFs) in human-mouse erythroid progenitor, lymphoblast and embryonic stem-cell lines. By combining the genome-wide transcription factor occupancy repertoires, associated epigenetic signals, and co-association patterns, here we deduce several evolutionary principles of gene regulatory features operating since the mouse and human lineages diverged. The genomic distribution profiles, primary binding motifs, chromatin states, and DNA methylation preferences are well conserved for TF-occupied sequences. However, the extent to which orthologous DNA segments are bound by orthologous TFs varies both among TFs and with genomic location: binding at promoters is more highly conserved than binding at distal elements. Notably, occupancy-conserved TF-occupied sequences tend to be pleiotropic; they function in several tissues and also co-associate with many TFs. Single nucleotide variants at sites with potential regulatory functions are enriched in occupancy-conserved TF-occupied sequences.

View details for DOI 10.1038/nature13985

View details for PubMedID 25409826

View details for PubMedCentralID PMC4343047
Principles of regulatory information conservation between mouse and human NATURE Cheng, Y., Ma, Z., Kim, B., Wu, W., Cayting, P., Boyle, A. P., Sundaram, V., Xing, X., Dogan, N., Li, J., Euskirchen, G., Lin, S., Lin, Y., Visel, A., Kawli, T., Yang, X., Patacsil, D., Keller, C. A., Giardine, B., Kundaje, A., Wang, T., Pennacchio, L. A., Weng, Z., Hardison, R. C., Snyder, M. P. 2014; 515 (7527): 371-?

Abstract

To broaden our understanding of the evolution of gene regulation mechanisms, we generated occupancy profiles for 34 orthologous transcription factors (TFs) in human-mouse erythroid progenitor, lymphoblast and embryonic stem-cell lines. By combining the genome-wide transcription factor occupancy repertoires, associated epigenetic signals, and co-association patterns, here we deduce several evolutionary principles of gene regulatory features operating since the mouse and human lineages diverged. The genomic distribution profiles, primary binding motifs, chromatin states, and DNA methylation preferences are well conserved for TF-occupied sequences. However, the extent to which orthologous DNA segments are bound by orthologous TFs varies both among TFs and with genomic location: binding at promoters is more highly conserved than binding at distal elements. Notably, occupancy-conserved TF-occupied sequences tend to be pleiotropic; they function in several tissues and also co-associate with many TFs. Single nucleotide variants at sites with potential regulatory functions are enriched in occupancy-conserved TF-occupied sequences.

View details for DOI 10.1038/nature13985

View details for Web of Science ID 000345770600036

View details for PubMedCentralID PMC4343047
Principles of regulatory information conservation between mouse and human NATURE Cheng, Y., Ma, Z., Kim, B., Wu, W., Cayting, P., Boyle, A. P., Sundaram, V., Xing, X., Dogan, N., Li, J., Euskirchen, G., Lin, S., Lin, Y., Visel, A., Kawli, T., Yang, X., Patacsil, D., Keller, C. A., Giardine, B., Kundaje, A., Wang, T., Pennacchio, L. A., Weng, Z., Hardison, R. C., Snyder, M. P. 2014; 515 (7527): 371-?

Abstract

To broaden our understanding of the evolution of gene regulation mechanisms, we generated occupancy profiles for 34 orthologous transcription factors (TFs) in human-mouse erythroid progenitor, lymphoblast and embryonic stem-cell lines. By combining the genome-wide transcription factor occupancy repertoires, associated epigenetic signals, and co-association patterns, here we deduce several evolutionary principles of gene regulatory features operating since the mouse and human lineages diverged. The genomic distribution profiles, primary binding motifs, chromatin states, and DNA methylation preferences are well conserved for TF-occupied sequences. However, the extent to which orthologous DNA segments are bound by orthologous TFs varies both among TFs and with genomic location: binding at promoters is more highly conserved than binding at distal elements. Notably, occupancy-conserved TF-occupied sequences tend to be pleiotropic; they function in several tissues and also co-associate with many TFs. Single nucleotide variants at sites with potential regulatory functions are enriched in occupancy-conserved TF-occupied sequences.

View details for DOI 10.1038/nature13985

View details for Web of Science ID 000345770600036

View details for PubMedCentralID PMC4343047
Transcription Factors Bind Negatively Selected Sites within Human mtDNA Genes GENOME BIOLOGY AND EVOLUTION Blumberg, A., Sailaja, B. S., Kundaje, A., Levin, L., Dadon, S., Shmorak, S., Shaulian, E., Meshorer, E., Mishmar, D. 2014; 6 (10): 2634-2646

Abstract

Transcription of mitochondrial DNA (mtDNA)-encoded genes is thought to be regulated by a handful of dedicated transcription factors (TFs), suggesting that mtDNA genes are separately regulated from the nucleus. However, several TFs, with known nuclear activities, were found to bind mtDNA and regulate mitochondrial transcription. Additionally, mtDNA transcriptional regulatory elements, which were proved important in vitro, were harbored by a deletion that normally segregated among healthy individuals. Hence, mtDNA transcriptional regulation is more complex than once thought. Here, by analyzing ENCODE chromatin immunoprecipitation sequencing (ChIP-seq) data, we identified strong binding sites of three bona fide nuclear TFs (c-Jun, Jun-D, and CEBPb) within human mtDNA protein-coding genes. We validated the binding of two TFs by ChIP-quantitative polymerase chain reaction (c-Jun and Jun-D) and showed their mitochondrial localization by electron microscopy and subcellular fractionation. As a step toward investigating the functionality of these TF-binding sites (TFBS), we assessed signatures of selection. By analyzing 9,868 human mtDNA sequences encompassing all major global populations, we recorded genetic variants in tips and nodes of mtDNA phylogeny within the TFBS. We next calculated the effects of variants on binding motif prediction scores. Finally, the mtDNA variation pattern in predicted TFBS, occurring within ChIP-seq negative-binding sites, was compared with ChIP-seq positive-TFBS (CPR). Motifs within CPRs of c-Jun, Jun-D, and CEBPb harbored either only tip variants or their nodal variants retained high motif prediction scores. This reflects negative selection within mtDNA CPRs, thus supporting their functionality. Hence, human mtDNA-coding sequences may have dual roles, namely coding for genes yet possibly also possessing regulatory potential.

View details for DOI 10.1093/gbe/evu210

View details for PubMedID 25245407
Comparative analysis of regulatory information and circuits across distant species. Nature Boyle, A. P., Araya, C. L., Brdlik, C., Cayting, P., Cheng, C., Cheng, Y., Gardner, K., Hillier, L. W., Janette, J., Jiang, L., Kasper, D., Kawli, T., Kheradpour, P., Kundaje, A., Li, J. J., Ma, L., Niu, W., Rehm, E. J., Rozowsky, J., Slattery, M., Spokony, R., Terrell, R., Vafeados, D., Wang, D., Weisdepp, P., Wu, Y., Xie, D., Yan, K., Feingold, E. A., Good, P. J., Pazin, M. J., Huang, H., Bickel, P. J., Brenner, S. E., Reinke, V., Waterston, R. H., Gerstein, M., White, K. P., Kellis, M., Snyder, M. 2014; 512 (7515): 453-456

Abstract

Despite the large evolutionary distances between metazoan species, they can show remarkable commonalities in their biology, and this has helped to establish fly and worm as model organisms for human biology. Although studies of individual elements and factors have explored similarities in gene regulation, a large-scale comparative analysis of basic principles of transcriptional regulatory features is lacking. Here we map the genome-wide binding locations of 165 human, 93 worm and 52 fly transcription regulatory factors, generating a total of 1,019 data sets from diverse cell types, developmental stages, or conditions in the three species, of which 498 (48.9%) are presented here for the first time. We find that structural properties of regulatory networks are remarkably conserved and that orthologous regulatory factor families recognize similar binding motifs in vivo and show some similar co-associations. Our results suggest that gene-regulatory properties previously observed for individual factors are general principles of metazoan regulation that are remarkably well-preserved despite extensive functional divergence of individual network connections. The comparative maps of regulatory circuitry provided here will drive an improved understanding of the regulatory underpinnings of model organism biology and how these relate to human biology, development and disease.

View details for DOI 10.1038/nature13668

View details for PubMedID 25164757
Comparative analysis of metazoan chromatin organization. Nature Ho, J. W., Jung, Y. L., Liu, T., Alver, B. H., Lee, S., Ikegami, K., Sohn, K., Minoda, A., Tolstorukov, M. Y., Appert, A., Parker, S. C., Gu, T., Kundaje, A., Riddle, N. C., Bishop, E., Egelhofer, T. A., Hu, S. S., Alekseyenko, A. A., Rechtsteiner, A., Asker, D., Belsky, J. A., Bowman, S. K., Chen, Q. B., Chen, R. A., Day, D. S., Dong, Y., Dose, A. C., Duan, X., Epstein, C. B., Ercan, S., Feingold, E. A., Ferrari, F., Garrigues, J. M., Gehlenborg, N., Good, P. J., Haseley, P., He, D., Herrmann, M., Hoffman, M. M., Jeffers, T. E., Kharchenko, P. V., Kolasinska-Zwierz, P., Kotwaliwale, C. V., Kumar, N., Langley, S. A., Larschan, E. N., Latorre, I., Libbrecht, M. W., Lin, X., Park, R., Pazin, M. J., Pham, H. N., Plachetka, A., Qin, B., Schwartz, Y. B., Shoresh, N., Stempor, P., Vielle, A., Wang, C., Whittle, C. M., Xue, H., Kingston, R. E., Kim, J. H., Bernstein, B. E., Dernburg, A. F., Pirrotta, V., Kuroda, M. I., Noble, W. S., Tullius, T. D., Kellis, M., MacAlpine, D. M., Strome, S., Elgin, S. C., Liu, X. S., Lieb, J. D., Ahringer, J., Karpen, G. H., Park, P. J. 2014; 512 (7515): 449-452

Abstract

Genome function is dynamically regulated in part by chromatin, which consists of the histones, non-histone proteins and RNA molecules that package DNA. Studies in Caenorhabditis elegans and Drosophila melanogaster have contributed substantially to our understanding of molecular mechanisms of genome function in humans, and have revealed conservation of chromatin components and mechanisms. Nevertheless, the three organisms have markedly different genome sizes, chromosome architecture and gene organization. On human and fly chromosomes, for example, pericentric heterochromatin flanks single centromeres, whereas worm chromosomes have dispersed heterochromatin-like regions enriched in the distal chromosomal 'arms', and centromeres distributed along their lengths. To systematically investigate chromatin organization and associated gene regulation across species, we generated and analysed a large collection of genome-wide chromatin data sets from cell lines and developmental stages in worm, fly and human. Here we present over 800 new data sets from our ENCODE and modENCODE consortia, bringing the total to over 1,400. Comparison of combinatorial patterns of histone modifications, nuclear lamina-associated domains, organization of large-scale topological domains, chromatin environment at promoters and enhancers, nucleosome positioning, and DNA replication patterns reveals many conserved features of chromatin organization among the three organisms. We also find notable differences in the composition and locations of repressive chromatin. These data sets and analyses provide a rich resource for comparative and species-specific investigations of chromatin composition, organization and function.

View details for DOI 10.1038/nature13415

View details for PubMedID 25164756
Regulatory analysis of the C. elegans genome with spatiotemporal resolution. Nature Araya, C. L., Kawli, T., Kundaje, A., Jiang, L., Wu, B., Vafeados, D., Terrell, R., Weissdepp, P., Gevirtzman, L., Mace, D., Niu, W., Boyle, A. P., Xie, D., Ma, L., Murray, J. I., Reinke, V., Waterston, R. H., Snyder, M. 2014; 512 (7515): 400-405

View details for DOI 10.1038/nature13497

View details for PubMedID 25164749
Regulatory analysis of the C. elegans genome with spatiotemporal resolution. Nature Araya, C. L., Kawli, T., Kundaje, A., Jiang, L., Wu, B., Vafeados, D., Terrell, R., Weissdepp, P., Gevirtzman, L., Mace, D., Niu, W., Boyle, A. P., Xie, D., Ma, L., Murray, J. I., Reinke, V., Waterston, R. H., Snyder, M. 2014; 512 (7515): 400-405

Abstract

Discovering the structure and dynamics of transcriptional regulatory events in the genome with cellular and temporal resolution is crucial to understanding the regulatory underpinnings of development and disease. We determined the genomic distribution of binding sites for 92 transcription factors and regulatory proteins across multiple stages of Caenorhabditis elegans development by performing 241 ChIP-seq (chromatin immunoprecipitation followed by sequencing) experiments. Integration of regulatory binding and cellular-resolution expression data produced a spatiotemporally resolved metazoan transcription factor binding map. Using this map, we explore developmental regulatory circuits that encode combinatorial logic at the levels of co-binding and co-expression of transcription factors, characterizing the genomic coverage and clustering of regulatory binding, the binding preferences of, and biological processes regulated by, transcription factors, the global transcription factor co-associations and genomic subdomains that suggest shared patterns of regulation, and identifying key transcription factors and transcription factor co-associations for fate specification of individual lineages and cell types.

View details for DOI 10.1038/nature13497

View details for PubMedID 25164749
Reply to Brunet and Doolittle: Both selected effect and causal role elements can influence human biology and disease PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Kellis, M., Wold, B., Snyder, M. P., Bernstein, B. E., Kundaje, A., Marinov, G. K., Ward, L. D., Birney, E., Crawford, G. E., Dekker, J., Dunham, I., Elnitski, L. L., Farnham, P. J., Feingold, E. A., Gerstein, M., Giddings, M. C., Gilbert, D. M., Gingeras, T. R., Green, E. D., Guigo, R., Hubbard, T., Kent, J., Lieb, J. D., Myers, R. M., Pazin, M. J., Ren, B., Stamatoyannopoulos, J., Weng, Z., White, K. P., Hardison, R. C. 2014; 111 (33): E3366-E3366

View details for DOI 10.1073/pnas.1410434111

View details for Web of Science ID 000340438800004

View details for PubMedID 25275169

View details for PubMedCentralID PMC4143047
H3K4me3 Breadth Is Linked to Cell Identity and Transcriptional Consistency. Cell Benayoun, B. A., Pollina, E. A., Ucar, D., Mahmoudi, S., Karra, K., Wong, E. D., Devarajan, K., Daugherty, A. C., Kundaje, A. B., Mancini, E., Hitz, B. C., Gupta, R., Rando, T. A., Baker, J. C., Snyder, M. P., Cherry, J. M., Brunet, A. 2014; 158 (3): 673-688

Abstract

Trimethylation of histone H3 at lysine 4 (H3K4me3) is a chromatin modification known to mark the transcription start sites of active genes. Here, we show that H3K4me3 domains that spread more broadly over genes in a given cell type preferentially mark genes that are essential for the identity and function of that cell type. Using the broadest H3K4me3 domains as a discovery tool in neural progenitor cells, we identify novel regulators of these cells. Machine learning models reveal that the broadest H3K4me3 domains represent a distinct entity, characterized by increased marks of elongation. The broadest H3K4me3 domains also have more paused polymerase at their promoters, suggesting a unique transcriptional output. Indeed, genes marked by the broadest H3K4me3 domains exhibit enhanced transcriptional consistency and [corrected] increased transcriptional levels, and perturbation of H3K4me3 breadth leads to changes in transcriptional consistency. Thus, H3K4me3 breadth contains information that could ensure transcriptional precision at key cell identity/function genes.

View details for DOI 10.1016/j.cell.2014.06.027

View details for PubMedID 25083876
Diverse patterns of genomic targeting by transcriptional regulators in Drosophila melanogaster GENOME RESEARCH Slattery, M., Ma, L., Spokony, R. F., Arthur, R. K., Kheradpour, P., Kundaje, A., Negre, N., Crofts, A., Ptashkin, R., Zieba, J., Ostapenko, A., Suchy, S., Victorsen, A., Jameel, N., Grundstad, A., Gao, W., Moran, J. R., Rehm, E., Grossman, R. L., Kellis, M., White, K. P. 2014; 24 (7): 1224-1235

Abstract

Annotation of regulatory elements and identification of the transcription-related factors (TRFs) targeting these elements are key steps in understanding how cells interpret their genetic blueprint and their environment during development, and how that process goes awry in the case of disease. One goal of the modENCODE (model organism ENCyclopedia of DNA Elements) Project is to survey a diverse sampling of TRFs, both DNA-binding and non-DNA-binding factors, to provide a framework for the subsequent study of the mechanisms by which transcriptional regulators target the genome. Here we provide an updated map of the Drosophila melanogaster regulatory genome based on the location of 84 TRFs at various stages of development. This regulatory map reveals a variety of genomic targeting patterns, including factors with strong preferences toward proximal promoter binding, factors that target intergenic and intronic DNA, and factors with distinct chromatin state preferences. The data also highlight the stringency of the Polycomb regulatory network, and show association of the Trithorax-like (Trl) protein with hotspots of DNA binding throughout development. Furthermore, the data identify more than 5800 instances in which TRFs target DNA regions with demonstrated enhancer activity. Regions of high TRF co-occupancy are more likely to be associated with open enhancers used across cell types, while lower TRF occupancy regions are associated with complex enhancers that are also regulated at the epigenetic level. Together these data serve as a resource for the research community in the continued effort to dissect transcriptional regulatory mechanisms directing Drosophila development.

View details for DOI 10.1101/gr.168807.113

View details for Web of Science ID 000338185000015

View details for PubMedID 24985916

View details for PubMedCentralID PMC4079976
Defining functional DNA elements in the human genome PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Kellis, M., Wold, B., Snyder, M. P., Bernstein, B. E., Kundaje, A., Marinov, G. K., Ward, L. D., Birney, E., Crawford, G. E., Dekker, J., Dunham, I., Elnitski, L. L., Farnham, P. J., Feingold, E. A., Gerstein, M., Giddings, M. C., Gilbert, D. M., Gingeras, T. R., Green, E. D., Guigo, R., Hubbard, T., Kent, J., Lieb, J. D., Myers, R. M., Pazin, M. J., Ren, B., Stamatoyannopoulos, J. A., Weng, Z., White, K. P., Hardison, R. C. 2014; 111 (17): 6131-6138

Abstract

With the completion of the human genome sequence, attention turned to identifying and annotating its functional DNA elements. As a complement to genetic and comparative genomics approaches, the Encyclopedia of DNA Elements Project was launched to contribute maps of RNA transcripts, transcriptional regulator binding sites, and chromatin states in many cell types. The resulting genome-wide data reveal sites of biochemical activity with high positional resolution and cell type specificity that facilitate studies of gene regulation and interpretation of noncoding variants associated with human disease. However, the biochemically active regions cover a much larger fraction of the genome than do evolutionarily conserved regions, raising the question of whether nonconserved but biochemically active regions are truly functional. Here, we review the strengths and limitations of biochemical, evolutionary, and genetic approaches for defining functional DNA segments, potential sources for the observed differences in estimated genomic coverage, and the biological implications of these discrepancies. We also analyze the relationship between signal intensity, genomic coverage, and evolutionary conservation. Our results reinforce the principle that each approach provides complementary information and that we need to use combinations of all three to elucidate genome function in human biology and disease.

View details for DOI 10.1073/pnas.1318948111

View details for Web of Science ID 000335199000025

View details for PubMedID 24753594

View details for PubMedCentralID PMC4035993
Large-Scale Quality Analysis of Published ChIP-seq Data. G3 (Bethesda, Md.) Marinov, G. K., Kundaje, A., Park, P. J., Wold, B. J. 2014; 4 (2): 209-223

Abstract

ChIP-seq has become the primary method for identifying in vivo protein-DNA interactions on a genome-wide scale, with nearly 800 publications involving the technique appearing in PubMed as of December 2012. Individually and in aggregate, these data are an important and information-rich resource. However, uncertainties about data quality confound their use by the wider research community. Recently, the Encyclopedia of DNA Elements (ENCODE) project developed and applied metrics to objectively measure ChIP-seq data quality. The ENCODE quality analysis was useful for flagging datasets for closer inspection, eliminating or replacing poor data, and for driving changes in experimental pipelines. There had been no similarly systematic quality analysis of the large and disparate body of published ChIP-seq profiles. Here, we report a uniform analysis of vertebrate transcription factor ChIP-seq datasets in the Gene Expression Omnibus (GEO) repository as of April 1, 2012. The majority (55%) of datasets scored as being highly successful, but a substantial minority (20%) were of apparently poor quality, and another ∼25% were of intermediate quality. We discuss how different uses of ChIP-seq data are affected by specific aspects of data quality, and we highlight exceptional instances for which the metric values should not be taken at face value. Unexpectedly, we discovered that a significant subset of control datasets (i.e., no immunoprecipitation and mock immunoprecipitation samples) display an enrichment structure similar to successful ChIP-seq data. This can, in turn, affect peak calling and data interpretation. Published datasets identified here as high-quality comprise a large group that users can draw on for large-scale integrated analysis. In the future, ChIP-seq quality assessment similar to that used here could guide experimentalists at early stages in a study, provide useful input in the publication process, and be used to stratify ChIP-seq data for different community-wide uses.

View details for DOI 10.1534/g3.113.008680

View details for PubMedID 24347632

View details for PubMedCentralID PMC3931556
STAT3 Targets Suggest Mechanisms of Aggressive Tumorigenesis in Diffuse Large B-Cell Lymphoma G3-GENES GENOMES GENETICS Hardee, J., Ouyang, Z., Zhang, Y., Kundaje, A., Lacroute, P., Snyder, M. 2013; 3 (12): 2173-2185

Abstract

The signal transducer and activator of transcription 3 (STAT3) is a transcription factor that, when dysregulated, becomes a powerful oncogene found in many human cancers, including diffuse large B-cell lymphoma. Diffuse large B-cell lymphoma is the most common form of non-Hodgkin's lymphoma and has two major subtypes: germinal center B-cell-like and activated B-cell-like. Compared with the germinal center B-cell-like form, activated B-cell-like lymphomas respond much more poorly to current therapies and often exhibit overexpression or overactivation of STAT3. To investigate how STAT3 might contribute to this aggressive phenotype, we have integrated genome-wide studies of STAT3 DNA binding using chromatin immunoprecipitation-sequencing with whole-transcriptome profiling using RNA-sequencing. STAT3 binding sites are present near almost a third of all genes that differ in expression between the two subtypes, and examination of the affected genes identified previously undetected and clinically significant pathways downstream of STAT3 that drive oncogenesis. Novel treatments aimed at these pathways may increase the survivability of activated B-cell-like diffuse large B-cell lymphoma.

View details for DOI 10.1534/g3.113.007674

View details for PubMedID 24142927
Extensive Variation in Chromatin States Across Humans SCIENCE Kasowski, M., Kyriazopoulou-Panagiotopoulou, S., Grubert, F., Zaugg, J. B., Kundaje, A., Liu, Y., Boyle, A. P., Zhang, Q. C., Zakharia, F., Spacek, D. V., Li, J., Xie, D., Olarerin-George, A., Steinmetz, L. M., Hogenesch, J. B., Kellis, M., Batzoglou, S., Snyder, M. 2013; 342 (6159): 750-752

Abstract

The majority of disease-associated variants lie outside protein-coding regions, suggesting a link between variation in regulatory regions and disease predisposition. We studied differences in chromatin states using five histone modifications, cohesin, and CTCF in lymphoblastoid lines from 19 individuals of diverse ancestry. We found extensive signal variation in regulatory regions, which often switch between active and repressed states across individuals. Enhancer activity is particularly diverse among individuals, whereas gene expression remains relatively stable. Chromatin variability shows genetic inheritance in trios, correlates with genetic variation and population divergence, and is associated with disruptions of transcription factor binding motifs. Overall, our results provide insights into chromatin variation among humans.

View details for DOI 10.1126/science.1242510

View details for PubMedID 24136358
Integrative annotation of chromatin elements from ENCODE data NUCLEIC ACIDS RESEARCH Hoffman, M. M., Ernst, J., Wilder, S. P., Kundaje, A., Harris, R. S., Libbrecht, M., Giardine, B., Ellenbogen, P. M., Bilmes, J. A., Birney, E., Hardison, R. C., Dunham, I., Kellis, M., Noble, W. S. 2013; 41 (2): 827-841

Abstract

The ENCODE Project has generated a wealth of experimental information mapping diverse chromatin properties in several human cell lines. Although each such data track is independently informative toward the annotation of regulatory elements, their interrelations contain much richer information for the systematic annotation of regulatory elements. To uncover these interrelations and to generate an interpretable summary of the massive datasets of the ENCODE Project, we apply unsupervised learning methodologies, converting dozens of chromatin datasets into discrete annotation maps of regulatory regions and other chromatin elements across the human genome. These methods rediscover and summarize diverse aspects of chromatin architecture, elucidate the interplay between chromatin activity and RNA transcription, and reveal that a large proportion of the genome lies in a quiescent state, even across multiple cell types. The resulting annotation of non-coding regulatory elements correlate strongly with mammalian evolutionary constraint, and provide an unbiased approach for evaluating metrics of evolutionary constraint in human. Lastly, we use the regulatory annotations to revisit previously uncharacterized disease-associated loci, resulting in focused, testable hypotheses through the lens of the chromatin landscape.

View details for DOI 10.1093/nar/gks1284

View details for Web of Science ID 000314121100021

View details for PubMedID 23221638

View details for PubMedCentralID PMC3553955
Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors Wang, J., Zhuang, J., Iyer, S., Lin, X., Whitfield, T. W., Greven, M. C., Pierce, B. G., Dong, X., Kundaje, A., Cheng, Y., Rando, O. J., Birney, E., Myers, R. M., Noble, W. S., Snyder, M., Weng, Z. TAYLOR & FRANCIS INC. 2013: 49-50

View details for DOI 10.1080/07391102.2013.786511

View details for Web of Science ID 000320149400077
Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors GENOME RESEARCH Wang, J., Zhuang, J., Iyer, S., Lin, X., Whitfield, T. W., Greven, M. C., Pierce, B. G., Dong, X., Kundaje, A., Cheng, Y., Rando, O. J., Birney, E., Myers, R. M., Noble, W. S., Snyder, M., Weng, Z. 2012; 22 (9): 1798-1812

Abstract

Chromatin immunoprecipitation coupled with high-throughput sequencing (ChIP-seq) has become the dominant technique for mapping transcription factor (TF) binding regions genome-wide. We performed an integrative analysis centered around 457 ChIP-seq data sets on 119 human TFs generated by the ENCODE Consortium. We identified highly enriched sequence motifs in most data sets, revealing new motifs and validating known ones. The motif sites (TF binding sites) are highly conserved evolutionarily and show distinct footprints upon DNase I digestion. We frequently detected secondary motifs in addition to the canonical motifs of the TFs, indicating tethered binding and cobinding between multiple TFs. We observed significant position and orientation preferences between many cobinding TFs. Genes specifically expressed in a cell line are often associated with a greater occurrence of nearby TF binding in that cell line. We observed cell-line-specific secondary motifs that mediate the binding of the histone deacetylase HDAC2 and the enhancer-binding protein EP300. TF binding sites are located in GC-rich, nucleosome-depleted, and DNase I sensitive regions, flanked by well-positioned nucleosomes, and many of these features show cell type specificity. The GC-richness may be beneficial for regulating TF binding because, when unoccupied by a TF, these regions are occupied by nucleosomes in vivo. We present the results of our analysis in a TF-centric web repository Factorbook (http://factorbook.org) and will continually update this repository as more ENCODE data are generated.

View details for DOI 10.1101/gr.139105.112

View details for Web of Science ID 000308272800020

View details for PubMedID 22955990

View details for PubMedCentralID PMC3431495
Long noncoding RNAs are rarely translated in two human cell lines GENOME RESEARCH Banfai, B., Jia, H., Khatun, J., Wood, E., Risk, B., Gundling, W. E., Kundaje, A., Gunawardena, H. P., Yu, Y., Xie, L., Krajewski, K., Strahl, B. D., Chen, X., Bickel, P., Giddings, M. C., Brown, J. B., Lipovich, L. 2012; 22 (9): 1646-1657

Abstract

Data from the Encyclopedia of DNA Elements (ENCODE) project show over 9640 human genome loci classified as long noncoding RNAs (lncRNAs), yet only ~100 have been deeply characterized to determine their role in the cell. To measure the protein-coding output from these RNAs, we jointly analyzed two recent data sets produced in the ENCODE project: tandem mass spectrometry (MS/MS) data mapping expressed peptides to their encoding genomic loci, and RNA-seq data generated by ENCODE in long polyA+ and polyA- fractions in the cell lines K562 and GM12878. We used the machine-learning algorithm RuleFit3 to regress the peptide data against RNA expression data. The most important covariate for predicting translation was, surprisingly, the Cytosol polyA- fraction in both cell lines. LncRNAs are ~13-fold less likely to produce detectable peptides than similar mRNAs, indicating that ~92% of GENCODE v7 lncRNAs are not translated in these two ENCODE cell lines. Intersecting 9640 lncRNA loci with 79,333 peptides yielded 85 unique peptides matching 69 lncRNAs. Most cases were due to a coding transcript misannotated as lncRNA. Two exceptions were an unprocessed pseudogene and a bona fide lncRNA gene, both with open reading frames (ORFs) compromised by upstream stop codons. All potentially translatable lncRNA ORFs had only a single peptide match, indicating low protein abundance and/or false-positive peptide matches. We conclude that with very few exceptions, ribosomes are able to distinguish coding from noncoding transcripts and, hence, that ectopic translation and cryptic mRNAs are rare in the human lncRNAome.

View details for DOI 10.1101/gr.134767.111

View details for Web of Science ID 000308272800007

View details for PubMedID 22955977

View details for PubMedCentralID PMC3431482
Modeling gene expression using chromatin features in various cellular contexts GENOME BIOLOGY Dong, X., Greven, M. C., Kundaje, A., Djebali, S., Brown, J. B., Cheng, C., Gingeras, T. R., Gerstein, M., Guigo, R., Birney, E., Weng, Z. 2012; 13 (9)

Abstract

Previous work has demonstrated that chromatin feature levels correlate with gene expression. The ENCODE project enables us to further explore this relationship using an unprecedented volume of data. Expression levels from more than 100,000 promoters were measured using a variety of high-throughput techniques applied to RNA extracted by different protocols from different cellular compartments of several human cell lines. ENCODE also generated the genome-wide mapping of eleven histone marks, one histone variant, and DNase I hypersensitivity sites in seven cell lines.We built a novel quantitative model to study the relationship between chromatin features and expression levels. Our study not only confirms that the general relationships found in previous studies hold across various cell lines, but also makes new suggestions about the relationship between chromatin features and gene expression levels. We found that expression status and expression levels can be predicted by different groups of chromatin features, both with high accuracy. We also found that expression levels measured by CAGE are better predicted than by RNA-PET or RNA-Seq, and different categories of chromatin features are the most predictive of expression for different RNA measurement methods. Additionally, PolyA+ RNA is overall more predictable than PolyA- RNA among different cell compartments, and PolyA+ cytosolic RNA measured with RNA-Seq is more predictable than PolyA+ nuclear RNA, while the opposite is true for PolyA- RNA.Our study provides new insights into transcriptional regulation by analyzing chromatin features in different cellular contexts.

View details for DOI 10.1186/gb-2012-13-9-r53

View details for Web of Science ID 000313182600006

View details for PubMedID 22950368

View details for PubMedCentralID PMC3491397
Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors GENOME BIOLOGY Yip, K. Y., Cheng, C., Bhardwaj, N., Brown, J. B., Leng, J., Kundaje, A., Rozowsky, J., Birney, E., Bickel, P., Snyder, M., Gerstein, M. 2012; 13 (9)

Abstract

Transcription factors function by binding different classes of regulatory elements. The Encyclopedia of DNA Elements (ENCODE) project has recently produced binding data for more than 100 transcription factors from about 500 ChIP-seq experiments in multiple cell types. While this large amount of data creates a valuable resource, it is nonetheless overwhelmingly complex and simultaneously incomplete since it covers only a small fraction of all human transcription factors.As part of the consortium effort in providing a concise abstraction of the data for facilitating various types of downstream analyses, we constructed statistical models that capture the genomic features of three paired types of regions by machine-learning methods: firstly, regions with active or inactive binding; secondly, those with extremely high or low degrees of co-binding, termed HOT and LOT regions; and finally, regulatory modules proximal or distal to genes. From the distal regulatory modules, we developed computational pipelines to identify potential enhancers, many of which were validated experimentally. We further associated the predicted enhancers with potential target transcripts and the transcription factors involved. For HOT regions, we found a significant fraction of transcription factor binding without clear sequence motifs and showed that this observation could be related to strong DNA accessibility of these regions.Overall, the three pairs of regions exhibit intricate differences in chromosomal locations, chromatin features, factors that bind them, and cell-type specificity. Our machine learning approach enables us to identify features potentially general to all transcription factors, including those not included in the data.

View details for DOI 10.1186/gb-2012-13-9-r48

View details for Web of Science ID 000313182600001

View details for PubMedID 22950945

View details for PubMedCentralID PMC3491392
A User's Guide to the Encyclopedia of DNA Elements (ENCODE) PLOS BIOLOGY Myers, R. M., Stamatoyannopoulos, J., Snyder, M., Dunham, I., Hardison, R. C., Bernstein, B. E., Gingeras, T. R., Kent, W. J., Birney, E., Wold, B., Crawford, G. E., Bernstein, B. E., Epstein, C. B., Shoresh, N., Ernst, J., Mikkelsen, T. S., Kheradpour, P., Zhang, X., Wang, L., Issner, R., Coyne, M. J., Durham, T., Ku, M., Thanh Truong, T., Ward, L. D., Altshuler, R. C., Lin, M. F., Kellis, M., Gingeras, T. R., Davis, C. A., Kapranov, P., Dobin, A., Zaleski, C., Schlesinger, F., Batut, P., Chakrabortty, S., Jha, S., Lin, W., Drenkow, J., Wang, H., Bell, K., Gao, H., Bell, I., Dumais, E., Dumais, J., Antonarakis, S. E., Ucla, C., Borel, C., Guigo, R., Djebali, S., Lagarde, J., Kingswood, C., Ribeca, P., Sammeth, M., Alioto, T., Merkel, A., Tilgner, H., Carninci, P., Hayashizaki, Y., Lassmann, T., Takahashi, H., Abdelhamid, R. F., Hannon, G., Fejes-Toth, K., Preall, J., Gordon, A., Sotirova, V., Reymond, A., Howald, C., Graison, E. A., Chrast, J., Ruan, Y., Ruan, X., Shahab, A., Poh, W. T., Wei, C., Crawford, G. E., Furey, T. S., Boyle, A. P., Sheffield, N. C., Song, L., Shibata, Y., Vales, T., Winter, D., Zhang, Z., London, D., Wang, T., Birney, E., Keefe, D., Iyer, V. R., Lee, B., McDaniell, R. M., Liu, Z., Battenhouse, A., Bhinge, A. A., Lieb, J. D., Grasfeder, L. L., Showers, K. A., Giresi, P. G., Kim, S. K., Shestak, C., Myers, R. M., Pauli, F., Reddy, T. E., Gertz, J., Partridge, E. C., Jain, P., Sprouse, R. O., Bansal, A., Pusey, B., Muratet, M. A., Varley, K. E., Bowling, K. M., Newberry, K. M., Nesmith, A. S., Dilocker, J. A., Parker, S. L., Waite, L. L., Thibeault, K., Roberts, K., Absher, D. M., Wold, B., Mortazavi, A., Williams, B., Marinov, G., Trout, D., Pepke, S., King, B., McCue, K., Kirilusha, A., DeSalvo, G., Fisher-Aylor, K., Amrhein, H., Vielmetter, J., Sherlock, G., Sidow, A., Batzoglou, S., Rauch, R., Kundaje, A., Libbrecht, M., Margulies, E. H., Parker, S. C., Elnitski, L., Green, E. D., Hubbard, T., Harrow, J., Searle, S., Kokocinski, F., Aken, B., Frankish, A., Hunt, T., Despacio-Reyes, G., Kay, M., Mukherjee, G., Bignell, A., Saunders, G., Boychenko, V., Brent, M., van Baren, M. J., Brown, R. H., Gerstein, M., Khurana, E., Balasubramanian, S., Zhang, Z., Lam, H., Cayting, P., Robilotto, R., Lu, Z., Guigo, R., Derrien, T., Tanzer, A., Knowles, D. G., Mariotti, M., Kent, W. J., Haussler, D., Harte, R., Diekhans, M., Kellis, M., Lin, M., Kheradpour, P., Ernst, J., Reymond, A., Howald, C., Graison, E. A., Chrast, J., Valencia, A., Tress, M., Manuel Rodriguez, J., Snyder, M., Landt, S. G., Raha, D., Shi, M., Euskirchen, G., Grubert, F., Kasowski, M., Lian, J., Cayting, P., Lacroute, P., Xu, Y., Monahan, H., Patacsil, D., Slifer, T., Yang, X., Charos, A., Reed, B., Wu, L., Auerbach, R. K., Habegger, L., Hariharan, M., Rozowsky, J., Abyzov, A., Weissman, S. M., Gerstein, M., Struhl, K., Lamarre-Vincent, N., Lindahl-Allen, M., Miotto, B., Moqtaderi, Z., Fleming, J. D., Newburger, P., Farnham, P. J., Frietze, S., O'Geen, H., Xu, X., Blahnik, K. R., Cao, A. R., Iyengar, S., Stamatoyannopoulos, J. A., Kaul, R., Thurman, R. E., Wang, H., Navas, P. A., Sandstrom, R., Sabo, P. J., Weaver, M., Canfield, T., Lee, K., Neph, S., Roach, V., Reynolds, A., Johnson, A., Rynes, E., Giste, E., Vong, S., Neri, J., Frum, T., Johnson, E. M., Nguyen, E. D., Ebersol, A. K., Sanchez, M. E., Sheffer, H. H., Lotakis, D., Haugen, E., Humbert, R., Kutyavin, T., Shafer, T., Dekker, J., Lajoie, B. R., Sanyal, A., Kent, W. J., Rosenbloom, K. R., Dreszer, T. R., Raney, B. J., Barber, G. P., Meyer, L. R., Sloan, C. A., Malladi, V. S., Cline, M. S., Learned, K., Swing, V. K., Zweig, A. S., Rhead, B., Fujita, P. A., Roskin, K., Karolchik, D., Kuhn, R. M., Haussler, D., Birney, E., Dunham, I., Wilder, S. P., Keefe, D., Sobral, D., Herrero, J., Beal, K., Lukk, M., Brazma, A., Vaquerizas, J. M., Luscombe, N. M., Bickel, P. J., Boley, N., Brown, J. B., Li, Q., Huang, H., Gerstein, M., Habegger, L., Sboner, A., Rozowsky, J., Auerbach, R. K., Yip, K. Y., Cheng, C., Yan, K., Bhardwaj, N., Wang, J., Lochovsky, L., Jee, J., Gibson, T., Leng, J., Du, J., Hardison, R. C., Harris, R. S., Song, G., Miller, W., Haussler, D., Roskin, K., Suh, B., Wang, T., Paten, B., Noble, W. S., Hoffman, M. M., Buske, O. J., Weng, Z., Dong, X., Wang, J., Xi, H., Tenenbaum, S. A., Doyle, F., Penalva, L. O., Chittur, S., Tullius, T. D., Parker, S. C., White, K. P., Karmakar, S., Victorsen, A., Jameel, N., Bild, N., Grossman, R. L., Snyder, M., Landt, S. G., Yang, X., Patacsil, D., Slifer, T., Dekker, J., Lajoie, B. R., Sanyal, A., Weng, Z., Whitfield, T. W., Wang, J., Collins, P. J., Trinklein, N. D., Partridge, E. C., Myers, R. M., Giddings, M. C., Chen, X., Khatun, J., Maier, C., Yu, Y., Gunawardena, H., Risk, B., Feingold, E. A., Lowdon, R. F., Dillon, L. A., Good, P. J. 2011; 9 (4)

Abstract

The mission of the Encyclopedia of DNA Elements (ENCODE) Project is to enable the scientific and medical communities to interpret the human genome sequence and apply it to understand human biology and improve health. The ENCODE Consortium is integrating multiple technologies and approaches in a collective effort to discover and define the functional elements encoded in the human genome, including genes, transcripts, and transcriptional regulatory regions, together with their attendant chromatin states and DNA methylation patterns. In the process, standards to ensure high-quality data have been implemented, and novel algorithms have been developed to facilitate analysis. Data and derived results are made available through a freely accessible database. Here we provide an overview of the project and the resources it is generating and illustrate the application of ENCODE data to interpret the human genome.

View details for DOI 10.1371/journal.pbio.1001046

View details for Web of Science ID 000289938900014
CP motifs, Hap1 and heme signaling Zhang, L., Leslie, C., Lee, H. C., Kundaje, A., Ie, E., Xin, X., Freund, Y., MEDIMOND MEDIMOND S R L. 2007: 45-+

View details for Web of Science ID 000251734400007
A classification-based framework for predicting and analyzing gene regulatory response NIPS Workshop on New Problems and Methods in Computational Biology Kundaje, A., Middendorf, M., Shah, M., Wiggins, C. H., Freund, Y., Leslie, C. BIOMED CENTRAL LTD. 2006

Abstract

We have recently introduced a predictive framework for studying gene transcriptional regulation in simpler organisms using a novel supervised learning algorithm called GeneClass. GeneClass is motivated by the hypothesis that in model organisms such as Saccharomyces cerevisiae, we can learn a decision rule for predicting whether a gene is up- or down-regulated in a particular microarray experiment based on the presence of binding site subsequences ("motifs") in the gene's regulatory region and the expression levels of regulators such as transcription factors in the experiment ("parents"). GeneClass formulates the learning task as a classification problem--predicting +1 and -1 labels corresponding to up- and down-regulation beyond the levels of biological and measurement noise in microarray measurements. Using the Adaboost algorithm, GeneClass learns a prediction function in the form of an alternating decision tree, a margin-based generalization of a decision tree.In the current work, we introduce a new, robust version of the GeneClass algorithm that increases stability and computational efficiency, yielding a more scalable and reliable predictive model. The improved stability of the prediction tree enables us to introduce a detailed post-processing framework for biological interpretation, including individual and group target gene analysis to reveal condition-specific regulation programs and to suggest signaling pathways. Robust GeneClass uses a novel stabilized variant of boosting that allows a set of correlated features, rather than single features, to be included at nodes of the tree; in this way, biologically important features that are correlated with the single best feature are retained rather than decorrelated and lost in the next round of boosting. Other computational developments include fast matrix computation of the loss function for all features, allowing scalability to large datasets, and the use of abstaining weak rules, which results in a more shallow and interpretable tree. We also show how to incorporate genome-wide protein-DNA binding data from ChIP chip experiments into the GeneClass algorithm, and we use an improved noise model for gene expression data.Using the improved scalability of Robust GeneClass, we present larger scale experiments on a yeast environmental stress dataset, training and testing on all genes and using a comprehensive set of potential regulators. We demonstrate the improved stability of the features in the learned prediction tree, and we show the utility of the post-processing framework by analyzing two groups of genes in yeast--the protein chaperones and a set of putative targets of the Nrg1 and Nrg2 transcription factors--and suggesting novel hypotheses about their transcriptional and post-transcriptional regulation. Detailed results and Robust GeneClass source code is available for download from http://www.cs.columbia.edu/compbio/robust-geneclass.

View details for DOI 10.1186/1471-2105-7-S1-S5

View details for Web of Science ID 000236765200005

View details for PubMedID 16723008

View details for PubMedCentralID PMC1810316
Statistical analysis of MPSS measurements: Application to the study of LPS-activated macrophage gene expression PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Stolovitzky, G. A., Kundaje, A., Held, G. A., Duggar, K. H., Haudenschild, C. D., Zhou, D., Vasicek, T. J., Smith, K. D., Aderem, A., Roach, J. C. 2005; 102 (5): 1402-1407

Abstract

Massively Parallel Signature Sequencing (MPSS), a recently developed high-throughput transcription profiling technology, has the ability to profile almost every transcript in a sample without requiring prior knowledge of the sequence of the transcribed genes. As is the case with DNA microarrays, effective data analysis depends crucially on understanding how noise affects measurements. We analyze the sources of noise in MPSS and present a quantitative model describing the variability between replicate MPSS assays. We use this model to construct statistical hypotheses that test whether an observed change in gene expression in a pair-wise comparison is significant. This analysis is then extended to the determination of the significance of changes in expression levels measured over the course of a time series of measurements. We apply these analytic techniques to the study of a time series of MPSS gene expression measurements on LPS-stimulated macrophages. To evaluate our statistical significance metrics, we compare our results with published data on macrophage activation measured by using Affymetrix GeneChips.

View details for DOI 10.1073/pnas.0406555102

View details for Web of Science ID 000226877300029

View details for PubMedID 15668391

View details for PubMedCentralID PMC547838
Motif discovery through predictive modeling of gene regulation 9th Annual International Conference on Research in Computational Molecular Biology (RECOMB 2005) Middendorf, M., Kundaje, A., Shah, M., Freund, Y., Wiggins, C. H., Leslie, C. SPRINGER-VERLAG BERLIN. 2005: 538–552

View details for Web of Science ID 000229741100041
Predicting genetic regulatory response using classification: Yeast stress response 1st Annual RECOMB Satellite Workshop on Regulatory Genomics Middendorf, M., Kundaje, A., Wiggins, C., Freund, Y., Leslie, C. SPRINGER-VERLAG BERLIN. 2005: 1–13

View details for Web of Science ID 000228721100001
Predicting genetic regulatory response using classification BIOINFORMATICS Middendorf, M., Kundaje, A., Wiggins, C., Freund, Y., Leslie, C. 2004; 20: 232-240

View details for DOI 10.1093/bioinformatics/bth923

View details for Web of Science ID 000208392400031
Support vector machine (SVM) classification of multifocal visual evoked potential responses (mfVEP) from Glaucoma patients. Baroumand, F., Kundaje, A. B., Zhang, Leslie, C., Hood, D. C. ASSOC RESEARCH VISION OPHTHALMOLOGY INC. 2004: U106

View details for Web of Science ID 000223338200506
Spectrogram analysis of genomes EURASIP JOURNAL ON APPLIED SIGNAL PROCESSING Sussillo, D., Kundaje, A., Anastassiou, D. 2004; 2004 (1): 29-42

View details for Web of Science ID 000221189500004

Anshul Kundaje

Associate Professor of Genetics and of Computer Science

Bio

Academic Appointments

Honors & Awards

Boards, Advisory Committees, Professional Organizations

Additional Info

Links

Current Research and Scholarly Interests

Projects

Location

For More Information:

Location

2025-26 Courses

2024-25 Courses

2023-24 Courses

2022-23 Courses

Stanford Advisees

Graduate and Fellowship Programs

All Publications

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract