fastISM: Performant in-silico saturation mutagenesis for convolutional neural networks.
Bioinformatics (Oxford, England)
MOTIVATION: Deep learning models such as convolutional neural networks are able to accurately map biological sequences to associated functional readouts and properties by learning predictive de novo representations. In-silico saturation mutagenesis (ISM) is a popular feature attribution technique for inferring contributions of all characters in an input sequence to the model's predicted output. The main drawback of ISM is its runtime, as it involves multiple forward propagations of all possible mutations of each character in the input sequence through the trained model to predict the effects on the output.RESULTS: We present fastISM, an algorithm that speeds up ISM by a factor of over 10x for commonly used convolutional neural network architectures. fastISM is based on the observations that the majority of computation in ISM is spent in convolutional layers, and a single mutation only disrupts a limited region of intermediate layers, rendering most computation redundant. fastISM reduces the gap between backpropagation-based feature attribution methods and ISM. It far surpasses the runtime of backpropagation-based methods on multi-output architectures, making it feasible to run ISM on a large number of sequences.AVAILABILITY: An easy-to-use Keras/TensorFlow 2 implementation of fastISM is available at https://github.com/kundajelab/fastISM. fastISM can be installed using pip install fastism. A hands-on tutorial can be found at https://colab.research.google.com/github/kundajelab/fastISM/blob/master/notebooks/colab/DeepSEA.ipynb.SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
View details for DOI 10.1093/bioinformatics/btac135
View details for PubMedID 35238376
ZEB2 Shapes the Epigenetic Landscape of Atherosclerosis.
Background: Smooth muscle cells (SMC) transition into a number of different phenotypes during atherosclerosis, including those that resemble fibroblasts and chondrocytes, and make up the majority of cells in the atherosclerotic plaque. To better understand the epigenetic and transcriptional mechanisms that mediate these cell state changes, and how they relate to risk for coronary artery disease (CAD), we have investigated the causality and function of transcription factors (TFs) at genome wide associated loci. Methods: We employed CRISPR-Cas 9 genome and epigenome editing to identify the causal gene and cell(s) for a complex CAD GWAS signal at 2q22.3. Subsequently, single-cell epigenetic and transcriptomic profiling in murine models and human coronary artery smooth muscle cells were employed to understand the cellular and molecular mechanism by which this CAD risk gene exerts its function. Results: CRISPR-Cas 9 genome and epigenome editing showed that the complex CAD genetic signals within a genomic region at 2q22.3 lie within smooth muscle long-distance enhancers for ZEB2, a TF extensively studied in the context of epithelial mesenchymal transition (EMT) in development and cancer. ZEB2 regulates SMC phenotypic transition through chromatin remodeling that obviates accessibility and disrupts both Notch and TGFβ signaling, thus altering the epigenetic trajectory of SMC transitions. SMC specific loss of ZEB2 resulted in an inability of transitioning SMCs to turn off contractile programing and take on a fibroblast-like phenotype, but accelerated the formation of chondromyocytes, mirroring features of high-risk atherosclerotic plaques in human coronary arteries. Conclusions: These studies identify ZEB2 as a new CAD GWAS gene that affects features of plaque vulnerability through direct effects on the epigenome, providing a new thereapeutic approach to target vascular disease.
View details for DOI 10.1161/CIRCULATIONAHA.121.057789
View details for PubMedID 34990206
AP-1 is a temporally regulated dual gatekeeper of reprogramming to pluripotency.
Proceedings of the National Academy of Sciences of the United States of America
2021; 118 (23)
Somatic cell transcription factors are critical to maintaining cellular identity and constitute a barrier to human somatic cell reprogramming; yet a comprehensive understanding of the mechanism of action is lacking. To gain insight, we examined epigenome remodeling at the onset of human nuclear reprogramming by profiling human fibroblasts after fusion with murine embryonic stem cells (ESCs). By assay for transposase-accessible chromatin with high-throughput sequencing (ATAC-seq) and chromatin immunoprecipitation sequencing we identified enrichment for the activator protein 1 (AP-1) transcription factor c-Jun at regions of early transient accessibility at fibroblast-specific enhancers. Expression of a dominant negative AP-1 mutant (dnAP-1) reduced accessibility and expression of fibroblast genes, overcoming the barrier to reprogramming. Remarkably, efficient reprogramming of human fibroblasts to induced pluripotent stem cells was achieved by transduction with vectors expressing SOX2, KLF4, and inducible dnAP-1, demonstrating that dnAP-1 can substitute for exogenous human OCT4. Mechanistically, we show that the AP-1 component c-Jun has two unexpected temporally distinct functions in human reprogramming: 1) to potentiate fibroblast enhancer accessibility and fibroblast-specific gene expression, and 2) to bind to and repress OCT4 as a complex with MBD3. Our findings highlight AP-1 as a previously unrecognized potent dual gatekeeper of the somatic cell state.
View details for DOI 10.1073/pnas.2104841118
View details for PubMedID 34088849
- Integrating regulatory DNA sequence and gene expression to predict genome-wide chromatin accessibility across cellular contexts OXFORD UNIV PRESS. 2019: I108–I116
Deciphering regulatory DNA sequences and noncoding genetic variants using neural network models of massively parallel reporter assays.
2019; 14 (6): e0218073
The relationship between noncoding DNA sequence and gene expression is not well-understood. Massively parallel reporter assays (MPRAs), which quantify the regulatory activity of large libraries of DNA sequences in parallel, are a powerful approach to characterize this relationship. We present MPRA-DragoNN, a convolutional neural network (CNN)-based framework to predict and interpret the regulatory activity of DNA sequences as measured by MPRAs. While our method is generally applicable to a variety of MPRA designs, here we trained our model on the Sharpr-MPRA dataset that measures the activity of ∼500,000 constructs tiling 15,720 regulatory regions in human K562 and HepG2 cell lines. MPRA-DragoNN predictions were moderately correlated (Spearman ρ = 0.28) with measured activity and were within range of replicate concordance of the assay. State-of-the-art model interpretation methods revealed high-resolution predictive regulatory sequence features that overlapped transcription factor (TF) binding motifs. We used the model to investigate the cell type and chromatin state preferences of predictive TF motifs. We explored the ability of our model to predict the allelic effects of regulatory variants in an independent MPRA experiment and fine map putative functional SNPs in loci associated with lipid traits. Our results suggest that interpretable deep learning models trained on MPRA data have the potential to reveal meaningful patterns in regulatory DNA sequences and prioritize regulatory genetic variants, especially as larger, higher-quality datasets are produced.
View details for DOI 10.1371/journal.pone.0218073
View details for PubMedID 31206543
- The Big Win Strategy on Multi-Value Network: An Improvement over AlphaZero Approach for 6x6 Othello ASSOC COMPUTING MACHINERY. 2018: 78–81