Academic Appointments


2020-21 Courses


Stanford Advisees


  • Doctoral Dissertation Reader (NonAC)
    Daniel Kang

All Publications


  • Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature Moore, J. E., Purcaro, M. J., Pratt, H. E., Epstein, C. B., Shoresh, N., Adrian, J., Kawli, T., Davis, C. A., Dobin, A., Kaul, R., Halow, J., Van Nostrand, E. L., Freese, P., Gorkin, D. U., Shen, Y., He, Y., Mackiewicz, M., Pauli-Behn, F., Williams, B. A., Mortazavi, A., Keller, C. A., Zhang, X. O., Elhajjajy, S. I., Huey, J., Dickel, D. E., Snetkova, V., Wei, X., Wang, X., Rivera-Mulia, J. C., Rozowsky, J., Zhang, J., Chhetri, S. B., Zhang, J., Victorsen, A., White, K. P., Visel, A., Yeo, G. W., Burge, C. B., Lécuyer, E., Gilbert, D. M., Dekker, J., Rinn, J., Mendenhall, E. M., Ecker, J. R., Kellis, M., Klein, R. J., Noble, W. S., Kundaje, A., Guigó, R., Farnham, P. J., Cherry, J. M., Myers, R. M., Ren, B., Graveley, B. R., Gerstein, M. B., Pennacchio, L. A., Snyder, M. P., Bernstein, B. E., Wold, B., Hardison, R. C., Gingeras, T. R., Stamatoyannopoulos, J. A., Weng, Z. 2020; 583 (7818): 699–710

    Abstract

    The human and mouse genomes contain instructions that specify RNAs and proteins and govern the timing, magnitude, and cellular context of their production. To better delineate these elements, phase III of the Encyclopedia of DNA Elements (ENCODE) Project has expanded analysis of the cell and tissue repertoires of RNA transcription, chromatin structure and modification, DNA methylation, chromatin looping, and occupancy by transcription factors and RNA-binding proteins. Here we summarize these efforts, which have produced 5,992 new experimental datasets, including systematic determinations across mouse fetal development. All data are available through the ENCODE data portal (https://www.encodeproject.org), including phase II ENCODE1 and Roadmap Epigenomics2 data. We have developed a registry of 926,535 human and 339,815 mouse candidate cis-regulatory elements, covering 7.9 and 3.4% of their respective genomes, by integrating selected datatypes associated with gene regulation, and constructed a web-based server (SCREEN; http://screen.encodeproject.org) to provide flexible, user-defined access to this resource. Collectively, the ENCODE data and registry provide an expansive resource for the scientific community to build a better understanding of the organization and function of the human and mouse genomes.

    View details for DOI 10.1038/s41586-020-2493-4

    View details for PubMedID 32728249

  • Inferring Multidimensional Rates of Aging from Cross-Sectional Data Pierson, E., Koh, P., Hashimoto, T., Koller, D., Leskovec, J., Eriksson, N., Liang, P., Chaudhuri, K., Sugiyama, M. MICROTOME PUBLISHING. 2019: 97–107
  • A Retrieve-and-Edit Framework for Predicting Structured Outputs Hashimoto, T. B., Guu, K., Oren, Y., Liang, P., Bengio, S., Wallach, H., Larochelle, H., Grauman, K., CesaBianchi, N., Garnett, R. NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2018
  • DNase-capture reveals differential transcription factor binding modalities. PloS one Kang, D., Sherwood, R., Barkal, A., Hashimoto, T., Engstrom, L., Gifford, D. 2017; 12 (12): e0187046

    Abstract

    We describe DNase-capture, an assay that increases the analytical resolution of DNase-seq by focusing its sequencing phase on selected genomic regions. We introduce a new method to compensate for capture bias called BaseNormal that allows for accurate recovery of transcription factor protection profiles from DNase-capture data. We show that these normalized data allow for nuanced detection of transcription factor binding heterogeneity with as few as dozens of sites.

    View details for PubMedID 29284001

    View details for PubMedCentralID PMC5746236

  • Unsupervised Transformation Learning via Convex Relaxations Hashimoto, T. B., Duchi, J. C., Liang, P., Guyon, Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2017
  • A synergistic DNA logic predicts genome-wide chromatin accessibility. Genome research Hashimoto, T., Sherwood, R. I., Kang, D. D., Rajagopal, N., Barkal, A. A., Zeng, H., Emons, B. J., Srinivasan, S., Jaakkola, T., Gifford, D. K. 2016; 26 (10): 1430–40

    Abstract

    Enhancers and promoters commonly occur in accessible chromatin characterized by depleted nucleosome contact; however, it is unclear how chromatin accessibility is governed. We show that log-additive cis-acting DNA sequence features can predict chromatin accessibility at high spatial resolution. We develop a new type of high-dimensional machine learning model, the Synergistic Chromatin Model (SCM), which when trained with DNase-seq data for a cell type is capable of predicting expected read counts of genome-wide chromatin accessibility at every base from DNA sequence alone, with the highest accuracy at hypersensitive sites shared across cell types. We confirm that a SCM accurately predicts chromatin accessibility for thousands of synthetic DNA sequences using a novel CRISPR-based method of highly efficient site-specific DNA library integration. SCMs are directly interpretable and reveal that a logic based on local, nonspecific synergistic effects, largely among pioneer TFs, is sufficient to predict a large fraction of cellular chromatin accessibility in a wide variety of cell types.

    View details for PubMedID 27456004

    View details for PubMedCentralID PMC5052050

  • GERV: a statistical method for generative evaluation of regulatory variants for transcription factor binding. Bioinformatics (Oxford, England) Zeng, H., Hashimoto, T., Kang, D. D., Gifford, D. K. 2016; 32 (4): 490–96

    Abstract

    The majority of disease-associated variants identified in genome-wide association studies reside in noncoding regions of the genome with regulatory roles. Thus being able to interpret the functional consequence of a variant is essential for identifying causal variants in the analysis of genome-wide association studies.We present GERV (generative evaluation of regulatory variants), a novel computational method for predicting regulatory variants that affect transcription factor binding. GERV learns a k-mer-based generative model of transcription factor binding from ChIP-seq and DNase-seq data, and scores variants by computing the change of predicted ChIP-seq reads between the reference and alternate allele. The k-mers learned by GERV capture more sequence determinants of transcription factor binding than a motif-based approach alone, including both a transcription factor's canonical motif and associated co-factor motifs. We show that GERV outperforms existing methods in predicting single-nucleotide polymorphisms associated with allele-specific binding. GERV correctly predicts a validated causal variant among linked single-nucleotide polymorphisms and prioritizes the variants previously reported to modulate the binding of FOXA1 in breast cancer cell lines. Thus, GERV provides a powerful approach for functionally annotating and prioritizing causal variants for experimental follow-up analysis.The implementation of GERV and related data are available at http://gerv.csail.mit.edu/.

    View details for PubMedID 26476779

    View details for PubMedCentralID PMC5860000

  • Cas9 Functionally Opens Chromatin. PloS one Barkal, A. A., Srinivasan, S., Hashimoto, T., Gifford, D. K., Sherwood, R. I. 2016; 11 (3): e0152683

    Abstract

    Using a nuclease-dead Cas9 mutant, we show that Cas9 reproducibly induces chromatin accessibility at previously inaccessible genomic loci. Cas9 chromatin opening is sufficient to enable adjacent binding and transcriptional activation by the settler transcription factor retinoic acid receptor at previously unbound motifs. Thus, we demonstrate a new use for Cas9 in increasing surrounding chromatin accessibility to alter local transcription factor binding.

    View details for PubMedID 27031353

    View details for PubMedCentralID PMC4816323

  • Cloning-free CRISPR. Stem cell reports Arbab, M., Srinivasan, S., Hashimoto, T., Geijsen, N., Sherwood, R. I. 2015; 5 (5): 908–17

    Abstract

    We present self-cloning CRISPR/Cas9 (scCRISPR), a technology that allows for CRISPR/Cas9-mediated genomic mutation and site-specific knockin transgene creation within several hours by circumventing the need to clone a site-specific single-guide RNA (sgRNA) or knockin homology construct for each target locus. We introduce a self-cleaving palindromic sgRNA plasmid and a short double-stranded DNA sequence encoding the desired locus-specific sgRNA into target cells, allowing them to produce a locus-specific sgRNA plasmid through homologous recombination. scCRISPR enables efficient generation of gene knockouts (∼88% mutation rate) at approximately one-sixth the cost of plasmid-based sgRNA construction with only 2 hr of preparation for each targeted site. Additionally, we demonstrate efficient site-specific knockin of GFP transgenes without any plasmid cloning or genome-integrated selection cassette in mouse and human embryonic stem cells (2%-4% knockin rate) through PCR-based addition of short homology arms. scCRISPR substantially lowers the bar on mouse and human transgenesis.

    View details for PubMedID 26527385

    View details for PubMedCentralID PMC4649464

  • Universal count correction for high-throughput sequencing. PLoS computational biology Hashimoto, T. B., Edwards, M. D., Gifford, D. K. 2014; 10 (3): e1003494

    Abstract

    We show that existing RNA-seq, DNase-seq, and ChIP-seq data exhibit overdispersed per-base read count distributions that are not matched to existing computational method assumptions. To compensate for this overdispersion we introduce a nonparametric and universal method for processing per-base sequencing read count data called FIXSEQ. We demonstrate that FIXSEQ substantially improves the performance of existing RNA-seq, DNase-seq, and ChIP-seq analysis tools when compared with existing alternatives.

    View details for PubMedID 24603409

    View details for PubMedCentralID PMC3945112

  • Discovery of directional and nondirectional pioneer transcription factors by modeling DNase profile magnitude and shape. Nature biotechnology Sherwood, R. I., Hashimoto, T., O'Donnell, C. W., Lewis, S., Barkal, A. A., van Hoff, J. P., Karun, V., Jaakkola, T., Gifford, D. K. 2014; 32 (2): 171–78

    Abstract

    We describe protein interaction quantitation (PIQ), a computational method for modeling the magnitude and shape of genome-wide DNase I hypersensitivity profiles to identify transcription factor (TF) binding sites. Through the use of machine-learning techniques, PIQ identified binding sites for >700 TFs from one DNase I hypersensitivity analysis followed by sequencing (DNase-seq) experiment with accuracy comparable to that of chromatin immunoprecipitation followed by sequencing (ChIP-seq). We applied PIQ to analyze DNase-seq data from mouse embryonic stem cells differentiating into prepancreatic and intestinal endoderm. We identified 120 and experimentally validated eight 'pioneer' TF families that dynamically open chromatin. Four pioneer TF families only opened chromatin in one direction from their motifs. Furthermore, we identified 'settler' TFs whose genomic binding is principally governed by proximity to open chromatin. Our results support a model of hierarchical TF binding in which directional and nondirectional pioneer activity shapes the chromatin landscape for population by settler TFs.

    View details for PubMedID 24441470

    View details for PubMedCentralID PMC3951735

  • Lineage-based identification of cellular states and expression programs. Bioinformatics (Oxford, England) Hashimoto, T., Jaakkola, T., Sherwood, R., Mazzoni, E. O., Wichterle, H., Gifford, D. 2012; 28 (12): i250–7

    Abstract

    We present a method, LineageProgram, that uses the developmental lineage relationship of observed gene expression measurements to improve the learning of developmentally relevant cellular states and expression programs. We find that incorporating lineage information allows us to significantly improve both the predictive power and interpretability of expression programs that are derived from expression measurements from in vitro differentiation experiments. The lineage tree of a differentiation experiment is a tree graph whose nodes describe all of the unique expression states in the input expression measurements, and edges describe the experimental perturbations applied to cells. Our method, LineageProgram, is based on a log-linear model with parameters that reflect changes along the lineage tree. Regularization with L(1) that based methods controls the parameters in three distinct ways: the number of genes change between two cellular states, the number of unique cellular states, and the number of underlying factors responsible for changes in cell state. The model is estimated with proximal operators to quickly discover a small number of key cell states and gene sets. Comparisons with existing factorization, techniques, such as singular value decomposition and non-negative matrix factorization show that our method provides higher predictive power in held, out tests while inducing sparse and biologically relevant gene sets.

    View details for PubMedID 22689769

    View details for PubMedCentralID PMC3371836