Stanford Advisors


Lab Affiliations


All Publications


  • scGraph: a graph neural network-based approach to automatically identify cell types. Bioinformatics (Oxford, England) Yin, Q., Liu, Q., Fu, Z., Zeng, W., Zhang, B., Zhang, X., Jiang, R., Lv, H. 2022

    Abstract

    MOTIVATION: Single cell technologies play a crucial role in revolutionizing biological research over the past decade, which strengthens our understanding in cell differentiation, development, and regulation from a single-cell level perspective. Single-cell RNA sequencing (scRNA-seq) is one of the most common single cell technologies, which enables probing transcriptional states in thousands of cells in one experiment. Identification of cell types from scRNA-seq measurements is a fundamental and crucial question to answer. Most previous studies directly take gene expression as input while ignoring the comprehensive gene-gene interactions.RESULTS: We propose scGraph, an automatic cell identification algorithm leveraging gene interaction relationships to enhance the performance of the cell type identification. ScGraph is based on a graph neural network to aggregate the information of interacting genes. In a series of experiments, we demonstrate that scGraph is accurate and outperforms eight comparison methods in the task of cell type identification. Moreover, scGraph automatically learns the gene interaction relationships from biological data and the pathway enrichment analysis shows consistent findings with previous analysis, providing insights on the analysis of regulatory mechanism.AVAILABILITY: scGraph is freely available at https://github.com/QijinYin/scGraph and https://figshare.com/articles/software/scGraph/17157743.SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

    View details for DOI 10.1093/bioinformatics/btac199

    View details for PubMedID 35394015

  • DeepCAGE: Incorporating transcription factors in genome-wide prediction of chromatin accessibility. Genomics, proteomics & bioinformatics Liu, Q., Hua, K., Zhang, X., Wong, W. H., Jiang, R. 2022

    Abstract

    Although computational approaches have been complementing high-throughput biological experiments for the identification of functional regions in the human genome, it remains a great challenge to systematically decipher interactions between transcription factors and regulatory elements to achieve interpretable annotations of chromatin accessibility across diverse cellular contexts. To solve this problem, we propose DeepCAGE, a deep learning framework that integrates sequence information and binding status of transcription factors, for the accurate prediction of chromatin accessible regions at a genome-wide scale in a variety of cell types. DeepCAGE takes advantage of a densely connected deep convolutional neural network architecture to automatically learn sequence signatures of known chromatin accessible regions and then incorporates such features with expression levels and binding activities of human core transcription factors to predict novel chromatin accessible regions. In a series of systematic comparisons with existing methods, DeepCAGE exhibits superior performance in not only the classification but also the regression of chromatin accessibility signals. In a detailed analysis of transcription factor activities, DeepCAGE successfully extracts novel binding motifs and measures the contribution of a transcription factor to the regulation with respect to a specific locus in a certain cell type. When applied to whole-genome sequencing data analysis, our method successfully prioritizes putative deleterious variants underlying a human complex trait and thus provides insights into the understanding of disease-associated genetic variants. DeepCAGE can be downloaded from https://github.com/kimmo1019/DeepCAGE.

    View details for DOI 10.1016/j.gpb.2021.08.015

    View details for PubMedID 35293310

  • OpenAnnotate: a web server to annotate the chromatin accessibility of genomic regions. Nucleic acids research Chen, S., Liu, Q., Cui, X., Feng, Z., Li, C., Wang, X., Zhang, X., Wang, Y., Jiang, R. 2021; 49 (W1): W483-W490

    Abstract

    Chromatin accessibility, as a powerful marker of active DNA regulatory elements, provides valuable information for understanding regulatory mechanisms. The revolution in high-throughput methods has accumulated massive chromatin accessibility profiles in public repositories. Nevertheless, utilization of these data is hampered by cumbersome collection, time-consuming processing, and manual chromatin accessibility (openness) annotation of genomic regions. To fill this gap, we developed OpenAnnotate (http://health.tsinghua.edu.cn/openannotate/) as the first web server for efficiently annotating openness of massive genomic regions across various biosample types, tissues, and biological systems. In addition to the annotation resource from 2729 comprehensive profiles of 614 biosample types of human and mouse, OpenAnnotate provides user-friendly functionalities, ultra-efficient calculation, real-time browsing, intuitive visualization, and elaborate application notebooks. We show its unique advantages compared to existing databases and toolkits by effectively revealing cell type-specificity, identifying regulatory elements and 3D chromatin contacts, deciphering gene functional relationships, inferring functions of transcription factors, and unprecedentedly promoting single-cell data analyses. We anticipate OpenAnnotate will provide a promising avenue for researchers to construct a more holistic perspective to understand regulatory mechanisms.

    View details for DOI 10.1093/nar/gkab337

    View details for PubMedID 33999180

    View details for PubMedCentralID PMC8262705

  • Simultaneous deep generative modeling and clustering of single cell genomic data. Nature machine intelligence Liu, Q., Chen, S., Jiang, R., Wong, W. H. 2021; 3 (6): 536-544

    Abstract

    Recent advances in single-cell technologies, including single-cell ATAC-seq (scATAC-seq), have enabled large-scale profiling of the chromatin accessibility landscape at the single cell level. However, the characteristics of scATAC-seq data, including high sparsity and high dimensionality, have greatly complicated the computational analysis. Here, we proposed scDEC, a computational tool for single cell ATAC-seq analysis with deep generative neural networks. scDEC is built on a pair of generative adversarial networks (GANs), and is capable of learning the latent representation and inferring the cell labels, simultaneously. In a series of experiments, scDEC demonstrates superior performance over other tools in scATAC-seq analysis across multiple datasets and experimental settings. In downstream applications, we demonstrated that the generative power of scDEC helps to infer the trajectory and intermediate state of cells during differentiation and the latent features learned by scDEC can potentially reveal both biological cell types and within-cell-type variations. We also showed that it is possible to extend scDEC for the integrative analysis of multi-modal single cell data.

    View details for DOI 10.1038/s42256-021-00333-y

    View details for PubMedID 34179690

    View details for PubMedCentralID PMC8223760

  • Simultaneous deep generative modelling and clustering of single-cell genomic data NATURE MACHINE INTELLIGENCE Liu, Q., Chen, S., Jiang, R., Wong, W. 2021
  • Density estimation using deep generative neural networks. Proceedings of the National Academy of Sciences of the United States of America Liu, Q., Xu, J., Jiang, R., Wong, W. H. 2021; 118 (15)

    Abstract

    Density estimation is one of the fundamental problems in both statistics and machine learning. In this study, we propose Roundtrip, a computational framework for general-purpose density estimation based on deep generative neural networks. Roundtrip retains the generative power of deep generative models, such as generative adversarial networks (GANs) while it also provides estimates of density values, thus supporting both data generation and density estimation. Unlike previous neural density estimators that put stringent conditions on the transformation from the latent space to the data space, Roundtrip enables the use of much more general mappings where target density is modeled by learning a manifold induced from a base density (e.g., Gaussian distribution). Roundtrip provides a statistical framework for GAN models where an explicit evaluation of density values is feasible. In numerical experiments, Roundtrip exceeds state-of-the-art performance in a diverse range of density estimation tasks.

    View details for DOI 10.1073/pnas.2101344118

    View details for PubMedID 33833061

  • DeepHistone: a deep learning approach to predicting histone modifications Yin, Q., Wu, M., Liu, Q., Lv, H., Jiang, R. BMC. 2019: 193

    Abstract

    Quantitative detection of histone modifications has emerged in the recent years as a major means for understanding such biological processes as chromosome packaging, transcriptional activation, and DNA damage. However, high-throughput experimental techniques such as ChIP-seq are usually expensive and time-consuming, prohibiting the establishment of a histone modification landscape for hundreds of cell types across dozens of histone markers. These disadvantages have been appealing for computational methods to complement experimental approaches towards large-scale analysis of histone modifications.We proposed a deep learning framework to integrate sequence information and chromatin accessibility data for the accurate prediction of modification sites specific to different histone markers. Our method, named DeepHistone, outperformed several baseline methods in a series of comprehensive validation experiments, not only within an epigenome but also across epigenomes. Besides, sequence signatures automatically extracted by our method was consistent with known transcription factor binding sites, thereby giving insights into regulatory signatures of histone modifications. As an application, our method was shown to be able to distinguish functional single nucleotide polymorphisms from their nearby genetic variants, thereby having the potential to be used for exploring functional implications of putative disease-associated genetic variants.DeepHistone demonstrated the possibility of using a deep learning framework to integrate DNA sequence and experimental data for predicting epigenomic signals. With the state-of-the-art performance, DeepHistone was expected to shed light on a variety of epigenomic studies. DeepHistone is freely available in https://github.com/QijinYin/DeepHistone .

    View details for DOI 10.1186/s12864-019-5489-4

    View details for Web of Science ID 000464120900013

    View details for PubMedID 30967126

    View details for PubMedCentralID PMC6456942

  • Chromatin accessibility prediction via a hybrid deep convolutional neural network BIOINFORMATICS Liu, Q., Xia, F., Yin, Q., Jiang, R. 2018; 34 (5): 732–38

    Abstract

    A majority of known genetic variants associated with human-inherited diseases lie in non-coding regions that lack adequate interpretation, making it indispensable to systematically discover functional sites at the whole genome level and precisely decipher their implications in a comprehensive manner. Although computational approaches have been complementing high-throughput biological experiments towards the annotation of the human genome, it still remains a big challenge to accurately annotate regulatory elements in the context of a specific cell type via automatic learning of the DNA sequence code from large-scale sequencing data. Indeed, the development of an accurate and interpretable model to learn the DNA sequence signature and further enable the identification of causative genetic variants has become essential in both genomic and genetic studies.We proposed Deopen, a hybrid framework mainly based on a deep convolutional neural network, to automatically learn the regulatory code of DNA sequences and predict chromatin accessibility. In a series of comparison with existing methods, we show the superior performance of our model in not only the classification of accessible regions against background sequences sampled at random, but also the regression of DNase-seq signals. Besides, we further visualize the convolutional kernels and show the match of identified sequence signatures and known motifs. We finally demonstrate the sensitivity of our model in finding causative noncoding variants in the analysis of a breast cancer dataset. We expect to see wide applications of Deopen with either public or in-house chromatin accessibility data in the annotation of the human genome and the identification of non-coding variants associated with diseases.Deopen is freely available at https://github.com/kimmo1019/Deopen.ruijiang@tsinghua.edu.cn.Supplementary data are available at Bioinformatics online.

    View details for PubMedID 29069282

  • A sequence-based method to predict the impact of regulatory variants using random forest BMC SYSTEMS BIOLOGY Liu, Q., Gan, M., Jiang, R. 2017; 11: 7

    Abstract

    Most disease-associated variants identified by genome-wide association studies (GWAS) exist in noncoding regions. In spite of the common agreement that such variants may disrupt biological functions of their hosting regulatory elements, it remains a great challenge to characterize the risk of a genetic variant within the implicated genome sequence. Therefore, it is essential to develop an effective computational model that is not only capable of predicting the potential risk of a genetic variant but also valid in interpreting how the function of the genome is affected with the occurrence of the variant.We developed a method named kmerForest that used a random forest classifier with k-mer counts to predict accessible chromatin regions purely based on DNA sequences. We demonstrated that our method outperforms existing methods in distinguishing known accessible chromatin regions from random genomic sequences. Furthermore, the performance of our method can further be improved with the incorporation of sequence conservation features. Based on this model, we assessed importance of the k-mer features by a series of permutation experiments, and we characterized the risk of a single nucleotide polymorphism (SNP) on the function of the genome using the difference between the importance of the k-mer features affected by the occurrence of the SNP. We conducted a series of experiments and showed that our model can well discriminate between pathogenic and normal SNPs. Particularly, our model correctly prioritized SNPs that are proved to be enriched for the binding sites of FOXA1 in breast cancer cell lines from previous studies.We presented a novel method to interpret functional genetic variants purely base on DNA sequences. The proposed k-mer based score offers an effective means of measuring the impact of SNPs on the function of the genome, and thus shedding light on the identification of genetic risk factors underlying complex traits and diseases.

    View details for DOI 10.1186/s12918-017-0389-1

    View details for Web of Science ID 000404915800002

    View details for PubMedID 28361702

    View details for PubMedCentralID PMC5374684