Professional Education

  • Doctor of Philosophy, Tsinghua University, Computer Science (2017)

Current Research and Scholarly Interests

Computational Biology, Machine Learning

All Publications

  • Physical exercise is a risk factor for amyotrophic lateral sclerosis: Convergent evidence from Mendelian randomisation, transcriptomics and risk genotypes. EBioMedicine Julian, T. H., Glascow, N., Barry, A. D., Moll, T., Harvey, C., Klimentidis, Y. C., Newell, M., Zhang, S., Snyder, M. P., Cooper-Knock, J., Shaw, P. J. 2021; 68: 103397


    BACKGROUND: Amyotrophic lateral sclerosis (ALS) is a universally fatal neurodegenerative disease. ALS is determined by gene-environment interactions and improved understanding of these interactions may lead to effective personalised medicine. The role of physical exercise in the development of ALS is currently controversial.METHODS: First, we dissected the exercise-ALS relationship in a series of two-sample Mendelian randomisation (MR) experiments. Next we tested for enrichment of ALS genetic risk within exercise-associated transcriptome changes. Finally, we applied a validated physical activity questionnaire in a small cohort of genetically selected ALS patients.FINDINGS: We present MR evidence supporting a causal relationship between genetic liability to frequent and strenuous leisure-time exercise and ALS using a liberal instrument (multiplicative random effects IVW, p=0.01). Transcriptomic analysis revealed that genes with altered expression in response to acute exercise are enriched with known ALS risk genes (permutation test, p=0.013) including C9ORF72, and with ALS-associated rare variants of uncertain significance. Questionnaire evidence revealed that age of onset is inversely proportional to historical physical activity for C9ORF72-ALS (Cox proportional hazards model, Wald test p=0.007, likelihood ratio test p=0.01, concordance=74%) but not for non-C9ORF72-ALS. Variability in average physical activity was lower in C9ORF72-ALS compared to both non-C9ORF72-ALS (F-test, p=0.002) and neurologically normal controls (F-test, p=0.049) which is consistent with a homogeneous effect of physical activity in all C9ORF72-ALS patients.INTERPRETATION: Our MR approach suggests a positive causal relationship between ALS and physical exercise. Exercise is likely to cause motor neuron injury only in patients with a risk-genotype. Consistent with this we have shown that ALS risk genes are activated in response to exercise. In particular, we propose that G4C2-repeat expansion of C9ORF72 predisposes to exercise-induced ALS.FUNDING: We acknowledge support from the Wellcome Trust (JCK, 216596/Z/19/Z), NIHR (PJS, NF-SI-0617-10077; IS-BRC-1215-20017) and NIH (MPS, CEGS5P50HG00773504,1P50HL083800, 1R01HL101388, 1R01-HL122939, S10OD025212, P30DK116074, and UM1HG009442).

    View details for DOI 10.1016/j.ebiom.2021.103397

    View details for PubMedID 34051439

  • Precision medicine in women with epilepsy: The challenge, systematic review, and future direction. Epilepsy & behavior : E&B Li, Y. n., Zhang, S. n., Snyder, M. P., Meador, K. J. 2021; 118: 107928


    Epilepsy is one of the most prevalent neurologic conditions, affecting almost 70 million people worldwide. In the United States, 1.3 million women with epilepsy (WWE) are in their active reproductive years. Women with epilepsy (WWE) face gender-specific challenges such as pregnancy, seizure exacerbation with hormonal pattern fluctuations, contraception, fertility, and menopause. Precision medicine, which applies state-of-the art molecular profiling to diagnostic, prognostic, and therapeutic problems, has the potential to advance the care of WWE by precisely tailoring individualized management to each patient's needs. For example, antiseizure medications (ASMs) are among the most common teratogens prescribed to women of childbearing potential. Teratogens act in a dose-dependent manner on a susceptible genotype. However, the genotypes at risk for ASM-induced teratogenic deficits are unknown. Here we summarize current challenging issues for WWE, review the state-of-art tools for clinical precision medicine approaches, perform a systematic review of pharmacogenomic approaches in management for WWE, and discuss potential future directions in this field. We envision a future in which precision medicine enables a new practice style that puts focus on early detection, prediction, and targeted therapies for WWE.

    View details for DOI 10.1016/j.yebeh.2021.107928

    View details for PubMedID 33774354

  • Rare Variant Burden Analysis within Enhancers Identifies CAV1 as an ALS Risk Gene. Cell reports Cooper-Knock, J., Zhang, S., Kenna, K. P., Moll, T., Franklin, J. P., Allen, S., Nezhad, H. G., Iacoangeli, A., Yacovzada, N. Y., Eitan, C., Hornstein, E., Ehilak, E., Celadova, P., Bose, D., Farhan, S., Fishilevich, S., Lancet, D., Morrison, K. E., Shaw, C. E., Al-Chalabi, A., Project MinE ALS Sequencing Consortium, Veldink, J. H., Kirby, J., Snyder, M. P., Shaw, P. J., Blair, I., Wray, N., Kiernan, M., Neto, M. M., Chio, A., Cauchi, R., Robberecht, W., van Damme, P., Corcia, P., Couratier, P., Hardiman, O., McLaughlin, R., Gotkine, M., Drory, V., Ticozzi, N., Silani, V., Veldink, J., van den Berg, L., de Carvalho, M., Pardina, J. M., Povedano, M., Andersen, P., Wber, M., Basak, N., Al-Chalabi, A., Shaw, C., Shaw, P., Morrison, K., Landers, J., Glass, J. 2020; 33 (9): 108456


    Amyotrophic lateral sclerosis (ALS) is an incurable neurodegenerative disease. CAV1 and CAV2 organize membrane lipid rafts (MLRs) important for cell signaling and neuronal survival, and overexpression of CAV1 ameliorates ALS phenotypes invivo. Genome-wide association studies localize a large proportion of ALS risk variants within the non-coding genome, but further characterization has been limited by lack ofappropriate tools. By designing and applying a pipeline to identify pathogenic genetic variation within enhancer elements responsible for regulating gene expression, we identify disease-associated variation within CAV1/CAV2 enhancers, which replicate in an independent cohort. Discovered enhancer mutations reduce CAV1/CAV2 expression and disrupt MLRs in patient-derived cells, and CRISPR-Cas9 perturbation proximate to a patient mutation is sufficient to reduce CAV1/CAV2 expression in neurons. Additional enrichment of ALS-associated mutations within CAV1 exons positions CAV1 as an ALS risk gene. We propose CAV1/CAV2 overexpression as a personalized medicine target for ALS.

    View details for DOI 10.1016/j.celrep.2020.108456

    View details for PubMedID 33264630

  • DeepRibSt: a multi-feature convolutional neural network for predicting ribosome stalling MULTIMEDIA TOOLS AND APPLICATIONS Zhang, Y., Zhang, S., He, X., Lu, J., Gao, X. 2020
  • DeepHINT: understanding HIV-1 integration via deep learning with attention BIOINFORMATICS Hu, H., Xiao, A., Zhang, S., Li, Y., Shi, X., Jiang, T., Zhang, L., Zhang, L., Zeng, J. 2019; 35 (10): 1660–67
  • Gene-Environment Interaction in the Era of Precision Medicine CELL Li, J., Li, X., Zhang, S., Snyder, M. 2019; 177 (1): 38–44
  • Decoding the Genomics of Abdominal Aortic Aneurysm. Cell Li, J., Pan, C., Zhang, S., Spin, J. M., Deng, A., Leung, L. L., Dalman, R. L., Tsao, P. S., Snyder, M. 2018; 174 (6): 1361


    A key aspect of genomic medicine is to make individualized clinical decisions from personal genomes. We developed a machine-learning framework to integrate personal genomes and electronic health record (EHR) data and used this framework to study abdominal aortic aneurysm (AAA), a prevalent irreversible cardiovascular disease with unclear etiology. Performing whole-genome sequencing on AAA patients and controls, we demonstrated its predictive precision solely from personal genomes. By modeling personal genomes with EHRs, this framework quantitatively assessed the effectiveness of adjusting personal lifestyles given personal genome baselines, demonstrating its utility as a personal health management tool. We showed that this new framework agnostically identified genetic components involved in AAA, which were subsequently validated in human aortic tissues and in murine models. Our study presents a new framework for disease genome analysis, which can be used for both health management and understanding the biological architecture of complex diseases. VIDEO ABSTRACT.

    View details for PubMedID 30193110

  • Reconstructing spatial organizations of chromosomes through manifold learning. Nucleic acids research Zhu, G. n., Deng, W. n., Hu, H. n., Ma, R. n., Zhang, S. n., Yang, J. n., Peng, J. n., Kaplan, T. n., Zeng, J. n. 2018


    Decoding the spatial organizations of chromosomes has crucial implications for studying eukaryotic gene regulation. Recently, chromosomal conformation capture based technologies, such as Hi-C, have been widely used to uncover the interaction frequencies of genomic loci in a high-throughput and genome-wide manner and provide new insights into the folding of three-dimensional (3D) genome structure. In this paper, we develop a novel manifold learning based framework, called GEM (Genomic organization reconstructor based on conformational Energy and Manifold learning), to reconstruct the three-dimensional organizations of chromosomes by integrating Hi-C data with biophysical feasibility. Unlike previous methods, which explicitly assume specific relationships between Hi-C interaction frequencies and spatial distances, our model directly embeds the neighboring affinities from Hi-C space into 3D Euclidean space. Extensive validations demonstrated that GEM not only greatly outperformed other state-of-art modeling methods but also provided a physically and physiologically valid 3D representations of the organizations of chromosomes. Furthermore, we for the first time apply the modeled chromatin structures to recover long-range genomic interactions missing from original Hi-C data.

    View details for DOI 10.1093/nar/gky065

    View details for PubMedID 29408992

  • A deep boosting based approach for capturing the sequence binding preferences of RNA-binding proteins from high-throughput CLIP-seq data. Nucleic acids research Li, S., Dong, F., Wu, Y., Zhang, S., Zhang, C., Liu, X., Jiang, T., Zeng, J. 2017


    Characterizing the binding behaviors of RNA-binding proteins (RBPs) is important for understanding their functional roles in gene expression regulation. However, current high-throughput experimental methods for identifying RBP targets, such as CLIP-seq and RNAcompete, usually suffer from the false negative issue. Here, we develop a deep boosting based machine learning approach, called DeBooster, to accurately model the binding sequence preferences and identify the corresponding binding targets of RBPs from CLIP-seq data. Comprehensive validation tests have shown that DeBooster can outperform other state-of-the-art approaches in RBP target prediction. In addition, we have demonstrated that DeBooster may provide new insights into understanding the regulatory functions of RBPs, including the binding effects of the RNA helicase MOV10 on mRNA degradation, the potentially different ADAR1 binding behaviors related to its editing activity, as well as the antagonizing effect of RBP binding on miRNA repression. Moreover, DeBooster may provide an effective index to investigate the effect of pathogenic mutations in RBP binding sites, especially those related to splicing events. We expect that DeBooster will be widely applied to analyze large-scale CLIP-seq experimental data and can provide a practically useful tool for novel biological discoveries in understanding the regulatory mechanisms of RBPs. The source code of DeBooster can be downloaded from

    View details for DOI 10.1093/nar/gkx492

    View details for PubMedID 28575488

  • Elastic restricted Boltzmann machines for cancer data analysis Quantitative Biology Zhang, S., Liang, M., Zhou, Z., Zhang, C., Chen, N., Chen, T., Zeng, J. 2017; 5 (2): 159-172
  • Analysis of Ribosome Stalling and Translation Elongation Dynamics by Deep Learning. Cell systems Zhang, S. n., Hu, H. n., Zhou, J. n., He, X. n., Jiang, T. n., Zeng, J. n. 2017; 5 (3): 212–20.e6


    Ribosome stalling is manifested by the local accumulation of ribosomes at specific codon positions of mRNAs. Here, we present ROSE, a deep learning framework to analyze high-throughput ribosome profiling data and estimate the probability of a ribosome stalling event occurring at each genomic location. Extensive validation tests on independent data demonstrated that ROSE possessed higher prediction accuracy than conventional prediction models, with an increase in the area under the receiver operating characteristic curve by up to 18.4%. In addition, genome-wide statistical analyses showed that ROSE predictions can be well correlated with diverse putative regulatory factors of ribosome stalling. Moreover, the genome-wide ribosome stalling landscapes of both human and yeast computed by ROSE recovered the functional interplays between ribosome stalling and cotranslational events in protein biogenesis, including protein targeting by the signal recognition particles and protein secondary structure formation. Overall, our study provides a novel method to complement the ribosome profiling techniques and further decipher the complex regulatory mechanisms underlying translation elongation dynamics encoded in the mRNA sequence.

    View details for DOI 10.1016/j.cels.2017.08.004

    View details for PubMedID 28957655

  • TITER: predicting translation initiation sites by deep learning. Bioinformatics (Oxford, England) Zhang, S. n., Hu, H. n., Jiang, T. n., Zhang, L. n., Zeng, J. n. 2017; 33 (14): i234–i242


    Translation initiation is a key step in the regulation of gene expression. In addition to the annotated translation initiation sites (TISs), the translation process may also start at multiple alternative TISs (including both AUG and non-AUG codons), which makes it challenging to predict TISs and study the underlying regulatory mechanisms. Meanwhile, the advent of several high-throughput sequencing techniques for profiling initiating ribosomes at single-nucleotide resolution, e.g. GTI-seq and QTI-seq, provides abundant data for systematically studying the general principles of translation initiation and the development of computational method for TIS identification.We have developed a deep learning-based framework, named TITER, for accurately predicting TISs on a genome-wide scale based on QTI-seq data. TITER extracts the sequence features of translation initiation from the surrounding sequence contexts of TISs using a hybrid neural network and further integrates the prior preference of TIS codon composition into a unified prediction framework.Extensive tests demonstrated that TITER can greatly outperform the state-of-the-art prediction methods in identifying TISs. In addition, TITER was able to identify important sequence signatures for individual types of TIS codons, including a Kozak-sequence-like motif for AUG start codon. Furthermore, the TITER prediction score can be related to the strength of translation initiation in various biological scenarios, including the repressive effect of the upstream open reading frames on gene expression and the mutational effects influencing translation initiation efficiency.TITER is available as an open-source software and can be downloaded from or data are available at Bioinformatics online.

    View details for DOI 10.1093/bioinformatics/btx247

    View details for PubMedID 28881981

  • Constructing Structure Ensembles of Intrinsically Disordered Proteins from Chemical Shift Data JOURNAL OF COMPUTATIONAL BIOLOGY Gong, H., Zhang, S., Wang, J., Gong, H., Zeng, J. 2016; 23 (5): 300-310


    Modeling the structural ensemble of intrinsically disordered proteins (IDPs), which lack fixed structures, is essential in understanding their cellular functions and revealing their regulation mechanisms in signaling pathways of related diseases (e.g., cancers and neurodegenerative disorders). Though the ensemble concept has been widely believed to be the most accurate way to depict 3D structures of IDPs, few of the traditional ensemble-based approaches effectively address the degeneracy problem that occurs when multiple solutions are consistent with experimental data and is the main challenge in the IDP ensemble construction task. In this article, based on a predefined conformational library, we formalize the structure ensemble construction problem into a least squares framework, which provides the optimal solution when the data constraints outnumber unknown variables. To deal with the degeneracy problem, we further propose a regularized regression approach based on the elastic net technique with the assumption that the weights to be estimated for individual structures in the ensemble are sparse. We have validated our methods through a reference ensemble approach as well as by testing the real biological data of three proteins, including alpha-synuclein, the translocation domain of Colocin N, and the K18 domain of Tau protein.

    View details for DOI 10.1089/cmb.2015.0184

    View details for Web of Science ID 000376080500002

    View details for PubMedID 27159632

    View details for PubMedCentralID PMC4876552

  • A deep learning framework for modeling structural features of RNA-binding protein targets NUCLEIC ACIDS RESEARCH Zhang, S., Zhou, J., Hu, H., Gong, H., Chen, L., Cheng, C., Zeng, J. 2016; 44 (4)


    RNA-binding proteins (RBPs) play important roles in the post-transcriptional control of RNAs. Identifying RBP binding sites and characterizing RBP binding preferences are key steps toward understanding the basic mechanisms of the post-transcriptional gene regulation. Though numerous computational methods have been developed for modeling RBP binding preferences, discovering a complete structural representation of the RBP targets by integrating their available structural features in all three dimensions is still a challenging task. In this paper, we develop a general and flexible deep learning framework for modeling structural binding preferences and predicting binding sites of RBPs, which takes (predicted) RNA tertiary structural information into account for the first time. Our framework constructs a unified representation that characterizes the structural specificities of RBP targets in all three dimensions, which can be further used to predict novel candidate binding sites and discover potential binding motifs. Through testing on the real CLIP-seq datasets, we have demonstrated that our deep learning framework can automatically extract effective hidden structural features from the encoded raw sequence and structural profiles, and predict accurate RBP binding sites. In addition, we have conducted the first study to show that integrating the additional RNA tertiary structural features can improve the model performance in predicting RBP binding sites, especially for the polypyrimidine tract-binding protein (PTB), which also provides a new evidence to support the view that RBPs may own specific tertiary structural binding preferences. In particular, the tests on the internal ribosome entry site (IRES) segments yield satisfiable results with experimental support from the literature and further demonstrate the necessity of incorporating RNA tertiary structural information into the prediction model. The source code of our approach can be found in

    View details for DOI 10.1093/nar/gkv1025

    View details for Web of Science ID 000371519700003

    View details for PubMedID 26467480

    View details for PubMedCentralID PMC4770198

  • Characterizing information spreading in online social networks Zhang, S., Xu, K., Chen, X., Liu, X. arXiv:1404.5562 [cs.SI]. 2014 17


    Online social networks (OSNs) are changing the way in which the information spreads throughout the Internet. A deep understanding of the information spreading in OSNs leads to both social and commercial benefits. In this paper, we characterize the dynamic of information spreading (e.g., how fast and widely the information spreads against time) in OSNs by developing a general and accurate model based on the Interactive Markov Chains (IMCs) and mean-field theory. This model explicitly reveals the impacts of the network topology on information spreading in OSNs. Further, we extend our model to feature the time-varying user behaviors and the ever-changing information popularity. The complicated dynamic patterns of information spreading are captured by our model using six key parameters. Extensive tests based on Renren's dataset validate the accuracy of our model, which demonstrate that it can characterize the dynamic patterns of video sharing in Renren precisely and predict future spreading tendency successfully.

  • Measurement and analysis of online social networks Chinese Journal of Computers Xu, K., Zhang, S., Chen, H., Li, H. 2014; 37 (1): 24