Doctor of Philosophy, Tsinghua University, Computer Science (2017)
Current Research and Scholarly Interests
Computational Biology, Machine Learning
Low expression of EXOSC2 protects against clinical COVID-19 and impedes SARS-CoV-2 replication.
Life science alliance
2023; 6 (1)
New therapeutic targets are a valuable resource for treatment of SARS-CoV-2 viral infection. Genome-wide association studies have identified risk loci associated with COVID-19, but many loci are associated with comorbidities and are not specific to host-virus interactions. Here, we identify and experimentally validate a link between reduced expression of EXOSC2 and reduced SARS-CoV-2 replication. EXOSC2 was one of the 332 host proteins examined, all of which interact directly with SARS-CoV-2 proteins. Aggregating COVID-19 genome-wide association studies statistics for gene-specific eQTLs revealed an association between increased expression of EXOSC2 and higher risk of clinical COVID-19. EXOSC2 interacts with Nsp8 which forms part of the viral RNA polymerase. EXOSC2 is a component of the RNA exosome, and here, LC-MS/MS analysis of protein pulldowns demonstrated interaction between the SARS-CoV-2 RNA polymerase and most of the human RNA exosome components. CRISPR/Cas9 introduction of nonsense mutations within EXOSC2 in Calu-3 cells reduced EXOSC2 protein expression and impeded SARS-CoV-2 replication without impacting cellular viability. Targeted depletion of EXOSC2 may be a safe and effective strategy to protect against clinical COVID-19.
View details for DOI 10.26508/lsa.202201449
View details for PubMedID 36241425
Systems analysis of de novo mutations in congenital heart diseases identified a protein network in the hypoplastic left heart syndrome.
Despite a strong genetic component, only a few genes have been identified in congenital heart diseases (CHDs). We introduced systems analyses to uncover the hidden organization on biological networks of mutations in CHDs and leveraged network analysis to integrate the protein interactome, patient exomes, and single-cell transcriptomes of the developing heart. We identified a CHD network regulating heart development and observed that a sub-network also regulates fetal brain development, thereby providing mechanistic insights into the clinical comorbidities between CHDs and neurodevelopmental conditions. At a small scale, we experimentally verified uncharacterized cardiac functions of several proteins. At a global scale, our study revealed developmental dynamics of the network and observed its association with the hypoplastic left heart syndrome (HLHS), which was further supported by the dysregulation of the network in HLHS endothelial cells. Overall, our work identified previously uncharacterized CHD factors and provided a generalizable framework applicable to studying many other complex diseases. A record of this paper's Transparent Peer Review process is included in the supplemental information.
View details for DOI 10.1016/j.cels.2022.09.001
View details for PubMedID 36167075
Deep learning-based pseudo-mass spectrometry imaging analysis for precision medicine.
Briefings in bioinformatics
Liquid chromatography-mass spectrometry (LC-MS)-based untargeted metabolomics provides systematic profiling of metabolic. Yet, its applications in precision medicine (disease diagnosis) have been limited by several challenges, including metabolite identification, information loss and low reproducibility. Here, we present the deep-learning-based Pseudo-Mass Spectrometry Imaging (deepPseudoMSI) project (https://www.deeppseudomsi.org/), which converts LC-MS raw data to pseudo-MS images and then processes them by deep learning for precision medicine, such as disease diagnosis. Extensive tests based on real data demonstrated the superiority of deepPseudoMSI over traditional approaches and the capacity of our method to achieve an accurate individualized diagnosis. Our framework lays the foundation for future metabolic-based precision medicine.
View details for DOI 10.1093/bib/bbac331
View details for PubMedID 35947990
Precision environmental health monitoring by longitudinal exposome and multi-omics profiling.
Conventional environmental health studies have primarily focused on limited environmental stressors at the population level, which lacks the power to dissect the complexity and heterogeneity of individualized environmental exposures. Here, as a pilot case study, we integrated deep-profiled longitudinal personal exposome and internal multi-omics to systematically investigate how the exposome shapes a single individual's phenome. We annotated thousands of chemical and biological components in the personal exposome cloud and found they were significantly correlated with thousands of internal biomolecules, which was further cross-validated using corresponding clinical data. Our results showed that agrochemicals and fungi predominated in the highly diverse and dynamic personal exposome, and the biomolecules and pathways related to the individual's immune system, kidney, and liver were highly associated with the personal external exposome. Overall, this data-driven longitudinal monitoring study shows the potential dynamic interactions between the personal exposome and internal multi-omics, as well as the impact of the exposome on precision health by producing abundant testable hypotheses.
View details for DOI 10.1101/gr.276521.121
View details for PubMedID 35667843
Multiomic analysis reveals cell-type-specific molecular determinants of COVID-19 severity.
The determinants of severe COVID-19 in healthy adults are poorly understood, which limits the opportunity for early intervention. We present a multiomic analysis using machine learning to characterize the genomic basis of COVID-19 severity. We use single-cell multiome profiling of human lungs to link genetic signals to cell-type-specific functions. We discover >1,000 risk genes across 19 cell types, which account for 77% of the SNP-based heritability for severe disease. Genetic risk is particularly focused within natural killer (NK) cells and T cells, placing the dysfunction of these cells upstream of severe disease. Mendelian randomization and single-cell profiling of human NK cells support the role of NK cells and further localize genetic risk to CD56bright NK cells, which are key cytokine producers during the innate immune response. Rare variant analysis confirms the enrichment of severe-disease-associated genetic variation within NK-cell risk genes. Our study provides insights into the pathogenesis of severe COVID-19 with potential therapeutic targets.
View details for DOI 10.1016/j.cels.2022.05.007
View details for PubMedID 35690068
Low expression of EXOSC2 protects against clinical COVID-19 and impedes SARS-CoV-2 replication.
bioRxiv : the preprint server for biology
New therapeutic targets are a valuable resource in the struggle to reduce the morbidity and mortality associated with the COVID-19 pandemic, caused by the SARS-CoV-2 virus. Genome-wide association studies (GWAS) have identified risk loci, but some loci are associated with co-morbidities and are not specific to host-virus interactions. Here, we identify and experimentally validate a link between reduced expression of EXOSC2 and reduced SARS-CoV-2 replication. EXOSC2 was one of 332 host proteins examined, all of which interact directly with SARS-CoV-2 proteins; EXOSC2 interacts with Nsp8 which forms part of the viral RNA polymerase. Lung-specific eQTLs were identified from GTEx (v7) for each of the 332 host proteins. Aggregating COVID-19 GWAS statistics for gene-specific eQTLs revealed an association between increased expression of EXOSC2 and higher risk of clinical COVID-19 which survived stringent multiple testing correction. EXOSC2 is a component of the RNA exosome and indeed, LC-MS/MS analysis of protein pulldowns demonstrated an interaction between the SARS-CoV-2 RNA polymerase and the majority of human RNA exosome components. CRISPR/Cas9 introduction of nonsense mutations within EXOSC2 in Calu-3 cells reduced EXOSC2 protein expression, impeded SARS-CoV-2 replication and upregulated oligoadenylate synthase ( OAS) genes, which have been linked to a successful immune response against SARS-CoV-2. Reduced EXOSC2 expression did not reduce cellular viability. OAS gene expression changes occurred independent of infection and in the absence of significant upregulation of other interferon-stimulated genes (ISGs). Targeted depletion or functional inhibition of EXOSC2 may be a safe and effective strategy to protect at-risk individuals against clinical COVID-19.
View details for DOI 10.1101/2022.03.06.483172
View details for PubMedID 35291294
View details for PubMedCentralID PMC8923113
Genome-wide identification of the genetic basis of amyotrophic lateral sclerosis.
Amyotrophic lateral sclerosis (ALS) is a complex disease that leads to motor neuron death. Despite heritability estimates of 52%, genome-wide association studies (GWASs) have discovered relatively few loci. We developed a machine learning approach called RefMap, which integrates functional genomics with GWAS summary statistics for gene discovery. With transcriptomic and epigenetic profiling of motor neurons derived from induced pluripotent stem cells (iPSCs), RefMap identified 690 ALS-associated genes that represent a 5-fold increase in recovered heritability. Extensive conservation, transcriptome, network, and rare variant analyses demonstrated the functional significance of candidate genes in healthy and diseased motor neurons and brain tissues. Genetic convergence between common and rare variation highlighted KANK1 as a new ALS gene. Reproducing KANK1 patient mutations in human neurons led to neurotoxicity and demonstrated that TDP-43 mislocalization, a hallmark pathology of ALS, is downstream of axonal dysfunction. RefMap can be readily applied to other complex diseases.
View details for DOI 10.1016/j.neuron.2021.12.019
View details for PubMedID 35045337
Unbiased metabolome screen leads to personalized medicine strategy for amyotrophic lateral sclerosis.
2022; 4 (2): fcac069
Amyotrophic lateral sclerosis is a rapidly progressive neurodegenerative disease that affects 1/350 individuals in the United Kingdom. The cause of amyotrophic lateral sclerosis is unknown in the majority of cases. Two-sample Mendelian randomization enables causal inference between an exposure, such as the serum concentration of a specific metabolite, and disease risk. We obtained genome-wide association study summary statistics for serum concentrations of 566 metabolites which were population matched with a genome-wide association study of amyotrophic lateral sclerosis. For each metabolite, we performed Mendelian randomization using an inverse variance weighted estimate for significance testing. After stringent Bonferroni multiple testing correction, our unbiased screen revealed three metabolites that were significantly linked to the risk of amyotrophic lateral sclerosis: Estrone-3-sulphate and bradykinin were protective, which is consistent with literature describing a male preponderance of amyotrophic lateral sclerosis and a preventive effect of angiotensin-converting enzyme inhibitors which inhibit the breakdown of bradykinin. Serum isoleucine was positively associated with amyotrophic lateral sclerosis risk. All three metabolites were supported by robust Mendelian randomization measures and sensitivity analyses; estrone-3-sulphate and isoleucine were confirmed in a validation amyotrophic lateral sclerosis genome-wide association study. Estrone-3-sulphate is metabolized to the more active estradiol by the enzyme 17beta-hydroxysteroid dehydrogenase 1; further, Mendelian randomization demonstrated a protective effect of estradiol and rare variant analysis showed that missense variants within HSD17B1, the gene encoding 17beta-hydroxysteroid dehydrogenase 1, modify risk for amyotrophic lateral sclerosis. Finally, in a zebrafish model of C9ORF72-amyotrophic lateral sclerosis, we present evidence that estradiol is neuroprotective. Isoleucine is metabolized via methylmalonyl-CoA mutase encoded by the gene MMUT in a reaction that consumes vitamin B12. Multivariable Mendelian randomization revealed that the toxic effect of isoleucine is dependent on the depletion of vitamin B12; consistent with this, rare variants which reduce the function of MMUT are protective against amyotrophic lateral sclerosis. We propose that amyotrophic lateral sclerosis patients and family members with high serum isoleucine levels should be offered supplementation with vitamin B12.
View details for DOI 10.1093/braincomms/fcac069
View details for PubMedID 35441136
A review of Mendelian randomization in amyotrophic lateral sclerosis.
Brain : a journal of neurology
Amyotrophic lateral sclerosis (ALS) is a relatively common and rapidly progressive neurodegenerative disease which, in the majority of cases, is thought to be determined by a complex gene-environment interaction. Exponential growth in the number of performed genome-wide association studies (GWAS), combined with the advent of Mendelian randomization (MR) is opening significant new opportunities to identify environmental exposures which increase or decrease the risk of ALS. Each of these discoveries has the potential to shape new therapeutic interventions. However, to do so rigorous methodological standards must be applied in the performance of MR. We have performed a review of MR studies performed in ALS to date. We identified 20 MR studies, including evaluation of physical exercise, adiposity, cognitive performance, immune function, blood lipids, sleep behaviours, educational attainment, alcohol consumption, smoking and type 2 diabetes mellitus. We have evaluated each study using gold standard methodology supported by the MR literature and the STROBE-MR checklist. Where discrepancies exist between MR studies, we suggest the underlying reasons. A number of studies conclude that there is a causal link between blood lipids and risk of ALS; replication across different datasets and even different populations adds confidence. For other putative risk factors, such as smoking and immune function, MR studies have provided cause for doubt. We highlight the use of positive control analyses in choosing exposure SNPs to make up the MR instrument, use of SNP clumping to avoid false positive results due to SNPs in linkage, and the importance of multiple testing correction. We discuss the implications of survival bias for study of late age of onset diseases such as ALS, and make recommendations to mitigate this potentially important confounder. For MR to be useful to the ALS field, high methodological standards must be applied to ensure reproducibility. MR is already an impactful tool but poor quality studies will lead to incorrect interpretations by a field which includes non-statisticians, wasted resources and missed opportunities.
View details for DOI 10.1093/brain/awab420
View details for PubMedID 34791088
Advances in the genetic classification of amyotrophic lateral sclerosis.
Current opinion in neurology
PURPOSE OF REVIEW: Amyotrophic lateral sclerosis (ALS) is an archetypal complex disease wherein disease risk and severity are, for the majority of patients, the product of interaction between multiple genetic and environmental factors. We are in a period of unprecedented discovery with new large-scale genome-wide association study (GWAS) and accelerating discovery of risk genes. However, much of the observed heritability of ALS is undiscovered and we are not yet approaching elucidation of the total genetic architecture, which will be necessary for comprehensive disease subclassification.RECENT FINDINGS: We summarize recent developments and discuss the future. New machine learning models will help to address nonlinear genetic interactions. Statistical power for genetic discovery may be boosted by reducing the search-space using cell-specific epigenetic profiles and expanding our scope to include genetically correlated phenotypes. Structural variation, somatic heterogeneity and consideration of environmental modifiers represent significant challenges which will require integration of multiple technologies and a multidisciplinary approach, including clinicians, geneticists and pathologists.SUMMARY: The move away from fully penetrant Mendelian risk genes necessitates new experimental designs and new standards for validation. The challenges are significant, but the potential reward for successful disease subclassification is large-scale and effective personalized medicine.
View details for DOI 10.1097/WCO.0000000000000986
View details for PubMedID 34343141
Physical exercise is a risk factor for amyotrophic lateral sclerosis: Convergent evidence from Mendelian randomisation, transcriptomics and risk genotypes.
2021; 68: 103397
BACKGROUND: Amyotrophic lateral sclerosis (ALS) is a universally fatal neurodegenerative disease. ALS is determined by gene-environment interactions and improved understanding of these interactions may lead to effective personalised medicine. The role of physical exercise in the development of ALS is currently controversial.METHODS: First, we dissected the exercise-ALS relationship in a series of two-sample Mendelian randomisation (MR) experiments. Next we tested for enrichment of ALS genetic risk within exercise-associated transcriptome changes. Finally, we applied a validated physical activity questionnaire in a small cohort of genetically selected ALS patients.FINDINGS: We present MR evidence supporting a causal relationship between genetic liability to frequent and strenuous leisure-time exercise and ALS using a liberal instrument (multiplicative random effects IVW, p=0.01). Transcriptomic analysis revealed that genes with altered expression in response to acute exercise are enriched with known ALS risk genes (permutation test, p=0.013) including C9ORF72, and with ALS-associated rare variants of uncertain significance. Questionnaire evidence revealed that age of onset is inversely proportional to historical physical activity for C9ORF72-ALS (Cox proportional hazards model, Wald test p=0.007, likelihood ratio test p=0.01, concordance=74%) but not for non-C9ORF72-ALS. Variability in average physical activity was lower in C9ORF72-ALS compared to both non-C9ORF72-ALS (F-test, p=0.002) and neurologically normal controls (F-test, p=0.049) which is consistent with a homogeneous effect of physical activity in all C9ORF72-ALS patients.INTERPRETATION: Our MR approach suggests a positive causal relationship between ALS and physical exercise. Exercise is likely to cause motor neuron injury only in patients with a risk-genotype. Consistent with this we have shown that ALS risk genes are activated in response to exercise. In particular, we propose that G4C2-repeat expansion of C9ORF72 predisposes to exercise-induced ALS.FUNDING: We acknowledge support from the Wellcome Trust (JCK, 216596/Z/19/Z), NIHR (PJS, NF-SI-0617-10077; IS-BRC-1215-20017) and NIH (MPS, CEGS5P50HG00773504,1P50HL083800, 1R01HL101388, 1R01-HL122939, S10OD025212, P30DK116074, and UM1HG009442).
View details for DOI 10.1016/j.ebiom.2021.103397
View details for PubMedID 34051439
Precision medicine in women with epilepsy: The challenge, systematic review, and future direction.
Epilepsy & behavior : E&B
2021; 118: 107928
Epilepsy is one of the most prevalent neurologic conditions, affecting almost 70 million people worldwide. In the United States, 1.3 million women with epilepsy (WWE) are in their active reproductive years. Women with epilepsy (WWE) face gender-specific challenges such as pregnancy, seizure exacerbation with hormonal pattern fluctuations, contraception, fertility, and menopause. Precision medicine, which applies state-of-the art molecular profiling to diagnostic, prognostic, and therapeutic problems, has the potential to advance the care of WWE by precisely tailoring individualized management to each patient's needs. For example, antiseizure medications (ASMs) are among the most common teratogens prescribed to women of childbearing potential. Teratogens act in a dose-dependent manner on a susceptible genotype. However, the genotypes at risk for ASM-induced teratogenic deficits are unknown. Here we summarize current challenging issues for WWE, review the state-of-art tools for clinical precision medicine approaches, perform a systematic review of pharmacogenomic approaches in management for WWE, and discuss potential future directions in this field. We envision a future in which precision medicine enables a new practice style that puts focus on early detection, prediction, and targeted therapies for WWE.
View details for DOI 10.1016/j.yebeh.2021.107928
View details for PubMedID 33774354
Membrane lipid raft homeostasis is directly linked to neurodegeneration.
Essays in biochemistry
Age-associated neurodegenerative diseases such as amyotrophic lateral sclerosis (ALS), Parkinson's disease (PD) and Alzheimer's disease (AD) are an unmet health need, with significant economic and societal implications, and an ever-increasing prevalence. Membrane lipid rafts (MLRs) are specialised plasma membrane microdomains that provide a platform for intracellular trafficking and signal transduction, particularly within neurons. Dysregulation of MLRs leads to disruption of neurotrophic signalling and excessive apoptosis which mirrors the final common pathway for neuronal death in ALS, PD and AD. Sphingomyelinase (SMase) and phospholipase (PL) enzymes process components of MLRs and therefore play central roles in MLR homeostasis and in neurotrophic signalling. We review the literature linking SMase and PL enzymes to ALS, AD and PD with particular attention to attractive therapeutic targets, where functional manipulation has been successful in preclinical studies. We propose that dysfunction of these enzymes is upstream in the pathogenesis of neurodegenerative diseases and to support this we provide new evidence that ALS risk genes are enriched with genes involved in ceramide metabolism (P=0.019, OR = 2.54, Fisher exact test). Ceramide is a product of SMase action upon sphingomyelin within MLRs, and it also has a role as a second messenger in intracellular signalling pathways important for neuronal survival. Genetic risk is necessarily upstream in a late age of onset disease such as ALS. We propose that manipulation of MLR structure and function should be a focus of future translational research seeking to ameliorate neurodegenerative disorders.
View details for DOI 10.1042/EBC20210026
View details for PubMedID 34623437
Rare Variant Burden Analysis within Enhancers Identifies CAV1 as an ALS Risk Gene.
2020; 33 (9): 108456
Amyotrophic lateral sclerosis (ALS) is an incurable neurodegenerative disease. CAV1 and CAV2 organize membrane lipid rafts (MLRs) important for cell signaling and neuronal survival, and overexpression of CAV1 ameliorates ALS phenotypes invivo. Genome-wide association studies localize a large proportion of ALS risk variants within the non-coding genome, but further characterization has been limited by lack ofappropriate tools. By designing and applying a pipeline to identify pathogenic genetic variation within enhancer elements responsible for regulating gene expression, we identify disease-associated variation within CAV1/CAV2 enhancers, which replicate in an independent cohort. Discovered enhancer mutations reduce CAV1/CAV2 expression and disrupt MLRs in patient-derived cells, and CRISPR-Cas9 perturbation proximate to a patient mutation is sufficient to reduce CAV1/CAV2 expression in neurons. Additional enrichment of ALS-associated mutations within CAV1 exons positions CAV1 as an ALS risk gene. We propose CAV1/CAV2 overexpression as a personalized medicine target for ALS.
View details for DOI 10.1016/j.celrep.2020.108456
View details for PubMedID 33264630
DeepRibSt: a multi-feature convolutional neural network for predicting ribosome stalling
MULTIMEDIA TOOLS AND APPLICATIONS
View details for DOI 10.1007/s11042-020-09598-8
View details for Web of Science ID 000568184100013
DeepHINT: understanding HIV-1 integration via deep learning with attention
2019; 35 (10): 1660–67
View details for DOI 10.1093/bioinformatics/bty842
View details for Web of Science ID 000469437800005
Gene-Environment Interaction in the Era of Precision Medicine
2019; 177 (1): 38–44
View details for DOI 10.1016/j.cell.2019.03.004
View details for Web of Science ID 000462034400011
Decoding the Genomics of Abdominal Aortic Aneurysm.
2018; 174 (6): 1361
A key aspect of genomic medicine is to make individualized clinical decisions from personal genomes. We developed a machine-learning framework to integrate personal genomes and electronic health record (EHR) data and used this framework to study abdominal aortic aneurysm (AAA), a prevalent irreversible cardiovascular disease with unclear etiology. Performing whole-genome sequencing on AAA patients and controls, we demonstrated its predictive precision solely from personal genomes. By modeling personal genomes with EHRs, this framework quantitatively assessed the effectiveness of adjusting personal lifestyles given personal genome baselines, demonstrating its utility as a personal health management tool. We showed that this new framework agnostically identified genetic components involved in AAA, which were subsequently validated in human aortic tissues and in murine models. Our study presents a new framework for disease genome analysis, which can be used for both health management and understanding the biological architecture of complex diseases. VIDEO ABSTRACT.
View details for PubMedID 30193110
Reconstructing spatial organizations of chromosomes through manifold learning.
Nucleic acids research
Decoding the spatial organizations of chromosomes has crucial implications for studying eukaryotic gene regulation. Recently, chromosomal conformation capture based technologies, such as Hi-C, have been widely used to uncover the interaction frequencies of genomic loci in a high-throughput and genome-wide manner and provide new insights into the folding of three-dimensional (3D) genome structure. In this paper, we develop a novel manifold learning based framework, called GEM (Genomic organization reconstructor based on conformational Energy and Manifold learning), to reconstruct the three-dimensional organizations of chromosomes by integrating Hi-C data with biophysical feasibility. Unlike previous methods, which explicitly assume specific relationships between Hi-C interaction frequencies and spatial distances, our model directly embeds the neighboring affinities from Hi-C space into 3D Euclidean space. Extensive validations demonstrated that GEM not only greatly outperformed other state-of-art modeling methods but also provided a physically and physiologically valid 3D representations of the organizations of chromosomes. Furthermore, we for the first time apply the modeled chromatin structures to recover long-range genomic interactions missing from original Hi-C data.
View details for DOI 10.1093/nar/gky065
View details for PubMedID 29408992
Analysis of Ribosome Stalling and Translation Elongation Dynamics by Deep Learning.
2017; 5 (3): 212-220.e6
Ribosome stalling is manifested by the local accumulation of ribosomes at specific codon positions of mRNAs. Here, we present ROSE, a deep learning framework to analyze high-throughput ribosome profiling data and estimate the probability of a ribosome stalling event occurring at each genomic location. Extensive validation tests on independent data demonstrated that ROSE possessed higher prediction accuracy than conventional prediction models, with an increase in the area under the receiver operating characteristic curve by up to 18.4%. In addition, genome-wide statistical analyses showed that ROSE predictions can be well correlated with diverse putative regulatory factors of ribosome stalling. Moreover, the genome-wide ribosome stalling landscapes of both human and yeast computed by ROSE recovered the functional interplays between ribosome stalling and cotranslational events in protein biogenesis, including protein targeting by the signal recognition particles and protein secondary structure formation. Overall, our study provides a novel method to complement the ribosome profiling techniques and further decipher the complex regulatory mechanisms underlying translation elongation dynamics encoded in the mRNA sequence.
View details for DOI 10.1016/j.cels.2017.08.004
View details for PubMedID 28957655
TITER: predicting translation initiation sites by deep learning.
Bioinformatics (Oxford, England)
2017; 33 (14): i234-i242
Translation initiation is a key step in the regulation of gene expression. In addition to the annotated translation initiation sites (TISs), the translation process may also start at multiple alternative TISs (including both AUG and non-AUG codons), which makes it challenging to predict TISs and study the underlying regulatory mechanisms. Meanwhile, the advent of several high-throughput sequencing techniques for profiling initiating ribosomes at single-nucleotide resolution, e.g. GTI-seq and QTI-seq, provides abundant data for systematically studying the general principles of translation initiation and the development of computational method for TIS identification.We have developed a deep learning-based framework, named TITER, for accurately predicting TISs on a genome-wide scale based on QTI-seq data. TITER extracts the sequence features of translation initiation from the surrounding sequence contexts of TISs using a hybrid neural network and further integrates the prior preference of TIS codon composition into a unified prediction framework.Extensive tests demonstrated that TITER can greatly outperform the state-of-the-art prediction methods in identifying TISs. In addition, TITER was able to identify important sequence signatures for individual types of TIS codons, including a Kozak-sequence-like motif for AUG start codon. Furthermore, the TITER prediction score can be related to the strength of translation initiation in various biological scenarios, including the repressive effect of the upstream open reading frames on gene expression and the mutational effects influencing translation initiation efficiency.TITER is available as an open-source software and can be downloaded from https://github.com/zhangsaithu/titer .firstname.lastname@example.org or email@example.com.Supplementary data are available at Bioinformatics online.
View details for DOI 10.1093/bioinformatics/btx247
View details for PubMedID 28881981
A deep boosting based approach for capturing the sequence binding preferences of RNA-binding proteins from high-throughput CLIP-seq data.
Nucleic acids research
Characterizing the binding behaviors of RNA-binding proteins (RBPs) is important for understanding their functional roles in gene expression regulation. However, current high-throughput experimental methods for identifying RBP targets, such as CLIP-seq and RNAcompete, usually suffer from the false negative issue. Here, we develop a deep boosting based machine learning approach, called DeBooster, to accurately model the binding sequence preferences and identify the corresponding binding targets of RBPs from CLIP-seq data. Comprehensive validation tests have shown that DeBooster can outperform other state-of-the-art approaches in RBP target prediction. In addition, we have demonstrated that DeBooster may provide new insights into understanding the regulatory functions of RBPs, including the binding effects of the RNA helicase MOV10 on mRNA degradation, the potentially different ADAR1 binding behaviors related to its editing activity, as well as the antagonizing effect of RBP binding on miRNA repression. Moreover, DeBooster may provide an effective index to investigate the effect of pathogenic mutations in RBP binding sites, especially those related to splicing events. We expect that DeBooster will be widely applied to analyze large-scale CLIP-seq experimental data and can provide a practically useful tool for novel biological discoveries in understanding the regulatory mechanisms of RBPs. The source code of DeBooster can be downloaded from http://github.com/dongfanghong/deepboost.
View details for DOI 10.1093/nar/gkx492
View details for PubMedID 28575488
Elastic restricted Boltzmann machines for cancer data analysis
2017; 5 (2): 159-172
View details for DOI 10.1007/s40484-017-0092-7
Constructing Structure Ensembles of Intrinsically Disordered Proteins from Chemical Shift Data
JOURNAL OF COMPUTATIONAL BIOLOGY
2016; 23 (5): 300-310
Modeling the structural ensemble of intrinsically disordered proteins (IDPs), which lack fixed structures, is essential in understanding their cellular functions and revealing their regulation mechanisms in signaling pathways of related diseases (e.g., cancers and neurodegenerative disorders). Though the ensemble concept has been widely believed to be the most accurate way to depict 3D structures of IDPs, few of the traditional ensemble-based approaches effectively address the degeneracy problem that occurs when multiple solutions are consistent with experimental data and is the main challenge in the IDP ensemble construction task. In this article, based on a predefined conformational library, we formalize the structure ensemble construction problem into a least squares framework, which provides the optimal solution when the data constraints outnumber unknown variables. To deal with the degeneracy problem, we further propose a regularized regression approach based on the elastic net technique with the assumption that the weights to be estimated for individual structures in the ensemble are sparse. We have validated our methods through a reference ensemble approach as well as by testing the real biological data of three proteins, including alpha-synuclein, the translocation domain of Colocin N, and the K18 domain of Tau protein.
View details for DOI 10.1089/cmb.2015.0184
View details for Web of Science ID 000376080500002
View details for PubMedID 27159632
View details for PubMedCentralID PMC4876552
A deep learning framework for modeling structural features of RNA-binding protein targets
NUCLEIC ACIDS RESEARCH
2016; 44 (4)
RNA-binding proteins (RBPs) play important roles in the post-transcriptional control of RNAs. Identifying RBP binding sites and characterizing RBP binding preferences are key steps toward understanding the basic mechanisms of the post-transcriptional gene regulation. Though numerous computational methods have been developed for modeling RBP binding preferences, discovering a complete structural representation of the RBP targets by integrating their available structural features in all three dimensions is still a challenging task. In this paper, we develop a general and flexible deep learning framework for modeling structural binding preferences and predicting binding sites of RBPs, which takes (predicted) RNA tertiary structural information into account for the first time. Our framework constructs a unified representation that characterizes the structural specificities of RBP targets in all three dimensions, which can be further used to predict novel candidate binding sites and discover potential binding motifs. Through testing on the real CLIP-seq datasets, we have demonstrated that our deep learning framework can automatically extract effective hidden structural features from the encoded raw sequence and structural profiles, and predict accurate RBP binding sites. In addition, we have conducted the first study to show that integrating the additional RNA tertiary structural features can improve the model performance in predicting RBP binding sites, especially for the polypyrimidine tract-binding protein (PTB), which also provides a new evidence to support the view that RBPs may own specific tertiary structural binding preferences. In particular, the tests on the internal ribosome entry site (IRES) segments yield satisfiable results with experimental support from the literature and further demonstrate the necessity of incorporating RNA tertiary structural information into the prediction model. The source code of our approach can be found in https://github.com/thucombio/deepnet-rbp.
View details for DOI 10.1093/nar/gkv1025
View details for Web of Science ID 000371519700003
View details for PubMedID 26467480
View details for PubMedCentralID PMC4770198
Characterizing information spreading in online social networks
Online social networks (OSNs) are changing the way in which the information spreads throughout the Internet. A deep understanding of the information spreading in OSNs leads to both social and commercial benefits. In this paper, we characterize the dynamic of information spreading (e.g., how fast and widely the information spreads against time) in OSNs by developing a general and accurate model based on the Interactive Markov Chains (IMCs) and mean-field theory. This model explicitly reveals the impacts of the network topology on information spreading in OSNs. Further, we extend our model to feature the time-varying user behaviors and the ever-changing information popularity. The complicated dynamic patterns of information spreading are captured by our model using six key parameters. Extensive tests based on Renren's dataset validate the accuracy of our model, which demonstrate that it can characterize the dynamic patterns of video sharing in Renren precisely and predict future spreading tendency successfully.
- Measurement and analysis of online social networks Chinese Journal of Computers 2014; 37 (1): 24