Stanford Advisors

All Publications

  • MAFN: multi-level attention fusion network for multimodal named entity recognition MULTIMEDIA TOOLS AND APPLICATIONS Zhou, X., Zhang, Y., Wang, Z., Lu, M., Liu, X. 2023
  • Improving genetic risk prediction across diverse population by disentangling ancestry representations. Communications biology Gyawali, P. K., Le Guen, Y., Liu, X., Belloy, M. E., Tang, H., Zou, J., He, Z. 2023; 6 (1): 964


    Risk prediction models using genetic data have seen increasing traction in genomics. However, most of the polygenic risk models were developed using data from participants with similar (mostly European) ancestry. This can lead to biases in the risk predictors resulting in poor generalization when applied to minority populations and admixed individuals such as African Americans. To address this issue, largely due to the prediction models being biased by the underlying population structure, we propose a deep-learning framework that leverages data from diverse population and disentangles ancestry from the phenotype-relevant information in its representation. The ancestry disentangled representation can be used to build risk predictors that perform better across minority populations. We applied the proposed method to the analysis of Alzheimer's disease genetics. Comparing with standard linear and nonlinear risk prediction methods, the proposed method substantially improves risk prediction in minority populations, including admixed individuals, without needing self-reported ancestry information.

    View details for DOI 10.1038/s42003-023-05352-6

    View details for PubMedID 37736834

  • GhostKnockoff inference empowers identification of putative causal variants in genome-wide association studies. Nature communications He, Z., Liu, L., Belloy, M. E., Le Guen, Y., Sossin, A., Liu, X., Qi, X., Ma, S., Gyawali, P. K., Wyss-Coray, T., Tang, H., Sabatti, C., Candes, E., Greicius, M. D., Ionita-Laza, I. 2022; 13 (1): 7209


    Recent advances in genome sequencing and imputation technologies provide an exciting opportunity to comprehensively study the contribution of genetic variants to complex phenotypes. However, our ability to translate genetic discoveries into mechanistic insights remains limited at this point. In this paper, we propose an efficient knockoff-based method, GhostKnockoff, for genome-wide association studies (GWAS) that leads to improved power and ability to prioritize putative causal variants relative to conventional GWAS approaches. The method requires only Z-scores from conventional GWAS and hence can be easily applied to enhance existing and future studies. The method can also be applied to meta-analysis of multiple GWAS allowing for arbitrary sample overlap. We demonstrate its performance using empirical simulations and two applications: (1) a meta-analysis for Alzheimer's disease comprising nine overlapping large-scale GWAS, whole-exome and whole-genome sequencing studies and (2) analysis of 1403 binary phenotypes from the UK Biobank data in 408,961 samples of European ancestry. Our results demonstrate that GhostKnockoff can identify putatively functional variants with weaker statistical effects that are missed by conventional association tests.

    View details for DOI 10.1038/s41467-022-34932-z

    View details for PubMedID 36418338

  • A supervised protein complex prediction method with network representation learning and gene ontology knowledge. BMC bioinformatics Wang, X., Zhang, Y., Zhou, P., Liu, X. 2022; 23 (1): 300


    BACKGROUND: Protein complexes are essential for biologists to understand cell organization and function effectively. In recent years, predicting complexes from protein-protein interaction (PPI) networks through computational methods is one of the current research hotspots. Many methods for protein complex prediction have been proposed. However, how to use the information of known protein complexes is still a fundamental problem that needs to be solved urgently in predicting protein complexes.RESULTS: To solve these problems, we propose a supervised learning method based on network representation learning and gene ontology knowledge, which can fully use the information of known protein complexes to predict new protein complexes. This method first constructs a weighted PPI network based on gene ontology knowledge and topology information, reducing the network's noise problem. On this basis, the topological information of known protein complexes is extracted as features, and the supervised learning model SVCC is obtained according to the feature training. At the same time, the SVCC model is used to predict candidate protein complexes from the protein interaction network. Then, we use the network representation learning method to obtain the vector representation of the protein complex and train the random forest model. Finally, we use the random forest model to classify the candidate protein complexes to obtain the final predicted protein complexes. We evaluate the performance of the proposed method on two publicly PPI data sets.CONCLUSIONS: Experimental results show that our method can effectively improve the performance of protein complex recognition compared with existing methods. In addition, we also analyze the biological significance of protein complexes predicted by our method and other methods. The results show that the protein complexes predicted by our method have high biological significance.

    View details for DOI 10.1186/s12859-022-04850-4

    View details for PubMedID 35879648

  • A Scalable Embedding Based Neural Network Method for Discovering Knowledge From Biomedical Literature IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS Sang, S., Liu, X., Chen, X., Zhao, D. 2022; 19 (3): 1294-1301


    Nowadays, the amount of biomedical literatures is growing at an explosive speed, and much useful knowledge is yet undiscovered in the literature. Classical information retrieval techniques allow to access explicit information from a given collection of information, but are not able to recognize implicit connections. Literature-based discovery (LBD) is characterized by uncovering hidden associations in non-interacting literature. It could significantly support scientific research by identifying new connections between biomedical entities. However, most of the existing approaches to LBD are not scalable and may not be sufficient to detect complex associations in non-directly-connected literature. In this article, we present a model which incorporates biomedical knowledge graph, graph embedding, and deep learning methods for literature-based discovery. First, the relations between biomedical entities are extracted from biomedical abstracts and then a knowledge graph is constructed by using these obtained relations. Second, the graph embedding technologies are applied to convert the entities and relations in the knowledge graph into a low-dimensional vector space. Third, a bidirectional Long Short-Term Memory (BLSTM) network is trained based on the entity associations represented by the pre-trained graph embeddings. Finally, the learned model is used for open and closed literature-based discovery tasks. The experimental results show that our method could not only effectively discover hidden associations between entities, but also reveal the corresponding mechanism of interactions. It suggests that incorporating knowledge graph and deep learning methods is an effective way for capturing the underlying complex associations between entities hidden in the literature.

    View details for DOI 10.1109/TCBB.2020.3003947

    View details for Web of Science ID 000805807200006

    View details for PubMedID 32750871

  • KGSG: Knowledge Guided Syntactic Graph Model for Drug-Drug Interaction Extraction Du, W., Zhang, Y., Yang, M., Liu, D., Liu, X., Sun, M., Qi, G., Liu, K., Ren, J., Xu, B., Feng, Y., Liu, Y., Chen, Y. SPRINGER INTERNATIONAL PUBLISHING AG. 2022: 55-67
  • Geometric resistant polar quaternion discrete Fourier transform and its application in color image zero-hiding. ISA transactions Wang, C., Ma, B., Xia, Z., Li, J., Li, Q., Liu, X., Sang, S. 2021


    As a typical frequency-domain analysis method, quaternion discrete Fourier transform (QDFT) has been widely used in information hiding in color images. However, due to the sensitivity of QDFT to geometric attacks, existing QDFT-based information hiding schemes have limited ability in resisting geometric attacks. In this study, a kind of novel geometrically resilient polar QDFT (PQDFT) is constructed and the properties of the proposed PQDFT are analyzed. Subsequently, a PQDFT-based color image zero-hiding scheme robust to geometric attacks is proposed for lossless copyright protection of color images, which experimentally shows reasonable resistance against geometric and common attacks, indicating better robustness compared with the existing QDFT-based information hiding schemes and other leading-edge zero-hiding schemes.

    View details for DOI 10.1016/j.isatra.2021.06.019

    View details for PubMedID 34176603

  • Genome-wide analysis of common and rare variants via multiple knockoffs at biobank scale, with an application to Alzheimer disease genetics. American journal of human genetics He, Z., Le Guen, Y., Liu, L., Lee, J., Ma, S., Yang, A. C., Liu, X., Rutledge, J., Losada, P. M., Song, B., Belloy, M. E., Butler, R. R., Longo, F. M., Tang, H., Mormino, E. C., Wyss-Coray, T., Greicius, M. D., Ionita-Laza, I. 2021


    Knockoff-based methods have become increasingly popular due to their enhanced power for locus discovery and their ability to prioritize putative causal variants in a genome-wide analysis. However, because of the substantial computational cost for generating knockoffs, existing knockoff approaches cannot analyze millions of rare genetic variants in biobank-scale whole-genome sequencing and whole-genome imputed datasets. We propose a scalable knockoff-based method for the analysis of common and rare variants across the genome, KnockoffScreen-AL, that is applicable to biobank-scale studies with hundreds of thousands of samples and millions of genetic variants. The application of KnockoffScreen-AL to the analysis of Alzheimer disease (AD) in 388,051 WG-imputed samples from the UK Biobank resulted in 31 significant loci, including 14 loci that are missed by conventional association tests on these data. We perform replication studies in an independent meta-analysis of clinically diagnosed AD with 94,437 samples, and additionally leverage single-cell RNA-sequencing data with 143,793 single-nucleus transcriptomes from 17 control subjects and AD-affected individuals, and proteomics data from 735 control subjects and affected indviduals with AD and related disorders to validate the genes at these significant loci. These multi-omics analyses show that 79.1% of the proximal genes at these loci and 76.2% of the genes at loci identified only by KnockoffScreen-AL exhibit at least suggestive signal (p < 0.05) in the scRNA-seq or proteomics analyses. We highlight a potentially causal gene in AD progression, EGFR, that shows significant differences in expression and protein levels between AD-affected individuals and healthy control subjects.

    View details for DOI 10.1016/j.ajhg.2021.10.009

    View details for PubMedID 34767756

  • Using Alias Sampling Strategy Based on Network Embeddings to Detect Protein Complexes IEEE ACCESS Liu, X., Sang, S., Wang, X. 2020; 8: 211773–83
  • Disease Gene Prediction Based on Heterogeneous Probabilistic Hypergraph Ranking Ding, F., Liu, A., Bai, C., Xu, B., Liu, X., Sang, S., Lin, H., Yang, Z., Wang, J., Kong, X., Zhao, Z., Xia, F., Yoo, I. H., Bi, J. B., Hu IEEE. 2019: 2021–28