Bio


I am a postdoctoral scholar in Department of Neurology & Neurological Sciences, Stanford University and under supervision of Dr. Zihuai He. Before that, I obtained my PhD degree in statistics under supervision of Prof. Philip L.H. Yu and Prof. Guosheng Yin in University of Hong Kong and my bachelor degrees in statistics from Renmin University of China.

My researches concentrate on preference learning, network data modeling, quantitative analysis of survival and public health data, high-dimensional statistical inference with geometric information and statistical genetics.

Honors & Awards


  • Excellent Research Award, Department of Statistics and Actuarial Science, University of Hong Kong (2022)
  • Excellent Research Award, Department of Statistics and Actuarial Science, University of Hong Kong (2021)
  • Excellent Teaching Assistant Award, Department of Statistics and Actuarial Science, University of Hong Kong (2021)
  • Honorable Mention, Interdisciplinary Contest in Modeling (2017)
  • Runner-up, Beijing-Hong Kong Data Modeling Competition (2017)
  • First Prize, Contemporary Undergraduate Mathematical Contest in Modeling (Beijing) (2016)

Professional Education


  • Bachelor of Science, Renmin University Of China (2018)
  • Doctor of Philosophy, University Of Hong Kong (2022)
  • Ph.D., University of Hong Kong, Statistics (2022)
  • B.S., Renmin University of China, Statistics (2018)

Stanford Advisors


All Publications


  • In silico identification of putative causal genetic variants. bioRxiv : the preprint server for biology He, Z., Chu, B., Yang, J., Gu, J., Chen, Z., Liu, L., Morrison, T., Belloy, M. E., Qi, X., Hejazi, N., Mathur, M., Le Guen, Y., Tang, H., Hastie, T., Ionita-Laza, I., Sabatti, C., Candes, E. 2024

    Abstract

    Understanding the causal genetic architecture of complex phenotypes is essential for future research into disease mechanisms and potential therapies. Despite the widespread availability of genome-wide data, existing methods to analyze genetic data still primarily focus on marginal association models, which fall short of fully capturing the polygenic nature of complex traits and elucidating biological causal mechanisms. Here we present a computationally efficient causal inference framework for genome-wide detection of putative causal variants underlying genetic associations. Our approach utilizes summary statistics from potentially overlapping studies as input, constructs in silico knockoff copies of summary statistics as negative controls to attenuate confounding effects induced by linkage disequilibrium, and employs efficient ultrahigh-dimensional sparse regression to jointly model all genetic variants across the genome. Our method is computationally efficient, requiring less than 15 minutes on a single CPU to analyze genome-wide summary statistics. In applications to a meta-analysis of ten large-scale genetic studies of Alzheimer's disease (AD) we identified 82 loci associated with AD, including 37 additional loci missed by conventional GWAS pipeline via marginal association testing. The identified putative causal variants achieve state-of-the-art agreement with massively parallel reporter assays and CRISPR-Cas9 experiments. Additionally, we applied the method to a retrospective analysis of large-scale genome-wide association studies (GWAS) summary statistics from 2013 to 2022. Results reveal the method's capacity to robustly discover additional loci for polygenic traits beyond conventional GWAS and pinpoint potential causal variants underpinning each locus (on average, 22.7% more loci and 78.7% fewer proxy variants), contributing to a deeper understanding of complex genetic architectures in post-GWAS analyses. We are making the discoveries and software freely available to the community and anticipate that routine end-to-end in silico identification of putative causal genetic variants will become an important tool that will facilitate downstream functional experiments and future research into disease etiology, as well as the exploration of novel therapeutic avenues.

    View details for DOI 10.1101/2024.02.28.582621

    View details for PubMedID 38464202

  • Controlled Variable Selection from Summary Statistics Only? A Solution via GhostKnockoffs and Penalized Regression. ArXiv Chen, Z., He, Z., Chu, B. B., Gu, J., Morrison, T., Sabatti, C., Candes, E. 2024

    Abstract

    Identifying which variables do influence a response while controlling false positives pervades statistics and data science. In this paper, we consider a scenario in which we only have access to summary statistics, such as the values of marginal empirical correlations between each dependent variable of potential interest and the response. This situation may arise due to privacy concerns, e.g., to avoid the release of sensitive genetic information. We extend GhostKnockoffs He et al. [2022] and introduce variable selection methods based on penalized regression achieving false discovery rate (FDR) control. We report empirical results in extensive simulation studies, demonstrating enhanced performance over previous work. We also apply our methods to genome-wide association studies of Alzheimer's disease, and evidence a significant improvement in power.

    View details for PubMedID 38463500

  • Omnibus test for restricted mean survival time based on influence function. Statistical methods in medical research Gu, J., Fan, Y., Yin, G. 2023: 9622802231158735

    Abstract

    The restricted mean survival time (RMST), which evaluates the expected survival time up to a pre-specified time point τ, has been widely used to summarize the survival distribution due to its robustness and straightforward interpretation. In comparative studies with time-to-event data, the RMST-based test has been utilized as an alternative to the classic log-rank test because the power of the log-rank test deteriorates when the proportional hazards assumption is violated. To overcome the challenge of selecting an appropriate time point τ, we develop an RMST-based omnibus Wald test to detect the survival difference between two groups throughout the study follow-up period. Treating a vector of RMSTs at multiple quantile-based time points as a statistical functional, we construct a Wald χ2 test statistic and derive its asymptotic distribution using the influence function. We further propose a new procedure based on the influence function to estimate the asymptotic covariance matrix in contrast to the usual bootstrap method. Simulations under different scenarios validate the size of our RMST-based omnibus test and demonstrate its advantage over the existing tests in power, especially when the true survival functions cross within the study follow-up period. For illustration, the proposed test is applied to two real datasets, which demonstrate its power and applicability in various situations.

    View details for DOI 10.1177/09622802231158735

    View details for PubMedID 37015346

  • ANALYSIS OF PREFERENCES IN SOCIAL NETWORKS ANNALS OF APPLIED STATISTICS Gu, B., Yu, P. H. 2023; 17 (1): 89-107
  • Bayesian Log-Rank Test AMERICAN STATISTICIAN Gu, J., Zhang, Y., Yin, G. 2023
  • 3D-Polishing for Triangular Mesh Compression of Point Cloud Data The 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’23) Gu, J., Yin, G. 2023: 557-566

    View details for DOI 10.1145/3580305.3599239

  • Jiaqi Gu and Guosheng Yin’s contribution to the Discussion of ‘Martingale Posterior Distributions’ by Fong, Holmes and Walker Journal of the Royal Statistical Society Series B (Statistical Methodology) Gu, J., Yin, G. 2023

    View details for DOI 10.1093/jrsssb/qkad092

  • Bayesian SIR model with change points with application to the Omicron wave in Singapore SCIENTIFIC REPORTS Gu, J., Yin, G. 2022; 12 (1): 20864

    Abstract

    The Omicron variant has led to a new wave of the COVID-19 pandemic worldwide, with unprecedented numbers of daily confirmed new cases in many countries and areas. To analyze the impact of society or policy changes on the development of the Omicron wave, the stochastic susceptible-infected-removed (SIR) model with change points is proposed to accommodate the situations where the transmission rate and the removal rate may vary significantly at change points. Bayesian inference based on a Markov chain Monte Carlo algorithm is developed to estimate both the locations of change points as well as the transmission rate and removal rate within each stage. Experiments on simulated data reveal the effectiveness of the proposed method, and several stages are detected in analyzing the Omicron wave data in Singapore.

    View details for DOI 10.1038/s41598-022-25473-y

    View details for Web of Science ID 000932261400072

    View details for PubMedID 36460721

    View details for PubMedCentralID PMC9718478

  • Triangular Concordance Learning of Networks JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS Gu, J., Yin, G. 2022
  • Sparse concordance-based ordinal classification SCANDINAVIAN JOURNAL OF STATISTICS Fan, Y., Gu, J., Yin, G. 2022

    View details for DOI 10.1111/sjos.12606

    View details for Web of Science ID 000846860600001

  • Joint latent space models for ranking data and social network STATISTICS AND COMPUTING Gu, J., Yu, P. H. 2022; 32 (3)
  • Reconstructing the Kaplan-Meier Estimator as an M-estimator AMERICAN STATISTICIAN Gu, J., Fan, Y., Yin, G. 2022; 76 (1): 37-43
  • Crystallization Learning with the Delaunay Triangulation The 38th International Conference on Machine Learning Gu, J., Yin, G. 2021: 3854-3863
  • Analysis of ranking data WILEY INTERDISCIPLINARY REVIEWS-COMPUTATIONAL STATISTICS Yu, P. H., Gu, J., Xu, H. 2019; 11 (6)

    View details for DOI 10.1002/wics.1483

    View details for Web of Science ID 000489576600004

  • Fast Algorithm for Generalized Multinomial Models with Ranking Data The 36th International Conference on Machine Learning Gu, J., Yin, G. 2019: 2445- 2453