Bio


I am a postdoctoral scholar in Department of Neurology & Neurological Sciences, Stanford University and under supervision of Dr. Zihuai He. Before that, I obtained my PhD degree in statistics under supervision of Prof. Philip L.H. Yu and Prof. Guosheng Yin in University of Hong Kong and my bachelor degrees in statistics from Renmin University of China.

My researches concentrate on preference learning, network data modeling, quantitative analysis of survival and public health data, high-dimensional statistical inference with geometric information and statistical genetics.

Honors & Awards


  • Excellent Research Award, Department of Statistics and Actuarial Science, University of Hong Kong (2022)
  • Excellent Research Award, Department of Statistics and Actuarial Science, University of Hong Kong (2021)
  • Excellent Teaching Assistant Award, Department of Statistics and Actuarial Science, University of Hong Kong (2021)
  • Honorable Mention, Interdisciplinary Contest in Modeling (2017)
  • Runner-up, Beijing-Hong Kong Data Modeling Competition (2017)
  • First Prize, Contemporary Undergraduate Mathematical Contest in Modeling (Beijing) (2016)

Professional Education


  • Bachelor of Science, Renmin University Of China (2018)
  • Doctor of Philosophy, University Of Hong Kong (2022)
  • Ph.D., University of Hong Kong, Statistics (2022)
  • B.S., Renmin University of China, Statistics (2018)

Stanford Advisors


All Publications


  • Second-order group knockoffs with applications to GWAS. Bioinformatics (Oxford, England) Chu, B. B., Gu, J., Chen, Z., Morrison, T., Candès, E., He, Z., Sabatti, C. 2024

    Abstract

    Conditional testing via the knockoff framework allows one to identify-among large number of possible explanatory variables-those that carry unique information about an outcome of interest, and also provides a false discovery rate guarantee on the selection. This approach is particularly well suited to the analysis of genome wide association studies (GWAS), which have the goal of identifying genetic variants which influence traits of medical relevance.While conditional testing can be both more powerful and precise than traditional GWAS analysis methods, its vanilla implementation encounters a difficulty common to all multivariate analysis methods: it is challenging to distinguish among multiple, highly correlated regressors. This impasse can be overcome by shifting the object of inference from single variables to groups of correlated variables. To achieve this, it is necessary to construct ''group knockoffs." While successful examples are already documented in the literature, this paper substantially expands the set of algorithms and software for group knockoffs. We focus in particular on second-order knockoffs, for which we describe correlation matrix approximations that are appropriate for GWAS data and that result in considerable computational savings. We illustrate the effectiveness of the proposed methods with simulations and with the analysis of albuminuria data from the UK Biobank.The described algorithms are implemented in an open-source Julia package Knockoffs.jl. R and Python wrappers are available as knockoffsr and knockoffspy packages.Supplementary data are available from Bioinformatics online.

    View details for DOI 10.1093/bioinformatics/btae580

    View details for PubMedID 39340798

  • Unit information Dirichlet process prior. Biometrics Gu, J., Yin, G. 2024; 80 (3)

    Abstract

    Prior distributions, which represent one's belief in the distributions of unknown parameters before observing the data, impact Bayesian inference in a critical and fundamental way. With the ability to incorporate external information from expert opinions or historical datasets, the priors, if specified appropriately, can improve the statistical efficiency of Bayesian inference. In survival analysis, based on the concept of unit information (UI) under parametric models, we propose the unit information Dirichlet process (UIDP) as a new class of nonparametric priors for the underlying distribution of time-to-event data. By deriving the Fisher information in terms of the differential of the cumulative hazard function, the UIDP prior is formulated to match its prior UI with the weighted average of UI in historical datasets and thus can utilize both parametric and nonparametric information provided by historical datasets. With a Markov chain Monte Carlo algorithm, simulations and real data analysis demonstrate that the UIDP prior can adaptively borrow historical information and improve statistical efficiency in survival analysis.

    View details for DOI 10.1093/biomtc/ujae091

    View details for PubMedID 39248120

  • Summary statistics knockoffs inference with family-wise error rate control. Biometrics Yu, C. X., Gu, J., Chen, Z., He, Z. 2024; 80 (3)

    Abstract

    Testing multiple hypotheses of conditional independence with provable error rate control is a fundamental problem with various applications. To infer conditional independence with family-wise error rate (FWER) control when only summary statistics of marginal dependence are accessible, we adopt GhostKnockoff to directly generate knockoff copies of summary statistics and propose a new filter to select features conditionally dependent on the response. In addition, we develop a computationally efficient algorithm to greatly reduce the computational cost of knockoff copies generation without sacrificing power and FWER control. Experiments on simulated data and a real dataset of Alzheimer's disease genetics demonstrate the advantage of the proposed method over existing alternatives in both statistical power and computational efficiency.

    View details for DOI 10.1093/biomtc/ujae082

    View details for PubMedID 39222026

    View details for PubMedCentralID PMC11367731

  • In silico identification of putative causal genetic variants. bioRxiv : the preprint server for biology He, Z., Chu, B., Yang, J., Gu, J., Chen, Z., Liu, L., Morrison, T., Belloy, M. E., Qi, X., Hejazi, N., Mathur, M., Le Guen, Y., Tang, H., Hastie, T., Ionita-Laza, I., Sabatti, C., Candes, E. 2024

    Abstract

    Understanding the causal genetic architecture of complex phenotypes is essential for future research into disease mechanisms and potential therapies. Despite the widespread availability of genome-wide data, existing methods to analyze genetic data still primarily focus on marginal association models, which fall short of fully capturing the polygenic nature of complex traits and elucidating biological causal mechanisms. Here we present a computationally efficient causal inference framework for genome-wide detection of putative causal variants underlying genetic associations. Our approach utilizes summary statistics from potentially overlapping studies as input, constructs in silico knockoff copies of summary statistics as negative controls to attenuate confounding effects induced by linkage disequilibrium, and employs efficient ultrahigh-dimensional sparse regression to jointly model all genetic variants across the genome. Our method is computationally efficient, requiring less than 15 minutes on a single CPU to analyze genome-wide summary statistics. In applications to a meta-analysis of ten large-scale genetic studies of Alzheimer's disease (AD) we identified 82 loci associated with AD, including 37 additional loci missed by conventional GWAS pipeline via marginal association testing. The identified putative causal variants achieve state-of-the-art agreement with massively parallel reporter assays and CRISPR-Cas9 experiments. Additionally, we applied the method to a retrospective analysis of large-scale genome-wide association studies (GWAS) summary statistics from 2013 to 2022. Results reveal the method's capacity to robustly discover additional loci for polygenic traits beyond conventional GWAS and pinpoint potential causal variants underpinning each locus (on average, 22.7% more loci and 78.7% fewer proxy variants), contributing to a deeper understanding of complex genetic architectures in post-GWAS analyses. We are making the discoveries and software freely available to the community and anticipate that routine end-to-end in silico identification of putative causal genetic variants will become an important tool that will facilitate downstream functional experiments and future research into disease etiology, as well as the exploration of novel therapeutic avenues.

    View details for DOI 10.1101/2024.02.28.582621

    View details for PubMedID 38464202

  • Controlled Variable Selection from Summary Statistics Only? A Solution via GhostKnockoffs and Penalized Regression. ArXiv Chen, Z., He, Z., Chu, B. B., Gu, J., Morrison, T., Sabatti, C., Candes, E. 2024

    Abstract

    Identifying which variables do influence a response while controlling false positives pervades statistics and data science. In this paper, we consider a scenario in which we only have access to summary statistics, such as the values of marginal empirical correlations between each dependent variable of potential interest and the response. This situation may arise due to privacy concerns, e.g., to avoid the release of sensitive genetic information. We extend GhostKnockoffs He et al. [2022] and introduce variable selection methods based on penalized regression achieving false discovery rate (FDR) control. We report empirical results in extensive simulation studies, demonstrating enhanced performance over previous work. We also apply our methods to genome-wide association studies of Alzheimer's disease, and evidence a significant improvement in power.

    View details for PubMedID 38463500

  • Omnibus test for restricted mean survival time based on influence function. Statistical methods in medical research Gu, J., Fan, Y., Yin, G. 2023: 9622802231158735

    Abstract

    The restricted mean survival time (RMST), which evaluates the expected survival time up to a pre-specified time point τ, has been widely used to summarize the survival distribution due to its robustness and straightforward interpretation. In comparative studies with time-to-event data, the RMST-based test has been utilized as an alternative to the classic log-rank test because the power of the log-rank test deteriorates when the proportional hazards assumption is violated. To overcome the challenge of selecting an appropriate time point τ, we develop an RMST-based omnibus Wald test to detect the survival difference between two groups throughout the study follow-up period. Treating a vector of RMSTs at multiple quantile-based time points as a statistical functional, we construct a Wald χ2 test statistic and derive its asymptotic distribution using the influence function. We further propose a new procedure based on the influence function to estimate the asymptotic covariance matrix in contrast to the usual bootstrap method. Simulations under different scenarios validate the size of our RMST-based omnibus test and demonstrate its advantage over the existing tests in power, especially when the true survival functions cross within the study follow-up period. For illustration, the proposed test is applied to two real datasets, which demonstrate its power and applicability in various situations.

    View details for DOI 10.1177/09622802231158735

    View details for PubMedID 37015346

  • ANALYSIS OF PREFERENCES IN SOCIAL NETWORKS ANNALS OF APPLIED STATISTICS Gu, B., Yu, P. H. 2023; 17 (1): 89-107
  • Bayesian Log-Rank Test AMERICAN STATISTICIAN Gu, J., Zhang, Y., Yin, G. 2023
  • 3D-Polishing for Triangular Mesh Compression of Point Cloud Data The 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’23) Gu, J., Yin, G. 2023: 557-566

    View details for DOI 10.1145/3580305.3599239

  • Jiaqi Gu and Guosheng Yin’s contribution to the Discussion of ‘Martingale Posterior Distributions’ by Fong, Holmes and Walker Journal of the Royal Statistical Society Series B (Statistical Methodology) Gu, J., Yin, G. 2023

    View details for DOI 10.1093/jrsssb/qkad092

  • Bayesian SIR model with change points with application to the Omicron wave in Singapore SCIENTIFIC REPORTS Gu, J., Yin, G. 2022; 12 (1): 20864

    Abstract

    The Omicron variant has led to a new wave of the COVID-19 pandemic worldwide, with unprecedented numbers of daily confirmed new cases in many countries and areas. To analyze the impact of society or policy changes on the development of the Omicron wave, the stochastic susceptible-infected-removed (SIR) model with change points is proposed to accommodate the situations where the transmission rate and the removal rate may vary significantly at change points. Bayesian inference based on a Markov chain Monte Carlo algorithm is developed to estimate both the locations of change points as well as the transmission rate and removal rate within each stage. Experiments on simulated data reveal the effectiveness of the proposed method, and several stages are detected in analyzing the Omicron wave data in Singapore.

    View details for DOI 10.1038/s41598-022-25473-y

    View details for Web of Science ID 000932261400072

    View details for PubMedID 36460721

    View details for PubMedCentralID PMC9718478

  • Triangular Concordance Learning of Networks JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS Gu, J., Yin, G. 2022
  • Sparse concordance-based ordinal classification SCANDINAVIAN JOURNAL OF STATISTICS Fan, Y., Gu, J., Yin, G. 2022

    View details for DOI 10.1111/sjos.12606

    View details for Web of Science ID 000846860600001

  • Joint latent space models for ranking data and social network STATISTICS AND COMPUTING Gu, J., Yu, P. H. 2022; 32 (3)
  • Reconstructing the Kaplan-Meier Estimator as an M-estimator AMERICAN STATISTICIAN Gu, J., Fan, Y., Yin, G. 2022; 76 (1): 37-43
  • Crystallization Learning with the Delaunay Triangulation The 38th International Conference on Machine Learning Gu, J., Yin, G. 2021: 3854-3863
  • Analysis of ranking data WILEY INTERDISCIPLINARY REVIEWS-COMPUTATIONAL STATISTICS Yu, P. H., Gu, J., Xu, H. 2019; 11 (6)

    View details for DOI 10.1002/wics.1483

    View details for Web of Science ID 000489576600004

  • Fast Algorithm for Generalized Multinomial Models with Ranking Data The 36th International Conference on Machine Learning Gu, J., Yin, G. 2019: 2445- 2453