Bio


Jiaqi Hu is a Postdoctoral Scholar at Stanford University, supervised by Drs. Tim Assimes and Shoa Clarke. She received her PhD in Chronic Disease Epidemiology from the Yale School of Public Health in 2026 and her Bachelor of Arts degree from Peking University in 2021. Her research focuses on identifying genetic variants underlying complex diseases, applying polygenic scores for disease risk prediction and subtype identification, and integrating genetic, environmental, and clinical data to improve individual-level risk stratification.

Stanford Advisors


All Publications


  • A novel two-sample Mendelian randomization framework integrating common and rare variants: application to assess the effect of HDL-C on preeclampsia risk BRIEFINGS IN BIOINFORMATICS Zhang, Y., Li, M., Haas, D. M., Bairey Merz, C., Workalemahu, T., Ryckman, K., Catov, J. M., Levine, L. D., Freedman, A., Saade, G. R., Hu, J., Zhao, H., Li, X., Liu, N., Yan, Q. 2026; 27 (1)

    Abstract

    Mendelian randomization (MR) has become an important technique for establishing causal relationships between risk factors and health outcomes. By using genetic variants as instrumental variables, it can mitigate bias due to confounding and reverse causation in observational studies. Current MR analyses have predominantly used common genetic variants as instruments, which represent only part of the genetic architecture of complex traits. Rare variants, which can have larger effect sizes and provide unique biological insights, have been understudied due to statistical and methodological challenges. We introduce MR-common and annotation-informed rare variants (MR-CARV), a novel framework integrating common and rare genetic variants in two-sample MR. This method leverages comprehensive genetic data made available by high-throughput sequencing technologies and large-scale consortia. Rare variants are aggregated into functional categories, such as gene-coding, gene-noncoding, and nongene regions, by leveraging variant annotations and biological impact as weights. The effects of rare variant sets are then estimated with STAARpipeline and combined with the estimated effects of common variants by the existing MR methods. Simulation studies demonstrate that MR-CARV maintains robust type I error and achieves higher statistical power, with up to a 66.3% relative increase compared with existing methods only based on common variants. Consistent with these findings, application to real data on high-density lipoprotein cholesterol (HDL-C) and preeclampsia showed that MR-CARV [inverse variance weighted (IVW)] yielded a more precise and statistically significant effect estimate (-0.020, SE = 0.0102, $P$ =.0470) than IVW using only common variants (-0.023, SE = 0.0123, $P$ =.0659).

    View details for DOI 10.1093/bib/bbaf649

    View details for Web of Science ID 001655540500001

    View details for PubMedID 41499219

    View details for PubMedCentralID PMC12777983

  • Improving polygenic risk prediction performance by integrating electronic health records through phenotype embedding AMERICAN JOURNAL OF HUMAN GENETICS Xu, L., Zheng, W., Hu, J., Lin, Y., Zhao, J., Wang, G., Liu, T., Zhao, H. 2025; 112 (12): 3030-3045

    Abstract

    Large-scale biobanks provide comprehensive electronic health records (EHRs) that capture detailed clinical phenotypes, potentially enhancing disease prediction. However, traditional polygenic risk score (PRS) methods rely on simplified phenotype definitions or predefined trait sets, limiting their ability to represent the complex structures embedded within EHRs. To address this gap, we introduce EHR-embedding-enhanced PRS (EEPRS), leveraging phenotype embeddings derived from EHRs to improve PRSs using only genome-wide association study (GWAS) summary statistics. Employing embedding methods such as Word2Vec and GPT, we conducted EHR-embedding-based GWASs and identified a cardiovascular cluster via hierarchical clustering of genetic correlations. Across 41 traits in the UK Biobank, EEPRS consistently outperformed single-trait PRSs, particularly within this cluster. PRS-based phenome-wide association studies further demonstrated robust associations between EHR-embedding-based PRS and circulatory system diseases. We then developed EEPRS_optimal, a data-adaptive method that uses cross-validation to select the best embedding, yielding additional improvements. We also developed MTAG_EEPRS for multi-trait PRSs, which further improved prediction accuracy compared to single-trait PRSs and MTAG_PRS. Finally, we validated the benefits of EEPRS in the All of Us cohort for seven selected diseases. Overall, EEPRS represents a robust and interpretable framework, enhancing single-trait and multi-trait PRSs by integrating EHR embeddings.

    View details for DOI 10.1016/j.ajhg.2025.11.006

    View details for Web of Science ID 001636458400001

    View details for PubMedID 41349513

    View details for PubMedCentralID PMC12695032

  • Robust pleiotropy-decomposed polygenic scores identify distinct contributions to elevated coronary artery disease polygenic risk PLOS COMPUTATIONAL BIOLOGY Hu, J., Ye, Y., Zhang, C., Ruan, Y., Natarajan, P., Zhao, H. 2025; 21 (6): e1013191

    Abstract

    Polygenic risk score (PRS) have proved to offer robust risk prediction for coronary artery disease (CAD). However, the global CAD PRS summarizes the joint effects of all the markers in the genome, masking potential genetic heterogeneity that may be important for disease interpretation and targeted interventions.Using summary-level data, we identified 43 significant CAD-related traits based on genetic correlations, and further classified them into eight pleiotropy clusters based on their biological functions. We then partitioned the genome into 2,353 near-independent regions. Variants in each region were assigned to the trait most genetically similar to CAD, and then were labeled with the corresponding pleiotropy cluster. We grouped variants without labels into a ninth, non-specific cluster. The Pleiotropy Decomposed (PD) PRSs for each of the nine clusters were calculated using variants assigned to each cluster for 407,903 samples of European ancestry from the UK Biobank (UKBB).We decomposed the CAD PRS into nine PD-PRSs and further stratified individuals with high CAD-PRS into nine subgroups. Each PD-PRS accounted for a higher proportion of the global CAD-PRS within its corresponding subgroup than in the remaining subjects with high CAD-PRS (e.g., 25.2% (0.07) vs. 10.06% (0.07) for lipids-PD-PRS). Additionally, these subgroups showed distinct clinical features. For example, in the lipids-related subgroup, lipoprotein(a) and LDL-cholesterol levels were 67.5% and 18.3% higher, respectively, compared to the remaining high-risk individuals. Furthermore, significant interactions were observed between blood pressure and BP PD-PRS, and between current smoking and respiratory system PD-PRS.Our findings suggest that PD-PRSs may reveal substantial genetic and phenotypic heterogeneity among individuals with high CAD-PRS. The unique PD-PRS compositions of each individual can highlight the relative importance of different pleiotropic regions.

    View details for DOI 10.1371/journal.pcbi.1013191

    View details for Web of Science ID 001517999500003

    View details for PubMedID 40570042

    View details for PubMedCentralID PMC12212871

  • Using clinical and genetic risk factors for risk prediction of 8 cancers in the UK Biobank JNCI CANCER SPECTRUM Hu, J., Ye, Y., Zhou, G., Zhao, H. 2024; 8 (2)

    Abstract

    Models with polygenic risk scores and clinical factors to predict risk of different cancers have been developed, but these models have been limited by the polygenic risk score-derivation methods and the incomplete selection of clinical variables.We used UK Biobank to train the best polygenic risk scores for 8 cancers (bladder, breast, colorectal, kidney, lung, ovarian, pancreatic, and prostate cancers) and select relevant clinical variables from 733 baseline traits through extreme gradient boosting (XGBoost). Combining polygenic risk scores and clinical variables, we developed Cox proportional hazards models for risk prediction in these cancers.Our models achieved high prediction accuracy for 8 cancers, with areas under the curve ranging from 0.618 (95% confidence interval = 0.581 to 0.655) for ovarian cancer to 0.831 (95% confidence interval = 0.817 to 0.845) for lung cancer. Additionally, our models could identify individuals at a high risk for developing cancer. For example, the risk of breast cancer for individuals in the top 5% score quantile was nearly 13 times greater than for individuals in the lowest 10%. Furthermore, we observed a higher proportion of individuals with high polygenic risk scores in the early-onset group but a higher proportion of individuals at high clinical risk in the late-onset group.Our models demonstrated the potential to predict cancer risk and identify high-risk individuals with great generalizability to different cancers. Our findings suggested that the polygenic risk score model is more predictive for the cancer risk of early-onset patients than for late-onset patients, while the clinical risk model is more predictive for late-onset patients. Meanwhile, combining polygenic risk scores and clinical risk factors has overall better predictive performance than using polygenic risk scores or clinical risk factors alone.

    View details for DOI 10.1093/jncics/pkae008

    View details for Web of Science ID 001180128800001

    View details for PubMedID 38366150

    View details for PubMedCentralID PMC10919929

  • Genomic risk prediction of cardiovascular diseases among type 2 diabetes patients in the UK Biobank FRONTIERS IN BIOINFORMATICS Ye, Y., Hu, J., Pang, F., Cui, C., Zhao, H. 2024; 3: 1320748

    Abstract

    Background: Polygenic risk score (PRS) has proved useful in predicting the risk of cardiovascular diseases (CVD) based on the genotypes of an individual, but most analyses have focused on disease onset in the general population. The usefulness of PRS to predict CVD risk among type 2 diabetes (T2D) patients remains unclear. Methods: We built a meta-PRSCVD upon the candidate PRSs developed from state-of-the-art PRS methods for three CVD subtypes of significant importance: coronary artery disease (CAD), ischemic stroke (IS), and heart failure (HF). To evaluate the prediction performance of the meta-PRSCVD, we restricted our analysis to 21,092 white British T2D patients in the UK Biobank, among which 4,015 had CVD events. Results: Results showed that the meta-PRSCVD was significantly associated with CVD risk with a hazard ratio per standard deviation increase of 1.28 (95% CI: 1.23-1.33). The meta-PRSCVD alone predicted the CVD incidence with an area under the receiver operating characteristic curve (AUC) of 0.57 (95% CI: 0.54-0.59). When restricted to the early-onset patients (onset age ≤ 55), the AUC was further increased to 0.61 (95% CI 0.56-0.67). Conclusion: Our results highlight the potential role of genomic screening for secondary preventions of CVD among T2D patients, especially among early-onset patients.

    View details for DOI 10.3389/fbinf.2023.1320748

    View details for Web of Science ID 001142781000001

    View details for PubMedID 38239805

    View details for PubMedCentralID PMC10794561