Jiayan Zhou
Postdoctoral Scholar, Cardiovascular Medicine
Boards, Advisory Committees, Professional Organizations
-
Reviewer, Journal of the American Medical Informatics Association (2024 - Present)
-
Editorial Board: Academic Editor, PeerJ Computation Science (2024 - Present)
-
Trusted Reviewer Board, Health and Quality of Life Outcomes (2024 - Present)
-
Reviewer, npj Digital Medicine (2024 - Present)
-
Reviewer, Circulation: Genomic and Precision Medicine (2023 - Present)
-
Reviewer, BMC Journals (2023 - Present)
-
Reviewer, Scientific Reports (2023 - Present)
-
Reviewer, Journal of Orthopaedic Surgery and Research (2023 - Present)
-
Reviewer, European Journal of Medical Research (2023 - Present)
-
Reviewer, Journal of Cancer Research and Clinical Oncology (2023 - Present)
-
Reviewer, Frontiers in Journals (2023 - Present)
-
Reviewer, Analytical Cellular Pathology (2022 - Present)
-
Reviewer, Evidence-Based Complementary and Alternative Medicine (2022 - Present)
Professional Education
-
Doctor of Philosophy, The Pennsylvania State University, Pathobiology (Bioinformatics and Human Genetics) (2023)
-
Master of Applied Statistics, The Pennsylvania State University, Applied Statistics (2022)
-
Bachelor of Science, The Pennsylvania State University, Biochemistry and Molecular Biology (2018)
-
Bachelor of Science, The Pennsylvania State University, Immunology and Infectious Diseases (2018)
All Publications
-
A plasma proteomic signature for atherosclerotic cardiovascular disease risk prediction in the UK Biobank cohort.
medRxiv : the preprint server for health sciences
2024
Abstract
Background: While risk stratification for atherosclerotic cardiovascular disease (ASCVD) is essential for primary prevention, current clinical risk algorithms demonstrate variability and leave room for further improvement. The plasma proteome holds promise as a future diagnostic and prognostic tool that can accurately reflect complex human traits and disease processes. We assessed the ability of plasma proteins to predict ASCVD.Method: Clinical, genetic, and high-throughput plasma proteomic data were analyzed for association with ASCVD in a cohort of 41,650 UK Biobank participants. Selected features for analysis included clinical variables such as a UK-based cardiovascular clinical risk score (QRISK3) and lipid levels, 36 polygenic risk scores (PRSs), and Olink protein expression data of 2,920 proteins. We used least absolute shrinkage and selection operator (LASSO) regression to select features and compared area under the curve (AUC) statistics between data types. Randomized LASSO regression with a stability selection algorithm identified a smaller set of more robustly associated proteins. The benefit of plasma proteins over standard clinical variables, the QRISK3 score, and PRSs was evaluated through the derivation of Delta AUC values. We also assessed the incremental gain in model performance using proteomic datasets with varying numbers of proteins. To identify potential causal proteins for ASCVD, we conducted a two-sample Mendelian randomization (MR) analysis.Result: The mean age of our cohort was 56.0 years, 60.3% were female, and 9.8% developed incident ASCVD over a median follow-up of 6.9 years. A protein-only LASSO model selected 294 proteins and returned an AUC of 0.723 (95% CI 0.708-0.737). A clinical variable and PRS-only LASSO model selected 4 clinical variables and 20 PRSs and achieved an AUC of 0.726 (95% CI 0.712-0.741). The addition of the full proteomic dataset to clinical variables and PRSs resulted in a Delta AUC of 0.010 (95% CI 0.003-0.018). Fifteen proteins selected by a stability selection algorithm offered improvement in ASCVD prediction over the QRISK3 risk score [Delta AUC: 0.013 (95% CI 0.005-0.021)]. Filtered and clustered versions of the full proteomic dataset (consisting of 600-1,500 proteins) performed comparably to the full dataset for ASCVD prediction. Using MR, we identified 11 proteins as potentially causal for ASCVD.Conclusion: A plasma proteomic signature performs well for incident ASCVD prediction but only modestly improves prediction over clinical and genetic factors. Further studies are warranted to better elucidate the clinical utility of this signature in predicting the risk of ASCVD over the standard practice of using the QRISK3 score.
View details for DOI 10.1101/2024.09.13.24313652
View details for PubMedID 39314942
-
Plasma proteomic signatures for type 2 diabetes mellitus and related traits in the UK Biobank cohort.
medRxiv : the preprint server for health sciences
2024
Abstract
Aims/hypothesis: The plasma proteome holds promise as a diagnostic and prognostic tool that can accurately reflect complex human traits and disease processes. We assessed the ability of plasma proteins to predict type 2 diabetes mellitus (T2DM) and related traits.Methods: Clinical, genetic, and high-throughput proteomic data from three subcohorts of UK Biobank participants were analyzed for association with dual-energy x-ray absorptiometry (DXA) derived truncal fat (in the adiposity subcohort), estimated maximum oxygen consumption (VO 2 max) (in the fitness subcohort), and incident T2DM (in the T2DM subcohort). We used least absolute shrinkage and selection operator (LASSO) regression to assess the relative ability of non-proteomic and proteomic variables to associate with each trait by comparing variance explained (R 2 ) and area under the curve (AUC) statistics between data types. Stability selection with randomized LASSO regression identified the most robustly associated proteins for each trait. The benefit of proteomic signatures (PSs) over QDiabetes, a T2DM clinical risk score, was evaluated through the derivation of delta (Delta) AUC values. We also assessed the incremental gain in model performance metrics using proteomic datasets with varying numbers of proteins. A series of two-sample Mendelian randomization (MR) analyses were conducted to identify potentially causal proteins for adiposity, fitness, and T2DM.Results: Across all three subcohorts, the mean age was 56.7 years and 54.9% were female. In the T2DM subcohort, 5.8% developed incident T2DM over a median follow-up of 7.6 years. LASSO-derived PSs increased the R 2 of truncal fat and VO 2 max over clinical and genetic factors by 0.074 and 0.057, respectively. We observed a similar improvement in T2DM prediction over the QDiabetes score [Delta AUC: 0.016 (95% CI 0.008, 0.024)] when using a robust PS derived strictly from the T2DM outcome versus a model further augmented with non-overlapping proteins associated with adiposity and fitness. A small number of proteins (29 for truncal adiposity, 18 for VO2max, and 26 for T2DM) identified by stability selection algorithms offered most of the improvement in prediction of each outcome. Filtered and clustered versions of the full proteomic dataset supplied by the UK Biobank (ranging between 600-1,500 proteins) performed comparably to the full dataset for T2DM prediction. Using MR, we identified 4 proteins as potentially causal for adiposity, 1 as potentially causal for fitness, and 4 as potentially causal for T2DM.Conclusions/Interpretation: Plasma PSs modestly improve the prediction of incident T2DM over that possible with clinical and genetic factors. Further studies are warranted to better elucidate the clinical utility of these signatures in predicting the risk of T2DM over the standard practice of using the QDiabetes score. Candidate causally associated proteins identified through MR deserve further study as potential novel therapeutic targets for T2DM.
View details for DOI 10.1101/2024.09.13.24313501
View details for PubMedID 39314935
-
Large Language Models in Biomedical and Health Informatics: A Review with Bibliometric Analysis
JOURNAL OF HEALTHCARE INFORMATICS RESEARCH
2024
View details for DOI 10.1007/s41666-024-00171-8
View details for Web of Science ID 001312101600001
-
A novel temperature-controlled device with standardized manipulation improves chronic back pain mediated by modulating deep muscle thickness: A multicenter randomized controlled trial
CLINICAL AND TRANSLATIONAL DISCOVERY
2024; 4 (4)
View details for DOI 10.1002/ctd2.330
View details for Web of Science ID 001255534200001
-
The global clinical studies of long COVID.
International journal of infectious diseases : IJID : official publication of the International Society for Infectious Diseases
2024: 107105
Abstract
Long COVID are those who still have symptoms, signs, and conditions after the initial phase of infection of SARS-CoV-2. The incidence of long COVID varies among regions - 31% in North America, 44% in Europe, and 51% in Asia, which is challenging the healthcare system, but there is limited guideline for its treatment. With more and more nation-wide projects funded by the government such as RECOVER initiative in US and NIHR funding in UK, an increasing number of ongoing clinical trials are investigating the efficacy of diverse therapies on reversing long COVID. After searching the WHO International Clinical Trial Registry Platform, 587 clinical studies are identified as long COVID studies. Among these, 312 studies (53.2%) are testing potential therapies. Most of the long COVID trials were conducted in the United States (58 trials [18.6%]), followed by India (55 trials [17.6%]), and Spain (20 trials [6.4%]). Interventions in these clinical trials include physical exercise, rehabilitation therapy, behavioral therapy, and pharmacological therapies including herbs, paxlovid, and fluvoxamine. These trials are aiming to deal with these long COVID symptoms and signs including fatigue, decreased pulmonary function, reduce cognitive function, and others. To date, only 11 of these 312 studies have published their results that were not confirmative unfortunately. Future studies should be designed to address sleep disorders which were seldomly included in registered clinical studies. Moreover, interventions aimed at treating the underlying pathophysiology of long COVID are also necessary but currently lacking.
View details for DOI 10.1016/j.ijid.2024.107105
View details for PubMedID 38782355
-
Infusion reactions to adeno-associated virus (AAV)-based gene therapy: Mechanisms, diagnostics, treatment and review of the literature
CLINICAL IMMUNOLOGY
2024; 262
View details for DOI 10.1016/j.clim.2024.110035
View details for Web of Science ID 001242487000001
-
Infusion reactions to adeno-associated virus (AAV)-based gene therapy: Mechanisms, diagnostics, treatment and review of the literature.
Journal of medical virology
2023; 95 (12): e29305
Abstract
The use of adeno-associated virus (AAV) vectors in gene therapy has demonstrated great potential in treating genetic disorders. However, infusion-associated reactions (IARs) pose a significant challenge to the safety and efficacy of AAV-based gene therapy. This review provides a comprehensive summary of the current understanding of IARs to AAV therapy, including their underlying mechanisms, clinical presentation, and treatment options. Toll-like receptor activation and subsequent production of pro-inflammatory cytokines are associated with IARs, stimulating neutralizing antibodies (Nabs) and T-cell responses that interfere with gene therapy. Risk factors for IARs include high titers of pre-existing Nabs, previous exposure to AAV, and specific comorbidities. Clinical presentation ranges from mild flu-like symptoms to severe anaphylaxis and can occur during or after AAV administration. There are no established guidelines for pre- and postadministration tests for AAV therapies, and routine laboratory requests are not standardized. Treatment options include corticosteroids, plasmapheresis, and supportive medications such as antihistamines and acetaminophen, but there is no consensus on the route of administration, dosage, and duration. This review highlights the inadequacy of current treatment regimens for IARs and the need for further research to improve the safety and efficacy of AAV-based gene therapy.
View details for DOI 10.1002/jmv.29305
View details for PubMedID 38116715
-
CXCL12 regulates coronary artery dominance in diverse populations and links development to disease.
medRxiv : the preprint server for health sciences
2023
Abstract
Mammalian cardiac muscle is supplied with blood by right and left coronary arteries that form branches covering both ventricles of the heart. Whether branches of the right or left coronary arteries wrap around to the inferior side of the left ventricle is variable in humans and termed right or left dominance. Coronary dominance is likely a heritable trait, but its genetic architecture has never been explored. Here, we present the first large-scale multi-ancestry genome-wide association study of dominance in 61,043 participants of the VA Million Veteran Program, including over 10,300 Africans and 4,400 Admixed Americans. Dominance was moderately heritable with ten loci reaching genome wide significance. The most significant mapped to the chemokine CXCL12 in both Europeans and Africans. Whole-organ imaging of human fetal hearts revealed that dominance is established during development in locations where CXCL12 is expressed. In mice, dominance involved the septal coronary artery, and its patterning was altered with Cxcl12 deficiency. Finally, we linked human dominance patterns with coronary artery disease through colocalization, genome-wide genetic correlation and Mendelian Randomization analyses. Together, our data supports CXCL12 as a primary determinant of coronary artery dominance in humans of diverse backgrounds and suggests that developmental patterning of arteries may influence one's susceptibility to ischemic heart disease.
View details for DOI 10.1101/2023.10.27.23297507
View details for PubMedID 37961706
View details for PubMedCentralID PMC10635223
-
Heat-stone massage for patients with chronic musculoskeletal pain: a protocol for multicenter randomized controlled trial.
Frontiers in medicine
2023; 10: 1215858
Abstract
Chronic musculoskeletal pain bothers the quality of life for approximately 1.71 billion people worldwide. Although pharmacological therapies play an important role in controlling chronic pain, overuse of opioids, persistent or recurrent symptoms, and pain-related disability burden still need to be addressed. Heat-stone massage is using the heated stone to stimulate muscles and ligaments followed by massage for relax, which can potentially treat the chronic musculoskeletal pain. To determine the efficacy and safety of heat-stone massage for patients with chronic musculoskeletal pain is needed.This multicenter, 2-arm, randomized, positive drug-controlled trial will include a total of 120 patients with chronic musculoskeletal pain. The intervention group will receive a 2 week heat-stone massage, 3 times per week, whereas the control group will receive the flurbiprofen plaster twice per day for 2 weeks. The primary end point is the change in Global Pain Scale from baseline to the end of the 2 week intervention. The secondary outcomes include the pain severity (Numerical Rating Scale), pain acceptance (Chronic Pain Acceptance Questionnaire), self-management (Health Education Impact Questionnaire), self-efficacy (Pain Self-Efficacy Questionnaire), anxiety and depression (Hospital Anxiety and Depression Scale), quality of life (Short Form-36). The intention-to-treat dataset will be used for analysis.The pain management remains the research topic that patients always pay close attention to. This will be the first randomized clinical trial to evaluate whether heat-stone massage, a non-pharmacological therapy, is effective in the chronic musculoskeletal pain management. The results will provide evidence for new option of daily practice.World Health Organization Chinese Clinical Trial Registry [ChiCTR2200065654; https://www.chictr.org.cn/showproj.html?proj=185403]; International Traditional Medicine Clinical Trial Registry [ITMCTR2022000104; http://itmctr.ccebtcm.org.cn/en-US/Home/ProjectView?pid=51776b6f-77b8-4811-9b5a-a0fec10f2cee].
View details for DOI 10.3389/fmed.2023.1215858
View details for PubMedID 37654653
View details for PubMedCentralID PMC10466406
-
Activation of GPR44 decreases severity of myeloid leukemia via specific targeting of leukemia initiating stem cells.
Cell reports
2023; 42 (7): 112794
Abstract
Relapse of acute myeloid leukemia (AML) remains a significant concern due to persistent leukemia-initiating stem cells (LICs) that are typically not targeted by most existing therapies. Using a murine AML model, human AML cell lines, and patient samples, we show that AML LICs are sensitive to endogenous and exogenous cyclopentenone prostaglandin-J (CyPG), Δ12-PGJ2, and 15d-PGJ2, which are increased upon dietary selenium supplementation via the cyclooxygenase-hematopoietic PGD synthase pathway. CyPGs are endogenous ligands for peroxisome proliferator-activated receptor gamma and GPR44 (CRTH2; PTGDR2). Deletion of GPR44 in a mouse model of AML exacerbated the disease suggesting that GPR44 activation mediates selenium-mediated apoptosis of LICs. Transcriptomic analysis of GPR44-/- LICs indicated that GPR44 activation by CyPGs suppressed KRAS-mediated MAPK and PI3K/AKT/mTOR signaling pathways, to enhance apoptosis. Our studies show the role of GPR44, providing mechanistic underpinnings of the chemopreventive and chemotherapeutic properties of selenium and CyPGs in AML.
View details for DOI 10.1016/j.celrep.2023.112794
View details for PubMedID 37459233
-
Dynamic assessment of the COVID-19 vaccine acceptance leveraging social media data
JOURNAL OF BIOMEDICAL INFORMATICS
2022; 129: 104054
Abstract
Vaccination is the most effective way to provide long-lasting immunity against viral infection; thus, rapid assessment of vaccine acceptance is a pressing challenge for health authorities. Prior studies have applied survey techniques to investigate vaccine acceptance, but these may be slow and expensive. This study investigates 29 million vaccine-related tweets from August 8, 2020 to April 19, 2021 and proposes a social media-based approach that derives a vaccine acceptance index (VAI) to quantify Twitter users' opinions on COVID-19 vaccination. This index is calculated based on opinion classifications identified with the aid of natural language processing techniques and provides a quantitative metric to indicate the level of vaccine acceptance across different geographic scales in the U.S. The VAI is easily calculated from the number of positive and negative Tweets posted by a specific users and groups of users, it can be compiled for regions such a counties or states to provide geospatial information, and it can be tracked over time to assess changes in vaccine acceptance as related to trends in the media and politics. At the national level, it showed that the VAI moved from negative to positive in 2020 and maintained steady after January 2021. Through exploratory analysis of state- and county-level data, reliable assessments of VAI against subsequent vaccination rates could be made for counties with at least 30 users. The paper discusses information characteristics that enable consistent estimation of VAI. The findings support the use of social media to understand opinions and to offer a timely and cost-effective way to assess vaccine acceptance.
View details for DOI 10.1016/j.jbi.2022.104054
View details for Web of Science ID 000788753600001
View details for PubMedID 35331966
View details for PubMedCentralID PMC8935963
-
Novel EDGE encoding method enhances ability to identify genetic interactions
PLOS GENETICS
2021; 17 (6): e1009534
Abstract
Assumptions are made about the genetic model of single nucleotide polymorphisms (SNPs) when choosing a traditional genetic encoding: additive, dominant, and recessive. Furthermore, SNPs across the genome are unlikely to demonstrate identical genetic models. However, running SNP-SNP interaction analyses with every combination of encodings raises the multiple testing burden. Here, we present a novel and flexible encoding for genetic interactions, the elastic data-driven genetic encoding (EDGE), in which SNPs are assigned a heterozygous value based on the genetic model they demonstrate in a dataset prior to interaction testing. We assessed the power of EDGE to detect genetic interactions using 29 combinations of simulated genetic models and found it outperformed the traditional encoding methods across 10%, 30%, and 50% minor allele frequencies (MAFs). Further, EDGE maintained a low false-positive rate, while additive and dominant encodings demonstrated inflation. We evaluated EDGE and the traditional encodings with genetic data from the Electronic Medical Records and Genomics (eMERGE) Network for five phenotypes: age-related macular degeneration (AMD), age-related cataract, glaucoma, type 2 diabetes (T2D), and resistant hypertension. A multi-encoding genome-wide association study (GWAS) for each phenotype was performed using the traditional encodings, and the top results of the multi-encoding GWAS were considered for SNP-SNP interaction using the traditional encodings and EDGE. EDGE identified a novel SNP-SNP interaction for age-related cataract that no other method identified: rs7787286 (MAF: 0.041; intergenic region of chromosome 7)-rs4695885 (MAF: 0.34; intergenic region of chromosome 4) with a Bonferroni LRT p of 0.018. A SNP-SNP interaction was found in data from the UK Biobank within 25 kb of these SNPs using the recessive encoding: rs60374751 (MAF: 0.030) and rs6843594 (MAF: 0.34) (Bonferroni LRT p: 0.026). We recommend using EDGE to flexibly detect interactions between SNPs exhibiting diverse action.
View details for DOI 10.1371/journal.pgen.1009534
View details for Web of Science ID 000664356500001
View details for PubMedID 34086673
View details for PubMedCentralID PMC8208534
-
Phenome-wide association studies on cardiovascular health and fatty acids considering phenotype quality control practices for epidemiological data.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
2020; 25: 659-670
Abstract
Phenome-wide association studies (PheWAS) allow agnostic investigation of common genetic variants in relation to a variety of phenotypes but preserving the power of PheWAS requires careful phenotypic quality control (QC) procedures. While QC of genetic data is well-defined, no established QC practices exist for multi-phenotypic data. Manually imposing sample size restrictions, identifying variable types/distributions, and locating problems such as missing data or outliers is arduous in large, multivariate datasets. In this paper, we perform two PheWAS on epidemiological data and, utilizing the novel software CLARITE (CLeaning to Analysis: Reproducibility-based Interface for Traits and Exposures), showcase a transparent and replicable phenome QC pipeline which we believe is a necessity for the field. Using data from the Ludwigshafen Risk and Cardiovascular (LURIC) Health Study we ran two PheWAS, one on cardiac-related diseases and the other on polyunsaturated fatty acids levels. These phenotypes underwent a stringent quality control screen and were regressed on a genome-wide sample of single nucleotide polymorphisms (SNPs). Seven SNPs were significant in association with dihomo-γ-linolenic acid, of which five were within fatty acid desaturases FADS1 and FADS2. PheWAS is a useful tool to elucidate the genetic architecture of complex disease phenotypes within a single experimental framework. However, to reduce computational and multiple-comparisons burden, careful assessment of phenotype quality and removal of low-quality data is prudent. Herein we perform two PheWAS while applying a detailed phenotype QC process, for which we provide a replicable pipeline that is modifiable for application to other large datasets with heterogenous phenotypes. As investigation of complex traits continues beyond traditional genome wide association studies (GWAS), such QC considerations and tools such as CLARITE are crucial to the in the analysis of non-genetic big data such as clinical measurements, lifestyle habits, and polygenic traits.
View details for PubMedID 31797636
-
Investigation of gene-gene interactions in cardiac traits and serum fatty acid levels in the LURIC Health Study
PLOS ONE
2020; 15 (9): e0238304
Abstract
Epistasis analysis elucidates the effects of gene-gene interactions (G×G) between multiple loci for complex traits. However, the large computational demands and the high multiple testing burden impede their discoveries. Here, we illustrate the utilization of two methods, main effect filtering based on individual GWAS results and biological knowledge-based modeling through Biofilter software, to reduce the number of interactions tested among single nucleotide polymorphisms (SNPs) for 15 cardiac-related traits and 14 fatty acids. We performed interaction analyses using the two filtering methods, adjusting for age, sex, body mass index (BMI), waist-hip ratio, and the first three principal components from genetic data, among 2,824 samples from the Ludwigshafen Risk and Cardiovascular (LURIC) Health Study. Using Biofilter, one interaction nearly met Bonferroni significance: an interaction between rs7735781 in XRCC4 and rs10804247 in XRCC5 was identified for venous thrombosis with a Bonferroni-adjusted likelihood ratio test (LRT) p: 0.0627. A total of 57 interactions were identified from main effect filtering for the cardiac traits G×G (10) and fatty acids G×G (47) at Bonferroni-adjusted LRT p < 0.05. For cardiac traits, the top interaction involved SNPs rs1383819 in SNTG1 and rs1493939 (138kb from 5' of SAMD12) with Bonferroni-adjusted LRT p: 0.0228 which was significantly associated with history of arterial hypertension. For fatty acids, the top interaction between rs4839193 in KCND3 and rs10829717 in LOC107984002 with Bonferroni-adjusted LRT p: 2.28×10-5 was associated with 9-trans 12-trans octadecanoic acid, an omega-6 trans fatty acid. The model inflation factor for the interactions under different filtering methods was evaluated from the standard median and the linear regression approach. Here, we applied filtering approaches to identify numerous genetic interactions related to cardiac-related outcomes as potential targets for therapy. The approaches described offer ways to detect epistasis in the complex traits and to improve precision medicine capability.
View details for DOI 10.1371/journal.pone.0238304
View details for Web of Science ID 000571887500145
View details for PubMedID 32915819
View details for PubMedCentralID PMC7485803
-
Long Non-coding RNA TDRKH-AS1 Promotes Colorectal Cancer Cell Proliferation and Invasion Through the beta-Catenin Activated Wnt Signaling Pathway
FRONTIERS IN ONCOLOGY
2020; 10: 639
Abstract
Colorectal cancer (CRC) is a common cancer worldwide, with a lower 5-years survival rate. Recently, long non-coding RNAs (lncRNAs) have been well-studied as the oncogenes or the tumor suppressors in multiple malignancies, including CRC. However, their biological functions and potential mechanisms in human cancer remain unclear. Here, we evaluated the expression of TDRKH-AS1 in CRC tissues and identified its potential targets. We found that TDRKH-AS1 is upregulated in majority of CRC patients, which is also significantly correlated with their malignant characteristics and their dismal prognoses. The high expression of TDRKH-AS1 can promote cancer cell proliferation substantially and invasion based on in vitro experiments. We also recognized that the TDRKH-AS1 targets the β-catenin in the Wnt signaling pathway to exert its carcinogenic activity. TDRKH-AS1 could serve as a promising prognostic predictor and a potential therapeutic target for further early diagnoses and treatments via a non-invasive method.
View details for DOI 10.3389/fonc.2020.00639
View details for Web of Science ID 000538517900001
View details for PubMedID 32670860
View details for PubMedCentralID PMC7326065
-
CLARITE Facilitates the Quality Control and Analysis Process for EWAS of Metabolic-Related Traits
FRONTIERS IN GENETICS
2019; 10: 1240
Abstract
While genome-wide association studies are an established method of identifying genetic variants associated with disease, environment-wide association studies (EWAS) highlight the contribution of nongenetic components to complex phenotypes. However, the lack of high-throughput quality control (QC) pipelines for EWAS data lends itself to analysis plans where the data are cleaned after a first-pass analysis, which can lead to bias, or are cleaned manually, which is arduous and susceptible to user error. We offer a novel software, CLeaning to Analysis: Reproducibility-based Interface for Traits and Exposures (CLARITE), as a tool to efficiently clean environmental data, perform regression analysis, and visualize results on a single platform through user-guided automation. It exists as both an R package and a Python package. Though CLARITE focuses on EWAS, it is intended to also improve the QC process for phenotypes and clinical lab measures for a variety of downstream analyses, including phenome-wide association studies and gene-environment interaction studies. With the goal of demonstrating the utility of CLARITE, we performed a novel EWAS in the National Health and Nutrition Examination Survey (NHANES) (N overall Discovery=9063, N overall Replication=9874) for body mass index (BMI) and over 300 environment variables post-QC, adjusting for sex, age, race, socioeconomic status, and survey year. The analysis used survey weights along with cluster and strata information in order to account for the complex survey design. Sixteen BMI results replicated at a Bonferroni corrected p < 0.05. The top replicating results were serum levels of g-tocopherol (vitamin E) (Discovery Bonferroni p: 8.67x10-12, Replication Bonferroni p: 2.70x10-9) and iron (Discovery Bonferroni p: 1.09x10-8, Replication Bonferroni p: 1.73x10-10). Results of this EWAS are important to consider for metabolic trait analysis, as BMI is tightly associated with these phenotypes. As such, exposures predictive of BMI may be useful for covariate and/or interaction assessment of metabolic-related traits. CLARITE allows improved data quality for EWAS, gene-environment interactions, and phenome-wide association studies by establishing a high-throughput quality control infrastructure. Thus, CLARITE is recommended for studying the environmental factors underlying complex disease.
View details for DOI 10.3389/fgene.2019.01240
View details for Web of Science ID 000504982600001
View details for PubMedID 31921293
View details for PubMedCentralID PMC6930237