Dr. He received his PhD from the University of Michigan in 2016. Following a postdoctoral training in biostatistics at Columbia University, he joined Stanford University as an assistant professor of neurology and of medicine in 2018. His research is concentrated in the area of statistical genetics and integrative analysis of omics data, with the aim of developing novel statistical and computational methodologies for the identification and interpretation of complex biological pathways involved in human diseases, particularly neurological disorders. His methodology interest includes high-dimensional data analysis, correlated (longitudinal, familial) data analysis and machine learning algorithms.
Honors & Awards
Rackham Pre-doctoral Fellowship Award, University of Michigan (2015)
Rackham Conference Travel Grant, University of Michigan (2013 - 2015)
Best Performance on the Qualifying Exam, University of Michigan (2013)
Ph.D., University of Michigan, Biostatistics (2016)
B.S., Tsinghua University, Mathematics and Physics (2010)
Multiple causal variants underlie genetic associations in humans.
Science (New York, N.Y.)
2022; 375 (6586): 1247-1254
Associations between genetic variation and traits are often in noncoding regions with strong linkage disequilibrium (LD), where a single causal variant is assumed to underlie the association. We applied a massively parallel reporter assay (MPRA) to functionally evaluate genetic variants in high, local LD for independent cis-expression quantitative trait loci (eQTL). We found that 17.7% of eQTLs exhibit more than one major allelic effect in tight LD. The detected regulatory variants were highly and specifically enriched for activating chromatin structures and allelic transcription factor binding. Integration of MPRA profiles with eQTL/complex trait colocalizations across 114 human traits and diseases identified causal variant sets demonstrating how genetic association signals can manifest through multiple, tightly linked causal variants.
View details for DOI 10.1126/science.abj5117
View details for PubMedID 35298243
Quantitative disease risk scores from EHR with applications to clinical risk stratification and genetic studies.
NPJ digital medicine
2021; 4 (1): 116
Labeling clinical data from electronic health records (EHR) in health systems requires extensive knowledge of human expert, and painstaking review by clinicians. Furthermore, existing phenotyping algorithms are not uniformly applied across large datasets and can suffer from inconsistencies in case definitions across different algorithms. We describe here quantitative disease risk scores based on almost unsupervised methods that require minimal input from clinicians, can be applied to large datasets, and alleviate some of the main weaknesses of existing phenotyping algorithms. We show applications to phenotypic data on approximately 100,000 individuals in eMERGE, and focus on several complex diseases, including Chronic Kidney Disease, Coronary Artery Disease, Type 2 Diabetes, Heart Failure, and a few others. We demonstrate that relative to existing approaches, the proposed methods have higher prediction accuracy, can better identify phenotypic features relevant to the disease under consideration, can perform better at clinical risk stratification, and can identify undiagnosed cases based on phenotypic features available in the EHR. Using genetic data from the eMERGE-seq panel that includes sequencing data for 109 genes on 21,363 individuals from multiple ethnicities, we also show how the new quantitative disease risk scores help improve the power of genetic association studies relative to the standard use of disease phenotypes. The results demonstrate the effectiveness of quantitative disease risk scores derived from rich phenotypic EHR databases to provide a more meaningful characterization of clinical risk for diseases of interest beyond the prevalent binary (case-control) classification.
View details for DOI 10.1038/s41746-021-00488-3
View details for PubMedID 34302027
Identification of putative causal loci in whole-genome sequencing data via knockoff statistics.
2021; 12 (1): 3152
The analysis of whole-genome sequencing studies is challenging due to the large number of rare variants in noncoding regions and the lack of natural units for testing. We propose a statistical method to detect and localize rare and common risk variants in whole-genome sequencing studies based on a recently developed knockoff framework. It can (1) prioritize causal variants over associations due to linkage disequilibrium thereby improving interpretability; (2) help distinguish the signal due to rare variants from shadow effects of significant common variants nearby; (3) integrate multiple knockoffs for improved power, stability, and reproducibility; and (4) flexibly incorporate state-of-the-art and future association tests to achieve the benefits proposed here. In applications to whole-genome sequencing data from the Alzheimer's Disease Sequencing Project (ADSP) and COPDGene samples from NHLBI Trans-Omics for Precision Medicine (TOPMed) Program we show that our method compared with conventional association tests can lead to substantially more discoveries.
View details for DOI 10.1038/s41467-021-22889-4
View details for PubMedID 34035245
Genome-wide analysis of common and rare variants via multiple knockoffs at biobank scale, with an application to Alzheimer disease genetics.
American journal of human genetics
Knockoff-based methods have become increasingly popular due to their enhanced power for locus discovery and their ability to prioritize putative causal variants in a genome-wide analysis. However, because of the substantial computational cost for generating knockoffs, existing knockoff approaches cannot analyze millions of rare genetic variants in biobank-scale whole-genome sequencing and whole-genome imputed datasets. We propose a scalable knockoff-based method for the analysis of common and rare variants across the genome, KnockoffScreen-AL, that is applicable to biobank-scale studies with hundreds of thousands of samples and millions of genetic variants. The application of KnockoffScreen-AL to the analysis of Alzheimer disease (AD) in 388,051 WG-imputed samples from the UK Biobank resulted in 31 significant loci, including 14 loci that are missed by conventional association tests on these data. We perform replication studies in an independent meta-analysis of clinically diagnosed AD with 94,437 samples, and additionally leverage single-cell RNA-sequencing data with 143,793 single-nucleus transcriptomes from 17 control subjects and AD-affected individuals, and proteomics data from 735 control subjects and affected indviduals with AD and related disorders to validate the genes at these significant loci. These multi-omics analyses show that 79.1% of the proximal genes at these loci and 76.2% of the genes at loci identified only by KnockoffScreen-AL exhibit at least suggestive signal (p < 0.05) in the scRNA-seq or proteomics analyses. We highlight a potentially causal gene in AD progression, EGFR, that shows significant differences in expression and protein levels between AD-affected individuals and healthy control subjects.
View details for DOI 10.1016/j.ajhg.2021.10.009
View details for PubMedID 34767756
Powerful gene-based testing by integrating long-range chromatin interactions and knockoff genotypes.
Proceedings of the National Academy of Sciences of the United States of America
2021; 118 (47)
Gene-based tests are valuable techniques for identifying genetic factors in complex traits. Here, we propose a gene-based testing framework that incorporates data on long-range chromatin interactions, several recent technical advances for region-based tests, and leverages the knockoff framework for synthetic genotype generation for improved gene discovery. Through simulations and applications to genome-wide association studies (GWAS) and whole-genome sequencing data for multiple diseases and traits, we show that the proposed test increases the power over state-of-the-art gene-based tests in the literature, identifies genes that replicate in larger studies, and can provide a more narrow focus on the possible causal genes at a locus by reducing the confounding effect of linkage disequilibrium. Furthermore, our results show that incorporating genetic variation in distal regulatory elements tends to improve power over conventional tests. Results for UK Biobank and BioBank Japan traits are also available in a publicly accessible database that allows researchers to query gene-based results in an easy fashion.
View details for DOI 10.1073/pnas.2105191118
View details for PubMedID 34799441
A genome-wide scan statistic framework for whole-genome sequence data analysis.
2019; 10 (1): 3018
The analysis of whole-genome sequencing studies is challenging due to the large number of noncoding rare variants, our limited understanding of their functional effects, and the lack of natural units for testing. Here we propose a scan statistic framework, WGScan, to simultaneously detect the existence, and estimate the locations of association signals at genome-wide scale. WGScan can analytically estimate the significance threshold for a whole-genome scan; utilize summary statistics for a meta-analysis; incorporate functional annotations for enhanced discoveries in noncoding regions; and enable enrichment analyses using genome-wide summary statistics. Based on the analysis of whole genomes of 1,786 phenotypically discordant sibling pairs from the Simons Simplex Collection study for autism spectrum disorders, we derive genome-wide significance thresholds for whole genome sequencing studies and detect significant enrichments of regions showing associations with autism in promoter regions, functional categories related to autism, and enhancers predicted to regulate expression of autism associated genes.
View details for DOI 10.1038/s41467-019-11023-0
View details for PubMedID 31289270
A semi-supervised approach for predicting cell-type specific functional consequences of non-coding variation using MPRAs.
2018; 9 (1): 5199
Predicting the functional consequences of genetic variants in non-coding regions is a challenging problem. We propose here a semi-supervised approach, GenoNet, to jointly utilize experimentally confirmed regulatory variants (labeled variants), millions of unlabeled variants genome-wide, and more than a thousand cell/tissue type specific epigenetic annotations to predict functional consequences of non-coding variants. Through the application to several experimental datasets, we demonstrate that the proposed method significantly improves prediction accuracy compared to existing functional prediction methods at the tissue/cell type level, but especially so at the organism level. Importantly, we illustrate how the GenoNet scores can help in fine-mapping at GWAS loci, and in the discovery of disease associated genes in sequencing studies. As more comprehensive lists of experimentally validated variants become available over the next few years, semi-supervised methods like GenoNet can be used to provide increasingly accurate functional predictions for variants genome-wide and across a variety of cell/tissue types.
View details for PubMedID 30518757
Unified Sequence-Based Association Tests Allowing for Multiple Functional Annotations and Meta-analysis of Noncoding Variation in Metabochip Data
AMERICAN JOURNAL OF HUMAN GENETICS
2017; 101 (3): 340–52
Substantial progress has been made in the functional annotation of genetic variation in the human genome. Integrative analysis that incorporates such functional annotations into sequencing studies can aid the discovery of disease-associated genetic variants, especially those with unknown function and located outside protein-coding regions. Direct incorporation of one functional annotation as weight in existing dispersion and burden tests can suffer substantial loss of power when the functional annotation is not predictive of the risk status of a variant. Here, we have developed unified tests that can utilize multiple functional annotations simultaneously for integrative association analysis with efficient computational techniques. We show that the proposed tests significantly improve power when variant risk status can be predicted by functional annotations. Importantly, when functional annotations are not predictive of risk status, the proposed tests incur only minimal loss of power in relation to existing dispersion and burden tests, and under certain circumstances they can even have improved power by learning a weight that better approximates the underlying disease model in a data-adaptive manner. The tests can be constructed with summary statistics of existing dispersion and burden tests for sequencing data, therefore allowing meta-analysis of multiple studies without sharing individual-level data. We applied the proposed tests to a meta-analysis of noncoding rare variants in Metabochip data on 12,281 individuals from eight studies for lipid traits. By incorporating the Eigen functional score, we detected significant associations between noncoding rare variants in SLC22A3 and low-density lipoprotein and total cholesterol, associations that are missed by standard dispersion and burden tests.
View details for PubMedID 28844485
View details for PubMedCentralID PMC5590864
Set-Based Tests for the Gene-Environment Interaction in Longitudinal Studies
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
2017; 112 (519): 966–78
We propose a generalized score type test for set-based inference for gene-environment interaction with longitudinally measured quantitative traits. The test is robust to misspecification of within subject correlation structure and has enhanced power compared to existing alternatives. Unlike tests for marginal genetic association, set-based tests for gene-environment interaction face the challenges of a potentially misspecified and high-dimensional main effect model under the null hypothesis. We show that our proposed test is robust to main effect misspecification of environmental exposure and genetic factors under the gene-environment independence condition. When genetic and environmental factors are dependent, the method of sieves is further proposed to eliminate potential bias due to a misspecified main effect of a continuous environmental exposure. A weighted principal component analysis approach is developed to perform dimension reduction when the number of genetic variants in the set is large relative to the sample size. The methods are motivated by an example from the Multi-Ethnic Study of Atherosclerosis (MESA), investigating interaction between measures of neighborhood environment and genetic regions on longitudinal measures of blood pressure over a study period of about seven years with 4 exams.
View details for PubMedID 29780190
View details for PubMedCentralID PMC5954413
Set-Based Tests for Genetic Association in Longitudinal Studies
2015; 71 (3): 606–15
Genetic association studies with longitudinal markers of chronic diseases (e.g., blood pressure, body mass index) provide a valuable opportunity to explore how genetic variants affect traits over time by utilizing the full trajectory of longitudinal outcomes. Since these traits are likely influenced by the joint effect of multiple variants in a gene, a joint analysis of these variants considering linkage disequilibrium (LD) may help to explain additional phenotypic variation. In this article, we propose a longitudinal genetic random field model (LGRF), to test the association between a phenotype measured repeatedly during the course of an observational study and a set of genetic variants. Generalized score type tests are developed, which we show are robust to misspecification of within-subject correlation, a feature that is desirable for longitudinal analysis. In addition, a joint test incorporating gene-time interaction is further proposed. Computational advancement is made for scalable implementation of the proposed methods in large-scale genome-wide association studies (GWAS). The proposed methods are evaluated through extensive simulation studies and illustrated using data from the Multi-Ethnic Study of Atherosclerosis (MESA). Our simulation results indicate substantial gain in power using LGRF when compared with two commonly used existing alternatives: (i) single marker tests using longitudinal outcome and (ii) existing gene-based tests using the average value of repeated measurements as the outcome.
View details for PubMedID 25854837
View details for PubMedCentralID PMC4601568
Modeling and Testing for Joint Association Using a Genetic Random Field Model
2014; 70 (3): 471–79
Substantial progress has been made in identifying single genetic variants predisposing to common complex diseases. Nonetheless, the genetic etiology of human diseases remains largely unknown. Human complex diseases are likely influenced by the joint effect of a large number of genetic variants instead of a single variant. The joint analysis of multiple genetic variants considering linkage disequilibrium (LD) and potential interactions can further enhance the discovery process, leading to the identification of new disease-susceptibility genetic variants. Motivated by development in spatial statistics, we propose a new statistical model based on the random field theory, referred to as a genetic random field model (GenRF), for joint association analysis with the consideration of possible gene-gene interactions and LD. Using a pseudo-likelihood approach, a GenRF test for the joint association of multiple genetic variants is developed, which has the following advantages: (1) accommodating complex interactions for improved performance; (2) natural dimension reduction; (3) boosting power in the presence of LD; and (4) computationally efficient. Simulation studies are conducted under various scenarios. The development has been focused on quantitative traits and robustness of the GenRF test to other traits, for example, binary traits, is also discussed. Compared with a commonly adopted kernel machine approach, SKAT, as well as other more standard methods, GenRF shows overall comparable performance and better performance in the presence of complex interactions. The method is further illustrated by an application to the Dallas Heart Study.
View details for PubMedID 24628067
Precision Care in Cardiac Arrest: ICECAP (PRECICECAP) Study Protocol and Informatics Approach.
BACKGROUND: Most trials in critical care have been neutral, in part because between-patient heterogeneity means not all patients respond identically to the same treatment. The Precision Care in Cardiac Arrest: Influence of Cooling duration on Efficacy in Cardiac Arrest Patients (PRECICECAP) study will apply machine learning to high-resolution, multimodality data collected from patients resuscitated from out-of-hospital cardiac arrest. We aim to discover novel biomarker signatures to predict the optimal duration of therapeutic hypothermia and 90-day functional outcomes. In parallel, we are developing a freely available software platform for standardized curation of intensive care unit-acquired data for machine learning applications.METHODS: The Influence of Cooling duration on Efficacy in Cardiac Arrest Patients (ICECAP) study is a response-adaptive, dose-finding trial testing different durations of therapeutic hypothermia. Twelve ICECAP sites will collect data for PRECICECAP from multiple modalities routinely used after out-of-hospital cardiac arrest, including ICECAP case report forms, detailed medication data, cardiopulmonary and electroencephalographic waveforms, and digital imaging and communications in medicine files (DICOMs). We partnered with Moberg Analytics to develop a freely available software platform to allow high-resolution critical care data to be used efficiently and effectively. We will use an autoencoder neural network to create low-dimensional representations of all raw waveforms and derivative features, censored at rewarming to ensure clinical usability to guide optimal duration of hypothermia. We will also consider simple features that are historically considered to be important. Finally, we will create a supervised deep learning neural network algorithm to directly predict 90-day functional outcome from large sets of novel features.RESULTS: PRECICECAP is currently enrolling and will be completed in late 2025.CONCLUSIONS: Cardiac arrest is a heterogeneous disease that causes substantial morbidity and mortality. PRECICECAP will advance the overarching goal of titrating personalized neurocritical care on the basis of robust measures of individual need and treatment responsiveness. The software platform we develop will be broadly applicable to hospital-based research after acute illness or injury.
View details for DOI 10.1007/s12028-022-01464-9
View details for PubMedID 35229231
Challenges at the APOE locus: a robust quality control approach for accurate APOE genotyping.
Alzheimer's research & therapy
2022; 14 (1): 22
Genetic variants within the APOE locus may modulate Alzheimer's disease (AD) risk independently or in conjunction with APOE*2/3/4 genotypes. Identifying such variants and mechanisms would importantly advance our understanding of APOE pathophysiology and provide critical guidance for AD therapies aimed at APOE. The APOE locus however remains relatively poorly understood in AD, owing to multiple challenges that include its complex linkage structure and uncertainty in APOE*2/3/4 genotype quality. Here, we present a novel APOE*2/3/4 filtering approach and showcase its relevance on AD risk association analyses for the rs439401 variant, which is located 1801 base pairs downstream of APOE and has been associated with a potential regulatory effect on APOE.We used thirty-two AD-related cohorts, with genetic data from various high-density single-nucleotide polymorphism microarrays, whole-genome sequencing, and whole-exome sequencing. Study participants were filtered to be ages 60 and older, non-Hispanic, of European ancestry, and diagnosed as cognitively normal or AD (n = 65,701). Primary analyses investigated AD risk in APOE*4/4 carriers. Additional supporting analyses were performed in APOE*3/4 and 3/3 strata. Outcomes were compared under two different APOE*2/3/4 filtering approaches.Using more conventional APOE*2/3/4 filtering criteria (approach 1), we showed that, when in-phase with APOE*4, rs439401 was variably associated with protective effects on AD case-control status. However, when applying a novel filter that increases the certainty of the APOE*2/3/4 genotypes by applying more stringent criteria for concordance between the provided APOE genotype and imputed APOE genotype (approach 2), we observed that all significant effects were lost.We showed that careful consideration of APOE genotype and appropriate sample filtering were crucial to robustly interrogate the role of the APOE locus on AD risk. Our study presents a novel APOE filtering approach and provides important guidelines for research into the APOE locus, as well as for elucidating genetic interaction effects with APOE*2/3/4.
View details for DOI 10.1186/s13195-022-00962-4
View details for PubMedID 35120553
Sex-heterogenous effect on Alzheimer's disease risk at the BIN1 locus.
Alzheimer's & dementia : the journal of the Alzheimer's Association
1800; 17 Suppl 3: e053616
BACKGROUND: Among Alzheimer's Disease (AD) tier 1 genes, BIN1 shows the greatest sex-biased expression in GTEx RNASeq, notably in brain tissues. Fine-mapping studies suggest that the BIN1 locus harbors at least two independent risk variants.METHOD: We considered a region ±200kb around BIN1 and performed sex-stratified analyses to identify genome-wide significant variants with a sex-heterogenous effect in imputed data from the AD Genetics Consortium. We ran conditional analyses on rs6733839 to show that variants with sex-heterogenous effects were independent from the lead variant at this locus. Additionally, we performed sex- and rs6733839-genotype-stratified analyses to understand which haplotype drives this sex-heterogenous effect on AD risk and on BIN1 expression in brain tissue from the ROSMAP study.RESULT: Rs10200967 has a significant sex-heterogenous effect on AD risk and is genome-wide significant in females but not males (Table 1). In the conditional analysis the association remains significant (pfemale = 6.5*10-3 , Table 2). The linkage disequilibrium between these two variants is low (r2 = 0.12). The protective association of rs10200967 is strongest in females homozygous for the major allele of rs6733839 (p = 1.1*10-3 ). Among individuals homozygous for the major allele of rs6733839, the effect of the interaction between rs10200967 dosage and sex on AD risk is significant (p = 3.2*10-3 ). In the full sample, the three-way interaction between these two variants and sex is significant (p = 0.021, Table 3). The rs10200967 minor allele is associated with an increased expression in GTEx (p = 6.0*10-15 , Figure 1) and ROSMAP (p = 9.1*10-3 , Table 4). Among rs6733839 reference allele homozygotes, the rs10200967 interaction with sex on BIN1 expression is significant (p = 0.0495). In the full ROSMAP sample, the three-way interaction is trending significant (p = 0.062, Table 5). Interestingly, rs10200967 is located in a histone peak and a start-exon of a BIN1 transcript (Figure 1) reinforcing its putative regulatory role.CONCLUSION: Our sex- and rs6733839-genotype stratified analyses, demonstrate that rs10200967 at the BIN1 locus is genome-wide significant, with a sex-heterogenous effect on AD risk and on BIN1 expression. These results support the growing consensus that there are two separate signals at the locus and suggest that rs10200967 contributes to the signal independent of rs6733839.
View details for DOI 10.1002/alz.053616
View details for PubMedID 35108924
APOE*4-stratified genome-wide association study of Alzheimer's disease in over 350,000 individuals.
Alzheimer's & dementia : the journal of the Alzheimer's Association
1800; 17 Suppl 3: e055905
BACKGROUND: APOE*4 is the strongest genetic risk factor for late-onset Alzheimer's disease (AD) and is highly pleiotropic, such that it may be considered as a biological factor that can affect overall genetic risk for AD. To advance our understanding of the genetic architecture of AD, we sought to perform the largest APOE*4-stratified genome-wide association study (GWAS) of AD.METHOD: Twenty-five publicly available AD GWAS datasets provided case-control diagnoses for phase-1 samples (imputed to the HRC r1.1 reference panel). The UK Biobank provided subjects with family history of AD status, transformed into an AD phenotype as described previously (Jansen et al., 2019) for phase-2 samples. Linear mixed model regressions were performed on case-control status (LMM-BOLT v.2.3.4), adjusting for age (age-at-onset in cases; age-at-last-exam in controls), sex, APOE*4 and APOE*2 dosage, the first 12 genetic principal components, array/batch, cohort in phase-1, and assessment center in phase-2. In phase-3, phase-1 and phase-2 findings were combined using multivariate genome-wide meta-analysis (Jansen et al., 2019). APOE*4+ heterogeneity tests were evaluated per phase and meta-analyzed in phase-3.RESULT: Participant demographics are in Table 1. Combining results from both APOE*4-stratified analyses, 106 lead variants across 98 loci passed suggestive significance (p<10-5 ; Figure 1). Although most variants reached only suggestive significance, 28 loci were previously reported at genome-wide significance in large-scale GWAS of AD (Kunkle et al., 2019, Jansen et al., 2019, Bellenguez et al., 2020), supporting that we identified potentially relevant AD loci. APOE*4-stratified effects were observed for 28 variants/loci covered across both phase-1 and phase-2 (NAPOE4+ =19; NAPOE4- =9; Table 2), and 25 variants/loci seen only in phase-2 (NAPOE4+ =17; NAPOE4- =8; Table 3). Notably, a genome-wide significant APOE*4+ heterogeneity effect was observed for the USP17L13 locus (a regulator of deubiquitination), while PPP1R12A, BRINP1, PCBD1, and SESN2 loci passed suggestive significance.CONCLUSION: Our findings revealed novel AD risk loci/genes and characterized which of these associated with AD risk differentially across APOE*4 status. This contributes highly to personalized genetic medicine and paves the way towards new potential AD drug targets. Ongoing work is adding samples for phase-1 analyses (imputing data to TOPMed) and pursuing both multi-omics and AD endophenotype validation efforts for variant prioritization.
View details for DOI 10.1002/alz.055905
View details for PubMedID 35108901
A Text-Based Intervention to Promote Literacy: An RCT.
BACKGROUND AND OBJECTIVES: Children entering kindergarten ready to learn are more likely to thrive. Inequitable access to high-quality, early educational settings creates early educational disparities. TipsByText, a text-message-based program for caregivers of young children, improves literacy of children in preschool, but efficacy for families without access to early childhood education was unknown.METHODS: We conducted a randomized controlled trial with caregivers of 3- and 4-year-olds in 2 public pediatric clinics. Intervention caregivers received TipsByText 3 times a week for 7 months. At pre- and postintervention, we measured child literacy using the Phonological Awareness Literacy Screening Tool (PALS-PreK) and caregiver involvement using the Parent Child Interactivity Scale (PCI). We estimated effects on PALS-PreK and PCI using multivariable linear regression.RESULTS: We enrolled 644 families, excluding 263 because of preschool participation. Compared with excluded children, those included in the study had parents with lower income and educational attainment and who were more likely to be Spanish speaking. Three-quarters of enrollees completed pre- and postintervention assessments. Postintervention PALS-PreK scores revealed an unadjusted treatment effect of 0.260 (P = .040); adjusting for preintervention score, child age, and caregiver language, treatment effect was 0.209 (P = .016), equating to 3 months of literacy gains. Effects were greater for firstborn children (0.282 vs 0.178), children in 2-parent families (0.262 vs 0.063), and 4-year-olds (0.436 vs 0.107). The overall effect on PCI was not significant (1.221, P = .124).CONCLUSIONS: The health sector has unique access to difficult-to-reach young children. With this clinic-based texting intervention, we reached underresourced families and increased child literacy levels.
View details for DOI 10.1542/peds.2020-049648
View details for PubMedID 34544847
Multitrait GWAS to connect disease variants and biological mechanisms.
2021; 17 (8): e1009713
Genome-wide association studies (GWASs) have uncovered a wealth of associations between common variants and human phenotypes. Here, we present an integrative analysis of GWAS summary statistics from 36 phenotypes to decipher multitrait genetic architecture and its link with biological mechanisms. Our framework incorporates multitrait association mapping along with an investigation of the breakdown of genetic associations into clusters of variants harboring similar multitrait association profiles. Focusing on two subsets of immunity and metabolism phenotypes, we then demonstrate how genetic variants within clusters can be mapped to biological pathways and disease mechanisms. Finally, for the metabolism set, we investigate the link between gene cluster assignment and the success of drug targets in randomized controlled trials.
View details for DOI 10.1371/journal.pgen.1009713
View details for PubMedID 34460823
Do Steroids Matter? A Retrospective Review of Premedication for Taxane Chemotherapy and Hypersensitivity Reactions.
Journal of clinical oncology : official journal of the American Society of Clinical Oncology
PURPOSE: Despite the widespread use of the taxanes paclitaxel and docetaxel for a variety of cancers and their well-known association with hypersensitivity reactions (HSRs), there is still significant variation in the prescribing practices of steroids for premedication. Premedication almost always includes dexamethasone, which can be associated with multiple adverse effects if taken for extended periods of time. This study reviews the pattern of steroid premedication in patients who received paclitaxel or docetaxel at Stanford Cancer Institute between January 2010 and June 2020.METHODS: We used an electronic query of the electronic medical record followed up with a manual review of patient charts to ask whether we could find a correlation between steroid premedication dosing and the incidence or severity of HSRs with the first taxane dose. Variables considered included steroid dose and route, dose and type of taxane, clinical cancer group, sex, and race.RESULTS: Five thousand two hundred seventeen patients were identified as having received paclitaxel or docetaxel, and 3,181 met criteria for our analysis. There were 264 (8.3%) HSRs. In adjusted multivariate analysis, we found no correlation of HSR rate or severity among any of the variables evaluated except gynecology oncology clinic patients, who had an increased risk (hazard ratio [HR] 1.34) of HSRs overall and high-grade HSRs (HR 2.34), and female patients, who had a higher rate of HSRs overall (HR 1.26), but not high-grade HSRs.CONCLUSION: Neither dexamethasone dose nor route correlated with subsequent HSRs. Given the potential for adverse events from repeated high-dose steroids, our findings suggest that routine use of lower doses, such as a single 10 mg dose of dexamethasone, as premedication for taxanes to prevent HSRs is preferable to the current prescribing guidelines.
View details for DOI 10.1200/JCO.21.01200
View details for PubMedID 34357780
- Advances and challenges in quantitative delineation of the genetic architecture of complex traits QUANTITATIVE BIOLOGY 2021; 9 (2): 168-184
A novel age-informed approach for genetic association analysis in Alzheimer's disease.
Alzheimer's research & therapy
2021; 13 (1): 72
BACKGROUND: Many Alzheimer's disease (AD) genetic association studies disregard age or incorrectly account for it, hampering variant discovery.METHODS: Using simulated data, we compared the statistical power of several models: logistic regression on AD diagnosis adjusted and not adjusted for age; linear regression on a score integrating case-control status and age; and multivariate Cox regression on age-at-onset. We applied these models to real exome-wide data of 11,127 sequenced individuals (54% cases) and replicated suggestive associations in 21,631 genotype-imputed individuals (51% cases).RESULTS: Modeling variable AD risk across age results in 5-10% statistical power gain compared to logistic regression without age adjustment, while incorrect age adjustment leads to critical power loss. Applying our novel AD-age score and/or Cox regression, we discovered and replicated novel variants associated with AD on KIF21B, USH2A, RAB10, RIN3, and TAOK2 genes.CONCLUSION: Our AD-age score provides a simple means for statistical power gain and is recommended for future AD studies.
View details for DOI 10.1186/s13195-021-00808-5
View details for PubMedID 33794991
Administration of Dexamethasone for Bacterial Meningitis: An Unreliable Quality Measure.
2021; 11 (2): 101-106
To validate the use of administrative data to identify patients with bacterial meningitis and quantify the rate of dexamethasone administration as defined in the American Academy of Neurology Inpatient and Emergency Care Quality Measurement Set.The Vizient Clinical Data Base and Resource Manager was used to identify patients with International Classification of Diseases, Tenth Revision (ICD-10) codes for bacterial meningitis from October 2015 to June 2019. Chart review was performed on patients identified at a single quaternary-care hospital. The positive predictive value (PPV) of Vizient was determined. Demographic, clinical, and laboratory data were assessed using descriptive statistics.Of all hospitals that submitted complete data to Vizient during the study period, a median of 19 patients per hospital had ICD-10 codes for bacterial meningitis in the 45-month period. We identified 79 patients using Vizient at our institution of whom 69 had a diagnosis of bacterial meningitis confirmed by chart review (PPV = 87%). 15 patients were eligible to receive dexamethasone per the quality measurement set. Six of these patients (40%) received dexamethasone.It is feasible to use the Vizient Clinical Data Base and Resource Manager to identify patients with bacterial meningitis. Due to low prevalence across multiple institutions and high rate of exclusion criteria at our institution, this study suggests that the rate of dexamethasone administration in bacterial meningitis may be an unreliable indicator of quality of care provided by inpatient neurologists. The creation of a registry for hospitalized neurology patients could enhance development of future quality measures.
View details for DOI 10.1177/1941874420969556
View details for PubMedID 33791051
View details for PubMedCentralID PMC7958681
KLVS heterozygosity reduces brain amyloid in asymptomatic at-risk APOE4 carriers.
Neurobiology of aging
2021; 101: 123–29
KLOTHOVS heterozygosity (KLVSHET+) was recently shown to be associated with reduced risk of Alzheimer's disease (AD) in APOE4 carriers. Additional studies suggest that KLVSHET+ protects against amyloid burden in cognitively normal older subjects, but sample sizes were too small to draw definitive conclusions. We performed a well-powered meta-analysis across 5 independent studies, comprising 3581 pre-clinical participants ages 60-80, to investigate whether KLVSHET+ reduces the risk of having an amyloid-positive positron emission tomography scan. Analyses were stratified by APOE4 status. KLVSHET+ reduced the risk of amyloid positivity in APOE4 carriers (odds ratio= 0.67 [0.52-0.88]; p= 3.5*10-3), but not in APOE4 non-carriers (odds ratio= 0.94 [0.73-1.21]; p= 0.63). The combination of APOE4 and KLVS genotypes should help enrich AD clinical trials for pre-symptomatic subjects at increased risk of developing amyloid aggregation and AD. KL-related pathways may help elucidate protective mechanisms against amyloid accumulation and merit exploration for novel AD drug targets. Future investigation of the biological mechanisms by which KL interacts with APOE4 and AD are warranted.
View details for DOI 10.1016/j.neurobiolaging.2021.01.008
View details for PubMedID 33610961
Treatment Practices and Outcomes in Continuous Spike and Wave During Slow Wave Sleep (CSWS): A Multicenter Collaboration.
The Journal of pediatrics
To determine how Continuous Spike and Wave during Slow Wave Sleep (CSWS) is currently managed and to compare the effectiveness of current treatment strategies using a database from 11 pediatric epilepsy centers in the United States.This retrospective study gathered information on baseline clinical characteristics, CSWS etiology, and treatment(s) in consecutive patients seen between 2014-2016 at 11 epilepsy referral centers. Treatments were categorized as benzodiazepines, steroids, other antiseizure medications (ASMs), or other therapies. Two measures of treatment response [clinical improvement as noted by the treating physician; and EEG improvement] were compared across therapies, controlling for baseline variables.81 children underwent 153 treatment trials during the study period (68 trials of benzodiazepines, 25 of steroids, 45 of ASMs, 14 of other therapies). Children most frequently received benzodiazepines (62%) or ASMs (27%) as first line therapy. Treatment choice did not differ based on baseline clinical variables, nor did these variables correlate with outcome. After adjusting for baseline variables, children had a greater odds of clinical improvement with benzodiazepines (OR 3.32, 95%CI 1.57-7.04, P = .002) or steroids (OR 4.04, 95%CI 1.41-11.59, p=0.01) than with ASMs and a greater odds of EEG improvement after steroids (OR 3.36, 95% CI 1.09-10.33, p=0.03) than after ASMs.Benzodiazepines and ASMs are the most frequent initial therapy prescribed for CSWS in the United States. Our data suggests that ASMs are inferior to benzodiazepines and steroids and support earlier use of these therapies. Multicenter prospective studies that rigorously assess treatment protocols and outcomes are needed.
View details for DOI 10.1016/j.jpeds.2021.01.032
View details for PubMedID 33484700
Generalizable Sample-Efficient Siamese Autoencoder for Tinnitus Diagnosis in Listeners With Subjective Tinnitus
IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING
2021; 29: 1452-1461
Electroencephalogram (EEG)-based neurofeedback has been widely studied for tinnitus therapy in recent years. Most existing research relies on experts' cognitive prediction, and studies based on machine learning and deep learning are either data-hungry or not well generalizable to new subjects. In this paper, we propose a robust, data-efficient model for distinguishing tinnitus from the healthy state based on EEG-based tinnitus neurofeedback. We propose trend descriptor, a feature extractor with lower fineness, to reduce the effect of electrode noises on EEG signals, and a siamese encoder-decoder network boosted in a supervised manner to learn accurate alignment and to acquire high-quality transferable mappings across subjects and EEG signal channels. Our experiments show the proposed method significantly outperforms state-of-the-art algorithms when analyzing subjects' EEG neurofeedback to 90dB and 100dB sound, achieving an accuracy of 91.67%-94.44% in predicting tinnitus and control subjects in a subject-independent setting. Our ablation studies on mixed subjects and parameters show the method's stability in performance.
View details for DOI 10.1109/TNSRE.2021.3095298
View details for Web of Science ID 000678331300009
View details for PubMedID 34232883
An evolutionarily acquired microRNA shapes development of mammalian cortical projections.
Proceedings of the National Academy of Sciences of the United States of America
The corticospinal tract is unique to mammals and the corpus callosum is unique to placental mammals (eutherians). The emergence of these structures is thought to underpin the evolutionary acquisition of complex motor and cognitive skills. Corticospinal motor neurons (CSMN) and callosal projection neurons (CPN) are the archetypal projection neurons of the corticospinal tract and corpus callosum, respectively. Although a number of conserved transcriptional regulators of CSMN and CPN development have been identified in vertebrates, none are unique to mammals and most are coexpressed across multiple projection neuron subtypes. Here, we discover 17 CSMN-enriched microRNAs (miRNAs), 15 of which map to a single genomic cluster that is exclusive to eutherians. One of these, miR-409-3p, promotes CSMN subtype identity in part via repression of LMO4, a key transcriptional regulator of CPN development. In vivo, miR-409-3p is sufficient to convert deep-layer CPN into CSMN. This is a demonstration of an evolutionarily acquired miRNA in eutherians that refines cortical projection neuron subtype development. Our findings implicate miRNAs in the eutherians' increase in neuronal subtype and projection diversity, the anatomic underpinnings of their complex behavior.
View details for DOI 10.1073/pnas.2006700117
View details for PubMedID 33139574
- Administration of Dexamethasone for Bacterial Meningitis: An Unreliable Quality Measure NEUROHOSPITALIST 2020
Benchmarking Performance on Administration of Dexamethasone for Bacterial Meningitis
LIPPINCOTT WILLIAMS & WILKINS. 2020
View details for Web of Science ID 000536058003112
Interaction analysis under misspecification of main effects: Some common mistakes and simple solutions.
Statistics in medicine
The statistical practice of modeling interaction with two linear main effects and a product term is ubiquitous in the statistical and epidemiological literature. Most data modelers are aware that the misspecification of main effects can potentially cause severe type I error inflation in tests for interactions, leading to spurious detection of interactions. However, modeling practice has not changed. In this article, we focus on the specific situation where the main effects in the model are misspecified as linear terms and characterize its impact on common tests for statistical interaction. We then propose some simple alternatives that fix the issue of potential type I error inflation in testing interaction due to main effect misspecification. We show that when using the sandwich variance estimator for a linear regression model with a quantitative outcome and two independent factors, both the Wald and score tests asymptotically maintain the correct type I error rate. However, if the independence assumption does not hold or the outcome is binary, using the sandwich estimator does not fix the problem. We further demonstrate that flexibly modeling the main effect under a generalized additive model can largely reduce or often remove bias in the estimates and maintain the correct type I error rate for both quantitative and binary outcomes regardless of the independence assumption. We show, under the independence assumption and for a continuous outcome, overfitting and flexibly modeling the main effects does not lead to power loss asymptotically relative to a correctly specified main effect model. Our simulation study further demonstrates the empirical fact that using flexible models for the main effects does not result in a significant loss of power for testing interaction in general. Our results provide an improved understanding of the strengths and limitations for tests of interaction in the presence of main effect misspecification. Using data from a large biobank study "The Michigan Genomics Initiative", we present two examples of interaction analysis in support of our results.
View details for DOI 10.1002/sim.8505
View details for PubMedID 32101638
FUN-LDA: A Latent Dirichlet Allocation Model for Predicting Tissue-Specific Functional Effects of Noncoding Variation: Methods and Applications
AMERICAN JOURNAL OF HUMAN GENETICS
2018; 102 (5): 920–42
We describe a method based on a latent Dirichlet allocation model for predicting functional effects of noncoding genetic variants in a cell-type- and/or tissue-specific way (FUN-LDA). Using this unsupervised approach, we predict tissue-specific functional effects for every position in the human genome in 127 different tissues and cell types. We demonstrate the usefulness of our predictions by using several validation experiments. Using eQTL data from several sources, including the GTEx project, Geuvadis project, and TwinsUK cohort, we show that eQTLs in specific tissues tend to be most enriched among the predicted functional variants in relevant tissues in Roadmap. We further show how these integrated functional scores can be used for (1) deriving the most likely cell or tissue type causally implicated for a complex trait by using summary statistics from genome-wide association studies and (2) estimating a tissue-based correlation matrix of various complex traits. We found large enrichment of heritability in functional components of relevant tissues for various complex traits, and FUN-LDA yielded higher enrichment estimates than existing methods. Finally, using experimentally validated functional variants from the literature and variants possibly implicated in disease by previous studies, we rigorously compare FUN-LDA with state-of-the-art functional annotation methods and show that FUN-LDA has better prediction accuracy and higher resolution than these methods. In particular, our results suggest that tissue- and cell-type-specific functional prediction methods tend to have substantially better prediction accuracy than organism-level prediction methods. Scores for each position in the human genome and for each ENCODE and Roadmap tissue are available online (see Web Resources).
View details for PubMedID 29727691
View details for PubMedCentralID PMC5986983
Detecting Rare Mutations with Heterogeneous Effects Using a Family-Based Genetic Random Field Method.
2018; 210 (2): 463–76
The genetic etiology of many complex diseases is highly heterogeneous. A complex disease can be caused by multiple mutations within the same gene or mutations in multiple genes at various genomic loci. Although these disease-susceptibility mutations can be collectively common in the population, they are often individually rare or even private to certain families. Family-based studies are powerful for detecting rare variants enriched in families, which is an important feature for sequencing studies due to the heterogeneous nature of rare variants. In addition, family designs can provide robust protection against population stratification. Nevertheless, statistical methods for analyzing family-based sequencing data are underdeveloped, especially those accounting for heterogeneous etiology of complex diseases. In this article, we introduce a random field framework for detecting gene-phenotype associations in family-based sequencing studies, referred to as family-based genetic random field (FGRF). Similar to existing family-based association tests, FGRF could utilize within-family and between-family information separately or jointly to test an association. We demonstrate that FGRF has comparable statistical power with existing methods when there is no genetic heterogeneity, but can improve statistical power when there is genetic heterogeneity across families. The proposed method also shares the same advantages with the conventional family-based association tests (e.g., being robust to population stratification). Finally, we applied the proposed method to a sequencing data from the Minnesota Twin Family Study, and revealed several genes, including SAMD14, potentially associated with alcohol dependence.
View details for PubMedID 30104420
View details for PubMedCentralID PMC6216585
Rare-variant association tests in longitudinal studies, with an application to the Multi-Ethnic Study of Atherosclerosis (MESA)
2017; 41 (8): 801–10
Over the past few years, an increasing number of studies have identified rare variants that contribute to trait heritability. Due to the extreme rarity of some individual variants, gene-based association tests have been proposed to aggregate the genetic variants within a gene, pathway, or specific genomic region as opposed to a one-at-a-time single variant analysis. In addition, in longitudinal studies, statistical power to detect disease susceptibility rare variants can be improved through jointly testing repeatedly measured outcomes, which better describes the temporal development of the trait of interest. However, usual sandwich/model-based inference for sequencing studies with longitudinal outcomes and rare variants can produce deflated/inflated type I error rate without further corrections. In this paper, we develop a group of tests for rare-variant association based on outcomes with repeated measures. We propose new perturbation methods such that the type I error rate of the new tests is not only robust to misspecification of within-subject correlation, but also significantly improved for variants with extreme rarity in a study with small or moderate sample size. Through extensive simulation studies, we illustrate that substantially higher power can be achieved by utilizing longitudinal outcomes and our proposed finite sample adjustment. We illustrate our methods using data from the Multi-Ethnic Study of Atherosclerosis for exploring association of repeated measures of blood pressure with rare and common variants based on exome sequencing data on 6,361 individuals.
View details for PubMedID 29076270
View details for PubMedCentralID PMC5696115
Interaction between Social/Psychosocial Factors and Genetic Variants on Body Mass Index: A Gene-Environment Interaction Analysis in a Longitudinal Setting
INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH
2017; 14 (10)
Obesity, which develops over time, is one of the leading causes of chronic diseases such as cardiovascular disease. However, hundreds of BMI (body mass index)-associated genetic loci identified through large-scale genome-wide association studies (GWAS) only explain about 2.7% of BMI variation. Most common human traits are believed to be influenced by both genetic and environmental factors. Past studies suggest a variety of environmental features that are associated with obesity, including socioeconomic status and psychosocial factors. This study combines both gene/regions and environmental factors to explore whether social/psychosocial factors (childhood and adult socioeconomic status, social support, anger, chronic burden, stressful life events, and depressive symptoms) modify the effect of sets of genetic variants on BMI in European American and African American participants in the Health and Retirement Study (HRS). In order to incorporate longitudinal phenotype data collected in the HRS and investigate entire sets of single nucleotide polymorphisms (SNPs) within gene/region simultaneously, we applied a novel set-based test for gene-environment interaction in longitudinal studies (LGEWIS). Childhood socioeconomic status (parental education) was found to modify the genetic effect in the gene/region around SNP rs9540493 on BMI in European Americans in the HRS. The most significant SNP (rs9540488) by childhood socioeconomic status interaction within the rs9540493 gene/region was suggestively replicated in the Multi-Ethnic Study of Atherosclerosis (MESA) (p = 0.07).
View details for PubMedID 28961216
Testing Allele Transmission of an SNP Set Using a Family-Based Generalized Genetic Random Field Method
2016; 40 (4): 341–51
Family-based association studies are commonly used in genetic research because they can be robust to population stratification (PS). Recent advances in high-throughput genotyping technologies have produced a massive amount of genomic data in family-based studies. However, current family-based association tests are mainly focused on evaluating individual variants one at a time. In this article, we introduce a family-based generalized genetic random field (FB-GGRF) method to test the joint association between a set of autosomal SNPs (i.e., single-nucleotide polymorphisms) and disease phenotypes. The proposed method is a natural extension of a recently developed GGRF method for population-based case-control studies. It models offspring genotypes conditional on parental genotypes, and, thus, is robust to PS. Through simulations, we presented that under various disease scenarios the FB-GGRF has improved power over a commonly used family-based sequence kernel association test (FB-SKAT). Further, similar to GGRF, the proposed FB-GGRF method is asymptotically well-behaved, and does not require empirical adjustment of the type I error rates. We illustrate the proposed method using a study of congenital heart defects with family trios from the National Birth Defects Prevention Study (NBDPS).
View details for PubMedID 27061818
View details for PubMedCentralID PMC5061344
Risk Prediction Modeling of Sequencing Data Using a Forward Random Field Method
2016; 6: 21120
With the advance in high-throughput sequencing technology, it is feasible to investigate the role of common and rare variants in disease risk prediction. While the new technology holds great promise to improve disease prediction, the massive amount of data and low frequency of rare variants pose great analytical challenges on risk prediction modeling. In this paper, we develop a forward random field method (FRF) for risk prediction modeling using sequencing data. In FRF, subjects' phenotypes are treated as stochastic realizations of a random field on a genetic space formed by subjects' genotypes, and an individual's phenotype can be predicted by adjacent subjects with similar genotypes. The FRF method allows for multiple similarity measures and candidate genes in the model, and adaptively chooses the optimal similarity measure and disease-associated genes to reflect the underlying disease model. It also avoids the specification of the threshold of rare variants and allows for different directions and magnitudes of genetic effects. Through simulations, we demonstrate the FRF method attains higher or comparable accuracy over commonly used support vector machine based methods under various disease models. We further illustrate the FRF method with an application to the sequencing data obtained from the Dallas Heart Study.
View details for PubMedID 26892725
Association between Stress Response Genes and Features of Diurnal Cortisol Curves in the Multi-Ethnic Study of Atherosclerosis: A New Multi-Phenotype Approach for Gene-Based Association Tests
2015; 10 (5): e0126637
The hormone cortisol is likely to be a key mediator of the stress response that influences multiple physiologic systems that are involved in common chronic disease, including the cardiovascular system, the immune system, and metabolism. In this paper, a candidate gene approach was used to investigate genetic contributions to variability in multiple correlated features of the daily cortisol profile in a sample of European Americans, African Americans, and Hispanic Americans from the Multi-Ethnic Study of Atherosclerosis (MESA). We proposed and applied a new gene-level multiple-phenotype analysis and carried out a meta-analysis to combine the ethnicity specific results. This new analysis, instead of a more routine single marker-single phenotype approach identified a significant association between one gene (ADRB2) and cortisol features (meta-analysis p-value=0.0025), which was not identified by three other commonly used existing analytic strategies: 1. Single marker association tests involving each single cortisol feature separately; 2. Single marker association tests jointly testing for multiple cortisol features; 3. Gene-level association tests separately carried out for each single cortisol feature. The analytic strategies presented consider different hypotheses regarding genotype-phenotype association and imply different costs of multiple testing. The proposed gene-level analysis integrating multiple cortisol features across multiple ethnic groups provides new insights into the gene-cortisol association.
View details for PubMedID 25993632
A Powerful Nonparametric Statistical Framework for Family-Based Association Analyses
2015; 200 (1): 69–U140
Family-based study design is commonly used in genetic research. It has many ideal features, including being robust to population stratification (PS). With the advance of high-throughput technologies and ever-decreasing genotyping cost, it has become common for family studies to examine a large number of variants for their associations with disease phenotypes. The yield from the analysis of these family-based genetic data can be enhanced by adopting computationally efficient and powerful statistical methods. We propose a general framework of a family-based U-statistic, referred to as family-U, for family-based association studies. Unlike existing parametric-based methods, the proposed method makes no assumption of the underlying disease models and can be applied to various phenotypes (e.g., binary and quantitative phenotypes) and pedigree structures (e.g., nuclear families and extended pedigrees). By using only within-family information, it can offer robust protection against PS. In the absence of PS, it can also utilize additional information (i.e., between-family information) for power improvement. Through simulations, we demonstrated that family-U attained higher power over a commonly used method, family-based association tests, under various disease scenarios. We further illustrated the new method with an application to large-scale family data from the Framingham Heart Study. By utilizing additional information (i.e., between-family information), family-U confirmed a previous association of CHRNA5 with nicotine dependence.
View details for PubMedID 25745024
View details for PubMedCentralID PMC4423382
A Weighted U-Statistic for Genetic Association Analyses of Sequencing Data
2014; 38 (8): 699–708
With advancements in next-generation sequencing technology, a massive amount of sequencing data is generated, which offers a great opportunity to comprehensively investigate the role of rare variants in the genetic etiology of complex diseases. Nevertheless, the high-dimensional sequencing data poses a great challenge for statistical analysis. The association analyses based on traditional statistical methods suffer substantial power loss because of the low frequency of genetic variants and the extremely high dimensionality of the data. We developed a Weighted U Sequencing test, referred to as WU-SEQ, for the high-dimensional association analysis of sequencing data. Based on a nonparametric U-statistic, WU-SEQ makes no assumption of the underlying disease model and phenotype distribution, and can be applied to a variety of phenotypes. Through simulation studies and an empirical study, we showed that WU-SEQ outperformed a commonly used sequence kernel association test (SKAT) method when the underlying assumptions were violated (e.g., the phenotype followed a heavy-tailed distribution). Even when the assumptions were satisfied, WU-SEQ still attained comparable performance to SKAT. Finally, we applied WU-SEQ to sequencing data from the Dallas Heart Study (DHS), and detected an association between ANGPTL 4 and very low density lipoprotein cholesterol.
View details for DOI 10.1002/gepi.21864
View details for Web of Science ID 000345292600005
View details for PubMedID 25331574
View details for PubMedCentralID PMC4236269
A Generalized Genetic Random Field Method for the Genetic Association Analysis of Sequencing Data
2014; 38 (3): 242–53
With the advance of high-throughput sequencing technologies, it has become feasible to investigate the influence of the entire spectrum of sequencing variations on complex human diseases. Although association studies utilizing the new sequencing technologies hold great promise to unravel novel genetic variants, especially rare genetic variants that contribute to human diseases, the statistical analysis of high-dimensional sequencing data remains a challenge. Advanced analytical methods are in great need to facilitate high-dimensional sequencing data analyses. In this article, we propose a generalized genetic random field (GGRF) method for association analyses of sequencing data. Like other similarity-based methods (e.g., SIMreg and SKAT), the new method has the advantages of avoiding the need to specify thresholds for rare variants and allowing for testing multiple variants acting in different directions and magnitude of effects. The method is built on the generalized estimating equation framework and thus accommodates a variety of disease phenotypes (e.g., quantitative and binary phenotypes). Moreover, it has a nice asymptotic property, and can be applied to small-scale sequencing data without need for small-sample adjustment. Through simulations, we demonstrate that the proposed GGRF attains an improved or comparable power over a commonly used method, SKAT, under various disease scenarios, especially when rare variants play a significant role in disease etiology. We further illustrate GGRF with an application to a real dataset from the Dallas Heart Study. By using GGRF, we were able to detect the association of two candidate genes, ANGPTL3 and ANGPTL4, with serum triglyceride.
View details for DOI 10.1002/gepi.21790
View details for Web of Science ID 000332700300007
View details for PubMedID 24482034
View details for PubMedCentralID PMC5241166