Dr. He received his PhD from the University of Michigan in 2016. Following a postdoctoral training in biostatistics at Columbia University, he joined Stanford University as an assistant professor of neurology and of medicine in 2018. His research is concentrated in the area of statistical genetics and integrative analysis of omics data, with the aim of developing novel statistical and computational methodologies for the identification and interpretation of complex biological pathways involved in human diseases, particularly neurological disorders. His methodology interest includes high-dimensional data analysis, correlated (longitudinal, familial) data analysis and machine learning algorithms.
Honors & Awards
Rackham Pre-doctoral Fellowship Award, University of Michigan (2015)
Rackham Conference Travel Grant, University of Michigan (2013 - 2015)
Best Performance on the Qualifying Exam, University of Michigan (2013)
Ph.D., University of Michigan, Biostatistics (2016)
B.S., Tsinghua University, Mathematics and Physics (2010)
A genome-wide scan statistic framework for whole-genome sequence data analysis.
2019; 10 (1): 3018
The analysis of whole-genome sequencing studies is challenging due to the large number of noncoding rare variants, our limited understanding of their functional effects, and the lack of natural units for testing. Here we propose a scan statistic framework, WGScan, to simultaneously detect the existence, and estimate the locations of association signals at genome-wide scale. WGScan can analytically estimate the significance threshold for a whole-genome scan; utilize summary statistics for a meta-analysis; incorporate functional annotations for enhanced discoveries in noncoding regions; and enable enrichment analyses using genome-wide summary statistics. Based on the analysis of whole genomes of 1,786 phenotypically discordant sibling pairs from the Simons Simplex Collection study for autism spectrum disorders, we derive genome-wide significance thresholds for whole genome sequencing studies and detect significant enrichments of regions showing associations with autism in promoter regions, functional categories related to autism, and enhancers predicted to regulate expression of autism associated genes.
View details for DOI 10.1038/s41467-019-11023-0
View details for PubMedID 31289270
FUN-LDA: A Latent Dirichlet Allocation Model for Predicting Tissue-Specific Functional Effects of Noncoding Variation: Methods and Applications
AMERICAN JOURNAL OF HUMAN GENETICS
2018; 102 (5): 920–42
We describe a method based on a latent Dirichlet allocation model for predicting functional effects of noncoding genetic variants in a cell-type- and/or tissue-specific way (FUN-LDA). Using this unsupervised approach, we predict tissue-specific functional effects for every position in the human genome in 127 different tissues and cell types. We demonstrate the usefulness of our predictions by using several validation experiments. Using eQTL data from several sources, including the GTEx project, Geuvadis project, and TwinsUK cohort, we show that eQTLs in specific tissues tend to be most enriched among the predicted functional variants in relevant tissues in Roadmap. We further show how these integrated functional scores can be used for (1) deriving the most likely cell or tissue type causally implicated for a complex trait by using summary statistics from genome-wide association studies and (2) estimating a tissue-based correlation matrix of various complex traits. We found large enrichment of heritability in functional components of relevant tissues for various complex traits, and FUN-LDA yielded higher enrichment estimates than existing methods. Finally, using experimentally validated functional variants from the literature and variants possibly implicated in disease by previous studies, we rigorously compare FUN-LDA with state-of-the-art functional annotation methods and show that FUN-LDA has better prediction accuracy and higher resolution than these methods. In particular, our results suggest that tissue- and cell-type-specific functional prediction methods tend to have substantially better prediction accuracy than organism-level prediction methods. Scores for each position in the human genome and for each ENCODE and Roadmap tissue are available online (see Web Resources).
View details for PubMedID 29727691
View details for PubMedCentralID PMC5986983
A semi-supervised approach for predicting cell-type specific functional consequences of non-coding variation using MPRAs.
2018; 9 (1): 5199
Predicting the functional consequences of genetic variants in non-coding regions is a challenging problem. We propose here a semi-supervised approach, GenoNet, to jointly utilize experimentally confirmed regulatory variants (labeled variants), millions of unlabeled variants genome-wide, and more than a thousand cell/tissue type specific epigenetic annotations to predict functional consequences of non-coding variants. Through the application to several experimental datasets, we demonstrate that the proposed method significantly improves prediction accuracy compared to existing functional prediction methods at the tissue/cell type level, but especially so at the organism level. Importantly, we illustrate how the GenoNet scores can help in fine-mapping at GWAS loci, and in the discovery of disease associated genes in sequencing studies. As more comprehensive lists of experimentally validated variants become available over the next few years, semi-supervised methods like GenoNet can be used to provide increasingly accurate functional predictions for variants genome-wide and across a variety of cell/tissue types.
View details for PubMedID 30518757
Rare-variant association tests in longitudinal studies, with an application to the Multi-Ethnic Study of Atherosclerosis (MESA)
2017; 41 (8): 801–10
Over the past few years, an increasing number of studies have identified rare variants that contribute to trait heritability. Due to the extreme rarity of some individual variants, gene-based association tests have been proposed to aggregate the genetic variants within a gene, pathway, or specific genomic region as opposed to a one-at-a-time single variant analysis. In addition, in longitudinal studies, statistical power to detect disease susceptibility rare variants can be improved through jointly testing repeatedly measured outcomes, which better describes the temporal development of the trait of interest. However, usual sandwich/model-based inference for sequencing studies with longitudinal outcomes and rare variants can produce deflated/inflated type I error rate without further corrections. In this paper, we develop a group of tests for rare-variant association based on outcomes with repeated measures. We propose new perturbation methods such that the type I error rate of the new tests is not only robust to misspecification of within-subject correlation, but also significantly improved for variants with extreme rarity in a study with small or moderate sample size. Through extensive simulation studies, we illustrate that substantially higher power can be achieved by utilizing longitudinal outcomes and our proposed finite sample adjustment. We illustrate our methods using data from the Multi-Ethnic Study of Atherosclerosis for exploring association of repeated measures of blood pressure with rare and common variants based on exome sequencing data on 6,361 individuals.
View details for PubMedID 29076270
View details for PubMedCentralID PMC5696115
Unified Sequence-Based Association Tests Allowing for Multiple Functional Annotations and Meta-analysis of Noncoding Variation in Metabochip Data
AMERICAN JOURNAL OF HUMAN GENETICS
2017; 101 (3): 340–52
Substantial progress has been made in the functional annotation of genetic variation in the human genome. Integrative analysis that incorporates such functional annotations into sequencing studies can aid the discovery of disease-associated genetic variants, especially those with unknown function and located outside protein-coding regions. Direct incorporation of one functional annotation as weight in existing dispersion and burden tests can suffer substantial loss of power when the functional annotation is not predictive of the risk status of a variant. Here, we have developed unified tests that can utilize multiple functional annotations simultaneously for integrative association analysis with efficient computational techniques. We show that the proposed tests significantly improve power when variant risk status can be predicted by functional annotations. Importantly, when functional annotations are not predictive of risk status, the proposed tests incur only minimal loss of power in relation to existing dispersion and burden tests, and under certain circumstances they can even have improved power by learning a weight that better approximates the underlying disease model in a data-adaptive manner. The tests can be constructed with summary statistics of existing dispersion and burden tests for sequencing data, therefore allowing meta-analysis of multiple studies without sharing individual-level data. We applied the proposed tests to a meta-analysis of noncoding rare variants in Metabochip data on 12,281 individuals from eight studies for lipid traits. By incorporating the Eigen functional score, we detected significant associations between noncoding rare variants in SLC22A3 and low-density lipoprotein and total cholesterol, associations that are missed by standard dispersion and burden tests.
View details for PubMedID 28844485
View details for PubMedCentralID PMC5590864
Set-Based Tests for the Gene-Environment Interaction in Longitudinal Studies
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
2017; 112 (519): 966–78
We propose a generalized score type test for set-based inference for gene-environment interaction with longitudinally measured quantitative traits. The test is robust to misspecification of within subject correlation structure and has enhanced power compared to existing alternatives. Unlike tests for marginal genetic association, set-based tests for gene-environment interaction face the challenges of a potentially misspecified and high-dimensional main effect model under the null hypothesis. We show that our proposed test is robust to main effect misspecification of environmental exposure and genetic factors under the gene-environment independence condition. When genetic and environmental factors are dependent, the method of sieves is further proposed to eliminate potential bias due to a misspecified main effect of a continuous environmental exposure. A weighted principal component analysis approach is developed to perform dimension reduction when the number of genetic variants in the set is large relative to the sample size. The methods are motivated by an example from the Multi-Ethnic Study of Atherosclerosis (MESA), investigating interaction between measures of neighborhood environment and genetic regions on longitudinal measures of blood pressure over a study period of about seven years with 4 exams.
View details for PubMedID 29780190
View details for PubMedCentralID PMC5954413
Set-Based Tests for Genetic Association in Longitudinal Studies
2015; 71 (3): 606–15
Genetic association studies with longitudinal markers of chronic diseases (e.g., blood pressure, body mass index) provide a valuable opportunity to explore how genetic variants affect traits over time by utilizing the full trajectory of longitudinal outcomes. Since these traits are likely influenced by the joint effect of multiple variants in a gene, a joint analysis of these variants considering linkage disequilibrium (LD) may help to explain additional phenotypic variation. In this article, we propose a longitudinal genetic random field model (LGRF), to test the association between a phenotype measured repeatedly during the course of an observational study and a set of genetic variants. Generalized score type tests are developed, which we show are robust to misspecification of within-subject correlation, a feature that is desirable for longitudinal analysis. In addition, a joint test incorporating gene-time interaction is further proposed. Computational advancement is made for scalable implementation of the proposed methods in large-scale genome-wide association studies (GWAS). The proposed methods are evaluated through extensive simulation studies and illustrated using data from the Multi-Ethnic Study of Atherosclerosis (MESA). Our simulation results indicate substantial gain in power using LGRF when compared with two commonly used existing alternatives: (i) single marker tests using longitudinal outcome and (ii) existing gene-based tests using the average value of repeated measurements as the outcome.
View details for PubMedID 25854837
View details for PubMedCentralID PMC4601568
Modeling and Testing for Joint Association Using a Genetic Random Field Model
2014; 70 (3): 471–79
Substantial progress has been made in identifying single genetic variants predisposing to common complex diseases. Nonetheless, the genetic etiology of human diseases remains largely unknown. Human complex diseases are likely influenced by the joint effect of a large number of genetic variants instead of a single variant. The joint analysis of multiple genetic variants considering linkage disequilibrium (LD) and potential interactions can further enhance the discovery process, leading to the identification of new disease-susceptibility genetic variants. Motivated by development in spatial statistics, we propose a new statistical model based on the random field theory, referred to as a genetic random field model (GenRF), for joint association analysis with the consideration of possible gene-gene interactions and LD. Using a pseudo-likelihood approach, a GenRF test for the joint association of multiple genetic variants is developed, which has the following advantages: (1) accommodating complex interactions for improved performance; (2) natural dimension reduction; (3) boosting power in the presence of LD; and (4) computationally efficient. Simulation studies are conducted under various scenarios. The development has been focused on quantitative traits and robustness of the GenRF test to other traits, for example, binary traits, is also discussed. Compared with a commonly adopted kernel machine approach, SKAT, as well as other more standard methods, GenRF shows overall comparable performance and better performance in the presence of complex interactions. The method is further illustrated by an application to the Dallas Heart Study.
View details for PubMedID 24628067
Interaction analysis under misspecification of main effects: Some common mistakes and simple solutions.
Statistics in medicine
The statistical practice of modeling interaction with two linear main effects and a product term is ubiquitous in the statistical and epidemiological literature. Most data modelers are aware that the misspecification of main effects can potentially cause severe type I error inflation in tests for interactions, leading to spurious detection of interactions. However, modeling practice has not changed. In this article, we focus on the specific situation where the main effects in the model are misspecified as linear terms and characterize its impact on common tests for statistical interaction. We then propose some simple alternatives that fix the issue of potential type I error inflation in testing interaction due to main effect misspecification. We show that when using the sandwich variance estimator for a linear regression model with a quantitative outcome and two independent factors, both the Wald and score tests asymptotically maintain the correct type I error rate. However, if the independence assumption does not hold or the outcome is binary, using the sandwich estimator does not fix the problem. We further demonstrate that flexibly modeling the main effect under a generalized additive model can largely reduce or often remove bias in the estimates and maintain the correct type I error rate for both quantitative and binary outcomes regardless of the independence assumption. We show, under the independence assumption and for a continuous outcome, overfitting and flexibly modeling the main effects does not lead to power loss asymptotically relative to a correctly specified main effect model. Our simulation study further demonstrates the empirical fact that using flexible models for the main effects does not result in a significant loss of power for testing interaction in general. Our results provide an improved understanding of the strengths and limitations for tests of interaction in the presence of main effect misspecification. Using data from a large biobank study "The Michigan Genomics Initiative", we present two examples of interaction analysis in support of our results.
View details for DOI 10.1002/sim.8505
View details for PubMedID 32101638
Detecting Rare Mutations with Heterogeneous Effects Using a Family-Based Genetic Random Field Method.
2018; 210 (2): 463–76
The genetic etiology of many complex diseases is highly heterogeneous. A complex disease can be caused by multiple mutations within the same gene or mutations in multiple genes at various genomic loci. Although these disease-susceptibility mutations can be collectively common in the population, they are often individually rare or even private to certain families. Family-based studies are powerful for detecting rare variants enriched in families, which is an important feature for sequencing studies due to the heterogeneous nature of rare variants. In addition, family designs can provide robust protection against population stratification. Nevertheless, statistical methods for analyzing family-based sequencing data are underdeveloped, especially those accounting for heterogeneous etiology of complex diseases. In this article, we introduce a random field framework for detecting gene-phenotype associations in family-based sequencing studies, referred to as family-based genetic random field (FGRF). Similar to existing family-based association tests, FGRF could utilize within-family and between-family information separately or jointly to test an association. We demonstrate that FGRF has comparable statistical power with existing methods when there is no genetic heterogeneity, but can improve statistical power when there is genetic heterogeneity across families. The proposed method also shares the same advantages with the conventional family-based association tests (e.g., being robust to population stratification). Finally, we applied the proposed method to a sequencing data from the Minnesota Twin Family Study, and revealed several genes, including SAMD14, potentially associated with alcohol dependence.
View details for PubMedID 30104420
View details for PubMedCentralID PMC6216585
Interaction between Social/Psychosocial Factors and Genetic Variants on Body Mass Index: A Gene-Environment Interaction Analysis in a Longitudinal Setting
INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH
2017; 14 (10)
Obesity, which develops over time, is one of the leading causes of chronic diseases such as cardiovascular disease. However, hundreds of BMI (body mass index)-associated genetic loci identified through large-scale genome-wide association studies (GWAS) only explain about 2.7% of BMI variation. Most common human traits are believed to be influenced by both genetic and environmental factors. Past studies suggest a variety of environmental features that are associated with obesity, including socioeconomic status and psychosocial factors. This study combines both gene/regions and environmental factors to explore whether social/psychosocial factors (childhood and adult socioeconomic status, social support, anger, chronic burden, stressful life events, and depressive symptoms) modify the effect of sets of genetic variants on BMI in European American and African American participants in the Health and Retirement Study (HRS). In order to incorporate longitudinal phenotype data collected in the HRS and investigate entire sets of single nucleotide polymorphisms (SNPs) within gene/region simultaneously, we applied a novel set-based test for gene-environment interaction in longitudinal studies (LGEWIS). Childhood socioeconomic status (parental education) was found to modify the genetic effect in the gene/region around SNP rs9540493 on BMI in European Americans in the HRS. The most significant SNP (rs9540488) by childhood socioeconomic status interaction within the rs9540493 gene/region was suggestively replicated in the Multi-Ethnic Study of Atherosclerosis (MESA) (p = 0.07).
View details for PubMedID 28961216
Testing Allele Transmission of an SNP Set Using a Family-Based Generalized Genetic Random Field Method
2016; 40 (4): 341–51
Family-based association studies are commonly used in genetic research because they can be robust to population stratification (PS). Recent advances in high-throughput genotyping technologies have produced a massive amount of genomic data in family-based studies. However, current family-based association tests are mainly focused on evaluating individual variants one at a time. In this article, we introduce a family-based generalized genetic random field (FB-GGRF) method to test the joint association between a set of autosomal SNPs (i.e., single-nucleotide polymorphisms) and disease phenotypes. The proposed method is a natural extension of a recently developed GGRF method for population-based case-control studies. It models offspring genotypes conditional on parental genotypes, and, thus, is robust to PS. Through simulations, we presented that under various disease scenarios the FB-GGRF has improved power over a commonly used family-based sequence kernel association test (FB-SKAT). Further, similar to GGRF, the proposed FB-GGRF method is asymptotically well-behaved, and does not require empirical adjustment of the type I error rates. We illustrate the proposed method using a study of congenital heart defects with family trios from the National Birth Defects Prevention Study (NBDPS).
View details for PubMedID 27061818
View details for PubMedCentralID PMC5061344
Risk Prediction Modeling of Sequencing Data Using a Forward Random Field Method
2016; 6: 21120
With the advance in high-throughput sequencing technology, it is feasible to investigate the role of common and rare variants in disease risk prediction. While the new technology holds great promise to improve disease prediction, the massive amount of data and low frequency of rare variants pose great analytical challenges on risk prediction modeling. In this paper, we develop a forward random field method (FRF) for risk prediction modeling using sequencing data. In FRF, subjects' phenotypes are treated as stochastic realizations of a random field on a genetic space formed by subjects' genotypes, and an individual's phenotype can be predicted by adjacent subjects with similar genotypes. The FRF method allows for multiple similarity measures and candidate genes in the model, and adaptively chooses the optimal similarity measure and disease-associated genes to reflect the underlying disease model. It also avoids the specification of the threshold of rare variants and allows for different directions and magnitudes of genetic effects. Through simulations, we demonstrate the FRF method attains higher or comparable accuracy over commonly used support vector machine based methods under various disease models. We further illustrate the FRF method with an application to the sequencing data obtained from the Dallas Heart Study.
View details for PubMedID 26892725
Association between Stress Response Genes and Features of Diurnal Cortisol Curves in the Multi-Ethnic Study of Atherosclerosis: A New Multi-Phenotype Approach for Gene-Based Association Tests
2015; 10 (5): e0126637
The hormone cortisol is likely to be a key mediator of the stress response that influences multiple physiologic systems that are involved in common chronic disease, including the cardiovascular system, the immune system, and metabolism. In this paper, a candidate gene approach was used to investigate genetic contributions to variability in multiple correlated features of the daily cortisol profile in a sample of European Americans, African Americans, and Hispanic Americans from the Multi-Ethnic Study of Atherosclerosis (MESA). We proposed and applied a new gene-level multiple-phenotype analysis and carried out a meta-analysis to combine the ethnicity specific results. This new analysis, instead of a more routine single marker-single phenotype approach identified a significant association between one gene (ADRB2) and cortisol features (meta-analysis p-value=0.0025), which was not identified by three other commonly used existing analytic strategies: 1. Single marker association tests involving each single cortisol feature separately; 2. Single marker association tests jointly testing for multiple cortisol features; 3. Gene-level association tests separately carried out for each single cortisol feature. The analytic strategies presented consider different hypotheses regarding genotype-phenotype association and imply different costs of multiple testing. The proposed gene-level analysis integrating multiple cortisol features across multiple ethnic groups provides new insights into the gene-cortisol association.
View details for PubMedID 25993632
A Powerful Nonparametric Statistical Framework for Family-Based Association Analyses
2015; 200 (1): 69–U140
Family-based study design is commonly used in genetic research. It has many ideal features, including being robust to population stratification (PS). With the advance of high-throughput technologies and ever-decreasing genotyping cost, it has become common for family studies to examine a large number of variants for their associations with disease phenotypes. The yield from the analysis of these family-based genetic data can be enhanced by adopting computationally efficient and powerful statistical methods. We propose a general framework of a family-based U-statistic, referred to as family-U, for family-based association studies. Unlike existing parametric-based methods, the proposed method makes no assumption of the underlying disease models and can be applied to various phenotypes (e.g., binary and quantitative phenotypes) and pedigree structures (e.g., nuclear families and extended pedigrees). By using only within-family information, it can offer robust protection against PS. In the absence of PS, it can also utilize additional information (i.e., between-family information) for power improvement. Through simulations, we demonstrated that family-U attained higher power over a commonly used method, family-based association tests, under various disease scenarios. We further illustrated the new method with an application to large-scale family data from the Framingham Heart Study. By utilizing additional information (i.e., between-family information), family-U confirmed a previous association of CHRNA5 with nicotine dependence.
View details for PubMedID 25745024
View details for PubMedCentralID PMC4423382
A Weighted U-Statistic for Genetic Association Analyses of Sequencing Data
2014; 38 (8): 699–708
With advancements in next-generation sequencing technology, a massive amount of sequencing data is generated, which offers a great opportunity to comprehensively investigate the role of rare variants in the genetic etiology of complex diseases. Nevertheless, the high-dimensional sequencing data poses a great challenge for statistical analysis. The association analyses based on traditional statistical methods suffer substantial power loss because of the low frequency of genetic variants and the extremely high dimensionality of the data. We developed a Weighted U Sequencing test, referred to as WU-SEQ, for the high-dimensional association analysis of sequencing data. Based on a nonparametric U-statistic, WU-SEQ makes no assumption of the underlying disease model and phenotype distribution, and can be applied to a variety of phenotypes. Through simulation studies and an empirical study, we showed that WU-SEQ outperformed a commonly used sequence kernel association test (SKAT) method when the underlying assumptions were violated (e.g., the phenotype followed a heavy-tailed distribution). Even when the assumptions were satisfied, WU-SEQ still attained comparable performance to SKAT. Finally, we applied WU-SEQ to sequencing data from the Dallas Heart Study (DHS), and detected an association between ANGPTL 4 and very low density lipoprotein cholesterol.
View details for DOI 10.1002/gepi.21864
View details for Web of Science ID 000345292600005
View details for PubMedID 25331574
View details for PubMedCentralID PMC4236269
A Generalized Genetic Random Field Method for the Genetic Association Analysis of Sequencing Data
2014; 38 (3): 242–53
With the advance of high-throughput sequencing technologies, it has become feasible to investigate the influence of the entire spectrum of sequencing variations on complex human diseases. Although association studies utilizing the new sequencing technologies hold great promise to unravel novel genetic variants, especially rare genetic variants that contribute to human diseases, the statistical analysis of high-dimensional sequencing data remains a challenge. Advanced analytical methods are in great need to facilitate high-dimensional sequencing data analyses. In this article, we propose a generalized genetic random field (GGRF) method for association analyses of sequencing data. Like other similarity-based methods (e.g., SIMreg and SKAT), the new method has the advantages of avoiding the need to specify thresholds for rare variants and allowing for testing multiple variants acting in different directions and magnitude of effects. The method is built on the generalized estimating equation framework and thus accommodates a variety of disease phenotypes (e.g., quantitative and binary phenotypes). Moreover, it has a nice asymptotic property, and can be applied to small-scale sequencing data without need for small-sample adjustment. Through simulations, we demonstrate that the proposed GGRF attains an improved or comparable power over a commonly used method, SKAT, under various disease scenarios, especially when rare variants play a significant role in disease etiology. We further illustrate GGRF with an application to a real dataset from the Dallas Heart Study. By using GGRF, we were able to detect the association of two candidate genes, ANGPTL3 and ANGPTL4, with serum triglyceride.
View details for DOI 10.1002/gepi.21790
View details for Web of Science ID 000332700300007
View details for PubMedID 24482034
View details for PubMedCentralID PMC5241166