Dr. He received his PhD from the University of Michigan in 2016. Following a postdoctoral training in biostatistics at Columbia University, he joined Stanford University as an assistant professor of neurology and of medicine in 2018. His research is concentrated in the area of statistical genetics and integrative analysis of omics data, with the aim of developing novel statistical and computational methodologies for the identification and interpretation of complex biological pathways involved in human diseases, particularly neurological disorders. His methodology interest includes high-dimensional data analysis, correlated (longitudinal, familial) data analysis and machine learning algorithms.

Academic Appointments

Honors & Awards

  • Rackham Pre-doctoral Fellowship Award, University of Michigan (2015)
  • Rackham Conference Travel Grant, University of Michigan (2013 - 2015)
  • Best Performance on the Qualifying Exam, University of Michigan (2013)

Professional Education

  • Ph.D., University of Michigan, Biostatistics (2016)
  • B.S., Tsinghua University, Mathematics and Physics (2010)

Stanford Advisees

All Publications

  • A genome-wide scan statistic framework for whole-genome sequence data analysis. Nature communications He, Z., Xu, B., Buxbaum, J., Ionita-Laza, I. 2019; 10 (1): 3018


    The analysis of whole-genome sequencing studies is challenging due to the large number of noncoding rare variants, our limited understanding of their functional effects, and the lack of natural units for testing. Here we propose a scan statistic framework, WGScan, to simultaneously detect the existence, and estimate the locations of association signals at genome-wide scale. WGScan can analytically estimate the significance threshold for a whole-genome scan; utilize summary statistics for a meta-analysis; incorporate functional annotations for enhanced discoveries in noncoding regions; and enable enrichment analyses using genome-wide summary statistics. Based on the analysis of whole genomes of 1,786 phenotypically discordant sibling pairs from the Simons Simplex Collection study for autism spectrum disorders, we derive genome-wide significance thresholds for whole genome sequencing studies and detect significant enrichments of regions showing associations with autism in promoter regions, functional categories related to autism, and enhancers predicted to regulate expression of autism associated genes.

    View details for DOI 10.1038/s41467-019-11023-0

    View details for PubMedID 31289270

  • FUN-LDA: A Latent Dirichlet Allocation Model for Predicting Tissue-Specific Functional Effects of Noncoding Variation: Methods and Applications AMERICAN JOURNAL OF HUMAN GENETICS Backenroth, D., He, Z., Kiryluk, K., Boeva, V., Pethukova, L., Khurana, E., Christiano, A., Buxbaum, J. D., Ionita-Laza, I. 2018; 102 (5): 920–42


    We describe a method based on a latent Dirichlet allocation model for predicting functional effects of noncoding genetic variants in a cell-type- and/or tissue-specific way (FUN-LDA). Using this unsupervised approach, we predict tissue-specific functional effects for every position in the human genome in 127 different tissues and cell types. We demonstrate the usefulness of our predictions by using several validation experiments. Using eQTL data from several sources, including the GTEx project, Geuvadis project, and TwinsUK cohort, we show that eQTLs in specific tissues tend to be most enriched among the predicted functional variants in relevant tissues in Roadmap. We further show how these integrated functional scores can be used for (1) deriving the most likely cell or tissue type causally implicated for a complex trait by using summary statistics from genome-wide association studies and (2) estimating a tissue-based correlation matrix of various complex traits. We found large enrichment of heritability in functional components of relevant tissues for various complex traits, and FUN-LDA yielded higher enrichment estimates than existing methods. Finally, using experimentally validated functional variants from the literature and variants possibly implicated in disease by previous studies, we rigorously compare FUN-LDA with state-of-the-art functional annotation methods and show that FUN-LDA has better prediction accuracy and higher resolution than these methods. In particular, our results suggest that tissue- and cell-type-specific functional prediction methods tend to have substantially better prediction accuracy than organism-level prediction methods. Scores for each position in the human genome and for each ENCODE and Roadmap tissue are available online (see Web Resources).

    View details for PubMedID 29727691

    View details for PubMedCentralID PMC5986983

  • A semi-supervised approach for predicting cell-type specific functional consequences of non-coding variation using MPRAs. Nature communications He, Z. n., Liu, L. n., Wang, K. n., Ionita-Laza, I. n. 2018; 9 (1): 5199


    Predicting the functional consequences of genetic variants in non-coding regions is a challenging problem. We propose here a semi-supervised approach, GenoNet, to jointly utilize experimentally confirmed regulatory variants (labeled variants), millions of unlabeled variants genome-wide, and more than a thousand cell/tissue type specific epigenetic annotations to predict functional consequences of non-coding variants. Through the application to several experimental datasets, we demonstrate that the proposed method significantly improves prediction accuracy compared to existing functional prediction methods at the tissue/cell type level, but especially so at the organism level. Importantly, we illustrate how the GenoNet scores can help in fine-mapping at GWAS loci, and in the discovery of disease associated genes in sequencing studies. As more comprehensive lists of experimentally validated variants become available over the next few years, semi-supervised methods like GenoNet can be used to provide increasingly accurate functional predictions for variants genome-wide and across a variety of cell/tissue types.

    View details for PubMedID 30518757

  • Rare-variant association tests in longitudinal studies, with an application to the Multi-Ethnic Study of Atherosclerosis (MESA) GENETIC EPIDEMIOLOGY He, Z., Lee, S., Zhang, M., Smith, J. A., Guo, X., Palmas, W., Kardia, S. R., Ionita-Laza, I., Mukherjee, B. 2017; 41 (8): 801–10


    Over the past few years, an increasing number of studies have identified rare variants that contribute to trait heritability. Due to the extreme rarity of some individual variants, gene-based association tests have been proposed to aggregate the genetic variants within a gene, pathway, or specific genomic region as opposed to a one-at-a-time single variant analysis. In addition, in longitudinal studies, statistical power to detect disease susceptibility rare variants can be improved through jointly testing repeatedly measured outcomes, which better describes the temporal development of the trait of interest. However, usual sandwich/model-based inference for sequencing studies with longitudinal outcomes and rare variants can produce deflated/inflated type I error rate without further corrections. In this paper, we develop a group of tests for rare-variant association based on outcomes with repeated measures. We propose new perturbation methods such that the type I error rate of the new tests is not only robust to misspecification of within-subject correlation, but also significantly improved for variants with extreme rarity in a study with small or moderate sample size. Through extensive simulation studies, we illustrate that substantially higher power can be achieved by utilizing longitudinal outcomes and our proposed finite sample adjustment. We illustrate our methods using data from the Multi-Ethnic Study of Atherosclerosis for exploring association of repeated measures of blood pressure with rare and common variants based on exome sequencing data on 6,361 individuals.

    View details for PubMedID 29076270

    View details for PubMedCentralID PMC5696115

  • Unified Sequence-Based Association Tests Allowing for Multiple Functional Annotations and Meta-analysis of Noncoding Variation in Metabochip Data AMERICAN JOURNAL OF HUMAN GENETICS He, Z., Xu, B., Lee, S., Ionita-Laza, I. 2017; 101 (3): 340–52


    Substantial progress has been made in the functional annotation of genetic variation in the human genome. Integrative analysis that incorporates such functional annotations into sequencing studies can aid the discovery of disease-associated genetic variants, especially those with unknown function and located outside protein-coding regions. Direct incorporation of one functional annotation as weight in existing dispersion and burden tests can suffer substantial loss of power when the functional annotation is not predictive of the risk status of a variant. Here, we have developed unified tests that can utilize multiple functional annotations simultaneously for integrative association analysis with efficient computational techniques. We show that the proposed tests significantly improve power when variant risk status can be predicted by functional annotations. Importantly, when functional annotations are not predictive of risk status, the proposed tests incur only minimal loss of power in relation to existing dispersion and burden tests, and under certain circumstances they can even have improved power by learning a weight that better approximates the underlying disease model in a data-adaptive manner. The tests can be constructed with summary statistics of existing dispersion and burden tests for sequencing data, therefore allowing meta-analysis of multiple studies without sharing individual-level data. We applied the proposed tests to a meta-analysis of noncoding rare variants in Metabochip data on 12,281 individuals from eight studies for lipid traits. By incorporating the Eigen functional score, we detected significant associations between noncoding rare variants in SLC22A3 and low-density lipoprotein and total cholesterol, associations that are missed by standard dispersion and burden tests.

    View details for PubMedID 28844485

    View details for PubMedCentralID PMC5590864

  • Set-Based Tests for the Gene-Environment Interaction in Longitudinal Studies JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION He, Z., Zhang, M., Lee, S., Smith, J. A., Kardia, S. R., Roux, V., Mukherjee, B. 2017; 112 (519): 966–78


    We propose a generalized score type test for set-based inference for gene-environment interaction with longitudinally measured quantitative traits. The test is robust to misspecification of within subject correlation structure and has enhanced power compared to existing alternatives. Unlike tests for marginal genetic association, set-based tests for gene-environment interaction face the challenges of a potentially misspecified and high-dimensional main effect model under the null hypothesis. We show that our proposed test is robust to main effect misspecification of environmental exposure and genetic factors under the gene-environment independence condition. When genetic and environmental factors are dependent, the method of sieves is further proposed to eliminate potential bias due to a misspecified main effect of a continuous environmental exposure. A weighted principal component analysis approach is developed to perform dimension reduction when the number of genetic variants in the set is large relative to the sample size. The methods are motivated by an example from the Multi-Ethnic Study of Atherosclerosis (MESA), investigating interaction between measures of neighborhood environment and genetic regions on longitudinal measures of blood pressure over a study period of about seven years with 4 exams.

    View details for PubMedID 29780190

    View details for PubMedCentralID PMC5954413

  • Set-Based Tests for Genetic Association in Longitudinal Studies BIOMETRICS He, Z., Zhang, M., Lee, S., Smith, J. A., Guo, X., Palmas, W., Kardia, S. R., Roux, A., Mukherjee, B. 2015; 71 (3): 606–15


    Genetic association studies with longitudinal markers of chronic diseases (e.g., blood pressure, body mass index) provide a valuable opportunity to explore how genetic variants affect traits over time by utilizing the full trajectory of longitudinal outcomes. Since these traits are likely influenced by the joint effect of multiple variants in a gene, a joint analysis of these variants considering linkage disequilibrium (LD) may help to explain additional phenotypic variation. In this article, we propose a longitudinal genetic random field model (LGRF), to test the association between a phenotype measured repeatedly during the course of an observational study and a set of genetic variants. Generalized score type tests are developed, which we show are robust to misspecification of within-subject correlation, a feature that is desirable for longitudinal analysis. In addition, a joint test incorporating gene-time interaction is further proposed. Computational advancement is made for scalable implementation of the proposed methods in large-scale genome-wide association studies (GWAS). The proposed methods are evaluated through extensive simulation studies and illustrated using data from the Multi-Ethnic Study of Atherosclerosis (MESA). Our simulation results indicate substantial gain in power using LGRF when compared with two commonly used existing alternatives: (i) single marker tests using longitudinal outcome and (ii) existing gene-based tests using the average value of repeated measurements as the outcome.

    View details for PubMedID 25854837

    View details for PubMedCentralID PMC4601568

  • Modeling and Testing for Joint Association Using a Genetic Random Field Model BIOMETRICS He, Z., Zhang, M., Zhan, X., Lu, Q. 2014; 70 (3): 471–79


    Substantial progress has been made in identifying single genetic variants predisposing to common complex diseases. Nonetheless, the genetic etiology of human diseases remains largely unknown. Human complex diseases are likely influenced by the joint effect of a large number of genetic variants instead of a single variant. The joint analysis of multiple genetic variants considering linkage disequilibrium (LD) and potential interactions can further enhance the discovery process, leading to the identification of new disease-susceptibility genetic variants. Motivated by development in spatial statistics, we propose a new statistical model based on the random field theory, referred to as a genetic random field model (GenRF), for joint association analysis with the consideration of possible gene-gene interactions and LD. Using a pseudo-likelihood approach, a GenRF test for the joint association of multiple genetic variants is developed, which has the following advantages: (1) accommodating complex interactions for improved performance; (2) natural dimension reduction; (3) boosting power in the presence of LD; and (4) computationally efficient. Simulation studies are conducted under various scenarios. The development has been focused on quantitative traits and robustness of the GenRF test to other traits, for example, binary traits, is also discussed. Compared with a commonly adopted kernel machine approach, SKAT, as well as other more standard methods, GenRF shows overall comparable performance and better performance in the presence of complex interactions. The method is further illustrated by an application to the Dallas Heart Study.

    View details for PubMedID 24628067

  • A Text-Based Intervention to Promote Literacy: An RCT. Pediatrics Chamberlain, L. J., Bruce, J., De La Cruz, M., Huffman, L., Steinberg, J. R., Bruguera, R., Peterson, J. W., Gardner, R. M., He, Z., Ordaz, Y., Connelly, E., Loeb, S. 2021


    BACKGROUND AND OBJECTIVES: Children entering kindergarten ready to learn are more likely to thrive. Inequitable access to high-quality, early educational settings creates early educational disparities. TipsByText, a text-message-based program for caregivers of young children, improves literacy of children in preschool, but efficacy for families without access to early childhood education was unknown.METHODS: We conducted a randomized controlled trial with caregivers of 3- and 4-year-olds in 2 public pediatric clinics. Intervention caregivers received TipsByText 3 times a week for 7 months. At pre- and postintervention, we measured child literacy using the Phonological Awareness Literacy Screening Tool (PALS-PreK) and caregiver involvement using the Parent Child Interactivity Scale (PCI). We estimated effects on PALS-PreK and PCI using multivariable linear regression.RESULTS: We enrolled 644 families, excluding 263 because of preschool participation. Compared with excluded children, those included in the study had parents with lower income and educational attainment and who were more likely to be Spanish speaking. Three-quarters of enrollees completed pre- and postintervention assessments. Postintervention PALS-PreK scores revealed an unadjusted treatment effect of 0.260 (P = .040); adjusting for preintervention score, child age, and caregiver language, treatment effect was 0.209 (P = .016), equating to 3 months of literacy gains. Effects were greater for firstborn children (0.282 vs 0.178), children in 2-parent families (0.262 vs 0.063), and 4-year-olds (0.436 vs 0.107). The overall effect on PCI was not significant (1.221, P = .124).CONCLUSIONS: The health sector has unique access to difficult-to-reach young children. With this clinic-based texting intervention, we reached underresourced families and increased child literacy levels.

    View details for DOI 10.1542/peds.2020-049648

    View details for PubMedID 34544847

  • Multitrait GWAS to connect disease variants and biological mechanisms. PLoS genetics Julienne, H., Laville, V., McCaw, Z. R., He, Z., Guillemot, V., Lasry, C., Ziyatdinov, A., Nerin, C., Vaysse, A., Lechat, P., Menager, H., Le Goff, W., Dube, M., Kraft, P., Ionita-Laza, I., Vilhjalmsson, B. J., Aschard, H. 2021; 17 (8): e1009713


    Genome-wide association studies (GWASs) have uncovered a wealth of associations between common variants and human phenotypes. Here, we present an integrative analysis of GWAS summary statistics from 36 phenotypes to decipher multitrait genetic architecture and its link with biological mechanisms. Our framework incorporates multitrait association mapping along with an investigation of the breakdown of genetic associations into clusters of variants harboring similar multitrait association profiles. Focusing on two subsets of immunity and metabolism phenotypes, we then demonstrate how genetic variants within clusters can be mapped to biological pathways and disease mechanisms. Finally, for the metabolism set, we investigate the link between gene cluster assignment and the success of drug targets in randomized controlled trials.

    View details for DOI 10.1371/journal.pgen.1009713

    View details for PubMedID 34460823

  • Do Steroids Matter? A Retrospective Review of Premedication for Taxane Chemotherapy and Hypersensitivity Reactions. Journal of clinical oncology : official journal of the American Society of Clinical Oncology Lansinger, O. M., Biedermann, S., He, Z., Colevas, A. D. 2021: JCO2101200


    PURPOSE: Despite the widespread use of the taxanes paclitaxel and docetaxel for a variety of cancers and their well-known association with hypersensitivity reactions (HSRs), there is still significant variation in the prescribing practices of steroids for premedication. Premedication almost always includes dexamethasone, which can be associated with multiple adverse effects if taken for extended periods of time. This study reviews the pattern of steroid premedication in patients who received paclitaxel or docetaxel at Stanford Cancer Institute between January 2010 and June 2020.METHODS: We used an electronic query of the electronic medical record followed up with a manual review of patient charts to ask whether we could find a correlation between steroid premedication dosing and the incidence or severity of HSRs with the first taxane dose. Variables considered included steroid dose and route, dose and type of taxane, clinical cancer group, sex, and race.RESULTS: Five thousand two hundred seventeen patients were identified as having received paclitaxel or docetaxel, and 3,181 met criteria for our analysis. There were 264 (8.3%) HSRs. In adjusted multivariate analysis, we found no correlation of HSR rate or severity among any of the variables evaluated except gynecology oncology clinic patients, who had an increased risk (hazard ratio [HR] 1.34) of HSRs overall and high-grade HSRs (HR 2.34), and female patients, who had a higher rate of HSRs overall (HR 1.26), but not high-grade HSRs.CONCLUSION: Neither dexamethasone dose nor route correlated with subsequent HSRs. Given the potential for adverse events from repeated high-dose steroids, our findings suggest that routine use of lower doses, such as a single 10 mg dose of dexamethasone, as premedication for taxanes to prevent HSRs is preferable to the current prescribing guidelines.

    View details for DOI 10.1200/JCO.21.01200

    View details for PubMedID 34357780

  • Quantitative disease risk scores from EHR with applications to clinical risk stratification and genetic studies. NPJ digital medicine Xu, D., Wang, C., Khan, A., Shang, N., He, Z., Gordon, A., Kullo, I. J., Murphy, S., Ni, Y., Wei, W., Gharavi, A., Kiryluk, K., Weng, C., Ionita-Laza, I. 2021; 4 (1): 116


    Labeling clinical data from electronic health records (EHR) in health systems requires extensive knowledge of human expert, and painstaking review by clinicians. Furthermore, existing phenotyping algorithms are not uniformly applied across large datasets and can suffer from inconsistencies in case definitions across different algorithms. We describe here quantitative disease risk scores based on almost unsupervised methods that require minimal input from clinicians, can be applied to large datasets, and alleviate some of the main weaknesses of existing phenotyping algorithms. We show applications to phenotypic data on approximately 100,000 individuals in eMERGE, and focus on several complex diseases, including Chronic Kidney Disease, Coronary Artery Disease, Type 2 Diabetes, Heart Failure, and a few others. We demonstrate that relative to existing approaches, the proposed methods have higher prediction accuracy, can better identify phenotypic features relevant to the disease under consideration, can perform better at clinical risk stratification, and can identify undiagnosed cases based on phenotypic features available in the EHR. Using genetic data from the eMERGE-seq panel that includes sequencing data for 109 genes on 21,363 individuals from multiple ethnicities, we also show how the new quantitative disease risk scores help improve the power of genetic association studies relative to the standard use of disease phenotypes. The results demonstrate the effectiveness of quantitative disease risk scores derived from rich phenotypic EHR databases to provide a more meaningful characterization of clinical risk for diseases of interest beyond the prevalent binary (case-control) classification.

    View details for DOI 10.1038/s41746-021-00488-3

    View details for PubMedID 34302027

  • Advances and challenges in quantitative delineation of the genetic architecture of complex traits QUANTITATIVE BIOLOGY Tang, H., He, Z. 2021; 9 (2): 168-184
  • Identification of putative causal loci in whole-genome sequencing data via knockoff statistics. Nature communications He, Z., Liu, L., Wang, C., Le Guen, Y., Lee, J., Gogarten, S., Lu, F., Montgomery, S., Tang, H., Silverman, E. K., Cho, M. H., Greicius, M., Ionita-Laza, I. 2021; 12 (1): 3152


    The analysis of whole-genome sequencing studies is challenging due to the large number of rare variants in noncoding regions and the lack of natural units for testing. We propose a statistical method to detect and localize rare and common risk variants in whole-genome sequencing studies based on a recently developed knockoff framework. It can (1) prioritize causal variants over associations due to linkage disequilibrium thereby improving interpretability; (2) help distinguish the signal due to rare variants from shadow effects of significant common variants nearby; (3) integrate multiple knockoffs for improved power, stability, and reproducibility; and (4) flexibly incorporate state-of-the-art and future association tests to achieve the benefits proposed here. In applications to whole-genome sequencing data from the Alzheimer's Disease Sequencing Project (ADSP) and COPDGene samples from NHLBI Trans-Omics for Precision Medicine (TOPMed) Program we show that our method compared with conventional association tests can lead to substantially more discoveries.

    View details for DOI 10.1038/s41467-021-22889-4

    View details for PubMedID 34035245

  • A novel age-informed approach for genetic association analysis in Alzheimer's disease. Alzheimer's research & therapy Le Guen, Y., Belloy, M. E., Napolioni, V., Eger, S. J., Kennedy, G., Tao, R., He, Z., Greicius, M. D., Alzheimers Disease Neuroimaging Initiative 2021; 13 (1): 72


    BACKGROUND: Many Alzheimer's disease (AD) genetic association studies disregard age or incorrectly account for it, hampering variant discovery.METHODS: Using simulated data, we compared the statistical power of several models: logistic regression on AD diagnosis adjusted and not adjusted for age; linear regression on a score integrating case-control status and age; and multivariate Cox regression on age-at-onset. We applied these models to real exome-wide data of 11,127 sequenced individuals (54% cases) and replicated suggestive associations in 21,631 genotype-imputed individuals (51% cases).RESULTS: Modeling variable AD risk across age results in 5-10% statistical power gain compared to logistic regression without age adjustment, while incorrect age adjustment leads to critical power loss. Applying our novel AD-age score and/or Cox regression, we discovered and replicated novel variants associated with AD on KIF21B, USH2A, RAB10, RIN3, and TAOK2 genes.CONCLUSION: Our AD-age score provides a simple means for statistical power gain and is recommended for future AD studies.

    View details for DOI 10.1186/s13195-021-00808-5

    View details for PubMedID 33794991

  • KLVS heterozygosity reduces brain amyloid in asymptomatic at-risk APOE4 carriers. Neurobiology of aging Belloy, M. E., Eger, S. J., Le Guen, Y., Napolioni, V., Deters, K. D., Yang, H., Scelsi, M. A., Porter, T., James, S., Wong, A., Schott, J. M., Sperling, R. A., Laws, S. M., Mormino, E. C., He, Z., Han, S. S., Altmann, A., Greicius, M. D., A4 Study Team, Insight 46 Study Team, Australian Imaging Biomarkers and Lifestyle (AIBL) Study, Alzheimer's Disease Neuroimaging Initiative 2021; 101: 123–29


    KLOTHOVS heterozygosity (KLVSHET+) was recently shown to be associated with reduced risk of Alzheimer's disease (AD) in APOE4 carriers. Additional studies suggest that KLVSHET+ protects against amyloid burden in cognitively normal older subjects, but sample sizes were too small to draw definitive conclusions. We performed a well-powered meta-analysis across 5 independent studies, comprising 3581 pre-clinical participants ages 60-80, to investigate whether KLVSHET+ reduces the risk of having an amyloid-positive positron emission tomography scan. Analyses were stratified by APOE4 status. KLVSHET+ reduced the risk of amyloid positivity in APOE4 carriers (odds ratio= 0.67 [0.52-0.88]; p= 3.5*10-3), but not in APOE4 non-carriers (odds ratio= 0.94 [0.73-1.21]; p= 0.63). The combination of APOE4 and KLVS genotypes should help enrich AD clinical trials for pre-symptomatic subjects at increased risk of developing amyloid aggregation and AD. KL-related pathways may help elucidate protective mechanisms against amyloid accumulation and merit exploration for novel AD drug targets. Future investigation of the biological mechanisms by which KL interacts with APOE4 and AD are warranted.

    View details for DOI 10.1016/j.neurobiolaging.2021.01.008

    View details for PubMedID 33610961

  • Treatment Practices and Outcomes in Continuous Spike and Wave During Slow Wave Sleep (CSWS): A Multicenter Collaboration. The Journal of pediatrics Baumer, F. M., McNamara, N. A., Fine, A. L., Pestana-Knight, E. n., Shellhaas, R. A., He, Z. n., Arndt, D. H., Gaillard, W. D., Kelley, S. A., Nagan, M. n., Ostendorf, A. P., Singhal, N. S., Speltz, L. n., Chapman, K. E. 2021


    To determine how Continuous Spike and Wave during Slow Wave Sleep (CSWS) is currently managed and to compare the effectiveness of current treatment strategies using a database from 11 pediatric epilepsy centers in the United States.This retrospective study gathered information on baseline clinical characteristics, CSWS etiology, and treatment(s) in consecutive patients seen between 2014-2016 at 11 epilepsy referral centers. Treatments were categorized as benzodiazepines, steroids, other antiseizure medications (ASMs), or other therapies. Two measures of treatment response [clinical improvement as noted by the treating physician; and EEG improvement] were compared across therapies, controlling for baseline variables.81 children underwent 153 treatment trials during the study period (68 trials of benzodiazepines, 25 of steroids, 45 of ASMs, 14 of other therapies). Children most frequently received benzodiazepines (62%) or ASMs (27%) as first line therapy. Treatment choice did not differ based on baseline clinical variables, nor did these variables correlate with outcome. After adjusting for baseline variables, children had a greater odds of clinical improvement with benzodiazepines (OR 3.32, 95%CI 1.57-7.04, P = .002) or steroids (OR 4.04, 95%CI 1.41-11.59, p=0.01) than with ASMs and a greater odds of EEG improvement after steroids (OR 3.36, 95% CI 1.09-10.33, p=0.03) than after ASMs.Benzodiazepines and ASMs are the most frequent initial therapy prescribed for CSWS in the United States. Our data suggests that ASMs are inferior to benzodiazepines and steroids and support earlier use of these therapies. Multicenter prospective studies that rigorously assess treatment protocols and outcomes are needed.

    View details for DOI 10.1016/j.jpeds.2021.01.032

    View details for PubMedID 33484700

  • Generalizable Sample-Efficient Siamese Autoencoder for Tinnitus Diagnosis in Listeners With Subjective Tinnitus IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING Liu, Z., Yao, L., Wang, X., Monaghan, J. M., Schaette, R., He, Z., McAlpine, D. 2021; 29: 1452-1461


    Electroencephalogram (EEG)-based neurofeedback has been widely studied for tinnitus therapy in recent years. Most existing research relies on experts' cognitive prediction, and studies based on machine learning and deep learning are either data-hungry or not well generalizable to new subjects. In this paper, we propose a robust, data-efficient model for distinguishing tinnitus from the healthy state based on EEG-based tinnitus neurofeedback. We propose trend descriptor, a feature extractor with lower fineness, to reduce the effect of electrode noises on EEG signals, and a siamese encoder-decoder network boosted in a supervised manner to learn accurate alignment and to acquire high-quality transferable mappings across subjects and EEG signal channels. Our experiments show the proposed method significantly outperforms state-of-the-art algorithms when analyzing subjects' EEG neurofeedback to 90dB and 100dB sound, achieving an accuracy of 91.67%-94.44% in predicting tinnitus and control subjects in a subject-independent setting. Our ablation studies on mixed subjects and parameters show the method's stability in performance.

    View details for DOI 10.1109/TNSRE.2021.3095298

    View details for Web of Science ID 000678331300009

    View details for PubMedID 34232883

  • Administration of Dexamethasone for Bacterial Meningitis: An Unreliable Quality Measure. The Neurohospitalist Dujari, S. n., Gummidipundi, S. n., He, Z. n., Gold, C. A. 2021; 11 (2): 101–6


    To validate the use of administrative data to identify patients with bacterial meningitis and quantify the rate of dexamethasone administration as defined in the American Academy of Neurology Inpatient and Emergency Care Quality Measurement Set.The Vizient Clinical Data Base and Resource Manager was used to identify patients with International Classification of Diseases, Tenth Revision (ICD-10) codes for bacterial meningitis from October 2015 to June 2019. Chart review was performed on patients identified at a single quaternary-care hospital. The positive predictive value (PPV) of Vizient was determined. Demographic, clinical, and laboratory data were assessed using descriptive statistics.Of all hospitals that submitted complete data to Vizient during the study period, a median of 19 patients per hospital had ICD-10 codes for bacterial meningitis in the 45-month period. We identified 79 patients using Vizient at our institution of whom 69 had a diagnosis of bacterial meningitis confirmed by chart review (PPV = 87%). 15 patients were eligible to receive dexamethasone per the quality measurement set. Six of these patients (40%) received dexamethasone.It is feasible to use the Vizient Clinical Data Base and Resource Manager to identify patients with bacterial meningitis. Due to low prevalence across multiple institutions and high rate of exclusion criteria at our institution, this study suggests that the rate of dexamethasone administration in bacterial meningitis may be an unreliable indicator of quality of care provided by inpatient neurologists. The creation of a registry for hospitalized neurology patients could enhance development of future quality measures.

    View details for DOI 10.1177/1941874420969556

    View details for PubMedID 33791051

    View details for PubMedCentralID PMC7958681

  • An evolutionarily acquired microRNA shapes development of mammalian cortical projections. Proceedings of the National Academy of Sciences of the United States of America Diaz, J. L., Siththanandan, V. B., Lu, V., Gonzalez-Nava, N., Pasquina, L., MacDonald, J. L., Woodworth, M. B., Ozkan, A., Nair, R., He, Z., Sahni, V., Sarnow, P., Palmer, T. D., Macklis, J. D., Tharin, S. 2020


    The corticospinal tract is unique to mammals and the corpus callosum is unique to placental mammals (eutherians). The emergence of these structures is thought to underpin the evolutionary acquisition of complex motor and cognitive skills. Corticospinal motor neurons (CSMN) and callosal projection neurons (CPN) are the archetypal projection neurons of the corticospinal tract and corpus callosum, respectively. Although a number of conserved transcriptional regulators of CSMN and CPN development have been identified in vertebrates, none are unique to mammals and most are coexpressed across multiple projection neuron subtypes. Here, we discover 17 CSMN-enriched microRNAs (miRNAs), 15 of which map to a single genomic cluster that is exclusive to eutherians. One of these, miR-409-3p, promotes CSMN subtype identity in part via repression of LMO4, a key transcriptional regulator of CPN development. In vivo, miR-409-3p is sufficient to convert deep-layer CPN into CSMN. This is a demonstration of an evolutionarily acquired miRNA in eutherians that refines cortical projection neuron subtype development. Our findings implicate miRNAs in the eutherians' increase in neuronal subtype and projection diversity, the anatomic underpinnings of their complex behavior.

    View details for DOI 10.1073/pnas.2006700117

    View details for PubMedID 33139574

  • Administration of Dexamethasone for Bacterial Meningitis: An Unreliable Quality Measure NEUROHOSPITALIST Dujari, S., Gummidipundi, S., He, Z., Gold, C. A. 2020
  • Benchmarking Performance on Administration of Dexamethasone for Bacterial Meningitis Dujari, S., Gummidipundi, S., He, Z., Gold, C. LIPPINCOTT WILLIAMS & WILKINS. 2020
  • Interaction analysis under misspecification of main effects: Some common mistakes and simple solutions. Statistics in medicine Zhang, M., Yu, Y., Wang, S., Salvatore, M., G Fritsche, L., He, Z., Mukherjee, B. 2020


    The statistical practice of modeling interaction with two linear main effects and a product term is ubiquitous in the statistical and epidemiological literature. Most data modelers are aware that the misspecification of main effects can potentially cause severe type I error inflation in tests for interactions, leading to spurious detection of interactions. However, modeling practice has not changed. In this article, we focus on the specific situation where the main effects in the model are misspecified as linear terms and characterize its impact on common tests for statistical interaction. We then propose some simple alternatives that fix the issue of potential type I error inflation in testing interaction due to main effect misspecification. We show that when using the sandwich variance estimator for a linear regression model with a quantitative outcome and two independent factors, both the Wald and score tests asymptotically maintain the correct type I error rate. However, if the independence assumption does not hold or the outcome is binary, using the sandwich estimator does not fix the problem. We further demonstrate that flexibly modeling the main effect under a generalized additive model can largely reduce or often remove bias in the estimates and maintain the correct type I error rate for both quantitative and binary outcomes regardless of the independence assumption. We show, under the independence assumption and for a continuous outcome, overfitting and flexibly modeling the main effects does not lead to power loss asymptotically relative to a correctly specified main effect model. Our simulation study further demonstrates the empirical fact that using flexible models for the main effects does not result in a significant loss of power for testing interaction in general. Our results provide an improved understanding of the strengths and limitations for tests of interaction in the presence of main effect misspecification. Using data from a large biobank study "The Michigan Genomics Initiative", we present two examples of interaction analysis in support of our results.

    View details for DOI 10.1002/sim.8505

    View details for PubMedID 32101638

  • Detecting Rare Mutations with Heterogeneous Effects Using a Family-Based Genetic Random Field Method. Genetics Li, M. n., He, Z. n., Tong, X. n., Witte, J. S., Lu, Q. n. 2018; 210 (2): 463–76


    The genetic etiology of many complex diseases is highly heterogeneous. A complex disease can be caused by multiple mutations within the same gene or mutations in multiple genes at various genomic loci. Although these disease-susceptibility mutations can be collectively common in the population, they are often individually rare or even private to certain families. Family-based studies are powerful for detecting rare variants enriched in families, which is an important feature for sequencing studies due to the heterogeneous nature of rare variants. In addition, family designs can provide robust protection against population stratification. Nevertheless, statistical methods for analyzing family-based sequencing data are underdeveloped, especially those accounting for heterogeneous etiology of complex diseases. In this article, we introduce a random field framework for detecting gene-phenotype associations in family-based sequencing studies, referred to as family-based genetic random field (FGRF). Similar to existing family-based association tests, FGRF could utilize within-family and between-family information separately or jointly to test an association. We demonstrate that FGRF has comparable statistical power with existing methods when there is no genetic heterogeneity, but can improve statistical power when there is genetic heterogeneity across families. The proposed method also shares the same advantages with the conventional family-based association tests (e.g., being robust to population stratification). Finally, we applied the proposed method to a sequencing data from the Minnesota Twin Family Study, and revealed several genes, including SAMD14, potentially associated with alcohol dependence.

    View details for PubMedID 30104420

    View details for PubMedCentralID PMC6216585

  • Interaction between Social/Psychosocial Factors and Genetic Variants on Body Mass Index: A Gene-Environment Interaction Analysis in a Longitudinal Setting INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH Zhao, W., Ware, E. B., He, Z., Kardia, S. R., Faul, J. D., Smith, J. A. 2017; 14 (10)


    Obesity, which develops over time, is one of the leading causes of chronic diseases such as cardiovascular disease. However, hundreds of BMI (body mass index)-associated genetic loci identified through large-scale genome-wide association studies (GWAS) only explain about 2.7% of BMI variation. Most common human traits are believed to be influenced by both genetic and environmental factors. Past studies suggest a variety of environmental features that are associated with obesity, including socioeconomic status and psychosocial factors. This study combines both gene/regions and environmental factors to explore whether social/psychosocial factors (childhood and adult socioeconomic status, social support, anger, chronic burden, stressful life events, and depressive symptoms) modify the effect of sets of genetic variants on BMI in European American and African American participants in the Health and Retirement Study (HRS). In order to incorporate longitudinal phenotype data collected in the HRS and investigate entire sets of single nucleotide polymorphisms (SNPs) within gene/region simultaneously, we applied a novel set-based test for gene-environment interaction in longitudinal studies (LGEWIS). Childhood socioeconomic status (parental education) was found to modify the genetic effect in the gene/region around SNP rs9540493 on BMI in European Americans in the HRS. The most significant SNP (rs9540488) by childhood socioeconomic status interaction within the rs9540493 gene/region was suggestively replicated in the Multi-Ethnic Study of Atherosclerosis (MESA) (p = 0.07).

    View details for PubMedID 28961216

  • Testing Allele Transmission of an SNP Set Using a Family-Based Generalized Genetic Random Field Method GENETIC EPIDEMIOLOGY Li, M., Li, J., He, Z., Lu, Q., Witte, J. S., Macleod, S. L., Hobbs, C. A., Cleves, M. A., Natl Birth Defects Prevention Stud 2016; 40 (4): 341–51


    Family-based association studies are commonly used in genetic research because they can be robust to population stratification (PS). Recent advances in high-throughput genotyping technologies have produced a massive amount of genomic data in family-based studies. However, current family-based association tests are mainly focused on evaluating individual variants one at a time. In this article, we introduce a family-based generalized genetic random field (FB-GGRF) method to test the joint association between a set of autosomal SNPs (i.e., single-nucleotide polymorphisms) and disease phenotypes. The proposed method is a natural extension of a recently developed GGRF method for population-based case-control studies. It models offspring genotypes conditional on parental genotypes, and, thus, is robust to PS. Through simulations, we presented that under various disease scenarios the FB-GGRF has improved power over a commonly used family-based sequence kernel association test (FB-SKAT). Further, similar to GGRF, the proposed FB-GGRF method is asymptotically well-behaved, and does not require empirical adjustment of the type I error rates. We illustrate the proposed method using a study of congenital heart defects with family trios from the National Birth Defects Prevention Study (NBDPS).

    View details for PubMedID 27061818

    View details for PubMedCentralID PMC5061344

  • Risk Prediction Modeling of Sequencing Data Using a Forward Random Field Method SCIENTIFIC REPORTS Wen, Y., He, Z., Li, M., Lu, Q. 2016; 6: 21120


    With the advance in high-throughput sequencing technology, it is feasible to investigate the role of common and rare variants in disease risk prediction. While the new technology holds great promise to improve disease prediction, the massive amount of data and low frequency of rare variants pose great analytical challenges on risk prediction modeling. In this paper, we develop a forward random field method (FRF) for risk prediction modeling using sequencing data. In FRF, subjects' phenotypes are treated as stochastic realizations of a random field on a genetic space formed by subjects' genotypes, and an individual's phenotype can be predicted by adjacent subjects with similar genotypes. The FRF method allows for multiple similarity measures and candidate genes in the model, and adaptively chooses the optimal similarity measure and disease-associated genes to reflect the underlying disease model. It also avoids the specification of the threshold of rare variants and allows for different directions and magnitudes of genetic effects. Through simulations, we demonstrate the FRF method attains higher or comparable accuracy over commonly used support vector machine based methods under various disease models. We further illustrate the FRF method with an application to the sequencing data obtained from the Dallas Heart Study.

    View details for PubMedID 26892725

  • Association between Stress Response Genes and Features of Diurnal Cortisol Curves in the Multi-Ethnic Study of Atherosclerosis: A New Multi-Phenotype Approach for Gene-Based Association Tests PLOS ONE He, Z., Payne, E. K., Mukherjee, B., Lee, S., Smith, J. A., Ware, E. B., Sanchez, B. N., Seeman, T. E., Kardia, S. R., Roux, A. 2015; 10 (5): e0126637


    The hormone cortisol is likely to be a key mediator of the stress response that influences multiple physiologic systems that are involved in common chronic disease, including the cardiovascular system, the immune system, and metabolism. In this paper, a candidate gene approach was used to investigate genetic contributions to variability in multiple correlated features of the daily cortisol profile in a sample of European Americans, African Americans, and Hispanic Americans from the Multi-Ethnic Study of Atherosclerosis (MESA). We proposed and applied a new gene-level multiple-phenotype analysis and carried out a meta-analysis to combine the ethnicity specific results. This new analysis, instead of a more routine single marker-single phenotype approach identified a significant association between one gene (ADRB2) and cortisol features (meta-analysis p-value=0.0025), which was not identified by three other commonly used existing analytic strategies: 1. Single marker association tests involving each single cortisol feature separately; 2. Single marker association tests jointly testing for multiple cortisol features; 3. Gene-level association tests separately carried out for each single cortisol feature. The analytic strategies presented consider different hypotheses regarding genotype-phenotype association and imply different costs of multiple testing. The proposed gene-level analysis integrating multiple cortisol features across multiple ethnic groups provides new insights into the gene-cortisol association.

    View details for PubMedID 25993632

  • A Powerful Nonparametric Statistical Framework for Family-Based Association Analyses GENETICS Li, M., He, Z., Schaid, D. J., Cleves, M. A., Nick, T. G., Lu, Q. 2015; 200 (1): 69–U140


    Family-based study design is commonly used in genetic research. It has many ideal features, including being robust to population stratification (PS). With the advance of high-throughput technologies and ever-decreasing genotyping cost, it has become common for family studies to examine a large number of variants for their associations with disease phenotypes. The yield from the analysis of these family-based genetic data can be enhanced by adopting computationally efficient and powerful statistical methods. We propose a general framework of a family-based U-statistic, referred to as family-U, for family-based association studies. Unlike existing parametric-based methods, the proposed method makes no assumption of the underlying disease models and can be applied to various phenotypes (e.g., binary and quantitative phenotypes) and pedigree structures (e.g., nuclear families and extended pedigrees). By using only within-family information, it can offer robust protection against PS. In the absence of PS, it can also utilize additional information (i.e., between-family information) for power improvement. Through simulations, we demonstrated that family-U attained higher power over a commonly used method, family-based association tests, under various disease scenarios. We further illustrated the new method with an application to large-scale family data from the Framingham Heart Study. By utilizing additional information (i.e., between-family information), family-U confirmed a previous association of CHRNA5 with nicotine dependence.

    View details for PubMedID 25745024

    View details for PubMedCentralID PMC4423382

  • A Weighted U-Statistic for Genetic Association Analyses of Sequencing Data GENETIC EPIDEMIOLOGY Wei, C., Li, M., He, Z., Vsevolozhskaya, O., Schaid, D. J., Lu, Q. 2014; 38 (8): 699–708


    With advancements in next-generation sequencing technology, a massive amount of sequencing data is generated, which offers a great opportunity to comprehensively investigate the role of rare variants in the genetic etiology of complex diseases. Nevertheless, the high-dimensional sequencing data poses a great challenge for statistical analysis. The association analyses based on traditional statistical methods suffer substantial power loss because of the low frequency of genetic variants and the extremely high dimensionality of the data. We developed a Weighted U Sequencing test, referred to as WU-SEQ, for the high-dimensional association analysis of sequencing data. Based on a nonparametric U-statistic, WU-SEQ makes no assumption of the underlying disease model and phenotype distribution, and can be applied to a variety of phenotypes. Through simulation studies and an empirical study, we showed that WU-SEQ outperformed a commonly used sequence kernel association test (SKAT) method when the underlying assumptions were violated (e.g., the phenotype followed a heavy-tailed distribution). Even when the assumptions were satisfied, WU-SEQ still attained comparable performance to SKAT. Finally, we applied WU-SEQ to sequencing data from the Dallas Heart Study (DHS), and detected an association between ANGPTL 4 and very low density lipoprotein cholesterol.

    View details for DOI 10.1002/gepi.21864

    View details for Web of Science ID 000345292600005

    View details for PubMedID 25331574

    View details for PubMedCentralID PMC4236269

  • A Generalized Genetic Random Field Method for the Genetic Association Analysis of Sequencing Data GENETIC EPIDEMIOLOGY Li, M., He, Z., Zhang, M., Zhan, X., Wei, C., Elston, R. C., Lu, Q. 2014; 38 (3): 242–53


    With the advance of high-throughput sequencing technologies, it has become feasible to investigate the influence of the entire spectrum of sequencing variations on complex human diseases. Although association studies utilizing the new sequencing technologies hold great promise to unravel novel genetic variants, especially rare genetic variants that contribute to human diseases, the statistical analysis of high-dimensional sequencing data remains a challenge. Advanced analytical methods are in great need to facilitate high-dimensional sequencing data analyses. In this article, we propose a generalized genetic random field (GGRF) method for association analyses of sequencing data. Like other similarity-based methods (e.g., SIMreg and SKAT), the new method has the advantages of avoiding the need to specify thresholds for rare variants and allowing for testing multiple variants acting in different directions and magnitude of effects. The method is built on the generalized estimating equation framework and thus accommodates a variety of disease phenotypes (e.g., quantitative and binary phenotypes). Moreover, it has a nice asymptotic property, and can be applied to small-scale sequencing data without need for small-sample adjustment. Through simulations, we demonstrate that the proposed GGRF attains an improved or comparable power over a commonly used method, SKAT, under various disease scenarios, especially when rare variants play a significant role in disease etiology. We further illustrate GGRF with an application to a real dataset from the Dallas Heart Study. By using GGRF, we were able to detect the association of two candidate genes, ANGPTL3 and ANGPTL4, with serum triglyceride.

    View details for DOI 10.1002/gepi.21790

    View details for Web of Science ID 000332700300007

    View details for PubMedID 24482034

    View details for PubMedCentralID PMC5241166