Barbara E Engelhardt is a Senior Investigator at Gladstone Institutes and Professor at Stanford University in the Department of Biomedical Data Science. She received her B.S. (Symbolic Systems) and M.S. (Computer Science) from Stanford University and her PhD from UC Berkeley (EECS) advised my Prof. Michael I Jordan. She was a postdoctoral fellow with Prof. Matthew Stephens at the University of Chicago. She was an Assistant Professor at Duke University from 2011-2014, and an Assistant, Associate, and then Full Professor at Princeton University in Computer Science from 2014-2022. She has worked at Jet Propulsion Labs, Google Research, 23andMe, and Genomics plc. In her career, she received an NSF GRFP, the Google Anita Borg Scholarship, the SMBE Walter M. Fitch Prize (2004), a Sloan Faculty Fellowship, an NSF CAREER, and the ISCB Overton Prize (2021). Her research is focused on developing and applying models for structured biomedical data that capture patterns in the data, predict results of interventions to the system, assist with decision-making support, and prioritize experiments for design and engineering of biological systems.
Professor (Research), Department of Biomedical Data Science
Senior Investigator, Gladstone Institutes (2021 - Present)
Hierarchical Gaussian Processes and Mixtures of Experts to Model COVID-19 Patient Trajectories.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
2022; 27: 266-277
Gaussian processes (GPs) are a versatile nonparametric model for nonlinear regression and have been widely used to study spatiotemporal phenomena. However, standard GPs offer limited interpretability and generalizability for datasets with naturally occurring hierarchies. With large-scale, rapidly-updating electronic health record (EHR) data, we want to study patient trajectories across diverse patient cohorts while preserving patient subgroup structure. In this work, we partition our cohort of over 2000 COVID-19 patients by sex and ethnicity. We develop and apply a hierarchical Gaussian process and a mixture of experts (MOE) hierarchical GP model to fit patient trajectories on clinical markers of disease progression. A case study for albumin, an effective predictor of COVID-19 patient outcomes, highlights the predictive performance of these models. These hierarchical spatiotemporal models of EHR data bring us a step closer toward our goal of building flexible approaches to capture patient data that can be used in real-time systems*.
View details for PubMedID 34890155
Telescoping bimodal latent Dirichlet allocation to identify expression QTLs across tissues.
Life science alliance
2022; 5 (12)
Expression quantitative trait loci (eQTLs), or single-nucleotide polymorphisms that affect average gene expression levels, provide important insights into context-specific gene regulation. Classic eQTL analyses use one-to-one association tests, which test gene-variant pairs individually and ignore correlations induced by gene regulatory networks and linkage disequilibrium. Probabilistic topic models, such as latent Dirichlet allocation, estimate latent topics for a collection of count observations. Prior multimodal frameworks that bridge genotype and expression data assume matched sample numbers between modalities. However, many data sets have a nested structure where one individual has several associated gene expression samples and a single germline genotype vector. Here, we build a telescoping bimodal latent Dirichlet allocation (TBLDA) framework to learn shared topics across gene expression and genotype data that allows multiple RNA sequencing samples to correspond to a single individual's genotype. By using raw count data, our model avoids possible adulteration via normalization procedures. Ancestral structure is captured in a genotype-specific latent space, effectively removing it from shared components. Using GTEx v8 expression data across 10 tissues and genotype data, we show that the estimated topics capture meaningful and robust biological signal in both modalities and identify associations within and across tissue types. We identify 4,645 cis-eQTLs and 995 trans-eQTLs by conducting eQTL mapping between the most informative features in each topic. Our TBLDA model is able to identify associations using raw sequencing count data when the samples in two separate data modalities are matched one-to-many, as is often the case in biological data. Our code is freely available at https://github.com/gewirtz/TBLDA.
View details for DOI 10.26508/lsa.202101297
View details for PubMedID 35977827
View details for PubMedCentralID PMC9387650
- CONTRASTIVE LATENT VARIABLE MODELING WITH APPLICATION TO CASE-CONTROL SEQUENCING EXPERIMENTS ANNALS OF APPLIED STATISTICS 2022; 16 (3): 1268-1291
Towards 'end-to-end' analysis and understanding of biological timecourse data.
The Biochemical journal
2022; 479 (11): 1257-1263
Petabytes of increasingly complex and multidimensional live cell and tissue imaging data are generated every year. These videos hold large promise for understanding biology at a deep and fundamental level, as they capture single-cell and multicellular events occurring over time and space. However, the current modalities for analysis and mining of these data are scattered and user-specific, preventing more unified analyses from being performed over different datasets and obscuring possible scientific insights. Here, we propose a unified pipeline for storage, segmentation, analysis, and statistical parametrization of live cell imaging datasets.
View details for DOI 10.1042/BCJ20220053
View details for PubMedID 35713413
View details for PubMedCentralID PMC9246344
Guiding Efficient, Effective, and Patient-Oriented Electrolyte Replacement in Critical Care: An Artificial Intelligence Reinforcement Learning Approach.
Journal of personalized medicine
2022; 12 (5)
Both provider- and protocol-driven electrolyte replacement have been linked to the over-prescription of ubiquitous electrolytes. Here, we describe the development and retrospective validation of a data-driven clinical decision support tool that uses reinforcement learning (RL) algorithms to recommend patient-tailored electrolyte replacement policies for ICU patients. We used electronic health records (EHR) data that originated from two institutions (UPHS; MIMIC-IV). The tool uses a set of patient characteristics, such as their physiological and pharmacological state, a pre-defined set of possible repletion actions, and a set of clinical goals to present clinicians with a recommendation for the route and dose of an electrolyte. RL-driven electrolyte repletion substantially reduces the frequency of magnesium and potassium replacements (up to 60%), adjusts the timing of interventions in all three electrolytes considered (potassium, magnesium, and phosphate), and shifts them towards orally administered repletion over intravenous replacement. This shift in recommended treatment limits risk of the potentially harmful effects of over-repletion and implies monetary savings. Overall, the RL-driven electrolyte repletion recommendations reduce excess electrolyte replacements and improve the safety, precision, efficacy, and cost of each electrolyte repletion event, while showing robust performance across patient cohorts and hospital systems.
View details for DOI 10.3390/jpm12050661
View details for PubMedID 35629084
View details for PubMedCentralID PMC9143326
Brain kernel: A new spatial covariance function for fMRI data.
2021; 245: 118580
A key problem in functional magnetic resonance imaging (fMRI) is to estimate spatial activity patterns from noisy high-dimensional signals. Spatial smoothing provides one approach to regularizing such estimates. However, standard smoothing methods ignore the fact that correlations in neural activity may fall off at different rates in different brain areas, or exhibit discontinuities across anatomical or functional boundaries. Moreover, such methods do not exploit the fact that widely separated brain regions may exhibit strong correlations due to bilateral symmetry or the network organization of brain regions. To capture this non-stationary spatial correlation structure, we introduce the brain kernel, a continuous covariance function for whole-brain activity patterns. We define the brain kernel in terms of a continuous nonlinear mapping from 3D brain coordinates to a latent embedding space, parametrized with a Gaussian process (GP). The brain kernel specifies the prior covariance between voxels as a function of the distance between their locations in embedding space. The GP mapping warps the brain nonlinearly so that highly correlated voxels are close together in latent space, and uncorrelated voxels are far apart. We estimate the brain kernel using resting-state fMRI data, and we develop an exact, scalable inference method based on block coordinate descent to overcome the challenges of high dimensionality (10-100K voxels). Finally, we illustrate the brain kernel's usefulness with applications to brain decoding and factor analysis with multiple task-based fMRI datasets.
View details for DOI 10.1016/j.neuroimage.2021.118580
View details for PubMedID 34740792
A self-exciting point process to study multicellular spatial signaling patterns
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2021; 118 (32)
Multicellular organisms rely on spatial signaling among cells to drive their organization, development, and response to stimuli. Several models have been proposed to capture the behavior of spatial signaling in multicellular systems, but existing approaches fail to capture both the autonomous behavior of single cells and the interactions of a cell with its neighbors simultaneously. We propose a spatiotemporal model of dynamic cell signaling based on Hawkes processes-self-exciting point processes-that model the signaling processes within a cell and spatial couplings between cells. With this cellular point process (CPP), we capture both the single-cell pathway activation rate and the magnitude and duration of signaling between cells relative to their spatial location. Furthermore, our model captures tissues composed of heterogeneous cell types with different bursting rates and signaling behaviors across multiple signaling proteins. We apply our model to epithelial cell systems that exhibit a range of autonomous and spatial signaling behaviors basally and under pharmacological exposure. Our model identifies known drug-induced signaling deficits, characterizes signaling changes across a wound front, and generalizes to multichannel observations.
View details for DOI 10.1073/pnas.2026123118
View details for Web of Science ID 000685043400011
View details for PubMedID 34362843
View details for PubMedCentralID PMC8364135
Joint analysis of expression levels and histological images identifies genes associated with tissue morphology
2021; 12 (1): 1609
Histopathological images are used to characterize complex phenotypes such as tumor stage. Our goal is to associate features of stained tissue images with high-dimensional genomic markers. We use convolutional autoencoders and sparse canonical correlation analysis (CCA) on paired histological images and bulk gene expression to identify subsets of genes whose expression levels in a tissue sample correlate with subsets of morphological features from the corresponding sample image. We apply our approach, ImageCCA, to two TCGA data sets, and find gene sets associated with the structure of the extracellular matrix and cell wall infrastructure, implicating uncharacterized genes in extracellular processes. We find sets of genes associated with specific cell types, including neuronal cells and cells of the immune system. We apply ImageCCA to the GTEx v6 data, and find image features that capture population variation in thyroid and in colon tissues associated with genetic variants (image morphology QTLs, or imQTLs), suggesting that genetic variation regulates population variation in tissue morphological traits.
View details for DOI 10.1038/s41467-021-21727-x
View details for Web of Science ID 000629648100001
View details for PubMedID 33707455
View details for PubMedCentralID PMC7952575
Optimal marker gene selection for cell type discrimination in single cell analyses
2021; 12 (1): 1186
Single-cell technologies characterize complex cell populations across multiple data modalities at unprecedented scale and resolution. Multi-omic data for single cell gene expression, in situ hybridization, or single cell chromatin states are increasingly available across diverse tissue types. When isolating specific cell types from a sample of disassociated cells or performing in situ sequencing in collections of heterogeneous cells, one challenging task is to select a small set of informative markers that robustly enable the identification and discrimination of specific cell types or cell states as precisely as possible. Given single cell RNA-seq data and a set of cellular labels to discriminate, scGeneFit selects gene markers that jointly optimize cell label recovery using label-aware compressive classification methods. This results in a substantially more robust and less redundant set of markers than existing methods, most of which identify markers that separate each cell label from the rest. When applied to a data set given a hierarchy of cell types as labels, the markers found by our method improves the recovery of the cell type hierarchy with fewer markers than existing methods using a computationally efficient and principled optimization.
View details for DOI 10.1038/s41467-021-21453-4
View details for Web of Science ID 000621494500002
View details for PubMedID 33608535
View details for PubMedCentralID PMC7895823
- COP-E-CAT: Cleaning and Organization Pipeline for EHR Computational and Analytic Tasks ASSOC COMPUTING MACHINERY. 2021
Causal network inference from gene transcriptional time-series response to glucocorticoids
PLOS COMPUTATIONAL BIOLOGY
2021; 17 (1): e1008223
Gene regulatory network inference is essential to uncover complex relationships among gene pathways and inform downstream experiments, ultimately enabling regulatory network re-engineering. Network inference from transcriptional time-series data requires accurate, interpretable, and efficient determination of causal relationships among thousands of genes. Here, we develop Bootstrap Elastic net regression from Time Series (BETS), a statistical framework based on Granger causality for the recovery of a directed gene network from transcriptional time-series data. BETS uses elastic net regression and stability selection from bootstrapped samples to infer causal relationships among genes. BETS is highly parallelized, enabling efficient analysis of large transcriptional data sets. We show competitive accuracy on a community benchmark, the DREAM4 100-gene network inference challenge, where BETS is one of the fastest among methods of similar performance and additionally infers whether causal effects are activating or inhibitory. We apply BETS to transcriptional time-series data of differentially-expressed genes from A549 cells exposed to glucocorticoids over a period of 12 hours. We identify a network of 2768 genes and 31,945 directed edges (FDR ≤ 0.2). We validate inferred causal network edges using two external data sources: Overexpression experiments on the same glucocorticoid system, and genetic variants associated with inferred edges in primary lung tissue in the Genotype-Tissue Expression (GTEx) v6 project. BETS is available as an open source software package at https://github.com/lujonathanh/BETS.
View details for DOI 10.1371/journal.pcbi.1008223
View details for Web of Science ID 000613893600001
View details for PubMedID 33513136
View details for PubMedCentralID PMC7875426
ACE inhibition and cardiometabolic risk factors, lung ACE2 and TMPRSS2 gene expression, and plasma ACE2 levels: a Mendelian randomization study.
Royal Society open science
2020; 7 (11): 200958
Angiotensin-converting enzyme 2 (ACE2) and serine protease TMPRSS2 have been implicated in cell entry for severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the virus responsible for coronavirus disease 2019 (COVID-19). The expression of ACE2 and TMPRSS2 in the lung epithelium might have implications for the risk of SARS-CoV-2 infection and severity of COVID-19. We use human genetic variants that proxy angiotensin-converting enzyme (ACE) inhibitor drug effects and cardiovascular risk factors to investigate whether these exposures affect lung ACE2 and TMPRSS2 gene expression and circulating ACE2 levels. We observed no consistent evidence of an association of genetically predicted serum ACE levels with any of our outcomes. There was weak evidence for an association of genetically predicted serum ACE levels with ACE2 gene expression in the Lung eQTL Consortium (p = 0.014), but this finding did not replicate. There was evidence of a positive association of genetic liability to type 2 diabetes mellitus with lung ACE2 gene expression in the Gene-Tissue Expression (GTEx) study (p = 4 × 10-4) and with circulating plasma ACE2 levels in the INTERVAL study (p = 0.03), but not with lung ACE2 expression in the Lung eQTL Consortium study (p = 0.68). There were no associations of genetically proxied liability to the other cardiometabolic traits with any outcome. This study does not provide consistent evidence to support an effect of serum ACE levels (as a proxy for ACE inhibitors) or cardiometabolic risk factors on lung ACE2 and TMPRSS2 expression or plasma ACE2 levels.
View details for DOI 10.1098/rsos.200958
View details for PubMedID 33391794
View details for PubMedCentralID PMC7735342
- The GTEx Consortium atlas of genetic regulatory effects across human tissues SCIENCE 2020; 369 (6509): 1318-+
A robust nonlinear low-dimensional manifold for single cell RNA-seq data.
2020; 21 (1): 324
Modern developments in single-cell sequencing technologies enable broad insights into cellular state. Single-cell RNA sequencing (scRNA-seq) can be used to explore cell types, states, and developmental trajectories to broaden our understanding of cellular heterogeneity in tissues and organs. Analysis of these sparse, high-dimensional experimental results requires dimension reduction. Several methods have been developed to estimate low-dimensional embeddings for filtered and normalized single-cell data. However, methods have yet to be developed for unfiltered and unnormalized count data that estimate uncertainty in the low-dimensional space. We present a nonlinear latent variable model with robust, heavy-tailed error and adaptive kernel learning to estimate low-dimensional nonlinear structure in scRNA-seq data.Gene expression in a single cell is modeled as a noisy draw from a Gaussian process in high dimensions from low-dimensional latent positions. This model is called the Gaussian process latent variable model (GPLVM). We model residual errors with a heavy-tailed Student's t-distribution to estimate a manifold that is robust to technical and biological noise found in normalized scRNA-seq data. We compare our approach to common dimension reduction tools across a diverse set of scRNA-seq data sets to highlight our model's ability to enable important downstream tasks such as clustering, inferring cell developmental trajectories, and visualizing high throughput experiments on available experimental data.We show that our adaptive robust statistical approach to estimate a nonlinear manifold is well suited for raw, unfiltered gene counts from high-throughput sequencing technologies for visualization, exploration, and uncertainty estimation of cell states.
View details for DOI 10.1186/s12859-020-03625-z
View details for PubMedID 32693778
View details for PubMedCentralID PMC7374962
Sparse multi-output Gaussian processes for online medical time series prediction.
BMC medical informatics and decision making
2020; 20 (1): 152
For real-time monitoring of hospital patients, high-quality inference of patients' health status using all information available from clinical covariates and lab test results is essential to enable successful medical interventions and improve patient outcomes. Developing a computational framework that can learn from observational large-scale electronic health records (EHRs) and make accurate real-time predictions is a critical step. In this work, we develop and explore a Bayesian nonparametric model based on multi-output Gaussian process (GP) regression for hospital patient monitoring.We propose MedGP, a statistical framework that incorporates 24 clinical covariates and supports a rich reference data set from which relationships between observed covariates may be inferred and exploited for high-quality inference of patient state over time. To do this, we develop a highly structured sparse GP kernel to enable tractable computation over tens of thousands of time points while estimating correlations among clinical covariates, patients, and periodicity in patient observations. MedGP has a number of benefits over current methods, including (i) not requiring an alignment of the time series data, (ii) quantifying confidence regions in the predictions, (iii) exploiting a vast and rich database of patients, and (iv) inferring interpretable relationships among clinical covariates.We evaluate and compare results from MedGP on the task of online prediction for three patient subgroups from two medical data sets across 8,043 patients. We find MedGP improves online prediction over baseline and state-of-the-art methods for nearly all covariates across different disease subgroups and hospitals.The MedGP framework is robust and efficient in estimating the temporal dependencies from sparse and irregularly sampled medical time series data for online prediction. The publicly available code is at https://github.com/bee-hive/MedGP .
View details for DOI 10.1186/s12911-020-1069-4
View details for PubMedID 32641134
View details for PubMedCentralID PMC7341595
The Human Tumor Atlas Network: Charting Tumor Transitions across Space and Time at Single-Cell Resolution.
2020; 181 (2): 236–49
Crucial transitions in cancer-including tumor initiation, local expansion, metastasis, and therapeutic resistance-involve complex interactions between cells within the dynamic tumor ecosystem. Transformative single-cell genomics technologies and spatial multiplex in situ methods now provide an opportunity to interrogate this complexity at unprecedented resolution. The Human Tumor Atlas Network (HTAN), part of the National Cancer Institute (NCI) Cancer Moonshot Initiative, will establish a clinical, experimental, computational, and organizational framework to generate informative and accessible three-dimensional atlases of cancer transitions for a diverse set of tumor types. This effort complements both ongoing efforts to map healthy organs and previous large-scale cancer genomics approaches focused on bulk sequencing at a single point in time. Generating single-cell, multiparametric, longitudinal atlases and integrating them with clinical outcomes should help identify novel predictive biomarkers and features as well as therapeutically relevant cell types, cell states, and cellular interactions across transitions. The resulting tumor atlases should have a profound impact on our understanding of cancer biology and have the potential to improve cancer detection, prevention, and therapeutic discovery for better precision-medicine treatments of cancer patients and those at risk for cancer.
View details for DOI 10.1016/j.cell.2020.03.053
View details for PubMedID 32302568
Measuring the predictability of life outcomes with a scientific mass collaboration.
Proceedings of the National Academy of Sciences of the United States of America
How predictable are life trajectories? We investigated this question with a scientific mass collaboration using the common task method; 160 teams built predictive models for six life outcomes using data from the Fragile Families and Child Wellbeing Study, a high-quality birth cohort study. Despite using a rich dataset and applying machine-learning methods optimized for prediction, the best predictions were not very accurate and were only slightly better than those from a simple benchmark model. Within each outcome, prediction error was strongly associated with the family being predicted and weakly associated with the technique used to generate the prediction. Overall, these results suggest practical limits to the predictability of life outcomes in some settings and illustrate the value of mass collaborations in the social sciences.
View details for DOI 10.1073/pnas.1915006117
View details for PubMedID 32229555
netNMF-sc: leveraging gene-gene interactions for imputation and dimensionality reduction in single-cell expression analysis.
2020; 30 (2): 195-204
Single-cell RNA-sequencing (scRNA-seq) enables high-throughput measurement of RNA expression in single cells. However, because of technical limitations, scRNA-seq data often contain zero counts for many transcripts in individual cells. These zero counts, or dropout events, complicate the analysis of scRNA-seq data using standard methods developed for bulk RNA-seq data. Current scRNA-seq analysis methods typically overcome dropout by combining information across cells in a lower-dimensional space, leveraging the observation that cells generally occupy a small number of RNA expression states. We introduce netNMF-sc, an algorithm for scRNA-seq analysis that leverages information across both cells and genes. netNMF-sc learns a low-dimensional representation of scRNA-seq transcript counts using network-regularized non-negative matrix factorization. The network regularization takes advantage of prior knowledge of gene-gene interactions, encouraging pairs of genes with known interactions to be nearby each other in the low-dimensional representation. The resulting matrix factorization imputes gene abundance for both zero and nonzero counts and can be used to cluster cells into meaningful subpopulations. We show that netNMF-sc outperforms existing methods at clustering cells and estimating gene-gene covariance using both simulated and real scRNA-seq data, with increasing advantages at higher dropout rates (e.g., >60%). We also show that the results from netNMF-sc are robust to variation in the input network, with more representative networks leading to greater performance gains.
View details for DOI 10.1101/gr.251603.119
View details for PubMedID 31992614
View details for PubMedCentralID PMC7050525
The impact of sex on gene expression across human tissues.
Science (New York, N.Y.)
2020; 369 (6509)
Many complex human phenotypes exhibit sex-differentiated characteristics. However, the molecular mechanisms underlying these differences remain largely unknown. We generated a catalog of sex differences in gene expression and in the genetic regulation of gene expression across 44 human tissue sources surveyed by the Genotype-Tissue Expression project (GTEx, v8 release). We demonstrate that sex influences gene expression levels and cellular composition of tissue samples across the human body. A total of 37% of all genes exhibit sex-biased expression in at least one tissue. We identify cis expression quantitative trait loci (eQTLs) with sex-differentiated effects and characterize their cellular origin. By integrating sex-biased eQTLs with genome-wide association study data, we identify 58 gene-trait associations that are driven by genetic regulation of gene expression in a single sex. These findings provide an extensive characterization of sex differences in the human transcriptome and its genetic regulation.
View details for DOI 10.1126/science.aba3066
View details for PubMedID 32913072
An Optimal Policy for Patient Laboratory Tests in Intensive Care Units.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
2019; 24: 320-331
Laboratory testing is an integral tool in the management of patient care in hospitals, particularly in intensive care units (ICUs). There exists an inherent trade-off in the selection and timing of lab tests between considerations of the expected utility in clinical decision-making of a given test at a specific time, and the associated cost or risk it poses to the patient. In this work, we introduce a framework that learns policies for ordering lab tests which optimizes for this trade-off. Our approach uses batch off-policy reinforcement learning with a composite reward function based on clinical imperatives, applied to data that include examples of clinicians ordering labs for patients. To this end, we develop and extend principles of Pareto optimality to improve the selection of actions based on multiple reward function components while respecting typical procedural considerations and prioritization of clinical goals in the ICU. Our experiments show that we can estimate a policy that reduces the frequency of lab tests and optimizes timing to minimize information redundancy. We also find that the estimated policies typically suggest ordering lab tests well ahead of critical onsets-such as mechanical ventilation or dialysis-that depend on the lab results. We evaluate our approach by quantifying how these policies may initiate earlier onset of treatment.
View details for PubMedID 30864333
View details for PubMedCentralID PMC6417830
Statistical tests for detecting variance effects in quantitative trait studies.
Bioinformatics (Oxford, England)
2019; 35 (2): 200-210
Identifying variants, both discrete and continuous, that are associated with quantitative traits, or QTs, is the primary focus of quantitative genetics. Most current methods are limited to identifying mean effects, or associations between genotype or covariates and the mean value of a quantitative trait. It is possible, however, that a variant may affect the variance of the quantitative trait in lieu of, or in addition to, affecting the trait mean. Here, we develop a general methodology to identify covariates with variance effects on a quantitative trait using a Bayesian heteroskedastic linear regression model (BTH). We compare BTH with existing methods to detect variance effects across a large range of simulations drawn from scenarios common to the analysis of quantitative traits.We find that BTH and a double generalized linear model (dglm) outperform classical tests used for detecting variance effects in recent genomic studies. We show BTH and dglm are less likely to generate spurious discoveries through simulations and application to identifying methylation variance QTs and expression variance QTs. We identify four variance effects of sex in the Cardiovascular and Pharmacogenetics study. Our work is the first to offer a comprehensive view of variance identifying methodology. We identify shortcomings in previously used methodology and provide a more conservative and robust alternative. We extend variance effect analysis to a wide array of covariates that enables a new statistical dimension in the study of sex and age specific quantitative trait effects.https://github.com/b2du/bth.Supplementary data are available at Bioinformatics online.
View details for DOI 10.1093/bioinformatics/bty565
View details for PubMedID 29982387
View details for PubMedCentralID PMC6330007
Fast Moment Estimation for Generalized Latent Dirichlet Models.
Journal of the American Statistical Association
2018; 113 (524): 1528-1540
We develop a generalized method of moments (GMM) approach for fast parameter estimation in a new class of Dirichlet latent variable models with mixed data types. Parameter estimation via GMM has computational and statistical advantages over alternative methods, such as expectation maximization, variational inference, and Markov chain Monte Carlo. A key computational advantage of our method, Moment Estimation for latent Dirichlet models (MELD), is that parameter estimation does not require instantiation of the latent variables. Moreover, performance is agnostic to distributional assumptions of the observations. We derive population moment conditions after marginalizing out the sample-specific Dirichlet latent variables. The moment conditions only depend on component mean parameters. We illustrate the utility of our approach on simulated data, comparing results from MELD to alternative methods, and we show the promise of our approach through the application to several datasets. Supplementary materials for this article are available online.
View details for DOI 10.1080/01621459.2017.1341839
View details for PubMedID 35875263
View details for PubMedCentralID PMC9302535
Glucocorticoid receptor recruits to enhancers and drives activation by motif-directed binding.
2018; 28 (9): 1272-1284
Glucocorticoids are potent steroid hormones that regulate immunity and metabolism by activating the transcription factor (TF) activity of glucocorticoid receptor (GR). Previous models have proposed that DNA binding motifs and sites of chromatin accessibility predetermine GR binding and activity. However, there are vast excesses of both features relative to the number of GR binding sites. Thus, these features alone are unlikely to account for the specificity of GR binding and activity. To identify genomic and epigenetic contributions to GR binding specificity and the downstream changes resultant from GR binding, we performed hundreds of genome-wide measurements of TF binding, epigenetic state, and gene expression across a 12-h time course of glucocorticoid exposure. We found that glucocorticoid treatment induces GR to bind to nearly all pre-established enhancers within minutes. However, GR binds to only a small fraction of the set of accessible sites that lack enhancer marks. Once GR is bound to enhancers, a combination of enhancer motif composition and interactions between enhancers then determines the strength and persistence of GR binding, which consequently correlates with dramatic shifts in enhancer activation. Over the course of several hours, highly coordinated changes in TF binding and histone modification occupancy occur specifically within enhancers, and these changes correlate with changes in the expression of nearby genes. Following GR binding, changes in the binding of other TFs precede changes in chromatin accessibility, suggesting that other TFs are also sensitive to genomic features beyond that of accessibility.
View details for DOI 10.1101/gr.233346.117
View details for PubMedID 30097539
View details for PubMedCentralID PMC6120625
Bayesian nonparametric discovery of isoforms and individual specific quantification.
2018; 9 (1): 1681
Most human protein-coding genes can be transcribed into multiple distinct mRNA isoforms. These alternative splicing patterns encourage molecular diversity, and dysregulation of isoform expression plays an important role in disease etiology. However, isoforms are difficult to characterize from short-read RNA-seq data because they share identical subsequences and occur in different frequencies across tissues and samples. Here, we develop BIISQ, a Bayesian nonparametric model for isoform discovery and individual specific quantification from short-read RNA-seq data. BIISQ does not require isoform reference sequences but instead estimates an isoform catalog shared across samples. We use stochastic variational inference for efficient posterior estimates and demonstrate superior precision and recall for simulations compared to state-of-the-art isoform reconstruction methods. BIISQ shows the most gains for low abundance isoforms, with 36% more isoforms correctly inferred at low coverage versus a multi-sample method and 170% more versus single-sample methods. We estimate isoforms in the GEUVADIS RNA-seq data and validate inferred isoforms by associating genetic variants with isoform ratios.
View details for DOI 10.1038/s41467-018-03402-w
View details for PubMedID 29703885
View details for PubMedCentralID PMC5923247
Clustering gene expression time series data using an infinite Gaussian process mixture model.
PLoS computational biology
2018; 14 (1): e1005896
Transcriptome-wide time series expression profiling is used to characterize the cellular response to environmental perturbations. The first step to analyzing transcriptional response data is often to cluster genes with similar responses. Here, we present a nonparametric model-based method, Dirichlet process Gaussian process mixture model (DPGP), which jointly models data clusters with a Dirichlet process and temporal dependencies with Gaussian processes. We demonstrate the accuracy of DPGP in comparison to state-of-the-art approaches using hundreds of simulated data sets. To further test our method, we apply DPGP to published microarray data from a microbial model organism exposed to stress and to novel RNA-seq data from a human cell line exposed to the glucocorticoid dexamethasone. We validate our clusters by examining local transcription factor binding and histone modifications. Our results demonstrate that jointly modeling cluster number and temporal dependencies can reveal shared regulatory mechanisms. DPGP software is freely available online at https://github.com/PrincetonUniversity/DP_GP_cluster.
View details for DOI 10.1371/journal.pcbi.1005896
View details for PubMedID 29337990
View details for PubMedCentralID PMC5786324
Co-expression networks reveal the tissue-specific regulation of transcription and splicing.
2017; 27 (11): 1843-1858
Gene co-expression networks capture biologically important patterns in gene expression data, enabling functional analyses of genes, discovery of biomarkers, and interpretation of genetic variants. Most network analyses to date have been limited to assessing correlation between total gene expression levels in a single tissue or small sets of tissues. Here, we built networks that additionally capture the regulation of relative isoform abundance and splicing, along with tissue-specific connections unique to each of a diverse set of tissues. We used the Genotype-Tissue Expression (GTEx) project v6 RNA sequencing data across 50 tissues and 449 individuals. First, we developed a framework called Transcriptome-Wide Networks (TWNs) for combining total expression and relative isoform levels into a single sparse network, capturing the interplay between the regulation of splicing and transcription. We built TWNs for 16 tissues and found that hubs in these networks were strongly enriched for splicing and RNA binding genes, demonstrating their utility in unraveling regulation of splicing in the human transcriptome. Next, we used a Bayesian biclustering model that identifies network edges unique to a single tissue to reconstruct Tissue-Specific Networks (TSNs) for 26 distinct tissues and 10 groups of related tissues. Finally, we found genetic variants associated with pairs of adjacent nodes in our networks, supporting the estimated network structures and identifying 20 genetic variants with distant regulatory impact on transcription and splicing. Our networks provide an improved understanding of the complex relationships of the human transcriptome across tissues.
View details for DOI 10.1101/gr.216721.116
View details for PubMedID 29021288
View details for PubMedCentralID PMC5668942
Expandable factor analysis.
2017; 104 (3): 649-663
Bayesian sparse factor models have proven useful for characterizing dependence in multivariate data, but scaling computation to large numbers of samples and dimensions is problematic. We propose expandable factor analysis for scalable inference in factor models when the number of factors is unknown. The method relies on a continuous shrinkage prior for efficient maximum a posteriori estimation of a low-rank and sparse loadings matrix. The structure of the prior leads to an estimation algorithm that accommodates uncertainty in the number of factors. We propose an information criterion to select the hyperparameters of the prior. Expandable factor analysis has better false discovery rates and true positive rates than its competitors across diverse simulation settings. We apply the proposed approach to a gene expression study of ageing in mice, demonstrating superior results relative to four competing methods.
View details for DOI 10.1093/biomet/asx030
View details for PubMedID 29430037
View details for PubMedCentralID PMC5793687
Detecting differential growth of microbial populations with Gaussian process regression.
2017; 27 (2): 320-333
Microbial growth curves are used to study differential effects of media, genetics, and stress on microbial population growth. Consequently, many modeling frameworks exist to capture microbial population growth measurements. However, current models are designed to quantify growth under conditions for which growth has a specific functional form. Extensions to these models are required to quantify the effects of perturbations, which often exhibit nonstandard growth curves. Rather than assume specific functional forms for experimental perturbations, we developed a general and robust model of microbial population growth curves using Gaussian process (GP) regression. GP regression modeling of high-resolution time-series growth data enables accurate quantification of population growth and allows explicit control of effects from other covariates such as genetic background. This framework substantially outperforms commonly used microbial population growth models, particularly when modeling growth data from environmentally stressed populations. We apply the GP growth model and develop statistical tests to quantify the differential effects of environmental perturbations on microbial growth across a large compendium of genotypes in archaea and yeast. This method accurately identifies known transcriptional regulators and implicates novel regulators of growth under standard and stress conditions in the model archaeal organism Halobacterium salinarum For yeast, our method correctly identifies known phenotypes for a diversity of genetic backgrounds under cyclohexamide stress and also detects previously unidentified oxidative stress sensitivity across a subset of strains. Together, these results demonstrate that the GP models are interpretable, recapitulating biological knowledge of growth response while providing new insights into the relevant parameters affecting microbial population growth.
View details for DOI 10.1101/gr.210286.116
View details for PubMedID 27864351
View details for PubMedCentralID PMC5287237
Genetic effects on gene expression across human tissues.
2017; 550 (7675): 204–13
Characterization of the molecular function of the human genome and its variation across individuals is essential for identifying the cellular mechanisms that underlie human genetic traits and diseases. The Genotype-Tissue Expression (GTEx) project aims to characterize variation in gene expression levels across individuals and diverse tissues of the human body, many of which are not easily accessible. Here we describe genetic effects on gene expression levels across 44 human tissues. We find that local genetic variation affects gene expression levels for the majority of genes, and we further identify inter-chromosomal genetic effects for 93 genes and 112 loci. On the basis of the identified genetic effects, we characterize patterns of tissue specificity, compare local and distal effects, and evaluate the functional properties of the genetic effects. We also demonstrate that multi-tissue, multi-individual data can be used to identify genes and pathways affected by human disease-associated variation, enabling a mechanistic interpretation of gene regulation and the genetic basis of disease.
View details for PubMedID 29022597
Context Specific and Differential Gene Co-expression Networks via Bayesian Biclustering.
PLoS computational biology
2016; 12 (7): e1004791
Identifying latent structure in high-dimensional genomic data is essential for exploring biological processes. Here, we consider recovering gene co-expression networks from gene expression data, where each network encodes relationships between genes that are co-regulated by shared biological mechanisms. To do this, we develop a Bayesian statistical model for biclustering to infer subsets of co-regulated genes that covary in all of the samples or in only a subset of the samples. Our biclustering method, BicMix, allows overcomplete representations of the data, computational tractability, and joint modeling of unknown confounders and biological signals. Compared with related biclustering methods, BicMix recovers latent structure with higher precision across diverse simulation scenarios as compared to state-of-the-art biclustering methods. Further, we develop a principled method to recover context specific gene co-expression networks from the estimated sparse biclustering matrices. We apply BicMix to breast cancer gene expression data and to gene expression data from a cardiovascular study cohort, and we recover gene co-expression networks that are differential across ER+ and ER- samples and across male and female samples. We apply BicMix to the Genotype-Tissue Expression (GTEx) pilot data, and we find tissue specific gene networks. We validate these findings by using our tissue specific networks to identify trans-eQTLs specific to one of four primary tissues.
View details for DOI 10.1371/journal.pcbi.1004791
View details for PubMedID 27467526
View details for PubMedCentralID PMC4965098
Meta-analysis of Genome-Wide Association Studies for Extraversion: Findings from the Genetics of Personality Consortium.
2016; 46 (2): 170-82
Extraversion is a relatively stable and heritable personality trait associated with numerous psychosocial, lifestyle and health outcomes. Despite its substantial heritability, no genetic variants have been detected in previous genome-wide association (GWA) studies, which may be due to relatively small sample sizes of those studies. Here, we report on a large meta-analysis of GWA studies for extraversion in 63,030 subjects in 29 cohorts. Extraversion item data from multiple personality inventories were harmonized across inventories and cohorts. No genome-wide significant associations were found at the single nucleotide polymorphism (SNP) level but there was one significant hit at the gene level for a long non-coding RNA site (LOC101928162). Genome-wide complex trait analysis in two large cohorts showed that the additive variance explained by common SNPs was not significantly different from zero, but polygenic risk scores, weighted using linkage information, significantly predicted extraversion scores in an independent cohort. These results show that extraversion is a highly polygenic personality trait, with an architecture possibly different from other complex human traits, including other personality traits. Future studies are required to further determine which genetic variants, by what modes of gene action, constitute the heritable nature of extraversion.
View details for DOI 10.1007/s10519-015-9735-5
View details for PubMedID 26362575
View details for PubMedCentralID PMC4751159
Meta-analysis of Genome-wide Association Studies for Neuroticism, and the Polygenic Association With Major Depressive Disorder.
2015; 72 (7): 642-50
Neuroticism is a pervasive risk factor for psychiatric conditions. It genetically overlaps with major depressive disorder (MDD) and is therefore an important phenotype for psychiatric genetics. The Genetics of Personality Consortium has created a resource for genome-wide association analyses of personality traits in more than 63,000 participants (including MDD cases).To identify genetic variants associated with neuroticism by performing a meta-analysis of genome-wide association results based on 1000 Genomes imputation; to evaluate whether common genetic variants as assessed by single-nucleotide polymorphisms (SNPs) explain variation in neuroticism by estimating SNP-based heritability; and to examine whether SNPs that predict neuroticism also predict MDD.Genome-wide association meta-analysis of 30 cohorts with genome-wide genotype, personality, and MDD data from the Genetics of Personality Consortium. The study included 63,661 participants from 29 discovery cohorts and 9786 participants from a replication cohort. Participants came from Europe, the United States, or Australia. Analyses were conducted between 2012 and 2014.Neuroticism scores harmonized across all 29 discovery cohorts by item response theory analysis, and clinical MDD case-control status in 2 of the cohorts.A genome-wide significant SNP was found on 3p14 in MAGI1 (rs35855737; P = 9.26 × 10-9 in the discovery meta-analysis). This association was not replicated (P = .32), but the SNP was still genome-wide significant in the meta-analysis of all 30 cohorts (P = 2.38 × 10-8). Common genetic variants explain 15% of the variance in neuroticism. Polygenic scores based on the meta-analysis of neuroticism in 27 cohorts significantly predicted neuroticism (1.09 × 10-12 < P < .05) and MDD (4.02 × 10-9 < P < .05) in the 2 other cohorts.This study identifies a novel locus for neuroticism. The variant is located in a known gene that has been associated with bipolar disorder and schizophrenia in previous studies. In addition, the study shows that neuroticism is influenced by many genetic variants of small effect that are either common or tagged by common variants. These genetic variants also influence MDD. Future studies should confirm the role of the MAGI1 locus for neuroticism and further investigate the association of MAGI1 and the polygenic association to a range of other psychiatric disorders that are phenotypically correlated with neuroticism.
View details for DOI 10.1001/jamapsychiatry.2015.0554
View details for PubMedID 25993607
View details for PubMedCentralID PMC4667957
Predicting genome-wide DNA methylation using methylation marks, genomic position, and DNA regulatory elements.
2015; 16: 14
Recent assays for individual-specific genome-wide DNA methylation profiles have enabled epigenome-wide association studies to identify specific CpG sites associated with a phenotype. Computational prediction of CpG site-specific methylation levels is critical to enable genome-wide analyses, but current approaches tackle average methylation within a locus and are often limited to specific genomic regions.We characterize genome-wide DNA methylation patterns, and show that correlation among CpG sites decays rapidly, making predictions solely based on neighboring sites challenging. We built a random forest classifier to predict methylation levels at CpG site resolution using features including neighboring CpG site methylation levels and genomic distance, co-localization with coding regions, CpG islands (CGIs), and regulatory elements from the ENCODE project. Our approach achieves 92% prediction accuracy of genome-wide methylation levels at single-CpG-site precision. The accuracy increases to 98% when restricted to CpG sites within CGIs and is robust across platform and cell-type heterogeneity. Our classifier outperforms other types of classifiers and identifies features that contribute to prediction accuracy: neighboring CpG site methylation, CGIs, co-localized DNase I hypersensitive sites, transcription factor binding sites, and histone modifications were found to be most predictive of methylation levels.Our observations of DNA methylation patterns led us to develop a classifier to predict DNA methylation levels at CpG site resolution with high accuracy. Furthermore, our method identified genomic features that interact with DNA methylation, suggesting mechanisms involved in DNA methylation modification and regulation, and linking diverse epigenetic processes.
View details for DOI 10.1186/s13059-015-0581-9
View details for PubMedID 25616342
View details for PubMedCentralID PMC4389802
Genetic variation associated with euphorigenic effects of d-amphetamine is associated with diminished risk for schizophrenia and attention deficit hyperactivity disorder.
Proceedings of the National Academy of Sciences of the United States of America
2014; 111 (16): 5968-73
Here, we extended our findings from a genome-wide association study of the euphoric response to d-amphetamine in healthy human volunteers by identifying enrichment between SNPs associated with response to d-amphetamine and SNPs associated with psychiatric disorders. We found that SNPs nominally associated (P ≤ 0.05 and P ≤ 0.01) with schizophrenia and attention deficit hyperactivity disorder were also nominally associated with d-amphetamine response. Furthermore, we found that the source of this enrichment was an excess of alleles that increased sensitivity to the euphoric effects of d-amphetamine and decreased susceptibility to schizophrenia and attention deficit hyperactivity disorder. In contrast, three negative control phenotypes (height, inflammatory bowel disease, and Parkinson disease) did not show this enrichment. Taken together, our results suggest that alleles identified using an acute challenge with a dopaminergic drug in healthy individuals can be used to identify alleles that confer risk for psychiatric disorders commonly treated with dopaminergic agonists and antagonists. More importantly, our results show the use of the enrichment approach as an alternative to stringent standards for genome-wide significance and suggest a relatively novel approach to the analysis of small cohorts in which intermediate phenotypes have been measured.
View details for DOI 10.1073/pnas.1318810111
View details for PubMedID 24711425
View details for PubMedCentralID PMC4000861
A statin-dependent QTL for GATM expression is associated with statin-induced myopathy.
2013; 502 (7471): 377-80
Statins are prescribed widely to lower plasma low-density lipoprotein (LDL) concentrations and cardiovascular disease risk and have been shown to have beneficial effects in a broad range of patients. However, statins are associated with an increased risk, albeit small, of clinical myopathy and type 2 diabetes. Despite evidence for substantial genetic influence on LDL concentrations, pharmacogenomic trials have failed to identify genetic variations with large effects on either statin efficacy or toxicity, and have produced little information regarding mechanisms that modulate statin response. Here we identify a downstream target of statin treatment by screening for the effects of in vitro statin exposure on genetic associations with gene expression levels in lymphoblastoid cell lines derived from 480 participants of a clinical trial of simvastatin treatment. This analysis identified six expression quantitative trait loci (eQTLs) that interacted with simvastatin exposure, including rs9806699, a cis-eQTL for the gene glycine amidinotransferase (GATM) that encodes the rate-limiting enzyme in creatine synthesis. We found this locus to be associated with incidence of statin-induced myotoxicity in two separate populations (meta-analysis odds ratio = 0.60). Furthermore, we found that GATM knockdown in hepatocyte-derived cell lines attenuated transcriptional response to sterol depletion, demonstrating that GATM may act as a functional link between statin-mediated lowering of cholesterol and susceptibility to statin-induced myopathy.
View details for DOI 10.1038/nature12508
View details for PubMedID 23995691
View details for PubMedCentralID PMC3933266
Integrative modeling of eQTLs and cis-regulatory elements suggests mechanisms underlying cell type specificity of eQTLs.
2013; 9 (8): e1003649
Genetic variants in cis-regulatory elements or trans-acting regulators frequently influence the quantity and spatiotemporal distribution of gene transcription. Recent interest in expression quantitative trait locus (eQTL) mapping has paralleled the adoption of genome-wide association studies (GWAS) for the analysis of complex traits and disease in humans. Under the hypothesis that many GWAS associations tag non-coding SNPs with small effects, and that these SNPs exert phenotypic control by modifying gene expression, it has become common to interpret GWAS associations using eQTL data. To fully exploit the mechanistic interpretability of eQTL-GWAS comparisons, an improved understanding of the genetic architecture and causal mechanisms of cell type specificity of eQTLs is required. We address this need by performing an eQTL analysis in three parts: first we identified eQTLs from eleven studies on seven cell types; then we integrated eQTL data with cis-regulatory element (CRE) data from the ENCODE project; finally we built a set of classifiers to predict the cell type specificity of eQTLs. The cell type specificity of eQTLs is associated with eQTL SNP overlap with hundreds of cell type specific CRE classes, including enhancer, promoter, and repressive chromatin marks, regions of open chromatin, and many classes of DNA binding proteins. These associations provide insight into the molecular mechanisms generating the cell type specificity of eQTLs and the mode of regulation of corresponding eQTLs. Using a random forest classifier with cell specific CRE-SNP overlap as features, we demonstrate the feasibility of predicting the cell type specificity of eQTLs. We then demonstrate that CREs from a trait-associated cell type can be used to annotate GWAS associations in the absence of eQTL data for that cell type. We anticipate that such integrative, predictive modeling of cell specificity will improve our ability to understand the mechanistic basis of human complex phenotypic variation.
View details for DOI 10.1371/journal.pgen.1003649
View details for PubMedID 23935528
View details for PubMedCentralID PMC3731231
Stability selection for regression-based models of transcription factor-DNA binding specificity.
Bioinformatics (Oxford, England)
2013; 29 (13): i117-25
The DNA binding specificity of a transcription factor (TF) is typically represented using a position weight matrix model, which implicitly assumes that individual bases in a TF binding site contribute independently to the binding affinity, an assumption that does not always hold. For this reason, more complex models of binding specificity have been developed. However, these models have their own caveats: they typically have a large number of parameters, which makes them hard to learn and interpret.We propose novel regression-based models of TF-DNA binding specificity, trained using high resolution in vitro data from custom protein-binding microarray (PBM) experiments. Our PBMs are specifically designed to cover a large number of putative DNA binding sites for the TFs of interest (yeast TFs Cbf1 and Tye7, and human TFs c-Myc, Max and Mad2) in their native genomic context. These high-throughput quantitative data are well suited for training complex models that take into account not only independent contributions from individual bases, but also contributions from di- and trinucleotides at various positions within or near the binding sites. To ensure that our models remain interpretable, we use feature selection to identify a small number of sequence features that accurately predict TF-DNA binding specificity. To further illustrate the accuracy of our regression models, we show that even in the case of paralogous TF with highly similar position weight matrices, our new models can distinguish the specificities of individual factors. Thus, our work represents an important step toward better sequence-based models of individual TF-DNA binding specificity.Our code is available at http://genome.duke.edu/labs/gordan/ISMB2013. The PBM data used in this article are available in the Gene Expression Omnibus under accession number GSE47026.
View details for DOI 10.1093/bioinformatics/btt221
View details for PubMedID 23812975
View details for PubMedCentralID PMC3694650
Genome-wide association study of d-amphetamine response in healthy volunteers identifies putative associations, including cadherin 13 (CDH13).
2012; 7 (8): e42646
Both the subjective response to d-amphetamine and the risk for amphetamine addiction are known to be heritable traits. Because subjective responses to drugs may predict drug addiction, identifying alleles that influence acute response may also provide insight into the genetic risk factors for drug abuse. We performed a Genome Wide Association Study (GWAS) for the subjective responses to amphetamine in 381 non-drug abusing healthy volunteers. Responses to amphetamine were measured using a double-blind, placebo-controlled, within-subjects design. We used sparse factor analysis to reduce the dimensionality of the data to ten factors. We identified several putative associations; the strongest was between a positive subjective drug-response factor and a SNP (rs3784943) in the 8(th) intron of cadherin 13 (CDH13; P = 4.58×10(-8)), a gene previously associated with a number of psychiatric traits including methamphetamine dependence. Additionally, we observed a putative association between a factor representing the degree of positive affect at baseline and a SNP (rs472402) in the 1(st) intron of steroid-5-alpha-reductase-α-polypeptide-1 (SRD5A1; P = 2.53×10(-7)), a gene whose protein product catalyzes the rate-limiting step in synthesis of the neurosteroid allopregnanolone. This SNP belongs to an LD-block that has been previously associated with the expression of SRD5A1 and differences in SRD5A1 enzymatic activity. The purpose of this study was to begin to explore the genetic basis of subjective responses to stimulant drugs using a GWAS approach in a modestly sized sample. Our approach provides a case study for analysis of high-dimensional intermediate pharmacogenomic phenotypes, which may be more tractable than clinical diagnoses.
View details for DOI 10.1371/journal.pone.0042646
View details for PubMedID 22952603
View details for PubMedCentralID PMC3429486
Genome-scale phylogenetic function annotation of large and diverse protein families.
2011; 21 (11): 1969-80
The Statistical Inference of Function Through Evolutionary Relationships (SIFTER) framework uses a statistical graphical model that applies phylogenetic principles to automate precise protein function prediction. Here we present a revised approach (SIFTER version 2.0) that enables annotations on a genomic scale. SIFTER 2.0 produces equivalently precise predictions compared to the earlier version on a carefully studied family and on a collection of 100 protein families. We have added an approximation method to SIFTER 2.0 and show a 500-fold improvement in speed with minimal impact on prediction results in the functionally diverse sulfotransferase protein family. On the Nudix protein family, previously inaccessible to the SIFTER framework because of the 66 possible molecular functions, SIFTER achieved 47.4% accuracy on experimental data (where BLAST achieved 34.0%). Finally, we used SIFTER to annotate all of the Schizosaccharomyces pombe proteins with experimental functional characterizations, based on annotations from proteins in 46 fungal genomes. SIFTER precisely predicted molecular function for 45.5% of the characterized proteins in this genome, as compared with four current function prediction methods that precisely predicted function for 62.6%, 30.6%, 6.0%, and 5.7% of these proteins. We use both precision-recall curves and ROC analyses to compare these genome-scale predictions across the different methods and to assess performance on different types of applications. SIFTER 2.0 is capable of predicting protein molecular function for large and functionally diverse protein families using an approximate statistical model, enabling phylogenetics-based protein function prediction for genome-wide analyses. The code for SIFTER and protein family data are available at http://sifter.berkeley.edu.
View details for DOI 10.1101/gr.104687.109
View details for PubMedID 21784873
View details for PubMedCentralID PMC3205580
Understanding mechanisms underlying human gene expression variation with RNA sequencing.
2010; 464 (7289): 768-72
Understanding the genetic mechanisms underlying natural variation in gene expression is a central goal of both medical and evolutionary genetics, and studies of expression quantitative trait loci (eQTLs) have become an important tool for achieving this goal. Although all eQTL studies so far have assayed messenger RNA levels using expression microarrays, recent advances in RNA sequencing enable the analysis of transcript variation at unprecedented resolution. We sequenced RNA from 69 lymphoblastoid cell lines derived from unrelated Nigerian individuals that have been extensively genotyped by the International HapMap Project. By pooling data from all individuals, we generated a map of the transcriptional landscape of these cells, identifying extensive use of unannotated untranslated regions and more than 100 new putative protein-coding exons. Using the genotypes from the HapMap project, we identified more than a thousand genes at which genetic variation influences overall expression levels or splicing. We demonstrate that eQTLs near genes generally act by a mechanism involving allele-specific expression, and that variation that influences the inclusion of an exon is enriched within and near the consensus splice sites. Our results illustrate the power of high-throughput sequencing for the joint analysis of variation in transcription, splicing and allele-specific expression across individuals.
View details for DOI 10.1038/nature08872
View details for PubMedID 20220758
View details for PubMedCentralID PMC3089435
Protein molecular function prediction by Bayesian phylogenomics.
PLoS computational biology
2005; 1 (5): e45
We present a statistical graphical model to infer specific molecular function for unannotated protein sequences using homology. Based on phylogenomic principles, SIFTER (Statistical Inference of Function Through Evolutionary Relationships) accurately predicts molecular function for members of a protein family given a reconciled phylogeny and available function annotations, even when the data are sparse or noisy. Our method produced specific and consistent molecular function predictions across 100 Pfam families in comparison to the Gene Ontology annotation database, BLAST, GOtcha, and Orthostrapper. We performed a more detailed exploration of functional predictions on the adenosine-5'-monophosphate/adenosine deaminase family and the lactate/malate dehydrogenase family, in the former case comparing the predictions against a gold standard set of published functional characterizations. Given function annotations for 3% of the proteins in the deaminase family, SIFTER achieves 96% accuracy in predicting molecular function for experimentally characterized proteins as reported in the literature. The accuracy of SIFTER on this dataset is a significant improvement over other currently available methods such as BLAST (75%), GeneQuiz (64%), GOtcha (89%), and Orthostrapper (11%). We also experimentally characterized the adenosine deaminase from Plasmodium falciparum, confirming SIFTER's prediction. The results illustrate the predictive power of exploiting a statistical model of function evolution in phylogenomic problems. A software implementation of SIFTER is available from the authors.
View details for DOI 10.1371/journal.pcbi.0010045
View details for PubMedID 16217548
View details for PubMedCentralID PMC1246806