  • Pharmacogenetics at Scale: An Analysis of the UK Biobank. Clinical pharmacology and therapeutics McInnes, G., Lavertu, A., Sangkuhl, K., Klein, T. E., Whirl-Carrillo, M., Altman, R. B. 2020


    Pharmacogenetics (PGx) studies the influence of genetic variation on drug response. Clinically actionable associations inform guidelines created by the Clinical Pharmacogenetics Implementation Consortium (CPIC), but the broad impact of genetic variation on entire populations is not well-understood. We analyzed PGx allele and phenotype frequencies for 487,409 participants in the U.K. Biobank, the largest PGx study to date. For fourteen CPIC pharmacogenes known to influence human drug response, we find that 99.5% of individuals may have an atypical response to at least one drug; on average they may have an atypical response to 10.3 drugs. Nearly 24% of participants have been prescribed a drug for which they are predicted to have an atypical response. Non-European populations carry a greater frequency of variants that are predicted to be functionally deleterious; many of these are not captured by current PGx allele definitions. Strategies for detecting and interpreting rare variation will be critical for enabling broad application of pharmacogenetics.

  • Transfer learning enables prediction of CYP2D6 haplotype function. PLoS computational biology McInnes, G., Dalton, R., Sangkuhl, K., Whirl-Carrillo, M., Lee, S., Tsao, P. S., Gaedigk, A., Altman, R. B., Woodahl, E. L. 2020; 16 (11): e1008399


    Cytochrome P450 2D6 (CYP2D6) is a highly polymorphic gene whose protein product metabolizes more than 20% of clinically used drugs. Genetic variations in CYP2D6 are responsible for interindividual heterogeneity in drug response that can lead to drug toxicity and ineffective treatment, making CYP2D6 one of the most important pharmacogenes. Prediction of CYP2D6 phenotype relies on curation of literature-derived functional studies to assign a functional status to CYP2D6 haplotypes. As the number of large-scale sequencing efforts grows, new haplotypes continue to be discovered, and assignment of function is challenging to maintain. To address this challenge, we have trained a convolutional neural network to predict functional status of CYP2D6 haplotypes, called Hubble.2D6. Hubble.2D6 predicts haplotype function from sequence data and was trained using two pre-training steps with a combination of real and simulated data. We find that Hubble.2D6 predicts CYP2D6 haplotype functional status with 88% accuracy in a held-out test set and explains 47.5% of the variance in in vitro functional data among star alleles with unknown function. Hubble.2D6 may be a useful tool for assigning function to haplotypes with uncurated function, and used for screening individuals who are at risk of being poor metabolizers.

  • Assessing Digital Phenotyping to Enhance Genetic Studies of Human Diseases. American journal of human genetics DeBoever, C., Tanigawa, Y., Aguirre, M., McInnes, G., Lavertu, A., Rivas, M. A. 2020


    Population-scale biobanks that combine genetic data and high-dimensional phenotyping for a large number of participants provide an exciting opportunity to perform genome-wide association studies (GWAS) to identify genetic variants associated with diverse quantitative traits and diseases. A major challenge for GWAS in population biobanks is ascertaining disease cases from heterogeneous data sources such as hospital records, digital questionnaire responses, or interviews. In this study, we use genetic parameters, including genetic correlation, to evaluate whether GWAS performed using cases in the UK Biobank ascertained from hospital records, questionnaire responses, and family history of disease implicate similar disease genetics across a range of effect sizes. We find that hospital record and questionnaire GWAS largely identify similar genetic effects for many complex phenotypes and that combining together both phenotyping methods improves power to detect genetic associations. We also show that family history GWAS using cases ascertained on family history of disease agrees with combined hospital record and questionnaire GWAS and that family history GWAS has better power to detect genetic associations for some phenotypes. Overall, this work demonstrates that digital phenotyping and unstructured phenotype data can be combined with structured data such as hospital records to identify cases for GWAS in biobanks and improve the ability of such studies to identify genetic associations.

  • Predicting venous thromboembolism risk from exomes in the Critical Assessment of Genome Interpretation (CAGI) challenges. Human mutation McInnes, G., Daneshjou, R., Katsonis, P., Lichtarge, O., Srinivasan, R. G., Rana, S., Radivojac, P., Mooney, S. D., Pagel, K. A., Stamboulian, M., Jiang, Y., Capriotti, E., Wang, Y., Bromberg, Y., Bovo, S., Savojardo, C., Martelli, P. L., Casadio, R., Pal, L. R., Moult, J., Brenner, S., Altman, R. 2019


    Genetics play a key role in venous thromboembolism (VTE) risk, however established risk factors in European populations do not translate to individuals of African descent due to differences in allele frequencies between populations. As part of the fifth iteration of the Critical Assessment of Genome Interpretation, participants were asked to predict VTE status in exome data from African American subjects. Participants were provided with 103 unlabeled exomes from patients treated with warfarin for non-VTE causes or VTE and asked to predict which disease each subject had been treated for. Given the lack of training data, many participants opted to use unsupervised machine learning methods, clustering the exomes by variation in genes known to be associated with VTE. The best performing method using only VTE related genes achieved an AUC of 0.65. Here we discuss the range of methods used in the prediction of VTE from sequence data and explore some of the difficulties of conducting a challenge with known confounders. Additionally, we show that an existing genetic risk score for VTE that was developed in European subjects works well in African Americans. This article is protected by copyright. All rights reserved.

  • Global Biobank Engine: enabling genotype-phenotype browsing for biobank summary statistics. Bioinformatics (Oxford, England) McInnes, G., Tanigawa, Y., DeBoever, C., Lavertu, A., Olivieri, J. E., Aguirre, M., Rivas, M. A. 2018


    Summary: Large biobanks linking phenotype to genotype have led to an explosion of genetic association studies across a wide range of phenotypes. Sharing the knowledge generated by these resources with the scientific community remains a challenge due to patient privacy and the vast amount of data. Here we present Global Biobank Engine (GBE), a web-based tool that enables exploration of the relationship between genotype and phenotype in biobank cohorts, such as the UK Biobank. GBE supports browsing for results from genome-wide association studies, phenome-wide association studies, gene-based tests, and genetic correlation between phenotypes. We envision GBE as a platform that facilitates the dissemination of summary statistics from biobanks to the scientific and clinical communities.Availability and implementation: GBE currently hosts data from the UK Biobank and can be found freely available at

  • Pharmacogenomics and big genomic data: from lab to clinic and back again. Human molecular genetics Lavertu, A., McInnes, G., Daneshjou, R., Whirl-Carrillo, M., Klein, T. E., Altman, R. B. 2018; 27 (R1): R72–R78


    The field of pharmacogenomics is an area of great potential for near-term human health impacts from the big genomic data revolution. Pharmacogenomics research momentum is building with numerous hypotheses currently being investigated through the integration of molecular profiles of different cell lines and large genomic data sets containing information on cellular and human responses to therapies. Additionally, the results of previous pharmacogenetic research efforts have been formulated into clinical guidelines that are beginning to impact how healthcare is conducted on the level of the individual patient. This trend will only continue with the recent release of new datasets containing linked genotype and electronic medical record data. This review discusses key resources available for pharmacogenomics and pharmacogenetics research and highlights recent work within the field.

  • Medical relevance of protein-truncating variants across 337,205 individuals in the UK Biobank study NATURE COMMUNICATIONS DeBoever, C., Tanigawa, Y., Lindholm, M. E., McInnes, G., Lavertu, A., Ingelsson, E., Chang, C., Ashley, E. A., Bustamante, C. D., Daly, M. J., Rivas, M. A. 2018; 9: 1612


    Protein-truncating variants can have profound effects on gene function and are critical for clinical genome interpretation and generating therapeutic hypotheses, but their relevance to medical phenotypes has not been systematically assessed. Here, we characterize the effect of 18,228 protein-truncating variants across 135 phenotypes from the UK Biobank and find 27 associations between medical phenotypes and protein-truncating variants in genes outside the major histocompatibility complex. We perform phenome-wide analyses and directly measure the effect in homozygous carriers, commonly referred to as "human knockouts," across medical phenotypes for genes implicated as being protective against disease or associated with at least one phenotype in our study. We find several genes with strong pleiotropic or non-additive effects. Our results illustrate the importance of protein-truncating variants in a variety of diseases.

  • Cloud-based Interactive Analytics for Terabytes of Genomic Variants Data Bioinformatics Pan, C., McInnes, G., Deflaux, N., Snyder, M. P., Bingham, J., Datta, S., Tsao, P. S. 2017: 3709–15


    Large scale genomic sequencing is now widely used to decipher questions in diverse realms such as biological function, human diseases, evolution, ecosystems, and agriculture. With the quantity and diversity these data harbor, a robust and scalable data handling and analysis solution is desired.We present interactive analytics using a cloud-based columnar database built on Dremel to perform information compression, comprehensive quality controls, and biological information retrieval in large volumes of genomic data. We demonstrate such Big Data computing paradigms can provide orders of magnitude faster turnaround for common genomic analyses, transforming long-running batch jobs submitted via a Linux shell into questions that can be asked from a web browser in seconds. Using this method, we assessed a study population of 475 deeply sequenced human genomes for genomic call rate, genotype and allele frequency distribution, variant density across the genome, and pharmacogenomic information.Our analysis framework is implemented in Google Cloud Platform and BigQuery. Codes are available at or data are available at Bioinformatics online.

