Russ B. Altman
Kenneth Fong Professor and Professor of Bioengineering, of Genetics, of Medicine, of Biomedical Data Science, Senior Fellow at the Stanford Institute for HAI and Professor, by courtesy, of Computer Science
Web page: https://rbaltman.people.stanford.edu
Bio
Russ Biagio Altman is the Kenneth Fong Professor of Bioengineering, Genetics, Medicine, Biomedical Data Science and (by courtesy) Computer Science) and past chairman of the Bioengineering Department at Stanford University. His primary research interests are in the application of computing (AI, data science and informatics) to problems relevant to medicine. He is particularly interested in methods for understanding drug action at molecular, cellular, organism and population levels. His lab studies how human genetic variation impacts drug response (e.g., http://www.pharmgkb.org/). Other work focuses on the analysis of biological molecules to understand the actions, interactions and adverse events of drugs (e.g., http://helix.stanford.edu/). He helps lead an FDA-supported Center of Excellence in Regulatory Science & Innovation.
Dr. Altman holds an AB from Harvard College, and an MD from Stanford Medical School, and a PhD in Medical Information Sciences from Stanford. He received the U.S. Presidential Early Career Award for Scientists and Engineers and a National Science Foundation CAREER Award. He is a fellow of the American College of Physicians (ACP), the American College of Medical Informatics (ACMI), the American Institute of Medical and Biological Engineering (AIMBE), and the American Association for the Advancement of Science (AAAS). He is a member of the National Academy of Medicine. He is a past-president, founding board member, and a fellow of the International Society for Computational Biology (ISCB), and a past-president of the American Society for Clinical Pharmacology & Therapeutics (ASCPT). He has chaired the Science Board advising the FDA commissioner, and has served on the NIH Director’s Advisory Committee, and as cochair of the IOM Drug Forum. He is an organizer of the annual Pacific Symposium on Biocomputing, and a founder of Personalis (NASDAQ: PSNL). Dr. Altman is board certified in Internal Medicine and in Clinical Informatics. He received the Stanford Medical School graduate teaching award in 2000 and 2020, and the mentorship award in 2014. He is the founding editor of the Annual Reviews of Biomedical Data Science, and hosts a podcast entitled “The Future of Everything.”
Academic Appointments
-
Professor, Bioengineering
-
Professor, Genetics
-
Professor, Medicine - Biomedical Informatics Research
-
Professor, Department of Biomedical Data Science
-
Senior Fellow, Institute for Human-Centered Artificial Intelligence (HAI)
-
Professor (By courtesy), Computer Science
-
Member, Bio-X
-
Member, Cardiovascular Institute
-
Member, Stanford Cancer Institute
-
Member, Wu Tsai Neurosciences Institute
Administrative Appointments
-
Faculty Director, SPADA (Stanford Predictives & Diagnostics Accelerator) (2016 - Present)
-
Faculty Director, 100 Year Study of Artificial Intelligence (2015 - Present)
-
Associate Director, Human-Centered Artificial Intelligence Institute (2018 - Present)
-
Member, Biomedical Library and Informatics Research Commitee Study Section (NIH) (2002 - 2005)
-
President, International Society for Computational Biology (2000 - 2001)
-
President, American Society for Clinical Pharmacology and Therapeutics (2013 - 2014)
-
Director, Biomedical Informatics Training Program (2000 - 2018)
-
Chairman, Department of Bioengineering (2007 - 2012)
-
Chair, FDA Science Board (2013 - 2014)
-
Member, Advisory Committee to the Director (ACD), NIH (2013 - 2016)
Honors & Awards
-
Teaching Honor Roll, Tau Beta Pi (2020)
-
Excellence in Graduate Teaching Award, Stanford Biosciences (2020)
-
Fellow, American Association for the Advancement of Science (2014)
-
Stanford Medical School Mentorship Award, Stanford Medical School (2014)
-
Fellow, International Society for Computational Biology (2010)
-
Member, Institute of Medicine of the National Academies (2009)
-
Fellow, American Institute for Medical and Biological Engineering (2007)
-
Award for Excellence in Graduate Teaching, Stanford Medical School (2000)
-
Fellow, American College of Medical Informatics (1998)
-
Fellow, American College of Physicians (1998)
-
U.S. Presidential Early Career Award for Scientists & Engineers, NIH (1997)
-
Post-Doctoral Fellowship, Howard Hughes Medical Institute (1991)
Boards, Advisory Committees, Professional Organizations
-
Global Health Faculty Fellow, Center for Innovation in Global Health (CIGH) (2024 - Present)
-
Co-Founder, Personalis.com (2013 - Present)
-
Editor-in-Chief, Annual Reviews of Biomedical Data Science (2016 - Present)
-
Board of Directors, YouScript.com (2018 - Present)
-
Advisor, Vanderbilt University Medical School (2014 - Present)
-
Advisor, NIH Advisory Committee to the Director (ACD) (2013 - 2017)
-
Member, FDA Commissioner Science Board (2011 - 2014)
-
Co-Organizer, Pacific Symposium on Biocomputing (psb.stanford.edu) (1995 - Present)
Program Affiliations
-
Symbolic Systems Program
Professional Education
-
AB (summa cum laude), Harvard College, Biochemistry and Molecular Biology (1983)
-
PhD, Stanford University, Medical Information Sciences (1989)
-
MD, Stanford University, Medicine (1990)
Community and International Work
-
Host, "The Future of Everything with Russ Altman" Podcast, https://engineering.stanford.edu/magazine/collection/future-everything
Topic
Science & Technology Interviews
Partnering Organization(s)
Stanford Engineering
Populations Served
General Public
Location
International
Ongoing Project
Yes
Opportunities for Student Involvement
No
-
Physician, Pharmacogenomics Consult Service, Stanford Clinics, CA
Topic
Providing genetic advice about drug response
Partnering Organization(s)
Stanford Health Care
Populations Served
Local population
Location
International
Ongoing Project
Yes
Opportunities for Student Involvement
No
Patents
-
Nicholas Tatonetti, Russ B. Altman, Guy Haskin Fernald. "United States Patent 9305267 Signal detection algorithms to identify drug effects and drug interactions", The Board of Trustees of the Leland Stanford Junior University, Apr 5, 2016
-
Kathleen A. Thompson, Russ B. Altman, Oliver M. Duschka. "United States Patent 6178416 Method and apparatus for knowledgebase searching", Jan 23, 2001
-
Ramon M. Felciano, Russ B. Altman. "United States Patent 6052730 Method for monitoring and/or modifying web browsing sessions", The Board of Trustees of the Leland Stanford Junior University, Apr 18, 2000
Current Research and Scholarly Interests
I am interested in the application of computational technologies to problems in molecular biology of relevance to medicine. In particular, my laboratory focuses on drug response at the molecular level, working in three areas. First, we are building a comprehensive pharmacogenomics knowledge base (http://www.pharmgkb.org/) that provides access to information relating genotype to phenotype (in particular, how variation in genetics leads to variation in response to drugs). We are interested in collaboratively discovering and applying new pharmacogenomics knowledge. Second, we are interested in the analysis of three dimensional biological structures. We have methods for analyzing protein structures to recognize and annotate active sites and binding sites, particularly in the context of interactions with small molecule drugs. We are also interested in physics-based simulation of biological structures to understand how their dynamics impact their function (http://simbios.stanford.edu/). Finally, we are interested in computational methods for analyzing functional genomics information. We use natural language processing techniques for extracting and summarizing information in the literature, chemoinformatics methods for understanding small molecule function, and machine learning & data mining techniques to understand the molecular responses to drugs.
2024-25 Courses
- Ethics in Bioengineering
BIOE 131, ETHICSOC 131X (Spr) - Introduction to Biomedical Informatics Research Methodology
BIOE 212, BIOMEDIN 212, CS 272, GENE 212 (Spr) - Principles of Pharmacogenomics
BIOMEDIN 224, GENE 224 (Aut, Spr) - Representations and Algorithms for Computational Molecular Biology
BIOE 214, BIOMEDIN 214, CS 274, GENE 214 (Aut) - Representations and Algorithms for Molecular Biology: Lectures
BIOMEDIN 216 (Aut) -
Independent Studies (33)
- Advanced Reading and Research
CS 499 (Aut, Win, Spr, Sum) - Advanced Reading and Research
CS 499P (Aut, Win, Spr, Sum) - Bioengineering Problems and Experimental Investigation
BIOE 191 (Aut, Win, Spr, Sum) - Biomedical Informatics Teaching Methods
BIOMEDIN 290 (Aut, Win, Spr, Sum) - Curricular Practical Training
CS 390A (Aut, Win, Spr, Sum) - Curricular Practical Training
CS 390B (Aut, Win, Spr, Sum) - Curricular Practical Training
CS 390C (Aut, Win, Spr, Sum) - Curricular Practical Training and Internship
GENE 290 (Aut, Win, Spr, Sum) - Directed Investigation
BIOE 392 (Aut, Win, Spr, Sum) - Directed Reading and Research
BIOMEDIN 299 (Aut, Win, Spr, Sum) - Directed Reading in Biophysics
BIOPHYS 399 (Aut, Win, Spr, Sum) - Directed Reading in Genetics
GENE 299 (Aut, Win, Spr, Sum) - Directed Study
BIOE 391 (Aut, Win, Spr, Sum) - Graduate Research
BIOPHYS 300 (Aut, Win, Spr, Sum) - Graduate Research
GENE 399 (Aut, Win, Spr, Sum) - Graduate Research
IMMUNOL 399 (Aut, Win, Spr, Sum) - Independent Project
CS 399 (Aut, Win, Spr, Sum) - Independent Project
CS 399P (Aut, Win, Spr, Sum) - Independent Work
CS 199 (Aut, Win, Spr, Sum) - Independent Work
CS 199P (Aut, Win, Spr, Sum) - Medical Scholars Research
BIOMEDIN 370 (Aut, Win, Spr, Sum) - Medical Scholars Research
GENE 370 (Aut, Win, Spr, Sum) - Part-time Curricular Practical Training
CS 390D (Aut, Win, Spr, Sum) - Practical Training
BIOE 299B (Aut, Sum) - Programming Service Project
CS 192 (Aut, Win, Spr, Sum) - Research
PHYSICS 490 (Aut, Win, Spr, Sum) - Senior Project
CS 191 (Aut, Win, Spr, Sum) - Special Studies in Engineering
ENGR 199 (Aut, Win, Spr) - Supervised Study
GENE 260 (Aut, Win, Spr, Sum) - Supervised Undergraduate Research
CS 195 (Aut, Win, Spr, Sum) - Undergraduate Research
GENE 199 (Aut, Win, Spr, Sum) - Writing Intensive Senior Research Project
CS 191W (Aut, Win, Spr) - Writing of Original Research for Engineers
ENGR 199W (Aut, Win, Spr, Sum)
- Advanced Reading and Research
-
Prior Year Courses
2023-24 Courses
- Ethics in Bioengineering
BIOE 131, ETHICSOC 131X (Spr) - Introduction to Biomedical Data Science Research Methodology
BIOE 212, BIOMEDIN 212, CS 272, GENE 212 (Spr) - Principles of Pharmacogenomics
BIOMEDIN 224, GENE 224 (Aut, Spr) - Representations and Algorithms for Computational Molecular Biology
BIOE 214, BIOMEDIN 214, CS 274, GENE 214 (Aut) - Representations and Algorithms for Molecular Biology: Lectures
BIOMEDIN 216 (Aut)
2022-23 Courses
- Ethics in Bioengineering
BIOE 131, ETHICSOC 131X (Spr) - Introduction to Biomedical Data Science Research Methodology
BIOE 212, BIOMEDIN 212, CS 272, GENE 212 (Spr) - Principles of Pharmacogenomics
BIOMEDIN 224, GENE 224 (Aut, Win, Spr, Sum) - Representations and Algorithms for Computational Molecular Biology
BIOE 214, BIOMEDIN 214, CS 274, GENE 214 (Aut) - Representations and Algorithms for Molecular Biology: Lectures
BIOMEDIN 216 (Aut)
2021-22 Courses
- Ethics in Bioengineering
BIOE 131, ETHICSOC 131X (Spr) - Introduction to Biomedical Data Science Research Methodology
BIOE 212, BIOMEDIN 212, CS 272, GENE 212 (Spr) - Principles of Pharmacogenomics
BIOMEDIN 224, GENE 224 (Aut, Win, Spr, Sum) - Representations and Algorithms for Computational Molecular Biology
BIOE 214, BIOMEDIN 214, CS 274, GENE 214 (Aut) - Representations and Algorithms for Molecular Biology: Lectures
BIOMEDIN 216 (Aut)
- Ethics in Bioengineering
Stanford Advisees
-
Doctoral Dissertation Reader (AC)
Matthew Aguirre, Andy Chen, Rastko Ciric, Ibtihal Elfaki, Elliot Hershberg, Jessica Kain, Ziv Lautman, Trang Le, Samson Mataraso, Akshat Nigam, Courtney Smith -
Postdoctoral Faculty Sponsor
Abdoul Jalil Djiberou Mahamadou, Artem Trotsyuk -
Doctoral Dissertation Advisor (AC)
Stephanie Arteaga, Kristy Carpenter, Henry Cousins, Gowri Nayar, Issah Samori, Delaney Smith, Betty Xiong -
Master's Program Advisor
Cathy Hou, Abhi Kumar, Nikhil Lyles, Eric Pan, Priyanka Shrestha, Neha Srivathsa, Serena Zhang -
Doctoral Dissertation Co-Advisor (AC)
Jeonghyeon Kim -
Undergraduate Major Advisor
John Wang -
Doctoral (Program)
Yasa Baig, Erin Craig, Aviv Korman, Ashley Lewis, Kara Liu, Janella Schwab Lizarraga
Graduate and Fellowship Programs
-
Biomedical Data Science (Masters Program)
-
Biomedical Data Science (Phd Program)
All Publications
-
Heterogeneous network approaches to protein pathway prediction.
Computational and structural biotechnology journal
2024; 23: 2727-2739
Abstract
Understanding protein-protein interactions (PPIs) and the pathways they comprise is essential for comprehending cellular functions and their links to specific phenotypes. Despite the prevalence of molecular data generated by high-throughput sequencing technologies, a significant gap remains in translating this data into functional information regarding the series of interactions that underlie phenotypic differences. In this review, we present an in-depth analysis of heterogeneous network methodologies for modeling protein pathways, highlighting the critical role of integrating multifaceted biological data. It outlines the process of constructing these networks, from data representation to machine learning-driven predictions and evaluations. The work underscores the potential of heterogeneous networks in capturing the complexity of proteomic interactions, thereby offering enhanced accuracy in pathway prediction. This approach not only deepens our understanding of cellular processes but also opens up new possibilities in disease treatment and drug discovery by leveraging the predictive power of comprehensive proteomic data analysis.
View details for DOI 10.1016/j.csbj.2024.06.022
View details for PubMedID 39035835
View details for PubMedCentralID PMC11260399
-
Databases of ligand-binding pockets and protein-ligand interactions.
Computational and structural biotechnology journal
2024; 23: 1320-1338
Abstract
Many research groups and institutions have created a variety of databases curating experimental and predicted data related to protein-ligand binding. The landscape of available databases is dynamic, with new databases emerging and established databases becoming defunct. Here, we review the current state of databases that contain binding pockets and protein-ligand binding interactions. We have compiled a list of such databases, fifty-three of which are currently available for use. We discuss variation in how binding pockets are defined and summarize pocket-finding methods. We organize the fifty-three databases into subgroups based on goals and contents, and describe standard use cases. We also illustrate that pockets within the same protein are characterized differently across different databases. Finally, we assess critical issues of sustainability, accessibility and redundancy.
View details for DOI 10.1016/j.csbj.2024.03.015
View details for PubMedID 38585646
View details for PubMedCentralID PMC10997877
-
Heterogeneous network approaches to protein pathway prediction
COMPUTATIONAL AND STRUCTURAL BIOTECHNOLOGY JOURNAL
2024; 23: 2727-2739
View details for DOI 10.1016/j.csbj.2024.06.022
View details for Web of Science ID 001262410400001
-
Which social media platforms facilitate monitoring the opioid crisis?
medRxiv : the preprint server for health sciences
2024
Abstract
Social media can provide real-time insight into trends in substance use, addiction, and recovery. Prior studies have used platforms such as Reddit and X (formerly Twitter), but evolving policies around data access have threatened these platforms' usability in research. We evaluate the potential of a broad set of platforms to detect emerging trends in the opioid epidemic. From these, we created a shortlist of 11 platforms, for which we documented official policies regulating drug-related discussion, data accessibility, geolocatability, and prior use in opioid-related studies. We quantified their volumes of opioid discussion, capturing informal language by including slang generated using a large language model. Beyond the most commonly used Reddit and X, the platforms with high potential for use in opioid-related surveillance are TikTok, YouTube, and Facebook. Leveraging many different social platforms, instead of a single platform, safeguards against sudden changes to data access and may better capture all populations that use opioids than any single platform.
View details for DOI 10.1101/2024.07.06.24310035
View details for PubMedID 39006412
-
Prospector Heads: Generalized Feature Attribution for Large Models & Data.
ArXiv
2024
Abstract
Feature attribution, the ability to localize regions of the input data that are relevant for classification, is an important capability for ML models in scientific and biomedical domains. Current methods for feature attribution, which rely on "explaining" the predictions of end-to-end classifiers, suffer from imprecise feature localization and are inadequate for use with small sample sizes and high-dimensional datasets due to computational challenges. We introduce prospector heads, an efficient and interpretable alternative to explanation-based attribution methods that can be applied to any encoder and any data modality. Prospector heads generalize across modalities through experiments on sequences (text), images (pathology), and graphs (protein structures), outperforming baseline attribution methods by up to 26.3 points in mean localization AUPRC. We also demonstrate how prospector heads enable improved interpretation and discovery of class-specific patterns in input data. Through their high performance, flexibility, and generalizability, prospectors provide a framework for improving trust and transparency for ML models in complex domains.
View details for PubMedID 38947933
View details for PubMedCentralID PMC11213143
-
Elucidating the semantics-topology trade-off for knowledge inference-based pharmacological discovery.
Journal of biomedical semantics
2024; 15 (1): 5
Abstract
Leveraging AI for synthesizing the deluge of biomedical knowledge has great potential for pharmacological discovery with applications including developing new therapeutics for untreated diseases and repurposing drugs as emergent pandemic treatments. Creating knowledge graph representations of interacting drugs, diseases, genes, and proteins enables discovery via embedding-based ML approaches and link prediction. Previously, it has been shown that these predictive methods are susceptible to biases from network structure, namely that they are driven not by discovering nuanced biological understanding of mechanisms, but based on high-degree hub nodes. In this work, we study the confounding effect of network topology on biological relation semantics by creating an experimental pipeline of knowledge graph semantic and topological perturbations. We show that the drop in drug repurposing performance from ablating meaningful semantics increases by 21% and 38% when mitigating topological bias in two networks. We demonstrate that new methods for representing knowledge and inferring new knowledge must be developed for making use of biomedical semantics for pharmacological innovation, and we suggest fruitful avenues for their development.
View details for DOI 10.1186/s13326-024-00308-z
View details for PubMedID 38693563
-
Computational Approaches to Drug Repurposing: Methods, Challenges, and Opportunities.
Annual review of biomedical data science
2024
Abstract
Drug repurposing refers to the inference of therapeutic relationships between a clinical indication and existing compounds. As an emerging paradigm in drug development, drug repurposing enables more efficient treatment of rare diseases, stratified patient populations, and urgent threats to public health. However, prioritizing well-suited drug candidates from among a nearly infinite number of repurposing options continues to represent a significant challenge in drug development. Over the past decade, advances in genomic profiling, database curation, and machine learning techniques have enabled more accurate identification of drug repurposing candidates for subsequent clinical evaluation. This review outlines the major methodologic classes that these approaches comprise, which rely on (a) protein structure, (b) genomic signatures, (c) biological networks, and (d) real-world clinical data. We propose that realizing the full impact of drug repurposing methodologies requires a multidisciplinary understanding of each method's advantages and limitations with respect to clinical practice.
View details for DOI 10.1146/annurev-biodatasci-110123-025333
View details for PubMedID 38598857
-
Leveraging large-scale biobank EHRs to enhance pharmacogenetics of cardiometabolic disease medications.
medRxiv : the preprint server for health sciences
2024
Abstract
Electronic health records (EHRs) coupled with large-scale biobanks offer great promises to unravel the genetic underpinnings of treatment efficacy. However, medication-induced biomarker trajectories stemming from such records remain poorly studied. Here, we extract clinical and medication prescription data from EHRs and conduct GWAS and rare variant burden tests in the UK Biobank (discovery) and the All of Us program (replication) on ten cardiometabolic drug response outcomes including lipid response to statins, HbA1c response to metformin and blood pressure response to antihypertensives (N = 740-26,669). Our findings at genome-wide significance level recover previously reported pharmacogenetic signals and also include novel associations for lipid response to statins (N = 26,669) near LDLR and ZNF800. Importantly, these associations are treatment-specific and not associated with biomarker progression in medication-naive individuals. Furthermore, we demonstrate that individuals with higher genetically determined low-density and total cholesterol baseline levels experience increased absolute, albeit lower relative biomarker reduction following statin treatment. In summary, we systematically investigated the common and rare pharmacogenetic contribution to cardiometabolic drug response phenotypes in over 50,000 UK Biobank and All of Us participants with EHR and identified clinically relevant genetic predictors for improved personalized treatment strategies.
View details for DOI 10.1101/2024.04.06.24305415
View details for PubMedID 38633781
View details for PubMedCentralID PMC11023668
-
CAGI, the Critical Assessment of Genome Interpretation, establishes progress and prospects for computational genetic variant interpretation methods
GENOME BIOLOGY
2024; 25 (1): 53
Abstract
The Critical Assessment of Genome Interpretation (CAGI) aims to advance the state-of-the-art for computational prediction of genetic variant impact, particularly where relevant to disease. The five complete editions of the CAGI community experiment comprised 50 challenges, in which participants made blind predictions of phenotypes from genetic data, and these were evaluated by independent assessors.Performance was particularly strong for clinical pathogenic variants, including some difficult-to-diagnose cases, and extends to interpretation of cancer-related variants. Missense variant interpretation methods were able to estimate biochemical effects with increasing accuracy. Assessment of methods for regulatory variants and complex trait disease risk was less definitive and indicates performance potentially suitable for auxiliary use in the clinic.Results show that while current methods are imperfect, they have major utility for research and clinical applications. Emerging methods and increasingly large, robust datasets for training and assessment promise further progress ahead.
View details for DOI 10.1186/s13059-023-03113-6
View details for Web of Science ID 001184832400002
View details for PubMedID 38389099
View details for PubMedCentralID PMC10882881
-
CRISPR-GPT: an LLM agent for automated design of gene-editing experiments
bioRxiv: the preprint server for biology
2024
View details for DOI 10.1101/2024.04.25.591003
-
DEEP LEARNING FOR LOCALIZED DETECTION OF OPTIC DISC HEMORRHAGES (vol 255, pg 161, 2023)
AMERICAN JOURNAL OF OPHTHALMOLOGY
2024; 257
View details for DOI 10.1016/j.ajo.2023.09.020
View details for Web of Science ID 001135019300001
-
A mitochondrial inside-out iron-calcium signal reveals drug targets for Parkinson's disease.
Cell reports
2023; 42 (12): 113544
Abstract
Dysregulated iron or Ca2+ homeostasis has been reported in Parkinson's disease (PD) models. Here, we discover a connection between these two metals at the mitochondria. Elevation of iron levels causes inward mitochondrial Ca2+ overflow, through an interaction of Fe2+ with mitochondrial calcium uniporter (MCU). In PD neurons, iron accumulation-triggered Ca2+ influx across the mitochondrial surface leads to spatially confined Ca2+ elevation at the outer mitochondrial membrane, which is subsequently sensed by Miro1, a Ca2+-binding protein. A Miro1 blood test distinguishes PD patients from controls and responds to drug treatment. Miro1-based drug screens in PD cells discover Food and Drug Administration-approved T-type Ca2+-channel blockers. Human genetic analysis reveals enrichment of rare variants in T-type Ca2+-channel subtypes associated with PD status. Our results identify a molecular mechanism in PD pathophysiology and drug targets and candidates coupled with a convenient stratification method.
View details for DOI 10.1016/j.celrep.2023.113544
View details for PubMedID 38060381
-
Integrative analyses highlight functional regulatory variants associated with neuropsychiatric diseases.
Nature genetics
2023
Abstract
Noncoding variants of presumed regulatory function contribute to the heritability of neuropsychiatric disease. A total of 2,221 noncoding variants connected to risk for ten neuropsychiatric disorders, including autism spectrum disorder, attention deficit hyperactivity disorder, bipolar disorder, borderline personality disorder, major depression, generalized anxiety disorder, panic disorder, post-traumatic stress disorder, obsessive-compulsive disorder and schizophrenia, were studied in developing human neural cells. Integrating epigenomic and transcriptomic data with massively parallel reporter assays identified differentially-active single-nucleotide variants (daSNVs) in specific neural cell types. Expression-gene mapping, network analyses and chromatin looping nominated candidate disease-relevant target genes modulated by these daSNVs. Follow-up integration of daSNV gene editing with clinical cohort analyses suggested that magnesium transport dysfunction may increase neuropsychiatric disease risk and indicated that common genetic pathomechanisms may mediate specific symptoms that are shared across multiple neuropsychiatric diseases.
View details for DOI 10.1038/s41588-023-01533-5
View details for PubMedID 37857935
View details for PubMedCentralID 4112379
-
Explainable protein function annotation using local structure embeddings.
bioRxiv : the preprint server for biology
2023
Abstract
The rapid expansion of protein sequence and structure databases has resulted in a significant number of proteins with ambiguous or unknown function. While advances in machine learning techniques hold great potential to fill this annotation gap, current methods for function prediction are unable to associate global function reliably to the specific residues responsible for that function. We address this issue by introducing PARSE (Protein Annotation by Residue-Specific Enrichment), a knowledge-based method which combines pre-trained embeddings of local structural environments with traditional statistical techniques to identify enriched functions with residue-level explainability. For the task of predicting the catalytic function of enzymes, PARSE achieves comparable or superior global performance to state-of-the-art machine learning methods (F1 score > 85%) while simultaneously annotating the specific residues involved in each function with much greater precision. Since it does not require supervised training, our method can make one-shot predictions for very rare functions and is not limited to a particular type of functional label (e.g. Enzyme Commission numbers or Gene Ontology codes). Finally, we leverage the AlphaFold Structure Database to perform functional annotation at a proteome scale. By applying PARSE to the dark proteome-predicted structures which cannot be classified into known structural families-we predict several novel bacterial metalloproteases. Each of these proteins shares a strongly conserved catalytic site despite highly divergent sequences and global folds, illustrating the value of local structure representations for new function discovery.
View details for DOI 10.1101/2023.10.13.562298
View details for PubMedID 37905033
View details for PubMedCentralID PMC10614799
-
A Holy Grail - The Prediction of Protein Structure.
The New England journal of medicine
2023
View details for DOI 10.1056/NEJMcibr2307735
View details for PubMedID 37732608
-
Stronger regulation of AI in biomedicine.
Science translational medicine
2023; 15 (713): eadi0336
Abstract
Regulatory agencies need to ensure the safety and equity of AI in biomedicine, and the time to do so is now.
View details for DOI 10.1126/scitranslmed.adi0336
View details for PubMedID 37703349
-
The phenotype-genotype reference map: Improving biobank data science through replication.
American journal of human genetics
2023
Abstract
Population-scale biobanks linked to electronic health record data provide vast opportunities to extend our knowledge of human genetics and discover new phenotype-genotype associations. Given their dense phenotype data, biobanks can also facilitate replication studies on a phenome-wide scale. Here, we introduce the phenotype-genotype reference map (PGRM), a set of 5,879 genetic associations from 523 GWAS publications that can be used for high-throughput replication experiments. PGRM phenotypes are standardized as phecodes, ensuring interoperability between biobanks. We applied the PGRM to five ancestry-specific cohorts from four independent biobanks and found evidence of robust replications across a wide array of phenotypes. We show how the PGRM can be used to detect data corruption and to empirically assess parameters for phenome-wide studies. Finally, we use the PGRM to explore factors associated with replicability of GWAS results.
View details for DOI 10.1016/j.ajhg.2023.07.012
View details for PubMedID 37607538
-
Associating biological context with protein-protein interactions through text mining at PubMed scale.
Journal of biomedical informatics
2023: 104474
Abstract
Inferring knowledge from known relationships between drugs, proteins, genes, and diseases has great potential for clinical impact, such as predicting which existing drugs could be repurposed to treat rare diseases. Incorporating key biological context such as cell type or tissue of action into representations of extracted biomedical knowledge is essential for principled pharmacological discovery. Existing global, literature-derived knowledge graphs of interactions between drugs, proteins, genes, and diseases lack this essential information. In this study, we frame the task of associating biological context with protein-protein interactions extracted from text as a classification task using syntactic, semantic, and novel meta-discourse features. We introduce the Insider corpora, which are automatically generated PubMed-scale corpora for training classifiers for the context association task. These corpora are created by searching for precise syntactic cues of cell type and tissue relevancy to extracted regulatory relations. We report F1 scores of 0.955 and 0.862 for identifying relevant cell types and tissues, respectively, for our identified relations. By classifying with this framework, we demonstrate that the problem of context association can be addressed using intuitive, interpretable features. We demonstrate the potential of this approach to enrich text-derived knowledge bases with biological detail by incorporating cell type context into a protein-protein network for dengue fever.
View details for DOI 10.1016/j.jbi.2023.104474
View details for PubMedID 37572825
-
Genetic Correlations Among Corneal Biophysical Parameters and Anthropometric Traits.
Translational vision science & technology
2023; 12 (8): 8
Abstract
Purpose: The genetic architecture of corneal dysfunction remains poorly understood. Epidemiological and clinical evidence suggests a relationship between corneal structural features and anthropometric measures. We used global and local genetic similarity analysis to identify genomic features that may underlie structural corneal dysfunction.Methods: We assembled genome-wide association study summary statistics for corneal features (central corneal thickness, corneal hysteresis [CH], corneal resistance factor [CRF], and the 3 mm index of keratometry) and anthropometric traits (body mass index, weight, and height) in Europeans. We calculated global genetic correlations (rg) between traits using linkage disequilibrium (LD) score regression and local genetic covariance using rho-HESS, which partitions the genome and performs regression with LD regions. Finally, we identified genes located within regions of significant genetic covariance and analyzed patterns of tissue expression and pathway enrichment.Results: Global LD score regression revealed significant negative correlations between height and both CH (rg = -0.12; P = 2.0 * 10-7) and CRF (rg = -0.11; P = 6.9 * 10-7). Local analysis revealed 68 genomic regions exhibiting significant local genetic covariance between CRF and height, containing 2874 unique genes. Pathway analysis of genes in regions with significant local rg revealed enrichment among signaling pathways with known keratoconus associations, including cadherin and Wnt signaling, as well as enrichment of genes modulated by copper and zinc ions.Conclusions: Corneal biophysical parameters and height share a common genomic architecture, which may facilitate identification of disease-associated genes and therapies for corneal ectasias.Translational Relevance: Local genetic covariance analysis enables the identification of associated genes and therapeutic targets for corneal ectatic disease.
View details for DOI 10.1167/tvst.12.8.8
View details for PubMedID 37561511
-
Integrative analysis of functional genomic screening and clinical data identifies a protective role for spironolactone in severe COVID-19.
Cell reports methods
2023; 3 (7): 100503
Abstract
We demonstrate that integrative analysis of CRISPR screening datasets enables network-based prioritization of prescription drugs modulating viral entry in severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) by developing a network-based approach called Rapid proXimity Guidance for Repurposing Investigational Drugs (RxGRID). We use our results to guide a propensity-score-matched, retrospective cohort study of 64,349 COVID-19 patients, showing that a top candidate drug, spironolactone, is associated with improved clinical prognosis, measured by intensive care unit (ICU) admission and mechanical ventilation rates. Finally, we show that spironolactone exerts a dose-dependent inhibitory effect on viral entry in human lung epithelial cells. Our RxGRID method presents a computational framework, implemented as an open-source software package, enabling genomics researchers to identify drugs likely to modulate a molecular phenotype of interest based on high-throughput screening data. Our results, derived from this method and supported by experimental and clinical analysis, add additional supporting evidence for a potential protective role of the potassium-sparing diuretic spironolactone in severe COVID-19.
View details for DOI 10.1016/j.crmeth.2023.100503
View details for PubMedID 37529368
-
Deep learning for localized detection of optic disc hemorrhages.
American journal of ophthalmology
2023
Abstract
To develop an automated deep learning system for detecting the presence and location of disc hemorrhages in optic disc photographs.Development and testing of a deep learning algorithm.Optic disc photos (597 images with at least one disc hemorrhage and 1,075 images without any disc hemorrhage from 1,562 eyes) from five institutions were classified by expert graders based on the presence or absence of disc hemorrhage. The images were split into training (n=1,340), validation (n=167), and test (n=165) datasets. Two state-of-the-art deep learning algorithms based on either object-level detection or image-level classification were trained on the dataset. These models were compared to one another and against two independent glaucoma specialists. We evaluated model performance by the area under the receiver operating characteristic curve (AUC). AUCs were compared with the Hanley-McNeil method.The object detection model achieved an AUC of 0.936 (95% CI: 0.857-0.964) across all held-out images (n=165 photos) which was significantly superior to the image classification model (AUC; 0.845 (95% CI: 0.740-0.912; p=0.006). At an operating point selected for high specificity, the model achieved a specificity of 94.3% and a sensitivity of 70.0%, which was statistically indistinguishable from an expert clinician (p=0.7). At an operating point selected for high sensitivity, the model achieves a sensitivity of 96.7% and a specificity of 73.3%.An autonomous object detection model is superior to an image classification model for detecting disc hemorrhages and performed comparably to 2 clinicians.
View details for DOI 10.1016/j.ajo.2023.07.007
View details for PubMedID 37490992
-
Network-based machine learning for gene prioritization in primary open-angle glaucoma
ASSOC RESEARCH VISION OPHTHALMOLOGY INC. 2023
View details for Web of Science ID 001053758300036
-
Association between spironolactone use and COVID-19 outcomes in population-scale claims data: a retrospective cohort study.
medRxiv : the preprint server for health sciences
2023
Abstract
Background: Spironolactone has been proposed as a potential modulator of SARS-CoV-2 cellular entry. We aimed to measure the effect of spironolactone use on the risk of adverse outcomes following COVID-19 hospitalization.Methods: We performed a retrospective cohort study of COVID-19 outcomes for patients with or without exposure to spironolactone, using population-scale claims data from the Komodo Healthcare Map. We identified all patients with a hospital admission for COVID-19 in the study window, defining treatment status based on spironolactone prescription orders. The primary outcomes were progression to respiratory ventilation or mortality during the hospitalization. Odds ratios (OR) were estimated following either 1:1 propensity score matching (PSM) or multivariable regression. Subgroup analysis was performed based on age, gender, body mass index (BMI), and dominant SARS-CoV-2 variant.Findings: Among 898,303 eligible patients with a COVID-19-related hospitalization, 16,324 patients (1.8%) had a spironolactone prescription prior to hospitalization. 59,937 patients (6.7%) met the ventilation endpoint, and 26,515 patients (3.0%) met the mortality endpoint. Spironolactone use was associated with a significant reduction in odds of both ventilation (OR 0.82; 95% CI: 0.75-0.88; p < 0.001) and mortality (OR 0.88; 95% CI: 0.78-0.99; p = 0.033) in the PSM analysis, supported by the regression analysis. Spironolactone use was associated with significantly reduced odds of ventilation for all age groups, men, women, and non-obese patients, with the greatest protective effects in younger patients, men, and non-obese patients.Interpretation: Spironolactone use was associated with a protective effect against ventilation and mortality following COVID-19 infection, amounting to up to 64% of the protective effect of vaccination against ventilation and consistent with an androgen-dependent mechanism. The findings warrant initiation of large-scale randomized controlled trials to establish a potential therapeutic role for spironolactone in COVID-19 patients.
View details for DOI 10.1101/2023.02.28.23286515
View details for PubMedID 36909470
-
Using GPT-3 to Build a Lexicon of Drugs of Abuse Synonyms for Social Media Pharmacovigilance.
Biomolecules
2023; 13 (2)
Abstract
Drug abuse is a serious problem in the United States, with over 90,000 drug overdose deaths nationally in 2020. A key step in combating drug abuse is detecting, monitoring, and characterizing its trends over time and location, also known as pharmacovigilance. While federal reporting systems accomplish this to a degree, they often have high latency and incomplete coverage. Social-media-based pharmacovigilance has zero latency, is easily accessible and unfiltered, and benefits from drug users being willing to share their experiences online pseudo-anonymously. However, unlike highly structured official data sources, social media text is rife with misspellings and slang, making automated analysis difficult. Generative Pretrained Transformer 3 (GPT-3) is a large autoregressive language model specialized for few-shot learning that was trained on text from the entire internet. We demonstrate that GPT-3 can be used to generate slang and common misspellings of terms for drugs of abuse. We repeatedly queried GPT-3 for synonyms of drugs of abuse and filtered the generated terms using automated Google searches and cross-references to known drug names. When generated terms for alprazolam were manually labeled, we found that our method produced 269 synonyms for alprazolam, 221 of which were new discoveries not included in an existing drug lexicon for social media. We repeated this process for 98 drugs of abuse, of which 22 are widely-discussed drugs of abuse, building a lexicon of colloquial drug synonyms that can be used for pharmacovigilance on social media.
View details for DOI 10.3390/biom13020387
View details for PubMedID 36830756
-
Multilingual translation for zero-shot biomedical classification using BioTranslator.
Nature communications
2023; 14 (1): 738
Abstract
Existing annotation paradigms rely on controlled vocabularies, where each data instance is classified into one term from a predefined set of controlled vocabularies. This paradigm restricts the analysis to concepts that are known and well-characterized. Here, we present the novel multilingual translation method BioTranslator to address this problem. BioTranslator takes a user-written textual description of a new concept and then translates this description to a non-text biological data instance. The key idea of BioTranslator is to develop a multilingual translation framework, where multiple modalities of biological data are all translated to text. We demonstrate how BioTranslator enables the identification of novel cell types using only a textual description and how BioTranslator can be further generalized to protein function prediction and drug target identification. Our tool frees scientists from limiting their analyses within predefined controlled vocabularies, enabling them to interact with biological data using free text.
View details for DOI 10.1038/s41467-023-36476-2
View details for PubMedID 36759510
-
Mapping transcriptional heterogeneity and metabolic networks in fatty livers at single-cell resolution.
iScience
2023; 26 (1): 105802
Abstract
Non-alcoholic fatty liver disease is a heterogeneous disease with unclear underlying molecular mechanisms. Here, we perform single-cell RNA sequencing of hepatocytes and hepatic non-parenchymal cells to map the lipid signatures in mice with non-alcoholic fatty liver disease (NAFLD). We uncover previously unidentified clusters of hepatocytes characterized by either high or low srebp1 expression. Surprisingly, the canonical lipid synthesis driver Srebp1 is not predictive of hepatic lipid accumulation, suggestive of other drivers of lipid metabolism. By combining transcriptional data at single-cell resolution with computational network analyses, we find that NAFLD is associated with high constitutive androstane receptor (CAR) expression. Mechanistically, CAR interacts with four functional modules: cholesterol homeostasis, bile acid metabolism, fatty acid metabolism, and estrogen response. Nuclear expression of CAR positively correlates with steatohepatitis in human livers. These findings demonstrate significant cellular differences in lipid signatures and identify functional networks linked to hepatic steatosis in mice and humans.
View details for DOI 10.1016/j.isci.2022.105802
View details for PubMedID 36636354
View details for PubMedCentralID PMC9830221
-
Genetic association studies using disease liabilities from deep neural networks.
medRxiv : the preprint server for health sciences
2023
Abstract
The case-control study is a widely used method for investigating the genetic landscape of binary traits. However, the health-related outcome or disease status of participants in long-term, prospective cohort studies such as the UK Biobank are subject to change. Here, we develop an approach for the genetic association study leveraging disease liabilities computed from a deep patient phenotyping framework (AI-based liability). Analyzing 44 common traits in 261,807 participants from the UK Biobank, we identified novel loci compared to the conventional case-control (CC) association studies. Our results showed that combining liability scores with CC status was more powerful than the CC-GWAS in detecting independent genetic loci across different diseases. This boost in statistical power was further reflected in increased SNP-based heritability estimates. Moreover, polygenic risk scores calculated from AI-based liabilities better identified newly diagnosed cases in the 2022 release of the UK Biobank that served as controls in the 2019 version (6.2% percentile rank increase on average). These findings demonstrate the utility of deep neural networks that are able to model disease liabilities from high-dimensional phenotypic data in large-scale population cohorts. Our pipeline of genome-wide association studies with disease liabilities can be applied to other biobanks with rich phenotype and genotype data.
View details for DOI 10.1101/2023.01.18.23284383
View details for PubMedID 36712099
-
Detecting Contradictory COVID-19 Drug Efficacy Claims from Biomedical Literature
ASSOC COMPUTATIONAL LINGUISTICS-ACL. 2023: 694-713
View details for Web of Science ID 001181088800061
-
Promises and challenges in pharmacoepigenetics.
Cambridge prisms, Precision medicine
2023; 1: e18
Abstract
Pharmacogenetics, the study of how interindividual genetic differences affect drug response, does not explain all observed heritable variance in drug response. Epigenetic mechanisms, such as DNA methylation, and histone acetylation may account for some of the unexplained variances. Epigenetic mechanisms modulate gene expression and can be suitable drug targets and can impact the action of nonepigenetic drugs. Pharmacoepigenetics is the field that studies the relationship between epigenetic variability and drug response. Much of this research focuses on compounds targeting epigenetic mechanisms, called epigenetic drugs, which are used to treat cancers, immune disorders, and other diseases. Several studies also suggest an epigenetic role in classical drug response; however, we know little about this area. The amount of information correlating epigenetic biomarkers to molecular datasets has recently expanded due to technological advances, and novel computational approaches have emerged to better identify and predict epigenetic interactions. We propose that the relationship between epigenetics and classical drug response may be examined using data already available by (1) finding regions of epigenetic variance, (2) pinpointing key epigenetic biomarkers within these regions, and (3) mapping these biomarkers to a drug-response phenotype. This approach expands on existing knowledge to generate putative pharmacoepigenetic relationships, which can be tested experimentally. Epigenetic modifications are involved in disease and drug response. Therefore, understanding how epigenetic drivers impact the response to classical drugs is important for improving drug design and administration to better treat disease.
View details for DOI 10.1017/pcm.2023.6
View details for PubMedID 37560024
-
COLLAPSE: A representation learning framework for identification and characterization of protein structural sites.
Protein science : a publication of the Protein Society
2022: e4541
Abstract
The identification and characterization of the structural sites which contribute to protein function are crucial for understanding biological mechanisms, evaluating disease risk, and developing targeted therapies. However, the quantity of known protein structures is rapidly outpacing our ability to functionally annotate them. Existing methods for function prediction either do not operate on local sites, suffer from high false positive or false negative rates, or require large site-specific training datasets, necessitating the development of new computational methods for annotating functional sites at scale. We present COLLAPSE (Compressed Latents Learned from Aligned Protein Structural Environments), a framework for learning deep representations of protein sites. COLLAPSE operates directly on the 3D positions of atoms surrounding a site and uses evolutionary relationships between homologous proteins as a self-supervision signal, enabling learned embeddings to implicitly capture structure-function relationships within each site. Our representations generalize across disparate tasks in a transfer learning context, achieving state-of-the-art performance on standardized benchmarks (protein-protein interactions and mutation stability) and on the prediction of functional sites from the Prosite database. We use COLLAPSE to search for similar sites across large protein datasets and to annotate proteins based on a database of known functional sites. These methods demonstrate that COLLAPSE is computationally efficient, tunable, and interpretable, providing a general-purpose platform for computational protein analysis. This article is protected by copyright. All rights reserved.
View details for DOI 10.1002/pro.4541
View details for PubMedID 36519247
-
POPDx: an automated framework for patient phenotyping across 392 246 individuals in the UK Biobank study.
Journal of the American Medical Informatics Association : JAMIA
2022
Abstract
For the UK Biobank, standardized phenotype codes are associated with patients who have been hospitalized but are missing for many patients who have been treated exclusively in an outpatient setting. We describe a method for phenotype recognition that imputes phenotype codes for all UK Biobank participants.POPDx (Population-based Objective Phenotyping by Deep Extrapolation) is a bilinear machine learning framework for simultaneously estimating the probabilities of 1538 phenotype codes. We extracted phenotypic and health-related information of 392 246 individuals from the UK Biobank for POPDx development and evaluation. A total of 12 803 ICD-10 diagnosis codes of the patients were converted to 1538 phecodes as gold standard labels. The POPDx framework was evaluated and compared to other available methods on automated multiphenotype recognition.POPDx can predict phenotypes that are rare or even unobserved in training. We demonstrate substantial improvement of automated multiphenotype recognition across 22 disease categories, and its application in identifying key epidemiological features associated with each phenotype.POPDx helps provide well-defined cohorts for downstream studies. It is a general-purpose method that can be applied to other biobanks with diverse but incomplete data.
View details for DOI 10.1093/jamia/ocac226
View details for PubMedID 36469791
-
Gene set proximity analysis: expanding gene set enrichment analysis through learned geometric embeddings, with drug-repurposing applications in COVID-19.
Bioinformatics (Oxford, England)
2022
Abstract
MOTIVATION: Gene set analysis methods rely on knowledge-based representations of genetic interactions in the form of both gene set collections and protein-protein interaction (PPI) networks. However, explicit representations of genetic interactions often fail to capture complex interdependencies among genes, limiting the analytic power of such methods.RESULTS: We propose an extension of gene set enrichment analysis to a latent embedding space reflecting PPI network topology, called gene set proximity analysis (GSPA). Compared with existing methods, GSPA provides improved ability to identify disease-associated pathways in disease-matched gene expression datasets, while improving reproducibility of enrichment statistics for similar gene sets. GSPA is statistically straightforward, reducing to a version of traditional gene set enrichment analysis through a single user-defined parameter. We apply our method to identify novel drug associations with SARS-CoV-2 viral entry. Finally, we validate our drug association predictions through retrospective clinical analysis of claims data from 8 million patients, supporting a role for gabapentin as a risk factor and metformin as a protective factor for severe COVID-19.AVAILABILITY: GSPA is available for download as a command-line Python package at https://github.com/henrycousins/gspa.SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
View details for DOI 10.1093/bioinformatics/btac735
View details for PubMedID 36394254
-
Functional genomics of OCTN2 variants informs protein-specific variant effect predictor for Carnitine Transporter Deficiency.
Proceedings of the National Academy of Sciences of the United States of America
2022; 119 (46): e2210247119
Abstract
Genetic variants in SLC22A5, encoding the membrane carnitine transporter OCTN2, cause the rare metabolic disorder Carnitine Transporter Deficiency (CTD). CTD is potentially lethal but actionable if detected early, with confirmatory diagnosis involving sequencing of SLC22A5. Interpretation of missense variants of uncertain significance (VUSs) is a major challenge. In this study, we sought to characterize the largest set to date (n = 150) of OCTN2 variants identified in diverse ancestral populations, with the goals of furthering our understanding of the mechanisms leading to OCTN2 loss-of-function (LOF) and creating a protein-specific variant effect prediction model for OCTN2 function. Uptake assays with 14C-carnitine revealed that 105 variants (70%) significantly reduced transport of carnitine compared to wild-type OCTN2, and 37 variants (25%) severely reduced function to less than 20%. All ancestral populations harbored LOF variants; 62% of green fluorescent protein (GFP)-tagged variants impaired OCTN2 localization to the plasma membrane of human embryonic kidney (HEK293T) cells, and subcellular localization significantly associated with function, revealing a major LOF mechanism of interest for CTD. With these data, we trained a model to classify variants as functional (>20% function) or LOF (<20% function). Our model outperformed existing state-of-the-art methods as evaluated by multiple performance metrics, with mean area under the receiver operating characteristic curve (AUROC) of 0.895 ± 0.025. In summary, in this study we generated a rich dataset of OCTN2 variant function and localization, revealed important disease-causing mechanisms, and improved upon machine learning-based prediction of OCTN2 variant function to aid in variant interpretation in the diagnosis and treatment of CTD.
View details for DOI 10.1073/pnas.2210247119
View details for PubMedID 36343260
-
A cis-regulatory lexicon of DNA motif combinations mediating cell-type-specific gene regulation.
Cell genomics
2022; 2 (11)
Abstract
Gene expression is controlled by transcription factors (TFs) that bind cognate DNA motif sequences in cis-regulatory elements (CREs). The combinations of DNA motifs acting within homeostasis and disease, however, are unclear. Gene expression, chromatin accessibility, TF footprinting, and H3K27ac-dependent DNA looping data were generated and a random-forest-based model was applied to identify 7,531 cell-type-specific cis-regulatory modules (CRMs) across 15 diploid human cell types. A co-enrichment framework within CRMs nominated 838 cell-type-specific, recurrent heterotypic DNA motif combinations (DMCs), which were functionally validated using massively parallel reporter assays. Cancer cells engaged DMCs linked to neoplasia-enabling processes operative in normal cells while also activating new DMCs only seen in the neoplastic state. This integrative approach identifies cell-type-specific cis-regulatory combinatorial DNA motifs in diverse normal and diseased human cells and represents a general framework for deciphering cis-regulatory sequence logic in gene regulation.
View details for DOI 10.1016/j.xgen.2022.100191
View details for PubMedID 36742369
-
A network paradigm predicts drug synergistic effects using downstream protein-protein interactions.
CPT: pharmacometrics & systems pharmacology
2022
Abstract
In some cases, drug combinations affect adverse outcome phenotypes by binding the same protein; however, drug-binding proteins are associated through protein-protein interaction (PPI) networks within the cell, suggesting that drug phenotypes may result from long-range network effects. We first used PPI network analysis to classify drugs based on proteins downstream of their targets and next predicted drug combination effects where drugs shared network proteins but had distinct binding proteins (e.g., targets, enzymes, or transporters). By classifying drugs using their downstream proteins, we had an 80.7% sensitivity for predicting rare drug combination effects documented in gold-standard datasets. We further measured the effect of predicted drug combinations on adverse outcome phenotypes using novel observational studies in the electronic health record. We tested predictions for 60 network-drug classes on seven adverse outcomes and measured changes in clinical outcomes for predicted combinations. These results demonstrate a novel paradigm for anticipating drug synergistic effects using proteins downstream of drug targets.
View details for DOI 10.1002/psp4.12861
View details for PubMedID 36204824
-
Contexts and contradictions: a roadmap for computational drug repurposing with knowledge inference.
Briefings in bioinformatics
2022
Abstract
The cost of drug development continues to rise and may be prohibitive in cases of unmet clinical need, particularly for rare diseases. Artificial intelligence-based methods are promising in their potential to discover new treatment options. The task of drug repurposing hypothesis generation is well-posed as a link prediction problem in a knowledge graph (KG) of interacting of drugs, proteins, genes and disease phenotypes. KGs derived from biomedical literature are semantically rich and up-to-date representations of scientific knowledge. Inference methods on scientific KGs can be confounded by unspecified contexts and contradictions. Extracting context enables incorporation of relevant pharmacokinetic and pharmacodynamic detail, such as tissue specificity of interactions. Contradictions in biomedical KGs may arise when contexts are omitted or due to contradicting research claims. In this review, we describe challenges to creating literature-scale representations of pharmacological knowledge and survey current approaches toward incorporating context and resolving contradictions.
View details for DOI 10.1093/bib/bbac268
View details for PubMedID 35817308
-
Genetic Correlations between Corneal Biophysical Parameters and Anthropomorphic Traits
ASSOC RESEARCH VISION OPHTHALMOLOGY INC. 2022
View details for Web of Science ID 000844437004080
-
Construction of disease-specific cytokine profiles by associating disease genes with immune responses.
PLoS computational biology
2022; 18 (4): e1009497
Abstract
The pathogenesis of many inflammatory diseases is a coordinated process involving metabolic dysfunctions and immune response-usually modulated by the production of cytokines and associated inflammatory molecules. In this work, we seek to understand how genes involved in pathogenesis which are often not associated with the immune system in an obvious way communicate with the immune system. We have embedded a network of human protein-protein interactions (PPI) from the STRING database with 14,707 human genes using feature learning that captures high confidence edges. We have found that our predicted Association Scores derived from the features extracted from STRING's high confidence edges are useful for predicting novel connections between genes, thus enabling the construction of a full map of predicted associations for all possible pairs between 14,707 human genes. In particular, we analyzed the pattern of associations for 126 cytokines and found that the six patterns of cytokine interaction with human genes are consistent with their functional classifications. To define the disease-specific roles of cytokines we have collected gene sets for 11,944 diseases from DisGeNET. We used these gene sets to predict disease-specific gene associations with cytokines by calculating the normalized average Association Scores between disease-associated gene sets and the 126 cytokines; this creates a unique profile of inflammatory genes (both known and predicted) for each disease. We validated our predicted cytokine associations by comparing them to known associations for 171 diseases. The predicted cytokine profiles correlate (p-value<0.0003) with the known ones in 95 diseases. We further characterized the profiles of each disease by calculating an "Inflammation Score" that summarizes different modes of immune responses. Finally, by analyzing subnetworks formed between disease-specific pathogenesis genes, hormones, receptors, and cytokines, we identified the key genes responsible for interactions between pathogenesis and inflammatory responses. These genes and the corresponding cytokines used by different immune disorders suggest unique targets for drug discovery.
View details for DOI 10.1371/journal.pcbi.1009497
View details for PubMedID 35404985
-
Protein sequence design with a learned potential.
Nature communications
2022; 13 (1): 746
Abstract
The task of protein sequence design is central to nearly all rational protein engineering problems, and enormous effort has gone into the development of energy functions to guide design. Here, we investigate the capability of a deep neural network model to automate design of sequences onto protein backbones, having learned directly from crystal structure data and without any human-specified priors. The model generalizes to native topologies not seen during training, producing experimentally stable designs. We evaluate the generalizability of our method to a de novo TIM-barrel scaffold. The model produces novel sequences, and high-resolution crystal structures of two designs show excellent agreement with in silico models. Our findings demonstrate the tractability of an entirely learned method for protein sequence design.
View details for DOI 10.1038/s41467-022-28313-9
View details for PubMedID 35136054
-
Training data composition affects performance of protein structure analysis algorithms.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
2022; 27: 10-21
Abstract
The three-dimensional structures of proteins are crucial for understanding their molecular mechanisms and interactions. Machine learning algorithms that are able to learn accurate representations of protein structures are therefore poised to play a key role in protein engineering and drug development. The accuracy of such models in deployment is directly influenced by training data quality. The use of different experimental methods for protein structure determination may introduce bias into the training data. In this work, we evaluate the magnitude of this effect across three distinct tasks: estimation of model accuracy, protein sequence design, and catalytic residue prediction. Most protein structures are derived from X-ray crystallography, nuclear magnetic resonance (NMR), or cryo-electron microscopy (cryo-EM); we trained each model on datasets consisting of either all three structure types or of only X-ray data. We Find that across these tasks, models consistently perform worse on test sets derived from NMR and cryo-EM than they do on test sets of structures derived from X-ray crystallography, but that the difference can be mitigated when NMR and cryo-EM structures are included in the training set. Importantly, we show that including all three types of structures in the training set does not degrade test performance on X-ray structures, and in some cases even increases it. Finally, we examine the relationship between model performance and the biophysical properties of each method, and recommend that the biochemistry of the task of interest should be considered when composing training sets.
View details for PubMedID 34890132
-
Recommendations for achieving interoperable and shareable medical data in the USA.
Communications medicine
2022; 2: 86
Abstract
Easy access to large quantities of accurate health data is required to understand medical and scientific information in real-time; evaluate public health measures before, during, and after times of crisis; and prevent medical errors. Introducing a system in the USA that allows for efficient access to such health data and ensures auditability of data facts, while avoiding data silos, will require fundamental changes in current practices. Here, we recommend the implementation of standardized data collection and transmission systems, universal identifiers for individual patients and end users, a reference standard infrastructure to support calibration and integration of laboratory results from equivalent tests, and modernized working practices. Requiring comprehensive and binding standards, rather than incentivizing voluntary and often piecemeal efforts for data exchange, will allow us to achieve the analytical information environment that patients need.
View details for DOI 10.1038/s43856-022-00148-x
View details for PubMedID 35865358
-
Challenges and opportunities in network-based solutions for biological questions.
Briefings in bioinformatics
2021
Abstract
Network biology is useful for modeling complex biological phenomena; it has attracted attention with the advent of novel graph-based machine learning methods. However, biological applications of network methods often suffer from inadequate follow-up. In this perspective, we discuss obstacles for contemporary network approaches-particularly focusing on challenges representing biological concepts, applying machine learning methods, and interpreting and validating computational findings about biology-in an effort to catalyze actionable biological discovery.
View details for DOI 10.1093/bib/bbab437
View details for PubMedID 34849568
-
Quantifying the Severity of Adverse Drug Reactions Using Social Media: Network Analysis.
Journal of medical Internet research
2021; 23 (10): e27714
Abstract
BACKGROUND: Adverse drug reactions (ADRs) affect the health of hundreds of thousands of individuals annually in the United States, with associated costs of hundreds of billions of dollars. The monitoring and analysis of the severity of ADRs is limited by the current qualitative and categorical systems of severity classification. Previous efforts have generated quantitative estimates for a subset of ADRs but were limited in scope because of the time and costs associated with the efforts.OBJECTIVE: The aim of this study is to increase the number of ADRs for which there are quantitative severity estimates while improving the quality of these severity estimates.METHODS: We present a semisupervised approach that estimates ADR severity by using social media word embeddings to construct a lexical network of ADRs and perform label propagation. We used this method to estimate the severity of 28,113 ADRs, representing 12,198 unique ADR concepts from the Medical Dictionary for Regulatory Activities.RESULTS: Our Severity of Adverse Events Derived from Reddit (SAEDR) scores have good correlations with real-world outcomes. The SAEDR scores had Spearman correlations of 0.595, 0.633, and -0.748 for death, serious outcome, and no outcome, respectively, with ADR case outcomes in the Food and Drug Administration Adverse Event Reporting System. We investigated different methods for defining initial seed term sets and evaluated their impact on the severity estimates. We analyzed severity distributions for ADRs based on their appearance in boxed warning drug label sections, as well as for ADRs with sex-specific associations. We found that ADRs discovered in the postmarketing period had significantly greater severity than those discovered during the clinical trial (P<.001). We created quantitative drug-risk profile (DRIP) scores for 968 drugs that had a Spearman correlation of 0.377 with drugs ranked by the Food and Drug Administration Adverse Event Reporting System cases resulting in death, where the given drug was the primary suspect.CONCLUSIONS: Our SAEDR and DRIP scores are well correlated with the real-world outcomes of the entities they represent and have demonstrated utility in pharmacovigilance research. We make the SAEDR scores for 12,198 ADRs and the DRIP scores for 968 drugs publicly available to enable more quantitative analysis of pharmacovigilance data.
View details for DOI 10.2196/27714
View details for PubMedID 34673524
-
Leveraging the Cell Ontology to classify unseen cell types.
Nature communications
2021; 12 (1): 5556
Abstract
Single cell technologies are rapidly generating large amounts of data that enables us to understand biological systems at single-cell resolution. However, joint analysis of datasets generated by independent labs remains challenging due to a lack of consistent terminology to describe cell types. Here, we present OnClass, an algorithm and accompanying software for automatically classifying cells into cell types that are part of the controlled vocabulary that forms the Cell Ontology. A key advantage of OnClass is its capability to classify cells into cell types not present in the training data because it uses the Cell Ontology graph to infer cell type relationships. Furthermore, OnClass can be used to identify marker genes for all the cell ontology categories, regardless of whether the cell types are present or absent in the training data, suggesting that OnClass goes beyond a simple annotation tool for single cell datasets, being the first algorithm capable to identify marker genes specific to all terms of the Cell Ontology and offering the possibility of refining the Cell Ontology using a data-centric approach.
View details for DOI 10.1038/s41467-021-25725-x
View details for PubMedID 34548483
-
PhenClust, a standalone tool for identifying trends within sets of biological phenotypes using semantic similarity and the Unified Medical Language System metathesaurus.
JAMIA open
2021; 4 (3): ooab079
Abstract
Objectives: We sought to cluster biological phenotypes using semantic similarity and create an easy-to-install, stable, and reproducible tool.Materials and Methods: We generated Phenotype Clustering (PhenClust)-a novel application of semantic similarity for interpreting biological phenotype associations-using the Unified Medical Language System (UMLS) metathesaurus, demonstrated the tool's application, and developed Docker containers with stable installations of two UMLS versions.Results: PhenClust identified disease clusters for drug network-associated phenotypes and a meta-analysis of drug target candidates. The Dockerized containers eliminated the requirement that the user install the UMLS metathesaurus.Discussion: Clustering phenotypes summarized all phenotypes associated with a drug network and two drug candidates. Docker containers can support dissemination and reproducibility of tools that are otherwise limited due to insufficient software support.Conclusion: PhenClust can improve interpretation of high-throughput biological analyses where many phenotypes are associated with a query and the Dockerized PhenClust achieved our objective of decreasing installation complexity.
View details for DOI 10.1093/jamiaopen/ooab079
View details for PubMedID 34541463
-
Genome-wide Association Studies in Pharmacogenomics.
Clinical pharmacology and therapeutics
2021
Abstract
The increasing availability of genotype data linked with information about drug-response phenotypes has enabled genome-wide association studies (GWAS) that uncover genetic determinants of drug response. GWAS have discovered associations between genetic variants and both drug efficacy and adverse drug reactions. Despite these successes, the design of GWAS in pharmacogenomics faces unique challenges. In this review we analyze the last decade of GWAS in pharmacogenomics. We review trends in publications over time, including the drugs and drug classes studied and the clinical phenotypes used. Several data sharing consortia have contributed substantially to the PGx GWAS literature. We anticipate increased focus on biobanks and highlight phenotypes that would best enable future pharmacogenomics discoveries.
View details for DOI 10.1002/cpt.2349
View details for PubMedID 34185318
-
Distinct clinical phenotypes for Crohn's disease derived from patient surveys.
BMC gastroenterology
2021; 21 (1): 160
Abstract
BACKGROUND: Defining clinical phenotypes provides opportunities for new diagnostics and may provide insights into early intervention and disease prevention. There is increasing evidence that patient-derived health data may contain information that complements traditional methods of clinical phenotyping. The utility of these data for defining meaningful phenotypic groups is of great interest because social media and online resources make it possible to query large cohorts of patients with health conditions.METHODS: We evaluated the degree to which patient-reported categorical data is useful for discovering subclinical phenotypes and evaluated its utility for discovering new measures of disease severity, treatment response and genetic architecture. Specifically, we examined the responses of 1961 patients with inflammatory bowel disease to questionnaires in search of sub-phenotypes. We applied machine learning methods to identify novel subtypes of Crohn's disease and studied their associations with drug responses.RESULTS: Using the patients' self-reported information, we identified two subpopulations of Crohn's disease; these subpopulations differ in disease severity, associations with smoking, and genetic transmission patterns. We also identified distinct features of drug response for the two Crohn's disease subtypes. These subtypes show a trend towards differential genotype signatures.CONCLUSION: Our findings suggest that patient-defined data can have unplanned utility for defining disease subtypes and may be useful for guiding treatment approaches.
View details for DOI 10.1186/s12876-021-01740-6
View details for PubMedID 33836648
-
Opportunities and challenges for the computational interpretation of rare variation in clinically important genes.
American journal of human genetics
2021; 108 (4): 535–48
Abstract
Genome sequencing is enabling precision medicine-tailoring treatment to the unique constellation of variants in an individual's genome. The impact of recurrent pathogenic variants is often understood, however there is a long tail of rare genetic variants that are uncharacterized. The problem of uncharacterized rare variation is especially acute when it occurs in genes of known clinical importance with functionally consequential variants and associated mechanisms. Variants of uncertain significance (VUSs) in these genes are discovered at a rate that outpaces current ability to classify them with databases of previous cases, experimental evaluation, and computational predictors. Clinicians are thus left without guidance about the significance of variants that may have actionable consequences. Computational prediction of the impact of rare genetic variation is increasingly becoming an important capability. In this paper, we review the technical and ethical challenges of interpreting the function of rare variants in two settings: inborn errors of metabolism in newborns and pharmacogenomics. We propose a framework for a genomic learning healthcare system with an initial focus on early-onset treatable disease in newborns and actionable pharmacogenomics. We argue that (1) a genomic learning healthcare system must allow for continuous collection and assessment of rare variants, (2) emerging machine learning methods will enable algorithms to predict the clinical impact of rare variants on protein function, and (3) ethical considerations must inform the construction and deployment of all rare-variation triage strategies, particularly with respect to health disparities arising from unbalanced ancestry representation.
View details for DOI 10.1016/j.ajhg.2021.03.003
View details for PubMedID 33798442
-
Large-scale labeling and assessment of sex bias in publicly available expression data.
BMC bioinformatics
2021; 22 (1): 168
Abstract
BACKGROUND: Women are at more than 1.5-fold higher risk for clinically relevant adverse drug events. While this higher prevalence is partially due to gender-related effects, biological sex differences likely also impact drug response. Publicly available gene expression databases provide a unique opportunity for examining drug response at a cellular level. However, missingness and heterogeneity of metadata prevent large-scale identification of drug exposure studies and limit assessments of sex bias. To address this, we trained organism-specific models to infer sample sex from gene expression data, and used entity normalization to map metadata cell line and drug mentions to existing ontologies. Using this method, we inferred sex labels for 450,371 human and 245,107 mouse microarray and RNA-seq samples from refine.bio.RESULTS: Overall, we find slight female bias (52.1%) in human samples and (62.5%) male bias in mouse samples; this corresponds to a majority of mixed sex studies in humans and single sex studies in mice, split between female-only and male-only (25.8% vs. 18.9% in human and 21.6% vs. 31.1% in mouse, respectively). In drug studies, we find limited evidence for sex-sampling bias overall; however, specific categories of drugs, including human cancer and mouse nervous system drugs, are enriched in female-only and male-only studies, respectively. We leverage our expression-based sex labels to further examine the complexity of cell line sex and assess the frequency of metadata sex label misannotations (2-5%).CONCLUSIONS: Our results demonstrate limited overall sex bias, while highlighting high bias in specific subfields and underscoring the importance of including sex labels to better understand the underlying biology. We make our inferred and normalized labels, along with flags for misannotated samples, publicly available to catalyze the routine use of sex as a study variable in future analyses.
View details for DOI 10.1186/s12859-021-04070-2
View details for PubMedID 33784977
-
Search and visualization of gene-drug-disease interactions for pharmacogenomics and precision medicine research using GeneDive.
Journal of biomedical informatics
2021: 103732
Abstract
BACKGROUND: Understanding the relationships between genes, drugs, and disease states is at the core of pharmacogenomics. Two leading approaches for identifying these relationships in medical literature are: human expert led manual curation efforts, and modern data mining based automated approaches. The former generates small amounts of high-quality data, and the later offers large volumes of mixed quality data. The algorithmically extracted relationships are often accompanied by supporting evidence, such as, confidence scores, source articles, and surrounding contexts (excerpts) from the articles, that can used as data quality indicators. Tools that can leverage these quality indicators to help the user gain access to larger and high-quality data are needed.APPROACH: We introduce GeneDive, a web application for pharmacogenomics researchers and precision medicine practitioners that makes gene, disease, and drug interactions data easily accessible and usable. GeneDive is designed to meet three key objectives: (1) provide functionality to manage information-overload problem and facilitate easy assimilation of supporting evidence, (2) support longitudinal and exploratory research investigations, and (3) offer integration of user-provided interactions data without requiring data sharing.RESULTS: GeneDive offers multiple search modalities, visualizations, and other features that guide the user efficiently to the information of their interest. To facilitate exploratory research, GeneDive makes the supporting evidence and context for each interaction readily available and allows the data quality threshold to be controlled by the user as per their risk tolerance level. The interactive search-visualization loop enables relationship discoveries between diseases, genes, and drugs that might not be explicitly described in literature but are emergent from the source medical corpus and deductive reasoning. The ability to utilize user's data either in combination with the GeneDive native datasets or in isolation promotes richer data-driven exploration and discovery. These functionalities along with GeneDive's applicability for precision medicine, bringing the knowledge contained in biomedical literature to bear on particular clinical situations and improving patient care, are illustrated through detailed use cases.CONCLUSION: GeneDive is a comprehensive, broad-use biological interactions browser. The GeneDive application and information about its underlying system architecture are available at http://www.genedive.net. GeneDive Docker image is also available for download at this URL, allowing users to (1) import their own interaction data securely and privately; and (2) generate and test hypotheses across their own and other datasets.
View details for DOI 10.1016/j.jbi.2021.103732
View details for PubMedID 33737208
-
Modeling drug response using network-based personalized treatment prediction (NetPTP) with applications to inflammatory bowel disease.
PLoS computational biology
2021; 17 (2): e1008631
Abstract
For many prevalent complex diseases, treatment regimens are frequently ineffective. For example, despite multiple available immunomodulators and immunosuppressants, inflammatory bowel disease (IBD) remains difficult to treat. Heterogeneity in the disease across patients makes it challenging to select the optimal treatment regimens, and some patients do not respond to any of the existing treatment choices. Drug repurposing strategies for IBD have had limited clinical success and have not typically offered individualized patient-level treatment recommendations. In this work, we present NetPTP, a Network-based Personalized Treatment Prediction framework which models measured drug effects from gene expression data and applies them to patient samples to generate personalized ranked treatment lists. To accomplish this, we combine publicly available network, drug target, and drug effect data to generate treatment rankings using patient data. These ranked lists can then be used to prioritize existing treatments and discover new therapies for individual patients. We demonstrate how NetPTP captures and models drug effects, and we apply our framework to individual IBD samples to provide novel insights into IBD treatment.
View details for DOI 10.1371/journal.pcbi.1008631
View details for PubMedID 33544718
-
A New Era in Pharmacovigilance: Towards real world data and digital monitoring.
Clinical pharmacology and therapeutics
2021
Abstract
Adverse drug reactions (ADRs) are a major concern for patients, clinicians, and regulatory agencies. The discovery of serious ADRs leading to substantial morbidity and mortality has resulted in mandatory Phase IV clinical trials, black box warnings, and withdrawal of drugs from the market. Real World Data, data collected during routine clinical care, is being adopted by innovators, regulators, payors, and providers to inform decision making throughout the product life cycle. We outline several different approaches to modern pharmacovigilance, including spontaneous reporting databases, electronic health record monitoring and research frameworks, social media surveillance, and the use of digital devices. Some of these platforms are well established while others are still emerging, or experimental. We highlight both the potential opportunity, as well as the existing challenges within these pharmacovigilance systems that have already begun to impact the drug development process, as well as the landscape of postmarket drug safety monitoring. Further research and investment into different and complementary pharmacovigilance systems is needed to ensure the continued safety of pharmacotherapy.
View details for DOI 10.1002/cpt.2172
View details for PubMedID 33492663
-
Repurposing Biomedical Informaticians for COVID-19.
Journal of biomedical informatics
2021: 103673
Abstract
The COVID-19 pandemic is an unprecedented challenge to the biomedical research community at the intersection of great uncertainty due to the novelty of the virus and extremely high stakes due to the large global death count. The global quarantine shut-downs complicated scientific matters because many laboratories were closed down unless they were actively doing COVID-19 related research, making repurposing of activities difficult for many biomedical researchers. Biomedical informaticians, who have been primarily able to continue their research through remote work and video conferencing, have been able to maintain normal activities. In addition to continuing ongoing studies, there has been great grass roots interest in helping in the fight against COVID-19. In this commentary, we describe several projects that arose from this desire to help, and the lessons that the authors learned along the way. We then offer some insights into how these lessons might be applied to make scientific progress be more efficient in future crisis scenarios.
View details for DOI 10.1016/j.jbi.2021.103673
View details for PubMedID 33486067
-
Drug Response Pharmacogenetics for 200,000 UK Biobank Participants.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
2021; 26: 184–95
Abstract
Pharmacogenetics studies how genetic variation leads to variability in drug response. Guidelines for selecting the right drug and right dose for patients based on their genetics are clinically effective, but are widely unused. For some drugs, the normal clinical decision making process may lead to the optimal dose of a drug that minimizes side effects and maximizes effectiveness. Without measurements of genotype, physicians and patients may adjust dosage in a manner that reflects the underlying genetics. The emergence of genetic data linked to longitudinal clinical data in large biobanks offers an opportunity to confirm known pharmacogenetic interactions as well as discover novel associations by investigating outcomes from normal clinical practice. Here we use the UK Biobank to search for pharmacogenetic interactions among 200 drugs and 9 genes among 200,000 participants. We identify associations between pharmacogene phenotypes and drug maintenance dose as well as differential drug response phenotypes. We find support for several known drug-gene associations as well as novel pharmacogenetic interactions.
View details for PubMedID 33691016
-
Analyzing the vast coronavirus literature with CoronaCentral.
Proceedings of the National Academy of Sciences of the United States of America
2021; 118 (23)
Abstract
The SARS-CoV-2 pandemic has caused a surge in research exploring all aspects of the virus and its effects on human health. The overwhelming publication rate means that researchers are unable to keep abreast of the literature. To ameliorate this, we present the CoronaCentral resource that uses machine learning to process the research literature on SARS-CoV-2 together with SARS-CoV and MERS-CoV. We categorize the literature into useful topics and article types and enable analysis of the contents, pace, and emphasis of research during the crisis with integration of Altmetric data. These topics include therapeutics, disease forecasting, as well as growing areas such as "long COVID" and studies of inequality. This resource, available at https://coronacentral.ai, is updated daily.
View details for DOI 10.1073/pnas.2100766118
View details for PubMedID 34016708
-
Drug Response Pharmacogenetics for 200,000 UK Biobank Participants
WORLD SCIENTIFIC PUBL CO PTE LTD. 2021: 184-195
View details for Web of Science ID 000759784400018
-
Pharmacogenetics at Scale: An Analysis of the UK Biobank.
Clinical pharmacology and therapeutics
2020
Abstract
Pharmacogenetics (PGx) studies the influence of genetic variation on drug response. Clinically actionable associations inform guidelines created by the Clinical Pharmacogenetics Implementation Consortium (CPIC), but the broad impact of genetic variation on entire populations is not well-understood. We analyzed PGx allele and phenotype frequencies for 487,409 participants in the U.K. Biobank, the largest PGx study to date. For fourteen CPIC pharmacogenes known to influence human drug response, we find that 99.5% of individuals may have an atypical response to at least one drug; on average they may have an atypical response to 10.3 drugs. Nearly 24% of participants have been prescribed a drug for which they are predicted to have an atypical response. Non-European populations carry a greater frequency of variants that are predicted to be functionally deleterious; many of these are not captured by current PGx allele definitions. Strategies for detecting and interpreting rare variation will be critical for enabling broad application of pharmacogenetics.
View details for DOI 10.1002/cpt.2122
View details for PubMedID 33237584
-
Transfer learning enables prediction of CYP2D6 haplotype function.
PLoS computational biology
2020; 16 (11): e1008399
Abstract
Cytochrome P450 2D6 (CYP2D6) is a highly polymorphic gene whose protein product metabolizes more than 20% of clinically used drugs. Genetic variations in CYP2D6 are responsible for interindividual heterogeneity in drug response that can lead to drug toxicity and ineffective treatment, making CYP2D6 one of the most important pharmacogenes. Prediction of CYP2D6 phenotype relies on curation of literature-derived functional studies to assign a functional status to CYP2D6 haplotypes. As the number of large-scale sequencing efforts grows, new haplotypes continue to be discovered, and assignment of function is challenging to maintain. To address this challenge, we have trained a convolutional neural network to predict functional status of CYP2D6 haplotypes, called Hubble.2D6. Hubble.2D6 predicts haplotype function from sequence data and was trained using two pre-training steps with a combination of real and simulated data. We find that Hubble.2D6 predicts CYP2D6 haplotype functional status with 88% accuracy in a held-out test set and explains 47.5% of the variance in in vitro functional data among star alleles with unknown function. Hubble.2D6 may be a useful tool for assigning function to haplotypes with uncurated function, and used for screening individuals who are at risk of being poor metabolizers.
View details for DOI 10.1371/journal.pcbi.1008399
View details for PubMedID 33137098
-
OrderRex clinical user testing: a randomized trial of recommender system decision support on simulated cases.
Journal of the American Medical Informatics Association : JAMIA
2020
Abstract
OBJECTIVE: To assess usability and usefulness of a machine learning-based order recommender system applied to simulated clinical cases.MATERIALS AND METHODS: 43 physicians entered orders for 5 simulated clinical cases using a clinical order entry interface with or without access to a previously developed automated order recommender system. Cases were randomly allocated to the recommender system in a 3:2 ratio. A panel of clinicians scored whether the orders placed were clinically appropriate. Our primary outcome included the difference in clinical appropriateness scores. Secondary outcomes included total number of orders, case time, and survey responses.RESULTS: Clinical appropriateness scores per order were comparable for cases randomized to the order recommender system (mean difference -0.11 order per score, 95% CI: [-0.41, 0.20]). Physicians using the recommender placed more orders (median 16 vs 15 orders, incidence rate ratio 1.09, 95%CI: [1.01-1.17]). Case times were comparable with the recommender system. Order suggestions generated from the recommender system were more likely to match physician needs than standard manual search options. Physicians used recommender suggestions in 98% of available cases. Approximately 95% of participants agreed the system would be useful for their workflows.DISCUSSION: User testing with a simulated electronic medical record interface can assess the value of machine learning and clinical decision support tools for clinician usability and acceptance before live deployments.CONCLUSIONS: Clinicians can use and accept machine learned clinical order recommendations integrated into an electronic order entry interface in a simulated setting. The clinical appropriateness of orders entered was comparable even when supported by automated recommendations.
View details for DOI 10.1093/jamia/ocaa190
View details for PubMedID 33106874
-
MARS: discovering novel cell types across heterogeneous single-cell experiments.
Nature methods
2020
Abstract
Although tremendous effort has been put into cell-type annotation, identification of previously uncharacterized cell types in heterogeneous single-cell RNA-seq data remains a challenge. Here we present MARS, a meta-learning approach for identifying and annotating known as well as new cell types. MARS overcomes the heterogeneity of cell types by transferring latent cell representations across multiple datasets. MARS uses deep learning to learn a cell embedding function as well as a set of landmarks in the cell embedding space. The method has a unique ability to discover cell types that have never been seen before and annotate experiments that are as yet unannotated. We apply MARS to a large mouse cell atlas and show its ability to accurately identify cell types, even when it has never seen them before. Further, MARS automatically generates interpretable names for new cell types by probabilistically defining a cell type in the embedding space.
View details for DOI 10.1038/s41592-020-00979-3
View details for PubMedID 33077966
-
PharmGKB tutorial for pharmacogenomics of drugs potentially used in the context of COVID-19.
Clinical pharmacology and therapeutics
2020
Abstract
Pharmacogenomics is a key area of precision medicine which is already being implemented in some health systems and may help guide clinicians towards effective therapies for individual patients. Over the last two decades, the Pharmacogenomics Knowledgebase (PharmGKB) has built a unique repository of pharmacogenomic knowledge, including annotations of clinical guideline and regulator-approved drug labels in addition to evidence-based drug pathways and annotations of the scientific literature. All of this knowledge is freely accessible on the PharmGKB website. In the first of a series of PharmGKB tutorials, we introduce the PharmGKB COVID-19 portal and, using examples of drugs found in the portal, demonstrate some of the main features of PharmGKB. This paper is intended as a resource to help users become quickly acquainted with the wealth of information stored in PharmGKB.
View details for DOI 10.1002/cpt.2067
View details for PubMedID 32978778
-
Sex-specific genetic effects across biomarkers.
European journal of human genetics : EJHG
2020
Abstract
Sex differences have been shown in laboratory biomarkers; however, the extent to which this is due to genetics is unknown. In this study, we infer sex-specific genetic parameters (heritability and genetic correlation) across 33 quantitative biomarker traits in 181,064 females and 156,135 males from the UK Biobank study. We apply a Bayesian Mixture Model, Sex Effects Mixture Model(SEMM), to Genome-wide Association Study summary statistics in order to (1) estimate the contributions of sex to the genetic variance of these biomarkers and (2) identify variants whose statistical association with these traits is sex-specific. We find that the genetics of most biomarker traits are shared between males and females, with the notable exception of testosterone, where we identify 119 female and 445 male-specific variants. These include protein-altering variants in steroid hormone production genes (POR, UGT2B7). Using the sex-specific variants as genetic instruments for Mendelian randomization, we find evidence for causal links between testosterone levels and height, body mass index, waist and hip circumference, and type 2 diabetes. We also show that sex-specific polygenic risk score models for testosterone outperform a combined model. Overall, these results demonstrate that while sex has a limited role in the genetics of most biomarker traits, sex plays an important role in testosterone genetics.
View details for DOI 10.1038/s41431-020-00712-w
View details for PubMedID 32873964
-
Scientific considerations for global drug development.
Science translational medicine
2020; 12 (554)
Abstract
Requiring regional or in-country confirmatory clinical trials before approval of drugs already approved elsewhere delays access to medicines in low- and middle-income countries and raises drug costs. Here, we discuss the scientific and technological advances that may reduce the need for in-country or in-region clinical trials for drugs approved in other countries and limitations of these advances that could necessitate in-region clinical studies.
View details for DOI 10.1126/scitranslmed.aax2550
View details for PubMedID 32727913
-
Gaussian embedding for large-scale gene set analysis
NATURE MACHINE INTELLIGENCE
2020; 2 (7): 387–95
View details for DOI 10.1038/s42256-020-0193-2
View details for Web of Science ID 000567371800007
-
Gaussian Embedding for Large-scale Gene Set Analysis.
Nature machine intelligence
2020; 2 (7): 387-395
Abstract
Gene sets, including protein complexes and signaling pathways, have proliferated greatly, in large part as a result of high-throughput biological data. Leveraging gene sets to gain insight into biological discovery requires computational methods for converting them into a useful form for available machine learning models. Here, we study the problem of embedding gene sets as compact features that are compatible with available machine learning codes. We present Set2Gaussian, a novel network-based gene set embedding approach, which represents each gene set as a multivariate Gaussian distribution rather than a single point in the low-dimensional space, according to the proximity of these genes in a protein-protein interaction network. We demonstrate that Set2Gaussian improves gene set member identification, accurately stratifies tumors, and finds concise gene sets for gene set enrichment analysis. We further show how Set2Gaussian allows us to identify a previously unknown clinical prognostic and predictive subnetwork around NEFM in sarcoma, which we validate in independent cohorts.
View details for DOI 10.1038/s42256-020-0193-2
View details for PubMedID 32968711
View details for PubMedCentralID PMC7505077
-
Homology modeling of TMPRSS2 yields candidate drugs that may inhibit entry of SARS-CoV-2 into human cells.
ChemRxiv : the preprint server for chemistry
2020
Abstract
The most rapid path to discovering treatment options for the novel coronavirus SARS-CoV-2 is to find existing medications that are active against the virus. We have focused on identifying repurposing candidates for the transmembrane serine protease family member II (TMPRSS2), which is critical for entry of coronaviruses into cells. Using known 3D structures of close homologs, we created seven homology models. We also identified a set of serine protease inhibitor drugs, generated several conformations of each, and docked them into our models. We used three known chemical (non-drug) inhibitors and one validated inhibitor of TMPRSS2 in MERS as benchmark compounds and found six compounds with predicted high binding affinity in the range of the known inhibitors. We also showed that a previously published weak inhibitor, Camostat, had a significantly lower binding score than our six compounds. All six compounds are anticoagulants with significant and potentially dangerous clinical effects and side effects. Nonetheless, if these compounds significantly inhibit SARS-CoV-2 infection, they could represent a potentially useful clinical tool.
View details for DOI 10.26434/chemrxiv.12009582
View details for PubMedID 32511288
View details for PubMedCentralID PMC7263764
-
PharmGKB summary: lamotrigine pathway, pharmacokinetics and pharmacodynamics.
Pharmacogenetics and genomics
2020
View details for DOI 10.1097/FPC.0000000000000397
View details for PubMedID 32187155
-
A Literature-Based Knowledge Graph Embedding Method for Identifying Drug Repurposing Opportunities in Rare Diseases.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
2020; 25: 463–74
Abstract
Millions of Americans are affected by rare diseases, many of which have poor survival rates. However, the small market size of individual rare diseases, combined with the time and capital requirements of pharmaceutical R&D, have hindered the development of new drugs for these cases. A promising alternative is drug repurposing, whereby existing FDA-approved drugs might be used to treat diseases different from their original indications. In order to generate drug repurposing hypotheses in a systematic and comprehensive fashion, it is essential to integrate information from across the literature of pharmacology, genetics, and pathology. To this end, we leverage a newly developed knowledge graph, the Global Network of Biomedical Relationships (GNBR). GNBR is a large, heterogeneous knowledge graph comprising drug, disease, and gene (or protein) entities linked by a small set of semantic themes derived from the abstracts of biomedical literature. We apply a knowledge graph embedding method that explicitly models the uncertainty associated with literature-derived relationships and uses link prediction to generate drug repurposing hypotheses. This approach achieves high performance on a gold-standard test set of known drug indications (AUROC = 0.89) and is capable of generating novel repurposing hypotheses, which we independently validate using external literature sources and protein interaction networks. Finally, we demonstrate the ability of our model to produce explanations of its predictions.
View details for PubMedID 31797619
-
Analyzing the vast coronavirus literature with CoronaCentral.
bioRxiv : the preprint server for biology
2020
Abstract
The global SARS-CoV-2 pandemic has caused a surge in research exploring all aspects of the virus and its effects on human health. The overwhelming rate of publications means that human researchers are unable to keep abreast of the research. To ameliorate this, we present the CoronaCentral resource which uses machine learning to process the research literature on SARS-CoV-2 along with articles on SARS-CoV and MERS-CoV. We break the literature down into useful categories and enable analysis of the contents, pace, and emphasis of research during the crisis. These categories cover therapeutics, forecasting as well as growing areas such as "Long Covid" and studies of inequality and misinformation. Using this data, we compare topics that appear in original research articles compared to commentaries and other article types. Finally, using Altmetric data, we identify the topics that have gained the most media attention. This resource, available at https://coronacentral.ai , is updated multiple times per day and provides an easy-to-navigate system to find papers in different categories, focussing on different aspects of the virus along with currently trending articles.
View details for DOI 10.1101/2020.12.21.423860
View details for PubMedID 33398279
View details for PubMedCentralID PMC7781314
-
Pathway and network embedding methods for prioritizing psychiatric drugs
WORLD SCIENTIFIC PUBL CO PTE LTD. 2020: 671-682
View details for Web of Science ID 000702064500059
-
A Literature-Based Knowledge Graph Embedding Method for Identifying Drug Repurposing Opportunities in Rare Diseases
WORLD SCIENTIFIC PUBL CO PTE LTD. 2020: 463-474
View details for Web of Science ID 000702064500041
-
PGxMine: Text mining for curation of PharmGKB
WORLD SCIENTIFIC PUBL CO PTE LTD. 2020: 611-622
View details for Web of Science ID 000702064500054
-
Variant Interpretation in Current Pharmacogenetic Testing.
Journal of personalized medicine
2020; 10 (4)
Abstract
In the current marketplace, there are now more than a dozen commercial companies providing pharmacogenetic tests. Each company varies in the panel of genes they test and the variants they are able to screen for. The reports generated by these companies provide phenotypic interpretations of pharmacogenes and clinically actionable gene-drug interactions based on internally curated data and proprietary algorithms. The freedom to choose the types of evidence to include versus exclude in interpreting genomics has created reporting discrepancies in the industry. The case report presented here reveals the discordant phenotype analysis provided by two pharmacogenetic testing companies. The uncertainty and unnecessary distress experienced by the patient highlights the need for consensus in phenotype reporting within the industry.
View details for DOI 10.3390/jpm10040204
View details for PubMedID 33142667
-
Extending TextAE for annotation of non-contiguous entities.
Genomics & informatics
2020; 18 (2): e15
Abstract
Named entity recognition tools are used to identify mentions of biomedical entities in free text and are essential components of high-quality information retrieval and extraction systems. Without good entity recognition, methods will mislabel searched text and will miss important information or identify spurious text that will frustrate users. Most tools do not capture non-contiguous entities which are separate spans of text that together refer to an entity, e.g., the entity "type 1 diabetes" in the phrase "type 1 and type 2 diabetes." This type is commonly found in biomedical texts, especially in lists, where multiple biomedical entities are named in shortened form to avoid repeating words. Most text annotation systems, that enable users to view and edit entity annotations, do not support non-contiguous entities. Therefore, experts cannot even visualize non-contiguous entities, let alone annotate them to build valuable datasets for machine learning methods. To combat this problem and as part of the BLAH6 hackathon, we extended the TextAE platform to allow visualization and annotation of non-contiguous entities. This enables users to add new subspans to existing entities by selecting additional text. We integrate this new functionality with TextAE's existing editing functionality to allow easy changes to entity annotation and editing of relation annotations involving non-contiguous entities, with importing and exporting to the PubAnnotation format. Finally, we roughly quantify the problem across the entire accessible biomedical literature to highlight that there are a substantial number of non-contiguous entities that appear in lists that would be missed by most text mining systems.
View details for DOI 10.5808/GI.2020.18.2.e15
View details for PubMedID 32634869
-
Geographic Distribution of US Cohorts Used to Train Deep Learning Algorithms.
JAMA
2020; 324 (12): 1212–13
View details for DOI 10.1001/jama.2020.12067
View details for PubMedID 32960230
-
Pharmacogenomics in Asian subpopulations and impacts on commonly prescribed medications.
Clinical and translational science
2020
Abstract
Asians as a group comprise of over 60% the world's population. There is an incredible amount of diversity in Asian and admixed populations that has not been studied in a pharmacogenetic context. The known pharmacogenetic differences in Asians subgroups generally represent previously known variants that are present at much lower or higher frequencies in Asians compared to other populations. This review aims to summarize the main drugs and known genes that appear to have differences in their pharmacogenetic properties in certain Asian populations. Evidence based guidelines and summary statistics from the Food and Drug Administration (FDA) and the Clinical Pharmacogenetics Implementation Consortium (CPIC) were analyzed for ethnic differences in outcomes. Implicated drugs included commonly prescribed drugs such as warfarin, clopidogrel, carbamazepine, and allopurinol. The majority of these associations are due to Asians more commonly being CYP2C19 poor metabolizers and carriers of the HLA-B*15:02 allele. The relative risk increase seen ranged between genes and drugs but could be over 100x more likely in Asians such as the 172x increase in risk of SJS and TEN with carbamazepine use amongst HLA-B*15:02 carriers. The effects ranged from relatively benign reactions such as reduced drug efficacy to severe cutaneous skin reactions. These reactions are severe and prevalent enough to warrant pharmacogenetic testing and appropriate changes in dosing and medication choice for at risk populations. Further studies should be done on Asian cohorts to more fully understand pharmacogenetic variants in these populations to understand how such differences may influence drug response.
View details for DOI 10.1111/cts.12771
View details for PubMedID 32100936
-
Classifying non-small cell lung cancer types and transcriptomic subtypes using convolutional neural networks.
Journal of the American Medical Informatics Association : JAMIA
2020; 27 (5): 757–69
Abstract
Non-small cell lung cancer is a leading cause of cancer death worldwide, and histopathological evaluation plays the primary role in its diagnosis. However, the morphological patterns associated with the molecular subtypes have not been systematically studied. To bridge this gap, we developed a quantitative histopathology analytic framework to identify the types and gene expression subtypes of non-small cell lung cancer objectively.We processed whole-slide histopathology images of lung adenocarcinoma (n = 427) and lung squamous cell carcinoma patients (n = 457) in the Cancer Genome Atlas. We built convolutional neural networks to classify histopathology images, evaluated their performance by the areas under the receiver-operating characteristic curves (AUCs), and validated the results in an independent cohort (n = 125).To establish neural networks for quantitative image analyses, we first built convolutional neural network models to identify tumor regions from adjacent dense benign tissues (AUCs > 0.935) and recapitulated expert pathologists' diagnosis (AUCs > 0.877), with the results validated in an independent cohort (AUCs = 0.726-0.864). We further demonstrated that quantitative histopathology morphology features identified the major transcriptomic subtypes of both adenocarcinoma and squamous cell carcinoma (P < .01).Our study is the first to classify the transcriptomic subtypes of non-small cell lung cancer using fully automated machine learning methods. Our approach does not rely on prior pathology knowledge and can discover novel clinically relevant histopathology patterns objectively. The developed procedure is generalizable to other tumor types or diseases.
View details for DOI 10.1093/jamia/ocz230
View details for PubMedID 32364237
-
Extracting chemical reactions from text using Snorkel.
BMC bioinformatics
2020; 21 (1): 217
Abstract
Enzymatic and chemical reactions are key for understanding biological processes in cells. Curated databases of chemical reactions exist but these databases struggle to keep up with the exponential growth of the biomedical literature. Conventional text mining pipelines provide tools to automatically extract entities and relationships from the scientific literature, and partially replace expert curation, but such machine learning frameworks often require a large amount of labeled training data and thus lack scalability for both larger document corpora and new relationship types.We developed an application of Snorkel, a weakly supervised learning framework, for extracting chemical reaction relationships from biomedical literature abstracts. For this work, we defined a chemical reaction relationship as the transformation of chemical A to chemical B. We built and evaluated our system on small annotated sets of chemical reaction relationships from two corpora: curated bacteria-related abstracts from the MetaCyc database (MetaCyc_Corpus) and a more general set of abstracts annotated with MeSH (Medical Subject Headings) term Bacteria (Bacteria_Corpus; a superset of MetaCyc_Corpus). For the MetaCyc_Corpus, we obtained 84% precision and 41% recall (55% F1 score). Extending to the more general Bacteria_Corpus decreased precision to 62% with only a four-point drop in recall to 37% (46% F1 score). Overall, the Bacteria_Corpus contained two orders of magnitude more candidate chemical reaction relationships (nine million candidates vs 68,0000 candidates) and had a larger class imbalance (2.5% positives vs 5% positives) as compared to the MetaCyc_Corpus. In total, we extracted 6871 chemical reaction relationships from nine million candidates in the Bacteria_Corpus.With this work, we built a database of chemical reaction relationships from almost 900,000 scientific abstracts without a large training set of labeled annotations. Further, we showed the generalizability of our initial application built on MetaCyc documents enriched with chemical reactions to a general set of articles related to bacteria.
View details for DOI 10.1186/s12859-020-03542-1
View details for PubMedID 32460703
-
Genome-wide association study of platelet reactivity and cardiovascular response in patients treated with clopidogrel: a study by the International Clopidogrel Pharmacogenomics Consortium (ICPC).
Clinical pharmacology and therapeutics
2020
Abstract
Antiplatelet response to clopidogrel shows wide variation, and poor response is correlated with adverse clinical outcomes. CYP2C19 loss-of-function alleles play an important role in this response, but account for only a small proportion of variability in response to clopidogrel. An aim of the International Clopidogrel Pharmacogenomics Consortium (ICPC) is to identify other genetic determinants of clopidogrel pharmacodynamics and clinical response. A genome-wide association study (GWAS) was performed using DNA from 2,750 European ancestry individuals, using adenosine diphosphate (ADP) induced platelet reactivity and major cardiovascular and cerebrovascular events as outcome parameters. GWAS for platelet reactivity revealed a strong signal for CYP2C19*2 (p-value=1.67e-33). After correction for CYP2C19*2 no other SNP reached genome-wide significance. GWAS for a combined clinical endpoint of cardiovascular death, myocardial infarction, or stroke (5.0% event rate) or a combined endpoint of cardiovascular death or myocardial infarction (4.7% event rate) showed no significant results, although in coronary artery disease, percutaneous coronary intervention, and acute coronary syndrome subgroups, mutations in SCOS5P1, CDC42BPA and CTRAC1 showed genome-wide significance (lowest p-values: 1.07e-09, 4.53e-08 and 2.60e-10, respectively). CYP2C19*2 is the strongest genetic determinant of on-clopidogrel platelet reactivity. We identified three novel associations in clinical outcome subgroups, suggestive for each of these outcomes.
View details for DOI 10.1002/cpt.1911
View details for PubMedID 32472697
-
PGxMine: Text mining for curation of PharmGKB.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
2020; 25: 611–22
Abstract
Precision medicine tailors treatment to individuals personal data including differences in their genome. The Pharmacogenomics Knowledgebase (PharmGKB) provides highly curated information on the effect of genetic variation on drug response and side effects for a wide range of drugs. PharmGKB's scientific curators triage, review and annotate a large number of papers each year but the task is challenging. We present the PGxMine resource, a text-mined resource of pharmacogenomic associations from all accessible published literature to assist in the curation of PharmGKB. We developed a supervised machine learning pipeline to extract associations between a variant (DNA and protein changes, star alleles and dbSNP identifiers) and a chemical. PGxMine covers 452 chemicals and 2,426 variants and contains 19,930 mentions of pharmacogenomic associations across 7,170 papers. An evaluation by PharmGKB curators found that 57 of the top 100 associations not found in PharmGKB led to 83 curatable papers and a further 24 associations would likely lead to curatable papers through citations. The results can be viewed at https://pgxmine.pharmgkb.org/ and code can be downloaded at https://github.com/jakelever/pgxmine.
View details for PubMedID 31797632
-
Pathway and network embedding methods for prioritizing psychiatric drugs.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
2020; 25: 671–82
Abstract
One in five Americans experience mental illness, and roughly 75% of psychiatric prescriptions do not successfully treat the patient's condition. Extensive evidence implicates genetic factors and signaling disruption in the pathophysiology of these diseases. Changes in transcription often underlie this molecular pathway dysregulation; individual patient transcriptional data can improve the efficacy of diagnosis and treatment. Recent large-scale genomic studies have uncovered shared genetic modules across multiple psychiatric disorders - providing an opportunity for an integrated multi-disease approach for diagnosis. Moreover, network-based models informed by gene expression can represent pathological biological mechanisms and suggest new genes for diagnosis and treatment. Here, we use patient gene expression data from multiple studies to classify psychiatric diseases, integrate knowledge from expert-curated databases and publicly available experimental data to create augmented disease-specific gene sets, and use these to recommend disease-relevant drugs. From Gene Expression Omnibus, we extract expression data from 145 cases of schizophrenia, 82 cases of bipolar disorder, 190 cases of major depressive disorder, and 307 shared controls. We use pathway-based approaches to predict psychiatric disease diagnosis with a random forest model (78% accuracy) and derive important features to augment available drug and disease signatures. Using protein-protein-interaction networks and embedding-based methods, we build a pipeline to prioritize treatments for psychiatric diseases that achieves a 3.4-fold improvement over a background model. Thus, we demonstrate that gene-expression-derived pathway features can diagnose psychiatric diseases and that molecular insights derived from this classification task can inform treatment prioritization for psychiatric diseases.
View details for PubMedID 31797637
-
Wiring Minds Successfully applying AI to biomedicine requires innovators trained in contrasting cultures
NATURE
2019; 576 (7787): S62–S63
View details for Web of Science ID 000513817800002
View details for PubMedID 31853071
-
PharmGKB summary: very important pharmacogene information for CACNA1S.
Pharmacogenetics and genomics
2019
View details for DOI 10.1097/FPC.0000000000000393
View details for PubMedID 31851124
-
PharmGKB summary: sertraline pathway, pharmacokinetics.
Pharmacogenetics and genomics
2019
View details for DOI 10.1097/FPC.0000000000000392
View details for PubMedID 31851125
-
Maria-I: A Deep-Learning Approach for Accurate Prediction of MHC Class I Tumor Neoantigen Presentation
AMER SOC HEMATOLOGY. 2019
View details for DOI 10.1182/blood-2019-129334
View details for Web of Science ID 000518218500130
-
Retro-2 protects cells from ricin toxicity by inhibiting ASNA1-mediated ER targeting and insertion of tail-anchored proteins.
eLife
2019; 8
Abstract
The small molecule Retro-2 prevents ricin toxicity through a poorly-defined mechanism of action (MOA), which involves halting retrograde vesicle transport to the endoplasmic reticulum (ER). CRISPRi genetic interaction analysis revealed Retro-2 activity resembles disruption of the transmembrane domain recognition complex (TRC) pathway, which mediates post-translational ER-targeting and insertion of tail-anchored (TA) proteins, including SNAREs required for retrograde transport. Cell-based and in vitro assays show that Retro-2 blocks delivery of newly-synthesized TA-proteins to the ER-targeting factor ASNA1 (TRC40). An ASNA1 point mutant identified using CRISPR-mediated mutagenesis abolishes both the cytoprotective effect of Retro-2 against ricin and its inhibitory effect on ASNA1-mediated ER-targeting. Together, our work explains how Retro-2 prevents retrograde trafficking of toxins by inhibiting TA-protein targeting, describes a general CRISPR strategy for predicting the MOA of small molecules, and paves the way for drugging the TRC pathway to treat broad classes of viruses known to be inhibited by Retro-2.
View details for DOI 10.7554/eLife.48434
View details for PubMedID 31674906
-
Successfully applying AI to biomedicine requires innovators trained in contrasting cultures
NATURE
2019; 574 (7779): S62–S63
View details for Web of Science ID 000509545200002
-
RedMed: Extending drug lexicons for social media applications.
Journal of biomedical informatics
2019: 103307
Abstract
Social media has been identified as a promising potential source of information for pharmacovigilance. The adoption of social media data has been hindered by the massive and noisy nature of the data. Initial attempts to use social media data have relied on exact text matches to drugs of interest, and therefore suffer from the gap between formal drug lexicons and the informal nature of social media. The Reddit comment archive represents an ideal corpus for bridging this gap. We trained a word embedding model, RedMed, to facilitate the identification and retrieval of health entities from Reddit data. We compare the performance of our model trained on a consumer-generated corpus against publicly available models trained on expert-generated corpora. Our automated classification pipeline achieves an accuracy of 0.88 and a specificity of > 0.9 across four different term classes. Of all drug mentions, an average of 79% (±0.5%) were exact matches to a generic or trademark drug name, 14% (±0.5%) were misspellings, 6.4% (±0.3%) were synonyms, and 0.13% (±0.05%) were pill marks. We find that our system captures an additional 20% of mentions; these would have been missed by approaches that rely solely on exact string matches. We provide a lexicon of misspellings and synonyms for 2,978 drugs and a word embedding model trained on a health-oriented subset of Reddit.
View details for DOI 10.1016/j.jbi.2019.103307
View details for PubMedID 31627020
-
Atrial Fibrillation Burden Signature and Near-Term Prediction of Stroke: A Machine Learning Analysis.
Circulation. Cardiovascular quality and outcomes
2019; 12 (10): e005595
Abstract
BACKGROUND: Atrial fibrillation (AF) increases the risk of stroke 5-fold and there is rising interest to determine if AF severity or burden can further risk stratify these patients, particularly for near-term events. Using continuous remote monitoring data from cardiac implantable electronic devices, we sought to evaluate if machine learned signatures of AF burden could provide prognostic information on near-term risk of stroke when compared to conventional risk scores.METHODS AND RESULTS: We retrospectively identified Veterans Health Administration serviced patients with cardiac implantable electronic device remote monitoring data and at least one day of device-registered AF. The first 30 days of remote monitoring in nonstroke controls were compared against the past 30 days of remote monitoring before stroke in cases. We trained 3 types of models on our data: (1) convolutional neural networks, (2) random forest, and (3) L1 regularized logistic regression (LASSO). We calculated the CHA2DS2-VASc score for each patient and compared its performance against machine learned indices based on AF burden in separate test cohorts. Finally, we investigated the effect of combining our AF burden models with CHA2DS2-VASc. We identified 3114 nonstroke controls and 71 stroke cases, with no significant differences in baseline characteristics. Random forest performed the best in the test data set (area under the curve [AUC]=0.662) and convolutional neural network in the validation dataset (AUC=0.702), whereas CHA2DS2-VASc had an AUC of 0.5 or less in both data sets. Combining CHA2DS2-VASc with random forest and convolutional neural network yielded a validation AUC of 0.696 and test AUC of 0.634, yielding the highest average AUC on nontraining data.CONCLUSIONS: This proof-of-concept study found that machine learning and ensemble methods that incorporate daily AF burden signature provided incremental prognostic value for risk stratification beyond CHA2DS2-VASc for near-term risk of stroke.
View details for DOI 10.1161/CIRCOUTCOMES.118.005595
View details for PubMedID 31610712
-
Pharmacogenomic Polygenic Response Score Predicts Ischemic Events and Cardiovascular Mortality in Clopidogrel-Treated Patients.
European heart journal. Cardiovascular pharmacotherapy
2019
Abstract
AIMS: Clopidogrel is prescribed for the prevention of atherothrombotic events. While investigations have identified genetic determinants of inter-individual variability in on-treatment platelet inhibition (e.g. CYP2C19*2), evidence that these variants have clinical utility to predict major adverse cardiovascular events remains controversial.METHODS AND RESULTS: We assessed the impact of 31 candidate gene polymorphisms on ADP-stimulated platelet reactivity in 3,391 clopidogrel-treated coronary artery disease patients of the International Clopidogrel Pharmacogenomics Consortium (ICPC). The influence of these polymorphisms on cardiovascular events (CVE) was tested in 2,134 ICPC patients (N=129 events) in whom clinical event data were available. Several variants were associated with on-treatment ADP-stimulated platelet reactivity (CYP2C19*2, P=8.8x10-54; CES1 G143E, P=1.3x10-16; CYP2C19*17, P=9.5x10-10; CYP2B6 1294+53C>T, P=3.0x10-4; CYP2B6 516G>T, P=1.0x10-3; CYP2C9*2, P=1.2x10-3; and CYP2C9*3, P=1.5x10-3). While no individual variant was associated with CVEs, generation of a pharmacogenomic polygenic response score (PgxRS) revealed that patients who carried a greater number of alleles that associated with increased on-treatment platelet reactivity were more likely to experience CVEs (beta=0.17, SE 0.06, P=0.01) and cardiovascular-related death (beta=0.43, SE 0.16, P=0.007). Patients who carried 8 or more risk alleles were significantly more likely to experience CVEs (OR=1.78, 95%CI 1.14-2.76, P=0.01) and cardiovascular death (OR=4.39, 95%CI 1.35-14.27, P=0.01) compared to patients who carried 6 or fewer of these alleles.CONCLUSION: Several polymorphisms impact clopidogrel response and PgxRS is a predictor of cardiovascular outcomes. Additional investigations that identify novel determinants of clopidogrel response and validating polygenic models may facilitate future precision medicine strategies.
View details for DOI 10.1093/ehjcvp/pvz045
View details for PubMedID 31504375
-
PharmGKB summary: methylphenidate pathway, pharmacokinetics/pharmacodynamics
PHARMACOGENETICS AND GENOMICS
2019; 29 (6): 136–54
View details for DOI 10.1097/FPC.0000000000000376
View details for Web of Science ID 000474099000003
-
Pharmacogenomics Clinical Annotation Tool (PharmCAT).
Clinical pharmacology and therapeutics
2019
Abstract
Pharmacogenomics (PGx) decision support and return of results is an active area of precision medicine. One challenge of implementing PGx is extracting genomic variants and assigning haplotypes in order to apply prescribing recommendations and information from CPIC, FDA, PharmGKB, etc. PharmCAT (1) extracts variants specified in guidelines from a genetic dataset derived from sequencing or genotyping technologies; (2) infers haplotypes and diplotypes; and (3) generates a report containing genotype/diplotype-based annotations and guideline recommendations. We describe PharmCAT and a pilot validation project comparing results for 1000 Genomes sequences of Coriell samples with corresponding Genetic Testing Reference Materials Coordination Program (GeT-RM) sample characterization. PharmCAT was highly concordant with the GeT-RM data. PharmCAT is available in GitHub to evaluate, test and report results back to the community. As precision medicine becomes more prevalent, our ability to consistently, accurately, and clearly define and report PGx annotations and prescribing recommendations is critical. This article is protected by copyright. All rights reserved.
View details for DOI 10.1002/cpt.1568
View details for PubMedID 31306493
-
PharmGKB summary: Ondansetron and tropisetron pathways, pharmacokinetics and pharmacodynamics
PHARMACOGENETICS AND GENOMICS
2019; 29 (4): 91–97
View details for DOI 10.1097/FPC.0000000000000369
View details for Web of Science ID 000466783000004
-
Effect of CYP4F2, VKORC1, and CYP2C9 in Influencing Coumarin Dose: A Single-Patient Data Meta-Analysis in More Than 15,000 Individuals
CLINICAL PHARMACOLOGY & THERAPEUTICS
2019; 105 (6): 1477–91
View details for DOI 10.1002/cpt.1323
View details for Web of Science ID 000467751900030
-
Predicting venous thromboembolism risk from exomes in the Critical Assessment of Genome Interpretation (CAGI) challenges.
Human mutation
2019
Abstract
Genetics play a key role in venous thromboembolism (VTE) risk, however established risk factors in European populations do not translate to individuals of African descent due to differences in allele frequencies between populations. As part of the fifth iteration of the Critical Assessment of Genome Interpretation, participants were asked to predict VTE status in exome data from African American subjects. Participants were provided with 103 unlabeled exomes from patients treated with warfarin for non-VTE causes or VTE and asked to predict which disease each subject had been treated for. Given the lack of training data, many participants opted to use unsupervised machine learning methods, clustering the exomes by variation in genes known to be associated with VTE. The best performing method using only VTE related genes achieved an AUC of 0.65. Here we discuss the range of methods used in the prediction of VTE from sequence data and explore some of the difficulties of conducting a challenge with known confounders. Additionally, we show that an existing genetic risk score for VTE that was developed in European subjects works well in African Americans. This article is protected by copyright. All rights reserved.
View details for DOI 10.1002/humu.23825
View details for PubMedID 31140652
-
Standardized Biogeographic Grouping System for Annotating Populations in Pharmacogenetic Research
CLINICAL PHARMACOLOGY & THERAPEUTICS
2019; 105 (5): 1256–62
View details for DOI 10.1002/cpt.1322
View details for Web of Science ID 000466750900030
-
High precision protein functional site detection using 3D convolutional neural networks.
Bioinformatics (Oxford, England)
2019; 35 (9): 1503–12
Abstract
MOTIVATION: Accurate annotation of protein functions is fundamental for understanding molecular and cellular physiology. Data-driven methods hold promise for systematically deriving rules underlying the relationship between protein structure and function. However, the choice of protein structural representation is critical. Pre-defined biochemical features emphasize certain aspects of protein properties while ignoring others, and therefore may fail to capture critical information in complex protein sites.RESULTS: In this paper, we present a general framework that applies 3D convolutional neural networks (3DCNNs) to structure-based protein functional site detection. The framework can extract task-dependent features automatically from the raw atom distributions. We benchmarked our method against other methods and demonstrate better or comparable performance for site detection. Our deep 3DCNNs achieved an average recall of 0.955 at a precision threshold of 0.99 on PROSITE families, detected 98.89 and 92.88% of nitric oxide synthase and TRYPSIN-like enzyme sites in Catalytic Site Atlas, and showed good performance on challenging cases where sequence motifs are absent but a function is known to exist. Finally, we inspected the individual contributions of each atom to the classification decisions and show that our models successfully recapitulate known 3D features within protein functional sites.AVAILABILITY AND IMPLEMENTATION: The 3DCNN models described in this paper are available at https://simtk.org/projects/fscnn.SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
View details for PubMedID 31051039
-
PharmGKB summary: methylphenidate pathway, pharmacokinetics/pharmacodynamics.
Pharmacogenetics and genomics
2019
View details for PubMedID 30950912
-
Research Projects Supported by the University of California, San Francisco-Stanford Center of Excellence in Regulatory Science and Innovation
CLINICAL PHARMACOLOGY & THERAPEUTICS
2019; 105 (4): 815–18
View details for DOI 10.1002/cpt.1308
View details for Web of Science ID 000461888300013
-
Pocket similarity identifies selective estrogen receptor modulators as microtubule modulators at the taxane site
NATURE COMMUNICATIONS
2019; 10
View details for DOI 10.1038/s41467-019-08965-w
View details for Web of Science ID 000460125400015
-
Pharmacogenomics in dermatology: tools for understanding gene-drug associations.
Seminars in cutaneous medicine and surgery
2019; 38 (1): E19–E24
Abstract
Pharmacogenomics aims to associate human genetic variability with differences in drug phenotypes in order to tailor drug treatment to individual patients. The massive amount of genetic data generated from large cohorts of patients with variable drug phenotypes have led to advances in this field. Understanding the application of pharmacogenomics in dermatology could inform clinical practice and provide insight for future research. The Pharmacogenomics Knowledge Base and the Clinical Pharmacogenetics Implementation Consortium are among the resources to help clinicians and researchers navigate the many gene-drug associations that have already been discovered. The implementation of clinical pharmacogenomics within health care systems remains an area of ongoing development. This review provides an introduction to the field of pharmacogenomics and to current pharmacogenomics resources using examples of gene-drug associations relevant to the field of dermatology.
View details for DOI 10.12788/j.sder.2019.009
View details for PubMedID 31051019
-
Research Projects Supported by the University of California, San Francisco-Stanford Center of Excellence in Regulatory Science and Innovation.
Clinical pharmacology and therapeutics
2019
View details for PubMedID 30773618
-
PharmGKB summary: Ondansetron and tropisetron pathways, pharmacokinetics and pharmacodynamics.
Pharmacogenetics and genomics
2019
View details for PubMedID 30672837
-
Computational analysis of kinase inhibitor selectivity using structural knowledge
BIOINFORMATICS
2019; 35 (2): 235–42
View details for DOI 10.1093/bioinformatics/bty582
View details for Web of Science ID 000459314900007
-
The association of obesity and coronary artery disease genes with response to SSRIs treatment in major depression.
Journal of neural transmission (Vienna, Austria : 1996)
2019
Abstract
Selective serotonin reuptake inhibitors (SSRIs) are first-line antidepressants for the treatment of major depressive disorder (MDD). However, treatment response during an initial therapeutic trial is often poor and is difficult to predict. Heterogeneity of response to SSRIs in depressed patients is partly driven by co-occurring somatic disorders such as coronary artery disease (CAD) and obesity. CAD and obesity may also be associated with metabolic side effects of SSRIs. In this study, we assessed the association of CAD and obesity with treatment response to SSRIs in patients with MDD using a polygenic score (PGS) approach. Additionally, we performed cross-trait meta-analyses to pinpoint genetic variants underpinnings the relationship of CAD and obesity with SSRIs treatment response. First, PGSs were calculated at different p value thresholds (PT) for obesity and CAD. Next, binary logistic regression was applied to evaluate the association of the PGSs to SSRIs treatment response in a discovery sample (ISPC, N=865), and in a replication cohort (STAR*D, N=1,878). Finally, a cross-trait GWAS meta-analysis was performed by combining summary statistics. We show that the PGSs for CAD and obesity were inversely associated with SSRIs treatment response. At the most significant thresholds, the PGS for CAD and body mass index accounted 1.3%, and 0.8% of the observed variability in treatment response to SSRIs, respectively. In the cross-trait meta-analyses, we identified (1) 14 genetic loci (including NEGR1, CADM2, PMAIP1, PARK2) that are associated with both obesity and SSRIs treatment response; (2) five genetic loci (LINC01412, PHACTR1, CDKN2B, ATXN2, KCNE2) with effects on CAD and SSRIs treatment response. Our findings implicate that the genetic variants of CAD and obesity are linked to SSRIs treatment response in MDD. A better SSRIs treatment response might be achieved through a stratified allocation of treatment for MDD patients with a genetic risk for obesity or CAD.
View details for PubMedID 30610379
-
Essential Characteristics of Pharmacogenomics Study Publications
CLINICAL PHARMACOLOGY & THERAPEUTICS
2019; 105 (1): 86–91
View details for DOI 10.1002/cpt.1279
View details for Web of Science ID 000454618200017
-
A deep learning framework to predict binding preference of RNA constituents on protein surface.
Nature communications
2019; 10 (1): 4941
Abstract
Protein-RNA interaction plays important roles in post-transcriptional regulation. However, the task of predicting these interactions given a protein structure is difficult. Here we show that, by leveraging a deep learning model NucleicNet, attributes such as binding preference of RNA backbone constituents and different bases can be predicted from local physicochemical characteristics of protein structure surface. On a diverse set of challenging RNA-binding proteins, including Fem-3-binding-factor 2, Argonaute 2 and Ribonuclease III, NucleicNet can accurately recover interaction modes discovered by structural biology experiments. Furthermore, we show that, without seeing any in vitro or in vivo assay data, NucleicNet can still achieve consistency with experiments, including RNAcompete, Immunoprecipitation Assay, and siRNA Knockdown Benchmark. NucleicNet can thus serve to provide quantitative fitness of RNA sequences for given binding pockets or to predict potential binding pockets and binding RNAs for previously unknown RNA binding proteins.
View details for DOI 10.1038/s41467-019-12920-0
View details for PubMedID 31666519
-
Examining the Use of Real-World Evidence in the Regulatory Process.
Clinical pharmacology and therapeutics
2019
Abstract
The 21st Century Cures Act passed by the United States Congress mandates the Food and Drug Administration to develop guidance to evaluate the use of real-world evidence (RWE) to support the regulatory process. RWE has generated important medical discoveries, especially in areas where traditional clinical trials would be unethical or infeasible. However, RWE suffers from several issues that hinder its ability to provide proof of treatment efficacy at a level comparable to randomized controlled trials. In this review article, we summarized the advantages and limitations of RWE, identified the key opportunities for RWE, and pointed the way forward to maximize the potential of RWE for regulatory purposes.
View details for DOI 10.1002/cpt.1658
View details for PubMedID 31562770
-
#Science: The potential and the challenges of utilizing social media and other electronic communication platforms in health care.
Clinical and translational science
2019
Abstract
Electronic communication is becoming increasingly popular worldwide, as evidenced by its widespread and rapidly growing use. In medicine however, it remains a novel approach to reach out to patients. Yet, they have the potential for further improving current health care. Electronic platforms could support therapy adherence and communication between physicians and patients. The power of social media as well as other electronic devices can improve adherence as evidenced by the development of the app bant. Additionally, systemic analysis of social media content by Screenome can identify health events not always captured by regular health care. By better identifying these health care events we can improve our current health care system as we will be able to better tailor to the patients' needs. All these techniques are a valuable component of modern health care and will help us into the future of increasingly digital health care. This article is protected by copyright. All rights reserved.
View details for DOI 10.1111/cts.12687
View details for PubMedID 31392837
-
Pocket similarity identifies selective estrogen receptor modulators as microtubule modulators at the taxane site.
Nature communications
2019; 10 (1): 1033
Abstract
Taxanes are a family of natural products with a broad spectrum of anticancer activity. This activity is mediated by interaction with the taxane site of beta-tubulin, leading to microtubule stabilization and cell death. Although widely used in the treatment of breast cancer and other malignancies, existing taxane-based therapies including paclitaxel and the second-generation docetaxel are currently limited by severe adverse effects and dose-limiting toxicity. To discover taxane site modulators, we employ a computational binding site similarity screen of > 14,000 drug-like pockets from PDB, revealing an unexpected similarity between the estrogen receptor and the beta-tubulin taxane binding pocket. Evaluation of nine selective estrogen receptor modulators (SERMs) via cellular and biochemical assays confirms taxane site interaction, microtubule stabilization, and cell proliferation inhibition. Our study demonstrates that SERMs can modulate microtubule assembly and raises the possibility of an estrogen receptor-independent mechanism for inhibiting cell proliferation.
View details for PubMedID 30833575
-
Graph Convolutional Neural Networks for Predicting Drug-Target Interactions.
Journal of chemical information and modeling
2019
Abstract
Accurate determination of target-ligand interactions is crucial in the drug discovery process. In this paper, we propose a graph-convolutional (Graph-CNN) framework for predicting protein-ligand interactions. First, we built an unsupervised graph-autoencoder to learn fixed-size representations of protein pockets from a set of representative druggable protein binding sites. Second, we trained two Graph-CNNs to automatically extract features from pocket graphs and 2D ligand graphs, respectively, driven by binding classification labels. We demonstrate that graph-autoencoders can learn fixed-size representations for protein pockets of varying sizes and the Graph-CNN framework can effectively capture protein-ligand binding interactions without relying on target-ligand complexes. Across several metrics, Graph-CNNs achieved better or comparable performance to 3DCNN ligand-scoring, AutoDock Vina, RF-Score, and NNScore on common virtual screening benchmark data sets. Visualization of key pocket residues and ligand atoms contributing to the classification decisions confirms that our networks are able to detect important interface residues and ligand atoms within the pockets and ligands, respectively.
View details for DOI 10.1021/acs.jcim.9b00628
View details for PubMedID 31580672
-
The association of obesity and coronary artery disease genes with response to SSRIs treatment in major depression
JOURNAL OF NEURAL TRANSMISSION
2019; 126 (1): 35–45
View details for DOI 10.1007/s00702-018-01966-x
View details for Web of Science ID 000458148200005
-
PathFXweb: a web application for identifying drug safety and efficacy phenotypes.
Bioinformatics (Oxford, England)
2019
Abstract
Limited efficacy and intolerable safety limit therapeutic development and identification of potential liabilities earlier in development could significantly improve this process. Computational approaches which aggregate data from multiple sources and consider the drug's pathways effects could add to identification of these liabilities earlier. Such computational methods must be accessible to a variety of users beyond computational scientists, especially regulators and industry scientists, in order to impact the therapeutic development process. We have previously developed and published PathFX, an algorithm for identifying drug networks and phenotypes for understanding drug associations to safety and efficacy. Here we present a streamlined and easy-to-use PathFX web application that allows users to search for drug networks and associated phenotypes. We have also added visualization, and phenotype clustering to improve functionality and interpretability of PathFXweb.https://www.pathfxweb.net/.Supplementary data are available at Bioinformatics online.
View details for DOI 10.1093/bioinformatics/btz419
View details for PubMedID 31114840
-
Predicting HLA class II antigen presentation through integrated deep learning.
Nature biotechnology
2019
Abstract
Accurate prediction of antigen presentation by human leukocyte antigen (HLA) class II molecules would be valuable for vaccine development and cancer immunotherapies. Current computational methods trained on in vitro binding data are limited by insufficient training data and algorithmic constraints. Here we describe MARIA (major histocompatibility complex analysis with recurrent integrated architecture; https://maria.stanford.edu/ ), a multimodal recurrent neural network for predicting the likelihood of antigen presentation from a gene of interest in the context of specific HLA class II alleles. In addition to in vitro binding measurements, MARIA is trained on peptide HLA ligand sequences identified by mass spectrometry, expression levels of antigen genes and protease cleavage signatures. Because it leverages these diverse training data and our improved machine learning framework, MARIA (area under the curve = 0.89-0.92) outperformed existing methods in validation datasets. Across independent cancer neoantigen studies, peptides with high MARIA scores are more likely to elicit strong CD4+ T cell responses. MARIA allows identification of immunogenic epitopes in diverse cancers and autoimmune disease.
View details for DOI 10.1038/s41587-019-0280-2
View details for PubMedID 31611695
-
The effect of CYP4F2, VKORC1 and CYP2C9 in influencing coumarin dose. A single patient data meta-analysis in more than 15,000 individuals.
Clinical pharmacology and therapeutics
2018
Abstract
The CYP4F2 gene is known to influence mean coumarin dose. The aim of the present study was to undertake a meta-analysis at individual patients' level to capture the possible effect of ethnicity, gene-gene interaction or other drugs on the association and to verify if inclusion of CYP4F2*3 variant into dosing algorithms improves the prediction of mean coumarin dose. We asked the authors of our previous meta-analysis (30 articles) and of 38 new articles retrieved by a systematic review to send us individual patients' data. The final collection consists 15,754 patients split into a derivation and validation cohort. The CYP4F2*3 polymorphism was consistently associated with an increase in mean coumarin dose (+9% (95%CI 7-10%), with a higher effect in females, in patients taking acenocoumarol and in Whites. The inclusion of the CYP4F2*3 in dosing algorithms slightly improved the prediction of stable coumarin dose. New pharmacogenetic equations potentially useful for clinical practice were derived. This article is protected by copyright. All rights reserved.
View details for PubMedID 30506689
-
PathFX provides mechanistic insights into drug efficacy and safety for regulatory review and therapeutic development.
PLoS computational biology
2018; 14 (12): e1006614
Abstract
Failure to demonstrate efficacy and safety issues are important reasons that drugs do not reach the market. An incomplete understanding of how drugs exert their effects hinders regulatory and pharmaceutical industry projections of a drug's benefits and risks. Signaling pathways mediate drug response and while many signaling molecules have been characterized for their contribution to disease or their role in drug side effects, our knowledge of these pathways is incomplete. To better understand all signaling molecules involved in drug response and the phenotype associations of these molecules, we created a novel method, PathFX, a non-commercial entity, to identify these pathways and drug-related phenotypes. We benchmarked PathFX by identifying drugs' marketed disease indications and reported a sensitivity of 41%, a 2.7-fold improvement over similar approaches. We then used PathFX to strengthen signals for drug-adverse event pairs occurring in the FDA Adverse Event Reporting System (FAERS) and also identified opportunities for drug repurposing for new diseases based on interaction paths that associated a marketed drug to that disease. By discovering molecular interaction pathways, PathFX improved our understanding of drug associations to safety and efficacy phenotypes. The algorithm may provide a new means to improve regulatory and therapeutic development decisions.
View details for PubMedID 30532240
-
Standardized biogeographic grouping system for annotating populations in pharmacogenetic research.
Clinical pharmacology and therapeutics
2018
Abstract
The varying frequencies of pharmacogenetic alleles between populations have important implications for the impact of these alleles in different populations. Current population grouping methods to communicate these patterns are insufficient as they are inconsistent and fail to reflect the global distribution of genetic variability. To facilitate and standardize the reporting of variability in pharmacogenetic allele frequencies, we present seven geographically-defined groups: American, Central/South Asian, East Asian, European, Near Eastern, Oceanian, and Sub-Saharan African, and two admixed groups: African American/Afro-Caribbean and Latino. These nine groups are defined by global autosomal genetic structure and based on data from large-scale sequencing initiatives. We recognize that broadly grouping global populations is an oversimplification of human diversity and does not capture complex social and cultural identity. However, these groups meet a key need in pharmacogenetics research by enabling consistent communication of the scale of variability in global allele frequencies and are now used by PharmGKB. This article is protected by copyright. All rights reserved.
View details for PubMedID 30506572
-
PathFX provides mechanistic insights into drug efficacy and safety for regulatory review and therapeutic development
PLOS COMPUTATIONAL BIOLOGY
2018; 14 (12)
View details for DOI 10.1371/journal.pcbi.1006614
View details for Web of Science ID 000454835100024
-
Essential characteristics of pharmacogenomics study publications.
Clinical pharmacology and therapeutics
2018
Abstract
Pharmacogenomics (PGx) can be seen as a model for biomedical studies: it includes all disease areas of interest, spans in vitro studies to clinical trials, while focusing on the relationships between genes and drugs and the resulting phenotypes. This review will examine different characteristics of PGx study publications and provide examples of excellence in framing PGx questions and reporting their resulting data in a way that maximizes the knowledge that can be built upon them. This article is protected by copyright. All rights reserved.
View details for PubMedID 30406943
-
The Pioglitazone Trek via Human PPAR Gamma: From Discovery to a Medicine at the FDA and Beyond.
Frontiers in pharmacology
2018; 9: 1093
Abstract
For almost two decades, pioglitazone has been prescribed primarily to prevent and treat insulin resistance in some type 2 diabetic patients. In this review, we trace the path to discovery of pioglitazone as a thiazolidinedione compound, the glitazone tracks through the regulatory agencies, the trek to molecular agonism in the nucleus and the binding of pioglitazone to the nuclear receptor PPAR gamma. Given the rise in consumption of pioglitazone in T2D patients worldwide and the increased number of clinical trials currently testing alternate medical uses for this drug, there is also merit to some reflection on the reported adverse effects. Going forward, it is imperative to continue investigations into the mechanisms of actions of pioglitazone, the potential of glitazone drugs to contribute to unmet needs in complex diseases associated with the dynamics of adaptive homeostasis, and also the routes to minimizing adverse effects in every-day patients throughout the world.
View details for DOI 10.3389/fphar.2018.01093
View details for PubMedID 30337873
View details for PubMedCentralID PMC6180177
-
The Pioglitazone Trek via Human PPAR Gamma: From Discovery to a Medicine at the FDA and Beyond
FRONTIERS IN PHARMACOLOGY
2018; 9
View details for DOI 10.3389/fphar.2018.01093
View details for Web of Science ID 000446339600001
-
PharmGKB summary: oxycodone pathway, pharmacokinetics
PHARMACOGENETICS AND GENOMICS
2018; 28 (10): 230–37
View details for PubMedID 30222708
-
Data-driven human transcriptomic module determined by independent component analysis
BMC BIOINFORMATICS
2018; 19
View details for DOI 10.1186/s12859-018-2338-4
View details for Web of Science ID 000444941300001
-
Data-driven human transcriptomic modules determined by independent component analysis.
BMC bioinformatics
2018; 19 (1): 327
Abstract
BACKGROUND: Analyzing the human transcriptome is crucial in advancing precision medicine, and the plethora of over half a million human microarray samples in the Gene Expression Omnibus (GEO) has enabled us to better characterize biological processes at the molecular level. However, transcriptomic analysis is challenging because the data is inherently noisy and high-dimensional. Gene set analysis is currently widely used to alleviate the issue of high dimensionality, but the user-defined choice of gene sets can introduce biasness in results. In this paper, we advocate the use of a fixed set of transcriptomic modules for such analysis. We apply independent component analysis to the large collection of microarray data in GEO in order to discover reproducible transcriptomic modules that can be used as features for machine learning. We evaluate the usability of these modules across six studies, and demonstrate (1) their usage as features for sample classification, and also their robustness in dealing with small training sets, (2) their regularization of data when clustering samples and (3) the biological relevancy of differentially expressed features.RESULTS: We identified 139 reproducible transcriptomic modules, which we term fundamental components (FCs). In studies with less than 50 samples, FC-space classification model outperformed their gene-space counterparts, with higher sensitivity (p<0.01). The models also had higher accuracy and negative predictive value (p<0.01) for small data sets (less than 30 samples). Additionally, we observed a reduction in batch effects when data is clustered in the FC-space. Finally, we found that differentially expressed FCs mapped to GO terms that were also identified via traditional gene-based approaches.CONCLUSIONS: The 139 FCs provide biologically-relevant summarization of transcriptomic data, and their performance in low sample settings suggest that they should be employed in such studies in order to harness the data efficiently.
View details for PubMedID 30223787
-
PharmGKB summary: clozapine pathway, pharmacokinetics
PHARMACOGENETICS AND GENOMICS
2018; 28 (9): 214–22
View details for PubMedID 30134346
-
Machine learning in chemoinformatics and drug discovery
DRUG DISCOVERY TODAY
2018; 23 (8): 1538–46
View details for DOI 10.1016/j.drudis.2018.05.010
View details for Web of Science ID 000443787600011
-
A global network of biomedical relationships derived from text
BIOINFORMATICS
2018; 34 (15): 2614–24
View details for DOI 10.1093/bioinformatics/bty114
View details for Web of Science ID 000440967900012
-
Computational Analysis of Kinase Inhibitor Selectivity using Structural Knowledge.
Bioinformatics (Oxford, England)
2018
Abstract
Motivation: Kinases play a significant role in diverse disease signaling pathways and understanding kinase inhibitor selectivity, the tendency of drugs to bind to off-targets, remains a top priority for kinase inhibitor design and clinical safety assessment. Traditional approaches for kinase selectivity analysis using biochemical activity and binding assays are useful but can be costly and are often limited by the kinases that are available. On the other hand, current computational kinase selectivity prediction methods are computational intensive and can rarely achieve sufficient accuracy for large-scale kinome wide inhibitor selectivity profiling.Results: Here, we present a KinomeFEATURE database for kinase binding site similarity search by comparing protein microenvironments characterized using diverse physiochemical descriptors. Initial selectivity prediction of 15 known kinase inhibitors achieved an>90% accuracy and demonstrated improved performance in comparison to commonly used kinase inhibitor selectivity prediction methods. Additional kinase ATP binding site similarity assessment (120 binding sites) identified 55 kinases with significant promiscuity and revealed unexpected inhibitor cross-activities between PKR and FGFR2 kinases. Kinome-wide selectivity profiling of 11 kinase drug candidates predicted novel as well as experimentally validated off-targets and suggested structural mechanisms of kinase cross-activities. Our study demonstrated potential utilities of our approach for large-scale kinase inhibitor selectivity profiling that could contribute to kinase drug development and safety assessment.Availability: The KinomeFEATURE database are available at https://simtk.org/projects/kdb.Supplementary information: Supplementary data are available at Bioinformatics online.
View details for PubMedID 29985971
-
PharmGKB: A worldwide resource for pharmacogenomic information
WILEY INTERDISCIPLINARY REVIEWS-SYSTEMS BIOLOGY AND MEDICINE
2018; 10 (4)
View details for DOI 10.1002/wsbm.1417
View details for Web of Science ID 000435287900002
-
Machine learning in chemoinformatics and drug discovery.
Drug discovery today
2018
Abstract
Chemoinformatics is an established discipline focusing on extracting, processing and extrapolating meaningful data from chemical structures. With the rapid explosion of chemical 'big' data from HTS and combinatorial synthesis, machine learning has become an indispensable tool for drug designers to mine chemical information from large compound databases to design drugs with important biological properties. To process the chemical data, we first reviewed multiple processing layers in the chemoinformatics pipeline followed by the introduction of commonly used machine learning models in drug discovery and QSAR analysis. Here, we present basic principles and recent case studies to demonstrate the utility of machine learning techniques in chemoinformatics analyses; and we discuss limitations and future directions to guide further development in this evolving field.
View details for PubMedID 29750902
-
Pharmacogenomics and big genomic data: from lab to clinic and back again.
Human molecular genetics
2018; 27 (R1): R72–R78
Abstract
The field of pharmacogenomics is an area of great potential for near-term human health impacts from the big genomic data revolution. Pharmacogenomics research momentum is building with numerous hypotheses currently being investigated through the integration of molecular profiles of different cell lines and large genomic data sets containing information on cellular and human responses to therapies. Additionally, the results of previous pharmacogenetic research efforts have been formulated into clinical guidelines that are beginning to impact how healthcare is conducted on the level of the individual patient. This trend will only continue with the recent release of new datasets containing linked genotype and electronic medical record data. This review discusses key resources available for pharmacogenomics and pharmacogenetics research and highlights recent work within the field.
View details for PubMedID 29635477
-
Pharmacogenomics and big genomic data: from lab to clinic and back again
HUMAN MOLECULAR GENETICS
2018; 27 (R1): R72–R78
View details for DOI 10.1093/hmg/ddy116
View details for Web of Science ID 000431884200012
-
PharmGKB summary: atazanavir pathway, pharmacokinetics/pharmacodynamics
PHARMACOGENETICS AND GENOMICS
2018; 28 (5): 127–37
View details for PubMedID 29517518
View details for PubMedCentralID PMC5910198
-
PharmGKB summary: clobazam pathway, pharmacokinetics
PHARMACOGENETICS AND GENOMICS
2018; 28 (4): 110–15
View details for PubMedID 29517622
-
Genome-wide and candidate gene approaches of clopidogrel efficacy using pharmacodynamic and clinical end points-Rationale and design of the International Clopidogrel Pharmacogenomics Consortium (ICPC)
AMERICAN HEART JOURNAL
2018; 198: 152–59
Abstract
The P2Y12 receptor inhibitor clopidogrel is widely used in patients with acute coronary syndrome, percutaneous coronary intervention, or ischemic stroke. Platelet inhibition by clopidogrel shows wide interpatient variability, and high on-treatment platelet reactivity is a risk factor for atherothrombotic events, particularly in high-risk populations. CYP2C19 polymorphism plays an important role in this variability, but heritability estimates suggest that additional genetic variants remain unidentified. The aim of the International Clopidogrel Pharmacogenomics Consortium (ICPC) is to identify genetic determinants of clopidogrel pharmacodynamics and clinical response.Based on the data published on www.clinicaltrials.gov, clopidogrel intervention studies containing genetic and platelet function data were identified for participation. Lead investigators were invited to share DNA samples, platelet function test results, patient characteristics, and cardiovascular outcomes to perform candidate gene and genome-wide studies.In total, 17 study sites from 13 countries participate in the ICPC, contributing individual patient data from 8,829 patients. Available adenosine diphosphate-stimulated platelet function tests included vasodilator-stimulated phosphoprotein assay, light transmittance aggregometry, and the VerifyNow P2Y12 assay. A proof-of-principle analysis based on genotype data provided by each group showed a strong and consistent association between CYP2C19*2 and platelet reactivity (P value=5.1 × 10-40).The ICPC aims to identify new loci influencing clopidogrel efficacy by using state-of-the-art genetic approaches in a large cohort of clopidogrel-treated patients to better understand the genetic basis of on-treatment response variability.
View details for PubMedID 29653637
-
A probabilistic pathway score (PROPS) for classification with applications to inflammatory bowel disease
BIOINFORMATICS
2018; 34 (6): 985–93
Abstract
Gene-based supervised machine learning classification models have been widely used to differentiate disease states, predict disease progression and determine effective treatment options. However, many of these classifiers are sensitive to noise and frequently do not replicate in external validation sets. For complex, heterogeneous diseases, these classifiers are further limited by being unable to capture varying combinations of genes that lead to the same phenotype. Pathway-based classification can overcome these challenges by using robust, aggregate features to represent biological mechanisms. In this work, we developed a novel pathway-based approach, PRObabilistic Pathway Score, which uses genes to calculate individualized pathway scores for classification. Unlike previous individualized pathway-based classification methods that use gene sets, we incorporate gene interactions using probabilistic graphical models to more accurately represent the underlying biology and achieve better performance. We apply our method to differentiate two similar complex diseases, ulcerative colitis (UC) and Crohn's disease (CD), which are the two main types of inflammatory bowel disease (IBD). Using five IBD datasets, we compare our method against four gene-based and four alternative pathway-based classifiers in distinguishing CD from UC. We demonstrate superior classification performance and provide biological insight into the top pathways separating CD from UC.PROPS is available as a R package, which can be downloaded at http://simtk.org/home/props or on Bioconductor.rbaltman@stanford.edu.Supplementary data are available at Bioinformatics online.
View details for PubMedID 29048458
View details for PubMedCentralID PMC5860179
-
Association of the Polygenic Scores for Personality Traits and Response to Selective Serotonin Reuptake Inhibitors in Patients with Major Depressive Disorder
FRONTIERS IN PSYCHIATRY
2018; 9: 65
Abstract
Studies reported a strong genetic correlation between the Big Five personality traits and major depressive disorder (MDD). Moreover, personality traits are thought to be associated with response to antidepressants treatment that might partly be mediated by genetic factors. In this study, we examined whether polygenic scores (PGSs) derived from the Big Five personality traits predict treatment response and remission in patients with MDD who were prescribed selective serotonin reuptake inhibitors (SSRIs). In addition, we performed meta-analyses of genome-wide association studies (GWASs) on these traits to identify genetic variants underpinning the cross-trait polygenic association. The PGS analysis was performed using data from two cohorts: the Pharmacogenomics Research Network Antidepressant Medication Pharmacogenomic Study (PGRN-AMPS, n = 529) and the International SSRI Pharmacogenomics Consortium (ISPC, n = 865). The cross-trait GWAS meta-analyses were conducted by combining GWAS summary statistics on SSRIs treatment outcome and on the personality traits. The results showed that the PGS for openness and neuroticism were associated with SSRIs treatment outcomes at p < 0.05 across PT thresholds in both cohorts. A significant association was also found between the PGS for conscientiousness and SSRIs treatment response in the PGRN-AMPS sample. In the cross-trait GWAS meta-analyses, we identified eight loci associated with (a) SSRIs response and conscientiousness near YEATS4 gene and (b) SSRI remission and neuroticism eight loci near PRAG1, MSRA, XKR6, ELAVL2, PLXNC1, PLEKHM1, and BRUNOL4 genes. An assessment of a polygenic load for personality traits may assist in conjunction with clinical data to predict whether MDD patients might respond favorably to SSRIs.
View details for PubMedID 29559929
-
Biological and functional relevance of CASP predictions
PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS
2018; 86: 374–86
Abstract
Our goal is to answer the question: compared with experimental structures, how useful are predicted models for functional annotation? We assessed the functional utility of predicted models by comparing the performances of a suite of methods for functional characterization on the predictions and the experimental structures. We identified 28 sites in 25 protein targets to perform functional assessment. These 28 sites included nine sites with known ligand binding (holo-sites), nine sites that are expected or suggested by experimental authors for small molecule binding (apo-sites), and Ten sites containing important motifs, loops, or key residues with important disease-associated mutations. We evaluated the utility of the predictions by comparing their microenvironments to the experimental structures. Overall structural quality correlates with functional utility. However, the best-ranked predictions (global) may not have the best functional quality (local). Our assessment provides an ability to discriminate between predictions with high structural quality. When assessing ligand-binding sites, most prediction methods have higher performance on apo-sites than holo-sites. Some servers show consistently high performance for certain types of functional sites. Finally, many functional sites are associated with protein-protein interaction. We also analyzed biologically relevant features from the protein assemblies of two targets where the active site spanned the protein-protein interface. For the assembly targets, we find that the features in the models are mainly determined by the choice of template.
View details for PubMedID 28975675
View details for PubMedCentralID PMC5820171
-
Mendelian Disease Associations Reveal Novel Insights into Inflammatory Bowel Disease
INFLAMMATORY BOWEL DISEASES
2018; 24 (3): 471–81
Abstract
Monogenic diseases have been shown to contribute to complex disease risk and may hold new insights into the underlying biological mechanism of Inflammatory Bowel Disease (IBD).We analyzed Mendelian disease associations with IBD using over 55 million patients from the Optum's deidentified electronic health records dataset database. Using the significant Mendelian diseases, we performed pathway enrichment analysis and constructed a model using gene expression datasets to differentiate Crohn's disease (CD), ulcerative colitis (UC), and healthy patient samples.We found 50 Mendelian diseases were significantly associated with IBD, with 40 being significantly associated with both CD and UC. Our results for CD replicated those from previous studies. Pathways that were enriched consisted of mainly immune and metabolic processes with a focus on tolerance and oxidative stress. Our 3-way classifier for UC, CD, and healthy samples yielded an accuracy of 72%.Mendelian diseases that are significantly associated with IBD may reveal novel insights into the genetic architecture of IBD.
View details for PubMedID 29462399
-
Application of a Dynamic Map for Learning, Communicating, Navigating, and Improving Therapeutic Development
CTS-CLINICAL AND TRANSLATIONAL SCIENCE
2018; 11 (2): 166–74
Abstract
Drug discovery and development is commonly schematized as a "pipeline," and, although appreciated by drug developers to be a useful oversimplification, this cartology may perpetuate inaccurate notions of straightforwardness and is of minimal utility for process engineering to improve efficiency. To create a more granular schema, a group of drug developers, researchers, patient advocates, and regulators developed a crowdsourced atlas of the steps involved in translating basic discoveries into health interventions, annotated with the steps that are particularly prone to difficulty or failure. This Drug Discovery, Development, and Deployment Map (4DM), provides a network view of the process, which will be useful for communication and education to those new to the field, orientation and navigation of individual projects, and prioritization of technology development and re-engineering endeavors to improve efficiency and effectiveness. The 4DM is freely available for utilization, modification, and further development by stakeholders across the translational ecosystem.
View details for PubMedID 29271559
View details for PubMedCentralID PMC5866991
-
Reversals and limitations on high-intensity, life-sustaining treatments
PLOS ONE
2018; 13 (2): e0190569
Abstract
Critically ill patients often receive high-intensity life sustaining treatments (LST) in the intensive care unit (ICU), although they can be ineffective and eventually undesired. Determining the risk factors associated with reversals in LST goals can improve patient and provider appreciation for the natural history and epidemiology of critical care and inform decision making around the (continued) use of LSTs.This is a single institution retrospective cohort study of patients receiving life sustaining treatment in an academic tertiary hospital from 2009 to 2013. Deidentified patient electronic medical record data was collected via the clinical data warehouse to study the outcomes of treatment limiting Comfort Care and do-not-resuscitate (DNR) orders. Extended multivariable Cox regression models were used to estimate the association of patient and clinical factors with subsequent treatment limiting orders.10,157 patients received life-sustaining treatment while initially Full Code (allowing all resuscitative measures). Of these, 770 (8.0%) transitioned to Comfort Care (with discontinuation of any life-sustaining treatments) while 1,669 (16%) patients received new DNR orders that reflect preferences to limit further life-sustaining treatment options. Patients who were older (Hazard Ratio(HR) 1.37 [95% CI 1.28-1.47] per decade), with cerebrovascular disease (HR 2.18 [95% CI 1.69-2.81]), treated by the Medical ICU (HR 1.92 [95% CI 1.49-2.49]) and Hematology-Oncology (HR 1.87 [95% CI 1.27-2.74]) services, receiving vasoactive infusions (HR 1.76 [95% CI 1.28, 2.43]) or continuous renal replacement (HR 1.83 [95% CI 1.34, 2.48]) were more likely to transition to Comfort Care. Any new DNR orders were more likely for patients who were older (HR 1.43 [95% CI 1.38-1.48] per decade), female (HR 1.30 [95% CI 1.17-1.44]), with cerebrovascular disease (HR 1.45 [95% CI 1.25-1.67]) or metastatic solid cancers (HR 1.92 [95% CI 1.48-2.49]), or treated by Medical ICU (HR 1.63 [95% CI 1.42-1.86]), Hematology-Oncology (HR 1.63 [95% CI 1.33-1.98]) and Cardiac Care Unit-Heart Failure (HR 1.41 [95% CI 1.15-1.72]).Decisions to reverse or limit treatment goals occurs after more than 1 in 13 trials of LST, and is associated with older female patients, receiving non-ventilator forms of LST, cerebrovascular disease, and treatment by certain medical specialty services.
View details for PubMedID 29489814
-
A global network of biomedical relationships derived from text.
Bioinformatics (Oxford, England)
2018
Abstract
Motivation: The biomedical community's collective understanding of how chemicals, genes, and phenotypes interact is distributed across the text of over 24 million research articles. These interactions offer insights into the mechanisms behind higher order biochemical phenomena, such as drug-drug interactions and variations in drug response across individuals. To assist their curation at scale, we must understand what relationship types are possible and map unstructured natural language descriptions onto these structured classes.Methods: We used NCBI's PubTator annotations to identify instances of chemical, gene and disease names in Medline abstracts and applied the Stanford dependency parser to find connecting dependency paths between pairs of entities in single sentences. We combined a published ensemble biclustering algorithm (EBC) with hierarchical clustering to group the dependency paths into semantically-related categories, which we annotated with labels, or "themes" ("inhibition" and "activation", for example). We evaluated our theme assignments against six human-curated databases: DrugBank, Reactome, SIDER, the Therapeutic Target Database (TTD), OMIM, and PharmGKB.Results: Clustering revealed 10 broad themes for chemical-gene relationships, 7 for chemical-disease, 10 for gene-disease and 9 for gene-gene. In most cases, enriched themes corresponded directly to known database relationships. Our final dataset, represented as a network, contained 37,491 thematically-labeled chemical-gene edges, 2,021,192 chemical-disease edges, 136,206 gene-disease edges, and 41,418 gene-gene edges, each representing a single-sentence description of an interaction from somewhere in the literature.Availability: The complete network is available on Zenodo (https://zenodo.org/record/1035500). We have also provided the full set of dependency paths connecting biomedical entities in Medline abstracts, with associated sentences, for future use by the biomedical research community.Contact: bethany.percha@mssm.edu.Supplementary Information: Supplementary data are available at Bioinformatics online.
View details for PubMedID 29490008
-
PharmGKB: A worldwide resource for pharmacogenomic information.
Wiley interdisciplinary reviews. Systems biology and medicine
2018
Abstract
As precision medicine becomes increasingly relevant in healthcare, the field of pharmacogenomics (PGx) also continues to gain prominence in the clinical setting. Leading institutions have begun to implement PGx testing and the amount of published PGx literature increases yearly. The Pharmacogenomics Knowledgebase (PharmGKB; www.pharmgkb.org) is one of the foremost worldwide resources for PGx knowledge, and the organization has been adapting and refocusing its mission along with the current revolution in genomic medicine. The PharmGKB website provides a diverse array of PGx information, from annotations of the primary literature to guidelines for adjusting drug treatment based on genetic information. It is freely available and accessible to everyone from researchers to clinicians to everyday citizens. PharmGKB was found over 17years ago, but continues to be a vital resource for the entire PGx community and the general public. This article is categorized under: Translational, Genomic, and Systems Medicine > Translational Medicine.
View details for PubMedID 29474005
-
A dynamic map for learning, communicating, navigating and improving therapeutic development
NATURE REVIEWS DRUG DISCOVERY
2018; 17 (2): 151–53
View details for DOI 10.1038/nrd.2017.217
View details for Web of Science ID 000424402500017
-
A dynamic map for learning, communicating, navigating and improving therapeutic development.
Nature reviews. Drug discovery
2018; 17 (2): 150
View details for DOI 10.1038/nrd.2017.217
View details for PubMedID 29269942
-
Challenges for Training Translational Researchers in the Era of Ubiquitous Data
CLINICAL PHARMACOLOGY & THERAPEUTICS
2018; 103 (2): 171–73
Abstract
Our ability to collect data at every stage of the translational pipeline creates great opportunities for formulating hypotheses both "upstream" (towards clinical implementation) and "downstream" (back to basic discovery). Translational researchers therefore must integrate information at multiple scales to both generate and test hypotheses-to some extent they must all be comfortable with the basics of "big data" analyses. This increased focus on data-driven science requires an understanding of basic experimental and clinical data collection-understanding that likely cannot efficiently be gathered through traditional apprenticeship models. Thus, new curricula are required to ensure that next-generation scientists have a new combination of skills required for integrating data to catalyze discovery.
View details for PubMedID 29134624
-
Biomarkers: Delivering on the expectation of molecularly driven, quantitative health
EXPERIMENTAL BIOLOGY AND MEDICINE
2018; 243 (3): 313–22
Abstract
Biomarkers are the pillars of precision medicine and are delivering on expectations of molecular, quantitative health. These features have made clinical decisions more precise and personalized, but require a high bar for validation. Biomarkers have improved health outcomes in a few areas such as cancer, pharmacogenetics, and safety. Burgeoning big data research infrastructure, the internet of things, and increased patient participation will accelerate discovery in the many areas that have not yet realized the full potential of biomarkers for precision health. Here we review themes of biomarker discovery, current implementations of biomarkers for precision health, and future opportunities and challenges for biomarker discovery. Impact statement Precision medicine evolved because of the understanding that human disease is molecularly driven and is highly variable across patients. This understanding has made biomarkers, a diverse class of biological measurements, more relevant for disease diagnosis, monitoring, and selection of treatment strategy. Biomarkers' impact on precision medicine can be seen in cancer, pharmacogenomics, and safety. The successes in these cases suggest many more applications for biomarkers and a greater impact for precision medicine across the spectrum of human disease. The authors assess the status of biomarker-guided medical practice by analyzing themes for biomarker discovery, reviewing the impact of these markers in the clinic, and highlight future and ongoing challenges for biomarker discovery. This work is timely and relevant, as the molecular, quantitative approach of precision medicine is spreading to many disease indications.
View details for PubMedID 29199461
-
Systematic target function annotation of human transcription factors
BMC BIOLOGY
2018; 16: 4
Abstract
Transcription factors (TFs), the key players in transcriptional regulation, have attracted great experimental attention, yet the functions of most human TFs remain poorly understood. Recent capabilities in genome-wide protein binding profiling have stimulated systematic studies of the hierarchical organization of human gene regulatory network and DNA-binding specificity of TFs, shedding light on combinatorial gene regulation. We show here that these data also enable a systematic annotation of the biological functions and functional diversity of TFs.We compiled a human gene regulatory network for 384 TFs covering the 146,096 TF-target gene (TF-TG) relationships, extracted from over 850 ChIP-seq experiments as well as the literature. By integrating this network of TF-TF and TF-TG relationships with 3715 functional concepts from six sources of gene function annotations, we obtained over 9000 confident functional annotations for 279 TFs. We observe extensive connectivity between TFs and Mendelian diseases, GWAS phenotypes, and pharmacogenetic pathways. Further, we show that TFs link apparently unrelated functions, even when the two functions do not share common genes. Finally, we analyze the pleiotropic functions of TFs and suggest that the increased number of upstream regulators contributes to the functional pleiotropy of TFs.Our computational approach is complementary to focused experimental studies on TF functions, and the resulting knowledge can guide experimental design for the discovery of unknown roles of TFs in human disease and drug response.
View details for PubMedID 29325558
-
Chemical reaction vector embeddings: towards predicting drug metabolism in the human gut microbiome.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
2018; 23: 56–67
Abstract
Bacteria in the human gut have the ability to activate, inactivate, and reactivate drugs with both intended and unintended effects. For example, the drug digoxin is reduced to the inactive metabolite dihydrodigoxin by the gut Actinobacterium E. lenta, and patients colonized with high levels of drug metabolizing strains may have limited response to the drug. Understanding the complete space of drugs that are metabolized by the human gut microbiome is critical for predicting bacteria-drug relationships and their effects on individual patient response. Discovery and validation of drug metabolism via bacterial enzymes has yielded >50 drugs after nearly a century of experimental research. However, there are limited computational tools for screening drugs for potential metabolism by the gut microbiome. We developed a pipeline for comparing and characterizing chemical transformations using continuous vector representations of molecular structure learned using unsupervised representation learning. We applied this pipeline to chemical reaction data from MetaCyc to characterize the utility of vector representations for chemical reaction transformations. After clustering molecular and reaction vectors, we performed enrichment analyses and queries to characterize the space. We detected enriched enzyme names, Gene Ontology terms, and Enzyme Consortium (EC) classes within reaction clusters. In addition, we queried reactions against drug-metabolite transformations known to be metabolized by the human gut microbiome. The top results for these known drug transformations contained similar substructure modifications to the original drug pair. This work enables high throughput screening of drugs and their resulting metabolites against chemical reactions common to gut bacteria.
View details for PubMedID 29218869
-
GeneDive: A gene interaction search and visualization tool to facilitate precision medicine
WORLD SCIENTIFIC PUBL CO PTE LTD. 2018: 590–601
Abstract
Obtaining relevant information about gene interactions is critical for understanding disease processes and treatment. With the rise in text mining approaches, the volume of such biomedical data is rapidly increasing, thereby creating a new problem for the users of this data: information overload. A tool for efficient querying and visualization of biomedical data that helps researchers understand the underlying biological mechanisms for diseases and drug responses, and ultimately helps patients, is sorely needed. To this end we have developed GeneDive, a web-based information retrieval, filtering, and visualization tool for large volumes of gene interaction data. GeneDive offers various features and modalities that guide the user through the search process to efficiently reach the information of their interest. GeneDive currently processes over three million gene-gene interactions with response times within a few seconds. For over half of the curated gene sets sourced from four prominent databases, more than 80% of the gene set members are recovered by GeneDive. In the near future, GeneDive will seamlessly accommodate other interaction types, such as gene-drug and gene-disease interactions, thus enabling full exploration of topics such as precision medicine. The GeneDive application and information about its underlying system architecture are available at http://www.genedive.net.
View details for Web of Science ID 000461831500054
View details for PubMedID 29218917
View details for PubMedCentralID PMC5807065
-
Improving the explainability of Random Forest classifier - user centered approach
WORLD SCIENTIFIC PUBL CO PTE LTD. 2018: 204–15
Abstract
Machine Learning (ML) methods are now influencing major decisions about patient care, new medical methods, drug development and their use and importance are rapidly increasing in all areas. However, these ML methods are inherently complex and often difficult to understand and explain resulting in barriers to their adoption and validation. Our work (RFEX) focuses on enhancing Random Forest (RF) classifier explainability by developing easy to interpret explainability summary reports from trained RF classifiers as a way to improve the explainability for (often non-expert) users. RFEX is implemented and extensively tested on Stanford FEATURE data where RF is tasked with predicting functional sites in 3D molecules based on their electrochemical signatures (features). In developing RFEX method we apply user-centered approach driven by explainability questions and requirements collected by discussions with interested practitioners. We performed formal usability testing with 13 expert and non-expert users to verify RFEX usefulness. Analysis of RFEX explainability report and user feedback indicates its usefulness in significantly increasing explainability and user confidence in RF classification on FEATURE data. Notably, RFEX summary reports easily reveal that one needs very few (from 2-6 depending on a model) top ranked features to achieve 90% or better of the accuracy when all 480 features are used.
View details for Web of Science ID 000461831500019
View details for PubMedID 29218882
View details for PubMedCentralID PMC5728671
-
Chemical reaction vector embeddings: towards predicting drug metabolism in the human gut microbiome
WORLD SCIENTIFIC PUBL CO PTE LTD. 2018: 56–67
View details for Web of Science ID 000461831500006
-
Expanding a radiology lexicon using contextual patterns in radiology reports.
Journal of the American Medical Informatics Association : JAMIA
2018
Abstract
Distributional semantics algorithms, which learn vector space representations of words and phrases from large corpora, identify related terms based on contextual usage patterns. We hypothesize that distributional semantics can speed up lexicon expansion in a clinical domain, radiology, by unearthing synonyms from the corpus.We apply word2vec, a distributional semantics software package, to the text of radiology notes to identify synonyms for RadLex, a structured lexicon of radiology terms. We stratify performance by term category, term frequency, number of tokens in the term, vector magnitude, and the context window used in vector building.Ranking candidates based on distributional similarity to a target term results in high curation efficiency: on a ranked list of 775 249 terms, >50% of synonyms occurred within the first 25 terms. Synonyms are easier to find if the target term is a phrase rather than a single word, if it occurs at least 100× in the corpus, and if its vector magnitude is between 4 and 5. Some RadLex categories, such as anatomical substances, are easier to identify synonyms for than others.The unstructured text of clinical notes contains a wealth of information about human diseases and treatment patterns. However, searching and retrieving information from clinical notes often suffer due to variations in how similar concepts are described in the text. Biomedical lexicons address this challenge, but are expensive to produce and maintain. Distributional semantics algorithms can assist lexicon curation, saving researchers time and money.
View details for PubMedID 29329435
-
A E2F4/p107 complex regulates LDHA in the mechanism of chemotherapy resistance in human colorectal cancer stem cells
WILEY. 2018: 426
View details for Web of Science ID 000422694003088
-
Association of Omics Features with Histopathology Patterns in Lung Adenocarcinoma
CELL SYSTEMS
2017; 5 (6): 620-+
Abstract
Adenocarcinoma accounts for more than 40% of lung malignancy, and microscopic pathology evaluation is indispensable for its diagnosis. However, how histopathology findings relate to molecular abnormalities remains largely unknown. Here, we obtained H&E-stained whole-slide histopathology images, pathology reports, RNA sequencing, and proteomics data of 538 lung adenocarcinoma patients from The Cancer Genome Atlas and used these to identify molecular pathways associated with histopathology patterns. We report cell-cycle regulation and nucleotide binding pathways underpinning tumor cell dedifferentiation, and we predicted histology grade using transcriptomics and proteomics signatures (area under curve >0.80). We built an integrative histopathology-transcriptomics model to generate better prognostic predictions for stage I patients (p = 0.0182 ± 0.0021) compared with gene expression or histopathology studies alone, and the results were replicated in an independent cohort (p = 0.0220 ± 0.0070). These results motivate the integration of histopathology and omics data to investigate molecular mechanisms of pathology findings and enhance clinical prognostic prediction.
View details for PubMedID 29153840
View details for PubMedCentralID PMC5746468
-
Pharmacogenomics-Based Point-of-Care Clinical Decision Support Significantly Alters Drug Prescribing
CLINICAL PHARMACOLOGY & THERAPEUTICS
2017; 102 (5): 859–69
Abstract
Changes in behavior are necessary to apply genomic discoveries to practice. We prospectively studied medication changes made by providers representing eight different medicine specialty clinics whose patients had submitted to preemptive pharmacogenomic genotyping. An institutional clinical decision support (CDS) system provided pharmacogenomic results using traffic light alerts: green = genomically favorable, yellow = genomic caution, red = high risk. The influence of pharmacogenomic alerts on prescribing behaviors was the primary endpoint. In all, 2,279 outpatient encounters were analyzed. Independent of other potential prescribing mediators, medications with high pharmacogenomic risk were changed significantly more often than prescription drugs lacking pharmacogenomic information (odds ratio (OR) = 26.2 (9.0-75.3), P < 0.0001). Medications with cautionary pharmacogenomic information were also changed more frequently (OR = 2.4 (1.7-3.5), P < 0.0001). No pharmacogenomically high-risk medications were prescribed during the entire study when physicians consulted the CDS tool. Pharmacogenomic information improved prescribing in patterns aimed at reducing patient risk, demonstrating that enhanced prescription decision-making is achievable through clinical integration of genomic medicine.
View details for PubMedID 28398598
View details for PubMedCentralID PMC5636653
-
PharmGKB summary: very important pharmacogene information for ABCG2
PHARMACOGENETICS AND GENOMICS
2017; 27 (11): 420–27
View details for PubMedID 28858993
View details for PubMedCentralID PMC5788016
-
Working toward precision medicine: Predicting phenotypes from exomes in the Critical Assessment of Genome Interpretation (CAGI) challenges
HUMAN MUTATION
2017; 38 (9): 1182–92
Abstract
Precision medicine aims to predict a patient's disease risk and best therapeutic options by using that individual's genetic sequencing data. The Critical Assessment of Genome Interpretation (CAGI) is a community experiment consisting of genotype-phenotype prediction challenges; participants build models, undergo assessment, and share key findings. For CAGI 4, three challenges involved using exome-sequencing data: Crohn's disease, bipolar disorder, and warfarin dosing. Previous CAGI challenges included prior versions of the Crohn's disease challenge. Here, we discuss the range of techniques used for phenotype prediction as well as the methods used for assessing predictive models. Additionally, we outline some of the difficulties associated with making predictions and evaluating them. The lessons learned from the exome challenges can be applied to both research and clinical efforts to improve phenotype prediction from genotype. In addition, these challenges serve as a vehicle for sharing clinical and research exome data in a secure manner with scientists who have a broad range of expertise, contributing to a collaborative effort to advance our understanding of genotype-phenotype relationships.
View details for DOI 10.1002/humu.23280
View details for Web of Science ID 000407861100014
View details for PubMedID 28634997
View details for PubMedCentralID PMC5600620
-
PharmGKB summary: pazopanib pathway, pharmacokinetics
PHARMACOGENETICS AND GENOMICS
2017; 27 (8): 307–12
View details for PubMedID 28678138
View details for PubMedCentralID PMC5862561
-
Shallow Representation Learning via Kernel PCA Improves QSAR Modelability
JOURNAL OF CHEMICAL INFORMATION AND MODELING
2017; 57 (8): 1859–67
Abstract
Linear models offer a robust, flexible, and computationally efficient set of tools for modeling quantitative structure-activity relationships (QSARs) but have been eclipsed in performance by nonlinear methods. Support vector machines (SVMs) and neural networks are currently among the most popular and accurate QSAR methods because they learn new representations of the data that greatly improve modelability. In this work, we use shallow representation learning to improve the accuracy of L1 regularized logistic regression (LASSO) and meet the performance of Tanimoto SVM. We embedded chemical fingerprints in Euclidean space using Tanimoto (a.k.a. Jaccard) similarity kernel principal component analysis (KPCA) and compared the effects on LASSO and SVM model performance for predicting the binding activities of chemical compounds against 102 virtual screening targets. We observed similar performance and patterns of improvement for LASSO and SVM. We also empirically measured model training and cross-validation times to show that KPCA used in concert with LASSO classification is significantly faster than linear SVM over a wide range of training set sizes. Our work shows that powerful linear QSAR methods can match nonlinear methods and demonstrates a modular approach to nonlinear classification that greatly enhances QSAR model prototyping facility, flexibility, and transferability.
View details for PubMedID 28727421
View details for PubMedCentralID PMC5942586
-
3D deep convolutional neural networks for amino acid environment similarity analysis
BMC BIOINFORMATICS
2017; 18: 302
Abstract
Central to protein biology is the understanding of how structural elements give rise to observed function. The surfeit of protein structural data enables development of computational methods to systematically derive rules governing structural-functional relationships. However, performance of these methods depends critically on the choice of protein structural representation. Most current methods rely on features that are manually selected based on knowledge about protein structures. These are often general-purpose but not optimized for the specific application of interest. In this paper, we present a general framework that applies 3D convolutional neural network (3DCNN) technology to structure-based protein analysis. The framework automatically extracts task-specific features from the raw atom distribution, driven by supervised labels. As a pilot study, we use our network to analyze local protein microenvironments surrounding the 20 amino acids, and predict the amino acids most compatible with environments within a protein structure. To further validate the power of our method, we construct two amino acid substitution matrices from the prediction statistics and use them to predict effects of mutations in T4 lysozyme structures.Our deep 3DCNN achieves a two-fold increase in prediction accuracy compared to models that employ conventional hand-engineered features and successfully recapitulates known information about similar and different microenvironments. Models built from our predictions and substitution matrices achieve an 85% accuracy predicting outcomes of the T4 lysozyme mutation variants. Our substitution matrices contain rich information relevant to mutation analysis compared to well-established substitution matrices. Finally, we present a visualization method to inspect the individual contributions of each atom to the classification decisions.End-to-end trained deep learning networks consistently outperform methods using hand-engineered features, suggesting that the 3DCNN framework is well suited for analysis of protein microenvironments and may be useful for other protein structural analyses.
View details for PubMedID 28615003
-
PharmGKB summary: sorafenib pathways
PHARMACOGENETICS AND GENOMICS
2017; 27 (6): 240-246
View details for DOI 10.1097/FPC.0000000000000279
View details for Web of Science ID 000400664400006
View details for PubMedID 28362716
-
Decaying relevance of clinical data towards future decisions in data-driven inpatient clinical order sets.
International journal of medical informatics
2017; 102: 71-79
Abstract
Determine how varying longitudinal historical training data can impact prediction of future clinical decisions. Estimate the "decay rate" of clinical data source relevance.We trained a clinical order recommender system, analogous to Netflix or Amazon's "Customers who bought A also bought B..." product recommenders, based on a tertiary academic hospital's structured electronic health record data. We used this system to predict future (2013) admission orders based on different subsets of historical training data (2009 through 2012), relative to existing human-authored order sets.Predicting future (2013) inpatient orders is more accurate with models trained on just one month of recent (2012) data than with 12 months of older (2009) data (ROC AUC 0.91 vs. 0.88, precision 27% vs. 22%, recall 52% vs. 43%, all P<10(-10)). Algorithmically learned models from even the older (2009) data was still more effective than existing human-authored order sets (ROC AUC 0.81, precision 16% recall 35%). Training with more longitudinal data (2009-2012) was no better than using only the most recent (2012) data, unless applying a decaying weighting scheme with a "half-life" of data relevance about 4 months.Clinical practice patterns (automatically) learned from electronic health record data can vary substantially across years. Gold standards for clinical decision support are elusive moving targets, reinforcing the need for automated methods that can adapt to evolving information.Prioritizing small amounts of recent data is more effective than using larger amounts of older data towards future clinical predictions.
View details for DOI 10.1016/j.ijmedinf.2017.03.006
View details for PubMedID 28495350
-
PharmGKB summary: voriconazole pathway, pharmacokinetics
PHARMACOGENETICS AND GENOMICS
2017; 27 (5): 201-209
View details for DOI 10.1097/FPC.0000000000000276
View details for Web of Science ID 000398829200005
View details for PubMedID 28277330
-
Artificial intelligence (AI) systems for interpreting complex medical datasets
CLINICAL PHARMACOLOGY & THERAPEUTICS
2017; 101 (5): 585–86
Abstract
Advances in machine intelligence have created powerful capabilities in algorithms that find hidden patterns in data, classify objects based on their measured characteristics, and associate similar patients/diseases/drugs based on common features. However, artificial intelligence (AI) applications in medical data have several technical challenges: complex and heterogeneous datasets, noisy medical datasets, and explaining their output to users. There are also social challenges related to intellectual property, data provenance, regulatory issues, economics, and liability.
View details for PubMedID 28182259
-
Opportunities for developing therapies for rare genetic diseases: focus on gain-of-function and allostery.
Orphanet journal of rare diseases
2017; 12 (1): 61-?
Abstract
Advances in next generation sequencing technologies have revolutionized our ability to discover the causes of rare genetic diseases. However, developing treatments for these diseases remains challenging. In fact, when we systematically analyze the US FDA orphan drug list, we find that only 8% of rare diseases have an FDA-designated drug. Our approach leverages three primary insights: first, diseases with gain-of-function mutations and late onset are more likely to have drug options; second, drugs are more often inhibitors than activators; and third, some disease-causing proteins can be rescued by allosteric activators in diseases due to loss-of-function mutations.We have developed a pipeline that combines natural language processing and human curation to mine promising targets for drug development from the Online Mendelian Inheritance in Man (OMIM) database. This pipeline targets diseases caused by well-characterized gain-of-function mutations or loss-of-function proteins with known allosteric activators. Applying this pipeline across thousands of rare genetic diseases, we discover 34 rare genetic diseases that are promising candidates for drug development.Our analysis has revealed uneven coverage of rare diseases in the current US FDA orphan drug space. Diseases with gain-of-function mutations or loss-of-function mutations and known allosteric activators should be prioritized for drug treatments.
View details for DOI 10.1186/s13023-017-0614-4
View details for PubMedID 28412959
-
Development of an automated assessment tool for MedWatch reports in the FDA adverse event reporting system.
Journal of the American Medical Informatics Association
2017
Abstract
As the US Food and Drug Administration (FDA) receives over a million adverse event reports associated with medication use every year, a system is needed to aid FDA safety evaluators in identifying reports most likely to demonstrate causal relationships to the suspect medications. We combined text mining with machine learning to construct and evaluate such a system to identify medication-related adverse event reports.FDA safety evaluators assessed 326 reports for medication-related causality. We engineered features from these reports and constructed random forest, L1 regularized logistic regression, and support vector machine models. We evaluated model accuracy and further assessed utility by generating report rankings that represented a prioritized report review process.Our random forest model showed the best performance in report ranking and accuracy, with an area under the receiver operating characteristic curve of 0.66. The generated report ordering assigns reports with a higher probability of medication-related causality a higher rank and is significantly correlated to a perfect report ordering, with a Kendall's tau of 0.24 ( P = .002).Our models produced prioritized report orderings that enable FDA safety evaluators to focus on reports that are more likely to contain valuable medication-related adverse event information. Applying our models to all FDA adverse event reports has the potential to streamline the manual review process and greatly reduce reviewer workload.
View details for DOI 10.1093/jamia/ocx022
View details for PubMedID 28371826
-
Imputing gene expression to maximize platform compatibility
BIOINFORMATICS
2017; 33 (4): 522-528
Abstract
Microarray measurements of gene expression constitute a large fraction of publicly shared biological data, and are available in the Gene Expression Omnibus (GEO). Many studies use GEO data to shape hypotheses and improve statistical power. Within GEO, the Affymetrix HG-U133A and HG-U133 Plus 2.0 are the two most commonly used microarray platforms for human samples; the HG-U133 Plus 2.0 platform contains 54 220 probes and the HG-U133A array contains a proper subset (21 722 probes). When different platforms are involved, the subset of common genes is most easily compared. This approach results in the exclusion of substantial measured data and can limit downstream analysis. To predict the expression values for the genes unique to the HG-U133 Plus 2.0 platform, we constructed a series of gene expression inference models based on genes common to both platforms. Our model predicts gene expression values that are within the variability observed in controlled replicate studies and are highly correlated with measured data. Using six previously published studies, we also demonstrate the improved performance of the enlarged feature space generated by our model in downstream analysis.The gene inference model described in this paper is available as a R package (affyImpute), which can be downloaded at http://simtk.org/home/affyimpute.rbaltman@stanford.edu.Supplementary data are available at Bioinformatics online.
View details for DOI 10.1093/bioinformatics/btw664
View details for Web of Science ID 000397264100008
View details for PubMedCentralID PMC5408923
-
PharmGKB summary: Macrolide antibiotic pathway, pharmacokinetics/pharmacodynamics.
Pharmacogenetics and genomics
2017
View details for DOI 10.1097/FPC.0000000000000270
View details for PubMedID 28146011
View details for PubMedCentralID PMC5346035
-
"The Pharmacogenomics Research Network Translational Pharmacogenetics Program: Outcomes and Metrics of Pharmacogenetic Implementations Across Diverse Healthcare Systems".
Clinical pharmacology & therapeutics
2017
Abstract
Numerous pharmacogenetic clinical guidelines and recommendations have been published, but barriers have hindered the clinical implementation of pharmacogenetics. The Translational Pharmacogenetics Program (TPP) of the NIH Pharmacogenomics Research Network was established in 2011 to catalog and contribute to the development of pharmacogenetic implementations at eight US healthcare systems, with the goal to disseminate real-world solutions for the barriers to clinical pharmacogenetic implementation. The TPP collected and normalized pharmacogenetic implementation metrics through June 2015, including gene-drug pairs implemented, interpretations of alleles and diplotypes, numbers of tests performed and actionable results, and workflow diagrams. TPP participant institutions developed diverse solutions to overcome many barriers, but the use of Clinical Pharmacogenetics Implementation Consortium (CPIC) guidelines provided some consistency among the institutions. The TPP also collected some pharmacogenetic implementation outcomes (scientific, educational, financial, and informatics), which may inform healthcare systems seeking to implement their own pharmacogenetic testing programs. This article is protected by copyright. All rights reserved.
View details for DOI 10.1002/cpt.630
View details for PubMedID 28090649
-
Cohort-specific imputation of gene expression improves prediction of warfarin dose for African Americans.
Genome medicine
2017; 9 (1): 98
Abstract
Genome-wide association studies are useful for discovering genotype-phenotype associations but are limited because they require large cohorts to identify a signal, which can be population-specific. Mapping genetic variation to genes improves power and allows the effects of both protein-coding variation as well as variation in expression to be combined into "gene level" effects.Previous work has shown that warfarin dose can be predicted using information from genetic variation that affects protein-coding regions. Here, we introduce a method that improves dose prediction by integrating tissue-specific gene expression. In particular, we use drug pathways and expression quantitative trait loci knowledge to impute gene expression-on the assumption that differential expression of key pathway genes may impact dose requirement. We focus on 116 genes from the pharmacokinetic and pharmacodynamic pathways of warfarin within training and validation sets comprising both European and African-descent individuals.We build gene-tissue signatures associated with warfarin dose in a cohort-specific manner and identify a signature of 11 gene-tissue pairs that significantly augments the International Warfarin Pharmacogenetics Consortium dosage-prediction algorithm in both populations.Our results demonstrate that imputed expression can improve dose prediction and bridge population-specific compositions. MATLAB code is available at https://github.com/assafgo/warfarin-cohort.
View details for PubMedID 29178968
-
Flexible Analog Search with Kernel PCA Embedded Molecule Vectors.
Computational and structural biotechnology journal
2017; 15: 320-327
Abstract
Studying analog series to find structural transformations that enhance the activity and ADME properties of lead compounds is an important part of drug development. Matched molecular pair (MMP) search is a powerful tool for analog analysis that imitates researchers' ability to select pairs of compounds that differ only by small well-defined transformations. Abstraction is a challenge for existing MMP search algorithms, which can result in the omission of relevant, inexact MMPs, and inclusion of irrelevant, contextually dissimilar MMPs. In this work, we present a new method for MMP search that returns approximate results and enables flexible control over abstraction of contextual information. We illustrate the concepts and mechanics of our method with a series of exemplar MMP queries, and then benchmark search accuracy using MMPs found by fragment indexing. We show that we can search for MMPs in a context dependent manner, and accurately approximate context independent fragment index based MMP search over a range of fingerprint and dataset conditions. Our method can be used to search for pairwise correspondences among analog sets and bolster MMP datasets where data is missing or incomplete.
View details for DOI 10.1016/j.csbj.2017.03.003
View details for PubMedID 28458783
-
PharmGKB summary: ivacaftor pathway, pharmacokinetics/pharmacodynamics
PHARMACOGENETICS AND GENOMICS
2017; 27 (1): 39-42
View details for DOI 10.1097/FPC.0000000000000246
View details for Web of Science ID 000390851300005
View details for PubMedCentralID PMC5140711
-
STAMS: STRING-assisted module search for genome wide association studies and application to autism.
Bioinformatics
2016; 32 (24): 3815-3822
Abstract
Analyzing genome wide association data in the context of biological pathways helps us understand how genetic variation influences phenotype and increases power to find associations. However, the utility of pathway-based analysis tools is hampered by undercuration and reliance on a distribution of signal across all of the genes in a pathway. Methods that combine genome wide association results with genetic networks to infer the key phenotype-modulating subnetworks combat these issues, but have primarily been limited to network definitions with yes/no labels for gene-gene interactions. A recent method (EW_dmGWAS) incorporates a biological network with weighted edge probability by requiring a secondary phenotype-specific expression dataset. In this article, we combine an algorithm for weighted-edge module searching and a probabilistic interaction network in order to develop a method, STAMS, for recovering modules of genes with strong associations to the phenotype and probable biologic coherence. Our method builds on EW_dmGWAS but does not require a secondary expression dataset and performs better in six test cases.We show that our algorithm improves over EW_dmGWAS and standard gene-based analysis by measuring precision and recall of each method on separately identified associations. In the Wellcome Trust Rheumatoid Arthritis study, STAMS-identified modules were more enriched for separately identified associations than EW_dmGWAS (STAMS P-value 3.0 × 10(-4); EW_dmGWAS- P-value = 0.8). We demonstrate that the area under the Precision-Recall curve is 5.9 times higher with STAMS than EW_dmGWAS run on the Wellcome Trust Type 1 Diabetes data.STAMS is implemented as an R package and is freely available at https://simtk.org/projects/stams CONTACT: rbaltman@stanford.eduSupplementary information: Supplementary data are available at Bioinformatics online.
View details for PubMedID 27542772
-
PharmGKB summary: very important pharmacogene information for MT-RNR1.
Pharmacogenetics and genomics
2016; 26 (12): 558-567
View details for PubMedID 27654872
-
The International SSRI Pharmacogenomics Consortium (ISPC): a genome-wide association study of antidepressant treatment response.
Translational psychiatry
2016; 6 (11)
View details for DOI 10.1038/tp.2016.187
View details for PubMedID 27801898
View details for PubMedCentralID PMC5314112
-
Exploring the Heritability of Pharmacogene Expression
WILEY-BLACKWELL. 2016: 653
View details for Web of Science ID 000386034800127
-
Imputing Gene Expression to Maximize Platform Compatibility.
Bioinformatics
2016
Abstract
Microarray measurements of gene expression constitute a large fraction of publicly shared biological data, and are available in the Gene Expression Omnibus (GEO). Many studies use GEO data to shape hypotheses and improve statistical power. Within GEO, the Affymetrix HG-U133A and HG-U133 Plus 2.0 are the two most commonly used microarray platforms for human samples; the HG-U133 Plus 2.0 platform contains 54 220 probes and the HG-U133A array contains a proper subset (21 722 probes). When different platforms are involved, the subset of common genes is most easily compared. This approach results in the exclusion of substantial measured data and can limit downstream analysis. To predict the expression values for the genes unique to the HG-U133 Plus 2.0 platform, we constructed a series of gene expression inference models based on genes common to both platforms. Our model predicts gene expression values that are within the variability observed in controlled replicate studies and are highly correlated with measured data. Using six previously published studies, we also demonstrate the improved performance of the enlarged feature space generated by our model in downstream analysis.The gene inference model described in this paper is available as a R package (affyImpute), which can be downloaded at http://simtk.org/home/affyimpute.rbaltman@stanford.edu.Supplementary data are available at Bioinformatics online.
View details for PubMedID 27797771
-
Estimation of Maximum Recommended Therapeutic Dose Using Predicted Promiscuity and Potency.
Clinical and translational science
2016
Abstract
We report a simple model that predicts the maximum recommended therapeutic dose (MRTD) of small molecule drugs based on an assessment of likely protein-drug interactions. Previously, we reported methods for computational estimation of drug promiscuity and potency. We used these concepts to build a linear model derived from 238 small molecular drugs to predict MRTD. We applied this model successfully to predict MRTDs for 16 nonsteroidal antiinflammatory drugs (NSAIDs) and 14 antiretroviral drugs. Of note, based on the estimated promiscuity of low-dose drugs (and active chemicals), we identified 83 proteins as "high-risk off-targets" (HROTs) that are often associated with low doses; the evaluation of interactions with HROTs may be useful during early phases of drug discovery. Our model helps explain the MRTD for drugs with severe adverse reactions caused by interactions with HROTs.
View details for DOI 10.1111/cts.12422
View details for PubMedID 27736015
View details for PubMedCentralID PMC5161261
-
Computing disease incidence, prevalence and comorbidity from electronic medical records.
Journal of biomedical informatics
2016; 63: 108-111
Abstract
Electronic medical records (EMR) represent a convenient source of coded medical data, but disease patterns found in EMRs may be biased when compared to surveys based on sampling. In this communication we draw attention to complications that arise when using EMR data to calculate disease prevalence, incidence, age of onset, and disease comorbidity. We review known solutions to these problems and identify challenges for future work.
View details for DOI 10.1016/j.jbi.2016.08.005
View details for PubMedID 27498067
-
Response to Open Peer Commentaries on "Human Germline CRISPR-Cas Modification: Toward a Regulatory Framework".
American journal of bioethics
2016; 16 (10): W1-2
View details for DOI 10.1080/15265161.2016.1214308
View details for PubMedID 27653416
-
Predicting inpatient clinical order patterns with probabilistic topic models vs conventional order sets.
Journal of the American Medical Informatics Association
2016
Abstract
Build probabilistic topic model representations of hospital admissions processes and compare the ability of such models to predict clinical order patterns as compared to preconstructed order sets.The authors evaluated the first 24 hours of structured electronic health record data for > 10 K inpatients. Drawing an analogy between structured items (e.g., clinical orders) to words in a text document, the authors performed latent Dirichlet allocation probabilistic topic modeling. These topic models use initial clinical information to predict clinical orders for a separate validation set of > 4 K patients. The authors evaluated these topic model-based predictions vs existing human-authored order sets by area under the receiver operating characteristic curve, precision, and recall for subsequent clinical orders.Existing order sets predict clinical orders used within 24 hours with area under the receiver operating characteristic curve 0.81, precision 16%, and recall 35%. This can be improved to 0.90, 24%, and 47% ( P < 10 -20 ) by using probabilistic topic models to summarize clinical data into up to 32 topics. Many of these latent topics yield natural clinical interpretations (e.g., "critical care," "pneumonia," "neurologic evaluation").Existing order sets tend to provide nonspecific, process-oriented aid, with usability limitations impairing more precise, patient-focused support. Algorithmic summarization has the potential to breach this usability barrier by automatically inferring patient context, but with potential tradeoffs in interpretability.Probabilistic topic modeling provides an automated approach to detect thematic trends in patient care and generate decision support content. A potential use case finds related clinical orders for decision support.
View details for DOI 10.1093/jamia/ocw136
View details for PubMedID 27655861
-
PharmGKB summary: ivacaftor pathway, pharmacokinetics/pharmacodynamics.
Pharmacogenetics and genomics
2016: -?
View details for PubMedID 27636560
-
Population-specific single-nucleotide polymorphism confers increased risk of venous thromboembolism in African Americans.
Molecular genetics & genomic medicine
2016; 4 (5): 513-520
Abstract
African Americans have a higher incidence of venous thromboembolism (VTE) than European descent individuals. However, the typical genetic risk factors in populations of European descent are nearly absent in African Americans, and population-specific genetic factors influencing the higher VTE rate are not well characterized.We performed a candidate gene analysis on an exome-sequenced African American family with recurrent VTE and identified a variant in Protein S (PROS1) V510M (rs138925964). We assessed the population impact of PROS1 V510M using a multicenter African American cohort of 306 cases with VTE compared to 370 controls. Additionally, we compared our case cohort to a background population cohort of 2203 African Americans in the NHLBI GO Exome Sequencing Project (ESP).In the African American family with recurrent VTE, we found prior laboratories for our cases indicating low free Protein S levels, providing functional support for PROS1 V510M as the causative mutation. Additionally, this variant was significantly enriched in the VTE cases of our multicenter case-control study (Fisher's Exact Test, P = 0.0041, OR = 4.62, 95% CI: 1.51-15.20; allele frequencies - cases: 2.45%, controls: 0.54%). Similarly, PROS1 V510M was also enriched in our VTE case cohort compared to African Americans in the ESP cohort (Fisher's Exact Test, P = 0.010, OR = 2.28, 95% CI: 1.26-4.10).We found a variant, PROS1 V510M, in an African American family with VTE and clinical laboratory abnormalities in Protein S. Additionally, we found that this variant conferred increased risk of VTE in a case-control study of African Americans. In the ESP cohort, the variant is nearly absent in ESP European descent subjects (n = 3, allele frequency: 0.03%). Additionally, in 1000 Genomes Phase 3 data, the variant only appears in African descent populations. Thus, PROS1 V510M is a population-specific genetic risk factor for VTE in African Americans.
View details for DOI 10.1002/mgg3.226
View details for PubMedID 27652279
-
PharmGKB summary: isoniazid pathway, pharmacokinetics.
Pharmacogenetics and genomics
2016; 26 (9): 436-444
View details for DOI 10.1097/FPC.0000000000000232
View details for PubMedID 27232112
View details for PubMedCentralID PMC4970941
-
Human induced pluripotent stem cell-derived cardiomyocytes recapitulate the predilection of breast cancer patients to doxorubicin-induced cardiotoxicity
NATURE MEDICINE
2016; 22 (5): 547-556
Abstract
Doxorubicin is an anthracycline chemotherapy agent effective in treating a wide range of malignancies, but it causes a dose-related cardiotoxicity that can lead to heart failure in a subset of patients. At present, it is not possible to predict which patients will be affected by doxorubicin-induced cardiotoxicity (DIC). Here we demonstrate that patient-specific human induced pluripotent stem cell-derived cardiomyocytes (hiPSC-CMs) can recapitulate the predilection to DIC of individual patients at the cellular level. hiPSC-CMs derived from individuals with breast cancer who experienced DIC were consistently more sensitive to doxorubicin toxicity than hiPSC-CMs from patients who did not experience DIC, with decreased cell viability, impaired mitochondrial and metabolic function, impaired calcium handling, decreased antioxidant pathway activity, and increased reactive oxygen species production. Taken together, our data indicate that hiPSC-CMs are a suitable platform to identify and characterize the genetic basis and molecular mechanisms of DIC.
View details for DOI 10.1038/nm.4087
View details for PubMedID 27089514
-
Constraints on Biological Mechanism from Disease Comorbidity Using Electronic Medical Records and Database of Genetic Variants
PLOS COMPUTATIONAL BIOLOGY
2016; 12 (4)
Abstract
Patterns of disease co-occurrence that deviate from statistical independence may represent important constraints on biological mechanism, which sometimes can be explained by shared genetics. In this work we study the relationship between disease co-occurrence and commonly shared genetic architecture of disease. Records of pairs of diseases were combined from two different electronic medical systems (Columbia, Stanford), and compared to a large database of published disease-associated genetic variants (VARIMED); data on 35 disorders were available across all three sources, which include medical records for over 1.2 million patients and variants from over 17,000 publications. Based on the sources in which they appeared, disease pairs were categorized as having predominant clinical, genetic, or both kinds of manifestations. Confounding effects of age on disease incidence were controlled for by only comparing diseases when they fall in the same cluster of similarly shaped incidence patterns. We find that disease pairs that are overrepresented in both electronic medical record systems and in VARIMED come from two main disease classes, autoimmune and neuropsychiatric. We furthermore identify specific genes that are shared within these disease groups.
View details for DOI 10.1371/journal.pcbi.1004885
View details for Web of Science ID 000376584400019
View details for PubMedID 27115429
View details for PubMedCentralID PMC4846031
-
PharmGKB summary: very important pharmacogene information for RYR1
PHARMACOGENETICS AND GENOMICS
2016; 26 (3): 138-144
View details for DOI 10.1097/FPC.0000000000000198
View details for Web of Science ID 000373526700005
View details for PubMedID 26709912
View details for PubMedCentralID PMC4738161
-
OrderRex: clinical order decision support and outcome predictions by data-mining electronic medical records.
Journal of the American Medical Informatics Association
2016; 23 (2): 339-348
Abstract
To answer a "grand challenge" in clinical decision support, the authors produced a recommender system that automatically data-mines inpatient decision support from electronic medical records (EMR), analogous to Netflix or Amazon.com's product recommender.EMR data were extracted from 1 year of hospitalizations (>18K patients with >5.4M structured items including clinical orders, lab results, and diagnosis codes). Association statistics were counted for the ∼1.5K most common items to drive an order recommender. The authors assessed the recommender's ability to predict hospital admission orders and outcomes based on initial encounter data from separate validation patients.Compared to a reference benchmark of using the overall most common orders, the recommender using temporal relationships improves precision at 10 recommendations from 33% to 38% (P < 10(-10)) for hospital admission orders. Relative risk-based association methods improve inverse frequency weighted recall from 4% to 16% (P < 10(-16)). The framework yields a prediction receiver operating characteristic area under curve (c-statistic) of 0.84 for 30 day mortality, 0.84 for 1 week need for ICU life support, 0.80 for 1 week hospital discharge, and 0.68 for 30-day readmission.Recommender results quantitatively improve on reference benchmarks and qualitatively appear clinically reasonable. The method assumes that aggregate decision making converges appropriately, but ongoing evaluation is necessary to discern common behaviors from "correct" ones.Collaborative filtering recommender algorithms generate clinical decision support that is predictive of real practice patterns and clinical outcomes. Incorporating temporal relationships improves accuracy. Different evaluation metrics satisfy different goals (predicting likely events vs. "interesting" suggestions).
View details for DOI 10.1093/jamia/ocv091
View details for PubMedID 26198303
-
Predicting non-small cell lung cancer prognosis by fully automated microscopic pathology image features.
Nature communications
2016; 7: 12474-?
Abstract
Lung cancer is the most prevalent cancer worldwide, and histopathological assessment is indispensable for its diagnosis. However, human evaluation of pathology slides cannot accurately predict patients' prognoses. In this study, we obtain 2,186 haematoxylin and eosin stained histopathology whole-slide images of lung adenocarcinoma and squamous cell carcinoma patients from The Cancer Genome Atlas (TCGA), and 294 additional images from Stanford Tissue Microarray (TMA) Database. We extract 9,879 quantitative image features and use regularized machine-learning methods to select the top features and to distinguish shorter-term survivors from longer-term survivors with stage I adenocarcinoma (P<0.003) or squamous cell carcinoma (P=0.023) in the TCGA data set. We validate the survival prediction framework with the TMA cohort (P<0.036 for both tumour types). Our results suggest that automatically derived image features can predict the prognosis of lung cancer patients and thereby contribute to precision oncology. Our methods are extensible to histopathology images of other organs.
View details for DOI 10.1038/ncomms12474
View details for PubMedID 27527408
-
Large-scale extraction of gene interactions from full-text literature using DeepDive.
Bioinformatics (Oxford, England)
2016; 32 (1): 106-13
Abstract
A complete repository of gene-gene interactions is key for understanding cellular processes, human disease and drug response. These gene-gene interactions include both protein-protein interactions and transcription factor interactions. The majority of known interactions are found in the biomedical literature. Interaction databases, such as BioGRID and ChEA, annotate these gene-gene interactions; however, curation becomes difficult as the literature grows exponentially. DeepDive is a trained system for extracting information from a variety of sources, including text. In this work, we used DeepDive to extract both protein-protein and transcription factor interactions from over 100,000 full-text PLOS articles.We built an extractor for gene-gene interactions that identified candidate gene-gene relations within an input sentence. For each candidate relation, DeepDive computed a probability that the relation was a correct interaction. We evaluated this system against the Database of Interacting Proteins and against randomly curated extractions.Our system achieved 76% precision and 49% recall in extracting direct and indirect interactions involving gene symbols co-occurring in a sentence. For randomly curated extractions, the system achieved between 62% and 83% precision based on direct or indirect interactions, as well as sentence-level and document-level precision. Overall, our system extracted 3356 unique gene pairs using 724 features from over 100,000 full-text articles.Application source code is publicly available at https://github.com/edoughty/deepdive_genegene_appruss.altman@stanford.eduSupplementary data are available at Bioinformatics online.
View details for DOI 10.1093/bioinformatics/btv476
View details for PubMedID 26338771
View details for PubMedCentralID PMC4681986
-
DYNAMICALLY EVOLVING CLINICAL PRACTICES AND IMPLICATIONS FOR PREDICTING MEDICAL DECISIONS
WORLD SCIENTIFIC PUBL CO PTE LTD. 2016: 195–206
View details for Web of Science ID 000386326200019
-
Current Progress in Bioinformatics 2016
BRIEFINGS IN BIOINFORMATICS
2016; 17 (1): 1
View details for DOI 10.1093/bib/bbv105
View details for Web of Science ID 000369219800001
View details for PubMedID 26628559
-
STAMS: STRING-assisted module search for genome wide association studies and application to autism
Bioinformatics
2016: 3815–22
Abstract
Analyzing genome wide association data in the context of biological pathways helps us understand how genetic variation influences phenotype and increases power to find associations. However, the utility of pathway-based analysis tools is hampered by undercuration and reliance on a distribution of signal across all of the genes in a pathway. Methods that combine genome wide association results with genetic networks to infer the key phenotype-modulating subnetworks combat these issues, but have primarily been limited to network definitions with yes/no labels for gene-gene interactions. A recent method (EW_dmGWAS) incorporates a biological network with weighted edge probability by requiring a secondary phenotype-specific expression dataset. In this article, we combine an algorithm for weighted-edge module searching and a probabilistic interaction network in order to develop a method, STAMS, for recovering modules of genes with strong associations to the phenotype and probable biologic coherence. Our method builds on EW_dmGWAS but does not require a secondary expression dataset and performs better in six test cases.We show that our algorithm improves over EW_dmGWAS and standard gene-based analysis by measuring precision and recall of each method on separately identified associations. In the Wellcome Trust Rheumatoid Arthritis study, STAMS-identified modules were more enriched for separately identified associations than EW_dmGWAS (STAMS P-value 3.0 × 10(-4); EW_dmGWAS- P-value = 0.8). We demonstrate that the area under the Precision-Recall curve is 5.9 times higher with STAMS than EW_dmGWAS run on the Wellcome Trust Type 1 Diabetes data.STAMS is implemented as an R package and is freely available at https://simtk.org/projects/stams CONTACT: rbaltman@stanford.eduSupplementary information: Supplementary data are available at Bioinformatics online.
View details for DOI 10.1093/bioinformatics/btw530
View details for PubMedCentralID PMC5167061
-
SEPARATING THE CAUSES AND CONSEQUENCES IN DISEASE TRANSCRIPTOME.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
2016; 21: 381–92
Abstract
The causes of complex diseases are multifactorial and the phenotypes of complex diseases are typically heterogeneous, posting significant challenges for both the experiment design and statistical inference in the study of such diseases. Transcriptome profiling can potentially provide key insights on the pathogenesis of diseases, but the signals from the disease causes and consequences are intertwined, leaving it to speculations what are likely causal. Genome-wide association study on the other hand provides direct evidences on the potential genetic causes of diseases, but it does not provide a comprehensive view of disease pathogenesis, and it has difficulties in detecting the weak signals from individual genes. Here we propose an approach diseaseExPatho that combines transcriptome data, regulome knowledge, and GWAS results if available, for separating the causes and consequences in the disease transcriptome. DiseaseExPatho computationally deconvolutes the expression data into gene expression modules, hierarchically ranks the modules based on regulome using a novel algorithm, and given GWAS data, it directly labels the potential causal gene modules based on their correlations with genome-wide gene-disease associations. Strikingly, we observed that the putative causal modules are not necessarily differentially expressed in disease, while the other modules can show strong differential expression without enrichment of top GWAS variations. On the other hand, we showed that the regulatory network based module ranking prioritized the putative causal modules consistently in 6 diseases. We suggest that the approach is applicable to other common and rare complex diseases to prioritize causal pathways with or without genome-wide association studies.
View details for PubMedID 26776202
-
DYNAMICALLY EVOLVING CLINICAL PRACTICES AND IMPLICATIONS FOR PREDICTING MEDICAL DECISIONS.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
2016; 21: 195-206
Abstract
Automatically data-mining clinical practice patterns from electronic health records (EHR) can enable prediction of future practices as a form of clinical decision support (CDS). Our objective is to determine the stability of learned clinical practice patterns over time and what implication this has when using varying longitudinal historical data sources towards predicting future decisions. We trained an association rule engine for clinical orders (e.g., labs, imaging, medications) using structured inpatient data from a tertiary academic hospital. Comparing top order associations per admission diagnosis from training data in 2009 vs. 2012, we find practice variability from unstable diagnoses with rank biased overlap (RBO)<0.35 (e.g., pneumonia) to stable admissions for planned procedures (e.g., chemotherapy, surgery) with comparatively high RBO>0.6. Predicting admission orders for future (2013) patients with associations trained on recent (2012) vs. older (2009) data improved accuracy evaluated by area under the receiver operating characteristic curve (ROC-AUC) 0.89 to 0.92, precision at ten (positive predictive value of the top ten predictions against actual orders) 30% to 37%, and weighted recall (sensitivity) at ten 2.4% to 13%, (P<10(-10)). Training with more longitudinal data (2009-2012) was no better than only using recent (2012) data. Secular trends in practice patterns likely explain why smaller but more recent training data is more accurate at predicting future practices.
View details for PubMedID 26776186
-
Towards Clinical Bioinformatics: Redux 2015.
Yearbook of medical informatics
2016: S6-7
Abstract
In 2004, medical informatics as a scientific community recognized an emerging field of "clinical bioinformatics" that included work bringing bioinformatics data and knowledge into the clinic. In the intervening decade, "translational biomedical informatics" has emerged as the umbrella term for the work that brings together biological entities and clinical entities. The major challenges continue: understanding the clinical significance of basic 'omics' (and other) measurements, and communicating this to increasingly empowered patients/consumers who often have access to this information outside usual medical channels. It has become clear that basic molecular information must be combined with environmental and lifestyle data to fully define, predict, and manage health status..
View details for DOI 10.15265/IYS-2016-s007
View details for PubMedID 27199190
-
Large-scale extraction of gene interactions from full-text literature using DeepDive
BIOINFORMATICS
2016; 32 (1): 106-113
Abstract
A complete repository of gene-gene interactions is key for understanding cellular processes, human disease and drug response. These gene-gene interactions include both protein-protein interactions and transcription factor interactions. The majority of known interactions are found in the biomedical literature. Interaction databases, such as BioGRID and ChEA, annotate these gene-gene interactions; however, curation becomes difficult as the literature grows exponentially. DeepDive is a trained system for extracting information from a variety of sources, including text. In this work, we used DeepDive to extract both protein-protein and transcription factor interactions from over 100,000 full-text PLOS articles.We built an extractor for gene-gene interactions that identified candidate gene-gene relations within an input sentence. For each candidate relation, DeepDive computed a probability that the relation was a correct interaction. We evaluated this system against the Database of Interacting Proteins and against randomly curated extractions.Our system achieved 76% precision and 49% recall in extracting direct and indirect interactions involving gene symbols co-occurring in a sentence. For randomly curated extractions, the system achieved between 62% and 83% precision based on direct or indirect interactions, as well as sentence-level and document-level precision. Overall, our system extracted 3356 unique gene pairs using 724 features from over 100,000 full-text articles.Application source code is publicly available at https://github.com/edoughty/deepdive_genegene_appruss.altman@stanford.eduSupplementary data are available at Bioinformatics online.
View details for DOI 10.1093/bioinformatics/btv476
View details for Web of Science ID 000368357800013
View details for PubMedCentralID PMC4681986
-
PharmGKB summary: succinylcholine pathway, pharmacokinetics/pharmacodynamics
PHARMACOGENETICS AND GENOMICS
2015; 25 (12): 622-630
View details for DOI 10.1097/FPC.0000000000000170
View details for Web of Science ID 000364626100006
View details for PubMedID 26398623
View details for PubMedCentralID PMC4631707
-
Human Germline CRISPR-Cas Modification: Toward a Regulatory Framework.
American journal of bioethics
2015; 15 (12): 25-29
Abstract
CRISPR germline editing therapies (CGETs) hold unprecedented potential to eradicate hereditary disorders. However, the prospect of altering the human germline has sparked a debate over the safety, efficacy, and morality of CGETs, triggering a funding moratorium by the NIH. There is an urgent need for practical paths for the evaluation of these capabilities. We propose a model regulatory framework for CGET research, clinical development, and distribution. Our model takes advantage of existing legal and regulatory institutions but adds elevated scrutiny at each stage of CGET development to accommodate the unique technical and ethical challenges posed by germline editing.
View details for DOI 10.1080/15265161.2015.1104160
View details for PubMedID 26632357
-
Unmet needs: Research helps regulators do their jobs
SCIENCE TRANSLATIONAL MEDICINE
2015; 7 (315)
Abstract
A plethora of innovative new medical products along with the need to apply modern technologies to medical-product evaluation has spurred seminal opportunities in regulatory sciences. Here, we provide eight examples of regulatory science research for diverse products. Opportunities abound, particularly in data science and precision health.
View details for DOI 10.1126/scitranslmed.aac4369
View details for Web of Science ID 000366135900002
View details for PubMedID 26606966
-
Informatics: Make sense of health data
NATURE
2015; 527 (7576): 31–32
View details for PubMedID 26536942
-
Personalization in practice
SCIENCE
2015; 350 (6258): 282–83
View details for PubMedID 26472898
-
Sequence to Medical Phenotypes: A Framework for Interpretation of Human Whole Genome DNA Sequence Data.
PLoS genetics
2015; 11 (10)
Abstract
High throughput sequencing has facilitated a precipitous drop in the cost of genomic sequencing, prompting predictions of a revolution in medicine via genetic personalization of diagnostic and therapeutic strategies. There are significant barriers to realizing this goal that are related to the difficult task of interpreting personal genetic variation. A comprehensive, widely accessible application for interpretation of whole genome sequence data is needed. Here, we present a series of methods for identification of genetic variants and genotypes with clinical associations, phasing genetic data and using Mendelian inheritance for quality control, and providing predictive genetic information about risk for rare disease phenotypes and response to pharmacological therapy in single individuals and father-mother-child trios. We demonstrate application of these methods for disease and drug response prognostication in whole genome sequence data from twelve unrelated adults, and for disease gene discovery in one father-mother-child trio with apparently simplex congenital ventricular arrhythmia. In doing so we identify clinically actionable inherited disease risk and drug response genotypes in pre-symptomatic individuals. We also nominate a new candidate gene in congenital arrhythmia, ATP2B4, and provide experimental evidence of a regulatory role for variants discovered using this framework.
View details for DOI 10.1371/journal.pgen.1005496
View details for PubMedID 26448358
-
Sequence to Medical Phenotypes: A Framework for Interpretation of Human Whole Genome DNA Sequence Data
PLOS GENETICS
2015; 11 (10)
Abstract
High throughput sequencing has facilitated a precipitous drop in the cost of genomic sequencing, prompting predictions of a revolution in medicine via genetic personalization of diagnostic and therapeutic strategies. There are significant barriers to realizing this goal that are related to the difficult task of interpreting personal genetic variation. A comprehensive, widely accessible application for interpretation of whole genome sequence data is needed. Here, we present a series of methods for identification of genetic variants and genotypes with clinical associations, phasing genetic data and using Mendelian inheritance for quality control, and providing predictive genetic information about risk for rare disease phenotypes and response to pharmacological therapy in single individuals and father-mother-child trios. We demonstrate application of these methods for disease and drug response prognostication in whole genome sequence data from twelve unrelated adults, and for disease gene discovery in one father-mother-child trio with apparently simplex congenital ventricular arrhythmia. In doing so we identify clinically actionable inherited disease risk and drug response genotypes in pre-symptomatic individuals. We also nominate a new candidate gene in congenital arrhythmia, ATP2B4, and provide experimental evidence of a regulatory role for variants discovered using this framework.
View details for DOI 10.1371/journal.pgen.1005496
View details for Web of Science ID 000364401600008
View details for PubMedID 26448358
View details for PubMedCentralID PMC4598191
-
PharmGKB summary: peginterferon-alpha pathway
PHARMACOGENETICS AND GENOMICS
2015; 25 (9): 465-474
View details for DOI 10.1097/FPC.0000000000000158
View details for Web of Science ID 000359645700006
View details for PubMedID 26111151
-
High Resolution Prediction of Calcium-Binding Sites in 3D Protein Structures Using FEATURE.
Journal of chemical information and modeling
2015; 55 (8): 1663-1672
Abstract
Metal-binding proteins are ubiquitous in biological systems ranging from enzymes to cell surface receptors. Among the various biologically active metal ions, calcium plays a large role in regulating cellular and physiological changes. With the increasing number of high-quality crystal structures of proteins associated with their metal ion ligands, many groups have built models to identify Ca(2+) sites in proteins, utilizing information such as structure, geometry, or homology to do the inference. We present a FEATURE-based approach in building such a model and show that our model is able to discriminate between nonsites and calcium-binding sites with a very high precision of more than 98%. We demonstrate the high specificity of our model by applying it to test sets constructed from other ions. We also introduce an algorithm to convert high scoring regions into specific site predictions and demonstrate the usage by scanning a test set of 91 calcium-binding protein structures (190 calcium sites). The algorithm has a recall of more than 93% on the test set with predictions found within 3 Å of the actual sites.
View details for DOI 10.1021/acs.jcim.5b00367
View details for PubMedID 26226489
-
Assessment of the Radiation Effects of Cardiac CT Angiography Using Protein and Genetic Biomarkers
JACC-CARDIOVASCULAR IMAGING
2015; 8 (8): 873-884
View details for DOI 10.1016/j.jcmg.2015.04.016
View details for Web of Science ID 000359895400001
View details for PubMedID 26210695
-
PharmGKB summary: pathways of acetaminophen metabolism at the therapeutic versus toxic doses.
Pharmacogenetics and genomics
2015; 25 (8): 416-26
View details for DOI 10.1097/FPC.0000000000000150
View details for PubMedID 26049587
View details for PubMedCentralID PMC4498995
-
An ontology for Autism Spectrum Disorder (ASD) to infer ASD phenotypes from Autism Diagnostic Interview-Revised data
JOURNAL OF BIOMEDICAL INFORMATICS
2015; 56: 333-347
Abstract
Our goal is to create an ontology that will allow data integration and reasoning with subject data to classify subjects, and based on this classification, to infer new knowledge on Autism Spectrum Disorder (ASD) and related neurodevelopmental disorders (NDD). We take a first step toward this goal by extending an existing autism ontology to allow automatic inference of ASD phenotypes and Diagnostic & Statistical Manual of Mental Disorders (DSM) criteria based on subjects' Autism Diagnostic Interview-Revised (ADI-R) assessment data.Knowledge regarding diagnostic instruments, ASD phenotypes and risk factors was added to augment an existing autism ontology via Ontology Web Language class definitions and semantic web rules. We developed a custom Protégé plugin for enumerating combinatorial OWL axioms to support the many-to-many relations of ADI-R items to diagnostic categories in the DSM. We utilized a reasoner to infer whether 2642 subjects, whose data was obtained from the Simons Foundation Autism Research Initiative, meet DSM-IV-TR (DSM-IV) and DSM-5 diagnostic criteria based on their ADI-R data.We extended the ontology by adding 443 classes and 632 rules that represent phenotypes, along with their synonyms, environmental risk factors, and frequency of comorbidities. Applying the rules on the data set showed that the method produced accurate results: the true positive and true negative rates for inferring autistic disorder diagnosis according to DSM-IV criteria were 1 and 0.065, respectively; the true positive rate for inferring ASD based on DSM-5 criteria was 0.94.The ontology allows automatic inference of subjects' disease phenotypes and diagnosis with high accuracy.The ontology may benefit future studies by serving as a knowledge base for ASD. In addition, by adding knowledge of related NDDs, commonalities and differences in manifestations and risk factors could be automatically inferred, contributing to the understanding of ASD pathophysiology.
View details for DOI 10.1016/j.jbi.2015.06.026
View details for Web of Science ID 000359752100030
View details for PubMedID 26151311
View details for PubMedCentralID PMC4532604
-
High Resolution Prediction of Calcium-Binding Sites in 3D Protein Structures Using FEATURE
JOURNAL OF CHEMICAL INFORMATION AND MODELING
2015; 55 (8): 1663-1672
Abstract
Metal-binding proteins are ubiquitous in biological systems ranging from enzymes to cell surface receptors. Among the various biologically active metal ions, calcium plays a large role in regulating cellular and physiological changes. With the increasing number of high-quality crystal structures of proteins associated with their metal ion ligands, many groups have built models to identify Ca(2+) sites in proteins, utilizing information such as structure, geometry, or homology to do the inference. We present a FEATURE-based approach in building such a model and show that our model is able to discriminate between nonsites and calcium-binding sites with a very high precision of more than 98%. We demonstrate the high specificity of our model by applying it to test sets constructed from other ions. We also introduce an algorithm to convert high scoring regions into specific site predictions and demonstrate the usage by scanning a test set of 91 calcium-binding protein structures (190 calcium sites). The algorithm has a recall of more than 93% on the test set with predictions found within 3 Å of the actual sites.
View details for DOI 10.1021/acs.jcim.5b00367
View details for Web of Science ID 000360322800016
View details for PubMedCentralID PMC4731830
-
PharmGKB summary: pathways of acetaminophen metabolism at the therapeutic versus toxic doses
PHARMACOGENETICS AND GENOMICS
2015; 25 (8): 416-426
View details for DOI 10.1097/FPC.0000000000000150
View details for Web of Science ID 000357993500007
View details for PubMedCentralID PMC4498995
-
Potential Adverse Effects of Anesthesia in Children Reply
JAMA-JOURNAL OF THE AMERICAN MEDICAL ASSOCIATION
2015; 314 (4): 409
View details for PubMedID 26219066
-
Relating Essential Proteins to Drug Side-Effects Using Canonical Component Analysis: A Structure-Based Approach.
Journal of chemical information and modeling
2015; 55 (7): 1483-1494
Abstract
The molecular mechanism of many drug side-effects is unknown and difficult to predict. Previous methods for explaining side-effects have focused on known drug targets and their pathways. However, low affinity binding to proteins that are not usually considered drug targets may also drive side-effects. In order to assess these alternative targets, we used the 3D structures of 563 essential human proteins systematically to predict binding to 216 drugs. We first benchmarked our affinity predictions with available experimental data. We then combined singular value decomposition and canonical component analysis (SVD-CCA) to predict side-effects based on these novel target profiles. Our method predicts side-effects with good accuracy (average AUC: 0.82 for side effects present in <50% of drug labels). We also noted that side-effect frequency is the most important feature for prediction and can confound efforts at elucidating mechanism; our method allows us to remove the contribution of frequency and isolate novel biological signals. In particular, our analysis produces 2768 triplet associations between 50 essential proteins, 99 drugs, and 77 side-effects. Although experimental validation is difficult because many of our essential proteins do not have validated assays, we nevertheless attempted to validate a subset of these associations using experimental assay data. Our focus on essential proteins allows us to find potential associations that would likely be missed if we used recognized drug targets. Our associations provide novel insights about the molecular mechanisms of drug side-effects and highlight the need for expanded experimental efforts to investigate drug binding to proteins more broadly.
View details for DOI 10.1021/acs.jcim.5b00030
View details for PubMedID 26121262
-
Achieving high-sensitivity for clinical applications using augmented exome sequencing
GENOME MEDICINE
2015; 7
Abstract
Whole exome sequencing is increasingly used for the clinical evaluation of genetic disease, yet the variation of coverage and sensitivity over medically relevant parts of the genome remains poorly understood. Several sequencing-based assays continue to provide coverage that is inadequate for clinical assessment.Using sequence data obtained from the NA12878 reference sample and pre-defined lists of medically-relevant protein-coding and noncoding sequences, we compared the breadth and depth of coverage obtained among four commercial exome capture platforms and whole genome sequencing. In addition, we evaluated the performance of an augmented exome strategy, ACE, that extends coverage in medically relevant regions and enhances coverage in areas that are challenging to sequence. Leveraging reference call-sets, we also examined the effects of improved coverage on variant detection sensitivity.We observed coverage shortfalls with each of the conventional exome-capture and whole-genome platforms across several medically interpretable genes. These gaps included areas of the genome required for reporting recently established secondary findings (ACMG) and known disease-associated loci. The augmented exome strategy recovered many of these gaps, resulting in improved coverage in these areas. At clinically-relevant coverage levels (100 % bases covered at ≥20×), ACE improved coverage among genes in the medically interpretable genome (>90 % covered relative to 10-78 % with other platforms), the set of ACMG secondary finding genes (91 % covered relative to 4-75 % with other platforms) and a subset of variants known to be associated with human disease (99 % covered relative to 52-95 % with other platforms). Improved coverage translated into improvements in sensitivity, with ACE variant detection sensitivities (>97.5 % SNVs, >92.5 % InDels) exceeding that observed with conventional whole-exome and whole-genome platforms.Clinicians should consider analytical performance when making clinical assessments, given that even a few missed variants can lead to reporting false negative results. An augmented exome strategy provides a level of coverage not achievable with other platforms, thus addressing concerns regarding the lack of sensitivity in clinically important regions. In clinical applications where comprehensive coverage of medically interpretable areas of the genome requires higher localized sequencing depth, an augmented exome approach offers both cost and performance advantages over other sequencing-based tests.
View details for DOI 10.1186/s13073-015-0197-4
View details for Web of Science ID 000359428300001
View details for PubMedID 26269718
View details for PubMedCentralID PMC4534066
-
PharmGKB summary: Efavirenz pathway, pharmacokinetics
PHARMACOGENETICS AND GENOMICS
2015; 25 (7): 363-376
View details for DOI 10.1097/FPC.0000000000000145
View details for Web of Science ID 000356370900005
View details for PubMedID 25966836
View details for PubMedCentralID PMC4461466
-
Learning the Structure of Biomedical Relationships from Unstructured Text
PLOS COMPUTATIONAL BIOLOGY
2015; 11 (7)
Abstract
The published biomedical research literature encompasses most of our understanding of how drugs interact with gene products to produce physiological responses (phenotypes). Unfortunately, this information is distributed throughout the unstructured text of over 23 million articles. The creation of structured resources that catalog the relationships between drugs and genes would accelerate the translation of basic molecular knowledge into discoveries of genomic biomarkers for drug response and prediction of unexpected drug-drug interactions. Extracting these relationships from natural language sentences on such a large scale, however, requires text mining algorithms that can recognize when different-looking statements are expressing similar ideas. Here we describe a novel algorithm, Ensemble Biclustering for Classification (EBC), that learns the structure of biomedical relationships automatically from text, overcoming differences in word choice and sentence structure. We validate EBC's performance against manually-curated sets of (1) pharmacogenomic relationships from PharmGKB and (2) drug-target relationships from DrugBank, and use it to discover new drug-gene relationships for both knowledge bases. We then apply EBC to map the complete universe of drug-gene relationships based on their descriptions in Medline, revealing unexpected structure that challenges current notions about how these relationships are expressed in text. For instance, we learn that newer experimental findings are described in consistently different ways than established knowledge, and that seemingly pure classes of relationships can exhibit interesting chimeric structure. The EBC algorithm is flexible and adaptable to a wide range of problems in biomedical text mining.
View details for DOI 10.1371/journal.pcbi.1004216
View details for Web of Science ID 000360620100003
View details for PubMedCentralID PMC4517797
-
Learning the Structure of Biomedical Relationships from Unstructured Text.
PLoS computational biology
2015; 11 (7)
Abstract
The published biomedical research literature encompasses most of our understanding of how drugs interact with gene products to produce physiological responses (phenotypes). Unfortunately, this information is distributed throughout the unstructured text of over 23 million articles. The creation of structured resources that catalog the relationships between drugs and genes would accelerate the translation of basic molecular knowledge into discoveries of genomic biomarkers for drug response and prediction of unexpected drug-drug interactions. Extracting these relationships from natural language sentences on such a large scale, however, requires text mining algorithms that can recognize when different-looking statements are expressing similar ideas. Here we describe a novel algorithm, Ensemble Biclustering for Classification (EBC), that learns the structure of biomedical relationships automatically from text, overcoming differences in word choice and sentence structure. We validate EBC's performance against manually-curated sets of (1) pharmacogenomic relationships from PharmGKB and (2) drug-target relationships from DrugBank, and use it to discover new drug-gene relationships for both knowledge bases. We then apply EBC to map the complete universe of drug-gene relationships based on their descriptions in Medline, revealing unexpected structure that challenges current notions about how these relationships are expressed in text. For instance, we learn that newer experimental findings are described in consistently different ways than established knowledge, and that seemingly pure classes of relationships can exhibit interesting chimeric structure. The EBC algorithm is flexible and adaptable to a wide range of problems in biomedical text mining.
View details for DOI 10.1371/journal.pcbi.1004216
View details for PubMedID 26219079
View details for PubMedCentralID PMC4517797
-
Evidence for Clinical Implementation of Pharmacogenomics in Cardiac Drugs
MAYO CLINIC PROCEEDINGS
2015; 90 (6): 716-729
Abstract
To comprehensively assess the pharmacogenomic evidence of routinely used drugs for clinical utility.Between January 2, 2011, and May 31, 2013, we assessed 71 drugs by identifying all drug/genetic variant combinations with published clinical pharmacogenomic evidence. Literature supporting each drug/variant pair was assessed for study design and methods, outcomes, statistical significance, and clinical relevance. Proposed clinical summaries were formally scored using a modified AGREE (Appraisal of Guidelines for Research and Evaluation) II instrument, including recommendation for or against guideline implementation.Positive pharmacogenomic findings were identified for 51 of 71 cardiovascular drugs (71.8%), representing 884 unique drug/variant pairs from 597 publications. After analysis for quality and clinical relevance, 92 drug/variant pairs were proposed for translation into clinical summaries, encompassing 23 drugs (32.4% of drugs reviewed). All were recommended for clinical implementation using AGREE II, with mean ± SD overall quality scores of 5.18±0.91 (of 7.0; range, 3.67-7.0). Drug guidelines had highest mean ± SD scores in AGREE II domain 1 (Scope) (91.9±6.1 of 100) and moderate but still robust mean ± SD scores in domain 3 (Rigor) (73.1±11.1), domain 4 (Clarity) (67.8±12.5), and domain 5 (Applicability) (65.8±10.0). Clopidogrel (CYP2C19), metoprolol (CYP2D6), simvastatin (rs4149056), dabigatran (rs2244613), hydralazine (rs1799983, rs1799998), and warfarin (CYP2C9/VKORC1) were distinguished by the highest scores. Seven of the 9 most commonly prescribed drugs warranted translation guidelines summarizing clinical pharmacogenomic information.Considerable clinically actionable pharmacogenomic information for cardiovascular drugs exists, supporting the idea that consideration of such information when prescribing is warranted.
View details for DOI 10.1016/j.mayocp.2015.03.016
View details for Web of Science ID 000355557900008
View details for PubMedID 26046407
View details for PubMedCentralID PMC4475352
-
Genomics in the clinic: ethical and policy challenges in clinical next-generation sequencing programs at early adopter USA institutions.
Personalized medicine
2015; 12 (3): 269-282
Abstract
Next-generation sequencing (NGS) technologies are poised to revolutionize clinical diagnosis and treatment, but raise significant ethical and policy challenges. This review examines NGS program challenges through a synthesis of published literature, website and conference presentation content, and interviews at early-adopting institutions in the USA. Institutions are proactively addressing policy challenges related to the management and technical aspects of program development. However, ethical challenges related to patient-related aspects have not been fully addressed. These complex challenges present opportunities to develop comprehensive and standardized regulations across programs. Understanding the strengths, weaknesses and current practices of evolving NGS program approaches are important considerations for institutions developing NGS services, policymakers regulating or funding NGS programs and physicians and patients considering NGS services.
View details for DOI 10.2217/pme.14.88
View details for PubMedID 29771644
-
Distribute AI benefits fairly
NATURE
2015; 521 (7553): 417–18
View details for Web of Science ID 000355286600015
-
The International SSRI Pharmacogenomics Consortium (ISPC): a genome-wide association study of antidepressant treatment response
TRANSLATIONAL PSYCHIATRY
2015; 5
Abstract
Response to treatment with selective serotonin reuptake inhibitors (SSRIs) varies considerably between patients. The International SSRI Pharmacogenomics Consortium (ISPC) was formed with the primary goal of identifying genetic variation that may contribute to response to SSRI treatment of major depressive disorder. A genome-wide association study of 4-week treatment outcomes, measured using the 17-item Hamilton Rating Scale for Depression (HRSD-17), was performed using data from 865 subjects from seven sites. The primary outcomes were percent change in HRSD-17 score and response, defined as at least 50% reduction in HRSD-17. Data from two prior studies, the Pharmacogenomics Research Network Antidepressant Medication Pharmacogenomics Study (PGRN-AMPS) and the Sequenced Treatment Alternatives to Relieve Depression (STAR*D) study, were used for replication, and a meta-analysis of the three studies was performed (N=2394). Although many top association signals in the ISPC analysis map to interesting candidate genes, none were significant at the genome-wide level and the associations were not replicated using PGRN-AMPS and STAR*D data. The top association result in the meta-analysis of response represents SNPs 5′ upstream of the neuregulin-1 gene, NRG1 (P = 1.20E - 06). NRG1 is involved in many aspects of brain development, including neuronal maturation and variations in this gene have been shown to be associated with increased risk for mental disorders, particularly schizophrenia. Replication and functional studies of these findings are warranted.
View details for DOI 10.1038/tp.2015.47
View details for Web of Science ID 000367655600002
View details for PubMedCentralID PMC4462610
-
The International SSRI Pharmacogenomics Consortium (ISPC): a genome-wide association study of antidepressant treatment response.
Translational psychiatry
2015; 5: e553
Abstract
Response to treatment with selective serotonin reuptake inhibitors (SSRIs) varies considerably between patients. The International SSRI Pharmacogenomics Consortium (ISPC) was formed with the primary goal of identifying genetic variation that may contribute to response to SSRI treatment of major depressive disorder. A genome-wide association study of 4-week treatment outcomes, measured using the 17-item Hamilton Rating Scale for Depression (HRSD-17), was performed using data from 865 subjects from seven sites. The primary outcomes were percent change in HRSD-17 score and response, defined as at least 50% reduction in HRSD-17. Data from two prior studies, the Pharmacogenomics Research Network Antidepressant Medication Pharmacogenomics Study (PGRN-AMPS) and the Sequenced Treatment Alternatives to Relieve Depression (STAR*D) study, were used for replication, and a meta-analysis of the three studies was performed (N=2394). Although many top association signals in the ISPC analysis map to interesting candidate genes, none were significant at the genome-wide level and the associations were not replicated using PGRN-AMPS and STAR*D data. The top association result in the meta-analysis of response represents SNPs 5′ upstream of the neuregulin-1 gene, NRG1 (P = 1.20E - 06). NRG1 is involved in many aspects of brain development, including neuronal maturation and variations in this gene have been shown to be associated with increased risk for mental disorders, particularly schizophrenia. Replication and functional studies of these findings are warranted.
View details for DOI 10.1038/tp.2015.47
View details for PubMedID 25897834
View details for PubMedCentralID PMC4462610
-
Neurotoxicity of Generic Anesthesia Agents in Infants and Children An Orphan Research Question in Search of a Sponsor
JAMA-JOURNAL OF THE AMERICAN MEDICAL ASSOCIATION
2015; 313 (15): 1515–16
View details for PubMedID 25898045
-
PharmGKB summary: very important pharmacogene information for human leukocyte antigen B.
Pharmacogenetics and genomics
2015; 25 (4): 205-221
View details for DOI 10.1097/FPC.0000000000000118
View details for PubMedID 25647431
View details for PubMedCentralID PMC4356642
-
PharmGKB summary: very important pharmacogene information for CFTR.
Pharmacogenetics and genomics
2015; 25 (3): 149-156
View details for DOI 10.1097/FPC.0000000000000112
View details for PubMedID 25514096
View details for PubMedCentralID PMC4336773
-
Predicting cancer drug response: advancing the DREAM.
Cancer discovery
2015; 5 (3): 237-238
Abstract
The DREAM challenge is a community effort to assess current capabilities in systems biology. Two recent challenges focus on cancer cell drug sensitivity and drug synergism, and highlight strengths and weaknesses of current approaches. Cancer Discov; 5(3); 237-8. ©2015 AACR.
View details for DOI 10.1158/2159-8290.CD-15-0093
View details for PubMedID 25623160
-
Variations in the Binding Pocket of an Inhibitor of the Bacterial Division Protein FtsZ across Genotypes and Species
PLOS COMPUTATIONAL BIOLOGY
2015; 11 (3)
Abstract
The recent increase in antibiotic resistance in pathogenic bacteria calls for new approaches to drug-target selection and drug development. Targeting the mechanisms of action of proteins involved in bacterial cell division bypasses problems associated with increasingly ineffective variants of older antibiotics; to this end, the essential bacterial cytoskeletal protein FtsZ is a promising target. Recent work on its allosteric inhibitor, PC190723, revealed in vitro activity on Staphylococcus aureus FtsZ and in vivo antimicrobial activities. However, the mechanism of drug action and its effect on FtsZ in other bacterial species are unclear. Here, we examine the structural environment of the PC190723 binding pocket using PocketFEATURE, a statistical method that scores the similarity between pairs of small-molecule binding sites based on 3D structure information about the local microenvironment, and molecular dynamics (MD) simulations. We observed that species and nucleotide-binding state have significant impacts on the structural properties of the binding site, with substantially disparate microenvironments for bacterial species not from the Staphylococcus genus. Based on PocketFEATURE analysis of MD simulations of S. aureus FtsZ bound to GTP or with mutations that are known to confer PC190723 resistance, we predict that PC190723 strongly prefers to bind Staphylococcus FtsZ in the nucleotide-bound state. Furthermore, MD simulations of an FtsZ dimer indicated that polymerization may enhance PC190723 binding. Taken together, our results demonstrate that a drug-binding pocket can vary significantly across species, genetic perturbations, and in different polymerization states, yielding important information for the further development of FtsZ inhibitors.
View details for DOI 10.1371/journal.pcbi.1004117
View details for Web of Science ID 000352195700026
View details for PubMedID 25811761
View details for PubMedCentralID PMC4374959
-
Variations in the binding pocket of an inhibitor of the bacterial division protein FtsZ across genotypes and species.
PLoS computational biology
2015; 11 (3)
Abstract
The recent increase in antibiotic resistance in pathogenic bacteria calls for new approaches to drug-target selection and drug development. Targeting the mechanisms of action of proteins involved in bacterial cell division bypasses problems associated with increasingly ineffective variants of older antibiotics; to this end, the essential bacterial cytoskeletal protein FtsZ is a promising target. Recent work on its allosteric inhibitor, PC190723, revealed in vitro activity on Staphylococcus aureus FtsZ and in vivo antimicrobial activities. However, the mechanism of drug action and its effect on FtsZ in other bacterial species are unclear. Here, we examine the structural environment of the PC190723 binding pocket using PocketFEATURE, a statistical method that scores the similarity between pairs of small-molecule binding sites based on 3D structure information about the local microenvironment, and molecular dynamics (MD) simulations. We observed that species and nucleotide-binding state have significant impacts on the structural properties of the binding site, with substantially disparate microenvironments for bacterial species not from the Staphylococcus genus. Based on PocketFEATURE analysis of MD simulations of S. aureus FtsZ bound to GTP or with mutations that are known to confer PC190723 resistance, we predict that PC190723 strongly prefers to bind Staphylococcus FtsZ in the nucleotide-bound state. Furthermore, MD simulations of an FtsZ dimer indicated that polymerization may enhance PC190723 binding. Taken together, our results demonstrate that a drug-binding pocket can vary significantly across species, genetic perturbations, and in different polymerization states, yielding important information for the further development of FtsZ inhibitors.
View details for DOI 10.1371/journal.pcbi.1004117
View details for PubMedID 25811761
View details for PubMedCentralID PMC4374959
-
PharmGKB summary: ibuprofen pathways
PHARMACOGENETICS AND GENOMICS
2015; 25 (2): 96-106
View details for DOI 10.1097/FPC.0000000000000113
View details for Web of Science ID 000347393200006
View details for PubMedID 25502615
View details for PubMedCentralID PMC4355401
-
INTERACTIVE GENOTYPE-BASED DOSING GUIDELINES.
WILEY-BLACKWELL. 2015: S61
View details for Web of Science ID 000348730500184
-
Enabling the curation of your pharmacogenetic study.
Clinical pharmacology & therapeutics
2015; 97 (2): 116-119
Abstract
As pharmacogenomics becomes integrated into clinical practice, curation of published studies becomes increasingly important. At the Pharmacogenomics Knowledgebase (PharmGKB; www.pharmgkb.org), pharmacogenetic associations reported in published articles are manually curated and evaluated. Standard terminologies are used, making findings uniform and unambiguous. Lack of information, clarity, or standards in the original report can make it difficult or impossible to curate. We provide 10 rules to help authors ensure that their results are accurately captured and integrated.
View details for DOI 10.1002/cpt.15
View details for PubMedID 25670512
-
Using "big data" to dissect clinical heterogeneity.
Circulation
2015; 131 (3): 232-233
View details for DOI 10.1161/CIRCULATIONAHA.114.014106
View details for PubMedID 25601948
-
Data-Mining Electronic Medical Records for Clinical Order Recommendations: Wisdom of the Crowd or Tyranny of the Mob?
AMIA Joint Summits on Translational Science proceedings AMIA Summit on Translational Science
2015; 2015: 435-439
Abstract
Uncertainty and variability is pervasive in medical decision making with insufficient evidence-based medicine and inconsistent implementation where established knowledge exists. Clinical decision support constructs like order sets help distribute expertise, but are constrained by knowledge-based development. We previously produced a data-driven order recommender system to automatically generate clinical decision support content from structured electronic medical record data on >19K hospital patients. We now present the first structured validation of such automatically generated content against an objective external standard by assessing how well the generated recommendations correspond to orders referenced as appropriate in clinical practice guidelines. For example scenarios of chest pain, gastrointestinal hemorrhage, and pneumonia in hospital patients, the automated method identifies guideline reference orders with ROC AUCs (c-statistics) (0.89, 0.95, 0.83) that improve upon statistical prevalence benchmarks (0.76, 0.74, 0.73) and pre-existing human-expert authored order sets (0.81, 0.77, 0.73) (P<10(-30) in all cases). We demonstrate that data-driven, automatically generated clinical decision support content can reproduce and optimize top-down constructs like order sets while largely avoiding inappropriate and irrelevant recommendations. This will be even more important when extrapolating to more typical clinical scenarios where well-defined external standards and decision support do not exist.
View details for PubMedID 26306281
-
TRAINING THE NEXT GENERATION OF QUANTITATIVE BIOLOGISTS IN THE ERA OF BIG DATA
WORLD SCIENTIFIC PUBL CO PTE LTD. 2015: 488–92
Abstract
The following sections are included: Workshop Focus, Workshop Contributions and References.
View details for Web of Science ID 000461835500048
View details for PubMedID 25592609
-
A TWENTIETH ANNIVERSARY TRIBUTE TO PSB
WORLD SCIENTIFIC PUBL CO PTE LTD. 2015: 1–7
View details for Web of Science ID 000461835500001
-
Ranking adverse drug reactions with crowdsourcing.
Journal of medical Internet research
2015; 17 (3): e80
Abstract
There is no publicly available resource that provides the relative severity of adverse drug reactions (ADRs). Such a resource would be useful for several applications, including assessment of the risks and benefits of drugs and improvement of patient-centered care. It could also be used to triage predictions of drug adverse events.The intent of the study was to rank ADRs according to severity.We used Internet-based crowdsourcing to rank ADRs according to severity. We assigned 126,512 pairwise comparisons of ADRs to 2589 Amazon Mechanical Turk workers and used these comparisons to rank order 2929 ADRs.There is good correlation (rho=.53) between the mortality rates associated with ADRs and their rank. Our ranking highlights severe drug-ADR predictions, such as cardiovascular ADRs for raloxifene and celecoxib. It also triages genes associated with severe ADRs such as epidermal growth-factor receptor (EGFR), associated with glioblastoma multiforme, and SCN1A, associated with epilepsy.ADR ranking lays a first stepping stone in personalized drug risk assessment. Ranking of ADRs using crowdsourcing may have useful clinical and financial implications, and should be further investigated in the context of health care decision making.
View details for DOI 10.2196/jmir.3962
View details for PubMedID 25800813
View details for PubMedCentralID PMC4387295
-
Achieving high-sensitivity for clinical applications using augmented exome sequencing.
Genome medicine
2015; 7 (1): 71-?
Abstract
Whole exome sequencing is increasingly used for the clinical evaluation of genetic disease, yet the variation of coverage and sensitivity over medically relevant parts of the genome remains poorly understood. Several sequencing-based assays continue to provide coverage that is inadequate for clinical assessment.Using sequence data obtained from the NA12878 reference sample and pre-defined lists of medically-relevant protein-coding and noncoding sequences, we compared the breadth and depth of coverage obtained among four commercial exome capture platforms and whole genome sequencing. In addition, we evaluated the performance of an augmented exome strategy, ACE, that extends coverage in medically relevant regions and enhances coverage in areas that are challenging to sequence. Leveraging reference call-sets, we also examined the effects of improved coverage on variant detection sensitivity.We observed coverage shortfalls with each of the conventional exome-capture and whole-genome platforms across several medically interpretable genes. These gaps included areas of the genome required for reporting recently established secondary findings (ACMG) and known disease-associated loci. The augmented exome strategy recovered many of these gaps, resulting in improved coverage in these areas. At clinically-relevant coverage levels (100 % bases covered at ≥20×), ACE improved coverage among genes in the medically interpretable genome (>90 % covered relative to 10-78 % with other platforms), the set of ACMG secondary finding genes (91 % covered relative to 4-75 % with other platforms) and a subset of variants known to be associated with human disease (99 % covered relative to 52-95 % with other platforms). Improved coverage translated into improvements in sensitivity, with ACE variant detection sensitivities (>97.5 % SNVs, >92.5 % InDels) exceeding that observed with conventional whole-exome and whole-genome platforms.Clinicians should consider analytical performance when making clinical assessments, given that even a few missed variants can lead to reporting false negative results. An augmented exome strategy provides a level of coverage not achievable with other platforms, thus addressing concerns regarding the lack of sensitivity in clinically important regions. In clinical applications where comprehensive coverage of medically interpretable areas of the genome requires higher localized sequencing depth, an augmented exome approach offers both cost and performance advantages over other sequencing-based tests.
View details for DOI 10.1186/s13073-015-0197-4
View details for PubMedID 26269718
-
Genomics in the clinic: ethical and policy challenges in clinical next-generation sequencing programs at early adopter USA institutions
PERSONALIZED MEDICINE
2015; 12 (3): 269-282
View details for DOI 10.2217/PME.14.88
View details for Web of Science ID 000355751600011
-
Ranking adverse drug reactions with crowdsourcing.
Journal of medical Internet research
2015; 17 (3)
Abstract
There is no publicly available resource that provides the relative severity of adverse drug reactions (ADRs). Such a resource would be useful for several applications, including assessment of the risks and benefits of drugs and improvement of patient-centered care. It could also be used to triage predictions of drug adverse events.The intent of the study was to rank ADRs according to severity.We used Internet-based crowdsourcing to rank ADRs according to severity. We assigned 126,512 pairwise comparisons of ADRs to 2589 Amazon Mechanical Turk workers and used these comparisons to rank order 2929 ADRs.There is good correlation (rho=.53) between the mortality rates associated with ADRs and their rank. Our ranking highlights severe drug-ADR predictions, such as cardiovascular ADRs for raloxifene and celecoxib. It also triages genes associated with severe ADRs such as epidermal growth-factor receptor (EGFR), associated with glioblastoma multiforme, and SCN1A, associated with epilepsy.ADR ranking lays a first stepping stone in personalized drug risk assessment. Ranking of ADRs using crowdsourcing may have useful clinical and financial implications, and should be further investigated in the context of health care decision making.
View details for DOI 10.2196/jmir.3962
View details for PubMedID 25800813
View details for PubMedCentralID PMC4387295
-
A twentieth anniversary tribute to psb.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
2015; 20: 1-7
Abstract
PSB brings together top researchers from around the world to exchange research results and address open issues in all aspects of computational biology. PSB 2015 marks the twentieth anniversary of PSB. Reaching a milestone year is an accomplishment well worth celebrating. It is long enough to have seen big changes occur, but recent enough to be relevant for today. As PSB celebrates twenty years of service, we would like to take this opportunity to congratulate the PSB community for your success. We would also like the community to join us in a time of celebration and reflection on this accomplishment.
View details for PubMedID 25592562
-
PharmGKB summary: very important pharmacogene information for CYP4F2
PHARMACOGENETICS AND GENOMICS
2015; 25 (1): 41-47
View details for DOI 10.1097/FPC.0000000000000100
View details for Web of Science ID 000346632900006
View details for PubMedID 25370453
View details for PubMedCentralID PMC4261059
-
A community computational challenge to predict the activity of pairs of compounds
NATURE BIOTECHNOLOGY
2014; 32 (12): 1213-+
Abstract
Recent therapeutic successes have renewed interest in drug combinations, but experimental screening approaches are costly and often identify only small numbers of synergistic combinations. The DREAM consortium launched an open challenge to foster the development of in silico methods to computationally rank 91 compound pairs, from the most synergistic to the most antagonistic, based on gene-expression profiles of human B cells treated with individual compounds at multiple time points and concentrations. Using scoring metrics based on experimental dose-response curves, we assessed 32 methods (31 community-generated approaches and SynGen), four of which performed significantly better than random guessing. We highlight similarities between the methods. Although the accuracy of predictions was not optimal, we find that computational prediction of compound-pair activity is possible, and that community challenges can be useful to advance the field of in silico compound-synergy prediction.
View details for DOI 10.1038/nbt.3052
View details for Web of Science ID 000346156800023
View details for PubMedID 25419740
View details for PubMedCentralID PMC4399794
-
A community effort to assess and improve drug sensitivity prediction algorithms
NATURE BIOTECHNOLOGY
2014; 32 (12): 1202-U57
Abstract
Predicting the best treatment strategy from genomic information is a core goal of precision medicine. Here we focus on predicting drug response based on a cohort of genomic, epigenomic and proteomic profiling data sets measured in human breast cancer cell lines. Through a collaborative effort between the National Cancer Institute (NCI) and the Dialogue on Reverse Engineering Assessment and Methods (DREAM) project, we analyzed a total of 44 drug sensitivity prediction algorithms. The top-performing approaches modeled nonlinear relationships and incorporated biological pathway information. We found that gene expression microarrays consistently provided the best predictive power of the individual profiling data sets; however, performance was increased by including multiple, independent data sets. We discuss the innovations underlying the top-performing methodology, Bayesian multitask MKL, and we provide detailed descriptions of all methods. This study establishes benchmarks for drug sensitivity prediction and identifies approaches that can be leveraged for the development of new methods.
View details for DOI 10.1038/nbt.2877
View details for Web of Science ID 000346156800022
View details for PubMedID 24880487
View details for PubMedCentralID PMC4547623
-
PharmGKB summary: gemcitabine pathway
PHARMACOGENETICS AND GENOMICS
2014; 24 (11): 564-574
View details for DOI 10.1097/FPC.0000000000000086
View details for Web of Science ID 000343666300003
View details for PubMedCentralID PMC4189987
-
PharmGKB summary: gemcitabine pathway.
Pharmacogenetics and genomics
2014; 24 (11): 564-574
View details for DOI 10.1097/FPC.0000000000000086
View details for PubMedID 25162786
-
Genetic variant in folate homeostasis is associated with lower warfarin dose in African Americans
BLOOD
2014; 124 (14): 2298-2305
Abstract
The anticoagulant warfarin has >30 million prescriptions per year in the United States. Doses can vary 20-fold between patients, and incorrect dosing can result in serious adverse events. Variation in warfarin pharmacokinetic and pharmacodynamic genes, such as CYP2C9 and VKORC1, do not fully explain the dose variability in African Americans. To identify additional genetic contributors to warfarin dose, we exome sequenced 103 African Americans on stable doses of warfarin at extremes (≤ 35 and ≥ 49 mg/week). We found an association between lower warfarin dose and a population-specific regulatory variant, rs7856096 (P = 1.82 × 10(-8), minor allele frequency = 20.4%), in the folate homeostasis gene folylpolyglutamate synthase (FPGS). We replicated this association in an independent cohort of 372 African American subjects whose stable warfarin doses represented the full dosing spectrum (P = .046). In a combined cohort, adding rs7856096 to the International Warfarin Pharmacogenetic Consortium pharmacogenetic dosing algorithm resulted in a 5.8 mg/week (P = 3.93 × 10(-5)) decrease in warfarin dose for each allele carried. The variant overlaps functional elements and was associated (P = .01) with FPGS gene expression in lymphoblastoid cell lines derived from combined HapMap African populations (N = 326). Our results provide the first evidence linking genetic variation in folate homeostasis to warfarin response.
View details for DOI 10.1182/blood-2014-04-568436
View details for Web of Science ID 000342763900023
View details for PubMedCentralID PMC4183989
-
Genetic variant in folate homeostasis is associated with lower warfarin dose in African Americans.
Blood
2014; 124 (14): 2298-2305
Abstract
The anticoagulant warfarin has >30 million prescriptions per year in the United States. Doses can vary 20-fold between patients, and incorrect dosing can result in serious adverse events. Variation in warfarin pharmacokinetic and pharmacodynamic genes, such as CYP2C9 and VKORC1, do not fully explain the dose variability in African Americans. To identify additional genetic contributors to warfarin dose, we exome sequenced 103 African Americans on stable doses of warfarin at extremes (≤ 35 and ≥ 49 mg/week). We found an association between lower warfarin dose and a population-specific regulatory variant, rs7856096 (P = 1.82 × 10(-8), minor allele frequency = 20.4%), in the folate homeostasis gene folylpolyglutamate synthase (FPGS). We replicated this association in an independent cohort of 372 African American subjects whose stable warfarin doses represented the full dosing spectrum (P = .046). In a combined cohort, adding rs7856096 to the International Warfarin Pharmacogenetic Consortium pharmacogenetic dosing algorithm resulted in a 5.8 mg/week (P = 3.93 × 10(-5)) decrease in warfarin dose for each allele carried. The variant overlaps functional elements and was associated (P = .01) with FPGS gene expression in lymphoblastoid cell lines derived from combined HapMap African populations (N = 326). Our results provide the first evidence linking genetic variation in folate homeostasis to warfarin response.
View details for DOI 10.1182/blood-2014-04-568436
View details for PubMedID 25079360
-
PharmGKB summary: uric acid-lowering drugs pathway, pharmacodynamics.
Pharmacogenetics and genomics
2014; 24 (9): 464-476
View details for DOI 10.1097/FPC.0000000000000058
View details for PubMedID 24915143
-
PharmGKB summary: very important pharmacogene information for N-acetyltransferase 2.
Pharmacogenetics and genomics
2014; 24 (8): 409-425
View details for DOI 10.1097/FPC.0000000000000062
View details for PubMedID 24892773
-
Interpreting the CYP2D6 results from the International Tamoxifen Pharmacogenetics Consortium.
Clinical pharmacology & therapeutics
2014; 96 (2): 144-146
View details for DOI 10.1038/clpt.2014.100
View details for PubMedID 25056393
View details for PubMedCentralID PMC4147833
-
PharmGKB summary: tramadol pathway
PHARMACOGENETICS AND GENOMICS
2014; 24 (7): 374-380
View details for DOI 10.1097/FPC.0000000000000057
View details for Web of Science ID 000337727000007
View details for PubMedID 24849324
View details for PubMedCentralID PMC4100774
-
PharmGKB summary: very important pharmacogene information for SLC22A1.
Pharmacogenetics and genomics
2014; 24 (6): 324-328
View details for DOI 10.1097/FPC.0000000000000048
View details for PubMedID 24681965
View details for PubMedCentralID PMC4035531
-
Integrating systems biology sources illuminates drug action.
Clinical pharmacology & therapeutics
2014; 95 (6): 663-669
Abstract
There are significant gaps in our understanding of the pathways by which drugs act. This incomplete knowledge limits our ability to use mechanistic molecular information rationally to repurpose drugs, understand their side effects, and predict their interactions with other drugs. Here, we present DrugRouter, a novel method for generating drug-specific pathways of action by linking target genes, disease genes, and pharmacogenes using gene interaction networks. We construct pathways for more than a hundred drugs and show that the genes included in our pathways (i) co-occur with the query drug in the literature, (ii) significantly overlap or are adjacent to known drug-response pathways, and (iii) are adjacent to genes that are hits in genome-wide association studies assessing drug response. Finally, these computed pathways suggest novel drug-repositioning opportunities (e.g., statins for follicular thyroid cancer), gene-side effect associations, and gene-drug interactions. Thus, DrugRouter generates hypotheses about drug actions using systems biology data.
View details for DOI 10.1038/clpt.2014.51
View details for PubMedID 24577151
View details for PubMedCentralID PMC4029855
-
Reconstruction of the Mouse Otocyst and Early Neuroblast Lineage at Single-Cell Resolution
CELL
2014; 157 (4): 964-978
Abstract
The otocyst harbors progenitors for most cell types of the mature inner ear. Developmental lineage analyses and gene expression studies suggest that distinct progenitor populations are compartmentalized to discrete axial domains in the early otocyst. Here, we conducted highly parallel quantitative RT-PCR measurements on 382 individual cells from the developing otocyst and neuroblast lineages to assay 96 genes representing established otic markers, signaling-pathway-associated transcripts, and novel otic-specific genes. By applying multivariate cluster, principal component, and network analyses to the data matrix, we were able to readily distinguish the delaminating neuroblasts and to describe progressive states of gene expression in this population at single-cell resolution. It further established a three-dimensional model of the otocyst in which each individual cell can be precisely mapped into spatial expression domains. Our bioinformatic modeling revealed spatial dynamics of different signaling pathways active during early neuroblast development and prosensory domain specification. PAPERFLICK:
View details for DOI 10.1016/j.cell.2014.03.036
View details for Web of Science ID 000335765500022
View details for PubMedID 24768691
-
PharmGKB summary: abacavir pathway.
Pharmacogenetics and genomics
2014; 24 (5): 276-282
View details for DOI 10.1097/FPC.0000000000000040
View details for PubMedID 24625462
-
Genotype-Guided Dosing of Vitamin K Antagonists
NEW ENGLAND JOURNAL OF MEDICINE
2014; 370 (18): 1762–63
View details for Web of Science ID 000335405200021
View details for PubMedID 24804303
-
Genotype-guided dosing of vitamin K antagonists.
New England journal of medicine
2014; 370 (18): 1762-1763
View details for DOI 10.1056/NEJMc1402521#SA4
View details for PubMedID 24785217
-
Guidelines for investigating causality of sequence variants in human disease
NATURE
2014; 508 (7497): 469-476
Abstract
The discovery of rare genetic variants is accelerating, and clear guidelines for distinguishing disease-causing sequence variants from the many potentially functional variants present in any human genome are urgently needed. Without rigorous standards we risk an acceleration of false-positive reports of causality, which would impede the translation of genomic research findings into the clinical diagnostic setting and hinder biological understanding of disease. Here we discuss the key challenges of assessing sequence variants in human disease, integrating both gene-level and variant-level support for causality. We propose guidelines for summarizing confidence in variant pathogenicity and highlight several areas that require further resource development.
View details for DOI 10.1038/nature13127
View details for Web of Science ID 000334741600026
View details for PubMedID 24759409
-
Knowledge-based fragment binding prediction.
PLoS computational biology
2014; 10 (4): e1003589
Abstract
Target-based drug discovery must assess many drug-like compounds for potential activity. Focusing on low-molecular-weight compounds (fragments) can dramatically reduce the chemical search space. However, approaches for determining protein-fragment interactions have limitations. Experimental assays are time-consuming, expensive, and not always applicable. At the same time, computational approaches using physics-based methods have limited accuracy. With increasing high-resolution structural data for protein-ligand complexes, there is now an opportunity for data-driven approaches to fragment binding prediction. We present FragFEATURE, a machine learning approach to predict small molecule fragments preferred by a target protein structure. We first create a knowledge base of protein structural environments annotated with the small molecule substructures they bind. These substructures have low-molecular weight and serve as a proxy for fragments. FragFEATURE then compares the structural environments within a target protein to those in the knowledge base to retrieve statistically preferred fragments. It merges information across diverse ligands with shared substructures to generate predictions. Our results demonstrate FragFEATURE's ability to rediscover fragments corresponding to the ligand bound with 74% precision and 82% recall on average. For many protein targets, it identifies high scoring fragments that are substructures of known inhibitors. FragFEATURE thus predicts fragments that can serve as inputs to fragment-based drug design or serve as refinement criteria for creating target-specific compound libraries for experimental or computational screening.
View details for DOI 10.1371/journal.pcbi.1003589
View details for PubMedID 24762971
View details for PubMedCentralID PMC3998881
-
Knowledge-based fragment binding prediction.
PLoS computational biology
2014; 10 (4)
Abstract
Target-based drug discovery must assess many drug-like compounds for potential activity. Focusing on low-molecular-weight compounds (fragments) can dramatically reduce the chemical search space. However, approaches for determining protein-fragment interactions have limitations. Experimental assays are time-consuming, expensive, and not always applicable. At the same time, computational approaches using physics-based methods have limited accuracy. With increasing high-resolution structural data for protein-ligand complexes, there is now an opportunity for data-driven approaches to fragment binding prediction. We present FragFEATURE, a machine learning approach to predict small molecule fragments preferred by a target protein structure. We first create a knowledge base of protein structural environments annotated with the small molecule substructures they bind. These substructures have low-molecular weight and serve as a proxy for fragments. FragFEATURE then compares the structural environments within a target protein to those in the knowledge base to retrieve statistically preferred fragments. It merges information across diverse ligands with shared substructures to generate predictions. Our results demonstrate FragFEATURE's ability to rediscover fragments corresponding to the ligand bound with 74% precision and 82% recall on average. For many protein targets, it identifies high scoring fragments that are substructures of known inhibitors. FragFEATURE thus predicts fragments that can serve as inputs to fragment-based drug design or serve as refinement criteria for creating target-specific compound libraries for experimental or computational screening.
View details for DOI 10.1371/journal.pcbi.1003589
View details for PubMedID 24762971
View details for PubMedCentralID PMC3998881
-
High Precision Prediction of Functional Sites in Protein Structures
PLOS ONE
2014; 9 (3)
Abstract
We address the problem of assigning biological function to solved protein structures. Computational tools play a critical role in identifying potential active sites and informing screening decisions for further lab analysis. A critical parameter in the practical application of computational methods is the precision, or positive predictive value. Precision measures the level of confidence the user should have in a particular computed functional assignment. Low precision annotations lead to futile laboratory investigations and waste scarce research resources. In this paper we describe an advanced version of the protein function annotation system FEATURE, which achieved 99% precision and average recall of 95% across 20 representative functional sites. The system uses a Support Vector Machine classifier operating on the microenvironment of physicochemical features around an amino acid. We also compared performance of our method with state-of-the-art sequence-level annotator Pfam in terms of precision, recall and localization. To our knowledge, no other functional site annotator has been rigorously evaluated against these key criteria. The software and predictive models are incorporated into the WebFEATURE service at http://feature.stanford.edu/wf4.0-beta.
View details for DOI 10.1371/journal.pone.0091240
View details for Web of Science ID 000332858400048
View details for PubMedID 24632601
View details for PubMedCentralID PMC3954699
-
Clinical interpretation and implications of whole-genome sequencing.
JAMA : the journal of the American Medical Association
2014; 311 (10): 1035-1045
Abstract
Whole-genome sequencing (WGS) is increasingly applied in clinical medicine and is expected to uncover clinically significant findings regardless of sequencing indication.To examine coverage and concordance of clinically relevant genetic variation provided by WGS technologies; to quantitate inherited disease risk and pharmacogenomic findings in WGS data and resources required for their discovery and interpretation; and to evaluate clinical action prompted by WGS findings.An exploratory study of 12 adult participants recruited at Stanford University Medical Center who underwent WGS between November 2011 and March 2012. A multidisciplinary team reviewed all potentially reportable genetic findings. Five physicians proposed initial clinical follow-up based on the genetic findings.Genome coverage and sequencing platform concordance in different categories of genetic disease risk, person-hours spent curating candidate disease-risk variants, interpretation agreement between trained curators and disease genetics databases, burden of inherited disease risk and pharmacogenomic findings, and burden and interrater agreement of proposed clinical follow-up.Depending on sequencing platform, 10% to 19% of inherited disease genes were not covered to accepted standards for single nucleotide variant discovery. Genotype concordance was high for previously described single nucleotide genetic variants (99%-100%) but low for small insertion/deletion variants (53%-59%). Curation of 90 to 127 genetic variants in each participant required a median of 54 minutes (range, 5-223 minutes) per genetic variant, resulted in moderate classification agreement between professionals (Gross κ, 0.52; 95% CI, 0.40-0.64), and reclassified 69% of genetic variants cataloged as disease causing in mutation databases to variants of uncertain or lesser significance. Two to 6 personal disease-risk findings were discovered in each participant, including 1 frameshift deletion in the BRCA1 gene implicated in hereditary breast and ovarian cancer. Physician review of sequencing findings prompted consideration of a median of 1 to 3 initial diagnostic tests and referrals per participant, with fair interrater agreement about the suitability of WGS findings for clinical follow-up (Fleiss κ, 0.24; P < 001).In this exploratory study of 12 volunteer adults, the use of WGS was associated with incomplete coverage of inherited disease genes, low reproducibility of detection of genetic variation with the highest potential clinical effects, and uncertainty about clinically reportable findings. In certain cases, WGS will identify clinically actionable genetic variants warranting early medical intervention. These issues should be considered when determining the role of WGS in clinical medicine.
View details for DOI 10.1001/jama.2014.1717
View details for PubMedID 24618965
View details for PubMedCentralID PMC4119063
-
Clinical interpretation and implications of whole-genome sequencing.
JAMA
2014; 311 (10): 1035-1045
Abstract
Whole-genome sequencing (WGS) is increasingly applied in clinical medicine and is expected to uncover clinically significant findings regardless of sequencing indication.To examine coverage and concordance of clinically relevant genetic variation provided by WGS technologies; to quantitate inherited disease risk and pharmacogenomic findings in WGS data and resources required for their discovery and interpretation; and to evaluate clinical action prompted by WGS findings.An exploratory study of 12 adult participants recruited at Stanford University Medical Center who underwent WGS between November 2011 and March 2012. A multidisciplinary team reviewed all potentially reportable genetic findings. Five physicians proposed initial clinical follow-up based on the genetic findings.Genome coverage and sequencing platform concordance in different categories of genetic disease risk, person-hours spent curating candidate disease-risk variants, interpretation agreement between trained curators and disease genetics databases, burden of inherited disease risk and pharmacogenomic findings, and burden and interrater agreement of proposed clinical follow-up.Depending on sequencing platform, 10% to 19% of inherited disease genes were not covered to accepted standards for single nucleotide variant discovery. Genotype concordance was high for previously described single nucleotide genetic variants (99%-100%) but low for small insertion/deletion variants (53%-59%). Curation of 90 to 127 genetic variants in each participant required a median of 54 minutes (range, 5-223 minutes) per genetic variant, resulted in moderate classification agreement between professionals (Gross κ, 0.52; 95% CI, 0.40-0.64), and reclassified 69% of genetic variants cataloged as disease causing in mutation databases to variants of uncertain or lesser significance. Two to 6 personal disease-risk findings were discovered in each participant, including 1 frameshift deletion in the BRCA1 gene implicated in hereditary breast and ovarian cancer. Physician review of sequencing findings prompted consideration of a median of 1 to 3 initial diagnostic tests and referrals per participant, with fair interrater agreement about the suitability of WGS findings for clinical follow-up (Fleiss κ, 0.24; P < 001).In this exploratory study of 12 volunteer adults, the use of WGS was associated with incomplete coverage of inherited disease genes, low reproducibility of detection of genetic variation with the highest potential clinical effects, and uncertainty about clinically reportable findings. In certain cases, WGS will identify clinically actionable genetic variants warranting early medical intervention. These issues should be considered when determining the role of WGS in clinical medicine.
View details for DOI 10.1001/jama.2014.1717
View details for PubMedID 24618965
-
PharmGKB summary: very important pharmacogene information for UGT1A1
PHARMACOGENETICS AND GENOMICS
2014; 24 (3): 177-183
View details for DOI 10.1097/FPC.0000000000000024
View details for Web of Science ID 000331209100006
View details for PubMedID 24492252
View details for PubMedCentralID PMC4091838
-
Environmental and State-Level Regulatory Factors Affect the Incidence of Autism and Intellectual Disability
PLOS COMPUTATIONAL BIOLOGY
2014; 10 (3)
Abstract
Many factors affect the risks for neurodevelopmental maladies such as autism spectrum disorders (ASD) and intellectual disability (ID). To compare environmental, phenotypic, socioeconomic and state-policy factors in a unified geospatial framework, we analyzed the spatial incidence patterns of ASD and ID using an insurance claims dataset covering nearly one third of the US population. Following epidemiologic evidence, we used the rate of congenital malformations of the reproductive system as a surrogate for environmental exposure of parents to unmeasured developmental risk factors, including toxins. Adjusted for gender, ethnic, socioeconomic, and geopolitical factors, the ASD incidence rates were strongly linked to population-normalized rates of congenital malformations of the reproductive system in males (an increase in ASD incidence by 283% for every percent increase in incidence of malformations, 95% CI: [91%, 576%], p<6×10(-5)). Such congenital malformations were barely significant for ID (94% increase, 95% CI: [1%, 250%], p = 0.0384). Other congenital malformations in males (excluding those affecting the reproductive system) appeared to significantly affect both phenotypes: 31.8% ASD rate increase (CI: [12%, 52%], p<6×10(-5)), and 43% ID rate increase (CI: [23%, 67%], p<6×10(-5)). Furthermore, the state-mandated rigor of diagnosis of ASD by a pediatrician or clinician for consideration in the special education system was predictive of a considerable decrease in ASD and ID incidence rates (98.6%, CI: [28%, 99.99%], p = 0.02475 and 99% CI: [68%, 99.99%], p = 0.00637 respectively). Thus, the observed spatial variability of both ID and ASD rates is associated with environmental and state-level regulatory factors; the magnitude of influence of compound environmental predictors was approximately three times greater than that of state-level incentives. The estimated county-level random effects exhibited marked spatial clustering, strongly indicating existence of as yet unidentified localized factors driving apparent disease incidence. Finally, we found that the rates of ASD and ID at the county level were weakly but significantly correlated (Pearson product-moment correlation 0.0589, p = 0.00101), while for females the correlation was much stronger (0.197, p<2.26×10(-16)).
View details for DOI 10.1371/journal.pcbi.1003518
View details for Web of Science ID 000336509000034
View details for PubMedID 24625521
View details for PubMedCentralID PMC3952819
-
Coherent functional modules improve transcription factor target identification, cooperativity prediction, and disease association.
PLoS genetics
2014; 10 (2)
Abstract
Transcription factors (TFs) are fundamental controllers of cellular regulation that function in a complex and combinatorial manner. Accurate identification of a transcription factor's targets is essential to understanding the role that factors play in disease biology. However, due to a high false positive rate, identifying coherent functional target sets is difficult. We have created an improved mapping of targets by integrating ChIP-Seq data with 423 functional modules derived from 9,395 human expression experiments. We identified 5,002 TF-module relationships, significantly improved TF target prediction, and found 30 high-confidence TF-TF associations, of which 14 are known. Importantly, we also connected TFs to diseases through these functional modules and identified 3,859 significant TF-disease relationships. As an example, we found a link between MEF2A and Crohn's disease, which we validated in an independent expression dataset. These results show the power of combining expression data and ChIP-Seq data to remove noise and better extract the associations between TFs, functional modules, and disease.
View details for DOI 10.1371/journal.pgen.1004122
View details for PubMedID 24516403
View details for PubMedCentralID PMC3916285
-
Coherent functional modules improve transcription factor target identification, cooperativity prediction, and disease association.
PLoS genetics
2014; 10 (2)
View details for DOI 10.1371/journal.pgen.1004122
View details for PubMedID 24516403
-
CYP2D6 Genotype and Adjuvant Tamoxifen: Meta-Analysis of Heterogeneous Study Populations
CLINICAL PHARMACOLOGY & THERAPEUTICS
2014; 95 (2): 216-227
Abstract
The International Tamoxifen Pharmacogenomics Consortium was established to address the controversy regarding cytochrome P450 2D6 (CYP2D6) status and clinical outcomes in tamoxifen therapy. We performed a meta-analysis on data from 4,973 tamoxifen-treated patients (12 globally distributed sites). Using strict eligibility requirements (postmenopausal women with estrogen receptor-positive breast cancer, receiving 20 mg/day tamoxifen for 5 years, criterion 1); CYP2D6 poor metabolizer status was associated with poorer invasive disease-free survival (IDFS: hazard ratio = 1.25; 95% confidence interval = 1.06, 1.47; P = 0.009). However, CYP2D6 status was not statistically significant when tamoxifen duration, menopausal status, and annual follow-up were not specified (criterion 2, n = 2,443; P = 0.25) or when no exclusions were applied (criterion 3, n = 4,935; P = 0.38). Although CYP2D6 is a strong predictor of IDFS using strict inclusion criteria, because the results are not robust to inclusion criteria (these were not defined a priori), prospective studies are necessary to fully establish the value of CYP2D6 genotyping in tamoxifen therapy.
View details for DOI 10.1038/clpt.2013.186
View details for PubMedID 24060820
-
PharmGKB summary: ifosfamide pathways, pharmacokinetics and pharmacodynamics.
Pharmacogenetics and genomics
2014; 24 (2): 133-138
View details for DOI 10.1097/FPC.0000000000000019
View details for PubMedID 24401834
-
Investigating Ligand-Modulation of GPCR Activation Pathways
CELL PRESS. 2014: 14A
View details for DOI 10.1016/j.bpj.2013.11.130
View details for Web of Science ID 000337000400071
-
Variation in the Binding Pocket of an Inhibitor of the Bacterial Division Protein FtsZ Across Genotypes, Nucleotide States, and Species
CELL PRESS. 2014: 474A–475A
View details for DOI 10.1016/j.bpj.2013.11.2684
View details for Web of Science ID 000337000402635
-
Automated physician order recommendations and outcome predictions by data-mining electronic medical records.
AMIA Joint Summits on Translational Science proceedings AMIA Summit on Translational Science
2014; 2014: 206-210
Abstract
The meaningful use of electronic medical records (EMR) will come from effective clinical decision support (CDS) applied to physician orders, the concrete manifestation of clinical decision making. CDS development is currently limited by a top-down approach, requiring manual production and limited end-user awareness. A statistical data-mining alternative automatically extracts expertise as association statistics from structured EMR data (>5.4M data elements from >19K inpatient encounters). This powers an order recommendation system analogous to commercial systems (e.g., Amazon.com's "Customers who bought this…"). Compared to a standard benchmark, the association method improves order prediction precision from 26% to 37% (p<0.01). Introducing an inverse frequency weighted recall metric demonstrates a quantifiable improvement from 3% to 17% (p<0.01) in recommending more specifically relevant orders. The system also predicts clinical outcomes, such as 30 day mortality and 1 week ICU intervention, with ROC AUC of 0.88 and 0.78 respectively, comparable to state-of-the-art prognosis scores.
View details for PubMedID 25717414
-
BUILDING THE NEXT GENERATION OF QUANTITATIVE BIOLOGISTS
WORLD SCIENTIFIC PUBL CO PTE LTD. 2014: 417–21
Abstract
Many colleges and universities across the globe now offer bachelors, masters, and doctoral degrees, along with certificate programs in bioinformatics. While there is some consensus surrounding curricula competencies, programs vary greatly in their core foci, with some leaning heavily toward the biological sciences and others toward quantitative areas. This allows prospective students to choose a program that best fits their interests and career goals. In the digital age, most scientific fields are facing an enormous growth of data, and as a consequence, the goals and challenges of bioinformatics are rapidly changing; this requires that bioinformatics education also change. In this workshop, we seek to ascertain current trends in bioinformatics education by asking the question, "What are the core competencies all bioinformaticians should have at the end of their training, and how successful have programs been in placing students in desired careers?"
View details for Web of Science ID 000461865200038
View details for PubMedID 24297567
View details for PubMedCentralID PMC3935419
-
Identifying Druggable Targets by Protein Microenvironments Matching: Application to Transcription Factors
CPT-PHARMACOMETRICS & SYSTEMS PHARMACOLOGY
2014; 3 (1)
View details for DOI 10.1038/psp.2013.66
View details for Web of Science ID 000218887300007
-
PATH-SCAN: A REPORTING TOOL FOR IDENTIFYING CLINICALLY ACTIONABLE VARIANTS
WORLD SCIENTIFIC PUBL CO PTE LTD. 2014: 229–40
View details for Web of Science ID 000461865200022
-
High precision prediction of functional sites in protein structures.
PloS one
2014; 9 (3)
Abstract
We address the problem of assigning biological function to solved protein structures. Computational tools play a critical role in identifying potential active sites and informing screening decisions for further lab analysis. A critical parameter in the practical application of computational methods is the precision, or positive predictive value. Precision measures the level of confidence the user should have in a particular computed functional assignment. Low precision annotations lead to futile laboratory investigations and waste scarce research resources. In this paper we describe an advanced version of the protein function annotation system FEATURE, which achieved 99% precision and average recall of 95% across 20 representative functional sites. The system uses a Support Vector Machine classifier operating on the microenvironment of physicochemical features around an amino acid. We also compared performance of our method with state-of-the-art sequence-level annotator Pfam in terms of precision, recall and localization. To our knowledge, no other functional site annotator has been rigorously evaluated against these key criteria. The software and predictive models are incorporated into the WebFEATURE service at http://feature.stanford.edu/wf4.0-beta.
View details for DOI 10.1371/journal.pone.0091240
View details for PubMedID 24632601
View details for PubMedCentralID PMC3954699
-
Identifying druggable targets by protein microenvironments matching: application to transcription factors.
CPT: pharmacometrics & systems pharmacology
2014; 3
Abstract
Druggability of a protein is its potential to be modulated by drug-like molecules. It is important in the target selection phase. We hypothesize that: (i) known drug-binding sites contain advantageous physicochemical properties for drug binding, or "druggable microenvironments" and (ii) given a target, the presence of multiple druggable microenvironments similar to those seen previously is associated with a high likelihood of druggability. We developed DrugFEATURE to quantify druggability by assessing the microenvironments in potential small-molecule binding sites. We benchmarked DrugFEATURE using two data sets. One data set measures druggability using NMR-based screening. DrugFEATURE correlates well with this metric. The second data set is based on historical drug discovery outcomes. Using the DrugFEATURE cutoffs derived from the first, we accurately discriminated druggable and difficult targets in the second. We further identified novel druggable transcription factors with implications for cancer therapy. DrugFEATURE provides useful insight for drug discovery, by evaluating druggability and suggesting specific regions for interacting with drug-like molecules.CPT: Pharmacometrics Systems Pharmacology (2014) 3, e93; doi:10.1038/psp.2013.66; published online 22 January 2014.
View details for DOI 10.1038/psp.2013.66
View details for PubMedID 24452614
View details for PubMedCentralID PMC3910014
-
PharmGKB summary: mycophenolic acid pathway
PHARMACOGENETICS AND GENOMICS
2014; 24 (1): 73-79
View details for DOI 10.1097/FPC.0000000000000010
View details for Web of Science ID 000328629800009
View details for PubMedID 24220207
-
Path-scan: a reporting tool for identifying clinically actionable variants.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
2014; 19: 229-240
Abstract
The American College of Medical Genetics and Genomics (ACMG) recently released guidelines regarding the reporting of incidental findings in sequencing data. Given the availability of Direct to Consumer (DTC) genetic testing and the falling cost of whole exome and genome sequencing, individuals will increasingly have the opportunity to analyze their own genomic data. We have developed a web-based tool, PATH-SCAN, which annotates individual genomes and exomes for ClinVar designated pathogenic variants found within the genes from the ACMG guidelines. Because mutations in these genes predispose individuals to conditions with actionable outcomes, our tool will allow individuals or researchers to identify potential risk variants in order to consult physicians or genetic counselors for further evaluation. Moreover, our tool allows individuals to anonymously submit their pathogenic burden, so that we can crowd source the collection of quantitative information regarding the frequency of these variants. We tested our tool on 1092 publicly available genomes from the 1000 Genomes project, 163 genomes from the Personal Genome Project, and 15 genomes from a clinical genome sequencing research project. Excluding the most commonly seen variant in 1000 Genomes, about 20% of all genomes analyzed had a ClinVar designated pathogenic variant that required further evaluation.
View details for PubMedID 24297550
-
Cloud-based simulations on Google Exacycle reveal ligand modulation of GPCR activation pathways
NATURE CHEMISTRY
2014; 6 (1): 15-21
Abstract
Simulations can provide tremendous insight into the atomistic details of biological mechanisms, but micro- to millisecond timescales are historically only accessible on dedicated supercomputers. We demonstrate that cloud computing is a viable alternative that brings long-timescale processes within reach of a broader community. We used Google's Exacycle cloud-computing platform to simulate two milliseconds of dynamics of a major drug target, the G-protein-coupled receptor β2AR. Markov state models aggregate independent simulations into a single statistical model that is validated by previous computational and experimental results. Moreover, our models provide an atomistic description of the activation of a G-protein-coupled receptor and reveal multiple activation pathways. Agonists and inverse agonists interact differentially with these pathways, with profound implications for drug design.
View details for DOI 10.1038/NCHEM.1821
View details for Web of Science ID 000328951000007
View details for PubMedID 24345941
-
PharmGKB summary: venlafaxine pathway
PHARMACOGENETICS AND GENOMICS
2014; 24 (1): 62-72
View details for DOI 10.1097/FPC.0000000000000003
View details for Web of Science ID 000328629800008
View details for PubMedID 24128936
-
Identifying phenotypic signatures of neuropsychiatric disorders from electronic medical records.
Journal of the American Medical Informatics Association
2013; 20 (e2): e297-305
View details for DOI 10.1136/amiajnl-2013-001933
View details for PubMedID 23956017
-
Identifying phenotypic signatures of neuropsychiatric disorders from electronic medical records.
Journal of the American Medical Informatics Association
2013; 20 (e2): e297-305
Abstract
Mental illness is the leading cause of disability in the USA, but boundaries between different mental illnesses are notoriously difficult to define. Electronic medical records (EMRs) have recently emerged as a powerful new source of information for defining the phenotypic signatures of specific diseases. We investigated how EMR-based text mining and statistical analysis could elucidate the phenotypic boundaries of three important neuropsychiatric illnesses-autism, bipolar disorder, and schizophrenia.We analyzed the medical records of over 7000 patients at two facilities using an automated text-processing pipeline to annotate the clinical notes with Unified Medical Language System codes and then searching for enriched codes, and associations among codes, that were representative of the three disorders. We used dimensionality-reduction techniques on individual patient records to understand individual-level phenotypic variation within each disorder, as well as the degree of overlap among disorders.We demonstrate that automated EMR mining can be used to extract relevant drugs and phenotypes associated with neuropsychiatric disorders and characteristic patterns of associations among them. Patient-level analyses suggest a clear separation between autism and the other disorders, while revealing significant overlap between schizophrenia and bipolar disorder. They also enable localization of individual patients within the phenotypic 'landscape' of each disorder.Because EMRs reflect the realities of patient care rather than idealized conceptualizations of disease states, we argue that automated EMR mining can help define the boundaries between different mental illnesses, facilitate cohort building for clinical and genomic studies, and reveal how clear expert-defined disease boundaries are in practice.
View details for DOI 10.1136/amiajnl-2013-001933
View details for PubMedID 23956017
View details for PubMedCentralID PMC3861917
-
PharmGKB summary: very important pharmacogene information for cytochrome P450, family 2, subfamily C, polypeptide 8
PHARMACOGENETICS AND GENOMICS
2013; 23 (12): 721-728
View details for DOI 10.1097/FPC.0b013e3283653b27
View details for Web of Science ID 000326971400009
View details for PubMedID 23962911
-
Genome Wide Analysis of Drug-Induced Torsades de Pointes: Lack of Common Variants with Large Effect Sizes
PLOS ONE
2013; 8 (11)
Abstract
Marked prolongation of the QT interval on the electrocardiogram associated with the polymorphic ventricular tachycardia Torsades de Pointes is a serious adverse event during treatment with antiarrhythmic drugs and other culprit medications, and is a common cause for drug relabeling and withdrawal. Although clinical risk factors have been identified, the syndrome remains unpredictable in an individual patient. Here we used genome-wide association analysis to search for common predisposing genetic variants. Cases of drug-induced Torsades de Pointes (diTdP), treatment tolerant controls, and general population controls were ascertained across multiple sites using common definitions, and genotyped on the Illumina 610k or 1M-Duo BeadChips. Principal Components Analysis was used to select 216 Northwestern European diTdP cases and 771 ancestry-matched controls, including treatment-tolerant and general population subjects. With these sample sizes, there is 80% power to detect a variant at genome-wide significance with minor allele frequency of 10% and conferring an odds ratio of ≥2.7. Tests of association were carried out for each single nucleotide polymorphism (SNP) by logistic regression adjusting for gender and population structure. No SNP reached genome wide-significance; the variant with the lowest P value was rs2276314, a non-synonymous coding variant in C18orf21 (p = 3×10(-7), odds ratio = 2, 95% confidence intervals: 1.5-2.6). The haplotype formed by rs2276314 and a second SNP, rs767531, was significantly more frequent in controls than cases (p = 3×10(-9)). Expanding the number of controls and a gene-based analysis did not yield significant associations. This study argues that common genomic variants do not contribute importantly to risk for drug-induced Torsades de Pointes across multiple drugs.
View details for DOI 10.1371/journal.pone.0078511
View details for Web of Science ID 000326656200047
View details for PubMedID 24223155
View details for PubMedCentralID PMC3819377
-
PharmGKB summary: tamoxifen pathway, pharmacokinetics.
Pharmacogenetics and genomics
2013; 23 (11): 643-647
View details for DOI 10.1097/FPC.0b013e3283656bc1
View details for PubMedID 23962908
-
PharmGKB summary: very important pharmacogene information for the epidermal growth factor receptor.
Pharmacogenetics and genomics
2013; 23 (11): 636-642
View details for DOI 10.1097/FPC.0b013e3283655091
View details for PubMedID 23962910
-
Using molecular features of xenobiotics to predict hepatic gene expression response.
Journal of chemical information and modeling
2013; 53 (10): 2765-2773
Abstract
Despite recent advances in molecular medicine and rational drug design, many drugs still fail because toxic effects arise at the cellular and tissue level. In order to better understand these effects, cellular assays can generate high-throughput measurements of gene expression changes induced by small molecules. However, our understanding of how the chemical features of small molecules influence gene expression is very limited. Therefore, we investigated the extent to which chemical features of small molecules can reliably be associated with significant changes in gene expression. Specifically, we analyzed the gene expression response of rat liver cells to 170 different drugs and searched for genes whose expression could be related to chemical features alone. Surprisingly, we can predict the up-regulation of 87 genes (increased expression of at least 1.5 times compared to controls). We show an average cross-validation predictive area under the receiver operating characteristic curve (AUROC) of 0.7 or greater for each of these 87 genes. We applied our method to an external data set of rat liver gene expression response to a novel drug and achieved an AUROC of 0.7. We also validated our approach by predicting up-regulation of Cytochrome P450 1A2 (CYP1A2) in three drugs known to induce CYP1A2 that were not in our data set. Finally, a detailed analysis of the CYP1A2 predictor allowed us to identify which fragments made significant contributions to the predictive scores.
View details for DOI 10.1021/ci3005868
View details for PubMedID 24010729
View details for PubMedCentralID PMC3810861
-
PharmGKB summary: cyclosporine and tacrolimus pathways
PHARMACOGENETICS AND GENOMICS
2013; 23 (10): 563-585
View details for DOI 10.1097/FPC.0b013e328364db84
View details for Web of Science ID 000324527600007
View details for PubMedID 23922006
View details for PubMedCentralID PMC4119065
-
A method for inferring medical diagnoses from patient similarities
BMC MEDICINE
2013; 11
Abstract
Clinical decision support systems assist physicians in interpreting complex patient data. However, they typically operate on a per-patient basis and do not exploit the extensive latent medical knowledge in electronic health records (EHRs). The emergence of large EHR systems offers the opportunity to integrate population information actively into these tools.Here, we assess the ability of a large corpus of electronic records to predict individual discharge diagnoses. We present a method that exploits similarities between patients along multiple dimensions to predict the eventual discharge diagnoses.Using demographic, initial blood and electrocardiography measurements, as well as medical history of hospitalized patients from two independent hospitals, we obtained high performance in cross-validation (area under the curve >0.88) and correctly predicted at least one diagnosis among the top ten predictions for more than 84% of the patients tested. Importantly, our method provides accurate predictions (>0.86 precision in cross validation) for major disease categories, including infectious and parasitic diseases, endocrine and metabolic diseases and diseases of the circulatory systems. Our performance applies to both chronic and acute diagnoses.Our results suggest that one can harness the wealth of population-based information embedded in electronic health records for patient-specific predictive tasks.
View details for DOI 10.1186/1741-7015-11-194
View details for Web of Science ID 000324133500001
View details for PubMedCentralID PMC3844462
-
PharmGKB summary: methylene blue pathway
PHARMACOGENETICS AND GENOMICS
2013; 23 (9): 498-508
View details for DOI 10.1097/FPC.0b013e32836498f4
View details for Web of Science ID 000323220200007
View details for PubMedID 23913015
-
Genetic variants associated with warfarin dose in African-American individuals: a genome-wide association study.
Lancet
2013; 382 (9894): 790-796
Abstract
BACKGROUND: VKORC1 and CYP2C9 are important contributors to warfarin dose variability, but explain less variability for individuals of African descent than for those of European or Asian descent. We aimed to identify additional variants contributing to warfarin dose requirements in African Americans. METHODS: We did a genome-wide association study of discovery and replication cohorts. Samples from African-American adults (aged ≥18 years) who were taking a stable maintenance dose of warfarin were obtained at International Warfarin Pharmacogenetics Consortium (IWPC) sites and the University of Alabama at Birmingham (Birmingham, AL, USA). Patients enrolled at IWPC sites but who were not used for discovery made up the independent replication cohort. All participants were genotyped. We did a stepwise conditional analysis, conditioning first for VKORC1 -1639G→A, followed by the composite genotype of CYP2C9*2 and CYP2C9*3. We prespecified a genome-wide significance threshold of p<5×10(-8) in the discovery cohort and p<0·0038 in the replication cohort. FINDINGS: The discovery cohort contained 533 participants and the replication cohort 432 participants. After the prespecified conditioning in the discovery cohort, we identified an association between a novel single nucleotide polymorphism in the CYP2C cluster on chromosome 10 (rs12777823) and warfarin dose requirement that reached genome-wide significance (p=1·51×10(-8)). This association was confirmed in the replication cohort (p=5·04×10(-5)); analysis of the two cohorts together produced a p value of 4·5×10(-12). Individuals heterozygous for the rs12777823 A allele need a dose reduction of 6·92 mg/week and those homozygous 9·34 mg/week. Regression analysis showed that the inclusion of rs12777823 significantly improves warfarin dose variability explained by the IWPC dosing algorithm (21% relative improvement). INTERPRETATION: A novel CYP2C single nucleotide polymorphism exerts a clinically relevant effect on warfarin dose in African Americans, independent of CYP2C9*2 and CYP2C9*3. Incorporation of this variant into pharmacogenetic dosing algorithms could improve warfarin dose prediction in this population. FUNDING: National Institutes of Health, American Heart Association, Howard Hughes Medical Institute, Wisconsin Network for Health Research, and the Wellcome Trust.
View details for DOI 10.1016/S0140-6736(13)60681-9
View details for PubMedID 23755828
-
PharmGKB summary: diuretics pathway, pharmacodynamics
PHARMACOGENETICS AND GENOMICS
2013; 23 (8): 449-453
View details for DOI 10.1097/FPC.0b013e3283636822
View details for Web of Science ID 000323226500009
View details for PubMedID 23788015
-
Challenges in the pharmacogenomic annotation of whole genomes.
Clinical pharmacology & therapeutics
2013; 94 (2): 211-213
View details for DOI 10.1038/clpt.2013.111
View details for PubMedID 23708745
-
The Pharmacogenomics Research Network Translational Pharmacogenetics Program: Overcoming Challenges of Real-World Implementation
CLINICAL PHARMACOLOGY & THERAPEUTICS
2013; 94 (2): 207-210
View details for DOI 10.1038/clpt.2013.59
View details for Web of Science ID 000322064400018
View details for PubMedID 23588301
-
Challenges in the Pharmacogenomic Annotation of Whole Genomes
CLINICAL PHARMACOLOGY & THERAPEUTICS
2013; 94 (2): 211-213
View details for DOI 10.1038/clpt.2013.111
View details for Web of Science ID 000322064400019
View details for PubMedID 23708745
-
K-Means for Parallel Architectures Using All-Prefix-Sum Sorting and Updating Steps
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
2013; 24 (8): 1602-1612
View details for DOI 10.1109/TPDS.2012.234
View details for Web of Science ID 000321153600012
-
Pathway analysis of genome-wide data improves warfarin dose prediction
BMC GENOMICS
2013; 14
Abstract
Many genome-wide association studies focus on associating single loci with target phenotypes. However, in the setting of rare variation, accumulating sufficient samples to assess these associations can be difficult. Moreover, multiple variations in a gene or a set of genes within a pathway may all contribute to the phenotype, suggesting that the aggregation of variations found over the gene or pathway may be useful for improving the power to detect associations.Here, we present a method for aggregating single nucleotide polymorphisms (SNPs) along biologically relevant pathways in order to seek genetic associations with phenotypes. Our method uses all available genetic variants and does not remove those in linkage disequilibrium (LD). Instead, it uses a novel SNP weighting scheme to down-weight the contributions of correlated SNPs. We apply our method to three cohorts of patients taking warfarin: two European descent cohorts and an African American cohort. Although the clinical covariates and key pharmacogenetic loci for warfarin have been characterized, our association metric identifies a significant association with mutations distributed throughout the pathway of warfarin metabolism. We improve dose prediction after using all known clinical covariates and pharmacogenetic variants in VKORC1 and CYP2C9. In particular, we find that at least 1% of the missing heritability in warfarin dose may be due to the aggregated effects of variations in the warfarin metabolic pathway, even though the SNPs do not individually show a significant association.Our method allows researchers to study aggregative SNP effects in an unbiased manner by not preselecting SNPs. It retains all the available information by accounting for LD-structure through weighting, which eliminates the need for LD pruning.
View details for DOI 10.1186/1471-2164-14-S3-S11
View details for Web of Science ID 000319869500011
View details for PubMedID 23819817
-
Collective judgment predicts disease-associated single nucleotide variants
BMC GENOMICS
2013; 14
Abstract
In recent years the number of human genetic variants deposited into the publicly available databases has been increasing exponentially. The latest version of dbSNP, for example, contains ~50 million validated Single Nucleotide Variants (SNVs). SNVs make up most of human variation and are often the primary causes of disease. The non-synonymous SNVs (nsSNVs) result in single amino acid substitutions and may affect protein function, often causing disease. Although several methods for the detection of nsSNV effects have already been developed, the consistent increase in annotated data is offering the opportunity to improve prediction accuracy.Here we present a new approach for the detection of disease-associated nsSNVs (Meta-SNP) that integrates four existing methods: PANTHER, PhD-SNP, SIFT and SNAP. We first tested the accuracy of each method using a dataset of 35,766 disease-annotated mutations from 8,667 proteins extracted from the SwissVar database. The four methods reached overall accuracies of 64%-76% with a Matthew's correlation coefficient (MCC) of 0.38-0.53. We then used the outputs of these methods to develop a machine learning based approach that discriminates between disease-associated and polymorphic variants (Meta-SNP). In testing, the combined method reached 79% overall accuracy and 0.59 MCC, ~3% higher accuracy and ~0.05 higher correlation with respect to the best-performing method. Moreover, for the hardest-to-define subset of nsSNVs, i.e. variants for which half of the predictors disagreed with the other half, Meta-SNP attained 8% higher accuracy than the best predictor.Here we find that the Meta-SNP algorithm achieves better performance than the best single predictor. This result suggests that the methods used for the prediction of variant-disease associations are orthogonal, encoding different biologically relevant relationships. Careful combination of predictions from various resources is therefore a good strategy for the selection of high reliability predictions. Indeed, for the subset of nsSNVs where all predictors were in agreement (46% of all nsSNVs in the set), our method reached 87% overall accuracy and 0.73 MCC. Meta-SNP server is freely accessible at http://snps.biofold.org/meta-snp.
View details for DOI 10.1186/1471-2164-14-S3-S2
View details for Web of Science ID 000319869500002
View details for PubMedID 23819846
View details for PubMedCentralID PMC3839641
-
WS-SNPs& GO: a web server for predicting the deleterious effect of human protein variants using functional annotation
BMC GENOMICS
2013; 14
Abstract
SNPs&GO is a method for the prediction of deleterious Single Amino acid Polymorphisms (SAPs) using protein functional annotation. In this work, we present the web server implementation of SNPs&GO (WS-SNPs&GO). The server is based on Support Vector Machines (SVM) and for a given protein, its input comprises: the sequence and/or its three-dimensional structure (when available), a set of target variations and its functional Gene Ontology (GO) terms. The output of the server provides, for each protein variation, the probabilities to be associated to human diseases.The server consists of two main components, including updated versions of the sequence-based SNPs&GO (recently scored as one of the best algorithms for predicting deleterious SAPs) and of the structure-based SNPs&GO(3d) programs. Sequence and structure based algorithms are extensively tested on a large set of annotated variations extracted from the SwissVar database. Selecting a balanced dataset with more than 38,000 SAPs, the sequence-based approach achieves 81% overall accuracy, 0.61 correlation coefficient and an Area Under the Curve (AUC) of the Receiver Operating Characteristic (ROC) curve of 0.88. For the subset of ~6,600 variations mapped on protein structures available at the Protein Data Bank (PDB), the structure-based method scores with 84% overall accuracy, 0.68 correlation coefficient, and 0.91 AUC. When tested on a new blind set of variations, the results of the server are 79% and 83% overall accuracy for the sequence-based and structure-based inputs, respectively.WS-SNPs&GO is a valuable tool that includes in a unique framework information derived from protein sequence, structure, evolutionary profile, and protein function. WS-SNPs&GO is freely available at http://snps.biofold.org/snps-and-go.
View details for DOI 10.1186/1471-2164-14-S3-S6
View details for Web of Science ID 000319869500006
View details for PubMedID 23819482
View details for PubMedCentralID PMC3665478
-
Web-scale pharmacovigilance: listening to signals from the crowd
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION
2013; 20 (3): 404-408
Abstract
Adverse drug events cause substantial morbidity and mortality and are often discovered after a drug comes to market. We hypothesized that Internet users may provide early clues about adverse drug events via their online information-seeking. We conducted a large-scale study of Web search log data gathered during 2010. We pay particular attention to the specific drug pairing of paroxetine and pravastatin, whose interaction was reported to cause hyperglycemia after the time period of the online logs used in the analysis. We also examine sets of drug pairs known to be associated with hyperglycemia and those not associated with hyperglycemia. We find that anonymized signals on drug interactions can be mined from search logs. Compared to analyses of other sources such as electronic health records (EHR), logs are inexpensive to collect and mine. The results demonstrate that logs of the search activities of populations of computer users can contribute to drug safety surveillance.
View details for DOI 10.1136/amiajnl-2012-001482
View details for Web of Science ID 000317477500001
View details for PubMedID 23467469
View details for PubMedCentralID PMC3628066
-
Valproic acid pathway: pharmacokinetics and pharmacodynamics
PHARMACOGENETICS AND GENOMICS
2013; 23 (4): 236-241
View details for DOI 10.1097/FPC.0b013e32835ea0b2
View details for Web of Science ID 000316109700008
View details for PubMedID 23407051
-
Informatics confronts drug-drug interactions
TRENDS IN PHARMACOLOGICAL SCIENCES
2013; 34 (3): 178-184
Abstract
Drug-drug interactions (DDIs) are an emerging threat to public health. Recent estimates indicate that DDIs cause nearly 74000 emergency room visits and 195000 hospitalizations each year in the USA. Current approaches to DDI discovery, which include Phase IV clinical trials and post-marketing surveillance, are insufficient for detecting many DDIs and do not alert the public to potentially dangerous DDIs before a drug enters the market. Recent work has applied state-of-the-art computational and statistical methods to the problem of DDIs. Here we review recent developments that encompass a range of informatics approaches in this domain, from the construction of databases for efficient searching of known DDIs to the prediction of novel DDIs based on data from electronic medical records, adverse event reports, scientific abstracts, and other sources. We also explore why DDIs are so difficult to detect and what the future holds for informatics-based approaches to DDI discovery.
View details for DOI 10.1016/j.tips.2013.01.006
View details for Web of Science ID 000316833900008
View details for PubMedID 23414686
-
Personal genomic measurements: the opportunity for information integration.
Clinical pharmacology & therapeutics
2013; 93 (1): 21-23
Abstract
High-throughput genomic measurements initially emerged for research purposes but are now entering the clinic. The challenge for clinicians is to integrate imperfect genomic measurements with other information sources so as to estimate as closely as possible the probabilities of clinical events (diagnoses, treatment responses, prognoses). Population-based data provide a priori probabilities that can be combined with individual measurements to compute a posteriori estimates using Bayes' rule. Thus, the integration of population science with individual genomic measurements will enable the practice of personalized medicine.
View details for DOI 10.1038/clpt.2012.203
View details for PubMedID 23241835
-
A method for inferring medical diagnoses from patient similarities.
BMC medicine
2013; 11: 194-?
Abstract
Clinical decision support systems assist physicians in interpreting complex patient data. However, they typically operate on a per-patient basis and do not exploit the extensive latent medical knowledge in electronic health records (EHRs). The emergence of large EHR systems offers the opportunity to integrate population information actively into these tools.Here, we assess the ability of a large corpus of electronic records to predict individual discharge diagnoses. We present a method that exploits similarities between patients along multiple dimensions to predict the eventual discharge diagnoses.Using demographic, initial blood and electrocardiography measurements, as well as medical history of hospitalized patients from two independent hospitals, we obtained high performance in cross-validation (area under the curve >0.88) and correctly predicted at least one diagnosis among the top ten predictions for more than 84% of the patients tested. Importantly, our method provides accurate predictions (>0.86 precision in cross validation) for major disease categories, including infectious and parasitic diseases, endocrine and metabolic diseases and diseases of the circulatory systems. Our performance applies to both chronic and acute diagnoses.Our results suggest that one can harness the wealth of population-based information embedded in electronic health records for patient-specific predictive tasks.
View details for DOI 10.1186/1741-7015-11-194
View details for PubMedID 24004670
View details for PubMedCentralID PMC3844462
-
Improving data and knowledge management to better integrate health care and research.
Journal of internal medicine
2013
View details for DOI 10.1111/joim.12105
View details for PubMedID 23808970
-
Inferring the semantic relationships of words within an ontology using random indexing: applications to pharmacogenomics.
AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium
2013; 2013: 1123-1132
Abstract
The biomedical literature presents a uniquely challenging text mining problem. Sentences are long and complex, the subject matter is highly specialized with a distinct vocabulary, and producing annotated training data for this domain is time consuming and expensive. In this environment, unsupervised text mining methods that do not rely on annotated training data are valuable. Here we investigate the use of random indexing, an automated method for producing vector-space semantic representations of words from large, unlabeled corpora, to address the problem of term normalization in sentences describing drugs and genes. We show that random indexing produces similarity scores that capture some of the structure of PHARE, a manually curated ontology of pharmacogenomics concepts. We further show that random indexing can be used to identify likely word candidates for inclusion in the ontology, and can help localize these new labels among classes and roles within the ontology.
View details for PubMedID 24551397
- Proceedings of Pacific Symposium on Biocomputing 2011. edited by Altman, R., Dunker, K., Hunter, L. 2013
-
Mining for clinical expertise in (undocumented) order sets to power an order suggestion system.
AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science
2013; 2013: 34-38
Abstract
Physician orders, the concrete manifestation of clinical decision making, are enhanced by the distribution of clinical expertise in the form of order sets and corollary orders. Conventional order sets are top-down distributed by committees of experts, limited by the cost of manual development, maintenance, and limited end-user awareness. An alternative explored here applies statistical data-mining to physician order data (>330K order instances from >1.4K inpatient encounters) to extract clinical expertise from the bottom-up. This powers a corollary order suggestion engine using techniques analogous to commercial product recommendation systems (e.g., Amazon.com's "Customers who bought this…" feature). Compared to a simple benchmark, the item-based association method illustrated here improves order prediction precision from 13% to 18% and further to 28% by incorporating information on the temporal relationship between orders. Incorporating statistics on conditional order frequency ratios further refines recommendations beyond just "common" orders to those relevant to a specific clinical context.
View details for PubMedID 24303232
-
PharmGKB: the Pharmacogenomics Knowledge Base.
Methods in molecular biology (Clifton, N.J.)
2013; 1015: 311-320
Abstract
The Pharmacogenomics Knowledge Base, PharmGKB, is an interactive tool for researchers investigating how genetic variation affects drug response. The PharmGKB Web site, http://www.pharmgkb.org , displays genotype, molecular, and clinical knowledge integrated into pathway representations and Very Important Pharmacogene (VIP) summaries with links to additional external resources. Users can search and browse the knowledgebase by genes, variants, drugs, diseases, and pathways. Registration is free to the entire research community, but subject to agreement to use for research purposes only and not to redistribute. Registered users can access and download data to aid in the design of future pharmacogenetics and pharmacogenomics studies.
View details for DOI 10.1007/978-1-62703-435-7_20
View details for PubMedID 23824865
-
Pathway analysis of genome-wide data improves warfarin dose prediction.
BMC genomics
2013; 14: S11-?
Abstract
Many genome-wide association studies focus on associating single loci with target phenotypes. However, in the setting of rare variation, accumulating sufficient samples to assess these associations can be difficult. Moreover, multiple variations in a gene or a set of genes within a pathway may all contribute to the phenotype, suggesting that the aggregation of variations found over the gene or pathway may be useful for improving the power to detect associations.Here, we present a method for aggregating single nucleotide polymorphisms (SNPs) along biologically relevant pathways in order to seek genetic associations with phenotypes. Our method uses all available genetic variants and does not remove those in linkage disequilibrium (LD). Instead, it uses a novel SNP weighting scheme to down-weight the contributions of correlated SNPs. We apply our method to three cohorts of patients taking warfarin: two European descent cohorts and an African American cohort. Although the clinical covariates and key pharmacogenetic loci for warfarin have been characterized, our association metric identifies a significant association with mutations distributed throughout the pathway of warfarin metabolism. We improve dose prediction after using all known clinical covariates and pharmacogenetic variants in VKORC1 and CYP2C9. In particular, we find that at least 1% of the missing heritability in warfarin dose may be due to the aggregated effects of variations in the warfarin metabolic pathway, even though the SNPs do not individually show a significant association.Our method allows researchers to study aggregative SNP effects in an unbiased manner by not preselecting SNPs. It retains all the available information by accounting for LD-structure through weighting, which eliminates the need for LD pruning.
View details for DOI 10.1186/1471-2164-14-S3-S11
View details for PubMedID 23819817
-
Impact of the CYP4F2 p.V433M Polymorphism on Coumarin Dose Requirement: Systematic Review and Meta-Analysis
CLINICAL PHARMACOLOGY & THERAPEUTICS
2012; 92 (6): 746-756
Abstract
A systematic review and a meta-analysis were performed to quantify the accumulated information from genetic association studies investigating the impact of the CYP4F2 rs2108622 (p.V433M) polymorphism on coumarin dose requirement. An additional aim was to explore the contribution of the CYP4F2 variant in comparison with, as well as after stratification for, the VKORC1 and CYP2C9 variants. Thirty studies involving 9,470 participants met prespecified inclusion criteria. As compared with CC-homozygotes, T-allele carriers required an 8.3% (95% confidence interval (CI): 5.6-11.1%; P < 0.0001) higher mean daily coumarin dose than CC homozygotes to reach a stable international normalized ratio (INR). There was no evidence of publication bias. Heterogeneity among studies was present (I(2) = 43%). Our results show that the CYP4F2 p.V433M polymorphism is associated with interindividual variability in response to coumarin drugs, but with a low effect size that is confirmed to be lower than those contributed by VKORC1 and CYP2C9 polymorphisms.
View details for DOI 10.1038/clpt.2012.184
View details for Web of Science ID 000311283400016
View details for PubMedID 23132553
-
Chapter 7: Pharmacogenomics
PLOS COMPUTATIONAL BIOLOGY
2012; 8 (12)
Abstract
There is great variation in drug-response phenotypes, and a "one size fits all" paradigm for drug delivery is flawed. Pharmacogenomics is the study of how human genetic information impacts drug response, and it aims to improve efficacy and reduced side effects. In this article, we provide an overview of pharmacogenetics, including pharmacokinetics (PK), pharmacodynamics (PD), gene and pathway interactions, and off-target effects. We describe methods for discovering genetic factors in drug response, including genome-wide association studies (GWAS), expression analysis, and other methods such as chemoinformatics and natural language processing (NLP). We cover the practical applications of pharmacogenomics both in the pharmaceutical industry and in a clinical setting. In drug discovery, pharmacogenomics can be used to aid lead identification, anticipate adverse events, and assist in drug repurposing efforts. Moreover, pharmacogenomic discoveries show promise as important elements of physician decision support. Finally, we consider the ethical, regulatory, and reimbursement challenges that remain for the clinical implementation of pharmacogenomics.
View details for DOI 10.1371/journal.pcbi.1002817
View details for Web of Science ID 000312901500023
View details for PubMedID 23300409
View details for PubMedCentralID PMC3531317
-
PharmGKB summary: zidovudine pathway
PHARMACOGENETICS AND GENOMICS
2012; 22 (12): 891-894
View details for DOI 10.1097/FPC.0b013e32835879a8
View details for Web of Science ID 000311031800008
View details for PubMedID 22960662
-
Introduction to Translational Bioinformatics Collection
PLOS COMPUTATIONAL BIOLOGY
2012; 8 (12)
View details for DOI 10.1371/journal.pcbi.1002796
View details for Web of Science ID 000312901500006
View details for PubMedID 23300404
View details for PubMedCentralID PMC3531318
-
Metformin pathways: pharmacokinetics and pharmacodynamics
PHARMACOGENETICS AND GENOMICS
2012; 22 (11): 820-827
View details for DOI 10.1097/FPC.0b013e3283559b22
View details for Web of Science ID 000309977100008
View details for PubMedID 22722338
View details for PubMedCentralID PMC3651676
-
Very important pharmacogene summary for VDR
PHARMACOGENETICS AND GENOMICS
2012; 22 (10): 758-763
View details for DOI 10.1097/FPC.0b013e328354455c
View details for Web of Science ID 000309115000007
View details for PubMedID 22588316
View details for PubMedCentralID PMC3678550
-
Implementing Personalized Medicine: Development of a Cost-Effective Customized Pharmacogenetics Genotyping Array
CLINICAL PHARMACOLOGY & THERAPEUTICS
2012; 92 (4): 437-439
Abstract
Although there is increasing evidence to support the implementation of pharmacogenetics in certain clinical scenarios, the adoption of this approach has been limited. The advent of preemptive and inexpensive testing of critical pharmacogenetic variants may overcome barriers to adoption. We describe the design of a customized array built for the personalized-medicine programs of the University of Florida and Stanford University. We selected key variants for the array using the clinical annotations of the Pharmacogenomics Knowledgebase (PharmGKB), and we included variants in drug metabolism and transporter genes along with other pharmacogenetically important variants.
View details for DOI 10.1038/clpt.2012.125
View details for Web of Science ID 000309017000017
View details for PubMedID 22910441
View details for PubMedCentralID PMC3454443
-
Pharmacogenomics Knowledge for Personalized Medicine
CLINICAL PHARMACOLOGY & THERAPEUTICS
2012; 92 (4): 414-417
Abstract
The Pharmacogenomics Knowledgebase (PharmGKB) is a resource that collects, curates, and disseminates information about the impact of human genetic variation on drug responses. It provides clinically relevant information, including dosing guidelines, annotated drug labels, and potentially actionable gene-drug associations and genotype-phenotype relationships. Curators assign levels of evidence to variant-drug associations using well-defined criteria based on careful literature review. Thus, PharmGKB is a useful source of high-quality information supporting personalized medicine-implementation projects.
View details for DOI 10.1038/clpt.2012.96
View details for Web of Science ID 000309017000009
View details for PubMedID 22992668
View details for PubMedCentralID PMC3660037
-
The state of the art in text mining and natural language processing for pharmacogenomics
JOURNAL OF BIOMEDICAL INFORMATICS
2012; 45 (5): 825-826
View details for DOI 10.1016/j.jbi.2012.08.001
View details for Web of Science ID 000309146200001
-
Mice lacking the β2 adrenergic receptor have a unique genetic profile before and after focal brain ischaemia.
ASN neuro
2012; 4 (5)
Abstract
The role of the β2AR (β2 adrenergic receptor) after stroke is unclear as pharmacological manipulations of the β2AR have produced contradictory results. We previously showed that mice deficient in the β2AR (β2KO) had smaller infarcts compared with WT (wild-type) mice (FVB) after MCAO (middle cerebral artery occlusion), a model of stroke. To elucidate mechanisms of this neuroprotection, we evaluated changes in gene expression using microarrays comparing differences before and after MCAO, and differences between genotypes. Genes associated with inflammation and cell deaths were enriched after MCAO in both genotypes, and we identified several genes not previously shown to increase following ischaemia (Ccl9, Gem and Prg4). In addition to networks that were similar between genotypes, one network with a central core of GPCR (G-protein-coupled receptor) and including biological functions such as carbohydrate metabolism, small molecule biochemistry and inflammation was identified in FVB mice but not in β2KO mice. Analysis of differences between genotypes revealed 11 genes differentially expressed by genotype both before and after ischaemia. We demonstrate greater Glo1 protein levels and lower Pmaip/Noxa mRNA levels in β2KO mice in both sham and MCAO conditions. As both genes are implicated in NF-κB (nuclear factor κB) signalling, we measured p65 activity and TNFα (tumour necrosis factor α) levels 24 h after MCAO. MCAO-induced p65 activation and post-ischaemic TNFα production were both greater in FVB compared with β2KO mice. These results suggest that loss of β2AR signalling results in a neuroprotective phenotype in part due to decreased NF-κB signalling, decreased inflammation and decreased apoptotic signalling in the brain.
View details for DOI 10.1042/AN20110020
View details for PubMedID 22867428
View details for PubMedCentralID PMC3436074
-
PharmGKB summary: very important pharmacogene information for cytochrome P-450, family 2, subfamily A, polypeptide 6
PHARMACOGENETICS AND GENOMICS
2012; 22 (9): 695-708
View details for DOI 10.1097/FPC.0b013e3283540217
View details for PubMedID 22547082
-
PharmGKB summary: very important pharmacogene information for GSTT1
PHARMACOGENETICS AND GENOMICS
2012; 22 (8): 646-651
View details for DOI 10.1097/FPC.0b013e3283527c02
View details for Web of Science ID 000306483500009
View details for PubMedID 22643671
View details for PubMedCentralID PMC3395771
-
Bioinformatics and variability in drug response: a protein structural perspective
JOURNAL OF THE ROYAL SOCIETY INTERFACE
2012; 9 (72): 1409-1437
Abstract
Marketed drugs frequently perform worse in clinical practice than in the clinical trials on which their approval is based. Many therapeutic compounds are ineffective for a large subpopulation of patients to whom they are prescribed; worse, a significant fraction of patients experience adverse effects more severe than anticipated. The unacceptable risk-benefit profile for many drugs mandates a paradigm shift towards personalized medicine. However, prior to adoption of patient-specific approaches, it is useful to understand the molecular details underlying variable drug response among diverse patient populations. Over the past decade, progress in structural genomics led to an explosion of available three-dimensional structures of drug target proteins while efforts in pharmacogenetics offered insights into polymorphisms correlated with differential therapeutic outcomes. Together these advances provide the opportunity to examine how altered protein structures arising from genetic differences affect protein-drug interactions and, ultimately, drug response. In this review, we first summarize structural characteristics of protein targets and common mechanisms of drug interactions. Next, we describe the impact of coding mutations on protein structures and drug response. Finally, we highlight tools for analysing protein structures and protein-drug interactions and discuss their application for understanding altered drug responses associated with protein structural variants.
View details for DOI 10.1098/rsif.2011.0843
View details for Web of Science ID 000304437400001
View details for PubMedID 22552919
-
PharmGKB summary: very important pharmacogene information for CYP3A5
PHARMACOGENETICS AND GENOMICS
2012; 22 (7): 555-558
View details for DOI 10.1097/FPC.0b013e328351d47f
View details for Web of Science ID 000305429900009
View details for PubMedID 22407409
-
Editorial: Current progress in Bioinformatics 2012
BRIEFINGS IN BIOINFORMATICS
2012; 13 (4): 393-394
View details for DOI 10.1093/bib/bbs042
View details for Web of Science ID 000306925000001
View details for PubMedID 22833494
-
PharmGKB summary: phenytoin pathway
PHARMACOGENETICS AND GENOMICS
2012; 22 (6): 466-470
View details for DOI 10.1097/FPC.0b013e32834aeedb
View details for Web of Science ID 000303769700007
View details for PubMedID 22569204
View details for PubMedCentralID PMC3349446
-
Translational Bioinformatics: Linking the Molecular World to the Clinical World
CLINICAL PHARMACOLOGY & THERAPEUTICS
2012; 91 (6): 994-1000
Abstract
Translational bioinformatics represents the union of translational medicine and bioinformatics. Translational medicine moves basic biological discoveries from the research bench into the patient-care setting and uses clinical observations to inform basic biology. It focuses on patient care, including the creation of new diagnostics, prognostics, prevention strategies, and therapies based on biological discoveries. Bioinformatics involves algorithms to represent, store, and analyze basic biological data, including DNA sequence, RNA expression, and protein and small-molecule abundance within cells. Translational bioinformatics spans these two fields; it involves the development of algorithms to analyze basic molecular and cellular data with an explicit goal of affecting clinical care.
View details for DOI 10.1038/clpt.2012.49
View details for Web of Science ID 000304245800017
View details for PubMedID 22549287
-
PharmGKB summary: caffeine pathway
PHARMACOGENETICS AND GENOMICS
2012; 22 (5): 389-395
View details for DOI 10.1097/FPC.0b013e3283505d5e
View details for Web of Science ID 000302783800008
View details for PubMedID 22293536
View details for PubMedCentralID PMC3381939
-
Using ODIN for a PharmGKB revalidation experiment
DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION
2012
Abstract
The need for efficient text-mining tools that support curation of the biomedical literature is ever increasing. In this article, we describe an experiment aimed at verifying whether a text-mining tool capable of extracting meaningful relationships among domain entities can be successfully integrated into the curation workflow of a major biological database. We evaluate in particular (i) the usability of the system's interface, as perceived by users, and (ii) the correlation of the ranking of interactions, as provided by the text-mining system, with the choices of the curators.
View details for DOI 10.1093/database/bas021
View details for Web of Science ID 000304924100001
View details for PubMedID 22529178
View details for PubMedCentralID PMC3332569
-
Celecoxib pathways: pharmacokinetics and pharmacodynamics
PHARMACOGENETICS AND GENOMICS
2012; 22 (4): 310-318
View details for DOI 10.1097/FPC.0b013e32834f94cb
View details for Web of Science ID 000301537400010
View details for PubMedID 22336956
View details for PubMedCentralID PMC3303994
-
Personal Omics Profiling Reveals Dynamic Molecular and Medical Phenotypes
CELL
2012; 148 (6): 1293-1307
Abstract
Personalized medicine is expected to benefit from combining genomic information with regular monitoring of physiological states by multiple high-throughput methods. Here, we present an integrative personal omics profile (iPOP), an analysis that combines genomic, transcriptomic, proteomic, metabolomic, and autoantibody profiles from a single individual over a 14 month period. Our iPOP analysis revealed various medical risks, including type 2 diabetes. It also uncovered extensive, dynamic changes in diverse molecular components and biological pathways across healthy and diseased conditions. Extremely high-coverage genomic and transcriptomic data, which provide the basis of our iPOP, revealed extensive heteroallelic changes during healthy and diseased states and an unexpected RNA editing mechanism. This study demonstrates that longitudinal iPOP can be used to interpret healthy and diseased states by connecting genomic information with additional dynamic omics activity.
View details for DOI 10.1016/j.cell.2012.02.009
View details for PubMedID 22424236
-
Data-Driven Prediction of Drug Effects and Interactions
SCIENCE TRANSLATIONAL MEDICINE
2012; 4 (125)
Abstract
Adverse drug events remain a leading cause of morbidity and mortality around the world. Many adverse events are not detected during clinical trials before a drug receives approval for use in the clinic. Fortunately, as part of postmarketing surveillance, regulatory agencies and other institutions maintain large collections of adverse event reports, and these databases present an opportunity to study drug effects from patient population data. However, confounding factors such as concomitant medications, patient demographics, patient medical histories, and reasons for prescribing a drug often are uncharacterized in spontaneous reporting systems, and these omissions can limit the use of quantitative signal detection methods used in the analysis of such data. Here, we present an adaptive data-driven approach for correcting these factors in cases for which the covariates are unknown or unmeasured and combine this approach with existing methods to improve analyses of drug effects using three test data sets. We also present a comprehensive database of drug effects (Offsides) and a database of drug-drug interaction side effects (Twosides). To demonstrate the biological use of these new resources, we used them to identify drug targets, predict drug indications, and discover drug class interactions. We then corroborated 47 (P < 0.0001) of the drug class interactions using an independent analysis of electronic medical records. Our analysis suggests that combined treatment with selective serotonin reuptake inhibitors and thiazides is associated with significantly increased incidence of prolonged QT intervals. We conclude that confounding effects from covariates in observational clinical data can be controlled in data analyses and thus improve the detection and prediction of adverse drug effects and interactions.
View details for DOI 10.1126/scitranslmed.3003377
View details for Web of Science ID 000301538300005
View details for PubMedID 22422992
View details for PubMedCentralID PMC3382018
-
PharmGKB summary: very important pharmacogene information for G6PD
PHARMACOGENETICS AND GENOMICS
2012; 22 (3): 219-228
View details for DOI 10.1097/FPC.0b013e32834eb313
View details for Web of Science ID 000300409800008
View details for PubMedID 22237549
View details for PubMedCentralID PMC3382019
-
Simbios: an NIH national center for physics-based simulation of biological structures
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION
2012; 19 (2): 186-189
Abstract
Physics-based simulation provides a powerful framework for understanding biological form and function. Simulations can be used by biologists to study macromolecular assemblies and by clinicians to design treatments for diseases. Simulations help biomedical researchers understand the physical constraints on biological systems as they engineer novel drugs, synthetic tissues, medical devices, and surgical interventions. Although individual biomedical investigators make outstanding contributions to physics-based simulation, the field has been fragmented. Applications are typically limited to a single physical scale, and individual investigators usually must create their own software. These conditions created a major barrier to advancing simulation capabilities. In 2004, we established a National Center for Physics-Based Simulation of Biological Structures (Simbios) to help integrate the field and accelerate biomedical research. In 6 years, Simbios has become a vibrant national center, with collaborators in 16 states and eight countries. Simbios focuses on problems at both the molecular scale and the organismal level, with a long-term goal of uniting these in accurate multiscale simulations.
View details for DOI 10.1136/amiajnl-2011-000488
View details for Web of Science ID 000300768100009
View details for PubMedID 22081222
View details for PubMedCentralID PMC3277621
-
PharmGKB summary: very important pharmacogene information for cytochrome P450, family 2, subfamily C, polypeptide 19
PHARMACOGENETICS AND GENOMICS
2012; 22 (2): 159-165
View details for DOI 10.1097/FPC.0b013e32834d4962
View details for Web of Science ID 000299310600008
View details for PubMedID 22027650
View details for PubMedCentralID PMC3349992
-
PharmGKB summary: very important pharmacogene information for CYP1A2
PHARMACOGENETICS AND GENOMICS
2012; 22 (1): 73-77
View details for DOI 10.1097/FPC.0b013e32834c6efd
View details for Web of Science ID 000298249500009
View details for PubMedID 21989077
View details for PubMedCentralID PMC3346273
-
SYSTEMS PHARMACOGENOMICS-BRIDGING THE GAP
WORLD SCIENTIFIC PUBL CO PTE LTD. 2012: 442
View details for Web of Science ID 000407150800044
- Interpretome: a freely available, modular, and secure personal genome interpretation engine. 2012
- Discovery and explanation of drug-drug interations via text mining. 2012
- Chapter 7: Pharmacogenomics. PLoS Comput Biol., PMCID: PMC3531317. 2012; 8 (12): e1002817
-
Mice lacking the beta 2 adrenergic receptor have a unique genetic profile before and after focal brain ischaemia
ASN NEURO
2012; 4 (5): 343-356
View details for DOI 10.1042/AN20110020
View details for Web of Science ID 000308887200005
-
A novel signal detection algorithm for identifying hidden drug-drug interactions in adverse event reports
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION
2012; 19 (1): 79-85
Abstract
Adverse drug events (ADEs) are common and account for 770 000 injuries and deaths each year and drug interactions account for as much as 30% of these ADEs. Spontaneous reporting systems routinely collect ADEs from patients on complex combinations of medications and provide an opportunity to discover unexpected drug interactions. Unfortunately, current algorithms for such "signal detection" are limited by underreporting of interactions that are not expected. We present a novel method to identify latent drug interaction signals in the case of underreporting.We identified eight clinically significant adverse events. We used the FDA's Adverse Event Reporting System to build profiles for these adverse events based on the side effects of drugs known to produce them. We then looked for pairs of drugs that match these single-drug profiles in order to predict potential interactions. We evaluated these interactions in two independent data sets and also through a retrospective analysis of the Stanford Hospital electronic medical records.We identified 171 novel drug interactions (for eight adverse event categories) that are significantly enriched for known drug interactions (p=0.0009) and used the electronic medical record for independently testing drug interaction hypotheses using multivariate statistical models with covariates.Our method provides an option for detecting hidden interactions in spontaneous reporting systems by using side effect profiles to infer the presence of unreported adverse events.
View details for DOI 10.1136/amiajnl-2011-000214
View details for Web of Science ID 000298848100012
View details for PubMedID 21676938
View details for PubMedCentralID PMC3240755
-
Interpretome: a freely available, modular, and secure personal genome interpretation engine.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
2012: 339-350
Abstract
The decreasing cost of genotyping and genome sequencing has ushered in an era of genomic personalized medicine. More than 100,000 individuals have been genotyped by direct-to-consumer genetic testing services, which offer a glimpse into the interpretation and exploration of a personal genome. However, these interpretations, which require extensive manual curation, are subject to the preferences of the company and are not customizable by the individual. Academic institutions teaching personalized medicine, as well as genetic hobbyists, may prefer to customize their analysis and have full control over the content and method of interpretation. We present the Interpretome, a system for private genome interpretation, which contains all genotype information in client-side interpretation scripts, supported by server-side databases. We provide state-of-the-art analyses for teaching clinical implications of personal genomics, including disease risk assessment and pharmacogenomics. Additionally, we have implemented client-side algorithms for ancestry inference, demonstrating the power of these methods without excessive computation. Finally, the modular nature of the system allows for plugin capabilities for custom analyses. This system will allow for personal genome exploration without compromising privacy, facilitating hands-on courses in genomics and personalized medicine.
View details for PubMedID 22174289
-
Discovery and explanation of drug-drug interactions via text mining.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
2012: 410-421
Abstract
Drug-drug interactions (DDIs) can occur when two drugs interact with the same gene product. Most available information about gene-drug relationships is contained within the scientific literature, but is dispersed over a large number of publications, with thousands of new publications added each month. In this setting, automated text mining is an attractive solution for identifying gene-drug relationships and aggregating them to predict novel DDIs. In previous work, we have shown that gene-drug interactions can be extracted from Medline abstracts with high fidelity - we extract not only the genes and drugs, but also the type of relationship expressed in individual sentences (e.g. metabolize, inhibit, activate and many others). We normalize these relationships and map them to a standardized ontology. In this work, we hypothesize that we can combine these normalized gene-drug relationships, drawn from a very broad and diverse literature, to infer DDIs. Using a training set of established DDIs, we have trained a random forest classifier to score potential DDIs based on the features of the normalized assertions extracted from the literature that relate two drugs to a gene product. The classifier recognizes the combinations of relationships, drugs and genes that are most associated with the gold standard DDIs, correctly identifying 79.8% of assertions relating interacting drug pairs and 78.9% of assertions relating noninteracting drug pairs. Most significantly, because our text processing method captures the semantics of individual gene-drug relationships, we can construct mechanistic pharmacological explanations for the newly-proposed DDIs. We show how our classifier can be used to explain known DDIs and to uncover new DDIs that have not yet been reported.
View details for PubMedID 22174296
-
From pharmacogenomic knowledge acquisition to clinical applications: the PharmGKB as a clinical pharmacogenomic biomarker resource
BIOMARKERS IN MEDICINE
2011; 5 (6): 795-806
Abstract
The mission of the Pharmacogenomics Knowledge Base (PharmGKB; www.pharmgkb.org ) is to collect, encode and disseminate knowledge about the impact of human genetic variations on drug responses. It is an important worldwide resource of clinical pharmacogenomic biomarkers available to all. The PharmGKB website has evolved to highlight our knowledge curation and aggregation over our previous emphasis on collecting primary data. This review summarizes the methods we use to drive this expanded scope of 'Knowledge Acquisition to Clinical Applications', the new features available on our website and our future goals.
View details for DOI 10.2217/BMM.11.94
View details for Web of Science ID 000298488200009
View details for PubMedID 22103613
View details for PubMedCentralID PMC3339046
-
Using Multiple Microenvironments to Find Similar Ligand-Binding Sites: Application to Kinase Inhibitor Binding
PLOS COMPUTATIONAL BIOLOGY
2011; 7 (12)
Abstract
The recognition of cryptic small-molecular binding sites in protein structures is important for understanding off-target side effects and for recognizing potential new indications for existing drugs. Current methods focus on the geometry and detailed chemical interactions within putative binding pockets, but may not recognize distant similarities where dynamics or modified interactions allow one ligand to bind apparently divergent binding pockets. In this paper, we introduce an algorithm that seeks similar microenvironments within two binding sites, and assesses overall binding site similarity by the presence of multiple shared microenvironments. The method has relatively weak geometric requirements (to allow for conformational change or dynamics in both the ligand and the pocket) and uses multiple biophysical and biochemical measures to characterize the microenvironments (to allow for diverse modes of ligand binding). We term the algorithm PocketFEATURE, since it focuses on pockets using the FEATURE system for characterizing microenvironments. We validate PocketFEATURE first by showing that it can better discriminate sites that bind similar ligands from those that do not, and by showing that we can recognize FAD-binding sites on a proteome scale with Area Under the Curve (AUC) of 92%. We then apply PocketFEATURE to evolutionarily distant kinases, for which the method recognizes several proven distant relationships, and predicts unexpected shared ligand binding. Using experimental data from ChEMBL and Ambit, we show that at high significance level, 40 kinase pairs are predicted to share ligands. Some of these pairs offer new opportunities for inhibiting two proteins in a single pathway.
View details for DOI 10.1371/journal.pcbi.1002326
View details for Web of Science ID 000299167800043
View details for PubMedID 22219723
View details for PubMedCentralID PMC3248393
-
PharmGKB summary: carbamazepine pathway
PHARMACOGENETICS AND GENOMICS
2011; 21 (12): 906-910
View details for DOI 10.1097/FPC.0b013e328348c6f2
View details for Web of Science ID 000296799900016
View details for PubMedID 21738081
View details for PubMedCentralID PMC3349991
-
PharmGKB summary: citalopram pharmacokinetics pathway
PHARMACOGENETICS AND GENOMICS
2011; 21 (11): 769-772
View details for DOI 10.1097/FPC.0b013e328346063f
View details for Web of Science ID 000296146400010
View details for PubMedID 21546862
View details for PubMedCentralID PMC3349993
-
PharmGKB summary: methotrexate pathway
PHARMACOGENETICS AND GENOMICS
2011; 21 (10): 679-686
View details for DOI 10.1097/FPC.0b013e328343dd93
View details for Web of Science ID 000294808900008
View details for PubMedID 21317831
View details for PubMedCentralID PMC3139712
-
Clinical Pharmacogenetics Implementation Consortium Guidelines for CYP2C9 and VKORC1 Genotypes and Warfarin Dosing
CLINICAL PHARMACOLOGY & THERAPEUTICS
2011; 90 (4): 625-629
Abstract
Warfarin is a widely used anticoagulant with a narrow therapeutic index and large interpatient variability in the dose required to achieve target anticoagulation. Common genetic variants in the cytochrome P450-2C9 (CYP2C9) and vitamin K-epoxide reductase complex (VKORC1) enzymes, in addition to known nongenetic factors, account for ~50% of warfarin dose variability. The purpose of this article is to assist in the interpretation and use of CYP2C9 and VKORC1 genotype data for estimating therapeutic warfarin dose to achieve an INR of 2-3, should genotype results be available to the clinician. The Clinical Pharmacogenetics Implementation Consortium (CPIC) of the National Institutes of Health Pharmacogenomics Research Network develops peer-reviewed gene-drug guidelines that are published and updated periodically on http://www.pharmgkb.org based on new developments in the field.(1).
View details for DOI 10.1038/clpt.2011.185
View details for Web of Science ID 000295119200035
View details for PubMedID 21900891
View details for PubMedCentralID PMC3187550
-
A new disease-specific machine learning approach for the prediction of cancer-causing missense variants
GENOMICS
2011; 98 (4): 310-317
Abstract
High-throughput genotyping and sequencing techniques are rapidly and inexpensively providing large amounts of human genetic variation data. Single Nucleotide Polymorphisms (SNPs) are an important source of human genome variability and have been implicated in several human diseases, including cancer. Amino acid mutations resulting from non-synonymous SNPs in coding regions may generate protein functional changes that affect cell proliferation. In this study, we developed a machine learning approach to predict cancer-causing missense variants. We present a Support Vector Machine (SVM) classifier trained on a set of 3163 cancer-causing variants and an equal number of neutral polymorphisms. The method achieve 93% overall accuracy, a correlation coefficient of 0.86, and area under ROC curve of 0.98. When compared with other previously developed algorithms such as SIFT and CHASM our method results in higher prediction accuracy and correlation coefficient in identifying cancer-causing variants.
View details for DOI 10.1016/j.ygeno.2011.06.010
View details for Web of Science ID 000295896300011
View details for PubMedID 21763417
View details for PubMedCentralID PMC3371640
-
Phased Whole-Genome Genetic Risk in a Family Quartet Using a Major Allele Reference Sequence
PLOS GENETICS
2011; 7 (9)
Abstract
Whole-genome sequencing harbors unprecedented potential for characterization of individual and family genetic variation. Here, we develop a novel synthetic human reference sequence that is ethnically concordant and use it for the analysis of genomes from a nuclear family with history of familial thrombophilia. We demonstrate that the use of the major allele reference sequence results in improved genotype accuracy for disease-associated variant loci. We infer recombination sites to the lowest median resolution demonstrated to date (< 1,000 base pairs). We use family inheritance state analysis to control sequencing error and inform family-wide haplotype phasing, allowing quantification of genome-wide compound heterozygosity. We develop a sequence-based methodology for Human Leukocyte Antigen typing that contributes to disease risk prediction. Finally, we advance methods for analysis of disease and pharmacogenomic risk across the coding and non-coding genome that incorporate phased variant data. We show these methods are capable of identifying multigenic risk for inherited thrombophilia and informing the appropriate pharmacological therapy. These ethnicity-specific, family-based approaches to interpretation of genetic variation are emblematic of the next generation of genetic risk assessment using whole-genome sequencing.
View details for DOI 10.1371/journal.pgen.1002280
View details for PubMedID 21935354
-
Fast Flexible Modeling of RNA Structure Using Internal Coordinates
IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS
2011; 8 (5): 1247-1257
Abstract
Modeling the structure and dynamics of large macromolecules remains a critical challenge. Molecular dynamics (MD) simulations are expensive because they model every atom independently, and are difficult to combine with experimentally derived knowledge. Assembly of molecules using fragments from libraries relies on the database of known structures and thus may not work for novel motifs. Coarse-grained modeling methods have yielded good results on large molecules but can suffer from difficulties in creating more detailed full atomic realizations. There is therefore a need for molecular modeling algorithms that remain chemically accurate and economical for large molecules, do not rely on fragment libraries, and can incorporate experimental information. RNABuilder works in the internal coordinate space of dihedral angles and thus has time requirements proportional to the number of moving parts rather than the number of atoms. It provides accurate physics-based response to applied forces, but also allows user-specified forces for incorporating experimental information. A particular strength of RNABuilder is that all Leontis-Westhof basepairs can be specified as primitives by the user to be satisfied during model construction. We apply RNABuilder to predict the structure of an RNA molecule with 160 bases from its secondary structure, as well as experimental information. Our model matches the known structure to 10.2 Angstroms RMSD and has low computational expense.
View details for DOI 10.1109/TCBB.2010.104
View details for Web of Science ID 000292681800008
View details for PubMedID 21778523
-
PharmGKB summary: very important pharmacogene information for PTGS2
PHARMACOGENETICS AND GENOMICS
2011; 21 (9): 607-613
View details for DOI 10.1097/FPC.0b013e3283415515
View details for Web of Science ID 000293731200012
View details for PubMedID 21063235
View details for PubMedCentralID PMC3141084
-
CAMPAIGN: an open-source library of GPU-accelerated data clustering algorithms
BIOINFORMATICS
2011; 27 (16): 2322-2323
Abstract
Data clustering techniques are an essential component of a good data analysis toolbox. Many current bioinformatics applications are inherently compute-intense and work with very large datasets. Sequential algorithms are inadequate for providing the necessary performance. For this reason, we have created Clustering Algorithms for Massively Parallel Architectures, Including GPU Nodes (CAMPAIGN), a central resource for data clustering algorithms and tools that are implemented specifically for execution on massively parallel processing architectures.CAMPAIGN is a library of data clustering algorithms and tools, written in 'C for CUDA' for Nvidia GPUs. The library provides up to two orders of magnitude speed-up over respective CPU-based clustering algorithms and is intended as an open-source resource. New modules from the community will be accepted into the library and the layout of it is such that it can easily be extended to promising future platforms such as OpenCL.Releases of the CAMPAIGN library are freely available for download under the LGPL from https://simtk.org/home/campaign. Source code can also be obtained through anonymous subversion access as described on https://simtk.org/scm/?group_id=453.kjk33@cantab.net.
View details for DOI 10.1093/bioinformatics/btr386
View details for Web of Science ID 000293620800028
View details for PubMedID 21712246
View details for PubMedCentralID PMC3150041
-
Cooperative transcription factor associations discovered using regulatory variation
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2011; 108 (32): 13353-13358
Abstract
Regulation of gene expression at the transcriptional level is achieved by complex interactions of transcription factors operating at their target genes. Dissecting the specific combination of factors that bind each target is a significant challenge. Here, we describe in detail the Allele Binding Cooperativity test, which uses variation in transcription factor binding among individuals to discover combinations of factors and their targets. We developed the ALPHABIT (a large-scale process to hunt for allele binding interacting transcription factors) pipeline, which includes statistical analysis of binding sites followed by experimental validation, and demonstrate that this method predicts transcription factors that associate with NFκB. Our method successfully identifies factors that have been known to work with NFκB (E2A, STAT1, IRF2), but whose global coassociation and sites of cooperative action were not known. In addition, we identify a unique coassociation (EBF1) that had not been reported previously. We present a general approach for discovering combinatorial models of regulation and advance our understanding of the genetic basis of variation in transcription factor binding.
View details for DOI 10.1073/pnas.1103105108
View details for Web of Science ID 000293691400076
View details for PubMedID 21828005
View details for PubMedCentralID PMC3156166
-
Platelet aggregation pathway
PHARMACOGENETICS AND GENOMICS
2011; 21 (8): 516-521
View details for DOI 10.1097/FPC.0b013e3283406323
View details for Web of Science ID 000292634200009
View details for PubMedID 20938371
View details for PubMedCentralID PMC3134593
-
RNA molecules with conserved catalytic cores but variable peripheries fold along unique energetically optimized pathways
RNA-A PUBLICATION OF THE RNA SOCIETY
2011; 17 (8): 1589-1603
Abstract
Functional and kinetic constraints must be efficiently balanced during the folding process of all biopolymers. To understand how homologous RNA molecules with different global architectures fold into a common core structure we determined, under identical conditions, the folding mechanisms of three phylogenetically divergent group I intron ribozymes. These ribozymes share a conserved functional core defined by topologically equivalent tertiary motifs but differ in their primary sequence, size, and structural complexity. Time-resolved hydroxyl radical probing of the backbone solvent accessible surface and catalytic activity measurements integrated with structural-kinetic modeling reveal that each ribozyme adopts a unique strategy to attain the conserved functional fold. The folding rates are not dictated by the size or the overall structural complexity, but rather by the strength of the constituent tertiary motifs which, in turn, govern the structure, stability, and lifetime of the folding intermediates. A fundamental general principle of RNA folding emerges from this study: The dominant folding flux always proceeds through an optimally structured kinetic intermediate that has sufficient stability to act as a nucleating scaffold while retaining enough conformational freedom to avoid kinetic trapping. Our results also suggest a potential role of naturally selected peripheral A-minor interactions in balancing RNA structural stability with folding efficiency.
View details for DOI 10.1261/rna.2694811
View details for Web of Science ID 000292843000016
View details for PubMedID 21712400
View details for PubMedCentralID PMC3153981
-
Improving the prediction of disease-related variants using protein three-dimensional structure
European Conference on Computational Biology (ECCB)/Workshop on Annotation, Interpretation and Management of Mutations (AIMM)
BIOMED CENTRAL LTD. 2011
Abstract
Single Nucleotide Polymorphisms (SNPs) are an important source of human genome variability. Non-synonymous SNPs occurring in coding regions result in single amino acid polymorphisms (SAPs) that may affect protein function and lead to pathology. Several methods attempt to estimate the impact of SAPs using different sources of information. Although sequence-based predictors have shown good performance, the quality of these predictions can be further improved by introducing new features derived from three-dimensional protein structures.In this paper, we present a structure-based machine learning approach for predicting disease-related SAPs. We have trained a Support Vector Machine (SVM) on a set of 3,342 disease-related mutations and 1,644 neutral polymorphisms from 784 protein chains. We use SVM input features derived from the protein's sequence, structure, and function. After dataset balancing, the structure-based method (SVM-3D) reaches an overall accuracy of 85%, a correlation coefficient of 0.70, and an area under the receiving operating characteristic curve (AUC) of 0.92. When compared with a similar sequence-based predictor, SVM-3D results in an increase of the overall accuracy and AUC by 3%, and correlation coefficient by 0.06. The robustness of this improvement has been tested on different datasets and in all the cases SVM-3D performs better than previously developed methods even when compared with PolyPhen2, which explicitly considers in input protein structure information.This work demonstrates that structural information can increase the accuracy of disease-related SAPs identification. Our results also quantify the magnitude of improvement on a large dataset. This improvement is in agreement with previously observed results, where structure information enhanced the prediction of protein stability changes upon mutation. Although the structural information contained in the Protein Data Bank is limiting the application and the performance of our structure-based method, we expect that SVM-3D will result in higher accuracy when more structural date become available.
View details for Web of Science ID 000303930500003
View details for PubMedID 21992054
View details for PubMedCentralID PMC3194195
-
Doxorubicin pathways: pharmacodynamics and adverse effects
PHARMACOGENETICS AND GENOMICS
2011; 21 (7): 440-446
View details for DOI 10.1097/FPC.0b013e32833ffb56
View details for Web of Science ID 000291633300011
View details for PubMedID 21048526
View details for PubMedCentralID PMC3116111
-
Bioinformatics challenges for personalized medicine
BIOINFORMATICS
2011; 27 (13): 1741-1748
Abstract
Widespread availability of low-cost, full genome sequencing will introduce new challenges for bioinformatics.This review outlines recent developments in sequencing technologies and genome analysis methods for application in personalized medicine. New methods are needed in four areas to realize the potential of personalized medicine: (i) processing large-scale robust genomic data; (ii) interpreting the functional effect and the impact of genomic variation; (iii) integrating systems data to relate complex genetic interactions with phenotypes; and (iv) translating these discoveries into medical practice.russ.altman@stanford.edu
View details for DOI 10.1093/bioinformatics/btr295
View details for Web of Science ID 000291752600050
View details for PubMedID 21596790
View details for PubMedCentralID PMC3117361
-
Detecting Drug Interactions From Adverse-Event Reports: Interaction Between Paroxetine and Pravastatin Increases Blood Glucose Levels
CLINICAL PHARMACOLOGY & THERAPEUTICS
2011; 90 (1): 133-142
Abstract
The lipid-lowering agent pravastatin and the antidepressant paroxetine are among the most widely prescribed drugs in the world. Unexpected interactions between them could have important public health implications. We mined the US Food and Drug Administration's (FDA's) Adverse Event Reporting System (AERS) for side-effect profiles involving glucose homeostasis and found a surprisingly strong signal for comedication with pravastatin and paroxetine. We retrospectively evaluated changes in blood glucose in 104 patients with diabetes and 135 without diabetes who had received comedication with these two drugs, using data in electronic medical record (EMR) systems of three geographically distinct sites. We assessed the mean random blood glucose levels before and after treatment with the drugs. We found that pravastatin and paroxetine, when administered together, had a synergistic effect on blood glucose. The average increase was 19 mg/dl (1.0 mmol/l) overall, and in those with diabetes it was 48 mg/dl (2.7 mmol/l). In contrast, neither drug administered singly was associated with such changes in glucose levels. An increase in glucose levels is not a general effect of combined therapy with selective serotonin reuptake inhibitors (SSRIs) and statins.
View details for DOI 10.1038/clpt.2011.83
View details for Web of Science ID 000291853800023
View details for PubMedID 21613990
View details for PubMedCentralID PMC3216673
-
2010 Translational bioinformatics year in review
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION
2011; 18 (4): 358-366
Abstract
A review of 2010 research in translational bioinformatics provides much to marvel at. We have seen notable advances in personal genomics, pharmacogenetics, and sequencing. At the same time, the infrastructure for the field has burgeoned. While acknowledging that, according to researchers, the members of this field tend to be overly optimistic, the authors predict a bright future.
View details for DOI 10.1136/amiajnl-2011-000328
View details for Web of Science ID 000292061700004
View details for PubMedID 21672905
View details for PubMedCentralID PMC3128418
-
PharmGKB summary: dopamine receptor D2
PHARMACOGENETICS AND GENOMICS
2011; 21 (6): 350-356
View details for DOI 10.1097/FPC.0b013e32833ee605
View details for Web of Science ID 000290431200007
View details for PubMedID 20736885
View details for PubMedCentralID PMC3091980
-
PharmGKB summary: cytochrome P450, family 2, subfamily J, polypeptide 2: CYP2J2
PHARMACOGENETICS AND GENOMICS
2011; 21 (5): 308-311
View details for DOI 10.1097/FPC.0b013e32833d1011
View details for Web of Science ID 000289460200009
View details for PubMedID 20739908
View details for PubMedCentralID PMC3086341
-
Databases in the Area of Pharmacogenetics
HUMAN MUTATION
2011; 32 (5): 526-531
Abstract
In the area of pharmacogenetics and personalized health care it is obvious that databases, providing important information of the occurrence and consequences of variant genes encoding drug metabolizing enzymes, drug transporters, drug targets, and other proteins of importance for drug response or toxicity, are of critical value for scientists, physicians, and industry. The primary outcome of the pharmacogenomic field is the identification of biomarkers that can predict drug toxicity and drug response, thereby individualizing and improving drug treatment of patients. The drug in question and the polymorphic gene exerting the impact are the main issues to be searched for in the databases. Here, we review the databases that provide useful information in this respect, of benefit for the development of the pharmacogenomic field.
View details for DOI 10.1002/humu.21454
View details for Web of Science ID 000289984100006
View details for PubMedID 21309040
View details for PubMedCentralID PMC3352027
-
Remote Thioredoxin Recognition Using Evolutionary Conservation and Structural Dynamics
STRUCTURE
2011; 19 (4): 461-470
Abstract
The thioredoxin family of oxidoreductases plays an important role in redox signaling and control of protein function. Not only are thioredoxins linked to a variety of disorders, but their stable structure has also seen application in protein engineering. Both sequence-based and structure-based tools exist for thioredoxin identification, but remote homolog detection remains a challenge. We developed a thioredoxin predictor using the approach of integrating sequence with structural information. We combined a sequence-based Hidden Markov Model (HMM) with a molecular dynamics enhanced structure-based recognition method (dynamic FEATURE, DF). This hybrid method (HMMDF) has high precision and recall (0.90 and 0.95, respectively) compared with HMM (0.92 and 0.87, respectively) and DF (0.82 and 0.97, respectively). Dynamic FEATURE is sensitive but struggles to resolve closely related protein families, while HMM identifies these evolutionary differences by compromising sensitivity. Our method applied to structural genomics targets makes a strong prediction of a novel thioredoxin.
View details for DOI 10.1016/j.str.2011.02.007
View details for Web of Science ID 000289592600005
View details for PubMedID 21481770
View details for PubMedCentralID PMC3075543
-
PharmGKB summary: fluoropyrimidine pathways
PHARMACOGENETICS AND GENOMICS
2011; 21 (4): 237-242
View details for DOI 10.1097/FPC.0b013e32833c6107
View details for Web of Science ID 000288444500010
View details for PubMedID 20601926
View details for PubMedCentralID PMC3098754
-
Very important pharmacogene summary: ABCB1 (MDR1, P-glycoprotein)
PHARMACOGENETICS AND GENOMICS
2011; 21 (3): 152-161
View details for DOI 10.1097/FPC.0b013e3283385a1c
View details for Web of Science ID 000286971900007
View details for PubMedID 20216335
View details for PubMedCentralID PMC3098758
-
Pharmacogenomics: "Noninferiority" Is Sufficient for Initial Implementation
CLINICAL PHARMACOLOGY & THERAPEUTICS
2011; 89 (3): 348-350
Abstract
Recent clinical annotation of a whole-genome sequence suggests that pharmacogenomics (PGx) may be ready for clinical implementation now. This conclusion rests on the recognition that PGx has greatly mitigated risks as compared with using genomics for assessment of disease risk. Failure to recognize these differences can produce unrealistic cost-benefit scenarios and impractical standards of evidence. In many cases, pharmacogenetic tests need only reach reasonable expectations of noninferiority (compared with current prescribing practices) to merit use.
View details for DOI 10.1038/clpt.2010.310
View details for Web of Science ID 000287439600011
View details for PubMedID 21326263
-
PharmGKB: very important pharmacogene - HMGCR
PHARMACOGENETICS AND GENOMICS
2011; 21 (2): 98-101
View details for DOI 10.1097/FPC.0b013e328336c81b
View details for Web of Science ID 000286096000006
View details for PubMedID 20084049
View details for PubMedCentralID PMC3098759
-
Pharmacogenomics: will the promise be fulfilled?
NATURE REVIEWS GENETICS
2011; 12 (1): 69-73
Abstract
Tools such as genome resequencing and genome-wide association studies have recently been used to uncover a number of variants that affect drug toxicity and efficacy, as well as potential drug targets. But how much closer are we to incorporating pharmacogenomics into routine clinical practice? Five experts discuss how far we have come, and highlight the technological, informatics, educational and practical obstacles that stand in the way of realizing genome-driven medicine.
View details for DOI 10.1038/nrg2920
View details for Web of Science ID 000285410500012
View details for PubMedID 21116304
View details for PubMedCentralID PMC3098748
-
Structural insights into pre-translocation ribosome motions.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
2011: 205-211
Abstract
Subsequent to the peptidyl transfer step of the translation elongation cycle, the initially formed pre-translocation ribosome, which we refer to here as R(1), undergoes a ratchet-like intersubunit rotation in order to sample a rotated conformation, referred to here as R(F), that is an obligatory intermediate in the translocation of tRNAs and mRNA through the ribosome during the translocation step of the translation elongation cycle. R(F) and the R(1) to R(F) transition are currently the subject of intense research, driven in part by the potential for developing novel antibiotics which trap R(F) or confound the R(1) to R(F) transition. Currently lacking a 3D atomic structure of the R(F) endpoint of the transition, as well as a preliminary conformational trajectory connecting R(1) and R(F), the dynamics of the mechanistically crucial R(1) to R(F) transition remain elusive. The current literature reports fitting of only a few ribosomal RNA (rRNA) and ribosomal protein (r-protein) components into cryogenic electron microscopy (cryo-EM) reconstructions of the Escherichia coli ribosome in RF. In this work we now fit the entire Thermus thermophilus 16S and 23S rRNAs and most of the remaining T. thermophilus r-proteins into a cryo-EM reconstruction of the E. coli ribosome in R(F) in order to build an almost complete model of the T. thermophilus ribosome in R(F) thus allowing a more detailed view of this crucial conformation. The resulting model validates key predictions from the published literature; in particular it recovers intersubunit bridges known to be maintained throughout the R(1) to R(F) transition and results in new intersubunit bridges that are predicted to exist only in R(F). In addition, we use a recently reported E. coli ribosome structure, apparently trapped in an intermediate state along the R(1) to R(F) transition pathway, referred to here as R(2), as a guide to generate a T. thermophilus ribosome in the R(2) state. This demonstrates a multiresolution method for morphing large complexes and provides us with a structural model of R(2) in the species of interest. The generated structural models form the basis for probing the motion of the deacylated tRNA bound at the peptidyl-tRNA binding site (P site) of the pre-translocation ribosome as it moves from its so-called classical P/P configuration to its so-called hybrid P/E configuration as part of the R(1) to R(F) transition. We create a dynamic model of this process which provides structural insights into the functional significance of R(2) as well as detailed atomic information to guide the design of further experiments. The results suggest extensibility to other steps of protein synthesis as well as to spatially larger systems.
View details for PubMedID 21121048
- Structural insights into pre-translocation ribosome motions. 2011
- Cooperative transcription factor associations discovered using regulatory variation. 2011
- Improving the prediction of disease-related variants using protein three-dimensional structure. BMC Bioinformatics.;12 Suppl4:S3. Epub 2011 Jul 5. PMCID PMC3194195. 2011
- Perspective: 2010 Translational bioinformatics year in review. JAMIA., PMCID: PMC3128418. 2011; 4 (18): 358-366
-
Integration and publication of heterogeneous text-mined relationships on the Semantic Web.
Journal of biomedical semantics
2011; 2: S10-?
Abstract
Advances in Natural Language Processing (NLP) techniques enable the extraction of fine-grained relationships mentioned in biomedical text. The variability and the complexity of natural language in expressing similar relationships causes the extracted relationships to be highly heterogeneous, which makes the construction of knowledge bases difficult and poses a challenge in using these for data mining or question answering.We report on the semi-automatic construction of the PHARE relationship ontology (the PHArmacogenomic RElationships Ontology) consisting of 200 curated relations from over 40,000 heterogeneous relationships extracted via text-mining. These heterogeneous relations are then mapped to the PHARE ontology using synonyms, entity descriptions and hierarchies of entities and roles. Once mapped, relationships can be normalized and compared using the structure of the ontology to identify relationships that have similar semantics but different syntax. We compare and contrast the manual procedure with a fully automated approach using WordNet to quantify the degree of integration enabled by iterative curation and refinement of the PHARE ontology. The result of such integration is a repository of normalized biomedical relationships, named PHARE-KB, which can be queried using Semantic Web technologies such as SPARQL and can be visualized in the form of a biological network.The PHARE ontology serves as a common semantic framework to integrate more than 40,000 relationships pertinent to pharmacogenomics. The PHARE ontology forms the foundation of a knowledge base named PHARE-KB. Once populated with relationships, PHARE-KB (i) can be visualized in the form of a biological network to guide human tasks such as database curation and (ii) can be queried programmatically to guide bioinformatics applications such as the prediction of molecular interactions. PHARE is available at http://purl.bioontology.org/ontology/PHARE.
View details for DOI 10.1186/2041-1480-2-S2-S10
View details for PubMedID 21624156
View details for PubMedCentralID PMC3102890
-
Bisphosphonates pathway
PHARMACOGENETICS AND GENOMICS
2011; 21 (1): 50-53
View details for DOI 10.1097/FPC.0b013e328335729c
View details for Web of Science ID 000285331700007
View details for PubMedID 20023594
View details for PubMedCentralID PMC3086066
-
Content-based microarray search using differential expression profiles
BMC BIOINFORMATICS
2010; 11
Abstract
With the expansion of public repositories such as the Gene Expression Omnibus (GEO), we are rapidly cataloging cellular transcriptional responses to diverse experimental conditions. Methods that query these repositories based on gene expression content, rather than textual annotations, may enable more effective experiment retrieval as well as the discovery of novel associations between drugs, diseases, and other perturbations.We develop methods to retrieve gene expression experiments that differentially express the same transcriptional programs as a query experiment. Avoiding thresholds, we generate differential expression profiles that include a score for each gene measured in an experiment. We use existing and novel dimension reduction and correlation measures to rank relevant experiments in an entirely data-driven manner, allowing emergent features of the data to drive the results. A combination of matrix decomposition and p-weighted Pearson correlation proves the most suitable for comparing differential expression profiles. We apply this method to index all GEO DataSets, and demonstrate the utility of our approach by identifying pathways and conditions relevant to transcription factors Nanog and FoxO3.Content-based gene expression search generates relevant hypotheses for biological inquiry. Experiments across platforms, tissue types, and protocols inform the analysis of new datasets.
View details for DOI 10.1186/1471-2105-11-603
View details for Web of Science ID 000286192100001
View details for PubMedID 21172034
View details for PubMedCentralID PMC3022631
-
KCNH2 pharmacogenomics summary
PHARMACOGENETICS AND GENOMICS
2010; 20 (12): 775-777
View details for DOI 10.1097/FPC.0b013e3283349e9c
View details for Web of Science ID 000284148300006
View details for PubMedID 20150828
View details for PubMedCentralID PMC3086352
-
Independent component analysis: Mining microarray data for fundamental human gene expression modules
JOURNAL OF BIOMEDICAL INFORMATICS
2010; 43 (6): 932-944
Abstract
As public microarray repositories rapidly accumulate gene expression data, these resources contain increasingly valuable information about cellular processes in human biology. This presents a unique opportunity for intelligent data mining methods to extract information about the transcriptional modules underlying these biological processes. Modeling cellular gene expression as a combination of functional modules, we use independent component analysis (ICA) to derive 423 fundamental components of human biology from a 9395-array compendium of heterogeneous expression data. Annotation using the Gene Ontology (GO) suggests that while some of these components represent known biological modules, others may describe biology not well characterized by existing manually-curated ontologies. In order to understand the biological functions represented by these modules, we investigate the mechanism of the preclinical anti-cancer drug parthenolide (PTL) by analyzing the differential expression of our fundamental components. Our method correctly identifies known pathways and predicts that N-glycan biosynthesis and T-cell receptor signaling may contribute to PTL response. The fundamental gene modules we describe have the potential to provide pathway-level insight into new gene expression datasets.
View details for DOI 10.1016/j.jbi.2010.07.001
View details for Web of Science ID 000285036700009
View details for PubMedID 20619355
View details for PubMedCentralID PMC2991480
-
Using text to build semantic networks for pharmacogenomics
JOURNAL OF BIOMEDICAL INFORMATICS
2010; 43 (6): 1009-1019
Abstract
Most pharmacogenomics knowledge is contained in the text of published studies, and is thus not available for automated computation. Natural Language Processing (NLP) techniques for extracting relationships in specific domains often rely on hand-built rules and domain-specific ontologies to achieve good performance. In a new and evolving field such as pharmacogenomics (PGx), rules and ontologies may not be available. Recent progress in syntactic NLP parsing in the context of a large corpus of pharmacogenomics text provides new opportunities for automated relationship extraction. We describe an ontology of PGx relationships built starting from a lexicon of key pharmacogenomic entities and a syntactic parse of more than 87 million sentences from 17 million MEDLINE abstracts. We used the syntactic structure of PGx statements to systematically extract commonly occurring relationships and to map them to a common schema. Our extracted relationships have a 70-87.7% precision and involve not only key PGx entities such as genes, drugs, and phenotypes (e.g., VKORC1, warfarin, clotting disorder), but also critical entities that are frequently modified by these key entities (e.g., VKORC1 polymorphism, warfarin response, clotting disorder treatment). The result of our analysis is a network of 40,000 relationships between more than 200 entity types with clear semantics. This network is used to guide the curation of PGx knowledge and provide a computable resource for knowledge discovery.
View details for DOI 10.1016/j.jbi.2010.08.005
View details for Web of Science ID 000285036700017
View details for PubMedID 20723615
View details for PubMedCentralID PMC2991587
-
SLC19A1 pharmacogenomics summary
PHARMACOGENETICS AND GENOMICS
2010; 20 (11): 708-715
View details for DOI 10.1097/FPC.0b013e32833eca92
View details for Web of Science ID 000282965100006
View details for PubMedID 20811316
View details for PubMedCentralID PMC2956130
-
An integrative method for scoring candidate genes from association studies: application to warfarin dosing
AMIA Summit on Translational Bioinformatics
BIOMED CENTRAL LTD. 2010
Abstract
A key challenge in pharmacogenomics is the identification of genes whose variants contribute to drug response phenotypes, which can include severe adverse effects. Pharmacogenomics GWAS attempt to elucidate genotypes predictive of drug response. However, the size of these studies has severely limited their power and potential application. We propose a novel knowledge integration and SNP aggregation approach for identifying genes impacting drug response. Our SNP aggregation method characterizes the degree to which uncommon alleles of a gene are associated with drug response. We first use pre-existing knowledge sources to rank pharmacogenes by their likelihood to affect drug response. We then define a summary score for each gene based on allele frequencies and train linear and logistic regression classifiers to predict drug response phenotypes.We applied our method to a published warfarin GWAS data set comprising 181 individuals. We find that our method can increase the power of the GWAS to identify both VKORC1 and CYP2C9 as warfarin pharmacogenes, where the original analysis had only identified VKORC1. Additionally, we find that our method can be used to discriminate between low-dose (AUROC=0.886) and high-dose (AUROC=0.764) responders.Our method offers a new route for candidate pharmacogene discovery from pharmacogenomics GWAS, and serves as a foundation for future work in methods for predictive pharmacogenomics.
View details for Web of Science ID 000290218700009
View details for PubMedID 21044367
View details for PubMedCentralID PMC2967750
-
The utility of general purpose versus specialty clinical databases for research: Warfarin dose estimation from extracted clinical variables
JOURNAL OF BIOMEDICAL INFORMATICS
2010; 43 (5): 747-751
Abstract
There is debate about the utility of clinical data warehouses for research. Using a clinical warfarin dosing algorithm derived from research-quality data, we evaluated the data quality of both a general-purpose database and a coagulation-specific database. We evaluated the functional utility of these repositories by using data extracted from them to predict warfarin dose. We reasoned that high-quality clinical data would predict doses nearly as accurately as research data, while poor-quality clinical data would predict doses less accurately. We evaluated the Mean Absolute Error (MAE) in predicted weekly dose as a metric of data quality. The MAE was comparable between the clinical gold standard (10.1mg/wk) and the specialty database (10.4 mg/wk), but the MAE for the clinical warehouse was 40% greater (14.1mg/wk). Our results indicate that the research utility of clinical data collected in focused clinical settings is greater than that of data collected during general-purpose clinical care.
View details for DOI 10.1016/j.jbi.2010.03.014
View details for Web of Science ID 000281927200010
View details for PubMedID 20363365
View details for PubMedCentralID PMC2928873
-
VKORC1 Pharmacogenomics Summary
PHARMACOGENETICS AND GENOMICS
2010; 20 (10): 642-644
View details for DOI 10.1097/FPC.0b013e32833433b6
View details for Web of Science ID 000281830900010
View details for PubMedID 19940803
View details for PubMedCentralID PMC3086043
-
Recent progress in automatically extracting information from the pharmacogenomic literature
PHARMACOGENOMICS
2010; 11 (10): 1467-1489
Abstract
The biomedical literature holds our understanding of pharmacogenomics, but it is dispersed across many journals. In order to integrate our knowledge, connect important facts across publications and generate new hypotheses we must organize and encode the contents of the literature. By creating databases of structured pharmocogenomic knowledge, we can make the value of the literature much greater than the sum of the individual reports. We can, for example, generate candidate gene lists or interpret surprising hits in genome-wide association studies. Text mining automatically adds structure to the unstructured knowledge embedded in millions of publications, and recent years have seen a surge in work on biomedical text mining, some specific to pharmacogenomics literature. These methods enable extraction of specific types of information and can also provide answers to general, systemic queries. In this article, we describe the main tasks of text mining in the context of pharmacogenomics, summarize recent applications and anticipate the next phase of text mining applications.
View details for DOI 10.2217/PGS.10.136
View details for Web of Science ID 000284199500014
View details for PubMedID 21047206
View details for PubMedCentralID PMC3035632
-
Thiopurine pathway
PHARMACOGENETICS AND GENOMICS
2010; 20 (9): 573-574
View details for DOI 10.1097/FPC.0b013e328334338f
View details for Web of Science ID 000281295500008
View details for PubMedID 19952870
View details for PubMedCentralID PMC3098750
-
Turning limited experimental information into 3D models of RNA
RNA-A PUBLICATION OF THE RNA SOCIETY
2010; 16 (9): 1769-1778
Abstract
Our understanding of RNA functions in the cell is evolving rapidly. As for proteins, the detailed three-dimensional (3D) structure of RNA is often key to understanding its function. Although crystallography and nuclear magnetic resonance (NMR) can determine the atomic coordinates of some RNA structures, many 3D structures present technical challenges that make these methods difficult to apply. The great flexibility of RNA, its charged backbone, dearth of specific surface features, and propensity for kinetic traps all conspire with its long folding time, to challenge in silico methods for physics-based folding. On the other hand, base-pairing interactions (either in runs to form helices or isolated tertiary contacts) and motifs are often available from relatively low-cost experiments or informatics analyses. We present RNABuilder, a novel code that uses internal coordinate mechanics to satisfy user-specified base pairing and steric forces under chemical constraints. The code recapitulates the topology and characteristic L-shape of tRNA and obtains an accurate noncrystallographic structure of the Tetrahymena ribozyme P4/P6 domain. The algorithm scales nearly linearly with molecule size, opening the door to the modeling of significantly larger structures.
View details for DOI 10.1261/rna.2112110
View details for Web of Science ID 000281003900006
View details for PubMedID 20651028
View details for PubMedCentralID PMC2924536
-
Maternal-fetal and neonatal pharmacogenomics: a review of current literature
JOURNAL OF PERINATOLOGY
2010; 30 (9): 571-579
Abstract
Pharmacogenomics, the study of specific genetic variations and their effect on drug response, will likely give rise to many applications in maternal-fetal and neonatal medicine; yet, an understanding of these applications in the field of obstetrics and gynecology and neonatal pediatrics is not widespread. This review describes the underpinnings of the field of pharmacogenomics and summarizes the current pharmacogenomic inquiries in relation to maternal-fetal medicine-including studies on various fetal and neonatal genetic cytochrome P450 (CYP) enzyme variants and their role in drug toxicities (for example, codeine metabolism, sepsis and selective serotonin reuptake inhibitor (SSRI) toxicity). Potential future directions, including alternative drug classification, improvements in drug efficacy and non-invasive pharmacogenomic testing, will also be explored.
View details for DOI 10.1038/jp.2009.183
View details for Web of Science ID 000281388500002
View details for PubMedID 19924131
View details for PubMedCentralID PMC3098749
-
PharmGKB summary: very important pharmacogene information for CYP2B6
PHARMACOGENETICS AND GENOMICS
2010; 20 (8): 520-523
View details for DOI 10.1097/FPC.0b013e32833947c2
View details for Web of Science ID 000279865400007
View details for PubMedID 20648701
View details for PubMedCentralID PMC3086041
-
Extending and evaluating a warfarin dosing algorithm that includes CYP4F2 and pooled rare variants of CYP2C9
PHARMACOGENETICS AND GENOMICS
2010; 20 (7): 407-413
Abstract
Warfarin dosing remains challenging because of its narrow therapeutic window and large variability in dose response. We sought to analyze new factors involved in its dosing and to evaluate eight dosing algorithms, including two developed by the International Warfarin Pharmacogenetics Consortium (IWPC).we enrolled 108 patients on chronic warfarin therapy and obtained complete clinical and pharmacy records; we genotyped single nucleotide polymorphisms relevant to the VKORC1, CYP2C9, and CYP4F2 genes using integrated fluidic circuits made by Fluidigm.When applying the IWPC pharmacogenetic algorithm to our cohort of patients, the percentage of patients within 1 mg/d of the therapeutic warfarin dose increases from 54% to 63% using clinical factors only, or from 38% using a fixed-dose approach. CYP4F2 adds 4% to the fraction of the variability in dose (R) explained by the IWPC pharmacogenetic algorithm (P<0.05). Importantly, we show that pooling rare variants substantially increases the R for CYP2C9 (rare variants: P=0.0065, R=6%; common variants: P=0.0034, R=7%; rare and common variants: P=0.00018; R=12%), indicating that relatively rare variants not genotyped in genome-wide association studies may be important. In addition, the IWPC pharmacogenetic algorithm and the Gage (2008) algorithm perform best (IWPC: R=50%; Gage: R=49%), and all pharmacogenetic algorithms outperform the IWPC clinical equation (R=22%). VKORC1 and CYP2C9 genotypes did not affect long-term variability in dose. Finally, the Fluidigm platform, a novel warfarin genotyping method, showed 99.65% concordance between different operators and instruments.CYP4F2 and pooled rare variants of CYP2C9 significantly improve the ability to estimate warfarin dose.
View details for DOI 10.1097/FPC.0b013e328338bac2
View details for Web of Science ID 000278879400001
View details for PubMedID 20442691
View details for PubMedCentralID PMC3098751
-
Clopidogrel pathway
PHARMACOGENETICS AND GENOMICS
2010; 20 (7): 463-465
View details for DOI 10.1097/FPC.0b013e3283385420
View details for Web of Science ID 000278879400009
View details for PubMedID 20440227
View details for PubMedCentralID PMC3086847
-
Very important pharmacogene summary: thiopurine S-methyltransferase
PHARMACOGENETICS AND GENOMICS
2010; 20 (6): 401-405
View details for DOI 10.1097/FPC.0b013e3283352860
View details for Web of Science ID 000277594800007
View details for PubMedID 20154640
View details for PubMedCentralID PMC3086840
-
Clinical implementation of pharmacogenomics: overcoming genetic exceptionalism
LANCET ONCOLOGY
2010; 11 (6): 507-509
View details for DOI 10.1016/S1470-2045(10)70097-8
View details for Web of Science ID 000279019500008
View details for PubMedID 20413348
-
Challenges in the clinical application of whole-genome sequencing
LANCET
2010; 375 (9727): 1749-1751
View details for DOI 10.1016/S0140-6736(10)60599-5
View details for Web of Science ID 000277890200036
View details for PubMedID 20434765
-
Warfarin pharmacogenetics: a single VKORC1 polymorphism is predictive of dose across 3 racial groups
BLOOD
2010; 115 (18): 3827-3834
Abstract
Warfarin-dosing algorithms incorporating CYP2C9 and VKORC1 -1639G>A improve dose prediction compared with algorithms based solely on clinical and demographic factors. However, these algorithms better capture dose variability among whites than Asians or blacks. Herein, we evaluate whether other VKORC1 polymorphisms and haplotypes explain additional variation in warfarin dose beyond that explained by VKORC1 -1639G>A among Asians (n = 1103), blacks (n = 670), and whites (n = 3113). Participants were recruited from 11 countries as part of the International Warfarin Pharmacogenetics Consortium effort. Evaluation of the effects of individual VKORC1 single nucleotide polymorphisms (SNPs) and haplotypes on warfarin dose used both univariate and multi variable linear regression. VKORC1 -1639G>A and 1173C>T individually explained the greatest variance in dose in all 3 racial groups. Incorporation of additional VKORC1 SNPs or haplotypes did not further improve dose prediction. VKORC1 explained greater variability in dose among whites than blacks and Asians. Differences in the percentage of variance in dose explained by VKORC1 across race were largely accounted for by the frequency of the -1639A (or 1173T) allele. Thus, clinicians should recognize that, although at a population level, the contribution of VKORC1 toward dose requirements is higher in whites than in nonwhites; genotype predicts similar dose requirements across racial groups.
View details for DOI 10.1182/blood-2009-12-255992
View details for Web of Science ID 000277335900027
View details for PubMedID 20203262
View details for PubMedCentralID PMC2865873
-
Clinical assessment incorporating a personal genome
LANCET
2010; 375 (9725): 1525-1535
Abstract
The cost of genomic information has fallen steeply, but the clinical translation of genetic risk estimates remains unclear. We aimed to undertake an integrated analysis of a complete human genome in a clinical context.We assessed a patient with a family history of vascular disease and early sudden death. Clinical assessment included analysis of this patient's full genome sequence, risk prediction for coronary artery disease, screening for causes of sudden cardiac death, and genetic counselling. Genetic analysis included the development of novel methods for the integration of whole genome and clinical risk. Disease and risk analysis focused on prediction of genetic risk of variants associated with mendelian disease, recognised drug responses, and pathogenicity for novel variants. We queried disease-specific mutation databases and pharmacogenomics databases to identify genes and mutations with known associations with disease and drug response. We estimated post-test probabilities of disease by applying likelihood ratios derived from integration of multiple common variants to age-appropriate and sex-appropriate pre-test probabilities. We also accounted for gene-environment interactions and conditionally dependent risks.Analysis of 2.6 million single nucleotide polymorphisms and 752 copy number variations showed increased genetic risk for myocardial infarction, type 2 diabetes, and some cancers. We discovered rare variants in three genes that are clinically associated with sudden cardiac death-TMEM43, DSP, and MYBPC3. A variant in LPA was consistent with a family history of coronary artery disease. The patient had a heterozygous null mutation in CYP2C19 suggesting probable clopidogrel resistance, several variants associated with a positive response to lipid-lowering therapy, and variants in CYP4F2 and VKORC1 that suggest he might have a low initial dosing requirement for warfarin. Many variants of uncertain importance were reported.Although challenges remain, our results suggest that whole-genome sequencing can yield useful and clinically relevant information for individual patients.National Institute of General Medical Sciences; National Heart, Lung And Blood Institute; National Human Genome Research Institute; Howard Hughes Medical Institute; National Library of Medicine, Lucile Packard Foundation for Children's Health; Hewlett Packard Foundation; Breetwor Family Foundation.
View details for Web of Science ID 000277655100025
View details for PubMedID 20435227
-
Vascular endothelial growth factor pathway
PHARMACOGENETICS AND GENOMICS
2010; 20 (5): 346-349
View details for DOI 10.1097/FPC.0b013e3283364ed7
View details for Web of Science ID 000276704800009
View details for PubMedID 20124951
View details for PubMedCentralID PMC3086058
-
Cytochrome P450 2C9-CYP2C9
PHARMACOGENETICS AND GENOMICS
2010; 20 (4): 277-281
View details for DOI 10.1097/FPC.0b013e3283349e84
View details for Web of Science ID 000276373800008
View details for PubMedID 20150829
View details for PubMedCentralID PMC3201766
-
Teaching computers to read the pharmacogenomics literature ... so you don't have to
PHARMACOGENOMICS
2010; 11 (4): 515-518
View details for DOI 10.2217/PGS.10.48
View details for Web of Science ID 000276769300010
View details for PubMedID 20350132
View details for PubMedCentralID PMC3478760
-
Pharmacogenomics and bioinformatics: PharmGKB
PHARMACOGENOMICS
2010; 11 (4): 501-505
Abstract
The NIH initiated the PharmGKB in April 2000. The primary mission was to create a repository of primary data, tools to track associations between genes and drugs, and to catalog the location and frequency of genetic variations known to impact drug response. Over the past 10 years, new technologies have shifted research from candidate gene pharmacogenetics to phenotype-based pharmacogenomics with a consequent explosion of data. PharmGKB has refocused on curating knowledge rather than housing primary genotype and phenotype data, and now, captures more complex relationships between genes, variants, drugs, diseases and pathways. Going forward, the challenges are to provide the tools and knowledge to plan and interpret genome-wide pharmacogenomics studies, predict gene-drug relationships based on shared mechanisms and support data-sharing consortia investigating clinical applications of pharmacogenomics.
View details for DOI 10.2217/PGS.10.15
View details for Web of Science ID 000276769300008
View details for PubMedID 20350130
View details for PubMedCentralID PMC3098752
-
DNATwist: A Web-Based Tool for Teaching Middle and High School Students About Pharmacogenomics
CLINICAL PHARMACOLOGY & THERAPEUTICS
2010; 87 (4): 393-395
Abstract
DNATwist is a Web-based learning tool (available at http://www.dnatwist.org) that explains pharmacogenomics concepts to middle- and high-school students. Its features include (i) a focus on drug responses of interest to teenagers (e.g., alcohol intolerance), (ii) reusable graphical interfaces that reduce extension costs, and (iii) explanations of molecular and cellular drug responses. In testing, students found the tool and topic understandable and engaging. The tool is being modified for use at the Tech Museum of Innovation in California.
View details for DOI 10.1038/clpt.2009.303
View details for Web of Science ID 000276506900009
View details for PubMedID 20305671
View details for PubMedCentralID PMC3098756
-
Using Pre-existing Microarray Datasets to Increase Experimental Power: Application to Insulin Resistance
PLOS COMPUTATIONAL BIOLOGY
2010; 6 (3)
Abstract
Although they have become a widely used experimental technique for identifying differentially expressed (DE) genes, DNA microarrays are notorious for generating noisy data. A common strategy for mitigating the effects of noise is to perform many experimental replicates. This approach is often costly and sometimes impossible given limited resources; thus, analytical methods are needed which increase accuracy at no additional cost. One inexpensive source of microarray replicates comes from prior work: to date, data from hundreds of thousands of microarray experiments are in the public domain. Although these data assay a wide range of conditions, they cannot be used directly to inform any particular experiment and are thus ignored by most DE gene methods. We present the SVD Augmented Gene expression Analysis Tool (SAGAT), a mathematically principled, data-driven approach for identifying DE genes. SAGAT increases the power of a microarray experiment by using observed coexpression relationships from publicly available microarray datasets to reduce uncertainty in individual genes' expression measurements. We tested the method on three well-replicated human microarray datasets and demonstrate that use of SAGAT increased effective sample sizes by as many as 2.72 arrays. We applied SAGAT to unpublished data from a microarray study investigating transcriptional responses to insulin resistance, resulting in a 50% increase in the number of significant genes detected. We evaluated 11 (58%) of these genes experimentally using qPCR, confirming the directions of expression change for all 11 and statistical significance for three. Use of SAGAT revealed coherent biological changes in three pathways: inflammation, differentiation, and fatty acid synthesis, furthering our molecular understanding of a type 2 diabetes risk factor. We envision SAGAT as a means to maximize the potential for biological discovery from subtle transcriptional responses, and we provide it as a freely available software package that is immediately applicable to any human microarray study.
View details for DOI 10.1371/journal.pcbi.1000718
View details for Web of Science ID 000278125200026
View details for PubMedID 20361040
View details for PubMedCentralID PMC2845644
-
PharmGKB very important pharmacogene: SLCO1B1
PHARMACOGENETICS AND GENOMICS
2010; 20 (3): 211-216
View details for DOI 10.1097/FPC.0b013e328333b99c
View details for Web of Science ID 000275061200007
View details for PubMedID 19952871
View details for PubMedCentralID PMC3086841
-
Identification of recurring protein structure microenvironments and discovery of novel functional sites around CYS residues
BMC STRUCTURAL BIOLOGY
2010; 10
Abstract
The emergence of structural genomics presents significant challenges in the annotation of biologically uncharacterized proteins. Unfortunately, our ability to analyze these proteins is restricted by the limited catalog of known molecular functions and their associated 3D motifs.In order to identify novel 3D motifs that may be associated with molecular functions, we employ an unsupervised, two-phase clustering approach that combines k-means and hierarchical clustering with knowledge-informed cluster selection and annotation methods. We applied the approach to approximately 20,000 cysteine-based protein microenvironments (3D regions 7.5 A in radius) and identified 70 interesting clusters, some of which represent known motifs (e.g. metal binding and phosphatase activity), and some of which are novel, including several zinc binding sites. Detailed annotation results are available online for all 70 clusters at http://feature.stanford.edu/clustering/cys.The use of microenvironments instead of backbone geometric criteria enables flexible exploration of protein function space, and detection of recurring motifs that are discontinuous in sequence and diverse in structure. Clustering microenvironments may thus help to functionally characterize novel proteins and better understand the protein structure-function relationship.
View details for DOI 10.1186/1472-6807-10-4
View details for Web of Science ID 000275410900001
View details for PubMedID 20122268
View details for PubMedCentralID PMC2833161
-
PharmGKB summary: very important pharmacogene information for angiotensin-converting enzyme
PHARMACOGENETICS AND GENOMICS
2010; 20 (2): 143-146
View details for DOI 10.1097/FPC.0b013e3283339bf3
View details for Web of Science ID 000274306700011
View details for PubMedID 19898265
View details for PubMedCentralID PMC3098760
-
Extraction of genotype-phenotype-drug relationships from text: from entity recognition to bioinformatics application.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
2010: 485-487
Abstract
Advances in concept recognition and natural language parsing have led to the development of various tools that enable the identification of biomedical entities and relationships between them in text. The aim of the Genotype-Phenotype-Drug Relationship Extraction from Text workshop (or GPD-Rx workshop) is to examine the current state of art and discuss the next steps for making the extraction of relationships between biomedical entities integral to the curation and knowledge management workflow in Pharmacogenomics. The workshop will focus particularly on the extraction of Genotype-Phenotype, Genotype-Drug, and Phenotype-Drug relationships that are of interest to Pharmacogenomics. Extracting and structuring such text-mined relationships is a key to support the evaluation and the validation of multiple hypotheses that emerge from high throughput translational studies spanning multiple measurement modalities. In order to advance this agenda, it is essential that existing relationship extraction methods be compared to one another and that a community wide benchmark corpus emerges; against which future methods can be compared. The workshop aims to bring together researchers working on the automatic or semi-automatic extraction of relationships between biomedical entities from research literature in order to identify the key groups interested in creating such a benchmark.
View details for PubMedID 19904832
- Predicting RNA structure by multiple template homology modeling. edited by Altman, R., Dunker, K., Hunter, L. 2010
- Improving the prediction of pharmacogenes using text-derived drug-gene relationships. edited by Altman, R., Dunker, K., Hunter, L. 2010
- Proceedings of Pacific Symposium on Biocomputing 2010. edited by Altman, R., Dunker, K., Hunter, L. 2010
- Extraction of genotypephenotype- drug relationships from text: from entity recognition to bioinformatics application. edited by Altman, R., Dunker, K., Hunter, L. 2010
- An integrative method for scoring candidate genes from association studies: application to warfarin dosing. BMC Bioinformatics., 11 Suppl 9:S9. PMCID: PMC2967750. 2010
-
Predicting RNA structure by multiple template homology modeling.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
2010: 216-227
Abstract
Despite the importance of 3D structure to understand the myriad functions of RNAs in cells, most RNA molecules remain out of reach of crystallographic and NMR methods. However, certain structural information such as base pairing and some tertiary contacts can be determined readily for many RNAs by bioinformatics or relatively low cost experiments. Further, because RNA structure is highly modular, it is possible to deduce local 3D structure from the solved structures of evolutionarily related RNAs or even unrelated RNAs that share the same module. RNABuilder is a software package that generates model RNA structures by treating the kinematics and forces at separate, multiple levels of resolution. Kinematically, bonds in bases, certain stretches of residues, and some entire molecules are rigid while other bonds remain flexible. Forces act on the rigid bases and selected individual atoms. Here we use RNABuilder to predict the structure of the 200-nucleotide Azoarcus group I intron by homology modeling against fragments of the distantly-related Twort and Tetrahymena group I introns and by incorporating base pairing forces where necessary. In the absence of any information from the solved Azoarcus intron crystal structure, the model accurately depicts the global topology, secondary and tertiary connections, and gives an overall RMSD value of 4.6 A relative to the crystal structure. The accuracy of the model is even higher in the intron core (RMSD = 3.5 A), whereas deviations are modestly larger for peripheral regions that differ more substantially between the different introns. These results lay the groundwork for using this approach for larger and more diverse group I introns, as well for still larger RNAs and RNA-protein complexes such as group II introns and the ribosomal subunits.
View details for PubMedID 19908374
-
Very important pharmacogene summary ADRB2
PHARMACOGENETICS AND GENOMICS
2010; 20 (1): 64-69
View details for DOI 10.1097/FPC.0b013e328333dae6
View details for Web of Science ID 000273307600008
View details for PubMedID 19927042
View details for PubMedCentralID PMC3098753
-
Editorial: Current progress in Bioinformatics 2010
BRIEFINGS IN BIOINFORMATICS
2010; 11 (1): 1-2
View details for DOI 10.1093/bib/bbq001
View details for Web of Science ID 000273866500001
View details for PubMedID 20097719
-
Improving the prediction of pharmacogenes using text-derived drug-gene relationships.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
2010: 305-314
Abstract
A critical goal of pharmacogenomics research is to identify genes that can explain variation in drug response. We have previously reported a method that creates a genome-scale ranking of genes likely to interact with a drug. The algorithm uses information about drug structure and indications of use to rank the genes. Although the algorithm has good performance, its performance depends on a curated set of drug-gene relationships that is expensive to create and difficult to maintain. In this work, we assess the utility of text mining in extracting a network of drug-gene relationships automatically. This provides a valuable aggregate source of knowledge, subsequently used as input into the algorithm that ranks potential pharmacogenes. Using a drug-gene network created from sentence-level co-occurrence in the full text of scientific articles, we compared the performance to that of a network created by manual curation of those articles. Under a wide range of conditions, we show that a knowledge base derived from text-mining the literature performs as well as, and sometimes better than, a high-quality, manually curated knowledge base. We conclude that we can use relationships mined automatically from the literature as a knowledgebase for pharmacogenomics relationships. Additionally, when relationships are missed by text mining, our system can accurately extrapolate new relationships with 77.4% precision.
View details for PubMedID 19908383
-
Knowledge-based instantiation of full atomic detail into coarse-grain RNA 3D structural models
BIOINFORMATICS
2009; 25 (24): 3259-3266
Abstract
The recent development of methods for modeling RNA 3D structures using coarse-grain approaches creates a need to bridge low- and high-resolution modeling methods. Although they contain topological information, coarse-grain models lack atomic detail, which limits their utility for some applications.We have developed a method for adding full atomic detail to coarse-grain models of RNA 3D structures. Our method [Coarse to Atomic (C2A)] uses geometries observed in known RNA crystal structures. Our method rebuilds full atomic detail from ideal coarse-grain backbones taken from crystal structures to within 1.87-3.31 A RMSD of the full atomic crystal structure. When starting from coarse-grain models generated by the modeling tool NAST, our method builds full atomic structures that are within 1.00 A RMSD of the starting structure. The resulting full atomic structures can be used as starting points for higher resolution modeling, thus bridging high- and low-resolution approaches to modeling RNA 3D structure.Code for the C2A method, as well as the examples discussed in this article, are freely available at www.simtk.org/home/c2a.russ.altman@stanford.edu
View details for DOI 10.1093/bioinformatics/btp576
View details for Web of Science ID 000272464000008
View details for PubMedID 19812110
View details for PubMedCentralID PMC2788923
-
Prediction of calcium-binding sites by combining loop-modeling with machine learning
BMC STRUCTURAL BIOLOGY
2009; 9
Abstract
Protein ligand-binding sites in the apo state exhibit structural flexibility. This flexibility often frustrates methods for structure-based recognition of these sites because it leads to the absence of electron density for these critical regions, particularly when they are in surface loops. Methods for recognizing functional sites in these missing loops would be useful for recovering additional functional information.We report a hybrid approach for recognizing calcium-binding sites in disordered regions. Our approach combines loop modeling with a machine learning method (FEATURE) for structure-based site recognition. For validation, we compared the performance of our method on known calcium-binding sites for which there are both holo and apo structures. When loops in the apo structures are rebuilt using modeling methods, FEATURE identifies 14 out of 20 crystallographically proven calcium-binding sites. It only recognizes 7 out of 20 calcium-binding sites in the initial apo crystal structures.We applied our method to unstructured loops in proteins from SCOP families known to bind calcium in order to discover potential cryptic calcium binding sites. We built 2745 missing loops and evaluated them for potential calcium binding. We made 102 predictions of calcium-binding sites. Ten predictions are consistent with independent experimental verifications. We found indirect experimental evidence for 14 other predictions. The remaining 78 predictions are novel predictions, some with intriguing potential biological significance. In particular, we see an enrichment of beta-sheet folds with predicted calcium binding sites in the connecting loops on the surface that may be important for calcium-mediated function switches.Protein crystal structures are a potentially rich source of functional information. When loops are missing in these structures, we may be losing important information about binding sites and active sites. We have shown that limited loop modeling (e.g. loops less than 17 residues) combined with pattern matching algorithms can recover functions and propose putative conformations associated with these functions.
View details for DOI 10.1186/1472-6807-9-72
View details for Web of Science ID 000273849100001
View details for PubMedID 20003365
View details for PubMedCentralID PMC2808310
-
Taxane pathway
PHARMACOGENETICS AND GENOMICS
2009; 19 (12): 979-983
View details for DOI 10.1097/FPC.0b013e3283335277
View details for Web of Science ID 000272310800008
View details for PubMedID 21151855
View details for PubMedCentralID PMC2998989
-
Selective serotonin reuptake inhibitors pathway
PHARMACOGENETICS AND GENOMICS
2009; 19 (11): 907-909
View details for DOI 10.1097/FPC.0b013e32833132cb
View details for Web of Science ID 000271602800010
View details for PubMedID 19741567
View details for PubMedCentralID PMC2896866
-
Generating Genome-Scale Candidate Gene Lists for Pharmacogenomics
CLINICAL PHARMACOLOGY & THERAPEUTICS
2009; 86 (2): 183-189
Abstract
A critical task in pharmacogenomics is identifying genes that may be important modulators of drug response. High-throughput experimental methods are often plagued by false positives and do not take advantage of existing knowledge. Candidate gene lists can usefully summarize existing knowledge, but they are expensive to generate manually and may therefore have incomplete coverage. We have developed a method that ranks 12,460 genes in the human genome on the basis of their potential relevance to a specific query drug and its putative indications. Our method uses known gene-drug interactions, networks of gene-gene interactions, and available measures of drug-drug similarity. It ranks genes by building a local network of known interactions and assessing the similarity of the query drug (by both structure and indication) with drugs that interact with gene products in the local network. In a comprehensive benchmark, our method achieves an overall area under the curve of 0.82. To showcase our method, we found novel gene candidates for warfarin, gefitinib, carboplatin, and gemcitabine, and we provide the molecular hypotheses for these predictions.
View details for DOI 10.1038/clpt.2009.42
View details for Web of Science ID 000268565100019
View details for PubMedID 19369935
View details for PubMedCentralID PMC2729176
-
A Double-Blind, Randomized, Saline-Controlled Study of the Efficacy and Safety of EUFLEXXA (R) for Treatment of Painful Osteoarthritis of the Knee, With an Open-Label Safety Extension (The FLEXX Trial)
SEMINARS IN ARTHRITIS AND RHEUMATISM
2009; 39 (1): 1-9
Abstract
To report the FLEXX trial, the first well-controlled study assessing the safety and efficacy of Euflexxa (1% sodium hyaluronate; IA-BioHA) therapy for knee osteoarthritis (OA) at 26 weeks.This was a randomized, double-blind, multicenter, saline-controlled study. Subjects with chronic knee OA were randomized to 3 weekly intra-articular (IA) injections of either buffered saline (IA-SA) or IA-BioHA (20 mg/2 ml). The primary efficacy outcome was subject recorded difference in least-squares means between IA-BioHA and IA-SA in subjects' change from baseline to week 26 following a 50-foot walk test, measured via 100-mm visual analog scale (VAS). Secondary outcome measures included Osteoarthritis Research Society International responder index, Western Ontario McMaster University Osteoarthritis Index VA 3.1 subscales, patient global assessment, rescue medication, and health-related quality of life (HRQoL) by the SF-36. Safety was assessed by monitoring and reporting vital signs, physical examination of the target knee following injection, adverse events, and concomitant medications.Five hundred eighty-eight subjects were randomized to either IA-BioHA (n = 293) or IA-SA (n = 295), with an 88% 26 week completion rate. No statistical differences were noted between the treatment groups at baseline. In the IA-BioHA group, mean VAS scores decreased by 25.7 mm, compared with 18.5 mm in the IA-SA group. This corresponded to a median reduction of 53% from baseline for IA-BioHA and a 38% reduction for IA-SA. The difference in least-squares means was -6.6 mm (P = 0.002). Secondary outcome measures were consistent with significant improvement in Osteoarthritis Research Society International responder index, HRQoL, and function. Both IA-SA and IA-BioHA injections were well tolerated, with a low incidence of adverse events that were equally distributed between groups. Injection-site reactions were reported by 1 (<1%) subject in the IA-SA group and 2 (1%) in the IA-BioHA group.IA-BioHA therapy resulted in significant OA knee pain relief at 26 weeks compared with IA-SA. Subjects treated with IA-BioHA also experienced significant improvements in joint function, treatment satisfaction, and HRQoL.
View details for DOI 10.1016/j.semarthrit.2009.04.001
View details for Web of Science ID 000268735900001
View details for PubMedID 19539353
-
Improving Structure-Based Function Prediction Using Molecular Dynamics
STRUCTURE
2009; 17 (7): 919-929
Abstract
The number of molecules with solved three-dimensional structure but unknown function is increasing rapidly. Particularly problematic are novel folds with little detectable similarity to molecules of known function. Experimental assays can determine the functions of such molecules, but are time-consuming and expensive. Computational approaches can identify potential functional sites; however, these approaches generally rely on single static structures and do not use information about dynamics. In fact, structural dynamics can enhance function prediction: we coupled molecular dynamics simulations with structure-based function prediction algorithms that identify Ca(2+) binding sites. When applied to 11 challenging proteins, both methods showed substantial improvement in performance, revealing 22 more sites in one case and 12 more in the other, with a modest increase in apparent false positives. Thus, we show that treating molecules as dynamic entities improves the performance of structure-based function prediction methods.
View details for DOI 10.1016/j.str.2009.05.010
View details for Web of Science ID 000268214500004
View details for PubMedID 19604472
View details for PubMedCentralID PMC2748254
-
Antiestrogen pathway (aromatase inhibitor)
PHARMACOGENETICS AND GENOMICS
2009; 19 (7): 554-555
View details for DOI 10.1097/FPC.0b013e32832e0ec1
View details for Web of Science ID 000267619000008
View details for PubMedID 19512956
View details for PubMedCentralID PMC2756763
-
Codeine and morphine pathway
PHARMACOGENETICS AND GENOMICS
2009; 19 (7): 556-558
View details for DOI 10.1097/FPC.01b013e32832e0eac
View details for Web of Science ID 000267619000009
View details for PubMedID 19512957
-
Platinum pathway
PHARMACOGENETICS AND GENOMICS
2009; 19 (7): 563-564
View details for DOI 10.1097/FPC.0b013e32832e0ed7
View details for Web of Science ID 000267619000011
View details for PubMedID 19525887
-
Cytochrome P450 2D6
PHARMACOGENETICS AND GENOMICS
2009; 19 (7): 559-562
View details for DOI 10.1097/FPC.0b013e32832e0e97
View details for Web of Science ID 000267619000010
View details for PubMedID 19512959
-
Etoposide pathway
PHARMACOGENETICS AND GENOMICS
2009; 19 (7): 552-553
View details for DOI 10.1097/FPC.0b013e32832e0e7f
View details for Web of Science ID 000267619000007
View details for PubMedID 19512958
-
Direct-to-Consumer Genetic Testing: Failure Is Not an Option
CLINICAL PHARMACOLOGY & THERAPEUTICS
2009; 86 (1): 15-17
Abstract
Direct-to-consumer genetic testing is an unavoidable consequence of our ability to cheaply and accurately measure the genome. Some are troubled by the loss of control over how and when this information is disclosed to individuals, but it is difficult to imagine any way to prevent the wide availability of these data. Therefore, the key challenge is to set up social, educational, and technical means to support individuals who have access to their genome.
View details for DOI 10.1038/clpt.2009.63
View details for Web of Science ID 000267225200003
View details for PubMedID 19536117
View details for PubMedCentralID PMC3086846
-
New feature: pathways and important genes from PharmGKB
PHARMACOGENETICS AND GENOMICS
2009; 19 (6): 403-403
View details for DOI 10.1097/FPC.0b013e32832b16ba
View details for Web of Science ID 000266575500001
-
Very important pharmacogene summary: sulfotransferase 1A1
PHARMACOGENETICS AND GENOMICS
2009; 19 (6): 404-406
View details for DOI 10.1097/FPC.0b013e32832e042e
View details for Web of Science ID 000266575500002
View details for PubMedID 19451861
-
Pharmspresso: a text mining tool for extraction of pharmacogenomic concepts and relationships from full text
1st Summit on Translational Bioinformatics
BIOMED CENTRAL LTD. 2009
Abstract
Pharmacogenomics studies the relationship between genetic variation and the variation in drug response phenotypes. The field is rapidly gaining importance: it promises drugs targeted to particular subpopulations based on genetic background. The pharmacogenomics literature has expanded rapidly, but is dispersed in many journals. It is challenging, therefore, to identify important associations between drugs and molecular entities--particularly genes and gene variants, and thus these critical connections are often lost. Text mining techniques can allow us to convert the free-style text to a computable, searchable format in which pharmacogenomic concepts (such as genes, drugs, polymorphisms, and diseases) are identified, and important links between these concepts are recorded. Availability of full text articles as input into text mining engines is key, as literature abstracts often do not contain sufficient information to identify these pharmacogenomic associations.Thus, building on a tool called Textpresso, we have created the Pharmspresso tool to assist in identifying important pharmacogenomic facts in full text articles. Pharmspresso parses text to find references to human genes, polymorphisms, drugs and diseases and their relationships. It presents these as a series of marked-up text fragments, in which key concepts are visually highlighted. To evaluate Pharmspresso, we used a gold standard of 45 human-curated articles. Pharmspresso identified 78%, 61%, and 74% of target gene, polymorphism, and drug concepts, respectively.Pharmspresso is a text analysis tool that extracts pharmacogenomic concepts from the literature automatically and thus captures our current understanding of gene-drug interactions in a computable form. We have made Pharmspresso available at http://pharmspresso.stanford.edu.
View details for Web of Science ID 000265602500007
View details for PubMedID 19208194
View details for PubMedCentralID PMC2646239
-
Coarse-grained modeling of large RNA molecules with knowledge-based potentials and structural filters
RNA-A PUBLICATION OF THE RNA SOCIETY
2009; 15 (2): 189-199
Abstract
Understanding the function of complex RNA molecules depends critically on understanding their structure. However, creating three-dimensional (3D) structural models of RNA remains a significant challenge. We present a protocol (the nucleic acid simulation tool [NAST]) for RNA modeling that uses an RNA-specific knowledge-based potential in a coarse-grained molecular dynamics engine to generate plausible 3D structures. We demonstrate NAST's capabilities by using only secondary structure and tertiary contact predictions to generate, cluster, and rank structures. Representative structures in the best ranking clusters averaged 8.0 +/- 0.3 A and 16.3 +/- 1.0 A RMSD for the yeast phenylalanine tRNA and the P4-P6 domain of the Tetrahymena thermophila group I intron, respectively. The coarse-grained resolution allows us to model large molecules such as the 158-residue P4-P6 or the 388-residue T. thermophila group I intron. One advantage of NAST is the ability to rank clusters of structurally similar decoys based on their compatibility with experimental data. We successfully used ideal small-angle X-ray scattering data and both ideal and experimental solvent accessibility data to select the best cluster of structures for both tRNA and P4-P6. Finally, we used NAST to build in missing loops in the crystal structures of the Azoarcus and Twort ribozymes, and to incorporate crystallographic data into the Michel-Westhof model of the T. thermophila group I intron, creating an integrated model of the entire molecule. Our software package is freely available at https://simtk.org/home/nast.
View details for DOI 10.1261/rna.1270809
View details for Web of Science ID 000262463200001
View details for PubMedID 19144906
View details for PubMedCentralID PMC2648710
-
TOWARDS A CYTOKINE-CELL INTERACTION KNOWLEDGEBASE OF THE ADAPTIVE IMMUNE SYSTEM
Pacific Symposium on Biocomputing
WORLD SCIENTIFIC PUBL CO PTE LTD. 2009: 439–450
Abstract
The immune system of higher organisms is, by any standard, complex. To date, using reductionist techniques, immunologists have elucidated many of the basic principles of how the immune system functions, yet our understanding is still far from complete. In an era of high throughput measurements, it is already clear that the scientific knowledge we have accumulated has itself grown larger than our ability to cope with it, and thus it is increasingly important to develop bioinformatics tools with which to navigate the complexity of the information that is available to us. Here, we describe ImmuneXpresso, an information extraction system, tailored for parsing the primary literature of immunology and relating it to experimental data. The immune system is very much dependent on the interactions of various white blood cells with each other, either in synaptic contacts, at a distance using cytokines or chemokines, or both. Therefore, as a first approximation, we used ImmuneXpresso to create a literature derived network of interactions between cells and cytokines. Integration of cell-specific gene expression data facilitates cross-validation of cytokine mediated cell-cell interactions and suggests novel interactions. We evaluate the performance of our automatically generated multi-scale model against existing manually curated data, and show how this system can be used to guide experimentalists in interpreting multi-scale, experimental data. Our methodology is scalable and can be generalized to other systems.
View details for Web of Science ID 000263639700041
View details for PubMedID 19209721
- The International Warfarin Pharmacogenetics Consortium. Warfarin Dosing UsingClinical and Pharmacogenetic Data. New England Journal of Medicine. 2009; 8 (360): 753-64
- Pharmspresso: a text mining tool for extraction of pharmacogenomic concepts and relationships from full text. BMC Bioinformatics., 10 Suppl 2:S6. PMCID: PMC2646239. 2009
- Proceedings of Pacific Symposium on Biocomputing 2009. edited by Altman, R., Dunker, K., Hunter, L. 2009
-
New feature: pathways and important genes from PharmGKB.
Pharmacogenetics and genomics
2009; 19 (6): 403
View details for PubMedID 20161212
View details for PubMedCentralID PMC2715563
-
Predicting drug side-effects by chemical systems biology
GENOME BIOLOGY
2009; 10 (9)
Abstract
New approaches to predicting ligand similarity and protein interactions can explain unexpected observations of drug inefficacy or side-effects.
View details for DOI 10.1186/gb-2009-10-9-238
View details for Web of Science ID 000271425300004
View details for PubMedID 19723347
View details for PubMedCentralID PMC2768971
-
A general framework for dose optimization.
AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium
2009; 2009: 656-660
Abstract
Dose optimization is a ubiquitous challenge in clinical practice and includes both pharmacologic and non-pharmacologic interventions. Methods for the statistical assessment of optimum dosing are lacking. We developed a generic framework for dose titration and demonstrated its application in two domains. Optimum warfarin dose was estimated from clinical titration data. In addition, cardiac pacemaker interval optimization was conducted using three conventional techniques. For both data types, optima were obtained from mathematical functions fit to the raw data. The precision of the estimated optima was quantified using bootstrapping. In pacing optimization, the observed precision varied significantly among the techniques, suggesting that impedance cardiography is superior to commonly used echocardiographic methods. The average 95% confidence interval of the estimated optimum warfarin dose was +/-18%, suggesting that titration within this range is of limited utility. By identifying statistically ineffective interventions, objective analysis of optimization data may both improve outcomes and reduce healthcare costs.
View details for PubMedID 20351936
-
Efficient Algorithms to Explore Conformation Spaces of Flexible Protein Loops
7th International Workshop on Algorithms in Bioinformatics (WABI 2007)
IEEE COMPUTER SOC. 2008: 534–45
Abstract
Several applications in biology - e.g., incorporation of protein flexibility in ligand docking algorithms, interpretation of fuzzy X-ray crystallographic data, and homology modeling - require computing the internal parameters of a flexible fragment (usually, a loop) of a protein in order to connect its termini to the rest of the protein without causing any steric clash. One must often sample many such conformations in order to explore and adequately represent the conformational range of the studied loop. While sampling must be fast, it is made difficult by the fact that two conflicting constraints - kinematic closure and clash avoidance - must be satisfied concurrently. This paper describes two efficient and complementary sampling algorithms to explore the space of closed clash-free conformations of a flexible protein loop. The "seed sampling" algorithm samples broadly from this space, while the "deformation sampling" algorithm uses seed conformations as starting points to explore the conformation space around them at a finer grain. Computational results are presented for various loops ranging from 5 to 25 residues. More specific results also show that the combination of the sampling algorithms with a functional site prediction software (FEATURE) makes it possible to compute and recognize calcium-binding loop conformations. The sampling algorithms are implemented in a toolkit (LoopTK), which is available at https://simtk.org/home/looptk.
View details for DOI 10.1109/TCBB.2008.96
View details for Web of Science ID 000260433100007
View details for PubMedID 18989041
View details for PubMedCentralID PMC2794838
-
PharmGKB: an integrated resource of pharmacogenomic data and knowledge.
Current protocols in bioinformatics / editoral board, Andreas D. Baxevanis ... [et al.]
2008; Chapter 14: Unit14 7-?
Abstract
The PharmGKB is a publicly available online resource that aims to facilitate understanding how genetic variation contributes to variation in drug response. It is not only a repository of pharmacogenomics primary data, but it also provides fully curated knowledge including drug pathways, annotated pharmacogene summaries, and relationships among genes, drugs, and diseases. This unit describes how to navigate the PharmGKB Web site to retrieve detailed information on genes and important variants, as well as their relationship to drugs and diseases. It also includes protocols on our drug-centered pathway, annotated pharmacogene summaries, and our Web services for downloading the underlying data. Workflow on how to use PharmGKB to facilitate design of the pharmacogenomic study is also described in this unit.
View details for DOI 10.1002/0471250953.bi1407s23
View details for PubMedID 18819074
-
The Simbios National Center: Systems biology in motion
PROCEEDINGS OF THE IEEE
2008; 96 (8): 1266-1280
Abstract
Physics-based simulation is needed to understand the function of biological structures and can be applied across a wide range of scales, from molecules to organisms. Simbios (the National Center for Physics-Based Simulation of Biological Structures, http://www.simbios.stanford.edu/) is one of seven NIH-supported National Centers for Biomedical Computation. This article provides an overview of the mission and achievements of Simbios, and describes its place within systems biology. Understanding the interactions between various parts of a biological system and integrating this information to understand how biological systems function is the goal of systems biology. Many important biological systems comprise complex structural systems whose components interact through the exchange of physical forces, and whose movement and function is dictated by those forces. In particular, systems that are made of multiple identifiable components that move relative to one another in a constrained manner are multibody systems. Simbios' focus is creating methods for their simulation. Simbios is also investigating the biomechanical forces that govern fluid flow through deformable vessels, a central problem in cardiovascular dynamics. In this application, the system is governed by the interplay of classical forces, but the motion is distributed smoothly through the materials and fluids, requiring the use of continuum methods. In addition to the research aims, Simbios is working to disseminate information, software and other resources relevant to biological systems in motion.
View details for DOI 10.1109/JPROC.2008.925454
View details for Web of Science ID 000257860800004
View details for PubMedCentralID PMC2811325
-
The Simbios National Center: Systems Biology in Motion.
Proceedings of the IEEE. Institute of Electrical and Electronics Engineers
2008; 96 (8): 1266
Abstract
Physics-based simulation is needed to understand the function of biological structures and can be applied across a wide range of scales, from molecules to organisms. Simbios (the National Center for Physics-Based Simulation of Biological Structures, http://www.simbios.stanford.edu/) is one of seven NIH-supported National Centers for Biomedical Computation. This article provides an overview of the mission and achievements of Simbios, and describes its place within systems biology. Understanding the interactions between various parts of a biological system and integrating this information to understand how biological systems function is the goal of systems biology. Many important biological systems comprise complex structural systems whose components interact through the exchange of physical forces, and whose movement and function is dictated by those forces. In particular, systems that are made of multiple identifiable components that move relative to one another in a constrained manner are multibody systems. Simbios' focus is creating methods for their simulation. Simbios is also investigating the biomechanical forces that govern fluid flow through deformable vessels, a central problem in cardiovascular dynamics. In this application, the system is governed by the interplay of classical forces, but the motion is distributed smoothly through the materials and fluids, requiring the use of continuum methods. In addition to the research aims, Simbios is working to disseminate information, software and other resources relevant to biological systems in motion.
View details for DOI 10.1109/JPROC.2008.925454
View details for PubMedID 20107615
View details for PubMedCentralID PMC2811325
-
High-throughput single-nucleotide structural mapping by capillary automated footprinting analysis
NUCLEIC ACIDS RESEARCH
2008; 36 (11)
Abstract
The use of capillary electrophoresis with fluorescently labeled nucleic acids revolutionized DNA sequencing, effectively fueling the genomic revolution. We present an application of this technology for the high-throughput structural analysis of nucleic acids by chemical and enzymatic mapping ('footprinting'). We achieve the throughput and data quality necessary for genomic-scale structural analysis by combining fluorophore labeling of nucleic acids with novel quantitation algorithms. We implemented these algorithms in the CAFA (capillary automated footprinting analysis) open-source software that is downloadable gratis from https://simtk.org/home/cafa. The accuracy, throughput and reproducibility of CAFA analysis are demonstrated using hydroxyl radical footprinting of RNA. The versatility of CAFA is illustrated by dimethyl sulfate mapping of RNA secondary structure and DNase I mapping of a protein binding to a specific sequence of DNA. Our experimental and computational approach facilitates the acquisition of high-throughput chemical probing data for solution structural analysis of nucleic acids.
View details for DOI 10.1093/nar/gkn267
View details for Web of Science ID 000257188700033
View details for PubMedID 18477638
View details for PubMedCentralID PMC2441812
-
Interview: Russ Altman speaks to Shreeya Nanda, Commissioning Editor.
Pharmacogenomics
2008; 9 (6): 663-665
Abstract
Russ Biagio Altman is a professor of bioengineering, genetics, and medicine (and of computer science by courtesy) and chairman of the Bioengineering Department at Stanford University, CA, USA. His primary research interests are in the application of computing technology to basic molecular biological problems of relevance to medicine. He is currently developing techniques for collaborative scientific computation over the internet, including novel user interfaces to biological data, particularly for pharmacogenomics. Other work focuses on the analysis of functional microenvironments within macromolecules and the application of algorithms for determining the structure, dynamics and function of biological macromolecules. Dr Altman holds an MD from Stanford Medical School, a PhD in medical information sciences from Stanford, and an AB from Harvard College, MA, USA. He has been the recipient of the US Presidential Early Career Award for Scientists and Engineers and a National Science Foundation CAREER Award. He is a fellow of the American College of Physicians and the American College of Medical Informatics. He is a past-president and founding board member of the International Society for Computational Biology and an organizer of the annual Pacific Symposium on Biocomputing. He leads one of seven NIH-supported National Centers for Biomedical Computation, focusing on physics-based simulation of biological structures. He won the Stanford Medical School graduate teaching award in 2000.
View details for DOI 10.2217/14622416.9.6.663
View details for PubMedID 18518843
-
iTools: A Framework for Classification, Categorization and Integration of Computational Biology Resources
PLOS ONE
2008; 3 (5)
Abstract
The advancement of the computational biology field hinges on progress in three fundamental directions--the development of new computational algorithms, the availability of informatics resource management infrastructures and the capability of tools to interoperate and synergize. There is an explosion in algorithms and tools for computational biology, which makes it difficult for biologists to find, compare and integrate such resources. We describe a new infrastructure, iTools, for managing the query, traversal and comparison of diverse computational biology resources. Specifically, iTools stores information about three types of resources--data, software tools and web-services. The iTools design, implementation and resource meta-data content reflect the broad research, computational, applied and scientific expertise available at the seven National Centers for Biomedical Computing. iTools provides a system for classification, categorization and integration of different computational biology resources across space-and-time scales, biomedical problems, computational infrastructures and mathematical foundations. A large number of resources are already iTools-accessible to the community and this infrastructure is rapidly growing. iTools includes human and machine interfaces to its resource meta-data repository. Investigators or computer programs may utilize these interfaces to search, compare, expand, revise and mine meta-data descriptions of existent computational biology resources. We propose two ways to browse and display the iTools dynamic collection of resources. The first one is based on an ontology of computational biology resources, and the second one is derived from hyperbolic projections of manifolds or complex structures onto planar discs. iTools is an open source project both in terms of the source code development as well as its meta-data content. iTools employs a decentralized, portable, scalable and lightweight framework for long-term resource management. We demonstrate several applications of iTools as a framework for integrated bioinformatics. iTools and the complete details about its specifications, usage and interfaces are available at the iTools web page http://iTools.ccb.ucla.edu.
View details for DOI 10.1371/journal.pone.0002265
View details for Web of Science ID 000262268500012
View details for PubMedID 18509477
View details for PubMedCentralID PMC2386255
-
M-BISON: Microarray-based integration of data sources using networks
BMC BIOINFORMATICS
2008; 9
Abstract
The accurate detection of differentially expressed (DE) genes has become a central task in microarray analysis. Unfortunately, the noise level and experimental variability of microarrays can be limiting. While a number of existing methods partially overcome these limitations by incorporating biological knowledge in the form of gene groups, these methods sacrifice gene-level resolution. This loss of precision can be inappropriate, especially if the desired output is a ranked list of individual genes. To address this shortcoming, we developed M-BISON (Microarray-Based Integration of data SOurces using Networks), a formal probabilistic model that integrates background biological knowledge with microarray data to predict individual DE genes.M-BISON improves signal detection on a range of simulated data, particularly when using very noisy microarray data. We also applied the method to the task of predicting heat shock-related differentially expressed genes in S. cerevisiae, using an hsf1 mutant microarray dataset and conserved yeast DNA sequence motifs. Our results demonstrate that M-BISON improves the analysis quality and makes predictions that are easy to interpret in concert with incorporated knowledge. Specifically, M-BISON increases the AUC of DE gene prediction from .541 to .623 when compared to a method using only microarray data, and M-BISON outperforms a related method, GeneRank. Furthermore, by analyzing M-BISON predictions in the context of the background knowledge, we identified YHR124W as a potentially novel player in the yeast heat shock response.This work provides a solid foundation for the principled integration of imperfect biological knowledge with gene expression data and other high-throughput data sources.
View details for DOI 10.1186/1471-2105-9-214
View details for Web of Science ID 000256421800001
View details for PubMedID 18439292
View details for PubMedCentralID PMC2396182
-
The chemical genomic portrait of yeast: Uncovering a phenotype for all genes
SCIENCE
2008; 320 (5874): 362-365
Abstract
Genetics aims to understand the relation between genotype and phenotype. However, because complete deletion of most yeast genes ( approximately 80%) has no obvious phenotypic consequence in rich medium, it is difficult to study their functions. To uncover phenotypes for this nonessential fraction of the genome, we performed 1144 chemical genomic assays on the yeast whole-genome heterozygous and homozygous deletion collections and quantified the growth fitness of each deletion strain in the presence of chemical or environmental stress conditions. We found that 97% of gene deletions exhibited a measurable growth phenotype, suggesting that nearly all genes are essential for optimal growth in at least one condition.
View details for DOI 10.1126/science.1150021
View details for Web of Science ID 000255026100040
View details for PubMedID 18420932
View details for PubMedCentralID PMC2794835
-
PharmGKB and the international warfarin pharmacogenetlics consortium: The changing role for pharmacogenomic databases and single-drug pharmacogenetics
HUMAN MUTATION
2008; 29 (4): 456-460
Abstract
PharmGKB, the pharmacogenetics and pharmacogenomics knowledge base (www.pharmgkb.org) is a publicly available online resource dedicated to the dissemination of how genetic variation leads to variation in drug responses. The goals of PharmGKB are to describe relationships between genes, drugs, and diseases, and to generate knowledge to catalyze pharmacogenetic and pharmacogenomic research. PharmGKB delivers knowledge in the form of curated literature annotations, drug pathway diagrams, and very important pharmacogene (VIP) summaries. Recently, PharmGKB has embraced a new role--broker of pharmacogenomic data for data sharing consortia. In particular, we have helped create the International Warfarin Pharmacogenetics Consortium (IWPC), which is devoted to pooling genotype and phenotype data relevant to the anticoagulant warfarin. PharmGKB has embraced the challenge of continuing to maintain its original mission while taking an active role in the formation of pharmacogenetic consortia.
View details for DOI 10.1002/humu.20731
View details for Web of Science ID 000254800400002
View details for PubMedID 18330919
-
Structural inference of native and partially folded RNA by high-throughput contact mapping
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2008; 105 (11): 4144-4149
Abstract
The biological behaviors of ribozymes, riboswitches, and numerous other functional RNA molecules are critically dependent on their tertiary folding and their ability to sample multiple functional states. The conformational heterogeneity and partially folded nature of most of these states has rendered their characterization by high-resolution structural approaches difficult or even intractable. Here we introduce a method to rapidly infer the tertiary helical arrangements of large RNA molecules in their native and non-native solution states. Multiplexed hydroxyl radical (.OH) cleavage analysis (MOHCA) enables the high-throughput detection of numerous pairs of contacting residues via random incorporation of radical cleavage agents followed by two-dimensional gel electrophoresis. We validated this technology by recapitulating the unfolded and native states of a well studied model RNA, the P4-P6 domain of the Tetrahymena ribozyme, at subhelical resolution. We then applied MOHCA to a recently discovered third state of the P4-P6 RNA that is stabilized by high concentrations of monovalent salt and whose partial order precludes conventional techniques for structure determination. The three-dimensional portrait of a compact, non-native RNA state reveals a well ordered subset of native tertiary contacts, in contrast to the dynamic but otherwise similar molten globule states of proteins. With its applicability to nearly any solution state, we expect MOHCA to be a powerful tool for illuminating the many functional structures of large RNA molecules and RNA/protein complexes.
View details for DOI 10.1073/pnas.0709032105
View details for Web of Science ID 000254263300015
View details for PubMedID 18322008
View details for PubMedCentralID PMC2393762
-
MScanner: a classifier for retrieving medline citations
BMC BIOINFORMATICS
2008; 9
Abstract
Keyword searching through PubMed and other systems is the standard means of retrieving information from Medline. However, ad-hoc retrieval systems do not meet all of the needs of databases that curate information from literature, or of text miners developing a corpus on a topic that has many terms indicative of relevance. Several databases have developed supervised learning methods that operate on a filtered subset of Medline, to classify Medline records so that fewer articles have to be manually reviewed for relevance. A few studies have considered generalisation of Medline classification to operate on the entire Medline database in a non-domain-specific manner, but existing applications lack speed, available implementations, or a means to measure performance in new domains.MScanner is an implementation of a Bayesian classifier that provides a simple web interface for submitting a corpus of relevant training examples in the form of PubMed IDs and returning results ranked by decreasing probability of relevance. For maximum speed it uses the Medical Subject Headings (MeSH) and journal of publication as a concise document representation, and takes roughly 90 seconds to return results against the 16 million records in Medline. The web interface provides interactive exploration of the results, and cross validated performance evaluation on the relevant input against a random subset of Medline. We describe the classifier implementation, cross validate it on three domain-specific topics, and compare its performance to that of an expert PubMed query for a complex topic. In cross validation on the three sample topics against 100,000 random articles, the classifier achieved excellent separation of relevant and irrelevant article score distributions, ROC areas between 0.97 and 0.99, and averaged precision between 0.69 and 0.92.MScanner is an effective non-domain-specific classifier that operates on the entire Medline database, and is suited to retrieving topics for which many features may indicate relevance. Its web interface simplifies the task of classifying Medline citations, compared to building a pre-filter and classifier specific to the topic. The data sets and open source code used to obtain the results in this paper are available on-line and as supplementary material, and the web interface may be accessed at http://mscanner.stanford.edu.
View details for DOI 10.1186/1471-2105-9-108
View details for Web of Science ID 000254012100001
View details for PubMedID 18284683
View details for PubMedCentralID PMC2263023
-
An XML-based interchange format for genotype-phenotype data
HUMAN MUTATION
2008; 29 (2): 212-219
Abstract
Recent advances in high-throughput genotyping and phenotyping have accelerated the creation of pharmacogenomic data. Consequently, the community requires standard formats to exchange large amounts of diverse information. To facilitate the transfer of pharmacogenomics data between databases and analysis packages, we have created a standard XML (eXtensible Markup Language) schema that describes both genotype and phenotype data as well as associated metadata. The schema accommodates information regarding genes, drugs, diseases, experimental methods, genomic/RNA/protein sequences, subjects, subject groups, and literature. The Pharmacogenetics and Pharmacogenomics Knowledge Base (PharmGKB; www.pharmgkb.org) has used this XML schema for more than 5 years to accept and process submissions containing more than 1,814,139 SNPs on 20,797 subjects using 8,975 assays. Although developed in the context of pharmacogenomics, the schema is of general utility for exchange of genotype and phenotype data. We have written syntactic and semantic validators to check documents using this format. The schema and code for validation is available to the community at http://www.pharmgkb.org/schema/index.html (last accessed: 8 October 2007).
View details for DOI 10.1002/humu.20662
View details for Web of Science ID 000253033000002
View details for PubMedID 17994540
-
The pharmacogenetics and pharmacogenomics knowledge base: accentuating the knowledge
NUCLEIC ACIDS RESEARCH
2008; 36: D913-D918
Abstract
PharmGKB is a knowledge base that captures the relationships between drugs, diseases/phenotypes and genes involved in pharmacokinetics (PK) and pharmacodynamics (PD). This information includes literature annotations, primary data sets, PK and PD pathways, and expert-generated summaries of PK/PD relationships between drugs, diseases/phenotypes and genes. PharmGKB's website is designed to effectively disseminate knowledge to meet the needs of our users. PharmGKB currently has literature annotations documenting the relationship of over 500 drugs, 450 diseases and 600 variant genes. In order to meet the needs of whole genome studies, PharmGKB has added new functionalities, including browsing the variant display by chromosome and cytogenetic locations, allowing the user to view variants not located within a gene. We have developed new infrastructure for handling whole genome data, including increased methods for quality control and tools for comparison across other data sources, such as dbSNP, JSNP and HapMap data. PharmGKB has also added functionality to accept, store, display and query high throughput SNP array data. These changes allow us to capture more structured information on phenotypes for better cataloging and comparison of data. PharmGKB is available at www.pharmgkb.org.
View details for DOI 10.1093/nar/gkm1009
View details for Web of Science ID 000252545400160
View details for PubMedID 18032438
View details for PubMedCentralID PMC2238877
- The chemical genomic portrait of yeast: uncovering a phenotype for all genes. Science., PMCID: PMC2794835 2008; 5874 (320): 362-5
- PharmGKB and the International Warfarin Pharmacogenetics Consortium: the changing role for pharmacogenomic databases and single-drug pharmacogenetics. Hum Mutat. 2008; 29 (4): 456-60
- Combining molecular dynamics and machine learning to improve protein function recognition edited by Altman, R., Dunker, K., Hunter, L. 2008
- Proceedings of Pacific Symposium on Biocomputing 2008. edited by Altman, R., Dunker, K., Hunter, L. 2008
- Structural inference of native and partially folded RNA by high-throughput contact mapping. 2008
- The FEATURE framework for protein function annotation: modeling new functions, improving performance, and extending to novel applications. BMC Genomics., 9 Suppl 2:S2. PMCID: PMC2559884. 2008
-
The FEATURE framework for protein function annotation: modeling new functions, improving performance, and extending to novel applications.
BMC genomics
2008; 9: S2-?
Abstract
Structural genomics efforts contribute new protein structures that often lack significant sequence and fold similarity to known proteins. Traditional sequence and structure-based methods may not be sufficient to annotate the molecular functions of these structures. Techniques that combine structural and functional modeling can be valuable for functional annotation. FEATURE is a flexible framework for modeling and recognition of functional sites in macromolecular structures. Here, we present an overview of the main components of the FEATURE framework, and describe the recent developments in its use. These include automating training sets selection to increase functional coverage, coupling FEATURE to structural diversity generating methods such as molecular dynamics simulations and loop modeling methods to improve performance, and using FEATURE in large-scale modeling and structure determination efforts.
View details for DOI 10.1186/1471-2164-9-S2-S2
View details for PubMedID 18831785
View details for PubMedCentralID PMC2559884
-
Semiautomated and rapid quantification of nucleic acid footprinting and structure mapping experiments
NATURE PROTOCOLS
2008; 3 (9): 1395-1401
Abstract
We have developed protocols for rapidly quantifying the band intensities from nucleic acid chemical mapping gels at single-nucleotide resolution. These protocols are implemented in the software SAFA (semi-automated footprinting analysis) that can be downloaded without charge from http://safa.stanford.edu. The protocols implemented in SAFA have five steps: (i) lane identification, (ii) gel rectification, (iii) band assignment, (iv) model fitting and (v) band-intensity normalization. SAFA enables the rapid quantitation of gel images containing thousands of discrete bands, thereby eliminating a bottleneck to the analysis of chemical mapping experiments. An experienced user of the software can quantify a gel image in approximately 20 min. Although SAFA was developed to analyze hydroxyl radical (*OH) footprints, it effectively quantifies the gel images obtained with other types of chemical mapping probes. We also present a series of tutorial movies that illustrate the best practices and different steps in the SAFA analysis as a supplement to this protocol.
View details for DOI 10.1038/nprot.2008.134
View details for Web of Science ID 000258424100003
View details for PubMedID 18772866
View details for PubMedCentralID PMC2652576
-
Combining molecular dynamics and machine learning to improve protein function recognition.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
2008: 332-343
Abstract
As structural genomics efforts succeed in solving protein structures with novel folds, the number of proteins with known structures but unknown functions increases. Although experimental assays can determine the functions of some of these molecules, they can be expensive and time consuming. Computational approaches can assist in identifying potential functions of these molecules. Possible functions can be predicted based on sequence similarity, genomic context, expression patterns, structure similarity, and combinations of these. We investigated whether simulations of protein dynamics can expose functional sites that are not apparent to the structure-based function prediction methods in static crystal structures. Focusing on Ca2+ binding, we coupled a machine learning tool that recognizes functional sites, FEATURE, with Molecular Dynamics (MD) simulations. Treating molecules as dynamic entities can improve the ability of structure-based function prediction methods to annotate possible functional sites.
View details for PubMedID 18229697
-
Commentaries on "Informatics and Medicine: From Molecules to Populations"
METHODS OF INFORMATION IN MEDICINE
2008; 47 (4): 296-317
Abstract
To discuss interdisciplinary research and education in the context of informatics and medicine by commenting on the paper of Kuhn et al. "Informatics and Medicine: From Molecules to Populations".Inviting an international group of experts in biomedical and health informatics and related disciplines to comment on this paper.The commentaries include a wide range of reasoned arguments and original position statements which, while strongly endorsing the educational needs identified by Kuhn et al., also point out fundamental challenges that are very specific to the unusual combination of scientific, technological, personal and social problems characterizing biomedical informatics. They point to the ultimate objectives of managing difficult human health problems, which are unlikely to yield to technological solutions alone. The psychological, societal, and environmental components of health and disease are emphasized by several of the commentators, setting the stage for further debate and constructive suggestions.
View details for Web of Science ID 000258751400003
View details for PubMedID 18690363
-
The SeqFEATURE library of 3D functional site models: comparison to existing methods and applications to protein function annotation
GENOME BIOLOGY
2008; 9 (1)
Abstract
Structural genomics efforts have led to increasing numbers of novel, uncharacterized protein structures with low sequence identity to known proteins, resulting in a growing need for structure-based function recognition tools. Our method, SeqFEATURE, robustly models protein functions described by sequence motifs using a structural representation. We built a library of models that shows good performance compared to other methods. In particular, SeqFEATURE demonstrates significant improvement over other methods when sequence and structural similarity are low.
View details for DOI 10.1186/gb-2008-9-1-r8
View details for Web of Science ID 000253779800016
View details for PubMedID 18197987
View details for PubMedCentralID PMC2395245
-
The ethics of characterizing difference: guiding principles on using racial categories in human genetics
GENOME BIOLOGY
2008; 9 (7)
Abstract
We are a multidisciplinary group of Stanford faculty who propose ten principles to guide the use of racial and ethnic categories when characterizing group differences in research into human genetic variation.
View details for DOI 10.1186/gb-2008-9-7-404
View details for Web of Science ID 000258773600005
View details for PubMedID 18638359
View details for PubMedCentralID PMC2530857
-
Text mining for biology - the way forward: opinions from leading scientists
GENOME BIOLOGY
2008; 9
Abstract
This article collects opinions from leading scientists about how text mining can provide better access to the biological literature, how the scientific community can help with this process, what the next steps are, and what role future BioCreative evaluations can play. The responses identify several broad themes, including the possibility of fusing literature and biological databases through text mining; the need for user interfaces tailored to different classes of users and supporting community-based annotation; the importance of scaling text mining technology and inserting it into larger workflows; and suggestions for additional challenge evaluations, new applications, and additional resources needed to make progress.
View details for DOI 10.1186/gb-2008-9-S2-S7
View details for Web of Science ID 000278173900007
View details for PubMedID 18834498
View details for PubMedCentralID PMC2559991
-
Robust recognition of zinc binding sites in proteins
PROTEIN SCIENCE
2008; 17 (1): 54-65
Abstract
Metals play a variety of roles in biological processes, and hence their presence in a protein structure can yield vital functional information. Because the residues that coordinate a metal often undergo conformational changes upon binding, detection of binding sites based on simple geometric criteria in proteins without bound metal is difficult. However, aspects of the physicochemical environment around a metal binding site are often conserved even when this structural rearrangement occurs. We have developed a Bayesian classifier using known zinc binding sites as positive training examples and nonmetal binding regions that nonetheless contain residues frequently observed in zinc sites as negative training examples. In order to allow variation in the exact positions of atoms, we average a variety of biochemical and biophysical properties in six concentric spherical shells around the site of interest. At a specificity of 99.8%, this method achieves 75.5% sensitivity in unbound proteins at a positive predictive value of 73.6%. We also test its accuracy on predicted protein structures obtained by homology modeling using templates with 30%-50% sequence identity to the target sequences. At a specificity of 99.8%, we correctly identify at least one zinc binding site in 65.5% of modeled proteins. Thus, in many cases, our model is accurate enough to identify metal binding sites in proteins of unknown structure for which no high sequence identity homologs of known structure exist. Both the source code and a Web interface are available to the public at http://feature.stanford.edu/metals.
View details for DOI 10.1110/ps.073138508
View details for Web of Science ID 000251834500007
View details for PubMedID 18042678
View details for PubMedCentralID PMC2144590
-
PharmGKB: UNDERSTANDING THE EFFECTS OF INDIVIDUAL GENETIC VARIANTS
DRUG METABOLISM REVIEWS
2008; 40 (4): 539-551
Abstract
The Pharmacogenetics and Pharmacogenomics Knowledge Base (PharmGKB: http://www.pharmgkb.org) is devoted to disseminating primary data and knowledge in pharmacogenetics and pharmacogenomics. We are annotating the genes that are most important for drug response and present this information in the form of Very Important Pharmacogene (VIP) summaries, pathway diagrams, and curated literature. The PharmGKB currently contains information on over 500 drugs, 500 diseases, and 700 genes with genotyped variants. New features focus on capturing the phenotypic consequences of individual genetic variants. These features link variant genotypes to phenotypes, increase the breadth of pharmacogenomics literature curated, and visualize single-nucleotide polymorphisms on a gene's three-dimensional protein structure.
View details for DOI 10.1080/03602530802413338
View details for Web of Science ID 000260325500002
View details for PubMedID 18949600
View details for PubMedCentralID PMC2677552
-
Predicting allosteric communication in myosin via a pathway of conserved residues
JOURNAL OF MOLECULAR BIOLOGY
2007; 373 (5): 1361-1373
Abstract
We present a computational method that predicts a pathway of residues that mediate protein allosteric communication. The pathway is predicted using only a combination of distance constraints between contiguous residues and evolutionary data. We applied this analysis to find pathways of conserved residues connecting the myosin ATP binding site to the lever arm. These pathway residues may mediate the allosteric communication that couples ATP hydrolysis to the lever arm recovery stroke. Having examined pre-stroke conformations of Dictyostelium, scallop, and chicken myosin II as well as Dictyostelium myosin I, we observed a conserved pathway traversing switch II and the relay helix, which is consistent with the understood need for allosteric communication in this conformation. We also examined post-rigor and rigor conformations across several myosin species. Although initial residues of these paths are more heterogeneous, all but one of these paths traverse a consistent set of relay helix residues to reach the beginning of the lever arm. We discuss our results in the context of structural elements and reported mutational experiments, which substantiate the significance of the pre-stroke pathways. Our method provides a simple, computationally efficient means of predicting a set of residues that mediate allosteric communication. We provide a refined, downloadable application and source code (on https://simtk.org) to share this tool with the wider community (https://simtk.org/home/allopathfinder).
View details for DOI 10.1016/j.jmb.2007.08.059
View details for Web of Science ID 000250712600021
View details for PubMedID 17900617
View details for PubMedCentralID PMC2128046
-
The education potential of the pharmacogenetics and pharmacogenomics knowledge base (PharmGKB)
CLINICAL PHARMACOLOGY & THERAPEUTICS
2007; 82 (4): 472-475
Abstract
The pharmacogenetics and pharmacogenomics knowledge base (PharmGKB, http://www.pharmgkb.org) is a publicly available internet resource dedicated to the integration, annotation, and aggregation of pharmacogenomic knowledge. PharmGKB is a repository for pharmacogenetic and pharmacogenomic data, and curators provide integrated knowledge in terms of gene summaries, pathways, and annotated literature. Although PharmGKB is primarily directed toward catalyzing new research, it also has utility as a source of information for education about pharmacogenomics.
View details for DOI 10.1038/sj.clpt.6100332
View details for Web of Science ID 000249636500024
View details for PubMedID 17713470
-
Ontological issues in pharmacogenomics
MONIST
2007; 90 (4): 523-533
View details for Web of Science ID 000259179300003
-
Current progress in bioinformatics 2007
BRIEFINGS IN BIOINFORMATICS
2007; 8 (5): 277-278
View details for DOI 10.1093/bib/bbm041
View details for Web of Science ID 000251034700001
View details for PubMedID 17724063
-
Using surface envelopes to constrain molecular modeling
PROTEIN SCIENCE
2007; 16 (7): 1266-1273
Abstract
Molecular density information (as measured by electron microscopic reconstructions or crystallographic density maps) can be a powerful source of information for molecular modeling. Molecular density constrains models by specifying where atoms should and should not be. Low-resolution density information can often be obtained relatively quickly, and there is a need for methods that use it effectively. We have previously described a method for scoring molecular models with surface envelopes to discriminate between plausible and implausible fits. We showed that we could successfully filter out models with the wrong shape based on this discrimination power. Ideally, however, surface information should be used during the modeling process to constrain the conformations that are sampled. In this paper, we describe an extension of our method for using shape information during computational modeling. We use the envelope scoring metric as part of an objective function in a global optimization that also optimizes distances and angles while avoiding collisions. We systematically tested surface representations of proteins (using all nonhydrogen heavy atoms) with different abundance of distance information and showed that the root mean square deviation (RMSD) of models built with envelope information is consistently improved, particularly in data sets with relatively small sets of short-range distances.
View details for DOI 10.1110/ps.062733407
View details for Web of Science ID 000247465400004
View details for PubMedID 17586766
View details for PubMedCentralID PMC2206696
-
Genetic nondiscrimination legislation: a critical prerequisite for pharmacogenomics data sharing
PHARMACOGENOMICS
2007; 8 (5): 519-519
View details for DOI 10.2217/14622416.8.5.519
View details for Web of Science ID 000246464800017
View details for PubMedID 17465717
-
Coplanar and coaxial orientations of RNA bases and helices
RNA-A PUBLICATION OF THE RNA SOCIETY
2007; 13 (5): 643-650
Abstract
Electrostatic interactions, base-pairing, and especially base-stacking dominate RNA three-dimensional structures. In an A-form RNA helix, base-stacking results in nearly perfect parallel orientations of all bases in the helix. Interestingly, when an RNA structure containing multiple helices is visualized at the atomic level, it is often possible to find an orientation such that only the edges of most bases are visible. This suggests that a general aspect of higher level RNA structure is a coplanar arrangement of base-normal vectors. We have analyzed all solved RNA crystal structures to determine the degree to which RNA base-normal vectors are globally coplanar. Using a statistical test based on the Watson-Girdle distribution, we determined that 330 out of 331 known RNA structures show statistically significant (p < 0.05; false discovery rate [FDR] = 0.05) coplanar normal vector orientations. Not surprisingly, 94% of the helices in RNA show bipolar arrangements of their base-normal vectors (p < 0.05). This allows us to compute a mean axis for each helix and compare their orientations within an RNA structure. This analysis revealed that 62% (208/331) of the RNA structures exhibit statistically significant coaxial packing of helices (p < 0.05, FDR = 0.08). Further analysis reveals that the bases in hairpin loops and junctions are also generally planar. This work demonstrates coplanar base orientation and coaxial helix packing as an emergent behavior of RNA structure and may be useful as a structural modeling constraint.
View details for DOI 10.1261/rna.381407
View details for Web of Science ID 000245882400002
View details for PubMedID 17339576
View details for PubMedCentralID PMC1852812
-
Distinct contribution of electrostatics, initial conformational ensemble, and macromolecular stability in RNA folding
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2007; 104 (17): 7045-7050
Abstract
We distinguish the contribution of the electrostatic environment, initial conformational ensemble, and macromolecular stability on the folding mechanism of a large RNA using a combination of time-resolved "Fast Fenton" hydroxyl radical footprinting and exhaustive kinetic modeling. This integrated approach allows us to define the folding landscape of the L-21 Tetrahymena thermophila group I intron structurally and kinetically from its earliest steps with unprecedented accuracy. Distinct parallel pathways leading the RNA to its native form upon its Mg(2+)-induced folding are observed. The structures of the intermediates populating the pathways are not affected by variation of the concentration and type of background monovalent ions (electrostatic environment) but are altered by a mutation that destabilizes one domain of the ribozyme. Experiments starting from different conformational ensembles but folding under identical conditions show that whereas the electrostatic environment modulates molecular flux through different pathways, the initial conformational ensemble determines the partitioning of the flux. This study showcases a robust approach for the development of kinetic models from collections of local structural probes.
View details for DOI 10.1073/pnas.0608765104
View details for Web of Science ID 000246024700031
View details for PubMedID 17438287
View details for PubMedCentralID PMC1855354
-
PharmGKB: a logical home for knowledge relating genotype to drug response phenotype
NATURE GENETICS
2007; 39 (4): 426-426
View details for Web of Science ID 000245271200003
View details for PubMedID 17392795
View details for PubMedCentralID PMC3203536
-
The Pharmacogenetics Research Network: From SNP discovery to clinical drug response
CLINICAL PHARMACOLOGY & THERAPEUTICS
2007; 81 (3): 328-345
Abstract
The NIH Pharmacogenetics Research Network (PGRN) is a collaborative group of investigators with a wide range of research interests, but all attempting to correlate drug response with genetic variation. Several research groups concentrate on drugs used to treat specific medical disorders (asthma, depression, cardiovascular disease, addiction of nicotine, and cancer), whereas others are focused on specific groups of proteins that interact with drugs (membrane transporters and phase II drug-metabolizing enzymes). The diverse scientific information is stored and annotated in a publicly accessible knowledge base, the Pharmacogenetics and Pharmacogenomics Knowledge base (PharmGKB). This report highlights selected achievements and scientific approaches as well as hypotheses about future directions of each of the groups within the PGRN. Seven major topics are included: informatics (PharmGKB), cardiovascular, pulmonary, addiction, cancer, transport, and metabolism.
View details for DOI 10.1038/sj.clpt.6100087
View details for Web of Science ID 000244850300011
View details for PubMedID 17339863
-
Biomedical informatics training at Stanford in the 21st century
JOURNAL OF BIOMEDICAL INFORMATICS
2007; 40 (1): 55-58
Abstract
The Stanford Biomedical Informatics training program began with a focus on clinical informatics, and has now evolved into a general program of biomedical informatics training, including clinical informatics, bioinformatics and imaging informatics. The program offers PhD, MS, distance MS, certificate programs, and is now affiliated with an undergraduate major in biomedical computation. Current dynamics include (1) increased activity in informatics within other training programs in biology and the information sciences (2) increased desire among informatics students to gain laboratory experience, (3) increased demand for computational collaboration among biomedical researchers, and (4) interaction with the newly formed Department of Bioengineering at Stanford University. The core focus on research training-the development and application of novel informatics methods for biomedical research-keeps the program centered in the midst of this period of growth and diversification.
View details for DOI 10.1016/j.jbi.2006.02.005
View details for Web of Science ID 000243216000007
View details for PubMedID 16564233
-
The PharmGKB: integration, aggregation, and annotation of pharmacogenomic data and knowledge
CLINICAL PHARMACOLOGY & THERAPEUTICS
2007; 81 (1): 21-24
Abstract
The Pharmacogenetics and Pharmacogenomics Knowledge Base, PharmGKB (http://www.pharmgkb.org), curates pharmacogenetic and pharmacogenomic information to generate knowledge concerning the relationships among genes, drugs, and diseases, and the effects of gene variation on these relationships. PharmGKB curators collect information on genotype-phenotype relationships both from the literature and from the deposition of primary research data into our database. Their goal is to catalyze pharmacogenetic and pharmacogenomic research.
View details for DOI 10.1038/sj.clpt.6100048
View details for Web of Science ID 000242874200010
View details for PubMedID 17185992
-
The FEATURE framework for protein function annotation: modelling new functions, improving performance, and extending to novel applications
BMC GENOMICS
2007; 9
Abstract
Structural genomics efforts contribute new protein structures that often lack significant sequence and fold similarity to known proteins. Traditional sequence and structure-based methods may not be sufficient to annotate the molecular functions of these structures. Techniques that combine structural and functional modeling can be valuable for functional annotation. FEATURE is a flexible framework for modeling and recognition of functional sites in macromolecular structures. Here, we present an overview of the main components of the FEATURE framework, and describe the recent developments in its use. These include automating training sets selection to increase functional coverage, coupling FEATURE to structural diversity generating methods such as molecular dynamics simulations and loop modeling methods to improve performance, and using FEATURE in large-scale modeling and structure determination efforts.
View details for DOI 10.1186/1471-2164-9-S2-S2
View details for Web of Science ID 000206244200003
View details for PubMedCentralID PMC2559884
- In Current Pharmacogenomics Bentham Science Publishers.. 2007
- PharmGKB: integration, aggregation, and annotation of pharmacogenomic data and knowledge. Clin Pharmacol Ther. 2007; 1 (81): 21-4
- The education potential of the pharmacogenetics and pharmacogenomics knowledge base (PharmGKB). Clin Pharmacol Ther. 2007; 82 (4): 472-5
- Clustering protein environments for function prediction: finding PROSITE motifs in 3D. BMC Bioinformatics., 8 Suppl 4:S10. PMCID: PMC1892080. 2007
- Proceedings of Pacific Symposium on Biocomputing 2007. edited by Altman, R., Dunker, K., Hunter, L. 2007
- Distinct contribution of electrostatics, initial conformational ensemble, and macromolecular stability in RNA folding. 2007
- ST WeissandI Zineh for the Pharmacogenetics Research Network. The Pharmacogenetics Research Network: From SNP Discovery to Clinical Drug Response. Clinical Pharmacology & Therapeutics. 2007; 81: 328-345
-
Clustering protein environments for function prediction: finding PROSITE motifs in 3D
2nd Automated Function Prediction Meeting
BIOMED CENTRAL LTD. 2007
Abstract
Structural genomics initiatives are producing increasing numbers of three-dimensional (3D) structures for which there is little functional information. Structure-based annotation of molecular function is therefore becoming critical. We previously presented FEATURE, a method for describing microenvironments around functional sites in proteins. However, FEATURE uses supervised machine learning and so is limited to building models for sites of known importance and location. We hypothesized that there are a large number of sites in proteins that are associated with function that have not yet been recognized. Toward that end, we have developed a method for clustering protein microenvironments in order to evaluate the potential for discovering novel sites that have not been previously identified.We have prototyped a computational method for rapid clustering of millions of microenvironments in order to discover residues whose surrounding environments are similar and which may therefore share a functional or structural role. We clustered nearly 2,000,000 environments from 9,600 protein chains and defined 4,550 clusters. As a preliminary validation, we asked whether known 3D environments associated with PROSITE motifs were "rediscovered". We found examples of clusters highly enriched for residues that share PROSITE sequence motifs.Our results demonstrate that we can cluster protein environments successfully using a simplified representation and K-means clustering algorithm. The rediscovery of known 3D motifs allows us to calibrate the size and intercluster distances that characterize useful clusters. This information will then allow us to find new clusters with similar characteristics that represent novel structural or functional sites.
View details for Web of Science ID 000247557800010
View details for PubMedID 17570144
View details for PubMedCentralID PMC1892080
-
Extracting Subject Demographic Information From Abstracts of Randomized Clinical Trial Reports
12th World Congress on Health (Medical) Informatics
I O S PRESS. 2007: 550–554
Abstract
In order to make more informed healthcare decisions, consumers need information systems that deliver accurate and reliable information about their illnesses and potential treatments. Reports of randomized clinical trials (RCTs) provide reliable medical evidence about the efficacy of treatments. Current methods to access, search for, and retrieve RCTs are keyword-based, time-consuming, and suffer from poor precision. Personalized semantic search and medical evidence summarization aim to solve this problem. The performance of these approaches may improve if they have access to study subject descriptors (e.g. age, gender, and ethnicity), trial sizes, and diseases/symptoms studied. We have developed a novel method to automatically extract such subject demographic information from RCT abstracts. We used text classification augmented with a Hidden Markov Model to identify sentences containing subject demographics, and subsequently these sentences were parsed using Natural Language Processing techniques to extract relevant information. Our results show accuracy levels of 82.5%, 92.5%, and 92.0% for extraction of subject descriptors, trial sizes, and diseases/symptoms descriptors respectively.
View details for PubMedID 17911777
-
Integrating large-scale genotype and phenotype data
OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY
2006; 10 (4): 545-554
Abstract
With the completion of the Human Genome Project, a new emphasis is focusing on the sequence variation and the resulting phenotype. The number of data available from genomic studies addressing this relationship is rapidly growing. In order to analyze these data as a whole, they need to be integrated, aggregated and annotated in a timely manner. The Pharmacogenetics and Pharmacogenomics Knowledge Base PharmGKB; (
) assembles and disseminates these data and their associated metadata that are needed for unambiguous identification and replication. Assembling these data in a timely manner is challenging, and the scalability of these data produce major challenges for a knowledge base such as PharmGKB. However, it is only through rapid global meta-annotation of these data that we will understand the relationship between specific genotype(s) and the related phenotype. PharmGKB has confronted these challenges, and these experiences and solutions can benefit all genome communities. View details for Web of Science ID 000243893500009
View details for PubMedID 17233563
-
Pharmacogenomics: Challenges and opportunities
ANNALS OF INTERNAL MEDICINE
2006; 145 (10): 749-757
Abstract
The outcome of drug therapy is often unpredictable, ranging from beneficial effects to lack of efficacy to serious adverse effects. Variations in single genes are 1 well-recognized cause of such unpredictability, defining the field of pharmacogenetics (see Glossary). Such variations may involve genes controlling drug metabolism, drug transport, disease susceptibility, or drug targets. The sequencing of the human genome and the cataloguing of variants across human genomes are the enabling resources for the nascent field of pharmacogenomics (see Glossary), which tests the idea that genomic variability underlies variability in drug responses. However, there are many challenges that must be overcome to apply rapidly accumulating genomic information to understand variable drug responses, including defining candidate genes and pathways; relating disease genes to drug response genes; precisely defining drug response phenotypes; and addressing analytic, ethical, and technological issues involved in generation and management of large drug response data sets. Overcoming these challenges holds the promise of improving new drug development and ultimately individualizing the selection of appropriate drugs and dosages for individual patients.
View details for Web of Science ID 000242387100004
View details for PubMedID 17116919
-
Annual progress in bioinformatics 2006
BRIEFINGS IN BIOINFORMATICS
2006; 7 (3): 209-210
View details for DOI 10.1093/bib/bbl029
View details for Web of Science ID 000240964500001
-
The incidentalome - A threat to genomic medicine
JAMA-JOURNAL OF THE AMERICAN MEDICAL ASSOCIATION
2006; 296 (2): 212-215
View details for Web of Science ID 000238946500027
View details for PubMedID 16835427
-
Local kinetic measures of macromolecular structure reveal partitioning among multiple parallel pathways from the earliest steps in the folding of a large RNA molecule
JOURNAL OF MOLECULAR BIOLOGY
2006; 358 (4): 1179-1190
Abstract
At the heart of the RNA folding problem is the number, structures, and relationships among the intermediates that populate the folding pathways of most large RNA molecules. Unique insight into the structural dynamics of these intermediates can be gleaned from the time-dependent changes in local probes of macromolecular conformation (e.g. reports on individual nucleotide solvent accessibility offered by hydroxyl radical (()OH) footprinting). Local measures distributed around a macromolecule individually illuminate the ensemble of separate changes that constitute a folding reaction. Folding pathway reconstruction from a multitude of these individual measures is daunting due to the combinatorial explosion of possible kinetic models as the number of independent local measures increases. Fortunately, clustering of time progress curves sufficiently reduces the dimensionality of the data so as to make reconstruction computationally tractable. The most likely folding topology and intermediates can then be identified by exhaustively enumerating all possible kinetic models on a super-computer grid. The folding pathways and measures of the relative flux through them were determined for Mg(2+) and Na(+)-mediated folding of the Tetrahymena thermophila group I intron using this combined experimental and computational approach. The flux during Mg(2+)-mediated folding is divided among numerous parallel pathways. In contrast, the flux during the Na(+)-mediated reaction is predominantly restricted through three pathways, one of which is without detectable passage through intermediates. Under both conditions, the folding reaction is highly parallel with no single pathway accounting for more than 50% of the molecular flux. This suggests that RNA folding is non-sequential under a variety of different experimental conditions even at the earliest stages of folding. This study provides a template for the systematic analysis of the time-evolution of RNA structure from ensembles of local measures that will illuminate the chemical and physical characteristics of each step in the process. The applicability of this analysis approach to other macromolecules is discussed.
View details for DOI 10.1016/j.jmb.2006.02.075
View details for Web of Science ID 000237567000021
View details for PubMedID 16574145
View details for PubMedCentralID PMC2621361
-
Delivering diverse data to multiple audiences: the PharmGKB model
SCIENTIST
2006; 20 (4): 49-50
View details for Web of Science ID 000236528700024
-
Choosing SNPs using feature selection.
Journal of bioinformatics and computational biology
2006; 4 (2): 241-257
Abstract
A major challenge for genomewide disease association studies is the high cost of genotyping large number of single nucleotide polymorphisms (SNPs). The correlations between SNPs, however, make it possible to select a parsimonious set of informative SNPs, known as "tagging" SNPs, able to capture most variation in a population. Considerable research interest has recently focused on the development of methods for finding such SNPs. In this paper, we present an efficient method for finding tagging SNPs. The method does not involve computation-intensive search for SNP subsets but discards redundant SNPs using a feature selection algorithm. In contrast to most existing methods, the method presented here does not limit itself to using only correlations between SNPs in local groups. By using correlations that occur across different chromosomal regions, the method can reduce the number of globally redundant SNPs. Experimental results show that the number of tagging SNPs selected by our method is smaller than by using block-based methods. Supplementary website: http://htsnp.stanford.edu/FSFS/.
View details for PubMedID 16819782
-
The RNA Ontology Consortium: An open invitation to the RNA community
RNA-A PUBLICATION OF THE RNA SOCIETY
2006; 12 (4): 533-541
Abstract
The aim of the RNA Ontology Consortium (ROC) is to create an integrated conceptual framework-an RNA Ontology (RO)-with a common, dynamic, controlled, and structured vocabulary to describe and characterize RNA sequences, secondary structures, three-dimensional structures, and dynamics pertaining to RNA function. The RO should produce tools for clear communication about RNA structure and function for multiple uses, including the integration of RNA electronic resources into the Semantic Web. These tools should allow the accurate description in computer-interpretable form of the coupling between RNA architecture, function, and evolution. The purposes for creating the RO are, therefore, (1) to integrate sequence and structural databases; (2) to allow different computational tools to interoperate; (3) to create powerful software tools that bring advanced computational methods to the bench scientist; and (4) to facilitate precise searches for all relevant information pertaining to RNA. For example, one initial objective of the ROC is to define, identify, and classify RNA structural motifs described in the literature or appearing in databases and to agree on a computer-interpretable definition for each of these motifs. To achieve these aims, the ROC will foster communication and promote collaboration among RNA scientists by coordinating frequent face-to-face workshops to discuss, debate, and resolve difficult conceptual issues. These meeting opportunities will create new directions at various levels of RNA research. The ROC will work closely with the PDB/NDB structural databases and the Gene, Sequence, and Open Biomedical Ontology Consortia to integrate the RO with existing biological ontologies to extend existing content while maintaining interoperability.
View details for DOI 10.1261/rna.2343206
View details for Web of Science ID 000236700200001
View details for PubMedID 16484377
View details for PubMedCentralID PMC1421088
-
Pharmacogenomics: The relevance of emerging genotyping technologies.
MLO: medical laboratory observer
2006; 38 (3): 24-?
View details for PubMedID 16610446
-
Drug targets for Plasmodium falciparum: A post-genomic review/survey
MINI-REVIEWS IN MEDICINAL CHEMISTRY
2006; 6 (2): 177-202
Abstract
Over 300 million cases of malaria each year cause significant morbidity and mortality. Growing drug-resistance among the Plasmodia that cause malaria motivates the development of additional anti-malarial drugs. This review summarizes the current state of knowledge about potential drug targets for malaria. The recently sequenced malaria genome data clarifies parasite metabolic pathways, and more metabolic targets have been identified.
View details for Web of Science ID 000235327300007
View details for PubMedID 16472186
-
A call for the creation of personalized medicine databases
NATURE REVIEWS DRUG DISCOVERY
2006; 5 (1): 23-26
Abstract
The success of the Human Genome Project raised expectations that the knowledge gained would lead to improved insight into human health and disease, identification of new drug targets and, eventually, a breakthrough in healthcare management. However, the realization of these expectations has been hampered by the lack of essential data on genotype--drug-response phenotype associations. We therefore propose a follow-up to the Human Genome Project: forming global consortia devoted to archiving and analysing group and individual patient data on associations between genotypes and drug-response phenotypes. Here, we discuss the rationale for such personalized medicine databases, and the key practical and ethical issues that need to be addressed in their establishment.
View details for DOI 10.1038/nrd1931
View details for Web of Science ID 000234555300014
View details for PubMedID 16374513
-
Physics-based simulation of biological sturctures
3rd IEEE International Symposium on Biomedical Imaging
IEEE. 2006: 802–803
View details for Web of Science ID 000244446000202
- Proceedings of Pacific Symposium on Biocomputing 2006. edited by Altman, R., Dunker, K., Hunter, L. 2006
-
Structural characterization of proteins using residue environments
PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS
2005; 61 (4): 741-747
Abstract
A primary challenge for structural genomics is the automated functional characterization of protein structures. We have developed a sequence-independent method called S-BLEST (Structure-Based Local Environment Search Tool) for the annotation of previously uncharacterized protein structures. S-BLEST encodes the local environment of an amino acid as a vector of structural property values. It has been applied to all amino acids in a nonredundant database of protein structures to generate a searchable structural resource. Given a query amino acid from an experimentally determined or modeled structure, S-BLEST quickly identifies similar amino acid environments using a K-nearest neighbor search. In addition, the method gives an estimation of the statistical significance of each result. We validated S-BLEST on X-ray crystal structures from the ASTRAL 40 nonredundant dataset. We then applied it to 86 crystallographically determined proteins in the protein data bank (PDB) with unknown function and with no significant sequence neighbors in the PDB. S-BLEST was able to associate 20 proteins with at least one local structural neighbor and identify the amino acid environments that are most similar between those neighbors.
View details for DOI 10.1002/prot.20661
View details for Web of Science ID 000233691100005
View details for PubMedID 16245324
View details for PubMedCentralID PMC2483305
-
Time to organize the bioinformatics resourceome
PLOS COMPUTATIONAL BIOLOGY
2005; 1 (7): 531-533
View details for DOI 10.1371/journal.pcbi.0010076
View details for Web of Science ID 000239480500002
View details for PubMedID 16738704
View details for PubMedCentralID PMC1323464
-
Health-information altruists - A potentially critical resource
NEW ENGLAND JOURNAL OF MEDICINE
2005; 353 (19): 2074-2077
View details for Web of Science ID 000233119600015
View details for PubMedID 16282184
-
Using Petri net tools to study properties and dynamics of biological systems
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION
2005; 12 (2): 181-199
Abstract
Petri Nets (PNs) and their extensions are promising methods for modeling and simulating biological systems. We surveyed PN formalisms and tools and compared them based on their mathematical capabilities as well as by their appropriateness to represent typical biological processes. We measured the ability of these tools to model specific features of biological systems and answer a set of biological questions that we defined. We found that different tools are required to provide all capabilities that we assessed. We created software to translate a generic PN model into most of the formalisms and tools discussed. We have also made available three models and suggest that a library of such models would catalyze progress in qualitative modeling via PNs. Development and wide adoption of common formats would enable researchers to share models and use different tools to analyze them without the need to convert to proprietary formats.
View details for DOI 10.1197/jamia.M1637
View details for Web of Science ID 000227842000009
View details for PubMedID 15561791
View details for PubMedCentralID PMC551550
-
SAFA: Semi-automated footprinting analysis software for high-throughput quantification of nucleic acid footprinting experiments
RNA-A PUBLICATION OF THE RNA SOCIETY
2005; 11 (3): 344-354
Abstract
Footprinting is a powerful and widely used tool for characterizing the structure, thermodynamics, and kinetics of nucleic acid folding and ligand binding reactions. However, quantitative analysis of the gel images produced by footprinting experiments is tedious and time-consuming, due to the absence of informatics tools specifically designed for footprinting analysis. We have developed SAFA, a semi-automated footprinting analysis software package that achieves accurate gel quantification while reducing the time to analyze a gel from several hours to 15 min or less. The increase in analysis speed is achieved through a graphical user interface that implements a novel methodology for lane and band assignment, called "gel rectification," and an optimized band deconvolution algorithm. The SAFA software yields results that are consistent with published methodologies and reduces the investigator-dependent variability compared to less automated methods. These software developments simplify the analysis procedure for a footprinting gel and can therefore facilitate the use of quantitative footprinting techniques in nucleic acid laboratories that otherwise might not have considered their use. Further, the increased throughput provided by SAFA may allow a more comprehensive understanding of molecular interactions. The software and documentation are freely available for download at http://safa.stanford.edu.
View details for DOI 10.1261/rna.7214405
View details for Web of Science ID 000227190000011
View details for PubMedID 15701734
View details for PubMedCentralID PMC1262685
-
A statistical approach to scanning the biomedical literature for pharmacogenetics knowledge
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION
2005; 12 (2): 121-129
Abstract
Biomedical databases summarize current scientific knowledge, but they generally require years of laborious curation effort to build, focusing on identifying pertinent literature and data in the voluminous biomedical literature. It is difficult to manually extract useful information embedded in the large volumes of literature, and automated intelligent text analysis tools are becoming increasingly essential to assist in these curation activities. The goal of the authors was to develop an automated method to identify articles in Medline citations that contain pharmacogenetics data pertaining to gene-drug relationships.The authors built and evaluated several candidate statistical models that characterize pharmacogenetics articles in terms of word usage and the profile of Medical Subject Headings (MeSH) used in those articles. The best-performing model was used to scan the entire Medline article database (11 million articles) to identify candidate pharmacogenetics articles.A sampling of the articles identified from scanning Medline was reviewed by a pharmacologist to assess the precision of the method. The authors' approach identified 4,892 pharmacogenetics articles in the literature with 92% precision. Their automated method took a fraction of the time to acquire these articles compared with the time expected to be taken to accumulate them manually. The authors have built a Web resource (http://pharmdemo.stanford.edu/pharmdb/main.spy) to provide access to their results.A statistical classification approach can screen the primary literature to pharmacogenetics articles with high precision. Such methods may assist curators in acquiring pertinent literature in building biomedical databases.
View details for DOI 10.1197/jamia.M1640
View details for Web of Science ID 000227842000003
View details for PubMedID 15561790
View details for PubMedCentralID PMC551544
-
Biomedical term mapping databases
NUCLEIC ACIDS RESEARCH
2005; 33: D289-D293
Abstract
Longer words and phrases are frequently mapped onto a shorter form such as abbreviations or acronyms for efficiency of communication. These abbreviations are pervasive in all aspects of biology and medicine and as the amount of biomedical literature grows, so does the number of abbreviations and the average number of definitions per abbreviation. Even more confusing, different authors will often abbreviate the same word/phrase differently. This ambiguity impedes our ability to retrieve information, integrate databases and mine textual databases for content. Efforts to standardize nomenclature, especially those doing so retrospectively, need to be aware of different abbreviatory mappings and spelling variations. To address this problem, there have been several efforts to develop computer algorithms to identify the mapping of terms between short and long form within a large body of literature. To date, four such algorithms have been applied to create online databases that comprehensively map biomedical terms and abbreviations within MEDLINE: ARGH (http://lethargy.swmed.edu/ARGH/argh.asp), the Stanford Biomedical Abbreviation Server (http://bionlp.stanford.edu/abbreviation/), AcroMed (http://medstract.med.tufts.edu/acro1.1/index.htm) and SaRAD (http://www.hpl.hp.com/research/idl/projects/abbrev.html). In addition to serving as useful computational tools, these databases serve as valuable references that help biologists keep up with an ever-expanding vocabulary of terms.
View details for DOI 10.1093/nar/gki137
View details for Web of Science ID 000226524300059
View details for PubMedID 15608198
View details for PubMedCentralID PMC540091
-
Challenges in creating an infrastructure for physics-based simulation of biological structures
IEEE Computational Systems Bioinformatics Conference
IEEE COMPUTER SOC. 2005: 3–3
View details for Web of Science ID 000231800100001
- Proceedings of Pacific Symposium on Biocomputing 2005. edited by Altman, R., Dunker, K., Hunter, L. 2005
- Choosing SNPs Using Feature Selection. 2005
- Introduction to ontologies in biomedicine: from powertools to assistants. In Encyclopedia of Genetics, Genomics, Proteomics and Bioinformatics. Wiley Online Library.. 2005: 1
- PharmGKB: The Pharmacogenetics and Pharmacogenomics Knowledge Base. Pharmacogenomics: Methods and Applications edited by Innocenti, F. Totowa: Humana Press.. 2005: 177–192
- PharmGKB: The Pharmacogenetics and Pharmacogenomics Knowledge Base. edited by Innocenti, F. 2005
-
Choosing SNPs using feature selection
IEEE Computational Systems Bioinformatics Conference
IEEE COMPUTER SOC. 2005: 301–309
Abstract
A major challenge for genomewide disease association studies is the high cost of genotyping large number of single nucleotide polymorphisms (SNP). The correlations between SNPs, however, make it possible to select a parsimonious set of informative SNPs, known as "tagging" SNPs, able to capture most variation in a population. Considerable research interest has recently focused on the development of methods for finding such SNPs. In this paper, we present an efficient method for finding tagging SNPs. The method does not involve computation-intensive search for SNP subsets but discards redundant SNPs using a feature selection algorithm. In contrast to most existing methods, the method presented here does not limit itself to using only correlations between SNPs in local groups. By using correlations that occur across different chromosomal regions, the method can reduce the number of globally redundant SNPs. Experimental results show that the number of tagging SNPs selected by our method is smaller than by using block-based methods.
View details for Web of Science ID 000231800100034
View details for PubMedID 16447987
-
PharmGKB: the pharmacogenetics and pharmacogenomics knowledge base.
Methods in molecular biology (Clifton, N.J.)
2005; 311: 179-191
Abstract
The Pharmacogenetics and Pharmacogenomics Knowledge Base (PharmGKB) is an interactive tool for researchers investigating how genetic variation effects drug response. The PharmGKB web site, www.pharmgkb.org, displays genotype, molecular, and clinical primary data integrated with literature, pathway representations, protocol information, and links to additional external resources. Users can search and browse the knowledge base by genes, drugs, diseases, and pathways. Registration is free to the entire research community but subject to an agreement to respect the rights and privacy of the individuals whose information is contained within the database. Registered users can access and download primary data to aid in the design of future pharmacogenetics and pharmacogenomics studies.
View details for PubMedID 16100408
-
Finding haplotype tagging SNPs by use of principal components analysis
AMERICAN JOURNAL OF HUMAN GENETICS
2004; 75 (5): 850-861
Abstract
The immense volume and rapid growth of human genomic data, especially single nucleotide polymorphisms (SNPs), present special challenges for both biomedical researchers and automatic algorithms. One such challenge is to select an optimal subset of SNPs, commonly referred as "haplotype tagging SNPs" (htSNPs), to capture most of the haplotype diversity of each haplotype block or gene-specific region. This information-reduction process facilitates cost-effective genotyping and, subsequently, genotype-phenotype association studies. It also has implications for assessing the risk of identifying research subjects on the basis of SNP information deposited in public domain databases. We have investigated methods for selecting htSNPs by use of principal components analysis (PCA). These methods first identify eigenSNPs and then map them to actual SNPs. We evaluated two mapping strategies, greedy discard and varimax rotation, by assessing the ability of the selected htSNPs to reconstruct genotypes of non-htSNPs. We also compared these methods with two other htSNP finders, one of which is PCA based. We applied these methods to three experimental data sets and found that the PCA-based methods tend to select the smallest set of htSNPs to achieve a 90% reconstruction precision.
View details for Web of Science ID 000224303500010
View details for PubMedID 15389393
-
Computational functional genomics
IEEE SIGNAL PROCESSING MAGAZINE
2004; 21 (6): 62-69
View details for Web of Science ID 000225031500008
-
Tools for loading MEDLINE into a local relational database
BMC BIOINFORMATICS
2004; 5
Abstract
Researchers who use MEDLINE for text mining, information extraction, or natural language processing may benefit from having a copy of MEDLINE that they can manage locally. The National Library of Medicine (NLM) distributes MEDLINE in eXtensible Markup Language (XML)-formatted text files, but it is difficult to query MEDLINE in that format. We have developed software tools to parse the MEDLINE data files and load their contents into a relational database. Although the task is conceptually straightforward, the size and scope of MEDLINE make the task nontrivial. Given the increasing importance of text analysis in biology and medicine, we believe a local installation of MEDLINE will provide helpful computing infrastructure for researchers.We developed three software packages that parse and load MEDLINE, and ran each package to install separate instances of the MEDLINE database. For each installation, we collected data on loading time and disk-space utilization to provide examples of the process in different settings. Settings differed in terms of commercial database-management system (IBM DB2 or Oracle 9i), processor (Intel or Sun), programming language of installation software (Java or Perl), and methods employed in different versions of the software. The loading times for the three installations were 76 hours, 196 hours, and 132 hours, and disk-space utilization was 46.3 GB, 37.7 GB, and 31.6 GB, respectively. Loading times varied due to a variety of differences among the systems. Loading time also depended on whether data were written to intermediate files or not, and on whether input files were processed in sequence or in parallel. Disk-space utilization depended on the number of MEDLINE files processed, amount of indexing, and whether abstracts were stored as character large objects or truncated.Relational database (RDBMS) technology supports indexing and querying of very large datasets, and can accommodate a locally stored version of MEDLINE. RDBMS systems support a wide range of queries and facilitate certain tasks that are not directly supported by the application programming interface to PubMed. Because there is variation in hardware, software, and network infrastructures across sites, we cannot predict the exact time required for a user to load MEDLINE, but our results suggest that performance of the software is reasonable. Our database schemas and conversion software are publicly available at http://biotext.berkeley.edu.
View details for DOI 10.1186/1471-2105-5-146
View details for Web of Science ID 000225769500002
View details for PubMedID 15471541
View details for PubMedCentralID PMC524480
-
Approaches for protecting privacy in the genomic era
GENETIC ENGINEERING NEWS
2004; 24 (17): 6-?
View details for Web of Science ID 000224359100003
-
Extracting and characterizing gene-drug relationships from the literature
PHARMACOGENETICS
2004; 14 (9): 577-586
Abstract
A fundamental task of pharmacogenetics is to collect and classify relationships between genes and drugs. Currently, this useful information has not been comprehensively aggregated in any database and remains scattered throughout the published literature. Although there are efforts to collect this information manually, they are limited by the size of the published literature on gene-drug relationships. Therefore, we investigated computational methods to extract and characterize pharmacogenetic relationships between genes and drugs from the literature. We first evaluated the effectiveness of the co-occurrence method in identifying related genes and drugs. We then used supervised machine learning algorithms to classify the relationships between genes and drugs from the Pharmacogenetics and Pharmacogenomics Knowledge Base (PharmGKB) into five categories that have been defined by active pharmacogenetic researchers as relevant to their work. The final co-occurrence algorithm was able to extract 78% of the related genes and drugs that were published in a review article from the literature. Our algorithm subsequently classified the relationships between genes and drugs from the PharmGKB into five categories with 74% accuracy. We have made the data available on a supplementary website at http://bionlp.stanford.edu/genedrug/ Gene-drug relationships can be accurately extracted from text and classified into categories. Although the relationships that we have identified do not capture the details and fine distinctions often made in the literature, these methods will help scientists to track the ever-growing literature and create information resources to support future discoveries.
View details for Web of Science ID 000224107300002
View details for PubMedID 15475731
-
Genomic research and human subject privacy
SCIENCE
2004; 305 (5681): 183-183
View details for Web of Science ID 000222501000030
View details for PubMedID 15247459
-
An "omics" view of drug development
DRUG DEVELOPMENT RESEARCH
2004; 62 (2): 81-85
View details for DOI 10.1002/ddr.10370
View details for Web of Science ID 000225497400003
-
Training the next generation of informaticians: The impact of "BISTI" and bioinformatics - A report from the American College of Medical Informatics
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION
2004; 11 (3): 167-172
Abstract
In 2002-2003, the American College of Medical Informatics (ACMI) undertook a study of the future of informatics training. This project capitalized on the rapidly expanding interest in the role of computation in basic biological research, well characterized in the National Institutes of Health (NIH) Biomedical Information Science and Technology Initiative (BISTI) report. The defining activity of the project was the three-day 2002 Annual Symposium of the College. A committee, comprised of the authors of this report, subsequently carried out activities, including interviews with a broader informatics and biological sciences constituency, collation and categorization of observations, and generation of recommendations. The committee viewed biomedical informatics as an interdisciplinary field, combining basic informational and computational sciences with application domains, including health care, biological research, and education. Consequently, effective training in informatics, viewed from a national perspective, should encompass four key elements: (1). curricula that integrate experiences in the computational sciences and application domains rather than just concatenating them; (2). diversity among trainees, with individualized, interdisciplinary cross-training allowing each trainee to develop key competencies that he or she does not initially possess; (3). direct immersion in research and development activities; and (4). exposure across the wide range of basic informational and computational sciences. Informatics training programs that implement these features, irrespective of their funding sources, will meet and exceed the challenges raised by the BISTI report, and optimally prepare their trainees for careers in a field that continues to evolve.
View details for Web of Science ID 000221546700001
View details for PubMedID 14764617
View details for PubMedCentralID PMC400513
-
Computational analysis of Plasmodium falciparum metabolism: Organizing genomic information to facilitate drug discovery
GENOME RESEARCH
2004; 14 (5): 917-924
Abstract
Identification of novel targets for the development of more effective antimalarial drugs and vaccines is a primary goal of the Plasmodium genome project. However, deciding which gene products are ideal drug/vaccine targets remains a difficult task. Currently, a systematic disruption of every single gene in Plasmodium is technically challenging. Hence, we have developed a computational approach to prioritize potential targets. A pathway/genome database (PGDB) integrates pathway information with information about the complete genome of an organism. We have constructed PlasmoCyc, a PGDB for Plasmodium falciparum 3D7, using its annotated genomic sequence. In addition to the annotations provided in the genome database, we add 956 additional annotations to proteins annotated as "hypothetical" using the GeneQuiz annotation system. We apply a novel computational algorithm to PlasmoCyc to identify 216 "chokepoint enzymes." All three clinically validated drug targets are chokepoint enzymes. A total of 87.5% of proposed drug targets with biological evidence in the literature are chokepoint reactions. Therefore, identifying chokepoint enzymes represents one systematic way to identify potential metabolic drug targets.
View details for DOI 10.1101/gr.2050304
View details for Web of Science ID 000221171700016
View details for PubMedID 15078855
View details for PubMedCentralID PMC479120
-
Eukaryotic regulatory element conservation analysis and identification using comparative genomics
GENOME RESEARCH
2004; 14 (3): 451-458
Abstract
Comparative genomics is a promising approach to the challenging problem of eukaryotic regulatory element identification, because functional noncoding sequences may be conserved across species from evolutionary constraints. We systematically analyzed known human and Saccharomyces cerevisiae regulatory elements and discovered that human regulatory elements are more conserved between human and mouse than are background sequences. Although S. cerevisiae regulatory elements do not appear to be more conserved by comparison of S. cerevisiae to Schizosaccharomyces pombe, they are more conserved when compared with multiple other yeast genomes (Saccharomyces paradoxus, Saccharomyces mikatae, and Saccharomyces bayanus). Based on these analyses, we developed a sequence-motif-finding algorithm called CompareProspector, which extends Gibbs sampling by biasing the search in regions conserved across species. Using human-mouse comparison, CompareProspector identified known motifs for transcription factors Mef2, Myf, Srf, and Sp1 from a set of human-muscle-specific genes. It also discovered the NFAT motif from genes up-regulated by CD28 stimulation in T-cells, which implies the direct involvement of NFAT in mediating the CD28 stimulatory signal. Using Caenorhabditis elegans-Caenorhabditis briggsae comparison, CompareProspector found the PHA-4 motif and the UNC-86 motif. CompareProspector outperformed many other computational motif-finding programs, demonstrating the power of comparative genomics-based biased sampling in eukaryotic regulatory element identification.
View details for Web of Science ID 000189389100013
View details for PubMedID 14993210
-
Editorial: Building successful biological databases
BRIEFINGS IN BIOINFORMATICS
2004; 5 (1): 4-5
View details for Web of Science ID 000222244300001
View details for PubMedID 15153301
-
GAPSCORE: finding gene and protein names one word at a time
BIOINFORMATICS
2004; 20 (2): 216-225
Abstract
New high-throughput technologies have accelerated the accumulation of knowledge about genes and proteins. However, much knowledge is still stored as written natural language text. Therefore, we have developed a new method, GAPSCORE, to identify gene and protein names in text. GAPSCORE scores words based on a statistical model of gene names that quantifies their appearance, morphology and context.We evaluated GAPSCORE against the Yapex data set and achieved an F-score of 82.5% (83.3% recall, 81.5% precision) for partial matches and 57.6% (58.5% recall, 56.7% precision) for exact matches. Since the method is statistical, users can choose score cutoffs that adjust the performance according to their needs.GAPSCORE is available at http://bionlp.stanford.edu/gapscore/
View details for DOI 10.1093/bioinformatics/btg393
View details for Web of Science ID 000188389700012
View details for PubMedID 14734313
-
Using surface envelopes for discrimination of molecular models
PROTEIN SCIENCE
2004; 13 (1): 15-24
Abstract
Shape information about macromolecules is increasingly available but is difficult to use in modeling efforts. We demonstrate that shape information alone can often distinguish structural models of biological macromolecules. By using a data structure called a surface envelope (SE) to represent the shape of the molecule, we propose a method that generates a fitness score for the shape of a particular molecular model. This score correlates well with root mean squared deviation (RMSD) of the model to the known test structures and can be used to filter models in decoy sets. The scoring method requires both alignment of the model to the SE in three-dimensional space and assessment of the degree to which atoms in the model fill the SE. Alignment combines a hybrid algorithm using principal components and a previously published iterated closest point algorithm. We test our method against models generated from random atom perturbation from crystal structures, published decoy sets used in structure prediction, and models created from the trajectories of atoms in molecular modeling runs. We also test our alignment algorithm against experimental electron microscopic data from rice dwarf virus. The alignment performance is reliable, and we show a high correlation between model RMSD and score function. This correlation is stronger for molecular models with greater oblong character (as measured by the ratio of largest to smallest principal component).
View details for DOI 10.1110/ps.03385504
View details for Web of Science ID 000187587700002
View details for PubMedID 14691217
View details for PubMedCentralID PMC2286533
-
Modeling and analyzing biomedical processes using workflow/Petri Net models and tools
11th World Congress on Medical Informatics
I O S PRESS. 2004: 74–78
Abstract
Computer simulation enables system developers to execute a model of an actual or theoretical system on a computer and analyze the execution output. We have been exploring the use of Petri Net (PN) tools to study the behavior of systems that are represented using three kinds of biomedical models: a biological workflow model used to represent biological processes, and two different computer-interpretable models of health care processes that are derived from clinical guidelines. We developed and implemented software that maps the three models into a single underlying process model (workflow), which is then converted into PNs in formats that are readable by several PN simulation and analysis tools. These analysis tools enabled us to simulate and study the behavior of two biomedical systems: a Malaria parasite invading a host cell, and patients undergoing management of chronic cough.
View details for Web of Science ID 000226723300016
View details for PubMedID 15360778
- Building successful biological databases. Brief Bioinform. 2004; 1 (5): 4-5
- Proceedings of Pacific Symposium on Biocomputing 2004. edited by Altman, R., Dunker, K., Hunter, L. 2004
-
A resource to acquire and summarize pharmacogenetics knowledge in the literature
11th World Congress on Medical Informatics
I O S PRESS. 2004: 793–797
Abstract
To determine how genetic variations contribute the variations in drug response, we need to know the genes that are related to drugs of interest. But there are no publicly available data-bases of known gene-drug relationships, and it is time-consuming to search the literature for this information. We have developed a resource to support the storage, summarization, and dissemination of key gene-drug interactions of relevance to pharmacogenetics. Extracting all gene-drug relationships from the literature is a daunting task, so we distributed a tool to acquire this knowledge from the scientific community. We also developed a categorization scheme to classify gene-drug relationships according to the type of pharmacogenetic evidence that supports them. Our resource (http://www.pharmgkb.org/home/project-community.jsp) can be queried by gene or drug, and it summarizes gene-drug relationships, categories of evidence, and supporting literature. This resource is growing, containing entries for 138 genes and 215 drugs of pharmacogenetics significance, and is a core component of PharmGKB, a pharmacogenetics knowledge base (http://www.pharmgkb.org).
View details for Web of Science ID 000226723300159
View details for PubMedID 15360921
-
PharmGKB: the pharmacogenetics and pharmacogenomics knowledge base
PHARMACOGENOMICS JOURNAL
2004; 4 (1): 1-1
View details for DOI 10.1038/sj.tpj.6500230
View details for Web of Science ID 000220143500001
View details for PubMedID 14735107
-
Ribosomal dynamics inferred from variations in experimental measurements
RNA-A PUBLICATION OF THE RNA SOCIETY
2003; 9 (11): 1301-1307
Abstract
The crystal structures of the ribosome reveal remarkable complexity and provide a starting set of snapshots with which to understand the dynamics of translation. To augment the static crystallographic models with dynamic information present in crosslink, footprint, and cleavage data, we examined 2691 proximity measurements and focused on the subset that was apparently incompatible with >40 published crystal structures. The measurements from this subset generally involve regions of the structure that are functionally conserved and structurally flexible. Local movements in the crystallographic states of the ribosome that would satisfy biochemical proximity measurements show coherent patterns suggesting alternative conformations of the ribosome. Three different types of data obtained for the two subunits display similar "mismatching" patterns, suggesting that the signals are robust and real. In particular, there is an indication of coherent motion in the decoding region within the 30S subunit and central protuberance and surrounding areas of the 50S subunit. Directions of rearrangements fluctuate around the proposed path of tRNA translocation and the plane parallel to the interface of the two subunits. Our results demonstrate that systematic combination and analysis of noisy, apparently incompatible data sources can provide biologically useful signals about structural dynamics.
View details for Web of Science ID 000186175900001
View details for PubMedID 14561879
-
MutDB: annotating human variation with functionally relevant data
BIOINFORMATICS
2003; 19 (14): 1858-1860
Abstract
We have developed a resource, MutDB (http://mutdb.org/), to aid in determining which single nucleotide polymorphisms (SNPs) are likely to alter the function of their associated protein product. MutDB contains protein structure annotations and comparative genomic annotations for 8000 disease-associated mutations and SNPs found in the UCSC Annotated Genome and the human RefSeq gene set. MutDB provides interactive mutation maps at the gene and protein levels, and allows for ranking of their predicted functional consequences based on conservation in multiple sequence alignments.http://mutdb.org/ Supplementary information: http://mutdb.org/about/about.html
View details for DOI 10.1093/bioinformatics/btg241
View details for Web of Science ID 000185701100022
View details for PubMedID 14512363
-
Investigating hypoxic tumor physiology through gene expression patterns
ONCOGENE
2003; 22 (37): 5907-5914
Abstract
Clinical evidence shows that tumor hypoxia is an independent prognostic indicator of poor patient outcome. Hypoxic tumors have altered physiologic processes, including increased regions of angiogenesis, increased local invasion, increased distant metastasis and altered apoptotic programs. Since hypoxia is a potent controller of gene expression, identifying hypoxia-regulated genes is a means to investigate the molecular response to hypoxic stress. Traditional experimental approaches have identified physiologic changes in hypoxic cells. Recent studies have identified hypoxia-responsive genes that may define the mechanism(s) underlying these physiologic changes. For example, the regulation of glycolytic genes by hypoxia can explain some characteristics of the Warburg effect. The converse of this logic is also true. By identifying new classes of hypoxia-regulated gene(s), we can infer the physiologic pressures that require the induction of these genes and their protein products. Furthermore, these physiologically driven hypoxic gene expression changes give us insight as to the poor outcome of patients with hypoxic tumors. Approximately 1-1.5% of the genome is transcriptionally responsive to hypoxia. However, there is significant heterogeneity in the transcriptional response to hypoxia between different cell types. Moreover, the coordinated change in the expression of families of genes supports the model of physiologic pressure leading to expression changes. Understanding the evolutionary pressure to develop a 'hypoxic response' provides a framework to investigate the biology of the hypoxic tumor microenvironment.
View details for DOI 10.1038/sj.onc.1206703
View details for Web of Science ID 000185086100017
View details for PubMedID 12947397
-
Pharmacokinetics of oral gallium maltolate administered in a single or multiple dose schedule in patients with Paget's disease of bone or primary hyperparathyroidism: A pilot study.
AMER SOC BONE & MINERAL RES. 2003: S391
View details for Web of Science ID 000186080501584
-
Large scale study of protein domain distribution in the context of alternative splicing
NUCLEIC ACIDS RESEARCH
2003; 31 (16): 4828-4835
Abstract
Alternative splicing plays an important role in processes such as development, differentiation and cancer. With the recent increase in the estimates of the number of human genes that undergo alternative splicing from 5 to 35-59%, it is becoming critical to develop a better understanding of its functional consequences and regulatory mechanisms. We conducted a large scale study of the distribution of protein domains in a curated data set of several thousand genes and identified protein domains disproportionately distributed among alternatively spliced genes. We also identified a number of protein domains that tend to be spliced out. Both the proteins having the disproportionately distributed domains as well as those with spliced-out domains are predominantly involved in the processes of cell communication, signaling, development and apoptosis. These proteins function mostly as enzymes, signal transducers and receptors. Somewhat surprisingly, 28% of all occurrences of spliced-out domains are not effected by straightforward exclusion of exons coding for the domains but by inclusion or exclusion of other exons to shift the reading frame while retaining the exons coding for the domains in the final transcripts.
View details for DOI 10.1093/nar/gkg668
View details for Web of Science ID 000184783000020
View details for PubMedID 12907725
View details for PubMedCentralID PMC169920
-
The computational analysis of scientific literature to define and recognize gene expression clusters
NUCLEIC ACIDS RESEARCH
2003; 31 (15): 4553-4560
Abstract
A limitation of many gene expression analytic approaches is that they do not incorporate comprehensive background knowledge about the genes into the analysis. We present a computational method that leverages the peer-reviewed literature in the automatic analysis of gene expression data sets. Including the literature in the analysis of gene expression data offers an opportunity to incorporate functional information about the genes when defining expression clusters. We have created a method that associates gene expression profiles with known biological functions. Our method has two steps. First, we apply hierarchical clustering to the given gene expression data set. Secondly, we use text from abstracts about genes to (i) resolve hierarchical cluster boundaries to optimize the functional coherence of the clusters and (ii) recognize those clusters that are most functionally coherent. In the case where a gene has not been investigated and therefore lacks primary literature, articles about well-studied homologous genes are added as references. We apply our method to two large gene expression data sets with different properties. The first contains measurements for a subset of well-studied Saccharomyces cerevisiae genes with multiple literature references, and the second contains newly discovered genes in Drosophila melanogaster; many have no literature references at all. In both cases, we are able to rapidly define and identify the biologically relevant gene expression profiles without manual intervention. In both cases, we identified novel clusters that were not noted by the original investigators.
View details for DOI 10.1093/nar/gkg636
View details for Web of Science ID 000184532900040
View details for PubMedID 12888516
View details for PubMedCentralID PMC169898
-
Microenvironment analysis and identification of magnesium binding sites in RNA
NUCLEIC ACIDS RESEARCH
2003; 31 (15): 4450-4460
Abstract
Interactions with magnesium (Mg2+) ions are essential for RNA folding and function. The locations and function of bound Mg2+ ions are difficult to characterize both experimentally and computationally. In particular, the P456 domain of the Tetrahymena thermophila group I intron, and a 58 nt 23s rRNA from Escherichia coli have been important systems for studying the role of Mg2+ binding in RNA, but characteristics of all the binding sites remain unclear. We therefore investigated the Mg2+ binding capabilities of these RNA systems using a computational approach to identify and further characterize their Mg2+ binding sites. The approach is based on the FEATURE algorithm, reported previously for microenvironment analysis of protein functional sites. We have determined novel physicochemical descriptions of site-bound and diffusely bound Mg2+ ions in RNA that are useful for prediction. Electrostatic calculations using the Non-Linear Poisson Boltzmann (NLPB) equation provided further evidence for the locations of site-bound ions. We confirmed the locations of experimentally determined sites and further differentiated between classes of ion binding. We also identified potentially important, high scoring sites in the group I intron that are not currently annotated as Mg2+ binding sites. We note their potential function and believe they deserve experimental follow-up.
View details for DOI 10.1093/nar/gkg471
View details for Web of Science ID 000184532900029
View details for PubMedID 12888505
View details for PubMedCentralID PMC169872
-
A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae)
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2003; 100 (14): 8348-8353
Abstract
Genomic sequencing is no longer a novelty, but gene function annotation remains a key challenge in modern biology. A variety of functional genomics experimental techniques are available, from classic methods such as affinity precipitation to advanced high-throughput techniques such as gene expression microarrays. In the future, more disparate methods will be developed, further increasing the need for integrated computational analysis of data generated by these studies. We address this problem with MAGIC (Multisource Association of Genes by Integration of Clusters), a general framework that uses formal Bayesian reasoning to integrate heterogeneous types of high-throughput biological data (such as large-scale two-hybrid screens and multiple microarray analyses) for accurate gene function prediction. The system formally incorporates expert knowledge about relative accuracies of data sources to combine them within a normative framework. MAGIC provides a belief level with its output that allows the user to vary the stringency of predictions. We applied MAGIC to Saccharomyces cerevisiae genetic and physical interactions, microarray, and transcription factor binding sites data and assessed the biological relevance of gene groupings using Gene Ontology annotations produced by the Saccharomyces Genome Database. We found that by creating functional groupings based on heterogeneous data types, MAGIC improved accuracy of the groupings compared with microarray analysis alone. We describe several of the biological gene groupings identified.
View details for DOI 10.1073/pnas.0832373100
View details for Web of Science ID 000184222500057
View details for PubMedID 12826619
View details for PubMedCentralID PMC166232
-
Inclusion of textual documentation in the analysis of multidimensional data sets: Application to gene expression data
MACHINE LEARNING
2003; 52 (1-2): 119-145
View details for Web of Science ID 000183199900007
-
WebFEATURE: an interactive web tool for identifying and visualizing functional sites on macromolecular structures
NUCLEIC ACIDS RESEARCH
2003; 31 (13): 3324-3327
Abstract
WebFEATURE (http://feature.stanford.edu/webfeature/) is a web-accessible structural analysis tool that allows users to scan query structures for functional sites in both proteins and nucleic acids. WebFEATURE is the public interface to the scanning algorithm of the FEATURE package, a supervised learning algorithm for creating and identifying 3D, physicochemical motifs in molecular structures. Given an input structure or Protein Data Bank identifier (PDB ID), and a statistical model of a functional site, WebFEATURE will return rank-scored 'hits' in 3D space that identify regions in the structure where similar distributions of physicochemical properties occur relative to the site model. Users can visualize and interactively manipulate scored hits and the query structure in web browsers that support the Chime plug-in. Alternatively, results can be downloaded and visualized through other freely available molecular modeling tools, like RasMol, PyMOL and Chimera. A major application of WebFEATURE is in rapid annotation of function to structures in the context of structural genomics.
View details for DOI 10.1093/nar/gkg553
View details for Web of Science ID 000183832900010
View details for PubMedID 12824318
View details for PubMedCentralID PMC168960
-
Identification of promoter regions in the human genome by using a retroviral plasmid library-based functional reporter gene assay
GENOME RESEARCH
2003; 13 (7): 1765-1774
Abstract
Attempts to identify regulatory sequences in the human genome have involved experimental and computational methods such as cross-species sequence comparisons and the detection of transcription factor binding-site motifs in coexpressed genes. Although these strategies provide information on which genomic regions are likely to be involved in gene regulation, they do not give information on their functions. We have developed a functional selection for promoter regions in the human genome that uses a retroviral plasmid library-based system. This approach enriches for and detects promoter function of isolated DNA fragments in an in vitro cell culture assay. By using this method, we have discovered likely promoters of known and predicted genes, as well as many other putative promoter regions based on the presence of features such as CpG islands. Comparison of sequences of 858 plasmid clones selected by this assay with the human genome draft sequence indicates that a significantly higher percentage of sequences align to the 500-bp segment upstream of the transcription start sites of known genes than would be expected from random genomic sequences. We also observed enrichment for putative promoter regions of genes predicted in at least two annotation databases and for clones overlapping with CpG islands. Functional validation of randomly selected clones enriched by this method showed that a large fraction of these putative promoters can drive the expression of a reporter gene in transient transfection experiments. This method promises to be a useful genome-wide function-based approach that can complement existing methods to look for promoters.
View details for DOI 10.1101/gr.529803
View details for Web of Science ID 000183970000023
View details for PubMedID 12805274
-
Genetic sequence data for pharmacogenomics
CURRENT OPINION IN DRUG DISCOVERY & DEVELOPMENT
2003; 6 (3): 297-303
Abstract
Pharmacogenetics is the study of how variation in human genes leads to variation in response to drugs. Pharmacogenomics is the term applied to large-scale genomic approaches to pharmacogenetics, and it is currently characterized chiefly by the use of high-throughput DNA sequencing to identify sequence variations in pharmacologically important genes. Genes of interest for pharmacogenomics include genes involved in drug metabolism and transport, as well as genes that are drug targets. The past year has seen an increasing number of systematic surveys of genetic variation that establish reliable baseline measurements of sequence variation--at least in coding and promoter regions. These surveys form the basis for determination of population frequencies, genetic linkage studies and association studies relating genotype with drug response phenotypes of interest.
View details for Web of Science ID 000183571800002
View details for PubMedID 12833660
-
A functional analysis of disease-associated mutations in the androgen receptor gene
NUCLEIC ACIDS RESEARCH
2003; 31 (8)
Abstract
Mutations in the androgen receptor (AR) are associated with a variety of diseases including androgen insensitivity syndrome and prostate cancer, but the way in which these mutations cause disease is poorly understood. We present a method for distinguishing likely disease-causing mutations from mutations that are merely associated with disease but have no causal role. Our method uses a measure of nucleotide conservation, and we find that conservation often correlates with severity of the clinical phenotype. Further, by only including mutations whose pathogenicity has been proven experimentally, this correlation is enhanced in the case of prostate cancer-associated mutations. Our method provides a means for assessing the significance of single nucleotide polymorphisms (SNPs) and cancer-associated mutations.
View details for DOI 10.1093/nar/gng042
View details for Web of Science ID 000182161400002
View details for PubMedID 12682377
View details for PubMedCentralID PMC153754
-
Recognizing complex, asymmetric functional sites in protein structures using a Bayesian scoring function.
Journal of bioinformatics and computational biology
2003; 1 (1): 119-138
Abstract
The increase in known three-dimensional protein structures enables us to build statistical profiles of important functional sites in protein molecules. These profiles can then be used to recognize sites in large-scale automated annotations of new protein structures. We report an improved FEATURE system which recognizes functional sites in protein structures. FEATURE defines multi-level physico-chemical properties and recognizes sites based on the spatial distribution of these properties in the sites' microenvironments. It uses a Bayesian scoring function to compare a query region with the statistical profile built from known examples of sites and control nonsites. We have previously shown that FEATURE can accurately recognize calcium-binding sites and have reported interesting results scanning for calcium-binding sites in the entire Protein Data Bank. Here we report the ability of the improved FEATURE to characterize and recognize geometrically complex and asymmetric sites such as ATP-binding sites and disulfide bond-forming sites. FEATURE does not rely on conserved residues or conserved residue geometry of the sites. We also demonstrate that, in the absence of a statistical profile of the sites, FEATURE can use an artificially constructed profile based on a priori knowledge to recognize the sites in new structures, using redoxin active sites as an example.
View details for PubMedID 15290784
-
Complexities of managing biomedical information.
Omics : a journal of integrative biology
2003; 7 (1): 127-129
View details for PubMedID 12831574
-
A literature-based method for assessing the functional coherence of a gene group
BIOINFORMATICS
2003; 19 (3): 396-401
Abstract
Many experimental and algorithmic approaches in biology generate groups of genes that need to be examined for related functional properties. For example, gene expression profiles are frequently organized into clusters of genes that may share functional properties. We evaluate a method, neighbor divergence per gene (NDPG), that uses scientific literature to assess whether a group of genes are functionally related. The method requires only a corpus of documents and an index connecting the documents to genes.We evaluate NDPG on 2796 functional groups generated by the Gene Ontology consortium in four organisms: mouse, fly, worm and yeast. NDPG finds functional coherence in 96, 92, 82 and 45% of the groups (at 99.9% specificity) in yeast, mouse, fly and worm respectively.
View details for DOI 10.1093/bioinformatics/btg002
View details for Web of Science ID 000181303000011
View details for PubMedID 12584126
View details for PubMedCentralID PMC2669934
-
Mining heterogeneous ribosomal structure data
47th Annual Meeting of the Biophysical-Society
CELL PRESS. 2003: 463A–463A
View details for Web of Science ID 000183123802264
-
Knowledge acquisition, consistency checking and concurrency control for Gene Ontology (GO)
BIOINFORMATICS
2003; 19 (2): 241-248
Abstract
A critical element of the computational infrastructure required for functional genomics is a shared language for communicating biological data and knowledge. The Gene Ontology (GO; http://www.geneontology.org) provides a taxonomy of concepts and their attributes for annotating gene products. As GO increases in size, its ongoing construction and maintenance becomes more challenging. In this paper, we assess the applicability of a Knowledge Base Management System (KBMS), Protégé-2000, to the maintenance and development of GO.We transferred GO to Protégé-2000 in order to evaluate its suitability for GO. The graphical user interface supported browsing and editing of GO. Tools for consistency checking identified minor inconsistencies in GO and opportunities to reduce redundancy in its representation. The Protégé Axiom Language proved useful for checking ontological consistency. The PROMPT tool allowed us to track changes to GO. Using Protégé-2000, we tested our ability to make changes and extensions to GO to refine the semantics of attributes and classify more concepts.Gene Ontology in Protégé-2000 and the associated code are located at http://smi.stanford.edu/projects/helix/gokbms/. Protégé-2000 is available from http://protege.stanford.edu.
View details for Web of Science ID 000180913600011
View details for PubMedID 12538245
-
Defining bioinformatics and structural bioinformatics.
Methods of biochemical analysis
2003; 44: 3-14
View details for PubMedID 12647379
-
Automatic construction of 3D structural motifs for protein function prediction
2nd International Computational Systems Bioinformatics Conference
IEEE COMPUTER SOC. 2003: 613–614
View details for Web of Science ID 000188997700136
- Proceedings of Pacific Symposium on Biocomputing 2003. edited by Altman, R., Dunker, K., Hunter, L. 2003
- Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae). 2003
- Preface. Bioinformatics and Functional Genomics. 2003
- A Personalized and Automated dbSNP Surveillance System. 2003
- The expanding scope of bioinformatics: sequence analysis and beyond. Heredity 2003; 5 (90): 345
- Recognizing Complex, Asymmetric Functional Sites in Protein Structures Using a Bayesian Scoring Function. Journal of Bioinformatics and Computational Biology 2003; 1 (1): 119-138
-
A personalized and automated dbSNP surveillance system
2nd International Computational Systems Bioinformatics Conference
IEEE COMPUTER SOC. 2003: 132–136
Abstract
The development of high throughput techniques and large-scale studies in the biological sciences has given rise to an explosive growth in both the volume and types of data available to researchers. A surveillance system that monitors data repositories and reports changes helps manage the data overload. We developed a dbSNP surveillance system (URL: http://www.pharmgkb.org/do/serve?id=tools.surveillance.dbsnp) that performs surveillance on the dbSNP database and alerts users to new information. The system is notable because it is personalized and fully automated. Each registered user has a list of genes to follow and receives notification of new entries concerning these genes. The system integrates data from dbSNP, LocusLink, PharmGKB, and Genbank to position SNPs on reference sequences and classify SNPs into categories such as synonymous and non-synonymous SNPs. The system uses data warehousing, object model-based data integration, object-oriented programming, and a platform-neutral data access mechanism.
View details for Web of Science ID 000188997700026
View details for PubMedID 16452787
-
Automated construction of structural motifs for predicting functional sites on protein structures.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
2003: 204-215
Abstract
Structural genomics initiatives are beginning to rapidly generate vast numbers of protein structures. For many of the structures, functions are not yet determined and high-throughput methods for determining function are necessary. Although there has been extensive work in function prediction at the sequence level, predicting function at the structure level may provide better sensitivity and predictive value. We describe a method to predict functional sites by automatically creating three dimensional structural motifs from amino acid sequence motifs. These structural motifs perform comparably well with manually generated structural motifs and perform better than sequence motifs. Automatically generated structural motifs can be used for structural-genomic scale function prediction on protein structures.
View details for PubMedID 12603029
-
Indexing pharmacogenetic knowledge on the World Wide Web
PHARMACOGENETICS
2003; 13 (1): 3-5
View details for Web of Science ID 000180584000002
View details for PubMedID 12544507
-
Qualitative models of molecular function: Linking genetic polymorphisms of tRNA to their functional sequelae
PROCEEDINGS OF THE IEEE
2002; 90 (12): 1875-1886
View details for DOI 10.1109/JPROC.2002.805304
View details for Web of Science ID 000179509700007
-
Creating an online dictionary of abbreviations from MEDLINE
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION
2002; 9 (6): 612-620
Abstract
The growth of the biomedical literature presents special challenges for both human readers and automatic algorithms. One such challenge derives from the common and uncontrolled use of abbreviations in the literature. Each additional abbreviation increases the effective size of the vocabulary for a field. Therefore, to create an automatically generated and maintained lexicon of abbreviations, we have developed an algorithm to match abbreviations in text with their expansions.Our method uses a statistical learning algorithm, logistic regression, to score abbreviation expansions based on their resemblance to a training set of human-annotated abbreviations. We applied it to Medstract, a corpus of MEDLINE abstracts in which abbreviations and their expansions have been manually annotated. We then ran the algorithm on all abstracts in MEDLINE, creating a dictionary of biomedical abbreviations. To test the coverage of the database, we used an independently created list of abbreviations from the China Medical Tribune.We measured the recall and precision of the algorithm in identifying abbreviations from the Medstract corpus. We also measured the recall when searching for abbreviations from the China Medical Tribune against the database.On the Medstract corpus, our algorithm achieves up to 83% recall at 80% precision. Applying the algorithm to all of MEDLINE yielded a database of 781,632 high-scoring abbreviations. Of all the abbreviations in the list from the China Medical Tribune, 88% were in the database.We have developed an algorithm to identify abbreviations from text. We are making this available as a public abbreviation server at \url[http://abbreviation.stanford.edu/].
View details for DOI 10.1197/jamia.M1139
View details for Web of Science ID 000178914400005
View details for PubMedID 12386112
View details for PubMedCentralID PMC349378
-
Nonparametric methods for identifying differentially expressed genes in microarray data
BIOINFORMATICS
2002; 18 (11): 1454-1461
Abstract
Gene expression experiments provide a fast and systematic way to identify disease markers relevant to clinical care. In this study, we address the problem of robust identification of differentially expressed genes from microarray data. Differentially expressed genes, or discriminator genes, are genes with significantly different expression in two user-defined groups of microarray experiments. We compare three model-free approaches: (1). nonparametric t-test, (2). Wilcoxon (or Mann-Whitney) rank sum test, and (3). a heuristic method based on high Pearson correlation to a perfectly differentiating gene ('ideal discriminator method'). We systematically assess the performance of each method based on simulated and biological data under varying noise levels and p-value cutoffs.All methods exhibit very low false positive rates and identify a large fraction of the differentially expressed genes in simulated data sets with noise level similar to that of actual data. Overall, the rank sum test appears most conservative, which may be advantageous when the computationally identified genes need to be tested biologically. However, if a more inclusive list of markers is desired, a higher p-value cutoff or the nonparametric t-test may be appropriate. When applied to data from lung tumor and lymphoma data sets, the methods identify biologically relevant differentially expressed genes that allow clear separation of groups in question. Thus the methods described and evaluated here provide a convenient and robust way to identify differentially expressed genes for further biological and clinical analysis.
View details for Web of Science ID 000179249800008
View details for PubMedID 12424116
-
Using text analysis to identify functionally coherent gene groups
GENOME RESEARCH
2002; 12 (10): 1582-1590
Abstract
The analysis of large-scale genomic information (such as sequence data or expression patterns) frequently involves grouping genes on the basis of common experimental features. Often, as with gene expression clustering, there are too many groups to easily identify the functionally relevant ones. One valuable source of information about gene function is the published literature. We present a method, neighbor divergence, for assessing whether the genes within a group share a common biological function based on their associated scientific literature. The method uses statistical natural language processing techniques to interpret biological text. It requires only a corpus of documents relevant to the genes being studied (e.g., all genes in an organism) and an index connecting the documents to appropriate genes. Given a group of genes, neighbor divergence assigns a numerical score indicating how "functionally coherent" the gene group is from the perspective of the published literature. We evaluate our method by testing its ability to distinguish 19 known functional gene groups from 1900 randomly assembled groups. Neighbor divergence achieves 79% sensitivity at 100% specificity, comparing favorably to other tested methods. We also apply neighbor divergence to previously published gene expression clusters to assess its ability to recognize gene groups that had been manually identified as representative of a common function.
View details for DOI 10.1101/gr.116402
View details for Web of Science ID 000178396400014
View details for PubMedID 12368251
View details for PubMedCentralID PMC187532
-
Promises of text processing: natural language processing meets AI
DRUG DISCOVERY TODAY
2002; 7 (19): 992-993
View details for Web of Science ID 000178338600006
View details for PubMedID 12546913
-
Emerging scientific applications in data mining
COMMUNICATIONS OF THE ACM
2002; 45 (8): 54-58
View details for Web of Science ID 000177087200015
-
Determining the genomic locations of repetitive DNA sequences with a whole-genome microarray: IS6110 in Mycobacterium tuberculosis
JOURNAL OF CLINICAL MICROBIOLOGY
2002; 40 (6): 2192-2198
Abstract
The mycobacterial insertion sequence IS6110 has been exploited extensively as a clonal marker in molecular epidemiologic studies of tuberculosis. In addition, it has been hypothesized that this element is an important driving force behind genotypic variability that may have phenotypic consequences. We present here a novel, DNA microarray-based methodology, designated SiteMapping, that simultaneously maps the locations and orientations of multiple copies of IS6110 within the genome. To investigate the sensitivity, accuracy, and limitations of the technique, it was applied to eight Mycobacterium tuberculosis strains for which complete or partial IS6110 insertion site information had been determined previously. SiteMapping correctly located 64% (38 of 59) of the IS6110 copies predicted by restriction fragment length polymorphism analysis. The technique is highly specific; 97% of the predicted insertion sites were true insertions. Eight previously unknown insertions were identified and confirmed by PCR or sequencing. The performance could be improved by modifications in the experimental protocol and in the approach to data analysis. SiteMapping has general applicability and demonstrates an expansion in the applications of microarrays that complements conventional approaches in the study of genome architecture.
View details for DOI 10.1128/JCM.40.6.2192-2198.2002
View details for Web of Science ID 000176159200048
View details for PubMedID 12037086
View details for PubMedCentralID PMC130717
-
Modelling biological processes using workflow and Petri Net models
BIOINFORMATICS
2002; 18 (6): 825-837
Abstract
Biological processes can be considered at many levels of detail, ranging from atomic mechanism to general processes such as cell division, cell adhesion or cell invasion. The experimental study of protein function and gene regulation typically provides information at many levels. The representation of hierarchical process knowledge in biology is therefore a major challenge for bioinformatics. To represent high-level processes in the context of their component functions, we have developed a graphical knowledge model for biological processes that supports methods for qualitative reasoning.We assessed eleven diverse models that were developed in the fields of software engineering, business, and biology, to evaluate their suitability for representing and simulating biological processes. Based on this assessment, we combined the best aspects of two models: Workflow/Petri Net and a biological concept model. The Workflow model can represent nesting and ordering of processes, the structural components that participate in the processes, and the roles that they play. It also maps to Petri Nets, which allow verification of formal properties and qualitative simulation. The biological concept model, TAMBIS, provides a framework for describing biological entities that can be mapped to the workflow model. We tested our model by representing malaria parasites invading host erythrocytes, and composed queries, in five general classes, to discover relationships among processes and structural components. We used reachability analysis to answer queries about the dynamic aspects of the model.The model is available at http://smi.stanford.edu/projects/helix/pubs/process-model/.
View details for Web of Science ID 000176553400006
View details for PubMedID 12075018
-
RNAML: A standard syntax for exchanging RNA information
RNA-A PUBLICATION OF THE RNA SOCIETY
2002; 8 (6): 707-717
Abstract
Analyzing a single data set using multiple RNA informatics programs often requires a file format conversion between each pair of programs, significantly hampering productivity. To facilitate the interoperation of these programs, we propose a syntax to exchange basic RNA molecular information. This RNAML syntax allows for the storage and the exchange of information about RNA sequence and secondary and tertiary structures. The syntax permits the description of higher level information about the data including, but not restricted to, base pairs, base triples, and pseudoknots. A class-oriented approach allows us to represent data common to a given set of RNA molecules, such as a sequence alignment and a consensus secondary structure. Documentation about experiments and computations, as well as references to journals and external databases, are included in the syntax. The chief challenge in creating such a syntax was to determine the appropriate scope of usage and to ensure extensibility as new needs will arise. The syntax complies with the eXtensible Markup Language (XML) recommendations, a widely accepted standard for syntax specifications. In addition to the various generic packages that exist to read and interpret XML formats, an XML processor was developed and put in the open-source MC-Core library for nucleic acid and protein structure computer manipulation.
View details for DOI 10.1017/S1355838202028017
View details for Web of Science ID 000176277100001
View details for PubMedID 12088144
View details for PubMedCentralID PMC1370290
-
Mining biochemical information: Lessons taught by the ribosome
RNA-A PUBLICATION OF THE RNA SOCIETY
2002; 8 (3): 279-289
Abstract
The publication of the crystal structures of the ribosome offers an opportunity to retrospectively evaluate the information content of hundreds of qualitative biochemical and biophysical studies of these structures. We assessed the correspondence between more than 2,500 experimental proximity measurements and the distances observed in the ribosomal crystals. Although detailed experimental procedures and protocols are unique in almost each analyzed paper, the data can be grouped into subsets with similar patterns and analyzed in an integrative fashion. We found that, for crosslinking, footprinting, and cleavage data, the corresponding distances observed in crystal structures generally did not exceed the maximum values expected (from the estimated length of the agent and maximal anticipated deviations from the conformations found in crystals). However, the distribution of distances had heavier tails than those typically assumed when building three-dimensional models, and the fraction of incompatible distances was greater than expected. Some of these incompatibilities can be attributed to the experimental methods used. In addition, the accuracy of these procedures appears to be sensitive to the different reactivities, flexibilities, and interactions among the components. These findings demonstrate the necessity of a very careful analysis of data used for structural modeling and consideration of all possible parameters that could potentially influence the quality of measurements. We conclude that experimental proximity measurements can provide useful distance information for structural modeling, but with a broad distribution of inferred distance ranges. We also conclude that development of automated modeling approaches would benefit from better annotations of experimental data for detection and interpretation of their significance.
View details for DOI 10.1017/S135583820202407X
View details for Web of Science ID 000175155500002
View details for PubMedID 12003488
View details for PubMedCentralID PMC1370250
-
Challenges for biomedical informatics and pharmacogenomics
ANNUAL REVIEW OF PHARMACOLOGY AND TOXICOLOGY
2002; 42: 113-133
Abstract
Pharmacogenomics requires the integration and analysis of genomic, molecular, cellular, and clinical data, and it thus offers a remarkable set of challenges to biomedical informatics. These include infrastructural challenges such as the creation of data models and databases for storing these data, the integration of these data with external databases, the extraction of information from natural language text, and the protection of databases with sensitive information. There are also scientific challenges in creating tools to support gene expression analysis, three-dimensional structural analysis, and comparative genomic analysis. In this review, we summarize the current uses of informatics within pharmacogenomics and show how the technical challenges that remain for biomedical informatics are typical of those that will be confronted in the postgenomic era.
View details for Web of Science ID 000174038800007
View details for PubMedID 11807167
-
Modeling molecular function and failure: Misreading of genetic code by the ribosome
CELL PRESS. 2002: 167A–168A
View details for Web of Science ID 000173252700826
- Qualitative models of molecular function: linking genetic polymorphisms of tRNA to their functional sequelae. edited by Akay, M. 2002
- Representing genetic sequence data for pharmacogenomics: an evolutionary approach using ontological and relational models. Bioinformatics, 18 Suppl 1 2002: S207-S215
- Using Binning to Maintain Confidentiality of Medical Data. 2002
- Scoring Functions Sensitive to Alignment Error Have a More Difficult Search: A Paradox for Threading. In Structures and Mechanisms ACS Publications.. 2002: 309–320
- Proceedings of Pacific Symposium on Biocomputing 2002. edited by Altman, R., Dunker, K., Hunter, L. 2002
- Preface. Microarrays For An Integrative Genomics. 2002: xii-xv
- Emerging Scientific Applications in Data Mining. Communications of the ACM 2002; 8 (45): 54-58
-
Scoring functions sensitive to alignment error have a more difficult search: A paradox for threading
Symposium held in Honor of William N Lipscomb on Structures and Mechanisms
AMER CHEMICAL SOC. 2002: 309–320
View details for Web of Science ID 000181756700019
-
Using binning to maintain confidentiality of medical data
Annual Symposium of the American-Medical-Informatics-Association
HANLEY & BELFUS INC MED PUBLISHERS. 2002: 454–458
Abstract
Biomedical informatics in general and pharmacogenomics in particular require a research platform that simultaneously enables discovery while protecting research subjects' privacy and information confidentiality. The development of inexpensive DNA sequencing and analysis technologies promises unprecedented database access to very specific information about individuals. To allow analysis of this data without compromising the research subjects' privacy, we must develop methods for removing identifying information from medical and genomic data. In this paper, we build upon the idea that binned database records are more difficult to trace back to individuals. We represent symbolic and numeric data hierarchically, and bin them by generalizing the records. We measure the information loss due to binning using an information theoretic measure called mutual information. The results show that we can bin the data to different levels of precision and use the bin size to control the tradeoff between privacy and data resolution.
View details for Web of Science ID 000189418100092
View details for PubMedID 12463865
-
Associating genes with gene ontology codes using a maximum entropy analysis of biomedical literature
GENOME RESEARCH
2002; 12 (1): 203-214
Abstract
Functional characterizations of thousands of gene products from many species are described in the published literature. These discussions are extremely valuable for characterizing the functions not only of these gene products, but also of their homologs in other organisms. The Gene Ontology (GO) is an effort to create a controlled terminology for labeling gene functions in a more precise, reliable, computer-readable manner. Currently, the best annotations of gene function with the GO are performed by highly trained biologists who read the literature and select appropriate codes. In this study, we explored the possibility that statistical natural language processing techniques can be used to assign GO codes. We compared three document classification methods (maximum entropy modeling, naïve Bayes classification, and nearest-neighbor classification) to the problem of associating a set of GO codes (for biological process) to literature abstracts and thus to the genes associated with the abstracts. We showed that maximum entropy modeling outperforms the other methods and achieves an accuracy of 72% when ascertaining the function discussed within an abstract. The maximum entropy method provides confidence measures that correlate well with performance. We conclude that statistical methods may be used to assign GO codes and may be useful for the difficult task of reassignment as terminology standards evolve over time.
View details for Web of Science ID 000173064900022
View details for PubMedID 11779846
-
Automating data acquisition into ontologies from pharmacogenetics relational data sources using declarative object definitions and XML.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
2002: 88-99
Abstract
Ontologies are useful for organizing large numbers of concepts having complex relationships, such as the breadth of genetic and clinical knowledge in pharmacogenomics. But because ontologies change and knowledge evolves, it is time consuming to maintain stable mappings to external data sources that are in relational format. We propose a method for interfacing ontology models with data acquisition from external relational data sources. This method uses a declarative interface between the ontology and the data source, and this interface is modeled in the ontology and implemented using XML schema. Data is imported from the relational source into the ontology using XML, and data integrity is checked by validating the XML submission with an XML schema. We have implemented this approach in PharmGKB (http://www.pharmgkb.org/), a pharmacogenetics knowledge base. Our goals were to (1) import genetic sequence data, collected in relational format, into the pharmacogenetics ontology, and (2) automate the process of updating the links between the ontology and data acquisition when the ontology changes. We tested our approach by linking PharmGKB with data acquisition from a relational model of genetic sequence information. The ontology subsequently evolved, and we were able to rapidly update our interface with the external data and continue acquiring the data. Similar approaches may be helpful for integrating other heterogeneous information sources in order make the diversity of pharmacogenetics data amenable to computational analysis.
View details for PubMedID 11928521
-
Representing genetic sequence data for pharmacogenomics: an evolutionary approach using ontological and relational models.
Bioinformatics
2002; 18: S207-15
Abstract
The information model chosen to store biological data affects the types of queries possible, database performance, and difficulty in updating that information model. Genetic sequence data for pharmacogenetics studies can be complex, and the best information model to use may change over time. As experimental and analytical methods change, and as biological knowledge advances, the data storage requirements and types of queries needed may also change.We developed a model for genetic sequence and polymorphism data, and used XML Schema to specify the elements and attributes required for this model. We implemented this model as an ontology in a frame-based representation and as a relational model in a database system. We collected genetic data from two pharmacogenetics resequencing studies, and formulated queries useful for analysing these data. We compared the ontology and relational models in terms of query complexity, performance, and difficulty in changing the information model. Our results demonstrate benefits of evolving the schema for storing pharmacogenetics data: ontologies perform well in early design stages as the information model changes rapidly and simplify query formulation, while relational models offer improved query speed once the information model and types of queries needed stabilize.
View details for PubMedID 12169549
-
Ontology development for a pharmacogenetics knowledge base.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
2002: 65-76
Abstract
Research directed toward discovering how genetic factors influence a patient's response to drugs requires coordination of data produced from laboratory experiments, computational methods, and clinical studies. A public repository of pharmacogenetic data should accelerate progress in the field of pharmacogenetics by organizing and disseminating public datasets. We are developing a pharmacogenetics knowledge base (PharmGKB) to support the storage and retrieval of both experimental data and conceptual knowledge. PharmGKB is an Internet-based resource that integrates complex biological, pharmacological, and clinical data in such a way that researchers can submit their data and users can retrieve information to investigate genotype-phenotype correlations. Successful management of the names, meaning, and organization of concepts used within the system is crucial. We have selected a frame-based knowledge-representation system for development of an ontology of concepts and relationships that represent the domain and that permit storage of experimental data. Preliminary experience shows that the ontology we have developed for gene-sequence data allows us to accept, store, and query data submissions.
View details for PubMedID 11928517
-
PharmGKB: The Pharmacogenetics Knowledge Base
NUCLEIC ACIDS RESEARCH
2002; 30 (1): 163-165
Abstract
The Pharmacogenetics Knowledge Base (PharmGKB; http://www.pharmgkb.org/) contains genomic, phenotype and clinical information collected from ongoing pharmacogenetic studies. Tools to browse, query, download, submit, edit and process the information are available to registered research network members. A subset of the tools is publicly available. PharmGKB currently contains over 150 genes under study, 14 Coriell populations and a large ontology of pharmacogenetics concepts. The pharmacogenetic concepts and the experimental data are interconnected by a set of relations to form a knowledge base of information for pharmacogenetic researchers. The information in PharmGKB, and its associated tools for processing that information, are tailored for leading-edge pharmacogenetics research. The PharmGKB project was initiated in April 2000 and the first version of the knowledge base went online in February 2001.
View details for Web of Science ID 000173077100041
View details for PubMedID 11752281
View details for PubMedCentralID PMC99138
-
Diversity of gene expression in adenocarcinoma of the lung
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2001; 98 (24): 13784-13789
Abstract
The global gene expression profiles for 67 human lung tumors representing 56 patients were examined by using 24,000-element cDNA microarrays. Subdivision of the tumors based on gene expression patterns faithfully recapitulated morphological classification of the tumors into squamous, large cell, small cell, and adenocarcinoma. The gene expression patterns made possible the subclassification of adenocarcinoma into subgroups that correlated with the degree of tumor differentiation as well as patient survival. Gene expression analysis thus promises to extend and refine standard pathologic analysis.
View details for PubMedID 11707590
-
Challenges for intelligent systems in biology
IEEE INTELLIGENT SYSTEMS
2001; 16 (6): 14-18
View details for Web of Science ID 000172527000005
-
Missing value estimation methods for DNA microarrays
BIOINFORMATICS
2001; 17 (6): 520-525
Abstract
Gene expression microarray experiments can generate data sets with multiple missing expression values. Unfortunately, many algorithms for gene expression analysis require a complete matrix of gene array values as input. For example, methods such as hierarchical clustering and K-means clustering are not robust to missing data, and may lose effectiveness even with a few missing values. Methods for imputing missing data are needed, therefore, to minimize the effect of incomplete data sets on analyses, and to increase the range of data sets to which these algorithms can be applied. In this report, we investigate automated methods for estimating missing data.We present a comparative study of several methods for the estimation of missing values in gene microarray data. We implemented and evaluated three methods: a Singular Value Decomposition (SVD) based method (SVDimpute), weighted K-nearest neighbors (KNNimpute), and row average. We evaluated the methods using a variety of parameter settings and over different real data sets, and assessed the robustness of the imputation methods to the amount of missing data over the range of 1--20% missing values. We show that KNNimpute appears to provide a more robust and sensitive method for missing value estimation than SVDimpute, and both SVDimpute and KNNimpute surpass the commonly used row average method (as well as filling missing values with zeros). We report results of the comparative experiments and provide recommendations and tools for accurate estimation of missing microarray data under a variety of conditions.
View details for Web of Science ID 000169404700005
View details for PubMedID 11395428
-
Whole-genome expression analysis: challenges beyond clustering
CURRENT OPINION IN STRUCTURAL BIOLOGY
2001; 11 (3): 340-347
Abstract
Measuring the expression of most or all of the genes in a biological system raises major analytic challenges. A wealth of recent reports uses microarray expression data to examine diverse biological phenomena - from basic processes in model organisms to complex aspects of human disease. After an initial flurry of methods for clustering the data on the basis of similarity, the field has recognized some longer-term challenges. Firstly, there are efforts to understand the sources of noise and variation in microarray experiments in order to increase the biological signal. Secondly, there are efforts to combine expression data with other sources of information to improve the range and quality of conclusions that can be drawn. Finally, techniques are now emerging to reconstruct networks of genetic interactions in order to create integrated and systematic models of biological systems.
View details for Web of Science ID 000169375000013
View details for PubMedID 11406385
-
Basic microarray analysis: grouping and feature reduction
TRENDS IN BIOTECHNOLOGY
2001; 19 (5): 189-193
Abstract
DNA microarray technologies are useful for addressing a broad range of biological problems - including the measurement of mRNA expression levels in target cells. These studies typically produce large data sets that contain measurements on thousands of genes under hundreds of conditions. There is a critical need to summarize this data and to pick out the important details. The most common activities, therefore, are to group together microarray data and to reduce the number of features. Both of these activities can be done using only the raw microarray data (unsupervised methods) or using external information that provides labels for the microarray data (supervised methods). We briefly review supervised and unsupervised methods for grouping and reducing data in the context of a publicly available suite of tools called CLEAVER, and illustrate their application on a representative data set collected to study lymphoma.
View details for Web of Science ID 000168716800008
View details for PubMedID 11301132
-
Including biological literature improves homology search.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
2001: 374-383
Abstract
Annotating the tremendous amount of sequence information being generated requires accurate automated methods for recognizing homology. Although sequence similarity is only one of many indicators of evolutionary homology, it is often the only one used. Here we find that supplementing sequence similarity with information from biomedical literature is successful in increasing the accuracy of homology search results. We modified the PSI-BLAST algorithm to use literature similarity in each iteration of its database search. The modified algorithm is evaluated and compared to standard PSI-BLAST in searching for homologous proteins. The performance of the modified algorithm achieved 32% recall with 95% precision, while the original one achieved 33% recall with 84% precision; the literature similarity requirement preserved the sensitive characteristic of the PSI-BLAST algorithm while improving the precision.
View details for PubMedID 11262956
- Using metacomputing tools to facilitate large scale analyses of biological databases. edited by Altman, R., Dunker, K., Hunter, L. 2001
- Challenges for intelligent systems in biology. IEEE Intelligent Systems 2001; 6 (16): 14-18
- Proceedings of Pacific Symposium on Biocomputing 2001. edited by Altman, R., Dunker, K., Hunter, L. 2001
-
ViewFeature: integrated feature analysis and visualization.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
2001: 240-250
Abstract
Visualization interfaces for high performance computing systems pose special problems due to the complexity and volume of data these systems manipulate. In the post-genomic era, scientists must be able to quickly gain insight into structure-function problems, and require flexible computing environments to quickly create interfaces that link the relevant tools. Feature, a program for analyzing protein sites, takes a set of 3-dimensional structures and creates statistical models of sites of structural or functional significance. Until now, Feature has provided no support for visualization, which can make understanding its results difficult. We have developed an extension to the molecular visualization program Chimera that integrates Feature's statistical models and site predictions with 3-dimensional structures viewed in Chimera. We call this extension ViewFeature, and it is designed to help users understand the structural Features that define a site of interest. We applied ViewFeature in an analysis of the enolase superfamily; a functionally distinct class of proteins that share a common fold, the alpha/beta barrel, in order to gain a more complete understanding of the conserved physical properties of this superfamily. In particular, we wanted to define the structural determinants that distinguish the enolase superfamily active site scaffold from other alpha/beta barrel superfamilies and particularly from other metal-binding alpha/beta barrel proteins. Through the use of ViewFeature, we have found that the C-terminal domain of the enolase superfamily does not differ at the scaffold level from metal-binding alpha/beta barrels. We are, however, able to differentiate between the metal-binding sites of alpha/beta barrels and those of other metal-binding proteins. We describe the overall architectural Features of enolases in a radius of 10 Angstroms around the active site.
View details for PubMedID 11262944
-
Using meta computing tools to facilitate large-scale analyses of biological databases.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
2001: 360-371
Abstract
Given the high rate at which biological data are being collected and made public, it is essential that computational tools be developed that are capable of efficiently accessing and analyzing these data. High-performance distributed computing resources can play a key role in enabling large-scale analyses of biological databases. We use a distributed computing environment, Legion, to enable large-scale computations on the Protein Data Bank (PDB). In particular, we employ the Feature program to scan all protein structures in the PDB in search for unrecognized potential cation binding sites. We evaluate the efficiency of Legion's parallel execution capabilities and analyze the initial biological implications that result from having a site annotation scan of the entire PDB. We discuss four interesting proteins with unannotated, high-scoring candidate cation binding sites.
View details for PubMedID 11262955
-
Integrating genotype and phenotype information: an overview of the PharmGKB project. Pharmacogenetics Research Network and Knowledge Base.
pharmacogenomics journal
2001; 1 (3): 167-170
View details for PubMedID 11908751
-
Constrained global optimization for estimating molecular structure from atomic distances
JOURNAL OF COMPUTATIONAL BIOLOGY
2001; 8 (5): 523-547
Abstract
Finding optimal three-dimensional molecular configurations based on a limited amount of experimental and/or theoretical data requires efficient nonlinear optimization algorithms. Optimization methods must be able to find atomic configurations that are close to the absolute, or global, minimum error and also satisfy known physical constraints such as minimum separation distances between atoms (based on van der Waals interactions). The most difficult obstacles in these types of problems are that 1) using a limited amount of input data leads to many possible local optima and 2) introducing physical constraints, such as minimum separation distances, helps to limit the search space but often makes convergence to a global minimum more difficult. We introduce a constrained global optimization algorithm that is robust and efficient in yielding near-optimal three-dimensional configurations that are guaranteed to satisfy known separation constraints. The algorithm uses an atom-based approach that reduces the dimensionality and allows for tractable enforcement of constraints while maintaining good global convergence properties. We evaluate the new optimization algorithm using synthetic data from the yeast phenylalanine tRNA and several proteins, all with known crystal structure taken from the Protein Data Bank. We compare the results to commonly applied optimization methods, such as distance geometry, simulated annealing, continuation, and smoothing. We show that compared to other optimization approaches, our algorithm is able combine sparse input data with physical constraints in an efficient manner to yield structures with lower root mean squared deviation.
View details for Web of Science ID 000171950200005
View details for PubMedID 11694181
-
Biomedical computation at Stanford University: a larger umbrella for the future
M D COMPUTING
2000; 17 (6): 35-37
View details for Web of Science ID 000165970200020
View details for PubMedID 11189759
-
The interactions between clinical informatics and bioinformatics: A case study
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION
2000; 7 (5): 439-443
Abstract
For the past decade, Stanford Medical Informatics has combined clinical informatics and bioinformatics research and training in an explicit way. The interest in applying informatics techniques to both clinical problems and problems in basic science can be traced to the Dendral project in the 1960s. Having bioinformatics and clinical informatics in the same academic unit is still somewhat unusual and can lead to clashes of clinical and basic science cultures. Nevertheless, the benefits of this organization have recently become clear, as the landscape of academic medicine in the next decades has begun to emerge. The author provides examples of technology transfer between clinical informatics and bioinformatics that illustrate how they complement each other.
View details for PubMedID 10984462
-
Calculation of the relative geometry of tRNAs in the ribosome from directed hydroxyl-radical probing data
RNA-A PUBLICATION OF THE RNA SOCIETY
2000; 6 (2): 220-232
Abstract
The many interactions of tRNA with the ribosome are fundamental to protein synthesis. During the peptidyl transferase reaction, the acceptor ends of the aminoacyl and peptidyl tRNAs must be in close proximity to allow peptide bond formation, and their respective anticodons must base pair simultaneously with adjacent trinucleotide codons on the mRNA. The two tRNAs in this state can be arranged in two nonequivalent general configurations called the R and S orientations, many versions of which have been proposed for the geometry of tRNAs in the ribosome. Here, we report the combined use of computational analysis and tethered hydroxyl-radical probing to constrain their arrangement. We used Fe(II) tethered to the 5' end of anticodon stem-loop analogs (ASLs) of tRNA and to the 5' end of deacylated tRNA(Phe) to generate hydroxyl radicals that probe proximal positions in the backbone of adjacent tRNAs in the 70S ribosome. We inferred probe-target distances from the resulting RNA strand cleavage intensities and used these to calculate the mutual arrangement of A-site and P-site tRNAs in the ribosome, using three different structure estimation algorithms. The two tRNAs are constrained to the S configuration with an angle of about 45 degrees between the respective planes of the molecules. The terminal phosphates of 3'CCA are separated by 23 A when using the tRNA crystal conformations, and the anticodon arms of the two tRNAs are sufficiently close to interact with adjacent codons in mRNA.
View details for Web of Science ID 000085267900007
View details for PubMedID 10688361
View details for PubMedCentralID PMC1369908
-
Generating interactive molecular documentaries using a library of graphical actions.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
2000: 266-277
Abstract
Paper-based publishing of scientific articles limits the types of presentations that can be used. The emergence of electronic publishing has created opportunities to increase the range of formats available for conveying scientific content. We introduce the Graphical Explanation Markup Language, GEML, implemented as an XML format for defining molecular documentaries which exploit the interactive capabilities of electronic publishing. GEML builds upon existing molecular structure definitions such as the Protein Data Bank (PDB) standard file format. GEML provides a library of gestures (or actions) commonly used for structural explanations, and is extensible. XML allows us to separate explicit statements about how to highlight a molecular structure from the implementation of these instructions. We also present GEIS (Generator of Explanatory Interactive Systems), a program that takes as input a GEML documentary definition file and produces all the files necessary for an interactive, web-based molecular documentary. To demonstrate GEML and GEIS, we constructed a documentary capturing the difficult 3D notions expressed in two selected published reports about human topoisomerase I. We have created a prototype Java application, GEMLBuilder, as an editor of GEML files.
View details for PubMedID 10902175
-
The new peer review
Annual Symposium of the American-Medical-Informatics-Association
HANLEY & BELFUS INC. 2000: 433–437
Abstract
It is widely recognized that the Internet has fundamentally changed the dynamics of publication, and in particular, it is clear that there is no effective way to control the release of any web-based publication. The scientific and lay literature is now accessible to the public with unprecedented ease Recent proposals to start a life sciences online repository of preprints highlights the trend towards "publish first, review later" that seems to be emerging. Does this mean that the peer review process is dead? It certainly suggests that there is a need for a change in how the process works. We discuss currently available technologies to enable the implementation of new, distributed peer review process benefiting multiple user communities.
View details for Web of Science ID 000170207500089
View details for PubMedID 11079920
View details for PubMedCentralID PMC2244085
- Proceedings of Pacific Symposium on Biocomputing 2000. edited by Altman, R., Dunker, K., Hunter, L. 2000
- Calculation of the relative geometry of tRNAs in the ribosome from directed hydroxyl-radical probing data. RNA, PMCID: PMC1369908. 2000; 6: 220-232
- Bioinformatics. Medical Informatics: Computer Applications in Health Care edited by Shortliffe, T., Wiederhold, G., Fagan, L. Heidelberg: Springer-Verlag.. 2000: 638–660
- National Research Council Panel. Networking Health: Prescriptions for the Internet. Washington, DC: National Academy Press.. 2000: 1
-
Pattern recognition of genomic features with microarrays: site typing of Mycobacterium tuberculosis strains.
Proceedings. International Conference on Intelligent Systems for Molecular Biology
2000; 8: 286-295
Abstract
Mycobacterium tuberculosis (M. tb.) strains differ in the number and locations of a transposon-like insertion sequence known as IS6110. Accurate detection of this sequence can be used as a fingerprint for individual strains, but can be difficult because of noisy data. In this paper, we propose a non-parametric discriminant analysis method for predicting the locations of the IS6110 sequence from microarray data. Polymerase chain reaction extension products generated from primers specific for the insertion sequence are hybridized to a microarray containing targets corresponding to each open reading frame in M. tb. To test for insertion sites, we use microarray intensity values extracted from small windows of contiguous open reading frames. Rank-transformation of spot intensities and first-order differences in local windows provide enough information to reliably determine the presence of an insertion sequence. The nonparametric approach outperforms all other methods tested in this study.
View details for PubMedID 10977090
-
Computational modeling of structural experimental data
RNA-LIGAND INTERACTIONS PT A
2000; 317: 470-491
View details for Web of Science ID 000087898000028
View details for PubMedID 10829296
-
Principal components analysis to summarize microarray experiments: application to sporulation time series.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
2000: 455-466
Abstract
A series of microarray experiments produces observations of differential expression for thousands of genes across multiple conditions. It is often not clear whether a set of experiments are measuring fundamentally different gene expression states or are measuring similar states created through different mechanisms. It is useful, therefore, to define a core set of independent features for the expression states that allow them to be compared directly. Principal components analysis (PCA) is a statistical technique for determining the key variables in a multidimensional data set that explain the differences in the observations, and can be used to simplify the analysis and visualization of multidimensional data sets. We show that application of PCA to expression data (where the experimental conditions are the variables, and the gene expression measurements are the observations) allows us to summarize the ways in which gene responses vary under different conditions. Examination of the components also provides insight into the underlying factors that are measured in the experiments. We applied PCA to the publicly released yeast sporulation data set (Chu et al. 1998). In that work, 7 different measurements of gene expression were made over time. PCA on the time-points suggests that much of the observed variability in the experiment can be summarized in just 2 components--i.e. 2 variables capture most of the information. These components appear to represent (1) overall induction level and (2) change in induction level over time. We also examined the clusters proposed in the original paper, and show how they are manifested in principal component space. Our results are available on the internet at http:¿www.smi.stanford.edu/project/helix/PCArray .
View details for PubMedID 10902193
-
AI in medicine - The spectrum of challenges from managed care to molecular medicine
AI MAGAZINE
1999; 20 (3): 67-77
View details for Web of Science ID 000083035800004
-
Automated diagnosis of data-model conflicts using metadata
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION
1999; 6 (5): 374-392
Abstract
The authors describe a methodology for helping computational biologists diagnose discrepancies they encounter between experimental data and the predictions of scientific models. The authors call these discrepancies data-model conflicts. They have built a prototype system to help scientists resolve these conflicts in a more systematic, evidence-based manner. In computational biology, data-model conflicts are the result of complex computations in which data and models are transformed and evaluated. Increasingly, the data, models, and tools employed in these computations come from diverse and distributed resources, contributing to a widening gap between the scientist and the original context in which these resources were produced. This contextual rift can contribute to the misuse of scientific data or tools and amplifies the problem of diagnosing data-model conflicts. The authors' hypothesis is that systematic collection of metadata about a computational process can help bridge the contextual rift and provide information for supporting automated diagnosis of these conflicts. The methodology involves three major steps. First, the authors decompose the data-model evaluation process into abstract functional components. Next, they use this process decomposition to enumerate the possible causes of the data-model conflict and direct the acquisition of diagnostically relevant metadata. Finally, they use evidence statically and dynamically generated from the metadata collected to identify the most likely causes of the given conflict. They describe how these methods are implemented in a knowledge-based system called GRENDEL and show how GRENDEL can be used to help diagnose conflicts between experimental data and computationally built structural models of the 30S ribosomal subunit.
View details for Web of Science ID 000082447300006
View details for PubMedID 10495098
-
RiboWeb: An ontology-based system for collaborative molecular biology
IEEE INTELLIGENT SYSTEMS & THEIR APPLICATIONS
1999; 14 (5): 68-76
View details for Web of Science ID 000082944500017
-
Sophia: A flexible, Web-based knowledge server
IEEE INTELLIGENT SYSTEMS & THEIR APPLICATIONS
1999; 14 (4): 79-85
View details for Web of Science ID 000081953300015
-
Are predicted structures good enough to preserve functional sites?
STRUCTURE
1999; 7 (6): 643-650
Abstract
A principal goal of structure prediction is the elucidation of function. We have studied the ability of computed models to preserve the microenvironments of functional sites. In particular, 653 model structures of a calcium-binding protein (generated using an ab initio folding protocol) were analyzed, and the degree to which calcium-binding sites were recognizable was assessed.While some model structures preserve the calcium-binding microenvironments, many others, including some with low root mean square deviations (rmsds) from the crystal structure of the native protein, do not. There is a very weak correlation between the overall rmsd of a structure and the preservation of calcium-binding sites. Only when the quality of the model structure is high (rmsd less than 2 A for atoms in the 7 A local neighborhood around calcium) does the modeling of the binding sites become reliable.Protein structure prediction methods need to be assessed in terms of their preservation of functional sites. High-resolution structures are necessary for identifying binding sites such as calcium-binding sites.
View details for Web of Science ID 000080967100007
View details for PubMedID 10404593
-
Using imperfect secondary structure predictions to improve molecular structure computations
BIOINFORMATICS
1999; 15 (1): 53-65
Abstract
Until ab initio structure prediction methods are perfected, the estimation of structure for protein molecules will depend on combining multiple sources of experimental and theoretical data. Secondary structure predictions are a particularly useful source of structural information, but are currently only approximately 70% correct, on average. Structure computation algorithms which incorporate secondary structure information must therefore have methods for dealing with predictions that are imperfect. EXPERIMENTS PERFORMED: We have modified our algorithm for probabilistic least squares structural computations to accept 'disjunctive' constraints, in which a constraint is provided as a set of possible values, each weighted with a probability. Thus, when a helix is predicted, the distances associated with a helix are given most of the weight, but some weights can be allocated to the other possibilities (strand and coil). We have tested a variety of strategies for this weighting scheme in conjunction with a baseline synthetic set of sparse distance data, and compared it with strategies which do not use disjunctive constraints.Naive interpretations in which predictions were taken as 100% correct led to poor-quality structures. Interpretations that allow disjunctive constraints are quite robust, and even relatively poor predictions (58% correct) can significantly increase the quality of computed structures (almost halving the RMS error from the known structure).Secondary structure predictions can be used to improve the quality of three-dimensional structural computations. In fact, when interpreted appropriately, imperfect predictions can provide almost as much improvement as perfect predictions in three-dimensional structure calculations.
View details for Web of Science ID 000079090200006
View details for PubMedID 10068692
- Proceedings of Pacific Symposium on Biocomputing 1999. edited by Altman, R., Dunker, K., Hunter, L. 1999
- RiboWeb: An Ontology-Based System for Collaborative Molecular Biology. IEEE Intelligent Systems and Their Application 1999; 5 (14): 68-76
- AI in medicine: The spectrum of challenges from managed care to molecular medicine. AI Magazine 1999; 3 (20): 67-77
-
Hierarchical organization of molecular structure computations
JOURNAL OF COMPUTATIONAL BIOLOGY
1998; 5 (3): 409-422
Abstract
The task of computing molecular structure from combinations of experimental and theoretical constraints is expensive because of the large number of estimated parameters (the 3D coordinates of each atom) and the rugged landscape of many objective functions. For large molecular ensembles with multiple protein and nucleic acid components, the problem of maintaining tractability in structural computations becomes critical. A well-known strategy for solving difficult problems is divide-and-conquer. For molecular computations, there are two ways in which problems can be divided: (1) using the natural hierarchy within biological macromolecules (taking advantage of primary sequence, secondary structural subunits and tertiary structural motifs, when they are known); and (2) using the hierarchy that results from analyzing the distribution of structural constraints (providing information about which substructures are constrained to one another). In this paper, we show that these two hierarchies can be complementary and can provide information for efficient decomposition of structural computations. We demonstrate five methods for building such hierarchies--two automated heuristics that use both natural and empirical hierarchies, one knowledge-based process using both hierarchies, one method based on the natural hierarchy alone, and for completeness one random hierarchy oblivious to auxiliary information--and apply them to a data set for the procaryotic 30S ribosomal subunit using our probabilistic least squares structure estimation algorithm. We show that the three methods that combine natural hierarchies with empirical hierarchies create decompositions which increase the efficiency of computations by as much as 50-fold. There is only half this gain when using the natural decomposition alone, while the random hierarchy suggests that a speedup of about five can be expected just by virtue of having a decomposition. Although the knowledge-based method performs marginally better, the automatic heuristics are easier to use, scale more reliably to larger problems, and can match the performance of knowledge-based methods if provided with basic structural information.
View details for Web of Science ID 000075921100005
View details for PubMedID 9773341
-
Reuse, CORBA, and knowledge-based systems
INTERNATIONAL JOURNAL OF HUMAN-COMPUTER STUDIES
1998; 49 (4): 523-546
View details for Web of Science ID 000076973800010
-
Bioinformatics in support of molecular medicine
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION
1998: 53-61
Abstract
Bioinformatics studies two important information flows in modern biology. The first is the flow of genetic information from the DNA of an individual organism up to the characteristics of a population of such organisms (with an eventual passage of information back to the genetic pool, as encoded within DNA). The second is the flow of experimental information from observed biological phenomena to models that explain them, and then to new experiments in order to test these models. The discipline of bioinformatics has its roots in a number of activities, including the organization of DNA sequence and protein three-dimensional structural data collections in the 1960's and 1970's. It has become a booming academic and industrial enterprise with the introduction of biological experiments that rapidly produce massive amounts of data (such as the multiple genome sequencing projects, the large scale analysis of gene expression, and the large scale analysis of protein-protein interactions). Basic biological science has always had an impact on clinical medicine (and clinical medical information systems), and is creating a new generation of epidemiologic, diagnostic, prognostic, and treatment modalities. Bioinformatics efforts that appear to be wholly geared towards basic science are likely to become relevant to clinical informatics in the coming decade. For example, DNA sequence information and sequence annotations will appear in the medical chart with increasing frequency. The algorithms developed for research in bioinformatics will soon become part of clinical information systems.
View details for Web of Science ID 000171768600009
View details for PubMedID 9929182
- Bioinformatics in Support of Molecular Medicine. 1998
- MHCWeb: Converting a WWW Database into a Knowledge-based Collaborative Environment. 1998
- A Curriculum for Bioinformatics: The Time is Ripe. Bioinformatics 1998; 7 (14): 549-550
- Graphical Style Sheets: Towards Reusable Representations of Biomedical Graphics. 1998
- Determination of the Spatial Distribution of Protein Structure Using Solution Data. edited by Jaroszewski, J., Schaumburg, K., Kofod, H. 1998
- SOPHIA: Providing Basic Knowledge Services with a Common DBMS. edited by Borgida, A., Chaudhri, V., Staudt, M. 1998
- Updated Bibliography Using the RELATED ARTICLES Function within PubMed. 1998
- Probabilistic and Statistical Descriptions of Protein Structure. Computational Biology: Pattern Analysis and Machine Learning Methods edited by Salzberg, S., Searls, D., Kasif, S. London, UK: Elsevier Science.. 1998: 207–225
- The Hierarchical Organization of Molecular Structure Computation. In: RECOMB-98 New York: ACM Press.. 1998: 51–59
- Proceedings of Pacific Symposium on Biocomputing 1998. edited by Altman, R., Dunker, K., Hunter, L. 1998
- PROTEAN: Deriving Protein Structure from Constraints. Blackboard Systems edited by Engelmore, R., Morgan, A. Workingham: Addison-Wesley.. 1998: 417–431
- The Hierarchical Organization of Molecular Structure Computations. Journal of Computational Biology 1998; 3 (5): 409-422
-
Updating a bibliography using the RELATED ARTICLES function within PubMed
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION
1998: 750-754
Abstract
Comprehensive bibliographies are useful for conducting reviews of the literature, and for assessing the progress within a field. These bibliographies may be broad and inclusive, or focused and precise in their inclusion criteria. In either case, the task of maintaining a complete bibliography within a particular area of research is made difficult by the diversity, complexity and huge volume of newly published literature. In an effort to effectively and automatically retrieve relevant literature, different search strategies and indexing tools have been developed, including the RELATED ARTICLES function provided with the PubMed system. In this paper, we report a program for incremental updates of a bibliography using the PubMed RELATED ARTICLES function. Given a highly specialized starting bibliography of experimental measurements of the structure of the 30S bacterial ribosomal subunit, the system was applied to find additional relevant references. For this particular task, the system has a recall of 75%, a strict precision of 32% and a partial precision of 42%. Our results are notable because although the RELATED ARTICLES function is purely statistical, it is nonetheless able to select a very narrowly defined set of articles from the literature. We discuss the tradeoffs between having a user to evaluate many articles of possible interest in a single session, versus asking a user to evaluate a small set of articles on a periodic basis.
View details for Web of Science ID 000171768600146
View details for PubMedID 9929319
-
A surface measure for probabilistic structural computations.
Proceedings / ... International Conference on Intelligent Systems for Molecular Biology ; ISMB. International Conference on Intelligent Systems for Molecular Biology
1998; 6: 148-156
Abstract
Computing three-dimensional structures from sparse experimental constraints requires method for combining heterogeneous sources of information, such as distances, angles, and measures of total volume, shape, and surface. For some types of information, such as distances between atoms, numerous methods are available for computing structures that satisfy the provided constraints. It is more difficult, however, to use information about the degree to which an atom is on the surface or buried as a useful constraint during structure computations. Surface measures have been used as accept/reject criteria for previously computed structures, but this is not an efficient strategy. In this paper, we investigate the efficacy of applying a surface measure in the computation of molecular structure, using a method of probabilistic least square computations which facilitates the introduction of multiple, noisy, heterogeneous data sources. For this purpose, we introduce a simple purely geometrical measure of surface proximity called maximal conic view (MCV). MCV is efficiently computable and differentiable, and is hence well suited to driving a structural optimization method based, in part, on surface data. As an initial validation, we show that MCV correlates well with known measures for total exposed surface area. We use this measure in our experiments to show that information about surface proximity (derived from theory or experiment, for example) can be added to a set of distance measurements to increase significantly the quality of the computed structure. In particular, when 30 to 50 percent of all possible short-range distances are provided, the addition of surface information improves the quality of the computed structure (as measured by RMS fit) by as much as 80 percent. Our results demonstrate that knowledge of which atoms are on the surface and which are buried can be used as a powerful constraint in estimating molecular structure.
View details for PubMedID 9783220
-
MHCWeb: Converting a WWW database into a knowledge-based collaborative environment
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION
1998: 947-951
Abstract
The World Wide Web (WWW) is useful for distributing scientific data. Most existing web data resources organize their information either in structured flat files or relational databases with basic retrieval capabilities. For databases with one or a few simple relations, these approaches are successful, but they can be cumbersome when there is a data model involving multiple relations between complex data. We believe that knowledge-based resources offer a solution in these cases. Knowledge bases have explicit declarations of the concepts in the domain, along with the relations between them. They are usually organized hierarchically, and provide a global data model with a controlled vocabulary. We have created the OWEB architecture for building online scientific data resources using knowledge bases. OWEB provides a shell for structuring data, providing secure and shared access, and creating computational modules for processing and displaying data. In this paper, we describe the translation of the online immunological database MHCPEP into an OWEB system called MHCWeb. This effort involved building a conceptual model for the data, creating a controlled terminology for the legal values for different types of data, and then translating the original data into the new structure. The OWEB environment allows for flexible access to the data by both users and computer programs.
View details for Web of Science ID 000171768600185
View details for PubMedID 9929358
-
Recognizing protein binding sites using statistical descriptions of their 3D environments.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
1998: 497-508
Abstract
We have developed a new method for recognizing sites in three-dimensional protein structures. Our method is based on our previously reported algorithm for creating descriptions of protein microenvironments using physical and chemical properties at multiple levels of detail (including features at the atomic, chemical group, residue, and secondary structural levels). The recognition method takes three inputs: a set of sites that share some structural or functional role, a set of control nonsites that lack this role, and a single query site. The values of properties for the query site are compared to the distributions of values for both sites and nonsites to determine the group to which it is most similar. A log-odds scoring function, based on Bayes' Rule, computes a score that indicates the likelihood that the query region is a site of interest. In this paper, we apply the method to the task of identifying calcium binding sites in proteins. Cross-validation analysis shows that this recognition approach has high sensitivity and specificity. We also describe the results of scanning four calcium binding proteins (with the calcium removed) using a three-dimensional grid of probe points at 2 A spacing. The probe points that have high scores cluster around the true calcium binding sites, with the highest scoring points at or near the binding sites. The method fails in only one case where a calcium binding site is created by four proteins in the crystal lattice, and is thus not recognizable within the crystallographic asymmetric unit. Our results show that property-based descriptions can be used for recognizing protein sites in unannotated structures.
View details for PubMedID 9697207
-
RNA secondary structure as a reusable interface to biological information resources (Reprinted from Gene-Combis, vol 190, pg GC59-GC70, 1997)
GENE
1997; 190 (2): GC59–GC70
View details for DOI 10.1016/S0378-1119(96)00855-4
View details for Web of Science ID A1997XC85400014
-
Informatics in the care of patients: Ten notable challenges
WESTERN JOURNAL OF MEDICINE
1997; 166 (2): 118-122
Abstract
What is medical informatics, and why should practicing physicians care about it? Medical informatics is the study of the concepts and conceptual relationships within biomedical information and how they can be harnessed for practical applications. In the past decade, the field has exploded as health professionals recognize the importance of strategic information management and the inadequacies of traditional tools for information storage, retrieval, and analysis. At the same time that medical informatics has established a presence within many academic and industrial research facilities, its goals and methods have become less clear to practicing physicians. In this article, I outline 10 challenges in medical informatics that provide a framework for understanding developments in the field. These challenges have been divided into those relating to infrastructure, specific performance, and evaluation. The primary goals of medical informatics, as for any other branch of biomedical research, are to improve the overall health of patients by combining basic scientific and engineering insights with the useful application of these insights to important problems.
View details for Web of Science ID A1997WR20700003
View details for PubMedID 9109328
-
LPFC: An Internet library of protein family core structures
PROTEIN SCIENCE
1997; 6 (1): 246-248
Abstract
As the number of protein molecules with known, high-resolution structures increases, it becomes necessary to organize these structures for rapid retrieval, comparison, and analysis. The Protein Data Bank (PDB) currently contains nearly 5,000 entries and is growing exponentially. Most new structures are similar structurally to ones reported previously and can be grouped into families. As the number of members in each family increases, it becomes possible to summarize, statistically, the commonalities and differences within each family. We reported previously a method for finding the atoms in a family alignment that have low spatial variance and those that have higher spatial variance (i.e., the "core" atoms that have the same relative position in all family members and the "non-core" atoms that do not). The core structures we compute have biological significance and provide an excellent quantitative and visual summary of a multiple structural alignment. In order to extend their utility, we have constructed a library of protein family cores, accessible over the World Wide Web at http:/ /www-smi.stanford.edu/projects/helix/LPFC/. This library is generated automatically with publicly available computer programs requiring only a set of multiple alignments as input. It contains quantitative analysis of the spatial variation of atoms within each protein family, the coordinates of the average core structures derived from the families, and display files (in bitmap and VRML formats). Here, we describe the resource and illustrate its applicability by comparing three multiple alignments of the globin family. These three alignments are found to be similar, but with some significant differences related to the diversity of family members and the specific method used for alignment.
View details for Web of Science ID A1997WD20100027
View details for PubMedID 9007997
- RiboWeb: Linking Structural Computations to a Knowledge Base of Published Experimental Data. 1997
- Standardized Representations of the Literature: Combining Diverse Sources of Ribosomal Data. 1997
- Using the Radial Distribution of Physical Features to Compare Amino Acid Environments. edited by Altman, R., Dunker, K., Hunter, L. 1997
- Proceedings of Pacific Symposium on Biocomputing 1997. edited by Altman, R., Dunker, K., Hunter, L. 1997
-
RNA secondary structure as a reusable interface to biological information resources
GENE-COMBIS
1997; 190: GC59-GC70
Abstract
The dissemination of biological information has become critically dependent on the Internet and World Wide Web (WWW), which enable distributed access to information in a platform independent manner. The mode of interaction between biologists and on-line information resources, however, has been mostly limited to simple interface technologies such has hypertext links, tables and forms. The introduction of platform-independent runtime environments facilitates the development of more sophisticated WWW-based user interfaces. Until recently, most such interfaces have been tightly coupled to the underlying computation engines, and not separated as reusable components. We believe that many subdisciplines of biology have intuitive and familiar graphical representations of knowledge that can serve as multipurpose user interface elements. We call such graphical idioms "domain graphics". In order to illustrate the power of such graphics, we have built a reusable interface based on the standard two dimensional (2D) layout of RNA secondary structure. The interface can be used to represent any pre-computed layout of RNA, and takes as a parameters the sets of actions to be performed as a user interacts with the interface. It can provide to any associated application program information about the base, helix, or subsequence selected by the user. We show the versatility of this interface by using it as a special purpose interface to BLAST, Medline and the RNA MFOLD search/compute engines. These demonstrations are available at: http://www-smi.stanford.edu/projects/helix/pubs/ gene-combis-96/
View details for PubMedID 9197551
-
Standardized representations of the literature: Combining diverse sources of ribosomal data
5th International Conference on Intelligent Systems for Molecular Biology (ISMB-97)
AMER ASSOC ARTIFICIAL INTELLIGENCE. 1997: 15–24
Abstract
We are building a knowledge base (KB) of published structural data on the 30s ribosomal subunit in prokaryotes. Our KB is distinguished by a standardized representation of biological experiments and their results, in a reusable format. It can be accessed by computer programs that exploit the rich interconnections within the data. The KB is designed to support the construction of 3D models of the 30S subunit, as well as the analysis and extension of relevant functional and phylogenetic information. Most published information about the structure of the ubiquitous ribosome focuses on E. coli as a model system. At the same time, thousands of RNA sequences for the ribosome have been gathered and cataloged. The volume and complexity of these data can complicate attempts to separate structural data peculiar to E. coli from data of universal relevance. We have written an application that dynamically queries the KB and the Ribosome Database Project, a repository of ribosomal RNA sequences from other organisms, in order to assess the relevance of structural data to particular organisms. The application uses the RDP alignment to determine whether a set of data refer primarily to conserved, mismatched, or gapped positions. For a set of 16 representative articles evaluated over 211 sequences, 73% of observations have unambiguous translations from E. coli to the other organisms, 21% have somewhat ambiguous translations, and 6% have no translations. There is a wide variation in these numbers over different articles and organisms, confirming that some articles report structural information specific to E. coli while others report information that is quite general.
View details for Web of Science ID 000072320000002
View details for PubMedID 9322010
-
RIBOWEB: Linking structural computations to a knowledge base of published experimental data
5th International Conference on Intelligent Systems for Molecular Biology (ISMB-97)
AMER ASSOC ARTIFICIAL INTELLIGENCE. 1997: 84–87
Abstract
The world wide web (WWW) has become critical for storing and disseminating biological data. It offers an additional opportunity, however, to support distributed computation and sharing of results. Currently, computational analysis tools are often separated from the data in a manner that makes iterative hypothesis testing cumbersome. We hypothesize that the cycle of scientific reasoning (using data to build models, and evaluating models in light of data) can be facilitated with resources that link computations with semantic models of the data. Riboweb is an on-line knowledge-based resource that supports the creation of three-dimensional models of the 30S ribosomal subunit. It has three components: (I) a knowledge base containing representations of the essential physical components and published structural data, (II) computational modules that use the knowledge base to build or analyze structural models, and (III) a web-based user interface that supports multiple users, sessions and computations. We have built a prototype of Riboweb, and have used it to refine a rough model of the central domain of the 30S subunit from E. coli. procedure. Our results suggest that sophisticated and integrated computational capabilities can be delivered to biologists using this simple three-component architecture.
View details for Web of Science ID 000072320000011
View details for PubMedID 9322019
-
Using the radial distributions of physical features to compare amino acid environments and align amino acid sequences.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
1997: 465-476
Abstract
We have performed a comprehensive analysis of the microenvironments surrounding the twenty amino acids. Our analysis includes comparison of amino acid environments with random control environments as well as with each of the other amino acid environments. We describe the amino acid environments with a set of 21 features summarizing atomic, chemical group, residue, and secondary structural features. The environments are divided into radial shells of 1 A thickness to represent the distance of the features from the amino acid C beta atoms. We make the results of our analysis available graphically over the world wide web. To illustrate the validity and utility of our analysis, we used the amino acid comparative profiles to construct a substitution matrix, the WAC matrix, based on a simple summary of the computed environmental differences. We compared our matrix to BLOSUM62 and PAM250 in BLAST searches with query sequences selected from 39 protein families found in the PROSITE database. Although BLOSUM62 was the most sensitive matrix overall, our matrix was more sensitive for some families, and exhibited overall performance similar to PAM250. Our results suggest that the radial distribution of biochemical and biophysical features is useful for comparing amino acid environments, and that similarity matrices based on the geometric distribution of features around amino acids may produce improved search sensitivity.
View details for PubMedID 9390315
-
Computational methods for defining the allowed conformational space of 16S rRNA based on chemical footprinting data
RNA-A PUBLICATION OF THE RNA SOCIETY
1996; 2 (9): 851-866
Abstract
Structural models for 16S ribosomal RNA have been proposed based on combinations of crosslinking, chemical protection, shape, and phylogenetic evidence. These models have been based for the most part on independent data sets and different sets of modeling assumptions. In order to evaluate such models meaningfully, methods are required to explicitly model the spatial certainty with which individual structural components are positioned by specific data sets. In this report, we use a constraint satisfaction algorithm to explicitly assess the location of the secondary structural elements of the 16S RNA, as well as the certainty with which these elements can be positioned. The algorithm initially assumes that these helical elements can occupy any position and orientation and then systematically eliminates those positions and orientations that do not satisfy formally parameterized interpretations of structural constraints. Using a conservative interpretation of the hydroxyl radical footprinting data, the positions of the ribosomal proteins as defined by neutron diffraction studies, and the secondary structure of 16S rRNA, the location of the RNA secondary structural elements can be defined with an average precision of 25 A (ranging from 12.8 to 56.3 A). The uncertainty in individual helix positions is both heterogeneous and dependent upon the number of constraints imposed on the helix. The topology of the resulting model is consistent with previous models based on independent approaches. The result of our computation is a conservative upper bound on the possible positions of the RNA secondary structural elements allowed by this data set, and provides a suitable starting point for refinement with other sources of data or different sets of modeling assumptions.
View details for Web of Science ID A1996VH69500001
View details for PubMedID 8809013
-
Constraining volume by matching the moments of a distance distribution
COMPUTER APPLICATIONS IN THE BIOSCIENCES
1996; 12 (4): 319-326
Abstract
The problem of computing a molecular structure from a set of distances arises in the interpretation of NMR data as well as other experimental methods that yield distance information. Techniques for computing structures must find conformations consistent with the distance data. There are often other constraints on the structure that must be satisfied as well. One of the most problematic constraints is the constraint on the total volume occupied by the atoms. In this paper, we use the first two moments (mean and variance) of an estimated distance distribution to constrain the volume of a computed structure. We show that a probabilistic algorithm for matching the first two moments of the estimated distance distribution significantly improves the quality of the solution, especially when the distance information alone is not sufficient to define the structure precisely. We also show that our method is not sensitive to small errors in the estimates of mean and variance of the distance distribution. Finally, we demonstrate the use of this constraint in computing a low-resolution structure of the 30S prokaryotic ribosomal subunit. Quantitative analysis of our results allows us to assess the information content contained in constraints on volume, and to show that in some cases addition of a volume constraint adds information roughly equivalent to doubling the number of input distances. Our results also demonstrate the flexibility of probabilistic representations of structural constraints, and the importance of including volume information to constrain structural computations-especially in the case of sparse data.
View details for Web of Science ID A1996VM02500008
View details for PubMedID 8902359
-
Images in clinical medicine. Knotted umbilical cord.
New England journal of medicine
1996; 334 (9): 573-?
View details for PubMedID 8569825
-
Knotted umbilical cord
NEW ENGLAND JOURNAL OF MEDICINE
1996; 334 (9): 573-573
View details for Web of Science ID A1996TW69600005
-
An evaluation of the TransFER model for sharing clinical decision-support applications.
Proceedings : a conference of the American Medical Informatics Association / ... AMIA Annual Fall Symposium. AMIA Fall Symposium
1996: 468-472
Abstract
TransFER is a formal model designed to facilitate the sharing of decision-support applications across institutions with heterogeneous clinical databases. The TransFER model provides a mechanism to automatically customize database queries based on a reference schema of clinical data and an encoded set of database mappings. In this paper, we describe the elements of the TransFER model and we present the results of a formal evaluation we conducted to assess the utility and generality of the model. The results suggest that the TransFER has significant potential for automating query translation and facilitating application sharing, but that further work on the representation of temporal semantics, on the modeling of missing data, and on the optimization of complex queries is required.
View details for PubMedID 8947710
-
Using tee radial distributions of physical features to compare amino acid environments and align amino acid sequences
2nd Pacific Symposium on Biocomputing (PSB)
WORLD SCIENTIFIC PUBL CO PTE LTD. 1996: 465–476
View details for Web of Science ID A1996BH75M00047
- Conserved Features in the Active Site of Nonhomologous Serine Proteases. Folding & Design 1996; 5 (1): 371-379
-
A programming course in bioinformatics for computer and information science students.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
1996: 73-84
Abstract
We have created a course entitled "Representations and Algorithms for Computational Molecular Biology" with three specific goals in mind. First, we want to provide a technical introduction for computer science and medical information science students to the challenges of computing with molecular biology data, particularly the advantages of having easy access to real-world data sets. Second, we want to equip the students with the skills required of productive research assistants in molecular biology computing research projects. Finally, we want to provide a showcase for local investigators to describe their work in the context of a course that provide adequate background information. In order to achieve these goals, we have created a programming course, in which three major projects and six smaller assignments are assigned during the quarter. We stress fundamental representations and algorithms during the first part of the course in lectures given by the core faculty, and then have more focused lectures in which faculty research interests are highlighted. The course stressed issues of structural molecular biology, in order to better motivate the critical issues in sequence analysis. The culmination of the course was a challenge to the students to use a version of protein threading to predict which members of a set of unknown sequences were globins. The course was well received, and has been made a core requirement in the Medical Information Sciences program.
View details for PubMedID 9390224
-
Lamprey: tracking users on the World Wide Web.
Proceedings : a conference of the American Medical Informatics Association / ... AMIA Annual Fall Symposium. AMIA Fall Symposium
1996: 757-761
Abstract
Tracking individual web sessions provides valuable information about user behavior. This information can be used for general purpose evaluation of web-based user interfaces to biomedical information systems. To this end, we have developed Lamprey, a tool for doing quantitative and qualitative analysis of Web-based user interfaces. Lamprey can be used from any conforming browser, and does not require modification of server or client software. By rerouting WWW navigation through a centralized filter, Lamprey collects the sequence and timing of hyperlinks used by individual users to move through the web. Instead of providing marginal statistics, it retains the full information required to recreate a user session. We have built Lamprey as a standard Common Gateway Interface (CGI) that works with all standard WWW browsers and servers. In this paper, we describe Lamprey and provide a short demonstration of this approach for evaluating web usage patterns.
View details for PubMedID 8947767
-
Conserved features in the active site of nonhomologous serine proteases
FOLDING & DESIGN
1996; 1 (5): 371-379
Abstract
Serine protease activity is critical for many biological processes and has arisen independently in a few different protein families. It is not clear, though, the degree to which these protease families share common biochemical and biophysical properties. We have used a computer program to study the properties that are shared by four serine protease active sites with no overall structural or sequence homology. The program systematically compares the region around the catalytic histidines from the four proteins with a set of noncatalytic histidines, used as controls. It reports the three-dimensional locations and level of statistical significance for those properties that distinguish the catalytic histidines from the noncatalytic ones. The method of analysis is general and can be applied easily to other active sites of interest.As expected, some of the reported properties correspond to previously known features of the serine protease active site, including the catalytic triad and the oxyanion hole. Novel properties are also found, including the spatial distribution of charged, polar, and hydrophobic groups arranged to stabilize the catalytic residues, and a relative abundance of some residues (Val, Tyr, Leu, and Gly) around the active site.Our findings show that in addition to some properties common to all the proteases examined, there are a set of preferred, but not required, properties that can be reliably observed only by aligning the sites and comparing them with carefully selected statistical controls.
View details for Web of Science ID A1996WC40600007
View details for PubMedID 9080183
-
Using a measure of structural variation to define a core for the globins
COMPUTER APPLICATIONS IN THE BIOSCIENCES
1995; 11 (6): 633-644
Abstract
As the database of three-dimensional protein structures expands, it becomes possible to classify related structures into families. Some of these families, such as the globins, have enough members to allow statistical analysis of conserved features. Previously, we have shown that a probabilistic representation based on means and variances can be useful for defining structural cores for large families. These cores contain the subset of atoms that are in essentially the same relative positions in all members of the family. In addition to defining a core, our method creates an ordered list of atoms, ranked by their structural variation. In applying our core-finding procedure to the globins, we find that helices A, B, G and H form a structural core with low variance. These helices fold early in the folding pathway, and superimpose well with helices in the helix-turn-helix repressor protein family. The non-core helices (F and the parts of other helices that interact with it) are associated with the functional differences among the globins, and are encoded within a separate exon. We have also compared the variability measure implicit in our core structures with measures of sequence variability, using a procedure for measuring sequence variability that helps correct for the biased sampling in the databanks. We find, somewhat surprisingly, that sequence variation does not appear to correlate with structural variation.
View details for Web of Science ID A1995TR87100009
View details for PubMedID 8808580
-
AVERAGE CORE STRUCTURES AND VARIABILITY MEASURES FOR PROTEIN FAMILIES - APPLICATION TO THE IMMUNOGLOBULINS
JOURNAL OF MOLECULAR BIOLOGY
1995; 251 (1): 161-175
Abstract
A variety of methods are currently available for creating multiple alignments, and these can be used to define and characterize families of related proteins, such as the globins or the immunoglobulins. We have developed a method for using a multiple alignment to identify an average structural "core", a subset of atoms with low structural variation. We show how the means and variances of core-atom positions summarize the commonalities and differences with a family, making them particularly useful in compiling libraries of protein folds. We show further how it is possible to describe the rotation and translation relating two core structures, as in two domains of a multi-domain protein, in a consistent fashion in terms of a "mean" transformation and a deviation about this mean. Once determined, our average core structures (with their implicit measure of structural variation) allow us to define a measure of structural similarity more informative than the usual root-mean-square (RMS) deviation in atomic position, i.e. a "better RMS." Our average structures also permit straightforward comparisons between variation in structure and sequence at each position in a family. We have applied our core-finding methodology in detail to the immunoglobulin family. We find that the structural variability we observe just within the VL and VH domains anticipates the variability that others have observed throughout the whole immunoglobulin superfamily; that a core definition based on sequence conservation, somewhat surprisingly, does not agree with one based on structural similarity; and that the cores of the VL and VH domains vary about 5 degrees in relative orientation across the known structures.
View details for Web of Science ID A1995RN00200014
View details for PubMedID 7643385
-
METHODS FOR DISPLAYING MACROMOLECULAR STRUCTURAL UNCERTAINTY - APPLICATION TO THE GLOBINS
JOURNAL OF MOLECULAR GRAPHICS & MODELLING
1995; 13 (3): 142-152
Abstract
Most molecular graphics programs ignore any uncertainty in the atomic coordinates being displayed. Structures are displayed in terms of perfect points, spheres, and lines with no uncertainty. However, all experimental methods for defining structures, and many methods for predicting and comparing structures, associate uncertainties with each atomic coordinate. We have developed graphical representations that highlight these uncertainties. These representations are encapsulated in a new interactive display program, PROTEAND. PROTEAND represents structural uncertainty in three ways: (1) The traditional way: The program shows a collection of structures as superposed and overlapped stick-figure models. (2) Ellipsoids: At each atom position, the program shows an ellipsoid derived from a three-dimensional Gaussian model of uncertainty. This probabilistic model provides additional information about the relationship between atoms that can be displayed as a correlation matrix. (3) Rigid-body volumes: Using clouds of dots, the program can show the range of rigid-body motion of selected substructures, such as individual alpha helices. We illustrate the utility of these display modalities by the applying PROTEAND to the globin family of proteins, and show that certain types of structural variation are best illustrated with different methods of display.
View details for Web of Science ID A1995RL45300002
View details for PubMedID 7577841
-
A PROBABILISTIC APPROACH TO DETERMINING BIOLOGICAL STRUCTURE - INTEGRATING UNCERTAIN DATA SOURCES
INTERNATIONAL JOURNAL OF HUMAN-COMPUTER STUDIES
1995; 42 (6): 593–616
View details for DOI 10.1006/ijhc.1995.1026
View details for Web of Science ID A1995RT03000003
-
CHARACTERIZING THE MICROENVIRONMENT SURROUNDING PROTEIN SITES
PROTEIN SCIENCE
1995; 4 (4): 622-635
Abstract
Sites are microenvironments within a biomolecular structure, distinguished by their structural or functional role. A site can be defined by a three-dimensional location and a local neighborhood around this location in which the structure or function exists. We have developed a computer system to facilitate structural analysis (both qualitative and quantitative) of biomolecular sites. Our system automatically examines the spatial distributions of biophysical and biochemical properties, and reports those regions within a site where the distribution of these properties differs significantly from control nonsites. The properties range from simple atom-based characteristics such as charge to polypeptide-based characteristics such as type of secondary structure. Our analysis of sites uses non-sites as controls, providing a baseline for the quantitative assessment of the significance of the features that are uncovered. In this paper, we use radial distributions of properties to study three well-known sites (the binding sites for calcium, the milieu of disulfide bridges, and the serine protease active site). We demonstrate that the system automatically finds many of the previously described features of these sites and augments these features with some new details. In some cases, we cannot confirm the statistical significance of previously reported features. Our results demonstrate that analysis of protein structure is sensitive to assumptions about background distributions, and that these distributions should be considered explicitly during structural analyses.
View details for Web of Science ID A1995QU44000004
View details for PubMedID 7613462
-
Characterizing oriented protein structural sites using biochemical properties.
Proceedings / ... International Conference on Intelligent Systems for Molecular Biology ; ISMB. International Conference on Intelligent Systems for Molecular Biology
1995; 3: 12-20
Abstract
A protein site is a region of a three-dimensional protein structure with a distinguishing functional or structural role. Certain sites recur in different protein structures (for example catalytic sites, calcium binding sites, and some types of turns), but maintain critical shared features. To facilitate the analysis of such protein sites, we have developed a computer system for analyzing the spatial distributions of biochemical properties around a site. The system takes a set of similar sites and a set of control nonsites, and finds differences between them. Specifically, it compares distributions of the properties surrounding the sites with those surrounding the nonsites, and reports statistically significant differences. In this paper, we use our method to analyze the features in the active site of the serine protease enzymes. We compare the use of radial distributions (shells) with 3-D grids (blocks) in the analysis of the active site. We demonstrate three different strategies for focusing attention on significant findings, based on properties of interest, spatial volumes of interest, and on the level of statistical significance. Finally, we show that the program automatically identifies conserved sequential, secondary structural and biophysical features of the serine protease active site, using noncatalytic histidine residues as a control environment.
View details for PubMedID 7584427
-
SHAPE-BASED MODELS FOR INTERACTIVE SEGMENTATION OF MEDICAL IMAGES
Conference on Image Processing - Medical Imaging 1995
SPIE - INT SOC OPTICAL ENGINEERING. 1995: 771–780
View details for Web of Science ID A1995BD19R00076
- Computing the Structure of Large Complexes: Applying Constraint Satisfaction Techniques to Modeling the 16S Ribosomal RNA. Biomolecular NMR Spectroscopy edited by Markley, J., Opella, S. London: Oxford University Press.. 1995: 279–299
- Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology (Cambridge, England). edited by Rawlings, C., Clark, D., Altman, R. 1995
- A Probabilistic Approach to Determining Biological Structure: Integrating Uncertain Data Sources. International Journal of Human Computer Studies 1995; 42: 593-616
-
Finding an average core structure: application to the globins.
Proceedings / ... International Conference on Intelligent Systems for Molecular Biology ; ISMB. International Conference on Intelligent Systems for Molecular Biology
1994; 2: 19-27
Abstract
We present a procedure for automatically identifying from a set of aligned protein structures a subset of atoms with only a small amount of structural variation, i.e., a core. We apply this procedure to the globin family of proteins. Based purely on the results of the procedure, we show that the globin fold can be divided into two parts. The part with greater structural variation consists of the residues near the heme (the F helix and parts of the G and H helices), and the part with lesser structural variation (the core) forms a structural framework similar to that of the repressor protein (A, B, and E helices and remainder of the G and H helices). Such a division is consistent with many other structural and biochemical findings. In addition, we find further partitions within the core that may have biological significance. Finally, using the structural core of the globin family as a reference point, we have compared structural variation to sequence variation and shown that a core definition based on sequence conservation does not necessarily agree with one based on structural similarity.
View details for PubMedID 7584390
-
PARALLEL PROTEIN STRUCTURE DETERMINATION FROM UNCERTAIN DATA
Supercomputing 94
I E E E, COMPUTER SOC PRESS. 1994: 570–579
View details for Web of Science ID A1994BC13B00068
- Compositional Characteristics of Disordered Regions in Proteins. Protein and Peptide Letters 1994; 2 (1): 120-127
- Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology (Stanford, CA). edited by Altman, R., Brutlag, D., Karp, P. 1994
-
PROBABILISTIC CONSTRAINT SATISFACTION - APPLICATION TO RADIOSURGERY
18th Annual Symposium on Computer Applications in Medical Care - Transforming Information, Changing Health Care
BMJ PUBLISHING GROUP. 1994: 780–784
Abstract
Although quite successful in a variety of settings, standard optimization approaches can have drawbacks within medical applications. For example, they often provide a single solution which is difficult to explain, or which can not be incrementally modified using secondary "soft" constrains that are difficult to encode within the optimization. In order to address these issues, we have developed a probabilistic optimization technique that allows the user to enter prior probability distributions (Gaussian) for the parameters to be optimized as well as for the constraints on the parameters. Our technique combines the prior distributions with the constraints using Bayes' rule. The algorithm produces not only a set of parameter values, but variances on these values and covariances showing the correlations between parameters. We have applied this method to the problem of planning a radiosurgical ablation of brain tumors. The radiation plan should maximize dose to tumor, minimize dose to surrounding areas, and provide an even distribution of dosage across the tumor. It also should be explainable to and modifiable by the expert physicians based on external considerations. We have compared the results of our method with the standard linear programming approach.
View details for Web of Science ID A1994QF21600137
View details for PubMedID 7950031
-
EXTRACTION OF SNOMED CONCEPTS FROM MEDICAL RECORD TEXTS
18th Annual Symposium on Computer Applications in Medical Care - Transforming Information, Changing Health Care
BMJ PUBLISHING GROUP. 1994: 179–183
Abstract
Clinicians have traditionally documented patient data using natural language text. With the increasing prevalence of computer systems in health care, an increasing amount of medical record text will be stored electronically. However, for such textual documents to be indexed, shared, and processed adequately by computers, it will be important to be able to identify concepts in the documents using a common medical terminology. Automated methods for extracting concepts in a standard terminology would enhance retrieval and analysis of medical record data. This paper discusses a method for extracting concepts from medical record documents using the medical terminology SNOMED-III (Systematized Nomenclature of Human and Veterinary Medicine, Version III). The technique employs a linear least squares fit that maps training set phrases to SNOMED concepts. This mapping can be used for unknown text inputs in the same domain as the training set to predict SNOMED concepts that are contained in the document. We have implemented the method in the domain of congestive heart failure for history and physical exam texts. Our system has a reasonable response time. We tested the system over a range of thresholds. The system performed with 90% sensitivity and 83% specificity at the lowest threshold, and 42% sensitivity and 99.9% specificity at the highest threshold.
View details for Web of Science ID A1994QF21600033
View details for PubMedID 7949915
-
Constraint satisfaction techniques for modeling large complexes: application to the central domain of 16S ribosomal RNA.
Proceedings / ... International Conference on Intelligent Systems for Molecular Biology ; ISMB. International Conference on Intelligent Systems for Molecular Biology
1994; 2: 10-18
Abstract
Standard experimental techniques for determining the structure of small to moderately-sized molecules are difficult to apply to large macromolecular complexes. These complexes, consisting of multiple protein and/or nucleic acid components, can contain many thousands of atoms and the experimental techniques used to study them provide relatively sparse structural information with significant measurement uncertainty. Computational technologies are required to reduce the conformational search space and synthesize the data in order to produce the structures or (more usually) sets of structures compatible with the data. In this paper, we show that a method based on the constraint satisfaction paradigm produces a three-dimensional topology for the central domain of the 16S ribosomal RNA that is generally consistent with interactively built models, although differing in significant ways. The modeling incorporates information about secondary structure of the nucleic acid, neutron diffraction data about the relative positions and uncertainties of the proteins, and protection experiments indicating proximities of segments of RNA to specific protein subunits. Unlike previously proposed models, our model contains explicit information about the range of positions for each subunit that are compatible with the data. The system uses a grid search, checks distances in a direction-dependent manner, uses disjunctive distance constraints, and checks for volume overlap violations.
View details for PubMedID 7584378
-
TOWARDS A STANDARD QUERY MODEL FOR SHARING DECISION-SUPPORT APPLICATIONS
18th Annual Symposium on Computer Applications in Medical Care - Transforming Information, Changing Health Care
BMJ PUBLISHING GROUP. 1994: 325–331
Abstract
Many clinical decision-support applications are created in a centralized manner, but distributed widely for local use. When such applications include queries to electronic patient databases, the queries must be translated to conform to local database specifications. Because no well-defined standard model of clinical data exists, the translation process is ad hoc, costly, and error-prone. In this paper, we propose an abstract formalism, called the Standard Query Model Framework, for specifying a standard clinical data model and for supporting the automated and reliable translation of queries that appear in shared decision-support applications. We present the components of this framework, discuss their desirable features, and describe a prototype that we have developed for relational patient databases. We also highlight the outstanding research issues relevant to our approach.
View details for Web of Science ID A1994QF21600059
View details for PubMedID 7949944
-
A SURVEY OF PATIENT ACCESS TO ELECTRONIC MAIL - ATTITUDES, BARRIERS, AND OPPORTUNITIES
18th Annual Symposium on Computer Applications in Medical Care - Transforming Information, Changing Health Care
BMJ PUBLISHING GROUP. 1994: 15–19
Abstract
The use of electronic mail (e-mail) is increasing among both physicians and patients, although there is limited information in the literature about how patients might use e-mail to communicate with their physician. In our university-based internal medicine clinic, we have studied attitudes toward and access to e-mail among patients. A survey of 444 patients in our clinic showed that 46% of patients in the clinic use e-mail, and 89% of those with e-mail use it at work. Fifty-one percent would use e-mail all or most of the time to communicate with the clinic if it were available, and many of the communications that currently take place by phone could be replaced by e-mail. Barriers to e-mail use include privacy concerns among patients who use e-mail in the workplace, choosing the appropriate tasks for e-mail, and methods for efficiently triaging electronic messages in the clinic.
View details for Web of Science ID A1994QF21600004
View details for PubMedID 7949909
-
Probabilistic constraint satisfaction with structural models: application to organ modeling by radial contours.
Proceedings / the ... Annual Symposium on Computer Application [sic] in Medical Care. Symposium on Computer Applications in Medical Care
1993: 492-496
Abstract
One of the key challenges within medical information sciences is the development of useful models for biological structure and its variability. Many biomedical problems involve the elucidation of structure (for example, from experimental data or from imaging studies), and structural models can often drive the process of inferring precise structure from data. Ideally, model-driven data interpretation combines knowledge about the generic features of a class of biological structures (as contained within a model) with data that provide specific information (often noisy) about a particular instance of the class. In this paper we briefly discuss model-driven determination of biological structure as an example of a structural constraint satisfaction problem. We describe a probabilistic implementation of structural constraint satisfaction, and show that our formulation of a particular organ modeling technology (Radial Contour Models) exhibits promising performance. Our results demonstrate the utility of probabilistic models for the solution of structural constraint satisfaction problems.
View details for PubMedID 8130522
-
Probabilistic structure calculations: a three-dimensional tRNA structure from sequence correlation data.
Proceedings / ... International Conference on Intelligent Systems for Molecular Biology ; ISMB. International Conference on Intelligent Systems for Molecular Biology
1993; 1: 12-20
Abstract
Algorithms based on probability theory can address issues of uncertainty directly through their representational framework and their theory for data combination. In this paper, we discuss the advantages of probabilistic formulations for molecular-structure calculations, describe one implementation of such a formulation, and show its performance on a data set derived from analysis of the statistical correlations within a set of aligned transfer RNA sequences. By assigning reasonable physical interpretations to certain statistical correlations, we are able to calculate three-dimensional structures for tRNA from a random starting structure. The constraints that we use are associated with different variances, and so their effects are not uniform, and must be reconciled by a probabilistic algorithm to yield the most likely structure. As might be predicted, the uncertainty in the position for each base is a function of both the number and strength of the constraints, and is reflected in the variances in atomic position calculated by the algorithm. For example, the hinge region in the tRNA is shown to be the most uncertain. In addition, the algorithm retains information about positional covariation that is useful for understanding the relationships between different parts of the structure. These experiments also demonstrate that we can define a single-sphere representation for each base that is useful for nucleic acid structural calculations in the same way that alpha-carbon representations are useful for protein structural calculations.
View details for PubMedID 7584327
-
STRUCTURAL UNCERTAINTY OF PROTEINS IN SOLUTION BY NMR - A REEVALUATION OF THE STRUCTURE OF THE LAC REPRESSOR HEADPIECE
APPLIED MAGNETIC RESONANCE
1993; 4 (4): 441-460
View details for Web of Science ID A1993LP80900004
-
A SYSTEMATIC COMPARISON OF 3 STRUCTURE DETERMINATION METHODS FROM NMR DATA - DEPENDENCE UPON QUALITY AND QUANTITY OF DATA
JOURNAL OF BIOMOLECULAR NMR
1992; 2 (4): 373-388
Abstract
We have systematically examined how the quality of NMR protein structures depends on (1) the number of NOE distance constraints, (2) their assumed precision, (3) the method of structure calculation and (4) the size of the protein. The test sets of distance constraints have been derived from the crystal structures of crambin (5 kDa) and staphylococcal nuclease (17 kDa). Three methods of structure calculation have been compared: Distance Geometry (DGEOM), Restrained Molecular Dynamics (XPLOR) and the Double Iterated Kalman Filter (DIKF). All three methods can reproduce the general features of the starting structure under all conditions tested. In many instances the apparent precision of the calculated structure (as measured by the RMS dispersion from the average) is greater than its accuracy (as measured by the RMS deviation of the average structure from the starting crystal structure). The global RMS deviations from the reference structures decrease exponentially as the number of constraints is increased, and after using about 30% of all potential constraints, the errors asymptotically approach a limiting value. Increasing the assumed precision of the constraints has the same qualitative effect as increasing the number of constraints. For comparable numbers of constraints/residue, the precision of the calculated structure is less for the larger than for the smaller protein, regardless of the method of calculation. The accuracy of the average structure calculated by Restrained Molecular Dynamics is greater than that of structures obtained by purely geometric methods (DGEOM and DIKF).
View details for Web of Science ID A1992JF96900006
View details for PubMedID 1511237
-
THE SOLUTION STRUCTURES OF ESCHERICHIA-COLI-TRP REPRESSOR AND TRP APOREPRESSOR AT AN INTERMEDIATE RESOLUTION
EUROPEAN JOURNAL OF BIOCHEMISTRY
1991; 202 (1): 53-66
Abstract
We have determined the solution structures and examined the dynamics of the Escherichia coli trp repressor (a 25-kDa dimer), with and without the co-repressor L-tryptophan, from NMR data. This is the largest protein structure thus far determined by NMR. To obtain a set of data sufficient for a structure determination it was essential to resort to isotopic spectral editing. Line broadening observed in this molecular mass range precludes for the most part the measurement of coupling constants and stereospecific assignments, with the inevitable result that the attainable resolution of the final structure will be somewhat lower than the resolution reported for smaller proteins and peptides. Nevertheless the general topology of the protein can be deduced from the subsets of NOEs defining the secondary and tertiary structure, providing a basis for further refinement using the full set of NOEs and energy minimization. We report here (a) an intermediate resolution structure that can be deduced from NMR data, covalent, angular and van-der-Waals constraints only, without resort to detailed energy calculations, and (b) the limits of uncertainty within which this structure is valid. An examination of these structures combined with backbone amide exchange data shows that even at this resolution three important conclusions can be drawn: (a) the protein structure changes upon binding tryptophan; (b) the putative DNA binding region is much more flexible than the core of the molecule, with backbone amide proton exchange rates 1000 times faster than in the core; (c) the binding of tryptophan stabilizes the repressor molecule, which is reflected in both the appearance of additional NOEs, and in the slowing of backbone proton exchange rates by factors of 3-10. Sequence-specific 1H-NMR assignments and the secondary structure of the holopressor (L-tryptophan-bound form) have been reported previously [C. H. Arrowsmith, R. Pachter, R. B. Altman, S. B. Iyer & O. Jardetzky (1990) Biochemistry 29, 6332-6341]. Those for the trp aporepressor (L-tryptophan-free form), made using the same methods and conditions as described in the cited paper, are reported here. The secondary structure of the aporepressor was calculated from sequential and medium-range NOEs and is the same as reported for the holorepressor except that helix E is shorter. The tertiary solution structures for both forms of the repressor were calculated from long-range NOE data.(ABSTRACT TRUNCATED AT 400 WORDS)
View details for Web of Science ID A1991GP84100006
View details for PubMedID 1935980
-
COMPARISON OF THE NMR SOLUTION STRUCTURES OF CYCLOSPORINE-A DETERMINED BY DIFFERENT TECHNIQUES
JOURNAL OF MAGNETIC RESONANCE
1991; 92 (3): 468-479
View details for Web of Science ID A1991FJ10000002
- Determination of Large Protein Structures from NMR Data: Definition of the Solution Structure of the TRP Repressor. Computational Aspects of the Study of Biological Macromolecules by NMR Spectroscopy edited by Hoch, J., Poulsen, F., Redfield, C. New York: Plenum Publishing Corp.. 1991: 363–374
-
DETERMINATION OF LARGE PROTEIN STRUCTURES FROM NMR DATA - DEFINITION OF THE SOLUTION STRUCTURE OF THE TRP REPRESSOR
NATO ADVANCED RESEARCH WORKSHOP ON COMPUTATIONAL ASPECTS OF THE STUDY OF BIOLOGICAL MACROMOLECULES BY NUCLEAR MAGNETIC RESONANCE SPECTROSCOPY
PLENUM PRESS DIV PLENUM PUBLISHING CORP. 1991: 363–374
View details for Web of Science ID A1991BV14N00028
-
SEQUENCE-SPECIFIC H-1-NMR ASSIGNMENTS AND SECONDARY STRUCTURE IN SOLUTION OF ESCHERICHIA-COLI TRP REPRESSOR
BIOCHEMISTRY
1990; 29 (27): 6332-6341
Abstract
Sequence-specific 1H NMR assignments are reported for the active L-tryptophan-bound form of Escherichia coli trp repressor. The repressor is a symmetric dimer of 107 residues per monomer; thus at 25 kDa, this is the largest protein for which such detailed sequence-specific assignments have been made. At this molecular mass the broad line widths of the NMR resonances preclude the use of assignment methods based on 1H-1H scalar coupling. Our assignment strategy centers on two-dimensional nuclear Overhauser spectroscopy (NOESY) of a series of selectively deuterated repressor analogues. A new methodology was developed for analysis of the spectra on the basis of the effects of selective deuteration on cross-peak intensities in the NOESY spectra. A total of 90% of the backbone amide protons have been assigned, and 70% of the alpha and side-chain proton resonances are assigned. The local secondary structure was calculated from sequential and medium-range backbone NOEs with the double-iterated Kalman filter method [Altman, R. B., & Jardetzky, O. (1989) Methods Enzymol. 177, 218-246]. The secondary structure agrees with that of the crystal structure [Schevitz, R., Otwinowski, Z., Joachimiak, A., Lawson, C. L., & Sigler, P. B. (1985) Nature 317, 782], except that the solution state is somewhat more disordered in the DNA binding region and in the N-terminal region of the first alpha-helix. Since the repressor is a symmetric dimer, long-range intersubunit NOEs were distinguished from intrasubunit interactions by formation of heterodimers between two appropriate selectively deuterated proteins and comparison of the resulting NOESY spectrum with that of each selectively deuterated homodimer. Thus, from spectra of three heterodimers, long-range NOEs between eight pairs of residues were identified as intersubunit NOEs, and two additional long-range intrasubunits NOEs were assigned.
View details for Web of Science ID A1990DN23200002
View details for PubMedID 2207078
- PROTEAN - Part II: Molecular Structure Determination from Uncertain Data. Quantitative Computer Program Exchange Bulletin 1990; 4 (10): 596
- PROTEAN - Part I: Generating Ensembles of Stylized Molecular Fragments using Uncertain Constraints. Quantative Computer Program Exchange Bulletin 1990; 4 (10): 596
-
NMR AND PROTEIN-STRUCTURE
BIOFIZIKA
1989; 34 (5): 763-771
View details for Web of Science ID A1989AU61400004
-
DETERMINATION OF STRUCTURAL UNCERTAINTY FROM NMR AND OTHER DATA - THE LAC REPRESSOR HEADPIECE
NATO ADVANCED STUDY INST AND 10TH COURSE OF THE INTERNATIONAL SCHOOL OF PURE AND APPLIED BIOSTRUCTURE : PROTEIN STRUCTURE AND ENGINEERING
PLENUM PRESS DIV PLENUM PUBLISHING CORP. 1989: 79–95
View details for Web of Science ID A1989BQ48W00006
-
NMR AND PROTEIN-STRUCTURE
24TH CONGRESS AMPERE : MAGNETIC RESONANCE AND RELATED PHENOMENA
ELSEVIER SCIENCE PUBL B V. 1989: 401–412
View details for Web of Science ID A1989BR03P00029
- NMR and Protein Structure. Biofizika 1989; 5 (34): 763-771
- The Determination of Structural Uncertainty from NMR and Other Data: The Lac Repressor Headpiece. Protein Structure and Engineering. edited by Jardetzky, O. New York: Plenum Publishing Corp.. 1989: 1
- The Heuristic Refinement Method for the Determination of the Solution Structure of Proteins from NMR Data. Nuclear Magnetic Resonance, Part B: Structure and Mechanisms (Methods in Enzymology) edited by Oppenheimer, N., James, T. New York: Academic Press.. 1989: 218–247
- Artificial Intelligence Techniques and NMR Spectroscopy: Application to the Structure of Proteins in Solution. Nuclear Magnetic Resonance: The Principles and Applications of NMR Spectroscopy and Imaging to Biomedical Research edited by Pettegrew, J. New York: Springer-Verlag.. 1989: 99–123
-
HEURISTIC REFINEMENT METHOD FOR DETERMINATION OF SOLUTION STRUCTURE OF PROTEINS FROM NUCLEAR-MAGNETIC-RESONANCE DATA
METHODS IN ENZYMOLOGY
1989; 177: 218-?
View details for Web of Science ID A1989CW82100011
View details for PubMedID 2691845
-
HEURISTIC REFINEMENT METHOD FOR THE DERIVATION OF PROTEIN SOLUTION STRUCTURES - VALIDATION ON CYTOCHROME B-562
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES
1988; 28 (4): 194-210
Abstract
A method is described for determining the family of protein structures compatible with solution data obtained primarily from nuclear magnetic resonance (NMR) spectroscopy. Starting with all possible conformations, the method systematically excludes conformations until the remaining structures are only those compatible with the data. The apparent computational intractability of this approach is reduced by assembling the protein in pieces, by considering the protein at several levels of abstraction, by utilizing constraint satisfaction methods to consider only a few atoms at a time, and by utilizing artificial intelligence methods of heuristic control to decide which actions will exclude the most conformations. Example results are presented for simulated NMR data from the known crystal structure of cytochrome b562 (103 residues). For 10 sample backbones an average root-mean-square deviation from the crystal of 4.1 A was found for all alpha-carbon atoms and 2.8 A for helix alpha-carbons alone. The 10 backbones define the family of all structures compatible with the data and provide nearly correct starting structures for adjustment by any of the current structure determination methods.
View details for Web of Science ID A1988R230100006
View details for PubMedID 3235473
- The Heuristic Refinement Method for the Derivation of Protein Solution Structures: Validation on Cytochrome-b562. Journal of Chemical Info. & Computer Sciences 1988; 4 (28): 194-210
- Positive Strand RNA Viruses 1987
-
NEW STRATEGIES FOR THE DETERMINATION OF MACROMOLECULAR STRUCTURE IN SOLUTION
JOURNAL OF BIOCHEMISTRY
1986; 100 (6): 1403-1423
Abstract
Non-crystallographic approaches to the determination of protein structure must solve the problem of insufficient and low information content experimental data. Most successful methods augment experimentation with theoretical constraints (for example, potential energy functions or optimization error metrics). We believe it is important to separate the contributions of experimentation and theory in the construction of protein structure. The PROTEAN system defines protein topology on the basis of experimental data alone. Its performance on three data sets, derived from the lac-repressor headpiece of E. coli, sperm whale myoglobin, and domain 1 of bacteriophage T4 lysozyme, indicates that there may be families of related conformations that are consistent with the experimental data. These conformations provide insight into the strengths and weaknesses in the data sets. They also provide a set of structures with which to begin theoretical refinements. We outline here a strategy which maintains a clear distinction between refinements based on theory and those based on experiment, and thus allows a careful analysis of the properties of such refinement methods.
View details for Web of Science ID A1986F079500001
View details for PubMedID 3553167
- PROTEAN: A New Method of Deriving Solution Structures of Proteins. Bulletin of Magnetic Resonance 1986; 8: 111-119
-
QUATERNARY STRUCTURAL-CHANGES IN ASPARTATE CARBAMOYLTRANSFERASE OF ESCHERICHIA-COLI AT PH 8.3 AND PH 5.8
BIOCHEMICAL AND BIOPHYSICAL RESEARCH COMMUNICATIONS
1982; 108 (2): 592-595
View details for Web of Science ID A1982PJ75900022
View details for PubMedID 6756403