Dr. Michel Dumontier is an Associate Professor of Medicine (Biomedical Informatics) at Stanford University. His research focuses on the development of computational methods to increase our understanding of how living systems respond to chemical agents. At the core of the research program is the development and use of Semantic Web technologies to formally represent and reason about data and services so as (1) to facilitate the publishing, sharing and discovery of scientific knowledge, (2) to enable the formulation and evaluation scientific hypotheses and (3) to create and make available computational methods to investigate the structure, function and behaviour of living systems. Dr. Dumontier serves as a co-chair for the World Wide Web Consortium Semantic Web for Health Care and Life Sciences Interest Group (W3C HCLSIG) and is the Scientific Director for Bio2RDF, a widely recognized open-source project to create and provide linked data for life sciences.

Academic Appointments

Boards, Advisory Committees, Professional Organizations

  • Chair, World Wide Web Consortium (W3C) Semantic Web for Health Care and Life Sciences Interest Group (2011 - Present)
  • Advisory Committee Representative, World Wide Web Consortium (W3C) (2014 - Present)

Professional Education

  • Postdoc, Samuel Lunenfeld Research Institute, Bioinformatics (2005)
  • BSc, University of Manitoba, Biochemistry (1998)
  • PhD, University of Toronto, Biochemistry (Bioinformatics) (2004)

Current Research and Scholarly Interests

The Dumontier laboratory for biomedical knowledge discovery develops computational methods to better understand how living systems respond to chemical agents. We use semantic technologies to integrate and analyze large biomedical data and enable knowledge-based discoveries in biology, biochemistry and medicine. Our major research interests include i) drug repositioning using large scale animal model data, ii) elucidating the mechanism by which complex phenotypes (e.g. side effects) arise from consumption of pharmaceutical products, iii) determining the extent to which drug metabolic products contribute to toxicity, iv) optimizing novel drug therapeutic regimes so as to minimize undesireable side effects and v) understanding the systemic basis of an altered response due to genetic variation. We develop novel methods to accurately capture, publish, discover and re-use biomedical data, ontologies and services using formal knowledge representation and automated reasoning.

2015-16 Courses

Stanford Advisees

Graduate and Fellowship Programs

  • Biomedical Informatics (Phd Program)

All Publications

  • Making Digital Artifacts on the Web Verifiable and Reliable IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING Kuhn, T., Dumontier, M. 2015; 27 (9): 2390-2400
  • PubChemRDF: towards the semantic annotation of PubChem compound and substance databases JOURNAL OF CHEMINFORMATICS Fu, G., Batchelor, C., Dumontier, M., Hastings, J., Willighagen, E., Bolton, E. 2015; 7


    PubChem is an open repository for chemical structures, biological activities and biomedical annotations. Semantic Web technologies are emerging as an increasingly important approach to distribute and integrate scientific data. Exposing PubChem data to Semantic Web services may help enable automated data integration and management, as well as facilitate interoperable web applications.This work, one of a series covering the PubChemRDF project, describes an approach to translate PubChem Substance and Compound information into Resource Description Framework (RDF) format. Basic examples are provided to demonstrate its use. The aim of this effort is to provide two new primary benefits to researchers in a cost-effective manner. Firstly, we aim to remove the inherent limitations of using the web-based resource PubChem by allowing a researcher to use readily available semantic technologies (namely, RDF triple stores and their corresponding SPARQL query engines) to query and analyze PubChem data on local computing resources. Secondly, this work intends to help improve data sharing, analysis, and integration of PubChem data to resources external to NCBI and across scientific domains, by means of the association of PubChem data to existing ontological frameworks, including CHEMical INFormation ontology, Semanticscience Integrated Ontology, and others.With the goal of semantically describing information available in the PubChem archive, pre-existing ontological frameworks were used, rather than creating new ones. Semantic relationships between compounds and substances, chemical descriptors associated with compounds and substances, interrelationships between chemicals, as well as provenance and attribute metadata of substances are described. Graphical abstract:Schematic representation of the semantic links for PubChem compounds and substances.

    View details for DOI 10.1186/s13321-015-0084-4

    View details for Web of Science ID 000357782400001

    View details for PubMedID 26175801

  • Toward a complete dataset of drug-drug interaction information from publicly available sources JOURNAL OF BIOMEDICAL INFORMATICS Ayvaz, S., Horn, J., Hassanzadeh, O., Zhu, Q., Stan, J., Tatonetti, N. P., Vilar, S., Brochhausen, M., Samwald, M., Rastegar-Mojarad, M., Dumontier, M., Boyce, R. D. 2015; 55: 206-217


    Although potential drug-drug interactions (PDDIs) are a significant source of preventable drug-related harm, there is currently no single complete source of PDDI information. In the current study, all publically available sources of PDDI information that could be identified using a comprehensive and broad search were combined into a single dataset. The combined dataset merged fourteen different sources including 5 clinically-oriented information sources, 4 Natural Language Processing (NLP) Corpora, and 5 Bioinformatics/Pharmacovigilance information sources. As a comprehensive PDDI source, the merged dataset might benefit the pharmacovigilance text mining community by making it possible to compare the representativeness of NLP corpora for PDDI text extraction tasks, and specifying elements that can be useful for future PDDI extraction purposes. An analysis of the overlap between and across the data sources showed that there was little overlap. Even comprehensive PDDI lists such as DrugBank, KEGG, and the NDF-RT had less than 50% overlap with each other. Moreover, all of the comprehensive lists had incomplete coverage of two data sources that focus on PDDIs of interest in most clinical settings. Based on this information, we think that systems that provide access to the comprehensive lists, such as APIs into RxNorm, should be careful to inform users that the lists may be incomplete with respect to PDDIs that drug experts suggest clinicians be aware of. In spite of the low degree of overlap, several dozen cases were identified where PDDI information provided in drug product labeling might be augmented by the merged dataset. Moreover, the combined dataset was also shown to improve the performance of an existing PDDI NLP pipeline and a recently published PDDI pharmacovigilance protocol. Future work will focus on improvement of the methods for mapping between PDDI information sources, identifying methods to improve the use of the merged dataset in PDDI NLP algorithms, integrating high-quality PDDI information from the merged dataset into Wikidata, and making the combined dataset accessible as Semantic Web Linked Data.

    View details for DOI 10.1016/j.jbi.2015.04.006

    View details for Web of Science ID 000356564700020

    View details for PubMedID 25917055

  • SPARQL-enabled identifier conversion with BIOINFORMATICS Wimalaratne, S. M., Bolleman, J., Juty, N., Katayama, T., Dumontier, M., Redaschi, N., Le Novere, N., Hermjakob, H., Laibe, C. 2015; 31 (11): 1875-1877


    On the semantic web, in life sciences in particular, data is often distributed via multiple resources. Each of these sources is likely to use their own International Resource Identifier for conceptually the same resource or database record. The lack of correspondence between identifiers introduces a barrier when executing federated SPARQL queries across life science data.We introduce a novel SPARQL-based service to enable on-the-fly integration of life science data. This service uses the identifier patterns defined in the Registry to generate a plurality of identifier variants, which can then be used to match source identifiers with target identifiers. We demonstrate the utility of this identifier integration approach by answering queries across major producers of life science Linked Data.The SPARQL-based identifier conversion service is available without restriction at

    View details for DOI 10.1093/bioinformatics/btv064

    View details for Web of Science ID 000356625300033

    View details for PubMedID 25638809

  • GFVO: the Genomic Feature and Variation Ontology PEERJ Baran, J., Durgahee, B. S., Eilbeck, K., Antezana, E., Hoehndorf, R., Dumontier, M. 2015; 3


    Falling costs in genomic laboratory experiments have led to a steady increase of genomic feature and variation data. Multiple genomic data formats exist for sharing these data, and whilst they are similar, they are addressing slightly different data viewpoints and are consequently not fully compatible with each other. The fragmentation of data format specifications makes it hard to integrate and interpret data for further analysis with information from multiple data providers. As a solution, a new ontology is presented here for annotating and representing genomic feature and variation dataset contents. The Genomic Feature and Variation Ontology (GFVO) specifically addresses genomic data as it is regularly shared using the GFF3 (incl. FASTA), GTF, GVF and VCF file formats. GFVO simplifies data integration and enables linking of genomic annotations across datasets through common semantics of genomic types and relations. Availability and implementation. The latest stable release of the ontology is available via its base URI; previous and development versions are available at the ontology's GitHub repository:; versions of the ontology are indexed through BioPortal (without external class-/property-equivalences due to BioPortal release 4.10 limitations); examples and reference documentation is provided on a separate web-page: GFVO version 1.0.2 is licensed under the CC0 1.0 Universal license ( and therefore de facto within the public domain; the ontology can be appropriated without attribution for commercial and non-commercial use.

    View details for DOI 10.7717/peerj.933

    View details for Web of Science ID 000354192400004

    View details for PubMedID 26019997

  • Ranking Adverse Drug Reactions With Crowdsourcing JOURNAL OF MEDICAL INTERNET RESEARCH Gottlieb, A., Hoehndorf, R., Dumontier, M., Altman, R. B. 2015; 17 (3)

    View details for DOI 10.2196/jmir.3962

    View details for Web of Science ID 000356780900020

  • Pharmacogenomic knowledge representation, reasoning and genome-based clinical decision support based on OWL 2 DL ontologies BMC MEDICAL INFORMATICS AND DECISION MAKING Samwald, M., Gimenez, J. A., Boyce, R. D., Freimuth, R. R., Adlassnig, K., Dumontier, M. 2015; 15


    Every year, hundreds of thousands of patients experience treatment failure or adverse drug reactions (ADRs), many of which could be prevented by pharmacogenomic testing. However, the primary knowledge needed for clinical pharmacogenomics is currently dispersed over disparate data structures and captured in unstructured or semi-structured formalizations. This is a source of potential ambiguity and complexity, making it difficult to create reliable information technology systems for enabling clinical pharmacogenomics.We developed Web Ontology Language (OWL) ontologies and automated reasoning methodologies to meet the following goals: 1) provide a simple and concise formalism for representing pharmacogenomic knowledge, 2) finde errors and insufficient definitions in pharmacogenomic knowledge bases, 3) automatically assign alleles and phenotypes to patients, 4) match patients to clinically appropriate pharmacogenomic guidelines and clinical decision support messages and 5) facilitate the detection of inconsistencies and overlaps between pharmacogenomic treatment guidelines from different sources. We evaluated different reasoning systems and test our approach with a large collection of publicly available genetic profiles.Our methodology proved to be a novel and useful choice for representing, analyzing and using pharmacogenomic data. The Genomic Clinical Decision Support (Genomic CDS) ontology represents 336 SNPs with 707 variants; 665 haplotypes related to 43 genes; 22 rules related to drug-response phenotypes; and 308 clinical decision support rules. OWL reasoning identified CDS rules with overlapping target populations but differing treatment recommendations. Only a modest number of clinical decision support rules were triggered for a collection of 943 public genetic profiles. We found significant performance differences across available OWL reasoners.The ontology-based framework we developed can be used to represent, organize and reason over the growing wealth of pharmacogenomic knowledge, as well as to identify errors, inconsistencies and insufficient definitions in source data sets or individual patient data. Our study highlights both advantages and potential practical issues with such an ontology-based approach.

    View details for DOI 10.1186/s12911-015-0130-1

    View details for Web of Science ID 000350000300001

    View details for PubMedID 25880555

  • An evidence-based approach to identify aging-related genes in Caenorhabditis elegans BMC BIOINFORMATICS Callahan, A., Cifuentes, J. J., Dumontier, M. 2015; 16
  • Finding Our Way through Phenotypes PLOS BIOLOGY Deans, A. R., Lewis, S. E., Huala, E., Anzaldo, S. S., Ashburner, M., Balhoff, J. P., Blackburn, D. C., Blake, J. A., Burleigh, J. G., Chanet, B., Cooper, L. D., Courtot, M., Csoesz, S., Cui, H., Dahdul, W., Das, S., Dececchi, T. A., Dettai, A., Diogo, R., Druzinsky, R. E., Dumontier, M., Franz, N. M., Friedrich, F., Gkouto, G. V., Haendel, M., Harmon, L. J., Hayamizu, T. F., He, Y., Hines, H. M., Ibrahim, N., Jackson, L. M., Jaiswal, P., James-Zorn, C., Koehler, S., Lecointre, G., Lapp, H., Lawrence, C. J., Le Novere, N., Lundberg, J. G., Macklin, J., Mast, A. R., Midford, P. E., Miko, I., Mungall, C. J., Oellrich, A., Osumi-Sutherland, D., Parkinson, H., Ramirez, M. J., Richter, S., Robinson, P. N., Ruttenberg, A., Schulz, K. S., Segerdell, E., Seltmann, K. C., Sharkey, M. J., Smith, A. D., Smith, B., Specht, C. D., Squires, R. B., Thacker, R. W., Thessen, A., Fernandez-Triana, J., Vihinen, M., Vize, P. D., Vogt, L., Wall, C. E., Walls, R. L., Westerfeld, M., Wharton, R. A., Wirkner, C. S., Woolley, J. B., Yoder, M. J., Zorn, A. M., Mabee, P. M. 2015; 13 (1)


    Despite a large and multifaceted effort to understand the vast landscape of phenotypic data, their current form inhibits productive data analysis. The lack of a community-wide, consensus-based, human- and machine-interpretable language for describing phenotypes and their genomic and environmental contexts is perhaps the most pressing scientific bottleneck to integration across many key fields in biology, including genomics, systems biology, development, medicine, evolution, ecology, and systematics. Here we survey the current phenomics landscape, including data resources and handling, and the progress that has been made to accurately capture relevant data descriptions for phenotypes. We present an example of the kind of integration across domains that computable phenotypes would enable, and we call upon the broader biology community, publishers, and relevant funding agencies to support efforts to surmount today's data barriers and facilitate analytical reproducibility.

    View details for DOI 10.1371/journal.pbio.1002033

    View details for Web of Science ID 000349169900001

    View details for PubMedID 25562316

  • An Ebola virus-centered knowledge base. Database : the journal of biological databases and curation Kamdar, M. R., Dumontier, M. 2015; 2015


    Ebola virus (EBOV), of the family Filoviridae viruses, is a NIAID category A, lethal human pathogen. It is responsible for causing Ebola virus disease (EVD) that is a severe hemorrhagic fever and has a cumulative death rate of 41% in the ongoing epidemic in West Africa. There is an ever-increasing need to consolidate and make available all the knowledge that we possess on EBOV, even if it is conflicting or incomplete. This would enable biomedical researchers to understand the molecular mechanisms underlying this disease and help develop tools for efficient diagnosis and effective treatment. In this article, we present our approach for the development of an Ebola virus-centered Knowledge Base (Ebola-KB) using Linked Data and Semantic Web Technologies. We retrieve and aggregate knowledge from several open data sources, web services and biomedical ontologies. This knowledge is transformed to RDF, linked to the Bio2RDF datasets and made available through a SPARQL 1.1 Endpoint. Ebola-KB can also be explored using an interactive Dashboard visualizing the different perspectives of this integrated knowledge. We showcase how different competency questions, asked by domain users researching the druggability of EBOV, can be formulated as SPARQL Queries or answered using the Ebola-KB Dashboard.Database URL:

    View details for DOI 10.1093/database/bav049

    View details for PubMedID 26055098

  • Automating Identification of Multiple Chronic Conditions in Clinical Practice Guidelines. AMIA Joint Summits on Translational Science proceedings AMIA Summit on Translational Science Leung, T. I., Jalal, H., Zulman, D. M., Dumontier, M., Owens, D. K., Musen, M. A., Goldstein, M. K. 2015; 2015: 456-460


    Many clinical practice guidelines (CPGs) are intended to provide evidence-based guidance to clinicians on a single disease, and are frequently considered inadequate when caring for patients with multiple chronic conditions (MCC), or two or more chronic conditions. It is unclear to what degree disease-specific CPGs provide guidance about MCC. In this study, we develop a method for extracting knowledge from single-disease chronic condition CPGs to determine how frequently they mention commonly co-occurring chronic diseases. We focus on 15 highly prevalent chronic conditions. We use publicly available resources, including a repository of guideline summaries from the National Guideline Clearinghouse to build a text corpus, a data dictionary of ICD-9 codes from the Medicare Chronic Conditions Data Warehouse (CCW) to construct an initial list of disease terms, and disease synonyms from the National Center for Biomedical Ontology to enhance the list of disease terms. First, for each disease guideline, we determined the frequency of comorbid condition mentions (a disease-comorbidity pair) by exactly matching disease synonyms in the text corpus. Then, we developed an annotated reference standard using a sample subset of guidelines. We used this reference standard to evaluate our approach. Then, we compared the co-prevalence of common pairs of chronic conditions from Medicare CCW data to the frequency of disease-comorbidity pairs in CPGs. Our results show that some disease-comorbidity pairs occur more frequently in CPGs than others. Sixty-one (29.0%) of 210 possible disease-comorbidity pairs occurred zero times; for example, no guideline on chronic kidney disease mentioned depression, while heart failure guidelines mentioned ischemic heart disease the most frequently. Our method adequately identifies comorbid chronic conditions in CPG recommendations with precision 0.82, recall 0.75, and F-measure 0.78. Our work identifies knowledge currently embedded in the free text of clinical practice guideline recommendations and provides an initial view of the extent to which CPGs mention common comorbid conditions. Knowledge extracted from CPG text in this way may be useful to inform gaps in guideline recommendations regarding MCC and therefore identify potential opportunities for guideline improvement.

    View details for PubMedID 26306285

  • Ranking adverse drug reactions with crowdsourcing. Journal of medical Internet research Gottlieb, A., Hoehndorf, R., Dumontier, M., Altman, R. B. 2015; 17 (3)


    There is no publicly available resource that provides the relative severity of adverse drug reactions (ADRs). Such a resource would be useful for several applications, including assessment of the risks and benefits of drugs and improvement of patient-centered care. It could also be used to triage predictions of drug adverse events.The intent of the study was to rank ADRs according to severity.We used Internet-based crowdsourcing to rank ADRs according to severity. We assigned 126,512 pairwise comparisons of ADRs to 2589 Amazon Mechanical Turk workers and used these comparisons to rank order 2929 ADRs.There is good correlation (rho=.53) between the mortality rates associated with ADRs and their rank. Our ranking highlights severe drug-ADR predictions, such as cardiovascular ADRs for raloxifene and celecoxib. It also triages genes associated with severe ADRs such as epidermal growth-factor receptor (EGFR), associated with glioblastoma multiforme, and SCN1A, associated with epilepsy.ADR ranking lays a first stepping stone in personalized drug risk assessment. Ranking of ADRs using crowdsourcing may have useful clinical and financial implications, and should be further investigated in the context of health care decision making.

    View details for DOI 10.2196/jmir.3962

    View details for PubMedID 25800813

  • Automatically exposing OpenLifeData via SADI semantic Web Services JOURNAL OF BIOMEDICAL SEMANTICS Rodriguez Gonzalez, A., Callahan, A., Cruz-Toledo, J., Garcia, A., Egana Aranguren, M., Dumontier, M., Wilkinson, M. D. 2014; 5
  • Bridging Islands of Information to Establish an Integrated Knowledge Base of Drugs and Health Outcomes of Interest DRUG SAFETY Boyce, R. D., Ryan, P. B., Noren, G. N., Schuemie, M. J., Reich, C., Duke, J., Tatonetti, N. P., Trifiro, G., Harpaz, R., Overhage, J. M., Hartzema, A. G., Khayter, M., Voss, E. A., Lambert, C. G., Huser, V., Dumontier, M. 2014; 37 (8): 557-567


    The entire drug safety enterprise has a need to search, retrieve, evaluate, and synthesize scientific evidence more efficiently. This discovery and synthesis process would be greatly accelerated through access to a common framework that brings all relevant information sources together within a standardized structure. This presents an opportunity to establish an open-source community effort to develop a global knowledge base, one that brings together and standardizes all available information for all drugs and all health outcomes of interest (HOIs) from all electronic sources pertinent to drug safety. To make this vision a reality, we have established a workgroup within the Observational Health Data Sciences and Informatics (OHDSI, collaborative. The workgroup's mission is to develop an open-source standardized knowledge base for the effects of medical products and an efficient procedure for maintaining and expanding it. The knowledge base will make it simpler for practitioners to access, retrieve, and synthesize evidence so that they can reach a rigorous and accurate assessment of causal relationships between a given drug and HOI. Development of the knowledge base will proceed with the measureable goal of supporting an efficient and thorough evidence-based assessment of the effects of 1,000 active ingredients across 100 HOIs. This non-trivial task will result in a high-quality and generally applicable drug safety knowledge base. It will also yield a reference standard of drug-HOI pairs that will enable more advanced methodological research that empirically evaluates the performance of drug safety analysis methods.

    View details for DOI 10.1007/s40264-014-0189-0

    View details for Web of Science ID 000344614700001

    View details for PubMedID 24985530

  • Mouse model phenotypes provide information about human drug targets BIOINFORMATICS Hoehndorf, R., Hiebert, T., Hardy, N. W., Schofield, P. N., Gkoutos, G. V., Dumontier, M. 2014; 30 (5): 719-725


    Methods for computational drug target identification use information from diverse information sources to predict or prioritize drug targets for known drugs. One set of resources that has been relatively neglected for drug repurposing is animal model phenotype.We investigate the use of mouse model phenotypes for drug target identification. To achieve this goal, we first integrate mouse model phenotypes and drug effects, and then systematically compare the phenotypic similarity between mouse models and drug effect profiles. We find a high similarity between phenotypes resulting from loss-of-function mutations and drug effects resulting from the inhibition of a protein through a drug action, and demonstrate how this approach can be used to suggest candidate drug targets.Analysis code and supplementary data files are available on the project Web site at

    View details for DOI 10.1093/bioinformatics/btt613

    View details for Web of Science ID 000332259300018

    View details for PubMedID 24158600

  • BioHackathon series in 2011 and 2012: penetration of ontology and linked data in life science domains JOURNAL OF BIOMEDICAL SEMANTICS Katayama, T., Wilkinson, M. D., Aoki-Kinoshita, K. F., Kawashima, S., Yamamoto, Y., Yamaguchi, A., Okamoto, S., Kawano, S., Kim, J., Wang, Y., Wu, H., Kano, Y., Ono, H., Bono, H., Kocbek, S., Aerts, J., Akune, Y., Antezana, E., Arakawa, K., Aranda, B., Baran, J., Bolleman, J., Bonnal, R. J., Buttigieg, P. L., Campbell, M. P., Chen, Y., Chiba, H., Cock, P. J., Cohen, K. B., Constantin, A., Duck, G., Dumontier, M., Fujisawa, T., Fujiwara, T., Goto, N., Hoehndorf, R., Igarashi, Y., Itaya, H., Ito, M., Iwasaki, W., Kalas, M., Katoda, T., Kim, T., Kokubu, A., Komiyama, Y., Kotera, M., Laibe, C., Lapp, H., Luetteke, T., Marshall, M. S., Mori, T., Mori, H., Morita, M., Murakami, K., Nakao, M., Narimatsu, H., Nishide, H., Nishimura, Y., Nystrom-Persson, J., Ogishima, S., Okamura, Y., Okuda, S., Oshita, K., Packer, N. H., Prins, P., Ranzinger, R., Rocca-Serra, P., Sansone, S., Sawaki, H., Shin, S., Splendiani, A., Strozzi, F., Tadaka, S., Toukach, P., Uchiyama, I., Umezaki, M., Vos, R., Whetzel, P. L., Yamada, I., Yamasaki, C., Yamashita, R., York, W. S., Zmasek, C. M., Kawamoto, S., Takagi, T. 2014; 5


    The application of semantic technologies to the integration of biological data and the interoperability of bioinformatics analysis and visualization tools has been the common theme of a series of annual BioHackathons hosted in Japan for the past five years. Here we provide a review of the activities and outcomes from the BioHackathons held in 2011 in Kyoto and 2012 in Toyama. In order to efficiently implement semantic technologies in the life sciences, participants formed various sub-groups and worked on the following topics: Resource Description Framework (RDF) models for specific domains, text mining of the literature, ontology development, essential metadata for biological databases, platforms to enable efficient Semantic Web technology development and interoperability, and the development of applications for Semantic Web data. In this review, we briefly introduce the themes covered by these sub-groups. The observations made, conclusions drawn, and software development projects that emerged from these activities are discussed.

    View details for DOI 10.1186/2041-1480-5-5

    View details for Web of Science ID 000343707900001

    View details for PubMedID 24495517

  • The Semanticscience Integrated Ontology (SIO) for biomedical research and knowledge discovery. Journal of biomedical semantics Dumontier, M., Baker, C. J., Baran, J., Callahan, A., Chepelev, L., Cruz-Toledo, J., Del Rio, N. R., Duck, G., Furlong, L. I., Keath, N., Klassen, D., McCusker, J. P., Queralt-Rosinach, N., Samwald, M., Villanueva-Rosales, N., Wilkinson, M. D., Hoehndorf, R. 2014; 5 (1): 14-?


    The Semanticscience Integrated Ontology (SIO) is an ontology to facilitate biomedical knowledge discovery. SIO features a simple upper level comprised of essential types and relations for the rich description of arbitrary (real, hypothesized, virtual, fictional) objects, processes and their attributes. SIO specifies simple design patterns to describe and associate qualities, capabilities, functions, quantities, and informational entities including textual, geometrical, and mathematical entities, and provides specific extensions in the domains of chemistry, biology, biochemistry, and bioinformatics. SIO provides an ontological foundation for the Bio2RDF linked data for the life sciences project and is used for semantic integration and discovery for SADI-based semantic web services. SIO is freely available to all users under a creative commons by attribution license. See website for further information:

    View details for DOI 10.1186/2041-1480-5-14

    View details for PubMedID 24602174

  • Drug-drug interaction data source survey and linking. AMIA Joint Summits on Translational Science proceedings AMIA Summit on Translational Science Ayvaz, S., Zhu, Q., Hochheiser, H., Brochhausen, M., Horn, J., Dumontier, M., Samwald, M., Boyce, R. D. 2014; 2014: 16-?


    As an initial step towards the goal of a common data model for potential drug-drug interactions, we surveyed the data elements provided by the publicly available sources. Our analysis found that there is very little overlap between or across publicly available resources and that the information provided is very heterogeneous.

    View details for PubMedID 25717393

  • Evaluation of research in biomedical ontologies BRIEFINGS IN BIOINFORMATICS Hoehndorf, R., Dumontier, M., Gkoutos, G. V. 2013; 14 (6): 696-712


    Ontologies are now pervasive in biomedicine, where they serve as a means to standardize terminology, to enable access to domain knowledge, to verify data consistency and to facilitate integrative analyses over heterogeneous biomedical data. For this purpose, research on biomedical ontologies applies theories and methods from diverse disciplines such as information management, knowledge representation, cognitive science, linguistics and philosophy. Depending on the desired applications in which ontologies are being applied, the evaluation of research in biomedical ontologies must follow different strategies. Here, we provide a classification of research problems in which ontologies are being applied, focusing on the use of ontologies in basic and translational research, and we demonstrate how research results in biomedical ontologies can be evaluated. The evaluation strategies depend on the desired application and measure the success of using an ontology for a particular biomedical problem. For many applications, the success can be quantified, thereby facilitating the objective evaluation and comparison of research in biomedical ontology. The objective, quantifiable comparison of research results based on scientific applications opens up the possibility for systematically improving the utility of ontologies in biomedical research.

    View details for DOI 10.1093/bib/bbs053

    View details for Web of Science ID 000327436000003

    View details for PubMedID 22962340

  • Evaluation of the OQuaRE framework for ontology quality EXPERT SYSTEMS WITH APPLICATIONS Duque-Ramos, A., Tomas Fernandez-Breis, J., Iniesta, M., Dumontier, M., Egana Aranguren, M., Schulz, S., Aussenac-Gilles, N., Stevens, R. 2013; 40 (7): 2696-2703
  • Ontology-Based Querying with Bio2RDF's Linked Open Data. Journal of biomedical semantics Callahan, A., Cruz-Toledo, J., Dumontier, M. 2013; 4: S1-?


    A key activity for life scientists in this post "-omics" age involves searching for and integrating biological data from a multitude of independent databases. However, our ability to find relevant data is hampered by non-standard web and database interfaces backed by an enormous variety of data formats. This heterogeneity presents an overwhelming barrier to the discovery and reuse of resources which have been developed at great public expense.To address this issue, the open-source Bio2RDF project promotes a simple convention to integrate diverse biological data using Semantic Web technologies. However, querying Bio2RDF remains difficult due to the lack of uniformity in the representation of Bio2RDF datasets.We describe an update to Bio2RDF that includes tighter integration across 19 new and updated RDF datasets. All available open-source scripts were first consolidated to a single GitHub repository and then redeveloped using a common API that generates normalized IRIs using a centralized dataset registry. We then mapped dataset specific types and relations to the Semanticscience Integrated Ontology (SIO) and demonstrate simplified federated queries across multiple Bio2RDF endpoints.This coordinated release marks an important milestone for the Bio2RDF open source linked data framework. Principally, it improves the quality of linked data in the Bio2RDF network and makes it easier to access or recreate the linked data locally. We hope to continue improving the Bio2RDF network of linked data by identifying priority databases and increasing the vocabulary coverage to additional dataset vocabularies beyond SIO.

    View details for DOI 10.1186/2041-1480-4-S1-S1

    View details for PubMedID 23735196

  • Selected papers from the 15th Annual Bio-Ontologies Special Interest Group Meeting. Journal of biomedical semantics Soldatova, L. N., Sansone, S., Dumontier, M., Shah, N. H. 2013; 4: I1-?


    Over the 15 years, the Bio-Ontologies SIG at ISMB has provided a forum for discussion of the latest and most innovative research in the bio-ontologies development, its applications to biomedicine and more generally the organisation, presentation and dissemination of knowledge in biomedicine and the life sciences. The seven papers and the commentary selected for this supplement span a wide range of topics including: web-based querying over multiple ontologies, integration of data, annotating patent records, NCBO Web services, ontology developments for probabilistic reasoning and for physiological processes, and analysis of the progress of annotation and structural GO changes.

    View details for DOI 10.1186/2041-1480-4-S1-I1

    View details for PubMedID 23735191

  • State of the art and open challenges in community-driven knowledge curation JOURNAL OF BIOMEDICAL INFORMATICS Groza, T., Tudorache, T., Dumontier, M. 2013; 46 (1): 1-4

    View details for DOI 10.1016/j.jbi.2012.11.007

    View details for Web of Science ID 000315362300001

    View details for PubMedID 23219718

  • The SADI Personal Health Lens: A Web Browser-Based System for Identifying Personally Relevant Drug Interactions. JMIR research protocols Vandervalk, B., McCarthy, E. L., Cruz-Toledo, J., Klein, A., Baker, C. J., Dumontier, M., Wilkinson, M. D. 2013; 2 (1)


    The Web provides widespread access to vast quantities of health-related information that can improve quality-of-life through better understanding of personal symptoms, medical conditions, and available treatments. Unfortunately, identifying a credible and personally relevant subset of information can be a time-consuming and challenging task for users without a medical background.The objective of the Personal Health Lens system is to aid users when reading health-related webpages by providing warnings about personally relevant drug interactions. More broadly, we wish to present a prototype for a novel, generalizable approach to facilitating interactions between a patient, their practitioner(s), and the Web.We utilized a distributed, Semantic Web-based architecture for recognizing personally dangerous drugs consisting of: (1) a private, local triple store of personal health information, (2) Semantic Web services, following the Semantic Automated Discovery and Integration (SADI) design pattern, for text mining and identifying substance interactions, (3) a bookmarklet to trigger analysis of a webpage and annotate it with personalized warnings, and (4) a semantic query that acts as an abstract template of the analytical workflow to be enacted by the system.A prototype implementation of the system is provided in the form of a Java standalone executable JAR file. The JAR file bundles all components of the system: the personal health database, locally-running versions of the SADI services, and a javascript bookmarklet that triggers analysis of a webpage. In addition, the demonstration includes a hypothetical personal health profile, allowing the system to be used immediately without configuration. Usage instructions are provided.The main strength of the Personal Health Lens system is its ability to organize medical information and to present it to the user in a personalized and contextually relevant manner. While this prototype was limited to a single knowledge domain (drug/drug interactions), the proposed architecture is generalizable, and could act as the foundation for much richer personalized-health-Web clients, while importantly providing a novel and personalizable mechanism for clinical experts to inject their expertise into the browsing experience of their patients in the form of customized semantic queries and ontologies.

    View details for DOI 10.2196/resprot.2315

    View details for PubMedID 23612187

  • An RDF/OWL Knowledge Base for Query Answering and Decision Support in Clinical Pharmacogenetics MEDINFO 2013: PROCEEDINGS OF THE 14TH WORLD CONGRESS ON MEDICAL AND HEALTH INFORMATICS, PTS 1 AND 2 Samwald, M., Freimuth, R., Luciano, J. S., Lin, S., Powers, R. L., Marshall, M. S., Adlassnig, K., Dumontier, M., Boyce, R. D. 2013; 192: 539-542


    Genetic testing for personalizing pharmacotherapy is bound to become an important part of clinical routine. To address associated issues with data management and quality, we are creating a semantic knowledge base for clinical pharmacogenetics. The knowledge base is made up of three components: an expressive ontology formalized in the Web Ontology Language (OWL 2 DL), a Resource Description Framework (RDF) model for capturing detailed results of manual annotation of pharmacogenomic information in drug product labels, and an RDF conversion of relevant biomedical datasets. Our work goes beyond the state of the art in that it makes both automated reasoning as well as query answering as simple as possible, and the reasoning capabilities go beyond the capabilities of previously described ontologies.

    View details for DOI 10.3233/978-1-61499-289-9-539

    View details for Web of Science ID 000341021700110

    View details for PubMedID 23920613

  • Linked Data in Drug Discovery IEEE INTERNET COMPUTING Dumontier, M., Wild, D. J. 2012; 16 (6): 68-71
  • Identifying aberrant pathways through integrated analysis of knowledge in pharmacogenomics BIOINFORMATICS Hoehndorf, R., Dumontier, M., Gkoutos, G. V. 2012; 28 (16): 2169-2175


    Many complex diseases are the result of abnormal pathway functions instead of single abnormalities. Disease diagnosis and intervention strategies must target these pathways while minimizing the interference with normal physiological processes. Large-scale identification of disease pathways and chemicals that may be used to perturb them requires the integration of information about drugs, genes, diseases and pathways. This information is currently distributed over several pharmacogenomics databases. An integrated analysis of the information in these databases can reveal disease pathways and facilitate novel biomedical analyses.We demonstrate how to integrate pharmacogenomics databases through integration of the biomedical ontologies that are used as meta-data in these databases. The additional background knowledge in these ontologies can then be used to enable novel analyses. We identify disease pathways using a novel multi-ontology enrichment analysis over the Human Disease Ontology, and we identify significant associations between chemicals and pathways using an enrichment analysis over a chemical ontology. The drug-pathway and disease-pathway associations are a valuable resource for research in disease and drug mechanisms and can be used to improve computational drug repurposing.

    View details for DOI 10.1093/bioinformatics/bts350

    View details for Web of Science ID 000307501100011

    View details for PubMedID 22711793

  • Aptamer base: a collaborative knowledge base to describe aptamers and SELEX experiments DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION Cruz-Toledo, J., McKeague, M., Zhang, X., Giamberardino, A., McConnell, E., Francis, T., DeRosa, M. C., Dumontier, M. 2012


    Over the past several decades, rapid developments in both molecular and information technology have collectively increased our ability to understand molecular recognition. One emerging area of interest in molecular recognition research includes the isolation of aptamers. Aptamers are single-stranded nucleic acid or amino acid polymers that recognize and bind to targets with high affinity and selectivity. While research has focused on collecting aptamers and their interactions, most of the information regarding experimental methods remains in the unstructured and textual format of peer reviewed publications. To address this, we present the Aptamer Base, a database that provides detailed, structured information about the experimental conditions under which aptamers were selected and their binding affinity quantified. The open collaborative nature of the Aptamer Base provides the community with a unique resource that can be updated and curated in a decentralized manner, thereby accommodating the ever evolving field of aptamer research. DATABASE URL:

    View details for DOI 10.1093/database/bas006

    View details for Web of Science ID 000304920200004

    View details for PubMedID 22434840

  • Self-organizing ontology of biochemically relevant small molecules BMC BIOINFORMATICS Chepelev, L. L., Hastings, J., Ennis, M., Steinbeck, C., Dumontier, M. 2012; 13


    The advent of high-throughput experimentation in biochemistry has led to the generation of vast amounts of chemical data, necessitating the development of novel analysis, characterization, and cataloguing techniques and tools. Recently, a movement to publically release such data has advanced biochemical structure-activity relationship research, while providing new challenges, the biggest being the curation, annotation, and classification of this information to facilitate useful biochemical pattern analysis. Unfortunately, the human resources currently employed by the organizations supporting these efforts (e.g. ChEBI) are expanding linearly, while new useful scientific information is being released in a seemingly exponential fashion. Compounding this, currently existing chemical classification and annotation systems are not amenable to automated classification, formal and transparent chemical class definition axiomatization, facile class redefinition, or novel class integration, thus further limiting chemical ontology growth by necessitating human involvement in curation. Clearly, there is a need for the automation of this process, especially for novel chemical entities of biological interest.To address this, we present a formal framework based on Semantic Web technologies for the automatic design of chemical ontology which can be used for automated classification of novel entities. We demonstrate the automatic self-assembly of a structure-based chemical ontology based on 60 MeSH and 40 ChEBI chemical classes. This ontology is then used to classify 200 compounds with an accuracy of 92.7%. We extend these structure-based classes with molecular feature information and demonstrate the utility of our framework for classification of functionally relevant chemicals. Finally, we discuss an iterative approach that we envision for future biochemical ontology development.We conclude that the proposed methodology can ease the burden of chemical data annotators and dramatically increase their productivity. We anticipate that the use of formal logic in our proposed framework will make chemical classification criteria more transparent to humans and machines alike and will thus facilitate predictive and integrative bioactivity model development.

    View details for DOI 10.1186/1471-2105-13-3

    View details for Web of Science ID 000299825400001

    View details for PubMedID 22221313

  • Building an HIV data mashup using Bio2RDF BRIEFINGS IN BIOINFORMATICS Nolin, M., Dumontier, M., Belleau, F., Corbeil, J. 2012; 13 (1): 98-106


    We present an update to the Bio2RDF Linked Data Network, which now comprises ∼30 billion statements across 30 data sets. Significant changes to the framework include the accommodation of global mirrors, offline data processing and new search and integration services. The utility of this new network of knowledge is illustrated through a Bio2RDF-based mashup with microarray gene expression results and interaction data obtained from the HIV-1, Human Protein Interaction Database (HHPID) with respect to the infection of human macrophages with the human immunodeficiency virus type 1 (HIV-1).

    View details for DOI 10.1093/bib/bbr003

    View details for Web of Science ID 000298888200007

    View details for PubMedID 22223742

  • Semantically enabling pharmacogenomic data for the realization of personalized medicine PHARMACOGENOMICS Samwald, M., Coulet, A., Huerga, I., Powers, R. L., Luciano, J. S., Freimuth, R. R., Whipple, F., Pichler, E., Prud'hommeaux, E., Dumontier, M., Marshall, M. S. 2012; 13 (2): 201-212


    Understanding how each individual's genetics and physiology influences pharmaceutical response is crucial to the realization of personalized medicine and the discovery and validation of pharmacogenomic biomarkers is key to its success. However, integration of genotype and phenotype knowledge in medical information systems remains a critical challenge. The inability to easily and accurately integrate the results of biomolecular studies with patients' medical records and clinical reports prevents us from realizing the full potential of pharmacogenomic knowledge for both drug development and clinical practice. Herein, we describe approaches using Semantic Web technologies, in which pharmacogenomic knowledge relevant to drug development and medical decision support is represented in such a way that it can be efficiently accessed both by software and human experts. We suggest that this approach increases the utility of data, and that such computational technologies will become an essential part of personalized medicine, alongside diagnostics and pharmaceutical products.

    View details for DOI 10.2217/PGS.11.179

    View details for Web of Science ID 000299864000016

    View details for PubMedID 22256869

  • Selected papers from the 14th Annual Bio-Ontologies Special Interest Group Meeting. Journal of biomedical semantics Soldatova, L. N., Sansone, S., Dumontier, M., Shah, N. H. 2012; 3: I1-?


    Over the 14 years, the Bio-Ontologies SIG at ISMB has provided a forum for discussion of the latest and most innovative research in the bio-ontologies development, its applications to biomedicine and more generally the organisation, presentation and dissemination of knowledge in biomedicine and the life sciences. The seven papers selected for this supplement span a wide range of topics including: web-based querying over multiple ontologies, integration of data from wikis, innovative methods of annotating and mining electronic health records, advances in annotating web documents and biomedical literature, quality control of ontology alignments, and the ontology support for predictive models about toxicity and open access to the toxicity data.

    View details for DOI 10.1186/2041-1480-3-S1-I1

    View details for PubMedID 22541591

  • The Chemical Information Ontology: Provenance and Disambiguation for Chemical Data on the Biological Semantic Web PLOS ONE Hastings, J., Chepelev, L., Willighagen, E., Adams, N., Steinbeck, C., Dumontier, M. 2011; 6 (10)


    Cheminformatics is the application of informatics techniques to solve chemical problems in silico. There are many areas in biology where cheminformatics plays an important role in computational research, including metabolism, proteomics, and systems biology. One critical aspect in the application of cheminformatics in these fields is the accurate exchange of data, which is increasingly accomplished through the use of ontologies. Ontologies are formal representations of objects and their properties using a logic-based ontology language. Many such ontologies are currently being developed to represent objects across all the domains of science. Ontologies enable the definition, classification, and support for querying objects in a particular domain, enabling intelligent computer applications to be built which support the work of scientists both within the domain of interest and across interrelated neighbouring domains. Modern chemical research relies on computational techniques to filter and organise data to maximise research productivity. The objects which are manipulated in these algorithms and procedures, as well as the algorithms and procedures themselves, enjoy a kind of virtual life within computers. We will call these information entities. Here, we describe our work in developing an ontology of chemical information entities, with a primary focus on data-driven research and the integration of calculated properties (descriptors) of chemical entities within a semantic web context. Our ontology distinguishes algorithmic, or procedural information from declarative, or factual information, and renders of particular importance the annotation of provenance to calculated data. The Chemical Information Ontology is being developed as an open collaborative project. More details, together with a downloadable OWL file, are available at (license: CC-BY-SA).

    View details for DOI 10.1371/journal.pone.0025513

    View details for Web of Science ID 000295943000034

    View details for PubMedID 21991315

  • Controlled vocabularies and semantics in systems biology MOLECULAR SYSTEMS BIOLOGY Courtot, M., Juty, N., Knuepfer, C., Waltemath, D., Zhukova, A., Draeger, A., Dumontier, M., Finney, A., Golebiewski, M., Hastings, J., Hoops, S., Keating, S., Kell, D. B., Kerrien, S., Lawson, J., Lister, A., Lu, J., Machne, R., Mendes, P., Pocock, M., Rodriguez, N., Villeger, A., Wilkinson, D. J., Wimalaratne, S., Laibe, C., Hucka, M., Le Novere, N. 2011; 7


    The use of computational modeling to describe and analyze biological systems is at the heart of systems biology. Model structures, simulation descriptions and numerical results can be encoded in structured formats, but there is an increasing need to provide an additional semantic layer. Semantic information adds meaning to components of structured descriptions to help identify and interpret them unambiguously. Ontologies are one of the tools frequently used for this purpose. We describe here three ontologies created specifically to address the needs of the systems biology community. The Systems Biology Ontology (SBO) provides semantic information about the model components. The Kinetic Simulation Algorithm Ontology (KiSAO) supplies information about existing algorithms available for the simulation of systems biology models, their characterization and interrelationships. The Terminology for the Description of Dynamics (TEDDY) categorizes dynamical features of the simulation results and general systems behavior. The provision of semantic information extends a model's longevity and facilitates its reuse. It provides useful insight into the biology of modeled processes, and may be used to make informed decisions on subsequent simulation experiments.

    View details for DOI 10.1038/msb.2011.77

    View details for Web of Science ID 000296652600009

    View details for PubMedID 22027554

  • Integrating systems biology models and biomedical ontologies BMC SYSTEMS BIOLOGY Hoehndorf, R., Dumontier, M., Gennari, J. H., Wimalaratne, S., de Bono, B., Cook, D. L., Gkoutos, G. V. 2011; 5


    Systems biology is an approach to biology that emphasizes the structure and dynamic behavior of biological systems and the interactions that occur within them. To succeed, systems biology crucially depends on the accessibility and integration of data across domains and levels of granularity. Biomedical ontologies were developed to facilitate such an integration of data and are often used to annotate biosimulation models in systems biology.We provide a framework to integrate representations of in silico systems biology with those of in vivo biology as described by biomedical ontologies and demonstrate this framework using the Systems Biology Markup Language. We developed the SBML Harvester software that automatically converts annotated SBML models into OWL and we apply our software to those biosimulation models that are contained in the BioModels Database. We utilize the resulting knowledge base for complex biological queries that can bridge levels of granularity, verify models based on the biological phenomenon they represent and provide a means to establish a basic qualitative layer on which to express the semantics of biosimulation models.We establish an information flow between biomedical ontologies and biosimulation models and we demonstrate that the integration of annotated biosimulation models and biomedical ontologies enables the verification of models as well as expressive queries. Establishing a bi-directional information flow between systems biology and biomedical ontologies has the potential to enable large-scale analyses of biological systems that span levels of granularity from molecules to organisms.

    View details for DOI 10.1186/1752-0509-5-124

    View details for Web of Science ID 000294781500001

    View details for PubMedID 21835028

  • MoSuMo: A Semantic Web service to generate electrostatic potentials across solvent excluded protein surfaces and binding pockets COMPUTERS & GRAPHICS-UK Gawronski, A., Dumontier, M. 2011; 35 (4): 823-830
  • Prototype semantic infrastructure for automated small molecule classification and annotation in lipidomics BMC BIOINFORMATICS Chepelev, L. L., Riazanov, A., Kouznetsov, A., Low, H. S., Dumontier, M., Baker, C. J. 2011; 12


    The development of high-throughput experimentation has led to astronomical growth in biologically relevant lipids and lipid derivatives identified, screened, and deposited in numerous online databases. Unfortunately, efforts to annotate, classify, and analyze these chemical entities have largely remained in the hands of human curators using manual or semi-automated protocols, leaving many novel entities unclassified. Since chemical function is often closely linked to structure, accurate structure-based classification and annotation of chemical entities is imperative to understanding their functionality.As part of an exploratory study, we have investigated the utility of semantic web technologies in automated chemical classification and annotation of lipids. Our prototype framework consists of two components: an ontology and a set of federated web services that operate upon it. The formal lipid ontology we use here extends a part of the LiPrO ontology and draws on the lipid hierarchy in the LIPID MAPS database, as well as literature-derived knowledge. The federated semantic web services that operate upon this ontology are deployed within the Semantic Annotation, Discovery, and Integration (SADI) framework. Structure-based lipid classification is enacted by two core services. Firstly, a structural annotation service detects and enumerates relevant functional groups for a specified chemical structure. A second service reasons over lipid ontology class descriptions using the attributes obtained from the annotation service and identifies the appropriate lipid classification. We extend the utility of these core services by combining them with additional SADI services that retrieve associations between lipids and proteins and identify publications related to specified lipid types. We analyze the performance of SADI-enabled eicosanoid classification relative to the LIPID MAPS classification and reflect on the contribution of our integrative methodology in the context of high-throughput lipidomics.Our prototype framework is capable of accurate automated classification of lipids and facile integration of lipid class information with additional data obtained with SADI web services. The potential of programming-free integration of external web services through the SADI framework offers an opportunity for development of powerful novel applications in lipidomics. We conclude that semantic web technologies can provide an accurate and versatile means of classification and annotation of lipids.

    View details for DOI 10.1186/1471-2105-12-303

    View details for Web of Science ID 000294361500001

    View details for PubMedID 21791100

  • Interoperability between Biomedical Ontologies through Relation Expansion, Upper-Level Ontologies and Automatic Reasoning PLOS ONE Hoehndorf, R., Dumontier, M., Oellrich, A., Rebholz-Schuhmann, D., Schofield, P. N., Gkoutos, G. V. 2011; 6 (7)


    Researchers design ontologies as a means to accurately annotate and integrate experimental data across heterogeneous and disparate data- and knowledge bases. Formal ontologies make the semantics of terms and relations explicit such that automated reasoning can be used to verify the consistency of knowledge. However, many biomedical ontologies do not sufficiently formalize the semantics of their relations and are therefore limited with respect to automated reasoning for large scale data integration and knowledge discovery. We describe a method to improve automated reasoning over biomedical ontologies and identify several thousand contradictory class definitions. Our approach aligns terms in biomedical ontologies with foundational classes in a top-level ontology and formalizes composite relations as class expressions. We describe the semi-automated repair of contradictions and demonstrate expressive queries over interoperable ontologies. Our work forms an important cornerstone for data integration, automatic inference and knowledge discovery based on formal representations of knowledge. Our results and analysis software are available at

    View details for DOI 10.1371/journal.pone.0022006

    View details for Web of Science ID 000292812400024

    View details for PubMedID 21789201

  • Chemical Entity Semantic Specification: Knowledge representation for efficient semantic cheminformatics and facile data integration JOURNAL OF CHEMINFORMATICS Chepelev, L. L., Dumontier, M. 2011; 3


    Over the past several centuries, chemistry has permeated virtually every facet of human lifestyle, enriching fields as diverse as medicine, agriculture, manufacturing, warfare, and electronics, among numerous others. Unfortunately, application-specific, incompatible chemical information formats and representation strategies have emerged as a result of such diverse adoption of chemistry. Although a number of efforts have been dedicated to unifying the computational representation of chemical information, disparities between the various chemical databases still persist and stand in the way of cross-domain, interdisciplinary investigations. Through a common syntax and formal semantics, Semantic Web technology offers the ability to accurately represent, integrate, reason about and query across diverse chemical information.Here we specify and implement the Chemical Entity Semantic Specification (CHESS) for the representation of polyatomic chemical entities, their substructures, bonds, atoms, and reactions using Semantic Web technologies. CHESS provides means to capture aspects of their corresponding chemical descriptors, connectivity, functional composition, and geometric structure while specifying mechanisms for data provenance. We demonstrate that using our readily extensible specification, it is possible to efficiently integrate multiple disparate chemical data sources, while retaining appropriate correspondence of chemical descriptors, with very little additional effort. We demonstrate the impact of some of our representational decisions on the performance of chemically-aware knowledgebase searching and rudimentary reaction candidate selection. Finally, we provide access to the tools necessary to carry out chemical entity encoding in CHESS, along with a sample knowledgebase.By harnessing the power of Semantic Web technologies with CHESS, it is possible to provide a means of facile cross-domain chemical knowledge integration with full preservation of data correspondence and provenance. Our representation builds on existing cheminformatics technologies and, by the virtue of RDF specification, remains flexible and amenable to application- and domain-specific annotations without compromising chemical data integration. We conclude that the adoption of a consistent and semantically-enabled chemical specification is imperative for surviving the coming chemical data deluge and supporting systems science research.

    View details for DOI 10.1186/1758-2946-3-20

    View details for Web of Science ID 000300226500001

    View details for PubMedID 21595881

  • Semantic Web integration of Cheminformatics resources with the SADI framework JOURNAL OF CHEMINFORMATICS Chepelev, L. L., Dumontier, M. 2011; 3


    The diversity and the largely independent nature of chemical research efforts over the past half century are, most likely, the major contributors to the current poor state of chemical computational resource and database interoperability. While open software for chemical format interconversion and database entry cross-linking have partially addressed database interoperability, computational resource integration is hindered by the great diversity of software interfaces, languages, access methods, and platforms, among others. This has, in turn, translated into limited reproducibility of computational experiments and the need for application-specific computational workflow construction and semi-automated enactment by human experts, especially where emerging interdisciplinary fields, such as systems chemistry, are pursued. Fortunately, the advent of the Semantic Web, and the very recent introduction of RESTful Semantic Web Services (SWS) may present an opportunity to integrate all of the existing computational and database resources in chemistry into a machine-understandable, unified system that draws on the entirety of the Semantic Web.We have created a prototype framework of Semantic Automated Discovery and Integration (SADI) framework SWS that exposes the QSAR descriptor functionality of the Chemistry Development Kit. Since each of these services has formal ontology-defined input and output classes, and each service consumes and produces RDF graphs, clients can automatically reason about the services and available reference information necessary to complete a given overall computational task specified through a simple SPARQL query. We demonstrate this capability by carrying out QSAR analysis backed by a simple formal ontology to determine whether a given molecule is drug-like. Further, we discuss parameter-based control over the execution of SADI SWS. Finally, we demonstrate the value of computational resource envelopment as SADI services through service reuse and ease of integration of computational functionality into formal ontologies.The work we present here may trigger a major paradigm shift in the distribution of computational resources in chemistry. We conclude that envelopment of chemical computational resources as SADI SWS facilitates interdisciplinary research by enabling the definition of computational problems in terms of ontologies and formal logical statements instead of cumbersome and application-specific tasks and workflows.

    View details for DOI 10.1186/1758-2946-3-16

    View details for Web of Science ID 000300226200001

    View details for PubMedID 21575200

  • A common layer of interoperability for biomedical ontologies based on OWL EL BIOINFORMATICS Hoehndorf, R., Dumontier, M., Oellrich, A., Wimalaratne, S., Rebholz-Schuhmann, D., Schofield, P., Gkoutos, G. V. 2011; 27 (7): 1001-1008


    Ontologies are essential in biomedical research due to their ability to semantically integrate content from different scientific databases and resources. Their application improves capabilities for querying and mining biological knowledge. An increasing number of ontologies is being developed for this purpose, and considerable effort is invested into formally defining them in order to represent their semantics explicitly. However, current biomedical ontologies do not facilitate data integration and interoperability yet, since reasoning over these ontologies is very complex and cannot be performed efficiently or is even impossible. We propose the use of less expressive subsets of ontology representation languages to enable efficient reasoning and achieve the goal of genuine interoperability between ontologies.We present and evaluate EL Vira, a framework that transforms OWL ontologies into the OWL EL subset, thereby enabling the use of tractable reasoning. We illustrate which OWL constructs and inferences are kept and lost following the conversion and demonstrate the performance gain of reasoning indicated by the significant reduction of processing time. We applied EL Vira to the open biomedical ontologies and provide a repository of ontologies resulting from this conversion. EL Vira creates a common layer of ontological interoperability that, for the first time, enables the creation of software solutions that can employ biomedical ontologies to perform inferences and answer complex queries to support scientific analyses. Availability and implementation: The EL Vira software is available from and converted OBO ontologies and their mappings are available from

    View details for DOI 10.1093/bioinformatics/btr058

    View details for Web of Science ID 000289162000017

    View details for PubMedID 21343142

  • The Translational Medicine Ontology and Knowledge Base: driving personalized medicine by bridging the gap between bench and bedside. Journal of biomedical semantics Luciano, J. S., Andersson, B., Batchelor, C., Bodenreider, O., Clark, T., Denney, C. K., Domarew, C., Gambet, T., Harland, L., Jentzsch, A., Kashyap, V., Kos, P., Kozlovsky, J., Lebo, T., Marshall, S. M., McCusker, J. P., McGuinness, D. L., Ogbuji, C., Pichler, E., Powers, R. L., Prud'hommeaux, E., Samwald, M., Schriml, L., Tonellato, P. J., Whetzel, P. L., Zhao, J., Stephens, S., Dumontier, M. 2011; 2: S1-?


    Translational medicine requires the integration of knowledge using heterogeneous data from health care to the life sciences. Here, we describe a collaborative effort to produce a prototype Translational Medicine Knowledge Base (TMKB) capable of answering questions relating to clinical practice and pharmaceutical drug discovery.We developed the Translational Medicine Ontology (TMO) as a unifying ontology to integrate chemical, genomic and proteomic data with disease, treatment, and electronic health records. We demonstrate the use of Semantic Web technologies in the integration of patient and biomedical data, and reveal how such a knowledge base can aid physicians in providing tailored patient care and facilitate the recruitment of patients into active clinical trials. Thus, patients, physicians and researchers may explore the knowledge base to better understand therapeutic options, efficacy, and mechanisms of action.This work takes an important step in using Semantic Web technologies to facilitate integration of relevant, distributed, external sources and progress towards a computational platform to support personalized medicine.TMO can be downloaded from and TMKB can be accessed at

    View details for DOI 10.1186/2041-1480-2-S2-S1

    View details for PubMedID 21624155

  • HyQue: evaluating hypotheses using Semantic Web technologies. Journal of biomedical semantics Callahan, A., Dumontier, M., Shah, N. H. 2011; 2: S3-?


    Key to the success of e-Science is the ability to computationally evaluate expert-composed hypotheses for validity against experimental data. Researchers face the challenge of collecting, evaluating and integrating large amounts of diverse information to compose and evaluate a hypothesis. Confronted with rapidly accumulating data, researchers currently do not have the software tools to undertake the required information integration tasks.We present HyQue, a Semantic Web tool for querying scientific knowledge bases with the purpose of evaluating user submitted hypotheses. HyQue features a knowledge model to accommodate diverse hypotheses structured as events and represented using Semantic Web languages (RDF/OWL). Hypothesis validity is evaluated against experimental and literature-sourced evidence through a combination of SPARQL queries and evaluation rules. Inference over OWL ontologies (for type specifications, subclass assertions and parthood relations) and retrieval of facts stored as Bio2RDF linked data provide support for a given hypothesis. We evaluate hypotheses of varying levels of detail about the genetic network controlling galactose metabolism in Saccharomyces cerevisiae to demonstrate the feasibility of deploying such semantic computing tools over a growing body of structured knowledge in Bio2RDF.HyQue is a query-based hypothesis evaluation system that can currently evaluate hypotheses about the galactose metabolism in S. cerevisiae. Hypotheses as well as the supporting or refuting data are represented in RDF and directly linked to one another allowing scientists to browse from data to hypothesis and vice versa. HyQue hypotheses and data are available at

    View details for DOI 10.1186/2041-1480-2-S2-S3

    View details for PubMedID 21624158

  • Integration and publication of heterogeneous text-mined relationships on the Semantic Web. Journal of biomedical semantics Coulet, A., Garten, Y., Dumontier, M., Altman, R. B., Musen, M. A., Shah, N. H. 2011; 2: S10-?


    Advances in Natural Language Processing (NLP) techniques enable the extraction of fine-grained relationships mentioned in biomedical text. The variability and the complexity of natural language in expressing similar relationships causes the extracted relationships to be highly heterogeneous, which makes the construction of knowledge bases difficult and poses a challenge in using these for data mining or question answering.We report on the semi-automatic construction of the PHARE relationship ontology (the PHArmacogenomic RElationships Ontology) consisting of 200 curated relations from over 40,000 heterogeneous relationships extracted via text-mining. These heterogeneous relations are then mapped to the PHARE ontology using synonyms, entity descriptions and hierarchies of entities and roles. Once mapped, relationships can be normalized and compared using the structure of the ontology to identify relationships that have similar semantics but different syntax. We compare and contrast the manual procedure with a fully automated approach using WordNet to quantify the degree of integration enabled by iterative curation and refinement of the PHARE ontology. The result of such integration is a repository of normalized biomedical relationships, named PHARE-KB, which can be queried using Semantic Web technologies such as SPARQL and can be visualized in the form of a biological network.The PHARE ontology serves as a common semantic framework to integrate more than 40,000 relationships pertinent to pharmacogenomics. The PHARE ontology forms the foundation of a knowledge base named PHARE-KB. Once populated with relationships, PHARE-KB (i) can be visualized in the form of a biological network to guide human tasks such as database curation and (ii) can be queried programmatically to guide bioinformatics applications such as the prediction of molecular interactions. PHARE is available at

    View details for DOI 10.1186/2041-1480-2-S2-S10

    View details for PubMedID 21624156

  • The RNA Ontology (RNAO): An ontology for integrating RNA sequence and structure data APPLIED ONTOLOGY Hoehndorf, R., Batchelor, C., Bittner, T., Dumontier, M., Eilbeck, K., Knight, R., Mungall, C. J., Richardson, J. S., Stombaugh, J., Westhof, E., Zirbel, C. L., Leontis, N. B. 2011; 6 (1): 53-89
  • Towards an interoperable information infrastructure providing decision support for genomic medicine. Studies in health technology and informatics Samwald, M., Stenzhorn, H., Dumontier, M., Marshall, M. S., Luciano, J., Adlassnig, K. 2011; 169: 165-169


    Genetic dispositions play a major role in individual disease risk and treatment response. Genomic medicine, in which medical decisions are refined by genetic information of particular patients, is becoming increasingly important. Here we describe our work and future visions around the creation of a distributed infrastructure for pharmacogenetic data and medical decision support, based on industry standards such as the Web Ontology Language (OWL) and the Arden Syntax.

    View details for PubMedID 21893735

  • Computational approaches toward the design of pools for the in vitro selection of complex aptamers RNA-A PUBLICATION OF THE RNA SOCIETY Luo, X., McKeague, M., Pitre, S., Dumontier, M., Green, J., Golshani, A., DeRosa, M. C., Dehne, F. 2010; 16 (11): 2252-2262


    It is well known that using random RNA/DNA sequences for SELEX experiments will generally yield low-complexity structures. Early experimental results suggest that having a structurally diverse library, which, for instance, includes high-order junctions, may prove useful in finding new functional motifs. Here, we develop two computational methods to generate sequences that exhibit higher structural complexity and can be used to increase the overall structural diversity of initial pools for in vitro selection experiments. Random Filtering selectively increases the number of five-way junctions in RNA/DNA pools, and Genetic Filtering designs RNA/DNA pools to a specified structure distribution, whether uniform or otherwise. We show that using our computationally designed DNA pool greatly improves access to highly complex sequence structures for SELEX experiments (without losing our ability to select for common one-way and two-way junction sequences).

    View details for DOI 10.1261/rna.2102210

    View details for Web of Science ID 000283047900020

    View details for PubMedID 20870801

  • Relations as patterns: bridging the gap between OBO and OWL BMC BIOINFORMATICS Hoehndorf, R., Oellrich, A., Dumontier, M., Kelso, J., Rebholz-Schuhmann, D., Herre, H. 2010; 11


    most biomedical ontologies are represented in the OBO Flatfile Format, which is an easy-to-use graph-based ontology language. The semantics of the OBO Flatfile Format 1.2 enforces a strict predetermined interpretation of relationship statements between classes. It does not allow flexible specifications that provide better approximations of the intuitive understanding of the considered relations. If relations cannot be accurately expressed then ontologies built upon them may contain false assertions and hence lead to false inferences. Ontologies in the OBO Foundry must formalize the semantics of relations according to the OBO Relationship Ontology (RO). Therefore, being able to accurately express the intended meaning of relations is of crucial importance. Since the Web Ontology Language (OWL) is an expressive language with a formal semantics, it is suitable to de ne the meaning of relations accurately.we developed a method to provide definition patterns for relations between classes using OWL and describe a novel implementation of the RO based on this method. We implemented our extension in software that converts ontologies in the OBO Flatfile Format to OWL, and also provide a prototype to extract relational patterns from OWL ontologies using automated reasoning. The conversion software is freely available at, and can be accessed via a web interface.explicitly defining relations permits their use in reasoning software and leads to a more flexible and powerful way of representing biomedical ontologies. Using the extended langua0067e and semantics avoids several mistakes commonly made in formalizing biomedical ontologies, and can be used to automatically detect inconsistencies. The use of our method enables the use of graph-based ontologies in OWL, and makes complex OWL ontologies accessible in a graph-based form. Thereby, our method provides the means to gradually move the representation of biomedical ontologies into formal knowledge representation languages that incorporates an explicit semantics. Our method facilitates the use of OWL-based software in the back-end while ontology curators may continue to develop ontologies with an OBO-style front-end.

    View details for DOI 10.1186/1471-2105-11-441

    View details for Web of Science ID 000282631900001

    View details for PubMedID 20807438

  • Chemical entity semantic specification: Knowledge representation for efficient semantic cheminformatics and facile data integration ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY Chepelev, L. L., Dumontier, M. 2010; 240
  • Semantic envelopment of cheminformatics resources with SADI ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY Chepelev, L. L., Willighagen, E., Dumontier, M. 2010; 240
  • CHEMINF: Community-developed ontology of chemical information and algorithms ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY Chepelev, L. L., Hastings, J., Willighagen, E., Adams, N., Steinbeck, C., Murray-Rust, P., Dumontier, M. 2010; 240
  • Modeling and querying graphical representations of statistical data JOURNAL OF WEB SEMANTICS Dumontier, M., Ferres, L., Villanueva-Rosales, N. 2010; 8 (2-3): 241-254
  • RKB: a Semantic Web knowledge base for RNA. Journal of biomedical semantics Cruz-Toledo, J., Dumontier, M., Parisien, M., Major, F. 2010; 1: S2-?


    Increasingly sophisticated knowledge about RNA structure and function requires an inclusive knowledge representation that facilitates the integration of independently -generated information arising from such efforts as genome sequencing projects, microarray analyses, structure determination and RNA SELEX experiments. While RNAML, an XML-based representation, has been proposed as an exchange format for a select subset of information, it lacks domain-specific semantics that are essential for answering questions that require expert knowledge. Here, we describe an RNA knowledge base (RKB) for structure-based knowledge using RDF/OWL Semantic Web technologies. RKB extends a number of ontologies and contains basic terminology for nucleic acid composition along with context/model-specific structural features such as sugar conformations, base pairings and base stackings. RKB (available at is populated with PDB entries and MC-Annotate structural annotation. We show queries to the RKB using description logic reasoning, thus opening the door to question answering over independently-published RNA knowledge using Semantic Web technologies.

    View details for DOI 10.1186/2041-1480-1-S1-S2

    View details for PubMedID 20626922

  • Realism for scientific ontologies FORMAL ONTOLOGY IN INFORMATION SYSTEMS (FOIS 2010) Dumontier, M., Hoehndorf, R. 2010; 209: 387-399
  • Towards pharmacogenomics knowledge discovery with the semantic web BRIEFINGS IN BIOINFORMATICS Dumontier, M., Villanueva-Rosales, N. 2009; 10 (2): 153-163


    Pharmacogenomics aims to understand pharmacological response with respect to genetic variation. Essential to the delivery of better health care is the use of pharmacogenomics knowledge to answer questions about therapeutic, pharmacological or genetic aspects. Several XML markup languages have been developed to capture pharmacogenomic and related information so as to facilitate data sharing. However, recent advances in semantic web technologies have presented exciting new opportunities for pharmacogenomics knowledge discovery by representing the information with machine understandable semantics. Progress in this area is illustrated with reference to the personalized medicine project that aims to facilitate pharmacogenomics knowledge discovery through intuitive knowledge capture and sophisticated question answering using automated reasoning over expressive ontologies.

    View details for DOI 10.1093/bib/bbn056

    View details for Web of Science ID 000264388500005

    View details for PubMedID 19240125

  • yOWL: An ontology-driven knowledge base for yeast biologists JOURNAL OF BIOMEDICAL INFORMATICS Villanueva-Rosales, N., Dumontier, M. 2008; 41 (5): 779-789


    Knowledge management is an ongoing challenge for the biological community such that large, diverse and continuously growing information requires more sophisticated methods to store, integrate and query their knowledge. The semantic web initiative provides a new knowledge engineering framework to represent, share and discover information. In this paper, we describe our efforts towards the development of an ontology-based knowledge base, including aspects from ontology design and population using "semantic" data mashup, to automated reasoning and semantic query answering. Based on yeast data obtained from the Saccharomyces Genome Database and UniProt, we discuss the challenges encountered during the building of the knowledge base and how they were overcome.

    View details for DOI 10.1016/j.jbi.2008.05.001

    View details for Web of Science ID 000260137300010

    View details for PubMedID 18562252

  • Global investigation of protein-protein interactions in yeast Saccharomyces cerevisiae using re-occurring short polypeptide sequences NUCLEIC ACIDS RESEARCH PITRE, S., North, C., Alamgir, M., Jessulat, M., Chan, A., Luo, X., Green, J. R., Dumontier, M., Dehne, F., Golshani, A. 2008; 36 (13): 4286-4294


    Protein-protein interaction (PPI) maps provide insight into cellular biology and have received considerable attention in the post-genomic era. While large-scale experimental approaches have generated large collections of experimentally determined PPIs, technical limitations preclude certain PPIs from detection. Recently, we demonstrated that yeast PPIs can be computationally predicted using re-occurring short polypeptide sequences between known interacting protein pairs. However, the computational requirements and low specificity made this method unsuitable for large-scale investigations. Here, we report an improved approach, which exhibits a specificity of approximately 99.95% and executes 16,000 times faster. Importantly, we report the first all-to-all sequence-based computational screen of PPIs in yeast, Saccharomyces cerevisiae in which we identify 29,589 high confidence interactions of approximately 2 x 10(7) possible pairs. Of these, 14,438 PPIs have not been previously reported and may represent novel interactions. In particular, these results reveal a richer set of membrane protein interactions, not readily amenable to experimental investigations. From the novel PPIs, a novel putative protein complex comprised largely of membrane proteins was revealed. In addition, two novel gene functions were predicted and experimentally confirmed to affect the efficiency of non-homologous end-joining, providing further support for the usefulness of the identified PPIs in biological investigations.

    View details for DOI 10.1093/nar/gkn390

    View details for Web of Science ID 000257964500014

    View details for PubMedID 18586826

  • GridCell: a stochastic particle-based biological system simulator BMC SYSTEMS BIOLOGY Boulianne, L., Al Assaad, S., Dumontier, M., Gross, W. J. 2008; 2


    Realistic biochemical simulators aim to improve our understanding of many biological processes that would be otherwise very difficult to monitor in experimental studies. Increasingly accurate simulators may provide insights into the regulation of biological processes due to stochastic or spatial effects.We have developed GridCell as a three-dimensional simulation environment for investigating the behaviour of biochemical networks under a variety of spatial influences including crowding, recruitment and localization. GridCell enables the tracking and characterization of individual particles, leading to insights on the behaviour of low copy number molecules participating in signaling networks. The simulation space is divided into a discrete 3D grid that provides ideal support for particle collisions without distance calculation and particle search. SBML support enables existing networks to be simulated and visualized. The user interface provides intuitive navigation that facilitates insights into species behaviour across spatial and temporal dimensions. We demonstrate the effect of crowing on a Michaelis-Menten system.GridCell is an effective stochastic particle simulator designed to track the progress of individual particles in a three-dimensional space in which spatial influences such as crowding, co-localization and recruitment may be investigated.

    View details for DOI 10.1186/1752-0509-2-66

    View details for Web of Science ID 000258870400001

    View details for PubMedID 18651956

  • Computational Methods For Predicting Protein-Protein Interactions PROTEIN - PROTEIN INTERACTION Pitre, S., Alamgir, M., Green, J. R., Dumontier, M., Dehne, F., Golshani, A. 2008; 110: 247-267


    Protein-protein interactions (PPIs) play a critical role in many cellular functions. A number of experimental techniques have been applied to discover PPIs; however, these techniques are expensive in terms of time, money, and expertise. There are also large discrepancies between the PPI data collected by the same or different techniques in the same organism. We therefore turn to computational techniques for the prediction of PPIs. Computational techniques have been applied to the collection, indexing, validation, analysis, and extrapolation of PPI data. This chapter will focus on computational prediction of PPI, reviewing a number of techniques including PIPE, developed in our own laboratory. For comparison, the conventional large-scale approaches to predict PPIs are also briefly discussed. The chapter concludes with a discussion of the limitations of both experimental and computational methods of determining PPIs.

    View details for DOI 10.1007/10_2007_089

    View details for Web of Science ID 000260375400011

    View details for PubMedID 18202838

  • Semantic Annotation and Question Answering of Statistical Graphs MICAI 2008: ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS Dumontier, M., Ferres, L., Villanueva-Rosales, N. 2008; 5317: 100-110
  • Chemical knowledge for the Semantic Web DATA INTEGRATION IN THE LIFE SCIENCES, PROCEEDINGS Konyk, M., De Leon, A., Dumontier, M. 2008; 5109: 169-176
  • Semantic Query Answering with Time-Series Graphs 2007 11TH IEEE INTERNATIONAL ENTERPRISE DISTRIBUTED OBJECT COMPUTING CONFERENCE WORKSHOPS Ferres, L., Dumontier, M., Villanueva-Rosales, N. 2007: 117-124
  • Domain-based small molecule binding site annotation BMC BIOINFORMATICS Snyder, K. A., Feldman, H. J., Dumontier, M., Salama, J. J., Hogue, C. W. 2006; 7


    Accurate small molecule binding site information for a protein can facilitate studies in drug docking, drug discovery and function prediction, but small molecule binding site protein sequence annotation is sparse. The Small Molecule Interaction Database (SMID), a database of protein domain-small molecule interactions, was created using structural data from the Protein Data Bank (PDB). More importantly it provides a means to predict small molecule binding sites on proteins with a known or unknown structure and unlike prior approaches, removes large numbers of false positive hits arising from transitive alignment errors, non-biologically significant small molecules and crystallographic conditions that overpredict ion binding sites.Using a set of co-crystallized protein-small molecule structures as a starting point, SMID interactions were generated by identifying protein domains that bind to small molecules, using NCBI's Reverse Position Specific BLAST (RPS-BLAST) algorithm. SMID records are available for viewing at The SMID-BLAST tool provides accurate transitive annotation of small-molecule binding sites for proteins not found in the PDB. Given a protein sequence, SMID-BLAST identifies domains using RPS-BLAST and then lists potential small molecule ligands based on SMID records, as well as their aligned binding sites. A heuristic ligand score is calculated based on E-value, ligand residue identity and domain entropy to assign a level of confidence to hits found. SMID-BLAST predictions were validated against a set of 793 experimental small molecule interactions from the PDB, of which 472 (60%) of predicted interactions identically matched the experimental small molecule and of these, 344 had greater than 80% of the binding site residues correctly identified. Further, we estimate that 45% of predictions which were not observed in the PDB validation set may be true positives.By focusing on protein domain-small molecule interactions, SMID is able to cluster similar interactions and detect subtle binding patterns that would not otherwise be obvious. Using SMID-BLAST, small molecule targets can be predicted for any protein sequence, with the only limitation being that the small molecule must exist in the PDB. Validation results and specific examples within illustrate that SMID-BLAST has a high degree of accuracy in terms of predicting both the small molecule ligand and binding site residue positions for a query protein.

    View details for DOI 10.1186/1471-2105-7-152

    View details for Web of Science ID 000236762000001

    View details for PubMedID 16545112

  • CO: A chemical ontology for identification of functional groups and semantic comparison of small molecules FEBS LETTERS Feldman, H. J., Dumontier, M., Ling, S., Haider, N., Hogue, C. W. 2005; 579 (21): 4685-4691


    A novel chemical ontology based on chemical functional groups automatically, objectively assigned by a computer program, was developed to categorize small molecules. It has been applied to PubChem and the small molecule interaction database to demonstrate its utility as a basic pharmacophore search system. Molecules can be compared using a semantic similarity score based on functional group assignments rather than 3D shape, which succeeds in identifying small molecules known to bind a common binding site. This ontology will serve as a powerful tool for searching chemical databases and identifying key functional groups responsible for biological activities.

    View details for DOI 10.1016/j.febslet.2005.07.039

    View details for Web of Science ID 000231625800022

    View details for PubMedID 16098521

  • Armadillo: Domain boundary prediction by amino acid composition JOURNAL OF MOLECULAR BIOLOGY Dumontier, M., Yao, R., Feldman, H. J., Hogue, C. W. 2005; 350 (5): 1061-1073


    The identification and annotation of protein domains provides a critical step in the accurate determination of molecular function. Both computational and experimental methods of protein structure determination may be deterred by large multi-domain proteins or flexible linker regions. Knowledge of domains and their boundaries may reduce the experimental cost of protein structure determination by allowing researchers to work on a set of smaller and possibly more successful alternatives. Current domain prediction methods often rely on sequence similarity to conserved domains and as such are poorly suited to detect domain structure in poorly conserved or orphan proteins. We present here a simple computational method to identify protein domain linkers and their boundaries from sequence information alone. Our domain predictor, Armadillo (, uses any amino acid index to convert a protein sequence to a smoothed numeric profile from which domains and domain boundaries may be predicted. We derived an amino acid index called the domain linker propensity index (DLI) from the amino acid composition of domain linkers using a non-redundant structure dataset. The index indicates that Pro and Gly show a propensity for linker residues while small hydrophobic residues do not. Armadillo predicts domain linker boundaries from Z-score distributions and obtains 35% sensitivity with DLI in a two-domain, single-linker dataset (within +/-20 residues from linker). The combination of DLI and an entropy-based amino acid index increases the overall Armadillo sensitivity to 56% for two domain proteins. Moreover, Armadillo achieves 37% sensitivity for multi-domain proteins, surpassing most other prediction methods. Armadillo provides a simple, but effective method by which prediction of domain boundaries can be obtained with reasonable sensitivity. Armadillo should prove to be a valuable tool for rapidly delineating protein domains in poorly conserved proteins or those with no sequence neighbors. As a first-line predictor, domain meta-predictors could yield improved results with Armadillo predictions.

    View details for DOI 10.1016/j.jmb.2005.05.037

    View details for Web of Science ID 000230701300019

    View details for PubMedID 15978619

  • The Biomolecular Interaction Network Database and related tools 2005 update NUCLEIC ACIDS RESEARCH Alfarano, C., Andrade, C. E., Anthony, K., Bahroos, N., Bajec, M., Bantoft, K., Betel, D., Bobechko, B., Boutilier, K., Burgess, E., Buzadzija, K., Cavero, R., D'Abreo, C., Donaldson, I., Dorairajoo, D., Dumontier, M. J., Dumontier, M. R., Earles, V., Farrall, R., Feldman, H., Garderman, E., Gong, Y., Gonzaga, R., Grytsan, V., Gryz, E., Gu, V., Haldorsen, E., Halupa, A., Haw, R., Hrvojic, A., Hurrell, L., Isserlin, R., Jack, F., Juma, F., Khan, A., Kon, T., Konopinsky, S., Le, V., Lee, E., Ling, S., Magidin, M., Moniakis, J., Montojo, J., Moore, S., Muskat, B., Ng, I., Paraiso, J. P., Parker, B., Pintilie, G., Pirone, R., Salama, J. J., Sgro, S., Shan, T., Shu, Y., Siew, J., Skinner, D., Snyder, K., Stasiuk, R., Strumpf, D., Tuekam, B., Tao, S., Wang, Z., White, M., Willis, R., Wolting, C., Wong, S., Wrong, A., Xin, C., Yao, R., Yates, B., Zhang, S., Zheng, K., PAWSON, T., Ouellette, B. F., Hogue, C. W. 2005; 33: D418-D424


    The Biomolecular Interaction Network Database (BIND) ( archives biomolecular interaction, reaction, complex and pathway information. Our aim is to curate the details about molecular interactions that arise from published experimental research and to provide this information, as well as tools to enable data analysis, freely to researchers worldwide. BIND data are curated into a comprehensive machine-readable archive of computable information and provides users with methods to discover interactions and molecular mechanisms. BIND has worked to develop new methods for visualization that amplify the underlying annotation of genes and proteins to facilitate the study of molecular interaction networks. BIND has maintained an open database policy since its inception in 1999. Data growth has proceeded at a tremendous rate, approaching over 100 000 records. New services provided include a new BIND Query and Submission interface, a Standard Object Access Protocol service and the Small Molecule Interaction Database ( that allows users to determine probable small molecule binding sites of new sequences and examine conserved binding residues.

    View details for DOI 10.1093/nar/gki051

    View details for Web of Science ID 000226524300086

    View details for PubMedID 15608229

  • Hardware-accelerated protein identification for mass spectrometry RAPID COMMUNICATIONS IN MASS SPECTROMETRY Alex, A. T., Dumontier, M., Rose, J. S., Hogue, C. W. 2005; 19 (6): 833-837


    An ongoing issue in mass spectrometry is the time it takes to search DNA sequences with MS/MS peptide fragments (see, e.g., Choudary et al., Proteomics 2001; 1: 651-667.) Search times are far longer than spectra acquisition time, and parallelization of search software on clusters requires doubling the size of a conventional computing cluster to cut the search time in half. Field programmable gate arrays (FPGAs) are used to create hardware-accelerated algorithms that reduce operating costs and improve search speed compared to large clusters. We present a novel hardware design that takes full spectra and computes 6-frame translation word searches on DNA databases at a rate of approximately 3 billion base pairs per second, with queries of up to 10 amino acids in length and arbitrary wildcard positions. Hardware post-processing identifies in silico tryptic peptides and scores them using a variety of techniques including mass frequency expected values. With faster FPGAs protein identifications from the human genome can be achieved in less than a second, and this makes it an ideal solution for a number of proteome-scale applications.

    View details for DOI 10.1002/rcm.1853

    View details for Web of Science ID 000227519500013

    View details for PubMedID 15723443

  • Species-specific protein sequence and fold optimizations BMC BIOINFORMATICS Dumontier, M., Michalickova, K., Hogue, C. W. 2002; 3


    An organism's ability to adapt to its particular environmental niche is of fundamental importance to its survival and proliferation. In the largest study of its kind, we sought to identify and exploit the amino-acid signatures that make species-specific protein adaptation possible across 100 complete genomes.Environmental niche was determined to be a significant factor in variability from correspondence analysis using the amino acid composition of over 360,000 predicted open reading frames (ORFs) from 17 archaea, 76 bacteria and 7 eukaryote complete genomes. Additionally, we found clusters of phylogenetically unrelated archaea and bacteria that share similar environments by amino acid composition clustering. Composition analyses of conservative, domain-based homology modeling suggested an enrichment of small hydrophobic residues Ala, Gly, Val and charged residues Asp, Glu, His and Arg across all genomes. However, larger aromatic residues Phe, Trp and Tyr are reduced in folds, and these results were not affected by low complexity biases. We derived two simple log-odds scoring functions from ORFs (CG) and folds (CF) for each of the complete genomes. CF achieved an average cross-validation success rate of 85 +/- 8% whereas the CG detected 73 +/- 9% species-specific sequences when competing against all other non-redundant CG. Continuously updated results are available at analysis of amino acid compositions from the complete genomes provides stronger evidence for species-specific and environmental residue preferences in genomic sequences as well as in folds. Scoring functions derived from this work will be useful in future protein engineering experiments and possibly in identifying horizontal transfer events.

    View details for Web of Science ID 000181476800039

    View details for PubMedID 12487631

  • NBLAST: a cluster variant of BLAST for NxN comparisons BMC BIOINFORMATICS Dumontier, M., Hogue, C. W. 2002; 3


    The BLAST algorithm compares biological sequences to one another in order to determine shared motifs and common ancestry. However, the comparison of all non-redundant (NR) sequences against all other NR sequences is a computationally intensive task. We developed NBLAST as a cluster computer implementation of the BLAST family of sequence comparison programs for the purpose of generating pre-computed BLAST alignments and neighbour lists of NR sequences.NBLAST performs the heuristic BLAST algorithm and generates an exhaustive database of alignments, but it only computes alignments (i.e. the upper triangle) of a possible N2 alignments, where N is the set of all sequences to be compared. A task-partitioning algorithm allows for cluster computing across all cluster nodes and the NBLAST master process produces a BLAST sequence alignment database and a list of sequence neighbours for each sequence record. The resulting sequence alignment and neighbour databases are used to serve the SeqHound query system through a C/C++ and PERL Application Programming Interface (API).NBLAST offers a local alternative to the NCBI's remote Entrez system for pre-computed BLAST alignments and neighbour queries. On our 216-processor 450 MHz PIII cluster, NBLAST requires ~24 hrs to compute neighbours for 850000 proteins currently in the non-redundant protein database.

    View details for Web of Science ID 000181476800013

    View details for PubMedID 12019022

  • SeqHound: biological sequence and structure database as a platform for bioinformatics research BMC BIOINFORMATICS Michalickova, K., Bader, G. D., Dumontier, M., LIEU, H., Betel, D., Isserlin, R., Hogue, C. W. 2002; 3


    SeqHound has been developed as an integrated biological sequence, taxonomy, annotation and 3-D structure database system. It provides a high-performance server platform for bioinformatics research in a locally-hosted environment.SeqHound is based on the National Center for Biotechnology Information data model and programming tools. It offers daily updated contents of all Entrez sequence databases in addition to 3-D structural data and information about sequence redundancies, sequence neighbours, taxonomy, complete genomes, functional annotation including Gene Ontology terms and literature links to PubMed. SeqHound is accessible via a web server through a Perl, C or C++ remote API or an optimized local API. It provides functionality necessary to retrieve specialized subsets of sequences, structures and structural domains. Sequences may be retrieved in FASTA, GenBank, ASN.1 and XML formats. Structures are available in ASN.1, XML and PDB formats. Emphasis has been placed on complete genomes, taxonomy, domain and functional annotation as well as 3-D structural functionality in the API, while fielded text indexing functionality remains under development. SeqHound also offers a streamlined WWW interface for simple web-user queries.The system has proven useful in several published bioinformatics projects such as the BIND database and offers a cost-effective infrastructure for research. SeqHound will continue to develop and be provided as a service of the Blueprint Initiative at the Samuel Lunenfeld Research Institute. The source code and examples are available under the terms of the GNU public license at the Sourceforge site in the SLRI Toolkit.

    View details for Web of Science ID 000181476800032

    View details for PubMedID 12401134