Dr. Michel Dumontier is an Associate Professor of Medicine (Biomedical Informatics) at Stanford University. His research focuses on the development of computational methods to increase our understanding of how living systems respond to chemical agents. At the core of the research program is the development and use of Semantic Web technologies to formally represent and reason about data and services so as (1) to facilitate the publishing, sharing and discovery of scientific knowledge, (2) to enable the formulation and evaluation scientific hypotheses and (3) to create and make available computational methods to investigate the structure, function and behaviour of living systems. Dr. Dumontier serves as a co-chair for the World Wide Web Consortium Semantic Web for Health Care and Life Sciences Interest Group (W3C HCLSIG) and is the Scientific Director for Bio2RDF, a widely recognized open-source project to create and provide linked data for life sciences.
Boards, Advisory Committees, Professional Organizations
Chair, World Wide Web Consortium (W3C) Semantic Web for Health Care and Life Sciences Interest Group (2011 - Present)
Postdoc, Samuel Lunenfeld Research Institute, Bioinformatics (2005)
BSc, University of Manitoba, Biochemistry (1998)
PhD, University of Toronto, Biochemistry (Bioinformatics) (2004)
Current Research and Scholarly Interests
The Dumontier laboratory for biomedical knowledge discovery develops computational methods to better understand how living systems respond to chemical agents. We use semantic technologies to integrate and analyze large biomedical data and enable knowledge-based discoveries in biology, biochemistry and medicine. Our major research interests include i) drug repositioning using large scale animal model data, ii) elucidating the mechanism by which complex phenotypes (e.g. side effects) arise from consumption of pharmaceutical products, iii) determining the extent to which drug metabolic products contribute to toxicity, iv) optimizing novel drug therapeutic regimes so as to minimize undesireable side effects and v) understanding the systemic basis of an altered response due to genetic variation. We develop novel methods to accurately capture, publish, discover and re-use biomedical data, ontologies and services using formal knowledge representation and automated reasoning.
- An evidence-based approach to identify aging-related genes in Caenorhabditis elegans BMC BIOINFORMATICS 2015; 16
Finding Our Way through Phenotypes.
2015; 13 (1)
Despite a large and multifaceted effort to understand the vast landscape of phenotypic data, their current form inhibits productive data analysis. The lack of a community-wide, consensus-based, human- and machine-interpretable language for describing phenotypes and their genomic and environmental contexts is perhaps the most pressing scientific bottleneck to integration across many key fields in biology, including genomics, systems biology, development, medicine, evolution, ecology, and systematics. Here we survey the current phenomics landscape, including data resources and handling, and the progress that has been made to accurately capture relevant data descriptions for phenotypes. We present an example of the kind of integration across domains that computable phenotypes would enable, and we call upon the broader biology community, publishers, and relevant funding agencies to support efforts to surmount today's data barriers and facilitate analytical reproducibility.
View details for DOI 10.1371/journal.pbio.1002033
View details for PubMedID 25562316
Bridging Islands of Information to Establish an Integrated Knowledge Base of Drugs and Health Outcomes of Interest
2014; 37 (8): 557-567
The entire drug safety enterprise has a need to search, retrieve, evaluate, and synthesize scientific evidence more efficiently. This discovery and synthesis process would be greatly accelerated through access to a common framework that brings all relevant information sources together within a standardized structure. This presents an opportunity to establish an open-source community effort to develop a global knowledge base, one that brings together and standardizes all available information for all drugs and all health outcomes of interest (HOIs) from all electronic sources pertinent to drug safety. To make this vision a reality, we have established a workgroup within the Observational Health Data Sciences and Informatics (OHDSI, http://ohdsi.org) collaborative. The workgroup's mission is to develop an open-source standardized knowledge base for the effects of medical products and an efficient procedure for maintaining and expanding it. The knowledge base will make it simpler for practitioners to access, retrieve, and synthesize evidence so that they can reach a rigorous and accurate assessment of causal relationships between a given drug and HOI. Development of the knowledge base will proceed with the measureable goal of supporting an efficient and thorough evidence-based assessment of the effects of 1,000 active ingredients across 100 HOIs. This non-trivial task will result in a high-quality and generally applicable drug safety knowledge base. It will also yield a reference standard of drug-HOI pairs that will enable more advanced methodological research that empirically evaluates the performance of drug safety analysis methods.
View details for DOI 10.1007/s40264-014-0189-0
View details for Web of Science ID 000344614700001
View details for PubMedID 24985530
BioHackathon series in 2011 and 2012: penetration of ontology and linked data in life science domains
JOURNAL OF BIOMEDICAL SEMANTICS
The application of semantic technologies to the integration of biological data and the interoperability of bioinformatics analysis and visualization tools has been the common theme of a series of annual BioHackathons hosted in Japan for the past five years. Here we provide a review of the activities and outcomes from the BioHackathons held in 2011 in Kyoto and 2012 in Toyama. In order to efficiently implement semantic technologies in the life sciences, participants formed various sub-groups and worked on the following topics: Resource Description Framework (RDF) models for specific domains, text mining of the literature, ontology development, essential metadata for biological databases, platforms to enable efficient Semantic Web technology development and interoperability, and the development of applications for Semantic Web data. In this review, we briefly introduce the themes covered by these sub-groups. The observations made, conclusions drawn, and software development projects that emerged from these activities are discussed.
View details for DOI 10.1186/2041-1480-5-5
View details for Web of Science ID 000343707900001
View details for PubMedID 24495517
The Semanticscience Integrated Ontology (SIO) for biomedical research and knowledge discovery.
Journal of biomedical semantics
2014; 5 (1): 14-?
The Semanticscience Integrated Ontology (SIO) is an ontology to facilitate biomedical knowledge discovery. SIO features a simple upper level comprised of essential types and relations for the rich description of arbitrary (real, hypothesized, virtual, fictional) objects, processes and their attributes. SIO specifies simple design patterns to describe and associate qualities, capabilities, functions, quantities, and informational entities including textual, geometrical, and mathematical entities, and provides specific extensions in the domains of chemistry, biology, biochemistry, and bioinformatics. SIO provides an ontological foundation for the Bio2RDF linked data for the life sciences project and is used for semantic integration and discovery for SADI-based semantic web services. SIO is freely available to all users under a creative commons by attribution license. See website for further information: http://sio.semanticscience.org.
View details for DOI 10.1186/2041-1480-5-14
View details for PubMedID 24602174
Drug-drug interaction data source survey and linking.
AMIA Joint Summits on Translational Science proceedings AMIA Summit on Translational Science
2014; 2014: 16-?
As an initial step towards the goal of a common data model for potential drug-drug interactions, we surveyed the data elements provided by the publicly available sources. Our analysis found that there is very little overlap between or across publicly available resources and that the information provided is very heterogeneous.
View details for PubMedID 25717393
Mouse model phenotypes provide information about human drug targets.
Bioinformatics (Oxford, England)
2014; 30 (5): 719-25
Methods for computational drug target identification use information from diverse information sources to predict or prioritize drug targets for known drugs. One set of resources that has been relatively neglected for drug repurposing is animal model phenotype.We investigate the use of mouse model phenotypes for drug target identification. To achieve this goal, we first integrate mouse model phenotypes and drug effects, and then systematically compare the phenotypic similarity between mouse models and drug effect profiles. We find a high similarity between phenotypes resulting from loss-of-function mutations and drug effects resulting from the inhibition of a protein through a drug action, and demonstrate how this approach can be used to suggest candidate drug targets.Analysis code and supplementary data files are available on the project Web site at https://drugeffects.googlecode.com.
View details for DOI 10.1093/bioinformatics/btt613
View details for PubMedID 24158600
- Evaluation of the OQuaRE framework for ontology quality EXPERT SYSTEMS WITH APPLICATIONS 2013; 40 (7): 2696-2703
Selected papers from the 15th Annual Bio-Ontologies Special Interest Group Meeting.
Journal of biomedical semantics
2013; 4: I1-?
Over the 15 years, the Bio-Ontologies SIG at ISMB has provided a forum for discussion of the latest and most innovative research in the bio-ontologies development, its applications to biomedicine and more generally the organisation, presentation and dissemination of knowledge in biomedicine and the life sciences. The seven papers and the commentary selected for this supplement span a wide range of topics including: web-based querying over multiple ontologies, integration of data, annotating patent records, NCBO Web services, ontology developments for probabilistic reasoning and for physiological processes, and analysis of the progress of annotation and structural GO changes.
View details for DOI 10.1186/2041-1480-4-S1-I1
View details for PubMedID 23735191
- State of the art and open challenges in community-driven knowledge curation JOURNAL OF BIOMEDICAL INFORMATICS 2013; 46 (1): 1-4
The SADI Personal Health Lens: A Web Browser-Based System for Identifying Personally Relevant Drug Interactions.
JMIR research protocols
2013; 2 (1): e14
View details for DOI 10.2196/resprot.2315
View details for PubMedID 23612187
An RDF/OWL knowledge base for query answering and decision support in clinical pharmacogenetics.
Studies in health technology and informatics
2013; 192: 539-42
Genetic testing for personalizing pharmacotherapy is bound to become an important part of clinical routine. To address associated issues with data management and quality, we are creating a semantic knowledge base for clinical pharmacogenetics. The knowledge base is made up of three components: an expressive ontology formalized in the Web Ontology Language (OWL 2 DL), a Resource Description Framework (RDF) model for capturing detailed results of manual annotation of pharmacogenomic information in drug product labels, and an RDF conversion of relevant biomedical datasets. Our work goes beyond the state of the art in that it makes both automated reasoning as well as query answering as simple as possible, and the reasoning capabilities go beyond the capabilities of previously described ontologies.
View details for PubMedID 23920613
Evaluation of research in biomedical ontologies.
Briefings in bioinformatics
2013; 14 (6): 696-712
Ontologies are now pervasive in biomedicine, where they serve as a means to standardize terminology, to enable access to domain knowledge, to verify data consistency and to facilitate integrative analyses over heterogeneous biomedical data. For this purpose, research on biomedical ontologies applies theories and methods from diverse disciplines such as information management, knowledge representation, cognitive science, linguistics and philosophy. Depending on the desired applications in which ontologies are being applied, the evaluation of research in biomedical ontologies must follow different strategies. Here, we provide a classification of research problems in which ontologies are being applied, focusing on the use of ontologies in basic and translational research, and we demonstrate how research results in biomedical ontologies can be evaluated. The evaluation strategies depend on the desired application and measure the success of using an ontology for a particular biomedical problem. For many applications, the success can be quantified, thereby facilitating the objective evaluation and comparison of research in biomedical ontology. The objective, quantifiable comparison of research results based on scientific applications opens up the possibility for systematically improving the utility of ontologies in biomedical research.
View details for DOI 10.1093/bib/bbs053
View details for PubMedID 22962340
Ontology-Based Querying with Bio2RDF's Linked Open Data.
Journal of biomedical semantics
2013; 4 Suppl 1: S1
A key activity for life scientists in this post "-omics" age involves searching for and integrating biological data from a multitude of independent databases. However, our ability to find relevant data is hampered by non-standard web and database interfaces backed by an enormous variety of data formats. This heterogeneity presents an overwhelming barrier to the discovery and reuse of resources which have been developed at great public expense.To address this issue, the open-source Bio2RDF project promotes a simple convention to integrate diverse biological data using Semantic Web technologies. However, querying Bio2RDF remains difficult due to the lack of uniformity in the representation of Bio2RDF datasets.We describe an update to Bio2RDF that includes tighter integration across 19 new and updated RDF datasets. All available open-source scripts were first consolidated to a single GitHub repository and then redeveloped using a common API that generates normalized IRIs using a centralized dataset registry. We then mapped dataset specific types and relations to the Semanticscience Integrated Ontology (SIO) and demonstrate simplified federated queries across multiple Bio2RDF endpoints.This coordinated release marks an important milestone for the Bio2RDF open source linked data framework. Principally, it improves the quality of linked data in the Bio2RDF network and makes it easier to access or recreate the linked data locally. We hope to continue improving the Bio2RDF network of linked data by identifying priority databases and increasing the vocabulary coverage to additional dataset vocabularies beyond SIO.
View details for DOI 10.1186/2041-1480-4-S1-S1
View details for PubMedID 23735196
Linked Data in Drug Discovery
IEEE INTERNET COMPUTING
2012; 16 (6): 68-71
View details for Web of Science ID 000310665100010
Identifying aberrant pathways through integrated analysis of knowledge in pharmacogenomics
2012; 28 (16): 2169-2175
Many complex diseases are the result of abnormal pathway functions instead of single abnormalities. Disease diagnosis and intervention strategies must target these pathways while minimizing the interference with normal physiological processes. Large-scale identification of disease pathways and chemicals that may be used to perturb them requires the integration of information about drugs, genes, diseases and pathways. This information is currently distributed over several pharmacogenomics databases. An integrated analysis of the information in these databases can reveal disease pathways and facilitate novel biomedical analyses.We demonstrate how to integrate pharmacogenomics databases through integration of the biomedical ontologies that are used as meta-data in these databases. The additional background knowledge in these ontologies can then be used to enable novel analyses. We identify disease pathways using a novel multi-ontology enrichment analysis over the Human Disease Ontology, and we identify significant associations between chemicals and pathways using an enrichment analysis over a chemical ontology. The drug-pathway and disease-pathway associations are a valuable resource for research in disease and drug mechanisms and can be used to improve computational drug repurposing.http://pharmgkb-owl.googlecode.com
View details for DOI 10.1093/bioinformatics/bts350
View details for Web of Science ID 000307501100011
View details for PubMedID 22711793
Aptamer base: a collaborative knowledge base to describe aptamers and SELEX experiments
DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION
Over the past several decades, rapid developments in both molecular and information technology have collectively increased our ability to understand molecular recognition. One emerging area of interest in molecular recognition research includes the isolation of aptamers. Aptamers are single-stranded nucleic acid or amino acid polymers that recognize and bind to targets with high affinity and selectivity. While research has focused on collecting aptamers and their interactions, most of the information regarding experimental methods remains in the unstructured and textual format of peer reviewed publications. To address this, we present the Aptamer Base, a database that provides detailed, structured information about the experimental conditions under which aptamers were selected and their binding affinity quantified. The open collaborative nature of the Aptamer Base provides the community with a unique resource that can be updated and curated in a decentralized manner, thereby accommodating the ever evolving field of aptamer research. DATABASE URL: http://aptamer.freebase.com.
View details for DOI 10.1093/database/bas006
View details for Web of Science ID 000304920200004
View details for PubMedID 22434840
Self-organizing ontology of biochemically relevant small molecules
The advent of high-throughput experimentation in biochemistry has led to the generation of vast amounts of chemical data, necessitating the development of novel analysis, characterization, and cataloguing techniques and tools. Recently, a movement to publically release such data has advanced biochemical structure-activity relationship research, while providing new challenges, the biggest being the curation, annotation, and classification of this information to facilitate useful biochemical pattern analysis. Unfortunately, the human resources currently employed by the organizations supporting these efforts (e.g. ChEBI) are expanding linearly, while new useful scientific information is being released in a seemingly exponential fashion. Compounding this, currently existing chemical classification and annotation systems are not amenable to automated classification, formal and transparent chemical class definition axiomatization, facile class redefinition, or novel class integration, thus further limiting chemical ontology growth by necessitating human involvement in curation. Clearly, there is a need for the automation of this process, especially for novel chemical entities of biological interest.To address this, we present a formal framework based on Semantic Web technologies for the automatic design of chemical ontology which can be used for automated classification of novel entities. We demonstrate the automatic self-assembly of a structure-based chemical ontology based on 60 MeSH and 40 ChEBI chemical classes. This ontology is then used to classify 200 compounds with an accuracy of 92.7%. We extend these structure-based classes with molecular feature information and demonstrate the utility of our framework for classification of functionally relevant chemicals. Finally, we discuss an iterative approach that we envision for future biochemical ontology development.We conclude that the proposed methodology can ease the burden of chemical data annotators and dramatically increase their productivity. We anticipate that the use of formal logic in our proposed framework will make chemical classification criteria more transparent to humans and machines alike and will thus facilitate predictive and integrative bioactivity model development.
View details for DOI 10.1186/1471-2105-13-3
View details for Web of Science ID 000299825400001
View details for PubMedID 22221313
Selected papers from the 14th Annual Bio-Ontologies Special Interest Group Meeting.
Journal of biomedical semantics
2012; 3: I1-?
Over the 14 years, the Bio-Ontologies SIG at ISMB has provided a forum for discussion of the latest and most innovative research in the bio-ontologies development, its applications to biomedicine and more generally the organisation, presentation and dissemination of knowledge in biomedicine and the life sciences. The seven papers selected for this supplement span a wide range of topics including: web-based querying over multiple ontologies, integration of data from wikis, innovative methods of annotating and mining electronic health records, advances in annotating web documents and biomedical literature, quality control of ontology alignments, and the ontology support for predictive models about toxicity and open access to the toxicity data.
View details for DOI 10.1186/2041-1480-3-S1-I1
View details for PubMedID 22541591
Semantically enabling pharmacogenomic data for the realization of personalized medicine.
2012; 13 (2): 201-12
Understanding how each individual's genetics and physiology influences pharmaceutical response is crucial to the realization of personalized medicine and the discovery and validation of pharmacogenomic biomarkers is key to its success. However, integration of genotype and phenotype knowledge in medical information systems remains a critical challenge. The inability to easily and accurately integrate the results of biomolecular studies with patients' medical records and clinical reports prevents us from realizing the full potential of pharmacogenomic knowledge for both drug development and clinical practice. Herein, we describe approaches using Semantic Web technologies, in which pharmacogenomic knowledge relevant to drug development and medical decision support is represented in such a way that it can be efficiently accessed both by software and human experts. We suggest that this approach increases the utility of data, and that such computational technologies will become an essential part of personalized medicine, alongside diagnostics and pharmaceutical products.
View details for DOI 10.2217/pgs.11.179
View details for PubMedID 22256869
Building an HIV data mashup using Bio2RDF
BRIEFINGS IN BIOINFORMATICS
2012; 13 (1): 98-106
We present an update to the Bio2RDF Linked Data Network, which now comprises ∼30 billion statements across 30 data sets. Significant changes to the framework include the accommodation of global mirrors, offline data processing and new search and integration services. The utility of this new network of knowledge is illustrated through a Bio2RDF-based mashup with microarray gene expression results and interaction data obtained from the HIV-1, Human Protein Interaction Database (HHPID) with respect to the infection of human macrophages with the human immunodeficiency virus type 1 (HIV-1).
View details for DOI 10.1093/bib/bbr003
View details for Web of Science ID 000298888200007
View details for PubMedID 22223742
The Chemical Information Ontology: Provenance and Disambiguation for Chemical Data on the Biological Semantic Web
2011; 6 (10)
Cheminformatics is the application of informatics techniques to solve chemical problems in silico. There are many areas in biology where cheminformatics plays an important role in computational research, including metabolism, proteomics, and systems biology. One critical aspect in the application of cheminformatics in these fields is the accurate exchange of data, which is increasingly accomplished through the use of ontologies. Ontologies are formal representations of objects and their properties using a logic-based ontology language. Many such ontologies are currently being developed to represent objects across all the domains of science. Ontologies enable the definition, classification, and support for querying objects in a particular domain, enabling intelligent computer applications to be built which support the work of scientists both within the domain of interest and across interrelated neighbouring domains. Modern chemical research relies on computational techniques to filter and organise data to maximise research productivity. The objects which are manipulated in these algorithms and procedures, as well as the algorithms and procedures themselves, enjoy a kind of virtual life within computers. We will call these information entities. Here, we describe our work in developing an ontology of chemical information entities, with a primary focus on data-driven research and the integration of calculated properties (descriptors) of chemical entities within a semantic web context. Our ontology distinguishes algorithmic, or procedural information from declarative, or factual information, and renders of particular importance the annotation of provenance to calculated data. The Chemical Information Ontology is being developed as an open collaborative project. More details, together with a downloadable OWL file, are available at http://code.google.com/p/semanticchemistry/ (license: CC-BY-SA).
View details for DOI 10.1371/journal.pone.0025513
View details for Web of Science ID 000295943000034
View details for PubMedID 21991315
Controlled vocabularies and semantics in systems biology
MOLECULAR SYSTEMS BIOLOGY
The use of computational modeling to describe and analyze biological systems is at the heart of systems biology. Model structures, simulation descriptions and numerical results can be encoded in structured formats, but there is an increasing need to provide an additional semantic layer. Semantic information adds meaning to components of structured descriptions to help identify and interpret them unambiguously. Ontologies are one of the tools frequently used for this purpose. We describe here three ontologies created specifically to address the needs of the systems biology community. The Systems Biology Ontology (SBO) provides semantic information about the model components. The Kinetic Simulation Algorithm Ontology (KiSAO) supplies information about existing algorithms available for the simulation of systems biology models, their characterization and interrelationships. The Terminology for the Description of Dynamics (TEDDY) categorizes dynamical features of the simulation results and general systems behavior. The provision of semantic information extends a model's longevity and facilitates its reuse. It provides useful insight into the biology of modeled processes, and may be used to make informed decisions on subsequent simulation experiments.
View details for DOI 10.1038/msb.2011.77
View details for Web of Science ID 000296652600009
View details for PubMedID 22027554
Integrating systems biology models and biomedical ontologies
BMC SYSTEMS BIOLOGY
Systems biology is an approach to biology that emphasizes the structure and dynamic behavior of biological systems and the interactions that occur within them. To succeed, systems biology crucially depends on the accessibility and integration of data across domains and levels of granularity. Biomedical ontologies were developed to facilitate such an integration of data and are often used to annotate biosimulation models in systems biology.We provide a framework to integrate representations of in silico systems biology with those of in vivo biology as described by biomedical ontologies and demonstrate this framework using the Systems Biology Markup Language. We developed the SBML Harvester software that automatically converts annotated SBML models into OWL and we apply our software to those biosimulation models that are contained in the BioModels Database. We utilize the resulting knowledge base for complex biological queries that can bridge levels of granularity, verify models based on the biological phenomenon they represent and provide a means to establish a basic qualitative layer on which to express the semantics of biosimulation models.We establish an information flow between biomedical ontologies and biosimulation models and we demonstrate that the integration of annotated biosimulation models and biomedical ontologies enables the verification of models as well as expressive queries. Establishing a bi-directional information flow between systems biology and biomedical ontologies has the potential to enable large-scale analyses of biological systems that span levels of granularity from molecules to organisms.
View details for DOI 10.1186/1752-0509-5-124
View details for Web of Science ID 000294781500001
View details for PubMedID 21835028
- MoSuMo: A Semantic Web service to generate electrostatic potentials across solvent excluded protein surfaces and binding pockets COMPUTERS & GRAPHICS-UK 2011; 35 (4): 823-830
Prototype semantic infrastructure for automated small molecule classification and annotation in lipidomics
The development of high-throughput experimentation has led to astronomical growth in biologically relevant lipids and lipid derivatives identified, screened, and deposited in numerous online databases. Unfortunately, efforts to annotate, classify, and analyze these chemical entities have largely remained in the hands of human curators using manual or semi-automated protocols, leaving many novel entities unclassified. Since chemical function is often closely linked to structure, accurate structure-based classification and annotation of chemical entities is imperative to understanding their functionality.As part of an exploratory study, we have investigated the utility of semantic web technologies in automated chemical classification and annotation of lipids. Our prototype framework consists of two components: an ontology and a set of federated web services that operate upon it. The formal lipid ontology we use here extends a part of the LiPrO ontology and draws on the lipid hierarchy in the LIPID MAPS database, as well as literature-derived knowledge. The federated semantic web services that operate upon this ontology are deployed within the Semantic Annotation, Discovery, and Integration (SADI) framework. Structure-based lipid classification is enacted by two core services. Firstly, a structural annotation service detects and enumerates relevant functional groups for a specified chemical structure. A second service reasons over lipid ontology class descriptions using the attributes obtained from the annotation service and identifies the appropriate lipid classification. We extend the utility of these core services by combining them with additional SADI services that retrieve associations between lipids and proteins and identify publications related to specified lipid types. We analyze the performance of SADI-enabled eicosanoid classification relative to the LIPID MAPS classification and reflect on the contribution of our integrative methodology in the context of high-throughput lipidomics.Our prototype framework is capable of accurate automated classification of lipids and facile integration of lipid class information with additional data obtained with SADI web services. The potential of programming-free integration of external web services through the SADI framework offers an opportunity for development of powerful novel applications in lipidomics. We conclude that semantic web technologies can provide an accurate and versatile means of classification and annotation of lipids.
View details for DOI 10.1186/1471-2105-12-303
View details for Web of Science ID 000294361500001
View details for PubMedID 21791100
Interoperability between Biomedical Ontologies through Relation Expansion, Upper-Level Ontologies and Automatic Reasoning
2011; 6 (7)
Researchers design ontologies as a means to accurately annotate and integrate experimental data across heterogeneous and disparate data- and knowledge bases. Formal ontologies make the semantics of terms and relations explicit such that automated reasoning can be used to verify the consistency of knowledge. However, many biomedical ontologies do not sufficiently formalize the semantics of their relations and are therefore limited with respect to automated reasoning for large scale data integration and knowledge discovery. We describe a method to improve automated reasoning over biomedical ontologies and identify several thousand contradictory class definitions. Our approach aligns terms in biomedical ontologies with foundational classes in a top-level ontology and formalizes composite relations as class expressions. We describe the semi-automated repair of contradictions and demonstrate expressive queries over interoperable ontologies. Our work forms an important cornerstone for data integration, automatic inference and knowledge discovery based on formal representations of knowledge. Our results and analysis software are available at http://bioonto.de/pmwiki.php/Main/ReasonableOntologies.
View details for DOI 10.1371/journal.pone.0022006
View details for Web of Science ID 000292812400024
View details for PubMedID 21789201
Chemical Entity Semantic Specification: Knowledge representation for efficient semantic cheminformatics and facile data integration
JOURNAL OF CHEMINFORMATICS
Over the past several centuries, chemistry has permeated virtually every facet of human lifestyle, enriching fields as diverse as medicine, agriculture, manufacturing, warfare, and electronics, among numerous others. Unfortunately, application-specific, incompatible chemical information formats and representation strategies have emerged as a result of such diverse adoption of chemistry. Although a number of efforts have been dedicated to unifying the computational representation of chemical information, disparities between the various chemical databases still persist and stand in the way of cross-domain, interdisciplinary investigations. Through a common syntax and formal semantics, Semantic Web technology offers the ability to accurately represent, integrate, reason about and query across diverse chemical information.Here we specify and implement the Chemical Entity Semantic Specification (CHESS) for the representation of polyatomic chemical entities, their substructures, bonds, atoms, and reactions using Semantic Web technologies. CHESS provides means to capture aspects of their corresponding chemical descriptors, connectivity, functional composition, and geometric structure while specifying mechanisms for data provenance. We demonstrate that using our readily extensible specification, it is possible to efficiently integrate multiple disparate chemical data sources, while retaining appropriate correspondence of chemical descriptors, with very little additional effort. We demonstrate the impact of some of our representational decisions on the performance of chemically-aware knowledgebase searching and rudimentary reaction candidate selection. Finally, we provide access to the tools necessary to carry out chemical entity encoding in CHESS, along with a sample knowledgebase.By harnessing the power of Semantic Web technologies with CHESS, it is possible to provide a means of facile cross-domain chemical knowledge integration with full preservation of data correspondence and provenance. Our representation builds on existing cheminformatics technologies and, by the virtue of RDF specification, remains flexible and amenable to application- and domain-specific annotations without compromising chemical data integration. We conclude that the adoption of a consistent and semantically-enabled chemical specification is imperative for surviving the coming chemical data deluge and supporting systems science research.
View details for DOI 10.1186/1758-2946-3-20
View details for Web of Science ID 000300226500001
View details for PubMedID 21595881
Semantic Web integration of Cheminformatics resources with the SADI framework
JOURNAL OF CHEMINFORMATICS
The diversity and the largely independent nature of chemical research efforts over the past half century are, most likely, the major contributors to the current poor state of chemical computational resource and database interoperability. While open software for chemical format interconversion and database entry cross-linking have partially addressed database interoperability, computational resource integration is hindered by the great diversity of software interfaces, languages, access methods, and platforms, among others. This has, in turn, translated into limited reproducibility of computational experiments and the need for application-specific computational workflow construction and semi-automated enactment by human experts, especially where emerging interdisciplinary fields, such as systems chemistry, are pursued. Fortunately, the advent of the Semantic Web, and the very recent introduction of RESTful Semantic Web Services (SWS) may present an opportunity to integrate all of the existing computational and database resources in chemistry into a machine-understandable, unified system that draws on the entirety of the Semantic Web.We have created a prototype framework of Semantic Automated Discovery and Integration (SADI) framework SWS that exposes the QSAR descriptor functionality of the Chemistry Development Kit. Since each of these services has formal ontology-defined input and output classes, and each service consumes and produces RDF graphs, clients can automatically reason about the services and available reference information necessary to complete a given overall computational task specified through a simple SPARQL query. We demonstrate this capability by carrying out QSAR analysis backed by a simple formal ontology to determine whether a given molecule is drug-like. Further, we discuss parameter-based control over the execution of SADI SWS. Finally, we demonstrate the value of computational resource envelopment as SADI services through service reuse and ease of integration of computational functionality into formal ontologies.The work we present here may trigger a major paradigm shift in the distribution of computational resources in chemistry. We conclude that envelopment of chemical computational resources as SADI SWS facilitates interdisciplinary research by enabling the definition of computational problems in terms of ontologies and formal logical statements instead of cumbersome and application-specific tasks and workflows.
View details for DOI 10.1186/1758-2946-3-16
View details for Web of Science ID 000300226200001
View details for PubMedID 21575200
A common layer of interoperability for biomedical ontologies based on OWL EL
2011; 27 (7): 1001-1008
Ontologies are essential in biomedical research due to their ability to semantically integrate content from different scientific databases and resources. Their application improves capabilities for querying and mining biological knowledge. An increasing number of ontologies is being developed for this purpose, and considerable effort is invested into formally defining them in order to represent their semantics explicitly. However, current biomedical ontologies do not facilitate data integration and interoperability yet, since reasoning over these ontologies is very complex and cannot be performed efficiently or is even impossible. We propose the use of less expressive subsets of ontology representation languages to enable efficient reasoning and achieve the goal of genuine interoperability between ontologies.We present and evaluate EL Vira, a framework that transforms OWL ontologies into the OWL EL subset, thereby enabling the use of tractable reasoning. We illustrate which OWL constructs and inferences are kept and lost following the conversion and demonstrate the performance gain of reasoning indicated by the significant reduction of processing time. We applied EL Vira to the open biomedical ontologies and provide a repository of ontologies resulting from this conversion. EL Vira creates a common layer of ontological interoperability that, for the first time, enables the creation of software solutions that can employ biomedical ontologies to perform inferences and answer complex queries to support scientific analyses. Availability and implementation: The EL Vira software is available from http://el-vira.googlecode.com and converted OBO ontologies and their mappings are available from http://bioonto.gen.cam.ac.uk/el-ont.
View details for DOI 10.1093/bioinformatics/btr058
View details for Web of Science ID 000289162000017
View details for PubMedID 21343142
Integration and publication of heterogeneous text-mined relationships on the Semantic Web.
Journal of biomedical semantics
2011; 2: S10-?
Advances in Natural Language Processing (NLP) techniques enable the extraction of fine-grained relationships mentioned in biomedical text. The variability and the complexity of natural language in expressing similar relationships causes the extracted relationships to be highly heterogeneous, which makes the construction of knowledge bases difficult and poses a challenge in using these for data mining or question answering.We report on the semi-automatic construction of the PHARE relationship ontology (the PHArmacogenomic RElationships Ontology) consisting of 200 curated relations from over 40,000 heterogeneous relationships extracted via text-mining. These heterogeneous relations are then mapped to the PHARE ontology using synonyms, entity descriptions and hierarchies of entities and roles. Once mapped, relationships can be normalized and compared using the structure of the ontology to identify relationships that have similar semantics but different syntax. We compare and contrast the manual procedure with a fully automated approach using WordNet to quantify the degree of integration enabled by iterative curation and refinement of the PHARE ontology. The result of such integration is a repository of normalized biomedical relationships, named PHARE-KB, which can be queried using Semantic Web technologies such as SPARQL and can be visualized in the form of a biological network.The PHARE ontology serves as a common semantic framework to integrate more than 40,000 relationships pertinent to pharmacogenomics. The PHARE ontology forms the foundation of a knowledge base named PHARE-KB. Once populated with relationships, PHARE-KB (i) can be visualized in the form of a biological network to guide human tasks such as database curation and (ii) can be queried programmatically to guide bioinformatics applications such as the prediction of molecular interactions. PHARE is available at http://purl.bioontology.org/ontology/PHARE.
View details for DOI 10.1186/2041-1480-2-S2-S10
View details for PubMedID 21624156
Towards an interoperable information infrastructure providing decision support for genomic medicine.
Studies in health technology and informatics
2011; 169: 165-9
Genetic dispositions play a major role in individual disease risk and treatment response. Genomic medicine, in which medical decisions are refined by genetic information of particular patients, is becoming increasingly important. Here we describe our work and future visions around the creation of a distributed infrastructure for pharmacogenetic data and medical decision support, based on industry standards such as the Web Ontology Language (OWL) and the Arden Syntax.
View details for PubMedID 21893735
The Translational Medicine Ontology and Knowledge Base: driving personalized medicine by bridging the gap between bench and bedside.
Journal of biomedical semantics
2011; 2 Suppl 2: S1
Translational medicine requires the integration of knowledge using heterogeneous data from health care to the life sciences. Here, we describe a collaborative effort to produce a prototype Translational Medicine Knowledge Base (TMKB) capable of answering questions relating to clinical practice and pharmaceutical drug discovery.We developed the Translational Medicine Ontology (TMO) as a unifying ontology to integrate chemical, genomic and proteomic data with disease, treatment, and electronic health records. We demonstrate the use of Semantic Web technologies in the integration of patient and biomedical data, and reveal how such a knowledge base can aid physicians in providing tailored patient care and facilitate the recruitment of patients into active clinical trials. Thus, patients, physicians and researchers may explore the knowledge base to better understand therapeutic options, efficacy, and mechanisms of action.This work takes an important step in using Semantic Web technologies to facilitate integration of relevant, distributed, external sources and progress towards a computational platform to support personalized medicine.TMO can be downloaded from http://code.google.com/p/translationalmedicineontology and TMKB can be accessed at http://tm.semanticscience.org/sparql.
View details for DOI 10.1186/2041-1480-2-S2-S1
View details for PubMedID 21624155
- The RNA Ontology (RNAO): An ontology for integrating RNA sequence and structure data APPLIED ONTOLOGY 2011; 6 (1): 53-89
HyQue: evaluating hypotheses using Semantic Web technologies.
Journal of biomedical semantics
2011; 2: S3-?
Key to the success of e-Science is the ability to computationally evaluate expert-composed hypotheses for validity against experimental data. Researchers face the challenge of collecting, evaluating and integrating large amounts of diverse information to compose and evaluate a hypothesis. Confronted with rapidly accumulating data, researchers currently do not have the software tools to undertake the required information integration tasks.We present HyQue, a Semantic Web tool for querying scientific knowledge bases with the purpose of evaluating user submitted hypotheses. HyQue features a knowledge model to accommodate diverse hypotheses structured as events and represented using Semantic Web languages (RDF/OWL). Hypothesis validity is evaluated against experimental and literature-sourced evidence through a combination of SPARQL queries and evaluation rules. Inference over OWL ontologies (for type specifications, subclass assertions and parthood relations) and retrieval of facts stored as Bio2RDF linked data provide support for a given hypothesis. We evaluate hypotheses of varying levels of detail about the genetic network controlling galactose metabolism in Saccharomyces cerevisiae to demonstrate the feasibility of deploying such semantic computing tools over a growing body of structured knowledge in Bio2RDF.HyQue is a query-based hypothesis evaluation system that can currently evaluate hypotheses about the galactose metabolism in S. cerevisiae. Hypotheses as well as the supporting or refuting data are represented in RDF and directly linked to one another allowing scientists to browse from data to hypothesis and vice versa. HyQue hypotheses and data are available at http://semanticscience.org/projects/hyque.
View details for DOI 10.1186/2041-1480-2-S2-S3
View details for PubMedID 21624158
Computational approaches toward the design of pools for the in vitro selection of complex aptamers
RNA-A PUBLICATION OF THE RNA SOCIETY
2010; 16 (11): 2252-2262
It is well known that using random RNA/DNA sequences for SELEX experiments will generally yield low-complexity structures. Early experimental results suggest that having a structurally diverse library, which, for instance, includes high-order junctions, may prove useful in finding new functional motifs. Here, we develop two computational methods to generate sequences that exhibit higher structural complexity and can be used to increase the overall structural diversity of initial pools for in vitro selection experiments. Random Filtering selectively increases the number of five-way junctions in RNA/DNA pools, and Genetic Filtering designs RNA/DNA pools to a specified structure distribution, whether uniform or otherwise. We show that using our computationally designed DNA pool greatly improves access to highly complex sequence structures for SELEX experiments (without losing our ability to select for common one-way and two-way junction sequences).
View details for DOI 10.1261/rna.2102210
View details for Web of Science ID 000283047900020
View details for PubMedID 20870801
Relations as patterns: bridging the gap between OBO and OWL
most biomedical ontologies are represented in the OBO Flatfile Format, which is an easy-to-use graph-based ontology language. The semantics of the OBO Flatfile Format 1.2 enforces a strict predetermined interpretation of relationship statements between classes. It does not allow flexible specifications that provide better approximations of the intuitive understanding of the considered relations. If relations cannot be accurately expressed then ontologies built upon them may contain false assertions and hence lead to false inferences. Ontologies in the OBO Foundry must formalize the semantics of relations according to the OBO Relationship Ontology (RO). Therefore, being able to accurately express the intended meaning of relations is of crucial importance. Since the Web Ontology Language (OWL) is an expressive language with a formal semantics, it is suitable to de ne the meaning of relations accurately.we developed a method to provide definition patterns for relations between classes using OWL and describe a novel implementation of the RO based on this method. We implemented our extension in software that converts ontologies in the OBO Flatfile Format to OWL, and also provide a prototype to extract relational patterns from OWL ontologies using automated reasoning. The conversion software is freely available at http://bioonto.de/obo2owl, and can be accessed via a web interface.explicitly defining relations permits their use in reasoning software and leads to a more flexible and powerful way of representing biomedical ontologies. Using the extended langua0067e and semantics avoids several mistakes commonly made in formalizing biomedical ontologies, and can be used to automatically detect inconsistencies. The use of our method enables the use of graph-based ontologies in OWL, and makes complex OWL ontologies accessible in a graph-based form. Thereby, our method provides the means to gradually move the representation of biomedical ontologies into formal knowledge representation languages that incorporates an explicit semantics. Our method facilitates the use of OWL-based software in the back-end while ontology curators may continue to develop ontologies with an OBO-style front-end.
View details for DOI 10.1186/1471-2105-11-441
View details for Web of Science ID 000282631900001
View details for PubMedID 20807438
Chemical entity semantic specification: Knowledge representation for efficient semantic cheminformatics and facile data integration
ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY
View details for Web of Science ID 000208164701526
Semantic envelopment of cheminformatics resources with SADI
ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY
View details for Web of Science ID 000208164701503
CHEMINF: Community-developed ontology of chemical information and algorithms
ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY
View details for Web of Science ID 000208164701525
- Modeling and querying graphical representations of statistical data JOURNAL OF WEB SEMANTICS 2010; 8 (2-3): 241-254
RKB: a Semantic Web knowledge base for RNA.
Journal of biomedical semantics
2010; 1 Suppl 1: S2
Increasingly sophisticated knowledge about RNA structure and function requires an inclusive knowledge representation that facilitates the integration of independently -generated information arising from such efforts as genome sequencing projects, microarray analyses, structure determination and RNA SELEX experiments. While RNAML, an XML-based representation, has been proposed as an exchange format for a select subset of information, it lacks domain-specific semantics that are essential for answering questions that require expert knowledge. Here, we describe an RNA knowledge base (RKB) for structure-based knowledge using RDF/OWL Semantic Web technologies. RKB extends a number of ontologies and contains basic terminology for nucleic acid composition along with context/model-specific structural features such as sugar conformations, base pairings and base stackings. RKB (available at http://semanticscience.org/projects/rkb) is populated with PDB entries and MC-Annotate structural annotation. We show queries to the RKB using description logic reasoning, thus opening the door to question answering over independently-published RNA knowledge using Semantic Web technologies.
View details for DOI 10.1186/2041-1480-1-S1-S2
View details for PubMedID 20626922
Towards pharmacogenomics knowledge discovery with the semantic web
BRIEFINGS IN BIOINFORMATICS
2009; 10 (2): 153-163
Pharmacogenomics aims to understand pharmacological response with respect to genetic variation. Essential to the delivery of better health care is the use of pharmacogenomics knowledge to answer questions about therapeutic, pharmacological or genetic aspects. Several XML markup languages have been developed to capture pharmacogenomic and related information so as to facilitate data sharing. However, recent advances in semantic web technologies have presented exciting new opportunities for pharmacogenomics knowledge discovery by representing the information with machine understandable semantics. Progress in this area is illustrated with reference to the personalized medicine project that aims to facilitate pharmacogenomics knowledge discovery through intuitive knowledge capture and sophisticated question answering using automated reasoning over expressive ontologies.
View details for DOI 10.1093/bib/bbn056
View details for Web of Science ID 000264388500005
View details for PubMedID 19240125
MODELING TRYPTIC DIGESTION ON THE CELL BE PROCESSOR
2009 IEEE 22ND CANADIAN CONFERENCE ON ELECTRICAL AND COMPUTER ENGINEERING, VOLS 1 AND 2
View details for Web of Science ID 000271823600052
yOWL: An ontology-driven knowledge base for yeast biologists
JOURNAL OF BIOMEDICAL INFORMATICS
2008; 41 (5): 779-789
Knowledge management is an ongoing challenge for the biological community such that large, diverse and continuously growing information requires more sophisticated methods to store, integrate and query their knowledge. The semantic web initiative provides a new knowledge engineering framework to represent, share and discover information. In this paper, we describe our efforts towards the development of an ontology-based knowledge base, including aspects from ontology design and population using "semantic" data mashup, to automated reasoning and semantic query answering. Based on yeast data obtained from the Saccharomyces Genome Database and UniProt, we discuss the challenges encountered during the building of the knowledge base and how they were overcome.
View details for DOI 10.1016/j.jbi.2008.05.001
View details for Web of Science ID 000260137300010
View details for PubMedID 18562252
Global investigation of protein-protein interactions in yeast Saccharomyces cerevisiae using re-occurring short polypeptide sequences
NUCLEIC ACIDS RESEARCH
2008; 36 (13): 4286-4294
Protein-protein interaction (PPI) maps provide insight into cellular biology and have received considerable attention in the post-genomic era. While large-scale experimental approaches have generated large collections of experimentally determined PPIs, technical limitations preclude certain PPIs from detection. Recently, we demonstrated that yeast PPIs can be computationally predicted using re-occurring short polypeptide sequences between known interacting protein pairs. However, the computational requirements and low specificity made this method unsuitable for large-scale investigations. Here, we report an improved approach, which exhibits a specificity of approximately 99.95% and executes 16,000 times faster. Importantly, we report the first all-to-all sequence-based computational screen of PPIs in yeast, Saccharomyces cerevisiae in which we identify 29,589 high confidence interactions of approximately 2 x 10(7) possible pairs. Of these, 14,438 PPIs have not been previously reported and may represent novel interactions. In particular, these results reveal a richer set of membrane protein interactions, not readily amenable to experimental investigations. From the novel PPIs, a novel putative protein complex comprised largely of membrane proteins was revealed. In addition, two novel gene functions were predicted and experimentally confirmed to affect the efficiency of non-homologous end-joining, providing further support for the usefulness of the identified PPIs in biological investigations.
View details for DOI 10.1093/nar/gkn390
View details for Web of Science ID 000257964500014
View details for PubMedID 18586826
GridCell: a stochastic particle-based biological system simulator
BMC SYSTEMS BIOLOGY
Realistic biochemical simulators aim to improve our understanding of many biological processes that would be otherwise very difficult to monitor in experimental studies. Increasingly accurate simulators may provide insights into the regulation of biological processes due to stochastic or spatial effects.We have developed GridCell as a three-dimensional simulation environment for investigating the behaviour of biochemical networks under a variety of spatial influences including crowding, recruitment and localization. GridCell enables the tracking and characterization of individual particles, leading to insights on the behaviour of low copy number molecules participating in signaling networks. The simulation space is divided into a discrete 3D grid that provides ideal support for particle collisions without distance calculation and particle search. SBML support enables existing networks to be simulated and visualized. The user interface provides intuitive navigation that facilitates insights into species behaviour across spatial and temporal dimensions. We demonstrate the effect of crowing on a Michaelis-Menten system.GridCell is an effective stochastic particle simulator designed to track the progress of individual particles in a three-dimensional space in which spatial influences such as crowding, co-localization and recruitment may be investigated.
View details for DOI 10.1186/1752-0509-2-66
View details for Web of Science ID 000258870400001
View details for PubMedID 18651956
Computational Methods For Predicting Protein-Protein Interactions
PROTEIN - PROTEIN INTERACTION
2008; 110: 247-267
Protein-protein interactions (PPIs) play a critical role in many cellular functions. A number of experimental techniques have been applied to discover PPIs; however, these techniques are expensive in terms of time, money, and expertise. There are also large discrepancies between the PPI data collected by the same or different techniques in the same organism. We therefore turn to computational techniques for the prediction of PPIs. Computational techniques have been applied to the collection, indexing, validation, analysis, and extrapolation of PPI data. This chapter will focus on computational prediction of PPI, reviewing a number of techniques including PIPE, developed in our own laboratory. For comparison, the conventional large-scale approaches to predict PPIs are also briefly discussed. The chapter concludes with a discussion of the limitations of both experimental and computational methods of determining PPIs.
View details for DOI 10.1007/10_2007_089
View details for Web of Science ID 000260375400011
View details for PubMedID 18202838
Semantic Annotation and Question Answering of Statistical Graphs
MICAI 2008: ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS
2008; 5317: 100-110
View details for Web of Science ID 000261873400009
Chemical knowledge for the Semantic Web
DATA INTEGRATION IN THE LIFE SCIENCES, PROCEEDINGS
2008; 5109: 169-176
View details for Web of Science ID 000257304400017
Semantic Query Answering with Time-Series Graphs
2007 11TH IEEE INTERNATIONAL ENTERPRISE DISTRIBUTED OBJECT COMPUTING CONFERENCE WORKSHOPS
View details for Web of Science ID 000270892600018
Domain-based small molecule binding site annotation
Accurate small molecule binding site information for a protein can facilitate studies in drug docking, drug discovery and function prediction, but small molecule binding site protein sequence annotation is sparse. The Small Molecule Interaction Database (SMID), a database of protein domain-small molecule interactions, was created using structural data from the Protein Data Bank (PDB). More importantly it provides a means to predict small molecule binding sites on proteins with a known or unknown structure and unlike prior approaches, removes large numbers of false positive hits arising from transitive alignment errors, non-biologically significant small molecules and crystallographic conditions that overpredict ion binding sites.Using a set of co-crystallized protein-small molecule structures as a starting point, SMID interactions were generated by identifying protein domains that bind to small molecules, using NCBI's Reverse Position Specific BLAST (RPS-BLAST) algorithm. SMID records are available for viewing at http://smid.blueprint.org. The SMID-BLAST tool provides accurate transitive annotation of small-molecule binding sites for proteins not found in the PDB. Given a protein sequence, SMID-BLAST identifies domains using RPS-BLAST and then lists potential small molecule ligands based on SMID records, as well as their aligned binding sites. A heuristic ligand score is calculated based on E-value, ligand residue identity and domain entropy to assign a level of confidence to hits found. SMID-BLAST predictions were validated against a set of 793 experimental small molecule interactions from the PDB, of which 472 (60%) of predicted interactions identically matched the experimental small molecule and of these, 344 had greater than 80% of the binding site residues correctly identified. Further, we estimate that 45% of predictions which were not observed in the PDB validation set may be true positives.By focusing on protein domain-small molecule interactions, SMID is able to cluster similar interactions and detect subtle binding patterns that would not otherwise be obvious. Using SMID-BLAST, small molecule targets can be predicted for any protein sequence, with the only limitation being that the small molecule must exist in the PDB. Validation results and specific examples within illustrate that SMID-BLAST has a high degree of accuracy in terms of predicting both the small molecule ligand and binding site residue positions for a query protein.
View details for DOI 10.1186/1471-2105-7-152
View details for Web of Science ID 000236762000001
View details for PubMedID 16545112
CO: A chemical ontology for identification of functional groups and semantic comparison of small molecules
2005; 579 (21): 4685-4691
A novel chemical ontology based on chemical functional groups automatically, objectively assigned by a computer program, was developed to categorize small molecules. It has been applied to PubChem and the small molecule interaction database to demonstrate its utility as a basic pharmacophore search system. Molecules can be compared using a semantic similarity score based on functional group assignments rather than 3D shape, which succeeds in identifying small molecules known to bind a common binding site. This ontology will serve as a powerful tool for searching chemical databases and identifying key functional groups responsible for biological activities.
View details for DOI 10.1016/j.febslet.2005.07.039
View details for Web of Science ID 000231625800022
View details for PubMedID 16098521
Armadillo: Domain boundary prediction by amino acid composition
JOURNAL OF MOLECULAR BIOLOGY
2005; 350 (5): 1061-1073
The identification and annotation of protein domains provides a critical step in the accurate determination of molecular function. Both computational and experimental methods of protein structure determination may be deterred by large multi-domain proteins or flexible linker regions. Knowledge of domains and their boundaries may reduce the experimental cost of protein structure determination by allowing researchers to work on a set of smaller and possibly more successful alternatives. Current domain prediction methods often rely on sequence similarity to conserved domains and as such are poorly suited to detect domain structure in poorly conserved or orphan proteins. We present here a simple computational method to identify protein domain linkers and their boundaries from sequence information alone. Our domain predictor, Armadillo (http://armadillo.blueprint.org), uses any amino acid index to convert a protein sequence to a smoothed numeric profile from which domains and domain boundaries may be predicted. We derived an amino acid index called the domain linker propensity index (DLI) from the amino acid composition of domain linkers using a non-redundant structure dataset. The index indicates that Pro and Gly show a propensity for linker residues while small hydrophobic residues do not. Armadillo predicts domain linker boundaries from Z-score distributions and obtains 35% sensitivity with DLI in a two-domain, single-linker dataset (within +/-20 residues from linker). The combination of DLI and an entropy-based amino acid index increases the overall Armadillo sensitivity to 56% for two domain proteins. Moreover, Armadillo achieves 37% sensitivity for multi-domain proteins, surpassing most other prediction methods. Armadillo provides a simple, but effective method by which prediction of domain boundaries can be obtained with reasonable sensitivity. Armadillo should prove to be a valuable tool for rapidly delineating protein domains in poorly conserved proteins or those with no sequence neighbors. As a first-line predictor, domain meta-predictors could yield improved results with Armadillo predictions.
View details for DOI 10.1016/j.jmb.2005.05.037
View details for Web of Science ID 000230701300019
View details for PubMedID 15978619
The Biomolecular Interaction Network Database and related tools 2005 update
NUCLEIC ACIDS RESEARCH
2005; 33: D418-D424
The Biomolecular Interaction Network Database (BIND) (http://bind.ca) archives biomolecular interaction, reaction, complex and pathway information. Our aim is to curate the details about molecular interactions that arise from published experimental research and to provide this information, as well as tools to enable data analysis, freely to researchers worldwide. BIND data are curated into a comprehensive machine-readable archive of computable information and provides users with methods to discover interactions and molecular mechanisms. BIND has worked to develop new methods for visualization that amplify the underlying annotation of genes and proteins to facilitate the study of molecular interaction networks. BIND has maintained an open database policy since its inception in 1999. Data growth has proceeded at a tremendous rate, approaching over 100 000 records. New services provided include a new BIND Query and Submission interface, a Standard Object Access Protocol service and the Small Molecule Interaction Database (http://smid.blueprint.org) that allows users to determine probable small molecule binding sites of new sequences and examine conserved binding residues.
View details for DOI 10.1093/nar/gki051
View details for Web of Science ID 000226524300086
View details for PubMedID 15608229
Hardware-accelerated protein identification for mass spectrometry
RAPID COMMUNICATIONS IN MASS SPECTROMETRY
2005; 19 (6): 833-837
An ongoing issue in mass spectrometry is the time it takes to search DNA sequences with MS/MS peptide fragments (see, e.g., Choudary et al., Proteomics 2001; 1: 651-667.) Search times are far longer than spectra acquisition time, and parallelization of search software on clusters requires doubling the size of a conventional computing cluster to cut the search time in half. Field programmable gate arrays (FPGAs) are used to create hardware-accelerated algorithms that reduce operating costs and improve search speed compared to large clusters. We present a novel hardware design that takes full spectra and computes 6-frame translation word searches on DNA databases at a rate of approximately 3 billion base pairs per second, with queries of up to 10 amino acids in length and arbitrary wildcard positions. Hardware post-processing identifies in silico tryptic peptides and scores them using a variety of techniques including mass frequency expected values. With faster FPGAs protein identifications from the human genome can be achieved in less than a second, and this makes it an ideal solution for a number of proteome-scale applications.
View details for DOI 10.1002/rcm.1853
View details for Web of Science ID 000227519500013
View details for PubMedID 15723443
Species-specific protein sequence and fold optimizations
An organism's ability to adapt to its particular environmental niche is of fundamental importance to its survival and proliferation. In the largest study of its kind, we sought to identify and exploit the amino-acid signatures that make species-specific protein adaptation possible across 100 complete genomes.Environmental niche was determined to be a significant factor in variability from correspondence analysis using the amino acid composition of over 360,000 predicted open reading frames (ORFs) from 17 archaea, 76 bacteria and 7 eukaryote complete genomes. Additionally, we found clusters of phylogenetically unrelated archaea and bacteria that share similar environments by amino acid composition clustering. Composition analyses of conservative, domain-based homology modeling suggested an enrichment of small hydrophobic residues Ala, Gly, Val and charged residues Asp, Glu, His and Arg across all genomes. However, larger aromatic residues Phe, Trp and Tyr are reduced in folds, and these results were not affected by low complexity biases. We derived two simple log-odds scoring functions from ORFs (CG) and folds (CF) for each of the complete genomes. CF achieved an average cross-validation success rate of 85 +/- 8% whereas the CG detected 73 +/- 9% species-specific sequences when competing against all other non-redundant CG. Continuously updated results are available at http://genome.mshri.on.ca.Our analysis of amino acid compositions from the complete genomes provides stronger evidence for species-specific and environmental residue preferences in genomic sequences as well as in folds. Scoring functions derived from this work will be useful in future protein engineering experiments and possibly in identifying horizontal transfer events.
View details for Web of Science ID 000181476800039
View details for PubMedID 12487631
NBLAST: a cluster variant of BLAST for NxN comparisons
The BLAST algorithm compares biological sequences to one another in order to determine shared motifs and common ancestry. However, the comparison of all non-redundant (NR) sequences against all other NR sequences is a computationally intensive task. We developed NBLAST as a cluster computer implementation of the BLAST family of sequence comparison programs for the purpose of generating pre-computed BLAST alignments and neighbour lists of NR sequences.NBLAST performs the heuristic BLAST algorithm and generates an exhaustive database of alignments, but it only computes alignments (i.e. the upper triangle) of a possible N2 alignments, where N is the set of all sequences to be compared. A task-partitioning algorithm allows for cluster computing across all cluster nodes and the NBLAST master process produces a BLAST sequence alignment database and a list of sequence neighbours for each sequence record. The resulting sequence alignment and neighbour databases are used to serve the SeqHound query system through a C/C++ and PERL Application Programming Interface (API).NBLAST offers a local alternative to the NCBI's remote Entrez system for pre-computed BLAST alignments and neighbour queries. On our 216-processor 450 MHz PIII cluster, NBLAST requires ~24 hrs to compute neighbours for 850000 proteins currently in the non-redundant protein database.
View details for Web of Science ID 000181476800013
View details for PubMedID 12019022
SeqHound: biological sequence and structure database as a platform for bioinformatics research
SeqHound has been developed as an integrated biological sequence, taxonomy, annotation and 3-D structure database system. It provides a high-performance server platform for bioinformatics research in a locally-hosted environment.SeqHound is based on the National Center for Biotechnology Information data model and programming tools. It offers daily updated contents of all Entrez sequence databases in addition to 3-D structural data and information about sequence redundancies, sequence neighbours, taxonomy, complete genomes, functional annotation including Gene Ontology terms and literature links to PubMed. SeqHound is accessible via a web server through a Perl, C or C++ remote API or an optimized local API. It provides functionality necessary to retrieve specialized subsets of sequences, structures and structural domains. Sequences may be retrieved in FASTA, GenBank, ASN.1 and XML formats. Structures are available in ASN.1, XML and PDB formats. Emphasis has been placed on complete genomes, taxonomy, domain and functional annotation as well as 3-D structural functionality in the API, while fielded text indexing functionality remains under development. SeqHound also offers a streamlined WWW interface for simple web-user queries.The system has proven useful in several published bioinformatics projects such as the BIND database and offers a cost-effective infrastructure for research. SeqHound will continue to develop and be provided as a service of the Blueprint Initiative at the Samuel Lunenfeld Research Institute. The source code and examples are available under the terms of the GNU public license at the Sourceforge site http://sourceforge.net/projects/slritools/ in the SLRI Toolkit.
View details for Web of Science ID 000181476800032
View details for PubMedID 12401134