Dr. Nigam Shah is associate professor of Medicine (Biomedical Informatics) at Stanford University, Assistant Director of the Center for Biomedical Informatics Research, and a core member of the Biomedical Informatics Graduate Program. Dr. Shah's research focuses on combining machine learning and prior knowledge in medical ontologies to enable use cases of the learning health system.

Dr. Shah received the AMIA New Investigator Award for 2013 and the Stanford Biosciences Faculty Teaching Award for outstanding teaching in his graduate class on “Data driven medicine”. Dr. Shah was elected into the American College of Medical Informatics (ACMI) in 2015 and is inducted into the American Society for Clinical Investigation (ASCI) in 2016. He holds an MBBS from Baroda Medical College, India, a PhD from Penn State University and completed postdoctoral training at Stanford University.

Administrative Appointments

  • Co-director - Informatics, Stanford Center for Clinical and Translational Research, and Education (Spectrum) (2017 - Present)
  • Assistant Director, Stanford Center for Biomedical Research (BMIR) (2013 - Present)
  • Member, Cancer Institute Informatics Steering Committee (2011 - Present)
  • Scientific Program Chair, AMIA Summit on Translational Bioinformatics (2011 - 2012)
  • Advisory Committee Member, Stanford Center for Clinical Informatics (2011 - 2012)

Honors & Awards

  • Outstanding paper award, Summit on Translational Bioinformatics (March 2008)
  • Outstanding paper award, AMIA Summit on Translational Bioinformatics (March 2009)
  • Distinguished paper award, AMIA Summit on Translational Bioinformatics (March 2011)
  • Biosciences Faculty Award recognizing outstanding teaching contributions, Stanford School of Medicine (June 2012)
  • Ramoni Best paper award, AMIA Summit on Translational Bioinformatics (March 2013)
  • New Investigator Award, American Medical Informatics Association (AMIA) (November 2013)

Professional Education

  • Postdoctoral, Stanford University, Biomedical Informatics (2007)
  • PhD, The Pennsylvania State University, Molecular Medicine (2005)
  • MBBS, Baroda Medical College, Medicine (1999)

Current Research and Scholarly Interests

We develop methods to analyze large unstructured data sets for data-driven medicine. We use ontologies to annotate, index and analyze Big Data in biomedicine for enabling data-driven decision making in medicine and health care. Our research group is part of the Center for Biomedical Informatics Research at Stanford and the National Center for Biomedical Ontology.

Data driven medicine: The goal is to combine machine learning, text-mining, and prior knowledge in medical ontologies to discover hidden trends, build risk models, drive data driven decision making, and comparative effectiveness studies. We have developed methods that transform unstructured patient notes into a de-identified, temporally ordered, patient-feature matrix (Imagine it as row = patient, column = medical concept, 1 = present, 0 = absent). With the resulting high-throughput data, we can monitor for adverse drug events, learn drug-drug interactions, identify off-label drug usage, generate practice-based evidence for difficult-to-test clinical hypotheses, and generate phenotypic fingerprints as well as build predictive models. We have efforts around combining multiple information sources for drug safety surveillance, which were recently the focus of a commentary titled Advancing the Science of Pharmacovigilance.

Annotation Analytics: In order to understand the “gene lists” from analysis of high-throughput data, researchers routinely use Gene Ontology based analyses. With available methods for automated annotation and the existence of over 200 biomedical ontologies, we can stop using just GO and move to enrichment analysis using disease ontologies.

2016-17 Courses

Stanford Advisees

Graduate and Fellowship Programs

All Publications

  • Predicting patient 'cost blooms' in Denmark: a longitudinal population-based study. BMJ open Tamang, S., Milstein, A., Sørensen, H. T., Pedersen, L., Mackey, L., Betterton, J., Janson, L., Shah, N. 2017; 7 (1)


    To compare the ability of standard versus enhanced models to predict future high-cost patients, especially those who move from a lower to the upper decile of per capita healthcare expenditures within 1 year-that is, 'cost bloomers'.We developed alternative models to predict being in the upper decile of healthcare expenditures in year 2 of a sample, based on data from year 1. Our 6 alternative models ranged from a standard cost-prediction model with 4 variables (ie, traditional model features), to our largest enhanced model with 1053 non-traditional model features. To quantify any increases in predictive power that enhanced models achieved over standard tools, we compared the prospective predictive performance of each model.We used the population of Western Denmark between 2004 and 2011 (2 146 801 individuals) to predict future high-cost patients and characterise high-cost patient subgroups. Using the most recent 2-year period (2010-2011) for model evaluation, our whole-population model used a cohort of 1 557 950 individuals with a full year of active residency in year 1 (2010). Our cost-bloom model excluded the 155 795 individuals who were already high cost at the population level in year 1, resulting in 1 402 155 individuals for prediction of cost bloomers in year 2 (2011).Using unseen data from a future year, we evaluated each model's prospective predictive performance by calculating the ratio of predicted high-cost patient expenditures to the actual high-cost patient expenditures in Year 2-that is, cost capture.Our best enhanced model achieved a 21% and 30% improvement in cost capture over a standard diagnosis-based model for predicting population-level high-cost patients and cost bloomers, respectively.In combination with modern statistical learning methods for analysing large data sets, models enhanced with a large and diverse set of features led to better performance-especially for predicting future cost bloomers.

    View details for DOI 10.1136/bmjopen-2016-011580

    View details for PubMedID 28077408

    View details for PubMedCentralID PMC5253526

  • Thematic issue of the Second combined Bio-ontologies and Phenotypes Workshop JOURNAL OF BIOMEDICAL SEMANTICS Verspoor, K., Oellrich, A., Collier, N., Groza, T., Rocca-Serra, P., Soldatova, L., Dumontier, M., Shah, N. 2016; 7


    This special issue covers selected papers from the 18th Bio-Ontologies Special Interest Group meeting and Phenotype Day, which took place at the Intelligent Systems for Molecular Biology (ISMB) conference in Dublin in 2015. The papers presented in this collection range from descriptions of software tools supporting ontology development and annotation of objects with ontology terms, to applications of text mining for structured relation extraction involving diseases and phenotypes, to detailed proposals for new ontologies and mapping of existing ontologies. Together, the papers consider a range of representational issues in bio-ontology development, and demonstrate the applicability of bio-ontologies to support biological and clinical knowledge-based decision making and analysis.The full set of papers in the Thematic Issue is available at .

    View details for DOI 10.1186/s13326-016-0108-7

    View details for Web of Science ID 000391059800001

    View details for PubMedID 27955708

  • The use of machine learning for the identification of peripheral artery disease and future mortality risk. Journal of vascular surgery Ross, E. G., Shah, N. H., Dalman, R. L., Nead, K. T., Cooke, J. P., Leeper, N. J. 2016; 64 (5): 1515-1522 e3


    A key aspect of the precision medicine effort is the development of informatics tools that can analyze and interpret "big data" sets in an automated and adaptive fashion while providing accurate and actionable clinical information. The aims of this study were to develop machine learning algorithms for the identification of disease and the prognostication of mortality risk and to determine whether such models perform better than classical statistical analyses.Focusing on peripheral artery disease (PAD), patient data were derived from a prospective, observational study of 1755 patients who presented for elective coronary angiography. We employed multiple supervised machine learning algorithms and used diverse clinical, demographic, imaging, and genomic information in a hypothesis-free manner to build models that could identify patients with PAD and predict future mortality. Comparison was made to standard stepwise linear regression models.Our machine-learned models outperformed stepwise logistic regression models both for the identification of patients with PAD (area under the curve, 0.87 vs 0.76, respectively; P = .03) and for the prediction of future mortality (area under the curve, 0.76 vs 0.65, respectively; P = .10). Both machine-learned models were markedly better calibrated than the stepwise logistic regression models, thus providing more accurate disease and mortality risk estimates.Machine learning approaches can produce more accurate disease classification and prediction models. These tools may prove clinically useful for the automated identification of patients with highly morbid diseases for which aggressive risk factor management can improve outcomes.

    View details for DOI 10.1016/j.jvs.2016.04.026

    View details for PubMedID 27266594

  • Influence of age on androgen deprivation therapy-associated Alzheimer's disease SCIENTIFIC REPORTS Nead, K. T., Gaskin, G., Chester, C., Swisher-McClure, S., Dudley, J. T., Leeper, N. J., Shah, N. H. 2016; 6


    We recently found an association between androgen deprivation therapy (ADT) and Alzheimer's disease. As Alzheimer's disease is a disease of advanced age, we hypothesize that older individuals on ADT may be at greatest risk. We conducted a retrospective multi-institutional analysis among 16,888 individuals with prostate cancer using an informatics approach. We tested the effect of ADT on Alzheimer's disease using Kaplan-Meier age stratified analyses in a propensity score matched cohort. We found a lower cumulative probability of remaining Alzheimer's disease-free between non-ADT users age ≥70 versus those age <70 years (p < 0.001) and between ADT versus non-ADT users ≥70 years (p = 0.034). The 5-year probability of developing Alzheimer's disease was 2.9%, 1.9% and 0.5% among ADT users ≥70, non-ADT users ≥70 and individuals <70 years, respectively. Compared to younger individuals older men on ADT may have the greatest absolute Alzheimer's disease risk. Future work should investigate the ADT Alzheimer's disease association in advanced age populations given the greater potential clinical impact.

    View details for DOI 10.1038/srep35695

    View details for Web of Science ID 000385588200002

    View details for PubMedID 27752112

    View details for PubMedCentralID PMC5067668

  • Impact of Predicting Health Care Utilization Via Web Search Behavior: A Data-Driven Analysis JOURNAL OF MEDICAL INTERNET RESEARCH Agarwal, V., Zhang, L., Zhu, J., Fang, S., Cheng, T., Hong, C., Shah, N. H. 2016; 18 (9): 241-253


    By recent estimates, the steady rise in health care costs has deprived more than 45 million Americans of health care services and has encouraged health care providers to better understand the key drivers of health care utilization from a population health management perspective. Prior studies suggest the feasibility of mining population-level patterns of health care resource utilization from observational analysis of Internet search logs; however, the utility of the endeavor to the various stakeholders in a health ecosystem remains unclear.The aim was to carry out a closed-loop evaluation of the utility of health care use predictions using the conversion rates of advertisements that were displayed to the predicted future utilizers as a surrogate. The statistical models to predict the probability of user's future visit to a medical facility were built using effective predictors of health care resource utilization, extracted from a deidentified dataset of geotagged mobile Internet search logs representing searches made by users of the Baidu search engine between March 2015 and May 2015.We inferred presence within the geofence of a medical facility from location and duration information from users' search logs and putatively assigned medical facility visit labels to qualifying search logs. We constructed a matrix of general, semantic, and location-based features from search logs of users that had 42 or more search days preceding a medical facility visit as well as from search logs of users that had no medical visits and trained statistical learners for predicting future medical visits. We then carried out a closed-loop evaluation of the utility of health care use predictions using the show conversion rates of advertisements displayed to the predicted future utilizers. In the context of behaviorally targeted advertising, wherein health care providers are interested in minimizing their cost per conversion, the association between show conversion rate and predicted utilization score, served as a surrogate measure of the model's utility.We obtained the highest area under the curve (0.796) in medical visit prediction with our random forests model and daywise features. Ablating feature categories one at a time showed that the model performance worsened the most when location features were dropped. An online evaluation in which advertisements were served to users who had a high predicted probability of a future medical visit showed a 3.96% increase in the show conversion rate.Results from our experiments done in a research setting suggest that it is possible to accurately predict future patient visits from geotagged mobile search logs. Results from the offline and online experiments on the utility of health utilization predictions suggest that such prediction can have utility for health care providers.

    View details for DOI 10.2196/jmir.6240

    View details for Web of Science ID 000384107200020

    View details for PubMedID 27655225

    View details for PubMedCentralID PMC5052461

  • The digital revolution in phenotyping BRIEFINGS IN BIOINFORMATICS Oellrich, A., Collier, N., Groza, T., Rebholz-Schuhmann, D., Shah, N., Bodenreider, O., Boland, M. R., Georgiev, I., Liu, H., Livingston, K., Luna, A., Mallon, A., Manda, P., Robinson, P. N., Rustici, G., Simon, M., Wang, L., Winnenburg, R., Dumontier, M. 2016; 17 (5): 819-830


    Phenotypes have gained increased notoriety in the clinical and biological domain owing to their application in numerous areas such as the discovery of disease genes and drug targets, phylogenetics and pharmacogenomics. Phenotypes, defined as observable characteristics of organisms, can be seen as one of the bridges that lead to a translation of experimental findings into clinical applications and thereby support 'bench to bedside' efforts. However, to build this translational bridge, a common and universal understanding of phenotypes is required that goes beyond domain-specific definitions. To achieve this ambitious goal, a digital revolution is ongoing that enables the encoding of data in computer-readable formats and the data storage in specialized repositories, ready for integration, enabling translational research. While phenome research is an ongoing endeavor, the true potential hidden in the currently available data still needs to be unlocked, offering exciting opportunities for the forthcoming years. Here, we provide insights into the state-of-the-art in digital phenotyping, by means of representing, acquiring and analyzing phenotype data. In addition, we provide visions of this field for future research work that could enable better applications of phenotype data.

    View details for DOI 10.1093/bib/bbv083

    View details for Web of Science ID 000386971500008

    View details for PubMedID 26420780

  • Reply to R.L. Bowen et al, M. Froehner et al, J.L. Leow et al, and C. Brady et al. Journal of clinical oncology Nead, K. T., Gaskin, G., Chester, C., Swisher-McClure, S., Dudley, J. T., Leeper, N. J., Shah, N. H. 2016; 34 (23): 2804-2805

    View details for DOI 10.1200/JCO.2016.67.9449

    View details for PubMedID 27298415

    View details for PubMedCentralID PMC5019764

  • Characterizing treatment pathways at scale using the OHDSI network PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Hripcsak, G., Ryan, P. B., Duke, J. D., Shah, N. H., Park, R. W., Huser, V., Suchard, M. A., Schuemie, M. J., DeFalco, F. J., Perotte, A., Banda, J. M., Reich, C. G., Schilling, L. M., Matheny, M. E., Meeker, D., Pratt, N., Madigan, D. 2016; 113 (27): 7329-7336


    Observational research promises to complement experimental research by providing large, diverse populations that would be infeasible for an experiment. Observational research can test its own clinical hypotheses, and observational studies also can contribute to the design of experiments and inform the generalizability of experimental research. Understanding the diversity of populations and the variance in care is one component. In this study, the Observational Health Data Sciences and Informatics (OHDSI) collaboration created an international data network with 11 data sources from four countries, including electronic health records and administrative claims data on 250 million patients. All data were mapped to common data standards, patient privacy was maintained by using a distributed model, and results were aggregated centrally. Treatment pathways were elucidated for type 2 diabetes mellitus, hypertension, and depression. The pathways revealed that the world is moving toward more consistent therapy over time across diseases and across locations, but significant heterogeneity remains among sources, pointing to challenges in generalizing clinical trial results. Diabetes favored a single first-line medication, metformin, to a much greater extent than hypertension or depression. About 10% of diabetes and depression patients and almost 25% of hypertension patients followed a treatment pathway that was unique within the cohort. Aside from factors such as sample size and underlying population (academic medical center versus general population), electronic health records data and administrative claims data revealed similar results. Large-scale international observational research is feasible.

    View details for DOI 10.1073/pnas.1510502113

    View details for Web of Science ID 000379021700036

    View details for PubMedID 27274072

  • Generalized enrichment analysis improves the detection of adverse drug events from the biomedical literature BMC BIOINFORMATICS Winnenburg, R., Shah, N. H. 2016; 17


    Identification of associations between marketed drugs and adverse events from the biomedical literature assists drug safety monitoring efforts. Assessing the significance of such literature-derived associations and determining the granularity at which they should be captured remains a challenge. Here, we assess how defining a selection of adverse event terms from MeSH, based on information content, can improve the detection of adverse events for drugs and drug classes.We analyze a set of 105,354 candidate drug adverse event pairs extracted from article indexes in MEDLINE. First, we harmonize extracted adverse event terms by aggregating them into higher-level MeSH terms based on the terms' information content. Then, we determine statistical enrichment of adverse events associated with drug and drug classes using a conditional hypergeometric test that adjusts for dependencies among associated terms. We compare our results with methods based on disproportionality analysis (proportional reporting ratio, PRR) and quantify the improvement in signal detection with our generalized enrichment analysis (GEA) approach using a gold standard of drug-adverse event associations spanning 174 drugs and four events. For single drugs, the best GEA method (Precision: .92/Recall: .71/F1-measure: .80) outperforms the best PRR based method (.69/.69/.69) on all four adverse event outcomes in our gold standard. For drug classes, our GEA performs similarly (.85/.69/.74) when increasing the level of abstraction for adverse event terms. Finally, on examining the 1609 individual drugs in our MEDLINE set, which map to chemical substances in ATC, we find signals for 1379 drugs (10,122 unique adverse event associations) on applying GEA with p < 0.005.We present an approach based on generalized enrichment analysis that can be used to detect associations between drugs, drug classes and adverse events at a given level of granularity, at the same time correcting for known dependencies among events. Our study demonstrates the use of GEA, and the importance of choosing appropriate abstraction levels to complement current drug safety methods. We provide an R package for exploration of alternative abstraction levels of adverse event terms based on information content.

    View details for DOI 10.1186/s12859-016-1080-z

    View details for Web of Science ID 000378846600002

    View details for PubMedID 27333889

  • Predictive modeling of risk factors and complications of cataract surgery. European journal of ophthalmology Gaskin, G. L., Pershing, S., Cole, T. S., Shah, N. H. 2016; 26 (4): 328-337


    Cataract surgery is generally safe; however, severe complications exist. Preexisting conditions are known to predispose patients to intraoperative and postoperative complications. This study quantifies the relationship between aggregated preoperative risk factors and cataract surgery complications, and builds a model predicting outcomes on an individual level, given a constellation of patient characteristics.This study utilized a retrospective cohort of patients age 40 years or older who received cataract surgery. Risk factors, complications, and demographic information were extracted from the Electronic Health Record based on International Classification of Diseases, 9th edition codes, Current Procedural Terminology codes, drug prescription information, and text data mining. We used a bootstrapped least absolute shrinkage and selection operator model to identify highly associated variables. We built random forest classifiers for each complication to create predictive models.Our data corroborated existing literature, including the association of intraoperative complications, complex cataract surgery, black race, and/or prior eye surgery with increased risk of any postoperative complications. We also found other, less well-described risk factors, including diabetes mellitus, young age (<60 years), and hyperopia, as risk factors for complex cataract surgery and intraoperative and postoperative complications. Our predictive models outperformed existing published models.The aggregated risk factors and complications described here can guide new avenues of research and provide specific, personalized risk assessment for a patient considering cataract surgery. Furthermore, the predictive capacity of our models can enable risk stratification of patients, which has utility as a teaching tool as well as informing quality/value-based reimbursements.

    View details for DOI 10.5301/ejo.5000706

    View details for PubMedID 26692059

    View details for PubMedCentralID PMC4930873

  • Statin Intensity or Achieved LDL? Practice-based Evidence for the Evaluation of New Cholesterol Treatment Guidelines PLOS ONE Ross, E. G., Shah, N., Leeper, N. 2016; 11 (5)


    The recently updated American College of Cardiology/American Heart Association cholesterol treatment guidelines outline a paradigm shift in the approach to cardiovascular risk reduction. One major change included a recommendation that practitioners prescribe fixed dose statin regimens rather than focus on specific LDL targets. The goal of this study was to determine whether achieved LDL or statin intensity was more strongly associated with major adverse cardiac events (MACE) using practice-based data from electronic health records (EHR).We analyzed the EHR data of more than 40,000 adult patients on statin therapy between 1995 and 2013. Demographic and clinical variables were extracted from coded data and unstructured clinical text. To account for treatment selection bias we performed propensity score stratification as well as 1:1 propensity score matched analyses. Conditional Cox proportional hazards modeling was used to identify variables associated with MACE.We identified 7,373 adults with complete data whose cholesterol appeared to be actively managed. In a stratified propensity score analysis of the entire cohort over 3.3 years of follow-up, achieved LDL was a significant predictor of MACE outcome (Hazard Ratio 1.1; 95% confidence interval, 1.05-1.2; P < 0.0004), while statin intensity was not. In a 1:1 propensity score matched analysis performed to more aggressively control for covariate balance between treatment groups, achieved LDL remained significantly associated with MACE (HR 1.3; 95% CI, 1.03-1.7; P = 0.03) while treatment intensity again was not a significant predictor.Using EHR data we found that on-treatment achieved LDL level was a significant predictor of MACE. Statin intensity alone was not associated with outcomes. These findings imply that despite recent guidelines, achieved LDL levels are clinically important and LDL titration strategies warrant further investigation in clinical trials.

    View details for DOI 10.1371/journal.pone.0154952

    View details for Web of Science ID 000376882500009

    View details for PubMedID 27227451

  • Learning statistical models of phenotypes using noisy labeled training data. Journal of the American Medical Informatics Association Agarwal, V., Podchiyska, T., Banda, J. M., Goel, V., Leung, T. I., Minty, E. P., Sweeney, T. E., Gyang, E., Shah, N. H. 2016: -?


    Traditionally, patient groups with a phenotype are selected through rule-based definitions whose creation and validation are time-consuming. Machine learning approaches to electronic phenotyping are limited by the paucity of labeled training datasets. We demonstrate the feasibility of utilizing semi-automatically labeled training sets to create phenotype models via machine learning, using a comprehensive representation of the patient medical record.We use a list of keywords specific to the phenotype of interest to generate noisy labeled training data. We train L1 penalized logistic regression models for a chronic and an acute disease and evaluate the performance of the models against a gold standard.Our models for Type 2 diabetes mellitus and myocardial infarction achieve precision and accuracy of 0.90, 0.89, and 0.86, 0.89, respectively. Local implementations of the previously validated rule-based definitions for Type 2 diabetes mellitus and myocardial infarction achieve precision and accuracy of 0.96, 0.92 and 0.84, 0.87, respectively.We have demonstrated feasibility of learning phenotype models using imperfectly labeled data for a chronic and acute phenotype. Further research in feature engineering and in specification of the keyword list can improve the performance of the models and the scalability of the approach.Our method provides an alternative to manual labeling for creating training sets for statistical models of phenotypes. Such an approach can accelerate research with large observational healthcare datasets and may also be used to create local phenotype models.

    View details for DOI 10.1093/jamia/ocw028

    View details for PubMedID 27174893

    View details for PubMedCentralID PMC5070523

  • Postmarket Surveillance of Point-of-Care Glucose Meters through Analysis of Electronic Medical Records CLINICAL CHEMISTRY Schroeder, L. F., Giacherio, D., Gianchandani, R., Engoren, M., Shah, N. H. 2016; 62 (5): 716-724


    The electronic medical record (EMR) holds a promising source of data for active postmarket surveillance of diagnostic accuracy, particularly for point-of-care (POC) devices. Through a comparison with prospective bedside and laboratory accuracy studies, we demonstrate the validity of active surveillance via an EMR data mining method [Data Mining EMRs to Evaluate Coincident Testing (DETECT)], comparing POC glucose results to near-in-time central laboratory glucose results.The Roche ACCU-CHEK Inform II(®) POC glucose meter was evaluated in a laboratory validation study (n = 73), a prospective bedside intensive care unit (ICU) study (n = 124), and with DETECT (n = 852-27 503). For DETECT, the EMR was queried for POC and central laboratory glucose results with filtering based on of bedside collection timestamps, central laboratory time delays, patient location, time period, absence of repeat testing, and presence of peripheral lines.DETECT and the bedside ICU study produced similar estimates of average bias (4.5 vs 5.0 mg/dL) and relative random error (6.3% vs 5.6%), with overlapping CIs. For glucose <100 mg/dL, the laboratory validation study estimated a lower relative random error of 3.6%. POC average bias correlated with central laboratory turnaround times, consistent with 4.8 mg · dL(-1) · h(-1) glycolysis. After glycolysis adjustment, average bias was estimated by the bedside ICU study at -0.4 mg/dL (CI, -1.6 to 0.9) and DETECT at -0.7 (CI, -1.3 to 0.2), and percentage POC results occurring outside Clinical Laboratory Standards Institute quality goals were 2.4% and 4.8%, respectively.This study validates DETECT for estimating POC glucose meter accuracy compared with a prospective bedside ICU study and establishes it as a reliable postmarket surveillance methodology.

    View details for DOI 10.1373/clinchem.2015.251827

    View details for Web of Science ID 000375173400014

    View details for PubMedID 26988586

  • RegenBase: a knowledge base of spinal cord injury biology for translational research DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION Callahan, A., Abeyruwan, S. W., Al-Ali, H., Sakurai, K., Ferguson, A. R., Popovich, P. G., Shah, N. H., Visser, U., Bixby, J. L., Lemmon, V. P. 2016


    Spinal cord injury (SCI) research is a data-rich field that aims to identify the biological mechanisms resulting in loss of function and mobility after SCI, as well as develop therapies that promote recovery after injury. SCI experimental methods, data and domain knowledge are locked in the largely unstructured text of scientific publications, making large scale integration with existing bioinformatics resources and subsequent analysis infeasible. The lack of standard reporting for experiment variables and results also makes experiment replicability a significant challenge. To address these challenges, we have developed RegenBase, a knowledge base of SCI biology. RegenBase integrates curated literature-sourced facts and experimental details, raw assay data profiling the effect of compounds on enzyme activity and cell growth, and structured SCI domain knowledge in the form of the first ontology for SCI, using Semantic Web representation languages and frameworks. RegenBase uses consistent identifier schemes and data representations that enable automated linking among RegenBase statements and also to other biological databases and electronic resources. By querying RegenBase, we have identified novel biological hypotheses linking the effects of perturbagens to observed behavioral outcomes after SCI. RegenBase is publicly available for browsing, querying and download.Database URL:

    View details for DOI 10.1093/database/baw040

    View details for Web of Science ID 000374094100001

    View details for PubMedID 27055827

  • Comparing high-dimensional confounder control methods for rapid cohort studies from electronic health records JOURNAL OF COMPARATIVE EFFECTIVENESS RESEARCH Low, Y. S., Gallego, B., Shah, N. H. 2016; 5 (2): 179-192

    View details for DOI 10.2217/cer.15.53

    View details for Web of Science ID 000372475700007

  • Comparing high-dimensional confounder control methods for rapid cohort studies from electronic health records. Journal of comparative effectiveness research Low, Y. S., Gallego, B., Shah, N. H. 2016; 5 (2): 179-192


    Electronic health records (EHR), containing rich clinical histories of large patient populations, can provide evidence for clinical decisions when evidence from trials and literature is absent. To enable such observational studies from EHR in real time, particularly in emergencies, rapid confounder control methods that can handle numerous variables and adjust for biases are imperative. This study compares the performance of 18 automatic confounder control methods.Methods include propensity scores, direct adjustment by machine learning, similarity matching and resampling in two simulated and one real-world EHR datasets.Direct adjustment by lasso regression and ensemble models involving multiple resamples have performance comparable to expert-based propensity scores and thus, may help provide real-time EHR-based evidence for timely clinical decisions. [Box: see text].

    View details for DOI 10.2217/cer.15.53

    View details for PubMedID 26634383

  • Androgen Deprivation Therapy and Future Alzheimer's Disease Risk. Journal of clinical oncology Nead, K. T., Gaskin, G., Chester, C., Swisher-McClure, S., Dudley, J. T., Leeper, N. J., Shah, N. H. 2016; 34 (6): 566-571


    To test the association of androgen deprivation therapy (ADT) in the treatment of prostate cancer with subsequent Alzheimer's disease risk.We used a previously validated and implemented text-processing pipeline to analyze electronic medical record data in a retrospective cohort of patients at Stanford University and Mt. Sinai hospitals. Specifically, we extracted International Classification of Diseases-9th revision diagnosis and Current Procedural Terminology codes, medication lists, and positive-present mentions of drug and disease concepts from all clinical notes. We then tested the effect of ADT on risk of Alzheimer's disease using 1:5 propensity score-matched and traditional multivariable-adjusted Cox proportional hazards models. The duration of ADT use was also tested for association with Alzheimer's disease risk.There were 16,888 individuals with prostate cancer meeting all inclusion and exclusion criteria, with 2,397 (14.2%) receiving ADT during a median follow-up period of 2.7 years (interquartile range, 1.0-5.4 years). Propensity score-matched analysis (hazard ratio, 1.88; 95% CI, 1.10 to 3.20; P = .021) and traditional multivariable-adjusted Cox regression analysis (hazard ratio, 1.66; 95% CI, 1.05 to 2.64; P = .031) both supported a statistically significant association between ADT use and Alzheimer's disease risk. We also observed a statistically significant increased risk of Alzheimer's disease with increasing duration of ADT (P = .016).Our results support an association between the use of ADT in the treatment of prostate cancer and an increased risk of Alzheimer's disease in a general population cohort. This study demonstrates the utility of novel methods to analyze electronic medical record data to generate practice-based evidence.

    View details for DOI 10.1200/JCO.2015.63.6266

    View details for PubMedID 26644522

    View details for PubMedCentralID PMC5070576

  • Reply. Gastroenterology Shah, N. H., Cooke, J. P., Leeper, N. J. 2016; 150 (2): 528-?

    View details for DOI 10.1053/j.gastro.2015.12.017

    View details for PubMedID 26721609

  • An unsupervised learning method to identify reference intervals from a clinical database. Journal of biomedical informatics Poole, S., Schroeder, L. F., Shah, N. 2016; 59: 276-284


    Reference intervals are critical for the interpretation of laboratory results. The development of reference intervals using traditional methods is time consuming and costly. An alternative approach, known as an a posteriori method, requires an expert to enumerate diagnoses and procedures that can affect the measurement of interest. We develop a method, LIMIT, to use laboratory test results from a clinical database to identify ICD9 codes that are associated with extreme laboratory results, thus automating the a posteriori method. LIMIT was developed using sodium serum levels, and validated using potassium serum levels, both tests for which harmonized reference intervals already exist. To test LIMIT, reference intervals for total hemoglobin in whole blood were learned, and were compared with the hemoglobin reference intervals found using an existing a posteriori approach. In addition, prescription of iron supplements were used to identify individuals whose hemoglobin levels were low enough for a clinician to choose to take action. This prescription data indicating clinical action was then used to estimate the validity of the hemoglobin reference interval sets. Results show that LIMIT produces usable reference intervals for sodium, potassium and hemoglobin laboratory tests. The hemoglobin intervals produced using the data driven approaches consistently had higher positive predictive value and specificity in predicting an iron supplement prescription than the existing intervals. LIMIT represents a fast and inexpensive solution for calculating reference intervals, and shows that it is possible to use laboratory results and coded diagnoses to learn laboratory test reference intervals from clinical data warehouses.

    View details for DOI 10.1016/j.jbi.2015.12.010

    View details for PubMedID 26707631

    View details for PubMedCentralID PMC4792744

  • Feasibility of Prioritizing Drug-Drug-Event Associations Found in Electronic Health Records. Drug safety Banda, J. M., Callahan, A., Winnenburg, R., Strasberg, H. R., Cami, A., Reis, B. Y., Vilar, S., Hripcsak, G., Dumontier, M., Shah, N. H. 2016; 39 (1): 45-57


    Several studies have demonstrated the ability to detect adverse events potentially related to multiple drug exposure via data mining. However, the number of putative associations produced by such computational approaches is typically large, making experimental validation difficult. We theorized that those potential associations for which there is evidence from multiple complementary sources are more likely to be true, and explored this idea using a published database of drug-drug-adverse event associations derived from electronic health records (EHRs).We prioritized drug-drug-event associations derived from EHRs using four sources of information: (1) public databases, (2) sources of spontaneous reports, (3) literature, and (4) non-EHR drug-drug interaction (DDI) prediction methods. After pre-filtering the associations by removing those found in public databases, we devised a ranking for associations based on the support from the remaining sources, and evaluated the results of this rank-based prioritization.We collected information for 5983 putative EHR-derived drug-drug-event associations involving 345 drugs and ten adverse events from four data sources and four prediction methods. Only seven drug-drug-event associations (<0.5 %) had support from the majority of evidence sources, and about one third (1777) had support from at least one of the evidence sources.Our proof-of-concept method for scoring putative drug-drug-event associations from EHRs offers a systematic and reproducible way of prioritizing associations for further study. Our findings also quantify the agreement (or lack thereof) among complementary sources of evidence for drug-drug-event associations and highlight the challenges of developing a robust approach for prioritizing signals of these associations.

    View details for DOI 10.1007/s40264-015-0352-2

    View details for PubMedID 26446143

  • A curated and standardized adverse drug event resource to accelerate drug safety research. Scientific data Banda, J. M., Evans, L., Vanguri, R. S., Tatonetti, N. P., Ryan, P. B., Shah, N. H. 2016; 3: 160026-?


    Identification of adverse drug reactions (ADRs) during the post-marketing phase is one of the most important goals of drug safety surveillance. Spontaneous reporting systems (SRS) data, which are the mainstay of traditional drug safety surveillance, are used for hypothesis generation and to validate the newer approaches. The publicly available US Food and Drug Administration (FDA) Adverse Event Reporting System (FAERS) data requires substantial curation before they can be used appropriately, and applying different strategies for data cleaning and normalization can have material impact on analysis results. We provide a curated and standardized version of FAERS removing duplicate case records, applying standardized vocabularies with drug names mapped to RxNorm concepts and outcomes mapped to SNOMED-CT concepts, and pre-computed summary statistics about drug-outcome relationships for general consumption. This publicly available resource, along with the source code, will accelerate drug safety research by reducing the amount of time spent performing data management on the source FAERS reports, improving the quality of the underlying data, and enabling standardized analyses using common vocabularies.

    View details for DOI 10.1038/sdata.2016.26

    View details for PubMedID 27193236

  • Rapid identification of slow healing wounds. Wound repair and regeneration Jung, K., Covington, S., Sen, C. K., Januszyk, M., Kirsner, R. S., Gurtner, G. C., Shah, N. H. 2016; 24 (1): 181-188


    Chronic nonhealing wounds have a prevalence of 2% in the United States, and cost an estimated $50 billion annually. Accurate stratification of wounds for risk of slow healing may help guide treatment and referral decisions. We have applied modern machine learning methods and feature engineering to develop a predictive model for delayed wound healing that uses information collected during routine care in outpatient wound care centers. Patient and wound data was collected at 68 outpatient wound care centers operated by Healogics Inc. in 26 states between 2009 and 2013. The dataset included basic demographic information on 59,953 patients, as well as both quantitative and categorical information on 180,696 wounds. Wounds were split into training and test sets by randomly assigning patients to training and test sets. Wounds were considered delayed with respect to healing time if they took more than 15 weeks to heal after presentation at a wound care center. Eleven percent of wounds in this dataset met this criterion. Prognostic models were developed on training data available in the first week of care to predict delayed healing wounds. A held out subset of the training set was used for model selection, and the final model was evaluated on the test set to evaluate discriminative power and calibration. The model achieved an area under the curve of 0.842 (95% confidence interval 0.834-0.847) for the delayed healing outcome and a Brier reliability score of 0.00018. Early, accurate prediction of delayed healing wounds can improve patient care by allowing clinicians to increase the aggressiveness of intervention in patients most at risk.

    View details for DOI 10.1111/wrr.12384

    View details for PubMedID 26606167

  • DISCOVERING PATIENT PHENOTYPES USING GENERALIZED LOW RANK MODELS. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing Schuler, A., Liu, V., Wan, J., Callahan, A., Udell, M., Stark, D. E., Shah, N. H. 2016; 21: 144-155


    The practice of medicine is predicated on discovering commonalities or distinguishing characteristics among patients to inform corresponding treatment. Given a patient grouping (hereafter referred to as a phenotype), clinicians can implement a treatment pathway accounting for the underlying cause of disease in that phenotype. Traditionally, phenotypes have been discovered by intuition, experience in practice, and advancements in basic science, but these approaches are often heuristic, labor intensive, and can take decades to produce actionable knowledge. Although our understanding of disease has progressed substantially in the past century, there are still important domains in which our phenotypes are murky, such as in behavioral health or in hospital settings. To accelerate phenotype discovery, researchers have used machine learning to find patterns in electronic health records, but have often been thwarted by missing data, sparsity, and data heterogeneity. In this study, we use a flexible framework called Generalized Low Rank Modeling (GLRM) to overcome these barriers and discover phenotypes in two sources of patient data. First, we analyze data from the 2010 Healthcare Cost and Utilization Project National Inpatient Sample (NIS), which contains upwards of 8 million hospitalization records consisting of administrative codes and demographic information. Second, we analyze a small (N=1746), local dataset documenting the clinical progression of autism spectrum disorder patients using granular features from the electronic health record, including text from physician notes. We demonstrate that low rank modeling successfully captures known and putative phenotypes in these vastly different datasets.

    View details for PubMedID 26776181

    View details for PubMedCentralID PMC4836913

  • Rapid identification of slow healing wounds WOUND REPAIR AND REGENERATION Jung, K., Covington, S., Sen, C. K., Januszyk, M., Kirsner, R. S., Gurtner, G. C., Shah, N. H. 2016; 24 (1): 181-188

    View details for DOI 10.1111/wrr.12384

    View details for Web of Science ID 000372925500018

  • Special issue on bio-ontologies and phenotypes JOURNAL OF BIOMEDICAL SEMANTICS Soldatova, L. N., Collier, N., Oellrich, A., Groza, T., Verspoor, K., Rocca-Serra, P., Dumontier, M., Shah, N. H. 2015; 6

    View details for DOI 10.1186/s13326-015-0040-2

    View details for Web of Science ID 000366682300001

    View details for PubMedID 26682035

  • Implications of non-stationarity on predictive modeling using EHRs JOURNAL OF BIOMEDICAL INFORMATICS Jung, K., Shah, N. H. 2015; 58: 168-174

    View details for DOI 10.1016/j.jbi.2015.10.006

    View details for Web of Science ID 000366791000018

    View details for PubMedID 26483171

  • A method for systematic discovery of adverse drug events from clinical notes. Journal of the American Medical Informatics Association Wang, G., Jung, K., Winnenburg, R., Shah, N. H. 2015; 22 (6): 1196-1204


    Adverse drug events (ADEs) are undesired harmful effects resulting from use of a medication, and occur in 30% of hospitalized patients. The authors have developed a data-mining method for systematic, automated detection of ADEs from electronic medical records.This method uses the text from 9.5 million clinical notes, along with prior knowledge of drug usages and known ADEs, as inputs. These inputs are further processed into statistics used by a discriminative classifier which outputs the probability that a given drug-disorder pair represents a valid ADE association. Putative ADEs identified by the classifier are further filtered for positive support in 2 independent, complementary data sources. The authors evaluate this method by assessing support for the predictions in other curated data sources, including a manually curated, time-indexed reference standard of label change events.This method uses a classifier that achieves an area under the curve of 0.94 on a held out test set. The classifier is used on 2 362 950 possible drug-disorder pairs comprised of 1602 unique drugs and 1475 unique disorders for which we had data, resulting in 240 high-confidence, well-supported drug-AE associations. Eighty-seven of them (36%) are supported in at least one of the resources that have information that was not available to the classifier.This method demonstrates the feasibility of systematic post-marketing surveillance for ADEs using electronic medical records, a key component of the learning healthcare system.

    View details for DOI 10.1093/jamia/ocv102

    View details for PubMedID 26232442

  • Proton pump inhibitors and vascular function: A prospective cross-over pilot study VASCULAR MEDICINE Ghebremariam, Y. T., Cooke, J. P., Khan, F., Thakker, R. N., Chang, P., Shah, N. H., Nead, K. T., Leeper, N. J. 2015; 20 (4): 309-316


    Proton pump inhibitors (PPIs) are commonly used drugs for the treatment of gastric reflux. Recent retrospective cohorts and large database studies have raised concern that the use of PPIs is associated with increased cardiovascular (CV) risk. However, there is no prospective clinical study evaluating whether the use of PPIs directly causes CV harm. We conducted a controlled, open-label, cross-over pilot study among 21 adults aged 18 and older who are healthy (n=11) or have established clinical cardiovascular disease (n=10). Study subjects were assigned to receive a PPI (Prevacid; 30 mg) or a placebo pill once daily for 4 weeks. After a 2-week washout period, participants were crossed over to receive the alternate treatment for the ensuing 4 weeks. Subjects underwent evaluation of vascular function (by the EndoPAT technique) and had plasma levels of asymmetric dimethylarginine (ADMA, an endogenous inhibitor of endothelial function previously implicated in PPI-mediated risk) measured prior to and after each treatment interval. We observed a marginal inverse correlation between the EndoPAT score and plasma levels of ADMA (r = -0.364). Subjects experienced a greater worsening in plasma ADMA levels while on PPI than on placebo, and this trend was more pronounced amongst those subjects with a history of vascular disease. However, these trends did not reach statistical significance, and PPI use was also not associated with an impairment in flow-mediated vasodilation during the course of this study. In conclusion, in this open-label, cross-over pilot study conducted among healthy subjects and coronary disease patients, PPI use did not significantly influence vascular endothelial function. Larger, long-term and blinded trials are needed to mechanistically explain the correlation between PPI use and adverse clinical outcomes, which has recently been reported in retrospective cohort studies.

    View details for DOI 10.1177/1358863X14568444

    View details for Web of Science ID 000359414300001

    View details for PubMedID 25835348

  • Analyzing Information Seeking and Drug-Safety Alert Response by Health Care Professionals as New Methods for Surveillance JOURNAL OF MEDICAL INTERNET RESEARCH Callahan, A., Pernek, I., Stiglic, G., Leskovec, J., Strasberg, H. R., Shah, N. H. 2015; 17 (8)

    View details for DOI 10.2196/jmir.4427

    View details for Web of Science ID 000360306600007

  • Proton Pump Inhibitor Usage and the Risk of Myocardial Infarction in the General Population PLOS ONE Shah, N. H., LePendu, P., Bauer-Mehren, A., Ghebremariam, Y. T., Iyer, S. V., Marcus, J., Nead, K. T., Cooke, J. P., Leeper, N. J. 2015; 10 (6)

    View details for DOI 10.1371/journal.pone.0124653

    View details for Web of Science ID 000355979500007

    View details for PubMedID 26061035

  • A formal concept analysis and semantic query expansion cooperation to refine health outcomes of interest BMC MEDICAL INFORMATICS AND DECISION MAKING Cure, O. C., Maurer, H., Shah, N. H., Le Pendu, P. 2015; 15
  • Detecting unplanned care from clinician notes in electronic health records. Journal of oncology practice / American Society of Clinical Oncology Tamang, S., Patel, M. I., Blayney, D. W., Kuznetsov, J., Finlayson, S. G., Vetteth, Y., Shah, N. 2015; 11 (3): e313-9


    Reduction in unplanned episodes of care, such as emergency department visits and unplanned hospitalizations, are important quality outcome measures. However, many events are only documented in free-text clinician notes and are labor intensive to detect by manual medical record review.We studied 308,096 free-text machine-readable documents linked to individual entries in our electronic health records, representing care for patients with breast, GI, or thoracic cancer, whose treatment was initiated at one academic medical center, Stanford Health Care (SHC). Using a clinical text-mining tool, we detected unplanned episodes documented in clinician notes (for non-SHC visits) or in coded encounter data for SHC-delivered care and the most frequent symptoms documented in emergency department (ED) notes.Combined reporting increased the identification of patients with one or more unplanned care visits by 32% (15% using coded data; 20% using all the data) among patients with 3 months of follow-up and by 21% (23% using coded data; 28% using all the data) among those with 1 year of follow-up. Based on the textual analysis of SHC ED notes, pain (75%), followed by nausea (54%), vomiting (47%), infection (36%), fever (28%), and anemia (27%), were the most frequent symptoms mentioned. Pain, nausea, and vomiting co-occur in 35% of all ED encounter notes.The text-mining methods we describe can be applied to automatically review free-text clinician notes to detect unplanned episodes of care mentioned in these notes. These methods have broad application for quality improvement efforts in which events of interest occur outside of a network that allows for patient data sharing.

    View details for DOI 10.1200/JOP.2014.002741

    View details for PubMedID 25980019

    View details for PubMedCentralID PMC4438112

  • Comment on: "Zoo or savannah? Choice of training ground for evidence-based pharmacovigilance". Drug safety Harpaz, R., DuMouchel, W., Shah, N. H. 2015; 38 (1): 113-114

    View details for DOI 10.1007/s40264-014-0245-9

    View details for PubMedID 25432779

  • Analyzing Information Seeking and Drug-Safety Alert Response by Health Care Professionals as New Methods for Surveillance. Journal of medical Internet research Callahan, A., Pernek, I., Stiglic, G., Leskovec, J., Strasberg, H. R., Shah, N. H. 2015; 17 (8)


    Patterns in general consumer online search logs have been used to monitor health conditions and to predict health-related activities, but the multiple contexts within which consumers perform online searches make significant associations difficult to interpret. Physician information-seeking behavior has typically been analyzed through survey-based approaches and literature reviews. Activity logs from health care professionals using online medical information resources are thus a valuable yet relatively untapped resource for large-scale medical surveillance.To analyze health care professionals' information-seeking behavior and assess the feasibility of measuring drug-safety alert response from the usage logs of an online medical information resource.Using two years (2011-2012) of usage logs from UpToDate, we measured the volume of searches related to medical conditions with significant burden in the United States, as well as the seasonal distribution of those searches. We quantified the relationship between searches and resulting page views. Using a large collection of online mainstream media articles and Web log posts we also characterized the uptake of a Food and Drug Administration (FDA) alert via changes in UpToDate search activity compared with general online media activity related to the subject of the alert.Diseases and symptoms dominate UpToDate searches. Some searches result in page views of only short duration, while others consistently result in longer-than-average page views. The response to an FDA alert for Celexa, characterized by a change in UpToDate search activity, differed considerably from general online media activity. Changes in search activity appeared later and persisted longer in UpToDate logs. The volume of searches and page view durations related to Celexa before the alert also differed from those after the alert.Understanding the information-seeking behavior associated with online evidence sources can offer insight into the information needs of health professionals and enable large-scale medical surveillance. Our Web log mining approach has the potential to monitor responses to FDA alerts at a national level. Our findings can also inform the design and content of evidence-based medical information resources such as UpToDate.

    View details for DOI 10.2196/jmir.4427

    View details for PubMedID 26293444

  • Bringing cohort studies to the bedside: framework for a "green button' to support clinical decision-making JOURNAL OF COMPARATIVE EFFECTIVENESS RESEARCH Gallego, B., Walter, S. R., Day, R. O., Dunn, A. G., Sivaraman, V., Shah, N., Longhurst, C. A., Coiera, E. 2015; 4 (3): 191-197

    View details for DOI 10.2217/cer.15.12

    View details for Web of Science ID 000355701500002

  • Proton Pump Inhibitor Usage and the Risk of Myocardial Infarction in the General Population. PloS one Shah, N. H., LePendu, P., Bauer-Mehren, A., Ghebremariam, Y. T., Iyer, S. V., Marcus, J., Nead, K. T., Cooke, J. P., Leeper, N. J. 2015; 10 (6)


    Proton pump inhibitors (PPIs) have been associated with adverse clinical outcomes amongst clopidogrel users after an acute coronary syndrome. Recent pre-clinical results suggest that this risk might extend to subjects without any prior history of cardiovascular disease. We explore this potential risk in the general population via data-mining approaches.Using a novel approach for mining clinical data for pharmacovigilance, we queried over 16 million clinical documents on 2.9 million individuals to examine whether PPI usage was associated with cardiovascular risk in the general population.In multiple data sources, we found gastroesophageal reflux disease (GERD) patients exposed to PPIs to have a 1.16 fold increased association (95% CI 1.09-1.24) with myocardial infarction (MI). Survival analysis in a prospective cohort found a two-fold (HR = 2.00; 95% CI 1.07-3.78; P = 0.031) increase in association with cardiovascular mortality. We found that this association exists regardless of clopidogrel use. We also found that H2 blockers, an alternate treatment for GERD, were not associated with increased cardiovascular risk; had they been in place, such pharmacovigilance algorithms could have flagged this risk as early as the year 2000.Consistent with our pre-clinical findings that PPIs may adversely impact vascular function, our data-mining study supports the association of PPI exposure with risk for MI in the general population. These data provide an example of how a combination of experimental studies and data-mining approaches can be applied to prioritize drug safety signals for further investigation.

    View details for DOI 10.1371/journal.pone.0124653

    View details for PubMedID 26061035

  • A formal concept analysis and semantic query expansion cooperation to refine health outcomes of interest. BMC medical informatics and decision making Curé, O. C., Maurer, H., Shah, N. H., Le Pendu, P. 2015; 15: S8-?


    Electronic Health Records (EHRs) are frequently used by clinicians and researchers to search for, extract, and analyze groups of patients by defining Health Outcome of Interests (HOI). The definition of an HOI is generally considered a complex and time consuming task for health care professionals.In our clinical note-based pharmacovigilance research, we often operate upon potentially hundreds of ontologies at once, expand query inputs, and we also increase the search space over clinical text as well as structured data. Such a method implies to specify an initial set of seed concepts, which are based on concept unique identifiers. This paper presents a novel method based on Formal Concept Analysis (FCA) and Semantic Query Expansion (SQE) to assist the end-user in defining their seed queries and in refining the expanded search space that it encompasses.We evaluate our method over a gold-standard corpus from the 2008 i2b2 Obesity Challenge. This experimentation emphasizes positive results for sensitivity and specificity measures. Our new approach provides better recall with high precision of the obtained results. The most promising aspect of this approach consists in the discovery of positive results not present our Obesity NLP reference set.Together with a Web graphical user interface, our FCA and SQE cooperation end up being an efficient approach for refining health outcome of interest using plain terms. We consider that this approach can be extended to support other domains such as cohort building tools.

    View details for DOI 10.1186/1472-6947-15-S1-S8

    View details for PubMedID 26043839

  • Analyzing search behavior of healthcare professionals for drug safety surveillance. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing Odgers, D. J., Harpaz, R., Callahan, A., Stiglic, G., Shah, N. H. 2015; 20: 306-317


    Post-market drug safety surveillance is hugely important and is a significant challenge despite the existence of adverse event (AE) reporting systems. Here we describe a preliminary analysis of search logs from healthcare professionals as a source for detecting adverse drug events. We annotate search log query terms with biomedical terminologies for drugs and events, and then perform a statistical analysis to identify associations among drugs and events within search sessions. We evaluate our approach using two different types of reference standards consisting of known adverse drug events (ADEs) and negative controls. Our approach achieves a discrimination accuracy of 0.85 in terms of the area under the receiver operator curve (AUC) for the reference set of well-established ADEs and an AUC of 0.68 for the reference set of recently labeled ADEs. We also find that the majority of associations in the reference sets have support in the search log data. Despite these promising results additional research is required to better understand users' search behavior, biasing factors, and the overall utility of analyzing healthcare professional search logs for drug safety surveillance.

    View details for PubMedID 25592591

  • Functional evaluation of out-of-the-box text-mining tools for data-mining tasks. Journal of the American Medical Informatics Association Jung, K., LePendu, P., Iyer, S., Bauer-Mehren, A., Percha, B., Shah, N. H. 2015; 22 (1): 121-131


    The trade-off between the speed and simplicity of dictionary-based term recognition and the richer linguistic information provided by more advanced natural language processing (NLP) is an area of active discussion in clinical informatics. In this paper, we quantify this trade-off among text processing systems that make different trade-offs between speed and linguistic understanding. We tested both types of systems in three clinical research tasks: phase IV safety profiling of a drug, learning adverse drug-drug interactions, and learning used-to-treat relationships between drugs and indications.We first benchmarked the accuracy of the NCBO Annotator and REVEAL in a manually annotated, publically available dataset from the 2008 i2b2 Obesity Challenge. We then applied the NCBO Annotator and REVEAL to 9 million clinical notes from the Stanford Translational Research Integrated Database Environment (STRIDE) and used the resulting data for three research tasks.There is no significant difference between using the NCBO Annotator and REVEAL in the results of the three research tasks when using large datasets. In one subtask, REVEAL achieved higher sensitivity with smaller datasets.For a variety of tasks, employing simple term recognition methods instead of advanced NLP methods results in little or no impact on accuracy when using large datasets. Simpler dictionary-based methods have the advantage of scaling well to very large datasets. Promoting the use of simple, dictionary-based methods for population level analyses can advance adoption of NLP in practice.

    View details for DOI 10.1136/amiajnl-2014-002902

    View details for PubMedID 25336595

  • A time-indexed reference standard of adverse drug reactions. Scientific data Harpaz, R., Odgers, D., Gaskin, G., DuMouchel, W., Winnenburg, R., Bodenreider, O., Ripple, A., Szarfman, A., Sorbello, A., Horvitz, E., White, R. W., Shah, N. H. 2014; 1: 140043-?


    Undetected adverse drug reactions (ADRs) pose a major burden on the health system. Data mining methodologies designed to identify signals of novel ADRs are of deep importance for drug safety surveillance. The development and evaluation of these methodologies requires proper reference benchmarks. While progress has recently been made in developing such benchmarks, our understanding of the performance characteristics of the data mining methodologies is limited because existing benchmarks do not support prospective performance evaluations. We address this shortcoming by providing a reference standard to support prospective performance evaluations. The reference standard was systematically curated from drug labeling revisions, such as new warnings, which were issued and communicated by the US Food and Drug Administration in 2013. The reference standard includes 62 positive test cases and 75 negative controls, and covers 44 drugs and 38 events. We provide usage guidance and empirical support for the reference standard by applying it to analyze two data sources commonly mined for drug safety surveillance.

    View details for DOI 10.1038/sdata.2014.43

    View details for PubMedID 25632348

    View details for PubMedCentralID PMC4306188

  • Toward personalizing treatment for depression: predicting diagnosis and severity JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION Huang, S. H., LePendu, P., Iyer, S. V., Tai-Seale, M., Carrell, D., Shah, N. H. 2014; 21 (6): 1069-1075
  • Repurposing cAMP-Modulating Medications to Promote beta-Cell Replication MOLECULAR ENDOCRINOLOGY Zhao, Z., Low, Y. S., Armstrong, N. A., Ryu, J. H., Sun, S. A., Arvanites, A. C., Hollister-Lock, J., Shah, N. H., Weir, G. C., Annes, J. P. 2014; 28 (10): 1682-1697


    Loss of β-cell mass is a cardinal feature of diabetes. Consequently, developing medications to promote β-cell regeneration is a priority. 3'-5'-Cyclic adenosine monophosphate (cAMP) is an intracellular second messenger that modulates β-cell replication. We investigated whether medications that increase cAMP stability or synthesis selectively stimulate β-cell growth. To identify cAMP stabilizing medications that promote β-cell replication we performed high-content screening of a phosphodiesterase-inhibitor (PDE-I) library. PDE3,4 and 10 inhibitors, including dipyridamole, were found to promote β-cell replication in an adenosine receptor-dependent manner. Dipyridamole's action is specific for β-cells and not α-cells. Next we demonstrated that norepinephrine (NE), a physiologic suppressor of cAMP synthesis in β-cells, impairs β-cell replication via activation of α2-adrenergic receptors. Accordingly, mirtazapine, an α2-adrenergic receptor antagonist and antidepressant, prevents NE-dependent suppression of β-cell replication. Interestingly, NE's growth-suppressive effect is modulated by endogenously expressed catecholamine-inactivating enzymes (COMT and MAO) and is dominant over the growth-promoting effects of PDE-Is. Treatment with dipyridamole and/or mirtazapine promote β-cell replication in mice and treatment with dipyridamole is associated with reduced glucose levels in humans. This work provides new mechanistic insights into cAMP-dependent growth regulation of β-cells and highlights the potential of commonly prescribed medications to influence β-cell growth.

    View details for DOI 10.1210/me.2014-1120

    View details for Web of Science ID 000346837000010

    View details for PubMedID 25083741

  • Text Mining for Adverse Drug Events: the Promise, Challenges, and State of the Art DRUG SAFETY Harpaz, R., Callahan, A., Tamang, S., Low, Y., Odgers, D., Finlayson, S., Jung, K., LePendu, P., Shah, N. H. 2014; 37 (10): 777-790
  • Toward Enhanced Pharmacovigilance Using Patient-Generated Data on the Internet CLINICAL PHARMACOLOGY & THERAPEUTICS WHITE, R. W., Harpaz, R., Shah, N. H., Dumouchel, W., Horvitz, E. 2014; 96 (2): 239-246


    The promise of augmenting pharmacovigilance with patient-generated data drawn from the Internet was called out by a scientific committee charged with conducting a review of the current and planned pharmacovigilance practices of the US Food and Drug Administration (FDA). To this end, we present a study on harnessing behavioral data drawn from Internet search logs to detect adverse drug reactions (ADRs). By analyzing search queries collected from 80 million consenting users and by using a widely recognized benchmark of ADRs, we found that the performance of ADR detection via search logs is comparable and complementary to detection based on the FDA's adverse event reporting system (AERS). We show that by jointly leveraging data from the AERS and search logs, the accuracy of ADR detection can be improved by 19% relative to the use of each data source independently. The results suggest that leveraging nontraditional sources such as online search logs could supplement existing pharmacovigilance approaches.

    View details for DOI 10.1038/clpt.2014.77

    View details for Web of Science ID 000339602900035

    View details for PubMedID 24713590

  • A 'green button' for using aggregate patient data at the point of care. Health affairs Longhurst, C. A., Harrington, R. A., Shah, N. H. 2014; 33 (7): 1229-1235


    Randomized controlled trials have traditionally been the gold standard against which all other sources of clinical evidence are measured. However, the cost of conducting these trials can be prohibitive. In addition, evidence from the trials frequently rests on narrow patient-inclusion criteria and thus may not generalize well to real clinical situations. Given the increasing availability of comprehensive clinical data in electronic health records (EHRs), some health system leaders are now advocating for a shift away from traditional trials and toward large-scale retrospective studies, which can use practice-based evidence that is generated as a by-product of clinical processes. Other thought leaders in clinical research suggest that EHRs should be used to lower the cost of trials by integrating point-of-care randomization and data capture into clinical processes. We believe that a successful learning health care system will require both approaches, and we suggest a model that resolves this escalating tension: a "green button" function within EHRs to help clinicians leverage aggregate patient data for decision making at the point of care. Giving clinicians such a tool would support patient care decisions in the absence of gold-standard evidence and would help prioritize clinical questions for which EHR-enabled randomization should be carried out. The privacy rule in the Health Insurance Portability and Accountability Act (HIPAA) of 1996 may require revision to support this novel use of patient data.

    View details for DOI 10.1377/hlthaff.2014.0099

    View details for PubMedID 25006150

  • Response to letters regarding article, "unexpected effect of proton pump inhibitors: elevation of the cardiovascular risk factor asymmetric dimethylarginine". Circulation Ghebremariam, Y. T., Lee, J. C., LePendu, P., Erlanson, D. A., Slaviero, A., Shah, N. H., Leiper, J. M., Cooke, J. P. 2014; 129 (13)

    View details for DOI 10.1161/CIRCULATIONAHA.114.009343

    View details for PubMedID 24687654

  • Mining clinical text for signals of adverse drug-drug interactions. Journal of the American Medical Informatics Association Iyer, S. V., Harpaz, R., LePendu, P., Bauer-Mehren, A., Shah, N. H. 2014; 21 (2): 353-362


    Electronic health records (EHRs) are increasingly being used to complement the FDA Adverse Event Reporting System (FAERS) and to enable active pharmacovigilance. Over 30% of all adverse drug reactions are caused by drug-drug interactions (DDIs) and result in significant morbidity every year, making their early identification vital. We present an approach for identifying DDI signals directly from the textual portion of EHRs.We recognize mentions of drug and event concepts from over 50 million clinical notes from two sites to create a timeline of concept mentions for each patient. We then use adjusted disproportionality ratios to identify significant drug-drug-event associations among 1165 drugs and 14 adverse events. To validate our results, we evaluate our performance on a gold standard of 1698 DDIs curated from existing knowledge bases, as well as with signaling DDI associations directly from FAERS using established methods.Our method achieves good performance, as measured by our gold standard (area under the receiver operator characteristic (ROC) curve >80%), on two independent EHR datasets and the performance is comparable to that of signaling DDIs from FAERS. We demonstrate the utility of our method for early detection of DDIs and for identifying alternatives for risky drug combinations. Finally, we publish a first of its kind database of population event rates among patients on drug combinations based on an EHR corpus.It is feasible to identify DDI signals and estimate the rate of adverse events among patients on drug combinations, directly from clinical text; this could have utility in prioritizing drug interaction surveillance as well as in clinical decision support.

    View details for DOI 10.1136/amiajnl-2013-001612

    View details for PubMedID 24158091

  • Automated detection of off-label drug use. PloS one Jung, K., LePendu, P., Chen, W. S., Iyer, S. V., Readhead, B., Dudley, J. T., Shah, N. H. 2014; 9 (2)


    Off-label drug use, defined as use of a drug in a manner that deviates from its approved use defined by the drug's FDA label, is problematic because such uses have not been evaluated for safety and efficacy. Studies estimate that 21% of prescriptions are off-label, and only 27% of those have evidence of safety and efficacy. We describe a data-mining approach for systematically identifying off-label usages using features derived from free text clinical notes and features extracted from two databases on known usage (Medi-Span and DrugBank). We trained a highly accurate predictive model that detects novel off-label uses among 1,602 unique drugs and 1,472 unique indications. We validated 403 predicted uses across independent data sources. Finally, we prioritize well-supported novel usages for further investigation on the basis of drug safety and cost.

    View details for DOI 10.1371/journal.pone.0089324

    View details for PubMedID 24586689

  • Building the graph of medicine from millions of clinical narratives SCIENTIFIC DATA Finlayson, S. G., LePendu, P., Shah, N. H. 2014; 1
  • A time-indexed reference standard of adverse drug reactions SCIENTIFIC DATA Harpaz, R., Odgers, D., Gaskin, G., DuMouchel, W., Winnenburg, R., Bodenreider, O., Ripple, A., Szarfman, A., Sorbello, A., Horvitz, E., White, R. W., Shah, N. H. 2014; 1
  • Building the graph of medicine from millions of clinical narratives. Scientific data Finlayson, S. G., LePendu, P., Shah, N. H. 2014; 1: 140032-?


    Electronic health records (EHR) represent a rich and relatively untapped resource for characterizing the true nature of clinical practice and for quantifying the degree of inter-relatedness of medical entities such as drugs, diseases, procedures and devices. We provide a unique set of co-occurrence matrices, quantifying the pairwise mentions of 3 million terms mapped onto 1 million clinical concepts, calculated from the raw text of 20 million clinical notes spanning 19 years of data. Co-frequencies were computed by means of a parallelized annotation, hashing, and counting pipeline that was applied over clinical notes from Stanford Hospitals and Clinics. The co-occurrence matrix quantifies the relatedness among medical concepts which can serve as the basis for many statistical tests, and can be used to directly compute Bayesian conditional probabilities, association rules, as well as a range of test statistics such as relative risks and odds ratios. This dataset can be leveraged to quantitatively assess comorbidity, drug-drug, and drug-disease patterns for a range of clinical, epidemiological, and financial applications.

    View details for DOI 10.1038/sdata.2014.32

    View details for PubMedID 25977789

  • Automated detection of off-label drug use. PloS one Jung, K., LePendu, P., Chen, W. S., Iyer, S. V., Readhead, B., Dudley, J. T., Shah, N. H. 2014; 9 (2)

    View details for DOI 10.1371/journal.pone.0089324

    View details for PubMedID 24586689

  • Profiling risk factors for chronic uveitis in juvenile idiopathic arthritis: a new model for EHR-based research PEDIATRIC RHEUMATOLOGY Cole, T. S., Frankovich, J., Iyer, S., LePendu, P., Bauer-Mehren, A., Shah, N. H. 2013; 11

    View details for DOI 10.1186/1546-0096-11-45

    View details for Web of Science ID 000328822300001

    View details for PubMedID 24299016

  • Mining the ultimate phenome repository NATURE BIOTECHNOLOGY Shah, N. H. 2013; 31 (12): 1095-1097

    View details for DOI 10.1038/nbt.2757

    View details for Web of Science ID 000328251900020

    View details for PubMedID 24316646

  • Identifying phenotypic signatures of neuropsychiatric disorders from electronic medical records. Journal of the American Medical Informatics Association Lyalina, S., Percha, B., LePendu, P., Iyer, S. V., Altman, R. B., Shah, N. H. 2013; 20 (e2): e297-305

    View details for DOI 10.1136/amiajnl-2013-001933

    View details for PubMedID 23956017

  • Identifying phenotypic signatures of neuropsychiatric disorders from electronic medical records. Journal of the American Medical Informatics Association Lyalina, S., Percha, B., LePendu, P., Iyer, S. V., Altman, R. B., Shah, N. H. 2013; 20 (e2): e297-305


    Mental illness is the leading cause of disability in the USA, but boundaries between different mental illnesses are notoriously difficult to define. Electronic medical records (EMRs) have recently emerged as a powerful new source of information for defining the phenotypic signatures of specific diseases. We investigated how EMR-based text mining and statistical analysis could elucidate the phenotypic boundaries of three important neuropsychiatric illnesses-autism, bipolar disorder, and schizophrenia.We analyzed the medical records of over 7000 patients at two facilities using an automated text-processing pipeline to annotate the clinical notes with Unified Medical Language System codes and then searching for enriched codes, and associations among codes, that were representative of the three disorders. We used dimensionality-reduction techniques on individual patient records to understand individual-level phenotypic variation within each disorder, as well as the degree of overlap among disorders.We demonstrate that automated EMR mining can be used to extract relevant drugs and phenotypes associated with neuropsychiatric disorders and characteristic patterns of associations among them. Patient-level analyses suggest a clear separation between autism and the other disorders, while revealing significant overlap between schizophrenia and bipolar disorder. They also enable localization of individual patients within the phenotypic 'landscape' of each disorder.Because EMRs reflect the realities of patient care rather than idealized conceptualizations of disease states, we argue that automated EMR mining can help define the boundaries between different mental illnesses, facilitate cohort building for clinical and genomic studies, and reveal how clear expert-defined disease boundaries are in practice.

    View details for DOI 10.1136/amiajnl-2013-001933

    View details for PubMedID 23956017

  • A Nondegenerate Code of Deleterious Variants in Mendelian Loci Contributes to Complex Disease Risk CELL Blair, D. R., Lyttle, C. S., Mortensen, J. M., Bearden, C. F., Jensen, A. B., Khiabanian, H., Melamed, R., Rabadan, R., Bernstam, E. V., Brunak, S., Jensen, L. J., Nicolae, D., Shah, N. H., Grossman, R. L., Cox, N. J., White, K. P., Rzhetsky, A. 2013; 155 (1): 70-80


    Although countless highly penetrant variants have been associated with Mendelian disorders, the genetic etiologies underlying complex diseases remain largely unresolved. By mining the medical records of over 110 million patients, we examine the extent to which Mendelian variation contributes to complex disease risk. We detect thousands of associations between Mendelian and complex diseases, revealing a nondegenerate, phenotypic code that links each complex disorder to a unique collection of Mendelian loci. Using genome-wide association results, we demonstrate that common variants associated with complex diseases are enriched in the genes indicated by this "Mendelian code." Finally, we detect hundreds of comorbidity associations among Mendelian disorders, and we use probabilistic genetic modeling to demonstrate that Mendelian variants likely contribute nonadditively to the risk for a subset of complex diseases. Overall, this study illustrates a complementary approach for mapping complex disease loci and provides unique predictions concerning the etiologies of specific diseases.

    View details for DOI 10.1016/j.cell.2013.08.030

    View details for Web of Science ID 000324916700010

    View details for PubMedID 24074861

  • Response to "Logistic regression in signal detection: another piece added to the puzzle". Clinical pharmacology & therapeutics Harpaz, R., Dumouchel, W., Lependu, P., Bauer-Mehren, A., Ryan, P., Shah, N. H. 2013; 94 (3): 313-?

    View details for DOI 10.1038/clpt.2013.125

    View details for PubMedID 23756371

  • Unexpected effect of proton pump inhibitors: elevation of the cardiovascular risk factor asymmetric dimethylarginine. Circulation Ghebremariam, Y. T., LePendu, P., Lee, J. C., Erlanson, D. A., Slaviero, A., Shah, N. H., Leiper, J., Cooke, J. P. 2013; 128 (8): 845-853


    Proton pump inhibitors (PPIs) are gastric acid-suppressing agents widely prescribed for the treatment of gastroesophageal reflux disease. Recently, several studies in patients with acute coronary syndrome have raised the concern that use of PPIs in these patients may increase their risk of major adverse cardiovascular events. The mechanism of this possible adverse effect is not known. Whether the general population might also be at risk has not been addressed.Plasma asymmetrical dimethylarginine (ADMA) is an endogenous inhibitor of nitric oxide synthase. Elevated plasma ADMA is associated with increased risk for cardiovascular disease, likely because of its attenuation of the vasoprotective effects of endothelial nitric oxide synthase. We find that PPIs elevate plasma ADMA levels and reduce nitric oxide levels and endothelium-dependent vasodilation in a murine model and ex vivo human tissues. PPIs increase ADMA because they bind to and inhibit dimethylarginine dimethylaminohydrolase, the enzyme that degrades ADMA.We present a plausible biological mechanism to explain the association of PPIs with increased major adverse cardiovascular events in patients with unstable coronary syndromes. Of concern, this adverse mechanism is also likely to extend to the general population using PPIs. This finding compels additional clinical investigations and pharmacovigilance directed toward understanding the cardiovascular risk associated with the use of the PPIs in the general population.

    View details for DOI 10.1161/CIRCULATIONAHA.113.003602

    View details for PubMedID 23825361

  • Performance of pharmacovigilance signal-detection algorithms for the FDA adverse event reporting system. Clinical pharmacology & therapeutics Harpaz, R., Dumouchel, W., Lependu, P., Bauer-Mehren, A., Ryan, P., Shah, N. H. 2013; 93 (6): 539-546


    Signal-detection algorithms (SDAs) are recognized as vital tools in pharmacovigilance. However, their performance characteristics are generally unknown. By leveraging a unique gold standard recently made public by the Observational Medical Outcomes Partnership (OMOP) and by conducting a unique systematic evaluation, we provide new insights into the diagnostic potential and characteristics of SDAs that are routinely applied to the US Food and Drug Administration (FDA) Adverse Event Reporting System (AERS). We find that SDAs can attain reasonable predictive accuracy in signaling adverse events. Two performance classes emerge, indicating that the class of approaches that address confounding and masking effects benefits safety surveillance. Our study shows that not all events are equally detectable, suggesting that specific events might be monitored more effectively using other data sources. We provide performance guidelines for several operating scenarios to inform the trade-off between sensitivity and specificity for specific use cases. We also propose an approach and demonstrate its application in identifying optimal signaling thresholds, given specific misclassification tolerances.

    View details for DOI 10.1038/clpt.2013.24

    View details for PubMedID 23571771

  • Pharmacovigilance using clinical notes. Clinical pharmacology & therapeutics Lependu, P., Iyer, S. V., Bauer-Mehren, A., Harpaz, R., MORTENSEN, J. M., Podchiyska, T., Ferris, T. A., Shah, N. H. 2013; 93 (6): 547-555


    With increasing adoption of electronic health records (EHRs), there is an opportunity to use the free-text portion of EHRs for pharmacovigilance. We present novel methods that annotate the unstructured clinical notes and transform them into a deidentified patient-feature matrix encoded using medical terminologies. We demonstrate the use of the resulting high-throughput data for detecting drug-adverse event associations and adverse events associated with drug-drug interactions. We show that these methods flag adverse events early (in most cases before an official alert), allow filtering of spurious signals by adjusting for potential confounding, and compile prevalence information. We argue that analyzing large volumes of free-text clinical notes enables drug safety surveillance using a yet untapped data source. Such data mining can be used for hypothesis generation and for rapid analysis of suspected adverse event risk.

    View details for DOI 10.1038/clpt.2013.47

    View details for PubMedID 23571773

  • Combing signals from spontaneous reports and electronic health records for detection of adverse drug reactions JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION Harpaz, R., Vilar, S., DuMouchel, W., Salmasian, H., Haerian, K., Shah, N. H., Chase, H. S., Friedman, C. 2013; 20 (3): 413-419


    Data-mining algorithms that can produce accurate signals of potentially novel adverse drug reactions (ADRs) are a central component of pharmacovigilance. We propose a signal-detection strategy that combines the adverse event reporting system (AERS) of the Food and Drug Administration and electronic health records (EHRs) by requiring signaling in both sources. We claim that this approach leads to improved accuracy of signal detection when the goal is to produce a highly selective ranked set of candidate ADRs.Our investigation was based on over 4 million AERS reports and information extracted from 1.2 million EHR narratives. Well-established methodologies were used to generate signals from each source. The study focused on ADRs related to three high-profile serious adverse reactions. A reference standard of over 600 established and plausible ADRs was created and used to evaluate the proposed approach against a comparator.The combined signaling system achieved a statistically significant large improvement over AERS (baseline) in the precision of top ranked signals. The average improvement ranged from 31% to almost threefold for different evaluation categories. Using this system, we identified a new association between the agent, rasburicase, and the adverse event, acute pancreatitis, which was supported by clinical review.The results provide promising initial evidence that combining AERS with EHRs via the framework of replicated signaling can improve the accuracy of signal detection for certain operating scenarios. The use of additional EHR data is required to further evaluate the capacity and limits of this system and to extend the generalizability of these results.

    View details for DOI 10.1136/amiajnl-2012-000930

    View details for Web of Science ID 000317477500003

    View details for PubMedID 23118093

  • Web-scale pharmacovigilance: listening to signals from the crowd JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION White, R. W., Tatonetti, N. P., Shah, N. H., Altman, R. B., Horvitz, E. 2013; 20 (3): 404-408


    Adverse drug events cause substantial morbidity and mortality and are often discovered after a drug comes to market. We hypothesized that Internet users may provide early clues about adverse drug events via their online information-seeking. We conducted a large-scale study of Web search log data gathered during 2010. We pay particular attention to the specific drug pairing of paroxetine and pravastatin, whose interaction was reported to cause hyperglycemia after the time period of the online logs used in the analysis. We also examine sets of drug pairs known to be associated with hyperglycemia and those not associated with hyperglycemia. We find that anonymized signals on drug interactions can be mined from search logs. Compared to analyses of other sources such as electronic health records (EHR), logs are inexpensive to collect and mine. The results demonstrate that logs of the search activities of populations of computer users can contribute to drug safety surveillance.

    View details for DOI 10.1136/amiajnl-2012-001482

    View details for Web of Science ID 000317477500001

    View details for PubMedID 23467469

    View details for PubMedCentralID PMC3628066

  • Selected papers from the 15th Annual Bio-Ontologies Special Interest Group Meeting. Journal of biomedical semantics Soldatova, L. N., Sansone, S., Dumontier, M., Shah, N. H. 2013; 4: I1-?


    Over the 15 years, the Bio-Ontologies SIG at ISMB has provided a forum for discussion of the latest and most innovative research in the bio-ontologies development, its applications to biomedicine and more generally the organisation, presentation and dissemination of knowledge in biomedicine and the life sciences. The seven papers and the commentary selected for this supplement span a wide range of topics including: web-based querying over multiple ontologies, integration of data, annotating patent records, NCBO Web services, ontology developments for probabilistic reasoning and for physiological processes, and analysis of the progress of annotation and structural GO changes.

    View details for DOI 10.1186/2041-1480-4-S1-I1

    View details for PubMedID 23735191

  • STOP using just GO: a multi-ontology hypothesis generation tool for high throughput experimentation BMC BIOINFORMATICS Wittkop, T., Teravest, E., Evani, U. S., Fleisch, K. M., Berman, A. E., Powell, C., Shah, N. H., Mooney, S. D. 2013; 14


    Gene Ontology (GO) enrichment analysis remains one of the most common methods for hypothesis generation from high throughput datasets. However, we believe that researchers strive to test other hypotheses that fall outside of GO. Here, we developed and evaluated a tool for hypothesis generation from gene or protein lists using ontological concepts present in manually curated text that describes those genes and proteins.As a consequence we have developed the method Statistical Tracking of Ontological Phrases (STOP) that expands the realm of testable hypotheses in gene set enrichment analyses by integrating automated annotations of genes to terms from over 200 biomedical ontologies. While not as precise as manually curated terms, we find that the additional enriched concepts have value when coupled with traditional enrichment analyses using curated terms.Multiple ontologies have been developed for gene and protein annotation, by using a dataset of both manually curated GO terms and automatically recognized concepts from curated text we can expand the realm of hypotheses that can be discovered. The web application STOP is available at

    View details for DOI 10.1186/1471-2105-14-53

    View details for Web of Science ID 000318030400001

    View details for PubMedID 23409969

  • Practice-based evidence: profiling the safety of cilostazol by text-mining of clinical notes. PloS one Leeper, N. J., Bauer-Mehren, A., Iyer, S. V., LePendu, P., Olson, C., Shah, N. H. 2013; 8 (5)


    Peripheral arterial disease (PAD) is a growing problem with few available therapies. Cilostazol is the only FDA-approved medication with a class I indication for intermittent claudication, but carries a black box warning due to concerns for increased cardiovascular mortality. To assess the validity of this black box warning, we employed a novel text-analytics pipeline to quantify the adverse events associated with Cilostazol use in a clinical setting, including patients with congestive heart failure (CHF).We analyzed the electronic medical records of 1.8 million subjects from the Stanford clinical data warehouse spanning 18 years using a novel text-mining/statistical analytics pipeline. We identified 232 PAD patients taking Cilostazol and created a control group of 1,160 PAD patients not taking this drug using 1∶5 propensity-score matching. Over a mean follow up of 4.2 years, we observed no association between Cilostazol use and any major adverse cardiovascular event including stroke (OR = 1.13, CI [0.82, 1.55]), myocardial infarction (OR = 1.00, CI [0.71, 1.39]), or death (OR = 0.86, CI [0.63, 1.18]). Cilostazol was not associated with an increase in any arrhythmic complication. We also identified a subset of CHF patients who were prescribed Cilostazol despite its black box warning, and found that it did not increase mortality in this high-risk group of patients.This proof of principle study shows the potential of text-analytics to mine clinical data warehouses to uncover 'natural experiments' such as the use of Cilostazol in CHF patients. We envision this method will have broad applications for examining difficult to test clinical hypotheses and to aid in post-marketing drug safety surveillance. Moreover, our observations argue for a prospective study to examine the validity of a drug safety warning that may be unnecessarily limiting the use of an efficacious therapy.

    View details for DOI 10.1371/journal.pone.0063499

    View details for PubMedID 23717437

  • Mining Biomedical Ontologies and Data Using RDF Hypergraphs 12th International Conference on Machine Learning and Applications (ICMLA) Liu, H., Dou, D., Jin, R., LePendu, P., Shah, N. IEEE. 2013: 141–146
  • Practice-based evidence: profiling the safety of cilostazol by text-mining of clinical notes. PloS one Leeper, N. J., Bauer-Mehren, A., Iyer, S. V., LePendu, P., Olson, C., Shah, N. H. 2013; 8 (5)

    View details for DOI 10.1371/journal.pone.0063499

    View details for PubMedID 23717437

  • Automated Detection of Systematic Off-label Drug Use in Free Text of Electronic Medical Records. AMIA Summits on Translational Science proceedings AMIA Summit on Translational Science Jung, K., LePendu, P., Shah, N. 2013; 2013: 94-98


    Off-label use of a drug occurs when it is used in a manner that deviates from its FDA label. Studies estimate that 21% of prescriptions are off-label, with only 27% of those uses supported by evidence of safety and efficacy. We have developed methods to detect population level off-label usage using computationally efficient annotation of free text from clinical notes to generate features encoding empirical information about drug-disease mentions. By including additional features encoding prior knowledge about drugs, diseases, and known usage, we trained a highly accurate predictive model that was used to detect novel candidate off-label usages in a very large clinical corpus. We show that the candidate uses are plausible and can be prioritized for further analysis in terms of safety and efficacy.

    View details for PubMedID 24303308

  • Profiling risk factors for chronic uveitis in juvenile idiopathic arthritis: a new model for EHR-based research. Pediatric rheumatology online journal Cole, T. S., Frankovich, J., Iyer, S., LePendu, P., Bauer-Mehren, A., Shah, N. H. 2013; 11 (1): 45-?


    Juvenile idiopathic arthritis is the most common rheumatic disease in children. Chronic uveitis is a common and serious comorbid condition of juvenile idiopathic arthritis, with insidious presentation and potential to cause blindness. Knowledge of clinical associations will improve risk stratification. Based on clinical observation, we hypothesized that allergic conditions are associated with chronic uveitis in juvenile idiopathic arthritis patients.This study is a retrospective cohort study using Stanford's clinical data warehouse containing data from Lucile Packard Children's Hospital from 2000-2011 to analyze patient characteristics associated with chronic uveitis in a large juvenile idiopathic arthritis cohort. Clinical notes in patients under 16 years of age were processed via a validated text analytics pipeline. Bivariate-associated variables were used in a multivariate logistic regression adjusted for age, gender, and race. Previously reported associations were evaluated to validate our methods. The main outcome measure was presence of terms indicating allergy or allergy medications use overrepresented in juvenile idiopathic arthritis patients with chronic uveitis. Residual text features were then used in unsupervised hierarchical clustering to compare clinical text similarity between patients with and without uveitis.Previously reported associations with uveitis in juvenile idiopathic arthritis patients (earlier age at arthritis diagnosis, oligoarticular-onset disease, antinuclear antibody status, history of psoriasis) were reproduced in our study. Use of allergy medications and terms describing allergic conditions were independently associated with chronic uveitis. The association with allergy drugs when adjusted for known associations remained significant (OR 2.54, 95% CI 1.22-5.4).This study shows the potential of using a validated text analytics pipeline on clinical data warehouses to examine practice-based evidence for evaluating hypotheses formed during patient care. Our study reproduces four known associations with uveitis development in juvenile idiopathic arthritis patients, and reports a new association between allergic conditions and chronic uveitis in juvenile idiopathic arthritis patients.

    View details for DOI 10.1186/1546-0096-11-45

    View details for PubMedID 24299016

  • Network analysis of unstructured EHR data for clinical research. AMIA Summits on Translational Science proceedings AMIA Summit on Translational Science Bauer-Mehren, A., LePendu, P., Iyer, S. V., Harpaz, R., Leeper, N. J., Shah, N. H. 2013; 2013: 14-18


    In biomedical research, network analysis provides a conceptual framework for interpreting data from high-throughput experiments. For example, protein-protein interaction networks have been successfully used to identify candidate disease genes. Recently, advances in clinical text processing and the increasing availability of clinical data have enabled analogous analyses on data from electronic medical records. We constructed networks of diseases, drugs, medical devices and procedures using concepts recognized in clinical notes from the Stanford clinical data warehouse. We demonstrate the use of the resulting networks for clinical research informatics in two ways-cohort construction and outcomes analysis-by examining the safety of cilostazol in peripheral artery disease patients as a use case. We show that the network-based approaches can be used for constructing patient cohorts as well as for analyzing differences in outcomes by comparing with standard methods, and discuss the advantages offered by network-based approaches.

    View details for PubMedID 24303229

  • Chapter 9: Analyses Using Disease Ontologies PLOS COMPUTATIONAL BIOLOGY Shah, N. H., Cole, T., Musen, M. A. 2012; 8 (12)


    Advanced statistical methods used to analyze high-throughput data such as gene-expression assays result in long lists of "significant genes." One way to gain insight into the significance of altered expression levels is to determine whether Gene Ontology (GO) terms associated with a particular biological process, molecular function, or cellular component are over- or under-represented in the set of genes deemed significant. This process, referred to as enrichment analysis, profiles a gene-set, and is widely used to makes sense of the results of high-throughput experiments. The canonical example of enrichment analysis is when the output dataset is a list of genes differentially expressed in some condition. To determine the biological relevance of a lengthy gene list, the usual solution is to perform enrichment analysis with the GO. We can aggregate the annotating GO concepts for each gene in this list, and arrive at a profile of the biological processes or mechanisms affected by the condition under study. While GO has been the principal target for enrichment analysis, the methods of enrichment analysis are generalizable. We can conduct the same sort of profiling along other ontologies of interest. Just as scientists can ask "Which biological process is over-represented in my set of interesting genes or proteins?" we can also ask "Which disease (or class of diseases) is over-represented in my set of interesting genes or proteins?". For example, by annotating known protein mutations with disease terms from the ontologies in BioPortal, Mort et al. recently identified a class of diseases--blood coagulation disorders--that were associated with a 14-fold depletion in substitutions at O-linked glycosylation sites. With the availability of tools for automatic annotation of datasets with terms from disease ontologies, there is no reason to restrict enrichment analyses to the GO. In this chapter, we will discuss methods to perform enrichment analysis using any ontology available in the biomedical domain. We will review the general methodology of enrichment analysis, the associated challenges, and discuss the novel translational analyses enabled by the existence of public, national computational infrastructure and by the use of disease ontologies in such analyses.

    View details for DOI 10.1371/journal.pcbi.1002827

    View details for Web of Science ID 000312901500032

    View details for PubMedID 23300417

  • Mining the pharmacogenomics literature-a survey of the state of the art BRIEFINGS IN BIOINFORMATICS Hahn, U., Cohen, K. B., Garten, Y., Shah, N. H. 2012; 13 (4): 460-494


    This article surveys efforts on text mining of the pharmacogenomics literature, mainly from the period 2008 to 2011. Pharmacogenomics (or pharmacogenetics) is the field that studies how human genetic variation impacts drug response. Therefore, publications span the intersection of research in genotypes, phenotypes and pharmacology, a topic that has increasingly become a focus of active research in recent years. This survey covers efforts dealing with the automatic recognition of relevant named entities (e.g. genes, gene variants and proteins, diseases and other pathological phenomena, drugs and other chemicals relevant for medical treatment), as well as various forms of relations between them. A wide range of text genres is considered, such as scientific publications (abstracts, as well as full texts), patent texts and clinical narratives. We also discuss infrastructure and resources needed for advanced text analytics, e.g. document corpora annotated with corresponding semantic metadata (gold standards and training data), biomedical terminologies and ontologies providing domain-specific background knowledge at different levels of formality and specificity, software architectures for building complex and scalable text analytics pipelines and Web services grounded to them, as well as comprehensive ways to disseminate and interact with the typically huge amounts of semiformal knowledge structures extracted by text mining tools. Finally, we consider some of the novel applications that have already been developed in the field of pharmacogenomic text mining and point out perspectives for future research.

    View details for DOI 10.1093/bib/bbs018

    View details for Web of Science ID 000306925000007

    View details for PubMedID 22833496

  • Using ontology-based annotation to profile disease research JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION Liu, Y., Coulet, A., LePendu, P., Shah, N. H. 2012; 19 (E1): E177-E186


    Profiling the allocation and trend of research activity is of interest to funding agencies, administrators, and researchers. However, the lack of a common classification system hinders the comprehensive and systematic profiling of research activities. This study introduces ontology-based annotation as a method to overcome this difficulty. Analyzing over a decade of funding data and publication data, the trends of disease research are profiled across topics, across institutions, and over time.This study introduces and explores the notions of research sponsorship and allocation and shows that leaders of research activity can be identified within specific disease areas of interest, such as those with high mortality or high sponsorship. The funding profiles of disease topics readily cluster themselves in agreement with the ontology hierarchy and closely mirror the funding agency priorities. Finally, four temporal trends are identified among research topics.This work utilizes disease ontology (DO)-based annotation to profile effectively the landscape of biomedical research activity. By using DO in this manner a use-case driven mechanism is also proposed to evaluate the utility of classification hierarchies.

    View details for DOI 10.1136/amiajnl-2011-000631

    View details for Web of Science ID 000314151400029

    View details for PubMedID 22494789

  • Unified Medical Language System term occurrences in clinical notes: a large-scale corpus analysis JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION Wu, S. T., Liu, H., Li, D., Tao, C., Musen, M. A., Chute, C. G., Shah, N. H. 2012; 19 (E1): E149-E156


    To characterise empirical instances of Unified Medical Language System (UMLS) Metathesaurus term strings in a large clinical corpus, and to illustrate what types of term characteristics are generalisable across data sources.Based on the occurrences of UMLS terms in a 51 million document corpus of Mayo Clinic clinical notes, this study computes statistics about the terms' string attributes, source terminologies, semantic types and syntactic categories. Term occurrences in 2010 i2b2/VA text were also mapped; eight example filters were designed from the Mayo-based statistics and applied to i2b2/VA data.For the corpus analysis, negligible numbers of mapped terms in the Mayo corpus had over six words or 55 characters. Of source terminologies in the UMLS, the Consumer Health Vocabulary and Systematized Nomenclature of Medicine-Clinical Terms (SNOMED-CT) had the best coverage in Mayo clinical notes at 106426 and 94788 unique terms, respectively. Of 15 semantic groups in the UMLS, seven groups accounted for 92.08% of term occurrences in Mayo data. Syntactically, over 90% of matched terms were in noun phrases. For the cross-institutional analysis, using five example filters on i2b2/VA data reduces the actual lexicon to 19.13% of the size of the UMLS and only sees a 2% reduction in matched terms.The corpus statistics presented here are instructive for building lexicons from the UMLS. Features intrinsic to Metathesaurus terms (well formedness, length and language) generalise easily across clinical institutions, but term frequencies should be adapted with caution. The semantic groups of mapped terms may differ slightly from institution to institution, but they differ greatly when moving to the biomedical literature domain.

    View details for DOI 10.1136/amiajnl-2011-000744

    View details for Web of Science ID 000314151400025

    View details for PubMedID 22493050

    View details for PubMedCentralID PMC3392861

  • Novel Data-Mining Methodologies for Adverse Drug Event Discovery and Analysis CLINICAL PHARMACOLOGY & THERAPEUTICS Harpaz, R., Dumouchel, W., Shah, N. H., Madigan, D., Ryan, P., Friedman, C. 2012; 91 (6): 1010-1021


    An important goal of the health system is to identify new adverse drug events (ADEs) in the postapproval period. Datamining methods that can transform data into meaningful knowledge to inform patient safety have proven essential for this purpose. New opportunities have emerged to harness data sources that have not been used within the traditional framework. This article provides an overview of recent methodological innovations and data sources used to support ADE discovery and analysis.

    View details for DOI 10.1038/clpt.2012.50

    View details for Web of Science ID 000304245800019

    View details for PubMedID 22549283

  • The coming age of data-driven medicine: translational bioinformatics' next frontier JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION Shah, N. H., Tenenbaum, J. D. 2012; 19 (E1): E2-E4

    View details for DOI 10.1136/amiajnl-2012-000969

    View details for Web of Science ID 000314151400002

    View details for PubMedID 22718035

  • The National Center for Biomedical Ontology JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION Musen, M. A., Noy, N. F., Shah, N. H., Whetzel, P. L., Chute, C. G., Story, M., Smith, B. 2012; 19 (2): 190-195


    The National Center for Biomedical Ontology is now in its seventh year. The goals of this National Center for Biomedical Computing are to: create and maintain a repository of biomedical ontologies and terminologies; build tools and web services to enable the use of ontologies and terminologies in clinical and translational research; educate their trainees and the scientific community broadly about biomedical ontology and ontology-based technology and best practices; and collaborate with a variety of groups who develop and use ontologies and terminologies in biomedicine. The centerpiece of the National Center for Biomedical Ontology is a web-based resource known as BioPortal. BioPortal makes available for research in computationally useful forms more than 270 of the world's biomedical ontologies and terminologies, and supports a wide range of web services that enable investigators to use the ontologies to annotate and retrieve data, to generate value sets and special-purpose lexicons, and to perform advanced analytics on a wide range of biomedical data.

    View details for DOI 10.1136/amiajnl-2011-000523

    View details for Web of Science ID 000300768100010

    View details for PubMedID 22081220

    View details for PubMedCentralID PMC3277625

  • Translational bioinformatics embraces big data. Yearbook of medical informatics Shah, N. H. 2012; 7 (1): 130-134


    We review the latest trends and major developments in translational bioinformatics in the year 2011-2012. Our emphasis is on highlighting the key events in the field and pointing at promising research areas for the future. The key take-home points are: • Translational informatics is ready to revolutionize human health and healthcare using large-scale measurements on individuals. • Data-centric approaches that compute on massive amounts of data (often called "Big Data") to discover patterns and to make clinically relevant predictions will gain adoption. • Research that bridges the latest multimodal measurement technologies with large amounts of electronic healthcare data is increasing; and is where new breakthroughs will occur.

    View details for PubMedID 22890354

  • Using temporal patterns in medical records to discern adverse drug events from indications. AMIA Summits on Translational Science proceedings AMIA Summit on Translational Science Liu, Y., LePendu, P., Iyer, S., Shah, N. H. 2012; 2012: 47-56


    Researchers estimate that electronic health record systems record roughly 2-million ambulatory adverse drug events and that patients suffer from adverse drug events in roughly 30% of hospital stays. Some have used structured databases of patient medical records and health insurance claims recently-going beyond the current paradigm of using spontaneous reporting systems like AERS-to detect drug-safety signals. However, most efforts do not use the free-text from clinical notes in monitoring for drug-safety signals. We hypothesize that drug-disease co-occurrences, extracted from ontology-based annotations of the clinical notes, can be examined for statistical enrichment and used for drug safety surveillance. When analyzing such co-occurrences of drugs and diseases, one major challenge is to differentiate whether the disease in a drug-disease pair represents an indication or an adverse event. We demonstrate that it is possible to make this distinction by combining the frequency distribution of the drug, the disease, and the drug-disease pair as well as the temporal ordering of the drugs and diseases in each pair across more than one million patients.

    View details for PubMedID 22779050

    View details for PubMedCentralID PMC3392062

  • Selected papers from the 14th Annual Bio-Ontologies Special Interest Group Meeting. Journal of biomedical semantics Soldatova, L. N., Sansone, S., Dumontier, M., Shah, N. H. 2012; 3: I1-?


    Over the 14 years, the Bio-Ontologies SIG at ISMB has provided a forum for discussion of the latest and most innovative research in the bio-ontologies development, its applications to biomedicine and more generally the organisation, presentation and dissemination of knowledge in biomedicine and the life sciences. The seven papers selected for this supplement span a wide range of topics including: web-based querying over multiple ontologies, integration of data from wikis, innovative methods of annotating and mining electronic health records, advances in annotating web documents and biomedical literature, quality control of ontology alignments, and the ontology support for predictive models about toxicity and open access to the toxicity data.

    View details for DOI 10.1186/2041-1480-3-S1-I1

    View details for PubMedID 22541591

  • Annotation Analysis for Testing Drug Safety Signals using Unstructured Clinical Notes. Journal of biomedical semantics LePendu, P., Iyer, S. V., Fairon, C., Shah, N. H. 2012; 3: S5-?


    The electronic surveillance for adverse drug events is largely based upon the analysis of coded data from reporting systems. Yet, the vast majority of electronic health data lies embedded within the free text of clinical notes and is not gathered into centralized repositories. With the increasing access to large volumes of electronic medical data-in particular the clinical notes-it may be possible to computationally encode and to test drug safety signals in an active manner.We describe the application of simple annotation tools on clinical text and the mining of the resulting annotations to compute the risk of getting a myocardial infarction for patients with rheumatoid arthritis that take Vioxx. Our analysis clearly reveals elevated risks for myocardial infarction in rheumatoid arthritis patients taking Vioxx (odds ratio 2.06) before 2005.Our results show that it is possible to apply annotation analysis methods for testing hypotheses about drug safety using electronic medical records.

    View details for DOI 10.1186/2041-1480-3-S1-S5

    View details for PubMedID 22541596

  • Analyzing patterns of drug use in clinical notes for patient safety. AMIA Summits on Translational Science proceedings AMIA Summit on Translational Science LePendu, P., Liu, Y., Iyer, S., Udell, M. R., Shah, N. H. 2012; 2012: 63-70


    Doctors prescribe drugs for indications that are not FDA approved. Research indicates that 21% of prescriptions filled are for off-label indications. Of those, more than 73% lack supporting scientific evidence. Traditional drug safety alerts may not cover usages that are not FDA approved. Therefore, analyzing patterns of off-label drug usage in the clinical setting is an important step toward reducing the incidence of adverse events and for improving patient safety. We applied term extraction tools on the clinical notes of a million patients to compile a database of statistically significant patterns of drug use. We validated some of the usage patterns learned from the data against sources of known on-label and off-label use. Given our ability to quantify adverse event risks using the clinical notes, this will enable us to address patient safety because we can now rank-order off-label drug use and prioritize the search for their adverse event profiles.

    View details for PubMedID 22779054

    View details for PubMedCentralID PMC3392046

  • Enabling enrichment analysis with the Human Disease Ontology. Journal of biomedical informatics LePendu, P., Musen, M. A., Shah, N. H. 2011; 44: S31-8


    Advanced statistical methods used to analyze high-throughput data such as gene-expression assays result in long lists of "significant genes." One way to gain insight into the significance of altered expression levels is to determine whether Gene Ontology (GO) terms associated with a particular biological process, molecular function, or cellular component are over- or under-represented in the set of genes deemed significant. This process, referred to as enrichment analysis, profiles a gene set, and is widely used to make sense of the results of high-throughput experiments. Our goal is to develop and apply general enrichment analysis methods to profile other sets of interest, such as patient cohorts from the electronic medical record, using a variety of ontologies including SNOMED CT, MedDRA, RxNorm, and others. Although it is possible to perform enrichment analysis using ontologies other than the GO, a key pre-requisite is the availability of a background set of annotations to enable the enrichment calculation. In the case of the GO, this background set is provided by the Gene Ontology Annotations. In the current work, we describe: (i) a general method that uses hand-curated GO annotations as a starting point for creating background datasets for enrichment analysis using other ontologies; and (ii) a gene-disease background annotation set - that enables disease-based enrichment - to demonstrate feasibility of our method.

    View details for DOI 10.1016/j.jbi.2011.04.007

    View details for PubMedID 21550421

  • NCBO Resource Index: Ontology-based search and mining of biomedical resources JOURNAL OF WEB SEMANTICS Jonquet, C., LePendu, P., Falconer, S., Coulet, A., Noy, N. F., Musen, M. A., Shah, N. H. 2011; 9 (3): 316-324


    The volume of publicly available data in biomedicine is constantly increasing. However, these data are stored in different formats and on different platforms. Integrating these data will enable us to facilitate the pace of medical discoveries by providing scientists with a unified view of this diverse information. Under the auspices of the National Center for Biomedical Ontology (NCBO), we have developed the Resource Index-a growing, large-scale ontology-based index of more than twenty heterogeneous biomedical resources. The resources come from a variety of repositories maintained by organizations from around the world. We use a set of over 200 publicly available ontologies contributed by researchers in various domains to annotate the elements in these resources. We use the semantics that the ontologies encode, such as different properties of classes, the class hierarchies, and the mappings between ontologies, in order to improve the search experience for the Resource Index user. Our user interface enables scientists to search the multiple resources quickly and efficiently using domain terms, without even being aware that there is semantics "under the hood."

    View details for DOI 10.1016/j.websem.2011.06.005

    View details for Web of Science ID 000300169800007

  • BioPortal: enhanced functionality via new Web services from the National Center for Biomedical Ontology to access and use ontologies in software applications NUCLEIC ACIDS RESEARCH Whetzel, P. L., Noy, N. F., Shah, N. H., Alexander, P. R., Nyulas, C., Tudorache, T., Musen, M. A. 2011; 39: W541-W545


    The National Center for Biomedical Ontology (NCBO) is one of the National Centers for Biomedical Computing funded under the NIH Roadmap Initiative. Contributing to the national computing infrastructure, NCBO has developed BioPortal, a web portal that provides access to a library of biomedical ontologies and terminologies ( via the NCBO Web services. BioPortal enables community participation in the evaluation and evolution of ontology content by providing features to add mappings between terms, to add comments linked to specific ontology terms and to provide ontology reviews. The NCBO Web services ( enable this functionality and provide a uniform mechanism to access ontologies from a variety of knowledge representation formats, such as Web Ontology Language (OWL) and Open Biological and Biomedical Ontologies (OBO) format. The Web services provide multi-layered access to the ontology content, from getting all terms in an ontology to retrieving metadata about a term. Users can easily incorporate the NCBO Web services into software applications to generate semantically aware applications and to facilitate structured data collection.

    View details for DOI 10.1093/nar/gkr469

    View details for Web of Science ID 000292325300088

    View details for PubMedID 21672956

  • Computationally translating molecular discoveries into tools for medicine: translational bioinformatics articles now featured in JAMIA JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION Butte, A. J., Shah, N. H. 2011; 18 (4): 352-353

    View details for DOI 10.1136/amiajnl-2011-000343

    View details for Web of Science ID 000292061700002

    View details for PubMedID 21672904

  • Integration and publication of heterogeneous text-mined relationships on the Semantic Web. Journal of biomedical semantics Coulet, A., Garten, Y., Dumontier, M., Altman, R. B., Musen, M. A., Shah, N. H. 2011; 2: S10-?


    Advances in Natural Language Processing (NLP) techniques enable the extraction of fine-grained relationships mentioned in biomedical text. The variability and the complexity of natural language in expressing similar relationships causes the extracted relationships to be highly heterogeneous, which makes the construction of knowledge bases difficult and poses a challenge in using these for data mining or question answering.We report on the semi-automatic construction of the PHARE relationship ontology (the PHArmacogenomic RElationships Ontology) consisting of 200 curated relations from over 40,000 heterogeneous relationships extracted via text-mining. These heterogeneous relations are then mapped to the PHARE ontology using synonyms, entity descriptions and hierarchies of entities and roles. Once mapped, relationships can be normalized and compared using the structure of the ontology to identify relationships that have similar semantics but different syntax. We compare and contrast the manual procedure with a fully automated approach using WordNet to quantify the degree of integration enabled by iterative curation and refinement of the PHARE ontology. The result of such integration is a repository of normalized biomedical relationships, named PHARE-KB, which can be queried using Semantic Web technologies such as SPARQL and can be visualized in the form of a biological network.The PHARE ontology serves as a common semantic framework to integrate more than 40,000 relationships pertinent to pharmacogenomics. The PHARE ontology forms the foundation of a knowledge base named PHARE-KB. Once populated with relationships, PHARE-KB (i) can be visualized in the form of a biological network to guide human tasks such as database curation and (ii) can be queried programmatically to guide bioinformatics applications such as the prediction of molecular interactions. PHARE is available at

    View details for DOI 10.1186/2041-1480-2-S2-S10

    View details for PubMedID 21624156

    View details for PubMedCentralID PMC3102890

  • Mapping between the OBO and OWL ontology languages. Journal of biomedical semantics Tirmizi, S. H., Aitken, S., Moreira, D. A., Mungall, C., Sequeda, J., Shah, N. H., Miranker, D. P. 2011; 2: S3-?


    Ontologies are commonly used in biomedicine to organize concepts to describe domains such as anatomies, environments, experiment, taxonomies etc. NCBO BioPortal currently hosts about 180 different biomedical ontologies. These ontologies have been mainly expressed in either the Open Biomedical Ontology (OBO) format or the Web Ontology Language (OWL). OBO emerged from the Gene Ontology, and supports most of the biomedical ontology content. In comparison, OWL is a Semantic Web language, and is supported by the World Wide Web consortium together with integral query languages, rule languages and distributed infrastructure for information interchange. These features are highly desirable for the OBO content as well. A convenient method for leveraging these features for OBO ontologies is by transforming OBO ontologies to OWL.We have developed a methodology for translating OBO ontologies to OWL using the organization of the Semantic Web itself to guide the work. The approach reveals that the constructs of OBO can be grouped together to form a similar layer cake. Thus we were able to decompose the problem into two parts. Most OBO constructs have easy and obvious equivalence to a construct in OWL. A small subset of OBO constructs requires deeper consideration. We have defined transformations for all constructs in an effort to foster a standard common mapping between OBO and OWL. Our mapping produces OWL-DL, a Description Logics based subset of OWL with desirable computational properties for efficiency and correctness. Our Java implementation of the mapping is part of the official Gene Ontology project source.Our transformation system provides a lossless roundtrip mapping for OBO ontologies, i.e. an OBO ontology may be translated to OWL and back without loss of knowledge. In addition, it provides a roadmap for bridging the gap between the two ontology languages in order to enable the use of ontology content in a language independent manner.

    View details for DOI 10.1186/2041-1480-2-S1-S3

    View details for PubMedID 21388572

  • Selected papers from the 13th Annual Bio-Ontologies Special Interest Group Meeting. Journal of biomedical semantics Soldatova, L. N., Sansone, S., Stephens, S. M., Shah, N. H. 2011; 2: I1-?


    Over the years, the Bio-Ontologies SIG at ISMB has provided a forum for discussion of the latest and most innovative research in the application of ontologies and more generally the organisation, presentation and dissemination of knowledge in biomedicine and the life sciences. The ten papers selected for this supplement are extended versions of the original papers presented at the 2010 SIG. The papers span a wide range of topics including practical solutions for data and knowledge integration for translational medicine, hypothesis based querying , understanding kidney and urinary pathways, mining the pharmacogenomics literature; theoretical research into the orthogonality of biomedical ontologies, the representation of diseases, the representation of research hypotheses, the combination of ontologies and natural language processing for an annotation framework, the generation of textual definitions, and the discovery of gene interaction networks.

    View details for DOI 10.1186/2041-1480-2-S2-I1

    View details for PubMedID 21624154

  • HyQue: evaluating hypotheses using Semantic Web technologies. Journal of biomedical semantics Callahan, A., Dumontier, M., Shah, N. H. 2011; 2: S3-?


    Key to the success of e-Science is the ability to computationally evaluate expert-composed hypotheses for validity against experimental data. Researchers face the challenge of collecting, evaluating and integrating large amounts of diverse information to compose and evaluate a hypothesis. Confronted with rapidly accumulating data, researchers currently do not have the software tools to undertake the required information integration tasks.We present HyQue, a Semantic Web tool for querying scientific knowledge bases with the purpose of evaluating user submitted hypotheses. HyQue features a knowledge model to accommodate diverse hypotheses structured as events and represented using Semantic Web languages (RDF/OWL). Hypothesis validity is evaluated against experimental and literature-sourced evidence through a combination of SPARQL queries and evaluation rules. Inference over OWL ontologies (for type specifications, subclass assertions and parthood relations) and retrieval of facts stored as Bio2RDF linked data provide support for a given hypothesis. We evaluate hypotheses of varying levels of detail about the genetic network controlling galactose metabolism in Saccharomyces cerevisiae to demonstrate the feasibility of deploying such semantic computing tools over a growing body of structured knowledge in Bio2RDF.HyQue is a query-based hypothesis evaluation system that can currently evaluate hypotheses about the galactose metabolism in S. cerevisiae. Hypotheses as well as the supporting or refuting data are represented in RDF and directly linked to one another allowing scientists to browse from data to hypothesis and vice versa. HyQue hypotheses and data are available at

    View details for DOI 10.1186/2041-1480-2-S2-S3

    View details for PubMedID 21624158

  • Using text to build semantic networks for pharmacogenomics JOURNAL OF BIOMEDICAL INFORMATICS Coulet, A., Shah, N. H., Garten, Y., Musen, M., Altman, R. B. 2010; 43 (6): 1009-1019


    Most pharmacogenomics knowledge is contained in the text of published studies, and is thus not available for automated computation. Natural Language Processing (NLP) techniques for extracting relationships in specific domains often rely on hand-built rules and domain-specific ontologies to achieve good performance. In a new and evolving field such as pharmacogenomics (PGx), rules and ontologies may not be available. Recent progress in syntactic NLP parsing in the context of a large corpus of pharmacogenomics text provides new opportunities for automated relationship extraction. We describe an ontology of PGx relationships built starting from a lexicon of key pharmacogenomic entities and a syntactic parse of more than 87 million sentences from 17 million MEDLINE abstracts. We used the syntactic structure of PGx statements to systematically extract commonly occurring relationships and to map them to a common schema. Our extracted relationships have a 70-87.7% precision and involve not only key PGx entities such as genes, drugs, and phenotypes (e.g., VKORC1, warfarin, clotting disorder), but also critical entities that are frequently modified by these key entities (e.g., VKORC1 polymorphism, warfarin response, clotting disorder treatment). The result of our analysis is a network of 40,000 relationships between more than 200 entity types with clear semantics. This network is used to guide the curation of PGx knowledge and provide a computable resource for knowledge discovery.

    View details for DOI 10.1016/j.jbi.2010.08.005

    View details for Web of Science ID 000285036700017

    View details for PubMedID 20723615

    View details for PubMedCentralID PMC2991587

  • The BioPAX community standard for pathway data sharing NATURE BIOTECHNOLOGY Demir, E., Cary, M. P., Paley, S., Fukuda, K., Lemer, C., Vastrik, I., Wu, G., D'Eustachio, P., Schaefer, C., Luciano, J., Schacherer, F., Martinez-Flores, I., Hu, Z., Jimenez-Jacinto, V., Joshi-Tope, G., Kandasamy, K., Lopez-Fuentes, A. C., Mi, H., Pichler, E., Rodchenkov, I., Splendiani, A., Tkachev, S., Zucker, J., Gopinath, G., Rajasimha, H., Ramakrishnan, R., Shah, I., Syed, M., Anwar, N., Babur, O., Blinov, M., Brauner, E., Corwin, D., Donaldson, S., Gibbons, F., Goldberg, R., Hornbeck, P., Luna, A., Murray-Rust, P., Neumann, E., Reubenacker, O., Samwald, M., van Iersel, M., Wimalaratne, S., Allen, K., Braun, B., Whirl-Carrillo, M., Cheung, K., Dahlquist, K., Finney, A., Gillespie, M., Glass, E., Gong, L., Haw, R., Honig, M., Hubaut, O., Kane, D., Krupa, S., Kutmon, M., Leonard, J., Marks, D., Merberg, D., Petri, V., Pico, A., Ravenscroft, D., Ren, L., Shah, N., Sunshine, M., Tang, R., Whaley, R., Letovksy, S., Buetow, K. H., Rzhetsky, A., Schachter, V., Sobral, B. S., Dogrusoz, U., McWeeney, S., Aladjem, M., Birney, E., Collado-Vides, J., Goto, S., Hucka, M., Le Novere, N., Maltsev, N., Pandey, A., Thomas, P., Wingender, E., Karp, P. D., Sander, C., Bader, G. D. 2010; 28 (9): 935-942


    Biological Pathway Exchange (BioPAX) is a standard language to represent biological pathways at the molecular and cellular level and to facilitate the exchange of pathway data. The rapid growth of the volume of pathway data has spurred the development of databases and computational tools to aid interpretation; however, use of these data is hampered by the current fragmentation of pathway information across many databases with incompatible formats. BioPAX, which was created through a community process, solves this problem by making pathway data substantially easier to collect, index, interpret and share. BioPAX can represent metabolic and signaling pathways, molecular and genetic interactions and gene regulation networks. Using BioPAX, millions of interactions, organized into thousands of pathways, from many organisms are available from a growing number of databases. This large amount of pathway data in a computable form will support visualization, analysis and biological discovery.

    View details for DOI 10.1038/nbt.1666

    View details for Web of Science ID 000281719100019

    View details for PubMedID 20829833

  • A UIMA wrapper for the NCBO annotator BIOINFORMATICS Roeder, C., Jonquet, C., Shah, N. H., Baumgartner, W. A., Verspoor, K., Hunter, L. 2010; 26 (14): 1800-1801


    The Unstructured Information Management Architecture (UIMA) framework and web services are emerging as useful tools for integrating biomedical text mining tools. This note describes our work, which wraps the National Center for Biomedical Ontology (NCBO) Annotator-an ontology-based annotation service-to make it available as a component in UIMA workflows.This wrapper is freely available on the web at as part of the UIMA tools distribution from the Center for Computational Pharmacology (CCP) at the University of Colorado School of Medicine. It has been implemented in Java for support on Mac OS X, Linux and MS Windows.

    View details for DOI 10.1093/bioinformatics/btq250

    View details for Web of Science ID 000279474400025

    View details for PubMedID 20505005

  • In Silico Functional Profiling of Human Disease-Associated and Polymorphic Amino Acid Substitutions HUMAN MUTATION Mort, M., Evani, U. S., Krishnan, V. G., Kamati, K. K., Baenziger, P. H., Bagchi, A., Peters, B. J., Sathyesh, R., Li, B., Sun, Y., Xue, B., Shah, N. H., Kann, M. G., Cooper, D. N., Radivojac, P., Mooney, S. D. 2010; 31 (3): 335-346


    An important challenge in translational bioinformatics is to understand how genetic variation gives rise to molecular changes at the protein level that can precipitate both monogenic and complex disease. To this end, we compiled datasets of human disease-associated amino acid substitutions (AAS) in the contexts of inherited monogenic disease, complex disease, functional polymorphisms with no known disease association, and somatic mutations in cancer, and compared them with respect to predicted functional sites in proteins. Using the sequence homology-based tool SIFT to estimate the proportion of deleterious AAS in each dataset, only complex disease AAS were found to be indistinguishable from neutral polymorphic AAS. Investigation of monogenic disease AAS predicted to be nondeleterious by SIFT were characterized by a significant enrichment for inherited AAS within solvent accessible residues, regions of intrinsic protein disorder, and an association with the loss or gain of various posttranslational modifications. Sites of structural and/or functional interest were therefore surmised to constitute useful additional features with which to identify the molecular disruptions caused by deleterious AAS. A range of bioinformatic tools, designed to predict structural and functional sites in protein sequences, were then employed to demonstrate that intrinsic biases exist in terms of the distribution of different types of human AAS with respect to specific structural, functional and pathological features. Our Web tool, designed to potentiate the functional profiling of novel AAS, has been made available at

    View details for DOI 10.1002/humu.21192

    View details for Web of Science ID 000275419900014

    View details for PubMedID 20052762

  • Selected papers from the 12th annual Bio-Ontologies meeting. Journal of biomedical semantics Soldatova, L. N., Lord, P., Sansone, S., Stephens, S. M., Shah, N. H. 2010; 1: I1-?

    View details for DOI 10.1186/2041-1480-1-S1-I1

    View details for PubMedID 20626920

  • Optimize First, Buy Later: Analyzing Metrics to Ramp-Up Very Large Knowledge Bases 9th International Semantic Web Conference LePendu, P., Noy, N. F., Jonquet, C., Alexander, P. R., Shah, N. H., Musen, M. A. SPRINGER-VERLAG BERLIN. 2010: 486–501
  • The Lexicon Builder Web service: Building Custom Lexicons from two hundred Biomedical Ontologies. AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium Parai, G. K., Jonquet, C., xu, r., Musen, M. A., Shah, N. H. 2010; 2010: 587-591


    Domain specific biomedical lexicons are extensively used by researchers for natural language processing tasks. Currently these lexicons are created manually by expert curators and there is a pressing need for automated methods to compile such lexicons. The Lexicon Builder Web service addresses this need and reduces the investment of time and effort involved in lexicon maintenance. The service has three components: Inclusion - selects one or several ontologies (or its branches) and includes preferred names and synonym terms; Exclusion - filters terms based on the term's Medline frequency, syntactic type, UMLS semantic type and match with stopwords; Output - aggregates information, handles compression and output formats. Evaluation demonstrates that the service has high accuracy and runtime performance. It is currently being evaluated for several use cases to establish its utility in biomedical information processing tasks. The Lexicon Builder promotes collaboration, sharing and standardization of lexicons amongst researchers by automating the creation, maintainence and cross referencing of custom lexicons.

    View details for PubMedID 21347046

  • Building a biomedical ontology recommender web service. Journal of biomedical semantics Jonquet, C., Musen, M. A., Shah, N. H. 2010; 1: S1-?


    Researchers in biomedical informatics use ontologies and terminologies to annotate their data in order to facilitate data integration and translational discoveries. As the use of ontologies for annotation of biomedical datasets has risen, a common challenge is to identify ontologies that are best suited to annotating specific datasets. The number and variety of biomedical ontologies is large, and it is cumbersome for a researcher to figure out which ontology to use.We present the Biomedical Ontology Recommender web service. The system uses textual metadata or a set of keywords describing a domain of interest and suggests appropriate ontologies for annotating or representing the data. The service makes a decision based on three criteria. The first one is coverage, or the ontologies that provide most terms covering the input text. The second is connectivity, or the ontologies that are most often mapped to by other ontologies. The final criterion is size, or the number of concepts in the ontologies. The service scores the ontologies as a function of scores of the annotations created using the National Center for Biomedical Ontology (NCBO) Annotator web service. We used all the ontologies from the UMLS Metathesaurus and the NCBO BioPortal.We compare and contrast our Recommender by an exhaustive functional comparison to previously published efforts. We evaluate and discuss the results of several recommendation heuristics in the context of three real world use cases. The best recommendations heuristics, rated 'very relevant' by expert evaluators, are the ones based on coverage and connectivity criteria. The Recommender service (alpha version) is available to the community and is embedded into BioPortal.

    View details for DOI 10.1186/2041-1480-1-S1-S1

    View details for PubMedID 20626921

  • An ontology-neutral framework for enrichment analysis. AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium Tirrell, R., Evani, U., Berman, A. E., Mooney, S. D., Musen, M. A., Shah, N. H. 2010; 2010: 797-801


    Advanced statistical methods used to analyze high-throughput data (e.g. gene-expression assays) result in long lists of "significant genes." One way to gain insight into the significance of altered expression levels is to determine whether Gene Ontology (GO) terms associated with a particular biological process, molecular function, or cellular component are over- or under-represented in the set of genes deemed significant. This process, referred to as enrichment analysis, profiles a gene-set, and is relevant for and extensible to data analysis with other high-throughput measurement modalities such as proteomics, metabolomics, and tissue-microarray assays. With the availability of tools for automatic ontology-based annotation of datasets with terms from biomedical ontologies besides the GO, we need not restrict enrichment analysis to the GO. We describe, RANSUM - Rich Annotation Summarizer - which performs enrichment analysis using any ontology in the National Center for Biomedical Ontology's (NCBO) BioPortal. We outline the methodology of enrichment analysis, the associated challenges, and discuss novel analyses enabled by RANSUM.

    View details for PubMedID 21347088

  • Extraction of genotype-phenotype-drug relationships from text: from entity recognition to bioinformatics application. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing Coulet, A., Shah, N., Hunter, L., Barral, C., Altman, R. B. 2010: 485-487


    Advances in concept recognition and natural language parsing have led to the development of various tools that enable the identification of biomedical entities and relationships between them in text. The aim of the Genotype-Phenotype-Drug Relationship Extraction from Text workshop (or GPD-Rx workshop) is to examine the current state of art and discuss the next steps for making the extraction of relationships between biomedical entities integral to the curation and knowledge management workflow in Pharmacogenomics. The workshop will focus particularly on the extraction of Genotype-Phenotype, Genotype-Drug, and Phenotype-Drug relationships that are of interest to Pharmacogenomics. Extracting and structuring such text-mined relationships is a key to support the evaluation and the validation of multiple hypotheses that emerge from high throughput translational studies spanning multiple measurement modalities. In order to advance this agenda, it is essential that existing relationship extraction methods be compared to one another and that a community wide benchmark corpus emerges; against which future methods can be compared. The workshop aims to bring together researchers working on the automatic or semi-automatic extraction of relationships between biomedical entities from research literature in order to identify the key groups interested in creating such a benchmark.

    View details for PubMedID 19904832

    View details for PubMedCentralID PMC3501138

  • A Comprehensive Analysis of Five Million UMLS Metathesaurus Terms Using Eighteen Million MEDLINE Citations. AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium xu, r., Musen, M. A., Shah, N. H. 2010; 2010: 907-911


    The Unified Medical Language System (UMLS) Metathesaurus is widely used for biomedical natural language processing (NLP) tasks. In this study, we systematically analyzed UMLS Metathesaurus terms by analyzing their occurrences in over 18 million MEDLINE abstracts. Our goals were: 1. analyze the frequency and syntactic distribution of Metathesaurus terms in MEDLINE; 2. create a filtered UMLS Metathesaurus based on the MEDLINE analysis; 3. augment the UMLS Metathesaurus where each term is associated with metadata on its MEDLINE frequency and syntactic distribution statistics. After MEDLINE frequency-based filtering, the augmented UMLS Metathesaurus contains 518,835 terms and is roughly 13% of its original size. We have shown that the syntactic and frequency information is useful to identify errors in the Metathesaurus. This filtered and augmented UMLS Metathesaurus can potentially be used to improve efficiency and precision of UMLS-based information retrieval and NLP tasks.

    View details for PubMedID 21347110

  • BioPortal: ontologies and integrated data resources at the click of a mouse NUCLEIC ACIDS RESEARCH Noy, N. F., Shah, N. H., Whetzel, P. L., Dai, B., Dorf, M., Griffith, N., Jonquet, C., Rubin, D. L., Storey, M., Chute, C. G., Musen, M. A. 2009; 37: W170-W173


    Biomedical ontologies provide essential domain knowledge to drive data integration, information retrieval, data annotation, natural-language processing and decision support. BioPortal ( is an open repository of biomedical ontologies that provides access via Web services and Web browsers to ontologies developed in OWL, RDF, OBO format and Protégé frames. BioPortal functionality includes the ability to browse, search and visualize ontologies. The Web interface also facilitates community-based participation in the evaluation and evolution of ontology content by providing features to add notes to ontology terms, mappings between terms and ontology reviews based on criteria such as usability, domain coverage, quality of content, and documentation and support. BioPortal also enables integrated search of biomedical data resources such as the Gene Expression Omnibus (GEO),, and ArrayExpress, through the annotation and indexing of these resources with ontologies in BioPortal. Thus, BioPortal not only provides investigators, clinicians, and developers 'one-stop shopping' to programmatically access biomedical ontologies, but also provides support to integrate data from a variety of biomedical resources.

    View details for DOI 10.1093/nar/gkp440

    View details for Web of Science ID 000267889100031

    View details for PubMedID 19483092

    View details for PubMedCentralID PMC2703982

  • Ontology-driven indexing of public datasets for translational bioinformatics 1st Summit on Translational Bioinformatics Shah, N. H., Jonquet, C., Chiang, A. P., Butte, A. J., Chen, R., Musen, M. A. BIOMED CENTRAL LTD. 2009


    The volume of publicly available genomic scale data is increasing. Genomic datasets in public repositories are annotated with free-text fields describing the pathological state of the studied sample. These annotations are not mapped to concepts in any ontology, making it difficult to integrate these datasets across repositories. We have previously developed methods to map text-annotations of tissue microarrays to concepts in the NCI thesaurus and SNOMED-CT. In this work we generalize our methods to map text annotations of gene expression datasets to concepts in the UMLS. We demonstrate the utility of our methods by processing annotations of datasets in the Gene Expression Omnibus. We demonstrate that we enable ontology-based querying and integration of tissue and gene expression microarray data. We enable identification of datasets on specific diseases across both repositories. Our approach provides the basis for ontology-driven data integration for translational research on gene and protein expression data. Based on this work we have built a prototype system for ontology based annotation and indexing of biomedical data. The system processes the text metadata of diverse resource elements such as gene expression data sets, descriptions of radiology images, clinical-trial reports, and PubMed article abstracts to annotate and index them with concepts from appropriate ontologies. The key functionality of this system is to enable users to locate biomedical data resources related to particular ontology concepts.

    View details for DOI 10.1186/1471-2105-10-S2-S1

    View details for Web of Science ID 000265602500002

    View details for PubMedID 19208184

    View details for PubMedCentralID PMC2646250

  • The open biomedical annotator. Summit on translational bioinformatics Jonquet, C., Shah, N. H., Musen, M. A. 2009; 2009: 56-60


    The range of publicly available biomedical data is enormous and is expanding fast. This expansion means that researchers now face a hurdle to extracting the data they need from the large numbers of data that are available. Biomedical researchers have turned to ontologies and terminologies to structure and annotate their data with ontology concepts for better search and retrieval. However, this annotation process cannot be easily automated and often requires expert curators. Plus, there is a lack of easy-to-use systems that facilitate the use of ontologies for annotation. This paper presents the Open Biomedical Annotator (OBA), an ontology-based Web service that annotates public datasets with biomedical ontology concepts based on their textual metadata ( The biomedical community can use the annotator service to tag datasets automatically with ontology terms (from UMLS and NCBO BioPortal ontologies). Such annotations facilitate translational discoveries by integrating annotated data.[1].

    View details for PubMedID 21347171

  • What Four Million Mappings Can Tell You about Two Hundred Ontologies 8th International Semantic Web Conference Ghazvinian, A., Noy, N. F., Jonquet, C., Shah, N., Musen, M. A. SPRINGER-VERLAG BERLIN. 2009: 229–242
  • BioPortal: ontologies and data resources with the click of a mouse. AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium Musen, M. A., Shah, N. H., Noy, N. F., Dai, B. Y., Dorf, M., Griffith, N., Buntrok, J., Jonquet, C., Montegut, M. J., Rubin, D. L. 2008: 1223-1224

    View details for PubMedID 18999306

  • A system for ontology-based annotation of biomedical data 5th International Workshop on Data Integration in the Life Sciences Jonquet, C., Musen, M. A., Shah, N. SPRINGER-VERLAG BERLIN. 2008: 144–152
  • Comparison of ontology-based semantic-similarity measures. AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium Lee, W., Shah, N., Sundlass, K., Musen, M. 2008: 384-388


    Semantic-similarity measures quantify concept similarities in a given ontology. Potential applications for these measures include search, data mining, and knowledge discovery in database or decision-support systems that utilize ontologies. To date, there have not been comparisons of the different semantic-similarity approaches on a single ontology. Such a comparison can offer insight on the validity of different approaches. We compared 3 approaches to semantic similarity-metrics (which rely on expert opinion, ontologies only, and information content) with 4 metrics applied to SNOMED-CT. We found that there was poor agreement among those metrics based on information content with the ontology only metric. The metric based only on the ontology structure correlated most with expert opinion. Our results suggest that metrics based on the ontology only may be preferable to information-content-based metrics, and point to the need for more research on validating the different approaches.

    View details for PubMedID 18999312

  • UMLS-Query: a perl module for querying the UMLS. AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium Shah, N. H., Muse, M. A. 2008: 652-656


    The Metathesaurus from the Unified Medical Language System (UMLS) is a widely used ontology resource, which is mostly used in a relational database form for terminology research, mapping and information indexing. A significant section of UMLS users use a MySQL installation of the metathesaurus and Perl programming language as their access mechanism. We describe UMLS-Query, a Perl module that provides functions for retrieving concept identifiers, mapping text-phrases to Metathesaurus concepts and graph traversal in the Metathesaurus stored in a MySQL database. UMLS-Query can be used to build applications for semi-automated sample annotation, terminology based browsers for tissue sample databases and for terminology research. We describe the results of such uses of UMLS-Query and present the module for others to use.

    View details for PubMedID 18998805

  • The Stanford Tissue Microarray Database NUCLEIC ACIDS RESEARCH Marinelli, R. J., Montgomery, K., Liu, C. L., Shah, N. H., Prapong, W., Nitzberg, M., Zachariah, Z. K., Sherlock, G. J., Natkunam, Y., West, R. B., van de Rijn, M., Brown, P. O., Ball, C. A. 2008; 36: D871-D877


    The Stanford Tissue Microarray Database (TMAD; is a public resource for disseminating annotated tissue images and associated expression data. Stanford University pathologists, researchers and their collaborators worldwide use TMAD for designing, viewing, scoring and analyzing their tissue microarrays. The use of tissue microarrays allows hundreds of human tissue cores to be simultaneously probed by antibodies to detect protein abundance (Immunohistochemistry; IHC), or by labeled nucleic acids (in situ hybridization; ISH) to detect transcript abundance. TMAD archives multi-wavelength fluorescence and bright-field images of tissue microarrays for scoring and analysis. As of July 2007, TMAD contained 205 161 images archiving 349 distinct probes on 1488 tissue microarray slides. Of these, 31 306 images for 68 probes on 125 slides have been released to the public. To date, 12 publications have been based on these raw public data. TMAD incorporates the NCI Thesaurus ontology for searching tissues in the cancer domain. Image processing researchers can extract images and scores for training and testing classification algorithms. The production server uses the Apache HTTP Server, Oracle Database and Perl application code. Source code is available to interested researchers under a no-cost license.

    View details for DOI 10.1093/nar/gkm861

    View details for Web of Science ID 000252545400154

    View details for PubMedID 17989087

    View details for PubMedCentralID PMC2238948

  • Biomedical ontologies: a functional perspective BRIEFINGS IN BIOINFORMATICS Rubin, D. L., Shah, N. H., Noy, N. F. 2008; 9 (1): 75-90


    The information explosion in biology makes it difficult for researchers to stay abreast of current biomedical knowledge and to make sense of the massive amounts of online information. Ontologies--specifications of the entities, their attributes and relationships among the entities in a domain of discourse--are increasingly enabling biomedical researchers to accomplish these tasks. In fact, bio-ontologies are beginning to proliferate in step with accruing biological data. The myriad of ontologies being created enables researchers not only to solve some of the problems in handling the data explosion but also introduces new challenges. One of the key difficulties in realizing the full potential of ontologies in biomedical research is the isolation of various communities involved: some workers spend their career developing ontologies and ontology-related tools, while few researchers (biologists and physicians) know how ontologies can accelerate their research. The objective of this review is to give an overview of biomedical ontology in practical terms by providing a functional perspective--describing how bio-ontologies can and are being used. As biomedical scientists begin to recognize the many different ways ontologies enable biomedical research, they will drive the emergence of new computer applications that will help them exploit the wealth of research data now at their fingertips.

    View details for DOI 10.1093/bib/bbm059

    View details for Web of Science ID 000251864600008

    View details for PubMedID 18077472

  • The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration NATURE BIOTECHNOLOGY Smith, B., Ashburner, M., Rosse, C., Bard, J., Bug, W., Ceusters, W., Goldberg, L. J., Eilbeck, K., Ireland, A., Mungall, C. J., Leontis, N., Rocca-Serra, P., Ruttenberg, A., Sansone, S., Scheuermann, R. H., Shah, N., Whetzel, P. L., Lewis, S. 2007; 25 (11): 1251-1255


    The value of any kind of data is greatly enhanced when it exists in a form that allows it to be integrated with other data. One approach to integration is through the annotation of multiple bodies of data using common controlled vocabularies or 'ontologies'. Unfortunately, the very success of this approach has led to a proliferation of ontologies, which itself creates obstacles to integration. The Open Biomedical Ontologies (OBO) consortium is pursuing a strategy to overcome this problem. Existing OBO ontologies, including the Gene Ontology, are undergoing coordinated reform, and new ontologies are being created on the basis of an evolving set of shared principles governing ontology development. The result is an expanding family of ontologies designed to be interoperable and logically well formed and to incorporate accurate representations of biological reality. We describe this OBO Foundry initiative and provide guidelines for those who might wish to become involved.

    View details for DOI 10.1038/nbt1346

    View details for Web of Science ID 000251086500025

    View details for PubMedID 17989687

  • Current progress in network research: toward reference networks for key model organisms BRIEFINGS IN BIOINFORMATICS Srinivasan, B. S., Shah, N. H., Flannick, J. A., Abeliuk, E., Novak, A. F., Batzoglou, S. 2007; 8 (5): 318-332


    The collection of multiple genome-scale datasets is now routine, and the frontier of research in systems biology has shifted accordingly. Rather than clustering a single dataset to produce a static map of functional modules, the focus today is on data integration, network alignment, interactive visualization and ontological markup. Because of the intrinsic noisiness of high-throughput measurements, statistical methods have been central to this effort. In this review, we briefly survey available datasets in functional genomics, review methods for data integration and network alignment, and describe recent work on using network models to guide experimental validation. We explain how the integration and validation steps spring from a Bayesian description of network uncertainty, and conclude by describing an important near-term milestone for systems biology: the construction of a set of rich reference networks for key model organisms.

    View details for DOI 10.1093/bib/bbm038

    View details for Web of Science ID 000251034700005

    View details for PubMedID 17728341

  • Annotation and query of tissue microarray data using the NCI Thesaurus BMC BIOINFORMATICS Shah, N. H., Rubin, D. L., Espinosa, I., Montgomery, K., Musen, M. A. 2007; 8


    The Stanford Tissue Microarray Database (TMAD) is a repository of data serving a consortium of pathologists and biomedical researchers. The tissue samples in TMAD are annotated with multiple free-text fields, specifying the pathological diagnoses for each sample. These text annotations are not structured according to any ontology, making future integration of this resource with other biological and clinical data difficult.We developed methods to map these annotations to the NCI thesaurus. Using the NCI-T we can effectively represent annotations for about 86% of the samples. We demonstrate how this mapping enables ontology driven integration and querying of tissue microarray data. We have deployed the mapping and ontology driven querying tools at the TMAD site for general use.We have demonstrated that we can effectively map the diagnosis-related terms describing a sample in TMAD to the NCI-T. The NCI thesaurus terms have a wide coverage and provide terms for about 86% of the samples. In our opinion the NCI thesaurus can facilitate integration of this resource with other biological data.

    View details for DOI 10.1186/1471-2105-8-296

    View details for Web of Science ID 000249734300001

    View details for PubMedID 17686183

    View details for PubMedCentralID PMC1988837

  • Interpretation errors related to the GO annotation file format. AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium Moreira, D. A., Shah, N. H., Musen, M. A. 2007: 538-542


    The Gene Ontology (GO) is the most widely used ontology for creating biomedical annotations. GO annotations are statements associating a biological entity with a GO term. These statements comprise a large dataset of biological knowledge that is used widely in biomedical research. GO Annotations are available as "gene association files" from the GO website in a tab-delimited file format (GO Annotation File Format) composed of rows of 15 tab-delimited fields. This simple format lacks the knowledge representation (KR) capabilities to represent unambiguously semantic relationships between each field. This paper demonstrates that this KR shortcoming leads users to interpret the files in ways that can be erroneous. We propose a complementary format to represent GO annotation files as knowledge bases using the W3C recommended Web Ontology Language (OWL).

    View details for PubMedID 18693894

  • Using annotations from controlled vocabularies to find meaningful associations 4th International Workshop on Data Integration in the Life Sciences Lee, W., Raschid, L., Srinivasan, P., Shah, N., Rubin, D., Noy, N. SPRINGER-VERLAG BERLIN. 2007: 247–263
  • Searching Ontologies Based on Content: Experiments in the Biomedical Domain 4th International Conference on Knowledge Capture Alani, H., Noy, N. F., Shah, N., Shadbolt, N., Musen, M. A. ASSOC COMPUTING MACHINERY. 2007: 55–62
  • A case study in pathway knowledgebase verification BMC BIOINFORMATICS Racunas, S. A., Shah, N. H., Fedoroff, N. V. 2006; 7


    Biological databases and pathway knowledge-bases are proliferating rapidly. We are developing software tools for computer-aided hypothesis design and evaluation, and we would like our tools to take advantage of the information stored in these repositories. But before we can reliably use a pathway knowledge-base as a data source, we need to proofread it to ensure that it can fully support computer-aided information integration and inference.We design a series of logical tests to detect potential problems we might encounter using a particular knowledge-base, the Reactome database, with a particular computer-aided hypothesis evaluation tool, HyBrow. We develop an explicit formal language from the language implicit in the Reactome data format and specify a logic to evaluate models expressed using this language. We use the formalism of finite model theory in this work. We then use this logic to formulate tests for desirable properties (such as completeness, consistency, and well-formedness) for pathways stored in Reactome. We apply these tests to the publicly available Reactome releases (releases 10 through 14) and compare the results, which highlight Reactome's steady improvement in terms of decreasing inconsistencies. We also investigate and discuss Reactome's potential for supporting computer-aided inference tools.The case study described in this work demonstrates that it is possible to use our model theory based approach to identify problems one might encounter using a knowledge-base to support hypothesis evaluation tools. The methodology we use is general and is in no way restricted to the specific knowledge-base employed in this case study. Future application of this methodology will enable us to compare pathway resources with respect to the generic properties such resources will need to possess if they are to support automated reasoning.

    View details for DOI 10.1186/1471-2105-7-196

    View details for Web of Science ID 000239302400001

    View details for PubMedID 16603083

  • Ontology-based annotation and query of tissue microarray data. AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium Shah, N. H., Rubin, D. L., Supekar, K. S., Musen, M. A. 2006: 709-713


    The Stanford Tissue Microarray Database (TMAD) is a repository of data amassed by a consortium of pathologists and biomedical researchers. The TMAD data are annotated with multiple free-text fields, specifying the pathological diagnoses for each tissue sample. These annotations are spread out over multiple text fields and are not structured according to any ontology, making it difficult to integrate this resource with other biological and clinical data. We developed methods to map these annotations to the NCI thesaurus and the SNOMED-CT ontologies. Using these two ontologies we can effectively represent about 80% of the annotations in a structured manner. This mapping offers the ability to perform ontology driven querying of the TMAD data. We also found that 40% of annotations can be mapped to terms from both ontologies, providing the potential to align the two ontologies based on experimental data. Our approach provides the basis for a data-driven ontology alignment by mapping annotations of experimental data.

    View details for PubMedID 17238433

    View details for PubMedCentralID PMC1839511

  • Temporal evolution of the Arabidopsis oxidative stress response PLANT MOLECULAR BIOLOGY Mahalingam, R., Shah, N., Scrymgeour, A., Fedoroff, N. 2005; 57 (5): 709-730


    We have carried out a detailed analysis of the changes in gene expression levels in Arabidopsis thaliana ecotype Columbia (Col-0) plants during and for 6 h after exposure to ozone (O3) at 350 parts per billion (ppb) for 6 h. This O3 exposure is sufficient to induce a marked transcriptional response and an oxidative burst, but not to cause substantial tissue damage in Col-0 wild-type plants and is within the range encountered in some major metropolitan areas. We have developed analytical and visualization tools to automate the identification of expression profile groups with common gene ontology (GO) annotations based on the sub-cellular localization and function of the proteins encoded by the genes, as well as to automate promoter analysis for such gene groups. We describe application of these methods to identify stress-induced genes whose transcript abundance is likely to be controlled by common regulatory mechanisms and summarized our findings in a temporal model of the stress response.

    View details for DOI 10.1007/s11103-005-2860-4

    View details for Web of Science ID 000231220400007

    View details for PubMedID 15988565

  • HyBrow: a prototype system for computer-aided hypothesis evaluation. Bioinformatics Racunas, S. A., Shah, N. H., Albert, I., Fedoroff, N. V. 2004; 20: i257-64


    Experimental design, hypothesis-testing and model-building in the current data-rich environment require the biologists' to collect, evaluate and integrate large amounts of information of many disparate kinds. Developing a unified framework for the representation and conceptual integration of biological data and processes is a major challenge in bioinformatics because of the variety of available data and the different levels of detail at which biological processes can be considered.We have developed the HyBrow (Hypothesis Browser) system as a prototype bioinformatics tool for designing hypotheses and evaluating them for consistency with existing knowledge. HyBrow consists of a modeling framework with the ability to accommodate diverse biological information sources, an event-based ontology for representing biological processes at different levels of detail, a database to query information in the ontology and programs to perform hypothesis design and evaluation. We demonstrate the HyBrow prototype using the galactose gene network in Saccharomyces cerevisiae as our test system, and evaluate alternative hypotheses for consistency with stored

    View details for PubMedID 15262807

  • HyBrow: a prototype system for computer-aided hypothesis evaluation BIOINFORMATICS Racunas, S. A., Shah, N. H., Albert, I., Fedoroff, N. V. 2004; 20: 257-264
  • CLENCH: a program for calculating Cluster ENriCHment using the Gene Ontology BIOINFORMATICS Shah, N. H., Fedoroff, N. V. 2004; 20 (7): 1196-1197


    Analysis of microarray data most often produces lists of genes with similar expression patterns, which are then subdivided into functional categories for biological interpretation. Such functional categorization is most commonly accomplished using Gene Ontology (GO) categories. Although there are several programs that identify and analyze functional categories for human, mouse and yeast genes, none of them accept Arabidopsis thaliana data. In order to address this need for A.thaliana community, we have developed a program that retrieves GO annotations for A.thaliana genes and performs functional category analysis for lists of genes selected by the user.

    View details for DOI 10.1093/bioinformatics/bth056

    View details for Web of Science ID 000221139700024

    View details for PubMedID 14764555

  • A finite model theory for biological hypotheses IEEE Computational Systems Bioinformatics Conference (CSB 2004) Racunas, S., Griffin, C., Shah, N. IEEE COMPUTER SOC. 2004: 616–620
  • A tool-kit for cDNA microarray and promoter analysis BIOINFORMATICS Shah, N. H., King, D. C., Shah, P. N., Fedoroff, N. V. 2003; 19 (14): 1846-1848


    We describe two sets of programs for expediting routine tasks in analysis of cDNA microarray data and promoter sequences. The first set permits bad data points to be flagged with respect to a number of parameters and performs normalization in three different ways. It allows combining of result files into comprehensive data sets, evaluation of the quality of both technical and biological replicates and row and/or column standardization of data matrices. The second set supports mapping ESTs in the genome, identifying the corresponding genes and recovering their promoters, analyzing promoters for transcription factor binding sites, and visual representation of the results. The programs are designed primarily for Arabidopsis thaliana researchers, but can be adapted readily for other model systems. Availability and Supplementary information:

    View details for DOI 10.1093/bioinformatics/btg253

    View details for Web of Science ID 000185701100017

    View details for PubMedID 14512358

  • A contradiction-based framework for testing gene regulation hypotheses 2nd International Computational Systems Bioinformatics Conference Racunas, S., Shah, N., Fedoroff, N. V. IEEE COMPUTER SOC. 2003: 634–638
  • Characterizing the stress/defense transcriptome of Arabidopsis GENOME BIOLOGY Mahalingam, R., Gomez-Buitrago, A., Eckardt, N., Shah, N., Guevara-Garcia, A., Day, P., Raina, R., Fedoroff, N. V. 2003; 4 (3)


    To understand the gene networks that underlie plant stress and defense responses, it is necessary to identify and characterize the genes that respond both initially and as the physiological response to the stress or pathogen develops. We used PCR-based suppression subtractive hybridization to identify Arabidopsis genes that are differentially expressed in response to ozone, bacterial and oomycete pathogens and the signaling molecules salicylic acid (SA) and jasmonic acid.We identified a total of 1,058 differentially expressed genes from eight stress cDNA libraries. Digital northern analysis revealed that 55% of the stress-inducible genes are rarely transcribed in unstressed plants and 17% of them were not previously represented in Arabidopsis expressed sequence tag databases. More than two-thirds of the genes in the stress cDNA collection have not been identified in previous studies as stress/defense response genes. Several stress-responsive cis-elements showed a statistically significant over-representation in the promoters of the genes in the stress cDNA collection. These include W- and G-boxes, the SA-inducible element, the abscisic acid response element and the TGA motif.The stress cDNA collection comprises a broad repertoire of stress-responsive genes encoding proteins that are involved in both the initial and subsequent stages of the physiological response to abiotic stress and pathogens. This set of stress-, pathogen- and hormone-modulated genes is an important resource for understanding the genetic interactions underlying stress signaling and responses and may contribute to the characterization of the stress transcriptome through the construction of standardized specialized arrays.

    View details for Web of Science ID 000182694200009

    View details for PubMedID 12620105

  • StressDB: A locally installable web-based relational microarray database designed for small user communities COMPARATIVE AND FUNCTIONAL GENOMICS Mitra, M., Shah, N., Mueller, L., Pin, S., Fedoroff, N. 2002; 3 (2): 91-96


    We have built a microarray database, StressDB, for management of microarray data from our studies on stress-modulated genes in Arabidopsis. StressDB provides small user groups with a locally installable web-based relational microarray database. It has a simple and intuitive architecture and has been designed for cDNA microarray technology users. StressDB uses Windows 2000 as the centralized database server with Oracle 8i as the relational database management system. It allows users to manage microarray data and data-related biological information over the Internet using a web browser. The source-code is currently available on request from the authors and will soon be made freely available for downloading from our website at

    View details for DOI 10.1002/cfg.153

    View details for Web of Science ID 000175388900001

    View details for PubMedID 18628845