I use data science and informatics techniques to study human diseases and their impact on population health outcomes and healthcare spending. Also, to enable new knowledge discovery and for the purpose of building next generation informatics tools for population health management and measurement. I bring over fifteen years of experience with large and diverse population health datasets. For example, population-based registers in Denmark and in the US, the Department of Veterans Affairs Corporate Data Warehouse, the Rheumatology Informatics System for Effectiveness, Stanford and UCSF electronic medical records, administrative healthcare claims and activity monitoring data. I have also developed natural language processing tools for a variety of biomedical use cases. Paired with the practical skills and knowledge that I have gained through working within integrated delivery systems across the US, my extensive training in computer science, biology, and health services research uniquely positions me to build next generation tools to support integrated health delivery systems and population health.
As an Instructor in the Department of Biomedical Data Science at Stanford, I manage a small research group, where I mentor all levels of students and advanced trainees, within the School of Medicine and more broadly within the University. I also lead the Stanford Working Group, Stats for Social Good.
Instructor, Biomedical Data Science
Senior Data Scientist, Department of Veterans Affairs (2020 - Present)
Assistant Director of Data Science, Center for Population Health Sciences (2019 - Present)
Postdoctoral Training, Stanford School of Medicine, Biomedical Informatics (2015)
Doctor of Philosophy, Graduate Center, City University of New York (CUNY), Computer Science (2013)
Master of Science, Brooklyn College, CUNY, Computer Science and Health Science (2006)
Bachelor of Science, Brooklyn College, CUNY, Biology
A case for developing domain-specific vocabularies for extracting suicide factors from healthcare notes.
Journal of psychiatric research
2022; 151: 328-338
The onset and persistence of life events (LE) such as housing instability, job instability, and reduced social connection have been shown to increase risk of suicide. Predictive models for suicide risk have low sensitivity to many of these factors due to under-reporting in structured electronic health records (EHR) data. In this study, we show how natural language processing (NLP) can help identify LE in clinical notes at higher rates than reported medical codes. We compare domain-specific lexicons formulated from Unified Medical Language System (UMLS) selection, content analysis by subject matter experts (SME) and the Gravity Project, to data-driven expansion through contextual word embedding using Word2Vec. Our analysis covers EHR from the Veterans Affairs (VA) Corporate Data Warehouse (CDW) and measures the prevalence of LE across time for patients with known underlying cause of death in the National Death Index (NDI). We found that NLP methods had higher sensitivity of detecting LE relative to structured EHR (S-EHR) variables. We observed that, on average, suicide cases had higher rates of LE over time when compared to patients who died of non-suicide related causes with no previous history of diagnosed mental illness. When used to discriminate these outcomes, the inclusion of NLP derived variables increased the concentration of LE along the top 0.1%, 0.5% and 1% of predicted risk. LE were less informative when discriminating suicide death from non-suicide related death for patients with diagnosed mental illness.
View details for DOI 10.1016/j.jpsychires.2022.04.009
View details for PubMedID 35533516
Development of a natural language processing system for extracting rheumatoid arthritis outcomes from clinical notes using the national RISE registry.
Arthritis care & research
OBJECTIVE: To accelerate the use of outcome measures in rheumatology, we developed and evaluated a natural language processing (NLP) pipeline for extracting these measures from free-text outpatient rheumatology notes within the ACR's Rheumatology Informatics System for Effectiveness (RISE) registry.METHODS: We included all patients in RISE (2015 to 2018). The NLP pipeline extracted scores corresponding to eight measures of RA disease activity (DA) and functional status (FS) documented in outpatient rheumatology notes. Score extraction performance was evaluated by chart review, and we assessed agreement with scores documented in structured data. We conducted an external validation of our NLP pipeline using data from rheumatology notes from an academic medical center that is not included in the RISE registry.RESULTS: We processed over 34 million notes from 854,628 patients, 158 practices, and 24 EHR systems from RISE. Manual chart review revealed a sensitivity, positive predictive value (PPV), and F1 score of 95%, 87%, and 91%, respectively. Substantial agreement was observed between scores extracted from RISE notes and scores derived from structured data (kappa: 0.43 - 0.68 among DA and 0.86-0.98 among FS measures). Inthe external validation, we found a sensitivity, PPV, and F1 score of 92%, 69%, and 79%, respectively.CONCLUSIONS: We developed an NLP pipeline to extract RA outcome measures from a national registry of notes from multiple EHR systems and found it to have good internal and external validity. This pipeline can facilitate measurement of clinical and patient reported outcomes for use in research and quality measurement.
View details for DOI 10.1002/acr.24869
View details for PubMedID 35157365
Unstructured clinical notes within the 24 hours since admission predict short, mid & long-term mortality in adult ICU patients.
1800; 17 (1): e0262182
Mortality prediction for intensive care unit (ICU) patients is crucial for improving outcomes and efficient utilization of resources. Accessibility of electronic health records (EHR) has enabled data-driven predictive modeling using machine learning. However, very few studies rely solely on unstructured clinical notes from the EHR for mortality prediction. In this work, we propose a framework to predict short, mid, and long-term mortality in adult ICU patients using unstructured clinical notes from the MIMIC III database, natural language processing (NLP), and machine learning (ML) models. Depending on the statistical description of the patients' length of stay, we define the short-term as 48-hour and 4-day period, the mid-term as 7-day and 10-day period, and the long-term as 15-day and 30-day period after admission. We found that by only using clinical notes within the 24 hours of admission, our framework can achieve a high area under the receiver operating characteristics (AU-ROC) score for short, mid and long-term mortality prediction tasks. The test AU-ROC scores are 0.87, 0.83, 0.83, 0.82, 0.82, and 0.82 for 48-hour, 4-day, 7-day, 10-day, 15-day, and 30-day period mortality prediction, respectively. We also provide a comparative study among three types of feature extraction techniques from NLP: frequency-based technique, fixed embedding-based technique, and dynamic embedding-based technique. Lastly, we provide an interpretation of the NLP-based predictive models using feature-importance scores.
View details for DOI 10.1371/journal.pone.0262182
View details for PubMedID 34990485
Natural Language Processing Tool for Extraction of Patient-Reported Outcomes from a National Multi-Electronic Health Records Registry
WILEY. 2021: 3955-3957
View details for Web of Science ID 000744545207208
Association of alpha1-Blocker Receipt With 30-Day Mortality and Risk of Intensive Care Unit Admission Among Adults Hospitalized With Influenza or Pneumonia in Denmark.
JAMA network open
2021; 4 (2): e2037053
Importance: Alpha 1-adrenergic receptor blocking agents (alpha1-blockers) have been reported to have protective benefits against hyperinflammation and cytokine storm syndrome, conditions that are associated with mortality in patients with coronavirus disease 2019 and other severe respiratory tract infections. However, studies of the association of alpha1-blockers with outcomes among human participants with respiratory tract infections are scarce.Objective: To examine the association between the receipt of alpha1-blockers and outcomes among adult patients hospitalized with influenza or pneumonia.Design, Setting, and Participants: This population-based cohort study used data from Danish national registries to identify individuals 40 years and older who were hospitalized with influenza or pneumonia between January 1, 2005, and November 30, 2018, with follow-up through December 31, 2018. In the main analyses, patients currently receiving alpha1-blockers were compared with those not receiving alpha1-blockers (defined as patients with no prescription for an alpha1-blocker filled within 365 days before the index date) and those currently receiving 5alpha-reductase inhibitors. Propensity scores were used to address confounding factors and to compute weighted risks, absolute risk differences, and risk ratios. Data were analyzed from April 21 to December 21, 2020.Exposures: Current receipt of alpha1-blockers compared with nonreceipt of alpha1-blockers and with current receipt of 5alpha-reductase inhibitors.Main Outcomes and Measures: Death within 30 days of hospital admission and risk of intensive care unit (ICU) admission.Results: A total of 528 467 adult patients (median age, 75.0 years; interquartile range, 64.4-83.6 years; 273 005 men [51.7%]) were hospitalized with influenza or pneumonia in Denmark between 2005 and 2018. Of those, 21 772 patients (4.1%) were currently receiving alpha1-blockers compared with a population of 22 117 patients not receiving alpha1-blockers who were weighted to the propensity score distribution of those receiving alpha1-blockers. In the propensity score-weighted analyses, patients receiving alpha1-blockers had lower 30-day mortality (15.9%) compared with patients not receiving alpha1-blockers (18.5%), with a corresponding risk difference of -2.7% (95% CI, -3.2% to -2.2%) and a risk ratio (RR) of 0.85 (95% CI, 0.83-0.88). The risk of ICU admission was 7.3% among patients receiving alpha1-blockers and 7.7% among those not receiving alpha1-blockers (risk difference, -0.4% [95% CI, -0.8% to 0%]; RR, 0.95 [95% CI, 0.90-1.00]). A comparison between 18 280 male patients currently receiving alpha1-blockers and 18 228 propensity score-weighted male patients currently receiving 5alpha-reductase inhibitors indicated that those receiving alpha1-blockers had lower 30-day mortality (risk difference, -2.0% [95% CI, -3.4% to -0.6%]; RR, 0.89 [95% CI, 0.82-0.96]) and a similar risk of ICU admission (risk difference, -0.3% [95% CI, -1.4% to 0.7%]; RR, 0.96 [95% CI, 0.83-1.10]).Conclusions and Relevance: This cohort study's findings suggest that the receipt of alpha1-blockers is associated with protective benefits among adult patients hospitalized with influenza or pneumonia.
View details for DOI 10.1001/jamanetworkopen.2020.37053
View details for PubMedID 33566109
Ten Rules for Conducting Retrospective Pharmacoepidemiological Analyses: Example COVID-19 Study.
Frontiers in pharmacology
2021; 12: 700776
Since the beginning of the COVID-19 pandemic, pharmaceutical treatment hypotheses have abounded, each requiring careful evaluation. A randomized controlled trial generally provides the most credible evaluation of a treatment, but the efficiency and effectiveness of the trial depend on the existing evidence supporting the treatment. The researcher must therefore compile a body of evidence justifying the use of time and resources to further investigate a treatment hypothesis in a trial. An observational study can provide this evidence, but the lack of randomized exposure and the researcher's inability to control treatment administration and data collection introduce significant challenges. A proper analysis of observational health care data thus requires contributions from experts in a diverse set of topics ranging from epidemiology and causal analysis to relevant medical specialties and data sources. Here we summarize these contributions as 10 rules that serve as an end-to-end introduction to retrospective pharmacoepidemiological analyses of observational health care data using a running example of a hypothetical COVID-19 study. A detailed supplement presents a practical how-to guide for following each rule. When carefully designed and properly executed, a retrospective pharmacoepidemiological analysis framed around these rules will inform the decisions of whether and how to investigate a treatment hypothesis in a randomized controlled trial. This work has important implications for any future pandemic by prescribing what we can and should do while the world waits for global vaccine distribution.
View details for DOI 10.3389/fphar.2021.700776
View details for PubMedID 34393782
Application of Text Mining Methods to Identify Lupus Nephritis from Electronic Health Records
View details for Web of Science ID 000587568500258
Risk of primary urological and genital cancers following incident breast cancer: a Danish population-based cohort study.
Breast cancer research and treatment
PURPOSE: The prevalence of breast cancer survivors has increased due to dissemination of population-based mammographic screening and improved treatments. Recent changes in anti-hormonal therapies for breast cancer may have modified the risks of subsequent urological and genital cancers. We examine the risk of subsequent primary urological and genital cancers in patients with incident breast cancer compared with risks in the general population.METHODS: Using population-based Danish medical registries, we identified a cohort of women with primary breast cancer (1990-2017). We followed them from one year after their breast cancer diagnosis until any subsequent urological or genital cancer diagnosis. We computed incidence rates and standardized incidence ratios (SIRs) with 95% confidence intervals (CIs) as the observed number of cancers relative to the expected number based on national incidence rates (by sex, age, and calendar year).RESULTS: Among 84,972 patients with breast cancer (median age 61years), we observed 623 urological cancers and 1397 genital cancers during a median follow-up of 7.4years. The incidence rate per 100,000 person-years was stable during follow-up (83 for urological cancers and 176 for genital cancers). The SIR was increased for ovarian cancer (1.37, 95% CI 1.23-1.52) and uterine cancer (1.37, 95% CI 1.25-1.50), but only during the pre-aromatase inhibitor era (before 2007). Moreover, the SIR of kidney cancer was increased (1.52, 95% CI 1.15-1.97), but only during 2007-2017. The SIR for urinary bladder cancer was marginally increased (1.15, 95% CI 1.04-1.28) with no temporal effects. No associations were observed for cervical cancer.CONCLUSION: Breast cancer survivors had higher risks of uterine and ovarian cancer than expected, but only before 2007, and of kidney cancer, but only after 2007. The risk of urinary bladder cancer was moderately increased without temporal effects, and we observed no association with cervical cancer.
View details for DOI 10.1007/s10549-020-05879-w
View details for PubMedID 32845432
A Machine Learning Approach to Identifying Changes in Suicidal Language.
Suicide & life-threatening behavior
OBJECTIVE: With early identification and intervention, many suicidal deaths are preventable. Tools that include machine learning methods have been able to identify suicidal language. This paper examines the persistence of this suicidal language up to 30days after discharge from care.METHOD: In a multi-center study, 253 subjects were enrolled into either suicidal or control cohorts. Their responses to standardized instruments and interviews were analyzed using machine learning algorithms. Subjects were re-interviewed approximately 30days later, and their language was compared to the original language to determine the presence of suicidal ideation.RESULTS: The results show that language characteristics used to classify suicidality at the initial encounter are still present in the speech 30days later (AUC=89% (95% CI: 85-95%), p<.0001) and that algorithms trained on the second interviews could also identify the subjects that produced the first interviews (AUC=85% (95% CI: 81-90%), p<.0001).CONCLUSIONS: This approach explores the stability of suicidal language. When using advanced computational methods, the results show that a patient's language is similar 30days after first captured, while responses to standard measures change. This can be useful when developing methods that identify the data-based phenotype of a subject.
View details for DOI 10.1111/sltb.12642
View details for PubMedID 32484597
Risk of primary gastrointestinal cancers following incident non-metastatic breast cancer: a Danish population-based cohort study.
BMJ open gastroenterology
2020; 7 (1)
OBJECTIVE: We examined the risk of primary gastrointestinal cancers in women with breast cancer and compared this risk with that of the general population.DESIGN: Using population-based Danish registries, we conducted a cohort study of women with incident non-metastatic breast cancer (1990-2017). We computed cumulative cancer incidences and standardised incidence ratios (SIRs).RESULTS: Among 84972 patients with breast cancer, we observed 2340 gastrointestinal cancers. After 20 years of follow-up, the cumulative incidence of gastrointestinal cancers was 4%, driven mainly by colon cancers. Only risk of stomach cancer was continually increased beyond 1year following breast cancer. The SIR for colon cancer was neutral during 2-5 years of follow-up and approximately 1.2-fold increased thereafter. For cancer of the oesophagus, the SIR was increased only during 6-10 years. There was a weak association with pancreas cancer beyond 10 years. Between 1990-2006 and 2007-2017, the 1-10 years SIR estimate decreased and reached unity for upper gastrointestinal cancers (oesophagus, stomach, and small intestine). For lower gastrointestinal cancers (colon, rectum, and anal canal), the SIR estimate was increased only after 2007. No temporal effects were observed for the remaining gastrointestinal cancers. Treatment effects were negligible.CONCLUSION: Breast cancer survivors were at increased risk of oesophagus and stomach cancer, but only before 2007. The risk of colon cancer was increased, but only after 2007.
View details for DOI 10.1136/bmjgast-2020-000413
View details for PubMedID 32611556
The incidence of hematologic cancers after breast cancer. A 35-year population-based cohort study in Denmark
WILEY. 2019: 75
View details for Web of Science ID 000481785600147
Risk of primary urological and genital cancers following incident breast cancer: A Danish population-based cohort study
WILEY. 2019: 79
View details for Web of Science ID 000481785600155
Risk of primary gastrointestinal cancers following incident breast cancer: A Danish population-based cohort study
WILEY. 2019: 81
View details for Web of Science ID 000481785600159
- Stress Disorders and Dementia in the Danish Population AMERICAN JOURNAL OF EPIDEMIOLOGY 2019; 188 (3): 493–99
Using natural language processing to construct a metastatic breast cancer cohort from linked cancer registry and electronic medical records data.
2019; 2 (4): 528–37
Most population-based cancer databases lack information on metastatic recurrence. Electronic medical records (EMR) and cancer registries contain complementary information on cancer diagnosis, treatment and outcome, yet are rarely used synergistically. To construct a cohort of metastatic breast cancer (MBC) patients, we applied natural language processing techniques within a semisupervised machine learning framework to linked EMR-California Cancer Registry (CCR) data.We studied all female patients treated at Stanford Health Care with an incident breast cancer diagnosis from 2000 to 2014. Our database consisted of structured fields and unstructured free-text clinical notes from EMR, linked to CCR, a component of the Surveillance, Epidemiology and End Results Program (SEER). We identified de novo MBC patients from CCR and extracted information on distant recurrences from patient notes in EMR. Furthermore, we trained a regularized logistic regression model for recurrent MBC classification and evaluated its performance on a gold standard set of 146 patients.There were 11 459 breast cancer patients in total and the median follow-up time was 96.3 months. We identified 1886 MBC patients, 512 (27.1%) of whom were de novo MBC patients and 1374 (72.9%) were recurrent MBC patients. Our final MBC classifier achieved an area under the receiver operating characteristic curve (AUC) of 0.917, with sensitivity 0.861, specificity 0.878, and accuracy 0.870.To enable population-based research on MBC, we developed a framework for retrospective case detection combining EMR and CCR data. Our classifier achieved good AUC, sensitivity, and specificity without expert-labeled examples.
View details for DOI 10.1093/jamiaopen/ooz040
View details for PubMedID 32025650
View details for PubMedCentralID PMC6994019
Stress Disorders and Dementia in the Danish Population.
American journal of epidemiology
There is an association between stress and dementia. However, less is known about dementia among persons with varied stress responses and sex differences in these associations. This population-based cohort study examined dementia among persons with a range of clinician-diagnosed stress disorders, and the interaction between stress disorders and sex in predicting dementia, in Denmark from 1995 to 2011. This study included Danes 40 years or older with a stress disorder diagnosis (n=47,047) and a matched comparison cohort (n=232,141) without a stress disorder diagnosis from 1995 through 2011. Diagnoses were culled from national registries. We used Cox proportional-hazards regression to estimate associations between stress disorders and dementia. Risk of dementia was higher for persons with stress disorders than for persons without such diagnosis; adjusted hazard ratios ranged from 1.6 to 2.8. There was evidence of an interaction between sex and stress disorders in predicting dementia, with a greater rate of dementia among men with stress disorders except posttraumatic stress disorder, for which women had a greater rate. Results support existing evidence of an association between stress and dementia. This study contributes novel information regarding dementia risk across a range of stress responses, and interactions between stress disorders and sex.
View details for PubMedID 30576420
Potential Biases in Machine Learning Algorithms Using Electronic Health Record Data
JAMA INTERNAL MEDICINE
2018; 178 (11): 1544–47
A promise of machine learning in health care is the avoidance of biases in diagnosis and treatment; a computer algorithm could objectively synthesize and interpret the data in the medical record. Integration of machine learning with clinical decision support tools, such as computerized alerts or diagnostic support, may offer physicians and others who provide health care targeted and timely information that can improve clinical decisions. Machine learning algorithms, however, may also be subject to biases. The biases include those related to missing data and patients not identified by algorithms, sample size and underestimation, and misclassification and measurement error. There is concern that biases and deficiencies in the data used by machine learning algorithms may contribute to socioeconomic disparities in health care. This Special Communication outlines the potential biases that may be introduced into machine learning-based clinical decision support tools that use electronic health record data and proposes potential solutions to the problems of overreliance on automation, algorithms based on biased data, and algorithms that do not provide information that is clinically meaningful. Existing health care disparities should not be amplified by thoughtless or excessive reliance on machines.
View details for PubMedID 30128552
Scalable Electronic Phenotyping For Studying Patient Comorbidities.
AMIA ... Annual Symposium proceedings. AMIA Symposium
2018; 2018: 740–49
Over 75 million Americans have multiple concurrent chronic conditions and medical decision making for these patients is mostly based on retrospective cohort studies. Current methods to generate cohorts of patients with comorbidities are neither scalable nor generalizable. We propose a supervised machine learning algorithm for learning comorbidity phenotypes without requiring manually created training sets. First, we generated myocardial infarction (MI) and type-2 diabetes (T2DM) patient cohorts using ICD9-based imperfectly labeled samples upon which LASSO logistic regression models were trained. Second, we assessed the effects of training sample size, inclusion of physician input, and inclusion of clinical text features on model performance. Using ICD9 codes as our labeling heuristic, we achieved comparable performance to models created using keywords as labeling heuristic. We found that expert input and higher training sample sizes could compensate for the lack of clinical text derived features. However, our best performing model included clinical text as features with a large training sample size.
View details for PubMedID 30815116
SynthNotes: A Generator Framework for High-volume, High-fidelity Synthetic Mental Health Notes
IEEE. 2018: 951–58
View details for Web of Science ID 000468499301003
Performance of Machine Learning Methods Using Electronic Medical Records to Predict Varicella Zoster Virus Infection
View details for Web of Science ID 000411824106394
Predicting patient 'cost blooms' in Denmark: a longitudinal population-based study.
2017; 7 (1)
To compare the ability of standard versus enhanced models to predict future high-cost patients, especially those who move from a lower to the upper decile of per capita healthcare expenditures within 1 year-that is, 'cost bloomers'.We developed alternative models to predict being in the upper decile of healthcare expenditures in year 2 of a sample, based on data from year 1. Our 6 alternative models ranged from a standard cost-prediction model with 4 variables (ie, traditional model features), to our largest enhanced model with 1053 non-traditional model features. To quantify any increases in predictive power that enhanced models achieved over standard tools, we compared the prospective predictive performance of each model.We used the population of Western Denmark between 2004 and 2011 (2 146 801 individuals) to predict future high-cost patients and characterise high-cost patient subgroups. Using the most recent 2-year period (2010-2011) for model evaluation, our whole-population model used a cohort of 1 557 950 individuals with a full year of active residency in year 1 (2010). Our cost-bloom model excluded the 155 795 individuals who were already high cost at the population level in year 1, resulting in 1 402 155 individuals for prediction of cost bloomers in year 2 (2011).Using unseen data from a future year, we evaluated each model's prospective predictive performance by calculating the ratio of predicted high-cost patient expenditures to the actual high-cost patient expenditures in Year 2-that is, cost capture.Our best enhanced model achieved a 21% and 30% improvement in cost capture over a standard diagnosis-based model for predicting population-level high-cost patients and cost bloomers, respectively.In combination with modern statistical learning methods for analysing large data sets, models enhanced with a large and diverse set of features led to better performance-especially for predicting future cost bloomers.
View details for DOI 10.1136/bmjopen-2016-011580
View details for PubMedID 28077408
View details for PubMedCentralID PMC5253526
Enhanced Quality Measurement Event Detection: An Application to Physician Reporting.
EGEMS (Washington, DC)
2017; 5 (1): 5
The wide-scale adoption of electronic health records (EHR)s has increased the availability of routinely collected clinical data in electronic form that can be used to improve the reporting of quality of care. However, the bulk of information in the EHR is in unstructured form (e.g., free-text clinical notes) and not amenable to automated reporting. Traditional methods are based on structured diagnostic and billing data that provide efficient, but inaccurate or incomplete summaries of actual or relevant care processes and patient outcomes. To assess the feasibility and benefit of implementing enhanced EHR- based physician quality measurement and reporting, which includes the analysis of unstructured free- text clinical notes, we conducted a retrospective study to compare traditional and enhanced approaches for reporting ten physician quality measures from multiple National Quality Strategy domains. We found that our enhanced approach enabled the calculation of five Physician Quality and Performance System measures not measureable in billing or diagnostic codes and resulted in over a five-fold increase in event at an average precision of 88 percent (95 percent CI: 83-93 percent). Our work suggests that enhanced EHR-based quality measurement can increase event detection for establishing value-based payment arrangements and can expedite quality reporting for physician practices, which are increasingly burdened by the process of manual chart review for quality reporting.
View details for PubMedID 29881731
New Paradigms for Patient-Centered Outcomes Research in Electronic Medical Records: An Example of Detecting Urinary Incontinence Following Prostatectomy.
EGEMS (Washington, DC)
2016; 4 (3): 1231-?
National initiatives to develop quality metrics emphasize the need to include patient-centered outcomes. Patient-centered outcomes are complex, require documentation of patient communications, and have not been routinely collected by healthcare providers. The widespread implementation of electronic medical records (EHR) offers opportunities to assess patient-centered outcomes within the routine healthcare delivery system. The objective of this study was to test the feasibility and accuracy of identifying patient centered outcomes within the EHR.Data from patients with localized prostate cancer undergoing prostatectomy were used to develop and test algorithms to accurately identify patient-centered outcomes in post-operative EHRs - we used urinary incontinence as the use case. Standard data mining techniques were used to extract and annotate free text and structured data to assess urinary incontinence recorded within the EHRs.A total 5,349 prostate cancer patients were identified in our EHR-system between 1998-2013. Among these EHRs, 30.3% had a text mention of urinary incontinence within 90 days post-operative compared to less than 1.0% with a structured data field for urinary incontinence (i.e. ICD-9 code). Our workflow had good precision and recall for urinary incontinence (positive predictive value: 0.73 and sensitivity: 0.84).Our data indicate that important patient-centered outcomes, such as urinary incontinence, are being captured in EHRs as free text and highlight the long-standing importance of accurate clinician documentation. Standard data mining algorithms can accurately and efficiently identify these outcomes in existing EHRs; the complete assessment of these outcomes is essential to move practice into the patient-centered realm of healthcare.
View details for DOI 10.13063/2327-9214.1231
View details for PubMedID 27347492
Detecting unplanned care from clinician notes in electronic health records.
Journal of oncology practice / American Society of Clinical Oncology
2015; 11 (3): e313-9
Reduction in unplanned episodes of care, such as emergency department visits and unplanned hospitalizations, are important quality outcome measures. However, many events are only documented in free-text clinician notes and are labor intensive to detect by manual medical record review.We studied 308,096 free-text machine-readable documents linked to individual entries in our electronic health records, representing care for patients with breast, GI, or thoracic cancer, whose treatment was initiated at one academic medical center, Stanford Health Care (SHC). Using a clinical text-mining tool, we detected unplanned episodes documented in clinician notes (for non-SHC visits) or in coded encounter data for SHC-delivered care and the most frequent symptoms documented in emergency department (ED) notes.Combined reporting increased the identification of patients with one or more unplanned care visits by 32% (15% using coded data; 20% using all the data) among patients with 3 months of follow-up and by 21% (23% using coded data; 28% using all the data) among those with 1 year of follow-up. Based on the textual analysis of SHC ED notes, pain (75%), followed by nausea (54%), vomiting (47%), infection (36%), fever (28%), and anemia (27%), were the most frequent symptoms mentioned. Pain, nausea, and vomiting co-occur in 35% of all ED encounter notes.The text-mining methods we describe can be applied to automatically review free-text clinician notes to detect unplanned episodes of care mentioned in these notes. These methods have broad application for quality improvement efforts in which events of interest occur outside of a network that allows for patient data sharing.
View details for DOI 10.1200/JOP.2014.002741
View details for PubMedID 25980019
View details for PubMedCentralID PMC4438112
Text Mining for Adverse Drug Events: the Promise, Challenges, and State of the Art
2014; 37 (10): 777-790
Text mining is the computational process of extracting meaningful information from large amounts of unstructured text. It is emerging as a tool to leverage underutilized data sources that can improve pharmacovigilance, including the objective of adverse drug event (ADE) detection and assessment. This article provides an overview of recent advances in pharmacovigilance driven by the application of text mining, and discusses several data sources-such as biomedical literature, clinical narratives, product labeling, social media, and Web search logs-that are amenable to text mining for pharmacovigilance. Given the state of the art, it appears text mining can be applied to extract useful ADE-related information from multiple textual sources. Nonetheless, further research is required to address remaining technical challenges associated with the text mining methodologies, and to conclusively determine the relative contribution of each textual source to improving pharmacovigilance.
View details for DOI 10.1007/s40264-014-0218-z
View details for Web of Science ID 000344615300005
View details for PubMedCentralID PMC4217510