Current Role at Stanford


I'm currently working as a staff research scientist in the Shah Lab and research scientist at Snorkel AI. My interests fall in the intersection of computer science and medical informatics. My research interests include:

• Machine learning with limited labeled data, e.g., weak supervision, self-supervision, and few-shot learning.
• Multimodal learning, e.g., combining text, imaging, video and electronic health record data for improving clinical outcome prediction
• Human-in-the-loop machine learning systems.
• Knowledge graphs and their use in improving representation learning

Projects


  • Weakly supervised classification of rare aortic valve malformations using unlabeled cardiac MRI sequences, Stanford University (1/1/2018 - 7/1/2019)

    This work explores training deep learning models for detecting cardiac pathologies using large-scale, unlabeled MRI video data available as part of the UK Biobank.

    Location

    Stanford, CA

    Collaborators

    • Chris Re, Mr, Stanford University
    • Euan Ashley, Professor, Stanford University Cardiology
    • James Priest, Adjunct Clinical Assistant Professor, THE STANFORD UNIVERSITY MEDICAL CENTER

    For More Information:

  • Snorkel: Rapid Training Data Creation with Weak Supervision, Stanford University (6/1/2016 - 12/1/2018)

    Labeling training data is increasingly the largest bottleneck in deploying machine learning systems. We present Snorkel, a first-of-its-kind system that enables users to train state-of-the-art models without hand labeling any training data. Instead, users write labeling functions that express arbitrary heuristics, which can have unknown accuracies and correlations. Snorkel denoises their outputs without access to ground truth by incorporating the first end-to-end implementation of our recently proposed machine learning paradigm, data programming. We present a flexible interface layer for writing labeling functions based on our experience over the past year collaborating with companies, agencies, and research labs. In a user study, subject matter experts build models 2.8x faster and increase predictive performance an average 45.5% versus seven hours of hand labeling. We study the modeling tradeoffs in this new setting and propose an optimizer for automating tradeoff decisions that gives up to 1.8x speedup per pipeline execution. In two collaborations, with the U.S. Department of Veterans Affairs and the U.S. Food and Drug Administration, and on four open-source text and image data sets representative of other deployments, Snorkel provides 132% average improvements to predictive performance over prior heuristic approaches and comes within an average 3.60% of the predictive performance of large hand-curated training sets.

    Location

    Stanford, CA

    Collaborators

    • Alex Ratner, PhD Student, Stanford University
    • Steven Bach, Assistant Professor, Brown University
    • Henry Ehrenberg, Software Engineer, Facebook
    • Sen Wu, PhD Student, Stanford University
    • Christopher Ré, Associate Professor, Computer Science, Stanford University

    For More Information:

Service, Volunteer and Community Work


  • Co-organizer for Machine Learning for Health Workshop @ NeurIPS, NeurIPS (12/2016 - 12/2018)

    Machine Learning for Health Workshop @ NeurIPS

    Location

    Stanford, CA

  • Area Chair @ Machine Learning for Healthcare Conference (MLHC), Stanford University (2019 - 2021)

    Location

    Palo Alto, CA

All Publications


  • Ontology-driven weak supervision for clinical entity classification in electronic health records. Nature communications Fries, J. A., Steinberg, E., Khattar, S., Fleming, S. L., Posada, J., Callahan, A., Shah, N. H. 2021; 12 (1): 2017

    Abstract

    In the electronic health record, using clinical notes to identify entities such as disorders and their temporality (e.g. the order of an event relative to a time index) can inform many important analyses. However, creating training data for clinical entity tasks is time consuming and sharing labeled data is challenging due to privacy concerns. The information needs of the COVID-19 pandemic highlight the need for agile methods of training machine learning models for clinical notes. We present Trove, a framework for weakly supervised entity classification using medical ontologies and expert-generated rules. Our approach, unlike hand-labeled notes, is easy to share and modify, while offering performance comparable to learning from manually labeled training data. In this work, we validate our framework on six benchmark tasks and demonstrate Trove's ability to analyze the records of patients visiting the emergency department at Stanford Health Care for COVID-19 presenting symptoms and risk factors.

    View details for DOI 10.1038/s41467-021-22328-4

    View details for PubMedID 33795682

  • Weakly supervised classification of rare aortic valve malformations using unlabeled cardiac MRI sequences Nature Communications Fries, J. A., Varma, P., Chen, V. S., Xiao, K., Tejeda, H., Saha, P., Dunnmon, J., Chubb, H., Maskatia, S., Fiterau, M., Delp, S., Ashley, E., Ré, C., Priest, J. R. 2019; 10
  • Snorkel: Rapid Training Data Creation with Weak Supervision PROCEEDINGS OF THE VLDB ENDOWMENT Ratner, A., Bach, S. H., Ehrenberg, H., Fries, J., Wu, S., Re, C. 2017; 11 (3): 269–82

    Abstract

    Labeling training data is increasingly the largest bottleneck in deploying machine learning systems. We present Snorkel, a first-of-its-kind system that enables users to train state-of- the-art models without hand labeling any training data. Instead, users write labeling functions that express arbitrary heuristics, which can have unknown accuracies and correlations. Snorkel denoises their outputs without access to ground truth by incorporating the first end-to-end implementation of our recently proposed machine learning paradigm, data programming. We present a flexible interface layer for writing labeling functions based on our experience over the past year collaborating with companies, agencies, and research labs. In a user study, subject matter experts build models 2.8× faster and increase predictive performance an average 45.5% versus seven hours of hand labeling. We study the modeling tradeoffs in this new setting and propose an optimizer for automating tradeoff decisions that gives up to 1.8× speedup per pipeline execution. In two collaborations, with the U.S. Department of Veterans Affairs and the U.S. Food and Drug Administration, and on four open-source text and image data sets representative of other deployments, Snorkel provides 132% average improvements to predictive performance over prior heuristic approaches and comes within an average 3.60% of the predictive performance of large hand-curated training sets.

    View details for PubMedID 29770249

  • The Stanford Medicine data science ecosystem for clinical and translational research. JAMIA open Callahan, A., Ashley, E., Datta, S., Desai, P., Ferris, T. A., Fries, J. A., Halaas, M., Langlotz, C. P., Mackey, S., Posada, J. D., Pfeffer, M. A., Shah, N. H. 2023; 6 (3): ooad054

    Abstract

    To describe the infrastructure, tools, and services developed at Stanford Medicine to maintain its data science ecosystem and research patient data repository for clinical and translational research.The data science ecosystem, dubbed the Stanford Data Science Resources (SDSR), includes infrastructure and tools to create, search, retrieve, and analyze patient data, as well as services for data deidentification, linkage, and processing to extract high-value information from healthcare IT systems. Data are made available via self-service and concierge access, on HIPAA compliant secure computing infrastructure supported by in-depth user training.The Stanford Medicine Research Data Repository (STARR) functions as the SDSR data integration point, and includes electronic medical records, clinical images, text, bedside monitoring data and HL7 messages. SDSR tools include tools for electronic phenotyping, cohort building, and a search engine for patient timelines. The SDSR supports patient data collection, reproducible research, and teaching using healthcare data, and facilitates industry collaborations and large-scale observational studies.Research patient data repositories and their underlying data science infrastructure are essential to realizing a learning health system and advancing the mission of academic medical centers. Challenges to maintaining the SDSR include ensuring sufficient financial support while providing researchers and clinicians with maximal access to data and digital infrastructure, balancing tool development with user training, and supporting the diverse needs of users.Our experience maintaining the SDSR offers a case study for academic medical centers developing data science and research informatics infrastructure.

    View details for DOI 10.1093/jamiaopen/ooad054

    View details for PubMedID 37545984

    View details for PubMedCentralID PMC10397535

  • Self-supervised machine learning using adult inpatient data produces effective models for pediatric clinical prediction tasks. Journal of the American Medical Informatics Association : JAMIA Lemmon, J., Guo, L. L., Steinberg, E., Morse, K. E., Fleming, S. L., Aftandilian, C., Pfohl, S. R., Posada, J. D., Shah, N., Fries, J., Sung, L. 2023

    Abstract

    Development of electronic health records (EHR)-based machine learning models for pediatric inpatients is challenged by limited training data. Self-supervised learning using adult data may be a promising approach to creating robust pediatric prediction models. The primary objective was to determine whether a self-supervised model trained in adult inpatients was noninferior to logistic regression models trained in pediatric inpatients, for pediatric inpatient clinical prediction tasks.This retrospective cohort study used EHR data and included patients with at least one admission to an inpatient unit. One admission per patient was randomly selected. Adult inpatients were 18 years or older while pediatric inpatients were more than 28 days and less than 18 years. Admissions were temporally split into training (January 1, 2008 to December 31, 2019), validation (January 1, 2020 to December 31, 2020), and test (January 1, 2021 to August 1, 2022) sets. Primary comparison was a self-supervised model trained in adult inpatients versus count-based logistic regression models trained in pediatric inpatients. Primary outcome was mean area-under-the-receiver-operating-characteristic-curve (AUROC) for 11 distinct clinical outcomes. Models were evaluated in pediatric inpatients.When evaluated in pediatric inpatients, mean AUROC of self-supervised model trained in adult inpatients (0.902) was noninferior to count-based logistic regression models trained in pediatric inpatients (0.868) (mean difference = 0.034, 95% CI=0.014-0.057; P < .001 for noninferiority and P = .006 for superiority).Self-supervised learning in adult inpatients was noninferior to logistic regression models trained in pediatric inpatients. This finding suggests transferability of self-supervised models trained in adult patients to pediatric patients, without requiring costly model retraining.

    View details for DOI 10.1093/jamia/ocad175

    View details for PubMedID 37639620

  • The shaky foundations of large language models and foundation models for electronic health records. NPJ digital medicine Wornow, M., Xu, Y., Thapa, R., Patel, B., Steinberg, E., Fleming, S., Pfeffer, M. A., Fries, J., Shah, N. H. 2023; 6 (1): 135

    Abstract

    The success of foundation models such as ChatGPT and AlphaFold has spurred significant interest in building similar models for electronic medical records (EMRs) to improve patient care and hospital operations. However, recent hype has obscured critical gaps in our understanding of these models' capabilities. In this narrative review, we examine 84 foundation models trained on non-imaging EMR data (i.e., clinical text and/or structured data) and create a taxonomy delineating their architectures, training data, and potential use cases. We find that most models are trained on small, narrowly-scoped clinical datasets (e.g., MIMIC-III) or broad, public biomedical corpora (e.g., PubMed) and are evaluated on tasks that do not provide meaningful insights on their usefulness to health systems. Considering these findings, we propose an improved evaluation framework for measuring the benefits of clinical foundation models that is more closely grounded to metrics that matter in healthcare.

    View details for DOI 10.1038/s41746-023-00879-8

    View details for PubMedID 37516790

    View details for PubMedCentralID 8371605

  • EHR foundation models improve robustness in the presence of temporal distribution shift. Scientific reports Guo, L. L., Steinberg, E., Fleming, S. L., Posada, J., Lemmon, J., Pfohl, S. R., Shah, N., Fries, J., Sung, L. 2023; 13 (1): 3767

    Abstract

    Temporal distribution shift negatively impacts the performance of clinical prediction models over time. Pretraining foundation models using self-supervised learning on electronic health records (EHR) may be effective in acquiring informative global patterns that can improve the robustness of task-specific models. The objective wasto evaluate the utility of EHR foundation models in improving the in-distribution (ID) and out-of-distribution (OOD) performance of clinical prediction models. Transformer- and gated recurrent unit-based foundation models were pretrained on EHR of up to 1.8M patients (382M coded events) collected within pre-determined year groups (e.g., 2009-2012) and were subsequently used to construct patient representations for patients admitted to inpatient units. These representations were used to train logistic regression models to predict hospital mortality, long length of stay, 30-day readmission, and ICU admission. We compared our EHR foundation models with baseline logistic regression models learned on count-based representations (count-LR) in ID and OOD year groups. Performance was measured using area-under-the-receiver-operating-characteristic curve (AUROC), area-under-the-precision-recall curve, and absolute calibration error. Both transformer and recurrent-based foundation models generally showed better ID and OOD discrimination relative to count-LR and often exhibited less decay in tasks where there is observable degradation of discrimination performance (average AUROC decay of 3% for transformer-based foundation model vs. 7% for count-LR after 5-9years). In addition, the performance and robustness of transformer-based foundation models continued to improve as pretraining set size increased. These results suggest that pretraining EHR foundation models at scale is a useful approach for developing clinical prediction models that perform well in the presence of temporal distribution shift.

    View details for DOI 10.1038/s41598-023-30820-8

    View details for PubMedID 36882576

  • Evaluation of Feature Selection Methods for Preserving Machine Learning Performance in the Presence of Temporal Dataset Shift in Clinical Medicine. Methods of information in medicine Lemmon, J., Guo, L. L., Posada, J., Pfohl, S. R., Fries, J., Fleming, S. L., Aftandilian, C., Shah, N., Sung, L. 2023

    Abstract

    BACKGROUND: Temporal dataset shift can cause degradation in model performance as discrepancies between training and deployment data grow over time. The primary objective was to determine whether parsimonious models produced by specific feature selection methods are more robust to temporal dataset shift as measured by out-of-distribution (OOD) performance, while maintaining in-distribution (ID) performance.METHODS: Our dataset consisted of intensive care unit patients from MIMIC-IV categorized by year groups (2008-2010, 2011-2013, 2014-2016, and 2017-2019). We trained baseline models using L2-regularized logistic regression on 2008-2010 to predict in-hospital mortality, long length of stay (LOS), sepsis, and invasive ventilation in all year groups. We evaluated three feature selection methods: L1-regularized logistic regression (L1), Remove and Retrain (ROAR), and causal feature selection. We assessed whether a feature selection method could maintain ID performance (2008-2010) and improve OOD performance (2017-2019). We also assessed whether parsimonious models retrained on OOD data performed as well as oracle models trained on all features in the OOD year group.RESULTS: The baseline model showed significantly worse OOD performance with the long LOS and sepsis tasks when compared with the ID performance. L1 and ROAR retained 3.7 to 12.6% of all features, whereas causal feature selection generally retained fewer features. Models produced by L1 and ROAR exhibited similar ID and OOD performance as the baseline models. The retraining of these models on 2017-2019 data using features selected from training on 2008-2010 data generally reached parity with oracle models trained directly on 2017-2019 data using all available features. Causal feature selection led to heterogeneous results with the superset maintaining ID performance while improving OOD calibration only on the long LOS task.CONCLUSIONS: While model retraining can mitigate the impact of temporal dataset shift on parsimonious models produced by L1 and ROAR, new methods are required to proactively improve temporal robustness.

    View details for DOI 10.1055/s-0043-1762904

    View details for PubMedID 36812932

  • Investigating real-world consequences of biases in commonly used clinical calculators. The American journal of managed care Yoo, R. M., Dash, D., Lu, J. H., Genkins, J. Z., Rabbani, N., Fries, J. A., Shah, N. H. 2023; 29 (1): e1-e7

    Abstract

    OBJECTIVES: To evaluate whether one summary metric of calculator performance sufficiently conveys equity across different demographic subgroups, as well as to evaluate how calculator predictive performance affects downstream health outcomes.STUDY DESIGN: We evaluate 3 commonly used clinical calculators-Model for End-Stage Liver Disease (MELD), CHA2DS2-VASc, and simplified Pulmonary Embolism Severity Index (sPESI)-on the cohort extracted from the Stanford Medicine Research Data Repository, following the cohort selection process as described in respective calculator derivation papers.METHODS: We quantified the predictive performance of the 3 clinical calculators across sex and race. Then, using the clinical guidelines that guide care based on these calculators' output, we quantified potential disparities in subsequent health outcomes.RESULTS: Across the examined subgroups, the MELD calculator exhibited worse performance for female and White populations, CHA2DS2-VASc calculator for the male population, and sPESI for the Black population. The extent to which such performance differences translated into differential health outcomes depended on the distribution of the calculators' scores around the thresholds used to trigger a care action via the corresponding guidelines. In particular, under the old guideline for CHA2DS2-VASc, among those who would not have been offered anticoagulant therapy, the Hispanic subgroup exhibited the highest rate of stroke.CONCLUSIONS: Clinical calculators, even when they do not include variables such as sex and race as inputs, can have very different care consequences across those subgroups. These differences in health care outcomes across subgroups can be explained by examining the distribution of scores and their calibration around the thresholds encoded in the accompanying care guidelines.

    View details for DOI 10.37765/ajmc.2023.89306

    View details for PubMedID 36716157

  • Perspective Toward Machine Learning Implementation in Pediatric Medicine: Mixed Methods Study. JMIR medical informatics Alexander, N., Aftandilian, C., Guo, L. L., Plenert, E., Posada, J., Fries, J., Fleming, S., Johnson, A., Shah, N., Sung, L. 2022; 10 (11): e40039

    Abstract

    BACKGROUND: Given the costs of machine learning implementation, a systematic approach to prioritizing which models to implement into clinical practice may be valuable.OBJECTIVE: The primary objective was to determine the health care attributes respondents at 2 pediatric institutions rate as important when prioritizing machine learning model implementation. The secondary objective was to describe their perspectives on implementation using a qualitative approach.METHODS: In this mixed methods study, we distributed a survey to health system leaders, physicians, and data scientists at 2 pediatric institutions. We asked respondents to rank the following 5 attributes in terms of implementation usefulness: the clinical problem was common, the clinical problem caused substantial morbidity and mortality, risk stratification led to different actions that could reasonably improve patient outcomes, reducing physician workload, and saving money. Important attributes were those ranked as first or second most important. Individual qualitative interviews were conducted with a subsample of respondents.RESULTS: Among 613 eligible respondents, 275 (44.9%) responded. Qualitative interviews were conducted with 17 respondents. The most common important attributes were risk stratification leading to different actions (205/275, 74.5%) and clinical problem causing substantial morbidity or mortality (177/275, 64.4%). The attributes considered least important were reducing physician workload and saving money. Qualitative interviews consistently prioritized implementations that improved patient outcomes.CONCLUSIONS: Respondents prioritized machine learning model implementation where risk stratification would lead to different actions and clinical problems that caused substantial morbidity and mortality. Implementations that improved patient outcomes were prioritized. These results can help provide a framework for machine learning model implementation.

    View details for DOI 10.2196/40039

    View details for PubMedID 36394938

  • Evaluation of domain generalization and adaptation on improving model robustness to temporal dataset shift in clinical medicine. Scientific reports Guo, L. L., Pfohl, S. R., Fries, J., Johnson, A. E., Posada, J., Aftandilian, C., Shah, N., Sung, L. 2022; 12 (1): 2726

    Abstract

    Temporal dataset shift associated with changes in healthcare over time is a barrier to deploying machine learning-based clinical decision support systems. Algorithms that learn robust models by estimating invariant properties across time periods for domain generalization (DG) and unsupervised domain adaptation (UDA) might be suitable to proactively mitigate dataset shift. The objective wasto characterize the impact of temporal dataset shift on clinical prediction models and benchmark DG and UDA algorithms on improving model robustness. In this cohort study, intensive care unit patients from the MIMIC-IV database were categorized by year groups (2008-2010, 2011-2013, 2014-2016 and 2017-2019). Tasks were predicting mortality, long length of stay, sepsis and invasive ventilation. Feedforward neural networks were used as prediction models. The baseline experiment trained models using empirical risk minimization (ERM) on 2008-2010 (ERM[08-10]) and evaluated them on subsequent year groups. DG experiment trained models using algorithms that estimated invariant properties using 2008-2016 and evaluated them on 2017-2019. UDA experiment leveraged unlabelled samples from 2017 to 2019 for unsupervised distribution matching. DG and UDA models were compared to ERM[08-16] models trained using 2008-2016. Main performance measures were area-under-the-receiver-operating-characteristic curve (AUROC), area-under-the-precision-recall curve and absolute calibration error. Threshold-based metrics including false-positives and false-negatives were used to assess the clinical impact of temporal dataset shift and its mitigation strategies. In the baseline experiments, dataset shift was most evident for sepsis prediction (maximum AUROC drop, 0.090; 95% confidence interval (CI), 0.080-0.101). Considering a scenario of 100 consecutively admitted patients showed that ERM[08-10] applied to 2017-2019 was associated with one additional false-negative among 11 patients with sepsis, when compared to the model applied to 2008-2010. When compared with ERM[08-16], DG and UDA experiments failed to produce more robust models (range of AUROC difference, -0.003 to 0.050). In conclusion,DG and UDA failed to produce more robust models compared to ERM in the setting of temporal dataset shift. Alternate approaches are required to preserve model performance over time in clinical medicine.

    View details for DOI 10.1038/s41598-022-06484-1

    View details for PubMedID 35177653

  • Dataset Debt in Biomedical Language Modeling Fries, J., Seelam, N., Altay, G., Weber, L., Kang, M., Datta, D., Su, R., Garda, S., Wang, B., Ott, S., Samwald, M., Kusa, W., Assoc Computat Linguist ASSOC COMPUTATIONAL LINGUISTICS-ACL. 2022: 137-145
  • PromptSource: An Integrated Development Environment and Repository for Natural Language Prompts Bach, S. H., Sanh, V., Yong, Z., Webson, A., Raffel, C., Nayak, N., Sharma, A., Kim, T., Bari, M., Fevry, T., Alyafeai, Z., Dey, M., Santilli, A., Sun, Z., Ben-David, S., Xu, C., Chhablani, G., Wang, H., Fries, J., Al-shaibani, M. S., Sharma, S., Thakker, U., Almubarak, K., Tang, X., Radev, D., Jiang, M., Rush, A. M., Assoc Computat Linguist ASSOC COMPUTATIONAL LINGUISTICS-ACL. 2022: 93-104
  • Systematic Review of Approaches to Preserve Machine Learning Performance in the Presence of Temporal Dataset Shift in Clinical Medicine. Applied clinical informatics Guo, L. L., Pfohl, S. R., Fries, J., Posada, J., Fleming, S. L., Aftandilian, C., Shah, N., Sung, L. 2021; 12 (4): 808-815

    Abstract

    OBJECTIVE: The change in performance of machine learning models over time as a result of temporal dataset shift is a barrier to machine learning-derived models facilitating decision-making in clinical practice. Our aim was to describe technical procedures used to preserve the performance of machine learning models in the presence of temporal dataset shifts.METHODS: Studies were included if they were fully published articles that used machine learning and implemented a procedure to mitigate the effects of temporal dataset shift in a clinical setting. We described how dataset shift was measured, the procedures used to preserve model performance, and their effects.RESULTS: Of 4,457 potentially relevant publications identified, 15 were included. The impact of temporal dataset shift was primarily quantified using changes, usually deterioration, in calibration or discrimination. Calibration deterioration was more common (n=11) than discrimination deterioration (n=3). Mitigation strategies were categorized as model level or feature level. Model-level approaches (n=15) were more common than feature-level approaches (n=2), with the most common approaches being model refitting (n=12), probability calibration (n=7), model updating (n=6), and model selection (n=6). In general, all mitigation strategies were successful at preserving calibration but not uniformly successful in preserving discrimination.CONCLUSION: There was limited research in preserving the performance of machine learning models in the presence of temporal dataset shift in clinical medicine. Future research could focus on the impact of dataset shift on clinical decision making, benchmark the mitigation strategies on a wider range of datasets and tasks, and identify optimal strategies for specific settings.

    View details for DOI 10.1055/s-0041-1735184

    View details for PubMedID 34470057

  • Assessment of Extractability and Accuracy of Electronic Health Record Data for Joint Implant Registries. JAMA network open Giori, N. J., Radin, J., Callahan, A., Fries, J. A., Halilaj, E., Re, C., Delp, S. L., Shah, N. H., Harris, A. H. 2021; 4 (3): e211728

    Abstract

    Importance: Implant registries provide valuable information on the performance of implants in a real-world setting, yet they have traditionally been expensive to establish and maintain. Electronic health records (EHRs) are widely used and may include the information needed to generate clinically meaningful reports similar to a formal implant registry.Objectives: To quantify the extractability and accuracy of registry-relevant data from the EHR and to assess the ability of these data to track trends in implant use and the durability of implants (hereafter referred to as implant survivorship), using data stored since 2000 in the EHR of the largest integrated health care system in the United States.Design, Setting, and Participants: Retrospective cohort study of a large EHR of veterans who had 45 351 total hip arthroplasty procedures in Veterans Health Administration hospitals from 2000 to 2017. Data analysis was performed from January 1, 2000, to December 31, 2017.Exposures: Total hip arthroplasty.Main Outcomes and Measures: Number of total hip arthroplasty procedures extracted from the EHR, trends in implant use, and relative survivorship of implants.Results: A total of 45 351 total hip arthroplasty procedures were identified from 2000 to 2017 with 192 805 implant parts. Data completeness improved over the time. After 2014, 85% of prosthetic heads, 91% of shells, 81% of stems, and 85% of liners used in the Veterans Health Administration health care system were identified by part number. Revision burden and trends in metal vs ceramic prosthetic femoral head use were found to reflect data from the American Joint Replacement Registry. Recalled implants were obvious negative outliers in implant survivorship using Kaplan-Meier curves.Conclusions and Relevance: Although loss to follow-up remains a challenge that requires additional attention to improve the quantitative nature of calculated implant survivorship, we conclude that data collected during routine clinical care and stored in the EHR of a large health system over 18 years were sufficient to provide clinically meaningful data on trends in implant use and to identify poor implants that were subsequently recalled. This automated approach was low cost and had no reporting burden. This low-cost, low-overhead method to assess implant use and performance within a large health care setting may be useful to internal quality assurance programs and, on a larger scale, to postmarket surveillance of implant performance.

    View details for DOI 10.1001/jamanetworkopen.2021.1728

    View details for PubMedID 33720372

  • Estimating the efficacy of symptom-based screening for COVID-19. NPJ digital medicine Callahan, A., Steinberg, E., Fries, J. A., Gombar, S., Patel, B., Corbin, C. K., Shah, N. H. 2020; 3 (1): 95

    Abstract

    There is substantial interest in using presenting symptoms to prioritize testing for COVID-19 and establish symptom-based surveillance. However, little is currently known about the specificity of COVID-19 symptoms. To assess the feasibility of symptom-based screening for COVID-19, we used data from tests for common respiratory viruses and SARS-CoV-2 in our health system to measure the ability to correctly classify virus test results based on presenting symptoms. Based on these results, symptom-based screening may not be an effective strategy to identify individuals who should be tested for SARS-CoV-2 infection or to obtain a leading indicator of new COVID-19 cases.

    View details for DOI 10.1038/s41746-020-0300-0

    View details for PubMedID 33597700

  • Measure what matters: Counts of hospitalized patients are a better metric for health system capacity planning for a reopening. Journal of the American Medical Informatics Association : JAMIA Kashyap, S., Gombar, S., Yadlowsky, S., Callahan, A., Fries, J., Pinsky, B. A., Shah, N. H. 2020

    Abstract

    OBJECTIVE: Responding to the COVID-19 pandemic requires accurate forecasting of health system capacity requirements using readily available inputs. We examined whether testing and hospitalization data could help quantify the anticipated burden on the health system given shelter-in-place (SIP) order.MATERIALS AND METHODS: 16,103 SARS-CoV-2 RT-PCR tests were performed on 15,807 patients at Stanford facilities between March 2 and April 11, 2020. We analyzed the fraction of tested patients that were confirmed positive for COVID-19, the fraction of those needing hospitalization, and the fraction requiring ICU admission over the 40 days between March 2nd and April 11th 2020.RESULTS: We find a marked slowdown in the hospitalization rate within ten days of SIP even as cases continued to rise. We also find a shift towards younger patients in the age distribution of those testing positive for COVID-19 over the four weeks of SIP. The impact of this shift is a divergence between increasing positive case confirmations and slowing new hospitalizations, both of which affects the demand on health systems.CONCLUSION: Without using local hospitalization rates and the age distribution of positive patients, current models are likely to overestimate the resource burden of COVID-19. It is imperative that health systems start using these data to quantify effects of SIP and aid reopening planning.

    View details for DOI 10.1093/jamia/ocaa076

    View details for PubMedID 32548636

  • Assessing the accuracy of automatic speech recognition for psychotherapy. NPJ digital medicine Miner, A. S., Haque, A., Fries, J. A., Fleming, S. L., Wilfley, D. E., Terence Wilson, G., Milstein, A., Jurafsky, D., Arnow, B. A., Stewart Agras, W., Fei-Fei, L., Shah, N. H. 2020; 3: 82

    Abstract

    Accurate transcription of audio recordings in psychotherapy would improve therapy effectiveness, clinician training, and safety monitoring. Although automatic speech recognition software is commercially available, its accuracy in mental health settings has not been well described. It is unclear which metrics and thresholds are appropriate for different clinical use cases, which may range from population descriptions to individual safety monitoring. Here we show that automatic speech recognition is feasible in psychotherapy, but further improvements in accuracy are needed before widespread use. Our HIPAA-compliant automatic speech recognition system demonstrated a transcription word error rate of 25%. For depression-related utterances, sensitivity was 80% and positive predictive value was 83%. For clinician-identified harm-related sentences, the word error rate was 34%. These results suggest that automatic speech recognition may support understanding of language patterns and subgroup variation in existing treatments but may not be ready for individual-level safety surveillance.

    View details for DOI 10.1038/s41746-020-0285-8

    View details for PubMedID 32550644

    View details for PubMedCentralID PMC7270106

  • Assessing the accuracy of automatic speech recognition for psychotherapy NPJ DIGITAL MEDICINE Miner, A. S., Haque, A., Fries, J. A., Fleming, S. L., Wilfley, D. E., Wilson, G., Milstein, A., Jurafsky, D., Arnow, B. A., Agras, W., Li Fei-Fei, Shah, N. H. 2020; 3 (1)
  • Language models are an effective representation learning technique for electronic health record data. Journal of biomedical informatics Steinberg, E. n., Jung, K. n., Fries, J. A., Corbin, C. K., Pfohl, S. R., Shah, N. H. 2020: 103637

    Abstract

    Widespread adoption of electronic health records (EHRs) has fueled the development of using machine learning to build prediction models for various clinical outcomes. However, this process is often constrained by having a relatively small number of patient records for training the model. We demonstrate that using patient representation schemes inspired from techniques in natural language processing can increase the accuracy of clinical prediction models by transferring information learned from the entire patient population to the task of training a specific model, where only a subset of the population is relevant. Such patient representation schemes enable a 3.5% mean improvement in AUROC on five prediction tasks compared to standard baselines, with the average improvement rising to 19% when only a small number of patient records are available for training the clinical prediction model.

    View details for DOI 10.1016/j.jbi.2020.103637

    View details for PubMedID 33290879

  • Assessing the accuracy of automatic speech recognition for psychotherapy. NPJ digital medicine Miner, A. S., Haque, A. n., Fries, J. A., Fleming, S. L., Wilfley, D. E., Terence Wilson, G. n., Milstein, A. n., Jurafsky, D. n., Arnow, B. A., Stewart Agras, W. n., Fei-Fei, L. n., Shah, N. H. 2020; 3 (1): 82

    Abstract

    Accurate transcription of audio recordings in psychotherapy would improve therapy effectiveness, clinician training, and safety monitoring. Although automatic speech recognition software is commercially available, its accuracy in mental health settings has not been well described. It is unclear which metrics and thresholds are appropriate for different clinical use cases, which may range from population descriptions to individual safety monitoring. Here we show that automatic speech recognition is feasible in psychotherapy, but further improvements in accuracy are needed before widespread use. Our HIPAA-compliant automatic speech recognition system demonstrated a transcription word error rate of 25%. For depression-related utterances, sensitivity was 80% and positive predictive value was 83%. For clinician-identified harm-related sentences, the word error rate was 34%. These results suggest that automatic speech recognition may support understanding of language patterns and subgroup variation in existing treatments but may not be ready for individual-level safety surveillance.

    View details for DOI 10.1038/s41746-020-0285-8

    View details for PubMedID 33597677

  • Snorkel: rapid training data creation with weak supervision. The VLDB journal : very large data bases : a publication of the VLDB Endowment Ratner, A., Bach, S. H., Ehrenberg, H., Fries, J., Wu, S., Re, C. 2020; 29 (2): 709–30

    Abstract

    Labeling training data is increasingly the largest bottleneck in deploying machine learning systems. We present Snorkel, a first-of-its-kind system that enables users to train state-of-the-art models without hand labeling any training data. Instead, users write labeling functions that express arbitrary heuristics, which can have unknown accuracies and correlations. Snorkel denoises their outputs without access to ground truth by incorporating the first end-to-end implementation of our recently proposed machine learning paradigm, data programming. We present a flexible interface layer for writing labeling functions based on our experience over the past year collaborating with companies, agencies, and research laboratories. In a user study, subject matter experts build models 2.8 * faster and increase predictive performance an average 45.5 % versus seven hours of hand labeling. We study the modeling trade-offs in this new setting and propose an optimizer for automating trade-off decisions that gives up to 1.8 * speedup per pipeline execution. In two collaborations, with the US Department of Veterans Affairs and the US Food and Drug Administration, and on four open-source text and image data sets representative of other deployments, Snorkel provides 132 % average improvements to predictive performance over prior heuristic approaches and comes within an average 3.60 % of the predictive performance of large hand-curated training sets.

    View details for DOI 10.1007/s00778-019-00552-1

    View details for PubMedID 32214778

  • The accuracy vs. coverage trade-off in patient-facing diagnosis models. AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science Kannan, A., Fries, J. A., Kramer, E., Chen, J. J., Shah, N., Amatriain, X. 2020; 2020: 298–307

    Abstract

    A third of adults in America use the Internet to diagnose medical concerns, and online symptom checkers are increasingly part of this process. These tools are powered by diagnosis models similar to clinical decision support systems, with the primary difference being the coverage of symptoms and diagnoses. To be useful to patients and physicians, these models must have high accuracy while covering a meaningful space of symptoms and diagnoses. To the best of our knowledge, this paper is the first in studying the trade-off between the coverage of the model and its performance for diagnosis. To this end, we learn diagnosis models with different coverage from EHR data. We find a 1% drop in top-3 accuracy for every 10 diseases added to the coverage. We also observe that complexity for these models does not affect performance, with linear models performing as well as neural networks.

    View details for PubMedID 32477649

  • Cardiac Imaging of Aortic Valve Area from 34,287 UK Biobank Participants Reveal Novel Genetic Associations and Shared Genetic Comorbidity with Multiple Disease Phenotypes. Circulation. Genomic and precision medicine Córdova-Palomera, A. n., Tcheandjieu, C. n., Fries, J. n., Varma, P. n., Chen, V. S., Fiterau, M. n., Xiao, K. n., Tejeda, H. n., Keavney, B. n., Cordell, H. J., Tanigawa, Y. n., Venkataraman, G. n., Rivas, M. n., Ré, C. n., Ashley, E. A., Priest, J. R. 2020

    Abstract

    Background - The aortic valve is an important determinant of cardiovascular physiology and anatomic location of common human diseases. Methods - From a sample of 34,287 white British-ancestry participants, we estimated functional aortic valve area by planimetry from prospectively obtained cardiac MRI sequences of the aortic valve. Aortic valve area measurements were submitted to genome-wide association testing, followed by polygenic risk scoring and phenome-wide screening to identify genetic comorbidities. Results - A genome-wide association study of aortic valve area in these UK Biobank participants showed three significant associations, indexed by rs71190365 (chr13:50764607, DLEU1, p=1.8×10-9), rs35991305 (chr12:94191968, CRADD, p=3.4×10-8) and chr17:45013271:C:T (GOSR2, p=5.6×10-8). Replication on an independent set of 8,145 unrelated European-ancestry participants showed consistent effect sizes in all three loci, although rs35991305 did not meet nominal significance. We constructed a polygenic risk score for aortic valve area, which in a separate cohort of 311,728 individuals without imaging demonstrated that smaller aortic valve area is predictive of increased risk for aortic valve disease (Odds Ratio 1.14, p=2.3×10-6). After excluding subjects with a medical diagnosis of aortic valve stenosis (remaining n=308,683 individuals), phenome-wide association of >10,000 traits showed multiple links between the polygenic score for aortic valve disease and key health-related comorbidities involving the cardiovascular system and autoimmune disease. Genetic correlation analysis supports a shared genetic etiology with between aortic valve area and birthweight along with other cardiovascular conditions. Conclusions - These results illustrate the use of automated phenotyping of cardiac imaging data from the general population to investigate the genetic etiology of aortic valve disease, perform clinical prediction, and uncover new clinical and genetic correlates of cardiac anatomy.

    View details for DOI 10.1161/CIRCGEN.120.003014

    View details for PubMedID 33125279

  • Estimating the efficacy of symptom-based screening for COVID-19. NPJ digital medicine Callahan, A., Steinberg, E., Fries, J. A., Gombar, S., Patel, B., Corbin, C. K., Shah, N. H. 2020; 3: 95

    Abstract

    There is substantial interest in using presenting symptoms to prioritize testing for COVID-19 and establish symptom-based surveillance. However, little is currently known about the specificity of COVID-19 symptoms. To assess the feasibility of symptom-based screening for COVID-19, we used data from tests for common respiratory viruses and SARS-CoV-2 in our health system to measure the ability to correctly classify virus test results based on presenting symptoms. Based on these results, symptom-based screening may not be an effective strategy to identify individuals who should be tested for SARS-CoV-2 infection or to obtain a leading indicator of new COVID-19 cases.

    View details for DOI 10.1038/s41746-020-0300-0

    View details for PubMedID 32695885

  • Medical device surveillance with electronic health records. NPJ digital medicine Callahan, A. n., Fries, J. A., Ré, C. n., Huddleston, J. I., Giori, N. J., Delp, S. n., Shah, N. H. 2019; 2: 94

    Abstract

    Post-market medical device surveillance is a challenge facing manufacturers, regulatory agencies, and health care providers. Electronic health records are valuable sources of real-world evidence for assessing device safety and tracking device-related patient outcomes over time. However, distilling this evidence remains challenging, as information is fractured across clinical notes and structured records. Modern machine learning methods for machine reading promise to unlock increasingly complex information from text, but face barriers due to their reliance on large and expensive hand-labeled training sets. To address these challenges, we developed and validated state-of-the-art deep learning methods that identify patient outcomes from clinical notes without requiring hand-labeled training data. Using hip replacements-one of the most common implantable devices-as a test case, our methods accurately extracted implant details and reports of complications and pain from electronic health records with up to 96.3% precision, 98.5% recall, and 97.4% F1, improved classification performance by 12.8-53.9% over rule-based methods, and detected over six times as many complication events compared to using structured data alone. Using these additional events to assess complication-free survivorship of different implant systems, we found significant variation between implants, including for risk of revision surgery, which could not be detected using coded data alone. Patients with revision surgeries had more hip pain mentions in the post-hip replacement, pre-revision period compared to patients with no evidence of revision surgery (mean hip pain mentions 4.97 vs. 3.23; t = 5.14; p < 0.001). Some implant models were associated with higher or lower rates of hip pain mentions. Our methods complement existing surveillance mechanisms by requiring orders of magnitude less hand-labeled training data, offering a scalable solution for national medical device surveillance using electronic health records.

    View details for DOI 10.1038/s41746-019-0168-z

    View details for PubMedID 31583282

    View details for PubMedCentralID PMC6761113

  • Multi-Resolution Weak Supervision for Sequential Data Sala, F., Varma, P., Fries, J., Fu, D. Y., Sagawa, S., Khattar, S., Ramamoorthy, A., Xiao, K., Fatahalian, K., Priest, J., Re, C., Wallach, H., Larochelle, H., Beygelzimer, A., d'Alche-Buc, F., Fox, E., Garnett, R. NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2019
  • ShortFuse: Biomedical Time Series Representations in the Presence of Structured Information. Proceedings of machine learning research Fiterau, M. n., Bhooshan, S. n., Fries, J. n., Bournhonesque, C. n., Hicks, J. n., Halilaj, E. n., Ré, C. n., Delp, S. n. 2017; 68: 59–74

    Abstract

    In healthcare applications, temporal variables that encode movement, health status and longitudinal patient evolution are often accompanied by rich structured information such as demographics, diagnostics and medical exam data. However, current methods do not jointly optimize over structured covariates and time series in the feature extraction process. We present ShortFuse, a method that boosts the accuracy of deep learning models for time series by explicitly modeling temporal interactions and dependencies with structured covariates. ShortFuse introduces hybrid convolutional and LSTM cells that incorporate the covariates via weights that are shared across the temporal domain. ShortFuse outperforms competing models by 3% on two biomedical applications, forecasting osteoarthritis-related cartilage degeneration and predicting surgical outcomes for cerebral palsy patients, matching or exceeding the accuracy of models that use features engineered by domain experts.

    View details for PubMedID 30882086

  • Brundlefly at SemEval-2016 Task 12: Recurrent Neural Networks vs. Joint Inference for Clinical Temporal Information Extraction Jason Alan Fries Fries, J. A. 2016: 1274–79

    View details for DOI 10.18653/v1/S16-1198