Stanford Advisors


All Publications


  • Measuring Equity in Readmission as an Assessment of Hospital Performance. JAMA Gallo, R. J., Santiago, C. 2024

    View details for DOI 10.1001/jama.2024.4351

    View details for PubMedID 38648050

  • Affiliation Bias in Peer Review of Abstracts. JAMA Gallo, R. J., Savage, T., Chen, J. H. 2024; 331 (14): 1234-1235

    View details for DOI 10.1001/jama.2024.3520

    View details for PubMedID 38592392

  • Effectiveness of an Artificial Intelligence-Enabled Intervention for Detecting Clinical Deterioration. JAMA internal medicine Gallo, R. J., Shieh, L., Smith, M., Marafino, B. J., Geldsetzer, P., Asch, S. M., Shum, K., Lin, S., Westphal, J., Hong, G., Li, R. C. 2024

    Abstract

    Inpatient clinical deterioration is associated with substantial morbidity and mortality but may be easily missed by clinicians. Early warning scores have been developed to alert clinicians to patients at high risk of clinical deterioration, but there is limited evidence for their effectiveness.To evaluate the effectiveness of an artificial intelligence deterioration model-enabled intervention to reduce the risk of escalations in care among hospitalized patients using a study design that facilitates stronger causal inference.This cohort study used a regression discontinuity design that controlled for confounding and was based on Epic Deterioration Index (EDI; Epic Systems Corporation) prediction model scores. Compared with other observational research, the regression discontinuity design facilitates causal analysis. Hospitalized adults were included from 4 general internal medicine units in 1 academic hospital from January 17, 2021, through November 16, 2022.An artificial intelligence deterioration model-enabled intervention, consisting of alerts based on an EDI score threshold with an associated collaborative workflow among nurses and physicians.The primary outcome was escalations in care, including rapid response team activation, transfer to the intensive care unit, or cardiopulmonary arrest during hospitalization.During the study, 9938 patients were admitted to 1 of the 4 units, with 963 patients (median [IQR] age, 76.1 [64.2-86.2] years; 498 males [52.3%]) included within the primary regression discontinuity analysis. The median (IQR) Elixhauser Comorbidity Index score in the primary analysis cohort was 10 (0-24). The intervention was associated with a -10.4-percentage point (95% CI, -20.1 to -0.8 percentage points; P = .03) absolute risk reduction in the primary outcome for patients at the EDI score threshold. There was no evidence of a discontinuity in measured confounders at the EDI score threshold.Using a regression discontinuity design, this cohort study found that the implementation of an artificial intelligence deterioration model-enabled intervention was associated with a significantly decreased risk of escalations in care among inpatients. These results provide evidence for the effectiveness of this intervention and support its further expansion and testing in other care settings.

    View details for DOI 10.1001/jamainternmed.2024.0084

    View details for PubMedID 38526472

  • Influence of a Large Language Model on Diagnostic Reasoning: A Randomized Clinical Vignette Study. medRxiv : the preprint server for health sciences Goh, E., Gallo, R., Hom, J., Strong, E., Weng, Y., Kerman, H., Cool, J., Kanjee, Z., Parsons, A. S., Ahuja, N., Horvitz, E., Yang, D., Milstein, A., Olson, A. P., Rodman, A., Chen, J. H. 2024

    Abstract

    Diagnostic errors are common and cause significant morbidity. Large language models (LLMs) have shown promise in their performance on both multiple-choice and open-ended medical reasoning examinations, but it remains unknown whether the use of such tools improves diagnostic reasoning.To assess the impact of the GPT-4 LLM on physicians' diagnostic reasoning compared to conventional resources.Multi-center, randomized clinical vignette study.The study was conducted using remote video conferencing with physicians across the country and in-person participation across multiple academic medical institutions.Resident and attending physicians with training in family medicine, internal medicine, or emergency medicine.Participants were randomized to access GPT-4 in addition to conventional diagnostic resources or to just conventional resources. They were allocated 60 minutes to review up to six clinical vignettes adapted from established diagnostic reasoning exams.The primary outcome was diagnostic performance based on differential diagnosis accuracy, appropriateness of supporting and opposing factors, and next diagnostic evaluation steps. Secondary outcomes included time spent per case and final diagnosis.50 physicians (26 attendings, 24 residents) participated, with an average of 5.2 cases completed per participant. The median diagnostic reasoning score per case was 76.3 percent (IQR 65.8 to 86.8) for the GPT-4 group and 73.7 percent (IQR 63.2 to 84.2) for the conventional resources group, with an adjusted difference of 1.6 percentage points (95% CI -4.4 to 7.6; p=0.60). The median time spent on cases for the GPT-4 group was 519 seconds (IQR 371 to 668 seconds), compared to 565 seconds (IQR 456 to 788 seconds) for the conventional resources group, with a time difference of -82 seconds (95% CI -195 to 31; p=0.20). GPT-4 alone scored 15.5 percentage points (95% CI 1.5 to 29, p=0.03) higher than the conventional resources group.In a clinical vignette-based study, the availability of GPT-4 to physicians as a diagnostic aid did not significantly improve clinical reasoning compared to conventional resources, although it may improve components of clinical reasoning such as efficiency. GPT-4 alone demonstrated higher performance than both physician groups, suggesting opportunities for further improvement in physician-AI collaboration in clinical practice.

    View details for DOI 10.1101/2024.03.12.24303785

    View details for PubMedID 38559045

    View details for PubMedCentralID PMC10980135

  • Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine. NPJ digital medicine Savage, T., Nayak, A., Gallo, R., Rangan, E., Chen, J. H. 2024; 7 (1): 20

    Abstract

    One of the major barriers to using large language models (LLMs) in medicine is the perception they use uninterpretable methods to make clinical decisions that are inherently different from the cognitive processes of clinicians. In this manuscript we develop diagnostic reasoning prompts to study whether LLMs can imitate clinical reasoning while accurately forming a diagnosis. We find that GPT-4 can be prompted to mimic the common clinical reasoning processes of clinicians without sacrificing diagnostic accuracy. This is significant because an LLM that can imitate clinical reasoning to provide an interpretable rationale offers physicians a means to evaluate whether an LLMs response is likely correct and can be trusted for patient care. Prompting methods that use diagnostic reasoning have the potential to mitigate the "black box" limitations of LLMs, bringing them one step closer to safe and effective use in medicine.

    View details for DOI 10.1038/s41746-024-01010-1

    View details for PubMedID 38267608

    View details for PubMedCentralID 9931230

  • Things We Do for No Reason™: Routine early PEG tube placement for dysphagia after acute stroke. Journal of hospital medicine Gallo, R. J., Wang, J. E., Madill, E. S. 2024

    View details for DOI 10.1002/jhm.13263

    View details for PubMedID 38180160

  • ChatGPT Influence on Medical Decision-Making, Bias, and Equity: A Randomized Study of Clinicians Evaluating Clinical Vignettes. medRxiv : the preprint server for health sciences Goh, E., Bunning, B., Khoong, E., Gallo, R., Milstein, A., Centola, D., Chen, J. H. 2023

    Abstract

    In a randomized, pre-post intervention study, we evaluated the influence of a large language model (LLM) generative AI system on accuracy of physician decision-making and bias in healthcare. 50 US-licensed physicians reviewed a video clinical vignette, featuring actors representing different demographics (a White male or a Black female) with chest pain. Participants were asked to answer clinical questions around triage, risk, and treatment based on these vignettes, then asked to reconsider after receiving advice generated by ChatGPT+ (GPT4). The primary outcome was the accuracy of clinical decisions based on pre-established evidence-based guidelines. Results showed that physicians are willing to change their initial clinical impressions given AI assistance, and that this led to a significant improvement in clinical decision-making accuracy in a chest pain evaluation scenario without introducing or exacerbating existing race or gender biases. A survey of physician participants indicates that the majority expect LLM tools to play a significant role in clinical decision making.

    View details for DOI 10.1101/2023.11.24.23298844

    View details for PubMedID 38076944

    View details for PubMedCentralID PMC10705632

  • K Grant Funding to Internal Medicine Specialties. Journal of general internal medicine Gallo, R. J., Asch, S. M., Chan, D. C. 2023

    View details for DOI 10.1007/s11606-023-08483-y

    View details for PubMedID 37904071

  • Administrative Coding Versus Laboratory Diagnosis of Inpatient Hypoglycemia. Diabetes care Gallo, R. J., Fang, D. Z., Heidenreich, P. A. 2023

    View details for DOI 10.2337/dc23-0053

    View details for PubMedID 37068271

  • Addition of Coronary Artery Calcium Scores to Primary Prevention Risk Estimation Models-Primum Non Nocere. JAMA internal medicine Gallo, R. J., Brown, D. L. 2022

    View details for DOI 10.1001/jamainternmed.2022.1258

    View details for PubMedID 35467702