Robert J. Gallo's Profile | Stanford Profiles

Stanford Advisors

Ranak Trivedi, Postdoctoral Faculty Sponsor

All Publications

Physician clinical decision modification and bias assessment in a randomized controlled trial of AI assistance. Communications medicine Goh, E., Bunning, B., Khoong, E. C., Gallo, R. J., Milstein, A., Centola, D., Chen, J. H. 2025; 5 (1): 59

Abstract

Artificial intelligence assistance in clinical decision making shows promise, but concerns exist about potential exacerbation of demographic biases in healthcare. This study aims to evaluate how physician clinical decisions and biases are influenced by AI assistance in a chest pain triage scenario.A randomized, pre post-intervention study was conducted with 50 US-licensed physicians who reviewed standardized chest pain video vignettes featuring either a white male or Black female patient. Participants answered clinical questions about triage, risk assessment, and treatment before and after receiving GPT-4 generated recommendations. Clinical decision accuracy was evaluated against evidence-based guidelines.Here we show that physicians are willing to modify their clinical decisions based on GPT-4 assistance, leading to improved accuracy scores from 47% to 65% in the white male patient group and 63% to 80% in the Black female patient group. The accuracy improvement occurs without introducing or exacerbating demographic biases, with both groups showing similar magnitudes of improvement (18%). A post-study survey indicates that 90% of physicians expect AI tools to play a significant role in future clinical decision making.Physician clinical decision making can be augmented by AI assistance while maintaining equitable care across patient demographics. These findings suggest a path forward for AI clinical decision support that improves medical care without amplifying healthcare disparities.

View details for DOI 10.1038/s43856-025-00781-2

View details for PubMedID 40038550

View details for PubMedCentralID 10582782
GPT-4 assistance for improvement of physician performance on patient care tasks: a randomized controlled trial. Nature medicine Goh, E., Gallo, R. J., Strong, E., Weng, Y., Kerman, H., Freed, J. A., Cool, J. A., Kanjee, Z., Lane, K. P., Parsons, A. S., Ahuja, N., Horvitz, E., Yang, D., Milstein, A., Olson, A. P., Hom, J., Chen, J. H., Rodman, A. 2025

Abstract

While large language models (LLMs) have shown promise in diagnostic reasoning, their impact on management reasoning, which involves balancing treatment decisions and testing strategies while managing risk, is unknown. This prospective, randomized, controlled trial assessed whether LLM assistance improves physician performance on open-ended management reasoning tasks compared to conventional resources. From November 2023 to April 2024, 92 practicing physicians were randomized to use either GPT-4 plus conventional resources or conventional resources alone to answer five expert-developed clinical vignettes in a simulated setting. All cases were based on real, de-identified patient encounters, with information revealed sequentially to mirror the nature of clinical environments. The primary outcome was the difference in total score between groups on expert-developed scoring rubrics. Secondary outcomes included domain-specific scores and time spent per case. Physicians using the LLM scored significantly higher compared to those using conventional resources (mean difference = 6.5%, 95% confidence interval (CI) = 2.7 to 10.2, P < 0.001). LLM users spent more time per case (mean difference = 119.3 s, 95% CI = 17.4 to 221.2, P = 0.02). There was no significant difference between LLM-augmented physicians and LLM alone (-0.9%, 95% CI = -9.0 to 7.2, P = 0.8). LLM assistance can improve physician management reasoning in complex clinical vignettes compared to conventional resources and should be validated in real clinical practice. ClinicalTrials.gov registration: NCT06208423 .

View details for DOI 10.1038/s41591-024-03456-y

View details for PubMedID 39910272

View details for PubMedCentralID 10273128
Inpatient Metformin Utilization and Post-hospitalization Clinical Outcomes: An Observational Cohort Study. Journal of general internal medicine Gallo, R. J., Lin, S., Fang, D. Z., Glassman, P. A., Sahay, A., Heidenreich, P. A. 2025

Abstract

Metformin is the first-line treatment for diabetes, with multiple long-term benefits. However, there is limited evidence for its use in the inpatient setting, and clinical guidelines have historically recommended holding oral diabetes medications during acute hospitalization. While studies have not found evidence of harm from continuing metformin during hospitalization, withholding may lead to unnecessary insulin prescriptions, which in turn may lead to hypoglycemia events after discharge and other associated complications.To investigate the association between metformin use during hospitalization and post-hospitalization outcomes.Observational cohort study from January 2016 to January 2022, emulating a target trial.Adults with type 2 diabetes admitted to a Veterans Health Administration hospital for common medical conditions.Continuation of an outpatient metformin prescription during hospitalization.Hypoglycemia within 90 days of discharge. Secondary outcomes included insulin prescriptions at discharge, 90-day readmissions, and 90-day mortality.The propensity-matched cohort included 67,162 hospitalizations, equally split between those who did and did not have metformin continued during hospitalization. Within 90 days of hospital discharge, those that received metformin had lower risk of hypoglycemia (1.5% vs 1.8%; OR 0.83, 95% CI 0.73-0.93; p = 0.003), readmissions (29.4% vs 30.6%; OR 0.96, 95% CI 0.92-1.00; p= 0.03), and mortality (6.4% vs 7.4%; OR 0.86, 95% CI 0.80-0.92; p <0.001). Patients receiving metformin also had lower risk of insulin prescriptions at discharge (18.5% vs 20.3%; OR 0.89, 95% CI 0.84-0.95; p<0.001).Continuation of metformin during hospitalization for patients with type 2 diabetes was associated with decreased risk of post-hospitalization insulin prescriptions and 90-day hypoglycemia, readmissions, and mortality. These findings question clinical guideline recommendations to hold metformin during hospitalization.

View details for DOI 10.1007/s11606-025-09384-y

View details for PubMedID 39900873

View details for PubMedCentralID 2681039
Clinical entity augmented retrieval for clinical information extraction. NPJ digital medicine Lopez, I., Swaminathan, A., Vedula, K., Narayanan, S., Nateghi Haredasht, F., Ma, S. P., Liang, A. S., Tate, S., Maddali, M., Gallo, R. J., Shah, N. H., Chen, J. H. 2025; 8 (1): 45

Abstract

Large language models (LLMs) with retrieval-augmented generation (RAG) have improved information extraction over previous methods, yet their reliance on embeddings often leads to inefficient retrieval. We introduce CLinical Entity Augmented Retrieval (CLEAR), a RAG pipeline that retrieves information using entities. We compared CLEAR to embedding RAG and full-note approaches for extracting 18 variables using six LLMs across 20,000 clinical notes. Average F1 scores were 0.90, 0.86, and 0.79; inference times were 4.95, 17.41, and 20.08 s per note; average model queries were 1.68, 4.94, and 4.18 per note; and average input tokens were 1.1k, 3.8k, and 6.1k per note for CLEAR, embedding RAG, and full-note approaches, respectively. In conclusion, CLEAR utilizes clinical entities for information retrieval and achieves >70% reduction in token usage and inference time with improved performance compared to modern methods.

View details for DOI 10.1038/s41746-024-01377-1

View details for PubMedID 39828800

View details for PubMedCentralID 4287068
Establishing best practices in large language model research: an application to repeat prompting JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION Gallo, R. J., Baiocchi, M., Savage, T. R., Chen, J. H. 2024

View details for DOI 10.1093/jamia/ocae294

View details for Web of Science ID 001373049500001
Establishing best practices in large language model research: an application to repeat prompting. Journal of the American Medical Informatics Association : JAMIA Gallo, R. J., Baiocchi, M., Savage, T. R., Chen, J. H. 2024

Abstract

We aimed to demonstrate the importance of establishing best practices in large language model research, using repeat prompting as an illustrative example.Using data from a prior study investigating potential model bias in peer review of medical abstracts, we compared methods that ignore correlation in model outputs from repeated prompting with a random effects method that accounts for this correlation.High correlation within groups was found when repeatedly prompting the model, with intraclass correlation coefficient of 0.69. Ignoring the inherent correlation in the data led to over 100-fold inflation of effective sample size. After appropriately accounting for this issue, the authors' results reverse from a small but highly significant finding to no evidence of model bias.The establishment of best practices for LLM research is urgently needed, as demonstrated in this case where accounting for repeat prompting in analyses was critical for accurate study conclusions.

View details for DOI 10.1093/jamia/ocae294

View details for PubMedID 39656836
Large language model uncertainty proxies: discrimination and calibration for medical diagnosis and treatment. Journal of the American Medical Informatics Association : JAMIA Savage, T., Wang, J., Gallo, R., Boukil, A., Patel, V., Safavi-Naini, S. A., Soroush, A., Chen, J. H. 2024

Abstract

The inability of large language models (LLMs) to communicate uncertainty is a significant barrier to their use in medicine. Before LLMs can be integrated into patient care, the field must assess methods to estimate uncertainty in ways that are useful to physician-users.Evaluate the ability for uncertainty proxies to quantify LLM confidence when performing diagnosis and treatment selection tasks by assessing the properties of discrimination and calibration.We examined confidence elicitation (CE), token-level probability (TLP), and sample consistency (SC) proxies across GPT3.5, GPT4, Llama2, and Llama3. Uncertainty proxies were evaluated against 3 datasets of open-ended patient scenarios.SC discrimination outperformed TLP and CE methods. SC by sentence embedding achieved the highest discriminative performance (ROC AUC 0.68-0.79), yet with poor calibration. SC by GPT annotation achieved the second-best discrimination (ROC AUC 0.66-0.74) with accurate calibration. Verbalized confidence (CE) was found to consistently overestimate model confidence.SC is the most effective method for estimating LLM uncertainty of the proxies evaluated. SC by sentence embedding can effectively estimate uncertainty if the user has a set of reference cases with which to re-calibrate their results, while SC by GPT annotation is the more effective method if the user does not have reference cases and requires accurate raw calibration. Our results confirm LLMs are consistently over-confident when verbalizing their confidence (CE).

View details for DOI 10.1093/jamia/ocae254

View details for PubMedID 39396184
Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial. JAMA network open Goh, E., Gallo, R., Hom, J., Strong, E., Weng, Y., Kerman, H., Cool, J. A., Kanjee, Z., Parsons, A. S., Ahuja, N., Horvitz, E., Yang, D., Milstein, A., Olson, A. P., Rodman, A., Chen, J. H. 2024; 7 (10): e2440969

Abstract

Large language models (LLMs) have shown promise in their performance on both multiple-choice and open-ended medical reasoning examinations, but it remains unknown whether the use of such tools improves physician diagnostic reasoning.To assess the effect of an LLM on physicians' diagnostic reasoning compared with conventional resources.A single-blind randomized clinical trial was conducted from November 29 to December 29, 2023. Using remote video conferencing and in-person participation across multiple academic medical institutions, physicians with training in family medicine, internal medicine, or emergency medicine were recruited.Participants were randomized to either access the LLM in addition to conventional diagnostic resources or conventional resources only, stratified by career stage. Participants were allocated 60 minutes to review up to 6 clinical vignettes.The primary outcome was performance on a standardized rubric of diagnostic performance based on differential diagnosis accuracy, appropriateness of supporting and opposing factors, and next diagnostic evaluation steps, validated and graded via blinded expert consensus. Secondary outcomes included time spent per case (in seconds) and final diagnosis accuracy. All analyses followed the intention-to-treat principle. A secondary exploratory analysis evaluated the standalone performance of the LLM by comparing the primary outcomes between the LLM alone group and the conventional resource group.Fifty physicians (26 attendings, 24 residents; median years in practice, 3 [IQR, 2-8]) participated virtually as well as at 1 in-person site. The median diagnostic reasoning score per case was 76% (IQR, 66%-87%) for the LLM group and 74% (IQR, 63%-84%) for the conventional resources-only group, with an adjusted difference of 2 percentage points (95% CI, -4 to 8 percentage points; P = .60). The median time spent per case for the LLM group was 519 (IQR, 371-668) seconds, compared with 565 (IQR, 456-788) seconds for the conventional resources group, with a time difference of -82 (95% CI, -195 to 31; P = .20) seconds. The LLM alone scored 16 percentage points (95% CI, 2-30 percentage points; P = .03) higher than the conventional resources group.In this trial, the availability of an LLM to physicians as a diagnostic aid did not significantly improve clinical reasoning compared with conventional resources. The LLM alone demonstrated higher performance than both physician groups, indicating the need for technology and workforce development to realize the potential of physician-artificial intelligence collaboration in clinical practice.ClinicalTrials.gov Identifier: NCT06157944.

View details for DOI 10.1001/jamanetworkopen.2024.40969

View details for PubMedID 39466245
Large Language Model Influence on Management Reasoning: A Randomized Controlled Trial. medRxiv : the preprint server for health sciences Goh, E., Gallo, R., Strong, E., Weng, Y., Kerman, H., Freed, J., Cool, J. A., Kanjee, Z., Lane, K. P., Parsons, A. S., Ahuja, N., Horvitz, E., Yang, D., Milstein, A., Olson, A. P., Hom, J., Chen, J. H., Rodman, A. 2024

Abstract

Large language model (LLM) artificial intelligence (AI) systems have shown promise in diagnostic reasoning, but their utility in management reasoning with no clear right answers is unknown.To determine whether LLM assistance improves physician performance on open-ended management reasoning tasks compared to conventional resources.Prospective, randomized controlled trial conducted from 30 November 2023 to 21 April 2024.Multi-institutional study from Stanford University, Beth Israel Deaconess Medical Center, and the University of Virginia involving physicians from across the United States.92 practicing attending physicians and residents with training in internal medicine, family medicine, or emergency medicine.Five expert-developed clinical case vignettes were presented with multiple open-ended management questions and scoring rubrics created through a Delphi process. Physicians were randomized to use either GPT-4 via ChatGPT Plus in addition to conventional resources (e.g., UpToDate, Google), or conventional resources alone.The primary outcome was difference in total score between groups on expert-developed scoring rubrics. Secondary outcomes included domain-specific scores and time spent per case.Physicians using the LLM scored higher compared to those using conventional resources (mean difference 6.5 %, 95% CI 2.7-10.2, p<0.001). Significant improvements were seen in management decisions (6.1%, 95% CI 2.5-9.7, p=0.001), diagnostic decisions (12.1%, 95% CI 3.1-21.0, p=0.009), and case-specific (6.2%, 95% CI 2.4-9.9, p=0.002) domains. GPT-4 users spent more time per case (mean difference 119.3 seconds, 95% CI 17.4-221.2, p=0.02). There was no significant difference between GPT-4-augmented physicians and GPT-4 alone (-0.9%, 95% CI -9.0 to 7.2, p=0.8).LLM assistance improved physician management reasoning compared to conventional resources, with particular gains in contextual and patient-specific decision-making. These findings indicate that LLMs can augment management decision-making in complex cases.ClinicalTrials.gov Identifier: NCT06208423 ; https://classic.clinicaltrials.gov/ct2/show/NCT06208423.Question: Does large language model (LLM) assistance improve physician performance on complex management reasoning tasks compared to conventional resources?Findings: In this randomized controlled trial of 92 physicians, participants using GPT-4 achieved higher scores on management reasoning compared to those using conventional resources (e.g., UpToDate).Meaning: LLM assistance enhances physician management reasoning performance in complex cases with no clear right answers.

View details for DOI 10.1101/2024.08.05.24311485

View details for PubMedID 39148822

View details for PubMedCentralID PMC11326321
Clinical Evaluations of Early Warning Scores-Reply. JAMA internal medicine Gallo, R. J., Geldsetzer, P., Li, R. C. 2024

View details for DOI 10.1001/jamainternmed.2024.3053

View details for PubMedID 39037786
Measuring Equity in Readmission as an Assessment of Hospital Performance. JAMA Gallo, R. J., Santiago, C. 2024

View details for DOI 10.1001/jama.2024.4351

View details for PubMedID 38648050
Affiliation Bias in Peer Review of Abstracts. JAMA Gallo, R. J., Savage, T., Chen, J. H. 2024; 331 (14): 1234-1235

View details for DOI 10.1001/jama.2024.3520

View details for PubMedID 38592392
Effectiveness of an Artificial Intelligence-Enabled Intervention for Detecting Clinical Deterioration. JAMA internal medicine Gallo, R. J., Shieh, L., Smith, M., Marafino, B. J., Geldsetzer, P., Asch, S. M., Shum, K., Lin, S., Westphal, J., Hong, G., Li, R. C. 2024

Abstract

Inpatient clinical deterioration is associated with substantial morbidity and mortality but may be easily missed by clinicians. Early warning scores have been developed to alert clinicians to patients at high risk of clinical deterioration, but there is limited evidence for their effectiveness.To evaluate the effectiveness of an artificial intelligence deterioration model-enabled intervention to reduce the risk of escalations in care among hospitalized patients using a study design that facilitates stronger causal inference.This cohort study used a regression discontinuity design that controlled for confounding and was based on Epic Deterioration Index (EDI; Epic Systems Corporation) prediction model scores. Compared with other observational research, the regression discontinuity design facilitates causal analysis. Hospitalized adults were included from 4 general internal medicine units in 1 academic hospital from January 17, 2021, through November 16, 2022.An artificial intelligence deterioration model-enabled intervention, consisting of alerts based on an EDI score threshold with an associated collaborative workflow among nurses and physicians.The primary outcome was escalations in care, including rapid response team activation, transfer to the intensive care unit, or cardiopulmonary arrest during hospitalization.During the study, 9938 patients were admitted to 1 of the 4 units, with 963 patients (median [IQR] age, 76.1 [64.2-86.2] years; 498 males [52.3%]) included within the primary regression discontinuity analysis. The median (IQR) Elixhauser Comorbidity Index score in the primary analysis cohort was 10 (0-24). The intervention was associated with a -10.4-percentage point (95% CI, -20.1 to -0.8 percentage points; P = .03) absolute risk reduction in the primary outcome for patients at the EDI score threshold. There was no evidence of a discontinuity in measured confounders at the EDI score threshold.Using a regression discontinuity design, this cohort study found that the implementation of an artificial intelligence deterioration model-enabled intervention was associated with a significantly decreased risk of escalations in care among inpatients. These results provide evidence for the effectiveness of this intervention and support its further expansion and testing in other care settings.

View details for DOI 10.1001/jamainternmed.2024.0084

View details for PubMedID 38526472
Influence of a Large Language Model on Diagnostic Reasoning: A Randomized Clinical Vignette Study. medRxiv : the preprint server for health sciences Goh, E., Gallo, R., Hom, J., Strong, E., Weng, Y., Kerman, H., Cool, J., Kanjee, Z., Parsons, A. S., Ahuja, N., Horvitz, E., Yang, D., Milstein, A., Olson, A. P., Rodman, A., Chen, J. H. 2024

Abstract

Diagnostic errors are common and cause significant morbidity. Large language models (LLMs) have shown promise in their performance on both multiple-choice and open-ended medical reasoning examinations, but it remains unknown whether the use of such tools improves diagnostic reasoning.To assess the impact of the GPT-4 LLM on physicians' diagnostic reasoning compared to conventional resources.Multi-center, randomized clinical vignette study.The study was conducted using remote video conferencing with physicians across the country and in-person participation across multiple academic medical institutions.Resident and attending physicians with training in family medicine, internal medicine, or emergency medicine.Participants were randomized to access GPT-4 in addition to conventional diagnostic resources or to just conventional resources. They were allocated 60 minutes to review up to six clinical vignettes adapted from established diagnostic reasoning exams.The primary outcome was diagnostic performance based on differential diagnosis accuracy, appropriateness of supporting and opposing factors, and next diagnostic evaluation steps. Secondary outcomes included time spent per case and final diagnosis.50 physicians (26 attendings, 24 residents) participated, with an average of 5.2 cases completed per participant. The median diagnostic reasoning score per case was 76.3 percent (IQR 65.8 to 86.8) for the GPT-4 group and 73.7 percent (IQR 63.2 to 84.2) for the conventional resources group, with an adjusted difference of 1.6 percentage points (95% CI -4.4 to 7.6; p=0.60). The median time spent on cases for the GPT-4 group was 519 seconds (IQR 371 to 668 seconds), compared to 565 seconds (IQR 456 to 788 seconds) for the conventional resources group, with a time difference of -82 seconds (95% CI -195 to 31; p=0.20). GPT-4 alone scored 15.5 percentage points (95% CI 1.5 to 29, p=0.03) higher than the conventional resources group.In a clinical vignette-based study, the availability of GPT-4 to physicians as a diagnostic aid did not significantly improve clinical reasoning compared to conventional resources, although it may improve components of clinical reasoning such as efficiency. GPT-4 alone demonstrated higher performance than both physician groups, suggesting opportunities for further improvement in physician-AI collaboration in clinical practice.

View details for DOI 10.1101/2024.03.12.24303785

View details for PubMedID 38559045

View details for PubMedCentralID PMC10980135
Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine. NPJ digital medicine Savage, T., Nayak, A., Gallo, R., Rangan, E., Chen, J. H. 2024; 7 (1): 20

Abstract

One of the major barriers to using large language models (LLMs) in medicine is the perception they use uninterpretable methods to make clinical decisions that are inherently different from the cognitive processes of clinicians. In this manuscript we develop diagnostic reasoning prompts to study whether LLMs can imitate clinical reasoning while accurately forming a diagnosis. We find that GPT-4 can be prompted to mimic the common clinical reasoning processes of clinicians without sacrificing diagnostic accuracy. This is significant because an LLM that can imitate clinical reasoning to provide an interpretable rationale offers physicians a means to evaluate whether an LLMs response is likely correct and can be trusted for patient care. Prompting methods that use diagnostic reasoning have the potential to mitigate the "black box" limitations of LLMs, bringing them one step closer to safe and effective use in medicine.

View details for DOI 10.1038/s41746-024-01010-1

View details for PubMedID 38267608

View details for PubMedCentralID 9931230
Things We Do for No Reason™: Routine early PEG tube placement for dysphagia after acute stroke. Journal of hospital medicine Gallo, R. J., Wang, J. E., Madill, E. S. 2024

View details for DOI 10.1002/jhm.13263

View details for PubMedID 38180160
ChatGPT Influence on Medical Decision-Making, Bias, and Equity: A Randomized Study of Clinicians Evaluating Clinical Vignettes. medRxiv : the preprint server for health sciences Goh, E., Bunning, B., Khoong, E., Gallo, R., Milstein, A., Centola, D., Chen, J. H. 2023

Abstract

In a randomized, pre-post intervention study, we evaluated the influence of a large language model (LLM) generative AI system on accuracy of physician decision-making and bias in healthcare. 50 US-licensed physicians reviewed a video clinical vignette, featuring actors representing different demographics (a White male or a Black female) with chest pain. Participants were asked to answer clinical questions around triage, risk, and treatment based on these vignettes, then asked to reconsider after receiving advice generated by ChatGPT+ (GPT4). The primary outcome was the accuracy of clinical decisions based on pre-established evidence-based guidelines. Results showed that physicians are willing to change their initial clinical impressions given AI assistance, and that this led to a significant improvement in clinical decision-making accuracy in a chest pain evaluation scenario without introducing or exacerbating existing race or gender biases. A survey of physician participants indicates that the majority expect LLM tools to play a significant role in clinical decision making.

View details for DOI 10.1101/2023.11.24.23298844

View details for PubMedID 38076944

View details for PubMedCentralID PMC10705632
K Grant Funding to Internal Medicine Specialties. Journal of general internal medicine Gallo, R. J., Asch, S. M., Chan, D. C. 2023

View details for DOI 10.1007/s11606-023-08483-y

View details for PubMedID 37904071
Administrative Coding Versus Laboratory Diagnosis of Inpatient Hypoglycemia. Diabetes care Gallo, R. J., Fang, D. Z., Heidenreich, P. A. 2023

View details for DOI 10.2337/dc23-0053

View details for PubMedID 37068271
Addition of Coronary Artery Calcium Scores to Primary Prevention Risk Estimation Models-Primum Non Nocere. JAMA internal medicine Gallo, R. J., Brown, D. L. 2022

View details for DOI 10.1001/jamainternmed.2022.1258

View details for PubMedID 35467702