Ethan Goh, MD, MS
Senior Research Engineer, Med/BMIR
All Publications
-
Asking the Right Questions: Benchmarking Large Language Models in the Development of Clinical Consultation Templates.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
2026; 31: 400-416
Abstract
This study evaluates the capacity of large language models (LLMs) to generate structured clinical consultation templates for electronic consultation. Using 145 expert-crafted templates developed and routinely used by Stanford's eConsult team, we assess frontier models-including o3, GPT-4o, Kimi K2, Claude 4 Sonnet, Llama 3 70B, and Gemini 2.5 Pro-for their ability to produce clinically coherent, concise, and prioritized clinical question schemas. Through a multi-agent pipeline combining prompt optimization, semantic autograding, and prioritization analysis, we show that while models like o3 achieve high comprehensiveness (up to 92.2%), they consistently generate excessively long templates and fail to correctly prioritize the most clinically important questions under length constraints. Performance varies across specialties, with significant degradation in narrative-driven fields such as psychiatry and pain medicine. Our findings demonstrate that LLMs can enhance structured clinical information exchange between physicians, while highlighting the need for more robust evaluation methods that capture a model's ability to prioritize clinically salient information within the time constraints of real-world physician communication. Limitations include reliance on Stanford-specific templates and concordancebased grading, which may not capture all clinically reasonable outputs.
View details for DOI 10.1142/9789819824755_0028
View details for PubMedID 41758156
-
Automated Evaluation of Large Language Model Response Concordance with Human Specialist Responses on Physician-to-Physician eConsult Cases.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
2026; 31: 372-387
Abstract
Specialist consults in primary care and inpatient settings typically address complex clinical questions beyond standard guidelines. eConsults have been developed as a way for specialist physicians to review cases asynchronously and provide clinical answers without a formal patient encounter. Meanwhile, large language models (LLMs) have approached human-level performance on structured clinical tasks, but their real-world effectiveness requires evaluation, which is bottlenecked by time-intensive manual physician review. To address this, we evaluate two automated methods: LLM-as-judge and a decompose-thenverify framework that breaks down AI answers into verifiable claims against human eConsult responses. Using 40 real-world physician-to-physician eConsults, we compared AI-generated responses to human answers using both physician raters and automated tools. LLM-as-judge outperformed decompose-then-verify, achieving human-level concordance assessment with F1-score of 0.89 (95% CI: 0.750, 0.960) and Cohen's kappa of 0.75 (95% CI 0.47,0.90) , comparable to physician inter-rater agreement κ = 0.69-0.90 (95% CI 0.43-1.0).
View details for DOI 10.1142/9789819824755_0026
View details for PubMedID 41758154
-
"I Double Checked It with My Own Knowledge:" Physician Perspectives on the Use of AI Chatbots for Clinical Decision-Making.
Journal of general internal medicine
2026
Abstract
AI chatbots are proliferating in healthcare systems. It is essential to explore how physicians use these tools in order to understand their influence on clinical care and outcomes. Our goal was to understand how physicians conceive of and incorporate AI into clinical decision-making.We conducted semistructured interviews with generalist physicians from inpatient and outpatient settings in the USA. Prior to the interview, participants were asked to use an AI chatbot, ChatGPT-4, to complete three mock clinical cases. Physicians were interviewed regarding their perspectives on the AI chatbot. Interviews were analyzed using reflexive thematic analysis and conducted via video conference meeting, where they were recorded and transcribed.We interviewed 22 physicians with 2-32 years of experience (median = 3 years). We identified a central organizing concept of "physician as filter" defining how physicians used the AI chatbot. This idea was composed of four themes. Theme 1: Physicians perceive clinical decision-making as a problem-solving activity, applying internally held knowledge to externally gathered information. Theme 2: AI chatbot systems are part of a continuum of information resources. Theme 3: Trust in the AI chatbot's outputs depends on the user's own clinical knowledge. Theme 4: Clinical decision-making is understood as the personalization of clinical knowledge and context.AI chatbots may help physicians with formulating a clinical problem and generating a hypothesis by expanding their repertoire of possible cases. Despite the "wealth of information" provided by AI chatbots, physician trust in the outputs is limited, especially when AI chatbots do not provide references. Physician users described filtering chatbot outputs, using their own clinical knowledge and experience, to determine what information is relevant. In describing how providers perceive AI chatbots, we hope to guide further investigation of physician AI interaction and chatbot development that facilitates improved clinical reasoning.
View details for DOI 10.1007/s11606-025-10145-0
View details for PubMedID 41563674
-
Holistic evaluation of large language models for medical tasks with MedHELM.
Nature medicine
2026
Abstract
While large language models (LLMs) achieve near-perfect scores on medical licensing exams, these evaluations inadequately reflect the complexity and diversity of real-world clinical practice. Here we introduce MedHELM, an extensible evaluation framework with three contributions. First, a clinician-validated taxonomy organizing medical AI applications into five categories that mirror real clinical tasks-clinical decision support (diagnostic decisions, treatment planning), clinical note generation (visit documentation, procedure reports), patient communication (education materials, care instructions), medical research (literature analysis, clinical data analysis) and administration (scheduling, workflow coordination). These encompass 22 subcategories and 121 specific tasks reflecting daily medical practice. Second, a comprehensive benchmark suite of 37 evaluations covering all subcategories. Third, systematic comparison of nine frontier LLMs-Claude 3.5 Sonnet, Claude 3.7 Sonnet, DeepSeek R1, Gemini 1.5 Pro, Gemini 2.0 Flash, GPT-4o, GPT-4o mini, Llama 3.3 and o3-mini-using an automated LLM-jury evaluation method. Our LLM-jury uses multiple AI evaluators to assess model outputs against expert-defined criteria. Advanced reasoning models (DeepSeek R1, o3-mini) demonstrated superior performance with win rates of 66%, although Claude 3.5 Sonnet achieved comparable results at 15% lower computational cost. These results not only highlight current model capabilities but also demonstrate how MedHELM could enable evidence-based selection of medical AI systems for healthcare applications.
View details for DOI 10.1038/s41591-025-04151-2
View details for PubMedID 41559415
View details for PubMedCentralID 10916499
-
A typology of physician input approaches to using AI chatbots for clinical decision-making.
NPJ digital medicine
2025
Abstract
Recent studies have found that physicians with access to a large language model (LLM) chatbot during clinical reasoning tests may score no better to worse compared to the same chatbot performing alone with an input that included the entire clinical case. This study explores how physicians approach using LLM chatbots during clinical reasoning tasks and whether the amount of clinical case content included in the input affects performance. We conducted semi-structured interviews with U.S. physicians on experiences using an LLM chatbot and developed a typology based on input patterns. We then analyzed physician chat logs from two randomized controlled trials, coding each clinical case to an input approach type. Lastly, we used a linear mixed-effects model to compare the case scores of different input approach types. We identified four input approach types based on patterns of content amount: copy-paster (entire case), selective copy-paster (pieces of a case), summarizer (user-generated case summary), and searcher (short queries). Copy-pasting and searching were utilized most. No single type was associated with scoring higher on clinical cases. Other factors such as different prompting strategies, cognitive engagement, and interpretation of the outputs may have more impact and should be explored in future studies.
View details for DOI 10.1038/s41746-025-02184-y
View details for PubMedID 41350807
-
A typology of physician input approaches to using AI chatbots for clinical decision-making: a mixed methods study.
medRxiv : the preprint server for health sciences
2025
Abstract
Large language model (LLM) chatbots demonstrate high degrees of accuracy, yet recent studies found that physicians using these same chatbots may score no better to worse on clinical reasoning tests compared to the chatbot performing alone with researcher-curated prompts. It is unknown how physicians approach inputting information into chatbots.This study aimed to identify how physicians interacted with LLM chatbots on clinical reasoning tasks to create a typology of input approaches, exploring whether input approach type was associated with improved clinical reasoning performance.We carried out a mixed methods study in three steps. First, we conducted semi-structured interviews with U.S. physicians on experiences using an LLM chatbot and analyzed transcripts using the Framework Method to develop a typology based on input patterns. Next, we analyzed the chat logs of physicians who used a chatbot while solving clinical cases, categorizing each case to an input approach type. Lastly, we used a linear mixed-effects model to compare each input approach type with performance on the clinical cases.We identified four input approach types based on patterns of "content amount": copy-paster (entire case), selective copy-paster (pieces of a case), summarizer (user-generated case summary), and searcher (short queries). Copy-pasting and searching were utilized most. No single type was associated with scoring higher on clinical cases.This study adds to our understanding of how physicians approach using chatbots and identifies ways in which physicians intuitively interact with chatbots.Purposeful training and support is needed to help physicians effectively use emerging AI technologies and realize their potential for supporting safe and effective medical decision-making in practice.
View details for DOI 10.1101/2025.07.23.25332002
View details for PubMedID 40778141
View details for PubMedCentralID PMC12330456
-
Systematic Exploration of Hospital Cost Variability: A Conformal Prediction-Based Outlier Detection Method for Electronic Health Records.
AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science
2025; 2025: 187-195
Abstract
Marked variability in inpatient hospitalization costs poses significant challenges to healthcare quality, resource allocation, and patient outcomes. Traditional methods like Diagnosis-Related Groups (DRGs) aid in cost management but lack practical solutions for enhancing hospital care value. We introduce a novel methodology for outlier detection in Electronic Health Records (EHRs) using Conformal Prediction. This approach identifies and prioritizes areas for optimizing high-value care processes. Unlike conventional predictive models that neglect uncertainty, our method employs Conformal Quantile Regression (CQR) to generate robust prediction intervals, offering a comprehensive view of cost variability. By integrating Conformal Prediction with machine learning models, healthcare professionals can more accurately pinpoint opportunities for quality and efficiency improvements. Our framework systematically evaluates unexplained hospital cost variations and generates interpretable hypotheses for refining clinical practices associated with atypical costs. This data-driven approach offers a systematic method to generate clinically sound hypotheses that may inform processes to enhance care quality and optimize resource utilization.
View details for PubMedID 40502259
View details for PubMedCentralID PMC12150741
-
From Tool to Teammate: A Randomized Controlled Trial of Clinician-AI Collaborative Workflows for Diagnosis.
medRxiv : the preprint server for health sciences
2025
Abstract
Early studies of large language models (LLMs) in clinical settings have largely treated artificial intelligence (AI) as a tool rather than an active collaborator. As LLMs now demonstrate expert-level diagnostic performance, the focus shifts from whether AI can offer valuable suggestions to how it can be effectively integrated into physicians' diagnostic workflows. We conducted a randomized controlled trial (n=70 clinicians) to evaluate the value of employing a custom GPT system designed to engage collaboratively with clinicians on diagnostic reasoning challenges. The collaborative design began with independent diagnostic assessments from both the clinician and the AI. These were then combined in an AI-generated synthesis that integrated the two perspectives, highlighting points of agreement and disagreement and offering commentary on each. We evaluated two workflow variants: one where the AI provided an initial opinion (AI-first), and another where it followed the clinician's assessment (AI-second). Clinicians using either collaborative workflow outperformed those using traditional tools, achieving average accuracies of 85% (AI-first) and 82% (AI-second), compared to 75% with traditional resources (p < 0.0004 and p < 0.00001; mean differences = 9.8% and 6.8%; 95% CI = 4.6%-15% and 4.0%-9.6%). Performance did not differ significantly between workflows or from the AI-alone score of 90%. These results underscore the value of collaborative AI systems that complement clinician expertise and foster effective coordination between human and machine reasoning in diagnostic decision-making.
View details for DOI 10.1101/2025.06.07.25329176
View details for PubMedID 40502554
-
Physician clinical decision modification and bias assessment in a randomized controlled trial of AI assistance.
Communications medicine
2025; 5 (1): 59
Abstract
Artificial intelligence assistance in clinical decision making shows promise, but concerns exist about potential exacerbation of demographic biases in healthcare. This study aims to evaluate how physician clinical decisions and biases are influenced by AI assistance in a chest pain triage scenario.A randomized, pre post-intervention study was conducted with 50 US-licensed physicians who reviewed standardized chest pain video vignettes featuring either a white male or Black female patient. Participants answered clinical questions about triage, risk assessment, and treatment before and after receiving GPT-4 generated recommendations. Clinical decision accuracy was evaluated against evidence-based guidelines.Here we show that physicians are willing to modify their clinical decisions based on GPT-4 assistance, leading to improved accuracy scores from 47% to 65% in the white male patient group and 63% to 80% in the Black female patient group. The accuracy improvement occurs without introducing or exacerbating demographic biases, with both groups showing similar magnitudes of improvement (18%). A post-study survey indicates that 90% of physicians expect AI tools to play a significant role in future clinical decision making.Physician clinical decision making can be augmented by AI assistance while maintaining equitable care across patient demographics. These findings suggest a path forward for AI clinical decision support that improves medical care without amplifying healthcare disparities.
View details for DOI 10.1038/s43856-025-00781-2
View details for PubMedID 40038550
View details for PubMedCentralID 10582782
-
GPT-4 assistance for improvement of physician performance on patient care tasks: a randomized controlled trial.
Nature medicine
2025
Abstract
While large language models (LLMs) have shown promise in diagnostic reasoning, their impact on management reasoning, which involves balancing treatment decisions and testing strategies while managing risk, is unknown. This prospective, randomized, controlled trial assessed whether LLM assistance improves physician performance on open-ended management reasoning tasks compared to conventional resources. From November 2023 to April 2024, 92 practicing physicians were randomized to use either GPT-4 plus conventional resources or conventional resources alone to answer five expert-developed clinical vignettes in a simulated setting. All cases were based on real, de-identified patient encounters, with information revealed sequentially to mirror the nature of clinical environments. The primary outcome was the difference in total score between groups on expert-developed scoring rubrics. Secondary outcomes included domain-specific scores and time spent per case. Physicians using the LLM scored significantly higher compared to those using conventional resources (mean difference = 6.5%, 95% confidence interval (CI) = 2.7 to 10.2, P < 0.001). LLM users spent more time per case (mean difference = 119.3 s, 95% CI = 17.4 to 221.2, P = 0.02). There was no significant difference between LLM-augmented physicians and LLM alone (-0.9%, 95% CI = -9.0 to 7.2, P = 0.8). LLM assistance can improve physician management reasoning in complex clinical vignettes compared to conventional resources and should be validated in real clinical practice. ClinicalTrials.gov registration: NCT06208423 .
View details for DOI 10.1038/s41591-024-03456-y
View details for PubMedID 39910272
View details for PubMedCentralID 10273128
-
Systematic Exploration of Hospital Cost Variability: A Conformal Prediction-Based Outlier Detection Method for Electronic Health Records.
AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science
2025; 2025: 187-195
Abstract
Marked variability in inpatient hospitalization costs poses significant challenges to healthcare quality, resource allocation, and patient outcomes. Traditional methods like Diagnosis-Related Groups (DRGs) aid in cost management but lack practical solutions for enhancing hospital care value. We introduce a novel methodology for outlier detection in Electronic Health Records (EHRs) using Conformal Prediction. This approach identifies and prioritizes areas for optimizing high-value care processes. Unlike conventional predictive models that neglect uncertainty, our method employs Conformal Quantile Regression (CQR) to generate robust prediction intervals, offering a comprehensive view of cost variability. By integrating Conformal Prediction with machine learning models, healthcare professionals can more accurately pinpoint opportunities for quality and efficiency improvements. Our framework systematically evaluates unexplained hospital cost variations and generates interpretable hypotheses for refining clinical practices associated with atypical costs. This data-driven approach offers a systematic method to generate clinically sound hypotheses that may inform processes to enhance care quality and optimize resource utilization.
View details for PubMedID 40502259
-
Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial.
JAMA network open
2024; 7 (10): e2440969
Abstract
Large language models (LLMs) have shown promise in their performance on both multiple-choice and open-ended medical reasoning examinations, but it remains unknown whether the use of such tools improves physician diagnostic reasoning.To assess the effect of an LLM on physicians' diagnostic reasoning compared with conventional resources.A single-blind randomized clinical trial was conducted from November 29 to December 29, 2023. Using remote video conferencing and in-person participation across multiple academic medical institutions, physicians with training in family medicine, internal medicine, or emergency medicine were recruited.Participants were randomized to either access the LLM in addition to conventional diagnostic resources or conventional resources only, stratified by career stage. Participants were allocated 60 minutes to review up to 6 clinical vignettes.The primary outcome was performance on a standardized rubric of diagnostic performance based on differential diagnosis accuracy, appropriateness of supporting and opposing factors, and next diagnostic evaluation steps, validated and graded via blinded expert consensus. Secondary outcomes included time spent per case (in seconds) and final diagnosis accuracy. All analyses followed the intention-to-treat principle. A secondary exploratory analysis evaluated the standalone performance of the LLM by comparing the primary outcomes between the LLM alone group and the conventional resource group.Fifty physicians (26 attendings, 24 residents; median years in practice, 3 [IQR, 2-8]) participated virtually as well as at 1 in-person site. The median diagnostic reasoning score per case was 76% (IQR, 66%-87%) for the LLM group and 74% (IQR, 63%-84%) for the conventional resources-only group, with an adjusted difference of 2 percentage points (95% CI, -4 to 8 percentage points; P = .60). The median time spent per case for the LLM group was 519 (IQR, 371-668) seconds, compared with 565 (IQR, 456-788) seconds for the conventional resources group, with a time difference of -82 (95% CI, -195 to 31; P = .20) seconds. The LLM alone scored 16 percentage points (95% CI, 2-30 percentage points; P = .03) higher than the conventional resources group.In this trial, the availability of an LLM to physicians as a diagnostic aid did not significantly improve clinical reasoning compared with conventional resources. The LLM alone demonstrated higher performance than both physician groups, indicating the need for technology and workforce development to realize the potential of physician-artificial intelligence collaboration in clinical practice.ClinicalTrials.gov Identifier: NCT06157944.
View details for DOI 10.1001/jamanetworkopen.2024.40969
View details for PubMedID 39466245
-
Large Language Model Influence on Management Reasoning: A Randomized Controlled Trial.
medRxiv : the preprint server for health sciences
2024
Abstract
Large language model (LLM) artificial intelligence (AI) systems have shown promise in diagnostic reasoning, but their utility in management reasoning with no clear right answers is unknown.To determine whether LLM assistance improves physician performance on open-ended management reasoning tasks compared to conventional resources.Prospective, randomized controlled trial conducted from 30 November 2023 to 21 April 2024.Multi-institutional study from Stanford University, Beth Israel Deaconess Medical Center, and the University of Virginia involving physicians from across the United States.92 practicing attending physicians and residents with training in internal medicine, family medicine, or emergency medicine.Five expert-developed clinical case vignettes were presented with multiple open-ended management questions and scoring rubrics created through a Delphi process. Physicians were randomized to use either GPT-4 via ChatGPT Plus in addition to conventional resources (e.g., UpToDate, Google), or conventional resources alone.The primary outcome was difference in total score between groups on expert-developed scoring rubrics. Secondary outcomes included domain-specific scores and time spent per case.Physicians using the LLM scored higher compared to those using conventional resources (mean difference 6.5 %, 95% CI 2.7-10.2, p<0.001). Significant improvements were seen in management decisions (6.1%, 95% CI 2.5-9.7, p=0.001), diagnostic decisions (12.1%, 95% CI 3.1-21.0, p=0.009), and case-specific (6.2%, 95% CI 2.4-9.9, p=0.002) domains. GPT-4 users spent more time per case (mean difference 119.3 seconds, 95% CI 17.4-221.2, p=0.02). There was no significant difference between GPT-4-augmented physicians and GPT-4 alone (-0.9%, 95% CI -9.0 to 7.2, p=0.8).LLM assistance improved physician management reasoning compared to conventional resources, with particular gains in contextual and patient-specific decision-making. These findings indicate that LLMs can augment management decision-making in complex cases.ClinicalTrials.gov Identifier: NCT06208423 ; https://classic.clinicaltrials.gov/ct2/show/NCT06208423.Question: Does large language model (LLM) assistance improve physician performance on complex management reasoning tasks compared to conventional resources?Findings: In this randomized controlled trial of 92 physicians, participants using GPT-4 achieved higher scores on management reasoning compared to those using conventional resources (e.g., UpToDate).Meaning: LLM assistance enhances physician management reasoning performance in complex cases with no clear right answers.
View details for DOI 10.1101/2024.08.05.24311485
View details for PubMedID 39148822
View details for PubMedCentralID PMC11326321
-
PRELIMINARY DATA: A RANDOMIZED CONTROLLED TRIAL OF LARGE LANGUAGE MODEL ASSISTANCE WITH COMPLEX DIAGNOSTIC REASONING
SPRINGER. 2024: S828-S829
View details for Web of Science ID 001433572702263
-
Right Patient, Right Specialist, Right Time: Retrieval Augmented Generation for Specialty Referral Routing.
AMIA ... Annual Symposium proceedings. AMIA Symposium
2024; 2024: 443-450
Abstract
We present an embedding-based retrieval system that automatically directs physician clinical questions to the most relevant specialist-curated question template, which is necessary for the specialist to provide a clinically relevant response. The system utilizes MPNet, a transformer-based model, to generate dense vector representations of both clinical queries and 24 predefined clinical templates. Given a clinical question, the system computes cosine similarity between the query and template embeddings to retrieve the most relevant matches. When validated against real-world, retrospective eConsults across five specialties, the system accurately identified the most relevant template in 87% of cases (success@1) and included it in the top three results 99% of the time (success@3). Automating specialty selection and clinical question referrals reduces the administrative burden on physicians, minimizes care delivery delays, and improves specialist responses by providing proper context.
View details for PubMedID 41726438
View details for PubMedCentralID PMC12919621
-
MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records.
Proceedings of the ... AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence
2024; 38 (20): 22021-22030
Abstract
The ability of large language models (LLMs) to follow natural language instructions with human-level fluency suggests many opportunities in healthcare to reduce administrative burden and improve quality of care. However, evaluating LLMs on realistic text generation tasks for healthcare remains challenging. Existing question answering datasets for electronic health record (EHR) data fail to capture the complexity of information needs and documentation burdens experienced by clinicians. To address these challenges, we introduce MedAlign, a benchmark dataset of 983 natural language instructions for EHR data. MedAlign is curated by 15 clinicians (7 specialities), includes clinician-written reference responses for 303 instructions, and provides 276 longitudinal EHRs for grounding instruction-response pairs. We used MedAlign to evaluate 6 general domain LLMs, having clinicians rank the accuracy and quality of each LLM response. We found high error rates, ranging from 35% (GPT-4) to 68% (MPT-7B-Instruct), and 8.3% drop in accuracy moving from 32k to 2k context lengths for GPT-4. Finally, we report correlations between clinician rankings and automated natural language generation metrics as a way to rank LLMs without human review. MedAlign is provided under a research data use agreement to enable LLM evaluations on tasks aligned with clinician needs and preferences.
View details for DOI 10.1609/aaai.v38i20.30205
View details for PubMedID 41584261
View details for PubMedCentralID PMC12826664
-
Influence of a Large Language Model on Diagnostic Reasoning: A Randomized Clinical Vignette Study.
medRxiv : the preprint server for health sciences
2024
Abstract
Diagnostic errors are common and cause significant morbidity. Large language models (LLMs) have shown promise in their performance on both multiple-choice and open-ended medical reasoning examinations, but it remains unknown whether the use of such tools improves diagnostic reasoning.To assess the impact of the GPT-4 LLM on physicians' diagnostic reasoning compared to conventional resources.Multi-center, randomized clinical vignette study.The study was conducted using remote video conferencing with physicians across the country and in-person participation across multiple academic medical institutions.Resident and attending physicians with training in family medicine, internal medicine, or emergency medicine.Participants were randomized to access GPT-4 in addition to conventional diagnostic resources or to just conventional resources. They were allocated 60 minutes to review up to six clinical vignettes adapted from established diagnostic reasoning exams.The primary outcome was diagnostic performance based on differential diagnosis accuracy, appropriateness of supporting and opposing factors, and next diagnostic evaluation steps. Secondary outcomes included time spent per case and final diagnosis.50 physicians (26 attendings, 24 residents) participated, with an average of 5.2 cases completed per participant. The median diagnostic reasoning score per case was 76.3 percent (IQR 65.8 to 86.8) for the GPT-4 group and 73.7 percent (IQR 63.2 to 84.2) for the conventional resources group, with an adjusted difference of 1.6 percentage points (95% CI -4.4 to 7.6; p=0.60). The median time spent on cases for the GPT-4 group was 519 seconds (IQR 371 to 668 seconds), compared to 565 seconds (IQR 456 to 788 seconds) for the conventional resources group, with a time difference of -82 seconds (95% CI -195 to 31; p=0.20). GPT-4 alone scored 15.5 percentage points (95% CI 1.5 to 29, p=0.03) higher than the conventional resources group.In a clinical vignette-based study, the availability of GPT-4 to physicians as a diagnostic aid did not significantly improve clinical reasoning compared to conventional resources, although it may improve components of clinical reasoning such as efficiency. GPT-4 alone demonstrated higher performance than both physician groups, suggesting opportunities for further improvement in physician-AI collaboration in clinical practice.
View details for DOI 10.1101/2024.03.12.24303785
View details for PubMedID 38559045
View details for PubMedCentralID PMC10980135
-
MEDALIGN: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records
edited by Wooldridge, M., Dy, J., Natarajan, S.
ASSOC ADVANCEMENT ARTIFICIAL INTELLIGENCE. 2024: 22021-22030
View details for Web of Science ID 001239985800017
-
ChatGPT Influence on Medical Decision-Making, Bias, and Equity: A Randomized Study of Clinicians Evaluating Clinical Vignettes.
medRxiv : the preprint server for health sciences
2023
Abstract
In a randomized, pre-post intervention study, we evaluated the influence of a large language model (LLM) generative AI system on accuracy of physician decision-making and bias in healthcare. 50 US-licensed physicians reviewed a video clinical vignette, featuring actors representing different demographics (a White male or a Black female) with chest pain. Participants were asked to answer clinical questions around triage, risk, and treatment based on these vignettes, then asked to reconsider after receiving advice generated by ChatGPT+ (GPT4). The primary outcome was the accuracy of clinical decisions based on pre-established evidence-based guidelines. Results showed that physicians are willing to change their initial clinical impressions given AI assistance, and that this led to a significant improvement in clinical decision-making accuracy in a chest pain evaluation scenario without introducing or exacerbating existing race or gender biases. A survey of physician participants indicates that the majority expect LLM tools to play a significant role in clinical decision making.
View details for DOI 10.1101/2023.11.24.23298844
View details for PubMedID 38076944
View details for PubMedCentralID PMC10705632
-
Remote evaluation of NAVIFY Oncology Hub using clinical simulation
LIPPINCOTT WILLIAMS & WILKINS. 2023
View details for Web of Science ID 001053772001507
https://orcid.org/0009-0001-7491-4257