Clinical Focus


  • Regional Anesthesia and Acute Pain Medicine
  • Anesthesiology

Academic Appointments


  • Clinical Assistant Professor, Anesthesiology, Perioperative and Pain Medicine

Professional Education


  • Board Certification: American Board of Anesthesiology, Anesthesiology (2024)
  • Fellowship: Stanford University Anesthesiology Fellowships (2024) CA
  • M.D., Perelman School of Medicine, University of Pennsylvania, MD (2019)
  • M.B.A., The Wharton School, Health Care Management (2019)
  • A.B., Harvard College, Economics (2012)
  • Internship, Stanford Health Care, Internal Medicine (2020)
  • Residency, Stanford Health Care, Anesthesiology (2023)

All Publications


  • Holistic evaluation of large language models for medical tasks with MedHELM. Nature medicine Bedi, S., Cui, H., Fuentes, M., Unell, A., Wornow, M., Banda, J. M., Kotecha, N., Keyes, T., Mai, Y., Oez, M., Qiu, H., Jain, S., Schettini, L., Kashyap, M., Fries, J. A., Swaminathan, A., Chung, P., Haredasht, F. N., Lopez, I., Aali, A., Tse, G., Nayak, A., Vedak, S., Jain, S. S., Patel, B., Fayanju, O., Shah, S., Goh, E., Yao, D. H., Soetikno, B., Reis, E., Gatidis, S., Divi, V., Capasso, R., Saralkar, R., Chiang, C. C., Jindal, J., Pham, T., Ghoddusi, F., Lin, S., Chiou, A. S., Hong, H. J., Roy, M., Gensheimer, M. F., Patel, H., Schulman, K., Dash, D., Char, D., Downing, L., Grolleau, F., Black, K., Mieso, B., Zahedivash, A., Yim, W. W., Sharma, H., Lee, T., Kirsch, H., Lee, J., Ambers, N., Lugtu, C., Sharma, A., Mawji, B., Alekseyev, A., Zhou, V., Kakkar, V., Helzer, J., Revri, A., Bannett, Y., Daneshjou, R., Chen, J., Alsentzer, E., Morse, K., Ravi, N., Aghaeepour, N., Kennedy, V., Chaudhari, A., Wang, T., Koyejo, S., Lungren, M. P., Horvitz, E., Liang, P., Pfeffer, M. A., Shah, N. H. 2026

    Abstract

    While large language models (LLMs) achieve near-perfect scores on medical licensing exams, these evaluations inadequately reflect the complexity and diversity of real-world clinical practice. Here we introduce MedHELM, an extensible evaluation framework with three contributions. First, a clinician-validated taxonomy organizing medical AI applications into five categories that mirror real clinical tasks-clinical decision support (diagnostic decisions, treatment planning), clinical note generation (visit documentation, procedure reports), patient communication (education materials, care instructions), medical research (literature analysis, clinical data analysis) and administration (scheduling, workflow coordination). These encompass 22 subcategories and 121 specific tasks reflecting daily medical practice. Second, a comprehensive benchmark suite of 37 evaluations covering all subcategories. Third, systematic comparison of nine frontier LLMs-Claude 3.5 Sonnet, Claude 3.7 Sonnet, DeepSeek R1, Gemini 1.5 Pro, Gemini 2.0 Flash, GPT-4o, GPT-4o mini, Llama 3.3 and o3-mini-using an automated LLM-jury evaluation method. Our LLM-jury uses multiple AI evaluators to assess model outputs against expert-defined criteria. Advanced reasoning models (DeepSeek R1, o3-mini) demonstrated superior performance with win rates of 66%, although Claude 3.5 Sonnet achieved comparable results at 15% lower computational cost. These results not only highlight current model capabilities but also demonstrate how MedHELM could enable evidence-based selection of medical AI systems for healthcare applications.

    View details for DOI 10.1038/s41591-025-04151-2

    View details for PubMedID 41559415

    View details for PubMedCentralID 10916499

  • Physician Perspectives on Large Language Models in Healthcare: A Cross-Sectional Survey Study. Applied clinical informatics Hong, H. J., Shah, N., Pfeffer, M. A., Lehmann, L. S. 2025

    Abstract

    This study aims to evaluate physicians' practices and perspectives regarding large language models (LLMs) in healthcare settings.A cross-sectional survey study was conducted between May and July 2024 comparing physician perspectives at two major academic medical centers (AMCs), one with institutional LLM access and one without. Participants included both clinical faculty and trainees recruited through departmental leadership and snowball sampling. Primary outcomes were current LLM use frequency, ranked importance of evaluation metrics, liability concerns, and preferred learning topics.Among 306 respondents (217 attending physicians [70.9%], 80 trainees [26.1%]), 197 (64.4%) reported using LLMs. The AMC with institutional LLM access reported significantly lower liability concerns (49.2% vs 66.7% reporting high concern; 17.5 percentage points difference [95% CI, 6.8-28.2]; P=.0082). Accuracy was prioritized across all specialties (median rank 1.0 [IQR, 1.0-2.0]). Of the respondents, 287 physicians (94%) requested additional training. Key learning priorities were clinical applications (206 [71.9%]) and risk management (181 [63.1%]). Despite widespread personal use, only 8 physicians (2.6%) recommended LLMs to patients. Notable specialty and demographic variations emerged, with younger physicians showing higher enthusiasm but also elevated legal concerns.This survey study provides insights into physicians' current usage patterns and perspectives on LLMs. Liability concerns appear to be lessened in settings with institutional LLM access. The findings suggest opportunities for medical centers to consider when developing LLM-related policies and educational programs.

    View details for DOI 10.1055/a-2735-0527

    View details for PubMedID 41167595

  • Verifying Facts in Patient Care Documents Generated by Large Language Models Using Electronic Health Records NEJM AI Chung, P., et al 2025; 3 (1)

    View details for DOI 10.1056/AIdbp2500418

  • Testing and Evaluation of Health Care Applications of Large Language Models: A Systematic Review. JAMA Bedi, S., Liu, Y., Orr-Ewing, L., Dash, D., Koyejo, S., Callahan, A., Fries, J. A., Wornow, M., Swaminathan, A., Lehmann, L. S., Hong, H. J., Kashyap, M., Chaurasia, A. R., Shah, N. R., Singh, K., Tazbaz, T., Milstein, A., Pfeffer, M. A., Shah, N. H. 2024

    Abstract

    Large language models (LLMs) can assist in various health care activities, but current evaluation approaches may not adequately identify the most useful application areas.To summarize existing evaluations of LLMs in health care in terms of 5 components: (1) evaluation data type, (2) health care task, (3) natural language processing (NLP) and natural language understanding (NLU) tasks, (4) dimension of evaluation, and (5) medical specialty.A systematic search of PubMed and Web of Science was performed for studies published between January 1, 2022, and February 19, 2024.Studies evaluating 1 or more LLMs in health care.Three independent reviewers categorized studies via keyword searches based on the data used, the health care tasks, the NLP and NLU tasks, the dimensions of evaluation, and the medical specialty.Of 519 studies reviewed, published between January 1, 2022, and February 19, 2024, only 5% used real patient care data for LLM evaluation. The most common health care tasks were assessing medical knowledge such as answering medical licensing examination questions (44.5%) and making diagnoses (19.5%). Administrative tasks such as assigning billing codes (0.2%) and writing prescriptions (0.2%) were less studied. For NLP and NLU tasks, most studies focused on question answering (84.2%), while tasks such as summarization (8.9%) and conversational dialogue (3.3%) were infrequent. Almost all studies (95.4%) used accuracy as the primary dimension of evaluation; fairness, bias, and toxicity (15.8%), deployment considerations (4.6%), and calibration and uncertainty (1.2%) were infrequently measured. Finally, in terms of medical specialty area, most studies were in generic health care applications (25.6%), internal medicine (16.4%), surgery (11.4%), and ophthalmology (6.9%), with nuclear medicine (0.6%), physical medicine (0.4%), and medical genetics (0.2%) being the least represented.Existing evaluations of LLMs mostly focus on accuracy of question answering for medical examinations, without consideration of real patient care data. Dimensions such as fairness, bias, and toxicity and deployment considerations received limited attention. Future evaluations should adopt standardized applications and metrics, use clinical data, and broaden focus to include a wider range of tasks and specialties.

    View details for DOI 10.1001/jama.2024.21700

    View details for PubMedID 39405325

    View details for PubMedCentralID PMC11480901

  • Enhancing the Readability of Preoperative Patient Instructions Using Large Language Models. Anesthesiology Hong, H. J., Schmiesing, C. A., Goodell, A. J. 2024; 141 (3): 608-610

    View details for DOI 10.1097/ALN.0000000000005122

    View details for PubMedID 39136480

  • Artificial Intelligence in Perioperative Care: Opportunities and Challenges. Anesthesiology Han, L., Char, D. S., Aghaeepour, N. 2024; 141 (2): 379-387

    View details for DOI 10.1097/ALN.0000000000005013

    View details for PubMedID 38980160

  • Engaging Housestaff as Informatics Collaborators: Educational and Operational Opportunities. Applied clinical informatics Shenson, J. A., Jankovic, I., Hong, H. J., Weia, B., White, L., Chen, J. H., Eisenberg, M. 2021; 12 (5): 1150-1156

    Abstract

    BACKGROUND: In academic hospitals, housestaff (interns, residents, and fellows) are a core user group of clinical information technology (IT) systems, yet are often relegated to being recipients of change, rather than active partners in system improvement. These information systems are an integral part of health care delivery and formal efforts to involve and educate housestaff are nascent.OBJECTIVE: This article develops a sustainable forum for effective engagement of housestaff in hospital informatics initiatives and creates opportunities for professional development.METHODS: A housestaff-led IT council was created within an academic medical center and integrated with informatics and graduate medical education leadership. The Council was designed to provide a venue for hands-on clinical informatics educational experiences to housestaff across all specialties.RESULTS: In the first year, five housestaff co-chairs and 50 members were recruited. More than 15 projects were completed with substantial improvements made to clinical systems impacting more than 1,300 housestaff and with touchpoints to nearly 3,000 staff members. Council leadership was integrally involved in hospital governance committees and became the go-to source for housestaff input on informatics efforts. Positive experiences informed members' career development toward informatics roles. Key lessons learned in building for success are discussed.CONCLUSION: The council model has effectively engaged housestaff as learners, local champions, and key informatics collaborators, with positive impact for the participating members and the institution. Requiring few resources for implementation, the model should be replicable at other institutions.

    View details for DOI 10.1055/s-0041-1740258

    View details for PubMedID 34879406