Bio


Bethel R. Mieso, MD is a general pediatrician and clinical informatics fellow at Stanford Medicine whose work sits at the intersection of operational informatics, artificial intelligence ethics, pediatric care, and health equity. Dr. Mieso has played a key role in the enterprise-wide rollout of DAX Copilot at Stanford, leading ethical and regulatory guidance, trainee deployment, and patient-facing education. She has led a post-deployment evaluation of program director AI scribe policies across training programs, with findings informing strategic guidance for GME leaders nationwide–work that extends to her contributions to a national multi-institutional collaborative on AI in graduate medical education. Her research centers patient and family perspectives of ambient AI scribes in pediatric settings, shaping how health systems approach consent, communication, and trust with AI-assisted care.

Dr. Mieso's work merges operational informatics with strategic AI implementation–streamlining clinical workflows, reducing provider burden, and ensuring that emerging technologies serve patients equitably. She holds a BS in Biology from San Jose State University, an MD from Case Western Reserve University School of Medicine, and completed her pediatrics residency at Stanford Medicine.

Clinical Focus


  • Clinical Informatics
  • General Pediatrics
  • Pediatrics

Boards, Advisory Committees, Professional Organizations


  • Member, American Medical Informatics Association (2023 - Present)
  • Member, American Academy of Pediatrics (2021 - Present)

Professional Education


  • Fellowship: Stanford University Clinical Informatics Fellowship (2026) CA
  • Board Certification: American Board of Pediatrics, Pediatrics (2024)
  • Residency: Stanford University Pediatric Residency at Lucile Packard Children's Hospital (2024) CA
  • Medical Education: Case Western Reserve School of Medicine (2021) OH
  • Fellowship, Stanford University, Clinical Informatics
  • Residency, Stanford University, Lucile Packard Children's Hospital (2024)
  • MD, Case Western Reserve University School of Medicine (2021)
  • Bachelor of Science, San Jose State University, Biology (2014)

All Publications


  • Holistic evaluation of large language models for medical tasks with MedHELM. Nature medicine Bedi, S., Cui, H., Fuentes, M., Unell, A., Wornow, M., Banda, J. M., Kotecha, N., Keyes, T., Mai, Y., Oez, M., Qiu, H., Jain, S., Schettini, L., Kashyap, M., Fries, J. A., Swaminathan, A., Chung, P., Haredasht, F. N., Lopez, I., Aali, A., Tse, G., Nayak, A., Vedak, S., Jain, S. S., Patel, B., Fayanju, O., Shah, S., Goh, E., Yao, D. H., Soetikno, B., Reis, E., Gatidis, S., Divi, V., Capasso, R., Saralkar, R., Chiang, C. C., Jindal, J., Pham, T., Ghoddusi, F., Lin, S., Chiou, A. S., Hong, H. J., Roy, M., Gensheimer, M. F., Patel, H., Schulman, K., Dash, D., Char, D., Downing, L., Grolleau, F., Black, K., Mieso, B., Zahedivash, A., Yim, W. W., Sharma, H., Lee, T., Kirsch, H., Lee, J., Ambers, N., Lugtu, C., Sharma, A., Mawji, B., Alekseyev, A., Zhou, V., Kakkar, V., Helzer, J., Revri, A., Bannett, Y., Daneshjou, R., Chen, J., Alsentzer, E., Morse, K., Ravi, N., Aghaeepour, N., Kennedy, V., Chaudhari, A., Wang, T., Koyejo, S., Lungren, M. P., Horvitz, E., Liang, P., Pfeffer, M. A., Shah, N. H. 2026

    Abstract

    While large language models (LLMs) achieve near-perfect scores on medical licensing exams, these evaluations inadequately reflect the complexity and diversity of real-world clinical practice. Here we introduce MedHELM, an extensible evaluation framework with three contributions. First, a clinician-validated taxonomy organizing medical AI applications into five categories that mirror real clinical tasks-clinical decision support (diagnostic decisions, treatment planning), clinical note generation (visit documentation, procedure reports), patient communication (education materials, care instructions), medical research (literature analysis, clinical data analysis) and administration (scheduling, workflow coordination). These encompass 22 subcategories and 121 specific tasks reflecting daily medical practice. Second, a comprehensive benchmark suite of 37 evaluations covering all subcategories. Third, systematic comparison of nine frontier LLMs-Claude 3.5 Sonnet, Claude 3.7 Sonnet, DeepSeek R1, Gemini 1.5 Pro, Gemini 2.0 Flash, GPT-4o, GPT-4o mini, Llama 3.3 and o3-mini-using an automated LLM-jury evaluation method. Our LLM-jury uses multiple AI evaluators to assess model outputs against expert-defined criteria. Advanced reasoning models (DeepSeek R1, o3-mini) demonstrated superior performance with win rates of 66%, although Claude 3.5 Sonnet achieved comparable results at 15% lower computational cost. These results not only highlight current model capabilities but also demonstrate how MedHELM could enable evidence-based selection of medical AI systems for healthcare applications.

    View details for DOI 10.1038/s41591-025-04151-2

    View details for PubMedID 41559415

    View details for PubMedCentralID 10916499

  • Decoding the Reference Letter: Strategies to Reduce Unintentional Gender Bias in Letters of Recommendation. MedEdPORTAL : the journal of teaching and learning resources Mieso, B. R., Barnett, J. F., Otero, T. M., Berquist, S. W., Perez, F. D., Han, P., Bhargava, S., Atasuntseva, A., Yemane, L. 2024; 20: 11419

    Abstract

    There is a growing body of literature on gender bias in letters of recommendation (LORs) in academic medicine and the negative effect of bias on promotion and career advancement. Thus, increasing knowledge about gender bias and developing skills to mitigate it is important for advancing gender equity in medicine. This workshop aims to provide participants with knowledge about linguistic bias (focused on gender), how to recognize it, and strategies to apply to mitigate it when writing LORs.We developed an interactive 60-minute workshop for faculty and graduate medical education program directors consisting of didactics, reflection exercises, and group activities. We used a postworkshop survey to evaluate the effectiveness of the workshop. Descriptive statistics were used to analyze Likert-scale questions and a thematic content analysis for open-ended prompts.We presented the workshop four times (two local and two national conferences) with one in-person and one virtual format for each. There were 50 participants who completed a postworkshop survey out of 74 total participants (68% response rate). Ninety-nine percent of participants felt the workshop met its educational objectives, and 100% felt it was a valuable use of their time. Major themes described for intended behavior change included utilization of the gender bias calculator, mindful use and balance of agentic versus communal traits, closer attention to letter length, and dissemination of this knowledge to colleagues.This workshop was an effective method for helping participants recognize gender bias when writing LORs and learn strategies to mitigate it.

    View details for DOI 10.15766/mep_2374-8265.11419

    View details for PubMedID 38974126

    View details for PubMedCentralID PMC11224141

  • Mobile Phone Applications to Support Breastfeeding Among African-American Women: a Scoping Review. Journal of racial and ethnic health disparities Mieso, B., Neudecker, M., Furman, L. 2022; 9 (1): 32-51

    Abstract

    Racial disparities persist with respect to breastfeeding. The use of health e-technology is increasing, with promise for a role in improving breastfeeding outcomes.We undertook a scoping review of both individual breastfeeding apps and the literature on breastfeeding apps to map the available evidence on app-based breastfeeding support for African-American mothers.A systematic search of online databases identified 241 English language papers published on or before June 2020 that included e-technology in support of breastfeeding. We included those that (1) described individual human subjects research studies utilizing any research design, (2) described app-based breastfeeding support, and (3) could be pertinent for African-American mothers, and assessed for inclusion and relevance for this population. We also searched app stores for breastfeeding apps, and evaluated features with a rubric. Our aim was to identify if gaps exist relative to breastfeeding support for African-Americans.Of the 15 publications meeting inclusion criteria, 9 focused on app development, 4 examined user experience, and 3 examined breastfeeding outcomes with use of an app (one study overlapped categories). The percentage of African-American participants ranged from 100% (2 studies) to none (7 studies); 3 studies (20%) focused on African-American mothers' breastfeeding experience. Of 77 apps that met inclusion criteria, just one was both breastfeeding-focused by content and targeted for African-Americans by picture predominance.The quality of studies was generally high and many included African-American participants, but research focused on breastfeeding apps specifically for African-American mothers/parents is limited, creating a meaningful gap in the literature.

    View details for DOI 10.1007/s40615-020-00927-z

    View details for PubMedID 33219430

    View details for PubMedCentralID 6715261

  • Beyond Statistics: Uncovering the Roots of Racial Disparities in Breastfeeding. Pediatrics Mieso, B. R., Burrow, H., Lam, S. K. 2021; 147 (5)

    View details for DOI 10.1542/peds.2020-037887

    View details for PubMedID 33833073