Keith Morse

Clinical Associate Professor, Pediatrics

Practices at Stanford Medicine Children's Health

Bio

Keith Morse, MD, MBA, serves as the Chief Medical Informatics Officer (CMIO) for Stanford Medicine Children's Health. He practices clinically as a pediatric hospitalist and is the Program Director for Stanford's Clinical Informatics fellowship.

Clinical Focus

Pediatric Hospital Medicine

Academic Appointments

Clinical Associate Professor, Pediatrics

Administrative Appointments

Chief Medical Informatics Officer, Stanford Medicine Children's Health (2025 - Present)
Program Director, Clinical Informatics Fellowship, Stanford Medicine (2024 - Present)

Professional Education

Board Certification: American Board of Preventive Medicine, Clinical Informatics (2021)
Medical Education: Sidney Kimmel Medical College Thomas Jefferson University (2015) PA
Fellowship: Stanford Hospital and Clinics (2020) CA
Board Certification: American Board of Pediatrics, Pediatrics (2018)
Residency: Phoenix Children's Hospital Pediatric Residency (2018) AZ
Fellowship, Stanford University, Clinical Informatics (2020)
Residency, Phoenix Children's Hospital, Pediatrics (2018)
MD, Jefferson Medical College (2015)
MBA, Washington University in St. Louis (2009)

Contact

Academic
keith.morse@stanford.edu

Department: Peds/Hospital Medicine Position: Clinical Associate Professor

Clinical (Primary) Division of Pediatric Hospital Medicine 300 Pasteur Dr MC 5776 Stanford, CA 94305
- (650) 736-4423 (office)
(650) 736-6690 (fax)

All Publications

Bridging the Gap: Consensus-Based Considerations for AI Usefulness in Healthcare. The American journal of bioethics : AJOB Salwei, M. E., Morse, K., Saria, S., Shah, N. H., Bedoya, A., Beyer, M., Munoz Del Rio, A., Chornenky, D., Lin, A., Ruparel, S., Kortsch, D., Barbarooah, P., Hanger, M., Beecy, A. N., Elmore, M., Economou-Zavlanos, N. J. 2026; 26 (2): 1-6

View details for DOI 10.1080/15265161.2026.2617850

View details for PubMedID 41678674
Holistic evaluation of large language models for medical tasks with MedHELM. Nature medicine Bedi, S., Cui, H., Fuentes, M., Unell, A., Wornow, M., Banda, J. M., Kotecha, N., Keyes, T., Mai, Y., Oez, M., Qiu, H., Jain, S., Schettini, L., Kashyap, M., Fries, J. A., Swaminathan, A., Chung, P., Haredasht, F. N., Lopez, I., Aali, A., Tse, G., Nayak, A., Vedak, S., Jain, S. S., Patel, B., Fayanju, O., Shah, S., Goh, E., Yao, D. H., Soetikno, B., Reis, E., Gatidis, S., Divi, V., Capasso, R., Saralkar, R., Chiang, C. C., Jindal, J., Pham, T., Ghoddusi, F., Lin, S., Chiou, A. S., Hong, H. J., Roy, M., Gensheimer, M. F., Patel, H., Schulman, K., Dash, D., Char, D., Downing, L., Grolleau, F., Black, K., Mieso, B., Zahedivash, A., Yim, W. W., Sharma, H., Lee, T., Kirsch, H., Lee, J., Ambers, N., Lugtu, C., Sharma, A., Mawji, B., Alekseyev, A., Zhou, V., Kakkar, V., Helzer, J., Revri, A., Bannett, Y., Daneshjou, R., Chen, J., Alsentzer, E., Morse, K., Ravi, N., Aghaeepour, N., Kennedy, V., Chaudhari, A., Wang, T., Koyejo, S., Lungren, M. P., Horvitz, E., Liang, P., Pfeffer, M. A., Shah, N. H. 2026

Abstract

While large language models (LLMs) achieve near-perfect scores on medical licensing exams, these evaluations inadequately reflect the complexity and diversity of real-world clinical practice. Here we introduce MedHELM, an extensible evaluation framework with three contributions. First, a clinician-validated taxonomy organizing medical AI applications into five categories that mirror real clinical tasks-clinical decision support (diagnostic decisions, treatment planning), clinical note generation (visit documentation, procedure reports), patient communication (education materials, care instructions), medical research (literature analysis, clinical data analysis) and administration (scheduling, workflow coordination). These encompass 22 subcategories and 121 specific tasks reflecting daily medical practice. Second, a comprehensive benchmark suite of 37 evaluations covering all subcategories. Third, systematic comparison of nine frontier LLMs-Claude 3.5 Sonnet, Claude 3.7 Sonnet, DeepSeek R1, Gemini 1.5 Pro, Gemini 2.0 Flash, GPT-4o, GPT-4o mini, Llama 3.3 and o3-mini-using an automated LLM-jury evaluation method. Our LLM-jury uses multiple AI evaluators to assess model outputs against expert-defined criteria. Advanced reasoning models (DeepSeek R1, o3-mini) demonstrated superior performance with win rates of 66%, although Claude 3.5 Sonnet achieved comparable results at 15% lower computational cost. These results not only highlight current model capabilities but also demonstrate how MedHELM could enable evidence-based selection of medical AI systems for healthcare applications.

View details for DOI 10.1038/s41591-025-04151-2

View details for PubMedID 41559415

View details for PubMedCentralID 10916499
Artificial intelligence-generated draft replies to patient messages in pediatrics. JAMIA open Liang, A. S., Vedak, S., Dussaq, A., Yao, D., Villarreal, J. A., Thomas, S., Chen, N., Townsend, T., Pageler, N. M., Morse, K. 2025; 8 (6): ooaf159

Abstract

Objectives: This study describes the utilization and experiences of artificial intelligence (AI)-generated draft responses to patient messages in pediatric ambulatory clinicians and contextualizes their experiences in relation to those of adult specialty clinicians.Materials and Methods: A prospective pilot was conducted from September 2023 to August 2024 in 2 pediatric clinics (General Pediatric and Adolescent Medicine) and 2 obstetric clinics (Reproductive Endocrinology and Infertility and General Obstetrics) within an academic health system in Northern California. Participants included physician, nurse, and medical assistant volunteers. The intervention involved a feature utilizing large language models embedded in the electronic health record to generate draft responses. Proportion of AI-generated draft used was collected, as were prepilot and follow-up surveys.Results: A total of 61 clinicians (26 pediatric, 35 obstetric) enrolled, with 46 (75%) completing both surveys. Pediatric clinicians utilized 13.3% (95% CI, 12.3%-14.4%) of AI-generated drafts, and usage rates when responding to patients vs their proxies was similar (15% vs 12.9%, P=.24). Despite using AI-generated drafts significantly less than obstetric clinicians (18.3% [17.2%-19.5%], P<.0001), pediatric clinicians reported a significant reduction in perceived task load (NASA Task Load Index: 59.9-50.9, P=.04) and were more likely to recommend the tool (LTR: 7.0 vs 5.2, P=.04).Discussion and Conclusion: Pediatric clinicians used AI-generated drafts at a rate within previously reported ranges in adult specialties and experienced utility. These findings suggest this tool has potential for enhancing efficiency and reducing task load in pediatric care.

View details for DOI 10.1093/jamiaopen/ooaf159

View details for PubMedID 41293120
Incremental Healthcare Utilization and Outcomes for Pediatric Cancer VS. Non-Cancer Patients: Leveraging the Observational Medical Outcomes Partnership Common Data Model for Multi-Center Research Chen, Y., Yi, M., Aftandilian, C., Beauchemin, M., Cunningham, C., Larimer, E., Oberg, J., May, B., Morse, K., Natarajan, K., Noyd, D., Soucek, V., Khurana, R., Sim, P., Guo, L., Sung, L. WILEY. 2025: S472-S473

View details for Web of Science ID 001671528703216
A multifaceted approach to advancing data quality and fitness standards in multi-institutional networks. Journal of the American Medical Informatics Association : JAMIA Razzaghi, H., Dickinson, K., Wieand, K., Boss, S., Weidlich, H., Huang, Y., Morse, K., Mutyala, S. K., Nandagopal, J. P., Viswanathan, K., Forrest, C. B., Bailey, L. C. 2025

Abstract

To construct a data quality (DQ) system that incorporates combinations of methods to evaluate data characteristics and analytic fitness across research questions for multiple uses.Drawing from experience of other data quality programs, network data extraction needs, and recurring study requirements, we developed 5 standards to guide development of a modular, multifaceted data quality system. These included annotation and documentation, ability to measure research readiness, reproducibility across networks, flexibility for the user, and interpretability to research and project teams. Implementation of checks based on these principles focused on reusability and interactive visualization of results.We identified 10 check types producing over 444 check applications and deployed them in 2 multi-institutional networks. Check types span structural conformance to a data model, utility for common research needs, and study-specific customization. All check types are customizable without dependencies between them. A dashboard visualizes results, permitting adjustments based on number of data sources, need for source masking, and the user's focus. All components can be applied as written to any data source using OMOP and are readily modified for other data models.We have extended previous work through our novel and multifaceted approach to data quality assessment, addressing needs in both network data improvement and research usage. We developed a capable and deployable system rather than tailoring to specific use cases.Our novel DQ assessment system provides essential components for future standardization and collaboration to improve fitness of clinical data for intended use.

View details for DOI 10.1093/jamia/ocaf181

View details for PubMedID 41128352
A natural language processing pipeline for identifying pediatric long COVID symptoms and functional impacts in freeform clinical notes: a RECOVER study. JAMIA open Bunnell, H. T., Reedy, C., Lorman, V., Jhaveri, R., Rivera-Sepulveda, A., Salamon, K. S., Patel, P. B., Morse, K. E., Davenport, M. A., Cowell, L. G., Utidjian, L., Christakis, D. A., Rao, S., Sills, M. R., Case, A., Mendonca, E. A., Taylor, B. W., Rutter, J., Martinez, A. T., Letts, R., Bailey, L. C., Forrest, C. B. 2025; 8 (5): ooaf089

Abstract

To develop a natural language processing (NLP) pipeline for unstructured electronic health record (EHR) data to identify symptoms and functional impacts associated with Long COVID in children.We analyzed 48 287 outpatient progress notes from 10 618 pediatric patients from 12 institutions. We evaluated notes obtained 28 to 179 days after a COVID-19 diagnosis or positive test. Two samples were examined: patients with evidence of Long COVID and patients with acute COVID but no evidence of Long COVID based on diagnostic codes. The pipeline identified clinical concepts associated with 21 symptoms and 4 functional impact categories. Subject matter experts (SMEs) screened a sample of 4586 terms from the NLP output to assess pipeline accuracy. Prevalence and concordance of each of the 25 concepts was compared between the 2 patient samples.A binary assertion measure comparing SME and NLP assertions showed moderate accuracy (N = 4133; F1 = .80) and improved substantially when only high-confidence SME assertions were considered (N = 2043; F1 = .90). Overall, the 25 Long COVID concept categories were markedly more prevalent in the presumptive Long COVID cohort, and differences were noted between concepts identified in notes versus structured data.This preliminary analysis illustrates the additional insight into a syndrome such as Long COVID gained from incorporating notes data, characterizing symptoms and functional impacts.These data support the importance of incorporating NLP methodology when possible into designing computable phenotypes and to accurately characterize patients with Long COVID.

View details for DOI 10.1093/jamiaopen/ooaf089

View details for PubMedID 40918941

View details for PubMedCentralID PMC12409404
AI in Health Care: The Leadership Role of Board-Certified Clinical Informaticists. Applied clinical informatics Morse, K. E., Pageler, N. M., Shah, N. H., Townsend, T., Sharp, C., Pfeffer, M. A. 2025; 16 (3): 612-613

View details for DOI 10.1055/a-2556-4698

View details for PubMedID 40602801

View details for PubMedCentralID PMC12221687
QUEST-AI: A System for Question Generation, Verification, and Refinement using AI for USMLE-Style Exams. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing Bedi, S., Fleming, S. L., Chiang, C. C., Morse, K., Kumar, A., Patel, B., Jindal, J. A., Davenport, C., Yamaguchi, C., Shah, N. H. 2025; 30: 54-69

Abstract

The United States Medical Licensing Examination (USMLE) is a critical step in assessing the competence of future physicians, yet the process of creating exam questions and study materials is both time-consuming and costly. While Large Language Models (LLMs), such as OpenAI's GPT-4, have demonstrated proficiency in answering medical exam questions, their potential in generating such questions remains underexplored. This study presents QUEST-AI, a novel system that utilizes LLMs to (1) generate USMLE-style questions, (2) identify and flag incorrect questions, and (3) correct errors in the flagged questions. We evaluated this system's output by constructing a test set of 50 LLM-generated questions mixed with 50 human-generated questions and conducting a two-part assessment with three physicians and two medical students. The assessors attempted to distinguish between LLM and human-generated questions and evaluated the validity of the LLM-generated content. A majority of exam questions generated by QUEST-AI were deemed valid by a panel of three clinicians, with strong correlations between performance on LLM-generated and human-generated questions. This pioneering application of LLMs in medical education could significantly increase the ease and efficiency of developing USMLE-style medical exam content, offering a cost-effective and accessible alternative for exam preparation.

View details for DOI 10.1142/9789819807024_0005

View details for PubMedID 40299581
Kidney Function Following COVID-19 in Children and Adolescents. JAMA network open Li, L., Zhou, T., Lu, Y., Chen, J., Lei, Y., Wu, Q., Arnold, J., Becich, M. J., Bisyuk, Y., Blecker, S., Chrischilles, E., Christakis, D. A., Geary, C. R., Jhaveri, R., Lenert, L., Liu, M., Mirhaji, P., Morizono, H., Mosa, A. S., Onder, A. M., Patel, R., Smoyer, W. E., Taylor, B. W., Williams, D. A., Dixon, B. P., Flynn, J. T., Gluck, C., Harshman, L. A., Mitsnefes, M. M., Modi, Z. J., Pan, C. G., Patel, H. P., Verghese, P. S., Forrest, C. B., Denburg, M. R., Chen, Y. 2025; 8 (4): e254129

Abstract

It remains unclear whether children and adolescents with SARS-CoV-2 infection are at heightened risk for long-term kidney complications.To investigate whether SARS-CoV-2 infection is associated with an increased risk of postacute kidney outcomes among pediatric patients, including those with preexisting kidney disease or acute kidney injury (AKI).This retrospective cohort study used data from 19 health institutions in the National Institutes of Health Researching COVID to Enhance Recovery (RECOVER) initiative from March 1, 2020, to May 1, 2023 (follow-up ≤2 years completed December 1, 2024; index date cutoff, December 1, 2022). Participants included children and adolescents (aged <21 years) with at least 1 baseline visit (24 months to 7 days before the index date) and at least 1 follow-up visit (28 to 179 days after the index date).SARS-CoV-2 infection, determined by positive laboratory test results (polymerase chain reaction, antigen, or serologic) or relevant clinical diagnoses. A comparison group included children with documented negative test results and no history of SARS-CoV-2 infection.Outcomes included new-onset chronic kidney disease (CKD) stage 2 or higher or CKD stage 3 or higher among those without preexisting CKD; composite kidney events (≥50% decline in estimated glomerular filtration rate [eGFR], eGFR ≤15 mL/min/1.73 m2, dialysis, transplant, or end-stage kidney disease diagnosis), and at least 30%, 40%, or 50% eGFR decline among those with preexisting CKD or acute-phase AKI. Hazard ratios (HRs) were estimated using Cox proportional hazards regression models with propensity score stratification.Among 1 900 146 pediatric patients (487 378 with and 1 412 768 without COVID-19), 969 937 (51.0%) were male, the mean (SD) age was 8.2 (6.2) years, and a range of comorbidities was represented. SARS-CoV-2 infection was associated with higher risk of new-onset CKD stage 2 or higher (HR, 1.17; 95% CI, 1.12-1.22) and CKD stage 3 or higher (HR, 1.35; 95% CI, 1.13-1.62). In those with preexisting CKD, COVID-19 was associated with an increased risk of composite kidney events (HR, 1.15; 95% CI, 1.04-1.27) at 28 to 179 days. Children with acute-phase AKI had elevated HRs (1.29; 95% CI, 1.21-1.38) at 90 to 179 days for composite outcomes.In this large US cohort study of children and adolescents, SARS-CoV-2 infection was associated with a higher risk of adverse postacute kidney outcomes, particularly among those with preexisting CKD or AKI, suggesting the need for vigilant long-term monitoring.

View details for DOI 10.1001/jamanetworkopen.2025.4129

View details for PubMedID 40214993

View details for PubMedCentralID PMC11992607
Natural Language Processing: Set to Transform Pediatric Research. Hospital pediatrics Bannett, Y., Bassett, H. K., Morse, K. E. 2024

View details for DOI 10.1542/hpeds.2024-008115

View details for PubMedID 39679589
Large Language Model Responses to Adolescent Patient and Proxy Messages. JAMA pediatrics Tse, G., Zahedivash, A., Anoshiravani, A., Carlson, J., Haberkorn, W., Morse, K. E. 2024

View details for DOI 10.1001/jamapediatrics.2024.4438

View details for PubMedID 39495530
Standing on FURM Ground: A Framework for Evaluating Fair, Useful, and Reliable AI Models in Health Care Systems NEJM CATALYST INNOVATIONS IN CARE DELIVERY Callahan, A., McElfresh, D., Banda, J. M., Bunney, G., Char, D., Chen, J., Corbin, C. K., Dash, D., Downing, N. L., Jain, S. S., Kotecha, N., Masterson, J., Mello, M. M., Morse, K., Nallan, S., Pandya, A., Revri, A., Sharma, A., Sharp, C., Thapa, R., Wornow, M., Youssef, A., Pfeffer, M. A., Shah, N. H. 2024; 5 (10)

View details for DOI 10.1056/CAT.24.0131

View details for Web of Science ID 001422126900001
Accuracy of a Proprietary Large Language Model in Labeling Obstetric Incident Reports. Joint Commission journal on quality and patient safety Johnson, J., Brown, C., Lee, G., Morse, K. 2024

Abstract

BACKGROUND: Using the data collected through incident reporting systems is challenging, as it is a large volume of primarily qualitative information. Large language models (LLMs), such as ChatGPT, provide novel capabilities in text summarization and labeling that could support safety data trending and early identification of opportunities to prevent patient harm. This study assessed the capability of a proprietary LLM (GPT-3.5) to automatically label a cross-sectional sample of real-world obstetric incident reports.METHODS: A sample of 370 incident reports submitted to inpatient obstetric units between December 2022 and May 2023 was extracted. Human-annotated labels were assigned by a clinician reviewer and considered gold standard. The LLM was prompted to label incident reports relying solely on its pretrained knowledge and information included in the prompt. Primary outcomes assessed were sensitivity, specificity, positive predictive value, and negative predictive value. A secondary outcome assessed the human-perceived quality of the model's justification for the label(s) applied.RESULTS: The LLM demonstrated the ability to label incident reports with high sensitivity and specificity. The model applied a total of 79 labels compared to the reviewer's 49 labels. Overall sensitivity for the model was 85.7%, and specificity was 97.9%. Positive and negative predictive values were 53.2% and 99.6%, respectively. For 60.8% of labels, the reviewer approved of the model's justification for applying the label.CONCLUSION: The proprietary LLM demonstrated the ability to label obstetric incident reports with high sensitivity and specificity. LLMs offer the potential to enable more efficient use of data from incident reporting systems.

View details for DOI 10.1016/j.jcjq.2024.08.001

View details for PubMedID 39256071
A multi-center study on the adaptability of a shared foundation model for electronic health records. NPJ digital medicine Guo, L. L., Fries, J., Steinberg, E., Fleming, S. L., Morse, K., Aftandilian, C., Posada, J., Shah, N., Sung, L. 2024; 7 (1): 171

Abstract

Foundation models are transforming artificial intelligence (AI) in healthcare by providing modular components adaptable for various downstream tasks, making AI development more scalable and cost-effective. Foundation models for structured electronic health records (EHR), trained on coded medical records from millions of patients, demonstrated benefits including increased performance with fewer training labels, and improved robustness to distribution shifts. However, questions remain on the feasibility of sharing these models across hospitals and their performance in local tasks. This multi-center study examined the adaptability of a publicly accessible structured EHR foundation model (FMSM), trained on 2.57 M patient records from Stanford Medicine. Experiments used EHR data from The Hospital for Sick Children (SickKids) and Medical Information Mart for Intensive Care (MIMIC-IV). We assessed both adaptability via continued pretraining on local data, and task adaptability compared to baselines of locally training models from scratch, including a local foundation model. Evaluations on 8 clinical prediction tasks showed that adapting the off-the-shelf FMSM matched the performance of gradient boosting machines (GBM) locally trained on all data while providing a 13% improvement in settings with few task-specific training labels. Continued pretraining on local data showed FMSM required fewer than 1% of training examples to match the fully trained GBM's performance, and was 60 to 90% more sample-efficient than training local foundation models from scratch. Our findings demonstrate that adapting EHR foundation models across hospitals provides improved prediction performance at less cost, underscoring the utility of base foundation models as modular components to streamline the development of healthcare AI.

View details for DOI 10.1038/s41746-024-01166-w

View details for PubMedID 38937550

View details for PubMedCentralID 10396962
Using a Large Language Model to Identify Adolescent Patient Portal Account Access by Guardians. JAMA network open Liang, A. S., Vedak, S., Dussaq, A., Yao, D. H., Morse, K., Ip, W., Pageler, N. M. 2024; 7 (6): e2418454

View details for DOI 10.1001/jamanetworkopen.2024.18454

View details for PubMedID 38916895
Systematic data quality assessment of electronic health record data to evaluate study-specific fitness: Report from the PRESERVE research study. PLOS digital health Razzaghi, H., Goodwin Davies, A., Boss, S., Bunnell, H. T., Chen, Y., Chrischilles, E. A., Dickinson, K., Hanauer, D., Huang, Y., Ilunga, K. T., Katsoufis, C., Lehmann, H., Lemas, D. J., Matthews, K., Mendonca, E. A., Morse, K., Ranade, D., Rosenman, M., Taylor, B., Walters, K., Denburg, M. R., Forrest, C. B., Bailey, L. C. 2024; 3 (6): e0000527

Abstract

Study-specific data quality testing is an essential part of minimizing analytic errors, particularly for studies making secondary use of clinical data. We applied a systematic and reproducible approach for study-specific data quality testing to the analysis plan for PRESERVE, a 15-site, EHR-based observational study of chronic kidney disease in children. This approach integrated widely adopted data quality concepts with healthcare-specific evaluation methods. We implemented two rounds of data quality assessment. The first produced high-level evaluation using aggregate results from a distributed query, focused on cohort identification and main analytic requirements. The second focused on extended testing of row-level data centralized for analysis. We systematized reporting and cataloguing of data quality issues, providing institutional teams with prioritized issues for resolution. We tracked improvements and documented anomalous data for consideration during analyses. The checks we developed identified 115 and 157 data quality issues in the two rounds, involving completeness, data model conformance, cross-variable concordance, consistency, and plausibility, extending traditional data quality approaches to address more complex stratification and temporal patterns. Resolution efforts focused on higher priority issues, given finite study resources. In many cases, institutional teams were able to correct data extraction errors or obtain additional data, avoiding exclusion of 2 institutions entirely and resolving 123 other gaps. Other results identified complexities in measures of kidney function, bearing on the study's outcome definition. Where limitations such as these are intrinsic to clinical data, the study team must account for them in conducting analyses. This study rigorously evaluated fitness of data for intended use. The framework is reusable and built on a strong theoretical underpinning. Significant data quality issues that would have otherwise delayed analyses or made data unusable were addressed. This study highlights the need for teams combining subject-matter and informatics expertise to address data quality when working with real world data.

View details for DOI 10.1371/journal.pdig.0000527

View details for PubMedID 38935590

View details for PubMedCentralID PMC11210795
MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records. Proceedings of the ... AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence Fleming, S. L., Lozano, A., Haberkorn, W. J., Jindal, J. A., Reis, E., Thapa, R., Blankemeier, L., Genkins, J. Z., Steinberg, E., Nayak, A., Patel, B., Chiang, C. C., Callahan, A., Huo, Z., Gatidis, S., Adams, S., Fayanju, O., Shah, S. J., Savage, T., Goh, E., Chaudhari, A. S., Aghaeepour, N., Sharp, C., Pfeffer, M. A., Liang, P., Chen, J. H., Morse, K. E., Brunskill, E. P., Fries, J. A., Shah, N. H. 2024; 38 (20): 22021-22030

Abstract

The ability of large language models (LLMs) to follow natural language instructions with human-level fluency suggests many opportunities in healthcare to reduce administrative burden and improve quality of care. However, evaluating LLMs on realistic text generation tasks for healthcare remains challenging. Existing question answering datasets for electronic health record (EHR) data fail to capture the complexity of information needs and documentation burdens experienced by clinicians. To address these challenges, we introduce MedAlign, a benchmark dataset of 983 natural language instructions for EHR data. MedAlign is curated by 15 clinicians (7 specialities), includes clinician-written reference responses for 303 instructions, and provides 276 longitudinal EHRs for grounding instruction-response pairs. We used MedAlign to evaluate 6 general domain LLMs, having clinicians rank the accuracy and quality of each LLM response. We found high error rates, ranging from 35% (GPT-4) to 68% (MPT-7B-Instruct), and 8.3% drop in accuracy moving from 32k to 2k context lengths for GPT-4. Finally, we report correlations between clinician rankings and automated natural language generation metrics as a way to rank LLMs without human review. MedAlign is provided under a research data use agreement to enable LLM evaluations on tasks aligned with clinician needs and preferences.

View details for DOI 10.1609/aaai.v38i20.30205

View details for PubMedID 41584261

View details for PubMedCentralID PMC12826664
Learning competing risks across multiple hospitals: one-shot distributed algorithms. Journal of the American Medical Informatics Association : JAMIA Zhang, D., Tong, J., Jing, N., Yang, Y., Luo, C., Lu, Y., Christakis, D. A., Güthe, D., Hornig, M., Kelleher, K. J., Morse, K. E., Rogerson, C. M., Divers, J., Carroll, R. J., Forrest, C. B., Chen, Y. 2024

Abstract

To characterize the complex interplay between multiple clinical conditions in a time-to-event analysis framework using data from multiple hospitals, we developed two novel one-shot distributed algorithms for competing risk models (ODACoR). By applying our algorithms to the EHR data from eight national children's hospitals, we quantified the impacts of a wide range of risk factors on the risk of post-acute sequelae of SARS-COV-2 (PASC) among children and adolescents.Our ODACoR algorithms are effectively executed due to their devised simplicity and communication efficiency. We evaluated our algorithms via extensive simulation studies as applications to quantification of the impacts of risk factors for PASC among children and adolescents using data from eight children's hospitals including the Children's Hospital of Philadelphia, Cincinnati Children's Hospital Medical Center, Children's Hospital of Colorado covering over 6.5 million pediatric patients. The accuracy of the estimation was assessed by comparing the results from our ODACoR algorithms with the estimators derived from the meta-analysis and the pooled data.The meta-analysis estimator showed a high relative bias (∼40%) when the clinical condition is relatively rare (∼0.5%), whereas ODACoR algorithms exhibited a substantially lower relative bias (∼0.2%). The estimated effects from our ODACoR algorithms were identical on par with the estimates from the pooled data, suggesting the high reliability of our federated learning algorithms. In contrast, the meta-analysis estimate failed to identify risk factors such as age, gender, chronic conditions history, and obesity, compared to the pooled data.Our proposed ODACoR algorithms are communication-efficient, highly accurate, and suitable to characterize the complex interplay between multiple clinical conditions.Our study demonstrates that our ODACoR algorithms are communication-efficient and can be widely applicable for analyzing multiple clinical conditions in a time-to-event analysis framework.

View details for DOI 10.1093/jamia/ocae027

View details for PubMedID 38456459
Characterizing the limitations of using diagnosis codes in the context of machine learning for healthcare. BMC medical informatics and decision making Guo, L. L., Morse, K. E., Aftandilian, C., Steinberg, E., Fries, J., Posada, J., Fleming, S. L., Lemmon, J., Jessa, K., Shah, N., Sung, L. 2024; 24 (1): 51

Abstract

Diagnostic codes are commonly used as inputs for clinical prediction models, to create labels for prediction tasks, and to identify cohorts for multicenter network studies. However, the coverage rates of diagnostic codes and their variability across institutions are underexplored. The primary objective was to describe lab- and diagnosis-based labels for 7 selected outcomes at three institutions. Secondary objectives were to describe agreement, sensitivity, and specificity of diagnosis-based labels against lab-based labels.This study included three cohorts: SickKids from The Hospital for Sick Children, and StanfordPeds and StanfordAdults from Stanford Medicine. We included seven clinical outcomes with lab-based definitions: acute kidney injury, hyperkalemia, hypoglycemia, hyponatremia, anemia, neutropenia and thrombocytopenia. For each outcome, we created four lab-based labels (abnormal, mild, moderate and severe) based on test result and one diagnosis-based label. Proportion of admissions with a positive label were presented for each outcome stratified by cohort. Using lab-based labels as the gold standard, agreement using Cohen's Kappa, sensitivity and specificity were calculated for each lab-based severity level.The number of admissions included were: SickKids (n = 59,298), StanfordPeds (n = 24,639) and StanfordAdults (n = 159,985). The proportion of admissions with a positive diagnosis-based label was significantly higher for StanfordPeds compared to SickKids across all outcomes, with odds ratio (99.9% confidence interval) for abnormal diagnosis-based label ranging from 2.2 (1.7-2.7) for neutropenia to 18.4 (10.1-33.4) for hyperkalemia. Lab-based labels were more similar by institution. When using lab-based labels as the gold standard, Cohen's Kappa and sensitivity were lower at SickKids for all severity levels compared to StanfordPeds.Across multiple outcomes, diagnosis codes were consistently different between the two pediatric institutions. This difference was not explained by differences in test results. These results may have implications for machine learning model development and deployment.

View details for DOI 10.1186/s12911-024-02449-8

View details for PubMedID 38355486

View details for PubMedCentralID PMC10868117
Evaluation of a Large Language Model to Identify Confidential Content in Adolescent Encounter Notes. JAMA pediatrics Rabbani, N., Brown, C., Bedgood, M., Goldstein, R. L., Carlson, J. L., Pageler, N. M., Morse, K. E. 2024

View details for DOI 10.1001/jamapediatrics.2023.6032

View details for PubMedID 38252434

View details for PubMedCentralID PMC10804277
MEDALIGN: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records Fleming, S. L., Lozano, A., Haberkorn, W. J., Jindal, J. A., Reis, E., Thapa, R., Blankemeier, L., Genkins, J. Z., Steinberg, E., Nayak, A., Patel, B., Chiang, C., Callahan, A., Huo, Z., Gatidis, S., Adams, S., Fayanju, O., Shah, S. J., Savage, T., Goh, E., Chaudhari, A. S., Aghaeepour, N., Sharp, C., Pfeffer, M. A., Liang, P., Chen, J. H., Morse, K. E., Brunskill, E. P., Fries, J. A., Shah, N. H. edited by Wooldridge, M., Dy, J., Natarajan, S. ASSOC ADVANCEMENT ARTIFICIAL INTELLIGENCE. 2024: 22021-22030

View details for Web of Science ID 001239985800017
Self-supervised machine learning using adult inpatient data produces effective models for pediatric clinical prediction tasks. Journal of the American Medical Informatics Association : JAMIA Lemmon, J., Guo, L. L., Steinberg, E., Morse, K. E., Fleming, S. L., Aftandilian, C., Pfohl, S. R., Posada, J. D., Shah, N., Fries, J., Sung, L. 2023

Abstract

Development of electronic health records (EHR)-based machine learning models for pediatric inpatients is challenged by limited training data. Self-supervised learning using adult data may be a promising approach to creating robust pediatric prediction models. The primary objective was to determine whether a self-supervised model trained in adult inpatients was noninferior to logistic regression models trained in pediatric inpatients, for pediatric inpatient clinical prediction tasks.This retrospective cohort study used EHR data and included patients with at least one admission to an inpatient unit. One admission per patient was randomly selected. Adult inpatients were 18 years or older while pediatric inpatients were more than 28 days and less than 18 years. Admissions were temporally split into training (January 1, 2008 to December 31, 2019), validation (January 1, 2020 to December 31, 2020), and test (January 1, 2021 to August 1, 2022) sets. Primary comparison was a self-supervised model trained in adult inpatients versus count-based logistic regression models trained in pediatric inpatients. Primary outcome was mean area-under-the-receiver-operating-characteristic-curve (AUROC) for 11 distinct clinical outcomes. Models were evaluated in pediatric inpatients.When evaluated in pediatric inpatients, mean AUROC of self-supervised model trained in adult inpatients (0.902) was noninferior to count-based logistic regression models trained in pediatric inpatients (0.868) (mean difference = 0.034, 95% CI=0.014-0.057; P < .001 for noninferiority and P = .006 for superiority).Self-supervised learning in adult inpatients was noninferior to logistic regression models trained in pediatric inpatients. This finding suggests transferability of self-supervised models trained in adult patients to pediatric patients, without requiring costly model retraining.

View details for DOI 10.1093/jamia/ocad175

View details for PubMedID 37639620
Pseudo-randomized testing of a discharge medication alert to reduce free-text prescribing. Applied clinical informatics Rabbani, N., Ho, M., Dash, D., Calway, T., Morse, K., Chadwick, W. 2023

Abstract

Pseudo-randomized testing can be applied to perform rigorous yet practical evaluations of clinical decision support tools. We apply this methodology to an interruptive alert aimed at reducing free-text prescriptions. Using free-text instead of structured computerized provider order entry elements can cause medication errors and inequity in care by bypassing medication-based clinical decision support tools and hindering automated translation of prescription instructions.Evaluate the effectiveness of an interruptive alert at reducing free-text prescriptions via pseudo-randomized testing using native electronic health records (EHR) functionality.Two versions of an EHR alert triggered when a provider attempted to sign a discharge free-text prescription. The visible version displayed an interruptive alert to the user, and a silent version triggered in the background, serving as a control. Providers were assigned to the visible and silent arms based on even/odd EHR provider IDs. The proportion of encounters with a free-text prescription was calculated across the groups. Alert trigger rates were compared in process control charts. Free-text prescriptions were analyzed to identify prescribing patterns.Over the 28 week study period, 143 providers triggered 695 alerts (345 visible and 350 silent). The proportions of encounters with free-text prescriptions were 83% (266/320) and 90% (273/303) in the intervention and control groups respectively (p-value = 0.01). For the active alert, median time to action was 31 seconds. Alert trigger rates between groups were similar over time. Ibuprofen, oxycodone, steroid tapers, and oncology-related prescriptions accounted for most free-text prescriptions. A majority of these prescriptions originated from user preference lists.An interruptive alert was associated with a modest reduction in free-text prescriptions. Furthermore, the majority of these prescriptions could have been reproduced using structured order entry fields. Targeting user preference lists shows promise for future intervention.

View details for DOI 10.1055/a-2068-6940

View details for PubMedID 37015344
A Natural Language Processing Model to Identify Confidential Content in Adolescent Clinical Notes. Applied clinical informatics Rabbani, N., Bedgood, M., Brown, C., Steinberg, E., Goldstein, R., Carlson, J., Pageler, N., Morse, K. 2023

Abstract

BACKGROUND: The 21st Century Cures Act mandates the immediate, electronic release of health information to patients. However, in the case of adolescents, special consideration is required to ensure that confidentiality is maintained. The detection of confidential content in clinical notes may support operational efforts to preserve adolescent confidentiality while implementing information sharing.OBJECTIVE: Determine if a natural language processing (NLP) algorithm can identify confidential content in adolescent clinical progress notes.METHODS: 1,200 outpatient adolescent progress notes written between 2016 and 2019 were manually annotated to identify confidential content. Labeled sentences from this corpus were featurized and used to train a two-part logistic regression model, which provides both sentence-level and note-level probability estimates that a given text contains confidential content. This model was prospectively validated on a set of 240 progress notes written in May 2022. It was subsequently deployed in a pilot intervention to augment an ongoing operational effort to identify confidential content in progress notes. Note-level probability estimates were used to triage notes for review and sentence-level probability estimates were used to highlight high-risk portions of those notes to aid the manual reviewer.RESULTS: The prevalence of notes containing confidential content was 21% (255/1200) and 22% (53/240) in the train/test and validation cohorts. The ensemble logistic regression model achieved an AUROC of 90% and 88% in the test and validation cohorts. Its use in a pilot intervention identified outlier documentation practices and demonstrated efficiency gains over completely manual note review.DISCUSSION: An NLP algorithm can identify confidential content in progress notes with high accuracy. Its human-in-the-loop deployment in clinical operations augmented an ongoing operational effort to identify confidential content in adolescent progress notes. These findings suggest NLP may be used to support efforts to preserve adolescent confidentiality in the wake of the information blocking mandate.

View details for DOI 10.1055/a-2051-9764

View details for PubMedID 36898410
The Prevalence of Confidential Content in Adolescent Progress Notes Prior to the 21st Century Cures Act Information Blocking Mandate. Applied clinical informatics Bedgood, M., Rabbani, N., Brown, C., Goldstein, R., Carlson, J. L., Steinberg, E., Powell, A., Pageler, N. M., Morse, K. 2023; 14 (2): 337-344

Abstract

The 21st Century Cures Act information blocking final rule mandated the immediate and electronic release of health care data in 2020. There is anecdotal concern that a significant amount of information is documented in notes that would breach adolescent confidentiality if released electronically to a guardian.The purpose of this study was to quantify the prevalence of confidential information, based on California laws, within progress notes for adolescent patients that would be released electronically and assess differences in prevalence across patient demographics.This is a single-center retrospective chart review of outpatient progress notes written between January 1, 2016, and December 31, 2019, at a large suburban academic pediatric network. Notes were labeled into one of three confidential domains by five expert reviewers trained on a rubric defining confidential information for adolescents derived from California state law. Participants included a random sampling of eligible patients aged 12 to 17 years old at the time of note creation. Secondary analysis included prevalence of confidentiality across age, gender, language spoken, and patient race.Of 1,200 manually reviewed notes, 255 notes (21.3%) (95% confidence interval: 19-24%) contained confidential information. There was a similar distribution among gender and age and a majority of English speaking (83.9%) and white or Caucasian patients (41.2%) in the cohort. Confidential information was more likely to be found in notes for females (p < 0.05) as well as for English-speaking patients (p < 0.05). Older patients had a higher probability of notes containing confidential information (p < 0.05).This study demonstrates that there is a significant risk to breach adolescent confidentiality if historical progress notes are released electronically to proxies without further review or redaction. With increased sharing of health care data, there is a need to protect the privacy of the adolescents and prevent potential breaches of confidentiality.

View details for DOI 10.1055/s-0043-1767682

View details for PubMedID 37137339

View details for PubMedCentralID PMC10156443
A machine learning-based phenotype for long COVID in children: An EHR-based study from the RECOVER program. PloS one Lorman, V., Razzaghi, H., Song, X., Morse, K., Utidjian, L., Allen, A. J., Rao, S., Rogerson, C., Bennett, T. D., Morizono, H., Eckrich, D., Jhaveri, R., Huang, Y., Ranade, D., Pajor, N., Lee, G. M., Forrest, C. B., Bailey, L. C. 2023; 18 (8): e0289774

Abstract

As clinical understanding of pediatric Post-Acute Sequelae of SARS CoV-2 (PASC) develops, and hence the clinical definition evolves, it is desirable to have a method to reliably identify patients who are likely to have post-acute sequelae of SARS CoV-2 (PASC) in health systems data. In this study, we developed and validated a machine learning algorithm to classify which patients have PASC (distinguishing between Multisystem Inflammatory Syndrome in Children (MIS-C) and non-MIS-C variants) from a cohort of patients with positive SARS- CoV-2 test results in pediatric health systems within the PEDSnet EHR network. Patient features included in the model were selected from conditions, procedures, performance of diagnostic testing, and medications using a tree-based scan statistic approach. We used an XGboost model, with hyperparameters selected through cross-validated grid search, and model performance was assessed using 5-fold cross-validation. Model predictions and feature importance were evaluated using Shapley Additive exPlanation (SHAP) values. The model provides a tool for identifying patients with PASC and an approach to characterizing PASC using diagnosis, medication, laboratory, and procedure features in health systems data. Using appropriate threshold settings, the model can be used to identify PASC patients in health systems data at higher precision for inclusion in studies or at higher recall in screening for clinical trials, especially in settings where PASC diagnosis codes are used less frequently or less reliably. Analysis of how specific features contribute to the classification process may assist in gaining a better understanding of features that are associated with PASC diagnoses.

View details for DOI 10.1371/journal.pone.0289774

View details for PubMedID 37561683
User-centred design for machine learning in health care: a case study from care management. BMJ health & care informatics Seneviratne, M. G., Li, R. C., Schreier, M., Lopez-Martinez, D., Patel, B. S., Yakubovich, A., Kemp, J. B., Loreaux, E., Gamble, P., El-Khoury, K., Vardoulakis, L., Wong, D., Desai, J., Chen, J. H., Morse, K. E., Downing, N. L., Finger, L. T., Chen, M., Shah, N. 2022; 29 (1)

Abstract

OBJECTIVES: Few machine learning (ML) models are successfully deployed in clinical practice. One of the common pitfalls across the field is inappropriate problem formulation: designing ML to fit the data rather than to address a real-world clinical pain point.METHODS: We introduce a practical toolkit for user-centred design consisting of four questions covering: (1) solvable pain points, (2) the unique value of ML (eg, automation and augmentation), (3) the actionability pathway and (4) the model's reward function. This toolkit was implemented in a series of six participatory design workshops with care managers in an academic medical centre.RESULTS: Pain points amenable to ML solutions included outpatient risk stratification and risk factor identification. The endpoint definitions, triggering frequency and evaluation metrics of the proposed risk scoring model were directly influenced by care manager workflows and real-world constraints.CONCLUSIONS: Integrating user-centred design early in the ML life cycle is key for configuring models in a clinically actionable way. This toolkit can guide problem selection and influence choices about the technical setup of the ML problem.

View details for DOI 10.1136/bmjhci-2022-100656

View details for PubMedID 36220304
Assessment of Adherence to Reporting Guidelines by Commonly Used Clinical Prediction Models From a Single Vendor: A Systematic Review. JAMA network open Lu, J. H., Callahan, A., Patel, B. S., Morse, K. E., Dash, D., Pfeffer, M. A., Shah, N. H. 2022; 5 (8): e2227779

Abstract

Importance: Various model reporting guidelines have been proposed to ensure clinical prediction models are reliable and fair. However, no consensus exists about which model details are essential to report, and commonalities and differences among reporting guidelines have not been characterized. Furthermore, how well documentation of deployed models adheres to these guidelines has not been studied.Objectives: To assess information requested by model reporting guidelines and whether the documentation for commonly used machine learning models developed by a single vendor provides the information requested.Evidence Review: MEDLINE was queried using machine learning model card and reporting machine learning from November 4 to December 6, 2020. References were reviewed to find additional publications, and publications without specific reporting recommendations were excluded. Similar elements requested for reporting were merged into representative items. Four independent reviewers and 1 adjudicator assessed how often documentation for the most commonly used models developed by a single vendor reported the items.Findings: From 15 model reporting guidelines, 220 unique items were identified that represented the collective reporting requirements. Although 12 items were commonly requested (requested by 10 or more guidelines), 77 items were requested by just 1 guideline. Documentation for 12 commonly used models from a single vendor reported a median of 39% (IQR, 37%-43%; range, 31%-47%) of items from the collective reporting requirements. Many of the commonly requested items had 100% reporting rates, including items concerning outcome definition, area under the receiver operating characteristics curve, internal validation, and intended clinical use. Several items reported half the time or less related to reliability, such as external validation, uncertainty measures, and strategy for handling missing data. Other frequently unreported items related to fairness (summary statistics and subgroup analyses, including for race and ethnicity or sex).Conclusions and Relevance: These findings suggest that consistent reporting recommendations for clinical predictive models are needed for model developers to share necessary information for model deployment. The many published guidelines would, collectively, require reporting more than 200 items. Model documentation from 1 vendor reported the most commonly requested items from model reporting guidelines. However, areas for improvement were identified in reporting items related to model reliability and fairness. This analysis led to feedback to the vendor, which motivated updates to the documentation for future users.

View details for DOI 10.1001/jamanetworkopen.2022.27779

View details for PubMedID 35984654
Monitoring Approaches for a Pediatric Chronic Kidney Disease Machine Learning Model. Applied clinical informatics Morse, K. E., Brown, C., Fleming, S., Todd, I., Powell, A., Russell, A., Scheinker, D., Sutherland, S. M., Lu, J., Watkins, B., Shah, N. H., Pageler, N. M., Palma, J. P. 2022; 13 (2): 431-438

Abstract

OBJECTIVE: The purpose of this study is to evaluate the ability of three metrics to monitor for a reduction in performance of a chronic kidney disease (CKD) model deployed at a pediatric hospital.METHODS: The CKD risk model estimates a patient's risk of developing CKD 3 to 12 months following an inpatient admission. The model was developed on a retrospective dataset of 4,879 admissions from 2014 to 2018, then run silently on 1,270 admissions from April to October, 2019. Three metrics were used to monitor its performance during the silent phase: (1) standardized mean differences (SMDs); (2) performance of a "membership model"; and (3) response distribution analysis. Observed patient outcomes for the 1,270 admissions were used to calculate prospective model performance and the ability of the three metrics to detect performance changes.RESULTS: The deployed model had an area under the receiver-operator curve (AUROC) of 0.63 in the prospective evaluation, which was a significant decrease from an AUROC of 0.76 on retrospective data (p=0.033). Among the three metrics, SMDs were significantly different for 66/75 (88%) of the model's input variables (p <0.05) between retrospective and deployment data. The membership model was able to discriminate between the two settings (AUROC=0.71, p <0.0001) and the response distributions were significantly different (p <0.0001) for the two settings.CONCLUSION: This study suggests that the three metrics examined could provide early indication of performance deterioration in deployed models' performance.

View details for DOI 10.1055/s-0042-1746168

View details for PubMedID 35508197
Ensuring Adolescent Patient Portal Confidentiality in the Age of the Cures Act Final Rule. The Journal of adolescent health : official publication of the Society for Adolescent Medicine Xie, J., McPherson, T., Powell, A., Fong, P., Hogan, A., Ip, W., Morse, K., Carlson, J. L., Lee, T., Pageler, N. 2021

Abstract

PURPOSE: Managing confidential adolescent health information in patient portals presents unique challenges. Adolescent patients and guardians electronically access medical records and communicate with providers via portals. In confidential matters like sexual health, ensuring confidentiality is crucial. A key aspect of confidential portals is ensuring that the account is registered to and utilized by the intended user. Inappropriately registered or guardian-accessed adolescent portal accounts may lead to confidentiality breaches.METHODS: We used a quality improvement framework to develop screening methodologies to flag guardian-accessible accounts. Accounts of patients aged 12-17 were flagged via manual review of account emails and natural language processing of portal messages. We implemented a reconciliation program to correct affected accounts' registered email. Clinics were notified about sign-up errors and educated on sign-up workflow. An electronic alert was created to check the adolescent's email prior to account activation.RESULTS: After initial screening, 2,307 of 3,701 (62%) adolescent accounts were flagged as registered with a guardian's email. Those accounts were notified to resolve their logins. After five notifications over 8 weeks, 266 of 2,307 accounts (12%) were corrected; the remaining 2,041 (88%) were deactivated.CONCLUSIONS: The finding that 62% of adolescent portal accounts were used/accessed by guardians has significant confidentiality implications. In the context of the Cures Act Final Rule and increased information sharing, our institution's experience with ensuring appropriate access to adolescent portal accounts is necessary, timely, and relevant. This study highlights ways to improve patient portal confidentiality and prompts institutions caring for adolescents to review their systems and processes.

View details for DOI 10.1016/j.jadohealth.2021.09.009

View details for PubMedID 34666956
Assessment of Prevalence of Adolescent Patient Portal Account Access by Guardians. JAMA network open Ip, W., Yang, S., Parker, J., Powell, A., Xie, J., Morse, K., Aikens, R. C., Lee, J., Gill, M., Vundavalli, S., Huang, Y., Huang, J., Chen, J. H., Hoffman, J., Kuelbs, C., Pageler, N. 2021; 4 (9): e2124733

Abstract

Importance: Patient portals can be configured to allow confidential communication for adolescents' sensitive health care information. Guardian access of adolescent patient portal accounts could compromise adolescents' confidentiality.Objective: To estimate the prevalence of guardian access to adolescent patient portals at 3 academic children's hospitals.Design, Setting, and Participants: A cross-sectional study to estimate the prevalence of guardian access to adolescent patient portal accounts was conducted at 3 academic children's hospitals. Adolescent patients (aged 13-18 years) with access to their patient portal account with at least 1 outbound message from their portal during the study period were included. A rule-based natural language processing algorithm was used to analyze all portal messages from June 1, 2014, to February 28, 2020, and identify any message sent by guardians. The sensitivity and specificity of the algorithm at each institution was estimated through manual review of a stratified subsample of patient accounts. The overall proportion of accounts with guardian access was estimated after correcting for the sensitivity and specificity of the natural language processing algorithm.Exposures: Use of patient portal.Main Outcome and Measures: Percentage of adolescent portal accounts indicating guardian access.Results: A total of 3429 eligible adolescent accounts containing 25 642 messages across 3 institutions were analyzed. A total of 1797 adolescents (52%) were female and mean (SD) age was 15.6 (1.6) years. The percentage of adolescent portal accounts with apparent guardian access ranged from 52% to 57% across the 3 institutions. After correcting for the sensitivity and specificity of the algorithm based on manual review of 200 accounts per institution, an estimated 64% (95% CI, 59%-69%) to 76% (95% CI, 73%-88%) of accounts with outbound messages were accessed by guardians across the 3 institutions.Conclusions and Relevance: In this study, more than half of adolescent accounts with outbound messages were estimated to have been accessed by guardians at least once. These findings have implications for health systems intending to rely on separate adolescent accounts to protect adolescent confidentiality.

View details for DOI 10.1001/jamanetworkopen.2021.24733

View details for PubMedID 34529064
A survey of extant organizational and computational setups for deploying predictive models in health systems. Journal of the American Medical Informatics Association : JAMIA Kashyap, S., Morse, K. E., Patel, B., Shah, N. H. 2021

Abstract

OBJECTIVE: Artificial intelligence (AI) and machine learning (ML) enabled healthcare is now feasible for many health systems, yet little is known about effective strategies of system architecture and governance mechanisms for implementation. Our objective was to identify the different computational and organizational setups that early-adopter health systems have utilized to integrate AI/ML clinical decision support (AI-CDS) and scrutinize their trade-offs.MATERIALS AND METHODS: We conducted structured interviews with health systems with AI deployment experience about their organizational and computational setups for deploying AI-CDS at point of care.RESULTS: We contacted 34 health systems and interviewed 20 healthcare sites (58% response rate). Twelve (60%) sites used the native electronic health record vendor configuration for model development and deployment, making it the most common shared infrastructure. Nine (45%) sites used alternative computational configurations which varied significantly. Organizational configurations for managing AI-CDS were distinguished by how they identified model needs, built and implemented models, and were separable into 3 major types: Decentralized translation (n=10, 50%), IT Department led (n=2, 10%), and AI in Healthcare (AIHC) Team (n=8, 40%).DISCUSSION: No singular computational configuration enables all current use cases for AI-CDS. Health systems need to consider their desired applications for AI-CDS and whether investment in extending the off-the-shelf infrastructure is needed. Each organizational setup confers trade-offs for health systems planning strategies to implement AI-CDS.CONCLUSION: Health systems will be able to use this framework to understand strengths and weaknesses of alternative organizational and computational setups when designing their strategy for artificial intelligence.

View details for DOI 10.1093/jamia/ocab154

View details for PubMedID 34423364
Quantifying Discharge Medication Reconciliation Errors at 2 Pediatric Hospitals. Pediatric quality & safety Morse, K. E., Chadwick, W. A., Paul, W., Haaland, W., Pageler, N. M., Tarrago, R. 2021; 6 (4): e436

Abstract

Introduction: Medication reconciliation errors (MREs) are common and can lead to significant patient harm. Quality improvement efforts to identify and reduce these errors typically rely on resource-intensive chart reviews or adverse event reporting. Quantifying these errors hospital-wide is complicated and rarely done. The purpose of this study is to define a set of 6 MREs that can be easily identified across an entire healthcare organization and report their prevalence at 2 pediatric hospitals.Methods: An algorithmic analysis of discharge medication lists and confirmation by clinician reviewers was used to find the prevalence of the 6 discharge MREs at 2 pediatric hospitals. These errors represent deviations from the standards for medication instruction completeness, clarity, and safety. The 6 error types are Duplication, Missing Route, Missing Dose, Missing Frequency, Unlisted Medication, and See Instructions errors.Results: This study analyzed 67,339 discharge medications and detected MREs commonly at both hospitals. For Institution A, a total of 4,234 errors were identified, with 29.9% of discharges containing at least one error and an average of 0.7 errors per discharge. For Institution B, a total of 5,942 errors were identified, with 42.2% of discharges containing at least 1 error and an average of 1.6 errors per discharge. The most common error types were Duplication and See Instructions errors.Conclusion: The presented method shows these MREs to be a common finding in pediatric care. This work offers a tool to strengthen hospital-wide quality improvement efforts to reduce pediatric medication errors.

View details for DOI 10.1097/pq9.0000000000000436

View details for PubMedID 34345749
Digital Symptom Checker Usage and Triage: Population-Based Descriptive Study in a Large North American Integrated Health System. Journal of medical Internet research Morse, K. E., Ostberg, N. P., Jones, V. G., Chan, A. S. 2020

Abstract

BACKGROUND: Pressure on the United States (US) healthcare system has been increasing due to a combination of aging populations, rising healthcare expenditures and, most recently, the COVID-19 pandemic. Responses are hindered in part by a reliance on a limited supply of highly trained healthcare professionals, creating a need for scalable technological solutions. Digital symptom checkers are artificial intelligence (AI)-supported software tools that use a conversational "chatbot" format to support rapid diagnosis and consistent triage. The COVID-19 pandemic has brought new attention to these tools, with the need to avoid face-to-face contact and preserve urgent care capacity. However, evidence-based deployment of these chatbots requires an understanding of user demographics and associated triage recommendations generated by a large, general population.OBJECTIVE: In this study we evaluate the user demographics and levels of triage acuity provided by one symptom checker chatbot deployed in partnership with a large integrated health system in the US.METHODS: Population-based descriptive study including all online symptom assessments completed on the website and patient portal of the Sutter Health system (24 hospitals in Northern California) from April 24th, 2019 to February 1st, 2020. User demographics were compared to relevant US Census population data.RESULTS: A total of 26,646 symptom assessments were completed during the study period. Most assessments (17,816/26,646, 66.9%) were completed by female users. Mean user age was 34.3 years (SD: 14.4 years), compared to a median age of 37.3 years of the general population. The most common initial symptom was 'abdominal pain' (2,060/26,646, 7.7%). A substantial portion (12,357/26,646, 46.4%) was completed outside of typical physician office hours. Most users were advised to seek medical care the same day (7,299/26,646, 27.4%) or within 2-3 days (6,301/26,646, 23.6%). Over one quarter of assessments required a high degree of urgency (7,723/26,646, 29.0%).CONCLUSIONS: Users of the symptom checker chatbot were broadly representative of our patient population, though skewed towards younger and female users. Triage recommendations are comparable to those of nurse-staffed phone triage lines. While the emergence of COVID-19 increases the enthusiasm for remote medical assessment tools, it is important to take an evidence-based approach to their deployment.CLINICALTRIAL:

View details for DOI 10.2196/20549

View details for PubMedID 33170799
Estimate the hidden deployment cost of predictive models to improve patient care. Nature medicine Morse, K. E., Bagely, S. C., Shah, N. H. 2020; 26 (1): 18–19

View details for DOI 10.1038/s41591-019-0651-8

View details for PubMedID 31932778
Your Patient Has a New Health App? Start With Its Data Source. Journal of participatory medicine Morse, K. E., Schremp, J., Pageler, N. M., Palma, J. P. 2019; 11 (2): e14288

Abstract

Recent regulatory and technological advances have enabled a new era of health apps that are controlled by patients and contain valuable health information. These health apps will be numerous and use novel interfaces that appeal to patients but will likely be unfamiliar to practitioners. We posit that understanding the origin of the health data is the most meaningful and versatile way for physicians to understand and effectively use these apps in patient care. This will allow providers to better support patients and encourage patient engagement in their own care.

View details for DOI 10.2196/14288

View details for PubMedID 33055064

View details for PubMedCentralID PMC7434101
Hospital-Level Variation in Practice Patterns and Patient Outcomes for Pediatric Patients Hospitalized With Functional Constipation. Hospital pediatrics Librizzi, J., Flores, S., Morse, K., Kelleher, K., Carter, J., Bode, R. 2017; 7 (6): 320-327

Abstract

Constipation is a common pediatric condition with a prevalence of 3% to 5% in children aged 4 to 17 years. Currently, there are no evidence-based guidelines for the management of pediatric patients hospitalized with constipation. The primary objective was to evaluate practice patterns and patient outcomes for the hospital management of functional constipation in US children's hospitals.We conducted a multicenter, retrospective cohort study of children aged 0 to 18 years hospitalized for functional constipation from 2012 to 2014 by using the Pediatric Health Information System. Patients were included by using constipation and other related diagnoses as classified by International Classification of Diseases, Ninth Revision. Patients with complex chronic conditions were excluded. Outcome measures included percentage of hospitalizations due to functional constipation, therapies used, length of stay, and 90-day readmission rates. Statistical analysis included means with 95% confidence intervals for individual hospital outcomes.A total of 14 243 hospitalizations were included, representing 12 804 unique patients. The overall percentage of hospitalizations due to functional constipation was 0.65% (range: 0.19%-1.41%, P < .0001). The percentage of patients receiving the following treatment during their hospitalization included: electrolyte laxatives: 40% to 96%; sodium phosphate enema: 0% to 64%; mineral oil enema: 0% to 61%; glycerin suppository: 0% to 37%; bisacodyl 0% to 47%; senna: 0% to 23%; and docusate 0% to 11%. Mean length of stay was 1.97 days (range: 1.31-2.73 days, P < .0001). Mean 90-day readmission rate was 3.78% (range: 0.95%-7.53%, P < .0001).There is significant variation in practice patterns and clinical outcomes for pediatric patients hospitalized with functional constipation across US children's hospitals. Collaborative initiatives to adopt evidence-based best practices guidelines could help standardize the hospital management of pediatric functional constipation.

View details for DOI 10.1542/hpeds.2016-0101

View details for PubMedID 28522604

Keith Morse

Clinical Associate Professor, Pediatrics

Bio

Clinical Focus

Academic Appointments

Administrative Appointments

Professional Education

Contact

Additional Clinical Info

Additional Info

2023-24 Courses

2022-23 Courses

All Publications

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract