Professional Education

  • Master of Medicine, Universitat Basel (2017)
  • Doctor of Medicine, Universitat Basel (2018)
  • Bachelor of Medicine, Universitat Basel (2014)
  • PhD, ETH Zurich, Machine Learning for Healthcare (2021)
  • MD, University of Basel (2018)

Stanford Advisors

All Publications

  • Foundation models for generalist medical artificial intelligence. Nature Moor, M., Banerjee, O., Abad, Z. S., Krumholz, H. M., Leskovec, J., Topol, E. J., Rajpurkar, P. 2023; 616 (7956): 259-265


    The exceptionally rapid development of highly flexible, reusable artificial intelligence (AI) models is likely to usher in newfound capabilities in medicine. We propose a new paradigm for medical AI, which we refer to as generalist medical AI (GMAI). GMAI models will be capable of carrying out a diverse set of tasks using very little or no task-specific labelled data. Built through self-supervision on large, diverse datasets, GMAI will flexibly interpret different combinations of medical modalities, including data from imaging, electronic health records, laboratory results, genomics, graphs or medical text. Models will in turn produce expressive outputs such as free-text explanations, spoken recommendations or image annotations that demonstrate advanced medical reasoning abilities. Here we identify a set of high-impact potential applications for GMAI and lay out specific technical capabilities and training datasets necessary to enable them. We expect that GMAI-enabled applications will challenge current strategies for regulating and validating AI devices for medicine and will shift practices associated with the collection of large medical datasets.

    View details for DOI 10.1038/s41586-023-05881-4

    View details for PubMedID 37045921

    View details for PubMedCentralID 9792464

  • Almanac - Retrieval-Augmented Language Models for Clinical Medicine. NEJM AI Zakka, C., Shad, R., Chaurasia, A., Dalal, A. R., Kim, J. L., Moor, M., Fong, R., Phillips, C., Alexander, K., Ashley, E., Boyd, J., Boyd, K., Hirsch, K., Langlotz, C., Lee, R., Melia, J., Nelson, J., Sallam, K., Tullis, S., Vogelsong, M. A., Cunningham, J. P., Hiesinger, W. 2024; 1 (2)


    Large language models (LLMs) have recently shown impressive zero-shot capabilities, whereby they can use auxiliary data, without the availability of task-specific training examples, to complete a variety of natural language tasks, such as summarization, dialogue generation, and question answering. However, despite many promising applications of LLMs in clinical medicine, adoption of these models has been limited by their tendency to generate incorrect and sometimes even harmful statements.We tasked a panel of eight board-certified clinicians and two health care practitioners with evaluating Almanac, an LLM framework augmented with retrieval capabilities from curated medical resources for medical guideline and treatment recommendations. The panel compared responses from Almanac and standard LLMs (ChatGPT-4, Bing, and Bard) versus a novel data set of 314 clinical questions spanning nine medical specialties.Almanac showed a significant improvement in performance compared with the standard LLMs across axes of factuality, completeness, user preference, and adversarial safety.Our results show the potential for LLMs with access to domain-specific corpora to be effective in clinical decision-making. The findings also underscore the importance of carefully testing LLMs before deployment to mitigate their shortcomings. (Funded by the National Institutes of Health, National Heart, Lung, and Blood Institute.).

    View details for DOI 10.1056/aioa2300068

    View details for PubMedID 38343631

    View details for PubMedCentralID PMC10857783

  • Development and Validation of a Model to Quantify Injury Severity in Real Time. JAMA network open Choi, J., Vendrow, E. B., Moor, M., Spain, D. A. 2023; 6 (10): e2336196


    Quantifying injury severity is integral to trauma care benchmarking, decision-making, and research, yet the most prevalent metric to quantify injury severity-Injury Severity Score (ISS)- is impractical to use in real time.To develop and validate a practical model that uses a limited number of injury patterns to quantify injury severity in real time through 3 intuitive outcomes.In this cohort study for prediction model development and validation, training, development, and internal validation cohorts comprised 223 545, 74 514, and 74 514 admission encounters, respectively, of adults (age ≥18 years) with a primary diagnosis of traumatic injury hospitalized more than 2 days (2017-2018 National Inpatient Sample). The external validation cohort comprised 3855 adults admitted to a level I trauma center who met criteria for the 2 highest of the institution's 3 trauma activation levels.Three outcomes were hospital length of stay, probability of discharge disposition to a facility, and probability of inpatient mortality. The prediction performance metric for length of stay was mean absolute error. Prediction performance metrics for discharge disposition and inpatient mortality were average precision, precision, recall, specificity, F1 score, and area under the receiver operating characteristic curve (AUROC). Calibration was evaluated using calibration plots. Shapley addictive explanations analysis and bee swarm plots facilitated model explainability analysis.The Length of Stay, Disposition, Mortality (LDM) Injury Index (the model) comprised a multitask deep learning model trained, developed, and internally validated on a data set of 372 573 traumatic injury encounters (mean [SD] age = 68.7 [19.3] years, 56.6% female). The model used 176 potential injuries to output 3 interpretable outcomes: the predicted hospital length of stay, probability of discharge to a facility, and probability of inpatient mortality. For the external validation set, the ISS predicted length of stay with mean absolute error was 4.16 (95% CI, 4.13-4.20) days. Compared with the ISS, the model had comparable external validation set discrimination performance (facility discharge AUROC: 0.67 [95% CI, 0.67-0.68] vs 0.65 [95% CI, 0.65-0.66]; recall: 0.59 [95% CI, 0.58-0.61] vs 0.59 [95% CI, 0.58-0.60]; specificity: 0.66 [95% CI, 0.66-0.66] vs 0.62 [95%CI, 0.60-0.63]; mortality AUROC: 0.83 [95% CI, 0.81-0.84] vs 0.82 [95% CI, 0.82-0.82]; recall: 0.74 [95% CI, 0.72-0.77] vs 0.75 [95% CI, 0.75-0.76]; specificity: 0.81 [95% CI, 0.81-0.81] vs 0.76 [95% CI, 0.75-0.77]). The model had excellent calibration for predicting facility discharge disposition, but overestimated inpatient mortality. Explainability analysis found the inputs influencing model predictions matched intuition.In this cohort study using a limited number of injury patterns, the model quantified injury severity using 3 intuitive outcomes. Further study is required to evaluate the model at scale.

    View details for DOI 10.1001/jamanetworkopen.2023.36196

    View details for PubMedID 37812422

  • Predicting sepsis using deep learning across international sites: a retrospective development and validation study. EClinicalMedicine Moor, M., Bennett, N., Plečko, D., Horn, M., Rieck, B., Meinshausen, N., Bühlmann, P., Borgwardt, K. 2023; 62: 102124


    When sepsis is detected, organ damage may have progressed to irreversible stages, leading to poor prognosis. The use of machine learning for predicting sepsis early has shown promise, however international validations are missing.This was a retrospective, observational, multi-centre cohort study. We developed and externally validated a deep learning system for the prediction of sepsis in the intensive care unit (ICU). Our analysis represents the first international, multi-centre in-ICU cohort study for sepsis prediction using deep learning to our knowledge. Our dataset contains 136,478 unique ICU admissions, representing a refined and harmonised subset of four large ICU databases comprising data collected from ICUs in the US, the Netherlands, and Switzerland between 2001 and 2016. Using the international consensus definition Sepsis-3, we derived hourly-resolved sepsis annotations, amounting to 25,694 (18.8%) patient stays with sepsis. We compared our approach to clinical baselines as well as machine learning baselines and performed an extensive internal and external statistical validation within and across databases, reporting area under the receiver-operating-characteristic curve (AUC).Averaged over sites, our model was able to predict sepsis with an AUC of 0.846 (95% confidence interval [CI], 0.841-0.852) on a held-out validation cohort internal to each site, and an AUC of 0.761 (95% CI, 0.746-0.770) when validating externally across sites. Given access to a small fine-tuning set (10% per site), the transfer to target sites was improved to an AUC of 0.807 (95% CI, 0.801-0.813). Our model raised 1.4 false alerts per true alert and detected 80% of the septic patients 3.7 h (95% CI, 3.0-4.3) prior to the onset of sepsis, opening a vital window for intervention.By monitoring clinical and laboratory measurements in a retrospective simulation of a real-time prediction scenario, a deep learning system for the detection of sepsis generalised to previously unseen ICU cohorts, internationally.This study was funded by the Personalized Health and Related Technologies (PHRT) strategic focus area of the ETH domain.

    View details for DOI 10.1016/j.eclinm.2023.102124

    View details for PubMedID 37588623

    View details for PubMedCentralID PMC10425671

  • Almanac: Retrieval-Augmented Language Models for Clinical Medicine. Research square Zakka, C., Chaurasia, A., Shad, R., Dalal, A. R., Kim, J. L., Moor, M., Alexander, K., Ashley, E., Boyd, J., Boyd, K., Hirsch, K., Langlotz, C., Nelson, J., Hiesinger, W. 2023


    Large-language models have recently demonstrated impressive zero-shot capabilities in a variety of natural language tasks such as summarization, dialogue generation, and question-answering. Despite many promising applications in clinical medicine, adoption of these models in real-world settings has been largely limited by their tendency to generate incorrect and sometimes even toxic statements. In this study, we develop Almanac, a large language model framework augmented with retrieval capabilities for medical guideline and treatment recommendations. Performance on a novel dataset of clinical scenarios (n= 130) evaluated by a panel of 5 board-certified and resident physicians demonstrates significant increases in factuality (mean of 18% at p-value < 0.05) across all specialties, with improvements in completeness and safety. Our results demonstrate the potential for large language models to be effective tools in the clinical decision-making process, while also emphasizing the importance of careful testing and deployment to mitigate their shortcomings.

    View details for DOI 10.21203/

    View details for PubMedID 37205549

    View details for PubMedCentralID PMC10187428