My current research is in deep neural networks that learn from multimodal clinical data including images and clinical information. I would like to combine these primary computer vision algorithms with large language models/EHR encoding models in order to integrate them into the clinical workflow, potentially as a virtual assistant.

Honors & Awards

  • DFG Walter Benjamin Award (fellowship), Deutsche Forschungsgesellschaft (2023)
  • DAAD RISE worldwide (fellowship), German Academic Exchange Service (2019)
  • Merit Scholarship, Kurt Hahn Foundation (2013)
  • Athletic Scholarship, Mercersburg Academy (2010)

Professional Education

  • Doctorat, School Undefined 1, Radiology, (2022)
  • Bachelor of Medicine, School Undefined 3, Medicine, combined BS & MD (2021)
  • Doctor of Medicine, School Undefined 2, Medicine, combined BS & MD (2021)
  • Dr. med., Technical University of Munich, Germany, Radiology (2022)
  • MD, Technical University of Munich, Germany, pre-clinical and clinical studies (2021)

Stanford Advisors

Lab Affiliations

All Publications

  • Non-inferiority of deep learning ischemic stroke segmentation on non-contrast CT within 16-hours compared to expert neuroradiologists. Scientific reports Ostmeier, S., Axelrod, B., Verhaaren, B. F., Christensen, S., Mahammedi, A., Liu, Y., Pulli, B., Li, L., Zaharchuk, G., Heit, J. J. 2023; 13 (1): 16153


    We determined if a convolutional neural network (CNN) deep learning model can accurately segment acute ischemic changes on non-contrast CT compared to neuroradiologists. Non-contrast CT (NCCT) examinations from 232 acute ischemic stroke patients who were enrolled in the DEFUSE 3 trial were included in this study. Three experienced neuroradiologists independently segmented hypodensity that reflected the ischemic core on each scan. The neuroradiologist with the most experience (expert A) served as the ground truth for deep learning model training. Two additional neuroradiologists' (experts B and C) segmentations were used for data testing. The 232 studies were randomly split into training and test sets. The training set was further randomly divided into 5 folds with training and validation sets. A 3-dimensional CNN architecture was trained and optimized to predict the segmentations of expert A from NCCT. The performance of the model was assessed using a set of volume, overlap, and distance metrics using non-inferiority thresholds of 20%, 3 ml, and 3 mm, respectively. The optimized model trained on expert A was compared to test experts B and C. We used a one-sided Wilcoxon signed-rank test to test for the non-inferiority of the model-expert compared to the inter-expert agreement. The final model performance for the ischemic core segmentation task reached a performance of 0.46 ± 0.09 Surface Dice at Tolerance 5mm and 0.47 ± 0.13 Dice when trained on expert A. Compared to the two test neuroradiologists the model-expert agreement was non-inferior to the inter-expert agreement, [Formula: see text]. The before, CNN accurately delineates the hypodense ischemic core on NCCT in acute ischemic stroke patients with an accuracy comparable to neuroradiologists.

    View details for DOI 10.1038/s41598-023-42961-x

    View details for PubMedID 37752162

  • USE-Evaluator: Performance metrics for medical image segmentation models supervised by uncertain, small or empty reference annotations in neuroimaging. Medical image analysis Ostmeier, S., Axelrod, B., Isensee, F., Bertels, J., Mlynash, M., Christensen, S., Lansberg, M. G., Albers, G. W., Sheth, R., Verhaaren, B. F., Mahammedi, A., Li, L. J., Zaharchuk, G., Heit, J. J. 2023; 90: 102927


    Performance metrics for medical image segmentation models are used to measure the agreement between the reference annotation and the predicted segmentation. Usually, overlap metrics, such as the Dice, are used as a metric to evaluate the performance of these models in order for results to be comparable. However, there is a mismatch between the distributions of cases and the difficulty level of segmentation tasks in public data sets compared to clinical practice. Common metrics used to assess performance fail to capture the impact of this mismatch, particularly when dealing with datasets in clinical settings that involve challenging segmentation tasks, pathologies with low signal, and reference annotations that are uncertain, small, or empty. Limitations of common metrics may result in ineffective machine learning research in designing and optimizing models. To effectively evaluate the clinical value of such models, it is essential to consider factors such as the uncertainty associated with reference annotations, the ability to accurately measure performance regardless of the size of the reference annotation volume, and the classification of cases where reference annotations are empty. We study how uncertain, small, and empty reference annotations influence the value of metrics on a stroke in-house data set regardless of the model. We examine metrics behavior on the predictions of a standard deep learning framework in order to identify suitable metrics in such a setting. We compare our results to the BRATS 2019 and Spinal Cord public data sets. We show how uncertain, small, or empty reference annotations require a rethinking of the evaluation. The evaluation code was released to encourage further analysis of this topic

    View details for DOI 10.1016/

    View details for PubMedID 37672900

  • Functional Outcome Prediction in Acute Ischemic Stroke Using a Fused Imaging and Clinical Deep Learning Model. Stroke Liu, Y., Yu, Y., Ouyang, J., Jiang, B., Yang, G., Ostmeier, S., Wintermark, M., Michel, P., Liebeskind, D. S., Lansberg, M. G., Albers, G. W., Zaharchuk, G. 2023


    Predicting long-term clinical outcome based on the early acute ischemic stroke information is valuable for prognostication, resource management, clinical trials, and patient expectations. Current methods require subjective decisions about which imaging features to assess and may require time-consuming postprocessing. This study's goal was to predict ordinal 90-day modified Rankin Scale (mRS) score in acute ischemic stroke patients by fusing a Deep Learning model of diffusion-weighted imaging images and clinical information from the acute period.A total of 640 acute ischemic stroke patients who underwent magnetic resonance imaging within 1 to 7 days poststroke and had 90-day mRS follow-up data were randomly divided into 70% (n=448) for model training, 15% (n=96) for validation, and 15% (n=96) for internal testing. Additionally, external testing on a cohort from Lausanne University Hospital (n=280) was performed to further evaluate model generalization. Accuracy for ordinal mRS, accuracy within ±1 mRS category, mean absolute prediction error, and determination of unfavorable outcome (mRS score >2) were evaluated for clinical only, imaging only, and 2 fused clinical-imaging models.The fused models demonstrated superior performance in predicting ordinal mRS score and unfavorable outcome in both internal and external test cohorts when compared with the clinical and imaging models. For the internal test cohort, the top fused model had the highest area under the curve of 0.92 for unfavorable outcome prediction and the lowest mean absolute error (0.96 [95% CI, 0.77-1.16]), with the highest proportion of mRS score predictions within ±1 category (79% [95% CI, 71%-88%]). On the external Lausanne University Hospital cohort, the best fused model had an area under the curve of 0.90 for unfavorable outcome prediction and outperformed other models with an mean absolute error of 0.90 (95% CI, 0.79-1.01), and the highest percentage of mRS score predictions within ±1 category (83% [95% CI, 78%-87%]).A Deep Learning-based imaging model fused with clinical variables can be used to predict 90-day stroke outcome with reduced subjectivity and user burden.

    View details for DOI 10.1161/STROKEAHA.123.044072

    View details for PubMedID 37485663

  • Prediction of delayed cerebral ischemia after cerebral aneurysm rupture using explainable machine learning approach. Interventional neuroradiology : journal of peritherapeutic neuroradiology, surgical procedures and related neurosciences Taghavi, R. M., Zhu, G., Wintermark, M., Kuraitis, G. M., Sussman, E. S., Pulli, B., Biniam, B., Ostmeier, S., Steinberg, G. K., Heit, J. J. 2023: 15910199231170411


    Aneurysmal subarachnoid hemorrhage results in significant mortality and disability, which is worsened by the development of delayed cerebral ischemia. Tests to identify patients with delayed cerebral ischemia prospectively are of high interest.We created a machine learning system based on clinical variables to predict delayed cerebral ischemia in aneurysmal subarachnoid hemorrhage patients. We also determined which variables have the most impact on delayed cerebral ischemia prediction using SHapley Additive exPlanations method.500 aneurysmal subarachnoid hemorrhage patients were identified and 369 met inclusion criteria: 70 patients developed delayed cerebral ischemia (delayed cerebral ischemia+) and 299 did not (delayed cerebral ischemia-). The algorithm was trained based upon age, sex, hypertension (HTN), diabetes, hyperlipidemia, congestive heart failure, coronary artery disease, smoking history, family history of aneurysm, Fisher Grade, Hunt and Hess score, and external ventricular drain placement. Random Forest was selected for this project, and prediction outcome of the algorithm was delayed cerebral ischemia+. SHapley Additive exPlanations was used to visualize each feature's contribution to the model prediction.The Random Forest machine learning algorithm predicted delayed cerebral ischemia: accuracy 80.65% (95% CI: 72.62-88.68), area under the curve 0.780 (95% CI: 0.696-0.864), sensitivity 12.5% (95% CI: -3.7 to 28.7), specificity 94.81% (95% CI: 89.85-99.77), PPV 33.3% (95% CI: -4.39 to 71.05), and NPV 84.1% (95% CI: 76.38-91.82). SHapley Additive exPlanations value demonstrated Age, external ventricular drain placement, Fisher Grade, and Hunt and Hess score, and HTN had the highest predictive values for delayed cerebral ischemia. Lower age, absence of hypertension, higher Hunt and Hess score, higher Fisher Grade, and external ventricular drain placement increased risk of delayed cerebral ischemia.Machine learning models based upon clinical variables predict delayed cerebral ischemia with high specificity and good accuracy.

    View details for DOI 10.1177/15910199231170411

    View details for PubMedID 37070145

  • Random Expert Sampling for Deep Learning Segmentation of Acute Ischemic Stroke on Non-contrast CT Arxiv Ostmeier, S. 2023
  • Iodine concentration of healthy lymph nodes of the neck, axilla and groin in Dual Energy Computed Tomography Ostmeier, S. Technical University Munich. 2022
  • Iodine concentration of healthy lymph nodes of neck, axilla, and groin in dual-energy computed tomography ACTA RADIOLOGICA Sauter, A. P., Ostmeier, S., Nadjiri, J., Deniffel, D., Rummeny, E. J., Pfeiffer, D. 2020; 61 (11): 1505-1511


    Lymph nodes (LN) are examined in every computed tomography (CT) scan. Until now, an evaluation is only possible based on morphological criteria. With dual-energy CT (DECT) systems, iodine concentration (IC) can be measured which could conduct in an improved diagnostic evaluation of LNs.To define standard values for IC of cervical, axillary, and inguinal LNs in DECT.Imaging data of 297 patients who received a DECT scan of the neck, thorax, abdomen-pelvis, or a combination of those in a portal-venous phase were retrospectively collected from the institutional PACS. No present history of malignancy, inflammation, or trauma in the examined region was present. For each examined region, the data of 99 patients were used. The IC of the three largest LNs, the main artery, the main vein, and a local muscle of the examined area was measured, respectively.Normalization of the IC of LNs to the artery, vein, muscle, or a combination of those did not lead to a decreased value-range. The smallest range and confidence interval (CI) of IC was found when using absolute values of IC for each region. Hereby, mean values (95% CI) for IC of LN were found: 2.09 mg/mL (2.00-2.18 mg/mL) for neck, 1.24 mg/mL (1.16-1.33 mg/mL) for axilla, and 1.11 mg/mL (1.04-1.17 mg/mL) for groin.The present study suggests standard values for IC of LNs in dual-layer CT could be used to differentiate between healthy and pathological lymph nodes, considering the used contrast injection protocol.

    View details for DOI 10.1177/0284185120903448

    View details for Web of Science ID 000514045200001

    View details for PubMedID 32064891