Sophie Ostmeier's Profile | Stanford Profiles

Bio

My current research is in deep neural networks that learn from multimodal clinical data including images and clinical information. I would like to combine these primary computer vision algorithms with large language models/EHR encoding models in order to integrate them into the clinical workflow, potentially as a virtual assistant.

Honors & Awards

DFG Walter Benjamin Award (fellowship), Deutsche Forschungsgesellschaft (2023)
DAAD RISE worldwide (fellowship), German Academic Exchange Service (2019)
Merit Scholarship, Kurt Hahn Foundation (2013)
Athletic Scholarship, Mercersburg Academy (2010)

Education & Certifications

Dr. med., Technical University of Munich, Germany, Radiology (2022)
MD, Technical University of Munich, Germany, pre-clinical and clinical studies (2021)

Contact

Academic
sostm@stanford.edu

University - Student Department: Computer Science Position: Graduate

Additional Info

Mail Code: 5659
ORCID:
https://orcid.org/0000-0003-3097-7042

All Publications

Prediction of Ischemic Stroke Functional Outcomes from Acute-Phase Noncontrast CT and Clinical Information. Radiology Liu, Y., Yu, Y., Ouyang, J., Jiang, B., Ostmeier, S., Wang, J., Lu-Liang, S., Yang, Y., Yang, G., Michel, P., Liebeskind, D. S., Lansberg, M., Moseley, M. E., Heit, J. J., Wintermark, M., Albers, G., Zaharchuk, G. 2024; 313 (1): e240137

Abstract

Background Clinical outcome prediction based on acute-phase ischemic stroke data is valuable for planning health care resources, designing clinical trials, and setting patient expectations. Existing methods require individualized features and often involve manually engineered, time-consuming postprocessing activities. Purpose To predict the 90-day modified Rankin Scale (mRS) score with a deep learning (DL) model fusing noncontrast-enhanced CT (NCCT) and clinical information from the acute phase of stroke. Materials and Methods This retrospective study included data from six patient datasets from four multicenter trials and two registries. The DL-based imaging and clinical model was trained by using NCCT data obtained 1-7 days after baseline imaging and clinical data (age; sex; baseline and 24-hour National Institutes of Health Stroke Scale scores; and history of hypertension, diabetes, and atrial fibrillation). This model was compared with models based on either NCCT or clinical information alone. Model-specific mRS score prediction accuracy, mRS score accuracy within 1 point of the actual mRS score, mean absolute error (MAE), and performance in identifying unfavorable outcomes (mRS score, >2) were evaluated. Results A total of 1335 patients (median age, 71 years; IQR, 60-80 years; 674 female patients) were included for model development and testing through sixfold cross validation, with distributions of 979, 133, and 223 patients across training, validation, and test sets in each of the six cross-validation folds, respectively. The fused model achieved an MAE of 0.94 (95% CI: 0.89, 0.98) for predicting the specific mRS score, outperforming the imaging-only (MAE, 1.10; 95% CI: 1.05, 1.16; P < .001) and the clinical information-only (MAE, 1.00; 95% CI: 0.94, 1.05; P = .04) models. The fused model achieved an area under the receiver operating characteristic curve (AUC) of 0.91 (95% CI: 0.89, 0.92) for predicting unfavorable outcomes, outperforming the clinical information-only model (AUC, 0.88; 95% CI: 0.87, 0.90; P < .001) and the imaging-only model (AUC, 0.85; 95% CI: 0.84, 0.87; P < .001). Conclusion A fused DL-based NCCT and clinical model outperformed an imaging-only model and a clinical-information-only model in predicting 90-day mRS scores. © RSNA, 2024 Supplemental material is available for this article. See also the editorial by Lee in this issue.

View details for DOI 10.1148/radiol.240137

View details for PubMedID 39404632
Merlin: A Vision Language Foundation Model for 3D Computed Tomography. Research square Blankemeier, L., Cohen, J. P., Kumar, A., Veen, D. V., Gardezi, S., Paschali, M., Chen, Z., Delbrouck, J. B., Reis, E., Truyts, C., Bluethgen, C., Jensen, M., Ostmeier, S., Varma, M., Valanarasu, J., Fang, Z., Huo, Z., Nabulsi, Z., Ardila, D., Weng, W. H., Junior, E. A., Ahuja, N., Fries, J., Shah, N., Johnston, A., Boutin, R., Wentland, A., Langlotz, C., Hom, J., Gatidis, S., Chaudhari, A. 2024

Abstract

Over 85 million computed tomography (CT) scans are performed annually in the US, of which approximately one quarter focus on the abdomen. Given the current shortage of both general and specialized radiologists, there is a large impetus to use artificial intelligence to alleviate the burden of interpreting these complex imaging studies while simultaneously using the images to extract novel physiological insights. Prior state-of-the-art approaches for automated medical image interpretation leverage vision language models (VLMs) that utilize both the image and the corresponding textual radiology reports. However, current medical VLMs are generally limited to 2D images and short reports. To overcome these shortcomings for abdominal CT interpretation, we introduce Merlin - a 3D VLM that leverages both structured electronic health records (EHR) and unstructured radiology reports for pretraining without requiring additional manual annotations. We train Merlin using a high-quality clinical dataset of paired CT scans (6+ million images from 15,331 CTs), EHR diagnosis codes (1.8+ million codes), and radiology reports (6+ million tokens) for training. We comprehensively evaluate Merlin on 6 task types and 752 individual tasks. The non-adapted (off-the-shelf) tasks include zero-shot findings classification (31 findings), phenotype classification (692 phenotypes), and zero-shot cross-modal retrieval (image to findings and image to impressions), while model adapted tasks include 5-year chronic disease prediction (6 diseases), radiology report generation, and 3D semantic segmentation (20 organs). We perform internal validation on a test set of 5,137 CTs, and external validation on 7,000 clinical CTs and on two public CT datasets (VerSe, TotalSegmentator). Beyond these clinically-relevant evaluations, we assess the efficacy of various network architectures and training strategies to depict that Merlin has favorable performance to existing task-specific baselines. We derive data scaling laws to empirically assess training data needs for requisite downstream task performance. Furthermore, unlike conventional VLMs that require hundreds of GPUs for training, we perform all training on a single GPU. This computationally efficient design can help democratize foundation model training, especially for health systems with compute constraints. We plan to release our trained models, code, and dataset, pending manual removal of all protected health information.

View details for DOI 10.21203/rs.3.rs-4546309/v1

View details for PubMedID 38978576

View details for PubMedCentralID PMC11230513
Multicenter US clinical experience with the Scepter Mini balloon catheter. Interventional neuroradiology : journal of peritherapeutic neuroradiology, surgical procedures and related neurosciences Salem, M. M., Kelmer, P., Sioutas, G. S., Ostmeier, S., Hoang, A., Cortez, G., El Naamani, K., Abbas, R., Hanel, R., Tanweer, O., Srinivasan, V. M., Jabbour, P., Kan, P., Jankowitz, B. T., Heit, J. J., Burkhardt, J. K. 2024: 15910199241246135

Abstract

Distal navigability and imprecise delivery of embolic agents are two limitations encountered during liquid embolization of cerebrospinal lesions. The dual-lumen Scepter Mini balloon (SMB) microcatheter was introduced to overcome these conventional microcatheters' limitations with few small single-center reports suggesting favorable results.A series of consecutive patients undergoing SMB-assisted endovascular embolization were extracted from prospectively maintained registries in seven North-American centers (November 2019 to September 2022).Fifty-four patients undergoing 55 embolization procedures utilizing SMB were included (median age 58.5; 48.1% females). Cranial dural arteriovenous fistula embolization was the most common indication (54.5%) followed by cranial arteriovenous malformation (27.3%). Staged/pre-operative embolization was done in 36.4% of cases; and 83.6% of procedures using Onyx-18. Most procedures utilized a transarterial approach (89.1%), and SMB-induced arterial-flow arrest concurrently with transvenous embolization was used in 10.9% of procedures. Femoral access/triaxial setups were utilized in the majority of procedures (65.5% and 60%, respectively). The median vessel diameter where the balloon was inflated of 1.8 mm, with a median of 1.5 cc of injected embolic material per procedure. Technical failures occurred in 5.5% of cases requiring aborting/replacement with other devices without clinical sequelae in any of the patients, with SMB-related procedural complications of 3.6% without clinical sequelae. Radiographic imaging follow-up was available in 76.9% of the patients (median follow-up 3.8 months), with complete occlusion (100%) or >50% occlusion in 92.5% of the cases, and unplanned retreatments in 1.8%.The SMB microcatheter is a useful new adjunctive device for balloon-assisted embolization of cerebrospinal lesions with a high technical success rate, favorable outcomes, and a reasonable safety profile.

View details for DOI 10.1177/15910199241246135

View details for PubMedID 38613371
Random expert sampling for deep learning segmentation of acute ischemic stroke on non-contrast CT. Journal of neurointerventional surgery Ostmeier, S., Axelrod, B., Liu, Y., Yu, Y., Jiang, B., Yuen, N., Pulli, B., Verhaaren, B. F., Kaka, H., Wintermark, M., Michel, P., Mahammedi, A., Federau, C., Lansberg, M. G., Albers, G. W., Moseley, M. E., Zaharchuk, G., Heit, J. J. 2024

Abstract

Outlining acutely infarcted tissue on non-contrast CT is a challenging task for which human inter-reader agreement is limited. We explored two different methods for training a supervised deep learning algorithm: one that used a segmentation defined by majority vote among experts and another that trained randomly on separate individual expert segmentations.The data set consisted of 260 non-contrast CT studies in 233 patients with acute ischemic stroke recruited from the multicenter DEFUSE 3 (Endovascular Therapy Following Imaging Evaluation for Ischemic Stroke 3) trial. Additional external validation was performed using 33 patients with matched stroke onset times from the University Hospital Lausanne. A benchmark U-Net was trained on the reference annotations of three experienced neuroradiologists to segment ischemic brain tissue using majority vote and random expert sampling training schemes. The median of volume, overlap, and distance segmentation metrics were determined for agreement in lesion segmentations between (1) three experts, (2) the majority model and each expert, and (3) the random model and each expert. The two sided Wilcoxon signed rank test was used to compare performances (1) to 2) and (1) to (3). We further compared volumes with the 24 hour follow-up diffusion weighted imaging (DWI, final infarct core) and correlations with clinical outcome (modified Rankin Scale (mRS) at 90 days) with the Spearman method.The random model outperformed the inter-expert agreement ((1) to (2)) and the majority model ((1) to (3)) (dice 0.51±0.04 vs 0.36±0.05 (P<0.0001) vs 0.45±0.05 (P<0.0001)). The random model predicted volume correlated with clinical outcome (0.19, P<0.05), whereas the median expert volume and majority model volume did not. There was no significant difference when comparing the volume correlations between random model, median expert volume, and majority model to 24 hour follow-up DWI volume (P>0.05, n=51).The random model for ischemic injury delineation on non-contrast CT surpassed the inter-expert agreement ((1) to (2)) and the performance of the majority model ((1) to (3)). We showed that the random model volumetric measures of the model were consistent with 24 hour follow-up DWI.

View details for DOI 10.1136/jnis-2023-021283

View details for PubMedID 38302420
Non-inferiority of deep learning ischemic stroke segmentation on non-contrast CT within 16-hours compared to expert neuroradiologists. Scientific reports Ostmeier, S., Axelrod, B., Verhaaren, B. F., Christensen, S., Mahammedi, A., Liu, Y., Pulli, B., Li, L., Zaharchuk, G., Heit, J. J. 2023; 13 (1): 16153

Abstract

We determined if a convolutional neural network (CNN) deep learning model can accurately segment acute ischemic changes on non-contrast CT compared to neuroradiologists. Non-contrast CT (NCCT) examinations from 232 acute ischemic stroke patients who were enrolled in the DEFUSE 3 trial were included in this study. Three experienced neuroradiologists independently segmented hypodensity that reflected the ischemic core on each scan. The neuroradiologist with the most experience (expert A) served as the ground truth for deep learning model training. Two additional neuroradiologists' (experts B and C) segmentations were used for data testing. The 232 studies were randomly split into training and test sets. The training set was further randomly divided into 5 folds with training and validation sets. A 3-dimensional CNN architecture was trained and optimized to predict the segmentations of expert A from NCCT. The performance of the model was assessed using a set of volume, overlap, and distance metrics using non-inferiority thresholds of 20%, 3 ml, and 3 mm, respectively. The optimized model trained on expert A was compared to test experts B and C. We used a one-sided Wilcoxon signed-rank test to test for the non-inferiority of the model-expert compared to the inter-expert agreement. The final model performance for the ischemic core segmentation task reached a performance of 0.46 ± 0.09 Surface Dice at Tolerance 5mm and 0.47 ± 0.13 Dice when trained on expert A. Compared to the two test neuroradiologists the model-expert agreement was non-inferior to the inter-expert agreement, [Formula: see text]. The before, CNN accurately delineates the hypodense ischemic core on NCCT in acute ischemic stroke patients with an accuracy comparable to neuroradiologists.

View details for DOI 10.1038/s41598-023-42961-x

View details for PubMedID 37752162
USE-Evaluator: Performance metrics for medical image segmentation models supervised by uncertain, small or empty reference annotations in neuroimaging. Medical image analysis Ostmeier, S., Axelrod, B., Isensee, F., Bertels, J., Mlynash, M., Christensen, S., Lansberg, M. G., Albers, G. W., Sheth, R., Verhaaren, B. F., Mahammedi, A., Li, L. J., Zaharchuk, G., Heit, J. J. 2023; 90: 102927

Abstract

Performance metrics for medical image segmentation models are used to measure the agreement between the reference annotation and the predicted segmentation. Usually, overlap metrics, such as the Dice, are used as a metric to evaluate the performance of these models in order for results to be comparable. However, there is a mismatch between the distributions of cases and the difficulty level of segmentation tasks in public data sets compared to clinical practice. Common metrics used to assess performance fail to capture the impact of this mismatch, particularly when dealing with datasets in clinical settings that involve challenging segmentation tasks, pathologies with low signal, and reference annotations that are uncertain, small, or empty. Limitations of common metrics may result in ineffective machine learning research in designing and optimizing models. To effectively evaluate the clinical value of such models, it is essential to consider factors such as the uncertainty associated with reference annotations, the ability to accurately measure performance regardless of the size of the reference annotation volume, and the classification of cases where reference annotations are empty. We study how uncertain, small, and empty reference annotations influence the value of metrics on a stroke in-house data set regardless of the model. We examine metrics behavior on the predictions of a standard deep learning framework in order to identify suitable metrics in such a setting. We compare our results to the BRATS 2019 and Spinal Cord public data sets. We show how uncertain, small, or empty reference annotations require a rethinking of the evaluation. The evaluation code was released to encourage further analysis of this topic https://github.com/SophieOstmeier/UncertainSmallEmpty.git.

View details for DOI 10.1016/j.media.2023.102927

View details for PubMedID 37672900
Functional Outcome Prediction in Acute Ischemic Stroke Using a Fused Imaging and Clinical Deep Learning Model. Stroke Liu, Y., Yu, Y., Ouyang, J., Jiang, B., Yang, G., Ostmeier, S., Wintermark, M., Michel, P., Liebeskind, D. S., Lansberg, M. G., Albers, G. W., Zaharchuk, G. 2023

Abstract

Predicting long-term clinical outcome based on the early acute ischemic stroke information is valuable for prognostication, resource management, clinical trials, and patient expectations. Current methods require subjective decisions about which imaging features to assess and may require time-consuming postprocessing. This study's goal was to predict ordinal 90-day modified Rankin Scale (mRS) score in acute ischemic stroke patients by fusing a Deep Learning model of diffusion-weighted imaging images and clinical information from the acute period.A total of 640 acute ischemic stroke patients who underwent magnetic resonance imaging within 1 to 7 days poststroke and had 90-day mRS follow-up data were randomly divided into 70% (n=448) for model training, 15% (n=96) for validation, and 15% (n=96) for internal testing. Additionally, external testing on a cohort from Lausanne University Hospital (n=280) was performed to further evaluate model generalization. Accuracy for ordinal mRS, accuracy within ±1 mRS category, mean absolute prediction error, and determination of unfavorable outcome (mRS score >2) were evaluated for clinical only, imaging only, and 2 fused clinical-imaging models.The fused models demonstrated superior performance in predicting ordinal mRS score and unfavorable outcome in both internal and external test cohorts when compared with the clinical and imaging models. For the internal test cohort, the top fused model had the highest area under the curve of 0.92 for unfavorable outcome prediction and the lowest mean absolute error (0.96 [95% CI, 0.77-1.16]), with the highest proportion of mRS score predictions within ±1 category (79% [95% CI, 71%-88%]). On the external Lausanne University Hospital cohort, the best fused model had an area under the curve of 0.90 for unfavorable outcome prediction and outperformed other models with an mean absolute error of 0.90 (95% CI, 0.79-1.01), and the highest percentage of mRS score predictions within ±1 category (83% [95% CI, 78%-87%]).A Deep Learning-based imaging model fused with clinical variables can be used to predict 90-day stroke outcome with reduced subjectivity and user burden.

View details for DOI 10.1161/STROKEAHA.123.044072

View details for PubMedID 37485663
Prediction of delayed cerebral ischemia after cerebral aneurysm rupture using explainable machine learning approach. Interventional neuroradiology : journal of peritherapeutic neuroradiology, surgical procedures and related neurosciences Taghavi, R. M., Zhu, G., Wintermark, M., Kuraitis, G. M., Sussman, E. S., Pulli, B., Biniam, B., Ostmeier, S., Steinberg, G. K., Heit, J. J. 2023: 15910199231170411

Abstract

Aneurysmal subarachnoid hemorrhage results in significant mortality and disability, which is worsened by the development of delayed cerebral ischemia. Tests to identify patients with delayed cerebral ischemia prospectively are of high interest.We created a machine learning system based on clinical variables to predict delayed cerebral ischemia in aneurysmal subarachnoid hemorrhage patients. We also determined which variables have the most impact on delayed cerebral ischemia prediction using SHapley Additive exPlanations method.500 aneurysmal subarachnoid hemorrhage patients were identified and 369 met inclusion criteria: 70 patients developed delayed cerebral ischemia (delayed cerebral ischemia+) and 299 did not (delayed cerebral ischemia-). The algorithm was trained based upon age, sex, hypertension (HTN), diabetes, hyperlipidemia, congestive heart failure, coronary artery disease, smoking history, family history of aneurysm, Fisher Grade, Hunt and Hess score, and external ventricular drain placement. Random Forest was selected for this project, and prediction outcome of the algorithm was delayed cerebral ischemia+. SHapley Additive exPlanations was used to visualize each feature's contribution to the model prediction.The Random Forest machine learning algorithm predicted delayed cerebral ischemia: accuracy 80.65% (95% CI: 72.62-88.68), area under the curve 0.780 (95% CI: 0.696-0.864), sensitivity 12.5% (95% CI: -3.7 to 28.7), specificity 94.81% (95% CI: 89.85-99.77), PPV 33.3% (95% CI: -4.39 to 71.05), and NPV 84.1% (95% CI: 76.38-91.82). SHapley Additive exPlanations value demonstrated Age, external ventricular drain placement, Fisher Grade, and Hunt and Hess score, and HTN had the highest predictive values for delayed cerebral ischemia. Lower age, absence of hypertension, higher Hunt and Hess score, higher Fisher Grade, and external ventricular drain placement increased risk of delayed cerebral ischemia.Machine learning models based upon clinical variables predict delayed cerebral ischemia with high specificity and good accuracy.

View details for DOI 10.1177/15910199231170411

View details for PubMedID 37070145
Iodine concentration of healthy lymph nodes of the neck, axilla and groin in Dual Energy Computed Tomography Ostmeier, S. Technical University Munich. 2022
Iodine concentration of healthy lymph nodes of neck, axilla, and groin in dual-energy computed tomography ACTA RADIOLOGICA Sauter, A. P., Ostmeier, S., Nadjiri, J., Deniffel, D., Rummeny, E. J., Pfeiffer, D. 2020; 61 (11): 1505-1511

Abstract

Lymph nodes (LN) are examined in every computed tomography (CT) scan. Until now, an evaluation is only possible based on morphological criteria. With dual-energy CT (DECT) systems, iodine concentration (IC) can be measured which could conduct in an improved diagnostic evaluation of LNs.To define standard values for IC of cervical, axillary, and inguinal LNs in DECT.Imaging data of 297 patients who received a DECT scan of the neck, thorax, abdomen-pelvis, or a combination of those in a portal-venous phase were retrospectively collected from the institutional PACS. No present history of malignancy, inflammation, or trauma in the examined region was present. For each examined region, the data of 99 patients were used. The IC of the three largest LNs, the main artery, the main vein, and a local muscle of the examined area was measured, respectively.Normalization of the IC of LNs to the artery, vein, muscle, or a combination of those did not lead to a decreased value-range. The smallest range and confidence interval (CI) of IC was found when using absolute values of IC for each region. Hereby, mean values (95% CI) for IC of LN were found: 2.09 mg/mL (2.00-2.18 mg/mL) for neck, 1.24 mg/mL (1.16-1.33 mg/mL) for axilla, and 1.11 mg/mL (1.04-1.17 mg/mL) for groin.The present study suggests standard values for IC of LNs in dual-layer CT could be used to differentiate between healthy and pathological lymph nodes, considering the used contrast injection protocol.

View details for DOI 10.1177/0284185120903448

View details for Web of Science ID 000514045200001

View details for PubMedID 32064891

Sophie Ostmeier

Masters Student in Computer Science, admitted Autumn 2024

Bio

Honors & Awards

Education & Certifications

Contact

Additional Info

Links

All Publications

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract