Bio


I'm a visiting scholar at Stanford AIMI Center, working in the intersection of Artificial Intelligence and Medicine. My purpose is to contribute to our understanding of intelligence. And our best chance to achieve this is through AI.



Research highlights:

- Published BRAX, the Brazilian Chest X-ray Dataset - https://www.nature.com/articles/s41597-022-01608-8

- Open-sourced the PyTorch implementation of ConVIRT (Y Zhang et al), a contrastive learning method for radiologic images and text (before CLIP) - https://github.com/edreisMD/ConVIRT-pytorch

- Released Brain Hemorrhage Annotations - Brain Hemorrhage Extended - BHX (https://physionet.org/content/bhx-brain-bounding-box)



At Hospital Israelita Albert Einstein:

- Started the Health Story project, a medical history timeline to support research and a more personalized clinical practice

- Ran the development of AI algorithms for diseases of national importance: Tuberculosis, COVID, Melanoma and Head CT

Stanford Advisors


All Publications


  • Automated abdominal CT contrast phase detection using an interpretable and open-source artificial intelligence algorithm. European radiology Reis, E. P., Blankemeier, L., Zambrano Chaves, J. M., Jensen, M. E., Yao, S., Truyts, C. A., Willis, M. H., Adams, S., Amaro, E., Boutin, R. D., Chaudhari, A. S. 2024

    Abstract

    To develop and validate an open-source artificial intelligence (AI) algorithm to accurately detect contrast phases in abdominal CT scans.Retrospective study aimed to develop an AI algorithm trained on 739 abdominal CT exams from 2016 to 2021, from 200 unique patients, covering 1545 axial series. We performed segmentation of five key anatomic structures-aorta, portal vein, inferior vena cava, renal parenchyma, and renal pelvis-using TotalSegmentator, a deep learning-based tool for multi-organ segmentation, and a rule-based approach to extract the renal pelvis. Radiomics features were extracted from the anatomical structures for use in a gradient-boosting classifier to identify four contrast phases: non-contrast, arterial, venous, and delayed. Internal and external validation was performed using the F1 score and other classification metrics, on the external dataset "VinDr-Multiphase CT".The training dataset consisted of 172 patients (mean age, 70 years ± 8, 22% women), and the internal test set included 28 patients (mean age, 68 years ± 8, 14% women). In internal validation, the classifier achieved an accuracy of 92.3%, with an average F1 score of 90.7%. During external validation, the algorithm maintained an accuracy of 90.1%, with an average F1 score of 82.6%. Shapley feature attribution analysis indicated that renal and vascular radiodensity values were the most important for phase classification.An open-source and interpretable AI algorithm accurately detects contrast phases in abdominal CT scans, with high accuracy and F1 scores in internal and external validation, confirming its generalization capability.Contrast phase detection in abdominal CT scans is a critical step for downstream AI applications, deploying algorithms in the clinical setting, and for quantifying imaging biomarkers, ultimately allowing for better diagnostics and increased access to diagnostic imaging.Digital Imaging and Communications in Medicine labels are inaccurate for determining the abdominal CT scan phase. AI provides great help in accurately discriminating the contrast phase. Accurate contrast phase determination aids downstream AI applications and biomarker quantification.

    View details for DOI 10.1007/s00330-024-10769-6

    View details for PubMedID 38683384

    View details for PubMedCentralID 9700820

  • Adapted large language models can outperform medical experts in clinical text summarization. Nature medicine Van Veen, D., Van Uden, C., Blankemeier, L., Delbrouck, J. B., Aali, A., Bluethgen, C., Pareek, A., Polacin, M., Reis, E. P., Seehofnerová, A., Rohatgi, N., Hosamani, P., Collins, W., Ahuja, N., Langlotz, C. P., Hom, J., Gatidis, S., Pauly, J., Chaudhari, A. S. 2024

    Abstract

    Analyzing vast textual data and summarizing key information from electronic health records imposes a substantial burden on how clinicians allocate their time. Although large language models (LLMs) have shown promise in natural language processing (NLP) tasks, their effectiveness on a diverse range of clinical summarization tasks remains unproven. Here we applied adaptation methods to eight LLMs, spanning four distinct clinical summarization tasks: radiology reports, patient questions, progress notes and doctor-patient dialogue. Quantitative assessments with syntactic, semantic and conceptual NLP metrics reveal trade-offs between models and adaptation methods. A clinical reader study with 10 physicians evaluated summary completeness, correctness and conciseness; in most cases, summaries from our best-adapted LLMs were deemed either equivalent (45%) or superior (36%) compared with summaries from medical experts. The ensuing safety analysis highlights challenges faced by both LLMs and medical experts, as we connect errors to potential medical harm and categorize types of fabricated information. Our research provides evidence of LLMs outperforming medical experts in clinical text summarization across multiple tasks. This suggests that integrating LLMs into clinical workflows could alleviate documentation burden, allowing clinicians to focus more on patient care.

    View details for DOI 10.1038/s41591-024-02855-5

    View details for PubMedID 38413730

    View details for PubMedCentralID 5593724

  • Clinical Text Summarization: Adapting Large Language Models Can Outperform Human Experts. Research square Veen, D. V., Uden, C. V., Blankemeier, L., Delbrouck, J. B., Aali, A., Bluethgen, C., Pareek, A., Polacin, M., Reis, E. P., Seehofnerova, A., Rohatgi, N., Hosamani, P., Collins, W., Ahuja, N., Langlotz, C., Hom, J., Gatidis, S., Pauly, J., Chaudhari, A. 2023

    Abstract

    Sifting through vast textual data and summarizing key information from electronic health records (EHR) imposes a substantial burden on how clinicians allocate their time. Although large language models (LLMs) have shown immense promise in natural language processing (NLP) tasks, their efficacy on a diverse range of clinical summarization tasks has not yet been rigorously demonstrated. In this work, we apply domain adaptation methods to eight LLMs, spanning six datasets and four distinct clinical summarization tasks: radiology reports, patient questions, progress notes, and doctor-patient dialogue. Our thorough quantitative assessment reveals trade-offs between models and adaptation methods in addition to instances where recent advances in LLMs may not improve results. Further, in a clinical reader study with ten physicians, we show that summaries from our best-adapted LLMs are preferable to human summaries in terms of completeness and correctness. Our ensuing qualitative analysis highlights challenges faced by both LLMs and human experts. Lastly, we correlate traditional quantitative NLP metrics with reader study scores to enhance our understanding of how these metrics align with physician preferences. Our research marks the first evidence of LLMs outperforming human experts in clinical text summarization across multiple tasks. This implies that integrating LLMs into clinical workflows could alleviate documentation burden, empowering clinicians to focus more on personalized patient care and the inherently human aspects of medicine.

    View details for DOI 10.21203/rs.3.rs-3483777/v1

    View details for PubMedID 37961377

    View details for PubMedCentralID PMC10635391

  • Skeletal Muscle Area on CT: Determination of an Optimal Height Scaling Power and Testing for Mortality Risk Prediction. AJR. American journal of roentgenology Blankemeier, L., Yao, L., Long, J., Reis, E. P., Lenchik, L., Chaudhari, A. S., Boutin, R. D. 2023

    Abstract

    BACKGROUND: Sarcopenia is commonly assessed on CT using the skeletal muscle index (SMI), calculated as skeletal muscle area (SMA) at L3 divided by patient height squared (i.e., height scaling power of 2). OBJECTIVE: To determine the optimal height scaling power for SMA measurements on CT, and to test the influence of the derived optimal scaling power on the utility of SMI in predicting all-cause mortality. METHODS: This retrospective study included 16,575 patients (mean age, 56.4 years; 6985 men, 9590 women) who underwent abdominal CT from December 2012 through October 2018. SMA at L3 was determined using automated software. The sample was stratified into 5459 patients without major medical conditions (using ICD-9 and ICD-10 codes) for determining an optimal height scaling power, and 11,116 patients with major medical conditions for testing this power. The optimal scaling power was determined by allometric analysis (whereby regression coefficients were fitted to log-linear sex-specific models relating height to SMA) and by analysis of statistical independence of SMI from height across scaling powers. Cox proportional hazards models were used to test the derived optimal scaling power's influence on utility of SMI in predicting all-cause mortality. RESULTS: In allometric analysis, the regression coefficient of log(height) in patients ≤40 years was 1.02 in men and 1.08 in women, and in patients >40 years was 1.07 in men and 1.10 in women (all p<.05 vs regression coefficient of 2). In analyses for statistical independence of SMI from height, the optimal height scaling power (i.e., those yielding correlations closest to 0) was, in patients ≤40 years, 0.97 in men and 1.08 in women, and in patients >40 years, 1.03 in men and 1.09 in women. In the Cox model used for testing, SMI predicted all-cause mortality with greater concordance index using a height scaling power of 1 than 2 in men (0.675 vs 0.663, p<.001) and women (0.664 vs 0.653, p<.001). CONCLUSION: The findings support a height scaling power of 1, rather than conventional power of 2, for SMI computation. CLINICAL IMPACT: A revised height scaling power for SMI could impact the utility of CT-based sarcopenia diagnoses in risk assessment.

    View details for DOI 10.2214/AJR.23.29889

    View details for PubMedID 37877596

  • Evaluating progress in automatic chest X-ray radiology report generation. Patterns (New York, N.Y.) Yu, F., Endo, M., Krishnan, R., Pan, I., Tsai, A., Reis, E. P., Fonseca, E. K., Lee, H. M., Abad, Z. S., Ng, A. Y., Langlotz, C. P., Venugopal, V. K., Rajpurkar, P. 2023; 4 (9): 100802

    Abstract

    Artificial intelligence (AI) models for automatic generation of narrative radiology reports from images have the potential to enhance efficiency and reduce the workload of radiologists. However, evaluating the correctness of these reports requires metrics that can capture clinically pertinent differences. In this study, we investigate the alignment between automated metrics and radiologists' scoring of errors in report generation. We address the limitations of existing metrics by proposing new metrics, RadGraph F1 and RadCliQ, which demonstrate stronger correlation with radiologists' evaluations. In addition, we analyze the failure modes of the metrics to understand their limitations and provide guidance for metric selection and interpretation. This study establishes RadGraph F1 and RadCliQ as meaningful metrics for guiding future research in radiology report generation.

    View details for DOI 10.1016/j.patter.2023.100802

    View details for PubMedID 37720336

    View details for PubMedCentralID PMC10499844

  • Deep COVID DeteCT: an international experience on COVID-19 lung detection and prognosis using chest CT. NPJ digital medicine Lee, E. H., Zheng, J. n., Colak, E. n., Mohammadzadeh, M. n., Houshmand, G. n., Bevins, N. n., Kitamura, F. n., Altinmakas, E. n., Reis, E. P., Kim, J. K., Klochko, C. n., Han, M. n., Moradian, S. n., Mohammadzadeh, A. n., Sharifian, H. n., Hashemi, H. n., Firouznia, K. n., Ghanaati, H. n., Gity, M. n., Doğan, H. n., Salehinejad, H. n., Alves, H. n., Seekins, J. n., Abdala, N. n., Atasoy, Ç. n., Pouraliakbar, H. n., Maleki, M. n., Wong, S. S., Yeom, K. W. 2021; 4 (1): 11

    Abstract

    The Coronavirus disease 2019 (COVID-19) presents open questions in how we clinically diagnose and assess disease course. Recently, chest computed tomography (CT) has shown utility for COVID-19 diagnosis. In this study, we developed Deep COVID DeteCT (DCD), a deep learning convolutional neural network (CNN) that uses the entire chest CT volume to automatically predict COVID-19 (COVID+) from non-COVID-19 (COVID-) pneumonia and normal controls. We discuss training strategies and differences in performance across 13 international institutions and 8 countries. The inclusion of non-China sites in training significantly improved classification performance with area under the curve (AUCs) and accuracies above 0.8 on most test sites. Furthermore, using available follow-up scans, we investigate methods to track patient disease course and predict prognosis.

    View details for DOI 10.1038/s41746-020-00369-1

    View details for PubMedID 33514852