Bio


I'm a visiting scholar at Stanford AIMI Center, working in the intersection of Artificial Intelligence and Medicine. My purpose is to contribute to our understanding of intelligence. And our best chance to achieve this is through AI.



Research highlights:

- Published BRAX, the Brazilian Chest X-ray Dataset - https://www.nature.com/articles/s41597-022-01608-8

- Open-sourced the PyTorch implementation of ConVIRT (Y Zhang et al), a contrastive learning method for radiologic images and text (before CLIP) - https://github.com/edreisMD/ConVIRT-pytorch

- Released Brain Hemorrhage Annotations - Brain Hemorrhage Extended - BHX (https://physionet.org/content/bhx-brain-bounding-box)



At Hospital Israelita Albert Einstein:

- Started the Health Story project, a medical history timeline to support research and a more personalized clinical practice

- Ran the development of AI algorithms for diseases of national importance: Tuberculosis, COVID, Melanoma and Head CT

Stanford Advisors


All Publications


  • Clinical Text Summarization: Adapting Large Language Models Can Outperform Human Experts. Research square Veen, D. V., Uden, C. V., Blankemeier, L., Delbrouck, J. B., Aali, A., Bluethgen, C., Pareek, A., Polacin, M., Reis, E. P., Seehofnerova, A., Rohatgi, N., Hosamani, P., Collins, W., Ahuja, N., Langlotz, C., Hom, J., Gatidis, S., Pauly, J., Chaudhari, A. 2023

    Abstract

    Sifting through vast textual data and summarizing key information from electronic health records (EHR) imposes a substantial burden on how clinicians allocate their time. Although large language models (LLMs) have shown immense promise in natural language processing (NLP) tasks, their efficacy on a diverse range of clinical summarization tasks has not yet been rigorously demonstrated. In this work, we apply domain adaptation methods to eight LLMs, spanning six datasets and four distinct clinical summarization tasks: radiology reports, patient questions, progress notes, and doctor-patient dialogue. Our thorough quantitative assessment reveals trade-offs between models and adaptation methods in addition to instances where recent advances in LLMs may not improve results. Further, in a clinical reader study with ten physicians, we show that summaries from our best-adapted LLMs are preferable to human summaries in terms of completeness and correctness. Our ensuing qualitative analysis highlights challenges faced by both LLMs and human experts. Lastly, we correlate traditional quantitative NLP metrics with reader study scores to enhance our understanding of how these metrics align with physician preferences. Our research marks the first evidence of LLMs outperforming human experts in clinical text summarization across multiple tasks. This implies that integrating LLMs into clinical workflows could alleviate documentation burden, empowering clinicians to focus more on personalized patient care and the inherently human aspects of medicine.

    View details for DOI 10.21203/rs.3.rs-3483777/v1

    View details for PubMedID 37961377

    View details for PubMedCentralID PMC10635391

  • Skeletal Muscle Area on CT: Determination of an Optimal Height Scaling Power and Testing for Mortality Risk Prediction. AJR. American journal of roentgenology Blankemeier, L., Yao, L., Long, J., Reis, E. P., Lenchik, L., Chaudhari, A. S., Boutin, R. D. 2023

    Abstract

    BACKGROUND: Sarcopenia is commonly assessed on CT using the skeletal muscle index (SMI), calculated as skeletal muscle area (SMA) at L3 divided by patient height squared (i.e., height scaling power of 2). OBJECTIVE: To determine the optimal height scaling power for SMA measurements on CT, and to test the influence of the derived optimal scaling power on the utility of SMI in predicting all-cause mortality. METHODS: This retrospective study included 16,575 patients (mean age, 56.4 years; 6985 men, 9590 women) who underwent abdominal CT from December 2012 through October 2018. SMA at L3 was determined using automated software. The sample was stratified into 5459 patients without major medical conditions (using ICD-9 and ICD-10 codes) for determining an optimal height scaling power, and 11,116 patients with major medical conditions for testing this power. The optimal scaling power was determined by allometric analysis (whereby regression coefficients were fitted to log-linear sex-specific models relating height to SMA) and by analysis of statistical independence of SMI from height across scaling powers. Cox proportional hazards models were used to test the derived optimal scaling power's influence on utility of SMI in predicting all-cause mortality. RESULTS: In allometric analysis, the regression coefficient of log(height) in patients ≤40 years was 1.02 in men and 1.08 in women, and in patients >40 years was 1.07 in men and 1.10 in women (all p<.05 vs regression coefficient of 2). In analyses for statistical independence of SMI from height, the optimal height scaling power (i.e., those yielding correlations closest to 0) was, in patients ≤40 years, 0.97 in men and 1.08 in women, and in patients >40 years, 1.03 in men and 1.09 in women. In the Cox model used for testing, SMI predicted all-cause mortality with greater concordance index using a height scaling power of 1 than 2 in men (0.675 vs 0.663, p<.001) and women (0.664 vs 0.653, p<.001). CONCLUSION: The findings support a height scaling power of 1, rather than conventional power of 2, for SMI computation. CLINICAL IMPACT: A revised height scaling power for SMI could impact the utility of CT-based sarcopenia diagnoses in risk assessment.

    View details for DOI 10.2214/AJR.23.29889

    View details for PubMedID 37877596

  • Evaluating progress in automatic chest X-ray radiology report generation. Patterns (New York, N.Y.) Yu, F., Endo, M., Krishnan, R., Pan, I., Tsai, A., Reis, E. P., Fonseca, E. K., Lee, H. M., Abad, Z. S., Ng, A. Y., Langlotz, C. P., Venugopal, V. K., Rajpurkar, P. 2023; 4 (9): 100802

    Abstract

    Artificial intelligence (AI) models for automatic generation of narrative radiology reports from images have the potential to enhance efficiency and reduce the workload of radiologists. However, evaluating the correctness of these reports requires metrics that can capture clinically pertinent differences. In this study, we investigate the alignment between automated metrics and radiologists' scoring of errors in report generation. We address the limitations of existing metrics by proposing new metrics, RadGraph F1 and RadCliQ, which demonstrate stronger correlation with radiologists' evaluations. In addition, we analyze the failure modes of the metrics to understand their limitations and provide guidance for metric selection and interpretation. This study establishes RadGraph F1 and RadCliQ as meaningful metrics for guiding future research in radiology report generation.

    View details for DOI 10.1016/j.patter.2023.100802

    View details for PubMedID 37720336

    View details for PubMedCentralID PMC10499844

  • Deep COVID DeteCT: an international experience on COVID-19 lung detection and prognosis using chest CT. NPJ digital medicine Lee, E. H., Zheng, J. n., Colak, E. n., Mohammadzadeh, M. n., Houshmand, G. n., Bevins, N. n., Kitamura, F. n., Altinmakas, E. n., Reis, E. P., Kim, J. K., Klochko, C. n., Han, M. n., Moradian, S. n., Mohammadzadeh, A. n., Sharifian, H. n., Hashemi, H. n., Firouznia, K. n., Ghanaati, H. n., Gity, M. n., Doğan, H. n., Salehinejad, H. n., Alves, H. n., Seekins, J. n., Abdala, N. n., Atasoy, Ç. n., Pouraliakbar, H. n., Maleki, M. n., Wong, S. S., Yeom, K. W. 2021; 4 (1): 11

    Abstract

    The Coronavirus disease 2019 (COVID-19) presents open questions in how we clinically diagnose and assess disease course. Recently, chest computed tomography (CT) has shown utility for COVID-19 diagnosis. In this study, we developed Deep COVID DeteCT (DCD), a deep learning convolutional neural network (CNN) that uses the entire chest CT volume to automatically predict COVID-19 (COVID+) from non-COVID-19 (COVID-) pneumonia and normal controls. We discuss training strategies and differences in performance across 13 international institutions and 8 countries. The inclusion of non-China sites in training significantly improved classification performance with area under the curve (AUCs) and accuracies above 0.8 on most test sites. Furthermore, using available follow-up scans, we investigate methods to track patient disease course and predict prognosis.

    View details for DOI 10.1038/s41746-020-00369-1

    View details for PubMedID 33514852