Bio


Sanmi Koyejo is an Assistant Professor in the Department of Computer Science at Stanford University and an adjunct Associate Professor at the University of Illinois at Urbana-Champaign. He leads the Stanford Trustworthy AI Research (STAIR) lab, which develops measurement-theoretic foundations for trustworthy AI systems, spanning AI evaluation science, algorithmic accountability, and privacy-preserving machine learning, with applications to healthcare and scientific discovery. His research on AI capabilities evaluation has challenged conventional understanding in the field, including work on measurement frameworks cited in the 2024 Economic Report of the President.

Koyejo has received the Presidential Early Career Award for Scientists and Engineers (PECASE), Skip Ellis Early Career Award, Alfred P. Sloan Research Fellowship, NSF CAREER Award, and multiple outstanding paper awards at flagship venues, including NeurIPS and ACL. He has delivered keynote presentations at major conferences, including ECCV and FAccT. He serves in key leadership roles, including Board President of Black in AI, Board of Directors of the Neural Information Processing Systems Foundation, and other leadership positions in professional organizations advancing AI research and broadening participation in the field.

Academic Appointments


2025-26 Courses


Stanford Advisees


All Publications


  • Holistic evaluation of large language models for medical tasks with MedHELM. Nature medicine Bedi, S., Cui, H., Fuentes, M., Unell, A., Wornow, M., Banda, J. M., Kotecha, N., Keyes, T., Mai, Y., Oez, M., Qiu, H., Jain, S., Schettini, L., Kashyap, M., Fries, J. A., Swaminathan, A., Chung, P., Haredasht, F. N., Lopez, I., Aali, A., Tse, G., Nayak, A., Vedak, S., Jain, S. S., Patel, B., Fayanju, O., Shah, S., Goh, E., Yao, D. H., Soetikno, B., Reis, E., Gatidis, S., Divi, V., Capasso, R., Saralkar, R., Chiang, C. C., Jindal, J., Pham, T., Ghoddusi, F., Lin, S., Chiou, A. S., Hong, H. J., Roy, M., Gensheimer, M. F., Patel, H., Schulman, K., Dash, D., Char, D., Downing, L., Grolleau, F., Black, K., Mieso, B., Zahedivash, A., Yim, W. W., Sharma, H., Lee, T., Kirsch, H., Lee, J., Ambers, N., Lugtu, C., Sharma, A., Mawji, B., Alekseyev, A., Zhou, V., Kakkar, V., Helzer, J., Revri, A., Bannett, Y., Daneshjou, R., Chen, J., Alsentzer, E., Morse, K., Ravi, N., Aghaeepour, N., Kennedy, V., Chaudhari, A., Wang, T., Koyejo, S., Lungren, M. P., Horvitz, E., Liang, P., Pfeffer, M. A., Shah, N. H. 2026

    Abstract

    While large language models (LLMs) achieve near-perfect scores on medical licensing exams, these evaluations inadequately reflect the complexity and diversity of real-world clinical practice. Here we introduce MedHELM, an extensible evaluation framework with three contributions. First, a clinician-validated taxonomy organizing medical AI applications into five categories that mirror real clinical tasks-clinical decision support (diagnostic decisions, treatment planning), clinical note generation (visit documentation, procedure reports), patient communication (education materials, care instructions), medical research (literature analysis, clinical data analysis) and administration (scheduling, workflow coordination). These encompass 22 subcategories and 121 specific tasks reflecting daily medical practice. Second, a comprehensive benchmark suite of 37 evaluations covering all subcategories. Third, systematic comparison of nine frontier LLMs-Claude 3.5 Sonnet, Claude 3.7 Sonnet, DeepSeek R1, Gemini 1.5 Pro, Gemini 2.0 Flash, GPT-4o, GPT-4o mini, Llama 3.3 and o3-mini-using an automated LLM-jury evaluation method. Our LLM-jury uses multiple AI evaluators to assess model outputs against expert-defined criteria. Advanced reasoning models (DeepSeek R1, o3-mini) demonstrated superior performance with win rates of 66%, although Claude 3.5 Sonnet achieved comparable results at 15% lower computational cost. These results not only highlight current model capabilities but also demonstrate how MedHELM could enable evidence-based selection of medical AI systems for healthcare applications.

    View details for DOI 10.1038/s41591-025-04151-2

    View details for PubMedID 41559415

    View details for PubMedCentralID 10916499

  • Shaping AI's Impact on Billions of Lives COMMUNICATIONS OF THE ACM Cuellar, M., Dean, J., Doshi-Velez, F., Hennessy, J., Konwinski, A., Koyejo, S., Moiloa, P., Pierson, E., Patterson, D. 2026; 69 (1): 54-65

    View details for DOI 10.1145/3746132

    View details for Web of Science ID 001650706200006

  • The inadequacy of offline large language model evaluations: A need to account for personalization in model behavior. Patterns (New York, N.Y.) Wang, A., Ho, D. E., Koyejo, S. 2025; 6 (12): 101397

    Abstract

    Standard offline evaluations for language models fail to capture how these models actually behave in practice, where personalization fundamentally alters model behavior. In this work, we provide empirical evidence showcasing this phenomenon by comparing offline evaluations to field evaluations conducted by having 800 real users of ChatGPT and Gemini pose benchmark and other questions to their chat interfaces.

    View details for DOI 10.1016/j.patter.2025.101397

    View details for PubMedID 41472831

    View details for PubMedCentralID PMC12745978

  • TIMER: temporal instruction modeling and evaluation for longitudinal clinical records. NPJ digital medicine Cui, H., Unell, A., Chen, B., Fries, J. A., Alsentzer, E., Koyejo, S., Shah, N. H. 2025; 8 (1): 577

    Abstract

    Electronic health records (EHRs) contain rich longitudinal information for clinical decision-making, yet LLMs struggle to reason across patient timelines. We introduce TIMER (Temporal Instruction Modeling and Evaluation for Longitudinal Clinical Records), a method to improve LLMs' temporal reasoning over multi-visit EHRs through time-aware instruction tuning. TIMER grounds LLMs in patient-specific temporal contexts by linking each instruction-response pair to specific timestamps, ensuring temporal fidelity throughout the training process. Evaluations show that TIMER-tuned models outperform conventional medical instruction-tuned approaches by 6.6% in completeness on clinician-curated benchmarks, with distribution-matched training demonstrating advantages up to 6.5% in temporal reasoning. Qualitative analyses reveal that using TIMER enhances temporal boundary adherence, trend detection, and chronological precision, necessary for applications such as disease trajectory modeling and treatment response monitoring. Overall, TIMER provides a methodological basis for developing LLMs that can effectively engage with the inherently longitudinal nature of data for patient care. Code is available at TIMER .

    View details for DOI 10.1038/s41746-025-01965-9

    View details for PubMedID 41006898

    View details for PubMedCentralID PMC12475073

  • Advancing science- and evidence-based AI policy. Science (New York, N.Y.) Bommasani, R., Arora, S., Chayes, J., Choi, Y., Cuéllar, M. F., Fei-Fei, L., Ho, D. E., Jurafsky, D., Koyejo, S., Lakkaraju, H., Narayanan, A., Nelson, A., Pierson, E., Pineau, J., Singer, S., Varoquaux, G., Venkatasubramanian, S., Stoica, I., Liang, P., Song, D. 2025; 389 (6759): 459-461

    Abstract

    Policy must be informed by, but also facilitate the generation of, scientific evidence.

    View details for DOI 10.1126/science.adu8449

    View details for PubMedID 40743343

  • Rethinking machine unlearning for large language models NATURE MACHINE INTELLIGENCE Liu, S., Yao, Y., Jia, J., Casper, S., Baracaldo, N., Hase, P., Yao, Y., Liu, C., Xu, X., Li, H., Varshney, K. R., Bansal, M., Koyejo, S., Liu, Y. 2025
  • Testing and Evaluation of Health Care Applications of Large Language Models: A Systematic Review. JAMA Bedi, S., Liu, Y., Orr-Ewing, L., Dash, D., Koyejo, S., Callahan, A., Fries, J. A., Wornow, M., Swaminathan, A., Lehmann, L. S., Hong, H. J., Kashyap, M., Chaurasia, A. R., Shah, N. R., Singh, K., Tazbaz, T., Milstein, A., Pfeffer, M. A., Shah, N. H. 2024

    Abstract

    Large language models (LLMs) can assist in various health care activities, but current evaluation approaches may not adequately identify the most useful application areas.To summarize existing evaluations of LLMs in health care in terms of 5 components: (1) evaluation data type, (2) health care task, (3) natural language processing (NLP) and natural language understanding (NLU) tasks, (4) dimension of evaluation, and (5) medical specialty.A systematic search of PubMed and Web of Science was performed for studies published between January 1, 2022, and February 19, 2024.Studies evaluating 1 or more LLMs in health care.Three independent reviewers categorized studies via keyword searches based on the data used, the health care tasks, the NLP and NLU tasks, the dimensions of evaluation, and the medical specialty.Of 519 studies reviewed, published between January 1, 2022, and February 19, 2024, only 5% used real patient care data for LLM evaluation. The most common health care tasks were assessing medical knowledge such as answering medical licensing examination questions (44.5%) and making diagnoses (19.5%). Administrative tasks such as assigning billing codes (0.2%) and writing prescriptions (0.2%) were less studied. For NLP and NLU tasks, most studies focused on question answering (84.2%), while tasks such as summarization (8.9%) and conversational dialogue (3.3%) were infrequent. Almost all studies (95.4%) used accuracy as the primary dimension of evaluation; fairness, bias, and toxicity (15.8%), deployment considerations (4.6%), and calibration and uncertainty (1.2%) were infrequently measured. Finally, in terms of medical specialty area, most studies were in generic health care applications (25.6%), internal medicine (16.4%), surgery (11.4%), and ophthalmology (6.9%), with nuclear medicine (0.6%), physical medicine (0.4%), and medical genetics (0.2%) being the least represented.Existing evaluations of LLMs mostly focus on accuracy of question answering for medical examinations, without consideration of real patient care data. Dimensions such as fairness, bias, and toxicity and deployment considerations received limited attention. Future evaluations should adopt standardized applications and metrics, use clinical data, and broaden focus to include a wider range of tasks and specialties.

    View details for DOI 10.1001/jama.2024.21700

    View details for PubMedID 39405325

    View details for PubMedCentralID PMC11480901

  • The inadequacy of offline large language model evaluations: A need to account for personalization in model behavior PATTERNS Wang, A., Ho, D. E., Koyejo, S. 2025; 6 (12)
  • Evaluating anti-LGBTQIA+ medical bias in large language models. PLOS digital health Chang, C. T., Srivathsa, N., Bou-Khalil, C., Swaminathan, A., Lunn, M. R., Mishra, K., Koyejo, S., Daneshjou, R. 2025; 4 (9): e0001001

    Abstract

    Large Language Models (LLMs) are increasingly deployed in clinical settings for tasks ranging from patient communication to decision support. While these models demonstrate race-based and binary gender biases, anti-LGBTQIA+ bias remains understudied despite documented healthcare disparities affecting these populations. In this work, we evaluated the potential of LLMs to propagate anti-LGBTQIA+ medical bias and misinformation. We prompted 4 LLMs (Gemini 1.5 Flash, Claude 3 Haiku, GPT-4o, Stanford Medicine Secure GPT [GPT-4.0]) with 38 prompts consisting of explicit questions and synthetic clinical notes created by medically-trained reviewers and LGBTQIA+ health experts. The prompts consisted of pairs of prompts with and without LGBTQIA+ identity terms and explored clinical situations across two axes: (i) situations where historical bias has been observed versus not observed, and (ii) situations where LGBTQIA+ identity is relevant to clinical care versus not relevant. Medically-trained reviewers evaluated LLM responses for appropriateness (safety, privacy, hallucination/accuracy, and bias) and clinical utility. We found that all 4 LLMs generated inappropriate responses for prompts with and without LGBTQIA+ identity terms. The proportion of inappropriate responses ranged from 43-62% for prompts mentioning LGBTQIA+ identities versus 47-65% for those without. The most common reason for inappropriate classification tended to be hallucination/accuracy, followed by bias or safety. Qualitatively, we observed differential bias patterns, with LGBTQIA+ prompts eliciting more severe bias. Average clinical utility score for inappropriate responses was lower than for appropriate responses (2.6 versus 3.7 on a 5-point Likert scale). Future work should focus on tailoring output formats to stated use cases, decreasing sycophancy and reliance on extraneous information in the prompt, and improving accuracy and decreasing bias for LGBTQIA+ patients. We present our prompts and annotated responses as a benchmark for evaluation of future models. Content warning: This paper includes prompts and model-generated responses that may be offensive.

    View details for DOI 10.1371/journal.pdig.0001001

    View details for PubMedID 40920790

  • Fidelity of Medical Reasoning in Large Language Models. JAMA network open Bedi, S., Jiang, Y., Chung, P., Koyejo, S., Shah, N. 2025; 8 (8): e2526021

    View details for DOI 10.1001/jamanetworkopen.2025.26021

    View details for PubMedID 40779272

  • Advancing oil and gas emissions assessment through large language model data extraction ENERGY AND AI Chen, Z., Zhong, R., Long, W., Tanga, H., Wang, A., Liu, Z., Yang, X., Ren, B., Littlefield, J., Koyejo, S., Masnadi, M. S., Brandt, A. R. 2025; 20
  • The Reality of AI and Biorisk Peppin, A., Reuel, A., Casper, S., Jones, E., Strait, A., Anwar, U., Agrawal, A., Kapoor, S., Koyejo, S., Pellat, M., Bommasani, R., Frosst, N., Hooker, S., ACM ASSOC COMPUTING MACHINERY. 2025: 763-771
  • More than Marketing? On the Information Value of AI Benchmarks for Practitioners Hardy, A., Reuel, A., Meimandi, K., Soder, L., Griffith, A., Asmar, D. M., Koyejo, S., Bernstein, M. S., Kochenderfer, M., ACM ASSOC COMPUTING MACHINERY. 2025: 1032-1047
  • Fairness through Difference Awareness: Measuring <i>Desired</i> Group Discrimination in LLMs Wang, A., Phan, M., Ho, D. E., Koyejo, S. edited by Che, W., Nabende, J., Shutova, E., Pilehvar, M. T. ASSOC COMPUTATIONAL LINGUISTICS-ACL. 2025: 6867-6893
  • Publisher Correction: Increasing the presence of BIPOC researchers in computational science. Nature computational science Chen, C. Y., Christoffels, A., Dube, R., Enos, K., Gilbert, J. E., Koyejo, S., Leigh, J., Liquido, C., McKee, A., Noe, K., Peng, T., Taiuru, K. 2024

    View details for DOI 10.1038/s43588-024-00710-8

    View details for PubMedID 39354103

  • Increasing the presence of BIPOC researchers in computational science. Nature computational science Chen, C. Y., Christoffels, A., Dube, R., Enos, K., Gilbert, J. E., Koyeji, S., Leigh, J., Liquido, C., McKee, A., Noe, K., Peng, T. Q., Taiuru, K. 2024; 4 (9): 646-653

    View details for DOI 10.1038/s43588-024-00693-6

    View details for PubMedID 39317763

  • Artificial Intelligence, Social Responsibility, and the Roles of the University COMMUNICATIONS OF THE ACM Bosch, N., Chan, A., Davis, J. L., Gutierrez, R., He, J., Karahalios, K., Koyejo, S., Loui, M. C., Mendenhall, R., Sanfilippo, M., Tong, H., Varshney, L. R., Wang, Y. 2024; 67 (8): 22-25

    View details for DOI 10.1145/3640541

    View details for Web of Science ID 001293981400007

  • Single-Trial Detection and Classification of Event-Related Optical Signals for a Brain-Computer Interface Application. Bioengineering (Basel, Switzerland) Chiou, N., Günal, M., Koyejo, S., Perpetuini, D., Chiarelli, A. M., Low, K. A., Fabiani, M., Gratton, G. 2024; 11 (8)

    Abstract

    Event-related optical signals (EROS) measure fast modulations in the brain's optical properties related to neuronal activity. EROS offer a high spatial and temporal resolution and can be used for brain-computer interface (BCI) applications. However, the ability to classify single-trial EROS remains unexplored. This study evaluates the performance of neural network methods for single-trial classification of motor response-related EROS. EROS activity was obtained from a high-density recording montage covering the motor cortex during a two-choice reaction time task involving responses with the left or right hand. This study utilized a convolutional neural network (CNN) approach to extract spatiotemporal features from EROS data and perform classification of left and right motor responses. Subject-specific classifiers trained on EROS phase data outperformed those trained on intensity data, reaching an average single-trial classification accuracy of around 63%. Removing low-frequency noise from intensity data is critical for achieving discriminative classification results with this measure. Our results indicate that deep learning with high-spatial-resolution signals, such as EROS, can be successfully applied to single-trial classifications.

    View details for DOI 10.3390/bioengineering11080781

    View details for PubMedID 39199739

    View details for PubMedCentralID PMC11351476

  • Bridging gaps in automated acute myocardial infarction detection between high-income and low-income countries PLOS GLOBAL PUBLIC HEALTH Chiou, N., Koyejo, S., Ngaruiya, C. 2024; 4 (6): e0003240

    View details for DOI 10.1371/journalpgph.0003240

    View details for Web of Science ID 001418792700001

    View details for PubMedID 38941326

  • Bridging gaps in automated acute myocardial infarction detection between high-income and low-income countries. PLOS global public health Chiou, N., Koyejo, S., Ngaruiya, C. 2024; 4 (6): e0003240

    View details for DOI 10.1371/journal.pgph.0003240

    View details for PubMedID 38941326

  • Author Correction: Opportunistic detection of type 2 diabetes using deep learning from frontal chest radiographs. Nature communications Pyrros, A., Borstelmann, S. M., Mantravadi, R., Zaiman, Z., Thomas, K., Price, B., Greenstein, E., Siddiqui, N., Willis, M., Shulhan, I., Hines-Shah, J., Horowitz, J. M., Nikolaidis, P., Lungren, M. P., Rodríguez-Fernández, J. M., Gichoya, J. W., Koyejo, S., Flanders, A. E., Khandwala, N., Gupta, A., Garrett, J. W., Cohen, J. P., Layden, B. T., Pickhardt, P. J., Galanter, W. 2024; 15 (1): 4817

    View details for DOI 10.1038/s41467-024-49184-2

    View details for PubMedID 38844459

    View details for PubMedCentralID PMC11156917

  • Impact of biased models in the context of fairness towards patients, and how to avoid or minimise biases in our datasets Koyejo, S. ELSEVIER IRELAND LTD. 2024: S46
  • Latent Multimodal Functional Graphical Model Estimation JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION Tsai, K., Zhao, B., Koyejo, S., Kolar, M. 2024; 119 (547): 2217-2229
  • The Case for Globalizing Fairness: A Mixed Methods Study on Colonialism, AI, and Health in Africa Asiedu, M., Dieng, A., Haykel, I., Rostamzadeh, N., Pfohl, S., Nagpal, C., Nagawa, M., Oppong, A., Koyejo, S., Heller, K., ACM ASSOC COMPUTING MACHINERY. 2024
  • Adaptive Compression in Federated Learning via Side Information Isik, B., Pase, F., Gunduz, D., Koyejo, S., Weissman, T., Zorzi, M. edited by Dasgupta, S., Mandt, S., Li, Y. JMLR-JOURNAL MACHINE LEARNING RESEARCH. 2024
  • Invariant Aggregator for Defending against Federated Backdoor Attacks Wang, X., Dimitriadis, D., Koyejo, S., Tople, S. edited by Dasgupta, S., Mandt, S., Li, Y. JMLR-JOURNAL MACHINE LEARNING RESEARCH. 2024
  • Proxy Methods for Domain Adaptation Tsai, K., Pfohl, S. R., Salaudeen, O., Chiou, N., Kusner, M. J., DAmour, A., Koyejo, S., Gretton, A. edited by Dasgupta, S., Mandt, S., Li, Y. JMLR-JOURNAL MACHINE LEARNING RESEARCH. 2024
  • Causally Inspired Regularization Enables Domain General Representations Salaudeen, O., Koyejo, S. edited by Dasgupta, S., Mandt, S., Li, Y. JMLR-JOURNAL MACHINE LEARNING RESEARCH. 2024
  • Towards Trustworthy Large Language Models Koyejo, S., Li, B., Assoc computing machinery ASSOC COMPUTING MACHINERY. 2024: 1126-1127
  • Bayesian Optimization for Crop Genetics with Scalable Probabilistic Models Azam, R., Truong, S. T., Fernandes, S. B., Leakey, A. D. B., Lipka, A., El-Kebir, M., Koyejo, S. edited by Antoran, J., Naesseth, C. A. JMLR-JOURNAL MACHINE LEARNING RESEARCH. 2024: 30-44
  • Disentangling Fact from Grid Cell Fiction in Trained Deep Path Integrators. ArXiv Schaeffer, R., Khona, M., Koyejo, S., Fiete, I. R. 2023

    Abstract

    Work on deep learning-based models of grid cells suggests that grid cells generically and robustly arise from optimizing networks to path integrate, i.e., track one's spatial position by integrating self-velocity signals. In previous work [27], we challenged this path integration hypothesis by showing that deep neural networks trained to path integrate almost always do so, but almost never learn grid-like tuning unless separately inserted by researchers via mechanisms unrelated to path integration. In this work, we restate the key evidence substantiating these insights, then address a response to [27] by authors of one of the path integration hypothesis papers [32]. First, we show that the response misinterprets our work, indirectly confirming our points. Second, we evaluate the response's preferred "unified theory for the origin of grid cells" in trained deep path integrators [31, 33, 34] and show that it is at best "occasionally suggestive," not exact or comprehensive. We finish by considering why assessing model quality through prediction of biological neural activity by regression of activity in deep networks [23] can lead to the wrong conclusions.

    View details for PubMedID 38106458

  • Longitudinal assessment of demographic representativeness in the Medical Imaging and Data Resource Center open data commons JOURNAL OF MEDICAL IMAGING Whitney, H. M., Baughan, N., Myers, K. J., Drukker, K., Gichoya, J., Bower, B., Chen, W., Gruszauskas, N., Kalpathy-Cramer, J., Koyejo, S., Sa, R. C., Sahiner, B., Zhang, Z., Giger, M. L. 2023; 10 (6): 61105

    Abstract

    The Medical Imaging and Data Resource Center (MIDRC) open data commons was launched to accelerate the development of artificial intelligence (AI) algorithms to help address the COVID-19 pandemic. The purpose of this study was to quantify longitudinal representativeness of the demographic characteristics of the primary MIDRC dataset compared to the United States general population (US Census) and COVID-19 positive case counts from the Centers for Disease Control and Prevention (CDC).The Jensen-Shannon distance (JSD), a measure of similarity of two distributions, was used to longitudinally measure the representativeness of the distribution of (1) all unique patients in the MIDRC data to the 2020 US Census and (2) all unique COVID-19 positive patients in the MIDRC data to the case counts reported by the CDC. The distributions were evaluated in the demographic categories of age at index, sex, race, ethnicity, and the combination of race and ethnicity.Representativeness of the MIDRC data by ethnicity and the combination of race and ethnicity was impacted by the percentage of CDC case counts for which this was not reported. The distributions by sex and race have retained their level of representativeness over time.The representativeness of the open medical imaging datasets in the curated public data commons at MIDRC has evolved over time as the number of contributing institutions and overall number of subjects have grown. The use of metrics, such as the JSD support measurement of representativeness, is one step needed for fair and generalizable AI algorithm development.

    View details for DOI 10.1117/1.JMI.10.6.061105

    View details for Web of Science ID 001139907400011

    View details for PubMedID 37469387

    View details for PubMedCentralID PMC10353566

  • Toward fairness in artificial intelligence for medical image analysis: identification and mitigation of potential biases in the roadmap from data collection to model deployment JOURNAL OF MEDICAL IMAGING Drukker, K., Chen, W., Gichoya, J., Gruszauskas, N., Kalpathy-Cramer, J., Koyejo, S., Myers, K., Sa, R. C., Sahiner, B., Whitney, H., Zhang, Z., Giger, M. 2023; 10 (6): 061104

    Abstract

    To recognize and address various sources of bias essential for algorithmic fairness and trustworthiness and to contribute to a just and equitable deployment of AI in medical imaging, there is an increasing interest in developing medical imaging-based machine learning methods, also known as medical imaging artificial intelligence (AI), for the detection, diagnosis, prognosis, and risk assessment of disease with the goal of clinical implementation. These tools are intended to help improve traditional human decision-making in medical imaging. However, biases introduced in the steps toward clinical deployment may impede their intended function, potentially exacerbating inequities. Specifically, medical imaging AI can propagate or amplify biases introduced in the many steps from model inception to deployment, resulting in a systematic difference in the treatment of different groups.Our multi-institutional team included medical physicists, medical imaging artificial intelligence/machine learning (AI/ML) researchers, experts in AI/ML bias, statisticians, physicians, and scientists from regulatory bodies. We identified sources of bias in AI/ML, mitigation strategies for these biases, and developed recommendations for best practices in medical imaging AI/ML development.Five main steps along the roadmap of medical imaging AI/ML were identified: (1) data collection, (2) data preparation and annotation, (3) model development, (4) model evaluation, and (5) model deployment. Within these steps, or bias categories, we identified 29 sources of potential bias, many of which can impact multiple steps, as well as mitigation strategies.Our findings provide a valuable resource to researchers, clinicians, and the public at large.

    View details for DOI 10.1117/1.JMI.10.6.061104

    View details for Web of Science ID 001139907400013

    View details for PubMedID 37125409

    View details for PubMedCentralID PMC10129875

  • Opportunistic detection of type 2 diabetes using deep learning from frontal chest radiographs. Nature communications Pyrros, A., Borstelmann, S. M., Mantravadi, R., Zaiman, Z., Thomas, K., Price, B., Greenstein, E., Siddiqui, N., Willis, M., Shulhan, I., Hines-Shah, J., Horowitz, J. M., Nikolaidis, P., Lungren, M. P., Rodríguez-Fernández, J. M., Gichoya, J. W., Koyejo, S., Flanders, A. E., Khandwala, N., Gupta, A., Garrett, J. W., Cohen, J. P., Layden, B. T., Pickhardt, P. J., Galanter, W. 2023; 14 (1): 4039

    Abstract

    Deep learning (DL) models can harness electronic health records (EHRs) to predict diseases and extract radiologic findings for diagnosis. With ambulatory chest radiographs (CXRs) frequently ordered, we investigated detecting type 2 diabetes (T2D) by combining radiographic and EHR data using a DL model. Our model, developed from 271,065 CXRs and 160,244 patients, was tested on a prospective dataset of 9,943 CXRs. Here we show the model effectively detected T2D with a ROC AUC of 0.84 and a 16% prevalence. The algorithm flagged 1,381 cases (14%) as suspicious for T2D. External validation at a distinct institution yielded a ROC AUC of 0.77, with 5% of patients subsequently diagnosed with T2D. Explainable AI techniques revealed correlations between specific adiposity measures and high predictivity, suggesting CXRs' potential for enhanced T2D screening.

    View details for DOI 10.1038/s41467-023-39631-x

    View details for PubMedID 37419921

    View details for PubMedCentralID PMC10328953

  • Fast Optical Signals for Real-Time Retinotopy and Brain Computer Interface. Bioengineering (Basel, Switzerland) Perpetuini, D., Gunal, M., Chiou, N., Koyejo, S., Mathewson, K., Low, K. A., Fabiani, M., Gratton, G., Chiarelli, A. M. 2023; 10 (5)

    Abstract

    A brain-computer interface (BCI) allows users to control external devices through brain activity. Portable neuroimaging techniques, such as near-infrared (NIR) imaging, are suitable for this goal. NIR imaging has been used to measure rapid changes in brain optical properties associated with neuronal activation, namely fast optical signals (FOS) with good spatiotemporal resolution. However, FOS have a low signal-to-noise ratio, limiting their BCI application. Here FOS were acquired with a frequency-domain optical system from the visual cortex during visual stimulation consisting of a rotating checkerboard wedge, flickering at 5 Hz. We used measures of photon count (Direct Current, DC light intensity) and time of flight (phase) at two NIR wavelengths (690 nm and 830 nm) combined with a machine learning approach for fast estimation of visual-field quadrant stimulation. The input features of a cross-validated support vector machine classifier were computed as the average modulus of the wavelet coherence between each channel and the average response among all channels in 512 ms time windows. An above chance performance was obtained when differentiating visual stimulation quadrants (left vs. right or top vs. bottom) with the best classification accuracy of ~63% (information transfer rate of ~6 bits/min) when classifying the superior and inferior stimulation quadrants using DC at 830 nm. The method is the first attempt to provide generalizable retinotopy classification relying on FOS, paving the way for the use of FOS in real-time BCI.

    View details for DOI 10.3390/bioengineering10050553

    View details for PubMedID 37237623

  • One Policy is Enough: Parallel Exploration with a Single Policy is Near-Optimal for Reward-Free Reinforcement Learning Cisneros-Velarde, P., Lyu, B., Koyejo, S., Kolar, M. edited by Ruiz, F., Dy, J., VanDeMeent, J. W. JMLR-JOURNAL MACHINE LEARNING RESEARCH. 2023
  • Finite-sample Guarantees for Nash Q-learning with Linear Function Approximation Cisneros-Velarde, P., Koyejo, S. edited by Evans, R. J., Shpitser JMLR-JOURNAL MACHINE LEARNING RESEARCH. 2023: 424-432
  • Unraveling the Connections between Privacy and Certified Robustness in Federated Learning Against Poisoning Attacks Xie, C., Long, Y., Chen, P., Li, Q., Koyejo, S., Li, B., ACM ASSOC COMPUTING MACHINERY. 2023: 1511-1525
  • Self-Supervised Learning of Representations for Space Generates Multi-Modular Grid Cells Schaeffer, R., Khona, M., Ma, T., Eyzaguirre, C., Koyejo, S., Fiete, I. edited by Oh, A., Neumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2023
  • DECODINGTRUST: A Comprehensive Assessment of Trustworthiness in GPT Models Wang, B., Chen, W., Pei, H., Xie, C., Kang, M., Zhang, C., Xu, C., Xiong, Z., Dutta, R., Schaeffer, R., Truong, S. T., Arora, S., Mazeika, M., Hendrycks, D., Lin, Z., Cheng, Y., Koyejo, S., Song, D., Li, B. edited by Oh, A., Neumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2023
  • Pairwise Ranking Losses of Click-Through Rates Prediction for Welfare Maximization in Ad Auctions Lyu, B., Feng, Z., Robertson, Z., Koyejo, S. edited by Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J. JMLR-JOURNAL MACHINE LEARNING RESEARCH. 2023
  • Adapting to Latent Subgroup Shifts via Concepts and Proxies Alabdulmohsin, I., Chiou, N., D'Amour, A., Gretton, A., Koyejo, S., Kusner, M. J., Pfohl, S. R., Salaudeen, O., Schrouff, J., Tsai, K. edited by Ruiz, F., Dy, J., VanDeMeent, J. W. JMLR-JOURNAL MACHINE LEARNING RESEARCH. 2023
  • Fair Wrapping for Black-box Predictions Soen, A., Alabdulmohsin, I., Koyejo, S., Mansour, Y., Moorosi, N., Nock, R. edited by Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2022