Sanmi Koyejo's Profile | Stanford Profiles

Bio

Sanmi Koyejo is an Assistant Professor in the Department of Computer Science at Stanford University and an adjunct Associate Professor at the University of Illinois at Urbana-Champaign. He leads the Stanford Trustworthy AI Research (STAIR) lab, which develops measurement-theoretic foundations for trustworthy AI systems, spanning AI evaluation science, algorithmic accountability, and privacy-preserving machine learning, with applications to healthcare and scientific discovery. His research on AI capabilities evaluation has challenged conventional understanding in the field, including work on measurement frameworks cited in the 2024 Economic Report of the President.

Koyejo has received the Presidential Early Career Award for Scientists and Engineers (PECASE), Skip Ellis Early Career Award, Alfred P. Sloan Research Fellowship, NSF CAREER Award, and multiple outstanding paper awards at flagship venues, including NeurIPS and ACL. He has delivered keynote presentations at major conferences, including ECCV and FAccT. He serves in key leadership roles, including Board President of Black in AI, Board of Directors of the Neural Information Processing Systems Foundation, and other leadership positions in professional organizations advancing AI research and broadening participation in the field.

Academic Appointments

Associate Professor, Computer Science
Member, Bio-X
Member, Wu Tsai Human Performance Alliance
Member, Wu Tsai Neurosciences Institute

2025-26 Courses

AI Measurement Science
CS 321M (Spr)
Governing Artificial Intelligence: Law, Policy, and Institutions
COMM 152A, COMM 252A, CS 283, GLOBAL 245B, INTLPOL 245B (Aut)
Governing Artificial Intelligence: Law, Policy, and Institutions
LAW 4052 (Aut)
Governing Artificial Intelligence: Law, Policy, and Institutions
POLISCI 145B, POLISCI 445B (Aut)
Machine Learning
CS 229, STATS 229 (Win)
Machine Learning from Human Preferences
CS 329H (Aut)
Independent Studies (20)
- Advanced Reading and Research
  CS 499 (Aut, Win, Spr, Sum)
- Advanced Reading and Research
  CS 499P (Aut, Win, Spr, Sum)
- Curricular Practical Training
  CME 390 (Sum)
- Curricular Practical Training
  CS 390A (Aut, Win, Spr, Sum)
- Curricular Practical Training
  CS 390B (Aut, Win, Spr, Sum)
- Curricular Practical Training
  CS 390C (Aut, Win, Spr, Sum)
- Graduate Research on Biomedical Data Science
  BMDS 399 (Win, Spr)
- Independent Project
  CS 399 (Aut, Win, Spr, Sum)
- Independent Project
  CS 399P (Aut, Win, Spr, Sum)
- Independent Study
  SYMSYS 196 (Aut, Win, Spr, Sum)
- Independent Work
  CS 199 (Aut, Win, Spr, Sum)
- Independent Work
  CS 199P (Aut, Win, Spr, Sum)
- Master's Research
  CME 291 (Win, Spr)
- Part-time Curricular Practical Training
  CS 390D (Aut, Win, Spr, Sum)
- Ph.D. Research
  CME 400 (Aut, Win, Spr, Sum)
- Ph.D. Research Rotation
  CME 391 (Win)
- Senior Honors Tutorial
  SYMSYS 190 (Aut, Win, Spr, Sum)
- Senior Project
  CS 191 (Aut, Win, Spr)
- Supervised Undergraduate Research
  CS 195 (Aut, Win, Spr, Sum)
- Writing Intensive Senior Research Project
  CS 191W (Aut, Win, Spr)
Prior Year Courses
2024-25 Courses
- Machine Learning
  CS 229, STATS 229 (Win)
- Machine Learning from Human Preferences
  CS 329H (Aut)
2023-24 Courses
- Artificial Intelligence: Principles and Techniques
  CS 221 (Spr)
- Machine Learning
  CS 229, STATS 229 (Win)
- Machine Learning from Human Preferences
  CS 329H (Aut)
2022-23 Courses
- Artificial Intelligence: Principles and Techniques
  CS 221 (Spr)

Stanford Advisees

Doctoral Dissertation Reader (AC)
Edward Chen
Orals Chair
Ravi Sojitra
Postdoctoral Faculty Sponsor
Youssef Allouah, Alexander Spangher, Zeyu Tang
Orals Evaluator
Edward Chen
Doctoral Dissertation Advisor (AC)
Steven Dillmann, Zach Robertson
Doctoral Dissertation Co-Advisor (AC)
Ahmed Ahmed, Suhana Bedi, Fangrui Huang, Josh Kazdan, Alisa Levin, Kara Liu, Ken Liu, Anka Reuel, Neha Srivathsa, Alyssa Unell
Master's Program Advisor
Stefan Ene, Nicolás Kennedy, Sreyana Kukadia, Hoang Nguyen, Isaac Park, Nestor Perez Fernandez, Jacob Rubenstein, Haoyue Xiao, Sean Yoon, Christine Zhang
Postdoctoral Research Mentor
Joachim Baumann
Doctoral (Program)
Nicole Chiou, Natalie Dullerud, Rui Li, Brando Miranda, Zach Robertson, Rylan Schaeffer, Nikil Selvam, Colin Sullivan, Sang Truong, Yibo Zhang

All Publications

Holistic evaluation of large language models for medical tasks with MedHELM. Nature medicine Bedi, S., Cui, H., Fuentes, M., Unell, A., Wornow, M., Banda, J. M., Kotecha, N., Keyes, T., Mai, Y., Oez, M., Qiu, H., Jain, S., Schettini, L., Kashyap, M., Fries, J. A., Swaminathan, A., Chung, P., Haredasht, F. N., Lopez, I., Aali, A., Tse, G., Nayak, A., Vedak, S., Jain, S. S., Patel, B., Fayanju, O., Shah, S., Goh, E., Yao, D. H., Soetikno, B., Reis, E., Gatidis, S., Divi, V., Capasso, R., Saralkar, R., Chiang, C. C., Jindal, J., Pham, T., Ghoddusi, F., Lin, S., Chiou, A. S., Hong, H. J., Roy, M., Gensheimer, M. F., Patel, H., Schulman, K., Dash, D., Char, D., Downing, L., Grolleau, F., Black, K., Mieso, B., Zahedivash, A., Yim, W. W., Sharma, H., Lee, T., Kirsch, H., Lee, J., Ambers, N., Lugtu, C., Sharma, A., Mawji, B., Alekseyev, A., Zhou, V., Kakkar, V., Helzer, J., Revri, A., Bannett, Y., Daneshjou, R., Chen, J., Alsentzer, E., Morse, K., Ravi, N., Aghaeepour, N., Kennedy, V., Chaudhari, A., Wang, T., Koyejo, S., Lungren, M. P., Horvitz, E., Liang, P., Pfeffer, M. A., Shah, N. H. 2026

Abstract

While large language models (LLMs) achieve near-perfect scores on medical licensing exams, these evaluations inadequately reflect the complexity and diversity of real-world clinical practice. Here we introduce MedHELM, an extensible evaluation framework with three contributions. First, a clinician-validated taxonomy organizing medical AI applications into five categories that mirror real clinical tasks-clinical decision support (diagnostic decisions, treatment planning), clinical note generation (visit documentation, procedure reports), patient communication (education materials, care instructions), medical research (literature analysis, clinical data analysis) and administration (scheduling, workflow coordination). These encompass 22 subcategories and 121 specific tasks reflecting daily medical practice. Second, a comprehensive benchmark suite of 37 evaluations covering all subcategories. Third, systematic comparison of nine frontier LLMs-Claude 3.5 Sonnet, Claude 3.7 Sonnet, DeepSeek R1, Gemini 1.5 Pro, Gemini 2.0 Flash, GPT-4o, GPT-4o mini, Llama 3.3 and o3-mini-using an automated LLM-jury evaluation method. Our LLM-jury uses multiple AI evaluators to assess model outputs against expert-defined criteria. Advanced reasoning models (DeepSeek R1, o3-mini) demonstrated superior performance with win rates of 66%, although Claude 3.5 Sonnet achieved comparable results at 15% lower computational cost. These results not only highlight current model capabilities but also demonstrate how MedHELM could enable evidence-based selection of medical AI systems for healthcare applications.

View details for DOI 10.1038/s41591-025-04151-2

View details for PubMedID 41559415

View details for PubMedCentralID 10916499
Shaping AI's Impact on Billions of Lives COMMUNICATIONS OF THE ACM Cuellar, M., Dean, J., Doshi-Velez, F., Hennessy, J., Konwinski, A., Koyejo, S., Moiloa, P., Pierson, E., Patterson, D. 2026; 69 (1): 54-65

View details for DOI 10.1145/3746132

View details for Web of Science ID 001650706200006
The inadequacy of offline large language model evaluations: A need to account for personalization in model behavior. Patterns (New York, N.Y.) Wang, A., Ho, D. E., Koyejo, S. 2025; 6 (12): 101397

Abstract

Standard offline evaluations for language models fail to capture how these models actually behave in practice, where personalization fundamentally alters model behavior. In this work, we provide empirical evidence showcasing this phenomenon by comparing offline evaluations to field evaluations conducted by having 800 real users of ChatGPT and Gemini pose benchmark and other questions to their chat interfaces.

View details for DOI 10.1016/j.patter.2025.101397

View details for PubMedID 41472831

View details for PubMedCentralID PMC12745978
TIMER: temporal instruction modeling and evaluation for longitudinal clinical records. NPJ digital medicine Cui, H., Unell, A., Chen, B., Fries, J. A., Alsentzer, E., Koyejo, S., Shah, N. H. 2025; 8 (1): 577

Abstract

Electronic health records (EHRs) contain rich longitudinal information for clinical decision-making, yet LLMs struggle to reason across patient timelines. We introduce TIMER (Temporal Instruction Modeling and Evaluation for Longitudinal Clinical Records), a method to improve LLMs' temporal reasoning over multi-visit EHRs through time-aware instruction tuning. TIMER grounds LLMs in patient-specific temporal contexts by linking each instruction-response pair to specific timestamps, ensuring temporal fidelity throughout the training process. Evaluations show that TIMER-tuned models outperform conventional medical instruction-tuned approaches by 6.6% in completeness on clinician-curated benchmarks, with distribution-matched training demonstrating advantages up to 6.5% in temporal reasoning. Qualitative analyses reveal that using TIMER enhances temporal boundary adherence, trend detection, and chronological precision, necessary for applications such as disease trajectory modeling and treatment response monitoring. Overall, TIMER provides a methodological basis for developing LLMs that can effectively engage with the inherently longitudinal nature of data for patient care. Code is available at TIMER .

View details for DOI 10.1038/s41746-025-01965-9

View details for PubMedID 41006898

View details for PubMedCentralID PMC12475073
Advancing science- and evidence-based AI policy. Science (New York, N.Y.) Bommasani, R., Arora, S., Chayes, J., Choi, Y., Cuéllar, M. F., Fei-Fei, L., Ho, D. E., Jurafsky, D., Koyejo, S., Lakkaraju, H., Narayanan, A., Nelson, A., Pierson, E., Pineau, J., Singer, S., Varoquaux, G., Venkatasubramanian, S., Stoica, I., Liang, P., Song, D. 2025; 389 (6759): 459-461

Abstract

Policy must be informed by, but also facilitate the generation of, scientific evidence.

View details for DOI 10.1126/science.adu8449

View details for PubMedID 40743343
Rethinking machine unlearning for large language models NATURE MACHINE INTELLIGENCE Liu, S., Yao, Y., Jia, J., Casper, S., Baracaldo, N., Hase, P., Yao, Y., Liu, C., Xu, X., Li, H., Varshney, K. R., Bansal, M., Koyejo, S., Liu, Y. 2025

View details for DOI 10.1038/s42256-025-00985-0

View details for Web of Science ID 001423011100001
Testing and Evaluation of Health Care Applications of Large Language Models: A Systematic Review. JAMA Bedi, S., Liu, Y., Orr-Ewing, L., Dash, D., Koyejo, S., Callahan, A., Fries, J. A., Wornow, M., Swaminathan, A., Lehmann, L. S., Hong, H. J., Kashyap, M., Chaurasia, A. R., Shah, N. R., Singh, K., Tazbaz, T., Milstein, A., Pfeffer, M. A., Shah, N. H. 2024

Abstract

Large language models (LLMs) can assist in various health care activities, but current evaluation approaches may not adequately identify the most useful application areas.To summarize existing evaluations of LLMs in health care in terms of 5 components: (1) evaluation data type, (2) health care task, (3) natural language processing (NLP) and natural language understanding (NLU) tasks, (4) dimension of evaluation, and (5) medical specialty.A systematic search of PubMed and Web of Science was performed for studies published between January 1, 2022, and February 19, 2024.Studies evaluating 1 or more LLMs in health care.Three independent reviewers categorized studies via keyword searches based on the data used, the health care tasks, the NLP and NLU tasks, the dimensions of evaluation, and the medical specialty.Of 519 studies reviewed, published between January 1, 2022, and February 19, 2024, only 5% used real patient care data for LLM evaluation. The most common health care tasks were assessing medical knowledge such as answering medical licensing examination questions (44.5%) and making diagnoses (19.5%). Administrative tasks such as assigning billing codes (0.2%) and writing prescriptions (0.2%) were less studied. For NLP and NLU tasks, most studies focused on question answering (84.2%), while tasks such as summarization (8.9%) and conversational dialogue (3.3%) were infrequent. Almost all studies (95.4%) used accuracy as the primary dimension of evaluation; fairness, bias, and toxicity (15.8%), deployment considerations (4.6%), and calibration and uncertainty (1.2%) were infrequently measured. Finally, in terms of medical specialty area, most studies were in generic health care applications (25.6%), internal medicine (16.4%), surgery (11.4%), and ophthalmology (6.9%), with nuclear medicine (0.6%), physical medicine (0.4%), and medical genetics (0.2%) being the least represented.Existing evaluations of LLMs mostly focus on accuracy of question answering for medical examinations, without consideration of real patient care data. Dimensions such as fairness, bias, and toxicity and deployment considerations received limited attention. Future evaluations should adopt standardized applications and metrics, use clinical data, and broaden focus to include a wider range of tasks and specialties.

View details for DOI 10.1001/jama.2024.21700

View details for PubMedID 39405325

View details for PubMedCentralID PMC11480901
Crossing Linguistic Horizons: Finetuning and Comprehensive Evaluation of Vietnamese Large Language Models Truong, S. T., Nguyen, D. Q., Toan Nguyen, Le, D. D., Truong, N. N., Tho Quan, Koyejo, S. edited by Duh, K., Gomez, H., Bethard, S. ASSOC COMPUTATIONAL LINGUISTICS-ACL. 2024: 2849-2900

View details for Web of Science ID 001511618600182
The inadequacy of offline large language model evaluations: A need to account for personalization in model behavior PATTERNS Wang, A., Ho, D. E., Koyejo, S. 2025; 6 (12)

View details for DOI 10.1016/j.patter.2025.101397

View details for Web of Science ID 001641944700001
Evaluating anti-LGBTQIA+ medical bias in large language models. PLOS digital health Chang, C. T., Srivathsa, N., Bou-Khalil, C., Swaminathan, A., Lunn, M. R., Mishra, K., Koyejo, S., Daneshjou, R. 2025; 4 (9): e0001001

Abstract

Large Language Models (LLMs) are increasingly deployed in clinical settings for tasks ranging from patient communication to decision support. While these models demonstrate race-based and binary gender biases, anti-LGBTQIA+ bias remains understudied despite documented healthcare disparities affecting these populations. In this work, we evaluated the potential of LLMs to propagate anti-LGBTQIA+ medical bias and misinformation. We prompted 4 LLMs (Gemini 1.5 Flash, Claude 3 Haiku, GPT-4o, Stanford Medicine Secure GPT [GPT-4.0]) with 38 prompts consisting of explicit questions and synthetic clinical notes created by medically-trained reviewers and LGBTQIA+ health experts. The prompts consisted of pairs of prompts with and without LGBTQIA+ identity terms and explored clinical situations across two axes: (i) situations where historical bias has been observed versus not observed, and (ii) situations where LGBTQIA+ identity is relevant to clinical care versus not relevant. Medically-trained reviewers evaluated LLM responses for appropriateness (safety, privacy, hallucination/accuracy, and bias) and clinical utility. We found that all 4 LLMs generated inappropriate responses for prompts with and without LGBTQIA+ identity terms. The proportion of inappropriate responses ranged from 43-62% for prompts mentioning LGBTQIA+ identities versus 47-65% for those without. The most common reason for inappropriate classification tended to be hallucination/accuracy, followed by bias or safety. Qualitatively, we observed differential bias patterns, with LGBTQIA+ prompts eliciting more severe bias. Average clinical utility score for inappropriate responses was lower than for appropriate responses (2.6 versus 3.7 on a 5-point Likert scale). Future work should focus on tailoring output formats to stated use cases, decreasing sycophancy and reliance on extraneous information in the prompt, and improving accuracy and decreasing bias for LGBTQIA+ patients. We present our prompts and annotated responses as a benchmark for evaluation of future models. Content warning: This paper includes prompts and model-generated responses that may be offensive.

View details for DOI 10.1371/journal.pdig.0001001

View details for PubMedID 40920790
Fidelity of Medical Reasoning in Large Language Models. JAMA network open Bedi, S., Jiang, Y., Chung, P., Koyejo, S., Shah, N. 2025; 8 (8): e2526021

View details for DOI 10.1001/jamanetworkopen.2025.26021

View details for PubMedID 40779272
Advancing oil and gas emissions assessment through large language model data extraction ENERGY AND AI Chen, Z., Zhong, R., Long, W., Tanga, H., Wang, A., Liu, Z., Yang, X., Ren, B., Littlefield, J., Koyejo, S., Masnadi, M. S., Brandt, A. R. 2025; 20

View details for DOI 10.1016/j.egyai.2025.100481

View details for Web of Science ID 001437342500001
The Reality of AI and Biorisk Peppin, A., Reuel, A., Casper, S., Jones, E., Strait, A., Anwar, U., Agrawal, A., Kapoor, S., Koyejo, S., Pellat, M., Bommasani, R., Frosst, N., Hooker, S., ACM ASSOC COMPUTING MACHINERY. 2025: 763-771

View details for DOI 10.1145/3715275.3732048

View details for Web of Science ID 001543679300045
Logits are All We Need to Adapt Closed Models Hiranandani, G., Wu, H., Mukherjee, S., Koyejo, S. edited by Singh, A., Fazel, M., Hsu, D., Lacoste-Julien, S., Berkenkamp, F., Maharaj, T., Wagstaff, K., Zhu, J. JMLR-JOURNAL MACHINE LEARNING RESEARCH. 2025: 23261-23289

View details for Web of Science ID 001693126000014
Collapse or Thrive? Perils and Promises of Synthetic Data in a Self-Generating World Kazdan, J., Schaeffer, R., Dey, A., Gerstgrasser, M., Rafailov, R., Donoho, D., Koyejo, S. edited by Singh, A., Fazel, M., Hsu, D., Lacoste-Julien, S., Berkenkamp, F., Maharaj, T., Wagstaff, K., Zhu, J. JMLR-JOURNAL MACHINE LEARNING RESEARCH. 2025: 29469-29494

View details for Web of Science ID 001693126000255
Decision from Suboptimal Classifiers: Excess Risk Pre- and Post-Calibration Perez-Lebel, A., Varoquaux, G., Koyejo, S., Doutreligne, M., Le Morvan, M. edited by Li, Y., Mandt, S., Agrawal, S., Khan, E. JMLR-JOURNAL MACHINE LEARNING RESEARCH. 2025

View details for Web of Science ID 001593416700267
Fairness through Difference Awareness: Measuring <i>Desired</i> Group Discrimination in LLMs Wang, A., Phan, M., Ho, D. E., Koyejo, S. edited by Che, W., Nabende, J., Shutova, E., Pilehvar, M. T. ASSOC COMPUTATIONAL LINGUISTICS-ACL. 2025: 6867-6893

View details for Web of Science ID 001611615400020
Riding on the Back of a Whale: A Hackathon Framework for Introducing High School Students to Large Language Models Nguyen, D., Le, D., Nguyen, L., Vu, Q., Le, T., Nguyen, D., Huynh, N., Nguyen, H., Tran, P., Le, D., Truong, S., Koyejo, S., Lea, C., Quan, T. edited by Cristea, A. I., Walker, E., Lu, Y., Santos, O. C., Isotani, S. SPRINGER INTERNATIONAL PUBLISHING AG. 2025: 201-209

View details for DOI 10.1007/978-3-031-99264-3_25

View details for Web of Science ID 001655251900025
More than Marketing? On the Information Value of AI Benchmarks for Practitioners Hardy, A., Reuel, A., Meimandi, K., Soder, L., Griffith, A., Asmar, D. M., Koyejo, S., Bernstein, M. S., Kochenderfer, M., ACM ASSOC COMPUTING MACHINERY. 2025: 1032-1047

View details for DOI 10.1145/3708359.3712152

View details for Web of Science ID 001477132000061
Publisher Correction: Increasing the presence of BIPOC researchers in computational science. Nature computational science Chen, C. Y., Christoffels, A., Dube, R., Enos, K., Gilbert, J. E., Koyejo, S., Leigh, J., Liquido, C., McKee, A., Noe, K., Peng, T., Taiuru, K. 2024

View details for DOI 10.1038/s43588-024-00710-8

View details for PubMedID 39354103
Increasing the presence of BIPOC researchers in computational science. Nature computational science Chen, C. Y., Christoffels, A., Dube, R., Enos, K., Gilbert, J. E., Koyeji, S., Leigh, J., Liquido, C., McKee, A., Noe, K., Peng, T. Q., Taiuru, K. 2024; 4 (9): 646-653

View details for DOI 10.1038/s43588-024-00693-6

View details for PubMedID 39317763
Artificial Intelligence, Social Responsibility, and the Roles of the University COMMUNICATIONS OF THE ACM Bosch, N., Chan, A., Davis, J. L., Gutierrez, R., He, J., Karahalios, K., Koyejo, S., Loui, M. C., Mendenhall, R., Sanfilippo, M., Tong, H., Varshney, L. R., Wang, Y. 2024; 67 (8): 22-25

View details for DOI 10.1145/3640541

View details for Web of Science ID 001293981400007
Single-Trial Detection and Classification of Event-Related Optical Signals for a Brain-Computer Interface Application. Bioengineering (Basel, Switzerland) Chiou, N., Günal, M., Koyejo, S., Perpetuini, D., Chiarelli, A. M., Low, K. A., Fabiani, M., Gratton, G. 2024; 11 (8)

Abstract

Event-related optical signals (EROS) measure fast modulations in the brain's optical properties related to neuronal activity. EROS offer a high spatial and temporal resolution and can be used for brain-computer interface (BCI) applications. However, the ability to classify single-trial EROS remains unexplored. This study evaluates the performance of neural network methods for single-trial classification of motor response-related EROS. EROS activity was obtained from a high-density recording montage covering the motor cortex during a two-choice reaction time task involving responses with the left or right hand. This study utilized a convolutional neural network (CNN) approach to extract spatiotemporal features from EROS data and perform classification of left and right motor responses. Subject-specific classifiers trained on EROS phase data outperformed those trained on intensity data, reaching an average single-trial classification accuracy of around 63%. Removing low-frequency noise from intensity data is critical for achieving discriminative classification results with this measure. Our results indicate that deep learning with high-spatial-resolution signals, such as EROS, can be successfully applied to single-trial classifications.

View details for DOI 10.3390/bioengineering11080781

View details for PubMedID 39199739

View details for PubMedCentralID PMC11351476
Bridging gaps in automated acute myocardial infarction detection between high-income and low-income countries. PLOS global public health Chiou, N., Koyejo, S., Ngaruiya, C. 2024; 4 (6): e0003240

View details for DOI 10.1371/journal.pgph.0003240

View details for PubMedID 38941326
Bridging gaps in automated acute myocardial infarction detection between high-income and low-income countries PLOS GLOBAL PUBLIC HEALTH Chiou, N., Koyejo, S., Ngaruiya, C. 2024; 4 (6): e0003240

View details for DOI 10.1371/journalpgph.0003240

View details for Web of Science ID 001418792700001

View details for PubMedID 38941326
Author Correction: Opportunistic detection of type 2 diabetes using deep learning from frontal chest radiographs. Nature communications Pyrros, A., Borstelmann, S. M., Mantravadi, R., Zaiman, Z., Thomas, K., Price, B., Greenstein, E., Siddiqui, N., Willis, M., Shulhan, I., Hines-Shah, J., Horowitz, J. M., Nikolaidis, P., Lungren, M. P., Rodríguez-Fernández, J. M., Gichoya, J. W., Koyejo, S., Flanders, A. E., Khandwala, N., Gupta, A., Garrett, J. W., Cohen, J. P., Layden, B. T., Pickhardt, P. J., Galanter, W. 2024; 15 (1): 4817

View details for DOI 10.1038/s41467-024-49184-2

View details for PubMedID 38844459

View details for PubMedCentralID PMC11156917
Impact of biased models in the context of fairness towards patients, and how to avoid or minimise biases in our datasets Koyejo, S. ELSEVIER IRELAND LTD. 2024: S46

View details for Web of Science ID 001331355600057
Latent Multimodal Functional Graphical Model Estimation JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION Tsai, K., Zhao, B., Koyejo, S., Kolar, M. 2024; 119 (547): 2217-2229

View details for DOI 10.1080/01621459.2023.2252142

View details for Web of Science ID 001201774100001
The Case for Globalizing Fairness: A Mixed Methods Study on Colonialism, AI, and Health in Africa Asiedu, M., Dieng, A., Haykel, I., Rostamzadeh, N., Pfohl, S., Nagpal, C., Nagawa, M., Oppong, A., Koyejo, S., Heller, K., ACM ASSOC COMPUTING MACHINERY. 2024

View details for DOI 10.1145/3689904.3694708

View details for Web of Science ID 001537950100010
Adaptive Compression in Federated Learning via Side Information Isik, B., Pase, F., Gunduz, D., Koyejo, S., Weissman, T., Zorzi, M. edited by Dasgupta, S., Mandt, S., Li, Y. JMLR-JOURNAL MACHINE LEARNING RESEARCH. 2024

View details for Web of Science ID 001221034001020
Invariant Aggregator for Defending against Federated Backdoor Attacks Wang, X., Dimitriadis, D., Koyejo, S., Tople, S. edited by Dasgupta, S., Mandt, S., Li, Y. JMLR-JOURNAL MACHINE LEARNING RESEARCH. 2024

View details for Web of Science ID 001286500300031
Causally Inspired Regularization Enables Domain General Representations Salaudeen, O., Koyejo, S. edited by Dasgupta, S., Mandt, S., Li, Y. JMLR-JOURNAL MACHINE LEARNING RESEARCH. 2024

View details for Web of Science ID 001286500302003
Proxy Methods for Domain Adaptation Tsai, K., Pfohl, S. R., Salaudeen, O., Chiou, N., Kusner, M. J., DAmour, A., Koyejo, S., Gretton, A. edited by Dasgupta, S., Mandt, S., Li, Y. JMLR-JOURNAL MACHINE LEARNING RESEARCH. 2024

View details for Web of Science ID 001286500304014
Towards Trustworthy Large Language Models Koyejo, S., Li, B., Assoc computing machinery ASSOC COMPUTING MACHINERY. 2024: 1126-1127

View details for DOI 10.1145/3616855.3636454

View details for Web of Science ID 001182230100136
Bayesian Optimization for Crop Genetics with Scalable Probabilistic Models Azam, R., Truong, S. T., Fernandes, S. B., Leakey, A. D. B., Lipka, A., El-Kebir, M., Koyejo, S. edited by Antoran, J., Naesseth, C. A. JMLR-JOURNAL MACHINE LEARNING RESEARCH. 2024: 30-44

View details for Web of Science ID 001347127100002
Disentangling Fact from Grid Cell Fiction in Trained Deep Path Integrators. ArXiv Schaeffer, R., Khona, M., Koyejo, S., Fiete, I. R. 2023

Abstract

Work on deep learning-based models of grid cells suggests that grid cells generically and robustly arise from optimizing networks to path integrate, i.e., track one's spatial position by integrating self-velocity signals. In previous work [27], we challenged this path integration hypothesis by showing that deep neural networks trained to path integrate almost always do so, but almost never learn grid-like tuning unless separately inserted by researchers via mechanisms unrelated to path integration. In this work, we restate the key evidence substantiating these insights, then address a response to [27] by authors of one of the path integration hypothesis papers [32]. First, we show that the response misinterprets our work, indirectly confirming our points. Second, we evaluate the response's preferred "unified theory for the origin of grid cells" in trained deep path integrators [31, 33, 34] and show that it is at best "occasionally suggestive," not exact or comprehensive. We finish by considering why assessing model quality through prediction of biological neural activity by regression of activity in deep networks [23] can lead to the wrong conclusions.

View details for PubMedID 38106458
Longitudinal assessment of demographic representativeness in the Medical Imaging and Data Resource Center open data commons JOURNAL OF MEDICAL IMAGING Whitney, H. M., Baughan, N., Myers, K. J., Drukker, K., Gichoya, J., Bower, B., Chen, W., Gruszauskas, N., Kalpathy-Cramer, J., Koyejo, S., Sa, R. C., Sahiner, B., Zhang, Z., Giger, M. L. 2023; 10 (6): 61105

Abstract

The Medical Imaging and Data Resource Center (MIDRC) open data commons was launched to accelerate the development of artificial intelligence (AI) algorithms to help address the COVID-19 pandemic. The purpose of this study was to quantify longitudinal representativeness of the demographic characteristics of the primary MIDRC dataset compared to the United States general population (US Census) and COVID-19 positive case counts from the Centers for Disease Control and Prevention (CDC).The Jensen-Shannon distance (JSD), a measure of similarity of two distributions, was used to longitudinally measure the representativeness of the distribution of (1) all unique patients in the MIDRC data to the 2020 US Census and (2) all unique COVID-19 positive patients in the MIDRC data to the case counts reported by the CDC. The distributions were evaluated in the demographic categories of age at index, sex, race, ethnicity, and the combination of race and ethnicity.Representativeness of the MIDRC data by ethnicity and the combination of race and ethnicity was impacted by the percentage of CDC case counts for which this was not reported. The distributions by sex and race have retained their level of representativeness over time.The representativeness of the open medical imaging datasets in the curated public data commons at MIDRC has evolved over time as the number of contributing institutions and overall number of subjects have grown. The use of metrics, such as the JSD support measurement of representativeness, is one step needed for fair and generalizable AI algorithm development.

View details for DOI 10.1117/1.JMI.10.6.061105

View details for Web of Science ID 001139907400011

View details for PubMedID 37469387

View details for PubMedCentralID PMC10353566
Toward fairness in artificial intelligence for medical image analysis: identification and mitigation of potential biases in the roadmap from data collection to model deployment JOURNAL OF MEDICAL IMAGING Drukker, K., Chen, W., Gichoya, J., Gruszauskas, N., Kalpathy-Cramer, J., Koyejo, S., Myers, K., Sa, R. C., Sahiner, B., Whitney, H., Zhang, Z., Giger, M. 2023; 10 (6): 061104

Abstract

To recognize and address various sources of bias essential for algorithmic fairness and trustworthiness and to contribute to a just and equitable deployment of AI in medical imaging, there is an increasing interest in developing medical imaging-based machine learning methods, also known as medical imaging artificial intelligence (AI), for the detection, diagnosis, prognosis, and risk assessment of disease with the goal of clinical implementation. These tools are intended to help improve traditional human decision-making in medical imaging. However, biases introduced in the steps toward clinical deployment may impede their intended function, potentially exacerbating inequities. Specifically, medical imaging AI can propagate or amplify biases introduced in the many steps from model inception to deployment, resulting in a systematic difference in the treatment of different groups.Our multi-institutional team included medical physicists, medical imaging artificial intelligence/machine learning (AI/ML) researchers, experts in AI/ML bias, statisticians, physicians, and scientists from regulatory bodies. We identified sources of bias in AI/ML, mitigation strategies for these biases, and developed recommendations for best practices in medical imaging AI/ML development.Five main steps along the roadmap of medical imaging AI/ML were identified: (1) data collection, (2) data preparation and annotation, (3) model development, (4) model evaluation, and (5) model deployment. Within these steps, or bias categories, we identified 29 sources of potential bias, many of which can impact multiple steps, as well as mitigation strategies.Our findings provide a valuable resource to researchers, clinicians, and the public at large.

View details for DOI 10.1117/1.JMI.10.6.061104

View details for Web of Science ID 001139907400013

View details for PubMedID 37125409

View details for PubMedCentralID PMC10129875
Opportunistic detection of type 2 diabetes using deep learning from frontal chest radiographs. Nature communications Pyrros, A., Borstelmann, S. M., Mantravadi, R., Zaiman, Z., Thomas, K., Price, B., Greenstein, E., Siddiqui, N., Willis, M., Shulhan, I., Hines-Shah, J., Horowitz, J. M., Nikolaidis, P., Lungren, M. P., Rodríguez-Fernández, J. M., Gichoya, J. W., Koyejo, S., Flanders, A. E., Khandwala, N., Gupta, A., Garrett, J. W., Cohen, J. P., Layden, B. T., Pickhardt, P. J., Galanter, W. 2023; 14 (1): 4039

Abstract

Deep learning (DL) models can harness electronic health records (EHRs) to predict diseases and extract radiologic findings for diagnosis. With ambulatory chest radiographs (CXRs) frequently ordered, we investigated detecting type 2 diabetes (T2D) by combining radiographic and EHR data using a DL model. Our model, developed from 271,065 CXRs and 160,244 patients, was tested on a prospective dataset of 9,943 CXRs. Here we show the model effectively detected T2D with a ROC AUC of 0.84 and a 16% prevalence. The algorithm flagged 1,381 cases (14%) as suspicious for T2D. External validation at a distinct institution yielded a ROC AUC of 0.77, with 5% of patients subsequently diagnosed with T2D. Explainable AI techniques revealed correlations between specific adiposity measures and high predictivity, suggesting CXRs' potential for enhanced T2D screening.

View details for DOI 10.1038/s41467-023-39631-x

View details for PubMedID 37419921

View details for PubMedCentralID PMC10328953
Fast Optical Signals for Real-Time Retinotopy and Brain Computer Interface. Bioengineering (Basel, Switzerland) Perpetuini, D., Gunal, M., Chiou, N., Koyejo, S., Mathewson, K., Low, K. A., Fabiani, M., Gratton, G., Chiarelli, A. M. 2023; 10 (5)

Abstract

A brain-computer interface (BCI) allows users to control external devices through brain activity. Portable neuroimaging techniques, such as near-infrared (NIR) imaging, are suitable for this goal. NIR imaging has been used to measure rapid changes in brain optical properties associated with neuronal activation, namely fast optical signals (FOS) with good spatiotemporal resolution. However, FOS have a low signal-to-noise ratio, limiting their BCI application. Here FOS were acquired with a frequency-domain optical system from the visual cortex during visual stimulation consisting of a rotating checkerboard wedge, flickering at 5 Hz. We used measures of photon count (Direct Current, DC light intensity) and time of flight (phase) at two NIR wavelengths (690 nm and 830 nm) combined with a machine learning approach for fast estimation of visual-field quadrant stimulation. The input features of a cross-validated support vector machine classifier were computed as the average modulus of the wavelet coherence between each channel and the average response among all channels in 512 ms time windows. An above chance performance was obtained when differentiating visual stimulation quadrants (left vs. right or top vs. bottom) with the best classification accuracy of ~63% (information transfer rate of ~6 bits/min) when classifying the superior and inferior stimulation quadrants using DC at 830 nm. The method is the first attempt to provide generalizable retinotopy classification relying on FOS, paving the way for the use of FOS in real-time BCI.

View details for DOI 10.3390/bioengineering10050553

View details for PubMedID 37237623
One Policy is Enough: Parallel Exploration with a Single Policy is Near-Optimal for Reward-Free Reinforcement Learning Cisneros-Velarde, P., Lyu, B., Koyejo, S., Kolar, M. edited by Ruiz, F., Dy, J., VanDeMeent, J. W. JMLR-JOURNAL MACHINE LEARNING RESEARCH. 2023

View details for Web of Science ID 001222727702004
Finite-sample Guarantees for Nash Q-learning with Linear Function Approximation Cisneros-Velarde, P., Koyejo, S. edited by Evans, R. J., Shpitser JMLR-JOURNAL MACHINE LEARNING RESEARCH. 2023: 424-432

View details for Web of Science ID 001222701100040
Unraveling the Connections between Privacy and Certified Robustness in Federated Learning Against Poisoning Attacks Xie, C., Long, Y., Chen, P., Li, Q., Koyejo, S., Li, B., ACM ASSOC COMPUTING MACHINERY. 2023: 1511-1525

View details for DOI 10.1145/3576915.3623193

View details for Web of Science ID 001124987201035
Self-Supervised Learning of Representations for Space Generates Multi-Modular Grid Cells Schaeffer, R., Khona, M., Ma, T., Eyzaguirre, C., Koyejo, S., Fiete, I. edited by Oh, A., Neumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2023

View details for Web of Science ID 001228825107028
DECODINGTRUST: A Comprehensive Assessment of Trustworthiness in GPT Models Wang, B., Chen, W., Pei, H., Xie, C., Kang, M., Zhang, C., Xu, C., Xiong, Z., Dutta, R., Schaeffer, R., Truong, S. T., Arora, S., Mazeika, M., Hendrycks, D., Lin, Z., Cheng, Y., Koyejo, S., Song, D., Li, B. edited by Oh, A., Neumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2023

View details for Web of Science ID 001220600008008
Pairwise Ranking Losses of Click-Through Rates Prediction for Welfare Maximization in Ad Auctions Lyu, B., Feng, Z., Robertson, Z., Koyejo, S. edited by Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J. JMLR-JOURNAL MACHINE LEARNING RESEARCH. 2023

View details for Web of Science ID 001372371906018
Adapting to Latent Subgroup Shifts via Concepts and Proxies Alabdulmohsin, I., Chiou, N., D'Amour, A., Gretton, A., Koyejo, S., Kusner, M. J., Pfohl, S. R., Salaudeen, O., Schrouff, J., Tsai, K. edited by Ruiz, F., Dy, J., VanDeMeent, J. W. JMLR-JOURNAL MACHINE LEARNING RESEARCH. 2023

View details for Web of Science ID 001298469303032
Fair Wrapping for Black-box Predictions Soen, A., Alabdulmohsin, I., Koyejo, S., Mansour, Y., Moorosi, N., Nock, R. edited by Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2022

View details for Web of Science ID 001215469503006

Sanmi Koyejo

Associate Professor of Computer Science

Bio

Academic Appointments

Additional Info

2025-26 Courses

2024-25 Courses

2023-24 Courses

2022-23 Courses

Stanford Advisees

All Publications

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract