Steven Feng
Ph.D. Student in Computer Science, admitted Autumn 2022
Bio
I'm a Stanford Computer Science PhD student and NSERC PGS-D scholar, working with the Stanford AI Lab and Stanford NLP Group. I am co-advised by Michael C. Frank and Noah Goodman as part of the Language & Cognition (LangCog) and Computation & Cognition (CoCo) Labs. I am grateful to receive support from Amazon Science, Microsoft AFMR, and StabilityAI.
My ultimate goal is to blend knowledge from multiple disciplines to advance AI research. My current research centers around aligning foundation model and human learning and capabilities, particularly in reasoning, generalization, and efficiency. I have explored ways to improve the controllability of language and visual generation models, and integrate structured and multimodal information to enhance their reasoning capabilities.
I'm investigating psychologically and cognitively inspired methods for continual learning, self-improvement, and advanced reasoning in foundation models. I'm also exploring methods to bridge the data efficiency gap between human and model learning while shedding further light on human cognitive models and our efficient language and vision acquisition capabilities.
Previously, I was a master's student at Carnegie Mellon University (CMU), where I worked with Eduard Hovy and Malihe Alikhani on language generation, data augmentation, and commonsense reasoning. Before that, I was an undergraduate student at the University of Waterloo, where I worked with Jesse Hoey on dialogue agents and text generation.
My research contributions have been recognized with several publications at major conferences and a best paper award at INLG 2021. I am also an Honorable Mention for the Jessie W.H. Zou Memorial Award and CRA Outstanding Undergraduate Researcher Award.
I am a co-instructor for the Stanford CS25 Transformers course, and mentor and advise several students. I also led the organization of CtrlGen, a controllable generation workshop at NeurIPS 2021, and was involved in the GEM benchmark and workshop for NLG evaluation.
In my free time, I enjoy gaming, playing the piano and guitar, martial arts, and table tennis. I am also the founder and president of the Stanford Piano Society.
2024-25 Courses
- Transformers United V5
CS 25 (Spr) -
Prior Year Courses
2023-24 Courses
- Transformers United V4
CS 25 (Aut, Spr)
2022-23 Courses
- Transformers United V2
CS 25 (Win)
- Transformers United V4
All Publications
-
A benchmark of expert-level academic questions to assess AI capabilities.
Nature
2026; 649 (8099): 1139-1146
Abstract
Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve more than 90% accuracy on popular benchmarks such as Measuring Massive Multitask Language Understanding1, limiting informed measurement of state-of-the-art LLM capabilities. Here, in response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be an expert-level closed-ended academic benchmark with broad subject coverage. HLE consists of 2,500 questions across dozens of subjects, including mathematics, humanities and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable but cannot be quickly answered by internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a marked gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai .
View details for DOI 10.1038/s41586-025-09962-4
View details for PubMedID 41606155
https://orcid.org/0000-0003-2122-8736