Megha Srivastava
Ph.D. Student in Computer Science, admitted Autumn 2020
All Publications
-
Training large language models on narrow tasks can lead to broad misalignment.
Nature
2026; 649 (8097): 584-589
Abstract
The widespread adoption of large language models (LLMs) raises important questions about their safety and alignment1. Previous safety research has largely focused on isolated undesirable behaviours, such as reinforcing harmful stereotypes or providing dangerous information2,3. Here we analyse an unexpected phenomenon we observed in our previous work: finetuning an LLM on a narrow task of writing insecure code causes a broad range of concerning behaviours unrelated to coding4. For example, these models can claim humans should be enslaved by artificial intelligence, provide malicious advice and behave in a deceptive way. We refer to this phenomenon as emergent misalignment. It arises across multiple state-of-the-art LLMs, including GPT-4o of OpenAI and Qwen2.5-Coder-32B-Instruct of Alibaba Cloud, with misaligned responses observed in as many as 50% of cases. We present systematic experiments characterizing this effect and synthesize findings from subsequent studies. These results highlight the risk that narrow interventions can trigger unexpectedly broad misalignment, with implications for both the evaluation and deployment of LLMs. Our experiments shed light on some of the mechanisms leading to emergent misalignment, but many aspects remain unresolved. More broadly, these findings underscore the need for a mature science of alignment, which can predict when and why interventions may induce misaligned behaviour.
View details for DOI 10.1038/s41586-025-09937-5
View details for PubMedID 41535488
View details for PubMedCentralID PMC12804084
-
Shared Autonomy for Proximal Teaching
IEEE. 2025: 232-241
View details for DOI 10.1109/HRI61500.2025.10973807
View details for Web of Science ID 001492540600025
-
Do Users Write More Insecure Code with AI Assistants?
ASSOC COMPUTING MACHINERY. 2023: 2785-2799
View details for DOI 10.1145/3576915.3623157
View details for Web of Science ID 001124987202053
-
Question Generation for Adaptive Education
ASSOC COMPUTATIONAL LINGUISTICS-ACL. 2021: 692-701
View details for Web of Science ID 000694699200088
-
Mathematical Notions vs. Human Perception of Fairness: A Descriptive Approach to Fairness for Machine Learning
ASSOC COMPUTING MACHINERY. 2019: 2459–68
View details for DOI 10.1145/3292500.3330664
View details for Web of Science ID 000485562502053