In-Training Defenses against Emergent Misalignment in Language Models
Published in arXiv, 2025
Recommended citation: David Kaczér, Magnus Jørgenvåg, Clemens Vetter, Lucie Flek, Florian Mai. (2025). "In-Training Defenses against Emergent Misalignment in Language Models." arXiv:2508.06249. https://arxiv.org/abs/2508.06249
We study practical in-training safeguards against emergent misalignment, evaluating regularization, safe subspaces, and data interleaving.
