In-Training Defenses against Emergent Misalignment in Language Models

Published in arXiv, 2025

Recommended citation: David Kaczér, Magnus Jørgenvåg, Clemens Vetter, Lucie Flek, Florian Mai. (2025). "In-Training Defenses against Emergent Misalignment in Language Models." arXiv:2508.06249. https://arxiv.org/abs/2508.06249

We study practical in-training safeguards against emergent misalignment, evaluating regularization, safe subspaces, and data interleaving.

Download paper here