Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models

Published in Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2025

Recommended citation: Mehdi Ali, Manuel Brack, Max Lübbering, Elias Wendt, Abbas Goher Khan, Richard Rutmann, Alex Jude, Maurice Kraus, Alexander Arno Weber, Felix Stollenwerk, David Kaczér, Florian Mai, Lucie Flek, Rafet Sifa, Nicolas Flores-Herr, Joachim Köhler, Patrick Schramowski, Michael Fromm, Kristian Kersting. (2025). "Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models." EMNLP 2025. https://arxiv.org/abs/2505.22232

We introduce JQL, a multilingual data curation pipeline that distills LLM annotations into lightweight annotators for scalable cross-lingual filtering.

Download paper here