CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages

Published in Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 2024

Access paper here

Recommended citation: Thuat Nguyen, Chien Nguyen, Viet Lai, Hieu Man, Nghia Ngo, Franck Dernoncourt, Ryan Rossi, Thien Nguyen, "CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages." Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 2024.
Download Paper