--- license: llama3 datasets: - MaLA-LM/mala-monolingual-split - MaLA-LM/mala-code-reasoning-v2 - MaLA-LM/mala-bilingual-translation-corpus base_model: - meta-llama/Llama-3-8B library_name: transformers pipeline_tag: text-generation --- # Massively Multilingual Adaptation of Large Language Models Using Bilingual Translation Data ## Model Description **EMMA-500 Llama 3 8B** is a state-of-the-art multilingual language model designed to improve language representation, especially in low-resource languages, through continual pre-training on the **Llama 3 8B** architecture. Leveraging the **[MaLA Corpus](https://huggingface.co/collections/MaLA-LM/mala-corpus-66e05127641a51de34d39529)**, which spans over 500 languages and is augmented with books, code, instruction data, and papers, EMMA-500 excels in multilingual tasks like commonsense reasoning, machine translation, and text classification. - Project Website: https://mala-lm.github.io/emma-500-gen2.html - Paper: https://arxiv.org/abs/2506.00469 --- ### Model Details - **Architecture**: Built on Llama 3 8B with enhanced language adaptation through continual pre-training. - **Languages**: Supports **546 languages** with substantial training data (over 100k tokens each). - **Data Mix**: A diverse [bilingual mix](https://mala-lm.github.io/static/images/mix-bilingual.png) of text from domains like code, books, instruction data, and papers. - **Total Tokens**: 671B **EMMA-500 series** - 🤗[MaLA-LM/emma-500-llama2-7b](https://huggingface.co/MaLA-LM/emma-500-llama2-7b): CPT model trained on monolingual data mix in 500+ languages - 🤗[MaLA-LM/emma-500-llama3-8b-mono](https://huggingface.co/MaLA-LM/emma-500-llama3-8b-mono): CPT model trained on monolingual data mix in 500+ languages - 🤗[MaLA-LM/emma-500-llama3-8b-bi](https://huggingface.co/MaLA-LM/emma-500-llama3-8b-bi): CPT model trained on monolingual data mix in 500+ languages + bilingual translation data in 2,500+ language pairs - 🤗[MaLA-LM/emma-500-llama3.1-8b-mono](https://huggingface.co/MaLA-LM/emma-500-llama3.1-8b-mono): CPT model trained on monolingual data mix in 500+ languages - 🤗[MaLA-LM/emma-500-llama3.1-8b-bi](https://huggingface.co/MaLA-LM/emma-500-llama3.1-8b-bi): CPT model trained on monolingual data mix in 500+ languages + bilingual translation data in 2,500+ language pairs --- ### Data Access 🤗[MaLA Corpus Dataset Collection](https://huggingface.co/collections/MaLA-LM/mala-corpus-66e05127641a51de34d39529) - MaLA monolingual corpus: 🤗[MaLA-LM/mala-monolingual-split](https://huggingface.co/datasets/MaLA-LM/mala-monolingual-split) - MaLA bilingual translation corpus: 🤗[MaLA-LM/mala-bilingual-translation-corpus](https://huggingface.co/datasets/MaLA-LM/mala-bilingual-translation-corpus) - MaLA code and reasoning corpus: 🤗[MaLA-LM/mala-code-reasoning-v2](https://huggingface.co/datasets/MaLA-LM/mala-code-reasoning-v2) --- ### Usage You can use **EMMA-500** for multilingual text generation. Below is an example to generate text using the model: ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "MaLA-LM/emma-500-llama3-8b-bi" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name) input_text = "Once upon a time" inputs = tokenizer(input_text, return_tensors="pt") outputs = model.generate(**inputs) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` --- ## Use Cases - Massively multilingual NLP tasks, e.g., machine translation - Performance regression on some tasks and high-resource languages - Cannot be used for real-world scenarios, esp. in high-stakes domains. --- ## Citation If you find this model useful, please cite the paper below. ``` @article{ji2025emma2, title={Massively Multilingual Adaptation of Large Language Models Using Bilingual Translation Data}, author={Shaoxiong Ji and Zihao Li and Jaakko Paavola and Indraneil Paul and Hengyu Luo and Jörg Tiedemann}, year={2025}, journal={arXiv preprint 2506.00469}, url={https://arxiv.org/abs/2506.00469}, } ``` Check out the below [paper](https://arxiv.org/abs/2409.17892) for the precedent EMMA-500 model trained on Llama 2 (🤗[MaLA-LM/emma-500-llama2-7b](https://huggingface.co/MaLA-LM/emma-500-llama2-7b)). ``` @article{ji2024emma500enhancingmassivelymultilingual, title={{EMMA}-500: Enhancing Massively Multilingual Adaptation of Large Language Models}, author={Shaoxiong Ji and Zihao Li and Indraneil Paul and Jaakko Paavola and Peiqin Lin and Pinzhen Chen and Dayyán O'Brien and Hengyu Luo and Hinrich Schütze and Jörg Tiedemann and Barry Haddow}, year={2024}, journal={arXiv preprint 2409.17892}, url={https://arxiv.org/abs/2409.17892}, } ```