tokyotech-llm
/

Llama-3.1-Swallow-8B-Instruct-v0.5

@@ -28,7 +28,7 @@ See the Swallow Model Index section to find other model variants.
 # Release History
-- **Jun 25, 2025**: Released [Llama-3.1-Swallow-8B-Instruct-v0.5](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.5) and [Llama-3.1-Swallow-8B-v0.5](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-8B-v0.5).
 - **March 10, 2025**: Released [Llama-3.3-Swallow-70B-Instruct-v0.4](https://huggingface.co/tokyotech-llm/Llama-3.3-Swallow-70B-Instruct-v0.4) and [Llama-3.3-Swallow-70B-v0.4](https://huggingface.co/tokyotech-llm/Llama-3.3-Swallow-70B-v0.4).
 - **December 30, 2024**: Released [Llama-3.1-Swallow-70B-Instruct-v0.3](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-70B-Instruct-v0.3).
 - **December 23, 2024**: Released [Llama-3.1-Swallow-8B-Instruct-v0.3](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.3).
@@ -62,60 +62,56 @@ The website [https://swallow-llm.github.io/](https://swallow-llm.github.io/index
 |Model|coding|extraction|humanities|math|reasoning|roleplay|stem|writing|JMTAvg|
 |---|---|---|---|---|---|---|---|---|---|
-| RakutenAI-7B-chat | 0.2475 | 0.3522 | 0.4692 | 0.2140 | 0.3926 | 0.4427 | 0.3977 | 0.4434 | 0.3699 |
-| Qwen2-7B-Instruct | 0.4635 | 0.6909 | 0.6857 | **0.5970** | 0.5042 | 0.6667 | 0.5353 | 0.6808 | 0.6030 |
-| Qwen2.5-7B-Instruct | **0.5111** | 0.7489 | 0.6913 | 0.5742 | 0.4851 | 0.6810 | 0.5350 | 0.6810 | 0.6134 |
-| Tanuki-8B-dpo-v1.0 | 0.3019 | 0.4772 | 0.5658 | 0.4129 | 0.3590 | 0.5120 | 0.4770 | 0.6159 | 0.4652 |
-| Llama 3 8B Instruct | 0.3744 | 0.6876 | 0.6225 | 0.2070 | 0.5032 | 0.5248 | 0.5326 | 0.4884 | 0.4926 |
-| Llama 3.1 8B Instruct | 0.3234 | 0.7362 | 0.4973 | 0.4787 | 0.3210 | 0.4670 | 0.4656 | 0.4314 | 0.4651 |
-| Llama 3 Youko 8B Instruct | 0.2950 | 0.7332 | 0.7125 | 0.2533 | 0.4987 | 0.6514 | 0.5438 | 0.7091 | 0.5496 |
-| Llama-3-ELYZA-JP-8B | 0.2908 | 0.6421 | 0.6406 | 0.3088 | **0.5500** | 0.6740 | 0.5251 | 0.6744 | 0.5382 |
-| Llama 3 heron brain 8B v0.3 | 0.2929 | 0.5635 | 0.6241 | 0.2135 | 0.4582 | 0.5354 | 0.5273 | 0.5099 | 0.4656 |
-| Llama 3 Swallow 8B Instruct | 0.3547 | 0.6508 | 0.5371 | 0.2718 | 0.4007 | 0.5493 | 0.4752 | 0.5730 | 0.4766 |
-| Llama 3.1 Swallow 8B Instruct v0.1| 0.3132 | **0.7734** | 0.6645 | 0.3880 | 0.5230 | 0.5711 | 0.4953 | 0.5330 | 0.5327 |
-| Llama 3.1 Swallow 8B Instruct v0.2| 0.4307 | 0.7089 | 0.6937 | 0.3881 | 0.5140 | 0.6277 | 0.5253 | 0.5787 | 0.5584 |
-| Llama 3.1 Swallow 8B Instruct v0.3| 0.4849 | 0.6845 | **0.8180** | 0.4817 | 0.5240 | **0.7370** | **0.6473** | **0.7615** | **0.6424** |
 ### Japanese tasks
 |Model|JCom.|JEMHopQA|NIILC|JSQuAD|XL-Sum|MGSM|WMT20-en-ja|WMT20-ja-en|JMMLU|JHumanEval|Ja Avg|
 |---|---|---|---|---|---|---|---|---|---|---|---|
-|   |4-shot|4-shot|4-shot|4-shot|1-shot|4-shot|4-shot|4-shot|5-shot|0-shot|   |
-|   |EM acc|Char-F1|Char-F1|Char-F1|ROUGE-2|EM acc|BLEU|BLEU|EM acc|pass@1|   |
-| RakutenAI-7B-chat | 0.9035 | 0.2600 | 0.4619 | 0.8647 | 0.1339 | 0.2120 | 0.2667 | 0.1966 | 0.4504 | 0.2299 | 0.3980 |
-| Qwen2-7B-Instruct | 0.8856 | 0.3902 | 0.3859 | 0.8967 | 0.1277 | 0.5720 | 0.2041 | 0.1909 | 0.5713 | **0.5683** | 0.4793 |
-| Qwen2.5-7B-Instruct | 0.9151 | 0.4293 | 0.3910 | 0.8908 | 0.1676 | **0.6240** | 0.2108 | 0.1916 | **0.6252** | 0.5305 | 0.4976 |
-| Tanuki-8B-dpo-v1.0 | 0.2770 | 0.2937 | 0.3710 | 0.6669 | 0.1016 | 0.4280 | 0.2385 | 0.1820 | 0.3078 | 0.2555 | 0.3122 |
-| Llama 3 8B Instruct | 0.8785 | 0.3812 | 0.3936 | 0.8955 | 0.1273 | 0.4160 | 0.2143 | 0.2035 | 0.4719 | 0.2872 | 0.4269 |
-| Llama 3.1 8B Instruct | 0.8829 | 0.4272 | 0.4112 | 0.8856 | 0.1481 | 0.5280 | 0.2174 | 0.1990 | 0.5086 | 0.4976 | 0.4706 |
-| Llama 3 Youko 8B Instruct | 0.9196 | 0.4850 | 0.5178 | 0.9001 | 0.2085 | 0.4680 | 0.2559 | 0.1906 | 0.4691 | 0.2695 | 0.4684 |
-| Llama-3-ELYZA-JP-8B | 0.9017 | 0.5124 | 0.5016 | 0.9113 | 0.1677 | 0.4600 | 0.2509 | 0.1846 | 0.4829 | 0.3811 | 0.4754 |
-| Llama 3 heron brain 8B v0.3 | 0.9231 | 0.4933 | 0.5694 | 0.9056 | **0.2178** | 0.4560 | 0.2771 | 0.2168 | 0.4993 | 0.3177 | 0.4876 |
-| Llama 3 Swallow 8B Instruct | 0.9178 | 0.4963 | 0.5168 | 0.9088 | 0.1296 | 0.4880 | 0.2522 | 0.2254 | 0.4835 | 0.3927 | 0.4811 |
-| Llama 3.1 Swallow 8B Instruct v0.1| 0.9240 | **0.5874** | 0.5736 | **0.9170** | 0.1380 | 0.5080 | 0.2820 | **0.2282** | 0.5301 | 0.3665 | 0.5055 |
-| Llama 3.1 Swallow 8B Instruct v0.2| **0.9294** | 0.5601 | **0.5988** | 0.9148 | 0.1372 | 0.5280 | **0.2878** | 0.2270 | 0.5504 | 0.4079 | **0.5141** |
-| Llama 3.1 Swallow 8B Instruct v0.3 |0.9240 | 0.5174 | 0.5825 | 0.8954 | 0.1902 | 0.5480 | 0.2809 | 0.2278 | 0.5445 | 0.3945| 0.5105 |
 ### English tasks
 |Model|OpenBookQA|TriviaQA|HellaSWAG|SQuAD2.0|XWINO|MMLU|GSM8K|BBH|HumanEval|En Avg|
 |---|---|---|---|---|---|---|---|---|---|---|
-|   |4-shot|4-shot|4-shot|4-shot|4-shot|5-shot|4-shot|3-shot|0-shot|   |
-|   |Acc|EM acc|Acc|EM acc|Acc|Acc|EM acc|CoT EM Acc|pass@1|   |
-| RakutenAI-7B-chat | 0.4160 | 0.5971 | **0.6465** | 0.3091 | 0.8886 | 0.5757 | 0.3139 | 0.4958 | 0.2671 | 0.5011 |
-| Qwen2-7B-Instruct | 0.4000 | 0.5468 | 0.6146 | 0.3518 | 0.8852 | 0.7073 | 0.6300 | 0.3101 | 0.6354 | 0.5646 |
-| Qwen2.5-7B-Instruct | **0.4280** | 0.5187 | 0.6240 | 0.2626 | 0.8761 | **0.7419** | 0.7415 | 0.2150 | **0.6360** | 0.5604 |
-| Tanuki-8B-dpo-v1.0 | 0.3340 | 0.2838 | 0.4696 | 0.2395 | 0.8168 | 0.3772 | 0.4867 | 0.3350 | 0.2805 | 0.4026 |
-| Llama 3 8B Instruct | 0.3880 | 0.6687 | 0.5834 | 0.3743 | 0.8903 | 0.6567 | **0.7453** | 0.6478 | 0.5415 | 0.6107 |
-| Llama 3.1 8B Instruct | 0.3700 | **0.6994** | 0.5920 | **0.3783** | **0.9037** | 0.6809 | 0.7430 | **0.6928** | 0.6293 | **0.6321** |
-| Llama 3 Youko 8B Instruct | 0.4080 | 0.6129 | 0.5983 | 0.3370 | 0.8981 | 0.5964 | 0.5618 | 0.4012 | 0.2750 | 0.5209 |
-| Llama-3-ELYZA-JP-8B | 0.3200 | 0.5502 | 0.5224 | 0.3631 | 0.8809 | 0.5875 | 0.5701 | 0.3213 | 0.4604 | 0.5084 |
-| Llama 3 heron brain 8B v0.3 | 0.3580 | 0.6563 | 0.5686 | 0.3726 | 0.9002 | 0.6213 | 0.5777 | 0.6409 | 0.3720 | 0.5631 |
-| Llama 3 Swallow 8B Instruct | 0.3720 | 0.6557 | 0.5861 | 0.3648 | 0.9002 | 0.6315 | 0.5959 | 0.6391 | 0.4238 | 0.5743 |
-| Llama 3.1 Swallow 8B Instruct v0.1| 0.3900 | 0.6488 | 0.6151 | 0.3553 | 0.8912 | 0.6237 | 0.6050 | 0.6417 | 0.3787 | 0.5722 |
-| Llama 3.1 Swallow 8B Instruct v0.2| 0.3800 | 0.6252 | 0.6031 | 0.3667 | 0.8886 | 0.6346 | 0.6202 | 0.6487 | 0.4738 | 0.5823 |
-| Llama 3.1 Swallow 8B Instruct v0.3 |0.3920 | 0.6295 | 0.5937 | 0.3638 | 0.8830 | 0.6280 | 0.6149 | 0.6282 | 0.4457 | 0.5754 |
 ## Evaluation Benchmarks
@@ -128,7 +124,7 @@ We used [Japanese MT-Bench](https://wandb.ai/wandb-japan/llm-leaderboard/artifac
 - Question: [Nejumi LLM-Leaderboard NEO, mtbench_ja_question_v3](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_question/v3)
 - Reference Answer: [Nejumi LLM-Leaderboard NEO, mtbench_ja_referenceanswer_v1](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_referenceanswer/v1)
 - Prompt for Judge: [Nejumi LLM-Leaderboard NEO, mtbench_ja_prompt_v1](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_prompt/v1)
-- Judge: `gpt-4-1106-preview`
 - Scoring: Absolute scale normalized to a 0-1 range, averaged over five runs.
 -
 ### Japanese evaluation benchmarks

 # Release History
+- **June 25, 2025**: Released [Llama-3.1-Swallow-8B-Instruct-v0.5](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.5) and [Llama-3.1-Swallow-8B-v0.5](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-8B-v0.5).
 - **March 10, 2025**: Released [Llama-3.3-Swallow-70B-Instruct-v0.4](https://huggingface.co/tokyotech-llm/Llama-3.3-Swallow-70B-Instruct-v0.4) and [Llama-3.3-Swallow-70B-v0.4](https://huggingface.co/tokyotech-llm/Llama-3.3-Swallow-70B-v0.4).
 - **December 30, 2024**: Released [Llama-3.1-Swallow-70B-Instruct-v0.3](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-70B-Instruct-v0.3).
 - **December 23, 2024**: Released [Llama-3.1-Swallow-8B-Instruct-v0.3](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.3).
 |Model|coding|extraction|humanities|math|reasoning|roleplay|stem|writing|JMTAvg|
 |---|---|---|---|---|---|---|---|---|---|
+| llm-jp-3-7.2b-instruct3                   | 0.358 | 0.597 | 0.812 | 0.386 | 0.438 | 0.766 | 0.622 | 0.721  | 0.588 |
+| Qwen2-7B-Instruct                         | 0.512 | 0.771 | 0.719 | 0.687 | 0.514 | 0.683 | 0.563 | 0.717  | 0.646 |
+| Qwen2.5-7B-Instruct                       | 0.599 | 0.741 | 0.719 | 0.637 | 0.541 | 0.744 | 0.624 | 0.713  | 0.665 |
+| Tanuki-8B-dpo-v1.0                        | 0.461 | 0.597 | 0.562 | 0.495 | 0.377 | 0.589 | 0.509 | 0.643  | 0.529 |
+| Llama 3 8B Instruct                       | 0.467 | 0.706 | 0.692 | 0.310 | 0.433 | 0.542 | 0.532 | 0.546  | 0.529 |
+| Llama 3.1 8B Instruct                     | 0.420 | **0.830** | 0.550 | 0.514 | 0.349 | 0.502 | 0.479 | 0.504  | 0.519 |
+| Llama 3 Youko 8B Instruct                 | 0.464 | 0.757 | 0.769 | 0.414 | 0.487 | 0.695 | 0.583 | 0.753  | 0.616 |
+| Llama-3-ELYZA-JP-8B                       | 0.389 | 0.706 | 0.647 | 0.426 | **0.613** | 0.684 | 0.533 | 0.697  | 0.587 |
+| Llama 3 heron brain 8B v0.3              | 0.362 | 0.566 | 0.602 | 0.315 | 0.426 | 0.586 | 0.567 | 0.550  | 0.497 |
+| Llama 3.1 Swallow 8B Instruct v0.1        | 0.427 | 0.738 | 0.675 | 0.527 | 0.453 | 0.615 | 0.593 | 0.624  | 0.581 |
+| Llama 3.1 Swallow 8B Instruct v0.2        | 0.534 | 0.748 | 0.705 | 0.565 | 0.475 | 0.646 | 0.579 | 0.646  | 0.612 |
+| Llama 3.1 Swallow 8B Instruct v0.3        | **0.562** | 0.756 | 0.869 | **0.610** | 0.512 | 0.783 | 0.748 | 0.803 | 0.705 |
+| Llama 3.1 Swallow 8B Instruct v0.5        | 0.551 | 0.814 | **0.847** | 0.568 | 0.577 | **0.796** | **0.770** | **0.832** | **0.719** |
 ### Japanese tasks
 |Model|JCom.|JEMHopQA|NIILC|JSQuAD|XL-Sum|MGSM|WMT20-en-ja|WMT20-ja-en|JMMLU|JHumanEval|Ja Avg|
 |---|---|---|---|---|---|---|---|---|---|---|---|
+| llm-jp-3-7.2b-instruct3                   | 0.780 | 0.297 | 0.570 | 0.882 | 0.132 | 0.344 | 0.251 | 0.189 | 0.422 | 0.196 | 0.406 |
+| Qwen2-7B-Instruct                         | 0.888 | 0.390 | 0.379 | 0.897 | 0.126 | 0.576 | 0.206 | 0.190 | 0.571 | 0.555 | 0.478 |
+| Qwen2.5-7B-Instruct                       | 0.915 | 0.429 | 0.391 | 0.891 | 0.168 | 0.632 | 0.211 | 0.192 | 0.623 | 0.532 | 0.498 |
+| Tanuki-8B-dpo-v1.0                        | 0.278 | 0.284 | 0.370 | 0.670 | 0.102 | 0.428 | 0.238 | 0.183 | 0.306 | 0.251 | 0.311 |
+| Llama 3 8B Instruct                       | 0.880 | 0.417 | 0.385 | 0.891 | 0.126 | 0.424 | 0.214 | 0.202 | 0.468 | 0.296 | 0.430 |
+| Llama 3.1 8B Instruct                     | 0.880 | 0.447 | 0.407 | 0.886 | 0.148 | 0.516 | 0.218 | 0.200 | 0.509 | 0.488 | 0.470 |
+| Llama 3 Youko 8B Instruct                 | 0.921 | 0.481 | 0.517 | 0.899 | 0.209 | 0.472 | 0.256 | 0.191 | 0.469 | 0.262 | 0.468 |
+| Llama-3-ELYZA-JP-8B                       | 0.897 | 0.498 | 0.496 | 0.906 | 0.168 | 0.436 | 0.250 | 0.185 | 0.487 | 0.388 | 0.471 |
+| Llama 3 heron brain 8B v0.3              | 0.923 | 0.493 | 0.569 | 0.906 | **0.218** | 0.456 | 0.277 | 0.217 | 0.499 | 0.318 | 0.488 |
+| Llama 3.1 Swallow 8B Instruct v0.1        | 0.924 | **0.587** | 0.574 | **0.917** | 0.138 | 0.508 | 0.282 | 0.228 | 0.530 | 0.366 | 0.505 |
+| Llama 3.1 Swallow 8B Instruct v0.2        | 0.929 | 0.560 | 0.599 | 0.915 | 0.137 | 0.528 | 0.288 | 0.227 | 0.550 | 0.408 | 0.514 |
+| Llama 3.1 Swallow 8B Instruct v0.3        | 0.924 | 0.528 | 0.583 | 0.896 | 0.191 | 0.532 | 0.281 | 0.229 | 0.544 | 0.394 | 0.510 |
+| Llama 3.1 Swallow 8B Instruct v0.5        | **0.937** | 0.511 | **0.606** | 0.900 | 0.174 | **0.604** | **0.293** | **0.230** | **0.581** | **0.496** | **0.533** |
 ### English tasks
 |Model|OpenBookQA|TriviaQA|HellaSWAG|SQuAD2.0|XWINO|MMLU|GSM8K|BBH|HumanEval|En Avg|
 |---|---|---|---|---|---|---|---|---|---|---|
+| llm-jp-3-7.2b-instruct3                   | 0.328 | 0.479 | 0.563 | 0.501 | 0.876 | 0.462 | 0.264 | 0.028 | 0.420 | 0.219 | 0.414 |
+| Qwen2-7B-Instruct                         | 0.396 | 0.547 | 0.615 | 0.593 | 0.886 | 0.707 | 0.626 | 0.504 | 0.304 | 0.643 | 0.582 |
+| Qwen2.5-7B-Instruct                       | 0.428 | 0.519 | 0.624 | 0.569 | 0.877 | 0.742 | 0.739 | 0.688 | 0.217 | 0.636 | 0.604 |
+| Tanuki-8B-dpo-v1.0                        | 0.334 | 0.283 | 0.469 | 0.501 | 0.816 | 0.377 | 0.487 | 0.178 | 0.333 | 0.288 | 0.406 |
+| Llama 3 8B Instruct                       | 0.388 | 0.670 | 0.583 | 0.611 | 0.892 | 0.657 | 0.745 | 0.306 | 0.646 | 0.554 | 0.605 |
+| Llama 3.1 8B Instruct                     | 0.366 | 0.699 | 0.592 | 0.600 | 0.904 | 0.680 | 0.743 | 0.376 | 0.690 | 0.624 | 0.627 |
+| Llama 3 Youko 8B Instruct                 | 0.406 | 0.613 | 0.599 | 0.559 | 0.897 | 0.596 | 0.563 | 0.152 | 0.401 | 0.287 | 0.507 |
+| Llama-3-ELYZA-JP-8B                       | 0.318 | 0.551 | 0.523 | 0.600 | 0.882 | 0.587 | 0.558 | 0.164 | 0.321 | 0.449 | 0.495 |
+| Llama 3 heron brain 8B v0.3              | 0.362 | 0.656 | 0.569 | 0.581 | 0.901 | 0.621 | 0.578 | 0.222 | 0.641 | 0.380 | 0.551 |
+| Llama 3.1 Swallow 8B Instruct v0.1        | 0.388 | 0.649 | 0.615 | 0.598 | 0.891 | 0.624 | 0.605 | 0.236 | 0.642 | 0.379 | 0.563 |
+| Llama 3.1 Swallow 8B Instruct v0.2        | 0.380 | 0.625 | 0.603 | 0.607 | 0.887 | 0.634 | 0.620 | 0.264 | 0.649 | 0.474 | 0.574 |
+| Llama 3.1 Swallow 8B Instruct v0.3        | 0.396 | 0.629 | 0.593 | 0.570 | 0.884 | 0.629 | 0.622 | 0.266 | 0.626 | 0.445 | 0.566 |
+| Llama 3.1 Swallow 8B Instruct v0.5        | 0.396 | 0.638 | 0.603 | 0.581 | 0.889 | 0.663 | 0.717 | 0.368 | 0.628 | 0.554 | 0.604 |
 ## Evaluation Benchmarks
 - Question: [Nejumi LLM-Leaderboard NEO, mtbench_ja_question_v3](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_question/v3)
 - Reference Answer: [Nejumi LLM-Leaderboard NEO, mtbench_ja_referenceanswer_v1](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_referenceanswer/v1)
 - Prompt for Judge: [Nejumi LLM-Leaderboard NEO, mtbench_ja_prompt_v1](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_prompt/v1)
+- Judge: `gpt-4o-2024-08-06`
 - Scoring: Absolute scale normalized to a 0-1 range, averaged over five runs.
 -
 ### Japanese evaluation benchmarks