Updated comparative models list
Browse files
README.md
CHANGED
|
@@ -28,7 +28,7 @@ See the Swallow Model Index section to find other model variants.
|
|
| 28 |
|
| 29 |
# Release History
|
| 30 |
|
| 31 |
-
- **
|
| 32 |
- **March 10, 2025**: Released [Llama-3.3-Swallow-70B-Instruct-v0.4](https://huggingface.co/tokyotech-llm/Llama-3.3-Swallow-70B-Instruct-v0.4) and [Llama-3.3-Swallow-70B-v0.4](https://huggingface.co/tokyotech-llm/Llama-3.3-Swallow-70B-v0.4).
|
| 33 |
- **December 30, 2024**: Released [Llama-3.1-Swallow-70B-Instruct-v0.3](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-70B-Instruct-v0.3).
|
| 34 |
- **December 23, 2024**: Released [Llama-3.1-Swallow-8B-Instruct-v0.3](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.3).
|
|
@@ -62,60 +62,56 @@ The website [https://swallow-llm.github.io/](https://swallow-llm.github.io/index
|
|
| 62 |
|
| 63 |
|Model|coding|extraction|humanities|math|reasoning|roleplay|stem|writing|JMTAvg|
|
| 64 |
|---|---|---|---|---|---|---|---|---|---|
|
| 65 |
-
|
|
| 66 |
-
| Qwen2-7B-Instruct
|
| 67 |
-
| Qwen2.5-7B-Instruct
|
| 68 |
-
| Tanuki-8B-dpo-v1.0
|
| 69 |
-
| Llama 3 8B Instruct
|
| 70 |
-
| Llama 3.1 8B Instruct
|
| 71 |
-
| Llama 3 Youko 8B Instruct
|
| 72 |
-
| Llama-3-ELYZA-JP-8B
|
| 73 |
-
| Llama 3 heron brain 8B v0.3
|
| 74 |
-
| Llama 3 Swallow 8B Instruct | 0.
|
| 75 |
-
| Llama 3.1 Swallow 8B Instruct v0.
|
| 76 |
-
| Llama 3.1 Swallow 8B Instruct v0.
|
| 77 |
-
| Llama 3.1 Swallow 8B Instruct v0.
|
| 78 |
|
| 79 |
### Japanese tasks
|
| 80 |
|
| 81 |
|Model|JCom.|JEMHopQA|NIILC|JSQuAD|XL-Sum|MGSM|WMT20-en-ja|WMT20-ja-en|JMMLU|JHumanEval|Ja Avg|
|
| 82 |
|---|---|---|---|---|---|---|---|---|---|---|---|
|
| 83 |
-
|
|
| 84 |
-
|
|
| 85 |
-
|
|
| 86 |
-
|
|
| 87 |
-
|
|
| 88 |
-
|
|
| 89 |
-
| Llama 3 8B Instruct
|
| 90 |
-
| Llama
|
| 91 |
-
| Llama 3
|
| 92 |
-
| Llama
|
| 93 |
-
| Llama 3
|
| 94 |
-
| Llama 3 Swallow 8B Instruct | 0.
|
| 95 |
-
| Llama 3.1 Swallow 8B Instruct v0.
|
| 96 |
-
| Llama 3.1 Swallow 8B Instruct v0.2| **0.9294** | 0.5601 | **0.5988** | 0.9148 | 0.1372 | 0.5280 | **0.2878** | 0.2270 | 0.5504 | 0.4079 | **0.5141** |
|
| 97 |
-
| Llama 3.1 Swallow 8B Instruct v0.3 |0.9240 | 0.5174 | 0.5825 | 0.8954 | 0.1902 | 0.5480 | 0.2809 | 0.2278 | 0.5445 | 0.3945| 0.5105 |
|
| 98 |
|
| 99 |
|
| 100 |
### English tasks
|
| 101 |
|
| 102 |
|Model|OpenBookQA|TriviaQA|HellaSWAG|SQuAD2.0|XWINO|MMLU|GSM8K|BBH|HumanEval|En Avg|
|
| 103 |
|---|---|---|---|---|---|---|---|---|---|---|
|
| 104 |
-
|
|
| 105 |
-
|
|
| 106 |
-
|
|
| 107 |
-
|
|
| 108 |
-
|
|
| 109 |
-
|
|
| 110 |
-
| Llama 3 8B Instruct | 0.
|
| 111 |
-
| Llama
|
| 112 |
-
| Llama 3
|
| 113 |
-
| Llama
|
| 114 |
-
| Llama 3
|
| 115 |
-
| Llama 3 Swallow 8B Instruct | 0.
|
| 116 |
-
| Llama 3.1 Swallow 8B Instruct v0.
|
| 117 |
-
| Llama 3.1 Swallow 8B Instruct v0.2| 0.3800 | 0.6252 | 0.6031 | 0.3667 | 0.8886 | 0.6346 | 0.6202 | 0.6487 | 0.4738 | 0.5823 |
|
| 118 |
-
| Llama 3.1 Swallow 8B Instruct v0.3 |0.3920 | 0.6295 | 0.5937 | 0.3638 | 0.8830 | 0.6280 | 0.6149 | 0.6282 | 0.4457 | 0.5754 |
|
| 119 |
|
| 120 |
|
| 121 |
## Evaluation Benchmarks
|
|
@@ -128,7 +124,7 @@ We used [Japanese MT-Bench](https://wandb.ai/wandb-japan/llm-leaderboard/artifac
|
|
| 128 |
- Question: [Nejumi LLM-Leaderboard NEO, mtbench_ja_question_v3](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_question/v3)
|
| 129 |
- Reference Answer: [Nejumi LLM-Leaderboard NEO, mtbench_ja_referenceanswer_v1](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_referenceanswer/v1)
|
| 130 |
- Prompt for Judge: [Nejumi LLM-Leaderboard NEO, mtbench_ja_prompt_v1](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_prompt/v1)
|
| 131 |
-
- Judge: `gpt-
|
| 132 |
- Scoring: Absolute scale normalized to a 0-1 range, averaged over five runs.
|
| 133 |
-
|
| 134 |
### Japanese evaluation benchmarks
|
|
|
|
| 28 |
|
| 29 |
# Release History
|
| 30 |
|
| 31 |
+
- **June 25, 2025**: Released [Llama-3.1-Swallow-8B-Instruct-v0.5](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.5) and [Llama-3.1-Swallow-8B-v0.5](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-8B-v0.5).
|
| 32 |
- **March 10, 2025**: Released [Llama-3.3-Swallow-70B-Instruct-v0.4](https://huggingface.co/tokyotech-llm/Llama-3.3-Swallow-70B-Instruct-v0.4) and [Llama-3.3-Swallow-70B-v0.4](https://huggingface.co/tokyotech-llm/Llama-3.3-Swallow-70B-v0.4).
|
| 33 |
- **December 30, 2024**: Released [Llama-3.1-Swallow-70B-Instruct-v0.3](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-70B-Instruct-v0.3).
|
| 34 |
- **December 23, 2024**: Released [Llama-3.1-Swallow-8B-Instruct-v0.3](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.3).
|
|
|
|
| 62 |
|
| 63 |
|Model|coding|extraction|humanities|math|reasoning|roleplay|stem|writing|JMTAvg|
|
| 64 |
|---|---|---|---|---|---|---|---|---|---|
|
| 65 |
+
| llm-jp-3-7.2b-instruct3 | 0.358 | 0.597 | 0.812 | 0.386 | 0.438 | 0.766 | 0.622 | 0.721 | 0.588 |
|
| 66 |
+
| Qwen2-7B-Instruct | 0.512 | 0.771 | 0.719 | 0.687 | 0.514 | 0.683 | 0.563 | 0.717 | 0.646 |
|
| 67 |
+
| Qwen2.5-7B-Instruct | 0.599 | 0.741 | 0.719 | 0.637 | 0.541 | 0.744 | 0.624 | 0.713 | 0.665 |
|
| 68 |
+
| Tanuki-8B-dpo-v1.0 | 0.461 | 0.597 | 0.562 | 0.495 | 0.377 | 0.589 | 0.509 | 0.643 | 0.529 |
|
| 69 |
+
| Llama 3 8B Instruct | 0.467 | 0.706 | 0.692 | 0.310 | 0.433 | 0.542 | 0.532 | 0.546 | 0.529 |
|
| 70 |
+
| Llama 3.1 8B Instruct | 0.420 | **0.830** | 0.550 | 0.514 | 0.349 | 0.502 | 0.479 | 0.504 | 0.519 |
|
| 71 |
+
| Llama 3 Youko 8B Instruct | 0.464 | 0.757 | 0.769 | 0.414 | 0.487 | 0.695 | 0.583 | 0.753 | 0.616 |
|
| 72 |
+
| Llama-3-ELYZA-JP-8B | 0.389 | 0.706 | 0.647 | 0.426 | **0.613** | 0.684 | 0.533 | 0.697 | 0.587 |
|
| 73 |
+
| Llama 3 heron brain 8B v0.3 | 0.362 | 0.566 | 0.602 | 0.315 | 0.426 | 0.586 | 0.567 | 0.550 | 0.497 |
|
| 74 |
+
| Llama 3.1 Swallow 8B Instruct v0.1 | 0.427 | 0.738 | 0.675 | 0.527 | 0.453 | 0.615 | 0.593 | 0.624 | 0.581 |
|
| 75 |
+
| Llama 3.1 Swallow 8B Instruct v0.2 | 0.534 | 0.748 | 0.705 | 0.565 | 0.475 | 0.646 | 0.579 | 0.646 | 0.612 |
|
| 76 |
+
| Llama 3.1 Swallow 8B Instruct v0.3 | **0.562** | 0.756 | 0.869 | **0.610** | 0.512 | 0.783 | 0.748 | 0.803 | 0.705 |
|
| 77 |
+
| Llama 3.1 Swallow 8B Instruct v0.5 | 0.551 | 0.814 | **0.847** | 0.568 | 0.577 | **0.796** | **0.770** | **0.832** | **0.719** |
|
| 78 |
|
| 79 |
### Japanese tasks
|
| 80 |
|
| 81 |
|Model|JCom.|JEMHopQA|NIILC|JSQuAD|XL-Sum|MGSM|WMT20-en-ja|WMT20-ja-en|JMMLU|JHumanEval|Ja Avg|
|
| 82 |
|---|---|---|---|---|---|---|---|---|---|---|---|
|
| 83 |
+
| llm-jp-3-7.2b-instruct3 | 0.780 | 0.297 | 0.570 | 0.882 | 0.132 | 0.344 | 0.251 | 0.189 | 0.422 | 0.196 | 0.406 |
|
| 84 |
+
| Qwen2-7B-Instruct | 0.888 | 0.390 | 0.379 | 0.897 | 0.126 | 0.576 | 0.206 | 0.190 | 0.571 | 0.555 | 0.478 |
|
| 85 |
+
| Qwen2.5-7B-Instruct | 0.915 | 0.429 | 0.391 | 0.891 | 0.168 | 0.632 | 0.211 | 0.192 | 0.623 | 0.532 | 0.498 |
|
| 86 |
+
| Tanuki-8B-dpo-v1.0 | 0.278 | 0.284 | 0.370 | 0.670 | 0.102 | 0.428 | 0.238 | 0.183 | 0.306 | 0.251 | 0.311 |
|
| 87 |
+
| Llama 3 8B Instruct | 0.880 | 0.417 | 0.385 | 0.891 | 0.126 | 0.424 | 0.214 | 0.202 | 0.468 | 0.296 | 0.430 |
|
| 88 |
+
| Llama 3.1 8B Instruct | 0.880 | 0.447 | 0.407 | 0.886 | 0.148 | 0.516 | 0.218 | 0.200 | 0.509 | 0.488 | 0.470 |
|
| 89 |
+
| Llama 3 Youko 8B Instruct | 0.921 | 0.481 | 0.517 | 0.899 | 0.209 | 0.472 | 0.256 | 0.191 | 0.469 | 0.262 | 0.468 |
|
| 90 |
+
| Llama-3-ELYZA-JP-8B | 0.897 | 0.498 | 0.496 | 0.906 | 0.168 | 0.436 | 0.250 | 0.185 | 0.487 | 0.388 | 0.471 |
|
| 91 |
+
| Llama 3 heron brain 8B v0.3 | 0.923 | 0.493 | 0.569 | 0.906 | **0.218** | 0.456 | 0.277 | 0.217 | 0.499 | 0.318 | 0.488 |
|
| 92 |
+
| Llama 3.1 Swallow 8B Instruct v0.1 | 0.924 | **0.587** | 0.574 | **0.917** | 0.138 | 0.508 | 0.282 | 0.228 | 0.530 | 0.366 | 0.505 |
|
| 93 |
+
| Llama 3.1 Swallow 8B Instruct v0.2 | 0.929 | 0.560 | 0.599 | 0.915 | 0.137 | 0.528 | 0.288 | 0.227 | 0.550 | 0.408 | 0.514 |
|
| 94 |
+
| Llama 3.1 Swallow 8B Instruct v0.3 | 0.924 | 0.528 | 0.583 | 0.896 | 0.191 | 0.532 | 0.281 | 0.229 | 0.544 | 0.394 | 0.510 |
|
| 95 |
+
| Llama 3.1 Swallow 8B Instruct v0.5 | **0.937** | 0.511 | **0.606** | 0.900 | 0.174 | **0.604** | **0.293** | **0.230** | **0.581** | **0.496** | **0.533** |
|
|
|
|
|
|
|
| 96 |
|
| 97 |
|
| 98 |
### English tasks
|
| 99 |
|
| 100 |
|Model|OpenBookQA|TriviaQA|HellaSWAG|SQuAD2.0|XWINO|MMLU|GSM8K|BBH|HumanEval|En Avg|
|
| 101 |
|---|---|---|---|---|---|---|---|---|---|---|
|
| 102 |
+
| llm-jp-3-7.2b-instruct3 | 0.328 | 0.479 | 0.563 | 0.501 | 0.876 | 0.462 | 0.264 | 0.028 | 0.420 | 0.219 | 0.414 |
|
| 103 |
+
| Qwen2-7B-Instruct | 0.396 | 0.547 | 0.615 | 0.593 | 0.886 | 0.707 | 0.626 | 0.504 | 0.304 | 0.643 | 0.582 |
|
| 104 |
+
| Qwen2.5-7B-Instruct | 0.428 | 0.519 | 0.624 | 0.569 | 0.877 | 0.742 | 0.739 | 0.688 | 0.217 | 0.636 | 0.604 |
|
| 105 |
+
| Tanuki-8B-dpo-v1.0 | 0.334 | 0.283 | 0.469 | 0.501 | 0.816 | 0.377 | 0.487 | 0.178 | 0.333 | 0.288 | 0.406 |
|
| 106 |
+
| Llama 3 8B Instruct | 0.388 | 0.670 | 0.583 | 0.611 | 0.892 | 0.657 | 0.745 | 0.306 | 0.646 | 0.554 | 0.605 |
|
| 107 |
+
| Llama 3.1 8B Instruct | 0.366 | 0.699 | 0.592 | 0.600 | 0.904 | 0.680 | 0.743 | 0.376 | 0.690 | 0.624 | 0.627 |
|
| 108 |
+
| Llama 3 Youko 8B Instruct | 0.406 | 0.613 | 0.599 | 0.559 | 0.897 | 0.596 | 0.563 | 0.152 | 0.401 | 0.287 | 0.507 |
|
| 109 |
+
| Llama-3-ELYZA-JP-8B | 0.318 | 0.551 | 0.523 | 0.600 | 0.882 | 0.587 | 0.558 | 0.164 | 0.321 | 0.449 | 0.495 |
|
| 110 |
+
| Llama 3 heron brain 8B v0.3 | 0.362 | 0.656 | 0.569 | 0.581 | 0.901 | 0.621 | 0.578 | 0.222 | 0.641 | 0.380 | 0.551 |
|
| 111 |
+
| Llama 3.1 Swallow 8B Instruct v0.1 | 0.388 | 0.649 | 0.615 | 0.598 | 0.891 | 0.624 | 0.605 | 0.236 | 0.642 | 0.379 | 0.563 |
|
| 112 |
+
| Llama 3.1 Swallow 8B Instruct v0.2 | 0.380 | 0.625 | 0.603 | 0.607 | 0.887 | 0.634 | 0.620 | 0.264 | 0.649 | 0.474 | 0.574 |
|
| 113 |
+
| Llama 3.1 Swallow 8B Instruct v0.3 | 0.396 | 0.629 | 0.593 | 0.570 | 0.884 | 0.629 | 0.622 | 0.266 | 0.626 | 0.445 | 0.566 |
|
| 114 |
+
| Llama 3.1 Swallow 8B Instruct v0.5 | 0.396 | 0.638 | 0.603 | 0.581 | 0.889 | 0.663 | 0.717 | 0.368 | 0.628 | 0.554 | 0.604 |
|
|
|
|
|
|
|
| 115 |
|
| 116 |
|
| 117 |
## Evaluation Benchmarks
|
|
|
|
| 124 |
- Question: [Nejumi LLM-Leaderboard NEO, mtbench_ja_question_v3](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_question/v3)
|
| 125 |
- Reference Answer: [Nejumi LLM-Leaderboard NEO, mtbench_ja_referenceanswer_v1](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_referenceanswer/v1)
|
| 126 |
- Prompt for Judge: [Nejumi LLM-Leaderboard NEO, mtbench_ja_prompt_v1](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_prompt/v1)
|
| 127 |
+
- Judge: `gpt-4o-2024-08-06`
|
| 128 |
- Scoring: Absolute scale normalized to a 0-1 range, averaged over five runs.
|
| 129 |
-
|
| 130 |
### Japanese evaluation benchmarks
|