W8A8-FP Qwen/Qwen3-8B model
- Developed by: namgyu-youn
- License: apache-2.0
- Quantized from Model: Qwen/Qwen3-8B
- Quantization Method: W8A8-FP
Model Performance
A. Perplexity (lm-eval)
# Perplexity (ppl) command
lm_eval --model hf --model_args pretrained=Qwen/Qwen3-8B --tasks mmlu --device cuda:0 --batch_size 8 --limit 100
Original Model
| Groups |
Version |
Filter |
n-shot |
Metric |
|
Value |
|
Stderr |
| mmlu |
2 |
none |
|
acc |
↑ |
0.7542 |
± |
0.0055 |
| - humanities |
2 |
none |
|
acc |
↑ |
0.7577 |
± |
0.0112 |
| - other |
2 |
none |
|
acc |
↑ |
0.7408 |
± |
0.0116 |
| - social sciences |
2 |
none |
|
acc |
↑ |
0.8333 |
± |
0.0105 |
| - stem |
2 |
none |
|
acc |
↑ |
0.7111 |
± |
0.0101 |
Quantized Model
| Groups |
Version |
Filter |
n-shot |
Metric |
|
Value |
|
Stderr |
| mmlu |
2 |
none |
|
acc |
↑ |
0.7498 |
± |
0.0055 |
| - humanities |
2 |
none |
|
acc |
↑ |
0.7508 |
± |
0.0114 |
| - other |
2 |
none |
|
acc |
↑ |
0.7392 |
± |
0.0116 |
| - social sciences |
2 |
none |
|
acc |
↑ |
0.8267 |
± |
0.0107 |
| - stem |
2 |
none |
|
acc |
↑ |
0.7079 |
± |
0.0101 |
Summary
| Benchmark |
|
|
|
Qwen/Qwen3-8B |
namgyu-youn/Qwen3-8B-W8A8-FP |
| mmlu |
0.7542 |
0.7498 |
B. Throughput (vLLM)
Original Model
vllm bench throughput --model Qwen/Qwen3-8B --input-len 256 --output-len 256 --num-prompts 100
Quantized Model
vllm bench throughput --model namgyu-youn/Qwen3-8B-W8A8-FP --input-len 256 --output-len 256 --num-prompts 100
Summary
| Benchmark |
|
|
|
Qwen/Qwen3-8B |
namgyu-youn/Qwen3-8B-W8A8-FP |
| Throughput (tok/s) |
- |
- |
C. Latency (vLLM)
Original Model
vllm bench latency --model Qwen/Qwen3-8B --input-len 256 --output-len 256 --batch-size 1
Quantized Model
vllm bench latency --model namgyu-youn/Qwen3-8B-W8A8-FP --input-len 256 --output-len 256 --batch-size 1
Summary
| Benchmark |
|
|
|
Qwen/Qwen3-8B |
namgyu-youn/Qwen3-8B-W8A8-FP |
| Latency (ms) |
- |
- |
Resources