W8A8-FP Qwen/Qwen3-8B model

  • Developed by: namgyu-youn
  • License: apache-2.0
  • Quantized from Model: Qwen/Qwen3-8B
  • Quantization Method: W8A8-FP

Model Performance

A. Perplexity (lm-eval)

# Perplexity (ppl) command
lm_eval --model hf   --model_args pretrained=Qwen/Qwen3-8B   --tasks mmlu   --device cuda:0   --batch_size 8   --limit 100

Original Model

Groups Version Filter n-shot Metric Value Stderr
mmlu 2 none acc ↑ 0.7542 ± 0.0055
- humanities 2 none acc ↑ 0.7577 ± 0.0112
- other 2 none acc ↑ 0.7408 ± 0.0116
- social sciences 2 none acc ↑ 0.8333 ± 0.0105
- stem 2 none acc ↑ 0.7111 ± 0.0101

Quantized Model

Groups Version Filter n-shot Metric Value Stderr
mmlu 2 none acc ↑ 0.7498 ± 0.0055
- humanities 2 none acc ↑ 0.7508 ± 0.0114
- other 2 none acc ↑ 0.7392 ± 0.0116
- social sciences 2 none acc ↑ 0.8267 ± 0.0107
- stem 2 none acc ↑ 0.7079 ± 0.0101

Summary

Benchmark
Qwen/Qwen3-8B namgyu-youn/Qwen3-8B-W8A8-FP
mmlu 0.7542 0.7498

B. Throughput (vLLM)

Original Model

vllm bench throughput --model Qwen/Qwen3-8B --input-len 256 --output-len 256 --num-prompts 100

Quantized Model

vllm bench throughput --model namgyu-youn/Qwen3-8B-W8A8-FP --input-len 256 --output-len 256 --num-prompts 100

Summary

Benchmark
Qwen/Qwen3-8B namgyu-youn/Qwen3-8B-W8A8-FP
Throughput (tok/s) - -

C. Latency (vLLM)

Original Model

vllm bench latency --model Qwen/Qwen3-8B --input-len 256 --output-len 256 --batch-size 1

Quantized Model

vllm bench latency --model namgyu-youn/Qwen3-8B-W8A8-FP --input-len 256 --output-len 256 --batch-size 1

Summary

Benchmark
Qwen/Qwen3-8B namgyu-youn/Qwen3-8B-W8A8-FP
Latency (ms) - -

Resources

Downloads last month
52
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for namgyu-youn/Qwen3-8B-W8A8-FP

Base model

Qwen/Qwen3-8B-Base
Finetuned
Qwen/Qwen3-8B
Quantized
(196)
this model

Collection including namgyu-youn/Qwen3-8B-W8A8-FP