GLM-4.7-NVFP4

Format: NVFP4 — optimal partial quantization of weights & activations to NVFP4.
Base model: zai-org/GLM-4.7
How it was made: AutoQuantized with NVIDIA Model-Optimizer (NVFP4), using the default calibration mix. (cnn_dailymail and nemotron-post-training-dataset-v2)

Check the original model card for information about this model.

MMLU Benchmark Results: Salyut1/GLM-4.7-NVFP4

Summary Table

Groups	Version	Metric	Value	Stderr
MMLU (Total)	2	acc ↑	0.8348	± 0.0030
Social Sciences	2	acc ↑	0.9051	± 0.0052
Other	2	acc ↑	0.8684	± 0.0058
STEM	2	acc ↑	0.8351	± 0.0064
Humanities	2	acc ↑	0.7664	± 0.0059

STEM

Tasks	Metric	Value	Stderr
High School Biology	acc ↑	0.9516	± 0.0122
College Biology	acc ↑	0.9514	± 0.0180
Astronomy	acc ↑	0.9474	± 0.0182
High School Computer Science	acc ↑	0.9300	± 0.0256
Conceptual Physics	acc ↑	0.9064	± 0.0190
Elementary Mathematics	acc ↑	0.8862	± 0.0164
Electrical Engineering	acc ↑	0.8690	± 0.0281
High School Statistics	acc ↑	0.8565	± 0.0239
College Computer Science	acc ↑	0.8400	± 0.0368
Anatomy	acc ↑	0.8296	± 0.0325
High School Physics	acc ↑	0.7947	± 0.0330
High School Chemistry	acc ↑	0.7882	± 0.0287
Machine Learning	acc ↑	0.7679	± 0.0401
College Physics	acc ↑	0.7647	± 0.0422
Abstract Algebra	acc ↑	0.6800	± 0.0469
College Chemistry	acc ↑	0.6800	± 0.0469
College Mathematics	acc ↑	0.6800	± 0.0469
High School Mathematics	acc ↑	0.6481	± 0.0291

Social Sciences

Tasks	Metric	Value	Stderr
High School Government/Politics	acc ↑	0.9793	± 0.0103
High School Microeconomics	acc ↑	0.9706	± 0.0110
High School Psychology	acc ↑	0.9523	± 0.0091
Human Sexuality	acc ↑	0.9313	± 0.0222
Sociology	acc ↑	0.9204	± 0.0191
High School Geography	acc ↑	0.9192	± 0.0194
High School Macroeconomics	acc ↑	0.9000	± 0.0152
US Foreign Policy	acc ↑	0.9000	± 0.0302
Professional Psychology	acc ↑	0.8725	± 0.0135
Security Studies	acc ↑	0.8653	± 0.0219
Public Relations	acc ↑	0.7636	± 0.0407
Econometrics	acc ↑	0.7544	± 0.0405

Humanities

Tasks	Metric	Value	Stderr
High School US History	acc ↑	0.9461	± 0.0159
High School World History	acc ↑	0.9367	± 0.0158
World Religions	acc ↑	0.9064	± 0.0223
Prehistory	acc ↑	0.8981	± 0.0168
International Law	acc ↑	0.8926	± 0.0283
Jurisprudence	acc ↑	0.8889	± 0.0304
Logical Fallacies	acc ↑	0.8834	± 0.0252
High School European History	acc ↑	0.8788	± 0.0255
Moral Disputes	acc ↑	0.8699	± 0.0181
Philosophy	acc ↑	0.8617	± 0.0196
Formal Logic	acc ↑	0.7460	± 0.0389
Professional Law	acc ↑	0.6610	± 0.0121
Moral Scenarios	acc ↑	0.6425	± 0.0160

Other

Tasks	Metric	Value	Stderr
Medical Genetics	acc ↑	0.9800	± 0.0141
Marketing	acc ↑	0.9530	± 0.0139
Miscellaneous	acc ↑	0.9374	± 0.0087
Professional Medicine	acc ↑	0.9301	± 0.0155
Clinical Knowledge	acc ↑	0.9057	± 0.0180
Nutrition	acc ↑	0.9052	± 0.0168
Management	acc ↑	0.8932	± 0.0306
Business Ethics	acc ↑	0.8600	± 0.0349
Computer Security	acc ↑	0.8600	± 0.0349
Human Aging	acc ↑	0.8161	± 0.0260
College Medicine	acc ↑	0.7977	± 0.0306
Professional Accounting	acc ↑	0.7624	± 0.0254
Global Facts	acc ↑	0.6500	± 0.0479
Virology	acc ↑	0.5723	± 0.0385

sglang Inference Note:

vim /sgl-workspace/sglang/python/sglang/srt/layers/quantization/modelopt_quant.py

change the code in 1637 line like this:

# Validate weight scales
assert_dim = 2 if layer.moe_runner_config.is_gated else 1
for name, weight_scale in [
    ("w13", layer.w13_weight_scale),
    ("w2", layer.w2_weight_scale),
]:
    pass
    #assert (
    #    weight_scale.shape[assert_dim] % 16 == 0
    #), f"Expected {name}_weight_scale.dim({assert_dim}) to be divisible by 16"
    #assert (
    #    weight_scale.dtype == torch.float8_e4m3fn
    #), f"{name} Weight Blockscale must be represented as FP8-E4M3"

deploy command GLM-4.7-NVFP4 on sglang:

python3 -m sglang.launch_server --model-path  GLM-4.7-NVFP4/   --quantization modelopt_fp4  --tp 8 --attention-backend flashinfer

perf

We performed deployment on 8x 5090s, and the stress test performance data is provided below.

Benchmarking summary:
+-----------------------------------+-----------+
| Key                               |     Value |
+===================================+===========+
| Time taken for tests (s)          |   25.7834 |
+-----------------------------------+-----------+
| Number of concurrency             |    1      |
+-----------------------------------+-----------+
| Total requests                    |    1      |
+-----------------------------------+-----------+
| Succeed requests                  |    1      |
+-----------------------------------+-----------+
| Failed requests                   |    0      |
+-----------------------------------+-----------+
| Output token throughput (tok/s)   |   39.7154 |
+-----------------------------------+-----------+
| Total token throughput (tok/s)    |   59.5731 |
+-----------------------------------+-----------+
| Request throughput (req/s)        |    0.0388 |
+-----------------------------------+-----------+
| Average latency (s)               |   25.7834 |
+-----------------------------------+-----------+
| Average time to first token (s)   |    0.7891 |
+-----------------------------------+-----------+
| Average time per output token (s) |    0.0244 |
+-----------------------------------+-----------+
| Average inter-token latency (s)   |    0.0244 |
+-----------------------------------+-----------+
| Average input tokens per request  |  512      |
+-----------------------------------+-----------+
| Average output tokens per request | 1024      |
+-----------------------------------+-----------+
2025-12-26 07:12:02 - evalscope - INFO: 
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
|     10%     |  0.7891  |  0.024  |  0.0244  |   25.7834   |     512      |     1024      |    39.7154     |    59.5731    |
|     25%     |  0.7891  | 0.0241  |  0.0244  |   25.7834   |     512      |     1024      |    39.7154     |    59.5731    |
|     50%     |  0.7891  | 0.0243  |  0.0244  |   25.7834   |     512      |     1024      |    39.7154     |    59.5731    |
|     66%     |  0.7891  | 0.0244  |  0.0244  |   25.7834   |     512      |     1024      |    39.7154     |    59.5731    |
|     75%     |  0.7891  | 0.0244  |  0.0244  |   25.7834   |     512      |     1024      |    39.7154     |    59.5731    |
|     80%     |  0.7891  | 0.0246  |  0.0244  |   25.7834   |     512      |     1024      |    39.7154     |    59.5731    |
|     90%     |  0.7891  |  0.025  |  0.0244  |   25.7834   |     512      |     1024      |    39.7154     |    59.5731    |
|     95%     |  0.7891  | 0.0257  |  0.0244  |   25.7834   |     512      |     1024      |    39.7154     |    59.5731    |
|     98%     |  0.7891  | 0.0267  |  0.0244  |   25.7834   |     512      |     1024      |    39.7154     |    59.5731    |
|     99%     |  0.7891  | 0.0273  |  0.0244  |   25.7834   |     512      |     1024      |    39.7154     |    59.5731    |



Benchmarking summary:
+-----------------------------------+-----------+
| Key                               |     Value |
+===================================+===========+
| Time taken for tests (s)          |   36.4068 |
+-----------------------------------+-----------+
| Number of concurrency             |    8      |
+-----------------------------------+-----------+
| Total requests                    |    8      |
+-----------------------------------+-----------+
| Succeed requests                  |    8      |
+-----------------------------------+-----------+
| Failed requests                   |    0      |
+-----------------------------------+-----------+
| Output token throughput (tok/s)   |  225.013  |
+-----------------------------------+-----------+
| Total token throughput (tok/s)    |  337.519  |
+-----------------------------------+-----------+
| Request throughput (req/s)        |    0.2197 |
+-----------------------------------+-----------+
| Average latency (s)               |   36.3904 |
+-----------------------------------+-----------+
| Average time to first token (s)   |    2.4183 |
+-----------------------------------+-----------+
| Average time per output token (s) |    0.0332 |
+-----------------------------------+-----------+
| Average inter-token latency (s)   |    0.0332 |
+-----------------------------------+-----------+
| Average input tokens per request  |  512      |
+-----------------------------------+-----------+
| Average output tokens per request | 1024      |
+-----------------------------------+-----------+
2025-12-26 07:14:21 - evalscope - INFO: 
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
|     10%     |  1.4982  | 0.0301  |  0.0326  |   36.2968   |     512      |     1024      |    28.1277     |    42.1915    |
|     25%     |  2.1396  | 0.0322  |  0.0326  |   36.403    |     512      |     1024      |    28.1287     |    42.1931    |
|     50%     |  2.141   | 0.0327  |  0.0335  |   36.4039   |     512      |     1024      |    28.1291     |    42.1936    |
|     66%     |  3.0959  | 0.0329  |  0.0335  |   36.4041   |     512      |     1024      |    28.1295     |    42.1943    |
|     75%     |  3.0961  |  0.033  |  0.0335  |   36.4045   |     512      |     1024      |    28.1305     |    42.1958    |
|     80%     |  3.0961  | 0.0331  |  0.0335  |   36.4045   |     512      |     1024      |    28.1305     |    42.1958    |
|     90%     |  3.0962  | 0.0336  |  0.0341  |   36.4054   |     512      |     1024      |    28.2119     |    42.3178    |
|     95%     |  3.0962  | 0.0342  |  0.0341  |   36.4054   |     512      |     1024      |    28.2119     |    42.3178    |
|     98%     |  3.0962  | 0.0355  |  0.0341  |   36.4054   |     512      |     1024      |    28.2119     |    42.3178    |
|     99%     |  3.0962  | 0.0363  |  0.0341  |   36.4054   |     512      |     1024      |    28.2119     |    42.3178    |



Benchmarking summary:
+-----------------------------------+-----------+
| Key                               |     Value |
+===================================+===========+
| Time taken for tests (s)          |   43.0028 |
+-----------------------------------+-----------+
| Number of concurrency             |   16      |
+-----------------------------------+-----------+
| Total requests                    |   16      |
+-----------------------------------+-----------+
| Succeed requests                  |   16      |
+-----------------------------------+-----------+
| Failed requests                   |    0      |
+-----------------------------------+-----------+
| Output token throughput (tok/s)   |  380.998  |
+-----------------------------------+-----------+
| Total token throughput (tok/s)    |  571.498  |
+-----------------------------------+-----------+
| Request throughput (req/s)        |    0.3721 |
+-----------------------------------+-----------+
| Average latency (s)               |   42.8878 |
+-----------------------------------+-----------+
| Average time to first token (s)   |    2.933  |
+-----------------------------------+-----------+
| Average time per output token (s) |    0.0391 |
+-----------------------------------+-----------+
| Average inter-token latency (s)   |    0.039  |
+-----------------------------------+-----------+
| Average input tokens per request  |  512      |
+-----------------------------------+-----------+
| Average output tokens per request | 1024      |
+-----------------------------------+-----------+
2025-12-26 07:17:55 - evalscope - INFO: 
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
|     10%     |  2.1016  | 0.0354  |  0.0384  |   42.8554   |     512      |     1024      |    23.8153     |    35.7229    |
|     25%     |  2.104   | 0.0358  |  0.0384  |   42.8585   |     512      |     1024      |    23.8899     |    35.8348    |
|     50%     |  3.0384  | 0.0371  |  0.0389  |   42.8599   |     512      |     1024      |     23.892     |    35.8381    |
|     66%     |  3.5629  |  0.04   |  0.0391  |   42.8631   |     512      |     1024      |    23.8925     |    35.8387    |
|     75%     |  3.5643  | 0.0407  |  0.0398  |   42.9931   |     512      |     1024      |    23.8927     |    35.8391    |
|     80%     |  3.5643  |  0.041  |  0.0398  |   42.9931   |     512      |     1024      |    23.8927     |    35.8391    |
|     90%     |   3.65   | 0.0417  |  0.0398  |   42.9976   |     512      |     1024      |    23.8943     |    35.8415    |
|     95%     |  3.6512  | 0.0425  |  0.0408  |   42.9981   |     512      |     1024      |    23.9443     |    35.9165    |
|     98%     |  3.6512  | 0.0435  |  0.0408  |   42.9981   |     512      |     1024      |    23.9443     |    35.9165    |
|     99%     |  3.6512  | 0.0449  |  0.0408  |   42.9981   |     512      |     1024      |    23.9443     |    35.9165    |


Benchmarking summary:
+-----------------------------------+-----------+
| Key                               |     Value |
+===================================+===========+
| Time taken for tests (s)          |   51.9487 |
+-----------------------------------+-----------+
| Number of concurrency             |   32      |
+-----------------------------------+-----------+
| Total requests                    |   32      |
+-----------------------------------+-----------+
| Succeed requests                  |   32      |
+-----------------------------------+-----------+
| Failed requests                   |    0      |
+-----------------------------------+-----------+
| Output token throughput (tok/s)   |  630.776  |
+-----------------------------------+-----------+
| Total token throughput (tok/s)    |  946.164  |
+-----------------------------------+-----------+
| Request throughput (req/s)        |    0.616  |
+-----------------------------------+-----------+
| Average latency (s)               |   51.9342 |
+-----------------------------------+-----------+
| Average time to first token (s)   |    3.4479 |
+-----------------------------------+-----------+
| Average time per output token (s) |    0.0474 |
+-----------------------------------+-----------+
| Average inter-token latency (s)   |    0.0474 |
+-----------------------------------+-----------+
| Average input tokens per request  |  512      |
+-----------------------------------+-----------+
| Average output tokens per request | 1024      |
+-----------------------------------+-----------+
2025-12-26 07:20:36 - evalscope - INFO: 
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
|     10%     |  1.5371  | 0.0434  |  0.0458  |   51.9326   |     512      |     1024      |     19.714     |    29.5709    |
|     25%     |  2.1556  | 0.0445  |  0.0463  |   51.9352   |     512      |     1024      |    19.7147     |    29.5721    |
|     50%     |  3.386   | 0.0456  |  0.0475  |   51.9383   |     512      |     1024      |    19.7158     |    29.5737    |
|     66%     |  4.6167  | 0.0464  |  0.0481  |   51.9401   |     512      |     1024      |    19.7168     |    29.5752    |
|     75%     |  4.618   | 0.0469  |  0.0487  |   51.9412   |     512      |     1024      |    19.7172     |    29.5757    |
|     80%     |  5.0425  | 0.0472  |  0.0487  |   51.9414   |     512      |     1024      |    19.7172     |    29.5758    |
|     90%     |  5.0448  | 0.0482  |  0.0493  |   51.9429   |     512      |     1024      |    19.7179     |    29.5768    |
|     95%     |  5.125   | 0.0491  |  0.0493  |   51.9448   |     512      |     1024      |    19.7193     |    29.5789    |
|     98%     |  5.1261  | 0.0503  |  0.0498  |   51.9463   |     512      |     1024      |    19.7633     |    29.645     |
|     99%     |  5.1261  | 0.0511  |  0.0498  |   51.9463   |     512      |     1024      |    19.7633     |    29.645     |

Downloads last month: 6,194

Model tree for Tengyunw/GLM-4.7-NVFP4

Base model

zai-org/GLM-4.7

Quantized

(41)

this model

Tengyunw
/

GLM-4.7-NVFP4