GLM-4.7-NVFP4

Format: NVFP4 — optimal partial quantization of weights & activations to NVFP4.
Base model: zai-org/GLM-4.7
How it was made: AutoQuantized with NVIDIA Model-Optimizer (NVFP4), using the default calibration mix. (cnn_dailymail and nemotron-post-training-dataset-v2)

Check the original model card for information about this model.


MMLU Benchmark Results: Salyut1/GLM-4.7-NVFP4

Summary Table

Groups Version Metric Value Stderr
MMLU (Total) 2 acc ↑ 0.8348 ± 0.0030
Social Sciences 2 acc ↑ 0.9051 ± 0.0052
Other 2 acc ↑ 0.8684 ± 0.0058
STEM 2 acc ↑ 0.8351 ± 0.0064
Humanities 2 acc ↑ 0.7664 ± 0.0059

STEM

Tasks n-shot Metric Value Stderr
High School Biology 0 acc ↑ 0.9516 ± 0.0122
College Biology 0 acc ↑ 0.9514 ± 0.0180
Astronomy 0 acc ↑ 0.9474 ± 0.0182
High School Computer Science 0 acc ↑ 0.9300 ± 0.0256
Conceptual Physics 0 acc ↑ 0.9064 ± 0.0190
Elementary Mathematics 0 acc ↑ 0.8862 ± 0.0164
Electrical Engineering 0 acc ↑ 0.8690 ± 0.0281
High School Statistics 0 acc ↑ 0.8565 ± 0.0239
College Computer Science 0 acc ↑ 0.8400 ± 0.0368
Anatomy 0 acc ↑ 0.8296 ± 0.0325
High School Physics 0 acc ↑ 0.7947 ± 0.0330
High School Chemistry 0 acc ↑ 0.7882 ± 0.0287
Machine Learning 0 acc ↑ 0.7679 ± 0.0401
College Physics 0 acc ↑ 0.7647 ± 0.0422
Abstract Algebra 0 acc ↑ 0.6800 ± 0.0469
College Chemistry 0 acc ↑ 0.6800 ± 0.0469
College Mathematics 0 acc ↑ 0.6800 ± 0.0469
High School Mathematics 0 acc ↑ 0.6481 ± 0.0291

Social Sciences

Tasks n-shot Metric Value Stderr
High School Government/Politics 0 acc ↑ 0.9793 ± 0.0103
High School Microeconomics 0 acc ↑ 0.9706 ± 0.0110
High School Psychology 0 acc ↑ 0.9523 ± 0.0091
Human Sexuality 0 acc ↑ 0.9313 ± 0.0222
Sociology 0 acc ↑ 0.9204 ± 0.0191
High School Geography 0 acc ↑ 0.9192 ± 0.0194
High School Macroeconomics 0 acc ↑ 0.9000 ± 0.0152
US Foreign Policy 0 acc ↑ 0.9000 ± 0.0302
Professional Psychology 0 acc ↑ 0.8725 ± 0.0135
Security Studies 0 acc ↑ 0.8653 ± 0.0219
Public Relations 0 acc ↑ 0.7636 ± 0.0407
Econometrics 0 acc ↑ 0.7544 ± 0.0405

Humanities

Tasks n-shot Metric Value Stderr
High School US History 0 acc ↑ 0.9461 ± 0.0159
High School World History 0 acc ↑ 0.9367 ± 0.0158
World Religions 0 acc ↑ 0.9064 ± 0.0223
Prehistory 0 acc ↑ 0.8981 ± 0.0168
International Law 0 acc ↑ 0.8926 ± 0.0283
Jurisprudence 0 acc ↑ 0.8889 ± 0.0304
Logical Fallacies 0 acc ↑ 0.8834 ± 0.0252
High School European History 0 acc ↑ 0.8788 ± 0.0255
Moral Disputes 0 acc ↑ 0.8699 ± 0.0181
Philosophy 0 acc ↑ 0.8617 ± 0.0196
Formal Logic 0 acc ↑ 0.7460 ± 0.0389
Professional Law 0 acc ↑ 0.6610 ± 0.0121
Moral Scenarios 0 acc ↑ 0.6425 ± 0.0160

Other

Tasks n-shot Metric Value Stderr
Medical Genetics 0 acc ↑ 0.9800 ± 0.0141
Marketing 0 acc ↑ 0.9530 ± 0.0139
Miscellaneous 0 acc ↑ 0.9374 ± 0.0087
Professional Medicine 0 acc ↑ 0.9301 ± 0.0155
Clinical Knowledge 0 acc ↑ 0.9057 ± 0.0180
Nutrition 0 acc ↑ 0.9052 ± 0.0168
Management 0 acc ↑ 0.8932 ± 0.0306
Business Ethics 0 acc ↑ 0.8600 ± 0.0349
Computer Security 0 acc ↑ 0.8600 ± 0.0349
Human Aging 0 acc ↑ 0.8161 ± 0.0260
College Medicine 0 acc ↑ 0.7977 ± 0.0306
Professional Accounting 0 acc ↑ 0.7624 ± 0.0254
Global Facts 0 acc ↑ 0.6500 ± 0.0479
Virology 0 acc ↑ 0.5723 ± 0.0385

sglang Inference Note:

vim /sgl-workspace/sglang/python/sglang/srt/layers/quantization/modelopt_quant.py

change the code in 1637 line like this:

# Validate weight scales
assert_dim = 2 if layer.moe_runner_config.is_gated else 1
for name, weight_scale in [
    ("w13", layer.w13_weight_scale),
    ("w2", layer.w2_weight_scale),
]:
    pass
    #assert (
    #    weight_scale.shape[assert_dim] % 16 == 0
    #), f"Expected {name}_weight_scale.dim({assert_dim}) to be divisible by 16"
    #assert (
    #    weight_scale.dtype == torch.float8_e4m3fn
    #), f"{name} Weight Blockscale must be represented as FP8-E4M3"

deploy command GLM-4.7-NVFP4 on sglang:

python3 -m sglang.launch_server --model-path  GLM-4.7-NVFP4/   --quantization modelopt_fp4  --tp 8 --attention-backend flashinfer

perf

We performed deployment on 8x 5090s, and the stress test performance data is provided below.

Benchmarking summary:
+-----------------------------------+-----------+
| Key                               |     Value |
+===================================+===========+
| Time taken for tests (s)          |   25.7834 |
+-----------------------------------+-----------+
| Number of concurrency             |    1      |
+-----------------------------------+-----------+
| Total requests                    |    1      |
+-----------------------------------+-----------+
| Succeed requests                  |    1      |
+-----------------------------------+-----------+
| Failed requests                   |    0      |
+-----------------------------------+-----------+
| Output token throughput (tok/s)   |   39.7154 |
+-----------------------------------+-----------+
| Total token throughput (tok/s)    |   59.5731 |
+-----------------------------------+-----------+
| Request throughput (req/s)        |    0.0388 |
+-----------------------------------+-----------+
| Average latency (s)               |   25.7834 |
+-----------------------------------+-----------+
| Average time to first token (s)   |    0.7891 |
+-----------------------------------+-----------+
| Average time per output token (s) |    0.0244 |
+-----------------------------------+-----------+
| Average inter-token latency (s)   |    0.0244 |
+-----------------------------------+-----------+
| Average input tokens per request  |  512      |
+-----------------------------------+-----------+
| Average output tokens per request | 1024      |
+-----------------------------------+-----------+
2025-12-26 07:12:02 - evalscope - INFO: 
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
|     10%     |  0.7891  |  0.024  |  0.0244  |   25.7834   |     512      |     1024      |    39.7154     |    59.5731    |
|     25%     |  0.7891  | 0.0241  |  0.0244  |   25.7834   |     512      |     1024      |    39.7154     |    59.5731    |
|     50%     |  0.7891  | 0.0243  |  0.0244  |   25.7834   |     512      |     1024      |    39.7154     |    59.5731    |
|     66%     |  0.7891  | 0.0244  |  0.0244  |   25.7834   |     512      |     1024      |    39.7154     |    59.5731    |
|     75%     |  0.7891  | 0.0244  |  0.0244  |   25.7834   |     512      |     1024      |    39.7154     |    59.5731    |
|     80%     |  0.7891  | 0.0246  |  0.0244  |   25.7834   |     512      |     1024      |    39.7154     |    59.5731    |
|     90%     |  0.7891  |  0.025  |  0.0244  |   25.7834   |     512      |     1024      |    39.7154     |    59.5731    |
|     95%     |  0.7891  | 0.0257  |  0.0244  |   25.7834   |     512      |     1024      |    39.7154     |    59.5731    |
|     98%     |  0.7891  | 0.0267  |  0.0244  |   25.7834   |     512      |     1024      |    39.7154     |    59.5731    |
|     99%     |  0.7891  | 0.0273  |  0.0244  |   25.7834   |     512      |     1024      |    39.7154     |    59.5731    |



Benchmarking summary:
+-----------------------------------+-----------+
| Key                               |     Value |
+===================================+===========+
| Time taken for tests (s)          |   36.4068 |
+-----------------------------------+-----------+
| Number of concurrency             |    8      |
+-----------------------------------+-----------+
| Total requests                    |    8      |
+-----------------------------------+-----------+
| Succeed requests                  |    8      |
+-----------------------------------+-----------+
| Failed requests                   |    0      |
+-----------------------------------+-----------+
| Output token throughput (tok/s)   |  225.013  |
+-----------------------------------+-----------+
| Total token throughput (tok/s)    |  337.519  |
+-----------------------------------+-----------+
| Request throughput (req/s)        |    0.2197 |
+-----------------------------------+-----------+
| Average latency (s)               |   36.3904 |
+-----------------------------------+-----------+
| Average time to first token (s)   |    2.4183 |
+-----------------------------------+-----------+
| Average time per output token (s) |    0.0332 |
+-----------------------------------+-----------+
| Average inter-token latency (s)   |    0.0332 |
+-----------------------------------+-----------+
| Average input tokens per request  |  512      |
+-----------------------------------+-----------+
| Average output tokens per request | 1024      |
+-----------------------------------+-----------+
2025-12-26 07:14:21 - evalscope - INFO: 
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
|     10%     |  1.4982  | 0.0301  |  0.0326  |   36.2968   |     512      |     1024      |    28.1277     |    42.1915    |
|     25%     |  2.1396  | 0.0322  |  0.0326  |   36.403    |     512      |     1024      |    28.1287     |    42.1931    |
|     50%     |  2.141   | 0.0327  |  0.0335  |   36.4039   |     512      |     1024      |    28.1291     |    42.1936    |
|     66%     |  3.0959  | 0.0329  |  0.0335  |   36.4041   |     512      |     1024      |    28.1295     |    42.1943    |
|     75%     |  3.0961  |  0.033  |  0.0335  |   36.4045   |     512      |     1024      |    28.1305     |    42.1958    |
|     80%     |  3.0961  | 0.0331  |  0.0335  |   36.4045   |     512      |     1024      |    28.1305     |    42.1958    |
|     90%     |  3.0962  | 0.0336  |  0.0341  |   36.4054   |     512      |     1024      |    28.2119     |    42.3178    |
|     95%     |  3.0962  | 0.0342  |  0.0341  |   36.4054   |     512      |     1024      |    28.2119     |    42.3178    |
|     98%     |  3.0962  | 0.0355  |  0.0341  |   36.4054   |     512      |     1024      |    28.2119     |    42.3178    |
|     99%     |  3.0962  | 0.0363  |  0.0341  |   36.4054   |     512      |     1024      |    28.2119     |    42.3178    |



Benchmarking summary:
+-----------------------------------+-----------+
| Key                               |     Value |
+===================================+===========+
| Time taken for tests (s)          |   43.0028 |
+-----------------------------------+-----------+
| Number of concurrency             |   16      |
+-----------------------------------+-----------+
| Total requests                    |   16      |
+-----------------------------------+-----------+
| Succeed requests                  |   16      |
+-----------------------------------+-----------+
| Failed requests                   |    0      |
+-----------------------------------+-----------+
| Output token throughput (tok/s)   |  380.998  |
+-----------------------------------+-----------+
| Total token throughput (tok/s)    |  571.498  |
+-----------------------------------+-----------+
| Request throughput (req/s)        |    0.3721 |
+-----------------------------------+-----------+
| Average latency (s)               |   42.8878 |
+-----------------------------------+-----------+
| Average time to first token (s)   |    2.933  |
+-----------------------------------+-----------+
| Average time per output token (s) |    0.0391 |
+-----------------------------------+-----------+
| Average inter-token latency (s)   |    0.039  |
+-----------------------------------+-----------+
| Average input tokens per request  |  512      |
+-----------------------------------+-----------+
| Average output tokens per request | 1024      |
+-----------------------------------+-----------+
2025-12-26 07:17:55 - evalscope - INFO: 
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
|     10%     |  2.1016  | 0.0354  |  0.0384  |   42.8554   |     512      |     1024      |    23.8153     |    35.7229    |
|     25%     |  2.104   | 0.0358  |  0.0384  |   42.8585   |     512      |     1024      |    23.8899     |    35.8348    |
|     50%     |  3.0384  | 0.0371  |  0.0389  |   42.8599   |     512      |     1024      |     23.892     |    35.8381    |
|     66%     |  3.5629  |  0.04   |  0.0391  |   42.8631   |     512      |     1024      |    23.8925     |    35.8387    |
|     75%     |  3.5643  | 0.0407  |  0.0398  |   42.9931   |     512      |     1024      |    23.8927     |    35.8391    |
|     80%     |  3.5643  |  0.041  |  0.0398  |   42.9931   |     512      |     1024      |    23.8927     |    35.8391    |
|     90%     |   3.65   | 0.0417  |  0.0398  |   42.9976   |     512      |     1024      |    23.8943     |    35.8415    |
|     95%     |  3.6512  | 0.0425  |  0.0408  |   42.9981   |     512      |     1024      |    23.9443     |    35.9165    |
|     98%     |  3.6512  | 0.0435  |  0.0408  |   42.9981   |     512      |     1024      |    23.9443     |    35.9165    |
|     99%     |  3.6512  | 0.0449  |  0.0408  |   42.9981   |     512      |     1024      |    23.9443     |    35.9165    |


Benchmarking summary:
+-----------------------------------+-----------+
| Key                               |     Value |
+===================================+===========+
| Time taken for tests (s)          |   51.9487 |
+-----------------------------------+-----------+
| Number of concurrency             |   32      |
+-----------------------------------+-----------+
| Total requests                    |   32      |
+-----------------------------------+-----------+
| Succeed requests                  |   32      |
+-----------------------------------+-----------+
| Failed requests                   |    0      |
+-----------------------------------+-----------+
| Output token throughput (tok/s)   |  630.776  |
+-----------------------------------+-----------+
| Total token throughput (tok/s)    |  946.164  |
+-----------------------------------+-----------+
| Request throughput (req/s)        |    0.616  |
+-----------------------------------+-----------+
| Average latency (s)               |   51.9342 |
+-----------------------------------+-----------+
| Average time to first token (s)   |    3.4479 |
+-----------------------------------+-----------+
| Average time per output token (s) |    0.0474 |
+-----------------------------------+-----------+
| Average inter-token latency (s)   |    0.0474 |
+-----------------------------------+-----------+
| Average input tokens per request  |  512      |
+-----------------------------------+-----------+
| Average output tokens per request | 1024      |
+-----------------------------------+-----------+
2025-12-26 07:20:36 - evalscope - INFO: 
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
|     10%     |  1.5371  | 0.0434  |  0.0458  |   51.9326   |     512      |     1024      |     19.714     |    29.5709    |
|     25%     |  2.1556  | 0.0445  |  0.0463  |   51.9352   |     512      |     1024      |    19.7147     |    29.5721    |
|     50%     |  3.386   | 0.0456  |  0.0475  |   51.9383   |     512      |     1024      |    19.7158     |    29.5737    |
|     66%     |  4.6167  | 0.0464  |  0.0481  |   51.9401   |     512      |     1024      |    19.7168     |    29.5752    |
|     75%     |  4.618   | 0.0469  |  0.0487  |   51.9412   |     512      |     1024      |    19.7172     |    29.5757    |
|     80%     |  5.0425  | 0.0472  |  0.0487  |   51.9414   |     512      |     1024      |    19.7172     |    29.5758    |
|     90%     |  5.0448  | 0.0482  |  0.0493  |   51.9429   |     512      |     1024      |    19.7179     |    29.5768    |
|     95%     |  5.125   | 0.0491  |  0.0493  |   51.9448   |     512      |     1024      |    19.7193     |    29.5789    |
|     98%     |  5.1261  | 0.0503  |  0.0498  |   51.9463   |     512      |     1024      |    19.7633     |    29.645     |
|     99%     |  5.1261  | 0.0511  |  0.0498  |   51.9463   |     512      |     1024      |    19.7633     |    29.645     |
Downloads last month
12
Safetensors
Model size
177B params
Tensor type
BF16
·
F32
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Tengyunw/GLM-4.7-NVFP4

Base model

zai-org/GLM-4.7
Quantized
(24)
this model

Datasets used to train Tengyunw/GLM-4.7-NVFP4