--- base_model: TheDrummer/Behemoth-123B-v2.2 tags: - nvfp4 - quantized - vllm - dgx license: apache-2.0 --- # Behemoth-123B-NVFP4 NVFP4 quantized version of [TheDrummer/Behemoth-123B-v2.2](https://huggingface.co/TheDrummer/Behemoth-123B-v2.2) optimized for NVIDIA DGX/Hopper+ architectures. ## Quantization Details - **Format:** NVFP4 (4-bit floating point) - **Quantized using:** NVIDIA TensorRT Model Optimizer 0.35.0 - **Hardware:** 2× NVIDIA H200 SXM (188GB each) - **Original size:** 245GB (BF16) → **66GB (NVFP4)** - **Compatible with:** vLLM v0.10+, NVIDIA NGC containers ## Usage ```bash vllm serve tbhot3ww/Behemoth-123B-NVFP4 \ --quantization modelopt_fp4 \ --max-model-len 8192 \ --gpu-memory-utilization 0.95 ``` ## Performance Benchmarks (NVIDIA GB10, 128GB unified memory) | Sequences | Throughput | Per-Seq | KV Cache | Notes | |-----------|-----------|---------|----------|-------| | 12 | 32.4 tok/s | 2.7 | 1.6% | No queuing | | 64 | 166.4 tok/s | 2.6 | 9.5% | Linear scaling | | 128 | 307.1 tok/s | 2.4 | 21.3% | Sweet spot | | 256 | 485.8 tok/s | 1.9 | 45.4% | Pre-queue limit | | 512 | 665.5 tok/s | 1.3 | 88.9% | Near capacity | | 768 | 768 (peak) / 424 (avg) | 0.6 | 100% | Queued batching, 6m2s | *All tests: 200 tokens per sequence* ## Original Model See [base model card](https://huggingface.co/TheDrummer/Behemoth-123B-v2.2) for architecture details, training data, and usage guidelines.