---
base_model:
- HiDream-ai/HiDream-I1-Full
base_model_relation: quantized
pipeline_tag: text-to-image
tags:
- dfloat11
- df11
- lossless compression
- 70% size, 100% accuracy
---
# DFloat11 Compressed Model: `HiDream-ai/HiDream-I1-Full`
This is a **DFloat11 losslessly compressed** version of the original `HiDream-ai/HiDream-I1-Full` model. It reduces model size by **30%** compared to the original BFloat16 model, while maintaining **bit-identical outputs** and supporting **efficient GPU inference**.
🔥🔥🔥 Thanks to DFloat11 compression, HiDream-I1-Full can now run smoothly on a single 32GB GPU without any quality loss. 🔥🔥🔥
### 📊 Performance Comparison
| Metric | HiDream-I1-Full (BFloat16) | HiDream-I1-Full (DFloat11) |
| ----------------------------------------------- | ------------------- | ------------------- |
| Model Size | 34.21 GB | 24.19 GB |
| Peak GPU Memory
(1024×1024 image generation) | 35.61 GB | 26.42 GB |
| Generation Time
(A100 GPU) | 140 seconds | 161 seconds |
### 🔧 How to Use
1. Install or upgrade the DFloat11 pip package *(installs the CUDA kernel automatically; requires a CUDA-compatible GPU and PyTorch installed)*:
```bash
pip install -U dfloat11[cuda12]
# or if you have CUDA version 11:
# pip install -U dfloat11[cuda11]
```
2. Install or upgrade the diffusers library.
```bash
pip install -U diffusers
```
3. To use the DFloat11 model, run the following example code in Python:
```python
import torch
from transformers import AutoTokenizer
from diffusers import HiDreamImagePipeline
from dfloat11 import DFloat11Model
tokenizer_4 = AutoTokenizer.from_pretrained("DFloat11/Llama-3.1-8B-Instruct-DF11")
text_encoder_4 = DFloat11Model.from_pretrained("DFloat11/Llama-3.1-8B-Instruct-DF11", device="cpu")
text_encoder_4.config.output_hidden_states = True
text_encoder_4.config.output_attentions = True
pipe = HiDreamImagePipeline.from_pretrained(
"HiDream-ai/HiDream-I1-Full",
tokenizer_4=tokenizer_4,
text_encoder_4=text_encoder_4,
torch_dtype=torch.bfloat16,
)
DFloat11Model.from_pretrained(
"DFloat11/HiDream-I1-Full-DF11",
device="cpu",
bfloat16_model=pipe.transformer,
)
pipe.enable_model_cpu_offload()
image = pipe(
'A cat wearing a vintage astronaut suit, floating inside a spaceship and gazing out the window at Earth.',
height=1024,
width=1024,
guidance_scale=5.0,
num_inference_steps=50,
generator=torch.Generator("cuda").manual_seed(0),
).images[0]
image.save("output.png")
```
### 🔍 How It Works
We apply **Huffman coding** to losslessly compress the exponent bits of BFloat16 model weights, which are highly compressible (their 8 bits carry only ~2.6 bits of actual information). To enable fast inference, we implement a highly efficient CUDA kernel that performs on-the-fly weight decompression directly on the GPU.
The result is a model that is **~30% smaller**, delivers **bit-identical outputs**, and achieves performance **comparable to the original** BFloat16 model.
Learn more in our [research paper](https://arxiv.org/abs/2504.11651).
### 📄 Learn More
* **Paper**: [70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float](https://arxiv.org/abs/2504.11651)
* **GitHub**: [https://github.com/LeanModels/DFloat11](https://github.com/LeanModels/DFloat11)
* **HuggingFace**: [https://huggingface.co/DFloat11](https://huggingface.co/DFloat11)