--- base_model: - HiDream-ai/HiDream-I1-Full base_model_relation: quantized pipeline_tag: text-to-image tags: - dfloat11 - df11 - lossless compression - 70% size, 100% accuracy --- # DFloat11 Compressed Model: `HiDream-ai/HiDream-I1-Full` This is a **DFloat11 losslessly compressed** version of the original `HiDream-ai/HiDream-I1-Full` model. It reduces model size by **30%** compared to the original BFloat16 model, while maintaining **bit-identical outputs** and supporting **efficient GPU inference**. 🔥🔥🔥 Thanks to DFloat11 compression, HiDream-I1-Full can now run smoothly on a single 32GB GPU without any quality loss. 🔥🔥🔥 ### 📊 Performance Comparison | Metric | HiDream-I1-Full (BFloat16) | HiDream-I1-Full (DFloat11) | | ----------------------------------------------- | ------------------- | ------------------- | | Model Size | 34.21 GB | 24.19 GB | | Peak GPU Memory
(1024×1024 image generation) | 35.61 GB | 26.42 GB | | Generation Time
(A100 GPU) | 140 seconds | 161 seconds | ### 🔧 How to Use 1. Install or upgrade the DFloat11 pip package *(installs the CUDA kernel automatically; requires a CUDA-compatible GPU and PyTorch installed)*: ```bash pip install -U dfloat11[cuda12] # or if you have CUDA version 11: # pip install -U dfloat11[cuda11] ``` 2. Install or upgrade the diffusers library. ```bash pip install -U diffusers ``` 3. To use the DFloat11 model, run the following example code in Python: ```python import torch from transformers import AutoTokenizer from diffusers import HiDreamImagePipeline from dfloat11 import DFloat11Model tokenizer_4 = AutoTokenizer.from_pretrained("DFloat11/Llama-3.1-8B-Instruct-DF11") text_encoder_4 = DFloat11Model.from_pretrained("DFloat11/Llama-3.1-8B-Instruct-DF11", device="cpu") text_encoder_4.config.output_hidden_states = True text_encoder_4.config.output_attentions = True pipe = HiDreamImagePipeline.from_pretrained( "HiDream-ai/HiDream-I1-Full", tokenizer_4=tokenizer_4, text_encoder_4=text_encoder_4, torch_dtype=torch.bfloat16, ) DFloat11Model.from_pretrained( "DFloat11/HiDream-I1-Full-DF11", device="cpu", bfloat16_model=pipe.transformer, ) pipe.enable_model_cpu_offload() image = pipe( 'A cat wearing a vintage astronaut suit, floating inside a spaceship and gazing out the window at Earth.', height=1024, width=1024, guidance_scale=5.0, num_inference_steps=50, generator=torch.Generator("cuda").manual_seed(0), ).images[0] image.save("output.png") ``` ### 🔍 How It Works We apply **Huffman coding** to losslessly compress the exponent bits of BFloat16 model weights, which are highly compressible (their 8 bits carry only ~2.6 bits of actual information). To enable fast inference, we implement a highly efficient CUDA kernel that performs on-the-fly weight decompression directly on the GPU. The result is a model that is **~30% smaller**, delivers **bit-identical outputs**, and achieves performance **comparable to the original** BFloat16 model. Learn more in our [research paper](https://arxiv.org/abs/2504.11651). ### 📄 Learn More * **Paper**: [70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float](https://arxiv.org/abs/2504.11651) * **GitHub**: [https://github.com/LeanModels/DFloat11](https://github.com/LeanModels/DFloat11) * **HuggingFace**: [https://huggingface.co/DFloat11](https://huggingface.co/DFloat11)