File size: 3,770 Bytes

2ace206

---
base_model:
  - HiDream-ai/HiDream-I1-Full
base_model_relation: quantized
pipeline_tag: text-to-image
tags:
- dfloat11
- df11
- lossless compression
- 70% size, 100% accuracy
---

# DFloat11 Compressed Model: `HiDream-ai/HiDream-I1-Full`

This is a **DFloat11 losslessly compressed** version of the original `HiDream-ai/HiDream-I1-Full` model. It reduces model size by **30%** compared to the original BFloat16 model, while maintaining **bit-identical outputs** and supporting **efficient GPU inference**.

🔥🔥🔥 Thanks to DFloat11 compression, HiDream-I1-Full can now run smoothly on a single 32GB GPU without any quality loss. 🔥🔥🔥

### 📊 Performance Comparison

| Metric                                          | HiDream-I1-Full (BFloat16) | HiDream-I1-Full (DFloat11) |
| ----------------------------------------------- | ------------------- | ------------------- |
| Model Size                                      | 34.21 GB            | 24.19 GB            |
| Peak GPU Memory<br>(1024×1024 image generation) | 35.61 GB            | 26.42 GB            |
| Generation Time<br>(A100 GPU)                   | 140 seconds          | 161 seconds          |

### 🔧 How to Use

1. Install or upgrade the DFloat11 pip package *(installs the CUDA kernel automatically; requires a CUDA-compatible GPU and PyTorch installed)*:

    ```bash
    pip install -U dfloat11[cuda12]
    # or if you have CUDA version 11:
    # pip install -U dfloat11[cuda11]
    ```

2. Install or upgrade the diffusers library.

    ```bash
    pip install -U diffusers
    ```

3. To use the DFloat11 model, run the following example code in Python:

    ```python
    import torch
    from transformers import AutoTokenizer
    from diffusers import HiDreamImagePipeline
    from dfloat11 import DFloat11Model

    tokenizer_4 = AutoTokenizer.from_pretrained("DFloat11/Llama-3.1-8B-Instruct-DF11")
    text_encoder_4 = DFloat11Model.from_pretrained("DFloat11/Llama-3.1-8B-Instruct-DF11", device="cpu")
    text_encoder_4.config.output_hidden_states = True
    text_encoder_4.config.output_attentions = True

    pipe = HiDreamImagePipeline.from_pretrained(
        "HiDream-ai/HiDream-I1-Full",
        tokenizer_4=tokenizer_4,
        text_encoder_4=text_encoder_4,
        torch_dtype=torch.bfloat16,
    )
    DFloat11Model.from_pretrained(
        "DFloat11/HiDream-I1-Full-DF11",
        device="cpu",
        bfloat16_model=pipe.transformer,
    )
    pipe.enable_model_cpu_offload()

    image = pipe(
        'A cat wearing a vintage astronaut suit, floating inside a spaceship and gazing out the window at Earth.',
        height=1024,
        width=1024,
        guidance_scale=5.0,
        num_inference_steps=50,
        generator=torch.Generator("cuda").manual_seed(0),
    ).images[0]
    image.save("output.png")
    ```


### 🔍 How It Works

We apply **Huffman coding** to losslessly compress the exponent bits of BFloat16 model weights, which are highly compressible (their 8 bits carry only ~2.6 bits of actual information). To enable fast inference, we implement a highly efficient CUDA kernel that performs on-the-fly weight decompression directly on the GPU.

The result is a model that is **~30% smaller**, delivers **bit-identical outputs**, and achieves performance **comparable to the original** BFloat16 model.

Learn more in our [research paper](https://arxiv.org/abs/2504.11651).

### 📄 Learn More

* **Paper**: [70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float](https://arxiv.org/abs/2504.11651)
* **GitHub**: [https://github.com/LeanModels/DFloat11](https://github.com/LeanModels/DFloat11)
* **HuggingFace**: [https://huggingface.co/DFloat11](https://huggingface.co/DFloat11)