FLUX.2-dev 2-bit HQQ (Half-Quadratic Quantization)

2-bit quantized variant of Flux.2-Dev by Black Forest Labs compacted using the HQQ toolkit.
All of the linear layers in the Transformer and Text Encoder (Mistral3-small) components have been replaced with HQQ-reapproximated weights.
To use, make sure to install the following libraries:

pip install git+https://github.com/huggingface/diffusers.git@main
pip install transformers>=4.53.1
pip install -U hqq
pip install accelerate huggingface_hub safetensors

Plus torch, naturally, however you might compile/install it for your device.

INFERENCE

(Sorry, but you may have to re-construct thee pipe on-thee-fly, as they say...)

import torch
import hqq
from diffusers import Flux2Pipeline, Flux2Transformer2DModel
from transformers import AutoModel
from hqq.core.quantize import HQQLinear, BaseQuantizeConfig
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file

def replace_with_hqq(model, quant_config):
    """
    Recursively replaces nn.Linear layers with HQQLinear layers.
    This must match the exact logic used during quantization.
    """
    for name, child in model.named_children():
        if isinstance(child, torch.nn.Linear):
            # Create empty HQQ layer
            hqq_layer = HQQLinear(
                child, 
                quant_config=quant_config, 
                compute_dtype=torch.bfloat16, 
                device="cuda", 
                initialize=False
            )
            setattr(model, name, hqq_layer)
        else:
            replace_with_hqq(child, quant_config)

hqq_config = BaseQuantizeConfig(
    nbits=2,
    group_size=64,
    axis=1 
)

model_id = "AlekseyCalvin/FLUX2_dev_2bit_hqq"

print("Loading Text Encoder (Mistral)...")
# Initialize skeleton
text_encoder = AutoModel.from_pretrained(
    "black-forest-labs/FLUX.2-dev", # Load config from base model
    subfolder="text_encoder",
    torch_dtype=torch.bfloat16
)
# Swap layers
replace_with_hqq(text_encoder, hqq_config)
# Load quantized weights
te_path = hf_hub_download(model_id, filename="text_encoder/model.safetensors")
te_state_dict = load_file(te_path)
text_encoder.load_state_dict(te_state_dict)
text_encoder = text_encoder.to("cuda")

print("Loading Transformer (Flux 2)...")
# Initialize skeleton
transformer = Flux2Transformer2DModel.from_pretrained(
    "black-forest-labs/FLUX.2-dev", 
    subfolder="transformer",
    torch_dtype=torch.bfloat16
)
# Swap layers
replace_with_hqq(transformer, hqq_config)
# Load quantized weights
tr_path = hf_hub_download(model_id, filename="transformer/diffusion_pytorch_model.safetensors")
tr_state_dict = load_file(tr_path)
transformer.load_state_dict(tr_state_dict)
transformer = transformer.to("cuda")

print("Assembling Pipeline...")
pipe = Flux2Pipeline.from_pretrained(
    "black-forest-labs/FLUX.2-dev",
    transformer=transformer,
    text_encoder=text_encoder,
    torch_dtype=torch.bfloat16
)
pipe.enable_model_cpu_offload()

print("Ready for Inference!")
prompt = "A photo of a sneaky koala hiding behind book stacks at a library, calm snowy landscape visible through large window in the backdrop..."
image = pipe(prompt, guidance_scale=4, num_inference_steps=40).images[0]
image.save("KoalaTesting.png")

If the above doesn't work, try the inference method at the HQQ Git Repo...
If neither works, please leave comment. I will do more testing soon and revise, if need be.
Crucially: HQQ should work with PEFT/LoRA inference + training.

MORE INFO:

HQQ doc at HugingFace.
HQQ git repo with further info and code.
Blog post about HQQ originally published by the Mobius team (reposted under Dropbox.tech)

Downloads last month
81
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AlekseyCalvin/FLUX2_dev_2bit_hqq

Quantized
(4)
this model