FLUX.2-dev 2-bit HQQ (Half-Quadratic Quantization)

2-bit quantized variant of Flux.2-Dev by Black Forest Labs compacted using the HQQ toolkit.
All of the linear layers in the Transformer and Text Encoder (Mistral3-small) components have been replaced with HQQ-reapproximated weights.
To use, make sure to install the following libraries:

pip install git+https://github.com/huggingface/diffusers.git@main
pip install transformers>=4.53.1
pip install -U hqq
pip install accelerate huggingface_hub safetensors

Plus torch, naturally, however you might compile/install it for your device.

INFERENCE

(Sorry, but you may have to re-construct thee pipe on-thee-fly, as they say...)

import torch
import hqq
from diffusers import Flux2Pipeline, Flux2Transformer2DModel
from transformers import AutoModel
from hqq.core.quantize import HQQLinear, BaseQuantizeConfig
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file

def replace_with_hqq(model, quant_config):
    """
    Recursively replaces nn.Linear layers with HQQLinear layers.
    This must match the exact logic used during quantization.
    """
    for name, child in model.named_children():
        if isinstance(child, torch.nn.Linear):
            # Create empty HQQ layer
            hqq_layer = HQQLinear(
                child, 
                quant_config=quant_config, 
                compute_dtype=torch.bfloat16, 
                device="cuda", 
                initialize=False
            )
            setattr(model, name, hqq_layer)
        else:
            replace_with_hqq(child, quant_config)

hqq_config = BaseQuantizeConfig(
    nbits=2,
    group_size=64,
    axis=1 
)

model_id = "AlekseyCalvin/FLUX2_dev_2bit_hqq"

print("Loading Text Encoder (Mistral)...")
# Initialize skeleton
text_encoder = AutoModel.from_pretrained(
    "black-forest-labs/FLUX.2-dev", # Load config from base model
    subfolder="text_encoder",
    torch_dtype=torch.bfloat16
)
# Swap layers
replace_with_hqq(text_encoder, hqq_config)
# Load quantized weights
te_path = hf_hub_download(model_id, filename="text_encoder/model.safetensors")
te_state_dict = load_file(te_path)
text_encoder.load_state_dict(te_state_dict)
text_encoder = text_encoder.to("cuda")

print("Loading Transformer (Flux 2)...")
# Initialize skeleton
transformer = Flux2Transformer2DModel.from_pretrained(
    "black-forest-labs/FLUX.2-dev", 
    subfolder="transformer",
    torch_dtype=torch.bfloat16
)
# Swap layers
replace_with_hqq(transformer, hqq_config)
# Load quantized weights
tr_path = hf_hub_download(model_id, filename="transformer/diffusion_pytorch_model.safetensors")
tr_state_dict = load_file(tr_path)
transformer.load_state_dict(tr_state_dict)
transformer = transformer.to("cuda")

print("Assembling Pipeline...")
pipe = Flux2Pipeline.from_pretrained(
    "black-forest-labs/FLUX.2-dev",
    transformer=transformer,
    text_encoder=text_encoder,
    torch_dtype=torch.bfloat16
)
pipe.enable_model_cpu_offload()

print("Ready for Inference!")
prompt = "A photo of a sneaky koala hiding behind book stacks at a library, calm snowy landscape visible through large window in the backdrop..."
image = pipe(prompt, guidance_scale=4, num_inference_steps=40).images[0]
image.save("KoalaTesting.png")

If the above doesn't work, try the inference method at the HQQ Git Repo...
If neither works, please leave comment. I will do more testing soon and revise, if need be.
Crucially: HQQ should work with PEFT/LoRA inference + training.

MORE INFO:

HQQ doc at HugingFace.
HQQ git repo with further info and code.
Blog post about HQQ originally published by the Mobius team (reposted under Dropbox.tech)

Downloads last month: 81

Model tree for AlekseyCalvin/FLUX2_dev_2bit_hqq

Base model

black-forest-labs/FLUX.2-dev

Quantized

(4)

this model