FLUX.2-dev 2-bit HQQ (Half-Quadratic Quantization)
2-bit quantized variant of Flux.2-Dev by Black Forest Labs compacted using the HQQ toolkit.
All of the linear layers in the Transformer and Text Encoder (Mistral3-small) components have been replaced with HQQ-reapproximated weights.
To use, make sure to install the following libraries:
pip install git+https://github.com/huggingface/diffusers.git@main
pip install transformers>=4.53.1
pip install -U hqq
pip install accelerate huggingface_hub safetensors
Plus torch, naturally, however you might compile/install it for your device.
INFERENCE
(Sorry, but you may have to re-construct thee pipe on-thee-fly, as they say...)
import torch
import hqq
from diffusers import Flux2Pipeline, Flux2Transformer2DModel
from transformers import AutoModel
from hqq.core.quantize import HQQLinear, BaseQuantizeConfig
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
def replace_with_hqq(model, quant_config):
"""
Recursively replaces nn.Linear layers with HQQLinear layers.
This must match the exact logic used during quantization.
"""
for name, child in model.named_children():
if isinstance(child, torch.nn.Linear):
# Create empty HQQ layer
hqq_layer = HQQLinear(
child,
quant_config=quant_config,
compute_dtype=torch.bfloat16,
device="cuda",
initialize=False
)
setattr(model, name, hqq_layer)
else:
replace_with_hqq(child, quant_config)
hqq_config = BaseQuantizeConfig(
nbits=2,
group_size=64,
axis=1
)
model_id = "AlekseyCalvin/FLUX2_dev_2bit_hqq"
print("Loading Text Encoder (Mistral)...")
# Initialize skeleton
text_encoder = AutoModel.from_pretrained(
"black-forest-labs/FLUX.2-dev", # Load config from base model
subfolder="text_encoder",
torch_dtype=torch.bfloat16
)
# Swap layers
replace_with_hqq(text_encoder, hqq_config)
# Load quantized weights
te_path = hf_hub_download(model_id, filename="text_encoder/model.safetensors")
te_state_dict = load_file(te_path)
text_encoder.load_state_dict(te_state_dict)
text_encoder = text_encoder.to("cuda")
print("Loading Transformer (Flux 2)...")
# Initialize skeleton
transformer = Flux2Transformer2DModel.from_pretrained(
"black-forest-labs/FLUX.2-dev",
subfolder="transformer",
torch_dtype=torch.bfloat16
)
# Swap layers
replace_with_hqq(transformer, hqq_config)
# Load quantized weights
tr_path = hf_hub_download(model_id, filename="transformer/diffusion_pytorch_model.safetensors")
tr_state_dict = load_file(tr_path)
transformer.load_state_dict(tr_state_dict)
transformer = transformer.to("cuda")
print("Assembling Pipeline...")
pipe = Flux2Pipeline.from_pretrained(
"black-forest-labs/FLUX.2-dev",
transformer=transformer,
text_encoder=text_encoder,
torch_dtype=torch.bfloat16
)
pipe.enable_model_cpu_offload()
print("Ready for Inference!")
prompt = "A photo of a sneaky koala hiding behind book stacks at a library, calm snowy landscape visible through large window in the backdrop..."
image = pipe(prompt, guidance_scale=4, num_inference_steps=40).images[0]
image.save("KoalaTesting.png")
If the above doesn't work, try the inference method at the HQQ Git Repo...
If neither works, please leave comment. I will do more testing soon and revise, if need be.
Crucially: HQQ should work with PEFT/LoRA inference + training.
MORE INFO:
HQQ doc at HugingFace.
HQQ git repo with further info and code.
Blog post about HQQ originally published by the Mobius team (reposted under Dropbox.tech)
- Downloads last month
- 81
Model tree for AlekseyCalvin/FLUX2_dev_2bit_hqq
Base model
black-forest-labs/FLUX.2-dev