ACE-Step 1.5 GGUF

Pre-quantized GGUF models for acestep.cpp, a portable C++17 implementation of the ACE-Step 1.5 music generation pipeline using GGML.

Text + lyrics in, stereo 48kHz WAV out. Runs on CPU, CUDA, Metal, Vulkan.

Quick start

git clone --recurse-submodules https://github.com/ServeurpersoCom/acestep.cpp
cd acestep.cpp

pip install huggingface_hub
./models.sh           # downloads Q8_0 turbo essentials (~7.7 GB)

mkdir build && cd build
cmake .. -DGGML_CUDA=ON
cmake --build . --config Release -j$(nproc)
cd ..

cat > /tmp/request.json << 'EOF'
{
    "caption": "Upbeat pop rock with driving guitars and catchy hooks",
    "inference_steps": 8,
    "shift": 3.0,
    "vocal_language": "fr"
}
EOF

# LLM: generate lyrics + audio codes
./build/ace-qwen3 \
    --request /tmp/request.json \
    --model models/acestep-5Hz-lm-4B-Q8_0.gguf

# DiT + VAE: synthesize audio
./build/dit-vae \
    --request /tmp/request0.json \
    --text-encoder models/Qwen3-Embedding-0.6B-Q8_0.gguf \
    --dit models/acestep-v15-turbo-Q8_0.gguf \
    --vae models/vae-BF16.gguf

Download options

./models.sh                # Q8_0 turbo essentials (~7.7 GB)
./models.sh --all          # every model, every quant (~97 GB)
./models.sh --quant BF16   # full precision
./models.sh --quant Q6_K   # pick a quant
./models.sh --sft          # add SFT DiT variant
./models.sh --shifts       # add shift1/shift3/continuous variants
./models.sh --lm 0.6B      # smaller LM (fast, lower quality)

Or download individual files manually from the files tab.

Available models

Text encoder

File	Quant	Size
Qwen3-Embedding-0.6B-BF16.gguf	BF16	1.2 GB
Qwen3-Embedding-0.6B-Q8_0.gguf	Q8_0	748 MB

LM (Qwen3 causal, audio code generation)

File	Params	Quant	Size
acestep-5Hz-lm-4B-BF16.gguf	4B	BF16	7.9 GB
acestep-5Hz-lm-4B-Q8_0.gguf	4B	Q8_0	4.2 GB
acestep-5Hz-lm-4B-Q6_K.gguf	4B	Q6_K	3.3 GB
acestep-5Hz-lm-4B-Q5_K_M.gguf	4B	Q5_K_M	2.9 GB
acestep-5Hz-lm-1.7B-BF16.gguf	1.7B	BF16	3.5 GB
acestep-5Hz-lm-1.7B-Q8_0.gguf	1.7B	Q8_0	1.9 GB
acestep-5Hz-lm-0.6B-BF16.gguf	0.6B	BF16	1.3 GB
acestep-5Hz-lm-0.6B-Q8_0.gguf	0.6B	Q8_0	677 MB

Small LMs (0.6B/1.7B) only have BF16 + Q8_0 (too small for aggressive quantization). The 4B LM does not have Q4_K_M (breaks audio code generation).

DiT (flow matching diffusion transformer)

Available for all 6 variants: turbo, sft, base, turbo-shift1, turbo-shift3, turbo-continuous.

Quant	Size per variant
BF16	4.5 GB
Q8_0	2.4 GB
Q6_K	1.9 GB
Q5_K_M	1.6 GB
Q4_K_M	1.4 GB

Turbo preset: 8 steps, no CFG. SFT/Base preset: 32-50 steps, CFG 7.0.

VAE

File	Size
vae-BF16.gguf	322 MB

Always BF16 (small, bandwidth-bound, quality-critical).

Pipeline

ace-qwen3 (Qwen3 causal LM, 0.6B/1.7B/4B)
  Phase 1 (if needed): CoT generates bpm, keyscale, timesignature, lyrics
  Phase 2: audio codes (5Hz tokens, FSQ vocabulary)
  Both phases batched: N sequences per forward, weights read once
  CFG with dual KV cache per batch element (cond + uncond)
  Output: request0.json .. requestN-1.json

dit-vae
  BPE tokenize
  Qwen3-Embedding (28L text encoder)
  CondEncoder (lyric 8L + timbre 4L + text_proj)
  FSQ detokenizer (audio codes -> source latents)
  DiT (24L flow matching, Euler steps)
  VAE (AutoencoderOobleck, tiled decode)
  WAV stereo 48kHz

Both stages support batching (--batch N) for parallel generation. LM batching produces different songs, DiT batching produces subtle variations of the same piece (different initial noise).

Acknowledgements

Independent C++/GGML implementation based on ACE-Step 1.5 by ACE Studio and StepFun. All model weights are theirs, this is a native inference backend.

Model tree for Serveurperso/ACE-Step-1.5-GGUF

Base model

ACE-Step/Ace-Step1.5

Quantized

(2)

this model

Serveurperso
/

ACE-Step-1.5-GGUF