KansenSakura-Erosion-RP-12b — Quantized (compressed-tensors for vLLM)
This repository hosts quantized runtime builds of
Retreatcost/KansenSakura-Erosion-RP-12b, repackaged for vLLM using the compressed-tensors format.
TL;DR
- vLLM-ready compressed-tensors quantization (INT8 weights / 16-bit activations, W8A16).
- Calibrated with 512 chat samples up to 2048 tokens each from
neuralmagic/LLM_compression_calibration. :contentReference[oaicite:0]{index=0}- Uses AWQ (weight-only, group size 128) on linear layers,
lm_headleft in higher precision. :contentReference[oaicite:1]{index=1}
Lineage & merges (base model)
- Base model:
Retreatcost/KansenSakura-Erosion-RP-12b— a 12B RP-oriented model in the KansenSakura series, described by the author as the latest “Erosion” iteration with more immersive prose and psychological themes. :contentReference[oaicite:2]{index=2} - Upstream merges: Retreatcost’s ecosystem notes that Impish-LongPen-12B (itself a Karcher merge of
SuperbEmphasis/MN-12b-RP-Ink-RP-LongformandSicarius-Prototyping/Impish_Longtail_12B) is used inside KansenSakura-Erosion-RP-12b. :contentReference[oaicite:3]{index=3} - Merge ancestry: Hugging Face metadata also lists
Retreatcost/Shisa-K-sakurizationas a base for Erosion, meaning this quant inherits a layered merge stack rather than a single-source finetune. :contentReference[oaicite:4]{index=4}
This repo does not change any of that training/merge behavior — it only changes how the weights are stored for efficient inference.
Revisions & Branches
The
mainbranch is a landing page (model card + links). Runnable artifacts live in per-quant branches.
- main — placeholder / landing page
- W8A16 — INT8 weights / 16-bit activations (compressed-tensors)
Quick links
- main: https://huggingface.co/TheHouseOfTheDude/KansenSakura-Erosion-RP-12b_Compressed-Tensors/tree/main
- W8A16: https://huggingface.co/TheHouseOfTheDude/KansenSakura-Erosion-RP-12b_Compressed-Tensors/tree/W8A16
Repository contents (per revision)
- Sharded quantized weights (
*.safetensors) + index (model.safetensors.index.json) config.jsonwith compressed-tensors metadata (quantization_config,weight_format, etc.)- Tokenizer artifacts (
tokenizer.json,tokenizer.model, merges/vocab as applicable) - Optional:
chat_template.jinja(inherits the base model’s chat style)
Exact file lists may differ between branches — see Files and versions on the model page.
Quantization & calibration details
All details below are taken from the attached quantization script. :contentReference[oaicite:5]{index=5}
Quantization scheme
- Library / flow:
llmcompressorone-shot pipeline (oneshot) with anAWQModifier. - Targets:
Linearlayers only (targets=["Linear"]). - Ignored layers:
["lm_head"](left in higher precision). - Weights:
num_bits=8(INT8 weights)symmetric=Truestrategy="group"withgroup_size=128- Weight-only quantization (no input/output activation quantization configured).
- Runtime format: Saved with
save_compressed=Trueso vLLM reads the compressed-tensors layout directly.
Calibration dataset & preprocessing
- Dataset:
neuralmagic/LLM_compression_calibration, split"train". :contentReference[oaicite:6]{index=6}- This dataset contains a
messagesfield with structured chat conversations (lists of{role, content}dicts).
- This dataset contains a
- Samples used:
NUM_CALIBRATION_SAMPLES = 512— randomly shuffled subset with fixed seed42. :contentReference[oaicite:7]{index=7}
- Sequence length:
MAX_SEQUENCE_LENGTH = 2048— tokenization truncates sequences at 2048 tokens and does not pad, to keep calibration close to realistic RP-style context lengths. :contentReference[oaicite:8]{index=8}
Preprocessing pipeline
- For each example, the script calls
tokenizer.apply_chat_template(messages, tokenize=False)to render the conversation using the base model’s chat template. - The rendered text is then tokenized with:
max_length=2048,truncation=True,padding=False,add_special_tokens=False.
- The processed dataset is passed into
oneshot(...)with:max_seq_length=MAX_SEQUENCE_LENGTHnum_calibration_samples=NUM_CALIBRATION_SAMPLES.
This yields per-channel weight scales that are representative of RP-style, multi-turn chat rather than generic plain-text corpora.
Context length
- Calibration context: up to 2048 tokens, as described above. :contentReference[oaicite:9]{index=9}
- Model context window: inherited from
Retreatcost/KansenSakura-Erosion-RP-12b(see that card for the authoritative max context size). This quantization does not change positional embeddings or rope scaling; it only changes the numeric representation of weights.
Quickstart — vLLM (compressed-tensors)
Install vLLM (recent version recommended):
pip install vllm
Serve (adjust to your hardware):
CUDA_VISIBLE_DEVICES=0,1,2,3 \
vllm serve TheHouseOfTheDude/KansenSakura-Erosion-RP-12b_Compressed-Tensors \
--quantization compressed-tensors \
--tensor-parallel-size 4 \
--max-model-len 2048 \
--gpu-memory-utilization 0.70 \
--dtype bfloat16
Example Chat Completions request:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "TheHouseOfTheDude/KansenSakura-Erosion-RP-12b_Compressed-Tensors",
"messages": [
{"role": "system", "content": "You are KansenSakura-Erosion, a creative RP assistant. Follow user instructions and respect user safety constraints."},
{"role": "user", "content": "Describe the setting and tone for the opening scene of our story."}
],
"max_tokens": 512,
"temperature": 0.7,
"top_p": 0.95
}'
Note:
compressed-tensorsis a vLLM runtime format. Loading directly with vanilla 🤗 Transformers is not supported.
For Transformers, use a compatible GGUF/GPTQ/AWQ export or the full-precision base model.
Prompting / chat template
This package uses the same chat template and tokenizer as Retreatcost/KansenSakura-Erosion-RP-12b.
- Use a concise system message to describe style, safety constraints, and what kinds of RP you expect.
- Provide clear user turns; for multi-step scenes, outline beats or constraints in bullet points.
If a chat_template.jinja file is present, libraries that support apply_chat_template (e.g., Transformers or vLLM’s OpenAI wrapper) will automatically format the messages.
Intended use & safety note
KansenSakura-series models are role-play focused and may generate intense or mature themes (as indicated by the base repo’s “not for all audiences” flag). This quantization:
- Does not change underlying behavior or content tendencies.
- Only changes weight representation for faster / cheaper inference.
You are responsible for applying your own content filters, jailbreak protections, and safety policies, especially in shared or public deployments.
Hardware tips
- 12B models run best on multi-GPU setups or a single high-VRAM GPU when using compressed-tensors.
- Throughput at long context is dominated by KV-cache memory; tune
--max-model-lenand batch size for your hardware. - Prefer BF16 where supported; otherwise FP16.
- Enable NVLink / high-bandwidth interconnects for better tensor-parallel scaling.
License & usage
This distribution inherits the licenses, terms, and content policies of:
- Base model:
Retreatcost/KansenSakura-Erosion-RP-12b - Any upstream merges listed on that model’s card (e.g.,
Impish-LongPen-12B,Shisa-K-sakurization, etc.).
Use of this quantized model constitutes acceptance of those upstream terms.
Changelog
- v1 (current) — Initial compressed-tensors W8A16 quantization of
Retreatcost/KansenSakura-Erosion-RP-12bwith 512-sample / 2048-token AWQ calibration and vLLM-ready packaging.
Model tree for TheHouseOfTheDude/KansenSakura-Erosion-RP-12b_Compressed-Tensors
Base model
Retreatcost/KansenSakura-Erosion-RP-12b