KansenSakura-Erosion-RP-12b — Quantized (compressed-tensors for vLLM)

This repository hosts quantized runtime builds of
Retreatcost/KansenSakura-Erosion-RP-12b, repackaged for vLLM using the compressed-tensors format.

TL;DR

vLLM-ready compressed-tensors quantization (INT8 weights / 16-bit activations, W8A16).

Calibrated with 512 chat samples up to 2048 tokens each from neuralmagic/LLM_compression_calibration. :contentReference[oaicite:0]{index=0}

Uses AWQ (weight-only, group size 128) on linear layers, lm_head left in higher precision. :contentReference[oaicite:1]{index=1}

Lineage & merges (base model)

Base model: Retreatcost/KansenSakura-Erosion-RP-12b — a 12B RP-oriented model in the KansenSakura series, described by the author as the latest “Erosion” iteration with more immersive prose and psychological themes. :contentReference[oaicite:2]{index=2}
Upstream merges: Retreatcost’s ecosystem notes that Impish-LongPen-12B (itself a Karcher merge of SuperbEmphasis/MN-12b-RP-Ink-RP-Longform and Sicarius-Prototyping/Impish_Longtail_12B) is used inside KansenSakura-Erosion-RP-12b. :contentReference[oaicite:3]{index=3}
Merge ancestry: Hugging Face metadata also lists Retreatcost/Shisa-K-sakurization as a base for Erosion, meaning this quant inherits a layered merge stack rather than a single-source finetune. :contentReference[oaicite:4]{index=4}

This repo does not change any of that training/merge behavior — it only changes how the weights are stored for efficient inference.

Revisions & Branches

The main branch is a landing page (model card + links). Runnable artifacts live in per-quant branches.

main — placeholder / landing page
W8A16 — INT8 weights / 16-bit activations (compressed-tensors)

Quick links

Repository contents (per revision)

Sharded quantized weights (*.safetensors) + index (model.safetensors.index.json)
config.json with compressed-tensors metadata (quantization_config, weight_format, etc.)
Tokenizer artifacts (tokenizer.json, tokenizer.model, merges/vocab as applicable)
Optional: chat_template.jinja (inherits the base model’s chat style)

Exact file lists may differ between branches — see Files and versions on the model page.

Quantization & calibration details

All details below are taken from the attached quantization script. :contentReference[oaicite:5]{index=5}

Quantization scheme

Library / flow: llmcompressor one-shot pipeline (oneshot) with an AWQModifier.
Targets: Linear layers only (targets=["Linear"]).
Ignored layers: ["lm_head"] (left in higher precision).
Weights:
- num_bits=8 (INT8 weights)
- symmetric=True
- strategy="group" with group_size=128
- Weight-only quantization (no input/output activation quantization configured).
Runtime format: Saved with save_compressed=True so vLLM reads the compressed-tensors layout directly.

Calibration dataset & preprocessing

Dataset: neuralmagic/LLM_compression_calibration, split "train". :contentReference[oaicite:6]{index=6}
- This dataset contains a messages field with structured chat conversations (lists of {role, content} dicts).
Samples used:
- NUM_CALIBRATION_SAMPLES = 512 — randomly shuffled subset with fixed seed 42. :contentReference[oaicite:7]{index=7}
Sequence length:
- MAX_SEQUENCE_LENGTH = 2048 — tokenization truncates sequences at 2048 tokens and does not pad, to keep calibration close to realistic RP-style context lengths. :contentReference[oaicite:8]{index=8}

Preprocessing pipeline

For each example, the script calls tokenizer.apply_chat_template(messages, tokenize=False) to render the conversation using the base model’s chat template.
The rendered text is then tokenized with:
- max_length=2048, truncation=True, padding=False, add_special_tokens=False.
The processed dataset is passed into oneshot(...) with:
- max_seq_length=MAX_SEQUENCE_LENGTH
- num_calibration_samples=NUM_CALIBRATION_SAMPLES.

This yields per-channel weight scales that are representative of RP-style, multi-turn chat rather than generic plain-text corpora.

Context length

Calibration context: up to 2048 tokens, as described above. :contentReference[oaicite:9]{index=9}
Model context window: inherited from Retreatcost/KansenSakura-Erosion-RP-12b (see that card for the authoritative max context size). This quantization does not change positional embeddings or rope scaling; it only changes the numeric representation of weights.

Quickstart — vLLM (compressed-tensors)

Install vLLM (recent version recommended):

pip install vllm

Serve (adjust to your hardware):

CUDA_VISIBLE_DEVICES=0,1,2,3 \
vllm serve TheHouseOfTheDude/KansenSakura-Erosion-RP-12b_Compressed-Tensors \
  --quantization compressed-tensors \
  --tensor-parallel-size 4 \
  --max-model-len 2048 \
  --gpu-memory-utilization 0.70 \
  --dtype bfloat16

Example Chat Completions request:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "TheHouseOfTheDude/KansenSakura-Erosion-RP-12b_Compressed-Tensors",
    "messages": [
      {"role": "system", "content": "You are KansenSakura-Erosion, a creative RP assistant. Follow user instructions and respect user safety constraints."},
      {"role": "user", "content": "Describe the setting and tone for the opening scene of our story."}
    ],
    "max_tokens": 512,
    "temperature": 0.7,
    "top_p": 0.95
  }'

Note: compressed-tensors is a vLLM runtime format. Loading directly with vanilla 🤗 Transformers is not supported.
For Transformers, use a compatible GGUF/GPTQ/AWQ export or the full-precision base model.

Prompting / chat template

This package uses the same chat template and tokenizer as Retreatcost/KansenSakura-Erosion-RP-12b.

Use a concise system message to describe style, safety constraints, and what kinds of RP you expect.
Provide clear user turns; for multi-step scenes, outline beats or constraints in bullet points.

If a chat_template.jinja file is present, libraries that support apply_chat_template (e.g., Transformers or vLLM’s OpenAI wrapper) will automatically format the messages.

Intended use & safety note

KansenSakura-series models are role-play focused and may generate intense or mature themes (as indicated by the base repo’s “not for all audiences” flag). This quantization:

Does not change underlying behavior or content tendencies.
Only changes weight representation for faster / cheaper inference.

You are responsible for applying your own content filters, jailbreak protections, and safety policies, especially in shared or public deployments.

Hardware tips

12B models run best on multi-GPU setups or a single high-VRAM GPU when using compressed-tensors.
Throughput at long context is dominated by KV-cache memory; tune --max-model-len and batch size for your hardware.
Prefer BF16 where supported; otherwise FP16.
Enable NVLink / high-bandwidth interconnects for better tensor-parallel scaling.

License & usage

This distribution inherits the licenses, terms, and content policies of:

Base model: Retreatcost/KansenSakura-Erosion-RP-12b
Any upstream merges listed on that model’s card (e.g., Impish-LongPen-12B, Shisa-K-sakurization, etc.).

Use of this quantized model constitutes acceptance of those upstream terms.

Changelog

v1 (current) — Initial compressed-tensors W8A16 quantization of Retreatcost/KansenSakura-Erosion-RP-12b with 512-sample / 2048-token AWQ calibration and vLLM-ready packaging.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for TheHouseOfTheDude/KansenSakura-Erosion-RP-12b_Compressed-Tensors

Base model

Retreatcost/KansenSakura-Erosion-RP-12b

Quantized

(5)

this model