KansenSakura-Erosion-RP-12b — Quantized (compressed-tensors for vLLM)

This repository hosts quantized runtime builds of
Retreatcost/KansenSakura-Erosion-RP-12b, repackaged for vLLM using the compressed-tensors format.

TL;DR

  • vLLM-ready compressed-tensors quantization (INT8 weights / 16-bit activations, W8A16).
  • Calibrated with 512 chat samples up to 2048 tokens each from neuralmagic/LLM_compression_calibration. :contentReference[oaicite:0]{index=0}
  • Uses AWQ (weight-only, group size 128) on linear layers, lm_head left in higher precision. :contentReference[oaicite:1]{index=1}

Lineage & merges (base model)

  • Base model: Retreatcost/KansenSakura-Erosion-RP-12b — a 12B RP-oriented model in the KansenSakura series, described by the author as the latest “Erosion” iteration with more immersive prose and psychological themes. :contentReference[oaicite:2]{index=2}
  • Upstream merges: Retreatcost’s ecosystem notes that Impish-LongPen-12B (itself a Karcher merge of SuperbEmphasis/MN-12b-RP-Ink-RP-Longform and Sicarius-Prototyping/Impish_Longtail_12B) is used inside KansenSakura-Erosion-RP-12b. :contentReference[oaicite:3]{index=3}
  • Merge ancestry: Hugging Face metadata also lists Retreatcost/Shisa-K-sakurization as a base for Erosion, meaning this quant inherits a layered merge stack rather than a single-source finetune. :contentReference[oaicite:4]{index=4}

This repo does not change any of that training/merge behavior — it only changes how the weights are stored for efficient inference.


Revisions & Branches

The main branch is a landing page (model card + links). Runnable artifacts live in per-quant branches.

  • main — placeholder / landing page
  • W8A16 — INT8 weights / 16-bit activations (compressed-tensors)

Quick links


Repository contents (per revision)

  • Sharded quantized weights (*.safetensors) + index (model.safetensors.index.json)
  • config.json with compressed-tensors metadata (quantization_config, weight_format, etc.)
  • Tokenizer artifacts (tokenizer.json, tokenizer.model, merges/vocab as applicable)
  • Optional: chat_template.jinja (inherits the base model’s chat style)

Exact file lists may differ between branches — see Files and versions on the model page.


Quantization & calibration details

All details below are taken from the attached quantization script. :contentReference[oaicite:5]{index=5}

Quantization scheme

  • Library / flow: llmcompressor one-shot pipeline (oneshot) with an AWQModifier.
  • Targets: Linear layers only (targets=["Linear"]).
  • Ignored layers: ["lm_head"] (left in higher precision).
  • Weights:
    • num_bits=8 (INT8 weights)
    • symmetric=True
    • strategy="group" with group_size=128
    • Weight-only quantization (no input/output activation quantization configured).
  • Runtime format: Saved with save_compressed=True so vLLM reads the compressed-tensors layout directly.

Calibration dataset & preprocessing

  • Dataset: neuralmagic/LLM_compression_calibration, split "train". :contentReference[oaicite:6]{index=6}
    • This dataset contains a messages field with structured chat conversations (lists of {role, content} dicts).
  • Samples used:
    • NUM_CALIBRATION_SAMPLES = 512 — randomly shuffled subset with fixed seed 42. :contentReference[oaicite:7]{index=7}
  • Sequence length:
    • MAX_SEQUENCE_LENGTH = 2048 — tokenization truncates sequences at 2048 tokens and does not pad, to keep calibration close to realistic RP-style context lengths. :contentReference[oaicite:8]{index=8}

Preprocessing pipeline

  1. For each example, the script calls tokenizer.apply_chat_template(messages, tokenize=False) to render the conversation using the base model’s chat template.
  2. The rendered text is then tokenized with:
    • max_length=2048, truncation=True, padding=False, add_special_tokens=False.
  3. The processed dataset is passed into oneshot(...) with:
    • max_seq_length=MAX_SEQUENCE_LENGTH
    • num_calibration_samples=NUM_CALIBRATION_SAMPLES.

This yields per-channel weight scales that are representative of RP-style, multi-turn chat rather than generic plain-text corpora.


Context length

  • Calibration context: up to 2048 tokens, as described above. :contentReference[oaicite:9]{index=9}
  • Model context window: inherited from Retreatcost/KansenSakura-Erosion-RP-12b (see that card for the authoritative max context size). This quantization does not change positional embeddings or rope scaling; it only changes the numeric representation of weights.

Quickstart — vLLM (compressed-tensors)

Install vLLM (recent version recommended):

pip install vllm

Serve (adjust to your hardware):

CUDA_VISIBLE_DEVICES=0,1,2,3 \
vllm serve TheHouseOfTheDude/KansenSakura-Erosion-RP-12b_Compressed-Tensors \
  --quantization compressed-tensors \
  --tensor-parallel-size 4 \
  --max-model-len 2048 \
  --gpu-memory-utilization 0.70 \
  --dtype bfloat16

Example Chat Completions request:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "TheHouseOfTheDude/KansenSakura-Erosion-RP-12b_Compressed-Tensors",
    "messages": [
      {"role": "system", "content": "You are KansenSakura-Erosion, a creative RP assistant. Follow user instructions and respect user safety constraints."},
      {"role": "user", "content": "Describe the setting and tone for the opening scene of our story."}
    ],
    "max_tokens": 512,
    "temperature": 0.7,
    "top_p": 0.95
  }'

Note: compressed-tensors is a vLLM runtime format. Loading directly with vanilla 🤗 Transformers is not supported.
For Transformers, use a compatible GGUF/GPTQ/AWQ export or the full-precision base model.


Prompting / chat template

This package uses the same chat template and tokenizer as Retreatcost/KansenSakura-Erosion-RP-12b.

  • Use a concise system message to describe style, safety constraints, and what kinds of RP you expect.
  • Provide clear user turns; for multi-step scenes, outline beats or constraints in bullet points.

If a chat_template.jinja file is present, libraries that support apply_chat_template (e.g., Transformers or vLLM’s OpenAI wrapper) will automatically format the messages.


Intended use & safety note

KansenSakura-series models are role-play focused and may generate intense or mature themes (as indicated by the base repo’s “not for all audiences” flag). This quantization:

  • Does not change underlying behavior or content tendencies.
  • Only changes weight representation for faster / cheaper inference.

You are responsible for applying your own content filters, jailbreak protections, and safety policies, especially in shared or public deployments.


Hardware tips

  • 12B models run best on multi-GPU setups or a single high-VRAM GPU when using compressed-tensors.
  • Throughput at long context is dominated by KV-cache memory; tune --max-model-len and batch size for your hardware.
  • Prefer BF16 where supported; otherwise FP16.
  • Enable NVLink / high-bandwidth interconnects for better tensor-parallel scaling.

License & usage

This distribution inherits the licenses, terms, and content policies of:

  • Base model: Retreatcost/KansenSakura-Erosion-RP-12b
  • Any upstream merges listed on that model’s card (e.g., Impish-LongPen-12B, Shisa-K-sakurization, etc.).

Use of this quantized model constitutes acceptance of those upstream terms.


Changelog

  • v1 (current) — Initial compressed-tensors W8A16 quantization of Retreatcost/KansenSakura-Erosion-RP-12b with 512-sample / 2048-token AWQ calibration and vLLM-ready packaging.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TheHouseOfTheDude/KansenSakura-Erosion-RP-12b_Compressed-Tensors

Quantized
(5)
this model