Kompress: ModernBERT Token Compressor for LLM Context Windows

Kompress compresses text in LLM context windows so agents can do more with less. It's a drop-in replacement for LLMLingua-2 that's higher quality and 2.3x faster.

Results

Metric Kompress LLMLingua-2
Quality (Claude-judged, 1-10) 6.9 6.2
Latency (median) 49ms 113ms
Speed 2.3x faster baseline
Model size 600MB (150M params) 710MB (179M params)
Max sequence length 8,192 tokens 512 tokens
Architecture ModernBERT-base (2024) mBERT (2018)

Quality on Real Agent Data

Eval Set Kompress LLMLingua-2
Unstructured NL text 6.9/10 6.2/10
Claude Code sessions (text) 6.7/10 5.2/10
Claude Code sessions (raw) 6.6/10 5.2/10

Quality scores are judged by Claude Sonnet 4.6: "Can an LLM fully understand and act on the compressed version?" (1-10 scale).

How It Works

Kompress is a dual-head ModernBERT model trained to classify each token as keep or discard:

  • Token head: Binary classifier (keep/discard per token via argmax)
  • Span head: 1D CNN that identifies important regions, boosts borderline tokens in critical spans

The model decides how much to compress based on content density — no fixed compression ratio.

Example

ORIGINAL (98 words):
After investigating the memory leak, I traced it to the event listener
registration in the WebSocket handler. Every time a client connects, we
register a new listener on the global event bus, but when the client
disconnects, the cleanup function only removes the WebSocket connection
from the pool — it doesn't unregister the event listener. Over time,
these orphaned listeners accumulate and each one holds a reference to
the connection's closure, which in turn holds the entire request context.
The fix is straightforward: store the listener reference at connection
time and explicitly remove it in the disconnect handler.

COMPRESSED (59 words, 60% kept):
investigating memory leak, traced event listener registration WebSocket
handler. Every time client connects, register new listener global event
bus, client disconnects, cleanup function only removes WebSocket
connection pool — doesn't unregister event listener. Over time, orphaned
listeners accumulate each one holds reference connection's closure, holds
entire request context. fix straightforward: store listener reference
connection time explicitly remove disconnect handler.

An LLM can fully understand and act on the compressed version.

Usage

from kompress.inference.pytorch_runner import KompressRunner

# Auto-downloads from HuggingFace on first use
runner = KompressRunner()

result = runner.compress("Your long text here...")
print(result.compressed)       # Compressed text
print(result.compression_ratio) # e.g., 0.62
print(result.tokens_saved)      # Number of tokens saved

With Headroom (LLM Proxy)

pip install headroom-ai

Kompress is built into Headroom as the default text compressor. It auto-downloads and runs on every API request that passes through the proxy.

Training

Architecture

  • Base: answerdotai/ModernBERT-base (149M params, 8192 token context)
  • Token head: Linear(768, 2) — binary keep/discard classifier
  • Span head: Conv1d(768→256, k=5) → GELU → Conv1d(256→1, k=3) → Sigmoid
  • Total: 150M params

Data

215K extractive compression labels from 8 diverse datasets, labeled by Claude Sonnet 4.6:

Dataset Count Type
LMSYS-Chat-1M 57K LLM conversations
CNN/DailyMail 50K News articles
WikiHow 50K How-to guides
MeetingBank 50K Meeting transcripts
XSum 47K News articles
GovReport 25K Government reports
ArXiv 25K Academic papers
SAMSum 14K Dialogues

Labeling Approach

Key insight: the labels must be strictly extractive — a subset of original words in original order. Previous versions failed because the labeling LLM rephrased text, causing alignment failures (5-12% keep ratio instead of the intended 40-60%).

The fix: prompt Claude to "select words like highlighting with a marker" rather than "compress this text." This ensures every word in the compressed output exists in the original, and the greedy alignment recovers 95%+ of the intended labels.

Training Details

  • 3 epochs, batch size 32, learning rate 2e-5
  • BF16 mixed precision on NVIDIA H100
  • HuggingFace Trainer with warmup + cosine schedule
  • ~3 hours training time

Comparison with LLMLingua-2

Kompress LLMLingua-2
Architecture ModernBERT (2024) mBERT (2018)
Max context 8,192 tokens 512 tokens
Training data 215K from 8 datasets 41K from MeetingBank
Labeling model Claude Sonnet 4.6 GPT-4
Compression style Content-adaptive Fixed ratio
Quality 6.9/10 6.2/10
Latency 49ms 113ms

Limitations

  • English only (ModernBERT is English-focused)
  • Best on natural language text; structured data (JSON, code, logs) should use specialized compressors
  • Compression ratio varies by content (60-80% kept for dense text, 40-60% for verbose text)

License

Apache 2.0

Downloads last month
64
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train chopratejas/kompress-base

Evaluation results