Kompress: ModernBERT Token Compressor for LLM Context Windows

Kompress compresses text in LLM context windows so agents can do more with less. It's a drop-in replacement for LLMLingua-2 that's higher quality and 2.3x faster.

Results

Metric	Kompress	LLMLingua-2
Quality (Claude-judged, 1-10)	6.9	6.2
Latency (median)	49ms	113ms
Speed	2.3x faster	baseline
Model size	600MB (150M params)	710MB (179M params)
Max sequence length	8,192 tokens	512 tokens
Architecture	ModernBERT-base (2024)	mBERT (2018)

Quality on Real Agent Data

Eval Set	Kompress	LLMLingua-2
Unstructured NL text	6.9/10	6.2/10
Claude Code sessions (text)	6.7/10	5.2/10
Claude Code sessions (raw)	6.6/10	5.2/10

Quality scores are judged by Claude Sonnet 4.6: "Can an LLM fully understand and act on the compressed version?" (1-10 scale).

How It Works

Kompress is a dual-head ModernBERT model trained to classify each token as keep or discard:

Token head: Binary classifier (keep/discard per token via argmax)
Span head: 1D CNN that identifies important regions, boosts borderline tokens in critical spans

The model decides how much to compress based on content density — no fixed compression ratio.

Example

ORIGINAL (98 words):
After investigating the memory leak, I traced it to the event listener
registration in the WebSocket handler. Every time a client connects, we
register a new listener on the global event bus, but when the client
disconnects, the cleanup function only removes the WebSocket connection
from the pool — it doesn't unregister the event listener. Over time,
these orphaned listeners accumulate and each one holds a reference to
the connection's closure, which in turn holds the entire request context.
The fix is straightforward: store the listener reference at connection
time and explicitly remove it in the disconnect handler.

COMPRESSED (59 words, 60% kept):
investigating memory leak, traced event listener registration WebSocket
handler. Every time client connects, register new listener global event
bus, client disconnects, cleanup function only removes WebSocket
connection pool — doesn't unregister event listener. Over time, orphaned
listeners accumulate each one holds reference connection's closure, holds
entire request context. fix straightforward: store listener reference
connection time explicitly remove disconnect handler.

An LLM can fully understand and act on the compressed version.

Usage

from kompress.inference.pytorch_runner import KompressRunner

# Auto-downloads from HuggingFace on first use
runner = KompressRunner()

result = runner.compress("Your long text here...")
print(result.compressed)       # Compressed text
print(result.compression_ratio) # e.g., 0.62
print(result.tokens_saved)      # Number of tokens saved

With Headroom (LLM Proxy)

pip install headroom-ai

Kompress is built into Headroom as the default text compressor. It auto-downloads and runs on every API request that passes through the proxy.

Training

Architecture

Base: answerdotai/ModernBERT-base (149M params, 8192 token context)
Token head: Linear(768, 2) — binary keep/discard classifier
Span head: Conv1d(768→256, k=5) → GELU → Conv1d(256→1, k=3) → Sigmoid
Total: 150M params

Data

215K extractive compression labels from 8 diverse datasets, labeled by Claude Sonnet 4.6:

Dataset	Count	Type
LMSYS-Chat-1M	57K	LLM conversations
CNN/DailyMail	50K	News articles
WikiHow	50K	How-to guides
MeetingBank	50K	Meeting transcripts
XSum	47K	News articles
GovReport	25K	Government reports
ArXiv	25K	Academic papers
SAMSum	14K	Dialogues

Labeling Approach

Key insight: the labels must be strictly extractive — a subset of original words in original order. Previous versions failed because the labeling LLM rephrased text, causing alignment failures (5-12% keep ratio instead of the intended 40-60%).

The fix: prompt Claude to "select words like highlighting with a marker" rather than "compress this text." This ensures every word in the compressed output exists in the original, and the greedy alignment recovers 95%+ of the intended labels.

Training Details

3 epochs, batch size 32, learning rate 2e-5
BF16 mixed precision on NVIDIA H100
HuggingFace Trainer with warmup + cosine schedule
~3 hours training time

Comparison with LLMLingua-2

	Kompress	LLMLingua-2
Architecture	ModernBERT (2024)	mBERT (2018)
Max context	8,192 tokens	512 tokens
Training data	215K from 8 datasets	41K from MeetingBank
Labeling model	Claude Sonnet 4.6	GPT-4
Compression style	Content-adaptive	Fixed ratio
Quality	6.9/10	6.2/10
Latency	49ms	113ms

Limitations

English only (ModernBERT is English-focused)
Best on natural language text; structured data (JSON, code, logs) should use specialized compressors
Compression ratio varies by content (60-80% kept for dense text, 40-60% for verbose text)

License

Apache 2.0

Downloads last month: 64

Safetensors

Model size

0.1B params

Tensor type

F32

Datasets used to train chopratejas/kompress-base

Evaluation results

Quality Score (Claude-judged)
self-reported

6.900
LLMLingua-2 Quality Score
self-reported

6.200
Latency (median, Apple Silicon)
self-reported

49ms
LLMLingua-2 Latency
self-reported

113ms