Kompress: ModernBERT Token Compressor for LLM Context Windows
Kompress compresses text in LLM context windows so agents can do more with less. It's a drop-in replacement for LLMLingua-2 that's higher quality and 2.3x faster.
Results
| Metric | Kompress | LLMLingua-2 |
|---|---|---|
| Quality (Claude-judged, 1-10) | 6.9 | 6.2 |
| Latency (median) | 49ms | 113ms |
| Speed | 2.3x faster | baseline |
| Model size | 600MB (150M params) | 710MB (179M params) |
| Max sequence length | 8,192 tokens | 512 tokens |
| Architecture | ModernBERT-base (2024) | mBERT (2018) |
Quality on Real Agent Data
| Eval Set | Kompress | LLMLingua-2 |
|---|---|---|
| Unstructured NL text | 6.9/10 | 6.2/10 |
| Claude Code sessions (text) | 6.7/10 | 5.2/10 |
| Claude Code sessions (raw) | 6.6/10 | 5.2/10 |
Quality scores are judged by Claude Sonnet 4.6: "Can an LLM fully understand and act on the compressed version?" (1-10 scale).
How It Works
Kompress is a dual-head ModernBERT model trained to classify each token as keep or discard:
- Token head: Binary classifier (keep/discard per token via argmax)
- Span head: 1D CNN that identifies important regions, boosts borderline tokens in critical spans
The model decides how much to compress based on content density — no fixed compression ratio.
Example
ORIGINAL (98 words):
After investigating the memory leak, I traced it to the event listener
registration in the WebSocket handler. Every time a client connects, we
register a new listener on the global event bus, but when the client
disconnects, the cleanup function only removes the WebSocket connection
from the pool — it doesn't unregister the event listener. Over time,
these orphaned listeners accumulate and each one holds a reference to
the connection's closure, which in turn holds the entire request context.
The fix is straightforward: store the listener reference at connection
time and explicitly remove it in the disconnect handler.
COMPRESSED (59 words, 60% kept):
investigating memory leak, traced event listener registration WebSocket
handler. Every time client connects, register new listener global event
bus, client disconnects, cleanup function only removes WebSocket
connection pool — doesn't unregister event listener. Over time, orphaned
listeners accumulate each one holds reference connection's closure, holds
entire request context. fix straightforward: store listener reference
connection time explicitly remove disconnect handler.
An LLM can fully understand and act on the compressed version.
Usage
from kompress.inference.pytorch_runner import KompressRunner
# Auto-downloads from HuggingFace on first use
runner = KompressRunner()
result = runner.compress("Your long text here...")
print(result.compressed) # Compressed text
print(result.compression_ratio) # e.g., 0.62
print(result.tokens_saved) # Number of tokens saved
With Headroom (LLM Proxy)
pip install headroom-ai
Kompress is built into Headroom as the default text compressor. It auto-downloads and runs on every API request that passes through the proxy.
Training
Architecture
- Base:
answerdotai/ModernBERT-base(149M params, 8192 token context) - Token head: Linear(768, 2) — binary keep/discard classifier
- Span head: Conv1d(768→256, k=5) → GELU → Conv1d(256→1, k=3) → Sigmoid
- Total: 150M params
Data
215K extractive compression labels from 8 diverse datasets, labeled by Claude Sonnet 4.6:
| Dataset | Count | Type |
|---|---|---|
| LMSYS-Chat-1M | 57K | LLM conversations |
| CNN/DailyMail | 50K | News articles |
| WikiHow | 50K | How-to guides |
| MeetingBank | 50K | Meeting transcripts |
| XSum | 47K | News articles |
| GovReport | 25K | Government reports |
| ArXiv | 25K | Academic papers |
| SAMSum | 14K | Dialogues |
Labeling Approach
Key insight: the labels must be strictly extractive — a subset of original words in original order. Previous versions failed because the labeling LLM rephrased text, causing alignment failures (5-12% keep ratio instead of the intended 40-60%).
The fix: prompt Claude to "select words like highlighting with a marker" rather than "compress this text." This ensures every word in the compressed output exists in the original, and the greedy alignment recovers 95%+ of the intended labels.
Training Details
- 3 epochs, batch size 32, learning rate 2e-5
- BF16 mixed precision on NVIDIA H100
- HuggingFace Trainer with warmup + cosine schedule
- ~3 hours training time
Comparison with LLMLingua-2
| Kompress | LLMLingua-2 | |
|---|---|---|
| Architecture | ModernBERT (2024) | mBERT (2018) |
| Max context | 8,192 tokens | 512 tokens |
| Training data | 215K from 8 datasets | 41K from MeetingBank |
| Labeling model | Claude Sonnet 4.6 | GPT-4 |
| Compression style | Content-adaptive | Fixed ratio |
| Quality | 6.9/10 | 6.2/10 |
| Latency | 49ms | 113ms |
Limitations
- English only (ModernBERT is English-focused)
- Best on natural language text; structured data (JSON, code, logs) should use specialized compressors
- Compression ratio varies by content (60-80% kept for dense text, 40-60% for verbose text)
License
Apache 2.0
- Downloads last month
- 64
Datasets used to train chopratejas/kompress-base
Evaluation results
- Quality Score (Claude-judged)self-reported6.900
- LLMLingua-2 Quality Scoreself-reported6.200
- Latency (median, Apple Silicon)self-reported49ms
- LLMLingua-2 Latencyself-reported113ms