AI21-Jamba-Reasoning-3B GGUF Models
Model Generation Details
This model was generated using llama.cpp at commit 03792ad93.
Quantization Beyond the IMatrix
I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides.
In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the --tensor-type option in llama.cpp to manually "bump" important layers to higher precision. You can see the implementation here:
👉 Layer bumping with llama.cpp
While this does increase model file size, it significantly improves precision for a given quantization level.
I'd love your feedback—have you tried this? How does it perform for you?
Click here to get info on choosing the right GGUF model format
Introduction
AI21’s Jamba Reasoning 3B is a top-performing reasoning model that packs leading scores on intelligence benchmarks and highly-efficient processing into a compact 3B build.
Read the full blog post here.
Key Advantages
Fast: Optimized for efficient sequence processing
The hybrid design combines Transformer attention with Mamba (a state-space model). Mamba layers are more efficient for sequence processing, while attention layers capture complex dependencies. This mix reduces memory overhead, improves throughput, and makes the model run smoothly on laptops, GPUs, and even mobile devices, while maintainig impressive quality.
Smart: Leading intelligence scores The model outperforms competitors, such as Gemma 3 4B, Llama 3.2 3B, and Granite 4.0 Micro, on a combined intelligence score that averages 6 standard benchmarks.
Scalable: Handles very long contexts
Unlike most compact models, Jamba Reasoning 3B supports extremely long contexts. Mamba layers allow the model to process inputs without storing massive attention caches, so it scales to 256K tokens while keeping inference practical. This makes it suitable for edge deployment as well as datacenter workloads.

Model Details
- Number of Parameters: 3B
- Number of Layers: 28 (26 Mamba, 2 Attention)
- Number of Attention Heads: 20 MQA (20 for Q, 1 for KV)
- Vocabulary Size: 64K
- Context Length: 256k
- Architecture: Hybrid Transformer–Mamba with efficient attention and long-context support
- Developed by: AI21
- Supported languages: English, Spanish, French, Portuguese, Italian, Dutch, German, Arabic and Hebrew
- Intelligence benchmark results:
| MMLU-Pro | Humanity’s Last Exam | IFBench | |
|---|---|---|---|
| DeepSeek R1 Distill Qwen 1.5B | 27.0% | 3.3% | 13.0% |
| Phi-4 mini | 47.0% | 4.2% | 21.0% |
| Granite 4.0 Micro | 44.7% | 5.1% | 24.8% |
| Llama 3.2 3B | 35.0% | 5.2% | 26.0% |
| Gemma 3 4B | 42.0% | 5.2% | 28.0% |
| Qwen 3 1.7B | 57.0% | 4.8% | 27.0% |
| Qwen 3 4B | 70% | 5.1% | 33% |
| Jamba Reasoning 3B | 61.0% | 6.0% | 52.0% |
Quickstart
Run the model locally
Please reference the GGUF model card here.
Run the model with vLLM
For best results, we recommend using vLLM version 0.11.0 or higher and enabling --mamba-ssm-cache-dtype=float32
pip install vllm>=0.11.0
Using vllm in online server mode:
vllm serve "ai21labs/AI21-Jamba-Reasoning-3B" --mamba-ssm-cache-dtype float32 --reasoning-parser deepseek_r1 --enable-auto-tool-choice --tool-call-parser hermes
Using vllm in offline mode:
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
model = "ai21labs/AI21-Jamba-Reasoning-3B"
llm = LLM(model=model,
tensor_parallel_size=1,
mamba_ssm_cache_dtype="float32")
tokenizer = AutoTokenizer.from_pretrained(model)
messages = [
{"role": "user", "content": "You are analyzing customer support tickets to decide which need escalation.\nTicket 1: 'App crashes when uploading files >50MB.'\nTicket 2: 'Forgot password, can’t log in.'\nTicket 3: 'Billing page missing enterprise pricing.'\nClassify each ticket as Critical, Medium, or Low and explain your reasoning.\n"},
]
prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
sampling_params = SamplingParams(temperature=0.6, max_tokens=4096)
outputs = llm.generate(prompts, sampling_params)
generated_text = outputs[0].outputs[0].text
print(generated_text)
Run the model with Transformers
pip install transformers>= 4.54.0
pip install flash-attn --no-build-isolation
pip install causal-conv1d>=1.2.0
pip install mamba-ssm
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("ai21labs/AI21-Jamba-Reasoning-3B",
dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("ai21labs/AI21-Jamba-Reasoning-3B")
messages = [
{"role": "user", "content": "You are analyzing customer support tickets to decide which need escalation.\nTicket 1: 'App crashes when uploading files >50MB.'\nTicket 2: 'Forgot password, can’t log in.'\nTicket 3: 'Billing page missing enterprise pricing.'\nClassify each ticket as Critical, Medium, or Low and explain your reasoning.\n"},
]
prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
outputs = model.generate(**tokenizer(prompts, return_tensors="pt").to(model.device), do_sample=True, temperature=0.6, max_new_tokens=4096)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
Training Details
We trained the model in multiple stages, each designed to strengthen reasoning and long-context performance. The process began with large-scale pre-training on a diverse corpus of natural documents. We then mid-trained on ~0.5T tokens of math and code, while extending the context length to 32K tokens. During this stage we also applied a Mamba-specific long-context method, which we found to significantly improve long-context abilities.
To improve reasoning, tool use, and instruction following, we applied cold-start distillation: supervised fine-tuning with a 32K window and direct preference optimization with a 64K window. Finally, we enhanced reasoning performance further through online reinforcement learning with RLVR, targeting tasks such as code generation, mathematical problem solving, structured output, and information extraction.
Reinforcement “Fine-Tuning”
Full support for training Jamba through VeRL will be available soon. AI21 has introduced several improvements to the VeRL framework (https://github.com/volcengine/verl), including new capabilities for training hybrid models, and stability improvements for GRPO training. These improvements will soon be available to the open source community.
License
Apache 2.0
Citation
- Blog post- Read the full blog post here
🚀 If you find these models useful
Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks:
The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder
💬 How to test:
Choose an AI assistant type:
TurboLLM(GPT-4.1-mini)HugLLM(Hugginface Open-source models)TestLLM(Experimental CPU-only)
What I’m Testing
I’m pushing the limits of small open-source models for AI network monitoring, specifically:
- Function calling against live network services
- How small can a model go while still handling:
- Automated Nmap security scans
- Quantum-readiness checks
- Network Monitoring tasks
🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space):
- ✅ Zero-configuration setup
- ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low.
- 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate!
Other Assistants
🟢 TurboLLM – Uses gpt-4.1-mini :
- **It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited.
- Create custom cmd processors to run .net code on Quantum Network Monitor Agents
- Real-time network diagnostics and monitoring
- Security Audits
- Penetration testing (Nmap/Metasploit)
🔵 HugLLM – Latest Open-source models:
- 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita.
💡 Example commands you could test:
"Give me info on my websites SSL certificate""Check if my server is using quantum safe encyption for communication""Run a comprehensive security audit on my server"- '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution!
Final Word
I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful.
If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
I'm also open to job opportunities or sponsorship.
Thank you! 😊
- Downloads last month
- 1,296