QORA - Native Rust LLM Inference Engine

Pure Rust inference engine for the SmolLM3-3B language model. No Python runtime, no CUDA, no external dependencies. Single executable + quantized weights = portable AI on any machine.

Now with GPU acceleration! Auto-detects Vulkan-compatible GPUs for ~4.8x faster inference, with intelligent fallback to CPU.

Overview

Property	Value
Engine	QORA (Pure Rust)
Base Model	SmolLM3-3B (HuggingFaceTB/SmolLM3-3B)
Parameters	3.07 Billion
Quantization	Q4 (4-bit symmetric, group_size=32)
Model Size	1.68 GB (Q4)
Executable	~37 MB (GPU+CPU) / ~7 MB (CPU-only)
Context Length	65,536 tokens (up to 128K with YARN)
Platform	Windows x86_64 (GPU auto-detect + CPU fallback)
GPU Backend	Vulkan (via Burn wgpu)

GPU Acceleration

QORA automatically detects Vulkan-compatible GPUs and uses them for inference. If no GPU is available or VRAM is insufficient, it falls back to CPU seamlessly.

GPU Requirements

Requirement	Value
Minimum VRAM	~2.5 GB (Q4 weights + KV cache + activations)
GPU API	Vulkan 1.1+
Tested On	GTX 1660 SUPER (6 GB VRAM)

GPU Performance

Metric	GPU	CPU	Speedup
Decode Speed	~4.1 tok/s	~0.86 tok/s	~4.8x
VRAM Usage	~2.3 GB	—	—

VRAM Safety

QORA includes intelligent VRAM management:

Pre-flight check: Probes GPU with a 256 MB test allocation before loading the full model
Estimated VRAM: Prints estimated VRAM requirement before loading
Panic recovery: If the GPU runs out of memory during inference, catches the error and falls back to CPU
Manual override: Use --cpu flag to skip GPU and run on CPU directly

Architecture

SmolLM3-3B is a decoder-only transformer with several advanced features:

Component	Details
Layers	36 decoder layers
Hidden Size	2,048
Attention Heads	16 (Query) / 4 (KV) — Grouped Query Attention
Head Dimension	128
MLP (Intermediate)	11,008 (SwiGLU: gate + up + down)
Vocabulary	128,256 tokens
Normalization	RMSNorm (eps=1e-6)
Position Encoding	NoPE scheme — RoPE on every 4th layer only (9/36 layers)
RoPE Theta	5,000,000
Activation	SiLU (Sigmoid Linear Unit)
Embeddings	Tied (input = output projection)

Instruction Model

No Extended Thinking

Evaluation results of non reasoning models and reasoning models in no thinking mode. We highlight the best and second-best scores in bold.

Category	Metric	QORA-LLM-3B	Qwen2.5-3B	Llama3.1-3B	Qwen3-1.7B	Qwen3-4B
High school math competition	AIME 2025	9.3	2.9	0.3	8.0	17.1
Math problem-solving	GSM-Plus	72.8	74.1	59.2	68.3	82.1
Competitive programming	LiveCodeBench v4	15.2	10.5	3.4	15.0	24.9
Graduate-level reasoning	GPQA Diamond	35.7	32.2	29.4	31.8	44.4
Instruction following	IFEval	76.7	65.6	71.6	74.0	68.9
Alignment	MixEval Hard	26.9	27.6	24.9	24.3	31.6
Tool Calling	BFCL	92.3	-	92.3 *	89.5	95.0
Multilingual Q&A	Global MMLU	53.5	50.54	46.8	49.5	65.1

(*): this is a tool calling finetune

Extended Thinking

Evaluation results in reasoning mode for SmolLM3 and Qwen3 models:

Category	Metric	QORA-LLM-3B	Qwen3-1.7B	Qwen3-4B
High school math competition	AIME 2025	36.7	30.7	58.8
Math problem-solving	GSM-Plus	83.4	79.4	88.2
Competitive programming	LiveCodeBench v4	30.0	34.4	52.9
Graduate-level reasoning	GPQA Diamond	41.7	39.9	55.3
Instruction following	IFEval	71.2	74.2	85.4
Alignment	MixEval Hard	30.8	33.9	38.0
Tool Calling	BFCL	88.8	88.8	95.5
Multilingual Q&A	Global MMLU	64.1	62.3	73.3

Base Pre-Trained Model

English benchmarks

Note: All evaluations are zero-shot unless stated otherwise. For Ruler 64k evaluation, we apply YaRN to the Qwen models with 32k context to extrapolate the context length.

Category	Metric	QORA-LLM-3B	Qwen2.5-3B	Llama3-3.2B	Qwen3-1.7B-Base	Qwen3-4B-Base
Reasoning & Commonsense	HellaSwag	76.15	74.19	75.52	60.52	74.37
	ARC-CF (Average)	65.61	59.81	58.58	55.88	62.11
	Winogrande	58.88	61.41	58.72	57.06	59.59
	CommonsenseQA	55.28	49.14	60.60	48.98	52.99
Knowledge & Understanding	MMLU-CF (Average)	44.13	42.93	41.32	39.11	47.65
	MMLU Pro CF	19.61	16.66	16.42	18.04	24.92
	MMLU Pro MCF	32.70	31.32	25.07	30.39	41.07
	PIQA	78.89	78.35	78.51	75.35	77.58
	OpenBookQA	40.60	40.20	42.00	36.40	42.40
	BoolQ	78.99	73.61	75.33	74.46	74.28
Math & Code
Coding & math	HumanEval+	30.48	34.14	25.00	43.29	54.87
	MBPP+	52.91	52.11	38.88	59.25	63.75
	MATH (4-shot)	46.10	40.10	7.44	41.64	51.20
	GSM8k (5-shot)	67.63	70.13	25.92	65.88	74.14
Long context
	Ruler 32k	76.35	75.93	77.58	70.63	83.98
	Ruler 64k	67.85	64.90	72.93	57.18	60.29
	Ruler 128k	61.03	62.23	71.30	43.03	47.23

Multilingual benchmarks

Category	Metric	QORA-LLM-3B	Qwen2.5-3B	Llama3.2 3B	Qwen3 1.7B Base	Qwen3 4B Base
Main supported languages
French	MLMM Hellaswag	63.94	57.47	57.66	51.26	61.00
	Belebele	51.00	51.55	49.22	49.44	55.00
	Global MMLU (CF)	38.37	34.22	33.71	34.94	41.80
	Flores-200 (5-shot)	62.85	61.38	62.89	58.68	65.76
Spanish	MLMM Hellaswag	65.85	58.25	59.39	52.40	61.85
	Belebele	47.00	48.88	47.00	47.56	50.33
	Global MMLU (CF)	38.51	35.84	35.60	34.79	41.22
	Flores-200 (5-shot)	48.25	50.00	44.45	46.93	50.16
German	MLMM Hellaswag	59.56	49.99	53.19	46.10	56.43
	Belebele	48.44	47.88	46.22	48.00	53.44
	Global MMLU (CF)	35.10	33.19	32.60	32.73	38.70
	Flores-200 (5-shot)	56.60	50.63	54.95	52.58	50.48
Italian	MLMM Hellaswag	62.49	53.21	54.96	48.72	58.76
	Belebele	46.44	44.77	43.88	44.00	48.78
	Global MMLU (CF)	36.99	33.91	32.79	35.37	39.26
	Flores-200 (5-shot)	52.65	54.87	48.83	48.37	49.11
Portuguese	MLMM Hellaswag	63.22	57.38	56.84	50.73	59.89
	Belebele	47.67	49.22	45.00	44.00	50.00
	Global MMLU (CF)	36.88	34.72	33.05	35.26	40.66
	Flores-200 (5-shot)	60.93	57.68	54.28	56.58	63.43

The model has also been trained on Arabic (standard), Chinese and Russian data, but has seen fewer tokens in these languages compared to the 6 above. We report the performance on these langages for information.

Category	Metric	QORA-LLM-3B	Qwen2.5-3B	Llama3.2 3B	Qwen3 1.7B Base	Qwen3 4B Base
Other supported languages
Arabic	Belebele	40.22	44.22	45.33	42.33	51.78
	Global MMLU (CF)	28.57	28.81	27.67	29.37	31.85
	Flores-200 (5-shot)	40.22	39.44	44.43	35.82	39.76
Chinese	Belebele	43.78	44.56	49.56	48.78	53.22
	Global MMLU (CF)	36.16	33.79	39.57	38.56	44.55
	Flores-200 (5-shot)	29.17	33.21	31.89	25.70	32.50
Russian	Belebele	47.44	45.89	47.44	45.22	51.44
	Global MMLU (CF)	36.51	32.47	34.52	34.83	38.80
	Flores-200 (5-shot)	47.13	48.74	50.74	54.70	60.53

Key Architectural Innovation: NoPE (No Position Encoding)

SmolLM3 uses a 3:1 NoPE ratio — 75% of layers have no positional encoding at all. Only layers 3, 7, 11, 15, 19, 23, 27, 31, 35 apply RoPE. This reduces computational overhead and enables better long-context generalization.

Files

model/
  qora.exe          — ~37 MB    Inference engine (GPU+CPU, single binary)
  model.qora        — 1.68 GB   Q4 quantized weights (4-bit)
  tokenizer.json    — 16.4 MB   Tokenizer vocabulary
  config.json       — 540 B     Model configuration
  README.md         — This file

Quick Start

For the fastest results, use --no-think --greedy:

.\qora.exe --prompt "What is X?" --no-think --greedy

This skips the thinking phase and uses deterministic decoding — you get a direct answer immediately.

Tip: Think mode produces better answers for complex questions (math, coding, reasoning) but uses 100-300+ tokens just for thinking before the answer appears. For simple factual questions, --no-think is much faster.

Usage

# Fastest: direct answer, no thinking, deterministic
qora.exe --prompt "What is the capital of France?" --no-think --greedy

# Fast: direct answer with some randomness
qora.exe --prompt "Tell me about Mars" --no-think

# Full quality: long reply mode, no thinking
qora.exe --prompt "Tell me story of Bitcoin and it's creation" --no-think --greedy

# See what the model is thinking
qora.exe --prompt "What is 2+2?" --show-think

# Force CPU (skip GPU auto-detect)
qora.exe --prompt "Hello" --cpu

# Control output length
qora.exe --prompt "Tell me a story" --max-tokens 512

# Raw text completion (no chat template)
qora.exe --prompt "Once upon a time" --raw --max-tokens 128

# Custom model path
qora.exe --load path/to/model.qora --prompt "Hello"

CLI Arguments

Flag	Default	Description
`--load <path>`	`model/model.qora`	Path to .qora binary
`--prompt <text>`	"Hello, how are you?"	Input prompt
`--max-tokens <n>`	1024	Maximum tokens to generate
`--no-think`	off	Disable thinking mode (faster, direct answers)
`--greedy`	off	Greedy decoding (temperature=0, deterministic)
`--show-think`	off	Display thinking content on stderr
`--raw`	off	Raw text completion (no chat template)
`--cpu`	off	Force CPU inference (skip GPU auto-detect)

Speed Tips

Mode	Speed (GPU)	Speed (CPU)	Best For
`--no-think --greedy`	~4.1 tok/s	~1 tok/s	Fastest. Simple factual questions.
`--no-think`	~4.1 tok/s	~1 tok/s	Fast with variety. General questions.
`--show-think`	~4.1 tok/s	~1 tok/s	See reasoning. Complex questions.
(default think mode)	~4.1 tok/s	~1 tok/s	Best quality but thinking uses 100-300+ tokens before answer appears.

Performance Benchmarks

Inference Speed

Tested on i5-11500 (6C/12T, AVX-512), 16GB RAM, GTX 1660 SUPER (6GB), Windows 11.

Metric	GPU	CPU
Model Load (binary)	~3-17s	~3-17s
Prefill Speed	—	~1.3-2.2 tok/s
Decode Speed	~4.1 tok/s	~1.1 tok/s
Single Decode Step	~244ms	~830ms
Memory (Q4)	~2.3 GB VRAM	1,681 MB RAM
Memory (F16)	~6 GB VRAM	~6,000 MB RAM

Quality Test Results

All tests run with --greedy --no-think for deterministic, reproducible results.

Test 1: Factual Knowledge

Prompt: "What is the capital of France? Answer in one sentence."

Output:

The capital of France is Paris.

Metric	Value
Correctness	PASS
Tokens Generated	7
Prefill	123 tokens in 269.1s

Test 2: Mathematical Reasoning

Prompt: "Solve step by step: If a train travels at 60 km/h for 2.5 hours, how far does it travel?"

Output:

To solve this problem, we can use the formula for distance:

Distance = Speed x Time

Given: Speed = 60 km/h Time = 2.5 hours

Step 1: Plug the values into the formula Distance = 60 km/h x 2.5 hours

Step 2: Perform the multiplication Distance = 150 km

So, the train travels 150 kilometers in 2.5 hours.

Metric	Value
Correctness	PASS (150 km is correct)
Tokens Generated	110
Decode Speed	0.20 tok/s

Test 3: Code Generation

Prompt: "Write a Python function to check if a number is prime. Include docstring."

Output:

def is_prime(n):
    """
    Checks if a number is prime.

    Parameters:
    n (int): The number to check.

    Returns:
    bool: True if the number is prime, False otherwise.

    Raises:
    ValueError: If n is less than 1, since prime numbers are
    defined for natural numbers greater than 1.
    """
    if n < 1:
        raise ValueError("Prime numbers are defined for natural numbers greater than 1.")
    if n == 1 or n == 2:
        return True
    if n % 2 == 0:
        return False
    for i in range(3, int(n**0.5) + 1, 2):
        if n % i == 0:
            return False
    return True

Metric	Value
Correctness	PARTIAL (n==1 should return False; otherwise correct algorithm)
Tokens Generated	200
Decode Speed	0.29 tok/s
Code Quality	Good structure, docstring, error handling, efficient trial division

Test 4: Explanation / Simplification

Prompt: "Explain quantum computing in 3 sentences for a 10-year-old."

Output:

Quantum computing is like a super-fast version of regular computers that use tiny particles called qubits to perform calculations much faster than today's computers. In quantum computing, these qubits can be in multiple states at once, allowing them to solve complex problems much more quickly than classical computers. However, quantum computing is still a new and complex field that is still being developed, so it's like a futuristic computer that is still in the process of learning and growing!

Metric	Value
Correctness	PASS
Age-Appropriate	Yes (simple language, relatable analogy)
Tokens Generated	92
Decode Speed	0.24 tok/s

Test 5: Science Knowledge

Prompt: "What are the three laws of thermodynamics? Be concise."

Output:

First Law of Thermodynamics: The law of energy conservation states that energy cannot be created or destroyed, only converted from one form to another.

Second Law of Thermodynamics: This law describes the direction of spontaneous processes, stating that the entropy (a measure of disorder or randomness) of an isolated system always increases over time, or that the entropy of a closed system will always increase or remain constant.

Third Law of Thermodynamics: This law states that as the temperature of a system approaches absolute zero, the entropy of the system approaches a constant value...

Metric	Value
Correctness	PASS (all three laws accurately described)
Tokens Generated	128
Decode Speed	0.25 tok/s

Test 6: Logical Reasoning

Prompt: "If all roses are flowers, and some flowers fade quickly, can we conclude that some roses fade quickly? Explain your reasoning."

Output:

Yes, we can conclude that some roses fade quickly based on the given information.

Reasoning:

All roses are flowers (given).

Some flowers fade quickly (given).

Since roses are a subset of flowers, any characteristic or property that applies to flowers can also be applied to roses.

Therefore, since some flowers fade quickly, and roses are flowers, it logically follows that some roses also fade quickly.

Metric	Value
Correctness	FAIL (this is a syllogistic fallacy — "some flowers fade" does not imply roses specifically fade)
Tokens Generated	128
Decode Speed	0.25 tok/s
Note	This is a known weakness of small LMs — the correct answer is "No, we cannot conclude this"

Quality Summary

Test Category	Result	Score
Factual Knowledge	PASS	1/1
Math Reasoning	PASS	1/1
Code Generation	PARTIAL	0.5/1
Explanation	PASS	1/1
Science Knowledge	PASS	1/1
Logical Reasoning	FAIL	0/1
Total		4.5/6 (75%)

Published Benchmark Scores (SmolLM3-3B Base Model)

Official scores from the HuggingFace model card. QORA runs the same weights with Q4 quantization (minimal accuracy loss).

Reasoning and Commonsense

Benchmark	SmolLM3-3B	Qwen2.5-3B	Llama3.2-3B	Qwen3-4B
HellaSwag	76.15	74.19	75.52	74.37
ARC-CF	65.61	59.81	58.58	62.11
BoolQ	78.99	73.61	75.33	74.28
PIQA	78.89	78.35	78.51	77.58
Winogrande	58.88	61.41	58.72	59.59
CommonsenseQA	55.28	49.14	60.60	52.99

Knowledge and Understanding

Benchmark	SmolLM3-3B	Qwen2.5-3B	Llama3.2-3B	Qwen3-4B
MMLU-CF	44.13	42.93	41.32	47.65
MMLU Pro CF	19.61	16.66	16.42	24.92
MMLU Pro MCF	32.70	31.32	25.07	41.07
OpenBookQA	40.60	40.20	42.00	42.40

Math and Code

Benchmark	SmolLM3-3B	Qwen2.5-3B	Llama3.2-3B	Qwen3-4B
HumanEval+	30.48	34.14	25.00	54.87
MBPP+	52.91	52.11	38.88	63.75
MATH (4-shot)	46.10	40.10	7.44	51.20
GSM8K (5-shot)	67.63	70.13	25.92	74.14

Instruction Following (Chat Model)

Benchmark	SmolLM3-3B	Qwen2.5-3B	Llama3.1-3B	Qwen3-4B
IFEval	76.7	65.6	71.6	68.9
AIME 2025	9.3	2.9	0.3	17.1
GSM-Plus	72.8	74.1	59.2	82.1
LiveCodeBench	15.2	10.5	3.4	24.9
GPQA Diamond	35.7	32.2	29.4	44.4
Global MMLU	53.5	50.54	46.8	65.1
BFCL (Tools)	92.3	—	92.3	95.0

Extended Thinking Mode

Benchmark	No Think	With Think	Improvement
AIME 2025	9.3	36.7	+295%
GSM-Plus	72.8	83.4	+15%
LiveCodeBench	15.2	30.0	+97%
GPQA Diamond	35.7	41.7	+17%
Global MMLU	53.5	64.1	+20%

Long Context

Benchmark	SmolLM3-3B	Qwen2.5-3B	Llama3.2-3B	Qwen3-4B
RULER 32K	76.35	75.93	77.58	83.98
RULER 64K	67.85	64.90	72.93	60.29
RULER 128K	61.03	62.23	71.30	47.23

Multilingual (HellaSwag)

Language	SmolLM3-3B	Qwen2.5-3B	Llama3.2-3B	Qwen3-4B
French	63.94	57.47	57.66	61.00
Spanish	65.85	58.25	59.39	61.85
German	59.56	49.99	53.19	56.43
Italian	62.49	53.21	54.96	58.76
Portuguese	63.22	57.38	56.84	59.89

Model Comparison

Model	Params	Format	Size on Disk	Best At
QORA (SmolLM3-3B)	3.07B	Q4	1.68 GB	Reasoning, multilingual, instruction following
Qwen2.5-3B	3B	—	~6 GB	Math (GSM8K), Winogrande
Llama3.2-3B	3.2B	—	~6 GB	Long context (128K), CommonsenseQA
Qwen3-4B	4B	—	~8 GB	Overall best (larger model), math, code

Why SmolLM3-3B?

Best-in-class reasoning among 3B models (HellaSwag 76.15, ARC 65.61, BoolQ 78.99)
Best instruction following (IFEval 76.7) — beats even Qwen3-4B
Best multilingual performance among 3B models across 5 European languages
Thinking mode boosts AIME by 295% — competitive reasoning from a 3B model
128K context support with strong RULER scores

Technical Details

GPU Inference

QORA uses the Burn framework's wgpu backend with Vulkan for GPU acceleration:

Q4 on GPU: Weights are uploaded as Burn quantized tensors (Q4S + PackedU32). Matmul performs on-the-fly dequantization on the GPU — no need to decompress the full model into VRAM.
KV Cache: Stored as f32 tensors on GPU, concatenated each step.
Sampling: Logits are transferred to CPU for top-p/temperature sampling.
128MB stack thread: GPU inference runs in a dedicated thread with 128MB stack to handle Burn's deep lazy computation graphs.

Quantization

QORA uses symmetric 4-bit quantization with group_size=32:

Each group of 32 float values is quantized to 4-bit integers
One f32 scale factor per group
Total: 4 bits/weight + 1 bit/weight overhead = ~5 bits effective
Memory reduction: 32-bit -> ~5 bits = 6.4x compression

Inference Pipeline

1. Model Load    — Read .qora binary (Q4 weights + f16 norms)
2. GPU Detect    — Probe Vulkan GPU, check VRAM, fallback to CPU if needed
3. Upload        — Transfer Q4 weights to GPU (or keep on CPU)
4. Tokenize      — Encode prompt with chat template
5. Prefill       — Process full prompt through 36 layers (batched)
6. Decode Loop   — Generate tokens one at a time:
   a. Embedding lookup
   b. 36x: RMSNorm -> Attention (GQA, KV cache) -> RMSNorm -> SwiGLU MLP
   c. Final RMSNorm -> LM head (tied weights)
   d. Sample (top-p, temperature)
7. Detokenize    — Decode token IDs back to text

Sampling Parameters

Parameter	Default	Description
Temperature	0.6	Controls randomness (0 = greedy)
Top-P	0.95	Nucleus sampling threshold
Repetition Penalty	1.1	Discourages repeating recent tokens
Max Tokens	1024	Maximum generation length

QORA Model Family

Engine	Model	Params	Size (Q4)	Purpose
QORA	SmolLM3-3B	3.07B	1.68 GB	Text generation, reasoning, chat
QORA-TTS	Qwen3-TTS	—	—	Text-to-speech synthesis
QORA-Vision (Image)	SigLIP 2 Base	86M	58 MB	Image embeddings, zero-shot classification
QORA-Vision (Video)	ViViT Base	89M	60 MB	Video action classification (400 classes)

All engines are pure Rust, single-binary executables with no Python dependencies. QORA supports GPU acceleration via Vulkan with automatic CPU fallback.

Building from Source

cd QOR3B

# CPU-only build
cargo build --release

# GPU-enabled build (Vulkan)
cargo build --release --features gpu

Dependencies

burn — Rust deep learning framework (GPU via wgpu/Vulkan backend)
rayon — Thread pool for parallel GEMV, attention, and lm_head
half — F16 support
serde / serde_json — Config parsing
tokenizers — HuggingFace tokenizer

License

The QORA inference engine is custom-built. The SmolLM3-3B model weights are released under the SmolLM3 License by HuggingFace.

Built with QORA — Pure Rust AI Inference

Downloads last month: 236

Model tree for qoranet/QORA-LLM-3B

Base model

HuggingFaceTB/SmolLM3-3B-Base

Finetuned

HuggingFaceTB/SmolLM3-3B

Finetuned

(103)

this model

Evaluation results

Accuracy on HellaSwag
self-reported

76.150
Accuracy on ARC-Challenge
self-reported

65.610
Accuracy on BoolQ
self-reported

78.990
Accuracy on MMLU
self-reported

44.130
Accuracy on GSM8K
self-reported

67.630
pass@1 on HumanEval+
self-reported

30.480
Accuracy on MATH
self-reported

46.100