gabriellarson's picture
Create README.md
9582c6a verified
metadata
license: apache-2.0
language:
  - en
pipeline_tag: text-generation
base_model:
  - PowerInfer/SmallThinker-21BA3B-Instruct

Introduction

SmallThinker is a family of on-device native Mixture-of-Experts (MoE) language models specially designed for local deployment, co-developed by the IPADS and School of AI at Shanghai Jiao Tong University and Zenergize AI. Designed from the ground up for resource-constrained environments, SmallThinker brings powerful, private, and low-latency AI directly to your personal devices, without relying on the cloud.

Performance

Note: The model is trained mainly on English.

Model MMLU GPQA-diamond MATH-500 IFEVAL LIVEBENCH HUMANEVAL Average
SmallThinker-21BA3B-Instruct 84.43 55.05 82.4 85.77 60.3 89.63 76.26
Gemma3-12b-it 78.52 34.85 82.4 74.68 44.5 82.93 66.31
Qwen3-14B 84.82 50 84.6 85.21 59.5 88.41 75.42
Qwen3-30BA3B 85.1 44.4 84.4 84.29 58.8 90.24 74.54
Qwen3-8B 81.79 38.89 81.6 83.92 49.5 85.9 70.26
Phi-4-14B 84.58 55.45 80.2 63.22 42.4 87.2 68.84

For the MMLU evaluation, we use a 0-shot CoT setting.

All models are evaluated in non-thinking mode.

Speed

Model Memory(GiB) i9 14900 1+13 8ge4 rk3588 (16G) Raspberry PI 5
SmallThinker 21B+sparse 11.47 30.19 23.03 10.84 6.61
SmallThinker 21B+sparse+limited memory limit 8G 20.30 15.50 8.56 -
Qwen3 30B A3B 16.20 33.52 20.18 9.07 -
Qwen3 30B A3B+limited memory limit 8G 10.11 0.18 6.32 -
Gemma 3n E2B 1G, theoretically 36.88 27.06 12.50 6.66
Gemma 3n E4B 2G, theoretically 21.93 16.58 7.37 4.01

Note: i9 14900, 1+13 8ge4 use 4 threads, others use the number of threads that can achieve the maximum speed. All models here have been quantized to q4_0. You can deploy SmallThinker with offloading support using PowerInfer

Model Card

| **Architecture** | Mixture-of-Experts (MoE) | |:---:|:---:| | **Total Parameters** | 21B | | **Activated Parameters** | 3B | | **Number of Layers** | 52 | | **Attention Hidden Dimension** | 2560 | | **MoE Hidden Dimension** (per Expert) | 768 | | **Number of Attention Heads** | 28 | | **Number of KV Heads** | 4 | | **Number of Experts** | 64 | | **Selected Experts per Token** | 6 | | **Vocabulary Size** | 151,936 | | **Context Length** | 16K | | **Attention Mechanism** | GQA | | **Activation Function** | ReGLU |
## How to Run ### Transformers `transformers==4.53.3` is required, we are actively working to support the latest version. The following contains a code snippet illustrating how to use the model generate content based on given inputs. ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch path = "PowerInfer/SmallThinker-21BA3B-Instruct" device = "cuda" tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map=device, trust_remote_code=True) messages = [ {"role": "user", "content": "Give me a short introduction to large language model."}, ] model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(device) model_outputs = model.generate( model_inputs, do_sample=True, max_new_tokens=1024 ) output_token_ids = [ model_outputs[i][len(model_inputs[i]):] for i in range(len(model_inputs)) ] responses = tokenizer.batch_decode(output_token_ids, skip_special_tokens=True)[0] print(responses) ``` ### ModelScope `ModelScope` adopts Python API similar to (though not entirely identical to) `Transformers`. For basic usage, simply modify the first line of the above code as follows: ```python from modelscope import AutoModelForCausalLM, AutoTokenizer ``` ## Statement - Due to the constraints of its model size and the limitations of its training data, its responses may contain factual inaccuracies, biases, or outdated information. - Users bear full responsibility for independently evaluating and verifying the accuracy and appropriateness of all generated content. - SmallThinker does not possess genuine comprehension or consciousness and cannot express personal opinions or value judgments.