Qwen3-1.7B-f16-GGUF
This is a GGUF-quantized version of the Qwen/Qwen3-1.7B language model โ a balanced 1.7-billion-parameter LLM designed for efficient local inference with strong reasoning and multilingual capabilities.
Converted for use with llama.cpp, LM Studio, OpenWebUI, GPT4All, and more.
Available Quantizations (from f16)
These variants were built from a f16 base model to ensure consistency across quant levels.
NEW: I have a custom model called Q3_HIFI, which is better than the standard Q3_K_M model. It is higher quality, smaller in size, and nearly the same speed as Q3_K_M.
It is listed under the 'f16' options because it's not an officially recognised type (at the moment).
Q3_HIFI
Pros:
- ๐ Best quality with lowest perplexity of 17.65 (21.4% better than Q3_K_M, 26.7% better than Q3_K_S)
- ๐ฆ Smaller than Q3_K_M (993.5 vs 1017.9 MiB) while being significantly better quality
- ๐ฏ Uses intelligent layer-sensitive quantization (Q3_HIFI on sensitive layers, mixed q3_K/q4_K elsewhere)
- ๐ Most consistent results (lowest standard deviation in perplexity: ยฑ0.16)
Cons:
- ๐ข Slowest inference at 411.1 TPS (3.4% slower than Q3_K_S)
- ๐ง Custom quantization may have less community support
Best for: Production deployments where output quality matters, tasks requiring accuracy (reasoning, coding, complex instructions), or when you want the best quality-to-size ratio.
You can read more about how it compares to Q3_K_M and Q3_K_S here: Q3_Quantisation_Comparison.md
You can also view a cross-model comparison of the Q3_HIFI type here.
| Level | Speed | Size | Recommendation |
|---|---|---|---|
| Q2_K | โก Fastest | 880 MB | ๐จ DO NOT USE Did not return results for most questions. |
| Q3_K_S | โก Fast | 1.0 GB | ๐ฅ Got good results across all question types. |
| Q3_K_M | โก Fast | 1.07 GB | Not recommended, did not appear in the top 3 models on any question. |
| Q4_K_S | ๐ Fast | 1.24 GB | ๐ฅ Runner up. Got very good results across all question types. |
| Q4_K_M | ๐ Fast | 1.28 GB | ๐ฅ Got good results across all question types. |
| Q5_K_S | ๐ข Medium | 1.44 GB | Made some appearances in the top 3, good for low-temperature questions. |
| Q5_K_M | ๐ข Medium | 1.47 GB | Not recommended, did not appear in the top 3 models on any question. |
| Q6_K | ๐ Slow | 1.67 GB | Made some appearances in the top 3 across a range of temperatures. |
| Q8_0 | ๐ Slow | 2.17 GB | ๐ฅ Best overall model. Highly recommended for all query types. |
Why Use a 1.7B Model?
The Qwen3-1.7B model offers a compelling middle ground between ultra-lightweight and full-scale language models, delivering:
- Noticeably better coherence and reasoning than 0.5Bโ1B models
- Fast CPU inference with minimal latencyโideal for real-time applications
- Quantized variants that fit in ~3โ4 GB RAM, making it suitable for low-end laptops, tablets, or edge devices
- Strong multilingual and coding support inherited from the Qwen3 family
Itโs ideal for:
- Responsive on-device assistants with more natural conversation flow
- Lightweight agent systems that require step-by-step logic
- Educational projects or hobbyist experiments with meaningful capability
- Prototyping AI features before scaling to larger models
Choose Qwen3-1.7B when you need more expressiveness and reliability than a sub-1B model provides - but still demand efficiency, offline operation, and low resource usage.
Build notes
All of these models (including Q3_HIFI) where built using these commands:
mkdir build
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=ON -DGGML_AVX=ON -DGGML_AVX2=ON -DGGML_CUDA=ON -DGGML_VULKAN=OFF -DLLAMA_CURL=OFF
cmake --build build --config Release -j
NOTE: Vulkan support is specifically turned off here. Vulkan performance was much worse, so if you want Vulkan support you can rebuild these models yourself.
The quantisation for Q3_HIFI also used a 5000 chunk imatrix file for extra precision. You can re-use it here: Qwen3-1.7B-f16-imatrix-5000.gguf
You can use the Q3_HIFI GitHub repository to build it from source if you're interested (use the Q3_HIFI branch): https://github.com/geoffmunn/llama.cpp.
Model anaysis and rankings
I have run each of these models across 6 questions, and ranked them all based on the quality of the anwsers. Qwen3-1.7B:Q8_0 is the best model across all question types, but you could use a smaller sized model such as Qwen3-1.7B:Q4_K_S and also get excellent results.
You can read the results here: Qwen3-1.7b-f16-analysis.md
If you find this useful, please give the project a โค๏ธ like.
Usage
Load this model using:
- OpenWebUI โ self-hosted AI interface with RAG & tools
- LM Studio โ desktop app with GPU support and chat templates
- GPT4All โ private, local AI chatbot (offline-first)
- Or directly via
llama.cpp
Each quantized model includes its own README.md and shares a common MODELFILE for optimal configuration.
Importing directly into Ollama should work, but you might encounter this error: Error: invalid character '<' looking for beginning of value.
In this case try these steps:
wget https://huggingface.co/geoffmunn/Qwen3-1.7B-f16/resolve/main/Qwen3-1.7B-f16%3AQ8_0.gguf(replace the quantised version with the one you want)nano Modelfileand enter these details (again, replacing Q8_0 with the version you want):
FROM ./Qwen3-1.7B-f16:Q8_0.gguf
# Chat template using ChatML (used by Qwen)
SYSTEM You are a helpful assistant
TEMPLATE "{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>{{ end }}<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
"
PARAMETER stop <|im_start|>
PARAMETER stop <|im_end|>
# Default sampling
PARAMETER temperature 0.6
PARAMETER top_p 0.95
PARAMETER top_k 20
PARAMETER min_p 0.0
PARAMETER repeat_penalty 1.1
PARAMETER num_ctx 4096
The num_ctx value has been dropped to increase speed significantly.
- Then run this command:
ollama create Qwen3-1.7B-f16:Q8_0 -f Modelfile
You will now see "Qwen3-1.7B-f16:Q8_0" in your Ollama model list.
These import steps are also useful if you want to customise the default parameters or system prompt.
Author
๐ค Geoff Munn (@geoffmunn)
๐ Hugging Face Profile
Disclaimer
This is a community conversion for local inference. Not affiliated with Alibaba Cloud or the Qwen team.
- Downloads last month
- 2,002
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit
16-bit