ShuaiBai623 commited on
Commit
e3423a3
·
verified ·
1 Parent(s): 76d308a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +45 -0
README.md CHANGED
@@ -77,7 +77,52 @@ This is the GGUF-format weight repository for Qwen3-VL-235B-A22B-Thinking.
77
  **Pure text performance**
78
  ![](https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3-VL/table_thinking_text.jpg)
79
 
 
80
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
81
 
82
 
83
  ## Citation
 
77
  **Pure text performance**
78
  ![](https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3-VL/table_thinking_text.jpg)
79
 
80
+ ## How to Use
81
 
82
+ To use these models with `llama.cpp`, please ensure you are using the **latest version**—either by [building from source](https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md) or downloading the most recent [release](https://github.com/ggml-org/llama.cpp/releases/tag/b6907) according to the devices.
83
+
84
+ You can run inference via the command line or through a web-based chat interface.
85
+
86
+ ### CLI Inference (`llama-mtmd-cli`)
87
+
88
+ For example, to run Qwen3-VL-2B-Instruct with an FP16 vision encoder and Q8_0 quantized LLM:
89
+
90
+ ```bash
91
+ llama-mtmd-cli \
92
+ -m path/to/Qwen3VL-2B-Instruct-Q8_0.gguf \
93
+ --mmproj path/to/mmproj-Qwen3VL-2B-Instruct-F16.gguf \
94
+ --image test.jpeg \
95
+ -p "What is the publisher name of the newspaper?" \
96
+ --temp 0.7 --top-k 20 --top-p 0.8 -n 1024
97
+ ```
98
+
99
+ ### Web Chat (using `llama-server`)
100
+
101
+ To serve Qwen3-VL-235B-A22B-Instruct via an OpenAI-compatible API with a web UI:
102
+
103
+ ```bash
104
+ llama-server \
105
+ -m path/to/Qwen3VL-235B-A22B-Instruct-Q4_K_M-split-00001-of-00003.gguf \
106
+ --mmproj path/to/mmproj-Qwen3VL-235B-A22B-Instruct-Q8_0.gguf
107
+ ```
108
+
109
+ > **Tip**: For models split into multiple GGUF files, simply specify the first shard (e.g., `...-00001-of-00003.gguf`). llama.cpp will automatically load all parts.
110
+
111
+ Once the server is running, open your browser to `http://localhost:8080` to access the built-in chat interface, or send requests to the `/v1/chat/completions` endpoint. For more details, refer to the [official documentation](https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md).
112
+
113
+ ### Quantize Your Custom Model
114
+
115
+ You can further quantize the FP16 weights to other precision levels. For example, to quantize the model to 2-bit:
116
+
117
+ ```bash
118
+ # Quantize to 2-bit (IQ2_XXS)
119
+ llama-quantize \
120
+ path/to/Qwen3VL-235B-A22B-Instruct-F16.gguf \
121
+ path/to/Qwen3VL-235B-A22B-Instruct-IQ2_XXS.gguf \
122
+ iq2_xxs 8
123
+ ```
124
+
125
+ For a full list of supported quantization types and detailed instructions, refer to the [quantization documentation](https://github.com/ggml-org/llama.cpp/blob/master/tools/quantize/README.md).
126
 
127
 
128
  ## Citation