When does the GGUF version get released?
When does the GGUF version get released?
https://huggingface.co/huihui-ai/Huihui-Qwen3-VL-8B-Instruct-abliterated/tree/main/GGUF
can this gguf be uploaded to ollama? because I failed to create custom model with the gguf and mmproj files on my local.
https://huggingface.co/huihui-ai/Huihui-Qwen3-VL-8B-Instruct-abliterated/tree/main/GGUF
can this gguf be uploaded to ollama? because I failed to create custom model with the gguf and mmproj files on my local.
this stuff does not work. it may work if you run it on windows and get lucky that your system randomly supports this hack job of an implementation. it's not huihui's fault. and I can't blame the llama.cpp's forks for trying. but it's too early. I wasted so many hours on this. it's useless. wait until the main official repo implements it, because they at least know what they are doing. just use the censored FP8 version until this is properly implemented and use sglang or vllm to run it. they do not need that much vram. ... Qwen's official FP8 quant runs this at 16.6GB VRAM. not just server launch but when you use it too. I doubt the gguf will perform this efficiently. once the visual side works as well as in sglang, then it makes sense to use ggufs... but before... why even waste the time when in a few days or next week it's in the official repo anyway; a note on the FP8 version: it runs SLOWER than the bf16 version of the same model. but it needs less VRAM. so I am waiting for real gguf support and a model that's not censored
python -m sglang.launch_server
--model "$MODEL_PATH"
--host 127.0.0.1
--port 30000
--trust-remote-code
--context-length 90000
--mem-fraction-static 0.55
--tp 1
--enable-multimodal
--chunked-prefill-size 4096
--attention-backend fa3
--moe-runner-backend auto
Thank you very much, the solution works perfectly. I managed to convert one into ollama.