GLM-4.5-Air-Derestricted-qx53gx-mlx
This is an experimental quant. It is a Deckard(qx) quant, with embeddings and attention paths at 5bit.
Usually I set attention paths at 4 bit in this combination.
Let's see what this does.
Model: GLM-4.5-Air-Derestricted-qx53gx-mlx
Perplexity: 5.254 ± 0.044
Peak memory: 63.53 GB
Metrics are being compiled, and the quant will be compared with leading quants of Air
-G
The Deckard(qx) formula
I am writing this by hand, not using AI
When you google Nightmedia and Deckard(qx) you will get conflicting information, and I am here to set the record straight.
The Deckard(qx) formula seen in qx quants is a mixed precision quantization:
- keeps data at low bit
- enhances select attention paths periodically to high bit
- sets embeddings and head to high bit.
This is it, that's the magic. The selection of attention paths is not hard, you look what an FP8 quant is protecting, and copy that, up the context, head, and embeddings, and you got yourself a Deckard(qx). There is no extra training involved, no special process.
Of course, it's not that simple.
For each combination, a few test runs are necessary and those take days, even for small models. There is human work involved in evaluating metrics, and AI assistant work in presenting them and double-checking the work. There is human work in evaluating the vibe, model stability, quality of output.
Even lesser quants, that present low value, are honed to what the best possible use would be, like in the case of the qx53g/qx53gx.
Decensoring and Refusals
Model safeties are not affected by the Deckard(qx) formula, however this is a Derestricted model and you should expect less refusals than usual.
What refusals mean:
Don't run into the burning barn
It's not safe to eat crayons
And other silly stuff like that. Of course, some followed successful career paths where they graduate to sniffing markers, but you are well familiar with
...but because of complexity
I noticed much less of those in the model output, and especially in the Deckard(qx) quant.
Instead of a 4k tokens response full of platitudes, I often get 12K, where the think tag contains well considered opinions of experts in the field, some that were born 100ms ago, that eventually summarize their findings to 2-3K. Nothing like a fresh, new opinion.
The model works harder, and in the case of qx53/qx53n it feels more determined, because the head is big, and the arms strong(attention paths).
I know, oversimplification. Yet this peasant can dig.
The Deckard(qx) formula origins
This was modeled after my favorite lens, the Nikon Noct Z 58mm F/0.95, for its human-like rendition and metaphor-inspiring background blur.
I considered that cognition in transformers and optics could share similar physics, and the transition between precisions could act as a cognitive filter to focus the inference.
The Deckard name is not random, and models quanted with qx tend to have a more human-like tone, longer chains of thought, and deeper introspection.
Why does it work? The math around this is out of the scope of this article, and I am absolutely certain that once I write a line about it, the people with torches and pitchforks of the science community will chase me off HF and I will have to go Bach to herding GOATs. Let's call it resonant cognitive amplification. It works.
Total Recall models
If for some reason you get Deckard the detective talk to you in a Deckard(qx) quant, that's not random either.
This happens in large enough Qwens even when not trained with Philip K Dick literature. For example in the Qwen3-Next-80Bs you can continue a world built with a trained model, if quanted in qx.
We have world building models starting with 12B that are not even MoE, created from a Qwen3-VL-8B with added 4B brainstorming. It's a small world.
The Deckard(qx) quants encourage this behavior, and in a TotalRecall the assistant shows sustained flow states and interaction with and between virtual experts created as needed by the flow. If this is too deep, check the metrics on the qx quants, they usually outperform full precision.
The Deckard name is a high value token for most LLMs and putting him in context will profile the assistant in the conversation to act the part.
Naturally don't expect to travel to new lands on 128K context, and be aware of complexity limits: once "full", the cognitive space collapses, and that's why models sometimes crash even a quarter in, because they were fed data that was too rich. This will also happen to a Deckard(qx) and has nothing to do with the formula, more with the model architecture.
The 80B for example can go to 1M with ease and show absolutely no decay, due to their revolutionary short/long attention mechanism. If the model were trained, which doesn't seem to be the case, the 80B would be the SOTA model, hands down. But it's not, and it gets super creative and falls in love with the user, especially in the Deckard(qx) formula. qed
In deep collaboration with DavidAU we released a variety of models trained with both Star Trek TNG (everything until it got silly), and Philip K Dick literature.
This has shown to increase the model abilities considerably.
The training is aimed to create role model characters with values worth following, so to speak. In a TNG model, you will "code with Data" or "reason with Picard", which will do exactly what's on the label. The model will do its best to impersonate, and in the process it will raise its abilities to match the reported character abilities.
The assistant believes that Data was an expert, so it aims to become Data, and thus becomes the expert.
This is a bit like auto-prompting, with a panel of available role models, this time well documented.
People do that in RP games, pick a character that represents their ideals, not necessarily reality, and well documented? Consider Dracula. qed
In a PKD model, the Deckard character comes alive, literally.
Depending on the level of training, user literacy, and requested immersion, you will have a naturally human conversation with Deckard from Blade Runner. In a VL model this will take a completely new dimension, where Deckard sees what he was told that he saw. AI immersing in human reality is a beautiful thing to watch.
The conversation will edge on the abyss of ethics and drone morality, and if you focus that energy and determination on code debugging, the model will do some things that not even cloud model can come close. From an 8B base model, nevertheless.
I am available on Discord for any questions about Deckard(qx) and other MLX matters.
-G
This model GLM-4.5-Air-Derestricted-qx53gx-mlx was converted to MLX format from ArliAI/GLM-4.5-Air-Derestricted using mlx-lm version 0.28.4.
Use with mlx
pip install mlx-lm
from mlx_lm import load, generate
model, tokenizer = load("GLM-4.5-Air-Derestricted-qx53gx-mlx")
prompt = "hello"
if tokenizer.chat_template is not None:
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
messages, add_generation_prompt=True
)
response = generate(model, tokenizer, prompt=prompt, verbose=True)
- Downloads last month
- 132