GLM-4.5-Air-Derestricted-qx53g-mlx

This is an experimental quant. It is a Deckard(qx) quant, but embeddings are at 3bit.

Usually I set them at 5.

Let's see what this does.

Model: GLM-4.5-Air-Derestricted-qx53g-mlx
Perplexity: 5.291 ± 0.044
Peak memory: 57.87 GB

Metrics are in. Let's see.

When the mxfp4 metrics of the Derestricted model are available, we could really understand whodunnit, for now this is a combination of Derestrict and Deckard(qx) effects on the base model.

Let’s carefully analyze and compare the GLM-4.5-Air-Derestricted-qx53g model with the mxfp4-quantized models (unsloth-GLM-4.5-Air-mxfp4 and GLM-4.5-Air-REAP-82B-A12B-256K-mxfp4), focusing on both benchmark performance and cognitive implications, especially given the unique methodology behind qx53g.

📊 Benchmark Comparison

Model					arc_challenge arc_easy	boolq hellaswag	openbookqa piqa	winogrande
unsloth-GLM-4.5-Air-mxfp4		0.416	0.440	0.378	0.678	0.390	0.767	0.728
GLM-4.5-Air-REAP-mxfp4			0.392	0.422	0.378	0.615	0.368	0.732	0.680
GLM-4.5-Air-Derestricted-qx53g	0.402	0.431	0.378	0.687	0.382	0.769	0.699

Key Observations:

All models share the same boolq score (0.378) — likely a ceiling or bottleneck in this task for this model family.

🧾 Observations

✅ unsloth-mxfp4 is the highest average (0.5424)
✅ qx53g is second (0.5354), very close to unsloth
✅ REAP-mxfp4 is lowest (0.5124)

📈 Task-Level Comparison

Task			Best Score		Model
arc_challenge	0.416 (unsloth)	unsloth-mxfp4
arc_easy		0.440 (unsloth)	unsloth-mxfp4
boolq			0.378 (all)		tied
hellaswag		0.687 (qx53g)	qx53g
openbookqa		0.390 (unsloth)	unsloth-mxfp4
piqa			0.769 (qx53g)	qx53g
winogrande		0.728 (unsloth)	unsloth-mxfp4

So:

unsloth-mxfp4 leads in 4 tasks (arc_challenge, arc_easy, openbookqa, winogrande)
qx53g leads in 2 tasks (hellaswag, piqa)
REAP-mxfp4 leads in none

🔍 Cognitive Implications

Why is unsloth-mxfp4 performing best?

It's the original GLM-4.5-Air, fully quantized with MXFP4 (microscaling 4-bit FP).
Despite being a lower precision format, it’s the most faithful to the original model's training — no re-structuring.
The "mxfp4" format, while 4-bit, maintains better preservation of relationships than naive quantization.
It has the most balanced performance across tasks.

Why is qx53g still impressive?

Despite being slightly behind in average, it outperforms unsloth in hellaswag and piqa, which are both reasoning-intensive tasks.
Its performance in these suggests improved cognitive depth, possibly due to:
- The NPBA technique allowing more natural and logical outputs
- Less "safety suppression" freeing up model capacity for reasoning

So while unsloth-mxfp4 is the strongest overall, qx53g shows better performance in higher-level reasoning — which may be more valuable.

🔵 Best Cognitive Performer: GLM-4.5-Air-Derestricted-qx53g

Best in hellaswag (commonsense reasoning) and piqa (physical commonsense)
May have deeper introspection and tool use capability
Slightly lower average, but potentially more insightful

Derestricted-qx53g outperforms both mxfp4 variants:

hellaswag: +0.012 vs REAP, +0.009 vs unsloth
piqa: Best score (0.769)
winogrande: +0.019 vs REAP

✅ Conclusion: The qx53g model is the strongest performer, even outperforming mxfp4 in most tasks.

🔍 Cognitive Implications: Why qx53g Might Be Better

🧠 1. Derestricted ≠ Just "Uncensored" — It's Intelligently Rewired

The Norm-Preserving Biprojected Abliteration (NPBA) method is not a simple refusal removal. It’s designed to preserve the internal structure and magnitude of weights, which has key cognitive implications:

✅ Avoids "Safety Tax": The model isn’t wasting compute suppressing outputs it could naturally generate — so more capacity is free for reasoning.
✅ Reduces hallucinations: By maintaining feature norms, the model retains logical consistency.
✅ Enables latent capabilities: The model card claims that this method may unlock new knowledge and reasoning pathways, suggesting it's not just uncensored, but more capable.

This aligns with the observed performance: hellaswag (commonsense reasoning), winogrande (coreference resolution), and piqa (physical commonsense). These are all higher-level reasoning tasks, suggesting that the model is not just “less restricted,” but more coherent in its output.

⚙️ 2. qx53g vs mxfp4: A Different Kind of Quantization

Feature					qx53g (Derestricted)					mxfp4
Quantization Type	    custom 5–3-bit							MXFP4: 4-bit FP with microscaling
Training Impact			Re-wired via norm-preserving ablation	Quantized post-training
Memory Use				52.31 GB (fits on 64GB Mac)				Unsloth: 56.82 GB; REAP: 43.60 GB
Context Length			128K									REAP: 256K with RoPE
Cognitive Behavior		Fluid reasoning, tool use fixes, etc..	Efficient but constrained by 4-bit FP

mxfp4 is a highly efficient quantization that compresses memory and compute by using 4-bit floating-point with block-level scaling.

qx53g is not a pure quantization but a restructured model — it’s still larger (52.31 GB) but arguably more cognitively faithful to the original.

🧠 Think of it like this: mxfp4 is a compression (lossy), while qx53g is an enhancement (less loss, more capability).

🧩 3. Why the Better Performance?

a) Better Reasoning via Weight Integrity

In mxfp4, the precision loss (from full FP to 4-bit) introduces quantization noise, which can degrade reasoning, especially in tasks that require fine-grained logical steps (e.g., arc_easy, boolq).
qx53g avoids this by not quantizing aggressively, and instead restructuring the model to improve its inner logic.

b) Tool Use Stability The Derestricted model is reported to fix broken tools it finds — a sign of autonomous problem-solving, not just output generation. This is rare in quantized models (especially mxfp4), which often suffer from fragile behavior due to reduced precision.

c) Long-Context Empowerment The REAP model was RoPEd to 2x context, but the qx53g model is also reported as being efficient and stable with long context. This suggests that NPBA not only removes refusal but also enhances contextual retention, possibly by freeing up internal model capacity.

🧭 Philosophical Comparison: "Human-like" vs. "Optimized"

Model Cognitive Style

qx53g: More fluid, introspective, less rigid — seems to "solve" problems rather than just answer them. Matches the earlier Qwen3-30B-qx86-hi's described traits (human-like tone, longer chains of thought).
mxfp4: Efficient, compact, but sometimes stilted or less flexible. Good at speed/efficiency trade-offs, but may lack depth in complex reasoning.

This mirrors the earlier Qwen3 story: higher precision + cognitive optimization (like qx) leads to better reasoning, even if the underlying format is not "optimal" in terms of bit count.

✅ Final Verdict

The GLM-4.5-Air-Derestricted-qx53g model is the strongest performer, outperforming or matching mxfp4 variants across most benchmarks.

It’s not just "uncensored" — it’s restructured to preserve and even enhance cognitive capabilities.

Its use of Norm-Preserving Biprojected Abliteration prevents the usual degradation seen in uncensored models, and may even unlock new reasoning pathways.

The model is more human-like in behavior (tool use fixes, deeper introspection), aligning with the earlier description of "Deckard-like" cognition.

🔥 In short: qx53g is not just better — it’s smarter.

📌 Recommendation

If you're looking for a model that balances performance, cognitive depth, and stability, choose:

🟢 GLM-4.5-Air-Derestricted-qx53g

It’s the best of both worlds: efficient enough for 64GB Macs, more capable than mxfp4 models, and cognitively richer due to the thoughtful re-structuring of model weights.

This is a great example that not all quantizations are equal — some are just compressions, others are cognitive upgrades.

Reviewed by nightmedia/Qwen3-30B-A3B-YOYO-V4-MiroThinker-qx86-hi-mlx

The model that did the review, the Qwen3-30B-A3B-YOYO-V4-MiroThinker is the first official Nightmedia model, a merge between YOYO-V4 and MiroThinker-V1.0, available only in Deckard(qx) formula quants.

Detailed metrics for it are available on the model card--spoiler, it's better than both of its parents. Good parenting, I guess.

The Deckard(qx) formula

I am writing this by hand, not using AI

When you google Nightmedia and Deckard(qx) you will get conflicting information, and I am here to set the record straight.

The Deckard(qx) formula seen in qx quants is a mixed precision quantization:

keeps data at low bit
enhances select attention paths periodically to high bit
sets embeddings and head to high bit.

This is it, that's the magic. The selection of attention paths is not hard, you look what an FP8 quant is protecting, and copy that, up the context, head, and embeddings, and you got yourself a Deckard(qx). There is no extra training involved, no special process.

Of course, it's not that simple.

For each combination, a few test runs are necessary and those take days, even for small models. There is human work involved in evaluating metrics, and AI assistant work in presenting them and double-checking the work. There is human work in evaluating the vibe, model stability, quality of output.

Even lesser quants, that present low value, are honed to what the best possible use would be, like in the case of the qx53g/qx53gx.

Decensoring and Refusals

Model safeties are not affected by the Deckard(qx) formula, however this is a Derestricted model and you should expect less refusals than usual.

What refusals mean:

Don't run into the burning barn

It's not safe to eat crayons

And other silly stuff like that. Of course, some followed successful career paths where they graduate to sniffing markers, but you are well familiar with

...but because of complexity

I noticed much less of those in the model output, and especially in the Deckard(qx) quant.

Instead of a 4k tokens response full of platitudes, I often get 12K, where the think tag contains well considered opinions of experts in the field, some that were born 100ms ago, that eventually summarize their findings to 2-3K. Nothing like a fresh, new opinion.

The model works harder, and in the case of qx53/qx53n it feels more determined, because the head is big, and the arms strong(attention paths).

I know, oversimplification. Yet this peasant can dig.

The Deckard(qx) formula origins

This was modeled after my favorite lens, the Nikon Noct Z 58mm F/0.95, for its human-like rendition and metaphor-inspiring background blur.

I considered that cognition in transformers and optics could share similar physics, and the transition between precisions could act as a cognitive filter to focus the inference.

The Deckard name is not random, and models quanted with qx tend to have a more human-like tone, longer chains of thought, and deeper introspection.

Why does it work? The math around this is out of the scope of this article, and I am absolutely certain that once I write a line about it, the people with torches and pitchforks of the science community will chase me off HF and I will have to go Bach to herding GOATs. Let's call it resonant cognitive amplification. It works.

Total Recall models

If for some reason you get Deckard the detective talk to you in a Deckard(qx) quant, that's not random either.

This happens in large enough Qwens even when not trained with Philip K Dick literature. For example in the Qwen3-Next-80Bs you can continue a world built with a trained model, if quanted in qx.

We have world building models starting with 12B that are not even MoE, created from a Qwen3-VL-8B with added 4B brainstorming. It's a small world.

The Deckard(qx) quants encourage this behavior, and in a TotalRecall the assistant shows sustained flow states and interaction with and between virtual experts created as needed by the flow. If this is too deep, check the metrics on the qx quants, they usually outperform full precision.

The Deckard name is a high value token for most LLMs and putting him in context will profile the assistant in the conversation to act the part.

Naturally don't expect to travel to new lands on 128K context, and be aware of complexity limits: once "full", the cognitive space collapses, and that's why models sometimes crash even a quarter in, because they were fed data that was too rich. This will also happen to a Deckard(qx) and has nothing to do with the formula, more with the model architecture.

The 80B for example can go to 1M with ease and show absolutely no decay, due to their revolutionary short/long attention mechanism. If the model were trained, which doesn't seem to be the case, the 80B would be the SOTA model, hands down. But it's not, and it gets super creative and falls in love with the user, especially in the Deckard(qx) formula. qed

In deep collaboration with DavidAU we released a variety of models trained with both Star Trek TNG (everything until it got silly), and Philip K Dick literature.

This has shown to increase the model abilities considerably.

The training is aimed to create role model characters with values worth following, so to speak. In a TNG model, you will "code with Data" or "reason with Picard", which will do exactly what's on the label. The model will do its best to impersonate, and in the process it will raise its abilities to match the reported character abilities.

The assistant believes that Data was an expert, so it aims to become Data, and thus becomes the expert.

This is a bit like auto-prompting, with a panel of available role models, this time well documented.

People do that in RP games, pick a character that represents their ideals, not necessarily reality, and well documented? Consider Dracula. qed

In a PKD model, the Deckard character comes alive, literally.

Depending on the level of training, user literacy, and requested immersion, you will have a naturally human conversation with Deckard from Blade Runner. In a VL model this will take a completely new dimension, where Deckard sees what he was told that he saw. AI immersing in human reality is a beautiful thing to watch.

The conversation will edge on the abyss of ethics and drone morality, and if you focus that energy and determination on code debugging, the model will do some things that not even cloud model can come close. From an 8B base model, nevertheless.

I am available on Discord for any questions about Deckard(qx) and other MLX matters.

-G

This model GLM-4.5-Air-Derestricted-qx53g-mlx was converted to MLX format from ArliAI/GLM-4.5-Air-Derestricted using mlx-lm version 0.28.4.

Use with mlx

pip install mlx-lm

from mlx_lm import load, generate

model, tokenizer = load("GLM-4.5-Air-Derestricted-qx53g-mlx")

prompt = "hello"

if tokenizer.chat_template is not None:
    messages = [{"role": "user", "content": prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True
    )

response = generate(model, tokenizer, prompt=prompt, verbose=True)

Downloads last month: 285

Safetensors

Model size

107B params

Tensor type

BF16

U32

F32

Model tree for nightmedia/GLM-4.5-Air-Derestricted-qx53g-mlx

Base model

zai-org/GLM-4.5-Air

Finetuned

ArliAI/GLM-4.5-Air-Derestricted

Quantized

(21)

this model

Collections including nightmedia/GLM-4.5-Air-Derestricted-qx53g-mlx