UIGEN-FX-Agentic-32B-qx86-hi-mlx
🧠 Cognitive Metrics of UIGEN-FX-Agentic-32B-qx86-hi
Task Score
arc_challenge 0.460
arc_easy 0.527
boolq 0.681
hellaswag 0.721
openbookqa 0.404
piqa 0.790
winogrande 0.728
🔍 Key Observations for this 32B Model
Strengths:
- Dominates in real-world commonsense tasks: piqa (physical reasoning) and winogrande (social context/prompt resolution).
- Outperforms all other models in these tasks by margins of +0.044 (piqa) and +0.056 (winogrande) compared to the next-best.
Weaknesses:
- Struggles with structured academic reasoning (arc_challenge: 0.460, arc_easy: 0.527) — below average for this dataset.
- Lowest in boolq (0.681) among all models, suggesting less optimal yes/no QA capabilities.
💡 Why this matters:
This model was explicitly designed for UI/web development tasks (not general cognitive benchmarks).
Its performance profile reveals:
- It excels at unstructured, real-world reasoning (e.g., "which object is heavier?"), but
- Lacks optimization for standardized academic tests (e.g., ARC, OpenBookQA).
This confirms its specialization — not a general-purpose AI.
Cognitive Metrics of All Models (7 Tasks)
Model arc_challenge arc_easy boolq hellaswag openbookqa piqa winogrande
Jan-v1-2509-qx86-hi 0.435 0.540 0.729 0.588 0.388 0.730 0.633
Qwen3-8B-DND-Almost-Human-B-e32 0.464 0.569 0.737 0.632 0.406 0.744 0.634
Qwen3-Jan-v1-256k-ctx-6B-Brainstorm20x 0.445 0.579 0.696 0.600 0.404 0.732 0.627
Qwen3-Neo-Experimental-6B 0.463 0.579 0.721 0.623 0.406 0.738 0.672
Qwen3-ST-Deep-Space-Nine-v3 0.442 0.568 0.743 0.637 0.384 0.732 0.624
Qwen3-ST-The-Next-Generation-II-E32 0.452 0.581 0.721 0.650 0.406 0.746 0.646
UIGEN-FX-Agentic-32B 0.460 0.527 0.681 0.721 0.404 0.790 0.728
WEBGEN-4B-Preview 0.503 0.694 0.849 0.583 0.426 0.732 0.593
✅ Key Highlights
- arc_challenge: WEBGEN-4B leads (0.503)
- arc_easy: WEBGEN-4B dominates (0.694), 2x higher than next
- boolq: WEBGEN-4B kills it (0.849) – 15% above second-place Qwen3-ST-Deep-Space-Nine
- piqa: UIGEN-FX wins (0.790) – 5% better than Qwen3-Next-Generation-II
- winogrande: UIGEN-FX highest (0.728) – 6% better than Qwen3-Neo
- openbookqa: 5 models tied at 0.406; WEBGEN-4B is only one above (0.426)
💡 Note: UIGEN-FX (32B) and WEBGEN-4B (4B) are specialized models – not designed for general cognitive benchmarking. The rest (Qwen3 variants) are the only "true" competitors for general-purpose reasoning tasks.
Reviewed with Brain
This model UIGEN-FX-Agentic-32B-qx86-hi-mlx was converted to MLX format from Tesslate/UIGEN-FX-Agentic-32B using mlx-lm version 0.28.2.
Use with mlx
pip install mlx-lm
from mlx_lm import load, generate
model, tokenizer = load("UIGEN-FX-Agentic-32B-qx86-hi-mlx")
prompt = "hello"
if tokenizer.chat_template is not None:
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
messages, add_generation_prompt=True
)
response = generate(model, tokenizer, prompt=prompt, verbose=True)
- Downloads last month
- 12