Model Card for Qemma-Q14B

Gap Envelope Integral

My mathematical formulation to utilize space projections to "measure" the Jump between points of discontinuity found in Non-Differentialable Functions.

Redux

This Model underwent an additional merge between Qemma-redux and Qwen3-14B, in addition to adding Rope Scaling.

Additionally

Fusion Logic was updated to aid per layer fusion and post fusion embedding alignment.
Qemma is a HuggingFace-native hybrid model that merges Gemma-3 (1B) and Qwen-3 (14B) at the weight level (no adapters).
Design: Gemma MLP/body + Qwen attention/head, projected and aligned to Gemma’s hidden size. The model is then SFT-tuned for stepwise reasoning.
This variant uses Yarn based Rope Scaling with 1:* Ratio from max_position_embeddings = 524288

Quick start

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "reaperdoesntknow/Qemma-Q14B"
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16).eval()

text = (
    "<|user|>"
    "What makes the sky blue?."
    "<|assistant|>"
    "<think><reasoning_step>"
)

inputs = tokenizer(text, return_tensors="pt", max_length=64, padding='max_length', truncation=True)
inputs = {k: v.to(model.device) for k, v in inputs.items()}

with torch.no_grad():
    model.eval()
    outputs = model.generate(**inputs, max_new_tokens=256, do_sample=True, min_length=32)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

What’s inside

Architecture:
Gemma-3 backbone (26 layers, hidden 1152, MLP 6912)
Qwen-style attention regrouped to Gemma’s 4×256 heads. (head_dim=128, hidden=5120, intermediate_size=17408, num_attn_heads=40, KV heads=8, num_hidd_layers=40)
Tokenizer: Gemma-3 tokenizer and chat template (see chat_template.jinja).
Training: SFT for instruction following and stepwise reasoning.

Intended use & limitations

Use: research, instruction following, code/help, analysis, further SFT/RLHF. Limits: may hallucinate; not for safety-critical, medical, legal, or financial decisions. Follow dataset/model licenses.

Training procedure

~512 warm-start steps (HuggingFaceH4/ultrachat_200k) ~ A small post fussion training round was done (8 steps): to encourage embedding realignment.
~256 SFT steps with (TIGER-Lab/MathInstruct + HuggingFaceH4/ultrachat_200k)

Framework versions

TRL: 0.25.0
Transformers: 4.57.1
Pytorch: 2.8.0+cpu
Datasets: 4.4.1
Tokenizers: 0.22.1

Citations

Cite TRL as:

@misc{vonwerra2022trl,
    title        = {{TRL: Transformer Reinforcement Learning}},
    author       = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
    year         = 2020,
    journal      = {GitHub repository},
    publisher    = {GitHub},
    howpublished = {\url{https://github.com/huggingface/trl}}
}

Downloads last month: 62

Safetensors

Model size

1B params

Tensor type

F32

Model tree for reaperdoesntknow/Qemma-Q14B

Qwen/Qwen3-14B

google/gemma-3-1b-it

Merge model

this model

Datasets used to train reaperdoesntknow/Qemma-Q14B

Collection including reaperdoesntknow/Qemma-Q14B

Qemma

Collection

Qemma is a HuggingFace-native hybrid model that merges Gemma-3 (1B) and Qwen-3 (0.6B) and (1.7B) using Gap Envelope Integrals . • 6 items • Updated 4 days ago