Draft Models
Collection
Tiny "draft" models for speculative decoding.
•
36 items
•
Updated
•
6
A 0.4B parameter draft (speculative decoding) model for use with Mistral-Large-Instruct-2411 and Mistral-Large-Instruct-2407.
See Mistral-Large-Instruct-2411-DRAFT-0.4B-v3.0-GGUF for the models in gguf format for use with llama.cpp.
The current config.json is set for context length up to 32k tokens. Add the "rope_scaling" section to config.json to enable YaRN, eg:
"max_position_embeddings": 65536,
...
"rope_scaling": {
"factor": 2.0,
"original_max_position_embeddings": 32768,
"type": "yarn"
},
"max_position_embeddings": 131072,
...
"rope_scaling": {
"factor": 4.0,
"original_max_position_embeddings": 32768,
"type": "yarn"
},
NOTE: Because llama.cpp uses "static-YaRN" the scaling factor remains constant regardless of input length! Only add the rope_scaling configuration when processing long contexts is required...
> python ./transplant_vocab.py \
./Qwen2.5-0.5B-Instruct \
./Mistral-Large-Instruct-2411 \
./Mistral-Large-Instruct-2411-DRAFT-0.4B-UNTRAINED \
--override "<unk>" "<|endoftext|>" \
--override "<s>" "<|endoftext|>" \
--override "</s>" "<|im_end|>" \
--override "[INST]" "<|im_start|>user\n" \
--override "[/INST]" "<|im_end|><|im_start|>assistant\n" \
--override "[TOOL_CALLS]" "<tool_call>" \
--override "[AVAILABLE_TOOLS]" "<tools>" \
--override "[/AVAILABLE_TOOLS]" "</tools>" \
--override "[TOOL_RESULTS]" "<tool_response>" \
--override "[/TOOL_RESULTS]" "</tool_response>" \
--override "[IMG]" "<|vision_start|>" \
--override "[PREFIX]" "<|fim_prefix|>" \
--override "[MIDDLE]" "<|fim_middle|>" \
--override "[SUFFIX]" "<|fim_suffix|>" \
--override "[IMG_BREAK]" "<|vision_pad|>" \
--override "[IMG_END]" "<|vision_end|>" \
--override "[SYSTEM_PROMPT]" "<|im_start|>system\n" \
--override "[/SYSTEM_PROMPT]" "<|im_end|>" \
--override "[TOOL_CONTENT]" "<tool_response>"
Loading config from 'Qwen2.5-0.5B-Instruct'... Done.
Loading config from 'Mistral-Large-Instruct-2411'... Done.
Loading tokenizer from 'Qwen2.5-0.5B-Instruct'... Done.
Loading tokenizer from 'Mistral-Large-Instruct-2411'... Done.
Loading model from 'Qwen2.5-0.5B-Instruct'...
Input model configuration:
- Target vocabulary size : 32768 (used = 32768, unused = 0)
- Donor vocabulary size : 151936
- Donor num layers : 24 (tied embeddings = True)
- Donor hidden size : 896
- Donor attention heads : 14
- Donor intermediate size : 4864 (ratio = 1:5.4)
- Donor total parameters : 494032768 (0.49B)
-- Embedding parameters : 136134656 (0.14B)
-- Non-embedding parameters : 357898112 (0.36B)
Processing 3 automatic token overrides:
✔ 'bos_token_id' : 1 '<s>' → [151643] '<|endoftext|>'
✔ 'eos_token_id' : 2 '</s>' → [151645] '<|im_end|>'
✘ 'pad_token_id' : Not found for target model
Processing 19 manual token overrides:
✔ 0 : '<unk>' → [151643] '<|endoftext|>'
✔ 1 : '<s>' → [151643] '<|endoftext|>'
✔ 2 : '</s>' → [151645] '<|im_end|>'
✔ 3 : '[INST]' → [151644, 872, 198] '<|im_start|>user\n'
✔ 4 : '[/INST]' → [151645, 151644, 77091, 198] '<|im_end|><|im_start|>assistant\n'
✔ 5 : '[TOOL_CALLS]' → [151657] '<tool_call>'
✔ 6 : '[AVAILABLE_TOOLS]' → [27, 15918, 29] '<tools>'
✔ 7 : '[/AVAILABLE_TOOLS]' → [522, 15918, 29] '</tools>'
✔ 8 : '[TOOL_RESULTS]' → [27, 14172, 9655, 29] '<tool_response>'
✔ 9 : '[/TOOL_RESULTS]' → [522, 14172, 9655, 29] '</tool_response>'
✔ 10 : '[IMG]' → [151652] '<|vision_start|>'
✔ 11 : '[PREFIX]' → [151659] '<|fim_prefix|>'
✔ 12 : '[MIDDLE]' → [151660] '<|fim_middle|>'
✔ 13 : '[SUFFIX]' → [151661] '<|fim_suffix|>'
✔ 14 : '[IMG_BREAK]' → [151654] '<|vision_pad|>'
✔ 15 : '[IMG_END]' → [151653] '<|vision_end|>'
✔ 16 : '[SYSTEM_PROMPT]' → [151644, 8948, 198] '<|im_start|>system\n'
✔ 17 : '[/SYSTEM_PROMPT]' → [151645] '<|im_end|>'
✔ 18 : '[TOOL_CONTENT]' → [27, 14172, 9655, 29] '<tool_response>'
NOTE: Using an "untied" copy of 'embed_tokens.weight' as new 'lm_head.weight' tensor...
Transplanting tokens: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32768/32768 [00:09<00:00, 3311.13token/s]
Transplant mappings:
- 1 to 1 : 29370 (90%)
- 2 to 1 : 2445 (7.5%)
- 3 to 1 : 170 (0.52%)
- 4 to 1 : 29 (0.089%)
- 5 to 1 : 3 (0.0092%)
- 6 to 1 : 93 (0.28%)
- 7 to 1 : 658 (2%)
Head initialized with:
- Copies : 29370 (90%)
- Means : 3398 (10%)
- Zeros : 0 (0%)
Output model configuration:
- Output vocabulary size : 32768
- Output num layers : 24 (tied embeddings = False)
- Output hidden size : 896
- Output attention heads : 14
- Output intermediate size : 4864 (ratio = 1:5.4)
- Output total parameters : 416618368 (0.42B)
-- Embedding parameters : 58720256 (0.06B)
-- Non-embedding parameters : 357898112 (0.36B)
Saving model and tokenizer to 'Mistral-Large-Instruct-2411-DRAFT-0.4B-UNTRAINED' folder
Patching 'torch_dtype' in 'Mistral-Large-Instruct-2411-DRAFT-0.4B-UNTRAINED/config.json' based on actual saved tensors
- Updated 'torch_dtype' to 'bfloat16' based on actual tensor dtype
Operation completed successfully (ignore any 'segmentation fault' that follows!!!)
output field only)formatted just between </s> tags.
# ==============================
# MODEL AND OUTPUT CONFIGURATION
# ==============================
model_dir = 'models/Mistral-Large-Instruct-2411-DRAFT-0.4B-UNTRAINED'
output_dir = 'finetuned'
# ===========================
# TRAINING TYPE CONFIGURATION
# ===========================
full_fine_tune = true
# =======================
# OPTIMIZER CONFIGURATION
# =======================
lr = 5e-5
# ======================
# TRAINING CONFIGURATION
# ======================
sequence_len = 32768
gradient_accumulation_steps = 10 # 10×6 = batch size 60, 10×6×32768 = ~2M tokens per step
# =====================
# DATASET CONFIGURATION
# =====================
drop_tails = true
[[datasets]]
dataset_path = 'datasets/common-crawl-sample/*.json'
[[datasets]]
dataset_path = 'datasets/the-stack-smol-xl/*.jsonl'
[[datasets]]
dataset_path = 'datasets/rombodawg-Everything-Instruct/*.json'
I used six RTX A6000 GPUs over three nodes and hence the 60 batch size (6 x 10 gradient accumulation steps = 60).