Extending the context above 32k

The current config.json is set for context length up to 32k tokens. Add the "rope_scaling" section to config.json to enable YaRN, eg:

To extend the context to 64k:

  "max_position_embeddings": 65536,
  ...
  "rope_scaling": {
    "factor": 2.0,
    "original_max_position_embeddings": 32768,
    "type": "yarn"
  },

To extend the context to 128k:

  "max_position_embeddings": 131072,
  ...
  "rope_scaling": {
    "factor": 4.0,
    "original_max_position_embeddings": 32768,
    "type": "yarn"
  },

NOTE: Because llama.cpp uses "static-YaRN" the scaling factor remains constant regardless of input length! Only add the rope_scaling configuration when processing long contexts is required...

How this model was created

1. The initial model was created from Qwen2.5-0.5B-Instruct using transplant-vocab:

> python ./transplant_vocab.py \
    ./Qwen2.5-0.5B-Instruct \
    ./Mistral-Large-Instruct-2411 \
    ./Mistral-Large-Instruct-2411-DRAFT-0.4B-UNTRAINED \
    --override "<unk>" "<|endoftext|>" \
    --override "<s>" "<|endoftext|>" \
    --override "</s>" "<|im_end|>" \
    --override "[INST]" "<|im_start|>user\n" \
    --override "[/INST]" "<|im_end|><|im_start|>assistant\n" \
    --override "[TOOL_CALLS]" "<tool_call>" \
    --override "[AVAILABLE_TOOLS]" "<tools>" \
    --override "[/AVAILABLE_TOOLS]" "</tools>" \
    --override "[TOOL_RESULTS]" "<tool_response>" \
    --override "[/TOOL_RESULTS]" "</tool_response>" \
    --override "[IMG]" "<|vision_start|>" \
    --override "[PREFIX]" "<|fim_prefix|>" \
    --override "[MIDDLE]" "<|fim_middle|>" \
    --override "[SUFFIX]" "<|fim_suffix|>" \
    --override "[IMG_BREAK]" "<|vision_pad|>" \
    --override "[IMG_END]" "<|vision_end|>" \
    --override "[SYSTEM_PROMPT]" "<|im_start|>system\n" \
    --override "[/SYSTEM_PROMPT]" "<|im_end|>" \
    --override "[TOOL_CONTENT]" "<tool_response>"

Loading config from 'Qwen2.5-0.5B-Instruct'... Done.
Loading config from 'Mistral-Large-Instruct-2411'... Done.
Loading tokenizer from 'Qwen2.5-0.5B-Instruct'... Done.
Loading tokenizer from 'Mistral-Large-Instruct-2411'... Done.
Loading model from 'Qwen2.5-0.5B-Instruct'...

Input model configuration:
- Target vocabulary size    : 32768 (used = 32768, unused = 0)
- Donor vocabulary size     : 151936
- Donor num layers          : 24 (tied embeddings = True)
- Donor hidden size         : 896
- Donor attention heads     : 14
- Donor intermediate size   : 4864 (ratio = 1:5.4)
- Donor total parameters    : 494032768 (0.49B)
-- Embedding parameters     : 136134656 (0.14B)
-- Non-embedding parameters : 357898112 (0.36B)

Processing 3 automatic token overrides:
✔ 'bos_token_id' : 1 '<s>' → [151643] '<|endoftext|>'
✔ 'eos_token_id' : 2 '</s>' → [151645] '<|im_end|>'
✘ 'pad_token_id' : Not found for target model

Processing 19 manual token overrides:
✔      0 : '<unk>' → [151643] '<|endoftext|>'
✔      1 : '<s>' → [151643] '<|endoftext|>'
✔      2 : '</s>' → [151645] '<|im_end|>'
✔      3 : '[INST]' → [151644, 872, 198] '<|im_start|>user\n'
✔      4 : '[/INST]' → [151645, 151644, 77091, 198] '<|im_end|><|im_start|>assistant\n'
✔      5 : '[TOOL_CALLS]' → [151657] '<tool_call>'
✔      6 : '[AVAILABLE_TOOLS]' → [27, 15918, 29] '<tools>'
✔      7 : '[/AVAILABLE_TOOLS]' → [522, 15918, 29] '</tools>'
✔      8 : '[TOOL_RESULTS]' → [27, 14172, 9655, 29] '<tool_response>'
✔      9 : '[/TOOL_RESULTS]' → [522, 14172, 9655, 29] '</tool_response>'
✔     10 : '[IMG]' → [151652] '<|vision_start|>'
✔     11 : '[PREFIX]' → [151659] '<|fim_prefix|>'
✔     12 : '[MIDDLE]' → [151660] '<|fim_middle|>'
✔     13 : '[SUFFIX]' → [151661] '<|fim_suffix|>'
✔     14 : '[IMG_BREAK]' → [151654] '<|vision_pad|>'
✔     15 : '[IMG_END]' → [151653] '<|vision_end|>'
✔     16 : '[SYSTEM_PROMPT]' → [151644, 8948, 198] '<|im_start|>system\n'
✔     17 : '[/SYSTEM_PROMPT]' → [151645] '<|im_end|>'
✔     18 : '[TOOL_CONTENT]' → [27, 14172, 9655, 29] '<tool_response>'

NOTE: Using an "untied" copy of 'embed_tokens.weight' as new 'lm_head.weight' tensor...

Transplanting tokens: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32768/32768 [00:09<00:00, 3311.13token/s]

Transplant mappings:
- 1 to 1  : 29370 (90%)
- 2 to 1  : 2445 (7.5%)
- 3 to 1  : 170 (0.52%)
- 4 to 1  : 29 (0.089%)
- 5 to 1  : 3 (0.0092%)
- 6 to 1  : 93 (0.28%)
- 7 to 1  : 658 (2%)

Head initialized with:
- Copies : 29370 (90%)
- Means  : 3398 (10%)
- Zeros  : 0 (0%)

Output model configuration:
- Output vocabulary size    : 32768
- Output num layers         : 24 (tied embeddings = False)
- Output hidden size        : 896
- Output attention heads    : 14
- Output intermediate size  : 4864 (ratio = 1:5.4)
- Output total parameters   : 416618368 (0.42B)
-- Embedding parameters     : 58720256 (0.06B)
-- Non-embedding parameters : 357898112 (0.36B)

Saving model and tokenizer to 'Mistral-Large-Instruct-2411-DRAFT-0.4B-UNTRAINED' folder

Patching 'torch_dtype' in 'Mistral-Large-Instruct-2411-DRAFT-0.4B-UNTRAINED/config.json' based on actual saved tensors
- Updated 'torch_dtype' to 'bfloat16' based on actual tensor dtype

Operation completed successfully (ignore any 'segmentation fault' that follows!!!)

2. The following datasets were used to create a fine-tuning dataset of ~2.8B tokens:

agentlans/common-crawl-sample
bigcode/the-stack-smol-xl
rombodawg/Everything_Instruct (NOTE: output field only)

formatted just between </s> tags.

3. The model was then trained using qlora-pipe-lite for 1 epoch with a batch size of 60 and a sequence length of 32k (~2M tokens per step):

# ==============================
# MODEL AND OUTPUT CONFIGURATION
# ==============================

model_dir = 'models/Mistral-Large-Instruct-2411-DRAFT-0.4B-UNTRAINED'
output_dir = 'finetuned'

# ===========================
# TRAINING TYPE CONFIGURATION
# ===========================

full_fine_tune = true

# =======================
# OPTIMIZER CONFIGURATION
# =======================

lr = 5e-5

# ======================
# TRAINING CONFIGURATION
# ======================

sequence_len = 32768

gradient_accumulation_steps = 10  # 10×6 = batch size 60, 10×6×32768 = ~2M tokens per step

# =====================
# DATASET CONFIGURATION
# =====================

drop_tails = true

[[datasets]]
dataset_path = 'datasets/common-crawl-sample/*.json'

[[datasets]]
dataset_path = 'datasets/the-stack-smol-xl/*.jsonl'

[[datasets]]
dataset_path = 'datasets/rombodawg-Everything-Instruct/*.json'