Mistral 12B โ CPT (Continual Pretraining with LoRA)
Model type: Causal Language Model
Base model: mistralai/Mistral-Nemo-Instruct-2407
License: Apache 2.0
Framework: Axolotl
Overview
mistral-12b-cpt is a continual-pretrained version of the Mistral-12B Nemo Instruct model.
This CPT phase extends the modelโs factual and energy domain understanding using scientific, governmental, news, and encyclopedic text.
Training was executed on the Leonardo EuroHPC system using Axolotl with DeepSpeed ZeRO-1 for efficient large-scale distributed fine-tuning.
Training Setup
Objective: Unsupervised continual pretraining (language modeling)
Adapter type: LoRA
Precision: bfloat16
Hardware: 8 nodes ร 2 ร NVIDIA A100 64 GB GPUs
Framework: Axolotl + DeepSpeed + PyTorch 2.5.1 + CUDA 12.1
Runtime: 24 h
Checkpoints: 5 per epoch
Dataset
| Dataset | Description |
|---|---|
arxiv.jsonl |
Scientific and technical papers |
gov.jsonl |
Government and policy documents |
news.jsonl |
News articles |
wiki.jsonl |
Wikipedia text |
Hyperparameters
| Parameter | Value |
|---|---|
| Sequence length | 2048 |
| Micro batch size | 2 |
| Gradient accumulation | 2 |
| Epochs | 10 |
| Max steps | 10000 |
| Learning rate | 0.0002 |
| LR scheduler | cosine |
| Optimizer | AdamW (8-bit) |
| Warmup steps | 10 |
| Weight decay | 0.0 |
| LoRA rank (r) | 16 |
| LoRA alpha | 32 |
| LoRA dropout | 0.05 |
| LoRA targets | q_proj, k_proj, v_proj, o_proj |
| Gradient checkpointing | โ |
| Flash attention | โ |
| Loss watchdog (threshold/patience) | 5.0 / 3 |
Tokenizer
Tokenizer type: AutoTokenizer
Pad token: <|end_of_text|>
- Downloads last month
- 5