File size: 4,938 Bytes
bd6d3e2 6e6cfee |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 |
---
library_name: transformers
license: mit
datasets:
- Geraldine/Ead-Instruct-4k-Distilled
language:
- zho
- eng
- fra
- spa
- por
- deu
- ita
- rus
- jpn
- kor
- vie
- tha
- ara
base_model:
- Qwen/Qwen2.5-0.5B-Instruct
---
# Gemini-Distill-Qwen2.5-0.5B-ead : Qwen2.5-0.5B-Instruct Fine-Tuned on EAD/XML (Distilled from Gemini-2.0-Flash-Thinking-Exp)
## Model Description
This model is a fine-tuned version of **Qwen2.5-0.5B-Instruct**, trained via knowledge distillation from **Gemini-2.0-Flash-Thinking-Exp**. The goal of this fine-tuning process is to teach the model to reason through and generate **Encoded Archival Description (EAD/XML)** outputs.
It follows a structured reasoning approach:
1. **First**, the model provides detailed reasoning.
2. **Then**, it outputs the final **EAD/XML** response.
This structure ensures that the model justifies its output before producing the archival XML format, improving interpretability and accuracy.
---
## Training Details
### **Dataset**
- Dataset: [Geraldine/Ead-Instruct-4k-Distilled](https://huggingface.co/datasets/Geraldine/Ead-Instruct-4k-Distilled)
- **Columns:**
- `tag`: EAD/XML element
- `prompt`: User query
- `reasoning`: Gemini-generated reasoning traces
- `final_output`: EAD/XML archival response
- `completion`: Concatenation of `reasoning` and `final_output`
### **Training Process**
- **Hardware:** : NVIDIA A100-SXM4-80GB
- **Distillation Source:** Gemini-2.0-Flash-Thinking-Exp
- **Model trained on:** User prompt → Gemini reasoning traces → Final EAD/XML response
- **Tokenization Strategy:**
- **Assistant (reasoning):** The start of the reasoning section
- **Assistant (final answer):** The start of the XML output
- Labels are masked (`-100`) before the reasoning phase to optimize the learning process
- **Training Hyperparameters:**
- **Batch Size:** 4 (per device) with gradient accumulation (steps=2)
- **Max Sequence Length:** 4096 tokens
- **Precision:** bf16
- **Epochs:** 5
- **Gradient Checkpointing:** Enabled (reduces memory usage)
- **Dataloader Efficiency:** dataloader_pin_memory=True, dataloader_num_workers=4
- **Warmup Steps:** 100
- **Checkpointing:** Model saved at every epoch, with a maximum of 2 saved checkpoints (save_total_limit=2)
- **Evaluation Strategy:** Evaluates after each epoch (eval_strategy="epoch")
- **Logging:** Logs stored in ./logs
- **Other:** dataloader_drop_last=False to preserve all batches
This setup ensures an optimal balance between performance and memory efficiency, leveraging gradient accumulation and checkpointing for stable training on long sequences
### **Training notebook**
[https://www.kaggle.com/code/geraldinegeoffroy/ead-distilled-qwen2-5-0-5b-instruct](https://www.kaggle.com/code/geraldinegeoffroy/ead-distilled-qwen2-5-0-5b-instruct)
---
## Model Usage
### **Load Model**
To use the model with the 🤗 Transformers library:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Geraldine/Gemini-Distilled-Qwen2.5-0.5B-ead"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto",
)
```
### **Inference Example**
```python
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "Give me an example of <controlaccess> content."
messages = [
{"role": "system", "content": "You are an archivist expert in EAD/XML format for archival records metadata."},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=1024
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
```
---
## Limitations & Future Improvements
- **Training Data Size:** The dataset consists of **4,000 distilled samples**, which may limit generalization.
- **Inference Speed:** Ensure that **Sliding Window Attention (SWA) is disabled**, as it may slow down inference.
- To disable: `model.config.sliding_window = None`
- **Potential Future Steps:**
- Fine-tuning on larger datasets
- Exploring LoRA/QLoRA for efficient parameter tuning
---
## Citation & Acknowledgments
If you use this model in research or production, please cite:
```
@misc{your-citation,
author = {Géralidne Geoffroy},
title = {Qwen2.5-0.5B-Instruct Fine-Tuned on EAD/XML},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead}
}
``` |