Gemini-Distill-Qwen2.5-0.5B-ead / README.md

Improve language tag

bd6d3e2 verified 8 months ago

4.94 kB

	---
	library_name: transformers
	license: mit
	datasets:
	- Geraldine/Ead-Instruct-4k-Distilled
	language:
	- zho
	- eng
	- fra
	- spa
	- por
	- deu
	- ita
	- rus
	- jpn
	- kor
	- vie
	- tha
	- ara
	base_model:
	- Qwen/Qwen2.5-0.5B-Instruct
	---

	# Gemini-Distill-Qwen2.5-0.5B-ead : Qwen2.5-0.5B-Instruct Fine-Tuned on EAD/XML (Distilled from Gemini-2.0-Flash-Thinking-Exp)

	## Model Description
	This model is a fine-tuned version of Qwen2.5-0.5B-Instruct, trained via knowledge distillation from Gemini-2.0-Flash-Thinking-Exp. The goal of this fine-tuning process is to teach the model to reason through and generate Encoded Archival Description (EAD/XML) outputs.

	It follows a structured reasoning approach:
	1. First, the model provides detailed reasoning.
	2. Then, it outputs the final EAD/XML response.

	This structure ensures that the model justifies its output before producing the archival XML format, improving interpretability and accuracy.

	---
	## Training Details

	### Dataset
	- Dataset: [Geraldine/Ead-Instruct-4k-Distilled](https://huggingface.co/datasets/Geraldine/Ead-Instruct-4k-Distilled)
	- Columns:
	- `tag`: EAD/XML element
	- `prompt`: User query
	- `reasoning`: Gemini-generated reasoning traces
	- `final_output`: EAD/XML archival response
	- `completion`: Concatenation of `reasoning` and `final_output`

	### Training Process
	- Hardware: : NVIDIA A100-SXM4-80GB
	- Distillation Source: Gemini-2.0-Flash-Thinking-Exp
	- Model trained on: User prompt → Gemini reasoning traces → Final EAD/XML response
	- Tokenization Strategy:
	- Assistant (reasoning): The start of the reasoning section
	- Assistant (final answer): The start of the XML output
	- Labels are masked (`-100`) before the reasoning phase to optimize the learning process
	- Training Hyperparameters:
	- Batch Size: 4 (per device) with gradient accumulation (steps=2)
	- Max Sequence Length: 4096 tokens
	- Precision: bf16
	- Epochs: 5
	- Gradient Checkpointing: Enabled (reduces memory usage)
	- Dataloader Efficiency: dataloader_pin_memory=True, dataloader_num_workers=4
	- Warmup Steps: 100
	- Checkpointing: Model saved at every epoch, with a maximum of 2 saved checkpoints (save_total_limit=2)
	- Evaluation Strategy: Evaluates after each epoch (eval_strategy="epoch")
	- Logging: Logs stored in ./logs
	- Other: dataloader_drop_last=False to preserve all batches

	This setup ensures an optimal balance between performance and memory efficiency, leveraging gradient accumulation and checkpointing for stable training on long sequences

	### Training notebook

	[https://www.kaggle.com/code/geraldinegeoffroy/ead-distilled-qwen2-5-0-5b-instruct](https://www.kaggle.com/code/geraldinegeoffroy/ead-distilled-qwen2-5-0-5b-instruct)

	---
	## Model Usage

	### Load Model
	To use the model with the 🤗 Transformers library:
	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_name = "Geraldine/Gemini-Distilled-Qwen2.5-0.5B-ead"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	torch_dtype="auto",
	device_map="auto",
	)
	```

	### Inference Example
	```python
	tokenizer = AutoTokenizer.from_pretrained(model_name)

	prompt = "Give me an example of <controlaccess> content."
	messages = [
	{"role": "system", "content": "You are an archivist expert in EAD/XML format for archival records metadata."},
	{"role": "user", "content": prompt}
	]
	text = tokenizer.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True
	)
	model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

	generated_ids = model.generate(
	**model_inputs,
	max_new_tokens=1024
	)
	generated_ids = [
	output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
	]

	response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
	print(response)
	```

	---
	## Limitations & Future Improvements
	- Training Data Size: The dataset consists of 4,000 distilled samples, which may limit generalization.
	- Inference Speed: Ensure that Sliding Window Attention (SWA) is disabled, as it may slow down inference.
	- To disable: `model.config.sliding_window = None`
	- Potential Future Steps:
	- Fine-tuning on larger datasets
	- Exploring LoRA/QLoRA for efficient parameter tuning

	---
	## Citation & Acknowledgments
	If you use this model in research or production, please cite:
	```
	@misc{your-citation,
	author = {Géralidne Geoffroy},
	title = {Qwen2.5-0.5B-Instruct Fine-Tuned on EAD/XML},
	year = {2025},
	publisher = {Hugging Face},
	url = {https://huggingface.co/Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead}
	}
	```