lbourdois commited on
Commit
bd6d3e2
·
verified ·
1 Parent(s): 5a11b4d

Improve language tag

Browse files

Hi! As the model is multilingual, this is a PR to add other languages than English to the language tag to improve the referencing. Note that 29 languages are announced in the README, but only 13 are explicitly listed. I was therefore only able to add these 13 languages.

Files changed (1) hide show
  1. README.md +138 -127
README.md CHANGED
@@ -1,128 +1,139 @@
1
- ---
2
- library_name: transformers
3
- license: mit
4
- datasets:
5
- - Geraldine/Ead-Instruct-4k-Distilled
6
- language:
7
- - fr
8
- - en
9
- base_model:
10
- - Qwen/Qwen2.5-0.5B-Instruct
11
- ---
12
-
13
- # Gemini-Distill-Qwen2.5-0.5B-ead : Qwen2.5-0.5B-Instruct Fine-Tuned on EAD/XML (Distilled from Gemini-2.0-Flash-Thinking-Exp)
14
-
15
- ## Model Description
16
- This model is a fine-tuned version of **Qwen2.5-0.5B-Instruct**, trained via knowledge distillation from **Gemini-2.0-Flash-Thinking-Exp**. The goal of this fine-tuning process is to teach the model to reason through and generate **Encoded Archival Description (EAD/XML)** outputs.
17
-
18
- It follows a structured reasoning approach:
19
- 1. **First**, the model provides detailed reasoning.
20
- 2. **Then**, it outputs the final **EAD/XML** response.
21
-
22
- This structure ensures that the model justifies its output before producing the archival XML format, improving interpretability and accuracy.
23
-
24
- ---
25
- ## Training Details
26
-
27
- ### **Dataset**
28
- - Dataset: [Geraldine/Ead-Instruct-4k-Distilled](https://huggingface.co/datasets/Geraldine/Ead-Instruct-4k-Distilled)
29
- - **Columns:**
30
- - `tag`: EAD/XML element
31
- - `prompt`: User query
32
- - `reasoning`: Gemini-generated reasoning traces
33
- - `final_output`: EAD/XML archival response
34
- - `completion`: Concatenation of `reasoning` and `final_output`
35
-
36
- ### **Training Process**
37
- - **Hardware:** : NVIDIA A100-SXM4-80GB
38
- - **Distillation Source:** Gemini-2.0-Flash-Thinking-Exp
39
- - **Model trained on:** User prompt → Gemini reasoning traces → Final EAD/XML response
40
- - **Tokenization Strategy:**
41
- - **Assistant (reasoning):** The start of the reasoning section
42
- - **Assistant (final answer):** The start of the XML output
43
- - Labels are masked (`-100`) before the reasoning phase to optimize the learning process
44
- - **Training Hyperparameters:**
45
- - **Batch Size:** 4 (per device) with gradient accumulation (steps=2)
46
- - **Max Sequence Length:** 4096 tokens
47
- - **Precision:** bf16
48
- - **Epochs:** 5
49
- - **Gradient Checkpointing:** Enabled (reduces memory usage)
50
- - **Dataloader Efficiency:** dataloader_pin_memory=True, dataloader_num_workers=4
51
- - **Warmup Steps:** 100
52
- - **Checkpointing:** Model saved at every epoch, with a maximum of 2 saved checkpoints (save_total_limit=2)
53
- - **Evaluation Strategy:** Evaluates after each epoch (eval_strategy="epoch")
54
- - **Logging:** Logs stored in ./logs
55
- - **Other:** dataloader_drop_last=False to preserve all batches
56
-
57
- This setup ensures an optimal balance between performance and memory efficiency, leveraging gradient accumulation and checkpointing for stable training on long sequences
58
-
59
- ### **Training notebook**
60
-
61
- [https://www.kaggle.com/code/geraldinegeoffroy/ead-distilled-qwen2-5-0-5b-instruct](https://www.kaggle.com/code/geraldinegeoffroy/ead-distilled-qwen2-5-0-5b-instruct)
62
-
63
- ---
64
- ## Model Usage
65
-
66
- ### **Load Model**
67
- To use the model with the 🤗 Transformers library:
68
- ```python
69
- from transformers import AutoModelForCausalLM, AutoTokenizer
70
-
71
- model_name = "Geraldine/Gemini-Distilled-Qwen2.5-0.5B-ead"
72
- tokenizer = AutoTokenizer.from_pretrained(model_name)
73
- model = AutoModelForCausalLM.from_pretrained(
74
- model_name,
75
- torch_dtype="auto",
76
- device_map="auto",
77
- )
78
- ```
79
-
80
- ### **Inference Example**
81
- ```python
82
- tokenizer = AutoTokenizer.from_pretrained(model_name)
83
-
84
- prompt = "Give me an example of <controlaccess> content."
85
- messages = [
86
- {"role": "system", "content": "You are an archivist expert in EAD/XML format for archival records metadata."},
87
- {"role": "user", "content": prompt}
88
- ]
89
- text = tokenizer.apply_chat_template(
90
- messages,
91
- tokenize=False,
92
- add_generation_prompt=True
93
- )
94
- model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
95
-
96
- generated_ids = model.generate(
97
- **model_inputs,
98
- max_new_tokens=1024
99
- )
100
- generated_ids = [
101
- output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
102
- ]
103
-
104
- response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
105
- print(response)
106
- ```
107
-
108
- ---
109
- ## Limitations & Future Improvements
110
- - **Training Data Size:** The dataset consists of **4,000 distilled samples**, which may limit generalization.
111
- - **Inference Speed:** Ensure that **Sliding Window Attention (SWA) is disabled**, as it may slow down inference.
112
- - To disable: `model.config.sliding_window = None`
113
- - **Potential Future Steps:**
114
- - Fine-tuning on larger datasets
115
- - Exploring LoRA/QLoRA for efficient parameter tuning
116
-
117
- ---
118
- ## Citation & Acknowledgments
119
- If you use this model in research or production, please cite:
120
- ```
121
- @misc{your-citation,
122
- author = {Géralidne Geoffroy},
123
- title = {Qwen2.5-0.5B-Instruct Fine-Tuned on EAD/XML},
124
- year = {2025},
125
- publisher = {Hugging Face},
126
- url = {https://huggingface.co/Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead}
127
- }
 
 
 
 
 
 
 
 
 
 
 
128
  ```
 
1
+ ---
2
+ library_name: transformers
3
+ license: mit
4
+ datasets:
5
+ - Geraldine/Ead-Instruct-4k-Distilled
6
+ language:
7
+ - zho
8
+ - eng
9
+ - fra
10
+ - spa
11
+ - por
12
+ - deu
13
+ - ita
14
+ - rus
15
+ - jpn
16
+ - kor
17
+ - vie
18
+ - tha
19
+ - ara
20
+ base_model:
21
+ - Qwen/Qwen2.5-0.5B-Instruct
22
+ ---
23
+
24
+ # Gemini-Distill-Qwen2.5-0.5B-ead : Qwen2.5-0.5B-Instruct Fine-Tuned on EAD/XML (Distilled from Gemini-2.0-Flash-Thinking-Exp)
25
+
26
+ ## Model Description
27
+ This model is a fine-tuned version of **Qwen2.5-0.5B-Instruct**, trained via knowledge distillation from **Gemini-2.0-Flash-Thinking-Exp**. The goal of this fine-tuning process is to teach the model to reason through and generate **Encoded Archival Description (EAD/XML)** outputs.
28
+
29
+ It follows a structured reasoning approach:
30
+ 1. **First**, the model provides detailed reasoning.
31
+ 2. **Then**, it outputs the final **EAD/XML** response.
32
+
33
+ This structure ensures that the model justifies its output before producing the archival XML format, improving interpretability and accuracy.
34
+
35
+ ---
36
+ ## Training Details
37
+
38
+ ### **Dataset**
39
+ - Dataset: [Geraldine/Ead-Instruct-4k-Distilled](https://huggingface.co/datasets/Geraldine/Ead-Instruct-4k-Distilled)
40
+ - **Columns:**
41
+ - `tag`: EAD/XML element
42
+ - `prompt`: User query
43
+ - `reasoning`: Gemini-generated reasoning traces
44
+ - `final_output`: EAD/XML archival response
45
+ - `completion`: Concatenation of `reasoning` and `final_output`
46
+
47
+ ### **Training Process**
48
+ - **Hardware:** : NVIDIA A100-SXM4-80GB
49
+ - **Distillation Source:** Gemini-2.0-Flash-Thinking-Exp
50
+ - **Model trained on:** User prompt → Gemini reasoning traces → Final EAD/XML response
51
+ - **Tokenization Strategy:**
52
+ - **Assistant (reasoning):** The start of the reasoning section
53
+ - **Assistant (final answer):** The start of the XML output
54
+ - Labels are masked (`-100`) before the reasoning phase to optimize the learning process
55
+ - **Training Hyperparameters:**
56
+ - **Batch Size:** 4 (per device) with gradient accumulation (steps=2)
57
+ - **Max Sequence Length:** 4096 tokens
58
+ - **Precision:** bf16
59
+ - **Epochs:** 5
60
+ - **Gradient Checkpointing:** Enabled (reduces memory usage)
61
+ - **Dataloader Efficiency:** dataloader_pin_memory=True, dataloader_num_workers=4
62
+ - **Warmup Steps:** 100
63
+ - **Checkpointing:** Model saved at every epoch, with a maximum of 2 saved checkpoints (save_total_limit=2)
64
+ - **Evaluation Strategy:** Evaluates after each epoch (eval_strategy="epoch")
65
+ - **Logging:** Logs stored in ./logs
66
+ - **Other:** dataloader_drop_last=False to preserve all batches
67
+
68
+ This setup ensures an optimal balance between performance and memory efficiency, leveraging gradient accumulation and checkpointing for stable training on long sequences
69
+
70
+ ### **Training notebook**
71
+
72
+ [https://www.kaggle.com/code/geraldinegeoffroy/ead-distilled-qwen2-5-0-5b-instruct](https://www.kaggle.com/code/geraldinegeoffroy/ead-distilled-qwen2-5-0-5b-instruct)
73
+
74
+ ---
75
+ ## Model Usage
76
+
77
+ ### **Load Model**
78
+ To use the model with the 🤗 Transformers library:
79
+ ```python
80
+ from transformers import AutoModelForCausalLM, AutoTokenizer
81
+
82
+ model_name = "Geraldine/Gemini-Distilled-Qwen2.5-0.5B-ead"
83
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
84
+ model = AutoModelForCausalLM.from_pretrained(
85
+ model_name,
86
+ torch_dtype="auto",
87
+ device_map="auto",
88
+ )
89
+ ```
90
+
91
+ ### **Inference Example**
92
+ ```python
93
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
94
+
95
+ prompt = "Give me an example of <controlaccess> content."
96
+ messages = [
97
+ {"role": "system", "content": "You are an archivist expert in EAD/XML format for archival records metadata."},
98
+ {"role": "user", "content": prompt}
99
+ ]
100
+ text = tokenizer.apply_chat_template(
101
+ messages,
102
+ tokenize=False,
103
+ add_generation_prompt=True
104
+ )
105
+ model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
106
+
107
+ generated_ids = model.generate(
108
+ **model_inputs,
109
+ max_new_tokens=1024
110
+ )
111
+ generated_ids = [
112
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
113
+ ]
114
+
115
+ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
116
+ print(response)
117
+ ```
118
+
119
+ ---
120
+ ## Limitations & Future Improvements
121
+ - **Training Data Size:** The dataset consists of **4,000 distilled samples**, which may limit generalization.
122
+ - **Inference Speed:** Ensure that **Sliding Window Attention (SWA) is disabled**, as it may slow down inference.
123
+ - To disable: `model.config.sliding_window = None`
124
+ - **Potential Future Steps:**
125
+ - Fine-tuning on larger datasets
126
+ - Exploring LoRA/QLoRA for efficient parameter tuning
127
+
128
+ ---
129
+ ## Citation & Acknowledgments
130
+ If you use this model in research or production, please cite:
131
+ ```
132
+ @misc{your-citation,
133
+ author = {Géralidne Geoffroy},
134
+ title = {Qwen2.5-0.5B-Instruct Fine-Tuned on EAD/XML},
135
+ year = {2025},
136
+ publisher = {Hugging Face},
137
+ url = {https://huggingface.co/Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead}
138
+ }
139
  ```