pengyizhou commited on
Commit
0ecadb5
·
verified ·
1 Parent(s): a87f4fa

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +147 -0
README.md ADDED
@@ -0,0 +1,147 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-3.0
3
+ datasets:
4
+ - google/fleurs
5
+ metrics:
6
+ - wer
7
+ base_model:
8
+ - openai/whisper-large-v3
9
+ pipeline_tag: automatic-speech-recognition
10
+ ---
11
+ # Whisper Fine-tuning for Cebuano Language
12
+
13
+ This project provides a configurable way to fine-tune OpenAI's Whisper model specifically on the Cebuano language using the Google FLEURS dataset (ceb_ph).
14
+
15
+ ## Features
16
+
17
+ - **Flexible Configuration**: All parameters are configurable through YAML files
18
+ - **Multi-GPU Support**: Automatic detection and support for multiple GPUs
19
+ - **Dynamic Language Selection**: Train on any subset of supported languages
20
+ - **On-the-fly Processing**: Efficient memory usage with dynamic audio preprocessing
21
+ - **Comprehensive Evaluation**: Automatic evaluation on test sets
22
+
23
+ ## Configuration
24
+
25
+ All parameters are configurable through the `config.yaml` file. This configuration is specifically set up for Cebuano language training using the Google FLEURS dataset.
26
+
27
+ ### Model Configuration
28
+ - Model checkpoint (default: `openai/whisper-large-v3`)
29
+ - Maximum target length for sequences
30
+
31
+ ### Dataset Configuration
32
+ - Uses Google FLEURS Cebuano (ceb_ph) dataset
33
+ - Dataset sources and splits
34
+ - Language-specific settings
35
+ - Training subset ratio (25% of data for faster training)
36
+
37
+ ### Training Configuration
38
+ - Learning rate, batch sizes, training steps
39
+ - Multi-GPU vs single GPU settings
40
+ - Evaluation and logging parameters
41
+
42
+ ### Environment Configuration
43
+ - CPU core limits
44
+ - Environment variables for optimization
45
+
46
+ ### Pushing to Hub
47
+ - I have set the configuration to not push to the Hugging Face Hub by default. You can enable this by setting `push_to_hub: true` in your config file.
48
+
49
+ ## Usage
50
+
51
+ ### Basic Usage
52
+ ```bash
53
+ python finetune.py --config config.yaml
54
+ ```
55
+
56
+ ### Custom Configuration
57
+ ```bash
58
+ python finetune.py --config my_custom_config.yaml
59
+ ```
60
+
61
+ ### Multi-GPU Training
62
+ Since we only have very few training data (around 2.5 hours), multi-GPU training is not recommended.
63
+
64
+ ## Configuration File Structure
65
+
66
+ The `config.yaml` file is organized into the following sections:
67
+
68
+ 1. **model**: Model checkpoint and sequence length settings
69
+ 2. **output**: Output directory configuration
70
+ 3. **environment**: Environment variables and CPU settings
71
+ 4. **audio**: Audio processing settings (sampling rate)
72
+ 5. **languages**: Cebuano language configuration
73
+ 6. **datasets**: Google FLEURS Cebuano dataset configuration
74
+ 7. **training**: All training hyperparameters
75
+ 8. **data_processing**: Data processing settings
76
+
77
+ ## Customizing Your Training
78
+
79
+ ### Adjusting Training Parameters
80
+ Modify the `training` section in `config.yaml`:
81
+ - Change learning rate, batch sizes, or training steps
82
+ - Adjust evaluation frequency
83
+ - Configure multi-GPU settings
84
+
85
+ ### Environment Optimization
86
+ Adjust the `environment` section to optimize for your system:
87
+ - Set CPU core limits
88
+ - Configure memory usage settings
89
+
90
+ ## Configuration
91
+
92
+ The provided `config.yaml` is specifically configured for Cebuano language training using the Google FLEURS dataset.
93
+
94
+ ## Training Commands
95
+
96
+ ### Basic Training
97
+ ```bash
98
+ python finetune.py
99
+ ```
100
+
101
+ ### Single GPU Training
102
+ ```bash
103
+ python finetune.py
104
+ ```
105
+
106
+ ## Inference Guide
107
+
108
+ After training your model, you can use the provided `inference.py` script for speech recognition:
109
+
110
+ ```bash
111
+ python inference.py
112
+ ```
113
+
114
+ The inference script includes:
115
+ - Model loading from the trained checkpoint
116
+ - Audio preprocessing pipeline
117
+ - Text generation with proper formatting
118
+ - Support for Cebuano language transcription
119
+
120
+ ### Using the Trained Model
121
+
122
+ The inference script automatically handles:
123
+ - Loading the fine-tuned model weights
124
+ - Audio preprocessing with proper sampling rate
125
+ - Generating transcriptions for Cebuano speech
126
+ - Output formatting for evaluation metrics
127
+
128
+ ## Dependencies
129
+
130
+ Install required packages:
131
+ ```bash
132
+ pip install -r requirements.txt
133
+ ```
134
+
135
+ Key dependencies:
136
+ - PyYAML (for configuration loading)
137
+ - torch, transformers, datasets
138
+ - librosa (for audio processing)
139
+ - evaluate (for metrics)
140
+
141
+ ## Evaluation Results
142
+ | Language | Metric | Error Rate | Zero Shot |
143
+ |-------------|:------:|-----------:|-----------:|
144
+ | Cebuano | WER | 16.10% | 47.33% |
145
+
146
+
147
+ **Note**: If you encounter issues running finetune.py, you can use the `finetune-backup.py` file which contains the original hardcoded configuration.