File size: 4,468 Bytes
0ecadb5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b87d4ca
 
 
 
 
 
 
0ecadb5
b87d4ca
 
 
 
0ecadb5
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
---
license: cc-by-nc-3.0
datasets:
- google/fleurs
metrics:
- wer
base_model:
- openai/whisper-large-v3
pipeline_tag: automatic-speech-recognition
---
# Whisper Fine-tuning for Cebuano Language

This project provides a configurable way to fine-tune OpenAI's Whisper model specifically on the Cebuano language using the Google FLEURS dataset (ceb_ph).

## Features

- **Flexible Configuration**: All parameters are configurable through YAML files
- **Multi-GPU Support**: Automatic detection and support for multiple GPUs
- **Dynamic Language Selection**: Train on any subset of supported languages
- **On-the-fly Processing**: Efficient memory usage with dynamic audio preprocessing
- **Comprehensive Evaluation**: Automatic evaluation on test sets

## Configuration

All parameters are configurable through the `config.yaml` file. This configuration is specifically set up for Cebuano language training using the Google FLEURS dataset.

### Model Configuration
- Model checkpoint (default: `openai/whisper-large-v3`)
- Maximum target length for sequences

### Dataset Configuration
- Uses Google FLEURS Cebuano (ceb_ph) dataset
- Dataset sources and splits
- Language-specific settings
- Training subset ratio (25% of data for faster training)

### Training Configuration
- Learning rate, batch sizes, training steps
- Multi-GPU vs single GPU settings
- Evaluation and logging parameters

### Environment Configuration
- CPU core limits
- Environment variables for optimization

### Pushing to Hub
- I have set the configuration to not push to the Hugging Face Hub by default. You can enable this by setting `push_to_hub: true` in your config file.

## Usage

### Basic Usage
```bash
python finetune.py --config config.yaml
```

### Custom Configuration
```bash
python finetune.py --config my_custom_config.yaml
```

### Multi-GPU Training
Since we only have very few training data (around 2.5 hours), multi-GPU training is not recommended.

## Configuration File Structure

The `config.yaml` file is organized into the following sections:

1. **model**: Model checkpoint and sequence length settings
2. **output**: Output directory configuration
3. **environment**: Environment variables and CPU settings
4. **audio**: Audio processing settings (sampling rate)
5. **languages**: Cebuano language configuration
6. **datasets**: Google FLEURS Cebuano dataset configuration
7. **training**: All training hyperparameters
8. **data_processing**: Data processing settings

## Customizing Your Training

### Adjusting Training Parameters
Modify the `training` section in `config.yaml`:
- Change learning rate, batch sizes, or training steps
- Adjust evaluation frequency
- Configure multi-GPU settings

### Environment Optimization
Adjust the `environment` section to optimize for your system:
- Set CPU core limits
- Configure memory usage settings

## Configuration

The provided `config.yaml` is specifically configured for Cebuano language training using the Google FLEURS dataset.

## Training Commands

### Basic Training
```bash
python finetune.py
```

### Single GPU Training
```bash
python finetune.py
```

## Inference Guide

After training your model, you can use the provided `inference.py` script for speech recognition:

```bash
python inference.py
```

The inference script includes:
- Model loading from the trained checkpoint
- Audio preprocessing pipeline
- Text generation with proper formatting
- Support for Cebuano language transcription

### Using the Trained Model

The inference script automatically handles:
- Loading the fine-tuned model weights
- Audio preprocessing with proper sampling rate
- Generating transcriptions for Cebuano speech
- Output formatting for evaluation metrics

## Dependencies

Install required packages:
```bash
pip install -r requirements.txt
```

Key dependencies:
- PyYAML (for configuration loading)
- torch, transformers, datasets
- librosa (for audio processing)
- evaluate (for metrics)

## Zero-shot Results 
| LID    | Metric | Error Rate |
|-------------|:------:|-----------:|
| Khmer       |  WER   |     355% |
| Tagalog     |  WER   |   40.14% |
| Auto        |  WER   |   40.13% |

## Evaluation Results
| Language    | Metric | Error Rate |
|-------------|:------:|-----------:|
| Tagalog       |  CER   |   13.42% |
| Auto        |  CER   |   13.40% |


**Note**: If you encounter issues running finetune.py, you can use the `finetune-backup.py` file which contains the original hardcoded configuration.