update README

Browse files

Files changed (5) hide show

README.md +138 -0
config.yaml +107 -0
finetune-backup.py +243 -0
finetune.py +673 -0
inference.py +89 -0

README.md ADDED Viewed

	@@ -0,0 +1,138 @@

+# Whisper Fine-tuning for Khmer Language
+This project provides a configurable way to fine-tune OpenAI's Whisper model specifically on the Khmer language using the Google FLEURS dataset (km_kh).
+## Features
+- **Flexible Configuration**: All parameters are configurable through YAML files
+- **Multi-GPU Support**: Automatic detection and support for multiple GPUs
+- **Dynamic Language Selection**: Train on any subset of supported languages
+- **On-the-fly Processing**: Efficient memory usage with dynamic audio preprocessing
+- **Comprehensive Evaluation**: Automatic evaluation on test sets
+## Configuration
+All parameters are configurable through the `config.yaml` file. This configuration is specifically set up for Khmer language training using the Google FLEURS dataset.
+### Model Configuration
+- Model checkpoint (default: `openai/whisper-large-v3`)
+- Maximum target length for sequences
+### Dataset Configuration
+- Uses Google FLEURS Khmer (km_kh) dataset
+- Dataset sources and splits
+- Language-specific settings
+- Training subset ratio (25% of data for faster training)
+### Training Configuration
+- Learning rate, batch sizes, training steps
+- Multi-GPU vs single GPU settings
+- Evaluation and logging parameters
+### Environment Configuration
+- CPU core limits
+- Environment variables for optimization
+### Pushing to Hub
+- I have set the configuration to not push to the Hugging Face Hub by default. You can enable this by setting `push_to_hub: true` in your config file.
+## Usage
+### Basic Usage
+```bash
+python finetune.py --config config.yaml
+```
+### Custom Configuration
+```bash
+python finetune.py --config my_custom_config.yaml
+```
+### Multi-GPU Training
+Since we only have very few training data (around 2.5 hours), multi-GPU training is not recommended.
+## Configuration File Structure
+The `config.yaml` file is organized into the following sections:
+1. **model**: Model checkpoint and sequence length settings
+2. **output**: Output directory configuration
+3. **environment**: Environment variables and CPU settings
+4. **audio**: Audio processing settings (sampling rate)
+5. **languages**: Khmer language configuration
+6. **datasets**: Google FLEURS Khmer dataset configuration
+7. **training**: All training hyperparameters
+8. **data_processing**: Data processing settings
+## Customizing Your Training
+### Adjusting Training Parameters
+Modify the `training` section in `config.yaml`:
+- Change learning rate, batch sizes, or training steps
+- Adjust evaluation frequency
+- Configure multi-GPU settings
+### Environment Optimization
+Adjust the `environment` section to optimize for your system:
+- Set CPU core limits
+- Configure memory usage settings
+## Configuration
+The provided `config.yaml` is specifically configured for Khmer language training using the Google FLEURS dataset.
+## Training Commands
+### Basic Training
+```bash
+python finetune.py
+```
+### Single GPU Training
+```bash
+python finetune.py
+```
+## Inference Guide
+After training your model, you can use the provided `inference.py` script for speech recognition:
+```bash
+python inference.py
+```
+The inference script includes:
+- Model loading from the trained checkpoint
+- Audio preprocessing pipeline
+- Text generation with proper formatting
+- Support for Khmer language transcription
+### Using the Trained Model
+The inference script automatically handles:
+- Loading the fine-tuned model weights
+- Audio preprocessing with proper sampling rate
+- Generating transcriptions for Khmer speech
+- Output formatting for evaluation metrics
+## Dependencies
+Install required packages:
+```bash
+pip install -r requirements.txt
+```
+Key dependencies:
+- PyYAML (for configuration loading)
+- torch, transformers, datasets
+- librosa (for audio processing)
+- evaluate (for metrics)
+## Evaluation Results
+| Language    | Metric | Error Rate |
+|-------------|:------:|-----------:|
+| Khmer       |  CER   |     33.18% |
+**Note**: If you encounter issues running finetune.py, you can use the `finetune-backup.py` file which contains the original hardcoded configuration.

config.yaml ADDED Viewed

	@@ -0,0 +1,107 @@

+# Configuration for training only on Khmer (FLEURS km_kh) data
+# Fine-tuning Whisper on Khmer language using Google FLEURS dataset
+# Model Configuration
+model:
+  checkpoint: "openai/whisper-large-v3"
+  max_target_length: 448
+# Output Configuration
+output:
+  output_dir: "./whisper-fleurs-km_kh-small"
+# Environment Configuration
+environment:
+  max_cpu_cores: 20
+  test_cpu_cores: 20
+  omp_num_threads: "20"
+  mkl_num_threads: "20"
+  openblas_num_threads: "20"
+  veclib_maximum_threads: "20"
+  numexpr_num_threads: "20"
+  tokenizers_parallelism: "false"
+  transformers_no_tf: "1"
+# Audio Processing Configuration
+audio:
+  sampling_rate: 16000
+# Language Configurations - Khmer only
+languages:
+  khmer:
+    whisper_language: "khmer"
+    fleurs_language: "km_kh"
+    text_key: "transcription"
+    train_subset_ratio: 0.25  # Use only 25% of training data for faster training/experimentation
+# Dataset Configurations - Khmer FLEURS
+datasets:
+  khmer:
+    source: "google/fleurs"
+    language_code: "km_kh"
+    splits:
+      train: "train"
+      validation: "validation"
+      test: "test"
+    trust_remote_code: true
+# Training Configuration
+training:
+  # Basic training parameters
+  learning_rate: 1.0e-5
+  warmup_steps: 100
+  max_steps: 800
+  # Batch size and accumulation
+  single_gpu:
+    per_device_train_batch_size: 16
+    per_device_eval_batch_size: 16
+    gradient_accumulation_steps: 1
+  # Optimization settings
+  gradient_checkpointing: true
+  fp16: true
+  # Evaluation settings
+  eval_strategy: "steps"
+  eval_steps: 100
+  predict_with_generate: true
+  generation_max_length: 225
+  # Saving and logging
+  save_steps: 100
+  logging_steps: 25
+  save_total_limit: 3
+  # Model selection
+  load_best_model_at_end: true
+  metric_for_best_model: "cer"  # Using CER for Khmer (character-based language)
+  greater_is_better: false
+  # Reporting
+  report_to:
+    - "tensorboard"
+  # Hub settings
+  push_to_hub: false
+  # Multi-GPU specific settings
+  dataloader_drop_last: true
+  ddp_find_unused_parameters: false
+# Data Processing Configuration
+data_processing:
+  # Random seed for reproducibility
+  seed: 42
+  # Columns to remove during standardization
+  columns_to_remove:
+    - "id"
+    - "num_samples"
+    - "path"
+    - "speaker_id"
+    - "chapter_id"
+    - "segment_id"

finetune-backup.py ADDED Viewed

	@@ -0,0 +1,243 @@

+#!/usr/bin/env python
+# finetune_whisper_fleurs_my_mm.py
+"""
+Fine-tune openai/whisper-large-v3 on the FLEURS Burmese dataset (config: "my_mm").
+Based on the Hugging Face blog: https://huggingface.co/blog/fine-tune-whisper
+"""
+import os
+os.environ["TRANSFORMERS_NO_TF"] = "1"
+import torch
+from datasets import load_dataset, Audio
+from transformers import (
+    WhisperProcessor,
+    WhisperForConditionalGeneration,
+    Seq2SeqTrainingArguments,
+    Seq2SeqTrainer,
+)
+import ipdb
+import evaluate
+from dataclasses import dataclass
+from typing import Any, Dict, List, Union
+@dataclass
+class DataCollatorSpeechSeq2SeqWithPadding:
+    processor: Any
+    decoder_start_token_id: int
+    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
+        # split inputs and labels since they have to be of different lengths and need different padding methods
+        # first treat the audio inputs by simply returning torch tensors
+        input_features = [{"input_features": feature["input_features"]} for feature in features]
+        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")
+        # get the tokenized label sequences
+        label_features = [{"input_ids": feature["labels"]} for feature in features]
+        # pad the labels to max length
+        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")
+        # replace padding with -100 to ignore loss correctly
+        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)
+        # if bos token is appended in previous tokenization step,
+        # cut bos token here as it's append later anyways
+        if (labels[:, 0] == self.decoder_start_token_id).all().cpu().item():
+            labels = labels[:, 1:]
+        batch["labels"] = labels
+        return batch
+# → Choose device (GPU if available)
+device = "cuda" if torch.cuda.is_available() else "cpu"
+# 1. Configuration
+LANGUAGE         = "km_kh"       # FLEURS config for Khmer
+LANGUAGE_WHISPER = "khmer"     # Whisper config for Khmer
+MODEL_CHECKPOINT = "openai/whisper-large-v3"
+OUTPUT_DIR       = f"./whisper-fleurs-{LANGUAGE}-small"
+TRAIN_SPLIT      = "train"
+VALID_SPLIT      = "validation"
+TEST_SPLIT       = "test"
+MAX_TARGET_LENGTH= 448
+# 2. Load FLEURS Dataset (audio at 16 kHz)
+raw_datasets = load_dataset("google/fleurs", LANGUAGE,
+                            split={ "train": TRAIN_SPLIT,
+                                    "validation": VALID_SPLIT,
+                                    "test": TEST_SPLIT })
+# Cast “audio” column to 16 kHz
+for split in ["train", "validation", "test"]:
+    raw_datasets[split] = raw_datasets[split].cast_column("audio", Audio(sampling_rate=16_000))
+raw_datasets["train"] = raw_datasets["train"].train_test_split(test_size=0.75, seed=42)["test"]
+# 3. Load Whisper Processor & Model
+processor = WhisperProcessor.from_pretrained(MODEL_CHECKPOINT, language=LANGUAGE_WHISPER)
+model     = WhisperForConditionalGeneration.from_pretrained(MODEL_CHECKPOINT)
+model.to(device)
+# 4. Preprocessing Function
+#    - Extract log‐Mel features from audio
+#    - Tokenize the target transcription
+def preprocess_batch(batch):
+    # batch["audio"]["array"] is a list of NumPy arrays @ 16 kHz
+    audio_arrays = [example["array"] for example in batch["audio"]]
+    # 4a. Feature extraction (log‐Mel + normalization)
+    inputs = processor.feature_extractor(
+        audio_arrays,
+        sampling_rate=16_000,
+        return_tensors="pt"
+    )
+    # 4b. Tokenize (labels) using the Whisper tokenizer
+    #     We prefix with target language ID (e.g. "<|my_mm|>") if necessary;
+    #     but for FLEURS, the default Whisper language‐ID tokens should suffice.
+    labels = processor.tokenizer(
+        batch["transcription"],
+        return_tensors="pt",
+        padding="longest",
+        truncation=True,
+        max_length=MAX_TARGET_LENGTH
+    )
+    # ipdb.set_trace()
+    # rename for trainer:
+    inputs["input_features"] = inputs.pop("input_features")
+    inputs["labels"]        = labels.input_ids
+    return inputs
+# 5. Apply preprocessing to train/validation/test
+#    - Remove all non‐audio columns after mapping
+train_dataset = raw_datasets["train"].map(
+    preprocess_batch,
+    remove_columns=raw_datasets["train"].column_names,
+    batched=True,
+    batch_size=16,   # adjust batch_size to your memory
+)
+# ipdb.set_trace()
+eval_dataset = raw_datasets["validation"].map(
+    preprocess_batch,
+    remove_columns=raw_datasets["validation"].column_names,
+    batched=True,
+    batch_size=8,
+)
+test_dataset = raw_datasets["test"].map(
+    preprocess_batch,
+    remove_columns=raw_datasets["test"].column_names,
+    batched=True,
+    batch_size=8,
+)
+# 6. Data Collator
+#    This will pad input_features and labels to the maximum length in the batch,
+#    and replace padding token ID in labels by -100 to ignore them in loss computation.
+data_collator = DataCollatorSpeechSeq2SeqWithPadding(
+    processor=processor,
+    decoder_start_token_id=model.config.decoder_start_token_id,
+)
+# 7. Metrics: WER & CER (using Hugging Face Evaluate)
+wer_metric = evaluate.load("wer")
+cer_metric = evaluate.load("cer")
+def compute_metrics(pred):
+    """
+    pred.predictions: raw token IDs from generate()
+    pred.label_ids: token IDs used as labels
+    """
+    # 7a. decode predictions → strings
+    pred_ids = pred.predictions
+    # ensure we skip special tokens
+    pred_str = processor.batch_decode(pred_ids,
+                                      skip_special_tokens=True)
+    # 7b. decode references → strings, replacing -100 with padding_token_id
+    label_ids = pred.label_ids
+    # replace -100 with pad_token_id so that the tokenizer does not crash
+    label_ids[label_ids == -100] = processor.tokenizer.pad_token_id
+    ref_str = processor.batch_decode(label_ids, skip_special_tokens=True)
+    # lowercase & strip
+    pred_str = [s.lower().strip() for s in pred_str]
+    ref_str  = [s.lower().strip() for s in ref_str]
+    wer_score = wer_metric.compute(predictions=pred_str, references=ref_str)
+    cer_score = cer_metric.compute(predictions=pred_str, references=ref_str)
+    return { "wer": wer_score, "cer": cer_score }
+"""
+# 8. Training Arguments
+training_args = Seq2SeqTrainingArguments(
+    output_dir=OUTPUT_DIR,
+    per_device_train_batch_size=4,      # reduce if you OOM; or increase if large GPU
+    per_device_eval_batch_size=4,
+    gradient_accumulation_steps=2,      # to simulate a larger batch
+    evaluation_strategy="steps",
+    eval_steps=500,                     # evaluate every 500 steps
+    logging_steps=250,
+    save_steps=1000,
+    num_train_epochs=3,
+    learning_rate=1e-5,
+    warmup_steps=500,
+    fp16=True,                          # use mixed precision if supported
+    predict_with_generate=True,         # for computing WER/CER we need generate()
+    save_total_limit=2,
+    push_to_hub=False,
+)
+"""
+training_args = Seq2SeqTrainingArguments(
+    output_dir=OUTPUT_DIR,
+    per_device_train_batch_size=16,
+    gradient_accumulation_steps=1,
+    learning_rate=1e-5,
+    warmup_steps=100,
+    max_steps=800,
+    gradient_checkpointing=True,
+    fp16=True,
+    eval_strategy="steps",
+    per_device_eval_batch_size=8,
+    predict_with_generate=True,
+    generation_max_length=448,
+    save_steps=100,
+    eval_steps=100,
+    logging_steps=10,
+    report_to=["tensorboard"],
+    load_best_model_at_end=True,
+    metric_for_best_model="wer",
+    greater_is_better=False,
+    push_to_hub=True
+)
+# 9. Initialize Seq2SeqTrainer
+trainer = Seq2SeqTrainer(
+    model=model,
+    args=training_args,
+    train_dataset=train_dataset,
+    eval_dataset=eval_dataset,
+    data_collator=data_collator,
+    tokenizer=processor.feature_extractor,  # feature_extractor + tokenizer packed in processor
+    compute_metrics=compute_metrics,
+)
+# 10. Fine-tune
+if __name__ == "__main__":
+    # 10a. Train
+    trainer.train()
+    # 10b. Evaluate on TEST split
+    print("\n***** Evaluating on TEST split *****")
+    test_metrics = trainer.predict(test_dataset, metric_key_prefix="test")
+    print(f"Test WER: {test_metrics.metrics['test_wer']*100:.2f}%")
+    print(f"Test CER: {test_metrics.metrics['test_cer']*100:.2f}%")

finetune.py ADDED Viewed

	@@ -0,0 +1,673 @@

+#!/usr/bin/env python
+# finetune_whisper_mix_datasets.py
+"""
+Fine-tune openai/whisper-large-v3 on mixed datasets from different languages:
+- FLEURS Cebuano (ceb_ph)
+- FLEURS Khmer (km_kh)
+- Switchboard1 English
+- WenetSpeech Chinese
+- Eng-Indon-CS
+- Eng-Malay-CS
+Based on the Hugging Face blog: https://huggingface.co/blog/fine-tune-whisper
+To run this script on multiple GPUs, you have several options:
+1. **Automatic Multi-GPU (DataParallel-style):**
+   python finetune_whisper_mix_datasets.py
+   The script will automatically detect and use all available GPUs.
+2. **Distributed Training with torchrun (Recommended for 2+ GPUs):**
+   torchrun --nproc_per_node=2 finetune_whisper_mix_datasets.py
+   This uses DistributedDataParallel which is more efficient.
+3. **Distributed Training with accelerate (Alternative):**
+   accelerate launch --num_processes=2 finetune_whisper_mix_datasets.py
+   Requires: pip install accelerate
+Note: With 2 GPUs, the effective batch size becomes:
+per_device_batch_size * num_gpus * gradient_accumulation_steps
+= 24 * 2 * 1 = 48 (compared to 32 with single GPU)
+CPU Core Limiting:
+The script automatically limits CPU usage to 20 cores using environment variables.
+You can also set these manually before running:
+   export OMP_NUM_THREADS=20
+   export MKL_NUM_THREADS=20
+   export NUMEXPR_NUM_THREADS=20
+   python finetune_whisper_mix_datasets.py
+"""
+import os
+import random
+import io
+import yaml
+import argparse
+from itertools import chain
+# Load configuration from YAML file
+def load_config(config_path):
+    with open(config_path, 'r') as file:
+        return yaml.safe_load(file)
+# Parse command line arguments
+parser = argparse.ArgumentParser(description='Fine-tune Whisper on mixed datasets')
+parser.add_argument('--config', type=str, default='config.yaml',
+                   help='Path to configuration YAML file')
+args = parser.parse_args()
+# Load configuration
+config = load_config(args.config)
+# Set environment variables from config
+env_config = config['environment']
+os.environ["OMP_NUM_THREADS"] = env_config['omp_num_threads']
+os.environ["MKL_NUM_THREADS"] = env_config['mkl_num_threads']
+os.environ["OPENBLAS_NUM_THREADS"] = env_config['openblas_num_threads']
+os.environ["VECLIB_MAXIMUM_THREADS"] = env_config['veclib_maximum_threads']
+os.environ["NUMEXPR_NUM_THREADS"] = env_config['numexpr_num_threads']
+os.environ["TOKENIZERS_PARALLELISM"] = env_config['tokenizers_parallelism']
+os.environ["TRANSFORMERS_NO_TF"] = env_config['transformers_no_tf']
+import torch
+from datasets import load_dataset, Audio, concatenate_datasets, Dataset
+from torch.utils.data import Dataset as TorchDataset
+from transformers import (
+    WhisperProcessor,
+    WhisperForConditionalGeneration,
+    Seq2SeqTrainingArguments,
+    Seq2SeqTrainer,
+)
+import ipdb
+import evaluate
+import numpy as np
+import ipdb
+# Multi-GPU setup
+if torch.cuda.device_count() > 1:
+    print(f"Setting up for {torch.cuda.device_count()} GPUs")
+    # Enable distributed training environment variables if not already set
+    if "LOCAL_RANK" not in os.environ:
+        os.environ["LOCAL_RANK"] = "0"
+    if "WORLD_SIZE" not in os.environ:
+        os.environ["WORLD_SIZE"] = str(torch.cuda.device_count())
+from dataclasses import dataclass
+from typing import Any, Dict, List, Union
+class WhisperOnTheFlyDataset(TorchDataset):
+    """Custom dataset that preprocesses audio on-the-fly during training"""
+    def __init__(self, dataset, processors, main_processor, max_target_length, audio_config):
+        self.dataset = dataset
+        self.processors = processors
+        self.main_processor = main_processor
+        self.max_target_length = max_target_length
+        self.sampling_rate = audio_config['sampling_rate']
+    def __len__(self):
+        return len(self.dataset)
+    def __getitem__(self, idx):
+        item = self.dataset[idx]
+        # Process audio
+        audio_sample = item["audio"]
+        audio_data = self._process_audio(audio_sample)
+        # Extract with main processor
+        inputs = self.main_processor.feature_extractor(
+            audio_data,
+            sampling_rate=self.sampling_rate,
+            return_tensors="pt"
+        )
+        # Process text with appropriate processor
+        lang = item["language"]
+        if lang in ["cebuano", "khmer"]:
+            text = item["transcription"]
+        else:  # english, chinese
+            text = item["text"]
+        # Tokenize with appropriate processor
+        if lang == "cebuano":
+            labels = self.processors["cebuano"].tokenizer(
+                text,
+                return_tensors="pt",
+                padding=False,
+                truncation=True,
+                max_length=self.max_target_length
+            )
+        elif lang == "khmer":
+            labels = self.processors["khmer"].tokenizer(
+                text,
+                return_tensors="pt",
+                padding=False,
+                truncation=True,
+                max_length=self.max_target_length
+            )
+        elif lang == "english":
+            labels = self.processors["english"].tokenizer(
+                text,
+                return_tensors="pt",
+                padding=False
+            )
+        elif lang == "chinese":
+            labels = self.processors["chinese"].tokenizer(
+                text,
+                return_tensors="pt",
+                padding=False
+            )
+        elif lang == "indonesian":
+            labels = self.processors["indonesian"].tokenizer(
+                text,
+                return_tensors="pt",
+                padding=False
+            )
+        else: # Malay
+            labels = self.processors["malay"].tokenizer(
+                text,
+                return_tensors="pt",
+                padding=False
+            )
+        return {
+            "input_features": inputs.input_features.squeeze(0),
+            "labels": labels.input_ids.squeeze(0),
+            "language": lang
+        }
+    def _process_audio(self, audio_sample):
+        """Process audio sample into numpy array"""
+        import librosa
+        if isinstance(audio_sample, dict):
+            if "array" in audio_sample:
+                return audio_sample["array"]
+            elif "bytes" in audio_sample and audio_sample["bytes"] is not None:
+                audio_array, _ = librosa.load(io.BytesIO(audio_sample["bytes"]), sr=self.sampling_rate)
+                return audio_array
+            elif "path" in audio_sample:
+                audio_array, _ = librosa.load(audio_sample["path"], sr=self.sampling_rate)
+                return audio_array
+            else:
+                raise ValueError(f"Unknown audio dict format: {audio_sample.keys()}")
+        elif isinstance(audio_sample, str):
+            audio_array, _ = librosa.load(audio_sample, sr=self.sampling_rate)
+            return audio_array
+        else:
+            return audio_sample
+@dataclass
+class DataCollatorSpeechSeq2SeqWithPadding:
+    processor: Any
+    decoder_start_token_id: int
+    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
+        # split inputs and labels since they have to be of different lengths and need different padding methods
+        # first treat the audio inputs by simply returning torch tensors
+        input_features = [{"input_features": feature["input_features"]} for feature in features]
+        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")
+        # get the tokenized label sequences
+        label_features = [{"input_ids": feature["labels"]} for feature in features]
+        # pad the labels to max length
+        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")
+        # replace padding with -100 to ignore loss correctly
+        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)
+        # if bos token is appended in previous tokenization step,
+        # cut bos token here as it's append later anyways
+        if (labels[:, 0] == self.decoder_start_token_id).all().cpu().item():
+            labels = labels[:, 1:]
+        batch["labels"] = labels
+        return batch
+# → Choose device (GPU if available)
+device = "cuda" if torch.cuda.is_available() else "cpu"
+# Extract configuration values
+MODEL_CHECKPOINT = config['model']['checkpoint']
+OUTPUT_DIR = config['output']['output_dir']
+MAX_TARGET_LENGTH = config['model']['max_target_length']
+# CPU usage configuration for dataset preprocessing
+MAX_CPU_CORES = config['environment']['max_cpu_cores']
+TEST_CPU_CORES = config['environment']['test_cpu_cores']
+# Language configurations for each dataset
+DATASET_CONFIGS = config['languages']
+print("Loading datasets...")
+# Load datasets for each language dynamically based on configuration
+datasets = {}
+dataset_configs = config['datasets']
+audio_config = config['audio']
+# Get list of enabled languages from both languages and datasets config
+enabled_languages = set(config['languages'].keys()) & set(config['datasets'].keys())
+print(f"Enabled languages: {list(enabled_languages)}")
+def load_fleurs_dataset(lang_name, lang_config, dataset_config):
+    """Load FLEURS dataset for a language"""
+    print(f"Loading FLEURS {lang_name.title()}...")
+    lang_datasets = load_dataset(
+        dataset_config['source'],
+        dataset_config['language_code'],
+        split={k: v for k, v in dataset_config['splits'].items()},
+        trust_remote_code=dataset_config['trust_remote_code']
+    )
+    # DON'T decode audio yet - keep it compressed until preprocessing
+    for split in dataset_config['splits'].keys():
+        lang_datasets[split] = lang_datasets[split].cast_column("audio", Audio(sampling_rate=audio_config['sampling_rate'], decode=False))
+    # Use subset of training data if specified
+    if 'train_subset_ratio' in lang_config:
+        train_subset_ratio = lang_config['train_subset_ratio']
+        lang_datasets["train"] = lang_datasets["train"].train_test_split(test_size=1-train_subset_ratio, seed=config['data_processing']['seed'])["train"]
+    return lang_datasets
+def load_simple_dataset(lang_name, dataset_config):
+    """Load simple dataset with train/validation/test splits"""
+    print(f"Loading {lang_name.title()}...")
+    lang_dataset = load_dataset(dataset_config['source'], split={k: v for k, v in dataset_config['splits'].items()})
+    return lang_dataset
+def load_english_dataset(lang_config, dataset_config):
+    """Load English dataset with custom train/validation split"""
+    print("Loading English...")
+    swb_train = load_dataset(dataset_config['train_dataset'], split=dataset_config['train_split'], streaming=dataset_config['streaming'])
+    swb_test = load_dataset(dataset_config['test_dataset'], split=dataset_config['test_split'], streaming=dataset_config['streaming'])
+    # Split into train/validation
+    validation_size = lang_config['validation_size']
+    swb_val = swb_train.take(validation_size)
+    swb_train = swb_train.skip(validation_size)
+    return {
+        "train": swb_train,
+        "validation": swb_val,
+        "test": swb_test
+    }
+def load_chinese_dataset(dataset_config):
+    """Load Chinese dataset with multiple test splits"""
+    print("Loading Chinese...")
+    wenet_train = load_dataset(dataset_config['train_dataset'], streaming=dataset_config['streaming'])
+    wenet_valid = load_dataset(dataset_config['validation_dataset'], dataset_config['validation_config'], split="validation", streaming=dataset_config['streaming'])
+    wenet_testnet = load_dataset(dataset_config['test_net_dataset'], dataset_config['test_net_config'], split="test", streaming=dataset_config['streaming'])
+    wenet_testmeeting = load_dataset(dataset_config['test_meeting_dataset'], dataset_config['test_meeting_config'], split="test", streaming=dataset_config['streaming'])
+    return {
+        "train": wenet_train["train"],
+        "validation": wenet_valid,
+        "test_net": wenet_testnet,
+        "test_meeting": wenet_testmeeting
+    }
+# Load datasets for each enabled language
+for lang in enabled_languages:
+    lang_config = config['languages'][lang]
+    dataset_config = dataset_configs[lang]
+    if lang in ['cebuano', 'khmer']:
+        # FLEURS datasets
+        datasets[lang] = load_fleurs_dataset(lang, lang_config, dataset_config)
+    elif lang == 'english':
+        # English with custom validation split
+        datasets[lang] = load_english_dataset(lang_config, dataset_config)
+    elif lang == 'chinese':
+        # Chinese with multiple test splits
+        datasets[lang] = load_chinese_dataset(dataset_config)
+    elif lang in ['indonesian', 'malay']:
+        # Simple datasets with standard splits
+        datasets[lang] = load_simple_dataset(lang, dataset_config)
+    else:
+        print(f"Warning: Unknown language {lang}, treating as simple dataset")
+        datasets[lang] = load_simple_dataset(lang, dataset_config)
+print("Setting up processors...")
+# Create processors for each enabled language
+processors = {}
+for lang in enabled_languages:
+    lang_config = config['languages'][lang]
+    processors[lang] = WhisperProcessor.from_pretrained(
+        MODEL_CHECKPOINT,
+        language=lang_config["whisper_language"]
+    )
+# Use the first available processor as the main one, preferring English if available
+if "english" in processors:
+    main_processor = processors["english"]
+elif processors:
+    main_processor = processors[list(processors.keys())[0]]
+else:
+    raise ValueError("No processors created. Check your language configuration.")
+model = WhisperForConditionalGeneration.from_pretrained(MODEL_CHECKPOINT)
+# Multi-GPU handling
+if torch.cuda.device_count() > 1:
+    print(f"Using {torch.cuda.device_count()} GPUs for training")
+    # The model will be automatically distributed by the Trainer
+    model.to(device)
+else:
+    model.to(device)
+print("Adding language labels to raw datasets...")
+# Remove existing language columns and add our own consistent language labels for each enabled language
+for lang in enabled_languages:
+    lang_datasets = datasets[lang]
+    # Handle different dataset structures
+    if isinstance(lang_datasets, dict):
+        # Dataset with explicit splits (train/validation/test)
+        for split_name, split_dataset in lang_datasets.items():
+            if split_dataset is not None:
+                # Remove existing language column if it exists
+                columns_to_remove = [col for col in split_dataset.column_names if col.lower() in ["language", "lang"]]
+                if columns_to_remove:
+                    print(f"Removing existing language column(s) {columns_to_remove} from {lang} {split_name}")
+                    datasets[lang][split_name] = split_dataset.remove_columns(columns_to_remove)
+                # Add our consistent language label
+                datasets[lang][split_name] = datasets[lang][split_name].add_column("language", [lang] * len(datasets[lang][split_name]))
+    else:
+        # Single dataset object - this shouldn't happen with current structure but handle gracefully
+        print(f"Warning: Unexpected dataset structure for {lang}")
+        continue
+print("Combining raw datasets before preprocessing...")
+# Ensure all datasets have compatible schemas before concatenation
+def standardize_dataset_schema(dataset, dataset_name):
+    """Standardize dataset schema to ensure compatibility for concatenation"""
+    print(f"Standardizing schema for {dataset_name}...")
+    # Keep audio compressed until preprocessing - only set sampling rate
+    if "audio" in dataset.column_names:
+        print(f"  Setting audio feature type to {audio_config['sampling_rate']}Hz (compressed) for {dataset_name}")
+        dataset = dataset.cast_column("audio", Audio(sampling_rate=audio_config['sampling_rate'], decode=False))
+    # Remove problematic columns that might have different types
+    columns_to_remove = []
+    for col in dataset.column_names:
+        if col in config['data_processing']['columns_to_remove']:
+            columns_to_remove.append(col)
+    if columns_to_remove:
+        print(f"  Removing incompatible columns: {columns_to_remove}")
+        dataset = dataset.remove_columns(columns_to_remove)
+    return dataset
+# Standardize all training datasets dynamically
+print("Standardizing training datasets...")
+raw_train_datasets = []
+for lang in enabled_languages:
+    if "train" in datasets[lang]:
+        std_dataset = standardize_dataset_schema(datasets[lang]["train"], f"{lang}_train")
+        raw_train_datasets.append(std_dataset)
+# Standardize validation datasets dynamically
+print("Standardizing validation datasets...")
+raw_val_datasets = []
+for lang in enabled_languages:
+    if "validation" in datasets[lang]:
+        std_dataset = standardize_dataset_schema(datasets[lang]["validation"], f"{lang}_val")
+        raw_val_datasets.append(std_dataset)
+# Combine datasets if we have any
+if raw_train_datasets:
+    print("Combining training datasets...")
+    combined_raw_train = concatenate_datasets(raw_train_datasets)
+    combined_raw_train = combined_raw_train.shuffle(seed=config['data_processing']['seed'])
+else:
+    raise ValueError("No training datasets found. Check your configuration.")
+if raw_val_datasets:
+    print("Combining validation datasets...")
+    combined_raw_val = concatenate_datasets(raw_val_datasets)
+    combined_raw_val = combined_raw_val.shuffle(seed=config['data_processing']['seed'])
+else:
+    print("Warning: No validation datasets found. Training without validation.")
+    combined_raw_val = None
+print("Creating on-the-fly datasets (no preprocessing stored to disk)...")
+# Create on-the-fly datasets instead of preprocessing and storing
+# Create on-the-fly datasets instead of preprocessing and storing
+combined_train_dataset = WhisperOnTheFlyDataset(
+    combined_raw_train,
+    processors,
+    main_processor,
+    MAX_TARGET_LENGTH,
+    audio_config
+)
+# Only create validation dataset if we have validation data
+if combined_raw_val is not None:
+    combined_val_dataset = WhisperOnTheFlyDataset(
+        combined_raw_val,
+        processors,
+        main_processor,
+        MAX_TARGET_LENGTH,
+        audio_config
+    )
+else:
+    combined_val_dataset = None
+print("Creating on-the-fly test datasets...")
+# Create on-the-fly test datasets dynamically
+processed_datasets = {}
+for lang in enabled_languages:
+    processed_datasets[lang] = {}
+    # Handle different test split structures for different languages
+    if lang == "chinese":
+        # Chinese has multiple test splits
+        if "test_net" in datasets[lang]:
+            processed_datasets[lang]["test_net"] = WhisperOnTheFlyDataset(
+                datasets[lang]["test_net"],
+                processors,
+                main_processor,
+                MAX_TARGET_LENGTH,
+                audio_config
+            )
+        if "test_meeting" in datasets[lang]:
+            processed_datasets[lang]["test_meeting"] = WhisperOnTheFlyDataset(
+                datasets[lang]["test_meeting"],
+                processors,
+                main_processor,
+                MAX_TARGET_LENGTH,
+                audio_config
+            )
+    else:
+        # Standard test split
+        if "test" in datasets[lang]:
+            processed_datasets[lang]["test"] = WhisperOnTheFlyDataset(
+                datasets[lang]["test"],
+                processors,
+                main_processor,
+                MAX_TARGET_LENGTH,
+                audio_config
+            )
+# Data Collator
+data_collator = DataCollatorSpeechSeq2SeqWithPadding(
+    processor=main_processor,
+    decoder_start_token_id=model.config.decoder_start_token_id,
+)
+# Metrics: WER & CER (using Hugging Face Evaluate)
+wer_metric = evaluate.load("wer")
+cer_metric = evaluate.load("cer")
+def compute_metrics(pred):
+    """
+    Compute WER and CER metrics for predictions
+    """
+    pred_ids = pred.predictions
+    pred_str = main_processor.batch_decode(pred_ids, skip_special_tokens=True)
+    label_ids = pred.label_ids
+    label_ids[label_ids == -100] = main_processor.tokenizer.pad_token_id
+    ref_str = main_processor.batch_decode(label_ids, skip_special_tokens=True)
+    # lowercase & strip
+    pred_str = [s.lower().strip() for s in pred_str]
+    ref_str = [s.lower().strip() for s in ref_str]
+    wer_score = wer_metric.compute(predictions=pred_str, references=ref_str)
+    cer_score = cer_metric.compute(predictions=pred_str, references=ref_str)
+    return {"wer": wer_score, "cer": cer_score}
+# Check for multi-GPU setup
+num_gpus = torch.cuda.device_count()
+print(f"Number of available GPUs: {num_gpus}")
+# Get training configuration
+training_config = config['training']
+# Adjust batch size and gradient accumulation for multi-GPU
+if num_gpus > 1:
+    # With multiple GPUs, use multi-GPU configuration
+    gpu_config = training_config['multi_gpu']
+    per_device_batch_size = gpu_config['per_device_train_batch_size']
+    per_device_eval_batch_size = gpu_config['per_device_eval_batch_size']
+    gradient_accumulation_steps = gpu_config['gradient_accumulation_steps']
+    print(f"Multi-GPU training detected. Using {num_gpus} GPUs.")
+else:
+    # Single GPU configuration
+    gpu_config = training_config['single_gpu']
+    per_device_batch_size = gpu_config['per_device_train_batch_size']
+    per_device_eval_batch_size = gpu_config['per_device_eval_batch_size']
+    gradient_accumulation_steps = gpu_config['gradient_accumulation_steps']
+    print("Single GPU training.")
+# Training Arguments
+training_args = Seq2SeqTrainingArguments(
+    output_dir=OUTPUT_DIR,
+    per_device_train_batch_size=per_device_batch_size,
+    gradient_accumulation_steps=gradient_accumulation_steps,
+    learning_rate=training_config['learning_rate'],
+    warmup_steps=training_config['warmup_steps'],
+    max_steps=training_config['max_steps'],
+    gradient_checkpointing=training_config['gradient_checkpointing'],
+    fp16=training_config['fp16'],
+    eval_strategy=training_config['eval_strategy'],
+    per_device_eval_batch_size=per_device_eval_batch_size,
+    predict_with_generate=training_config['predict_with_generate'],
+    generation_max_length=training_config['generation_max_length'],
+    save_steps=training_config['save_steps'],
+    eval_steps=training_config['eval_steps'],
+    logging_steps=training_config['logging_steps'],
+    report_to=training_config['report_to'],
+    load_best_model_at_end=training_config['load_best_model_at_end'],
+    metric_for_best_model=training_config['metric_for_best_model'],
+    greater_is_better=training_config['greater_is_better'],
+    push_to_hub=training_config['push_to_hub'],
+    save_total_limit=training_config['save_total_limit'],
+    # Multi-GPU specific settings
+    dataloader_drop_last=training_config['dataloader_drop_last'],
+    ddp_find_unused_parameters=training_config['ddp_find_unused_parameters'],
+)
+# Initialize Seq2SeqTrainer
+trainer = Seq2SeqTrainer(
+    model=model,
+    args=training_args,
+    train_dataset=combined_train_dataset,
+    eval_dataset=combined_val_dataset,
+    data_collator=data_collator,
+    tokenizer=main_processor.feature_extractor,
+    compute_metrics=compute_metrics,
+)
+def evaluate_on_test_sets():
+    """Evaluate the model on all test sets from enabled languages"""
+    print("\n" + "="*60)
+    print("EVALUATING ON ALL TEST SETS")
+    print("="*60)
+    results = {}
+    for lang in enabled_languages:
+        if lang in processed_datasets:
+            lang_results = {}
+            if lang == "chinese":
+                # Chinese has multiple test splits
+                if "test_net" in processed_datasets[lang]:
+                    print(f"\n***** Evaluating on WenetSpeech Chinese TEST_NET *****")
+                    chi_testnet_metrics = trainer.predict(processed_datasets[lang]["test_net"], metric_key_prefix=f"test_{lang}_net")
+                    print(f"Chinese TEST_NET WER: {chi_testnet_metrics.metrics[f'test_{lang}_net_wer']*100:.2f}%")
+                    print(f"Chinese TEST_NET CER: {chi_testnet_metrics.metrics[f'test_{lang}_net_cer']*100:.2f}%")
+                    lang_results["test_net"] = chi_testnet_metrics.metrics
+                if "test_meeting" in processed_datasets[lang]:
+                    print(f"\n***** Evaluating on WenetSpeech Chinese TEST_MEETING *****")
+                    chi_testmeet_metrics = trainer.predict(processed_datasets[lang]["test_meeting"], metric_key_prefix=f"test_{lang}_meeting")
+                    print(f"Chinese TEST_MEETING WER: {chi_testmeet_metrics.metrics[f'test_{lang}_meeting_wer']*100:.2f}%")
+                    print(f"Chinese TEST_MEETING CER: {chi_testmeet_metrics.metrics[f'test_{lang}_meeting_cer']*100:.2f}%")
+                    lang_results["test_meeting"] = chi_testmeet_metrics.metrics
+            else:
+                # Standard test split
+                if "test" in processed_datasets[lang]:
+                    print(f"\n***** Evaluating on {lang.title()} test set *****")
+                    test_metrics = trainer.predict(processed_datasets[lang]["test"], metric_key_prefix=f"test_{lang}")
+                    print(f"{lang.title()} Test WER: {test_metrics.metrics[f'test_{lang}_wer']*100:.2f}%")
+                    print(f"{lang.title()} Test CER: {test_metrics.metrics[f'test_{lang}_cer']*100:.2f}%")
+                    lang_results["test"] = test_metrics.metrics
+            results[lang] = lang_results
+    # Summary
+    print("\n" + "="*60)
+    print("SUMMARY OF ALL TEST RESULTS")
+    print("="*60)
+    for lang in enabled_languages:
+        if lang in results:
+            if lang == "chinese":
+                if "test_net" in results[lang]:
+                    wer = results[lang]["test_net"][f"test_{lang}_net_wer"] * 100
+                    cer = results[lang]["test_net"][f"test_{lang}_net_cer"] * 100
+                    print(f"Chinese-NET: WER={wer:.2f}% | CER={cer:.2f}%")
+                if "test_meeting" in results[lang]:
+                    wer = results[lang]["test_meeting"][f"test_{lang}_meeting_wer"] * 100
+                    cer = results[lang]["test_meeting"][f"test_{lang}_meeting_cer"] * 100
+                    print(f"Chinese-MTG: WER={wer:.2f}% | CER={cer:.2f}%")
+            else:
+                if "test" in results[lang]:
+                    wer = results[lang]["test"][f"test_{lang}_wer"] * 100
+                    cer = results[lang]["test"][f"test_{lang}_cer"] * 100
+                    print(f"{lang.title():12}: WER={wer:.2f}% | CER={cer:.2f}%")
+    return results
+if __name__ == "__main__":
+    print(f"Total training samples: {len(combined_train_dataset)}")
+    print(f"Total validation samples: {len(combined_val_dataset)}")
+    print("Starting training...")
+    # Fine-tune the model
+    trainer.train()
+    # Evaluate on all test sets
+    evaluate_on_test_sets()

inference.py ADDED Viewed

	@@ -0,0 +1,89 @@

+#!/usr/bin/env python
+# pip install transformers datasets torch soundfile jiwer
+from datasets import load_dataset, Audio
+from transformers import pipeline, WhisperProcessor
+from torch.utils.data import DataLoader
+import torch
+from jiwer import wer as jiwer_wer
+from jiwer import cer as jiwer_cer
+import ipdb
+# 1. Load FLEURS Burmese test set, cast to 16 kHz audio
+ds = load_dataset("google/fleurs", "km_kh", split="test")
+ds = ds.cast_column("audio", Audio(sampling_rate=16_000))
+from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
+device = "cuda:0" if torch.cuda.is_available() else "cpu"
+torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
+# model_id = "openai/whisper-large-v3"
+model_id = "pengyizhou/whisper-fleurs-km_kh"
+model = AutoModelForSpeechSeq2Seq.from_pretrained(
+    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
+)
+model.to(device)
+whisper_model = "openai/whisper-large-v3"
+processor = WhisperProcessor.from_pretrained(whisper_model, language="khmer")
+asr = pipeline(
+    "automatic-speech-recognition",
+    model=model,
+    tokenizer=processor.tokenizer,
+    feature_extractor=processor.feature_extractor,
+    torch_dtype=torch_dtype,
+    chunk_length_s=30,
+    batch_size=64,
+    max_new_tokens=440,
+    device=device,
+    no_repeat_ngram_size=3,        # Prevent repeating 3-grams
+    repetition_penalty=1.0,        # Penalize repetitions (>1.0 reduces repetition)
+    length_penalty=1.0,            # Control length preference
+    num_beams=1,                   # Use beam search for better quality
+    do_sample=False,               # Disable sampling for deterministic output
+    early_stopping=False,           # Stop when sufficient beams are complete
+    suppress_tokens=[],
+)
+# 3. Batch‐wise transcription function
+def transcribe_batch(batch):
+    # `batch["audio"]` is a list of {"array": np.ndarray, ...}
+    inputs = [ ex["array"] for ex in batch["audio"] ]
+    outputs = asr(inputs)  # returns a list of dicts with "text"
+    # lower-case and strip to normalize for CER
+    preds = [ out["text"].lower().strip() for out in outputs ]
+    return {"prediction": preds}
+# 4. Map over the dataset in chunks of, say, 32 examples at a time
+result = ds.map(
+    transcribe_batch,
+    batched=True,
+    batch_size=64,              # feed 32 audios → pipeline will sub-batch into 8s
+    remove_columns=ds.column_names
+)
+# ipdb.set_trace()
+# 5. Compute corpus-level CER with jiwer
+# refs = "\n".join(t.lower().strip() for t in ds["transcription"])
+# preds = "\n".join(t for t in result["prediction"])
+# score = jiwer_cer(refs, preds)
+refs = [t.lower().strip() for t in ds["transcription"]]
+preds = [t for t in result["prediction"]]
+score_cer = jiwer_cer(refs, preds)
+score_wer = jiwer_wer(refs, preds)
+print(f"CER on FLEURS km_kh: {score_cer*100:.2f}%")
+print(f"WER on FLEURS km_kh: {score_wer*100:.2f}%")
+with open("./km_kh_finetune.pred", "w") as pred_results:
+    for pred in preds:
+        pred_results.write("{}\n".format(pred))
+with open("./km_kh.ref", "w") as ref_results:
+    for ref in refs:
+        ref_results.write("{}\n".format(ref))