--- base_model: - facebook/mms-lid-256 datasets: - ai4bharat/IndicVoices - mozilla-foundation/common_voice_11_0 language: - hi - ur - en - ta - te - ne - kn - ml - mr - bn license: cc-by-nc-4.0 metrics: - accuracy pipeline_tag: audio-classification tags: - model_hub_mixin - pytorch_model_hub_mixin - speaker_dialect_classification library_name: transformers --- # MMS-LID-256 for Regional Languages Classification in India # Model Description This model includes the implementation of regional languages classification in India described in **Voxlect: A Speech Foundation Model Benchmark for Modeling Dialect and Regional Languages Around the Globe** Github repository: https://github.com/tiantiaf0627/voxlect The included languages spoken in India are: ``` label_list = [ "assamese", "bengali", "bodo", "dogri", "english", "gujarati", "hindi", "kannada", "kashmiri", "konkani", "maithili", "malayalam", "manipuri", "marathi", "nepali", "odia", "punjabi", "sanskrit", "santali", "sindhi", "tamil", "telugu", "urdu" ] ``` # How to use this model ## Download repo ```bash git clone git@github.com:tiantiaf0627/voxlect ``` ## Install the package ```bash conda create -n voxlect python=3.8 cd voxlect pip install -e . ``` ## Load the model ```python # Load libraries import torch import torch.nn.functional as F from src.model.dialect.mms_dialect import MMSWrapper # Find device device = torch.device("cuda") if torch.cuda.is_available() else "cpu" # Load model from Huggingface model = MMSWrapper.from_pretrained("tiantiaf/voxlect-indic-lid-mms-lid-256").to(device) model.eval() ``` ## Prediction ```python # Label List label_list = [ "assamese", "bengali", "bodo", "dogri", "english", "gujarati", "hindi", "kannada", "kashmiri", "konkani", "maithili", "malayalam", "manipuri", "marathi", "nepali", "odia", "punjabi", "sanskrit", "santali", "sindhi", "tamil", "telugu", "urdu" ] # Load data, here just zeros as the example # Our training data filters output audio shorter than 3 seconds (unreliable predictions) and longer than 15 seconds (computation limitation) # So you need to prepare your audio to a maximum of 15 seconds, 16kHz and mono channel max_audio_length = 15 * 16000 data = torch.zeros([1, 16000]).float().to(device)[:, :max_audio_length] logits, embeddings = model(data, return_feature=True) # Probability and output dialect_prob = F.softmax(logits, dim=1) print(dialect_list[torch.argmax(dialect_prob).detach().cpu().item()]) ``` Responsible Use: Users should respect the privacy and consent of the data subjects, and adhere to the relevant laws and regulations in their jurisdictions when using Voxlect. ## If you have any questions, please contact: Tiantian Feng (tiantiaf@usc.edu) ❌ **Out-of-Scope Use** - Clinical or diagnostic applications - Surveillance - Privacy-invasive applications - No commercial use #### If you like our work or use the models in your work, kindly cite the following. We appreciate your recognition! ``` @article{feng2025voxlect, title={Voxlect: A Speech Foundation Model Benchmark for Modeling Dialects and Regional Languages Around the Globe}, author={Feng, Tiantian and Huang, Kevin and Xu, Anfeng and Shi, Xuan and Lertpetchpun, Thanathai and Lee, Jihwan and Lee, Yoonjeong and Byrd, Dani and Narayanan, Shrikanth}, journal={arXiv preprint arXiv:2508.01691}, year={2025} } ```