YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Param-1-7B-MoE

Param-1-7B-MoE is a multilingual large language model developed under the Param-1 family as part of BharatGen – A Suite of Generative AI Technologies for India. With 7 billion parameters and a Mixture of Experts (MoE) architecture, the model is designed to better understand and generate text across English, Hindi, and 14 additional Indian languages.

The model is pretrained from scratch with a strong focus on linguistic diversity, cultural context, and large-scale multilingual representation relevant to India.

Key Highlights

7B parameter Mixture of Experts (MoE) language model
Multilingual: English, Hindi + 14 Indian languages
Trained on 4 trillion tokens
Uses 64 specialized experts, dynamically activated per token
Supports long-context understanding (up to 4096 tokens)
Designed as a pretrained (PT) base model for downstream fine-tuning

Supported Languages

In addition to English and Hindi, the model has been trained on data from the following 14 Indian languages:

Assamese
Bengali
Gujarati
Kannada
Maithili
Malayalam
Marathi
Nepali
Oriya
Punjabi
Sanskrit
Sindhi
Tamil
Telugu

This broad language coverage enables better performance in region-specific applications and improves inclusivity across India’s linguistic landscape.

Model Inference

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "bharatgenai/Param-1-7B"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=False)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
    device_map="auto"
)

prompt = "Your prompt here"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=300,
        do_sample=True,
        top_k=50,
        top_p=0.95,
        temperature=0.6,
        eos_token_id=tokenizer.eos_token_id,
        use_cache=False
    )

print(tokenizer.decode(output[0], skip_special_tokens=True))

Model Architecture

Architecture: Transformer (Decoder-only) with Mixture of Experts (MoE)
Number of parameters: ~7B
Active parameters: 1.04B
Number of experts: 64
Expert routing: Token-level sparse activation, Top-K = 8 experts
Maximum sequence length: 4096
Positional embeddings: Rotary (RoPE)
Attention mechanism: Optimized attention with modern activation techniques
Precision: bf16-mixed

Training Data

The model is pretrained on a large-scale multilingual corpus totaling 4 trillion tokens.

Dataset Composition

PT 1 : Pre-Training Phase 1

English + Hindi: ~`500B` tokens
Code + Math ~`500B` tokens
14 Indian languages: ~`1.0T` tokens

PT-2 : Pre-Training Phase 2

Note: All figures represent net token counts. Multilingual, SFT, RL, and replay data are re-expressions of the same underlying knowledge and are not additive.

General Knowledge: ~`470B` tokens
Technical Knowledge (Research + STEM + Philosophy + India-focused research): ~`270B` tokens
Education, Exams, Domain Specialisation, Math, Benchmarks & Code: ~`330B` tokens
Conversational Knowledge (Forums, Q&A): ~`10B` tokens
Indic Multilingual Expansion (16 languages; synthetic + OCR + translation + personas): ~`920B` tokens
Alignment & Stabilization (within budget):
- SFT-style: ~`45B` tokens
- RL / Preference / Safety: ~`4.9B` tokens
- PT-1 replay (low-resource stabilization): ~`95B` tokens
- LLM Self-Identity & BharatGen knowledge: ~`0.1B` tokens

Total PT-2 Tokens: ~`2.0T`

The data mixture was curated and balanced using CLIMB (Clustering-based Iterative Data Mixture Bootstrapping), an advanced data filtering and mixing technique from NVIDIA. This ensures high-quality training signals and fair representation across languages.

Training Details

Training framework: NVIDIA NeMo
Training infrastructure: Yotta Shakti Cloud
Hardware: NVIDIA H100 GPUs

Limitations

This is a pretrained base model and may require fine-tuning for instruction-following or chat-based applications.
Model outputs may reflect biases present in large-scale multilingual web data.
Performance may vary across low-resource domains and specialized tasks.

License

This model is released under the BharatGen non-commercial license.

Please refer to the LICENSE file for detailed terms and conditions.

Downloads last month: 20

Safetensors

Model size

7B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support