Param-1-7B-MoE
Param-1-7B-MoE is a multilingual large language model developed under the Param-1 family as part of BharatGen – A Suite of Generative AI Technologies for India. With 7 billion parameters and a Mixture of Experts (MoE) architecture, the model is designed to better understand and generate text across English, Hindi, and 14 additional Indian languages.
The model is pretrained from scratch with a strong focus on linguistic diversity, cultural context, and large-scale multilingual representation relevant to India.
Key Highlights
- 7B parameter Mixture of Experts (MoE) language model
- Multilingual: English, Hindi + 14 Indian languages
- Trained on 4 trillion tokens
- Uses 64 specialized experts, dynamically activated per token
- Supports long-context understanding (up to 4096 tokens)
- Designed as a pretrained (PT) base model for downstream fine-tuning
Supported Languages
In addition to English and Hindi, the model has been trained on data from the following 14 Indian languages:
- Assamese
- Bengali
- Gujarati
- Kannada
- Maithili
- Malayalam
- Marathi
- Nepali
- Oriya
- Punjabi
- Sanskrit
- Sindhi
- Tamil
- Telugu
This broad language coverage enables better performance in region-specific applications and improves inclusivity across India’s linguistic landscape.
Model Inference
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_name = "bharatgenai/Param-1-7B"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=False)
model = AutoModelForCausalLM.from_pretrained(
model_name,
trust_remote_code=True,
torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
device_map="auto"
)
prompt = "Your prompt here"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=300,
do_sample=True,
top_k=50,
top_p=0.95,
temperature=0.6,
eos_token_id=tokenizer.eos_token_id,
use_cache=False
)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Model Architecture
- Architecture: Transformer (Decoder-only) with Mixture of Experts (MoE)
- Number of parameters: ~7B
- Active parameters: 1.04B
- Number of experts: 64
- Expert routing: Token-level sparse activation, Top-K = 8 experts
- Maximum sequence length:
4096 - Positional embeddings: Rotary (RoPE)
- Attention mechanism: Optimized attention with modern activation techniques
- Precision:
bf16-mixed
Training Data
The model is pretrained on a large-scale multilingual corpus totaling 4 trillion tokens.
Dataset Composition
PT 1 : Pre-Training Phase 1
- English + Hindi: ~`500B` tokens
- Code + Math ~`500B` tokens
- 14 Indian languages: ~`1.0T` tokens
PT-2 : Pre-Training Phase 2
Note: All figures represent net token counts. Multilingual, SFT, RL, and replay data are re-expressions of the same underlying knowledge and are not additive.
General Knowledge: ~`470B` tokens
Technical Knowledge (Research + STEM + Philosophy + India-focused research): ~`270B` tokens
Education, Exams, Domain Specialisation, Math, Benchmarks & Code: ~`330B` tokens
Conversational Knowledge (Forums, Q&A): ~`10B` tokens
Indic Multilingual Expansion (16 languages; synthetic + OCR + translation + personas): ~`920B` tokens
Alignment & Stabilization (within budget):
- SFT-style: ~`45B` tokens
- RL / Preference / Safety: ~`4.9B` tokens
- PT-1 replay (low-resource stabilization): ~`95B` tokens
- LLM Self-Identity & BharatGen knowledge: ~`0.1B` tokens
Total PT-2 Tokens: ~`2.0T`
The data mixture was curated and balanced using CLIMB (Clustering-based Iterative Data Mixture Bootstrapping), an advanced data filtering and mixing technique from NVIDIA. This ensures high-quality training signals and fair representation across languages.
Training Details
- Training framework:
NVIDIA NeMo - Training infrastructure:
Yotta Shakti Cloud - Hardware:
NVIDIA H100 GPUs
Limitations
- This is a pretrained base model and may require fine-tuning for instruction-following or chat-based applications.
- Model outputs may reflect biases present in large-scale multilingual web data.
- Performance may vary across low-resource domains and specialized tasks.
License
This model is released under the BharatGen non-commercial license.
Please refer to the LICENSE file for detailed terms and conditions.
- Downloads last month
- 20