Zenith-70B-p300 V1-Tenstorrent-Blackhole-p300

Flagship 70B parameter model optimized for Tenstorrent p300a hardware, based on DeepSeek-R1-Distill-Llama-70B.

Features

  • 70B Parameters: Largest Zenith model with maximum capability
  • p300a Optimized: Specifically tuned for Tenstorrent p300a (dual-chip, 64 cores, 64GB GDDR6)
  • Ring Attention: 32K context window with efficient chunked attention
  • Hybrid MoE: Mixture of Experts for sparse activation (12 experts, top-2 routing)
  • EQ Adapter: Advanced emotional intelligence
  • DeepSeek-R1 Distilled: Knowledge distilled from DeepSeek-R1
  • Tensor/Pipeline Parallelism: Optimized for distributed training on p300
  • NoC Optimization: Efficient chip-to-chip communication
  • Ollama Compatible: Ready for deployment

Hardware Requirements

Training

  • Tenstorrent p300a: 2 chips (32 RISC-V cores) (16 RISC-V cores per chip)
  • Memory: 64GB GDDR6 (fully utilized)
  • Storage: 2TB+ NVMe SSD
  • Note: Full fine-tuning requires all memory; LoRA/QLoRA strongly recommended

Inference

  • p300a: Full 32K context supported
  • Standard GPU: 80GB+ VRAM (H100, A100 80GB) for full model
  • Consumer GPUs: Not feasible for full model; use QLoRA with reduced context

Quick Start

Installation

cd Zenith/V1-Tenstorrent-Blackhole-p300/70B
pip install -r requirements.txt

Training

IMPORTANT: 70B is extremely large. Use LoRA or QLoRA exclusively.

# QLoRA (4-bit) - Recommended
python train.py \
  --base_model deepseek-ai/DeepSeek-R1-Distill-Llama-70B \
  --train_data ./data/train.json \
  --use_qlora \
  --use_lora \
  --lora_r 8 \
  --lora_alpha 16 \
  --epochs 1 \
  --batch_size 1 \
  --gradient_accumulation_steps 32 \
  --learning_rate 5e-6 \
  --use_ring_attention \
  --max_seq_length 32768 \
  --tensor_parallel_size 8 \
  --pipeline_parallel_size 4 \
  --use_noc_optimization \
  --mixed_precision bf16 \
  --use_quality_filter \
  --use_curriculum

# LoRA (8-bit) - Alternative
python train.py \
  --base_model deepseek-ai/DeepSeek-R1-Distill-Llama-70B \
  --train_data ./data/train.json \
  --use_lora \
  --lora_r 8 \
  --lora_alpha 16 \
  --epochs 1 \
  --batch_size 1 \
  --gradient_accumulation_steps 32 \
  --learning_rate 5e-6 \
  ...

Do NOT attempt full fine-tuning unless you have specialized hardware beyond p300.

Inference

# Interactive mode
python inference.py --checkpoint ./outputs/checkpoint-final

# Single prompt (long context)
python inference.py \
  --checkpoint ./outputs/checkpoint-final \
  --prompt "Analyze this 30K document and extract key insights..." \
  --max_new_tokens 2048 \
  --temperature 0.55

Ollama Deployment

# Build model (requires ~140GB disk space for 70B 4-bit quantized)
ollama create zenith-70b-p300 -f Modelfile

# Run
ollama run zenith-70b-p300 "Explain the implications of Gödel's incompleteness theorems"

# Long context
ollama run zenith-70b-p300 "Read this 32K document and provide a comprehensive summary: [paste text]"

Architecture

Model Configuration

from configs.zenith_config import get_70b_config

config = get_70b_config()
print(config)

Key Parameters:

  • hidden_size: 8192
  • num_layers: 64
  • num_heads: 64
  • num_experts: 12 (configurable)
  • moe_top_k: 2
  • max_seq_len: 32768
  • use_ring_attention: True
  • ring_attention_chunk_size: 8192
  • ring_attention_overlap: 2048

p300 Optimizations

  • TP=8: 8 cores/chip for tensor parallelism
  • PP=4: 4 cores/chip for pipeline parallelism
  • NoC: Optimized inter-chip communication
  • Ring Attention: 32K context without OOM
  • BF16: Native mixed precision

MoE Configuration

config.num_experts = 12
config.moe_top_k = 2
config.moe_load_balancing_weight = 0.01
config.moe_capacity_factor = 1.0
  • 12 experts per MoE layer
  • Top-2 routing (2 experts active per token)
  • Load balancing loss
  • 60% of middle layers use MoE (38 out of 64 layers)

EQ Adapter

config.use_eq_adapter = True
config.eq_adapter_hidden_size = 64
config.eq_loss_weight = 0.03
  • Frustration detection
  • 8-emotion classification
  • Fused after attention

Data Processing

OpenThoughts Integration

from data.openthoughts_processor import OpenThoughtsProcessor, OpenThoughtsConfig

ot_config = OpenThoughtsConfig(
    dataset_name="open-thoughts/OpenThoughts3-1.2M",
    streaming=True,
    max_seq_length=32768,
    quality_filtering=True,
    curriculum_learning=True,
    tokenizer=tokenizer
)
processor = OpenThoughtsProcessor(ot_config)

Curriculum Stages

  1. Foundation: quality_score > 0.8, length 1024-8192
  2. Reasoning: has_thoughts=True, CoT depth > 3
  3. Code: domain='code', complexity > 0.5
  4. Full: all samples

Quality Filtering

  • Length: 512-32000 tokens
  • Language: English
  • Repetition: < 15%
  • Coherence: > 0.7
  • Structure: Valid
  • Thought quality: CoT depth > 3

Advanced Features

Ring Attention

--use_ring_attention \
--ring_chunk_size 8192 \
--ring_overlap 2048

Enables 32K context on limited memory.

Distributed Training

export MASTER_ADDR=localhost
export MASTER_PORT=29500
export WORLD_SIZE=2

torchrun --nproc_per_node=2 --nnodes=1 train.py ...

Mixed Precision

--mixed_precision bf16

Gradient Checkpointing

--gradient_checkpointing

Reduces memory by ~60%.

Testing

python test_model.py

Tests:

  • Model creation
  • Forward pass
  • p300 optimizations
  • MoE configuration
  • Ring attention
  • EQ adapter
  • Generation
  • Gradient flow

Evaluation

python -m evaluation.benchmark \
  --model_path ./outputs/checkpoint-final \
  --benchmarks humaneval mbpp gsm8k math truthfulqa

Deployment

Ollama

ollama create zenith-70b-p300 -f Modelfile
ollama run zenith-70b-p300 "Your prompt here"

vLLM

python -m vllm.entrypoints.openai.api_server \
  --model ./outputs/checkpoint-final \
  --tensor-parallel-size 2 \
  --max-model-len 32768 \
  --port 8000

Hugging Face TGI

docker run --gpus all -p 8080:80 \
  -v ./outputs/checkpoint-final:/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id /data \
  --max-input-length 32768 \
  --max-total-tokens 36864

Troubleshooting

Out of Memory

  • Use QLoRA (4-bit) or LoRA (8-bit)
  • Reduce batch size to 1
  • Increase gradient accumulation
  • Reduce max_seq_length to 16384
  • Enable gradient checkpointing
  • Use fewer gradient accumulation steps

Slow Training

  • Data loading bottleneck? Use SSD, pre-tokenize
  • Increase batch size if possible
  • Reduce gradient accumulation
  • Use mixed precision
  • Ensure NoC optimization is on

Poor Quality

  • Use curriculum learning
  • Apply quality filtering
  • Train more epochs (2-3 if possible)
  • Lower learning rate (1e-6 to 5e-6)
  • Use more high-quality data
  • Ensure base model is correct (DeepSeek-R1-Distill-Llama-70B)

Ring Attention Issues

  • Check: max_seq_length % ring_chunk_size == 0
  • Reduce ring_overlap if memory tight
  • Chunk size >= 2048
  • Update transformers

Performance

Configuration Memory Speed (tokens/s) Quality
QLoRA r=8, 2K ~10GB 80-120 95%
LoRA r=8, 2K ~16GB 60-90 98%
Ring 32K +20% 25-45 Enables long context

Note: Full 70B requires ~140GB VRAM, not feasible on p300 without quantization.

Citation

@misc{zenith-70b-p300-2025,
  title={Zenith-70B-p300: A Tenstorrent-Optimized 70B Model with Ring Attention and MoE},
  author={Zenith Project},
  year={2025}
}

License

[Specify]

Support

  • README.md for quick reference
  • FINETUNE_GUIDE.md for detailed instructions
  • configs/zenith_config.py for configuration
  • Open issues with logs
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support

Model tree for Matrix-Corp/Zenith-70b-p300-V1

Finetuned
(19)
this model

Collection including Matrix-Corp/Zenith-70b-p300-V1