Zenith-70B-p300 V1-Tenstorrent-Blackhole-p300

Flagship 70B parameter model optimized for Tenstorrent p300a hardware, based on DeepSeek-R1-Distill-Llama-70B.

Features

70B Parameters: Largest Zenith model with maximum capability
p300a Optimized: Specifically tuned for Tenstorrent p300a (dual-chip, 64 cores, 64GB GDDR6)
Ring Attention: 32K context window with efficient chunked attention
Hybrid MoE: Mixture of Experts for sparse activation (12 experts, top-2 routing)
EQ Adapter: Advanced emotional intelligence
DeepSeek-R1 Distilled: Knowledge distilled from DeepSeek-R1
Tensor/Pipeline Parallelism: Optimized for distributed training on p300
NoC Optimization: Efficient chip-to-chip communication
Ollama Compatible: Ready for deployment

Hardware Requirements

Training

Tenstorrent p300a: 2 chips (32 RISC-V cores) (16 RISC-V cores per chip)
Memory: 64GB GDDR6 (fully utilized)
Storage: 2TB+ NVMe SSD
Note: Full fine-tuning requires all memory; LoRA/QLoRA strongly recommended

Inference

p300a: Full 32K context supported
Standard GPU: 80GB+ VRAM (H100, A100 80GB) for full model
Consumer GPUs: Not feasible for full model; use QLoRA with reduced context

Quick Start

Installation

cd Zenith/V1-Tenstorrent-Blackhole-p300/70B
pip install -r requirements.txt

Training

IMPORTANT: 70B is extremely large. Use LoRA or QLoRA exclusively.

# QLoRA (4-bit) - Recommended
python train.py \
  --base_model deepseek-ai/DeepSeek-R1-Distill-Llama-70B \
  --train_data ./data/train.json \
  --use_qlora \
  --use_lora \
  --lora_r 8 \
  --lora_alpha 16 \
  --epochs 1 \
  --batch_size 1 \
  --gradient_accumulation_steps 32 \
  --learning_rate 5e-6 \
  --use_ring_attention \
  --max_seq_length 32768 \
  --tensor_parallel_size 8 \
  --pipeline_parallel_size 4 \
  --use_noc_optimization \
  --mixed_precision bf16 \
  --use_quality_filter \
  --use_curriculum

# LoRA (8-bit) - Alternative
python train.py \
  --base_model deepseek-ai/DeepSeek-R1-Distill-Llama-70B \
  --train_data ./data/train.json \
  --use_lora \
  --lora_r 8 \
  --lora_alpha 16 \
  --epochs 1 \
  --batch_size 1 \
  --gradient_accumulation_steps 32 \
  --learning_rate 5e-6 \
  ...

Do NOT attempt full fine-tuning unless you have specialized hardware beyond p300.

Inference

# Interactive mode
python inference.py --checkpoint ./outputs/checkpoint-final

# Single prompt (long context)
python inference.py \
  --checkpoint ./outputs/checkpoint-final \
  --prompt "Analyze this 30K document and extract key insights..." \
  --max_new_tokens 2048 \
  --temperature 0.55

Ollama Deployment

# Build model (requires ~140GB disk space for 70B 4-bit quantized)
ollama create zenith-70b-p300 -f Modelfile

# Run
ollama run zenith-70b-p300 "Explain the implications of Gödel's incompleteness theorems"

# Long context
ollama run zenith-70b-p300 "Read this 32K document and provide a comprehensive summary: [paste text]"

Architecture

Model Configuration

from configs.zenith_config import get_70b_config

config = get_70b_config()
print(config)

Key Parameters:

hidden_size: 8192
num_layers: 64
num_heads: 64
num_experts: 12 (configurable)
moe_top_k: 2
max_seq_len: 32768
use_ring_attention: True
ring_attention_chunk_size: 8192
ring_attention_overlap: 2048

p300 Optimizations

TP=8: 8 cores/chip for tensor parallelism
PP=4: 4 cores/chip for pipeline parallelism
NoC: Optimized inter-chip communication
Ring Attention: 32K context without OOM
BF16: Native mixed precision

MoE Configuration

config.num_experts = 12
config.moe_top_k = 2
config.moe_load_balancing_weight = 0.01
config.moe_capacity_factor = 1.0

12 experts per MoE layer
Top-2 routing (2 experts active per token)
Load balancing loss
60% of middle layers use MoE (38 out of 64 layers)

EQ Adapter

config.use_eq_adapter = True
config.eq_adapter_hidden_size = 64
config.eq_loss_weight = 0.03

Frustration detection
8-emotion classification
Fused after attention

Data Processing

OpenThoughts Integration

from data.openthoughts_processor import OpenThoughtsProcessor, OpenThoughtsConfig

ot_config = OpenThoughtsConfig(
    dataset_name="open-thoughts/OpenThoughts3-1.2M",
    streaming=True,
    max_seq_length=32768,
    quality_filtering=True,
    curriculum_learning=True,
    tokenizer=tokenizer
)
processor = OpenThoughtsProcessor(ot_config)

Curriculum Stages

Foundation: quality_score > 0.8, length 1024-8192
Reasoning: has_thoughts=True, CoT depth > 3
Code: domain='code', complexity > 0.5
Full: all samples

Quality Filtering

Length: 512-32000 tokens
Language: English
Repetition: < 15%
Coherence: > 0.7
Structure: Valid
Thought quality: CoT depth > 3

Advanced Features

Ring Attention

--use_ring_attention \
--ring_chunk_size 8192 \
--ring_overlap 2048

Enables 32K context on limited memory.

Distributed Training

export MASTER_ADDR=localhost
export MASTER_PORT=29500
export WORLD_SIZE=2

torchrun --nproc_per_node=2 --nnodes=1 train.py ...

Mixed Precision

--mixed_precision bf16

Gradient Checkpointing

--gradient_checkpointing

Reduces memory by ~60%.

Testing

python test_model.py

Tests:

Model creation
Forward pass
p300 optimizations
MoE configuration
Ring attention
EQ adapter
Generation
Gradient flow

Evaluation

python -m evaluation.benchmark \
  --model_path ./outputs/checkpoint-final \
  --benchmarks humaneval mbpp gsm8k math truthfulqa

Deployment

Ollama

ollama create zenith-70b-p300 -f Modelfile
ollama run zenith-70b-p300 "Your prompt here"

vLLM

python -m vllm.entrypoints.openai.api_server \
  --model ./outputs/checkpoint-final \
  --tensor-parallel-size 2 \
  --max-model-len 32768 \
  --port 8000

Hugging Face TGI

docker run --gpus all -p 8080:80 \
  -v ./outputs/checkpoint-final:/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id /data \
  --max-input-length 32768 \
  --max-total-tokens 36864

Troubleshooting

Out of Memory

Use QLoRA (4-bit) or LoRA (8-bit)
Reduce batch size to 1
Increase gradient accumulation
Reduce max_seq_length to 16384
Enable gradient checkpointing
Use fewer gradient accumulation steps

Slow Training

Data loading bottleneck? Use SSD, pre-tokenize
Increase batch size if possible
Reduce gradient accumulation
Use mixed precision
Ensure NoC optimization is on

Poor Quality

Use curriculum learning
Apply quality filtering
Train more epochs (2-3 if possible)
Lower learning rate (1e-6 to 5e-6)
Use more high-quality data
Ensure base model is correct (DeepSeek-R1-Distill-Llama-70B)

Ring Attention Issues

Check: max_seq_length % ring_chunk_size == 0
Reduce ring_overlap if memory tight
Chunk size >= 2048
Update transformers

Performance

Configuration	Memory	Speed (tokens/s)	Quality
QLoRA r=8, 2K	~10GB	80-120	95%
LoRA r=8, 2K	~16GB	60-90	98%
Ring 32K	+20%	25-45	Enables long context

Note: Full 70B requires ~140GB VRAM, not feasible on p300 without quantization.

Citation

@misc{zenith-70b-p300-2025,
  title={Zenith-70B-p300: A Tenstorrent-Optimized 70B Model with Ring Attention and MoE},
  author={Zenith Project},
  year={2025}
}

License

[Specify]

Support

README.md for quick reference
FINETUNE_GUIDE.md for detailed instructions
configs/zenith_config.py for configuration
Open issues with logs

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for Matrix-Corp/Zenith-70b-p300-V1

Base model

deepseek-ai/DeepSeek-R1-Distill-Llama-70B

Finetuned

(19)

this model

Collection including Matrix-Corp/Zenith-70b-p300-V1

Zenith V1

Collection

All V1 models of Zenith series • 4 items • Updated 4 days ago • 1