🎧 Adversarial 3.8B - Foundation Model (Ready for Post-Training)

A 3.8 billion parameter foundation language model trained from scratch - awaiting post-training with advanced random streaming techniques

💡 What is This Model?

This is a foundation model trained from scratch on raw text data. It has learned general language patterns but requires post-training to become task-aware.

Foundation Model Concept

Foundation Model Training Process

⚠️ IMPORTANT: This is a foundation model only. Direct use will produce random/incoherent outputs. You need to post-train it first using the advanced streaming system included in this repository.

🎯 Training Pipeline

Current Status: Foundation Model ✓

1. ✅ Pre-training (DONE - This Model)
   └─> Foundation model trained from scratch
   └─> Learned general language patterns
   └─> 3.8B parameters, 5 epochs, loss: 1.32

2. ⏳ Post-Training (TODO - Use Our System!)
   └─> Use: post_training_streaming_advanced.py
   └─> Techniques: Stratified + Curriculum + Reservoir
   └─> Result: Task-aware model (code errors, debugging)

3. 🔄 Fine-Tuning (Optional)
   └─> Further specialize for your domain

🏗️ Architecture

Model Type: LlamaForCausalLM
Hidden Size: 4096
Layers: 20
Attention Heads: 32 (GQA with 8 KV heads)
Intermediate Size: 11008
Context Length: 8192
Vocabulary Size: 34,783
Precision: BF16
Parameters: 3.8B
Model Size: ~7.3 GB

Architecture Highlights:

Grouped Query Attention (GQA) - Faster inference
8K context length - Long document support
BF16 precision - Optimal balance speed/quality
Custom tokenizer - 34,783 vocabulary

📊 Pre-Training Details (Already Done)

Hardware & Framework

GPUs: 2x NVIDIA H200 (141GB VRAM each)
Framework: DeepSpeed ZeRO-2 + Gradient Checkpointing
Batch Size: 64 effective (4 per GPU × 2 GPUs × 8 accumulation)
Training Time: ~12 hours
Precision: BF16 mixed precision

Training Configuration

Sequence Length: 2048 tokens
Epochs: 5
Learning Rate: 3e-4 (WarmupDecayLR)
Optimizer: AdamW (weight decay: 0.01)
Total Steps: 47,900
Final Loss: 1.32

Training Metrics

Total Steps: 47,900
Initial Loss: 11.22 (random weights)
Final Loss: 1.32 (converged)
Training Time: 11h 40min
GPU Utilization: 95%+

Loss Progression:

Epoch 0.0: 11.22 ██████████████████████████████
Epoch 1.0:  8.50 ████████████████████
Epoch 2.0:  5.20 ████████████
Epoch 3.0:  3.10 ███████
Epoch 4.0:  2.00 ████
Epoch 5.0:  1.32 ██  ← Converged!

🚀 How to Use This Model

Step 1: Post-Training (REQUIRED)

Use our advanced random streaming system to post-train this foundation model.

The post-training scripts are available alongside this model and can be used with the DJ3.8B dataset:

# Use the advanced system (RECOMMENDED)
python3 post_training_streaming_advanced.py

What is Random Streaming?

Instead of loading the entire dataset into memory, our system uses intelligent random sampling:

Traditional Post-Training:
├─ Load ALL data (12.4M examples) → 16 GB RAM
├─ Sequential processing
└─ Fixed order, possible duplicates

Random Streaming v3.0:
├─ Load CHUNKS dynamically → <1 GB RAM
├─ Random sampling from 12.4M examples
├─ Smart techniques:
│   ✓ Stratified: Balanced categories
│   ✓ Curriculum: Easy → Hard progression
│   ✓ Reservoir: Zero duplicates
└─ Result: 2x faster, better performance

Configuration Example:

# Edit post_training_streaming_advanced.py

MODEL_NAME = "Pacific-Prime/adversarial_3.8.0"  # This foundation model
PARQUET_PATH = "your_data.parquet"  # Your training data
OUTPUT_DIR = "./post_trained_model"

# Use PowerfulDataset (combines all techniques)
DATASET_TYPE = "powerful"
NUM_SAMPLES = 50_000  # Per epoch
NUM_EPOCHS = 3-5

# Enable advanced features
ENABLE_STRATIFIED = True    # Balanced distribution
ENABLE_CURRICULUM = True    # Easy → Hard
ENABLE_RESERVOIR = True     # No duplicates

STRATA_WEIGHTS = {
    "syntax_error": 0.40,
    "logical_error": 0.40,
    "incomplete": 0.20
}

Run Post-Training:

python3 post_training_streaming_advanced.py

Expected Results:

Training time: 6-12 hours (2x faster than traditional)
Samples needed: 150K-250K (50% less than traditional)
Final model: Task-aware, ready for use

Step 2: After Post-Training - Inference

Once post-trained, use the model:

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load post-trained model
model = AutoModelForCausalLM.from_pretrained(
    "./post_trained_model",  # Your output dir
    torch_dtype="bfloat16",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("./post_trained_model")

# Generate
prompt = "Fix this Python code:\ndef test():\n    print('hello'"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=256,
    temperature=0.7,
    do_sample=True
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Step 3 (Optional): Further Fine-Tuning

Want to specialize even more? Fine-tune the post-trained model:

from transformers import Trainer, TrainingArguments

# Load post-trained model
model = AutoModelForCausalLM.from_pretrained("./post_trained_model")

# Fine-tune on your custom domain
training_args = TrainingArguments(
    output_dir="./specialized_model",
    num_train_epochs=2,
    learning_rate=1e-5,  # Low LR
)

trainer = Trainer(model=model, args=training_args, ...)
trainer.train()

🎓 Post-Training System Features

🎯 Stratified Sampling

Ensures balanced distribution across categories:

Your Dataset (imbalanced):
├─ Syntax errors: 45%
├─ Logical errors: 35%
└─ Incomplete code: 20%

Post-Training (balanced):
├─ Syntax errors: 33%
├─ Logical errors: 33%
└─ Incomplete code: 34%

Result: Better performance on ALL categories

📚 Curriculum Learning

Progressive difficulty for stable training:

Epoch 1-2: Easy examples
  └─ Loss: 2.5 → 1.8 (fast learning)

Epoch 3-4: Medium examples
  └─ Loss: 1.8 → 1.2 (steady progress)

Epoch 5: Hard examples
  └─ Loss: 1.2 → 0.8 (fine-tuning)

💧 Reservoir Sampling

Zero duplicates across all epochs:

Traditional:
├─ Total: 250K samples
└─ Unique: ~210K (40K duplicates wasted)

Reservoir Streaming:
├─ Total: 250K samples
└─ Unique: 250K (100% unique!) ✓

📈 Expected Performance

After Post-Training

With our advanced streaming system, expect:

Metric	Traditional	Random Streaming v3.0	Improvement
Training Time	~24 hours	~12 hours	2x faster
Samples Needed	500K	250K	50% less
Memory Usage	16 GB	<1 GB	16x less
Duplicates	~40K	0	100% unique

Post-Trained Model Capabilities

After post-training on code error tasks:

Task	Expected Accuracy	Notes
Syntax Error Detection	90-95%	High precision
Logical Error Detection	85-90%	Complex reasoning
Code Completion	80-85%	Context-aware
Error Explanation	85-90%	Clear descriptions

📚 Documentation & Resources

Post-Training System Documentation (DJ3.8B Dataset)

README.md - Overview of random streaming
ADVANCED_GUIDE.md - Complete v3.0 guide
QUICK_COMPARISON.md - Which dataset to choose
WHATS_NEW_V3.md - v3.0 features
INDEX.md - Navigation guide

Training Code

The post-training scripts implement advanced random streaming techniques:

post_training_streaming.py - Basic v2.0 random streaming
post_training_streaming_advanced.py - Advanced v3.0 (recommended)
demo_advanced.py - Test all features

These scripts work with the DJ3.8B Dataset to post-train this foundation model.

External Resources

🛠️ Advanced Post-Training

Using PowerfulStreamingDataset (Recommended)

from post_training_streaming_advanced import PowerfulStreamingDataset, post_train_advanced

# Run complete post-training
post_train_advanced(
    model_name="Pacific-Prime/adversarial_3.8.0",  # This foundation model
    parquet_path="your_data.parquet",
    output_dir="./post_trained_model",

    # Dataset configuration
    dataset_type="powerful",  # All features combined
    num_samples=50_000,       # Per epoch
    num_epochs=5,

    # Advanced features
    enable_stratified=True,   # Balanced
    enable_curriculum=True,   # Progressive
    enable_reservoir=True,    # No duplicates

    curriculum_schedule="linear",
    strata_weights={
        "syntax_error": 0.40,
        "logical_error": 0.40,
        "incomplete": 0.20
    }
)

Available Dataset Types

Choose based on your needs:

# 1. Basic random (fast, simple)
dataset_type = "random"

# 2. Balanced distribution
dataset_type = "stratified"

# 3. Progressive difficulty
dataset_type = "curriculum"

# 4. No duplicates
dataset_type = "reservoir"

# 5. ALL features (BEST)
dataset_type = "powerful"

⚙️ Technical Specifications

Model Files

adversarial_3.8.0/
├── config.json              # Model configuration
├── generation_config.json   # Generation settings
├── model.safetensors        # Model weights (~7.3 GB)
├── tokenizer.json           # Fast tokenizer
├── tokenizer_config.json    # Tokenizer settings
├── special_tokens_map.json  # Special tokens
└── README.md                # This file

System Requirements

Post-Training:

GPU: 1x A100 (80GB) or 2x RTX 4090
RAM: 64 GB
Storage: 20 GB (input data + output model)
Time: 6-12 hours (with advanced streaming)

Inference (after post-training):

GPU: 1x RTX 3090 / 4090 (24GB)
RAM: 16 GB
BF16 precision recommended

🎯 Use Cases (After Post-Training)

Direct Use Cases

Code Error Detection & Fixing
- Syntax errors, logical bugs, incomplete code
- Best for adversarial programming patterns
Programming Education
- Explaining errors to students
- Pedagogical code analysis
Code Review Assistance
- Identifying potential issues
- Suggesting improvements
Debugging Support
- Error explanation in plain language
- Step-by-step debugging guidance

Example Data Format

Your post-training dataset should look like this:

{
    "messages": [
        {
            "role": "user",
            "content": "Fix this code:\ndef test():\n    print('hello'"
        },
        {
            "role": "assistant",
            "content": "Fixed code:\n```python\ndef test():\n    print('hello')\n```"
        }
    ]
}

🔬 Reproducibility

Exact Post-Training Reproduction

To reproduce results:

# 1. Edit configuration in post_training_streaming_advanced.py
# Set:
# - MODEL_NAME = "Pacific-Prime/adversarial_3.8.0"
# - PARQUET_PATH = "path/to/DJ3.8B.parquet"
# - DATASET_TYPE = "powerful"
# - NUM_SAMPLES = 50_000
# - NUM_EPOCHS = 5
# - SEED = 42

# 2. Run
python3 post_training_streaming_advanced.py

Reproducibility Guarantees:

✅ Same seed → Same samples in same order
✅ Same hyperparameters → Same convergence
✅ Deterministic training (within GPU variance)

📄 License

This model is released under the Apache 2.0 License.

✅ Commercial use allowed
✅ Modification allowed
✅ Distribution allowed
✅ Private use allowed
⚠️ Must include license and copyright notice

🙏 Acknowledgments

DeepSpeed - Efficient distributed training
vLLM - Fast inference engine
Hugging Face - Transformers library
Random Streaming v3.0 - Advanced post-training techniques

📞 Contact & Support

For questions or issues:

Documentation: DJ3.8B Dataset
Model Page: adversarial_3.8.0
Navigation: INDEX.md

🎯 Quick Start Checklist

Read this README completely
Understand this is a foundation model (needs post-training)
Prepare your post-training dataset (see format above)
Review post-training configurations
Run python3 post_training_streaming_advanced.py
Test the post-trained model
(Optional) Fine-tune for your specific domain
Deploy with vLLM for production

Need help? See ADVANCED_GUIDE.md for complete documentation.

Foundation model ready for post-training with Advanced Random Streaming v3.0! 🎧

Start with: python3 post_training_streaming_advanced.py 🚀

Downloads last month: 7

Safetensors

Model size

4B params

Tensor type

BF16