OLMoE-1B-7B DPO with DoRA

A preference-aligned version of OLMoE-1B-7B trained with DoRA (Weight-Decomposed LoRA) and DPO (Direct Preference Optimization).

What's This?

This model was fine-tuned in two stages:

  1. SFT on 20K examples from OpenHermes-2.5 for instruction-following
  2. DPO on 10K preference pairs from UltraFeedback for alignment

The adapter uses DoRA (enhanced LoRA with magnitude decomposition) + QLoRA (4-bit quantization) for efficient training on consumer GPUs.

Available Formats

We provide two versions for different use cases:

1. LoRA Adapter (This Repo) - 8.3MB

Lightweight adapter weights for PEFT loading. Perfect for experimentation and memory-constrained environments.

2. Merged Weights - demonlxrd/olmoe-openhermes-ultrafeedback-dora-dpo-merged

Full merged model ready for production inference with vLLM. Recommended for deployment.

Quick Start

Option A: Use Merged Model (Recommended for Production)

Best for fast inference with vLLM:

# Serve with vLLM
vllm serve demonlxrd/olmoe-openhermes-ultrafeedback-dora-dpo-merged \
  --max-model-len 4096 \
  --dtype bfloat16

# Inference
curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "demonlxrd/olmoe-openhermes-ultrafeedback-dora-dpo-merged",
    "messages": [
      {"role": "user", "content": "Explain quantum computing in simple terms."}
    ],
    "max_tokens": 200,
    "temperature": 0.7
  }' | jq -r '.choices[0].message.content'

Option B: Use LoRA Adapter (For Experimentation)

Load adapter with PEFT:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

# Load base + adapter
base_model = AutoModelForCausalLM.from_pretrained(
    "1024m/OLMoE-1B-7B-0924-Base",
    device_map="auto",
    torch_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained("demonlxrd/olmoe-openhermes-ultrafeedback-dora-dpo")
model = PeftModel.from_pretrained(base_model, "demonlxrd/olmoe-openhermes-ultrafeedback-dora-dpo")
model.eval()

# Generate
messages = [{"role": "user", "content": "What is machine learning?"}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.inference_mode():
    outputs = model.generate(**inputs, max_tokens=200, temperature=0.7)
    
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Note: vLLM doesn't support LoRA for OLMoE architecture yet. For vLLM inference, use the merged model.

Python Inference (vLLM Server)

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy"
)

response = client.chat.completions.create(
    model="demonlxrd/olmoe-openhermes-ultrafeedback-dora-dpo-merged",
    messages=[
        {"role": "user", "content": "Write a haiku about machine learning."}
    ],
    max_tokens=100,
    temperature=0.8
)

print(response.choices[0].message.content)

Merge Adapter Yourself

Want to create your own merged model?

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load and merge
base = AutoModelForCausalLM.from_pretrained("1024m/OLMoE-1B-7B-0924-Base", device_map="auto")
model = PeftModel.from_pretrained(base, "demonlxrd/olmoe-openhermes-ultrafeedback-dora-dpo")
merged = model.merge_and_unload()

# Save
merged.save_pretrained("./my_merged_model")
tokenizer = AutoTokenizer.from_pretrained("demonlxrd/olmoe-openhermes-ultrafeedback-dora-dpo")
tokenizer.save_pretrained("./my_merged_model")

Training Details

Parameter Value
Base Model OLMoE-1B-7B-0924
LoRA Rank 16
LoRA Alpha 32
Target Modules q_proj, v_proj
Quantization 4-bit NF4 + Double Quant
Training Precision bfloat16
SFT Data 20K samples (OpenHermes-2.5)
DPO Data 10K preference pairs (UltraFeedback)
DPO Beta 0.1
Learning Rate 5e-5
Batch Size 36 (effective)
Hardware 2× NVIDIA A40 80GB

Key Techniques

  • DoRA: Weight decomposition for better adapter performance than standard LoRA
  • QLoRA: 4-bit quantization for memory-efficient training
  • DPO: Direct preference optimization without reward model
  • Magnitude Vector Freezing: Prevents gradient issues during DPO training

Chat Template

User:
<message>


Assistant:
<response>

Roles supported: system, user, assistant

Files Included

  • adapter_config.json - PEFT adapter configuration
  • adapter_model.safetensors - Trained adapter weights (~8.3MB)
  • tokenizer.json - Tokenizer vocabulary
  • tokenizer_config.json - Tokenizer configuration
  • chat_template.jinja - Chat formatting template
  • special_tokens_map.json - Special token mappings

License

Apache 2.0. Please also check the license of the base model.

Citation

If you use this model, please cite the original datasets and frameworks:


Note: This adapter requires the base model to function. For production use, we recommend the merged version for better performance with vLLM.

Downloads last month
10
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for demonlxrd/olmoe-openhermes-ultrafeedback-dora-dpo

Adapter
(1)
this model

Datasets used to train demonlxrd/olmoe-openhermes-ultrafeedback-dora-dpo