OLMoE-1B-7B DPO with DoRA
A preference-aligned version of OLMoE-1B-7B trained with DoRA (Weight-Decomposed LoRA) and DPO (Direct Preference Optimization).
What's This?
This model was fine-tuned in two stages:
- SFT on 20K examples from OpenHermes-2.5 for instruction-following
- DPO on 10K preference pairs from UltraFeedback for alignment
The adapter uses DoRA (enhanced LoRA with magnitude decomposition) + QLoRA (4-bit quantization) for efficient training on consumer GPUs.
Available Formats
We provide two versions for different use cases:
1. LoRA Adapter (This Repo) - 8.3MB
Lightweight adapter weights for PEFT loading. Perfect for experimentation and memory-constrained environments.
2. Merged Weights - demonlxrd/olmoe-openhermes-ultrafeedback-dora-dpo-merged
Full merged model ready for production inference with vLLM. Recommended for deployment.
Quick Start
Option A: Use Merged Model (Recommended for Production)
Best for fast inference with vLLM:
# Serve with vLLM
vllm serve demonlxrd/olmoe-openhermes-ultrafeedback-dora-dpo-merged \
--max-model-len 4096 \
--dtype bfloat16
# Inference
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "demonlxrd/olmoe-openhermes-ultrafeedback-dora-dpo-merged",
"messages": [
{"role": "user", "content": "Explain quantum computing in simple terms."}
],
"max_tokens": 200,
"temperature": 0.7
}' | jq -r '.choices[0].message.content'
Option B: Use LoRA Adapter (For Experimentation)
Load adapter with PEFT:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
# Load base + adapter
base_model = AutoModelForCausalLM.from_pretrained(
"1024m/OLMoE-1B-7B-0924-Base",
device_map="auto",
torch_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained("demonlxrd/olmoe-openhermes-ultrafeedback-dora-dpo")
model = PeftModel.from_pretrained(base_model, "demonlxrd/olmoe-openhermes-ultrafeedback-dora-dpo")
model.eval()
# Generate
messages = [{"role": "user", "content": "What is machine learning?"}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.inference_mode():
outputs = model.generate(**inputs, max_tokens=200, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Note: vLLM doesn't support LoRA for OLMoE architecture yet. For vLLM inference, use the merged model.
Python Inference (vLLM Server)
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="dummy"
)
response = client.chat.completions.create(
model="demonlxrd/olmoe-openhermes-ultrafeedback-dora-dpo-merged",
messages=[
{"role": "user", "content": "Write a haiku about machine learning."}
],
max_tokens=100,
temperature=0.8
)
print(response.choices[0].message.content)
Merge Adapter Yourself
Want to create your own merged model?
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
# Load and merge
base = AutoModelForCausalLM.from_pretrained("1024m/OLMoE-1B-7B-0924-Base", device_map="auto")
model = PeftModel.from_pretrained(base, "demonlxrd/olmoe-openhermes-ultrafeedback-dora-dpo")
merged = model.merge_and_unload()
# Save
merged.save_pretrained("./my_merged_model")
tokenizer = AutoTokenizer.from_pretrained("demonlxrd/olmoe-openhermes-ultrafeedback-dora-dpo")
tokenizer.save_pretrained("./my_merged_model")
Training Details
| Parameter | Value |
|---|---|
| Base Model | OLMoE-1B-7B-0924 |
| LoRA Rank | 16 |
| LoRA Alpha | 32 |
| Target Modules | q_proj, v_proj |
| Quantization | 4-bit NF4 + Double Quant |
| Training Precision | bfloat16 |
| SFT Data | 20K samples (OpenHermes-2.5) |
| DPO Data | 10K preference pairs (UltraFeedback) |
| DPO Beta | 0.1 |
| Learning Rate | 5e-5 |
| Batch Size | 36 (effective) |
| Hardware | 2× NVIDIA A40 80GB |
Key Techniques
- DoRA: Weight decomposition for better adapter performance than standard LoRA
- QLoRA: 4-bit quantization for memory-efficient training
- DPO: Direct preference optimization without reward model
- Magnitude Vector Freezing: Prevents gradient issues during DPO training
Chat Template
User:
<message>
Assistant:
<response>
Roles supported: system, user, assistant
Files Included
adapter_config.json- PEFT adapter configurationadapter_model.safetensors- Trained adapter weights (~8.3MB)tokenizer.json- Tokenizer vocabularytokenizer_config.json- Tokenizer configurationchat_template.jinja- Chat formatting templatespecial_tokens_map.json- Special token mappings
License
Apache 2.0. Please also check the license of the base model.
Citation
If you use this model, please cite the original datasets and frameworks:
- OpenHermes-2.5: teknium/OpenHermes-2.5
- UltraFeedback: HuggingFaceH4/ultrafeedback_binarized
- TRL: HuggingFace TRL
Note: This adapter requires the base model to function. For production use, we recommend the merged version for better performance with vLLM.
- Downloads last month
- 10
Model tree for demonlxrd/olmoe-openhermes-ultrafeedback-dora-dpo
Base model
1024m/OLMoE-1B-7B-0924-Base