-
-
-
-
-
-
Inference Providers
Active filters:
rlhf
BlankZ/ragen-checkpoint-step-600-bf16
Text Generation
•
2B
•
Updated
•
4
BlankZ/ragen-checkpoint-step-900-bf16
Text Generation
•
2B
•
Updated
•
1
BlankZ/ragen-checkpoint-step-1200-bf16
Text Generation
•
2B
•
Updated
•
1
Text Generation
•
0.6B
•
Updated
•
6
•
2
gandhiraketla277/demo-lora-reward-model
Text Generation
•
Updated
Nirav-Madhani/gemma3-270m-grpo-math
Text Generation
•
0.3B
•
Updated
•
4
Reinforcement Learning
•
5B
•
Updated
•
12
•
1
Arc-Intelligence/ATLAS-8B-Thinking
Text Generation
•
8B
•
Updated
•
31
•
4
SakaiSec/ATLAS-8B-Thinking-Q4_K_M-GGUF
Text Generation
•
8B
•
Updated
•
4
mradermacher/ATLAS-8B-Thinking-GGUF
Reinforcement Learning
•
8B
•
Updated
•
1.08k
•
1
mradermacher/ATLAS-8B-Thinking-i1-GGUF
Reinforcement Learning
•
8B
•
Updated
•
152
•
1
Asap7772/Qwen3-4B-second-stage-DPO-lr-1e-6-beta-0.1-loss-sigmoid-rpo-1.0-ckpt-135
196k
•
Updated
•
1
Asap7772/Qwen3-4B-second-stage-DPO-lr-1e-6-beta-0.01-loss-sigmoid-rpo-1.0-ckpt-135
196k
•
Updated
Asap7772/Qwen3-4B-second-stage-DPO-lr-1e-6-beta-0.05-loss-sigmoid-rpo-1.0-ckpt-135
196k
•
Updated
Asap7772/Qwen3-4B-second-stage-DPO-lr-1e-7-beta-0.1-loss-sigmoid-rpo-1.0-ckpt-135
196k
•
Updated
•
1
Asap7772/Qwen3-4B-second-stage-DPO-lr-1e-7-beta-0.01-loss-sigmoid-rpo-1.0-ckpt-135
196k
•
Updated
Asap7772/Qwen3-4B-second-stage-DPO-lr-1e-7-beta-0.05-loss-sigmoid-rpo-1.0-ckpt-135
196k
•
Updated
Tanaybh/gpt2-rlhf-anthropic
Text Generation
•
0.1B
•
Updated
•
2
mradermacher/gpt2-rlhf-anthropic-GGUF
0.1B
•
Updated
•
99
nabeelshan/rlhf-gpt2-pipeline
Text Generation
•
Updated
Schrieffer/Llama-SARM-4B-PostSAEPretrain
Feature Extraction
•
5B
•
Updated
•
2
•
1
Text Generation
•
0.1B
•
Updated
•
5
Vibudhbh/gpt2-rlhf-implementation
Text Generation
•
0.1B
•
Updated
•
6
mradermacher/gpt2-rlhf-implementation-GGUF
0.1B
•
Updated
•
177
mzhaoshuai/Mistral-7B-v0.1-conf-refalign
Text Generation
•
Updated
geoffmunn/Qwen3-4B-SafeRL-GGUF
Text Generation
•
4B
•
Updated
•
15
orville-wang/Transit-R1-3B
Text Generation
•
3B
•
Updated
•
2
•
1
mradermacher/Transit-R1-3B-GGUF
3B
•
Updated
•
17
samhitha2601/llama3.2-3b-ppo
Reinforcement Learning
•
Updated
samhitha2601/llama3.2-3b-ppo-critic
Reinforcement Learning
•
Updated