sunblaze-ucb
/

Qwen2.5-3B-Intuitor-MATH-1EPOCH

@@ -1,20 +1,69 @@
 ---
 base_model: Qwen/Qwen2.5-3B
-license: apache-2.0
 datasets:
-  - math
 metrics:
-  - accuracy
 pipeline_tag: text-generation
-language:
-  - en
 ---
 # Qwen2.5-3B-Intuitor-MATH-1EPOCH
-**Description:**
-An Intuitor-fine-tuned version of Qwen2.5-3B trained on the MATH dataset.
 ---
@@ -27,5 +76,4 @@ An Intuitor-fine-tuned version of Qwen2.5-3B trained on the MATH dataset.
   journal = {arXiv preprint arXiv:2505.19590},
   year    = {2025}
 }
-```

 ---
 base_model: Qwen/Qwen2.5-3B
 datasets:
+- math
+language:
+- en
+license: apache-2.0
 metrics:
+- accuracy
 pipeline_tag: text-generation
+library_name: transformers
 ---
 # Qwen2.5-3B-Intuitor-MATH-1EPOCH
+This model is an Intuitor-fine-tuned version of Qwen2.5-3B trained on the MATH dataset, as presented in the paper [Learning to Reason without External Rewards](https://huggingface.co/papers/2505.19590).
+## Introduction
+**Intuitor** is a reinforcement learning method that fine-tunes large language models (LLMs) using *self-certainty*—the model’s own internal confidence—as the sole reward. It is built on a novel paradigm we call **Reinforcement Learning from Internal Feedback (RLIF)**.
+**Reinforcement Learning from Internal Feedback (RLIF)** is a training framework where language models learn *without any external rewards, gold labels, or verifiers*. Instead, models improve by optimizing *intrinsic signals*—such as confidence in their own answers—generated entirely from within. RLIF enables scalable and domain-agnostic fine-tuning of LLMs in settings where human feedback or verifiable supervision is expensive or unavailable.
+Intuitor instantiates RLIF by using **self-certainty**—a model's confidence measured via KL divergence to uniform—as an intrinsic reward in the GRPO policy optimization algorithm.
+For more details, see the [project's GitHub repository](https://github.com/sunblaze-ucb/Intuitor).
+## Usage
+You can use this model with the Hugging Face `transformers` library.
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model_id = "sunblaze-ucb/Qwen2.5-3B-Intuitor-MATH-1EPOCH"
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    torch_dtype=torch.bfloat16, # or torch.float16 depending on your GPU
+    device_map="auto"
+)
+messages = [
+    {"role": "user", "content": "What is the capital of France?"},
+]
+text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True
+)
+model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
+generated_ids = model.generate(
+    model_inputs.input_ids,
+    max_new_tokens=50,
+    temperature=0.7,
+    do_sample=True
+)
+output = tokenizer.decode(generated_ids[0][model_inputs.input_ids.shape[1]:], skip_special_tokens=True)
+print(output)
+```
 ---
   journal = {arXiv preprint arXiv:2505.19590},
   year    = {2025}
 }
+```