Improve model card: Add `library_name`, expanded description, GitHub link, and usage

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +57 -9
README.md CHANGED
@@ -1,20 +1,69 @@
1
  ---
2
  base_model: Qwen/Qwen2.5-3B
3
- license: apache-2.0
4
  datasets:
5
- - math
 
 
 
6
  metrics:
7
- - accuracy
8
  pipeline_tag: text-generation
9
- language:
10
- - en
11
  ---
12
 
13
  # Qwen2.5-3B-Intuitor-MATH-1EPOCH
14
 
15
- **Description:**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
- An Intuitor-fine-tuned version of Qwen2.5-3B trained on the MATH dataset.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
 
19
  ---
20
 
@@ -27,5 +76,4 @@ An Intuitor-fine-tuned version of Qwen2.5-3B trained on the MATH dataset.
27
  journal = {arXiv preprint arXiv:2505.19590},
28
  year = {2025}
29
  }
30
- ```
31
-
 
1
  ---
2
  base_model: Qwen/Qwen2.5-3B
 
3
  datasets:
4
+ - math
5
+ language:
6
+ - en
7
+ license: apache-2.0
8
  metrics:
9
+ - accuracy
10
  pipeline_tag: text-generation
11
+ library_name: transformers
 
12
  ---
13
 
14
  # Qwen2.5-3B-Intuitor-MATH-1EPOCH
15
 
16
+ This model is an Intuitor-fine-tuned version of Qwen2.5-3B trained on the MATH dataset, as presented in the paper [Learning to Reason without External Rewards](https://huggingface.co/papers/2505.19590).
17
+
18
+ ## Introduction
19
+
20
+ **Intuitor** is a reinforcement learning method that fine-tunes large language models (LLMs) using *self-certainty*—the model’s own internal confidence—as the sole reward. It is built on a novel paradigm we call **Reinforcement Learning from Internal Feedback (RLIF)**.
21
+
22
+ **Reinforcement Learning from Internal Feedback (RLIF)** is a training framework where language models learn *without any external rewards, gold labels, or verifiers*. Instead, models improve by optimizing *intrinsic signals*—such as confidence in their own answers—generated entirely from within. RLIF enables scalable and domain-agnostic fine-tuning of LLMs in settings where human feedback or verifiable supervision is expensive or unavailable.
23
+
24
+ Intuitor instantiates RLIF by using **self-certainty**—a model's confidence measured via KL divergence to uniform—as an intrinsic reward in the GRPO policy optimization algorithm.
25
+
26
+ For more details, see the [project's GitHub repository](https://github.com/sunblaze-ucb/Intuitor).
27
+
28
+ ## Usage
29
+
30
+ You can use this model with the Hugging Face `transformers` library.
31
 
32
+ ```python
33
+ import torch
34
+ from transformers import AutoModelForCausalLM, AutoTokenizer
35
+
36
+ model_id = "sunblaze-ucb/Qwen2.5-3B-Intuitor-MATH-1EPOCH"
37
+
38
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
39
+ model = AutoModelForCausalLM.from_pretrained(
40
+ model_id,
41
+ torch_dtype=torch.bfloat16, # or torch.float16 depending on your GPU
42
+ device_map="auto"
43
+ )
44
+
45
+ messages = [
46
+ {"role": "user", "content": "What is the capital of France?"},
47
+ ]
48
+
49
+ text = tokenizer.apply_chat_template(
50
+ messages,
51
+ tokenize=False,
52
+ add_generation_prompt=True
53
+ )
54
+
55
+ model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
56
+
57
+ generated_ids = model.generate(
58
+ model_inputs.input_ids,
59
+ max_new_tokens=50,
60
+ temperature=0.7,
61
+ do_sample=True
62
+ )
63
+
64
+ output = tokenizer.decode(generated_ids[0][model_inputs.input_ids.shape[1]:], skip_special_tokens=True)
65
+ print(output)
66
+ ```
67
 
68
  ---
69
 
 
76
  journal = {arXiv preprint arXiv:2505.19590},
77
  year = {2025}
78
  }
79
+ ```