File size: 7,665 Bytes
4657f72
 
 
8613f6c
4657f72
 
 
 
56a872f
 
 
 
 
 
 
4657f72
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8613f6c
4657f72
 
 
8613f6c
 
4657f72
 
8613f6c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4657f72
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
---
pipeline_tag: image-text-to-text
library_name: transformers
license: apache-2.0
---

# PaCo-RL: Advancing Reinforcement Learning for Consistent Image Generation with Pairwise Reward Modeling

<div align="center">
  <a href='https://arxiv.org/abs/2512.04784'><img src='https://img.shields.io/badge/ArXiv-red?logo=arxiv'></a>  &nbsp;
  <a href='https://x-gengroup.github.io/HomePage_PaCo-RL/'><img src='https://img.shields.io/badge/ProjectPage-purple?logo=github'></a> &nbsp;
  <a href="https://github.com/X-GenGroup/PaCo-RL"><img src="https://img.shields.io/badge/Code-9E95B7?logo=github"></a> &nbsp; 
  <a href='https://huggingface.co/collections/X-GenGroup/paco-rl'><img src='https://img.shields.io/badge/Data & Model-green?logo=huggingface'></a> &nbsp;
</div>

**PaCo-RL** is a comprehensive framework designed for consistent image generation using reinforcement learning. It tackles the challenges of preserving identities, styles, and logical coherence across multiple images, which is crucial for applications such as storytelling and character design.

This model is presented in the paper [PaCo-RL: Advancing Reinforcement Learning for Consistent Image Generation with Pairwise Reward Modeling](https://huggingface.co/papers/2512.04784).

- **Project Page**: https://x-gengroup.github.io/HomePage_PaCo-RL/
- **Code Repository**: https://github.com/X-GenGroup/PaCo-RL
- **Data & Models Collection**: https://huggingface.co/collections/X-GenGroup/paco-rl

## Overview

PaCo-RL argues that reinforcement learning offers a promising alternative for learning complex and subjective visual criteria in a data-free manner. The framework combines a specialized consistency reward model with an efficient RL algorithm.

### Key Components

-   **PaCo-Reward**: A pairwise consistency evaluator trained on a large-scale dataset constructed via automated sub-figure pairing. It evaluates consistency through a generative, autoregressive scoring mechanism enhanced by task-aware instructions and Chain-of-Thought (CoT) reasons.
-   **PaCo-GRPO**: An efficient RL algorithm leveraging a novel resolution-decoupled optimization strategy to substantially reduce RL cost, alongside a log-tamed multi-reward aggregation mechanism that ensures balanced and stable reward optimization.

Extensive experiments show that PaCo-Reward significantly improves alignment with human perceptions of visual consistency, and PaCo-GRPO achieves state-of-the-art consistency performance with improved training efficiency and stability.

<div align="center">
  <img src="https://github.com/X-GenGroup/PaCo-RL/raw/main/assets/dataset_pipeline.png" alt="PaCo-RL Overview" width="800"/>
</div>



## Example Usage
For detailed installation, training of the reward model (PaCo-Reward), and running the full RL training (PaCo-GRPO), please refer to the [official GitHub repository](https://github.com/X-GenGroup/PaCo-RL). The repository provides comprehensive documentation for each component.

```python
import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from peft import PeftModel
from qwen_vl_utils import process_vision_info

# Load base model
base_model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-7B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

# Load LoRA adapter
model = PeftModel.from_pretrained(
    base_model,
    "X-GenGroup/PaCo-Reward-7B-Lora"
)

image1 = 'https://huggingface.co/X-GenGroup/PaCo-Reward-7B/resolve/main/images/image_1.jpg'
image2 = 'https://huggingface.co/X-GenGroup/PaCo-Reward-7B/resolve/main/images/image_2.jpg'

main_prompt = 'Generate multiple images portraying a medical scene of a dentist in scrubs. The images should include activities such as explaining oral hygiene to a patient, taking X-rays of teeth, cleaning teeth in a dental office, and filling a cavity during an appointment. The setting should depict a realistic dental clinic.'
text_prompt = (
    f"Given two subfigures generated based on the theme: \"{main_prompt}\", "
    f"do the two images maintain consistency in terms of style, logic and identity? "
    f"Answer \"Yes\" and \"No\" first, and then provide detailed reasons."
)

# Example: Compare whether two images are visually consistent
messages_1 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image1},
            {"type": "image", "image": image2},
            {"type": "text", "text": text_prompt},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages_1, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages_1)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Calculate consistency score
# Get logits for first token
with torch.no_grad():
    outputs = model(**inputs)
    first_token_logits = outputs.logits[0, -1, :]  # Last position of prompt

# Get token IDs for "Yes" and "No"
yes_id = processor.tokenizer.encode("Yes", add_special_tokens=False)[0]
no_id = processor.tokenizer.encode("No", add_special_tokens=False)[0]

# Calculate probability
yes_logit = first_token_logits[yes_id]
no_logit = first_token_logits[no_id]
yes_prob = torch.exp(yes_logit) / (torch.exp(yes_logit) + torch.exp(no_logit))

# PaCo-Reward-7B and this model may differ in scores due to numerical precision
print(f"Consistency Score (Yes Conditional Probability): {yes_prob.item():.4f}")

# Inference: Generate detailed reasons
generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])
```


## Model Zoo

The PaCo-RL framework includes several models available on Hugging Face:

| Model | Type | HuggingFace |
|-------|------|-------------|
| **PaCo-Reward-7B** | Reward Model | [🤗 Link](https://huggingface.co/X-GenGroup/PaCo-Reward-7B) |
| **PaCo-Reward-7B-Lora** | Reward Model (LoRA) | [🤗 Link](https://huggingface.co/X-GenGroup/PaCo-Reward-7B-Lora) |
| **PaCo-FLUX.1-dev** | T2I Model (LoRA) | [🤗 Link](https://huggingface.co/X-GenGroup/PaCo-FLUX.1-dev-Lora) |
| **PaCo-FLUX.1-Kontext-dev** | Image Editing Model (LoRA) | [🤗 Link](https://huggingface.co/X-GenGroup/PaCo-FLUX.1-Kontext-Lora) |
| **PaCo-QwenImage-Edit** | Image Editing Model (LoRA) | [🤗 Link](https://huggingface.co/X-GenGroup/PaCo-Qwen-Image-Edit-Lora) |

## Acknowledgement

Our work is built upon [Flow-GRPO](https://github.com/yifan123/flow_grpo), [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory), [vLLM](https://github.com/vllm-project/vllm), and [Qwen2.5-VL](https://github.com/QwenLM/Qwen3-VL). We sincerely thank the authors for their valuable contributions to the community.

## Citation

```bibtex
@misc{ping2025pacorladvancingreinforcementlearning,
      title={PaCo-RL: Advancing Reinforcement Learning for Consistent Image Generation with Pairwise Reward Modeling}, 
      author={Bowen Ping and Chengyou Jia and Minnan Luo and Changliang Xia and Xin Shen and Zhuohang Dang and Hangwei Qian},
      year={2025},
      eprint={2512.04784},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.04784}, 
}
```

<div align="center">
  <sub>⭐ Star us on GitHub if you find PaCo-RL helpful!</sub>
</div>