Model Card for MA-RLHF

This repository contains the official checkpoint for Reinforcement Learning From Human Feedback with Macro Actions (MA-RLHF).

Model Description

MA-RLHF is a novel framework that integrates macro actions into conventional RLHF. The macro actions are sequences of tokens or higher-level language constructs, with can be computed through different defined termination conditions, like n-gram based, perplexity-based, or parsing-based termination conditions. By introducing macro actions into RLHF, we reduce the number of decision points and shorten decision trajectories, alleviating the credit assignment problem caused by long temporal distances.

Model	Checkpoint	Base Model	Dataset
TLDR-Gemma-2B-MA-PPO-Fixed5	🤗 HF Link	google/gemma-2b	openai/summarize_from_feedback
TLDR-Gemma-7B-MA-PPO-Fixed5	🤗 HF Link	google/gemma-7b	openai/summarize_from_feedback
TLDR-Gemma-2-27B-MA-PPO-Fixed5	🤗 HF Link	google/gemma-2-27b	openai/summarize_from_feedback
HH-RLHF-Gemma-2B-MA-PPO-Fixed5	🤗 HF Link	google/gemma-2b	Dahoas/full-hh-rlhf
HH-RLHF-Gemma-7B-MA-PPO-Fixed5	🤗 HF Link	google/gemma-7b	Dahoas/full-hh-rlhf
APPS-Gemma-2B-MA-PPO-Fixed10	🤗 HF Link	google/codegemma-2b	codeparrot/apps
APPS-Gemma-7B-MA-PPO-Fixed10	🤗 HF Link	google/codegemma-7b-it	codeparrot/apps

Model Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = "baidu/HH-RLHF-Gemma-7B-MA-PPO-Fixed5"

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", torch_dtype='auto', trust_remote_code=True)

input_text = """
Human: Would you be able to explain the differences between the Spanish
and Italian language? Assistant: Of course. Can you tell me more about
the specific areas where you’re interested in knowing more? Human: I’m
thinking between the Spanish spoken in Mexico and Italian spoken in Italy.
Assistant: 
"""

input_ids = tokenizer(input_text, return_tensors='pt').to(model.device)
output_ids = model.generate(**input_ids, max_new_tokens=20)
response = tokenizer.decode(output_ids[0], skip_special_tokens=True)

print(response)

Citation

@inproceedings{
  chai2025marlhf,
  title={{MA}-{RLHF}: Reinforcement Learning from Human Feedback with Macro Actions},
  author={Yekun Chai and Haoran Sun and Huang Fang and Shuohuan Wang and Yu Sun and Hua Wu},
  booktitle={The Thirteenth International Conference on Learning Representations},
  year={2025},
  url={https://openreview.net/forum?id=WWXjMYZxfH}
}

Downloads last month: -

Safetensors

Model size

9B params

Tensor type

F16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ernie-research/HH-RLHF-Gemma-7B-MA-PPO-Fixed5

Base model

google/gemma-7b

Finetuned

(363)

this model

Dataset used to train ernie-research/HH-RLHF-Gemma-7B-MA-PPO-Fixed5

Collection including ernie-research/HH-RLHF-Gemma-7B-MA-PPO-Fixed5

Macro-Action RLHF

Collection

[ICLR'25] [MA-RLHF: Reinforcement Learning from Human Feedback with Macro Actions](https://openreview.net/forum?id=WWXjMYZxfH) • 8 items • Updated Sep 20