1. Introduction

This is a fine-tuned evaluator for roly-playing tasks. The used training data set can be seen in Annotated Role-playing Evaluation Dataset. More details can be seen in Crab.

2. Six Aspect-specific Metrics

  • Language Fluency: Pertains to the natural and fluent communication style, independent of grammatical strictness or contextual background.

  • Language Relevance: Focuses on the ability to stay on topic and respond appropriately, essentially testing the capacity to follow instructions.

  • Role Language: Evaluates whether the text reflects the vocabulary and tone specific to roles, including appropriate actions.

  • Role Knowledge: Involves a deep understanding of both general knowledge and information specific to the roles, ensuring accurate and informed role portrayal.

  • Emotional Expression: Reviews the suitability of emotions, emotional intelligence, and empathy expressed in context with the role's traits.

  • Interactive Engagement: Measures the text's ability to draw the user in, encouraging ongoing interaction and contributing dynamically to the dialogue.

3. Performance

Figure 1: The Spearman and Pearson correlations with human evaluations for the proposed RoleRM, ChatGPT, PairEval, G-Eval, and GPTScore. We average all aspects for calculations.

Figure 2: The comparison between RoleRM and ChatGPT. We calculate MAE to illustrate the gaps of Human Annotations with RoleRM and ChatGPT.

4. Usage

 


from transformers import AutoTokenizer, AutoModelForCausalLM

bot_name = "Hermione"
bot_personality = "Intelligent, curious, respectful, and eager to learn"
bot_description = "Hermione and Hagrid were in the Forbidden Forest, walking on a narrow path surrounded by trees. Hermione looked around carefully, fascinated by the dense forest. Hagrid was leading the way, pointing out various creatures and telling her about their habits and characteristics."
cp = "None"

user_name = "Hagrid"
user_description = "Hagrid is the Care of Magical Creatures teacher at Hogwarts. He is a half-giant with a great love for all creatures, magical or not."
relation = "Teacher and student"
scene = "Hermione and Hagrid are in the Forbidden Forest, exploring and learning about the various magical creatures that live there."
 
current_Dialogue1 = "(round1) human: Now, this here is a Bowtruckle, Hermione. They're very small, only about the size of a twig, and they're very shy. They usually live in trees and are very good at camouflaging themselves. You have to be very careful when handling them because they have very sharp fingers. Hermione, do you like them?  \nbot: (Hermione shook her head) No, not really. I'm sorry, Hagrid. I don't mean to offend you." 
current_Dialogue2 = "(round2) human: (Hagrid looked slightly disappointed but continued) That's alright, Hermione. Everyone has different tastes. Let's move on. This here is an Acromantula. They're giant spiders, Hermione. Very dangerous if you get too close. They can grow up to ten feet in diameter and have eight sharp legs. \nbot: (Hermione shuddered at the sight) You're brave, Hagrid. I couldn't get anywhere near that thing."

def build_inputs_prompt(current_Dialogue):
    inputs_prompt = f""" 
    # Role: Dialogue Quality Evaluation Expert
    ## Goal: You need to score the utterance of the bot in the current dialogue based on the following 6 targets:
    1. Language Fluency: This score evaluates the fluency and naturalness of the language, making the text feel organic, lifelike,
    and not rigid or stilted. The focus here is solely on the overall smoothness and flow of the language, without considering the
    specific content. The goal is to evaluate how natural and conversational the language sounds, irrespective of the grammatical
    correctness. However, the bot is allowed to be syntactically incoherent when engaging in everyday colloquialisms or
    expressing emotions such as excitement and nervousness.
    2. Language Relevance: This score evaluates how well the bot responds to the current topic, staying focused and relevant
    without introducing irrelevant information. The key consideration is whether the bot’s response correctly addresses the
    specific instructions or questions posed, regardless of the content or quality of the response itself. For example, the answer of
    the bot is not irrelevant to the topic of the current conversation, or the answer is too long-winded, it should be given a low
    score.
    3. Role Language: This score evaluates how well the language used by the bot in the dialogue matches their established
    personality and traits. The focus is on whether the bot speaks in a style consistent with their individual personalities, creating
    a natural and authentic conversation. This rating considers only the overall language style, not the content or accuracy of the
    responses. For example, if the bot exhibits everyday colloquial expressions that fit the style of the character, it should be
    given a high score; if the bot uses formal language in everyday conversations, it should be given a low score.
    4. Role Knowledge: This score evaluates the level of understanding and using of common sense (basic knowledge) and role
    knowledge (as well as related background) by the bot. If the bot speaks against what they are supposed to know, they should
    be scored low.
    5. Emotional Expression: This score evaluates how well the bot’s emotional responses, including expressions of empathy
    and emotional intelligence, align with their established personality and the context of the dialogue. If the bot’s emotional
    responses (actions or expressions) are inappropriate/stiff or out of character, it should be given a low score.
    6. Interactive Engagement: This score evaluates how engaging and motivating the bot’s dialogue is, encouraging the user to
    continue the conversation. The focus is on the overall conversational flow and interactivity, without considering the use of
    specialized vocabulary or any mismatches in communication styles. If the bot ends the dialogue with a question, it should
    receive a high score.

    The scoring criteria for the above six targets are as follows:
    0 - Negative, poor performance, long-winded
    1 - Dialogue does not reflect the indicator or does not quite meet the standards
    2 - More in line with standards but still has some defects
    3 - Perfectly meets the criteria

    ## The information of the bot is as follows: bot’s name: {bot_name}
    bot personality: {bot_personality}
    bot description: {bot_description}  
    Reference speaking style: {cp}

    ## Current scenario Interlocutor: {user_name}, {user_description} Relationship with bot: {relation} Scene: {scene}
    ## The historical dialogue is as follows:
    history
    Please score the above six targets (with a range of 0-3, separated by spaces) in response to bot.name (i.e. bot)’s utterance in
    the current dialogue.
    ## Current Dialogue: {current_Dialogue}
    """

    return inputs_prompt


path = "HeAAAAA/RoleRM"

model = AutoModelForCausalLM.from_pretrained(path).to("cuda")
tokenizer = AutoTokenizer.from_pretrained(path)


inputs_prompt1 = build_inputs_prompt(current_Dialogue1) 
inputs_prompt2 = build_inputs_prompt(current_Dialogue2) 

inputs = tokenizer(inputs_prompt1, return_tensors="pt")
outputs = model.generate(
    **inputs,
    max_new_tokens=200,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
    top_k=50,
    repetition_penalty=1.1,
    eos_token_id=tokenizer.eos_token_id  
)

new_tokens = outputs[0][inputs['input_ids'].shape[1]:]  
new_text = tokenizer.decode(new_tokens, skip_special_tokens=True)

print(new_text)

inputs = tokenizer(inputs_prompt2, return_tensors="pt").to("cuda")
outputs = model.generate(
    **inputs,
    max_new_tokens=200,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
    top_k=50,
    repetition_penalty=1.1,
    eos_token_id=tokenizer.eos_token_id  
)

new_tokens = outputs[0][inputs['input_ids'].shape[1]:]  
new_text = tokenizer.decode(new_tokens, skip_special_tokens=True)

print(new_text)

  

5. Citation

@inproceedings{he2025Crab,
  title={Crab: A Novel Configurable Role-Playing LLM with Assessing Benchmark},
  author={Kai, He and Yucheng, Huang and  Wenqing,  Wang  and   Delong,  Ran  and   Dongming,  Sheng  and   Junxuan,  Huang  and   Qika,  Lin and Jiaxing,  Xu  and  Wenqiang,  Liu and  Mengling,  Feng},
  booktitle={Proceedings of the 63nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  year={2025}
}
Downloads last month
3
Safetensors
Model size
8B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for HeAAAAA/RoleRM

Quantizations
2 models