Vietnamese Legal Reasoning Model - GRPO Phase 2 (Hard Difficulty)

🏛️ Model Description

This is a Phase 2 Vietnamese legal reasoning specialist fine-tuned using Group Relative Policy Optimization (GRPO) on hard difficulty Vietnamese legal question-answering data. This model builds upon the Phase 1 training and is specifically designed to handle more complex syllogistic reasoning for challenging Vietnamese legal scenarios.

🎯 Base Model

Base: thangvip/qwen3-4b-vietnamese-legal-grpo
Architecture: Qwen 3 (4B parameters)
Language: Vietnamese
Specialization: Advanced legal reasoning and syllogism (Hard difficulty)
Training Phase: Phase 2 - Hard Level QA

🔥 Key Features

✅ Phase 2 Training: Advanced model trained on hard difficulty legal questions
✅ Syllogistic Reasoning: Structured legal arguments (Major Premise → Minor Premise → Conclusion)
✅ Vietnamese Legal Domain: Trained on Vietnamese legal texts and Q&A
✅ GRPO Optimization: Advanced policy optimization for better reasoning
✅ Citation Support: Generates responses with legal citations
✅ Structured Output: Uses XML-like tags for organized responses
✅ Extended Context: Supports up to 8192 tokens for complex reasoning chains

📊 Model Architecture

Parameters: ~4B
Vocabulary Size: 151936
Hidden Size: 2560
Layers: 36
Attention Heads: 32
Max Completion Length: 8192 tokens (extended for complex reasoning)

🚀 Quick Start

Installation

pip install transformers torch

Basic Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model and tokenizer
model_name = "thangvip/qwen3-4b-vietnamese-legal-grpo-phase-2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Format your legal question
system_prompt = """Bạn là một chuyên gia pháp lý. Hãy trả lời câu hỏi bằng cách sử dụng phương pháp lập luận tam đoạn luận (syllogism).

Trước tiên, hãy suy nghĩ về vấn đề trong thẻ <think></think>.

Sau đó, trả lời theo định dạng sau:
<answer>
<major_premise>[Quy định pháp luật chung]</major_premise>
<minor_premise>[Sự kiện cụ thể trong câu hỏi]</minor_premise>
<conclusion>[Áp dụng quy định vào sự kiện để đưa ra kết luận]</conclusion>
</answer>

Hãy đảm bảo trích dẫn chính xác các điều luật liên quan."""

question = "Một công ty có nghĩa vụ gì khi sa thải nhân viên do tái cơ cấu?"

# Create conversation
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": question}
]

# Generate response
input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=2048,  # Extended for hard difficulty complex reasoning
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(response)

Pipeline Usage

from transformers import pipeline

# Create text generation pipeline
generator = pipeline(
    "text-generation",
    model="thangvip/qwen3-4b-vietnamese-legal-grpo-phase-2",
    tokenizer="thangvip/qwen3-4b-vietnamese-legal-grpo-phase-2",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Generate legal reasoning
prompt = "Câu hỏi: Quyền và nghĩa vụ của người thuê nhà khi hợp đồng thuê hết hạn?"
result = generator(prompt, max_new_tokens=512, temperature=0.7)
print(result[0]['generated_text'])

🎯 Training Details

Training Procedure

Method: Group Relative Policy Optimization (GRPO)
Base Model: thangvip/qwen3-4b-vietnamese-legal-grpo
Training Phase: Phase 2 - Hard Difficulty
Training Steps: N/A (typically 1500 steps)
Learning Rate: N/A
Batch Size: N/A
Max Completion Length: 8192 tokens (doubled for complex reasoning)

Training Data

Domain: Vietnamese legal question-answering (Hard difficulty)
Format: Syllogistic reasoning pairs
Structure: Question → Structured legal reasoning response
Difficulty Level: Hard - Complex multi-step legal reasoning scenarios
Dataset: hard_level_qa.jsonl

Two-Phase Training Approach

Phase 1: Initial GRPO training on normal difficulty questions
- Base model: thangvip/qwen3-4b-legal-pretrain-synthetic-8k or similar
- Dataset: normal_level_qa.jsonl
- Max completion: 4096 tokens
Phase 2 (This Model): Continued training on hard difficulty questions with extended context window
- Base model: Phase 1 trained model (thangvip/qwen3-4b-vietnamese-legal-grpo)
- Dataset: hard_level_qa.jsonl
- Max completion: 8192 tokens (doubled for complex reasoning)
- Training steps: ~1500 steps

This progressive training approach allows the model to first master basic legal reasoning before tackling more complex, multi-step legal problems.

Reward System

The model was trained with a sophisticated reward system:

Correctness (35%): Factual accuracy against reference answers
Format Compliance (20%): Proper use of syllogistic structure
Citation Accuracy (15%): Relevant and accurate legal citations
Reasoning Quality (15%): Quality of legal reasoning process
Hallucination Penalty (10%): Penalty for unsupported claims
Length Penalty (5%): Penalty for exceeding maximum token length

📝 Expected Output Format

The model generates structured responses in this format:

<think>
[Internal reasoning about the legal question]
</think>

<answer>
<major_premise>
[General legal rule or principle applicable to the situation]
</major_premise>

<minor_premise>
[Specific facts from the question that relate to the legal rule]
</minor_premise>

<conclusion>
[Legal conclusion that follows logically from applying the rule to the facts]
</conclusion>
</answer>

🎯 Use Cases

Complex Legal Education: Teaching advanced legal reasoning methodology for difficult cases
Advanced Legal Research: Preliminary analysis of complex legal questions
Multi-step Legal Analysis: Structured legal argument generation for intricate scenarios
Legal Consultation: Initial legal guidance for challenging cases (with human review)
Legal Training: Demonstrating proper syllogistic reasoning for complex legal problems

⚠️ Limitations

Domain Specific: Optimized for Vietnamese legal context
Educational Purpose: Should not replace professional legal advice
Fact Checking Required: Always verify legal citations and conclusions
Extended Generation: May produce lengthy responses (up to 8192 tokens) for complex questions
Phase 2 Training: Built upon Phase 1 model; requires understanding of base model capabilities

📄 Citation

If you use this model, please cite:

@misc{vietnamese-legal-grpo-2024,
  title={Vietnamese Legal Reasoning Model with GRPO},
  author={Your Name},
  year={2024},
  publisher={Hugging Face},
  url={https://huggingface.co/thangvip/qwen3-4b-vietnamese-legal-grpo-phase-2}
}

🤝 Contributing

Contributions are welcome! Please see our contributing guidelines.

📜 License

This model is released under the Apache 2.0 License.

🙏 Acknowledgments

TRL Team: For the GRPO implementation
Qwen Team: For the excellent base model
Hugging Face: For the transformers library and model hosting

Note: This model is for educational and research purposes. Always consult qualified legal professionals for actual legal advice.

Downloads last month: 9

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for thangvip/qwen3-4b-vietnamese-legal-grpo-phase-2

Base model

thangvip/qwen3-4b-legal-pretrain-synthetic-8k

Finetuned

thangvip/qwen3-4b-vietnamese-legal-grpo

Finetuned

(1)

this model