π Legal Pegasus β BillSum Fine-Tuned
Fine-tuned version of NSIβs Legal Pegasus for abstractive summarization of legal and legislative documents.
This model fine-tunes nsi319/legal-pegasus, a legally-pretrained Pegasus model, on the BillSum dataset and additional cleaned summaries to generate concise, context-aware, and structured legal summaries.
It improves coherence, domain terminology handling, and section-wise reasoning in long-form legal and policy text.
π§ Base Model
This model builds on:
π nsi319/legal-pegasus
Pretrained on large-scale legal corpora including:
- Statutes
- Case law
- Legislative documents
- Regulatory material
This provides strong legal-domain grounding before fine-tuning.
π Fine-Tuning Dataset
- BillSum (US Congressional + California bills)
- Additional cleaned legal-style summaries
- Documents range from 2k to 12k+ tokens
βοΈ Training Configuration
| Setting | Value |
|---|---|
| Base model | nsi319/legal-pegasus |
| Epochs | 8 |
| Learning rate | 2e-5 |
| Optimizer | AdamW |
| Weight decay | 0.01 |
| Batch size | 1 |
| Gradient accumulation | 4 |
| Max input length | 1024 tokens |
| Max summary length | 256 tokens |
| FP16 | Yes |
| Warmup | 500 steps |
| Logging-steps | 50 steps |
Training was performed on Kaggle P100 GPU (16GB).
π§ͺ Evaluation Metrics
ROUGE Scores (Test Set)
| Metric | F1 |
|---|---|
| ROUGE-1 | ~0.5554 |
| ROUGE-2 | ~0.3531 |
| ROUGE-L | ~0.4178 |
BERTScore (Semantic Similarity)
| Metric | Score |
|---|---|
| Precision | 0.8841 |
| Recall | 0.8943 |
| F1 | 0.8864 |
BERTScore is emphasized since legal summarization requires semantic preservation rather than lexical overlap.
ποΈ Long-Document Summarization Strategy
Pegasus supports ~1024 tokens, so long legal documents (3kβ30k tokens) were handled using:
- Sentence/paragraph splitting
- Token-based chunking
- Sliding-window segmentation
- Chunk-wise summarization
- Second-pass βsummary-of-summariesβ rewriting
This enables effective summarization far beyond the backbone context limit.
π Intended Use
This model is intended for:
- Legal document summarization
- Bill/policy analysis
- Legislative NLP pipelines
- AI assistants for law students
- Preprocessing for downstream legal reasoning tasks
β οΈ Limitations
- English only
- Long documents require external chunking
- May simplify dense legal definitions
- Not suitable for legal citations or case-law cross referencing
- Not intended for production-grade legal decisions
π§ Usage Example
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_name = "Anurag33Gaikwad/legal-pegasus-billsum-summarization"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
text = """Your long legal or legislative text hereβ¦"""
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=1024)
summary_ids = model.generate(
inputs["input_ids"],
num_beams=5,
max_length=256,
early_stopping=True
)
print(tokenizer.decode(summary_ids[0], skip_special_tok_
- Downloads last month
- 65
Model tree for Anurag33Gaikwad/legal-pegasus-billsum-summarization
Base model
nsi319/legal-pegasus