📘 Legal Pegasus – BillSum Fine-Tuned

Fine-tuned version of NSI’s Legal Pegasus for abstractive summarization of legal and legislative documents.

This model fine-tunes nsi319/legal-pegasus, a legally-pretrained Pegasus model, on the BillSum dataset and additional cleaned summaries to generate concise, context-aware, and structured legal summaries.
It improves coherence, domain terminology handling, and section-wise reasoning in long-form legal and policy text.

🧠 Base Model

This model builds on:

👉 nsi319/legal-pegasus
Pretrained on large-scale legal corpora including:

Statutes
Case law
Legislative documents
Regulatory material

This provides strong legal-domain grounding before fine-tuning.

📚 Fine-Tuning Dataset

BillSum (US Congressional + California bills)
Additional cleaned legal-style summaries
Documents range from 2k to 12k+ tokens

⚙️ Training Configuration

Setting	Value
Base model	nsi319/legal-pegasus
Epochs	8
Learning rate	2e-5
Optimizer	AdamW
Weight decay	0.01
Batch size	1
Gradient accumulation	4
Max input length	1024 tokens
Max summary length	256 tokens
FP16	Yes
Warmup	500 steps
Logging-steps	50 steps

Training was performed on Kaggle P100 GPU (16GB).

🧪 Evaluation Metrics

ROUGE Scores (Test Set)

Metric	F1
ROUGE-1	~0.5554
ROUGE-2	~0.3531
ROUGE-L	~0.4178

BERTScore (Semantic Similarity)

Metric	Score
Precision	0.8841
Recall	0.8943
F1	0.8864

BERTScore is emphasized since legal summarization requires semantic preservation rather than lexical overlap.

🏗️ Long-Document Summarization Strategy

Pegasus supports ~1024 tokens, so long legal documents (3k–30k tokens) were handled using:

Sentence/paragraph splitting
Token-based chunking
Sliding-window segmentation
Chunk-wise summarization
Second-pass “summary-of-summaries” rewriting

This enables effective summarization far beyond the backbone context limit.

📌 Intended Use

This model is intended for:

Legal document summarization
Bill/policy analysis
Legislative NLP pipelines
AI assistants for law students
Preprocessing for downstream legal reasoning tasks

⚠️ Limitations

English only
Long documents require external chunking
May simplify dense legal definitions
Not suitable for legal citations or case-law cross referencing
Not intended for production-grade legal decisions

🔧 Usage Example

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "Anurag33Gaikwad/legal-pegasus-billsum-summarization"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

text = """Your long legal or legislative text here…"""

inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=1024)

summary_ids = model.generate(
    inputs["input_ids"],
    num_beams=5,
    max_length=256,
    early_stopping=True
)

print(tokenizer.decode(summary_ids[0], skip_special_tok_

Downloads last month: 65

Safetensors

Model size

0.6B params

Tensor type

F32

Model tree for Anurag33Gaikwad/legal-pegasus-billsum-summarization

Base model

nsi319/legal-pegasus

Finetuned

(9)

this model

Anurag33Gaikwad
/

legal-pegasus-billsum-summarization