πŸ“˜ Legal Pegasus – BillSum Fine-Tuned

Fine-tuned version of NSI’s Legal Pegasus for abstractive summarization of legal and legislative documents.

This model fine-tunes nsi319/legal-pegasus, a legally-pretrained Pegasus model, on the BillSum dataset and additional cleaned summaries to generate concise, context-aware, and structured legal summaries.
It improves coherence, domain terminology handling, and section-wise reasoning in long-form legal and policy text.


🧠 Base Model

This model builds on:

πŸ‘‰ nsi319/legal-pegasus
Pretrained on large-scale legal corpora including:

  • Statutes
  • Case law
  • Legislative documents
  • Regulatory material

This provides strong legal-domain grounding before fine-tuning.


πŸ“š Fine-Tuning Dataset

  • BillSum (US Congressional + California bills)
  • Additional cleaned legal-style summaries
  • Documents range from 2k to 12k+ tokens

βš™οΈ Training Configuration

Setting Value
Base model nsi319/legal-pegasus
Epochs 8
Learning rate 2e-5
Optimizer AdamW
Weight decay 0.01
Batch size 1
Gradient accumulation 4
Max input length 1024 tokens
Max summary length 256 tokens
FP16 Yes
Warmup 500 steps
Logging-steps 50 steps

Training was performed on Kaggle P100 GPU (16GB).


πŸ§ͺ Evaluation Metrics

ROUGE Scores (Test Set)

Metric F1
ROUGE-1 ~0.5554
ROUGE-2 ~0.3531
ROUGE-L ~0.4178

BERTScore (Semantic Similarity)

Metric Score
Precision 0.8841
Recall 0.8943
F1 0.8864

BERTScore is emphasized since legal summarization requires semantic preservation rather than lexical overlap.


πŸ—οΈ Long-Document Summarization Strategy

Pegasus supports ~1024 tokens, so long legal documents (3k–30k tokens) were handled using:

  • Sentence/paragraph splitting
  • Token-based chunking
  • Sliding-window segmentation
  • Chunk-wise summarization
  • Second-pass β€œsummary-of-summaries” rewriting

This enables effective summarization far beyond the backbone context limit.


πŸ“Œ Intended Use

This model is intended for:

  • Legal document summarization
  • Bill/policy analysis
  • Legislative NLP pipelines
  • AI assistants for law students
  • Preprocessing for downstream legal reasoning tasks

⚠️ Limitations

  • English only
  • Long documents require external chunking
  • May simplify dense legal definitions
  • Not suitable for legal citations or case-law cross referencing
  • Not intended for production-grade legal decisions

πŸ”§ Usage Example

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "Anurag33Gaikwad/legal-pegasus-billsum-summarization"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

text = """Your long legal or legislative text here…"""

inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=1024)

summary_ids = model.generate(
    inputs["input_ids"],
    num_beams=5,
    max_length=256,
    early_stopping=True
)

print(tokenizer.decode(summary_ids[0], skip_special_tok_
Downloads last month
65
Safetensors
Model size
0.6B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Anurag33Gaikwad/legal-pegasus-billsum-summarization

Finetuned
(9)
this model

Dataset used to train Anurag33Gaikwad/legal-pegasus-billsum-summarization

Space using Anurag33Gaikwad/legal-pegasus-billsum-summarization 1