📘 Legal LED – Long-Document BillSum Summarizer

Fine-tuned version of NSI’s Legal LED for summarization of long legal and legislative documents.

This model fine-tunes nsi319/legal-led-base-16384, a legally pretrained LED (Longformer-Encoder-Decoder) model with a 16k token context window.
Legal LED is specifically adapted to legal corpora such as case law, statutes, regulatory materials, and legislative documents — making it more reliable than the vanilla LED for legal NLP.

This fine-tuned version is optimized for summarizing long and complex legal text such as US bills, policy documents, and multi-section legislative structures.

🧠 Base Model

This model extends:

👉 nsi319/legal-led-base-16384

The base model has:

Longformer sparse attention for 16,384-token sequences
Legal-domain pretraining on:
- court judgments
- legislation
- legal commentary
- regulatory filings
Strong domain adaptation prior to fine-tuning

This gives LED excellent performance on structural legal documents.

📚 Fine-Tuning Dataset

BillSum (US Congress + California bills)
Additional cleaned legal-style summaries
Documents ranged from 3k to 30k tokens

⚙️ Training Configuration

Setting	Value
Base model	nsi319/legal-led-base-16384
Epochs	6 total
Batch size	2
Gradient accumulation	2
Learning rate	1e-5
Optimizer	AdamW
Weight decay	0.01
FP16	Yes
Warmup steps	500
Max input length	4096 tokens
Max output length	512 tokens
Attention	Global attention on first token
Scheduler	Linear

Training was performed on NVIDIA P100 (16GB VRAM) via Kaggle.

🧪 Evaluation Metrics

Training Progress

Epoch	Training Loss	Validation Loss
1	1.39	1.33
2	1.17	1.26
3	1.19	1.23
4	1.15	1.18
5	1.03	1.16
6	1.02	1.16

ROUGE (document test set)

Metric	F1
ROUGE-1	0.5179
ROUGE-2	0.3432
ROUGE-L	0.4067

BERTScore

Metric	Score
Precision	0.9015
Recall	0.8868
F1	0.8936

🏗️ Long-Document Summarization Strategy

Legal LED supports long contexts (~16k tokens), but many legal bills exceed that.
To summarize documents up to 30k tokens, this pipeline was used:

Length-adaptive chunking
Paragraph grouping
Sliding-window segmentation
Chunk-wise LED summarization
Top-K reranking using BERTScore
Final second-pass LED rewriting

This improves semantic cohesion and section preservation.

📌 Intended Use

Ideal for:

Legislative document summarization
Legal policy analysis
Long-form legal NLP applications
AI assistants for lawyers or students
Preprocessing for legal research systems

⚠️ Limitations

English only
Requires chunking for documents >16k tokens
May simplify definitions too aggressively
Not suitable for citation extraction or case-law reasoning
Not intended for legal decision-making

🔧 Usage Example

from transformers import AutoTokenizer, LEDForConditionalGeneration
import torch

model_name = "Anurag33Gaikwad/legal-led-billsum-summarization"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = LEDForConditionalGeneration.from_pretrained(model_name)

text = """Your long legal or legislative document here..."""

inputs = tokenizer(
    text,
    return_tensors="pt",
    truncation=True,
    max_length=4096,
)

# LED requires global attention on the first token
global_attention_mask = torch.zeros_like(inputs["input_ids"])
global_attention_mask[:, 0] = 1

summary_ids = model.generate(
    inputs["input_ids"],
    global_attention_mask=global_attention_mask,
    num_beams=5,
    max_length=512,
    early_stopping=True
)

print(tokenizer.decode(summary_ids[0], skip_special_tokens=True))

Downloads last month: 104

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for Anurag33Gaikwad/legal-led-billsum-summarization

Base model

nsi319/legal-led-base-16384

Finetuned

(9)

this model

Anurag33Gaikwad
/

legal-led-billsum-summarization