π Legal LED β Long-Document BillSum Summarizer
Fine-tuned version of NSIβs Legal LED for summarization of long legal and legislative documents.
This model fine-tunes nsi319/legal-led-base-16384, a legally pretrained LED (Longformer-Encoder-Decoder) model with a 16k token context window.
Legal LED is specifically adapted to legal corpora such as case law, statutes, regulatory materials, and legislative documents β making it more reliable than the vanilla LED for legal NLP.
This fine-tuned version is optimized for summarizing long and complex legal text such as US bills, policy documents, and multi-section legislative structures.
π§ Base Model
This model extends:
π nsi319/legal-led-base-16384
The base model has:
- Longformer sparse attention for 16,384-token sequences
- Legal-domain pretraining on:
- court judgments
- legislation
- legal commentary
- regulatory filings
- Strong domain adaptation prior to fine-tuning
This gives LED excellent performance on structural legal documents.
π Fine-Tuning Dataset
- BillSum (US Congress + California bills)
- Additional cleaned legal-style summaries
- Documents ranged from 3k to 30k tokens
βοΈ Training Configuration
| Setting | Value |
|---|---|
| Base model | nsi319/legal-led-base-16384 |
| Epochs | 6 total |
| Batch size | 2 |
| Gradient accumulation | 2 |
| Learning rate | 1e-5 |
| Optimizer | AdamW |
| Weight decay | 0.01 |
| FP16 | Yes |
| Warmup steps | 500 |
| Max input length | 4096 tokens |
| Max output length | 512 tokens |
| Attention | Global attention on first token |
| Scheduler | Linear |
Training was performed on NVIDIA P100 (16GB VRAM) via Kaggle.
π§ͺ Evaluation Metrics
Training Progress
| Epoch | Training Loss | Validation Loss |
|---|---|---|
| 1 | 1.39 | 1.33 |
| 2 | 1.17 | 1.26 |
| 3 | 1.19 | 1.23 |
| 4 | 1.15 | 1.18 |
| 5 | 1.03 | 1.16 |
| 6 | 1.02 | 1.16 |
ROUGE (document test set)
| Metric | F1 |
|---|---|
| ROUGE-1 | 0.5179 |
| ROUGE-2 | 0.3432 |
| ROUGE-L | 0.4067 |
BERTScore
| Metric | Score |
|---|---|
| Precision | 0.9015 |
| Recall | 0.8868 |
| F1 | 0.8936 |
ποΈ Long-Document Summarization Strategy
Legal LED supports long contexts (~16k tokens), but many legal bills exceed that.
To summarize documents up to 30k tokens, this pipeline was used:
- Length-adaptive chunking
- Paragraph grouping
- Sliding-window segmentation
- Chunk-wise LED summarization
- Top-K reranking using BERTScore
- Final second-pass LED rewriting
This improves semantic cohesion and section preservation.
π Intended Use
Ideal for:
- Legislative document summarization
- Legal policy analysis
- Long-form legal NLP applications
- AI assistants for lawyers or students
- Preprocessing for legal research systems
β οΈ Limitations
- English only
- Requires chunking for documents >16k tokens
- May simplify definitions too aggressively
- Not suitable for citation extraction or case-law reasoning
- Not intended for legal decision-making
π§ Usage Example
from transformers import AutoTokenizer, LEDForConditionalGeneration
import torch
model_name = "Anurag33Gaikwad/legal-led-billsum-summarization"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = LEDForConditionalGeneration.from_pretrained(model_name)
text = """Your long legal or legislative document here..."""
inputs = tokenizer(
text,
return_tensors="pt",
truncation=True,
max_length=4096,
)
# LED requires global attention on the first token
global_attention_mask = torch.zeros_like(inputs["input_ids"])
global_attention_mask[:, 0] = 1
summary_ids = model.generate(
inputs["input_ids"],
global_attention_mask=global_attention_mask,
num_beams=5,
max_length=512,
early_stopping=True
)
print(tokenizer.decode(summary_ids[0], skip_special_tokens=True))
- Downloads last month
- 104
Model tree for Anurag33Gaikwad/legal-led-billsum-summarization
Base model
nsi319/legal-led-base-16384