πŸ“˜ Legal LED – Long-Document BillSum Summarizer

Fine-tuned version of NSI’s Legal LED for summarization of long legal and legislative documents.

This model fine-tunes nsi319/legal-led-base-16384, a legally pretrained LED (Longformer-Encoder-Decoder) model with a 16k token context window.
Legal LED is specifically adapted to legal corpora such as case law, statutes, regulatory materials, and legislative documents β€” making it more reliable than the vanilla LED for legal NLP.

This fine-tuned version is optimized for summarizing long and complex legal text such as US bills, policy documents, and multi-section legislative structures.


🧠 Base Model

This model extends:

πŸ‘‰ nsi319/legal-led-base-16384

The base model has:

  • Longformer sparse attention for 16,384-token sequences
  • Legal-domain pretraining on:
    • court judgments
    • legislation
    • legal commentary
    • regulatory filings
  • Strong domain adaptation prior to fine-tuning

This gives LED excellent performance on structural legal documents.


πŸ“š Fine-Tuning Dataset

  • BillSum (US Congress + California bills)
  • Additional cleaned legal-style summaries
  • Documents ranged from 3k to 30k tokens

βš™οΈ Training Configuration

Setting Value
Base model nsi319/legal-led-base-16384
Epochs 6 total
Batch size 2
Gradient accumulation 2
Learning rate 1e-5
Optimizer AdamW
Weight decay 0.01
FP16 Yes
Warmup steps 500
Max input length 4096 tokens
Max output length 512 tokens
Attention Global attention on first token
Scheduler Linear

Training was performed on NVIDIA P100 (16GB VRAM) via Kaggle.


πŸ§ͺ Evaluation Metrics

Training Progress

Epoch Training Loss Validation Loss
1 1.39 1.33
2 1.17 1.26
3 1.19 1.23
4 1.15 1.18
5 1.03 1.16
6 1.02 1.16

ROUGE (document test set)

Metric F1
ROUGE-1 0.5179
ROUGE-2 0.3432
ROUGE-L 0.4067

BERTScore

Metric Score
Precision 0.9015
Recall 0.8868
F1 0.8936

πŸ—οΈ Long-Document Summarization Strategy

Legal LED supports long contexts (~16k tokens), but many legal bills exceed that.
To summarize documents up to 30k tokens, this pipeline was used:

  • Length-adaptive chunking
  • Paragraph grouping
  • Sliding-window segmentation
  • Chunk-wise LED summarization
  • Top-K reranking using BERTScore
  • Final second-pass LED rewriting

This improves semantic cohesion and section preservation.


πŸ“Œ Intended Use

Ideal for:

  • Legislative document summarization
  • Legal policy analysis
  • Long-form legal NLP applications
  • AI assistants for lawyers or students
  • Preprocessing for legal research systems

⚠️ Limitations

  • English only
  • Requires chunking for documents >16k tokens
  • May simplify definitions too aggressively
  • Not suitable for citation extraction or case-law reasoning
  • Not intended for legal decision-making

πŸ”§ Usage Example

from transformers import AutoTokenizer, LEDForConditionalGeneration
import torch

model_name = "Anurag33Gaikwad/legal-led-billsum-summarization"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = LEDForConditionalGeneration.from_pretrained(model_name)

text = """Your long legal or legislative document here..."""

inputs = tokenizer(
    text,
    return_tensors="pt",
    truncation=True,
    max_length=4096,
)

# LED requires global attention on the first token
global_attention_mask = torch.zeros_like(inputs["input_ids"])
global_attention_mask[:, 0] = 1

summary_ids = model.generate(
    inputs["input_ids"],
    global_attention_mask=global_attention_mask,
    num_beams=5,
    max_length=512,
    early_stopping=True
)

print(tokenizer.decode(summary_ids[0], skip_special_tokens=True))
Downloads last month
104
Safetensors
Model size
0.2B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Anurag33Gaikwad/legal-led-billsum-summarization

Finetuned
(9)
this model

Dataset used to train Anurag33Gaikwad/legal-led-billsum-summarization

Space using Anurag33Gaikwad/legal-led-billsum-summarization 1