Update README.md

fbe3271 verified 5 months ago

5.94 kB

	---
	base_model: unsloth/Qwen2.5-1.5B-Instruct
	language:
	- de
	- fr
	- it
	license: apache-2.0
	tags:
	- text-generation-inference
	- transformers
	- unsloth
	- qwen2
	- trl
	datasets:
	- ipst/slds
	metrics:
	- bertscore
	- bleu
	- rouge
	---
	# Model Card for Qwen2.5-1.5B-Instruct-SLDS

	## Model Summary

	This model is a Qwen2.5-1.5B-Instruct fine-tuned on the Swiss Landmark Decisions Summarization (SLDS) dataset.
	SLDS is a multilingual dataset of 20,000 Swiss Federal Supreme Court decisions (1954–2024), each paired with headnotes in German, French, and Italian, resulting in ~60,000 decision–headnote pairs.

	The model is optimized for legal abstractive summarization and is capable of producing concise, legally structured headnotes.
	It can be used for both monolingual and cross-lingual summarization tasks.

	This model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.

	[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)

	---

	## Intended Use

	- Primary Task: Judicial summarization (decision → headnote generation).
	- Languages: German (`de`), French (`fr`), Italian (`it`).
	- Scenarios:
	- Monolingual summarization: e.g., German decision → German headnote.
	- Cross-lingual summarization: e.g., German decision → French headnote.
	- Legal research support: assisting in retrieval and navigation of court decisions.

	Not intended for:
	- Replacing human legal expertise.
	- Serving as an authoritative legal source.
	- Automated legal advice or decision-making.

	---

	## Training Data

	- Dataset: [Swiss Landmark Decisions Summarization (SLDS)](https://huggingface.co/datasets/ipst/slds).
	- Size: ~20K decisions, ~60K decision–headnote pairs.
	- Splits: Train (1954–2021), Validation (2022), Test (2023–2024).
	- Source: [Swiss Federal Supreme Court](https://www.bger.ch).

	---

	## Training Procedure

	- Base Models:
	- Qwen2.5 family (0.5B–14B)
	- Llama 3.2 (3B)
	- Phi-3.5-mini

	- Fine-tuning Objective: Conditional generation (decision → headnote).
	- Evaluation Metrics:
	- Lexical: ROUGE-1/2/L, BLEU, BERTScore.
	- Domain-specific: LLM-as-a-Judge framework (DeepSeek V3) assessing five rubrics: accuracy, completeness, clarity, legal citations, and considerations.

	---

	## Model Performance

	On the SLDS test set (2023–2024):

	\| Model \| Setting \| BERTScore ↑ \| BLEU ↑ \| ROUGE-1 ↑ \| ROUGE-2 ↑ \| ROUGE-L ↑ \| JUDGE ↑ \|
	\|:--- \|:--- \|:--- \|:--- \|:--- \|:--- \|:--- \|:--- \|
	\| [Phi-3.5-mini](https://huggingface.co/ipst/Phi-3.5-mini-instruct-SLDS) \| fine-tuned \| 11.24 ± 3.82 \| 34.84 ± 0.41 \| 31.20 ± 2.08 \| 14.11 ± 1.27 \| 20.96 ± 1.35 \| 15.25 ± 2.32 \|
	\| [Llama 3.2B](https://huggingface.co/ipst/Llama-3.2-3B-Instruct-SLDS) \| fine-tuned \| 15.20 ± 4.40 \| 21.89 ± 0.42 \| 31.89 ± 2.34 \| 14.87 ± 1.61 \| 22.49 ± 1.60 \| 18.47 ± 2.99 \|
	\| [Qwen2.5 0.5B](https://huggingface.co/ipst/Qwen2.5-0.5B-Instruct-SLDS) \| fine-tuned \| -1.37 ± 3.85 \| 32.20 ± 0.35 \| 23.87 ± 1.68 \| 9.46 ± 0.94 \| 17.37 ± 1.09 \| 5.80 ± 1.26 \|
	\| [Qwen2.5 1.5B](https://huggingface.co/ipst/Qwen2.5-1.5B-Instruct-SLDS) \| fine-tuned \| 19.81 ± 2.72 \| 36.79 ± 0.34 \| 33.03 ± 1.73 \| 14.14 ± 1.08 \| 22.67 ± 1.13 \| 15.92 ± 2.27 \|
	\| [Qwen2.5 3B](https://huggingface.co/ipst/Qwen2.5-3B-Instruct-SLDS) \| fine-tuned \| 23.23 ± 2.80 \| 38.42 ± 0.34 \| 35.18 ± 1.79 \| 15.66 ± 1.23 \| 24.10 ± 1.17 \| 20.31 ± 2.66 \|
	\| [Qwen2.5 7B](https://huggingface.co/ipst/Qwen2.5-7B-Instruct-SLDS) \| fine-tuned \| 29.59 ± 1.97 \| 41.40 ± 0.34 \| 39.24 ± 1.59 \| 18.26 ± 1.25 \| 26.44 ± 1.15 \| 28.37 ± 3.07 \|
	\| [Qwen2.5 14B](https://huggingface.co/ipst/Qwen2.5-14B-Instruct-SLDS) \| fine-tuned \| 32.48 ± 1.98 \| 41.80 ± 0.37 \| 40.04 ± 1.74 \| 19.99 ± 1.41 \| 28.00 ± 1.28 \| 31.38 ± 3.19 \|
	\| GPT-4o \| one-shot \| 30.44 ± 1.74 \| 31.89 ± 0.25 \| 42.12 ± 1.79 \| 18.92 ± 1.22 \| 25.92 ± 1.05 \| 39.70 ± 2.66 \|
	\| Claude 3.5 Sonnet \| one-shot \| 5.53 ± 2.00 \| 21.88 ± 0.25 \| 41.86 ± 1.64 \| 19.23 ± 1.19 \| 27.67 ± 1.20 \| 41.25 ± 2.90 \|
	\| DeepSeek-R1 \| one-shot \| 20.28 ± 1.45 \| 22.37 ± 0.18 \| 38.30 ± 1.82 \| 15.97 ± 0.85 \| 21.03 ± 0.84 \| 42.28 ± 2.21 \|
	\| o3-mini \| one-shot \| 14.18 ± 1.31 \| 20.55 ± 0.17 \| 34.77 ± 1.43 \| 11.92 ± 0.69 \| 18.21 ± 0.67 \| 34.82 ± 2.41 \|

	- Lexical metrics: Fine-tuned models outperform in overlap-based scores.
	- LLM-judge scores: Larger proprietary and reasoning models outperform in legal precision.

	---

	## Limitations

	- Language imbalance: German decisions dominate, while Italian remains underrepresented.
	- Biases: Headnotes reflect judicial style and conventions, not neutral summaries.
	- Evaluation mismatch: ROUGE and BLEU may not fully capture legal accuracy.
	- Overfitting risk: Models may overfit to formulaic headnote structures.
	- Cross-lingual difficulty: Some models struggle with non-monolingual headnote generation.

	---

	## Ethical Considerations

	- Sensitive information: All data is anonymized by the Swiss Federal Supreme Court before publication.
	- Legal risk: Generated headnotes must not be used as official legal advice.
	- Fair use: Ensure attribution when reusing outputs.

	---

	## How to Cite

	If you use this model, please cite the dataset paper:

	```bibtex
	@article{rolshoven2025slds,
	title={Unlocking Legal Knowledge: A Multilingual Dataset for Judicial Summarization in Switzerland},
	author={Luca Rolshoven and Vishvaksenan Rasiah and Srinanda Brügger Bose and Sarah Hostettler and Lara Burkhalter and Matthias Stürmer and Joel Niklaus},
	year={2025},
	eprint={2410.13456},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2410.13456},
	}
	```