Update README.md

0ee1028 verified 6 months ago

8.57 kB

	---
	library_name: transformers
	tags:
	- token-classification
	- ner
	- korean
	- court-judgment
	- de-identification
	license: cc-by-nc-sa-4.0
	language: ko
	---

	# Model Card for SNU Thunder-DeID


	<!-- Provide a quick summary of what the model is/does. -->

	## Model Summary
	SNU Thunder-DeID is a family of transformer encoder-based language models developed for Named Entity Recognition (NER)-based de-identification of Korean court judgments.
	Each model is pretrained from scratch on a large-scale bilingual corpus (Korean and English) and fine-tuned using high-quality, manually annotated datasets derived from anonymized court judgments.
	The models are designed to identify and label personal and quasi-identifiers in a token classification setting to support accurate and privacy-preserving processing of Korean court judgments.


	The SNU Thunder-DeID models are released in three sizes:
	- SNU Thunder-DeID-340M (here)
	- [SNU Thunder-DeID-750M](https://huggingface.co/thunder-research-group/SNU_Thunder-DeID-750M)
	- [SNU Thunder-DeID-1.5B](https://huggingface.co/thunder-research-group/SNU_Thunder-DeID-1.5B)

	## Intended Use

	The SNU Thunder-DeID models are intended to support:
	- De-identification of Korean court judgments
	- NER tasks focused on court judgments entities
	- Fine-tuning for privacy-preserving AI systems

	## How to Use

	```python
	from transformers import AutoTokenizer, AutoModelForTokenClassification

	tokenizer = AutoTokenizer.from_pretrained("thunder-research-group/SNU_Thunder-DeID-340M")
	model = AutoModelForTokenClassification.from_pretrained("thunder-research-group/SNU_Thunder-DeID-340M")

	inputs = tokenizer("""피고인 이규성은 서울대학교 데이터사이언스대학원 박사과정에 재학 중이며, 같은 연구실 소속 함성은, 박현지와 함께 AI 모델 비식별화와 관련된 연구를 진행 중이다.
	그는 해당 기술이 이미 여러 공공기관 및 대기업으로부터 상용화 제안을 받고 있다고 허위로 주장하며, 커뮤니티 사이트 ‘에브리타임’에 “비식별화 기술 투자자 모집”이라는 제목의 글을 게시하였다.
	해당 글에는 “이미 검증된 알고리즘, 선점 투자 시 지분 우선 배정”, “특허 수익 배분 예정” 등의 문구와 함께 자신 명의의 우리은행 계좌 (9429-424-343942)를 기재하고,
	1인당 10만 원의 초기 투자금을 요구하였다. 이에 따라 이규성은 손영준, 조경제, 이동영, 소연경, 석지헌 등 5명으로부터 총 50만 원을 송금받아 편취하였다.""", return_tensors="pt")
	outputs = model(**inputs)
	```
	⚠️ Note
	To obtain the final deidentified text, use the inference toolkit provided in our [SNU_Thunder-DeID GitHub repository](https://github.com/mcrl/SNU_Thunder-DeID).
	The toolkit handles the full postprocessing pipeline, including:
	- `id2label` and `label2id` mappings
	- token-to-text alignment
	- entity merging, whitespace recovery, and formatting


	# Model Details

	## Model Architecture

	All SNU Thunder-DeID models are based on the [DeBERTa-v2](https://huggingface.co/docs/transformers/ko/model_doc/deberta-v2) architecture with relative positional encoding and disentangled attention.
	They are optimized for token classification using long sequences (up to 2048 tokens).

	\| Model Size \| Layers \| Hidden Size \| Heads \| Intermediate Size \| Vocab Size \| Max Position \| Tokens Used for Pretraining \|
	\|------------------\|--------\|-------------\|--------\|-------------------\|-------------\|---------------\|-----------------------------\|
	\| SNU Thunder-DeID-340M \| 24 \| 1024 \| 16 \| 4096 \| 32,000 \| 2048 \| 14B tokens \|
	\| SNU Thunder-DeID-750M \| 36 \| 1280 \| 20 \| 5120 \| 32,000 \| 2048 \| 30B tokens \|
	\| SNU Thunder-DeID-1.5B \| 24 \| 2048 \| 32 \| 5504 \| 128,000 \| 2048 \| 60B tokens \|

	All models use:
	- `hidden_act`: GELU
	- `dropout`: 0.1
	- `pos_att_type`: `p2c\|c2p` (position-to-content and content-to-position attention)
	- `relative_attention`: True
	- `tokenizer`: Custom BPE + MeCab-ko tokenizer, trained from scratch on Korean court judgment data


	## Tokenizer

	All SNU Thunder-DeID models use a custom tokenizer trained from scratch on a large-scale Korean corpus.
	The tokenizer combines:
	- [MeCab-ko](https://bitbucket.org/eunjeon/mecab-ko) for morpheme-based segmentation
	- Byte-Pair Encoding (BPE) for subword representation

	Two vocabulary sizes were used depending on the model:
	- 32,000 tokens (used in 340M and 750M models)
	- 128,000 tokens (used in 1.5B model)

	The tokenizer was trained on a subset of the pretraining corpus to ensure optimal vocabulary coverage for Korean anonymization tasks.



	## Training Data

	The model training consists of two phases: pretraining from scratch and task-specific fine-tuning.

	### Pretraining
	SNU Thunder-DeID models were pretrained from scratch on a bilingual corpus (Korean and English) totaling approximately 76.7GB,
	using 14B / 30B / 60B tokens for the 340M, 750M, and 1.5B models respectively.

	### Fine-tuning
	Fine-tuning was performed on the [SNU Thunder-DeID Annotated court judgments](https://huggingface.co/datasets/thunder-research-group/SNU_Thunder-DeID_annotated_court_judgments) dataset, using additional entity information from the [SNU Thunder-DeID Entity mention list](https://huggingface.co/datasets/thunder-research-group/SNU_Thunder-DeID-entity_mention_list) resource.
	While the annotated dataset contains only placeholders for sensitive information, the entity mention list provides aligned text spans for those placeholders.
	This alignment enables full token-level supervision for NER training.
	- 4,500 anonymized and manually annotated court judgment texts
	- Covers three major criminal case types: fraud, crime of violence, and indecent act by compulsion
	- 27,402 labeled entity spans, using a three-tiered taxonomy of 595 entity labels tailored for Korean judicial anonymization
	- Annotations are inserted in-line using special tokens for structured NER training

	While the base annotated dataset contains only generic placeholders, the entity mention dataset aligns these with realistic entity spans to enable effective NER-based de-identification training.

	## Evaluation

	Models were evaluated on the internal validation split of the SNU Thunder-DeID Annotated court judgments dataset.

	\| Metric \| 340M \| 750M \| 1.5B \|
	\|-----------------------------\|--------\|--------\|--------\|
	\| Binary Token-Level Micro F1 \| 0.9894 \| 0.9891 \| 0.9910 \|
	\| Token-Level Micro F1 \| 0.8917 \| 0.8862 \| 0.8974 \|

	Binary token-level F1 measures whether the model correctly detects which tokens need to be de-identified.
	Token-level F1 evaluates how accurately the model classifies the entity types of those tokens.

	## Limitations

	- Trained only on criminal court cases — not guaranteed to generalize to civil or administrative rulings
	- Designed for Korean texts — not applicable to other languages or domains
	- Not suitable for identifying sensitive content outside of structured NER targets


	## Ethical Considerations
	- The model is trained on already-anonymized court documents
	- Deployment in real-world settings should still include human oversight and legal compliance check


	## License
	This repository contains original work licensed under the
	Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-NC-SA 4.0).

	Portions of this repository (including tokenizer vocabulary and/or model weights)
	are derived from Meta Llama 3.1 and are subject to the Meta Llama 3.1 Community License.
	https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE

	- Creative Commons Attribution-ShareAlike 4.0 License:
	https://creativecommons.org/licenses/by-nc-sa/4.0/

	## Citation

	If you use this model, please cite:
	```bibtex
	@misc{hahm2025thunderdeidaccurateefficientdeidentification,
	title={Thunder-DeID: Accurate and Efficient De-identification Framework for Korean Court Judgments},
	author={Sungen Hahm and Heejin Kim and Gyuseong Lee and Hyunji Park and Jaejin Lee},
	year={2025},
	eprint={2506.15266},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2506.15266},
	}
	```

	## Contact

	If you have questions or issues, contact:
	snullm@aces.snu.ac.kr