|
|
--- |
|
|
library_name: transformers |
|
|
tags: |
|
|
- token-classification |
|
|
- ner |
|
|
- korean |
|
|
- court-judgment |
|
|
- de-identification |
|
|
license: cc-by-nc-sa-4.0 |
|
|
language: ko |
|
|
--- |
|
|
|
|
|
# Model Card for SNU Thunder-DeID |
|
|
|
|
|
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
|
|
## Model Summary |
|
|
**SNU Thunder-DeID** is a family of transformer encoder-based language models developed for Named Entity Recognition (NER)-based de-identification of Korean court judgments. |
|
|
Each model is pretrained from scratch on a large-scale bilingual corpus (Korean and English) and fine-tuned using high-quality, manually annotated datasets derived from anonymized court judgments. |
|
|
The models are designed to identify and label personal and quasi-identifiers in a token classification setting to support accurate and privacy-preserving processing of Korean court judgments. |
|
|
|
|
|
|
|
|
The SNU Thunder-DeID models are released in three sizes: |
|
|
- SNU Thunder-DeID-340M (here) |
|
|
- [SNU Thunder-DeID-750M](https://huggingface.co/thunder-research-group/SNU_Thunder-DeID-750M) |
|
|
- [SNU Thunder-DeID-1.5B](https://huggingface.co/thunder-research-group/SNU_Thunder-DeID-1.5B) |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
The SNU Thunder-DeID models are intended to support: |
|
|
- **De-identification** of Korean court judgments |
|
|
- **NER tasks** focused on court judgments entities |
|
|
- Fine-tuning for privacy-preserving AI systems |
|
|
|
|
|
## How to Use |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForTokenClassification |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained("thunder-research-group/SNU_Thunder-DeID-340M") |
|
|
model = AutoModelForTokenClassification.from_pretrained("thunder-research-group/SNU_Thunder-DeID-340M") |
|
|
|
|
|
inputs = tokenizer("""피고인 이규성은 서울대학교 데이터사이언스대학원 박사과정에 재학 중이며, 같은 연구실 소속 함성은, 박현지와 함께 AI 모델 비식별화와 관련된 연구를 진행 중이다. |
|
|
그는 해당 기술이 이미 여러 공공기관 및 대기업으로부터 상용화 제안을 받고 있다고 허위로 주장하며, 커뮤니티 사이트 ‘에브리타임’에 “비식별화 기술 투자자 모집”이라는 제목의 글을 게시하였다. |
|
|
해당 글에는 “이미 검증된 알고리즘, 선점 투자 시 지분 우선 배정”, “특허 수익 배분 예정” 등의 문구와 함께 자신 명의의 우리은행 계좌 (9429-424-343942)를 기재하고, |
|
|
1인당 10만 원의 초기 투자금을 요구하였다. 이에 따라 이규성은 손영준, 조경제, 이동영, 소연경, 석지헌 등 5명으로부터 총 50만 원을 송금받아 편취하였다.""", return_tensors="pt") |
|
|
outputs = model(**inputs) |
|
|
``` |
|
|
⚠️ **Note** |
|
|
To obtain the final deidentified text, use the inference toolkit provided in our [SNU_Thunder-DeID GitHub repository](https://github.com/mcrl/SNU_Thunder-DeID). |
|
|
The toolkit handles the full postprocessing pipeline, including: |
|
|
- `id2label` and `label2id` mappings |
|
|
- token-to-text alignment |
|
|
- entity merging, whitespace recovery, and formatting |
|
|
|
|
|
|
|
|
# Model Details |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
All SNU Thunder-DeID models are based on the [DeBERTa-v2](https://huggingface.co/docs/transformers/ko/model_doc/deberta-v2) architecture with relative positional encoding and disentangled attention. |
|
|
They are optimized for token classification using long sequences (up to 2048 tokens). |
|
|
|
|
|
| Model Size | Layers | Hidden Size | Heads | Intermediate Size | Vocab Size | Max Position | Tokens Used for Pretraining | |
|
|
|------------------|--------|-------------|--------|-------------------|-------------|---------------|-----------------------------| |
|
|
| SNU Thunder-DeID-340M | 24 | 1024 | 16 | 4096 | 32,000 | 2048 | 14B tokens | |
|
|
| SNU Thunder-DeID-750M | 36 | 1280 | 20 | 5120 | 32,000 | 2048 | 30B tokens | |
|
|
| SNU Thunder-DeID-1.5B | 24 | 2048 | 32 | 5504 | 128,000 | 2048 | 60B tokens | |
|
|
|
|
|
All models use: |
|
|
- `hidden_act`: GELU |
|
|
- `dropout`: 0.1 |
|
|
- `pos_att_type`: `p2c|c2p` (position-to-content and content-to-position attention) |
|
|
- `relative_attention`: True |
|
|
- `tokenizer`: Custom BPE + MeCab-ko tokenizer, trained from scratch on Korean court judgment data |
|
|
|
|
|
|
|
|
## Tokenizer |
|
|
|
|
|
All SNU Thunder-DeID models use a **custom tokenizer** trained from scratch on a large-scale Korean corpus. |
|
|
The tokenizer combines: |
|
|
- [**MeCab-ko**](https://bitbucket.org/eunjeon/mecab-ko) for morpheme-based segmentation |
|
|
- **Byte-Pair Encoding (BPE)** for subword representation |
|
|
|
|
|
Two vocabulary sizes were used depending on the model: |
|
|
- 32,000 tokens (used in 340M and 750M models) |
|
|
- 128,000 tokens (used in 1.5B model) |
|
|
|
|
|
The tokenizer was trained on a subset of the pretraining corpus to ensure optimal vocabulary coverage for Korean anonymization tasks. |
|
|
|
|
|
|
|
|
|
|
|
## Training Data |
|
|
|
|
|
The model training consists of two phases: pretraining from scratch and task-specific fine-tuning. |
|
|
|
|
|
### Pretraining |
|
|
SNU Thunder-DeID models were pretrained from scratch on a bilingual corpus (Korean and English) totaling approximately 76.7GB, |
|
|
using 14B / 30B / 60B tokens for the 340M, 750M, and 1.5B models respectively. |
|
|
|
|
|
### Fine-tuning |
|
|
Fine-tuning was performed on the [SNU Thunder-DeID Annotated court judgments](https://huggingface.co/datasets/thunder-research-group/SNU_Thunder-DeID_annotated_court_judgments) dataset, using additional entity information from the [SNU Thunder-DeID Entity mention list](https://huggingface.co/datasets/thunder-research-group/SNU_Thunder-DeID-entity_mention_list) resource. |
|
|
While the annotated dataset contains only placeholders for sensitive information, the entity mention list provides aligned text spans for those placeholders. |
|
|
This alignment enables full token-level supervision for NER training. |
|
|
- **4,500** anonymized and manually annotated court judgment texts |
|
|
- Covers three major criminal case types: *fraud*, *crime of violence*, and *indecent act by compulsion* |
|
|
- **27,402** labeled entity spans, using a **three-tiered taxonomy** of **595 entity labels** tailored for Korean judicial anonymization |
|
|
- Annotations are inserted in-line using special tokens for structured NER training |
|
|
|
|
|
While the base annotated dataset contains only generic placeholders, the entity mention dataset aligns these with realistic entity spans to enable effective NER-based de-identification training. |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
Models were evaluated on the internal validation split of the **SNU Thunder-DeID Annotated court judgments** dataset. |
|
|
|
|
|
| Metric | 340M | 750M | 1.5B | |
|
|
|-----------------------------|--------|--------|--------| |
|
|
| Binary Token-Level Micro F1 | 0.9894 | 0.9891 | 0.9910 | |
|
|
| Token-Level Micro F1 | 0.8917 | 0.8862 | 0.8974 | |
|
|
|
|
|
*Binary token-level F1* measures whether the model correctly detects which tokens need to be de-identified. |
|
|
*Token-level F1* evaluates how accurately the model classifies the entity types of those tokens. |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Trained only on criminal court cases — not guaranteed to generalize to civil or administrative rulings |
|
|
- Designed for Korean texts — not applicable to other languages or domains |
|
|
- Not suitable for identifying sensitive content outside of structured NER targets |
|
|
|
|
|
|
|
|
## Ethical Considerations |
|
|
- The model is trained on already-anonymized court documents |
|
|
- Deployment in real-world settings should still include human oversight and legal compliance check |
|
|
|
|
|
|
|
|
## License |
|
|
This repository contains original work licensed under the |
|
|
Creative Commons Attribution-ShareAlike 4.0 International License (**CC BY-NC-SA 4.0**). |
|
|
|
|
|
Portions of this repository (including tokenizer vocabulary and/or model weights) |
|
|
are derived from Meta Llama 3.1 and are subject to the Meta Llama 3.1 Community License. |
|
|
https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE |
|
|
|
|
|
- Creative Commons Attribution-ShareAlike 4.0 License: |
|
|
https://creativecommons.org/licenses/by-nc-sa/4.0/ |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite: |
|
|
```bibtex |
|
|
@misc{hahm2025thunderdeidaccurateefficientdeidentification, |
|
|
title={Thunder-DeID: Accurate and Efficient De-identification Framework for Korean Court Judgments}, |
|
|
author={Sungen Hahm and Heejin Kim and Gyuseong Lee and Hyunji Park and Jaejin Lee}, |
|
|
year={2025}, |
|
|
eprint={2506.15266}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.CL}, |
|
|
url={https://arxiv.org/abs/2506.15266}, |
|
|
} |
|
|
``` |
|
|
|
|
|
## Contact |
|
|
|
|
|
If you have questions or issues, contact: |
|
|
**snullm@aces.snu.ac.kr** |