againeureka commited on
Commit
f5aa5d1
·
verified ·
1 Parent(s): ef93388

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +78 -31
README.md CHANGED
@@ -1,50 +1,97 @@
1
  ---
2
- tags:
3
- - generated_from_trainer
4
- model-index:
5
- - name: klue_roberta_base_for_legal
6
- results: []
7
  ---
8
 
9
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
10
- should probably proofread and complete it, then remove this comment. -->
11
 
12
- # klue_roberta_base_for_legal
13
 
14
- This model is a fine-tuned version of [klue/roberta-base](https://huggingface.co/klue/roberta-base) on an unknown dataset.
15
 
16
- ## Model description
17
 
18
- More information needed
19
 
20
- ## Intended uses & limitations
21
 
22
- More information needed
23
 
24
- ## Training and evaluation data
 
 
 
 
25
 
26
- More information needed
27
 
28
- ## Training procedure
29
 
30
- ### Training hyperparameters
 
31
 
32
- The following hyperparameters were used during training:
33
- - learning_rate: 5e-05
34
- - train_batch_size: 18
35
- - eval_batch_size: 8
36
- - seed: 1
37
- - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
38
- - lr_scheduler_type: linear
39
- - num_epochs: 5
40
 
41
- ### Training results
42
 
 
 
 
 
 
43
 
 
44
 
45
- ### Framework versions
 
 
46
 
47
- - Transformers 4.28.1
48
- - Pytorch 2.0.0+cu117
49
- - Datasets 2.12.0
50
- - Tokenizers 0.13.3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - ko
4
+ metrics:
5
+ - accuracy
6
+ library_name: transformers
7
  ---
8
 
9
+ # KLUE Robeta-base for legal documents
 
10
 
11
+ <!-- Provide a quick summary of what the model is/does. -->
12
 
13
+ - KLUE/Robeta-Base Model을 판결문으로 이뤄진 legal_text_merged02_light.txt 파일을 사용하여 재학습 시킨 모델입니다.
14
 
 
15
 
16
+ ## Model Details
17
 
18
+ ### Model Description
19
 
20
+ <!-- Provide a longer summary of what this model is. -->
21
 
22
+ - **Developed by:** J.Park @ KETI
23
+ - **Model type:** klue/roberta-base
24
+ - **Language(s) (NLP):** korean
25
+ - **License:** [More Information Needed]
26
+ - **Finetuned from model [optional]:** [More Information Needed]
27
 
28
+ ### 학습 방법
29
 
30
+ ```python
31
 
32
+ from transformers import RobertaTokenizer, RobertaForMaskedLM
33
+ from transformers import AutoModel, AutoTokenizer
34
 
35
+ model = RobertaForMaskedLM.from_pretrained(base_model)
36
+ tokenizer = AutoTokenizer.from_pretrained(base_tokenizer)
 
 
 
 
 
 
37
 
38
+ from transformers import LineByLineTextDataset
39
 
40
+ dataset = LineByLineTextDataset(
41
+ tokenizer=tokenizer,
42
+ file_path=fpath_dataset,
43
+ block_size=512,
44
+ )
45
 
46
+ from transformers import DataCollatorForLanguageModeling
47
 
48
+ data_collator = DataCollatorForLanguageModeling(
49
+ tokenizer=tokenizer, mlm=True, mlm_probability=0.15
50
+ )
51
 
52
+ from transformers import Trainer, TrainingArguments
53
+
54
+ training_args = TrainingArguments(
55
+ output_dir=output_dir,
56
+ overwrite_output_dir=True,
57
+ num_train_epochs=5,
58
+ per_device_train_batch_size=18,
59
+ save_steps=100,
60
+ save_total_limit=2,
61
+ seed=1
62
+ )
63
+
64
+ trainer = Trainer(
65
+ model=model,
66
+ args=training_args,
67
+ data_collator=data_collator,
68
+ train_dataset=dataset
69
+ )
70
+
71
+ train_metrics = trainer.train()
72
+
73
+
74
+ trainer.save_model(output_dir)
75
+ trainer.push_to_hub()
76
+ ```
77
+
78
+ ### 학습용 configuration
79
+
80
+ - number of epochs
81
+ ```bash
82
+ epochs = 50
83
+ ```
84
+
85
+ - JSON file
86
+ ```json
87
+ [
88
+ {'basemodel' : 'againeureka/klue_roberta_base_for_legal',
89
+ 'basetokenizer' : 'klue/roberta-base',
90
+ 'trainmodel' : 'againeureka/toulmin_classifier8_klue_roberta_base_retrained6',
91
+ 'batchsize' : 92,
92
+ 'epochs' : epochs,
93
+ 'push_to_hub' : True,
94
+ 'is_on' : True,
95
+ },
96
+ ]
97
+ ```