File size: 1,172 Bytes
e9a1d7b 1642b86 e9a1d7b 1642b86 e9a1d7b a7e8b4b e9a1d7b 141147e e9a1d7b 141147e e9a1d7b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
#Slovak RoBERTA Masked Language Model
###83Mil Parameters in small model
Medium and Large models coming soon!
RoBERTA pretrained tokenizer vocab and merges included.
---
##Training params:
- **Dataset**:
8GB Slovak Monolingual dataset including ParaCrawl (monolingual), OSCAR, and several gigs of my own findings and cleaning.
- **Preprocessing**:
Tokenized with a pretrained ByteLevelBPETokenizer trained on the same dataset. Uncased, with s, pad, /s, unk, and mask special tokens.
- **Evaluation results**:
- Mnoho ľudí tu<mask>
* žije.
* žijú.
* je.
* trpí.
- Ako sa<mask>
* máte
* máš
* má
* hovorí
- Plážová sezóna pod Zoborom patrí medzi<mask> obdobia.
* ročné
* najkrajšie
* najobľúbenejšie
* najnáročnejšie
- **Limitations**:
The current model is fairly small, although it works very well. This model is meant to be finetuned on downstream tasks e.g. Part-of-Speech tagging, Question Answering, anything in GLUE or SUPERGLUE.
- **Credit**:
If you use this or any of my models in research or professional work, please credit me - Christopher Brousseau in said work. |