File size: 1,172 Bytes

e9a1d7b
1642b86
e9a1d7b
1642b86
e9a1d7b
 
a7e8b4b
 
e9a1d7b
 
 
 
 
 
141147e
e9a1d7b
141147e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e9a1d7b

#Slovak RoBERTA Masked Language Model

###83Mil Parameters in small model

Medium and Large models coming soon!

RoBERTA pretrained tokenizer vocab and merges included.

---

##Training params:
- **Dataset**:
  8GB Slovak Monolingual dataset including ParaCrawl (monolingual), OSCAR, and several gigs of my own findings and cleaning.
- **Preprocessing**:
  Tokenized with a pretrained ByteLevelBPETokenizer trained on the same dataset. Uncased, with s, pad, /s, unk, and mask special tokens.
- **Evaluation results**:
  - Mnoho ľudí tu<mask>
    * žije.
    * žijú.
    * je.
    * trpí.
  - Ako sa<mask>
    * máte
    * máš
    * má
    * hovorí
  - Plážová sezóna pod Zoborom patrí medzi<mask> obdobia.
    * ročné
    * najkrajšie
    * najobľúbenejšie
    * najnáročnejšie
    
- **Limitations**:
  The current model is fairly small, although it works very well. This model is meant to be finetuned on downstream tasks e.g. Part-of-Speech tagging, Question Answering, anything in GLUE or SUPERGLUE.
  
- **Credit**:
  If you use this or any of my models in research or professional work, please credit me - Christopher Brousseau in said work.