IMJONEZZ
/

SlovenBERTcina

Model card Files Files and versions

SlovenBERTcina / README.md

IMJONEZZ's picture

Update README.md

141147e over 4 years ago

|

1.17 kB

	#Slovak RoBERTA Masked Language Model

	###83Mil Parameters in small model

	Medium and Large models coming soon!

	RoBERTA pretrained tokenizer vocab and merges included.

	---

	##Training params:
	- Dataset:
	8GB Slovak Monolingual dataset including ParaCrawl (monolingual), OSCAR, and several gigs of my own findings and cleaning.
	- Preprocessing:
	Tokenized with a pretrained ByteLevelBPETokenizer trained on the same dataset. Uncased, with s, pad, /s, unk, and mask special tokens.
	- Evaluation results:
	- Mnoho ľudí tu<mask>
	* žije.
	* žijú.
	* je.
	* trpí.
	- Ako sa<mask>
	* máte
	* máš
	* má
	* hovorí
	- Plážová sezóna pod Zoborom patrí medzi<mask> obdobia.
	* ročné
	* najkrajšie
	* najobľúbenejšie
	* najnáročnejšie

	- Limitations:
	The current model is fairly small, although it works very well. This model is meant to be finetuned on downstream tasks e.g. Part-of-Speech tagging, Question Answering, anything in GLUE or SUPERGLUE.

	- Credit:
	If you use this or any of my models in research or professional work, please credit me - Christopher Brousseau in said work.