Fill-Mask
Transformers
Safetensors
modernbert
akseli-reunamo commited on
Commit
90b889e
·
verified ·
1 Parent(s): aae90e2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +58 -143
README.md CHANGED
@@ -21,164 +21,79 @@ library_name: transformers
21
  <img src="images/finnish_modernbert.png" alt="Finnish ModernBERT" width="600" height="600">
22
 
23
  # Finnish ModernBERT Model Card
24
- Finnish ModernBERT large is an encoder model following ModernBERT architecture, pretrained on Finnish, Swedish, English, Code, Latin, and Northern Sámi. It was trained on 400B tokens.
 
25
  Training was conducted on the [LUMI supercomputer](https://www.lumi-supercomputer.eu/).
26
- The project aimed to train multilingual encoder models that support long context and all official Finnish languages¹. The model can theoretically extrapolate to a context length of 128,000 tokens.
27
 
28
  ¹Multiple Sámi languages are spoken in Finland, but Northern Sámi is the most widespread and thus included in the training data. English is not the official language of Finland, but it is widely used. Latin was included for potential clinical use.
 
 
 
29
  ## Table of Contents
30
  1. [Model Overview](#model-overview)
31
- 2. [Training](#training)
32
- 3. [Training data](#training-data)
33
- 4. [Evaluation results](#evaluation-results)
34
- 5. [Ethical Considerations and Limitations](#ethical-considerations-and-limitations)
35
- 6. [Aknowledgements](#aknowledgements)
36
- 7. [Licence](#licence)
37
- 8. [Citation information](#citation-information)
 
38
  ## Model Overview
39
  | Hyperparameter | Value |
40
  | :------------- | :----: |
41
  | n_parameters | 401M |
42
  | n_layers | 28 |
43
- | RoPE theta | 10,000 / 1,000,000 |
44
- | vocab_size | 55,616 |
45
- | sequence_length | 16,000 / 128,000 |
46
- ## Training
47
- Pretraining was done using Distributed Data Parallelism, AdamW with ZeroRedundancyOptimizer, and the WSD learning rate schedule.
48
- The model was trained with a learning rate of 3e-4, a sequence length of 1024, and a RoPE theta of 10,000 for 350B tokens over 117,300 steps.
49
- ### Long context training
50
- The model was trained with a learning rate of 5e-5, increasing the context length from 1024 to 16,000 in six stages,
51
- where each sequence length was trained for an equal number of tokens, totaling 40B tokens over 16,560 steps.
52
- RoPE theta in global layers was increased to 1,000,000. Long documents were sampled from the original data in the distribution below:
53
- |Sequence lenght | % |
54
- |:-----:| :-----: |
55
- |<1000|21|
56
- |1000-10000|78|
57
- |10000-16000|1|
58
 
59
- ### Annealing
60
- For the learning rate decay phase, the dataset was swapped into a high-quality subset.
61
- The RoPE theta and context length were kept the same as in long context training.
62
- The model was annealed for 10B tokens over 4,139 steps using ```
63
- \\(1-\sqrt{LR}\\)``` learning rate decay.
64
- ## Training data
65
- All pretraining data (excluding the annealing data) were globally exact deduplicated, and PII-removed.
66
- ### Pretraining data
67
- #### Data by language
68
- |Language| Tokens | % |
69
- |:-----:| :-----: |:-:|
70
- |Code| 14.12B | 3.6 |
71
- |English| 80.77B | 20.7 |
72
- |Finnish| 209.09B | 53.6 |
73
- |Latin| 0.94B | 0.3 |
74
- |Northern Sámi|1.07B | 0.3 |
75
- |Swedish| 80.09B | 20.5 |
76
- |Cross-lingual | 3.98B | 1.0 |
77
- |Total|390B|100|
78
- #### Individual datasets
79
- |Language| Dataset | Notes | Sampling fraction | Tokens |
80
- |:-----:| :-----: | :---: | :----: | :----: |
81
- |Code| Starcoder| GitHub issues | 0.83 | 12.8B |
82
- |Code| SmolLM | PythonEdu (score 5) | 30 | 1.4B |
83
- |English| Brithish Library | - | 1 | 1.9B |
84
- |English| Europarl | English subset | 5 | 0.06B |
85
- |English| FineWeb-Edu fortified | - | 0.5 | 69.5B |
86
- |English| Natural Instructions | - | 1| 0.7B |
87
- |English| peS2o | - | 0.13 | 51.9B |
88
- |English| PubMed Central | - | 0.1 | 22.1B |
89
- |English| PubMed Abstracts | - | 1 | 3.8B |
90
- |English| Wikipedia | Dump 20241101| 9 | 3.8B |
91
- |Finnish| CC-fi | FinGPT | 4 | 10.8B |
92
- |Finnish| CulturaX | Finnish subset | 3.7 | 16.9B |
93
- |Finnish| HPLT 2.0 | Finnish subset | 3.7 | 19.1B |
94
- |Finnish| nlfcl-fi | Finnish subset | 6 | 0.02B |
95
- |Finnish| Europarl | Finnish subset | 6 | 0.12B |
96
- |Finnish| Lönnrot | FinGPT | 6 | 0.13B |
97
- |Finnish| Reddit-Fi | FinGPT | 6 | 0.11B |
98
- |Finnish| Suomi24 | FinGPT | 6 | 3.27B |
99
- |Finnish| Wikipedia | Dump 20241101| 30 | 0.13B |
100
- |Finnish| Yle | FinGPT | 30 | 0.22B |
101
- |Finnish| Ylilauta | - | 30 | 0.22B |
102
- |Latin| CulturaX | Latin subset | 30 | 0.03B |
103
- |Northern Sámi| Glot500 | Northern Sámi subset | 30 | 0.004B |
104
- |Northern Sámi| saami-web | - | 30 | 0.017B |
105
- |Northern Sámi| SALT | - | 30 | 0.015B |
106
- |Swedish| CulturaX | Swedish subset | 1.09 | 28.7B |
107
- |Swedish| Europarl | Swedish subset | 5 | 0.05B |
108
- |Swedish| fstc | - | 5 | 0.002B |
109
- |Swedish| HPLT 2.0 | Swedish subset | 1.05 | 35.8B |
110
- |Swedish| nlfcl-sv | Swedish subset | 5 | 0.014B |
111
- |Swedish| Wikipedia | Dump 20241101| 30 | 0.27B |
112
- |Swedish| Yle |Swedish subset | 30 | 0.27B |
113
- |Cross-lingual| Tatoeba | English-Finnish | 0.62 | 1.07B|
114
- |Cross-lingual| OPUS | English-Northern Sámi | 30 | 5K |
115
- |Cross-lingual| Tatoeba | English-Swedish | 0.57 | 1.15B |
116
- |Cross-lingual| Tatoeba | Finnish-English | 0.62 | 1.06B |
117
- |Cross-lingual| OPUS | Finnish-Northern Sámi| 30 | 12K |
118
- |Cross-lingual| Tatoeba | Finnish-Swedish| 5.7 | 0.12B |
119
- |Cross-lingual| OPUS | Northern Sámi-English| 30 | 5K |
120
- |Cross-lingual| OPUS | Northern Sámi-Finnish| 30 | 12K |
121
- |Cross-lingual| OPUS | Northern Sámi-Swedish| 30 | 0.8K |
122
- |Cross-lingual| Tatoeba | Swedish-English| 0.58 | 1.15B |
123
- |Cross-lingual| Tatoeba | Swedish-Finnish| 5.7 | 0.12B |
124
- |Cross-lingual| OPUS | Swedish-Northern Sámi| 30 | 0.8K |
125
- ### Annealing data
126
- Details coming soon.
127
- ## Evaluation results
128
- Complete set of evaluations coming soon. A limited set of assessments using the modified version of [EuroEval](https://euroeval.com/) is presented in the table below.
129
- For each model, five learning rates were tested against the validation set, and the F1 score was used as a metric to determine the optimal learning rate.
130
- Results are the means of 10 iterations on the bootstrapped versions of the training and test sets.
131
 
132
- Results indicate that Finnish ModernBERT is competitive against other multilingual models in short context and performs best in tasks not involving token level predictions.
133
- ### Finnish
134
- | Model | scala-fi | scandisent-fi | turku-ner-fi | tydiqa-fi | Params (M) |
135
- | --- | --- | --- | --- | --- | --- |
136
- | FacebookAI/xlm-roberta-large | mcc: 50.84±3.76 \| macro_f1: 74.32±2.41 | mcc: 90.39±1.12 \| macro_f1: 95.18±0.56 | **micro_f1_no_misc: 84.31±1.35** \| **micro_f1: 81.93±1.07** | f1: 56.66±5.70 \| em: 35.34±4.34 | 561.2 |
137
- | TurkuNLP/bert-base-finnish-cased-v1 | mcc: 47.16±5.27 \| macro_f1: 72.98±2.47 | mcc: 90.16±0.50 \| macro_f1: 95.08±0.25 | micro_f1_no_misc: 82.04±1.33 \| micro_f1: 79.35±0.94 | f1: 56.20±1.42 \| em: 35.68±1.82 | 125.2 |
138
- | TurkuNLP/bert-large-finnish-cased-v1 | **mcc: 58.81±2.46** \| **macro_f1: 78.91±1.23** | **mcc: 91.69±0.60** \| **macro_f1: 95.85±0.30** | micro_f1_no_misc: 77.57±1.43 \| micro_f1: 74.50±1.74 | f1: 59.91±1.19 \| em: 39.10±1.18 | 355.2 |
139
- | TurkuNLP/finnish-modernbert-base | mcc: 24.81±6.66 \| macro_f1: 61.46±3.62 | mcc: 84.59±1.80 \| macro_f1: 92.26±0.89 | micro_f1_no_misc: 56.17±4.80 \| micro_f1: 56.03±4.91 | f1: 30.04±1.27 \| em: 14.22±1.25 | 143.4 |
140
- | TurkuNLP/finnish-modernbert-large | mcc: 51.88±3.07 \| macro_f1: 75.39±1.91 | mcc: 88.02±2.33 \| macro_f1: 93.99±1.18 | micro_f1_no_misc: 71.11±1.83 \| micro_f1: 70.47±1.44 | f1: 43.45±2.92 \| em: 23.47±2.90 | 401.3 |
141
- | TurkuNLP/finnish-modernbert-large-seq-len-1024-117300-annealed | mcc: 49.81±4.13 \| macro_f1: 74.58±2.10 | mcc: 88.50±2.88 \| macro_f1: 94.22±1.47 | micro_f1_no_misc: 71.16±2.41 \| micro_f1: 70.58±2.01 | f1: 42.40±3.43 \| em: 22.17±2.78 | 401.3 |
142
- | TurkuNLP/finnish-modernbert-tiny | mcc: 4.94±1.95 \| macro_f1: 51.89±1.24 | mcc: 76.15±1.93 \| macro_f1: 88.05±0.97 | micro_f1_no_misc: 52.45±1.23 \| micro_f1: 53.81±1.05 | f1: 29.63±0.42 \| em: 14.59±0.58 | 51.6 |
143
- | intfloat/multilingual-e5-large | mcc: 12.06±4.33 \| macro_f1: 54.51±3.19 | mcc: 90.77±0.70 \| macro_f1: 95.37±0.36 | micro_f1_no_misc: 80.55±1.28 \| micro_f1: 78.08±1.14 | **f1: 60.87±1.77** \| **em: 39.98±1.78** | 559.9 |
144
- ### Swedish
145
- | Model | scala-sv | scandiqa-sv | suc3 | swerec | Params (M) |
146
- | --- | --- | --- | --- | --- | --- |
147
- | AI-Sweden-Models/roberta-large-1160k | **mcc: 76.24±1.30** \| **macro_f1: 87.74±0.72** | **f1: 53.13±0.86** \| **em: 46.76±1.08** | **micro_f1_no_misc: 79.27±2.28** \| micro_f1: 76.65±2.03 | mcc: 77.43±0.65 \| macro_f1: 76.11±1.73 | 355.4 |
148
- | FacebookAI/xlm-roberta-large | mcc: 72.61±2.84 \| macro_f1: 85.79±1.42 | f1: 47.91±1.23 \| em: 41.40±1.00 | micro_f1_no_misc: 79.12±1.13 \| **micro_f1: 76.69±1.14** | mcc: 75.34±0.60 \| macro_f1: 70.16±2.52 | 561.2 |
149
- | TurkuNLP/finnish-modernbert-base | mcc: 58.79±2.50 \| macro_f1: 78.96±1.22 | f1: 29.98±2.03 \| em: 23.35±2.22 | micro_f1_no_misc: 51.67±3.10 \| micro_f1: 53.42±3.09 | mcc: 63.10±3.20 \| macro_f1: 62.47±4.03 | 143.4 |
150
- | TurkuNLP/finnish-modernbert-large | mcc: 69.42±3.72 \| macro_f1: 84.50±2.01 | f1: 34.26±0.85 \| em: 27.46±0.86 | micro_f1_no_misc: 59.99±2.42 \| micro_f1: 60.27±2.05 | mcc: 71.01±2.11 \| macro_f1: 71.36±1.14 | 401.3 |
151
- | TurkuNLP/finnish-modernbert-large-seq-len-1024-117300-annealed | mcc: 66.97±2.66 \| macro_f1: 83.38±1.36 | f1: 38.83±2.12 \| em: 32.53±2.09 | micro_f1_no_misc: 59.65±1.64 \| micro_f1: 59.91±1.33 | mcc: 70.18±3.77 \| macro_f1: 69.85±4.05 | 401.3 |
152
- | TurkuNLP/finnish-modernbert-tiny | mcc: 11.31±3.88 \| macro_f1: 54.81±2.30 | f1: 27.19±0.82 \| em: 19.54±0.97 | micro_f1_no_misc: 48.06±2.18 \| micro_f1: 49.55±1.87 | mcc: 63.73±1.75 \| macro_f1: 63.98±1.64 | 51.6 |
153
- | intfloat/multilingual-e5-large | mcc: 49.79±11.17 \| macro_f1: 73.39±6.85 | f1: 52.23±0.90 \| em: 44.44±1.34 | micro_f1_no_misc: 77.37±1.84 \| micro_f1: 75.75±1.76 | **mcc: 79.13±1.03** \| **macro_f1: 77.44±2.85** | 559.9 |
154
- ### English
155
- | Model | conll-en | scala-en | squad | sst5 | Params (M) |
156
- | --- | --- | --- | --- | --- | --- |
157
- | FacebookAI/xlm-roberta-large | micro_f1_no_misc: 88.74±1.06 \| micro_f1: 88.12±0.94 | mcc: 34.33±15.56 \| macro_f1: 64.04±9.79 | f1: 70.42±0.84 \| em: 57.34±0.82 | mcc: 58.86±1.33 \| macro_f1: 58.07±2.23 | 561.2 |
158
- | TurkuNLP/finnish-modernbert-base | micro_f1_no_misc: 70.64±2.52 \| micro_f1: 72.96±1.99 | mcc: 14.04±3.08 \| macro_f1: 56.21±1.86 | f1: 29.36±6.50 \| em: 18.20±5.63 | mcc: 33.81±3.80 \| macro_f1: 46.50±2.77 | 143.4 |
159
- | TurkuNLP/finnish-modernbert-large | micro_f1_no_misc: 79.73±1.29 \| micro_f1: 80.90±1.11 | mcc: 50.98±3.90 \| macro_f1: 74.94±2.06 | f1: 55.98±2.65 \| em: 40.35±2.57 | mcc: 37.08±5.53 \| macro_f1: 49.38±4.69 | 401.3 |
160
- | TurkuNLP/finnish-modernbert-large-seq-len-1024-117300-annealed | micro_f1_no_misc: 79.15±0.60 \| micro_f1: 80.20±0.47 | mcc: 46.82±5.34 \| macro_f1: 72.62±2.64 | f1: 58.70±1.98 \| em: 42.86±1.95 | mcc: 38.60±3.48 \| macro_f1: 51.67±3.58 | 401.3 |
161
- | TurkuNLP/finnish-modernbert-tiny | micro_f1_no_misc: 68.71±1.09 \| micro_f1: 71.02±0.89 | mcc: 4.72±2.12 \| macro_f1: 51.47±1.40 | f1: 12.00±0.47 \| em: 4.96±0.43 | mcc: 21.24±4.35 \| macro_f1: 40.46±2.94 | 51.6 |
162
- | intfloat/multilingual-e5-large | micro_f1_no_misc: 90.83±0.49 \| micro_f1: 90.08±0.41 | mcc: 37.27±8.82 \| macro_f1: 68.10±4.43 | f1: 72.19±0.85 \| em: 58.64±0.76 | **mcc: 65.11±0.97** \| **macro_f1: 64.68±2.38** | 559.9 |
163
- | microsoft/deberta-v3-base | **micro_f1_no_misc: 91.05±0.53** \| **micro_f1: 90.46±0.54** | **mcc: 64.68±1.29** \| **macro_f1: 81.85±0.67** | **f1: 75.68±0.86** \| **em: 62.80±0.98** | mcc: 62.03±1.05 \| macro_f1: 60.52±3.55 | 183.8 |
164
  ## Ethical Considerations and Limitations
165
- Finnish ModernBERT may produce representations that reflect biases and patterns present in its training data.
166
  The training data were not filtered for toxic, harmful, or offensive content to serve various use cases.
167
- ## Aknowledgements
168
- We thank [CSC](https://csc.fi/), the IT Center for Science in Finland, for the computational resources. We thank [The Language Bank of Finland](https://www.kielipankki.fi/) for additional resources for Finnish, Finland-Swedish, and Swedish.
169
- This research was also supported by [HPLT-project](https://hplt-project.org/) and [Finnish Cultural Foundation](https://skr.fi/en/).
170
 
171
  ## Licence
172
- Finnish ModernBert large is released under the Apache 2.0 license.
 
 
 
 
 
 
 
 
 
173
  ## Citation information
174
- Preprint coming soon. If you need to cite this work, please use the citation below:
175
- ```
176
- @misc {finnish_modernbert_2025,
177
- author = { Reunamo, Akseli and Pyysalo, Sampo },
178
- title = { Finnish-ModernBert: A Family of ModernBerts for Finnish languages },
179
- year = 2025,
180
- url = {https://huggingface.co/collections/TurkuNLP/finnish-modernberts-685bb5f2ab4d39d6480a16d4},
181
- publisher = { Hugging Face }
 
 
182
  }
183
- ```
184
-
 
21
  <img src="images/finnish_modernbert.png" alt="Finnish ModernBERT" width="600" height="600">
22
 
23
  # Finnish ModernBERT Model Card
24
+ Finnish ModernBERT large is an encoder model following the ModernBERT architecture, pretrained on Finnish, Swedish, English, Code, Latin, and Northern Sámi.
25
+ It was trained on 399.2BB tokens. **This model can theoretically extrapolate to a context length of 128K.**
26
  Training was conducted on the [LUMI supercomputer](https://www.lumi-supercomputer.eu/).
27
+ The project aimed to train multilingual encoder models that support long context and all official Finnish languages¹.
28
 
29
  ¹Multiple Sámi languages are spoken in Finland, but Northern Sámi is the most widespread and thus included in the training data. English is not the official language of Finland, but it is widely used. Latin was included for potential clinical use.
30
+
31
+ Full descriptions of training, data and evaluation are available in the [article](https://arxiv.org/abs/2511.09213).
32
+
33
  ## Table of Contents
34
  1. [Model Overview](#model-overview)
35
+ 2. [Evaluation](#evaluation)
36
+ 3. [Data](#data)
37
+ 4. [Training](#training)
38
+ 3. [Ethical Considerations](#ethical-considerations)
39
+ 4. [Aknowledgements](#aknowledgements)
40
+ 5. [Licence](#licence)
41
+ 6. [Citation information](#citation-information)
42
+
43
  ## Model Overview
44
  | Hyperparameter | Value |
45
  | :------------- | :----: |
46
  | n_parameters | 401M |
47
  | n_layers | 28 |
48
+ | RoPE base | 10K / 1M |
49
+ | vocab_size | 56K |
50
+ | sequence_length | 16K / 128K |
51
+
52
+ ## Evaluation
53
+ The Finnish ModernBERT models were competitive with other multilingual models ([XLM-R-large](https://huggingface.co/FacebookAI/xlm-roberta-large) and [mmBERT-base](https://huggingface.co/jhu-clsp/mmBERT-base))
54
+ on short-context NLU tasks in Finnish, Swedish, and English, where
55
+ XLM-R-large was the strongest model. The Finnish ModernBERTs are the strongest multilingual encoder models in out-of-domain retrieval
56
+ tasks, outperforming the others with a large margin.
57
+
58
+ ## Data
59
+ We used text datasets from diverse sources, including web crawls, news, scientific articles, classical literature, historical texts, Wikipedia, forums, and authoritative sources.
60
+ Sources underwent various levels of pre-processing, including the removal of low-quality text or boilerplate, PII removal, and deduplication.
61
+ Note that more datasets were used for the training than listed in this repository metadata.
 
62
 
63
+ ## Training
64
+ Pretraining was done using Distributed Data Parallelism, AdamW with ZeroRedundancyOptimizer, and the WSD learning rate schedule.
65
+ We describe the training as a three-step process where:
66
+ 1. Models' parameters are first optimized for short context representations of the tokens (stable phase)
67
+ 2. Refine the token representations for longer dependencies (context extension phase)
68
+ 3. Reinforce the representations for the inputs that we believe the model will be used for in the future (annealing phase)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
69
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
70
  ## Ethical Considerations and Limitations
71
+ Finnish ModernBERTs' training data include sources regarded as biased and harmful, and the models' outputs may mirror these biased properties.
72
  The training data were not filtered for toxic, harmful, or offensive content to serve various use cases.
73
+ The representations produced by the models should not be used without
74
+ caution and without evaluating their effects on vulnerable population groups.
 
75
 
76
  ## Licence
77
+ Finnish ModernBERT large is released under the Apache 2.0 license.
78
+
79
+ ## Aknowledgements
80
+ We acknowledge CSC, IT Center for Science, Finland, for awarding this project access to the LUMI supercomputer,
81
+ owned by the EuroHPC Joint Undertaking, hosted by CSC (Finland) and the LUMI consortium. We acknowledge the HPLT-project for supporting this research.
82
+ This project has received funding from
83
+ the European Union’s Horizon Europe research and innovation programme under grant agreement No. 101070350,
84
+ and it has also received funding from the Finnish Cultural Foundation.
85
+ We thank [The Language Bank of Finland](https://www.kielipankki.fi/) for additional resources for Finnish, Finland-Swedish, and Swedish.
86
+
87
  ## Citation information
88
+ If you use Finnish ModernBERTs or need to reference the work, please use the citation below:
89
+ ~~~
90
+ @misc{reunamo2025pretrainingfinnishmodernberts,
91
+ title={Pretraining Finnish ModernBERTs},
92
+ author={Akseli Reunamo and Laura-Maria Peltonen and Hans Moen and Sampo Pyysalo},
93
+ year={2025},
94
+ eprint={2511.09213},
95
+ archivePrefix={arXiv},
96
+ primaryClass={cs.CL},
97
+ url={https://arxiv.org/abs/2511.09213},
98
  }
99
+ ~~~