manu commited on
Commit
3d4c773
·
1 Parent(s): 99bfaed

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +9 -2
README.md CHANGED
@@ -3,9 +3,10 @@
3
  Example sentence: `This is a test sentence. On va voir comment elle est gérée .... 123 + 56 = 2567. Let's go! Imagine I have code 4 spaces.
4
  and a backslash!! Eléonore est un prénom français. __name__ isInstance`
5
 
6
- Encoded sentence: `['▁Th', 'is', '▁is', '▁a', '▁test', '▁sent', 'ence.', '▁On', '▁va', '▁voir', '▁comment', '▁elle', '▁est', '▁g', 'érée', '▁....', '▁', '1', '2', '3', '▁+', '▁', '5', '6', '▁=', '▁', '2', '5', '6', '7', '.', "▁Let's", '▁go', '!', '▁Imagine', '▁I', '▁h', 'ave', '▁code', '▁', '▁', '▁', '▁', '4', '▁sp', 'aces.', '▁and', '▁a', '▁', '▁', '▁', '▁', '▁', '▁back', 's', 'l', 'ash', '!!', '▁E', '', 'on', 'ore', '▁est', '▁un', '▁prénom', '▁français.', '▁', '__', 'n', 'ame', '__', '▁is', 'In', 'st', 'ance']`
7
 
8
- Decoded sentence: `<s> This is a test sentence. On va voir comment elle est gérée .... 123 + 56 = 2567. Let's go! Imagine I have code 4 spaces. and a backslash!! Eléonore est un prénom français. __name__ isInstance`
 
9
 
10
  ## Usage
11
  ```python
@@ -25,6 +26,12 @@ The dataset consists of french, english and code samples
25
 
26
  More info on the dataset can be found [here](https://huggingface.co/datasets/manu/tok-corpus-shuffled)
27
 
 
 
 
 
 
 
28
  ## Tokenizer Configs
29
  Build from scratch: True
30
 
 
3
  Example sentence: `This is a test sentence. On va voir comment elle est gérée .... 123 + 56 = 2567. Let's go! Imagine I have code 4 spaces.
4
  and a backslash!! Eléonore est un prénom français. __name__ isInstance`
5
 
6
+ Encoded sentence: `['▁This', '▁is', '▁a', '▁test', '▁sent', 'ence.', '▁On', '▁va', '▁voir', '▁comment', '▁elle', '▁est', '▁g', 'érée', '▁....', '▁', '1', '2', '3', '▁+', '▁', '5', '6', '▁=', '▁', '2', '5', '6', '7', '.', "▁Let's", '▁go', '!', '▁Im', 'ag', 'ine', 'I', '▁have', '▁code', '▁', '▁', '▁', '▁', '4', '▁spaces', '.\n', '▁and', '▁a', '▁', '▁', '▁', '▁', '▁', '▁back', 'sl', 'ash', '!!', '▁El', 'éon', 'ore', '▁est', '▁un', '▁prénom', '▁français.', '▁__name__', '▁is', 'Instance']`
7
 
8
+ Decoded sentence: `<s> This is a test sentence. On va voir comment elle est gérée .... 123 + 56 = 2567. Let's go! Imagine I have code 4 spaces.
9
+ and a backslash!! Eléonore est un prénom français. __name__ isInstance`
10
 
11
  ## Usage
12
  ```python
 
26
 
27
  More info on the dataset can be found [here](https://huggingface.co/datasets/manu/tok-corpus-shuffled)
28
 
29
+ For speed purposes, the tokenizer was trained on a sample of the dataset. Only the first samples were selected.
30
+
31
+ Sample size: 5000000
32
+
33
+ Size of Sampled: 19.0 GB
34
+
35
  ## Tokenizer Configs
36
  Build from scratch: True
37