File size: 3,667 Bytes
11ad452
cb5926c
 
 
e377b47
 
 
cb5926c
 
 
f607eb7
cb5926c
 
f607eb7
 
 
 
c35eb56
f607eb7
 
 
 
 
 
 
 
 
 
 
 
 
11ad452
 
f607eb7
cb5926c
f607eb7
cb5926c
f607eb7
cb5926c
f607eb7
cb5926c
f607eb7
 
 
 
 
 
 
 
 
 
 
 
 
cb5926c
 
 
 
 
 
f607eb7
 
 
cb5926c
f607eb7
cb5926c
f607eb7
 
 
 
 
 
cb5926c
 
f607eb7
 
 
 
 
 
 
 
 
e58b728
f607eb7
 
 
 
cb5926c
f607eb7
cb5926c
f607eb7
11ad452
f607eb7
 
11ad452
f607eb7
23aecb7
f607eb7
 
23aecb7
f607eb7
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
---
license: mit
language:
- en
tags:
- leetspeak
- text2text-generation
- byt5
- decoder
- translation
- normalization
datasets:
- wikitext
- eli5
metrics:
- bleu
- cer
pipeline_tag: translation
model-index:
- name: ByT5 Leetspeak Decoder V3
  results:
  - task:
      type: translation
      name: Leetspeak Decoding
    metrics:
    - type: accuracy
      name: Mixed-Number Accuracy
      value: 100.0
    - type: accuracy
      name: Basic Leet Accuracy
      value: 100.0
---

# ByT5 Leetspeak Decoder V3 (Production)

**The definitive byte-level translator for leetspeak, internet slang, and visual character obfuscation.**

Built on `google/byt5-base`, **V3** represents a major architectural shift from previous versions. It utilizes **Curriculum Learning** and **Adversarial Filtering** to solve the complex context ambiguity between leetspeak numbers (e.g., "2" meaning "to") and actual quantities (e.g., "2 cats").

## Key Improvements in V3

| Feature | V2 (Legacy) | V3 (Current) |
| :--- | :--- | :--- |
| **Mixed-Number Context** | Struggled (~74%) | **100.0% Accuracy** |
| **Basic Leet Decoding** | 85% | **100.0% Accuracy** |
| **Visual Obfuscation** | Moderate | **High** (handles `|<1||`, `|-|`, etc.) |
| **Output Style** | Casual/Slang-heavy | **Formal/Standard English** |
| **Final Eval Loss** | 0.84 | **0.3812** |

### The "Number Problem" Solved
V3 is the first model in this series to perfectly distinguish between numbers used as letters and numbers used as quantities within the same sentence.
* **Input:** `1t5 2 l8 4 2 people`
* **V2 Output:** *It's to late for to people.* (Fail)
* **V3 Output:** *It is too late for 2 people.* (Pass)

## Usage

```python
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_id = "ilyyeees/byt5-leetspeak-decoder"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

def decode_leet(text):
    inputs = tokenizer(text, return_tensors="pt")
    outputs = model.generate(
        **inputs, 
        max_length=256,
        num_beams=4,
        early_stopping=True
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Test Cases
print(decode_leet("1t5 2 l8 4 th4t"))         
# Output: It is too late for that.

print(decode_leet("1 g0t 100 p01nt5 0n 1t"))   
# Output: I got 100 points on it. (Preserves the '100' but decodes the rest)

print(decode_leet("idk wh4t 2 d0 tbh"))      
# Output: I don't know what to do to be honest. (Expands abbreviations)
```
##Training Methodology
V3 was trained on 2x NVIDIA RTX 5090s using a custom Reverse-Corruption Pipeline:

Clean Base: High-quality English from WikiText and ELI5 to ground the model in correct grammar.

LLM Adversarial Corruption: We used Qwen 2.5 72B to generate "Hard Negatives"—specific leetspeak patterns that previous model versions failed to decode.

Curriculum Learning: The model was trained in phases of increasing difficulty, starting with simple character swaps and ending with complex visual noise and mixed-number ambiguity.

#Limitations & Bias
Formalization Bias: Because V3 was trained on high-quality datasets (Wiki/ELI5), it has a bias toward formal English. It may expand casual slang into formal prose (e.g., converting ngl to not gonna lie or idk to I don't know). It generally avoids outputting slang words like gonna or wanna unless strongly prompted.

Short Inputs: Extremely short, ambiguous inputs (1-2 characters) may be interpreted as standard English rather than leetspeak due to the conservative decoding threshold.

#Links
GitHub Repository: ilyyeees/leet-speak-decoder

V2 Model (Legacy): byt5-leetspeak-decoder-v2