mjbommar commited on
Commit
967e04b
·
verified ·
1 Parent(s): 45aefdc

Upload magic-bert-50m-classification model files

Browse files
README.md ADDED
@@ -0,0 +1,324 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ library_name: transformers
6
+ tags:
7
+ - binary-analysis
8
+ - file-type-detection
9
+ - byte-level
10
+ - classification
11
+ - mime-type
12
+ - security
13
+ pipeline_tag: text-classification
14
+ base_model: magic-bert-50m-mlm
15
+ model-index:
16
+ - name: magic-bert-50m-classification
17
+ results:
18
+ - task:
19
+ type: text-classification
20
+ name: File Type Classification
21
+ metrics:
22
+ - name: Probing Accuracy
23
+ type: accuracy
24
+ value: 89.7
25
+ - name: Silhouette Score
26
+ type: silhouette
27
+ value: 0.55
28
+ - name: F1 (Weighted)
29
+ type: f1
30
+ value: 0.886
31
+ ---
32
+
33
+ # Magic-BERT 50M Classification
34
+
35
+ A BERT-style transformer model fine-tuned for binary file type classification. This model classifies binary files into 106 MIME types based on their content structure.
36
+
37
+ ## Why Not Just Use libmagic?
38
+
39
+ For intact files starting at byte 0, libmagic works well. But libmagic matches *signatures at fixed offsets*. Magic-BERT learns *structural patterns* throughout the file, enabling use cases where you don't have clean file boundaries:
40
+
41
+ - **Network streams**: Classifying packet payloads mid-connection, before headers arrive
42
+ - **Disk forensics**: Identifying file types during carving, when scanning raw disk images without filesystem metadata
43
+ - **Fragment analysis**: Working with partial files, slack space, or corrupted data
44
+ - **Adversarial contexts**: Detecting file types when magic bytes are stripped, spoofed, or deliberately misleading
45
+
46
+ ## Model Description
47
+
48
+ This model extends magic-bert-50m-mlm with contrastive learning fine-tuning to produce embeddings optimized for file type discrimination. It uses a projection head and classifier trained with supervised contrastive loss.
49
+
50
+ | Property | Value |
51
+ |----------|-------|
52
+ | Parameters | 59M (+ 0.4M classifier head) |
53
+ | Hidden Size | 512 |
54
+ | Projection Dimension | 256 |
55
+ | Number of Classes | 106 MIME types |
56
+ | Base Model | magic-bert-50m-mlm |
57
+
58
+ ### Tokenizer
59
+
60
+ The tokenizer uses the Binary BPE methodology introduced in [Bommarito (2025)](https://arxiv.org/abs/2511.17573). The original Binary BPE tokenizers (available at [mjbommar/binary-tokenizer-001-64k](https://huggingface.co/mjbommar/binary-tokenizer-001-64k)) were trained exclusively on executable binaries (ELF, PE, Mach-O). This tokenizer uses the same BPE training approach but was trained on a diverse corpus spanning 106 file types.
61
+
62
+ ## Intended Uses
63
+
64
+ **Primary use cases:**
65
+ - File type classification from binary content
66
+ - MIME type detection without relying on file extensions
67
+ - Embedding-based file similarity search
68
+ - Security analysis and malware triage
69
+
70
+ **Example tasks:**
71
+ - Identifying file types in network traffic
72
+ - Classifying files with missing or incorrect extensions
73
+ - Building file type indexes for large archives
74
+
75
+ ## Detailed Use Cases
76
+
77
+ ### Network Traffic Analysis
78
+ When inspecting packet payloads, you often see file data mid-stream—TCP reassembly may give you bytes 1500-3000 of a PDF before you ever see byte 0. Traditional signature matching fails here. Classification embeddings can identify file types from interior content.
79
+
80
+ ### Disk Forensics & File Carving
81
+ During disk image analysis, you scan raw bytes looking for file boundaries. Tools like Scalpel rely on header/footer signatures, but many files lack clear footers. This model can score byte ranges for file type probability, helping identify carved fragments or validate carving results.
82
+
83
+ ### Incident Response
84
+ Malware often strips or modifies magic bytes to evade detection. Polyglot files (valid as multiple types) exploit signature-based tools. Learning structural patterns provides a second opinion that doesn't rely solely on the first few bytes.
85
+
86
+ ### Similarity Search
87
+ The embedding space (256-dimensional, L2-normalized) enables similarity search across file collections: "find files structurally similar to this sample" for malware clustering, duplicate detection, or content-based retrieval.
88
+
89
+ ## MLM vs Classification: Two-Phase Training
90
+
91
+ This is the **Phase 2 (Classification)** model built on Magic-BERT. The training pipeline has two phases:
92
+
93
+ | Phase | Model | Task | Purpose |
94
+ |-------|-------|------|---------|
95
+ | Phase 1 | magic-bert-50m-mlm | Masked Language Modeling | Learn byte-level patterns and file structure |
96
+ | **Phase 2** | **This model** | Contrastive Learning | Optimize embeddings for file type discrimination |
97
+
98
+ ### Two-Phase Training
99
+
100
+ | Phase | Steps | Learning Rate | Objective |
101
+ |-------|-------|---------------|-----------|
102
+ | 1: MLM Pre-training | 100,000 | 1e-4 | Masked Language Modeling |
103
+ | 2: Contrastive Fine-tuning | 50,000 | 1e-6 | Supervised Contrastive Loss |
104
+
105
+ **Phase 2 specifics:**
106
+ - Frozen: Embeddings + first 4 transformer layers
107
+ - Learning rate: 100x lower than Phase 1
108
+ - Objective: Pull same-MIME-type samples together, push different types apart
109
+
110
+ ## Evaluation Results
111
+
112
+ ### Classification Performance
113
+
114
+ | Metric | Value |
115
+ |--------|-------|
116
+ | Linear Probe Accuracy | 89.7% |
117
+ | F1 (Macro) | 0.787 |
118
+ | F1 (Weighted) | 0.886 |
119
+
120
+ ### Embedding Quality
121
+
122
+ | Metric | Value |
123
+ |--------|-------|
124
+ | Silhouette Score | 0.55 |
125
+ | Separation Ratio | 3.60 |
126
+ | Intra-class Distance | 12.6 |
127
+ | Inter-class Distance | 45.2 |
128
+
129
+ ### MLM Capability (Retained)
130
+
131
+ | Metric | Value |
132
+ |--------|-------|
133
+ | Fill-mask Top-1 | 41.8% |
134
+ | Perplexity | 1.32 |
135
+
136
+ This model retains moderate fill-mask capability, making it suitable for hybrid tasks that need both classification and byte prediction.
137
+
138
+ ## Supported MIME Types (106 Classes)
139
+
140
+ The model classifies files into 106 MIME types across these categories:
141
+
142
+ | Category | Count | Examples |
143
+ |----------|-------|----------|
144
+ | application/ | 41 | PDF, ZIP, GZIP, Office docs, executables |
145
+ | text/ | 24 | Python, C, Java, HTML, XML, shell scripts |
146
+ | image/ | 18 | PNG, JPEG, GIF, WebP, TIFF, PSD |
147
+ | video/ | 9 | MP4, WebM, MKV, AVI, MOV |
148
+ | audio/ | 8 | MP3, FLAC, WAV, OGG, M4A |
149
+ | font/ | 3 | SFNT, WOFF, WOFF2 |
150
+ | other | 3 | biosig/atf, inode/x-empty, message/rfc822 |
151
+
152
+ <details>
153
+ <summary>Click to expand full MIME type list</summary>
154
+
155
+ **application/** (41 types):
156
+ - application/SIMH-tape-data, application/encrypted, application/gzip
157
+ - application/javascript, application/json, application/msword
158
+ - application/mxf, application/octet-stream, application/pdf
159
+ - application/pgp-keys, application/postscript
160
+ - application/vnd.microsoft.portable-executable, application/vnd.ms-excel
161
+ - application/vnd.ms-opentype, application/vnd.ms-powerpoint
162
+ - application/vnd.oasis.opendocument.spreadsheet
163
+ - application/vnd.openxmlformats-officedocument.* (3 variants)
164
+ - application/vnd.rn-realmedia, application/vnd.wordperfect
165
+ - application/wasm, application/x-7z-compressed, application/x-archive
166
+ - application/x-bzip2, application/x-coff, application/x-dbf
167
+ - application/x-dosexec, application/x-executable
168
+ - application/x-gettext-translation, application/x-ms-ne-executable
169
+ - application/x-ndjson, application/x-object, application/x-ole-storage
170
+ - application/x-sharedlib, application/x-shockwave-flash
171
+ - application/x-tar, application/x-wine-extension-ini
172
+ - application/zip, application/zlib, application/zstd
173
+
174
+ **text/** (24 types):
175
+ - text/csv, text/html, text/plain, text/rtf, text/troff
176
+ - text/x-Algol68, text/x-asm, text/x-c, text/x-c++
177
+ - text/x-diff, text/x-file, text/x-fortran, text/x-java
178
+ - text/x-m4, text/x-makefile, text/x-msdos-batch, text/x-perl
179
+ - text/x-php, text/x-po, text/x-ruby, text/x-script.python
180
+ - text/x-shellscript, text/x-tex, text/xml
181
+
182
+ **image/** (18 types):
183
+ - image/bmp, image/fits, image/gif, image/heif, image/jpeg
184
+ - image/png, image/svg+xml, image/tiff, image/vnd.adobe.photoshop
185
+ - image/vnd.microsoft.icon, image/webp, image/x-eps, image/x-exr
186
+ - image/x-jp2-codestream, image/x-portable-bitmap
187
+ - image/x-portable-greymap, image/x-tga, image/x-xpixmap
188
+
189
+ **video/** (9 types):
190
+ - video/3gpp, video/mp4, video/mpeg, video/quicktime, video/webm
191
+ - video/x-ivf, video/x-matroska, video/x-ms-asf, video/x-msvideo
192
+
193
+ **audio/** (8 types):
194
+ - audio/amr, audio/flac, audio/mpeg, audio/ogg, audio/x-ape
195
+ - audio/x-hx-aac-adts, audio/x-m4a, audio/x-wav
196
+
197
+ **font/** (3 types):
198
+ - font/sfnt, font/woff, font/woff2
199
+
200
+ **other** (3 types):
201
+ - biosig/atf, inode/x-empty, message/rfc822
202
+
203
+ </details>
204
+
205
+ ## How to Use
206
+
207
+ ```python
208
+ from transformers import AutoTokenizer
209
+ from safetensors.torch import load_file
210
+ import torch
211
+ import json
212
+
213
+ # Load tokenizer and MIME mapping
214
+ tokenizer = AutoTokenizer.from_pretrained("path/to/magic-bert-50m-classification")
215
+ with open("path/to/magic-bert-50m-classification/mime_type_mapping.json") as f:
216
+ mime_mapping = json.load(f)
217
+ id_to_mime = {int(k): v for k, v in mime_mapping.items()}
218
+
219
+ # Load model
220
+ from modeling_magic_bert import MagicBERTForSequenceClassification
221
+ from configuration_magic_bert import MagicBERTConfig
222
+
223
+ config = MagicBERTConfig.from_pretrained("path/to/magic-bert-50m-classification")
224
+ model = MagicBERTForSequenceClassification(config)
225
+
226
+ # Load base model weights
227
+ state_dict = load_file("path/to/magic-bert-50m-classification/model.safetensors")
228
+ model.load_state_dict(state_dict, strict=False)
229
+
230
+ # Load contrastive head weights
231
+ contrastive_dict = load_file("path/to/magic-bert-50m-classification/contrastive_head.safetensors")
232
+ model.projection.load_state_dict({k.replace("projection.", ""): v for k, v in contrastive_dict.items() if "projection" in k})
233
+ model.classifier.load_state_dict({k.replace("classifier.", ""): v for k, v in contrastive_dict.items() if "classifier" in k})
234
+
235
+ model.eval()
236
+
237
+ # Classify a file
238
+ with open("example.pdf", "rb") as f:
239
+ data = f.read(512)
240
+
241
+ # Decode bytes to string using latin-1 (preserves all byte values 0-255)
242
+ text = data.decode("latin-1")
243
+
244
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
245
+
246
+ with torch.no_grad():
247
+ outputs = model(**inputs)
248
+ predicted_id = outputs.logits.argmax(-1).item()
249
+
250
+ print(f"Predicted MIME type: {id_to_mime[predicted_id]}")
251
+ ```
252
+
253
+ ### Getting Embeddings for Similarity Search
254
+
255
+ ```python
256
+ # Get normalized projection embeddings
257
+ with torch.no_grad():
258
+ embeddings = model.get_embeddings(inputs["input_ids"], inputs["attention_mask"])
259
+ # embeddings shape: [batch_size, 256], L2 normalized
260
+
261
+ # Compute cosine similarity between files
262
+ similarity = torch.mm(embeddings1, embeddings2.T)
263
+ ```
264
+
265
+ ## Limitations
266
+
267
+ 1. **Position bias:** Best performance when content starts at position 0. Accuracy degrades for content at higher offsets.
268
+
269
+ 2. **Class imbalance:** Performance varies by file type. Common formats (PDF, PNG, ZIP) perform better than rare formats.
270
+
271
+ 3. **Ambiguous types:** Some file types share similar structure (e.g., ZIP-based formats like DOCX, XLSX, JAR), which can cause confusion.
272
+
273
+ 4. **Encrypted content:** Cannot classify encrypted or compressed content that lacks recognizable patterns.
274
+
275
+ ## Architecture: Absolute vs Rotary Position Embeddings
276
+
277
+ This model uses **absolute position embeddings**, where each position (0-511) has a learned embedding vector. An alternative is **Rotary Position Embeddings (RoPE)**, used by the RoFormer variant.
278
+
279
+ | Metric | Magic-BERT (this) | RoFormer |
280
+ |--------|-------------------|----------|
281
+ | Classification Accuracy | 89.7% | **93.7%** |
282
+ | Silhouette Score | 0.55 | **0.663** |
283
+ | F1 (Weighted) | 0.886 | **0.933** |
284
+ | Fill-mask Retention | **41.8%** | 14.5% |
285
+ | Parameters | 59M | **42M** |
286
+
287
+ Magic-BERT retains better fill-mask capability after classification fine-tuning, making it suitable when both tasks are needed. For pure classification, consider the RoFormer variant.
288
+
289
+ ## Model Selection Guide
290
+
291
+ | Use Case | Recommended Model | Reason |
292
+ |----------|-------------------|--------|
293
+ | Classification + fill-mask | **This model** | Retains 41.8% fill-mask capability |
294
+ | Fill-mask / byte prediction | magic-bert-50m-mlm | Best perplexity (1.05) |
295
+ | Research baseline | magic-bert-50m-mlm | Established BERT architecture |
296
+ | **Production classification** | **magic-bert-50m-roformer-classification** | Highest accuracy (93.7%), efficient (42M params) |
297
+
298
+ ## Related Models
299
+
300
+ - **magic-bert-50m-mlm**: Base model before classification fine-tuning
301
+ - **magic-bert-50m-roformer-mlm**: RoFormer variant with rotary position embeddings
302
+ - **magic-bert-50m-roformer-classification**: RoFormer variant with higher classification accuracy (93.7%, recommended for production)
303
+
304
+ ## Related Work
305
+
306
+ This model builds on the Binary BPE tokenization approach:
307
+
308
+ - **Binary BPE Paper**: [Bommarito (2025)](https://arxiv.org/abs/2511.17573) introduced byte-level BPE tokenization for binary analysis, demonstrating 2-3x compression over raw bytes for executable content.
309
+ - **Binary BPE Tokenizers**: Pre-trained tokenizers for executables are available at [mjbommar/binary-tokenizer-001-64k](https://huggingface.co/mjbommar/binary-tokenizer-001-64k).
310
+
311
+ **Key difference**: The original Binary BPE work focused on executable binaries (ELF, PE, Mach-O). Magic-BERT extends this to general file type understanding across 106 diverse formats, using a tokenizer trained on the broader dataset.
312
+
313
+ ## Citation
314
+
315
+ A paper describing Magic-BERT, the training methodology, and the dataset is forthcoming.
316
+
317
+ ```bibtex
318
+ @article{bommarito2025binarybpe,
319
+ title={Binary BPE: A Family of Cross-Platform Tokenizers for Binary Analysis},
320
+ author={Bommarito, Michael J., II},
321
+ journal={arXiv preprint arXiv:2511.17573},
322
+ year={2025}
323
+ }
324
+ ```
config.json ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "magic-bert",
3
+ "architectures": [
4
+ "MagicBERTForSequenceClassification"
5
+ ],
6
+ "vocab_size": 32768,
7
+ "hidden_size": 512,
8
+ "num_hidden_layers": 8,
9
+ "num_attention_heads": 8,
10
+ "intermediate_size": 2048,
11
+ "hidden_dropout_prob": 0.1,
12
+ "attention_probs_dropout_prob": 0.1,
13
+ "max_position_embeddings": 512,
14
+ "pad_token_id": 2,
15
+ "hidden_act": "gelu",
16
+ "layer_norm_eps": 1e-12,
17
+ "torch_dtype": "float32",
18
+ "transformers_version": "4.57.0",
19
+ "auto_map": {
20
+ "AutoConfig": "configuration_magic_bert.MagicBERTConfig",
21
+ "AutoModel": "modeling_magic_bert.MagicBERTModel",
22
+ "AutoModelForMaskedLM": "modeling_magic_bert.MagicBERTForMaskedLM",
23
+ "AutoModelForSequenceClassification": "modeling_magic_bert.MagicBERTForSequenceClassification"
24
+ },
25
+ "num_labels": 106,
26
+ "contrastive_projection_dim": 256
27
+ }
configuration_magic_bert.py ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """MagicBERT configuration for HuggingFace transformers."""
2
+
3
+ from transformers import PretrainedConfig
4
+
5
+
6
+ class MagicBERTConfig(PretrainedConfig):
7
+ """Configuration class for MagicBERT model.
8
+
9
+ MagicBERT is a BERT-style transformer model designed for binary file
10
+ type classification. It uses a byte-level BPE tokenizer with a 32K vocabulary.
11
+ """
12
+
13
+ model_type = "magic-bert"
14
+
15
+ def __init__(
16
+ self,
17
+ vocab_size: int = 32768,
18
+ hidden_size: int = 512,
19
+ num_hidden_layers: int = 8,
20
+ num_attention_heads: int = 8,
21
+ intermediate_size: int = 2048,
22
+ hidden_dropout_prob: float = 0.1,
23
+ attention_probs_dropout_prob: float = 0.1,
24
+ max_position_embeddings: int = 512,
25
+ pad_token_id: int = 2,
26
+ hidden_act: str = "gelu",
27
+ layer_norm_eps: float = 1e-12,
28
+ **kwargs,
29
+ ):
30
+ super().__init__(pad_token_id=pad_token_id, **kwargs)
31
+ self.vocab_size = vocab_size
32
+ self.hidden_size = hidden_size
33
+ self.num_hidden_layers = num_hidden_layers
34
+ self.num_attention_heads = num_attention_heads
35
+ self.intermediate_size = intermediate_size
36
+ self.hidden_dropout_prob = hidden_dropout_prob
37
+ self.attention_probs_dropout_prob = attention_probs_dropout_prob
38
+ self.max_position_embeddings = max_position_embeddings
39
+ self.hidden_act = hidden_act
40
+ self.layer_norm_eps = layer_norm_eps
contrastive_head.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f737801d763977c076592b9689dfd994994af17769fb78cdcb4ea72d26ea9a20
3
+ size 1685408
mime_type_mapping.json ADDED
@@ -0,0 +1,108 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "0": "application/SIMH-tape-data",
3
+ "1": "application/encrypted",
4
+ "2": "application/gzip",
5
+ "3": "application/javascript",
6
+ "4": "application/json",
7
+ "5": "application/msword",
8
+ "6": "application/mxf",
9
+ "7": "application/octet-stream",
10
+ "8": "application/pdf",
11
+ "9": "application/pgp-keys",
12
+ "10": "application/postscript",
13
+ "11": "application/vnd.microsoft.portable-executable",
14
+ "12": "application/vnd.ms-excel",
15
+ "13": "application/vnd.ms-opentype",
16
+ "14": "application/vnd.ms-powerpoint",
17
+ "15": "application/vnd.oasis.opendocument.spreadsheet",
18
+ "16": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
19
+ "17": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
20
+ "18": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
21
+ "19": "application/vnd.rn-realmedia",
22
+ "20": "application/vnd.wordperfect",
23
+ "21": "application/wasm",
24
+ "22": "application/x-7z-compressed",
25
+ "23": "application/x-archive",
26
+ "24": "application/x-bzip2",
27
+ "25": "application/x-coff",
28
+ "26": "application/x-dbf",
29
+ "27": "application/x-dosexec",
30
+ "28": "application/x-executable",
31
+ "29": "application/x-gettext-translation",
32
+ "30": "application/x-ms-ne-executable",
33
+ "31": "application/x-ndjson",
34
+ "32": "application/x-object",
35
+ "33": "application/x-ole-storage",
36
+ "34": "application/x-sharedlib",
37
+ "35": "application/x-shockwave-flash",
38
+ "36": "application/x-tar",
39
+ "37": "application/x-wine-extension-ini",
40
+ "38": "application/zip",
41
+ "39": "application/zlib",
42
+ "40": "application/zstd",
43
+ "41": "audio/amr",
44
+ "42": "audio/flac",
45
+ "43": "audio/mpeg",
46
+ "44": "audio/ogg",
47
+ "45": "audio/x-ape",
48
+ "46": "audio/x-hx-aac-adts",
49
+ "47": "audio/x-m4a",
50
+ "48": "audio/x-wav",
51
+ "49": "biosig/atf",
52
+ "50": "font/sfnt",
53
+ "51": "font/woff",
54
+ "52": "font/woff2",
55
+ "53": "image/bmp",
56
+ "54": "image/fits",
57
+ "55": "image/gif",
58
+ "56": "image/heif",
59
+ "57": "image/jpeg",
60
+ "58": "image/png",
61
+ "59": "image/svg+xml",
62
+ "60": "image/tiff",
63
+ "61": "image/vnd.adobe.photoshop",
64
+ "62": "image/vnd.microsoft.icon",
65
+ "63": "image/webp",
66
+ "64": "image/x-eps",
67
+ "65": "image/x-exr",
68
+ "66": "image/x-jp2-codestream",
69
+ "67": "image/x-portable-bitmap",
70
+ "68": "image/x-portable-greymap",
71
+ "69": "image/x-tga",
72
+ "70": "image/x-xpixmap",
73
+ "71": "inode/x-empty",
74
+ "72": "message/rfc822",
75
+ "73": "text/csv",
76
+ "74": "text/html",
77
+ "75": "text/plain",
78
+ "76": "text/rtf",
79
+ "77": "text/troff",
80
+ "78": "text/x-Algol68",
81
+ "79": "text/x-asm",
82
+ "80": "text/x-c",
83
+ "81": "text/x-c++",
84
+ "82": "text/x-diff",
85
+ "83": "text/x-file",
86
+ "84": "text/x-fortran",
87
+ "85": "text/x-java",
88
+ "86": "text/x-m4",
89
+ "87": "text/x-makefile",
90
+ "88": "text/x-msdos-batch",
91
+ "89": "text/x-perl",
92
+ "90": "text/x-php",
93
+ "91": "text/x-po",
94
+ "92": "text/x-ruby",
95
+ "93": "text/x-script.python",
96
+ "94": "text/x-shellscript",
97
+ "95": "text/x-tex",
98
+ "96": "text/xml",
99
+ "97": "video/3gpp",
100
+ "98": "video/mp4",
101
+ "99": "video/mpeg",
102
+ "100": "video/quicktime",
103
+ "101": "video/webm",
104
+ "102": "video/x-ivf",
105
+ "103": "video/x-matroska",
106
+ "104": "video/x-ms-asf",
107
+ "105": "video/x-msvideo"
108
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a3923cd4384639bde231f53f2b40822cc71fdc920d43bf4b97a5b6edafad3d2c
3
+ size 236291992
modeling_magic_bert.py ADDED
@@ -0,0 +1,346 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """MagicBERT model implementation for HuggingFace transformers.
2
+
3
+ This module provides HuggingFace-compatible implementations of MagicBERT,
4
+ a BERT-style model trained for binary file type understanding.
5
+ """
6
+
7
+ import math
8
+ from dataclasses import dataclass
9
+ from typing import Optional, Tuple, Union
10
+
11
+ import torch
12
+ import torch.nn as nn
13
+ import torch.nn.functional as F
14
+ from transformers import PreTrainedModel
15
+ from transformers.modeling_outputs import (
16
+ MaskedLMOutput,
17
+ SequenceClassifierOutput,
18
+ BaseModelOutput,
19
+ )
20
+
21
+ try:
22
+ from .configuration_magic_bert import MagicBERTConfig
23
+ except ImportError:
24
+ from configuration_magic_bert import MagicBERTConfig
25
+
26
+
27
+ class MagicBERTEmbeddings(nn.Module):
28
+ """MagicBERT embeddings: token + position embeddings."""
29
+
30
+ def __init__(self, config: MagicBERTConfig):
31
+ super().__init__()
32
+ self.token_embeddings = nn.Embedding(
33
+ config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id
34
+ )
35
+ self.position_embeddings = nn.Embedding(
36
+ config.max_position_embeddings, config.hidden_size
37
+ )
38
+ self.layer_norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
39
+ self.dropout = nn.Dropout(config.hidden_dropout_prob)
40
+
41
+ self.register_buffer(
42
+ "position_ids",
43
+ torch.arange(config.max_position_embeddings).expand((1, -1)),
44
+ persistent=False,
45
+ )
46
+
47
+ def forward(self, input_ids: torch.Tensor) -> torch.Tensor:
48
+ batch_size, seq_length = input_ids.shape
49
+ token_embeds = self.token_embeddings(input_ids)
50
+ position_ids = self.position_ids[:, :seq_length]
51
+ position_embeds = self.position_embeddings(position_ids)
52
+ embeddings = token_embeds + position_embeds
53
+ embeddings = self.layer_norm(embeddings)
54
+ embeddings = self.dropout(embeddings)
55
+ return embeddings
56
+
57
+
58
+ class MagicBERTAttention(nn.Module):
59
+ """Multi-head self-attention."""
60
+
61
+ def __init__(self, config: MagicBERTConfig):
62
+ super().__init__()
63
+ self.num_attention_heads = config.num_attention_heads
64
+ self.attention_head_size = config.hidden_size // config.num_attention_heads
65
+ self.all_head_size = self.num_attention_heads * self.attention_head_size
66
+
67
+ self.query = nn.Linear(config.hidden_size, self.all_head_size)
68
+ self.key = nn.Linear(config.hidden_size, self.all_head_size)
69
+ self.value = nn.Linear(config.hidden_size, self.all_head_size)
70
+ self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
71
+
72
+ def transpose_for_scores(self, x: torch.Tensor) -> torch.Tensor:
73
+ new_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
74
+ x = x.view(new_shape)
75
+ return x.permute(0, 2, 1, 3)
76
+
77
+ def forward(
78
+ self,
79
+ hidden_states: torch.Tensor,
80
+ attention_mask: Optional[torch.Tensor] = None,
81
+ ) -> torch.Tensor:
82
+ query_layer = self.transpose_for_scores(self.query(hidden_states))
83
+ key_layer = self.transpose_for_scores(self.key(hidden_states))
84
+ value_layer = self.transpose_for_scores(self.value(hidden_states))
85
+
86
+ attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
87
+ attention_scores = attention_scores / math.sqrt(self.attention_head_size)
88
+
89
+ if attention_mask is not None:
90
+ attention_mask = attention_mask[:, None, None, :]
91
+ attention_scores = attention_scores + (1.0 - attention_mask) * -10000.0
92
+
93
+ attention_probs = F.softmax(attention_scores, dim=-1)
94
+ attention_probs = self.dropout(attention_probs)
95
+ context = torch.matmul(attention_probs, value_layer)
96
+ context = context.permute(0, 2, 1, 3).contiguous()
97
+ new_shape = context.size()[:-2] + (self.all_head_size,)
98
+ context = context.view(new_shape)
99
+ return context
100
+
101
+
102
+ class MagicBERTLayer(nn.Module):
103
+ """Single transformer layer."""
104
+
105
+ def __init__(self, config: MagicBERTConfig):
106
+ super().__init__()
107
+ self.attention = MagicBERTAttention(config)
108
+ self.attention_output = nn.Linear(config.hidden_size, config.hidden_size)
109
+ self.attention_norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
110
+ self.attention_dropout = nn.Dropout(config.hidden_dropout_prob)
111
+
112
+ self.intermediate = nn.Linear(config.hidden_size, config.intermediate_size)
113
+ self.output = nn.Linear(config.intermediate_size, config.hidden_size)
114
+ self.output_norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
115
+ self.output_dropout = nn.Dropout(config.hidden_dropout_prob)
116
+
117
+ def forward(
118
+ self,
119
+ hidden_states: torch.Tensor,
120
+ attention_mask: Optional[torch.Tensor] = None,
121
+ ) -> torch.Tensor:
122
+ # Self-attention with residual
123
+ attention_output = self.attention(hidden_states, attention_mask)
124
+ attention_output = self.attention_output(attention_output)
125
+ attention_output = self.attention_dropout(attention_output)
126
+ attention_output = self.attention_norm(hidden_states + attention_output)
127
+
128
+ # Feed-forward with residual
129
+ intermediate_output = self.intermediate(attention_output)
130
+ intermediate_output = F.gelu(intermediate_output)
131
+ layer_output = self.output(intermediate_output)
132
+ layer_output = self.output_dropout(layer_output)
133
+ layer_output = self.output_norm(attention_output + layer_output)
134
+ return layer_output
135
+
136
+
137
+ class MagicBERTEncoder(nn.Module):
138
+ """Stack of transformer layers."""
139
+
140
+ def __init__(self, config: MagicBERTConfig):
141
+ super().__init__()
142
+ self.layers = nn.ModuleList(
143
+ [MagicBERTLayer(config) for _ in range(config.num_hidden_layers)]
144
+ )
145
+
146
+ def forward(
147
+ self,
148
+ hidden_states: torch.Tensor,
149
+ attention_mask: Optional[torch.Tensor] = None,
150
+ ) -> torch.Tensor:
151
+ for layer in self.layers:
152
+ hidden_states = layer(hidden_states, attention_mask)
153
+ return hidden_states
154
+
155
+
156
+ class MagicBERTPreTrainedModel(PreTrainedModel):
157
+ """Base class for MagicBERT models."""
158
+
159
+ config_class = MagicBERTConfig
160
+ base_model_prefix = "magic_bert"
161
+ supports_gradient_checkpointing = False
162
+
163
+ def _init_weights(self, module):
164
+ if isinstance(module, nn.Linear):
165
+ module.weight.data.normal_(mean=0.0, std=0.02)
166
+ if module.bias is not None:
167
+ module.bias.data.zero_()
168
+ elif isinstance(module, nn.Embedding):
169
+ module.weight.data.normal_(mean=0.0, std=0.02)
170
+ if module.padding_idx is not None:
171
+ module.weight.data[module.padding_idx].zero_()
172
+ elif isinstance(module, nn.LayerNorm):
173
+ module.bias.data.zero_()
174
+ module.weight.data.fill_(1.0)
175
+
176
+
177
+ class MagicBERTModel(MagicBERTPreTrainedModel):
178
+ """MagicBERT base model outputting raw hidden states."""
179
+
180
+ def __init__(self, config: MagicBERTConfig):
181
+ super().__init__(config)
182
+ self.config = config
183
+ self.embeddings = MagicBERTEmbeddings(config)
184
+ self.encoder = MagicBERTEncoder(config)
185
+ self.post_init()
186
+
187
+ def forward(
188
+ self,
189
+ input_ids: torch.Tensor,
190
+ attention_mask: Optional[torch.Tensor] = None,
191
+ token_type_ids: Optional[torch.Tensor] = None, # Ignored, for tokenizer compatibility
192
+ return_dict: Optional[bool] = None,
193
+ ) -> Union[Tuple[torch.Tensor, torch.Tensor], BaseModelOutput]:
194
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
195
+
196
+ hidden_states = self.embeddings(input_ids)
197
+ sequence_output = self.encoder(hidden_states, attention_mask)
198
+ pooled_output = sequence_output[:, 0, :]
199
+
200
+ if not return_dict:
201
+ return (sequence_output, pooled_output)
202
+
203
+ return BaseModelOutput(
204
+ last_hidden_state=sequence_output,
205
+ hidden_states=None,
206
+ attentions=None,
207
+ )
208
+
209
+
210
+ class MagicBERTForMaskedLM(MagicBERTPreTrainedModel):
211
+ """MagicBERT for masked language modeling (fill-mask task)."""
212
+
213
+ def __init__(self, config: MagicBERTConfig):
214
+ super().__init__(config)
215
+ self.config = config
216
+ self.embeddings = MagicBERTEmbeddings(config)
217
+ self.encoder = MagicBERTEncoder(config)
218
+ self.mlm_head = nn.Linear(config.hidden_size, config.vocab_size)
219
+ self.post_init()
220
+
221
+ def forward(
222
+ self,
223
+ input_ids: torch.Tensor,
224
+ attention_mask: Optional[torch.Tensor] = None,
225
+ token_type_ids: Optional[torch.Tensor] = None, # Ignored, for tokenizer compatibility
226
+ labels: Optional[torch.Tensor] = None,
227
+ return_dict: Optional[bool] = None,
228
+ ) -> Union[Tuple[torch.Tensor, ...], MaskedLMOutput]:
229
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
230
+
231
+ hidden_states = self.embeddings(input_ids)
232
+ sequence_output = self.encoder(hidden_states, attention_mask)
233
+ logits = self.mlm_head(sequence_output)
234
+
235
+ loss = None
236
+ if labels is not None:
237
+ loss_fct = nn.CrossEntropyLoss(ignore_index=-100)
238
+ loss = loss_fct(logits.view(-1, self.config.vocab_size), labels.view(-1))
239
+
240
+ if not return_dict:
241
+ output = (logits,)
242
+ return ((loss,) + output) if loss is not None else output
243
+
244
+ return MaskedLMOutput(
245
+ loss=loss,
246
+ logits=logits,
247
+ hidden_states=None,
248
+ attentions=None,
249
+ )
250
+
251
+ def get_embeddings(
252
+ self,
253
+ input_ids: torch.Tensor,
254
+ attention_mask: Optional[torch.Tensor] = None,
255
+ pooling: str = "cls",
256
+ ) -> torch.Tensor:
257
+ """Get embeddings for downstream tasks.
258
+
259
+ Args:
260
+ input_ids: Input token IDs
261
+ attention_mask: Attention mask
262
+ pooling: Pooling strategy ("cls" or "mean")
263
+
264
+ Returns:
265
+ Pooled embeddings [batch_size, hidden_size]
266
+ """
267
+ hidden_states = self.embeddings(input_ids)
268
+ sequence_output = self.encoder(hidden_states, attention_mask)
269
+
270
+ if pooling == "cls":
271
+ return sequence_output[:, 0, :]
272
+ elif pooling == "mean":
273
+ if attention_mask is not None:
274
+ mask = attention_mask.unsqueeze(-1).float()
275
+ return (sequence_output * mask).sum(1) / mask.sum(1).clamp(min=1e-9)
276
+ return sequence_output.mean(dim=1)
277
+ else:
278
+ raise ValueError(f"Unknown pooling: {pooling}")
279
+
280
+
281
+ class MagicBERTForSequenceClassification(MagicBERTPreTrainedModel):
282
+ """MagicBERT for sequence classification (file type classification)."""
283
+
284
+ def __init__(self, config: MagicBERTConfig):
285
+ super().__init__(config)
286
+ self.config = config
287
+ self.num_labels = getattr(config, "num_labels", 106)
288
+
289
+ self.embeddings = MagicBERTEmbeddings(config)
290
+ self.encoder = MagicBERTEncoder(config)
291
+
292
+ # Projection head (for contrastive learning compatibility)
293
+ projection_dim = getattr(config, "contrastive_projection_dim", 256)
294
+ self.projection = nn.Sequential(
295
+ nn.Linear(config.hidden_size, config.hidden_size),
296
+ nn.ReLU(),
297
+ nn.Linear(config.hidden_size, projection_dim),
298
+ )
299
+ self.classifier = nn.Linear(projection_dim, self.num_labels)
300
+ self.post_init()
301
+
302
+ def forward(
303
+ self,
304
+ input_ids: torch.Tensor,
305
+ attention_mask: Optional[torch.Tensor] = None,
306
+ token_type_ids: Optional[torch.Tensor] = None, # Ignored, for tokenizer compatibility
307
+ labels: Optional[torch.Tensor] = None,
308
+ return_dict: Optional[bool] = None,
309
+ ) -> Union[Tuple[torch.Tensor, ...], SequenceClassifierOutput]:
310
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
311
+
312
+ hidden_states = self.embeddings(input_ids)
313
+ sequence_output = self.encoder(hidden_states, attention_mask)
314
+ pooled_output = sequence_output[:, 0, :]
315
+
316
+ projections = self.projection(pooled_output)
317
+ projections = F.normalize(projections, p=2, dim=1)
318
+ logits = self.classifier(projections)
319
+
320
+ loss = None
321
+ if labels is not None:
322
+ loss_fct = nn.CrossEntropyLoss()
323
+ loss = loss_fct(logits, labels)
324
+
325
+ if not return_dict:
326
+ output = (logits,)
327
+ return ((loss,) + output) if loss is not None else output
328
+
329
+ return SequenceClassifierOutput(
330
+ loss=loss,
331
+ logits=logits,
332
+ hidden_states=None,
333
+ attentions=None,
334
+ )
335
+
336
+ def get_embeddings(
337
+ self,
338
+ input_ids: torch.Tensor,
339
+ attention_mask: Optional[torch.Tensor] = None,
340
+ ) -> torch.Tensor:
341
+ """Get normalized projection embeddings for similarity search."""
342
+ hidden_states = self.embeddings(input_ids)
343
+ sequence_output = self.encoder(hidden_states, attention_mask)
344
+ pooled_output = sequence_output[:, 0, :]
345
+ projections = self.projection(pooled_output)
346
+ return F.normalize(projections, p=2, dim=1)
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "tokenizer_class": "PreTrainedTokenizerFast",
3
+ "model_max_length": 512,
4
+ "pad_token": "[PAD]",
5
+ "mask_token": "[MASK]",
6
+ "cls_token": "[CLS]",
7
+ "sep_token": "[SEP]",
8
+ "unk_token": "[UNK]"
9
+ }