Mitchins commited on
Commit
e78f911
·
verified ·
1 Parent(s): c7024c9

Upload folder using huggingface_hub

Browse files
Files changed (7) hide show
  1. README.md +259 -0
  2. config.json +26 -0
  3. model.py +297 -0
  4. model.safetensors +3 -0
  5. pytorch_model.bin +3 -0
  6. test_results.csv +19 -0
  7. test_results.json +182 -0
README.md ADDED
@@ -0,0 +1,259 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ library_name: pytorch
6
+ tags:
7
+ - text-classification
8
+ - fiction-detection
9
+ - byte-level
10
+ - cnn
11
+ datasets:
12
+ - HuggingFaceTB/cosmopedia
13
+ - BEE-spoke-data/gutenberg-en-v1-clean
14
+ - common-pile/arxiv_abstracts
15
+ - ccdv/cnn_dailymail
16
+ metrics:
17
+ - accuracy
18
+ - f1
19
+ - roc_auc
20
+ model-index:
21
+ - name: TinyByteCNN-Fiction-Detector
22
+ results:
23
+ - task:
24
+ type: text-classification
25
+ name: Fiction vs Non-Fiction Classification
26
+ dataset:
27
+ name: Custom Fiction/Non-Fiction Dataset
28
+ type: custom
29
+ split: validation
30
+ metrics:
31
+ - type: accuracy
32
+ value: 99.91
33
+ name: Validation Accuracy
34
+ - type: f1
35
+ value: 99.91
36
+ name: F1 Score
37
+ - type: roc_auc
38
+ value: 99.99
39
+ name: ROC AUC
40
+ ---
41
+
42
+ # TinyByteCNN Fiction vs Non-Fiction Detector
43
+
44
+ A lightweight, byte-level CNN model for detecting fiction vs non-fiction text with 99.91% validation accuracy.
45
+
46
+ ## Model Description
47
+
48
+ TinyByteCNN is a highly efficient byte-level convolutional neural network designed for binary classification of fiction vs non-fiction text. The model operates directly on UTF-8 byte sequences, eliminating the need for tokenization and making it robust to various text formats and languages.
49
+
50
+ ### Architecture Highlights
51
+
52
+ - **Model Size**: 942,313 parameters (~3.6MB)
53
+ - **Input**: Raw UTF-8 bytes (max 4096 bytes ≈ 512 words)
54
+ - **Architecture**: Depthwise-separable 1D CNN with Squeeze-Excitation
55
+ - **Receptive Field**: ~2.8KB covering multi-paragraph context
56
+ - **Key Features**:
57
+ - 4 stages with progressive downsampling (32x reduction)
58
+ - Dilated convolutions for larger receptive field
59
+ - SE attention modules for channel recalibration
60
+ - Global average + max pooling head
61
+
62
+ ## Intended Uses & Limitations
63
+
64
+ ### Intended Uses
65
+ - Automated content categorization for libraries and archives
66
+ - Fiction/non-fiction filtering for content platforms
67
+ - Educational content classification
68
+ - Writing style analysis
69
+ - Content recommendation systems
70
+
71
+ ### Limitations
72
+ - **Personal narratives**: May misclassify personal journal entries and memoirs as fiction (observed ~97% fiction confidence on journal entries)
73
+ - **Mixed content**: Struggles with creative non-fiction and narrative journalism
74
+ - **Length**: Optimized for 512-4096 byte inputs; longer texts should be chunked
75
+ - **Language**: Primarily trained on English text
76
+
77
+ ## Training Data
78
+
79
+ The model was trained on a diverse dataset of 85,000 samples (60k train, 15k validation, 10k test) drawn from:
80
+
81
+ ### Fiction Sources (50%)
82
+ 1. **Cosmopedia Stories** (HuggingFaceTB/cosmopedia)
83
+ - Synthetic fiction stories
84
+ - License: Apache 2.0
85
+
86
+ 2. **Project Gutenberg** (BEE-spoke-data/gutenberg-en-v1-clean)
87
+ - Classic literature
88
+ - License: Public Domain
89
+
90
+ 3. **Reddit WritingPrompts**
91
+ - Community-generated creative writing
92
+ - Via synthetic alternatives
93
+
94
+ ### Non-Fiction Sources (50%)
95
+ 1. **Cosmopedia Educational** (HuggingFaceTB/cosmopedia)
96
+ - Textbooks, WikiHow, educational blogs
97
+ - License: Apache 2.0
98
+
99
+ 2. **Scientific Papers** (common-pile/arxiv_abstracts)
100
+ - Academic abstracts and introductions
101
+ - License: Various (permissive)
102
+
103
+ 3. **News Articles** (ccdv/cnn_dailymail)
104
+ - CNN and Daily Mail articles
105
+ - License: Apache 2.0
106
+
107
+ ## Training Procedure
108
+
109
+ ### Preprocessing
110
+ - Unicode NFC normalization
111
+ - Whitespace normalization (max 2 consecutive spaces)
112
+ - UTF-8 byte encoding
113
+ - Padding/truncation to 4096 bytes
114
+
115
+ ### Training Hyperparameters
116
+ - **Optimizer**: AdamW (lr=3e-3, betas=(0.9, 0.98), weight_decay=0.01)
117
+ - **Schedule**: Cosine decay with 5% warmup
118
+ - **Batch Size**: 32
119
+ - **Epochs**: 10
120
+ - **Label Smoothing**: 0.05
121
+ - **Gradient Clipping**: 1.0
122
+ - **Device**: Apple M-series (MPS)
123
+
124
+ ## Evaluation Results
125
+
126
+ ### Validation Set (15,000 samples)
127
+ | Metric | Value |
128
+ |--------|-------|
129
+ | Accuracy | 99.91% |
130
+ | F1 Score | 0.9991 |
131
+ | ROC AUC | 0.9999 |
132
+ | Loss | 0.1194 |
133
+
134
+ ### Test Samples by Category (12 curated samples)
135
+
136
+ | Category | Samples | Accuracy | Avg Confidence |
137
+ |----------|---------|----------|----------------|
138
+ | General Fiction | 3 | 100% | 91.4% |
139
+ | Textbook | 3 | 100% | 97.8% |
140
+ | News Articles | 3 | 100% | 97.9% |
141
+ | Journal Articles | 3 | 100% | 97.6% |
142
+ | **Overall** | **12** | **100%** | **96.2%** |
143
+
144
+ The model achieved perfect classification across all categories, including diverse journal types (financial news, scientific research, and personal travel logs).
145
+
146
+ ### Detailed Test Results
147
+
148
+ #### ✅ All 12 Samples Correctly Classified
149
+
150
+ **Fiction Samples (3/3):**
151
+ 1. Lighthouse keeper narrative → Fiction (79.8% conf)
152
+ 2. Time travel story → Fiction (97.2% conf)
153
+ 3. Detective mystery → Fiction (97.3% conf)
154
+
155
+ **Textbook Samples (3/3):**
156
+ 1. Photosynthesis (Biology) → Non-Fiction (97.8% conf)
157
+ 2. Fundamental theorem (Calculus) → Non-Fiction (97.8% conf)
158
+ 3. Market equilibrium (Economics) → Non-Fiction (97.9% conf)
159
+
160
+ **News Articles (3/3):**
161
+ 1. Federal Reserve decision → Non-Fiction (97.8% conf)
162
+ 2. City homeless initiative → Non-Fiction (97.9% conf)
163
+ 3. Exoplanet discovery → Non-Fiction (97.9% conf)
164
+
165
+ **Journal Articles (3/3):**
166
+ 1. Wall Street Journal (Financial) → Non-Fiction (97.7% conf)
167
+ 2. Nature Scientific Reports → Non-Fiction (97.7% conf)
168
+ 3. Personal Travel Journal → Non-Fiction (97.5% conf)
169
+
170
+ ## How to Use
171
+
172
+ ### PyTorch
173
+
174
+ ```python
175
+ import torch
176
+ import numpy as np
177
+ from model import TinyByteCNN, preprocess_text
178
+
179
+ # Load model
180
+ model = TinyByteCNN.from_pretrained("username/tinybytecnn-fiction-detector")
181
+ model.eval()
182
+
183
+ # Prepare text
184
+ text = "Your text here..."
185
+ input_bytes = preprocess_text(text) # Returns tensor of shape [1, 4096]
186
+
187
+ # Predict
188
+ with torch.no_grad():
189
+ logits = model(input_bytes)
190
+ probability = torch.sigmoid(logits).item()
191
+
192
+ if probability > 0.5:
193
+ print(f"Non-Fiction (confidence: {probability:.1%})")
194
+ else:
195
+ print(f"Fiction (confidence: {1-probability:.1%})")
196
+ ```
197
+
198
+ ### Batch Processing
199
+
200
+ ```python
201
+ def classify_texts(texts, model, batch_size=32):
202
+ results = []
203
+ for i in range(0, len(texts), batch_size):
204
+ batch = texts[i:i+batch_size]
205
+ inputs = torch.stack([preprocess_text(t) for t in batch])
206
+
207
+ with torch.no_grad():
208
+ logits = model(inputs)
209
+ probs = torch.sigmoid(logits)
210
+
211
+ for text, prob in zip(batch, probs):
212
+ results.append({
213
+ 'text': text[:100] + '...',
214
+ 'class': 'Non-Fiction' if prob > 0.5 else 'Fiction',
215
+ 'confidence': prob.item() if prob > 0.5 else 1-prob.item()
216
+ })
217
+
218
+ return results
219
+ ```
220
+
221
+ ## Training Infrastructure
222
+
223
+ - **Hardware**: Apple M-series with 8GB MPS memory limit
224
+ - **Training Time**: ~20 minutes
225
+ - **Framework**: PyTorch 2.0+
226
+
227
+ ## Environmental Impact
228
+
229
+ - **Hardware Type**: Apple Silicon M-series
230
+ - **Hours used**: 0.33
231
+ - **Carbon Emitted**: Minimal (ARM-based efficiency, ~10W average)
232
+
233
+ ## Citation
234
+
235
+ ```bibtex
236
+ @model{tinybytecnn-fiction-2024,
237
+ title={TinyByteCNN Fiction vs Non-Fiction Detector},
238
+ author={Mitchell Currie},
239
+ year={2024},
240
+ publisher={HuggingFace},
241
+ url={https://huggingface.co/username/tinybytecnn-fiction-detector}
242
+ }
243
+ ```
244
+
245
+ ## Acknowledgments
246
+
247
+ This model uses data from:
248
+ - HuggingFace Team (Cosmopedia dataset)
249
+ - Project Gutenberg
250
+ - Common Pile contributors
251
+ - CNN/Daily Mail dataset creators
252
+
253
+ ## License
254
+
255
+ Apache 2.0
256
+
257
+ ## Contact
258
+
259
+ For questions or issues, please open an issue on the [model repository](https://huggingface.co/username/tinybytecnn-fiction-detector).
config.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": ["TinyByteCNN"],
3
+ "model_type": "byte_cnn",
4
+ "task": "text-classification",
5
+ "num_labels": 2,
6
+ "id2label": {
7
+ "0": "Fiction",
8
+ "1": "Non-Fiction"
9
+ },
10
+ "label2id": {
11
+ "Fiction": 0,
12
+ "Non-Fiction": 1
13
+ },
14
+ "max_seq_len": 4096,
15
+ "vocab_size": 256,
16
+ "embed_dim": 32,
17
+ "widths": [128, 192, 256, 320],
18
+ "use_gn": false,
19
+ "head_drop": 0.1,
20
+ "stochastic_depth": 0.05,
21
+ "num_parameters": 942313,
22
+ "torch_dtype": "float32",
23
+ "validation_accuracy": 99.91,
24
+ "validation_f1": 0.9991,
25
+ "validation_auc": 0.9999
26
+ }
model.py ADDED
@@ -0,0 +1,297 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ TinyByteCNN Model for Fiction vs Non-Fiction Classification
3
+ """
4
+
5
+ import torch
6
+ import torch.nn as nn
7
+ import torch.nn.functional as F
8
+ import numpy as np
9
+ import unicodedata
10
+ import re
11
+ from typing import Union, List
12
+
13
+
14
+ class SE(nn.Module):
15
+ """Squeeze-Excitation module"""
16
+ def __init__(self, c, r=8):
17
+ super().__init__()
18
+ m = max(c // r, 4)
19
+ self.fc1 = nn.Linear(c, m)
20
+ self.fc2 = nn.Linear(m, c)
21
+
22
+ def forward(self, x):
23
+ # x: [B, C, T]
24
+ s = x.mean(dim=-1) # [B, C]
25
+ s = F.silu(self.fc1(s))
26
+ s = torch.sigmoid(self.fc2(s)) # [B, C]
27
+ return x * s.unsqueeze(-1)
28
+
29
+
30
+ class SepResBlock(nn.Module):
31
+ """Separable Residual Block with SE attention"""
32
+ def __init__(self, c_in, c_out, k=7, stride=1, dilation=1, use_gn=False, se_ratio=8, drop=0.0):
33
+ super().__init__()
34
+ Norm = (lambda c: nn.GroupNorm(32, c)) if use_gn else nn.BatchNorm1d
35
+
36
+ self.dw = nn.Conv1d(c_in, c_in, k, stride=stride, dilation=dilation,
37
+ padding=((k-1)//2)*dilation, groups=c_in, bias=False)
38
+ self.bn1 = Norm(c_in)
39
+ self.pw = nn.Conv1d(c_in, c_out, 1, bias=False)
40
+ self.bn2 = Norm(c_out)
41
+ self.se = SE(c_out, se_ratio)
42
+ self.drop = nn.Dropout(p=drop)
43
+
44
+ self.proj = None
45
+ if stride != 1 or c_in != c_out:
46
+ self.proj = nn.Conv1d(c_in, c_out, 1, stride=stride, bias=False)
47
+
48
+ def forward(self, x):
49
+ y = self.dw(x)
50
+ y = F.silu(self.bn1(y))
51
+ y = self.pw(y)
52
+ y = self.bn2(y)
53
+ y = self.se(y)
54
+ if self.proj is not None:
55
+ x = self.proj(x)
56
+ y = self.drop(y)
57
+ return F.silu(x + y)
58
+
59
+
60
+ class TinyByteCNN(nn.Module):
61
+ """TinyByteCNN for Fiction vs Non-Fiction Classification"""
62
+
63
+ def __init__(self, config=None):
64
+ super().__init__()
65
+
66
+ # Default configuration
67
+ if config is None:
68
+ config = type('Config', (), {
69
+ 'vocab_size': 256,
70
+ 'embed_dim': 32,
71
+ 'widths': [128, 192, 256, 320],
72
+ 'use_gn': False,
73
+ 'head_drop': 0.1,
74
+ 'stochastic_depth': 0.05
75
+ })()
76
+
77
+ self.config = config
78
+
79
+ # Embedding layer for bytes
80
+ self.embed = nn.Embedding(config.vocab_size, config.embed_dim)
81
+
82
+ # Stem convolution
83
+ self.stem = nn.Conv1d(config.embed_dim, config.widths[0], 5, stride=2, padding=2, bias=False)
84
+ self.bn0 = nn.BatchNorm1d(config.widths[0]) if not config.use_gn else nn.GroupNorm(32, config.widths[0])
85
+
86
+ # Build stages
87
+ cfg = [
88
+ (2, config.widths[0], [1, 2]),
89
+ (2, config.widths[1], [1, 2]),
90
+ (3, config.widths[2], [1, 2, 4]),
91
+ (3, config.widths[3], [1, 2, 8])
92
+ ]
93
+
94
+ stages = []
95
+ c_prev = config.widths[0]
96
+ for blocks, c, ds in cfg:
97
+ for i in range(blocks):
98
+ stride = 2 if i == 0 else 1
99
+ d = ds[i]
100
+ stages.append(SepResBlock(c_prev, c, k=7, stride=stride, dilation=d,
101
+ use_gn=config.use_gn, drop=config.stochastic_depth))
102
+ c_prev = c
103
+
104
+ self.stages = nn.Sequential(*stages)
105
+
106
+ # Classification head
107
+ self.head = nn.Sequential(
108
+ nn.Dropout(p=config.head_drop),
109
+ nn.Linear(2 * config.widths[-1], 1)
110
+ )
111
+
112
+ def forward(self, x_bytes):
113
+ """
114
+ Args:
115
+ x_bytes: [B, T] uint8 tensor of byte values
116
+ Returns:
117
+ logits: [B] tensor of binary classification logits
118
+ """
119
+ x = self.embed(x_bytes.long()) # [B, T, E]
120
+ x = x.transpose(1, 2).contiguous() # [B, E, T]
121
+ x = F.silu(self.bn0(self.stem(x))) # [B, C0, T/2]
122
+ x = self.stages(x) # [B, C, T/32]
123
+
124
+ # Global pooling
125
+ avg = x.mean(dim=-1)
126
+ mx = x.amax(dim=-1)
127
+ feats = torch.cat([avg, mx], dim=1)
128
+
129
+ logits = self.head(feats).squeeze(1)
130
+ return logits
131
+
132
+ @classmethod
133
+ def from_pretrained(cls, path_or_repo, use_safetensors=True):
134
+ """Load pretrained model (supports both .bin and .safetensors)"""
135
+ import os
136
+ from pathlib import Path
137
+
138
+ # Determine if it's a file or directory/repo
139
+ if os.path.isdir(path_or_repo):
140
+ # Directory path - look for model files
141
+ base_path = Path(path_or_repo)
142
+ safetensors_path = base_path / "model.safetensors"
143
+ pytorch_path = base_path / "pytorch_model.bin"
144
+
145
+ if use_safetensors and safetensors_path.exists():
146
+ # Load from safetensors
147
+ from safetensors.torch import load_file
148
+ state_dict = load_file(str(safetensors_path))
149
+
150
+ # Load config if available
151
+ config_path = base_path / "config.json"
152
+ if config_path.exists():
153
+ import json
154
+ with open(config_path) as f:
155
+ config_dict = json.load(f)
156
+ config = type('Config', (), config_dict)()
157
+ else:
158
+ config = None
159
+
160
+ model = cls(config)
161
+ model.load_state_dict(state_dict)
162
+ return model
163
+ elif pytorch_path.exists():
164
+ checkpoint = torch.load(pytorch_path, weights_only=False, map_location='cpu')
165
+ elif os.path.isfile(path_or_repo):
166
+ if path_or_repo.endswith('.safetensors'):
167
+ from safetensors.torch import load_file
168
+ state_dict = load_file(path_or_repo)
169
+ model = cls()
170
+ model.load_state_dict(state_dict)
171
+ return model
172
+ else:
173
+ checkpoint = torch.load(path_or_repo, weights_only=False, map_location='cpu')
174
+ else:
175
+ # HuggingFace hub loading
176
+ from huggingface_hub import hf_hub_download
177
+
178
+ if use_safetensors:
179
+ try:
180
+ model_file = hf_hub_download(repo_id=path_or_repo, filename="model.safetensors")
181
+ from safetensors.torch import load_file
182
+ state_dict = load_file(model_file)
183
+ model = cls()
184
+ model.load_state_dict(state_dict)
185
+ return model
186
+ except:
187
+ pass # Fall back to pytorch format
188
+
189
+ model_file = hf_hub_download(repo_id=path_or_repo, filename="pytorch_model.bin")
190
+ checkpoint = torch.load(model_file, weights_only=False, map_location='cpu')
191
+
192
+ # Load from checkpoint (pytorch format)
193
+ if 'checkpoint' in locals():
194
+ config = checkpoint.get('config', None)
195
+ model = cls(config)
196
+ state_dict = checkpoint.get('model_state_dict', checkpoint)
197
+ model.load_state_dict(state_dict)
198
+ return model
199
+
200
+ def save_pretrained(self, save_path):
201
+ """Save model to directory"""
202
+ import os
203
+ os.makedirs(save_path, exist_ok=True)
204
+
205
+ torch.save({
206
+ 'model_state_dict': self.state_dict(),
207
+ 'config': self.config
208
+ }, os.path.join(save_path, 'pytorch_model.bin'))
209
+
210
+
211
+ def preprocess_text(text: str, max_len: int = 4096) -> torch.Tensor:
212
+ """
213
+ Preprocess text to bytes for model input
214
+
215
+ Args:
216
+ text: Input text string
217
+ max_len: Maximum sequence length (default 4096)
218
+
219
+ Returns:
220
+ Tensor of shape [1, max_len] containing byte values
221
+ """
222
+ # Unicode NFC normalize
223
+ text = unicodedata.normalize('NFC', text)
224
+
225
+ # Replace \r\n → \n
226
+ text = text.replace('\r\n', '\n')
227
+
228
+ # Collapse runs of whitespace to at most 2
229
+ text = re.sub(r'\s{3,}', ' ', text)
230
+
231
+ # Convert to bytes
232
+ text_bytes = text.encode('utf-8', errors='ignore')
233
+
234
+ # Pad or truncate to max_len
235
+ input_ids = np.zeros(max_len, dtype=np.uint8)
236
+ input_ids[:min(len(text_bytes), max_len)] = list(text_bytes[:max_len])
237
+
238
+ return torch.from_numpy(input_ids).unsqueeze(0) # Add batch dimension
239
+
240
+
241
+ def classify_text(text: Union[str, List[str]], model=None, device='cpu'):
242
+ """
243
+ Classify text as fiction or non-fiction
244
+
245
+ Args:
246
+ text: Single string or list of strings to classify
247
+ model: Pre-loaded model (optional)
248
+ device: Device to run on ('cpu', 'cuda', 'mps')
249
+
250
+ Returns:
251
+ Dictionary with predictions and confidence scores
252
+ """
253
+ if model is None:
254
+ model = TinyByteCNN.from_pretrained("fiction_classifier_hf")
255
+
256
+ model = model.to(device)
257
+ model.eval()
258
+
259
+ # Handle single text or batch
260
+ if isinstance(text, str):
261
+ texts = [text]
262
+ else:
263
+ texts = text
264
+
265
+ results = []
266
+
267
+ for t in texts:
268
+ input_ids = preprocess_text(t).to(device)
269
+
270
+ with torch.no_grad():
271
+ logits = model(input_ids)
272
+ prob = torch.sigmoid(logits).item()
273
+
274
+ pred_class = "Non-Fiction" if prob > 0.5 else "Fiction"
275
+ confidence = prob if prob > 0.5 else (1 - prob)
276
+
277
+ results.append({
278
+ 'text': t[:100] + '...' if len(t) > 100 else t,
279
+ 'prediction': pred_class,
280
+ 'confidence': confidence,
281
+ 'probability_nonfiction': prob
282
+ })
283
+
284
+ return results[0] if isinstance(text, str) else results
285
+
286
+
287
+ if __name__ == "__main__":
288
+ # Example usage
289
+ sample_text = "The detective's coffee had gone cold hours ago, but she hardly noticed."
290
+
291
+ # Load and use model
292
+ model = TinyByteCNN.from_pretrained("fiction_model_output_cnn/best_model.pt")
293
+ result = classify_text(sample_text, model)
294
+
295
+ print(f"Text: {result['text']}")
296
+ print(f"Prediction: {result['prediction']}")
297
+ print(f"Confidence: {result['confidence']:.1%}")
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e788bf5427b996650f8e657b05615078bdb3f0e778f23eb5059a2566b92e8a2a
3
+ size 3821900
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c21fe2faa7c6707c40c83b6c866dbd93f9437e2a67c558330408d4085448f1b6
3
+ size 3862846
test_results.csv ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ sample_id,category,true_label,predicted_label,confidence,probability_nonfiction,correct,text_preview
2
+ general_fiction_1,Fiction,Fiction,Fiction,0.7979161292314529,0.20208387076854706,True,"The old lighthouse keeper squinted through the salt-stained window, watching the storm gather streng..."
3
+ general_fiction_2,Fiction,Fiction,Fiction,0.9722247179597616,0.02777528204023838,True,"Marcus never believed in second chances until the morning he woke up in his childhood bedroom, seven..."
4
+ general_fiction_3,Fiction,Fiction,Fiction,0.9732040446251631,0.02679595537483692,True,"The detective's coffee had gone cold hours ago, but she hardly noticed. The case files spread across..."
5
+ childrens_stories_1,Fiction,Fiction,Fiction,0.9714287109673023,0.028571289032697678,True,Benny the bunny had a very important problem. His favorite carrot was stuck at the top of the talles...
6
+ childrens_stories_2,Fiction,Fiction,Fiction,0.972626393660903,0.027373606339097023,True,"Princess Luna loved to paint, but there was one big problem - all her paintings came to life at midn..."
7
+ childrens_stories_3,Fiction,Fiction,Fiction,0.9603868946433067,0.03961310535669327,True,Tommy's grandpa had a secret. Hidden in his workshop was a pair of goggles that could let you see in...
8
+ fantasy_stories_1,Fiction,Fiction,Fiction,0.9737914837896824,0.02620851621031761,True,The ancient runes on Kaelen's sword began to glow with an otherworldly blue light as she approached ...
9
+ fantasy_stories_2,Fiction,Fiction,Fiction,0.9677380956709385,0.03226190432906151,True,"Elara discovered she could weave moonlight into solid form quite by accident, during the Festival of..."
10
+ fantasy_stories_3,Fiction,Fiction,Fiction,0.9729745481163263,0.027025451883673668,True,"In the dragon markets of Valengard, memories were currency and dreams could be bottled like wine. Th..."
11
+ textbook_1,Non-Fiction,Non-Fiction,Non-Fiction,0.9782394766807556,0.9782394766807556,True,The process of photosynthesis can be divided into two main stages: the light-dependent reactions and...
12
+ textbook_2,Non-Fiction,Non-Fiction,Non-Fiction,0.9783691167831421,0.9783691167831421,True,The fundamental theorem of calculus establishes the relationship between differentiation and integra...
13
+ textbook_3,Non-Fiction,Non-Fiction,Non-Fiction,0.9790801405906677,0.9790801405906677,True,"Market equilibrium occurs at the intersection of supply and demand curves, where the quantity demand..."
14
+ news_1,Non-Fiction,Non-Fiction,Non-Fiction,0.9778554439544678,0.9778554439544678,True,The Federal Reserve announced Tuesday its decision to maintain interest rates at their current level...
15
+ news_2,Non-Fiction,Non-Fiction,Non-Fiction,0.9786907434463501,0.9786907434463501,True,"City officials unveiled a comprehensive plan Wednesday to address the growing homeless crisis, alloc..."
16
+ news_3,Non-Fiction,Non-Fiction,Non-Fiction,0.9789323210716248,0.9789323210716248,True,Scientists at the European Space Observatory have discovered three potentially habitable exoplanets ...
17
+ journal_entries_1,Non-Fiction,Non-Fiction,Non-Fiction,0.9772427678108215,0.9772427678108215,True,"Wall Street Journal - March 15, 2024: Technology stocks led a broad market rally Thursday as investo..."
18
+ journal_entries_2,Non-Fiction,Non-Fiction,Non-Fiction,0.9768512845039368,0.9768512845039368,True,"Nature Scientific Reports - September 2024: In this study, we investigated the correlation between m..."
19
+ journal_entries_3,Non-Fiction,Non-Fiction,Non-Fiction,0.9745306968688965,0.9745306968688965,True,"Personal Travel Journal - January 8, 2024: Day 3 in Kyoto. Visited Fushimi Inari shrine early this m..."
test_results.json ADDED
@@ -0,0 +1,182 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "category": "Fiction",
4
+ "sample_id": "general_fiction_1",
5
+ "true_label": "Fiction",
6
+ "predicted_label": "Fiction",
7
+ "confidence": 0.7979161292314529,
8
+ "probability_nonfiction": 0.20208387076854706,
9
+ "correct": true,
10
+ "text_preview": "The old lighthouse keeper squinted through the salt-stained window, watching the storm gather streng..."
11
+ },
12
+ {
13
+ "category": "Fiction",
14
+ "sample_id": "general_fiction_2",
15
+ "true_label": "Fiction",
16
+ "predicted_label": "Fiction",
17
+ "confidence": 0.9722247179597616,
18
+ "probability_nonfiction": 0.02777528204023838,
19
+ "correct": true,
20
+ "text_preview": "Marcus never believed in second chances until the morning he woke up in his childhood bedroom, seven..."
21
+ },
22
+ {
23
+ "category": "Fiction",
24
+ "sample_id": "general_fiction_3",
25
+ "true_label": "Fiction",
26
+ "predicted_label": "Fiction",
27
+ "confidence": 0.9732040446251631,
28
+ "probability_nonfiction": 0.02679595537483692,
29
+ "correct": true,
30
+ "text_preview": "The detective's coffee had gone cold hours ago, but she hardly noticed. The case files spread across..."
31
+ },
32
+ {
33
+ "category": "Fiction",
34
+ "sample_id": "childrens_stories_1",
35
+ "true_label": "Fiction",
36
+ "predicted_label": "Fiction",
37
+ "confidence": 0.9714287109673023,
38
+ "probability_nonfiction": 0.028571289032697678,
39
+ "correct": true,
40
+ "text_preview": "Benny the bunny had a very important problem. His favorite carrot was stuck at the top of the talles..."
41
+ },
42
+ {
43
+ "category": "Fiction",
44
+ "sample_id": "childrens_stories_2",
45
+ "true_label": "Fiction",
46
+ "predicted_label": "Fiction",
47
+ "confidence": 0.972626393660903,
48
+ "probability_nonfiction": 0.027373606339097023,
49
+ "correct": true,
50
+ "text_preview": "Princess Luna loved to paint, but there was one big problem - all her paintings came to life at midn..."
51
+ },
52
+ {
53
+ "category": "Fiction",
54
+ "sample_id": "childrens_stories_3",
55
+ "true_label": "Fiction",
56
+ "predicted_label": "Fiction",
57
+ "confidence": 0.9603868946433067,
58
+ "probability_nonfiction": 0.03961310535669327,
59
+ "correct": true,
60
+ "text_preview": "Tommy's grandpa had a secret. Hidden in his workshop was a pair of goggles that could let you see in..."
61
+ },
62
+ {
63
+ "category": "Fiction",
64
+ "sample_id": "fantasy_stories_1",
65
+ "true_label": "Fiction",
66
+ "predicted_label": "Fiction",
67
+ "confidence": 0.9737914837896824,
68
+ "probability_nonfiction": 0.02620851621031761,
69
+ "correct": true,
70
+ "text_preview": "The ancient runes on Kaelen's sword began to glow with an otherworldly blue light as she approached ..."
71
+ },
72
+ {
73
+ "category": "Fiction",
74
+ "sample_id": "fantasy_stories_2",
75
+ "true_label": "Fiction",
76
+ "predicted_label": "Fiction",
77
+ "confidence": 0.9677380956709385,
78
+ "probability_nonfiction": 0.03226190432906151,
79
+ "correct": true,
80
+ "text_preview": "Elara discovered she could weave moonlight into solid form quite by accident, during the Festival of..."
81
+ },
82
+ {
83
+ "category": "Fiction",
84
+ "sample_id": "fantasy_stories_3",
85
+ "true_label": "Fiction",
86
+ "predicted_label": "Fiction",
87
+ "confidence": 0.9729745481163263,
88
+ "probability_nonfiction": 0.027025451883673668,
89
+ "correct": true,
90
+ "text_preview": "In the dragon markets of Valengard, memories were currency and dreams could be bottled like wine. Th..."
91
+ },
92
+ {
93
+ "category": "Non-Fiction",
94
+ "sample_id": "textbook_1",
95
+ "true_label": "Non-Fiction",
96
+ "predicted_label": "Non-Fiction",
97
+ "confidence": 0.9782394766807556,
98
+ "probability_nonfiction": 0.9782394766807556,
99
+ "correct": true,
100
+ "text_preview": "The process of photosynthesis can be divided into two main stages: the light-dependent reactions and..."
101
+ },
102
+ {
103
+ "category": "Non-Fiction",
104
+ "sample_id": "textbook_2",
105
+ "true_label": "Non-Fiction",
106
+ "predicted_label": "Non-Fiction",
107
+ "confidence": 0.9783691167831421,
108
+ "probability_nonfiction": 0.9783691167831421,
109
+ "correct": true,
110
+ "text_preview": "The fundamental theorem of calculus establishes the relationship between differentiation and integra..."
111
+ },
112
+ {
113
+ "category": "Non-Fiction",
114
+ "sample_id": "textbook_3",
115
+ "true_label": "Non-Fiction",
116
+ "predicted_label": "Non-Fiction",
117
+ "confidence": 0.9790801405906677,
118
+ "probability_nonfiction": 0.9790801405906677,
119
+ "correct": true,
120
+ "text_preview": "Market equilibrium occurs at the intersection of supply and demand curves, where the quantity demand..."
121
+ },
122
+ {
123
+ "category": "Non-Fiction",
124
+ "sample_id": "news_1",
125
+ "true_label": "Non-Fiction",
126
+ "predicted_label": "Non-Fiction",
127
+ "confidence": 0.9778554439544678,
128
+ "probability_nonfiction": 0.9778554439544678,
129
+ "correct": true,
130
+ "text_preview": "The Federal Reserve announced Tuesday its decision to maintain interest rates at their current level..."
131
+ },
132
+ {
133
+ "category": "Non-Fiction",
134
+ "sample_id": "news_2",
135
+ "true_label": "Non-Fiction",
136
+ "predicted_label": "Non-Fiction",
137
+ "confidence": 0.9786907434463501,
138
+ "probability_nonfiction": 0.9786907434463501,
139
+ "correct": true,
140
+ "text_preview": "City officials unveiled a comprehensive plan Wednesday to address the growing homeless crisis, alloc..."
141
+ },
142
+ {
143
+ "category": "Non-Fiction",
144
+ "sample_id": "news_3",
145
+ "true_label": "Non-Fiction",
146
+ "predicted_label": "Non-Fiction",
147
+ "confidence": 0.9789323210716248,
148
+ "probability_nonfiction": 0.9789323210716248,
149
+ "correct": true,
150
+ "text_preview": "Scientists at the European Space Observatory have discovered three potentially habitable exoplanets ..."
151
+ },
152
+ {
153
+ "category": "Non-Fiction",
154
+ "sample_id": "journal_entries_1",
155
+ "true_label": "Non-Fiction",
156
+ "predicted_label": "Non-Fiction",
157
+ "confidence": 0.9772427678108215,
158
+ "probability_nonfiction": 0.9772427678108215,
159
+ "correct": true,
160
+ "text_preview": "Wall Street Journal - March 15, 2024: Technology stocks led a broad market rally Thursday as investo..."
161
+ },
162
+ {
163
+ "category": "Non-Fiction",
164
+ "sample_id": "journal_entries_2",
165
+ "true_label": "Non-Fiction",
166
+ "predicted_label": "Non-Fiction",
167
+ "confidence": 0.9768512845039368,
168
+ "probability_nonfiction": 0.9768512845039368,
169
+ "correct": true,
170
+ "text_preview": "Nature Scientific Reports - September 2024: In this study, we investigated the correlation between m..."
171
+ },
172
+ {
173
+ "category": "Non-Fiction",
174
+ "sample_id": "journal_entries_3",
175
+ "true_label": "Non-Fiction",
176
+ "predicted_label": "Non-Fiction",
177
+ "confidence": 0.9745306968688965,
178
+ "probability_nonfiction": 0.9745306968688965,
179
+ "correct": true,
180
+ "text_preview": "Personal Travel Journal - January 8, 2024: Day 3 in Kyoto. Visited Fushimi Inari shrine early this m..."
181
+ }
182
+ ]