Mitchins commited on
Commit
6d93be6
·
verified ·
1 Parent(s): 17799eb

Upload folder using huggingface_hub

Browse files
.DS_Store ADDED
Binary file (6.15 kB). View file
 
README.md ADDED
@@ -0,0 +1,305 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: mit
5
+ tags:
6
+ - text-classification
7
+ - narrative-analysis
8
+ - fiction
9
+ - genre-classification
10
+ - literary-analysis
11
+ - creative-writing
12
+ - longformer
13
+ - narrative-embeddings
14
+ datasets:
15
+ - Mitchins/fiction-genre-validation-52
16
+ metrics:
17
+ - accuracy
18
+ model-index:
19
+ - name: Longformer Fiction Genre Classifier
20
+ results:
21
+ - task:
22
+ type: text-classification
23
+ name: Narrative Genre Classification
24
+ dataset:
25
+ name: Fiction Genre Validation Set (52 Stories)
26
+ type: Mitchins/fiction-genre-validation-52
27
+ metrics:
28
+ - type: accuracy
29
+ value: 67.31
30
+ name: Accuracy
31
+ library_name: transformers
32
+ pipeline_tag: text-classification
33
+ ---
34
+
35
+ # Longformer Fiction Genre Classifier
36
+
37
+ ## Model Description
38
+
39
+ This is **not** "yet another genre classifier."
40
+
41
+ This model recognizes **narrative semantic genres** in long-form fiction—detecting what a story *feels like* and what narrative machinery it uses, rather than simply tagging marketing categories or book descriptions.
42
+
43
+ ### What Makes This Different
44
+
45
+ Most genre classifiers online:
46
+ - Train on 100-1000 samples (Goodreads tags, book blurbs)
47
+ - Use bag-of-words or TF-IDF (no deep understanding)
48
+ - Focus on short text (tweets, descriptions)
49
+ - Single-label classification
50
+ - Never tested on actual novels
51
+ - No sliding-window inference
52
+ - Predict bookstore shelves, not literary modes
53
+
54
+ **This model:**
55
+ - ✅ **Narrative-aware**: Trained on actual story text, not marketing copy
56
+ - ✅ **Long-context transformer**: Longformer architecture (4096 token windows)
57
+ - ✅ **Curriculum-trained**: Progressive training from short scenes → long narratives
58
+ - ✅ **Evaluated on real fiction**: Tested on commercial novels and diverse short stories
59
+ - ✅ **Window-based inference**: Produces genre heatmaps across a book
60
+ - ✅ **Semantic genre detection**: Identifies literary modes (tone, structure, diction)
61
+ - ✅ **Catches nuance**: Distinguishes political sci-fi from space opera, literary fantasy from epic fantasy
62
+
63
+ ### What This Model Does
64
+
65
+ Instead of asking "What shelf would this book go on?", it asks:
66
+
67
+ - What narrative modes is this text using?
68
+ - What emotional tone and pacing patterns appear?
69
+ - What socio-political structures and themes are present?
70
+ - What genre conventions guide the storytelling?
71
+
72
+ **Example**: *A Memory Called Empire* → The model correctly identifies it as science_fiction + literary + romance, not just "space opera." That's **literary-correct**, not bookstore-correct.
73
+
74
+ ## Intended Uses
75
+
76
+ ### Primary Use Cases
77
+
78
+ 1. **Fiction RAG systems**: Cluster and retrieve by narrative style/tone
79
+ 2. **Book recommendation engines**: "Find books that feel like X"
80
+ 3. **Writing assistants**: Analyze draft chapters for genre consistency
81
+ 4. **Story AI agents**: Condition generation on narrative mode
82
+ 5. **Dataset curation**: Filter fiction corpora by semantic genre
83
+ 6. **Subgenre classification**: Build specialized heads (e.g., "cozy mystery", "grimdark fantasy")
84
+ 7. **Narrative embeddings**: Use hidden states for "similar writing" search
85
+ 8. **Genre arc analysis**: Track how genre shifts across a book's chapters
86
+
87
+ ### What This Model Is NOT For
88
+
89
+ - ❌ Classifying book blurbs or marketing copy (trained on story text)
90
+ - ❌ Single-sentence genre detection (needs narrative context)
91
+ - ❌ Non-fiction classification (trained exclusively on fiction)
92
+ - ❌ Multi-label prediction (designed for dominant genre, though provides probabilities)
93
+
94
+ ## Model Architecture
95
+
96
+ - **Base Model**: `allenai/longformer-base-4096`
97
+ - **Architecture**: Longformer (efficient self-attention for long documents)
98
+ - **Max Sequence Length**: 4096 tokens
99
+ - **Parameters**: ~149M (backbone) + classification head
100
+ - **Training Strategy**: Curriculum learning (short → long)
101
+ - **Genres**: 13 semantic categories
102
+
103
+ ### Genre Labels
104
+
105
+ The model predicts **13 semantic narrative genres**:
106
+
107
+ - **adventure**: Narrative mode, not bookstore category
108
+ - **contemporary**: Narrative mode, not bookstore category
109
+ - **crime**: Narrative mode, not bookstore category
110
+ - **fantasy**: Narrative mode, not bookstore category
111
+ - **historical**: Narrative mode, not bookstore category
112
+ - **horror**: Narrative mode, not bookstore category
113
+ - **literary**: Narrative mode, not bookstore category
114
+ - **mystery**: Narrative mode, not bookstore category
115
+ - **romance**: Narrative mode, not bookstore category
116
+ - **science_fiction**: Narrative mode, not bookstore category
117
+ - **thriller**: Narrative mode, not bookstore category
118
+ - **war**: Narrative mode, not bookstore category
119
+ - **western**: Narrative mode, not bookstore category
120
+
121
+ These represent **literary modes** and **narrative structures**, not marketing labels.
122
+
123
+ ## Training Data
124
+
125
+ - **Training set**: Diverse corpus of fiction excerpts and scenes
126
+ - **Curriculum strategy**: Progressive training from 500-token scenes to 4000-token chapters
127
+ - **Validation set**: 52 original short stories (4 per genre × 13 genres)
128
+ - **Writing styles**: Literary, indie, and blockbuster prose
129
+
130
+ The training emphasizes:
131
+ 1. Narrative structure and pacing
132
+ 2. Diction and tone
133
+ 3. Thematic elements
134
+ 4. Character dynamics and focalization
135
+ 5. Genre conventions and tropes
136
+
137
+ ## Performance
138
+
139
+ ### Evaluation Results (52-Story Validation Set)
140
+
141
+ **Overall Accuracy**: 67.31% (35/52 stories correct)
142
+
143
+ This was achieved at only **8.6% through training** (checkpoint 2000/23226 steps), suggesting significant room for improvement with full training.
144
+
145
+ ### Genre-Specific Performance
146
+
147
+ | Tier | Accuracy | Genres |
148
+ |------|----------|--------|
149
+ | Excellent (≥75%) | 75-100% | war (100%), western (100%), science_fiction (75%), horror (75%), romance (100%), literary (75%) |
150
+ | Good (50-74%) | 50-74% | adventure (75%), contemporary (50%), historical (75%), mystery (75%), fantasy (50%) |
151
+ | Challenging (<50%) | <50% | crime (25%), thriller (0%) |
152
+
153
+ ### Known Limitations
154
+
155
+ 1. **Thriller confusion**: Model struggles to distinguish thriller from mystery/crime (0% accuracy)
156
+ 2. **Crime vs. Mystery**: Confuses criminal perspective (crime) with investigative perspective (mystery)
157
+ 3. **Character-driven blur**: Literary, contemporary, and romance can overlap when character-focused
158
+ 4. **Experimental prose**: Indie/experimental writing styles reduce accuracy slightly
159
+
160
+ ### Strengths
161
+
162
+ - Excellent at genres with clear setting/tone markers (war, western, sci-fi, horror)
163
+ - Handles literary fiction with nuanced themes
164
+ - Distinguishes romance as narrative driver vs. subplot
165
+ - Recognizes historical context as central vs. background
166
+
167
+ ## How to Use
168
+
169
+ ### Basic Classification
170
+
171
+ ```python
172
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
173
+ import torch
174
+
175
+ # Load model
176
+ model_name = "Mitchins/longformer-fiction-genre"
177
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
178
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
179
+
180
+ # Classify a text
181
+ text = """Your story text here (can be up to 4096 tokens)"""
182
+
183
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=4096)
184
+ outputs = model(**inputs)
185
+
186
+ # Get prediction
187
+ probs = torch.softmax(outputs.logits, dim=-1)
188
+ predicted_class = torch.argmax(probs, dim=-1).item()
189
+ predicted_genre = model.config.id2label[predicted_class]
190
+ confidence = probs[0][predicted_class].item()
191
+
192
+ print(f"Genre: {predicted_genre} ({confidence:.2%} confidence)")
193
+ ```
194
+
195
+ ### Windowed Book Classification
196
+
197
+ For full novels, use sliding windows to analyze genre distribution:
198
+
199
+ ```python
200
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
201
+ import torch
202
+
203
+ def classify_book_windowed(text, window_size=3500, stride=1750):
204
+ """
205
+ Classify a full book using overlapping windows.
206
+ Returns genre distribution across the book.
207
+ """
208
+ tokenizer = AutoTokenizer.from_pretrained("Mitchins/longformer-fiction-genre")
209
+ model = AutoModelForSequenceClassification.from_pretrained("Mitchins/longformer-fiction-genre")
210
+
211
+ # Tokenize full text
212
+ tokens = tokenizer.encode(text, add_special_tokens=False)
213
+
214
+ # Create windows
215
+ windows = []
216
+ for i in range(0, len(tokens), stride):
217
+ window = tokens[i:i + window_size]
218
+ if len(window) > 100: # Skip tiny windows
219
+ windows.append(window)
220
+ if i + window_size >= len(tokens):
221
+ break
222
+
223
+ # Classify each window
224
+ genre_votes = []
225
+ for window_tokens in windows:
226
+ inputs = {'input_ids': torch.tensor([window_tokens])}
227
+ outputs = model(**inputs)
228
+ pred = torch.argmax(outputs.logits, dim=-1).item()
229
+ genre_votes.append(model.config.id2label[pred])
230
+
231
+ # Aggregate results
232
+ from collections import Counter
233
+ return Counter(genre_votes)
234
+
235
+ # Usage
236
+ with open("your_book.txt", "r") as f:
237
+ book_text = f.read()
238
+
239
+ genre_dist = classify_book_windowed(book_text)
240
+ print("Genre distribution:", genre_dist)
241
+ ```
242
+
243
+ ### Extract Narrative Embeddings
244
+
245
+ Use the model's hidden states as narrative embeddings:
246
+
247
+ ```python
248
+ model = AutoModelForSequenceClassification.from_pretrained(
249
+ "Mitchins/longformer-fiction-genre",
250
+ output_hidden_states=True
251
+ )
252
+
253
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=4096)
254
+ outputs = model(**inputs)
255
+
256
+ # Get final layer embedding (mean pooling)
257
+ hidden_states = outputs.hidden_states[-1] # Last layer
258
+ embedding = hidden_states.mean(dim=1) # Mean pool
259
+
260
+ # Use for similarity search, clustering, etc.
261
+ ```
262
+
263
+ ## Citation
264
+
265
+ If you use this model, please cite:
266
+
267
+ ```bibtex
268
+ @model{longformer_fiction_genre,
269
+ title={Longformer Fiction Genre Classifier: A Narrative Semantic Genre Model},
270
+ author={Mitchell Currie},
271
+ year={2024},
272
+ publisher={HuggingFace},
273
+ howpublished={\url{https://huggingface.co/Mitchins/longformer-fiction-genre}}
274
+ }
275
+ ```
276
+
277
+ ## Related Resources
278
+
279
+ - **Validation Dataset**: [fiction-genre-validation-52](https://huggingface.co/datasets/Mitchins/fiction-genre-validation-52)
280
+ - **Evaluation Results**: See FINAL_EVALUATION_52_STORIES.md in the repository
281
+ - **Training Details**: See training documentation for curriculum strategy
282
+ - **Inference Script**: Windowed classification script available in repository
283
+
284
+ ## Model Card Authors
285
+
286
+ Mitchell Currie
287
+
288
+ ## License
289
+
290
+ MIT License
291
+
292
+ ---
293
+
294
+ ## Future Directions
295
+
296
+ This model represents a **building block** for narrative intelligence:
297
+
298
+ 1. **Fiction-trained multimodal RAG**: Combine with embeddings for narrative retrieval
299
+ 2. **Subgenre specialization**: Fine-tune heads for niche genres (cozy mystery, progression fantasy)
300
+ 3. **Genre-aware generation**: Condition story generation on semantic genre
301
+ 4. **Cross-genre detection**: Extend to multi-label for genre-blending
302
+ 5. **Style transfer**: Use embeddings to guide prose style transformation
303
+ 6. **Narrative arc tracking**: Analyze how genre/tone evolves through a story
304
+
305
+ **Welcome to Narrative ML. This is just the beginning. 🧠📚**
config.json ADDED
@@ -0,0 +1,73 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "LongformerForSequenceClassification"
4
+ ],
5
+ "attention_mode": "longformer",
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "attention_window": [
8
+ 512,
9
+ 512,
10
+ 512,
11
+ 512,
12
+ 512,
13
+ 512,
14
+ 512,
15
+ 512,
16
+ 512,
17
+ 512,
18
+ 512,
19
+ 512
20
+ ],
21
+ "bos_token_id": 0,
22
+ "eos_token_id": 2,
23
+ "gradient_checkpointing": false,
24
+ "hidden_act": "gelu",
25
+ "hidden_dropout_prob": 0.1,
26
+ "hidden_size": 768,
27
+ "id2label": {
28
+ "0": "adventure",
29
+ "1": "contemporary",
30
+ "2": "crime",
31
+ "3": "fantasy",
32
+ "4": "historical",
33
+ "5": "horror",
34
+ "6": "literary",
35
+ "7": "mystery",
36
+ "8": "romance",
37
+ "9": "science_fiction",
38
+ "10": "thriller",
39
+ "11": "war",
40
+ "12": "western"
41
+ },
42
+ "ignore_attention_mask": false,
43
+ "initializer_range": 0.02,
44
+ "intermediate_size": 3072,
45
+ "label2id": {
46
+ "adventure": 0,
47
+ "contemporary": 1,
48
+ "crime": 2,
49
+ "fantasy": 3,
50
+ "historical": 4,
51
+ "horror": 5,
52
+ "literary": 6,
53
+ "mystery": 7,
54
+ "romance": 8,
55
+ "science_fiction": 9,
56
+ "thriller": 10,
57
+ "war": 11,
58
+ "western": 12
59
+ },
60
+ "layer_norm_eps": 1e-05,
61
+ "max_position_embeddings": 4098,
62
+ "model_type": "longformer",
63
+ "num_attention_heads": 12,
64
+ "num_hidden_layers": 12,
65
+ "onnx_export": false,
66
+ "pad_token_id": 1,
67
+ "problem_type": "single_label_classification",
68
+ "sep_token_id": 2,
69
+ "torch_dtype": "float32",
70
+ "transformers_version": "4.54.0",
71
+ "type_vocab_size": 1,
72
+ "vocab_size": 50265
73
+ }
final_eval_results.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "eval_loss": 0.07908330112695694,
3
+ "eval_accuracy": 0.981617379931701,
4
+ "eval_f1_macro": 0.9816376629462489,
5
+ "eval_f1_weighted": 0.9816396709662533,
6
+ "eval_runtime": 2373.738,
7
+ "eval_samples_per_second": 11.596,
8
+ "eval_steps_per_second": 0.966,
9
+ "epoch": 1.3562386980108498
10
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ecc09bf51c9c12b33542915c616bbafaed4be179b325cc992719df1300cb5b31
3
+ size 594712020
phase3_results.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "train_runtime": 1.7731,
3
+ "train_samples_per_second": 279440.99,
4
+ "train_steps_per_second": 5822.721,
5
+ "total_flos": 4.4142344258995814e+17,
6
+ "train_loss": 0.0,
7
+ "epoch": 1.3562386980108498
8
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<s>",
3
+ "cls_token": "<s>",
4
+ "eos_token": "</s>",
5
+ "mask_token": {
6
+ "content": "<mask>",
7
+ "lstrip": true,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false
11
+ },
12
+ "pad_token": "<pad>",
13
+ "sep_token": "</s>",
14
+ "unk_token": "<unk>"
15
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "added_tokens_decoder": {
4
+ "0": {
5
+ "content": "<s>",
6
+ "lstrip": false,
7
+ "normalized": true,
8
+ "rstrip": false,
9
+ "single_word": false,
10
+ "special": true
11
+ },
12
+ "1": {
13
+ "content": "<pad>",
14
+ "lstrip": false,
15
+ "normalized": true,
16
+ "rstrip": false,
17
+ "single_word": false,
18
+ "special": true
19
+ },
20
+ "2": {
21
+ "content": "</s>",
22
+ "lstrip": false,
23
+ "normalized": true,
24
+ "rstrip": false,
25
+ "single_word": false,
26
+ "special": true
27
+ },
28
+ "3": {
29
+ "content": "<unk>",
30
+ "lstrip": false,
31
+ "normalized": true,
32
+ "rstrip": false,
33
+ "single_word": false,
34
+ "special": true
35
+ },
36
+ "50264": {
37
+ "content": "<mask>",
38
+ "lstrip": true,
39
+ "normalized": false,
40
+ "rstrip": false,
41
+ "single_word": false,
42
+ "special": true
43
+ }
44
+ },
45
+ "bos_token": "<s>",
46
+ "clean_up_tokenization_spaces": false,
47
+ "cls_token": "<s>",
48
+ "eos_token": "</s>",
49
+ "errors": "replace",
50
+ "extra_special_tokens": {},
51
+ "mask_token": "<mask>",
52
+ "model_max_length": 1000000000000000019884624838656,
53
+ "pad_token": "<pad>",
54
+ "sep_token": "</s>",
55
+ "tokenizer_class": "LongformerTokenizer",
56
+ "trim_offsets": true,
57
+ "unk_token": "<unk>"
58
+ }
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:93397c42c0040f6662c80addcd9b88f9479cde906b58393e345c3b95389c2de8
3
+ size 5841
vocab.json ADDED
The diff for this file is too large to render. See raw diff