Longformer Fiction Genre Classifier
Model Description
This model classifies narrative semantic genres in long-form fiction. Rather than predicting marketing categories or analyzing book descriptions, it identifies narrative modes by analyzing story structure, tone, diction, and thematic elements in actual fiction text.
Approach
The model is trained on story text (not blurbs or metadata) and uses a Longformer architecture to handle long contexts (up to 4096 tokens). It was trained with curriculum learning, progressively moving from short scenes to full chapters. For inference on complete books, sliding windows can be used to produce genre distributions across the text.
Key differences from typical genre classifiers:
- Trained on narrative text rather than short descriptions
- 4096 token context window (full chapters)
- Curriculum learning approach (short to long)
- Tested on commercial novels and diverse short stories
- Supports windowed inference for book-length texts
Model Architecture
- Base Model:
allenai/longformer-base-4096 - Architecture: Longformer with efficient self-attention for long documents
- Max Sequence Length: 4096 tokens
- Parameters: ~149M (backbone) + classification head
- Training Strategy: Curriculum learning (500-token scenes to 4000-token chapters)
- Genres: 13 semantic categories
Genre Labels
The model predicts 13 semantic narrative genres representing literary modes rather than bookstore categories:
adventure, contemporary, crime, fantasy, historical, horror, literary, mystery, romance, science_fiction, thriller, war, western
Training Data
- Training corpus: Fiction excerpts and scenes spanning multiple genres
- Curriculum strategy: Progressive training from 500-token scenes to 4000-token chapters
- Validation set: 52 original short stories (4 per genre, multiple writing styles)
- Focus areas: Narrative structure, pacing, diction, tone, thematic elements, character dynamics
Performance
Evaluation Results
Overall Accuracy: 73.08% (38/52 stories correct)
Inference Speed: Average 0.198s per story (±0.076s)
Per-Genre Performance
| Genre | Accuracy |
|---|---|
| adventure | 100% (4/4) |
| literary | 100% (4/4) |
| romance | 100% (4/4) |
| war | 100% (4/4) |
| western | 100% (4/4) |
| historical | 75% (3/4) |
| horror | 75% (3/4) |
| mystery | 75% (3/4) |
| science_fiction | 75% (3/4) |
| crime | 50% (2/4) |
| fantasy | 50% (2/4) |
| contemporary | 25% (1/4) |
| thriller | 25% (1/4) |
Known Limitations
- Contemporary fiction: Lower accuracy (25%), often misclassified as literary or crime
- Thriller classification: Lower accuracy (25%), confused with crime, horror, and literary
- Crime vs. Mystery confusion: Some overlap between criminal and investigative perspectives
- Literary over-prediction: Model occasionally defaults to literary for complex character-driven narratives
Strengths
- Strong performance on genres with distinctive setting/tone markers
- Handles literary fiction with complex themes
- Distinguishes romance as primary driver vs. subplot element
- Recognizes when historical setting is central vs. background
Usage
Basic Classification
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "Mitchins/longformer-fiction-genre-13g"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
text = """Your story text here (up to 4096 tokens)"""
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=4096)
outputs = model(**inputs)
probs = torch.softmax(outputs.logits, dim=-1)
predicted_class = torch.argmax(probs, dim=-1).item()
predicted_genre = model.config.id2label[predicted_class]
confidence = probs[0][predicted_class].item()
print(f"Genre: {predicted_genre} ({confidence:.2%} confidence)")
Windowed Classification for Full Books
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from collections import Counter
def classify_book_windowed(text, window_size=3500, stride=1750):
tokenizer = AutoTokenizer.from_pretrained("Mitchins/longformer-fiction-genre-13g")
model = AutoModelForSequenceClassification.from_pretrained("Mitchins/longformer-fiction-genre-13g")
tokens = tokenizer.encode(text, add_special_tokens=False)
genre_votes = []
for i in range(0, len(tokens), stride):
window = tokens[i:i + window_size]
if len(window) < 100:
continue
inputs = {'input_ids': torch.tensor([window])}
outputs = model(**inputs)
pred = torch.argmax(outputs.logits, dim=-1).item()
genre_votes.append(model.config.id2label[pred])
if i + window_size >= len(tokens):
break
return Counter(genre_votes)
# Usage
with open("book.txt", "r") as f:
book_text = f.read()
genre_distribution = classify_book_windowed(book_text)
print("Genre distribution:", genre_distribution)
Extracting Narrative Embeddings
model = AutoModelForSequenceClassification.from_pretrained(
"Mitchins/longformer-fiction-genre-13g",
output_hidden_states=True
)
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=4096)
outputs = model(**inputs)
# Extract embeddings from final layer
hidden_states = outputs.hidden_states[-1]
embedding = hidden_states.mean(dim=1)
# Use for similarity search, clustering, etc.
Use Cases
- Fiction retrieval systems (RAG) that cluster by narrative style
- Book recommendation based on narrative characteristics
- Writing analysis tools for genre consistency
- Dataset curation and filtering by semantic genre
- Building specialized subgenre classifiers on top of this base model
- Narrative similarity search
- Genre arc analysis across chapters
Future Directions
Potential extensions include training specialized subgenre classifier heads or adapting to more recent architectures like DeBERTa.
Citation
@model{longformer_fiction_genre,
title={Longformer Fiction Genre Classifier},
author={Mitchell Currie},
year={2024},
publisher={HuggingFace},
howpublished={\url{https://huggingface.co/Mitchins/longformer-fiction-genre-13g}}
}
Related Resources
- Validation Dataset - 52 original stories used for evaluation
- Detailed evaluation results available in model repository
License
MIT License
- Downloads last month
- 17
Dataset used to train Mitchins/longformer-fiction-genre-13g
Evaluation results
- Accuracy on Fiction Genre Validation Set (52 Stories)self-reported67.310