Longformer Fiction Genre Classifier

Model Description

This model classifies narrative semantic genres in long-form fiction. Rather than predicting marketing categories or analyzing book descriptions, it identifies narrative modes by analyzing story structure, tone, diction, and thematic elements in actual fiction text.

Approach

The model is trained on story text (not blurbs or metadata) and uses a Longformer architecture to handle long contexts (up to 4096 tokens). It was trained with curriculum learning, progressively moving from short scenes to full chapters. For inference on complete books, sliding windows can be used to produce genre distributions across the text.

Key differences from typical genre classifiers:

Trained on narrative text rather than short descriptions
4096 token context window (full chapters)
Curriculum learning approach (short to long)
Tested on commercial novels and diverse short stories
Supports windowed inference for book-length texts

Model Architecture

Base Model: allenai/longformer-base-4096
Architecture: Longformer with efficient self-attention for long documents
Max Sequence Length: 4096 tokens
Parameters: ~149M (backbone) + classification head
Training Strategy: Curriculum learning (500-token scenes to 4000-token chapters)
Genres: 13 semantic categories

Genre Labels

The model predicts 13 semantic narrative genres representing literary modes rather than bookstore categories:

adventure, contemporary, crime, fantasy, historical, horror, literary, mystery, romance, science_fiction, thriller, war, western

Training Data

Training corpus: Fiction excerpts and scenes spanning multiple genres
Curriculum strategy: Progressive training from 500-token scenes to 4000-token chapters
Validation set: 52 original short stories (4 per genre, multiple writing styles)
Focus areas: Narrative structure, pacing, diction, tone, thematic elements, character dynamics

Performance

Evaluation Results

Overall Accuracy: 73.08% (38/52 stories correct)

Inference Speed: Average 0.198s per story (±0.076s)

Per-Genre Performance

Genre	Accuracy
adventure	100% (4/4)
literary	100% (4/4)
romance	100% (4/4)
war	100% (4/4)
western	100% (4/4)
historical	75% (3/4)
horror	75% (3/4)
mystery	75% (3/4)
science_fiction	75% (3/4)
crime	50% (2/4)
fantasy	50% (2/4)
contemporary	25% (1/4)
thriller	25% (1/4)

Known Limitations

Contemporary fiction: Lower accuracy (25%), often misclassified as literary or crime
Thriller classification: Lower accuracy (25%), confused with crime, horror, and literary
Crime vs. Mystery confusion: Some overlap between criminal and investigative perspectives
Literary over-prediction: Model occasionally defaults to literary for complex character-driven narratives

Strengths

Strong performance on genres with distinctive setting/tone markers
Handles literary fiction with complex themes
Distinguishes romance as primary driver vs. subplot element
Recognizes when historical setting is central vs. background

Usage

Basic Classification

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "Mitchins/longformer-fiction-genre-13g"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

text = """Your story text here (up to 4096 tokens)"""

inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=4096)
outputs = model(**inputs)

probs = torch.softmax(outputs.logits, dim=-1)
predicted_class = torch.argmax(probs, dim=-1).item()
predicted_genre = model.config.id2label[predicted_class]
confidence = probs[0][predicted_class].item()

print(f"Genre: {predicted_genre} ({confidence:.2%} confidence)")

Windowed Classification for Full Books

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from collections import Counter

def classify_book_windowed(text, window_size=3500, stride=1750):
    tokenizer = AutoTokenizer.from_pretrained("Mitchins/longformer-fiction-genre-13g")
    model = AutoModelForSequenceClassification.from_pretrained("Mitchins/longformer-fiction-genre-13g")

    tokens = tokenizer.encode(text, add_special_tokens=False)

    genre_votes = []
    for i in range(0, len(tokens), stride):
        window = tokens[i:i + window_size]
        if len(window) < 100:
            continue

        inputs = {'input_ids': torch.tensor([window])}
        outputs = model(**inputs)
        pred = torch.argmax(outputs.logits, dim=-1).item()
        genre_votes.append(model.config.id2label[pred])

        if i + window_size >= len(tokens):
            break

    return Counter(genre_votes)

# Usage
with open("book.txt", "r") as f:
    book_text = f.read()

genre_distribution = classify_book_windowed(book_text)
print("Genre distribution:", genre_distribution)

Extracting Narrative Embeddings

model = AutoModelForSequenceClassification.from_pretrained(
    "Mitchins/longformer-fiction-genre-13g",
    output_hidden_states=True
)

inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=4096)
outputs = model(**inputs)

# Extract embeddings from final layer
hidden_states = outputs.hidden_states[-1]
embedding = hidden_states.mean(dim=1)

# Use for similarity search, clustering, etc.

Use Cases

Fiction retrieval systems (RAG) that cluster by narrative style
Book recommendation based on narrative characteristics
Writing analysis tools for genre consistency
Dataset curation and filtering by semantic genre
Building specialized subgenre classifiers on top of this base model
Narrative similarity search
Genre arc analysis across chapters

Future Directions

Potential extensions include training specialized subgenre classifier heads or adapting to more recent architectures like DeBERTa.

Citation

@model{longformer_fiction_genre,
  title={Longformer Fiction Genre Classifier},
  author={Mitchell Currie},
  year={2024},
  publisher={HuggingFace},
  howpublished={\url{https://huggingface.co/Mitchins/longformer-fiction-genre-13g}}
}

Related Resources

Validation Dataset - 52 original stories used for evaluation
Detailed evaluation results available in model repository

License

MIT License

Downloads last month: 17

Safetensors

Model size

0.1B params

Tensor type

F32

Dataset used to train Mitchins/longformer-fiction-genre-13g

Evaluation results

Accuracy on Fiction Genre Validation Set (52 Stories)
self-reported

67.310

View on Papers With Code