Spaces:
Running
title: Cloze Reader
emoji: π
colorFrom: yellow
colorTo: gray
sdk: docker
pinned: true
An interactive reading comprehension game using AI to generate cloze (fill-in-the-blank) exercises from public domain literature.
Overview
The Cloze Reader transforms passages from Project Gutenberg into adaptive vocabulary exercises. It uses Gemma-3 models to select contextually appropriate words for deletion and provide hints through a chat interface. Rather than generating novel text, the system surfaces forgotten public domain literature and invites sustained engagement with specific texts.
Historical Context
Educational cloze testing (1953): Wilson L. Taylor introduced the cloze procedureβsystematically deleting words from passages to measure reading comprehension. It became standard in U.S. educational assessment by the 1960s.
Masked language modeling (2018): BERT and subsequent models rediscovered cloze methodology independently as a training objective, randomly masking tokens and predicting from context.
This project: Uses language models trained on prediction tasks to generate prediction exercises for human readers. While Gemma-3 uses next-token prediction rather than masked language modeling, the system demonstrates how assessment and training methodologies are now instrumentalized through identical computational systems.
Architecture
Page Load β app.js
ββ bookDataService.js β Hugging Face Datasets API (manu/project_gutenberg)
ββ clozeGameEngine.js β Game logic and word selection
ββ aiService.js β Gemma-3-27b word generation and hints
ββ leaderboardService.js β localStorage persistence
User Flow:
ββ Input validation β app.js
ββ Chat help β chatInterface.js
ββ Answer submission β clozeGameEngine.js scoring
ββ Level progression β Round advancement
Key Modules
- bookDataService.js: Streams from 70,000+ Project Gutenberg texts with local fallback classics
- aiService.js: Gemma-3-27b (OpenRouter) for production; Gemma-3-12b on port 1234 for local deployment
- clozeGameEngine.js: Level-aware difficulty, word selection, content quality filtering
- chatInterface.js: Socratic hints per blank with persistent conversation history
- leaderboardService.js: Top 10 high scores, player stats via localStorage
Difficulty System
- Levels 1-5: 1 blank, easier vocab (4-7 letters), full hints
- Levels 6-10: 2 blanks, medium vocab (4-10 letters), partial hints
- Levels 11+: 3 blanks, challenging vocab (5-14 letters), minimal hints
Scoring: 100% accuracy required for 1 blank, both correct for 2 blanks, all but one for 3+ blanks.
Data Pipeline
Primary source: manu/project_gutenberg on Hugging Face (70,000+ texts, continuously updated)
Content processing:
- Removes Project Gutenberg metadata, chapter headers, page numbers
- Statistical quality filtering: caps ratio, punctuation density, sentence structure
- Pattern detection for dictionaries, technical material, references
- Quality threshold > 3 rejects passages
Level-aware selection:
- Levels 1-2: 1900s texts
- Levels 3-4: 1800s texts
- Levels 5+: Any period
Technology Stack
Frontend: Vanilla JavaScript ES6 modules, no build process
Backend: FastAPI for static serving and secure API key injection
Models:
- Production: Gemma-3-27b via OpenRouter
- Local: Gemma-3-12b on port 1234 (LM Studio, ollama, or OpenAI-compatible)
State: localStorage only (no backend database)
Quick Start
Docker (Recommended)
docker build -t cloze-reader .
docker run -p 7860:7860 -e OPENROUTER_API_KEY=your_key cloze-reader
# Access at http://localhost:7860
Local Development
# With FastAPI
pip install -r requirements.txt
python app.py
# Access at http://localhost:7860
# Simple HTTP server
python -m http.server 8000
# Access at http://localhost:8000
Local LLM
# Start LLM server on port 1234 (LM Studio, etc.)
# Then access with:
http://localhost:8000?local=true
Environment Variables
OPENROUTER_API_KEY: Required for production (get from openrouter.ai)HF_API_KEY: Optional, for Hugging Face APIsHF_TOKEN: Optional, for Hub leaderboard sync
Development Commands
make install # Install Python and Node.js dependencies
make dev # Start dev server (simple HTTP)
make dev-python # Start FastAPI dev server
make docker-build # Build Docker image
make docker-run # Run container
make docker-dev # Full Docker dev environment
make clean # Clean build artifacts
make logs # View container logs
make stop # Stop containers
Design Philosophy
- Vanilla JS, no build step: Keeps code visible and modifiable
- Open-weight Gemma models: Enables local deployment and inspection
- Streaming from Project Gutenberg: Reproducible without proprietary content
- Local LLM support: No API dependency
- No backend database: Full client-side auditability
- Mid-century aesthetic: Temporal distance from contemporary algorithmic systems
Error Handling
AI Service:
- Retry with exponential backoff (up to 3 attempts)
- Response extraction hierarchy (message.content β reasoning β reasoning_details β regex)
- Manual word selection fallback
- Generic hint generation fallback
Content Service:
- HF API availability check before streaming
- Preloaded book cache
- 10 embedded classics guarantee offline functionality
- Quality validation retry with different passages
- 15-second request timeout with sequential processing fallback
Critical Questions
- What happens when training and assessment methodologies use identical computational systems?
- Can algorithmic selection trained on internet-scale data capture pedagogical intent?
- When both humans and models solve prediction tasks using similar heuristics, where is comprehension?
- What's gained/lost when authority shifts from institutional expertise to interrogable algorithms?
- What does deep engagement with finite texts mean in an age of infinite algorithmic generation?
- How do we surface public domain texts that have been appropriated relentlessly as training data?
References
- Matsumori, A., et al. (2023). CLOZER: Generating open cloze questions with masked language models. EMNLP.
- Ondov, B., et al. (2024). Masked language models as natural generators for cloze questions. NAACL.
- Zhang, Y., & Hashimoto, K. (2021). What do language models learn about the structure of their language? ACL.
Attribution
Created by Zach Muhlbauer at CUNY Graduate Center.
Development space: huggingface.co/spaces/milwright/cloze-reader
Dataset: manu/project_gutenberg