File size: 7,032 Bytes
54cf274
 
 
 
 
 
 
 
a6d0aac
96e73f5
a6d0aac
96e73f5
a6d0aac
96e73f5
a6d0aac
96e73f5
a6d0aac
96e73f5
a6d0aac
54cf274
a6d0aac
96e73f5
a6d0aac
96e73f5
a6d0aac
 
96e73f5
 
 
 
a6d0aac
 
96e73f5
a6d0aac
96e73f5
a6d0aac
96e73f5
a6d0aac
 
96e73f5
a6d0aac
96e73f5
 
 
 
 
a6d0aac
96e73f5
a6d0aac
96e73f5
 
 
a6d0aac
96e73f5
a6d0aac
96e73f5
a6d0aac
96e73f5
a6d0aac
96e73f5
a6d0aac
96e73f5
 
 
 
a6d0aac
96e73f5
a6d0aac
96e73f5
 
 
a6d0aac
96e73f5
a6d0aac
96e73f5
a6d0aac
96e73f5
a6d0aac
96e73f5
a6d0aac
96e73f5
 
a6d0aac
96e73f5
a6d0aac
 
 
96e73f5
a6d0aac
 
 
96e73f5
a6d0aac
 
 
96e73f5
a6d0aac
 
96e73f5
a6d0aac
 
 
 
96e73f5
a6d0aac
 
 
 
96e73f5
a6d0aac
 
96e73f5
 
a6d0aac
 
 
96e73f5
a6d0aac
96e73f5
 
 
a6d0aac
 
 
 
 
96e73f5
 
a6d0aac
96e73f5
 
 
 
 
a6d0aac
 
96e73f5
a6d0aac
96e73f5
 
 
 
 
 
a6d0aac
96e73f5
a6d0aac
96e73f5
a6d0aac
96e73f5
 
 
 
a6d0aac
96e73f5
a6d0aac
96e73f5
 
 
 
 
a6d0aac
96e73f5
a6d0aac
96e73f5
 
 
 
 
 
a6d0aac
96e73f5
a6d0aac
96e73f5
 
 
a6d0aac
 
 
96e73f5
a6d0aac
96e73f5
a6d0aac
96e73f5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
---
title: Cloze Reader
emoji: πŸ“š
colorFrom: yellow
colorTo: gray
sdk: docker
pinned: true
---

An interactive reading comprehension game using AI to generate cloze (fill-in-the-blank) exercises from public domain literature.

## Overview

The Cloze Reader transforms passages from Project Gutenberg into adaptive vocabulary exercises. It uses Gemma-3 models to select contextually appropriate words for deletion and provide hints through a chat interface. Rather than generating novel text, the system surfaces forgotten public domain literature and invites sustained engagement with specific texts.

## Historical Context

**Educational cloze testing (1953)**: Wilson L. Taylor introduced the cloze procedureβ€”systematically deleting words from passages to measure reading comprehension. It became standard in U.S. educational assessment by the 1960s.

**Masked language modeling (2018)**: BERT and subsequent models rediscovered cloze methodology independently as a training objective, randomly masking tokens and predicting from context.

**This project**: Uses language models trained on prediction tasks to generate prediction exercises for human readers. While Gemma-3 uses next-token prediction rather than masked language modeling, the system demonstrates how assessment and training methodologies are now instrumentalized through identical computational systems.

## Architecture

```tree
Page Load β†’ app.js
β”œβ”€ bookDataService.js β†’ Hugging Face Datasets API (manu/project_gutenberg)
β”œβ”€ clozeGameEngine.js β†’ Game logic and word selection
β”œβ”€ aiService.js β†’ Gemma-3-27b word generation and hints
└─ leaderboardService.js β†’ localStorage persistence

User Flow:
β”œβ”€ Input validation β†’ app.js
β”œβ”€ Chat help β†’ chatInterface.js
β”œβ”€ Answer submission β†’ clozeGameEngine.js scoring
└─ Level progression β†’ Round advancement
```

### Key Modules

- **bookDataService.js**: Streams from 70,000+ Project Gutenberg texts with local fallback classics
- **aiService.js**: Gemma-3-27b (OpenRouter) for production; Gemma-3-12b on port 1234 for local deployment
- **clozeGameEngine.js**: Level-aware difficulty, word selection, content quality filtering
- **chatInterface.js**: Socratic hints per blank with persistent conversation history
- **leaderboardService.js**: Top 10 high scores, player stats via localStorage

### Difficulty System

- **Levels 1-5**: 1 blank, easier vocab (4-7 letters), full hints
- **Levels 6-10**: 2 blanks, medium vocab (4-10 letters), partial hints
- **Levels 11+**: 3 blanks, challenging vocab (5-14 letters), minimal hints

Scoring: 100% accuracy required for 1 blank, both correct for 2 blanks, all but one for 3+ blanks.

## Data Pipeline

**Primary source**: [manu/project_gutenberg](https://huggingface.co/datasets/manu/project_gutenberg) on Hugging Face (70,000+ texts, continuously updated)

**Content processing**:

- Removes Project Gutenberg metadata, chapter headers, page numbers
- Statistical quality filtering: caps ratio, punctuation density, sentence structure
- Pattern detection for dictionaries, technical material, references
- Quality threshold > 3 rejects passages

**Level-aware selection**:

- Levels 1-2: 1900s texts
- Levels 3-4: 1800s texts
- Levels 5+: Any period

## Technology Stack

**Frontend**: Vanilla JavaScript ES6 modules, no build process

**Backend**: FastAPI for static serving and secure API key injection

**Models**:

- Production: Gemma-3-27b via OpenRouter
- Local: Gemma-3-12b on port 1234 (LM Studio, ollama, or OpenAI-compatible)

**State**: localStorage only (no backend database)

## Quick Start

### Docker (Recommended)

```bash
docker build -t cloze-reader .
docker run -p 7860:7860 -e OPENROUTER_API_KEY=your_key cloze-reader
# Access at http://localhost:7860
```

### Local Development

```bash
# With FastAPI
pip install -r requirements.txt
python app.py
# Access at http://localhost:7860

# Simple HTTP server
python -m http.server 8000
# Access at http://localhost:8000
```

### Local LLM

```bash
# Start LLM server on port 1234 (LM Studio, etc.)
# Then access with:
http://localhost:8000?local=true
```

## Environment Variables

- `OPENROUTER_API_KEY`: Required for production (get from [openrouter.ai](https://openrouter.ai))
- `HF_API_KEY`: Optional, for Hugging Face APIs
- `HF_TOKEN`: Optional, for Hub leaderboard sync

## Development Commands

```bash
make install          # Install Python and Node.js dependencies
make dev             # Start dev server (simple HTTP)
make dev-python      # Start FastAPI dev server
make docker-build    # Build Docker image
make docker-run      # Run container
make docker-dev      # Full Docker dev environment
make clean           # Clean build artifacts
make logs            # View container logs
make stop            # Stop containers
```

## Design Philosophy

- **Vanilla JS, no build step**: Keeps code visible and modifiable
- **Open-weight Gemma models**: Enables local deployment and inspection
- **Streaming from Project Gutenberg**: Reproducible without proprietary content
- **Local LLM support**: No API dependency
- **No backend database**: Full client-side auditability
- **Mid-century aesthetic**: Temporal distance from contemporary algorithmic systems

## Error Handling

**AI Service**:

1. Retry with exponential backoff (up to 3 attempts)
2. Response extraction hierarchy (message.content β†’ reasoning β†’ reasoning_details β†’ regex)
3. Manual word selection fallback
4. Generic hint generation fallback

**Content Service**:

1. HF API availability check before streaming
2. Preloaded book cache
3. 10 embedded classics guarantee offline functionality
4. Quality validation retry with different passages
5. 15-second request timeout with sequential processing fallback

## Critical Questions

1. What happens when training and assessment methodologies use identical computational systems?
2. Can algorithmic selection trained on internet-scale data capture pedagogical intent?
3. When both humans and models solve prediction tasks using similar heuristics, where is comprehension?
4. What's gained/lost when authority shifts from institutional expertise to interrogable algorithms?
5. What does deep engagement with finite texts mean in an age of infinite algorithmic generation?
6. How do we surface public domain texts that have been appropriated relentlessly as training data?

## References

- Matsumori, A., et al. (2023). CLOZER: Generating open cloze questions with masked language models. EMNLP.
- Ondov, B., et al. (2024). Masked language models as natural generators for cloze questions. NAACL.
- Zhang, Y., & Hashimoto, K. (2021). What do language models learn about the structure of their language? ACL.

## Attribution

Created by [Zach Muhlbauer](https://huggingface.co/milwright) at CUNY Graduate Center.

Development space: [huggingface.co/spaces/milwright/cloze-reader](https://huggingface.co/spaces/milwright/cloze-reader)

Dataset: [manu/project_gutenberg](https://huggingface.co/datasets/manu/project_gutenberg)