hadadrjt commited on
Commit
906428b
·
0 Parent(s):

Pocket TTS: Initial experimental.

Browse files
Files changed (4) hide show
  1. Dockerfile +10 -0
  2. LICENSE +13 -0
  3. README.md +11 -0
  4. app.py +1053 -0
Dockerfile ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ #
2
+ # SPDX-FileCopyrightText: Hadad <hadad@linuxmail.org>
3
+ # SPDX-License-Identifier: Apache-2.0
4
+ #
5
+
6
+ FROM hadadrjt/pocket-tts:hf
7
+
8
+ WORKDIR /app
9
+
10
+ COPY app.py .
LICENSE ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Copyright (c) 2025 Hadad <hadad@linuxmail.org>
2
+
3
+ Licensed under the Apache License, Version 2.0 (the "License");
4
+ you may not use this file except in compliance with the License.
5
+ You may obtain a copy of the License at
6
+
7
+ http://www.apache.org/licenses/LICENSE-2.0
8
+
9
+ Unless required by applicable law or agreed to in writing, software
10
+ distributed under the License is distributed on an "AS IS" BASIS,
11
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ See the License for the specific language governing permissions and
13
+ limitations under the License.
README.md ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: kyutai/pocket-tts
3
+ short_description: Pocket TTS optimized for Hugging Face Spaces on CPU
4
+ license: apache-2.0
5
+ emoji: ⚡
6
+ colorFrom: gray
7
+ colorTo: yellow
8
+ sdk: docker
9
+ app_port: 7860
10
+ pinned: false
11
+ ---
app.py ADDED
@@ -0,0 +1,1053 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ ============================================================================
3
+ AI-GENERATED CODE
4
+ ============================================================================
5
+ """
6
+
7
+ """
8
+ Pocket TTS Web Application
9
+ ==========================
10
+
11
+ A Gradio-based web interface for the Pocket TTS text-to-speech model.
12
+ This application provides an intuitive interface for generating speech
13
+ from text using either preset voices or voice cloning capabilities.
14
+
15
+ Features:
16
+ ---------
17
+ - Multiple preset voice options
18
+ - Voice cloning from uploaded audio files
19
+ - Configurable generation parameters (temperature, LSD steps, etc.)
20
+ - Real-time character counting and validation
21
+ - Temporary file management with automatic cleanup
22
+ - Thread-safe generation state management
23
+
24
+ Usage:
25
+ ------
26
+ Run this script directly to launch the web application:
27
+ $ python app.py
28
+
29
+ The application will be available at http://localhost:7860
30
+ """
31
+
32
+ import os
33
+ import time
34
+ import torch
35
+ import tempfile
36
+ import threading
37
+ import scipy.io.wavfile
38
+ import gradio as gr
39
+ from pocket_tts import TTSModel
40
+
41
+
42
+ # =============================================================================
43
+ # ENVIRONMENT CONFIGURATION
44
+ # =============================================================================
45
+ # Configure PyTorch threading behavior
46
+ torch.set_num_threads(2) # Intra-op parallelism threads
47
+ torch.set_num_interop_threads(2) # Inter-op parallelism threads
48
+
49
+
50
+ # =============================================================================
51
+ # APPLICATION CONSTANTS
52
+ # =============================================================================
53
+ # Define all configurable constants and default values used throughout
54
+ # the application. These values control model behavior, UI constraints,
55
+ # and resource management policies.
56
+
57
+ # Available preset voice options for speech generation
58
+ AVAILABLE_VOICES = [
59
+ "alba",
60
+ "marius",
61
+ "javert",
62
+ "jean",
63
+ "fantine",
64
+ "cosette",
65
+ "eponine",
66
+ "azelma"
67
+ ]
68
+
69
+ # Default configuration values
70
+ DEFAULT_VOICE = "alba" # Default preset voice selection
71
+ DEFAULT_MODEL_VARIANT = "b6369a24" # Model variant identifier
72
+ DEFAULT_TEMPERATURE = 0.7 # Generation temperature
73
+ DEFAULT_LSD_DECODE_STEPS = 1 # Latent space decode steps
74
+ DEFAULT_EOS_THRESHOLD = -4.0 # End-of-sequence detection threshold
75
+ DEFAULT_NOISE_CLAMP = 0.0 # Noise clamping value (0 = disabled)
76
+ DEFAULT_FRAMES_AFTER_EOS = 10 # Additional frames after EOS
77
+
78
+ # Input constraints and resource management
79
+ MAXIMUM_INPUT_LENGTH = 1000 # Maximum text input characters
80
+ TEMPORARY_FILE_LIFETIME_SECONDS = 7200 # Temp file retention (2 hours)
81
+
82
+ # Voice mode selection options
83
+ VOICE_MODE_PRESET = "Preset Voices" # Use predefined voice
84
+ VOICE_MODE_CLONE = "Voice Cloning" # Clone voice from audio
85
+
86
+ # Example prompts with associated voice presets for demonstration
87
+ EXAMPLE_PROMPTS_WITH_VOICES = [
88
+ {
89
+ "text": "The quick brown fox jumps over the lazy dog near the riverbank.",
90
+ "voice": "alba"
91
+ },
92
+ {
93
+ "text": "Welcome to the future of text to speech technology powered by artificial intelligence.",
94
+ "voice": "marius"
95
+ },
96
+ {
97
+ "text": "Technology continues to push the boundaries of what we thought was possible.",
98
+ "voice": "javert"
99
+ },
100
+ {
101
+ "text": "The weather today is absolutely beautiful and perfect for a relaxing walk outside.",
102
+ "voice": "fantine"
103
+ },
104
+ {
105
+ "text": "Science and innovation are transforming how we interact with the world around us.",
106
+ "voice": "jean"
107
+ }
108
+ ]
109
+
110
+
111
+ # =============================================================================
112
+ # THREAD SYNCHRONIZATION
113
+ # =============================================================================
114
+ # Global state management for thread-safe generation operations.
115
+ # These locks and flags prevent concurrent generation requests and
116
+ # enable graceful cancellation of ongoing operations.
117
+
118
+ generation_state_lock = threading.Lock() # Lock for generation state access
119
+ is_currently_generating = False # Flag indicating active generation
120
+ stop_generation_requested = False # Flag for stop request signaling
121
+
122
+ # Temporary file registry for cleanup management
123
+ temporary_files_registry = {} # Maps file paths to creation timestamps
124
+ temporary_files_lock = threading.Lock() # Lock for registry access
125
+
126
+
127
+ # =============================================================================
128
+ # TEXT-TO-SPEECH MANAGER CLASS
129
+ # =============================================================================
130
+
131
+ class TextToSpeechManager:
132
+ """
133
+ Manages TTS model lifecycle and speech generation operations.
134
+
135
+ This class handles model loading, configuration caching, voice state
136
+ management, and audio generation. It implements lazy loading and
137
+ caching strategies to optimize performance and memory usage.
138
+
139
+ Attributes:
140
+ loaded_model: Currently loaded TTS model instance
141
+ current_configuration: Dict of current model configuration
142
+ voice_state_cache: Cache of computed voice states for preset voices
143
+
144
+ Example:
145
+ >>> manager = TextToSpeechManager()
146
+ >>> manager.load_or_get_model("b6369a24", 0.7, 1, None, -4.0)
147
+ >>> voice_state = manager.get_voice_state_for_preset("alba")
148
+ >>> audio = manager.generate_audio("Hello world", voice_state, 10, False)
149
+ """
150
+
151
+ def __init__(self):
152
+ """Initialize the TTS manager with empty state."""
153
+ self.loaded_model = None
154
+ self.current_configuration = {}
155
+ self.voice_state_cache = {}
156
+
157
+ def load_or_get_model(
158
+ self,
159
+ model_variant,
160
+ temperature,
161
+ lsd_decode_steps,
162
+ noise_clamp,
163
+ eos_threshold
164
+ ):
165
+ """
166
+ Load a TTS model or return cached instance if configuration matches.
167
+
168
+ This method implements lazy loading with configuration-based caching.
169
+ If the requested configuration differs from the currently loaded model,
170
+ a new model instance is created and the voice state cache is cleared.
171
+
172
+ Args:
173
+ model_variant: Model variant identifier string
174
+ temperature: Generation temperature (float, 0.1-2.0)
175
+ lsd_decode_steps: Number of LSD decode steps (int, 1-20)
176
+ noise_clamp: Maximum noise value or None to disable
177
+ eos_threshold: End-of-sequence detection threshold (float)
178
+
179
+ Returns:
180
+ TTSModel: Loaded and configured TTS model instance
181
+ """
182
+ # Process and validate input parameters with defaults
183
+ processed_variant = str(model_variant or DEFAULT_MODEL_VARIANT).strip()
184
+ processed_temperature = float(temperature) if temperature is not None else DEFAULT_TEMPERATURE
185
+ processed_lsd_steps = int(lsd_decode_steps) if lsd_decode_steps is not None else DEFAULT_LSD_DECODE_STEPS
186
+ processed_noise_clamp = float(noise_clamp) if noise_clamp and float(noise_clamp) > 0 else None
187
+ processed_eos_threshold = float(eos_threshold) if eos_threshold is not None else DEFAULT_EOS_THRESHOLD
188
+
189
+ # Build configuration dictionary for comparison
190
+ requested_configuration = {
191
+ "variant": processed_variant,
192
+ "temp": processed_temperature,
193
+ "lsd_decode_steps": processed_lsd_steps,
194
+ "noise_clamp": processed_noise_clamp,
195
+ "eos_threshold": processed_eos_threshold
196
+ }
197
+
198
+ # Load new model if configuration changed or no model loaded
199
+ if self.loaded_model is None or self.current_configuration != requested_configuration:
200
+ self.loaded_model = TTSModel.load_model(**requested_configuration)
201
+ self.current_configuration = requested_configuration
202
+ self.voice_state_cache = {} # Clear cache on model change
203
+
204
+ return self.loaded_model
205
+
206
+ def get_voice_state_for_preset(self, voice_name):
207
+ """
208
+ Get or compute voice state for a preset voice.
209
+
210
+ Voice states are cached to avoid redundant computation for
211
+ frequently used preset voices.
212
+
213
+ Args:
214
+ voice_name: Name of the preset voice (must be in AVAILABLE_VOICES)
215
+
216
+ Returns:
217
+ Voice state tensor for the specified preset voice
218
+ """
219
+ # Validate voice name and fall back to default if invalid
220
+ validated_voice = voice_name if voice_name in AVAILABLE_VOICES else DEFAULT_VOICE
221
+
222
+ # Compute and cache voice state if not already cached
223
+ if validated_voice not in self.voice_state_cache:
224
+ self.voice_state_cache[validated_voice] = self.loaded_model.get_state_for_audio_prompt(
225
+ audio_conditioning=validated_voice,
226
+ truncate=False
227
+ )
228
+
229
+ return self.voice_state_cache[validated_voice]
230
+
231
+ def get_voice_state_for_clone(self, audio_file_path):
232
+ """
233
+ Compute voice state from an uploaded audio file for voice cloning.
234
+
235
+ Unlike preset voices, cloned voice states are not cached as they
236
+ are typically unique per request.
237
+
238
+ Args:
239
+ audio_file_path: Path to the uploaded audio file
240
+
241
+ Returns:
242
+ Voice state tensor extracted from the audio file
243
+ """
244
+ return self.loaded_model.get_state_for_audio_prompt(
245
+ audio_conditioning=audio_file_path,
246
+ truncate=False
247
+ )
248
+
249
+ def generate_audio(self, text_content, voice_state, frames_after_eos, enable_custom_frames):
250
+ """
251
+ Generate speech audio from text using the specified voice state.
252
+
253
+ Args:
254
+ text_content: Text string to convert to speech
255
+ voice_state: Pre-computed voice state tensor
256
+ frames_after_eos: Number of frames to generate after EOS
257
+ enable_custom_frames: Whether to use custom frame count
258
+
259
+ Returns:
260
+ torch.Tensor: Generated audio waveform
261
+ """
262
+ # Apply custom frames setting if enabled
263
+ processed_frames = int(frames_after_eos) if enable_custom_frames else None
264
+
265
+ return self.loaded_model.generate_audio(
266
+ model_state=voice_state,
267
+ text_to_generate=text_content,
268
+ frames_after_eos=processed_frames,
269
+ copy_state=True
270
+ )
271
+
272
+ def save_audio_to_file(self, audio_tensor):
273
+ """
274
+ Save generated audio tensor to a temporary WAV file.
275
+
276
+ The file is registered for automatic cleanup after the configured
277
+ lifetime expires.
278
+
279
+ Args:
280
+ audio_tensor: PyTorch tensor containing audio waveform
281
+
282
+ Returns:
283
+ str: Path to the saved temporary WAV file
284
+ """
285
+ # Convert tensor to numpy array for scipy
286
+ audio_numpy_data = audio_tensor.numpy()
287
+ audio_sample_rate = self.loaded_model.sample_rate
288
+
289
+ # Create temporary file and write audio data
290
+ output_file = tempfile.NamedTemporaryFile(suffix=".wav", delete=False)
291
+ scipy.io.wavfile.write(output_file.name, audio_sample_rate, audio_numpy_data)
292
+
293
+ # Register file for cleanup tracking
294
+ with temporary_files_lock:
295
+ temporary_files_registry[output_file.name] = time.time()
296
+
297
+ return output_file.name
298
+
299
+
300
+ # Create global TTS manager instance
301
+ text_to_speech_manager = TextToSpeechManager()
302
+
303
+
304
+ # =============================================================================
305
+ # UTILITY FUNCTIONS
306
+ # =============================================================================
307
+
308
+ def cleanup_expired_temporary_files():
309
+ """
310
+ Remove temporary files that have exceeded their lifetime.
311
+
312
+ This function is called periodically to prevent disk space exhaustion
313
+ from accumulated temporary audio files. Files older than
314
+ TEMPORARY_FILE_LIFETIME_SECONDS are removed from disk and registry.
315
+ """
316
+ current_timestamp = time.time()
317
+ expired_files = []
318
+
319
+ with temporary_files_lock:
320
+ # Identify expired files
321
+ for file_path, creation_timestamp in list(temporary_files_registry.items()):
322
+ if current_timestamp - creation_timestamp > TEMPORARY_FILE_LIFETIME_SECONDS:
323
+ expired_files.append(file_path)
324
+
325
+ # Remove expired files from disk and registry
326
+ for file_path in expired_files:
327
+ try:
328
+ if os.path.exists(file_path):
329
+ os.remove(file_path)
330
+ del temporary_files_registry[file_path]
331
+ except Exception:
332
+ pass # Silently ignore deletion errors
333
+
334
+
335
+ def validate_text_input(text_content):
336
+ """
337
+ Validate and clean text input for speech generation.
338
+
339
+ Args:
340
+ text_content: Raw text input from user
341
+
342
+ Returns:
343
+ tuple: (is_valid: bool, result: str)
344
+ - If valid: (True, cleaned_text)
345
+ - If invalid: (False, error_message or empty string)
346
+ """
347
+ # Check for None or non-string input
348
+ if not text_content or not isinstance(text_content, str):
349
+ return False, ""
350
+
351
+ # Clean whitespace
352
+ cleaned_text = text_content.strip()
353
+
354
+ # Check for empty content
355
+ if not cleaned_text:
356
+ return False, ""
357
+
358
+ # Check length constraint
359
+ if len(cleaned_text) > MAXIMUM_INPUT_LENGTH:
360
+ return False, f"Input exceeds maximum length of {MAXIMUM_INPUT_LENGTH} characters."
361
+
362
+ return True, cleaned_text
363
+
364
+
365
+ def request_generation_stop():
366
+ """
367
+ Signal a request to stop the current generation.
368
+
369
+ Returns:
370
+ gr.update: Update to disable the stop button
371
+ """
372
+ global stop_generation_requested
373
+ stop_generation_requested = True
374
+ return gr.update(interactive=False)
375
+
376
+
377
+ # =============================================================================
378
+ # SPEECH GENERATION FUNCTION
379
+ # =============================================================================
380
+
381
+ def perform_speech_generation(
382
+ text_input,
383
+ voice_mode_selection,
384
+ voice_preset_selection,
385
+ voice_clone_audio_file,
386
+ model_variant,
387
+ lsd_decode_steps,
388
+ temperature,
389
+ noise_clamp,
390
+ eos_threshold,
391
+ frames_after_eos,
392
+ enable_custom_frames
393
+ ):
394
+ """
395
+ Perform the complete speech generation workflow.
396
+
397
+ This function orchestrates the entire generation process including:
398
+ validation, model loading, voice state preparation, audio generation,
399
+ and file saving. It handles thread safety and stop requests.
400
+
401
+ Args:
402
+ text_input: Text to convert to speech
403
+ voice_mode_selection: "Preset Voices" or "Voice Cloning"
404
+ voice_preset_selection: Selected preset voice name
405
+ voice_clone_audio_file: Path to uploaded audio for cloning
406
+ model_variant: Model variant identifier
407
+ lsd_decode_steps: Number of LSD decode steps
408
+ temperature: Generation temperature
409
+ noise_clamp: Noise clamping value
410
+ eos_threshold: End-of-sequence threshold
411
+ frames_after_eos: Frames to generate after EOS
412
+ enable_custom_frames: Whether to use custom frame count
413
+
414
+ Returns:
415
+ str or None: Path to generated audio file, or None if stopped
416
+
417
+ Raises:
418
+ gr.Error: On validation failure or generation error
419
+ """
420
+ global is_currently_generating, stop_generation_requested
421
+
422
+ # Run cleanup before starting new generation
423
+ cleanup_expired_temporary_files()
424
+
425
+ # Validate text input
426
+ is_valid, validation_result = validate_text_input(text_input)
427
+
428
+ if not is_valid:
429
+ if validation_result:
430
+ raise gr.Error(validation_result)
431
+ raise gr.Error("Please enter valid text to generate speech.")
432
+
433
+ # Validate voice cloning audio if in clone mode
434
+ if voice_mode_selection == VOICE_MODE_CLONE and not voice_clone_audio_file:
435
+ raise gr.Error("Please upload an audio file for voice cloning.")
436
+
437
+ # Acquire generation lock
438
+ with generation_state_lock:
439
+ if is_currently_generating:
440
+ raise gr.Error("A generation is already in progress. Please wait.")
441
+ is_currently_generating = True
442
+ stop_generation_requested = False
443
+
444
+ try:
445
+ # Load or retrieve cached model
446
+ text_to_speech_manager.load_or_get_model(
447
+ model_variant,
448
+ temperature,
449
+ lsd_decode_steps,
450
+ noise_clamp,
451
+ eos_threshold
452
+ )
453
+
454
+ # Check for stop request after model loading
455
+ if stop_generation_requested:
456
+ return None
457
+
458
+ # Prepare voice state based on mode
459
+ if voice_mode_selection == VOICE_MODE_CLONE:
460
+ voice_state = text_to_speech_manager.get_voice_state_for_clone(voice_clone_audio_file)
461
+ else:
462
+ voice_state = text_to_speech_manager.get_voice_state_for_preset(voice_preset_selection)
463
+
464
+ # Check for stop request after voice state preparation
465
+ if stop_generation_requested:
466
+ return None
467
+
468
+ # Generate audio from text
469
+ generated_audio = text_to_speech_manager.generate_audio(
470
+ validation_result,
471
+ voice_state,
472
+ frames_after_eos,
473
+ enable_custom_frames
474
+ )
475
+
476
+ # Check for stop request after generation
477
+ if stop_generation_requested:
478
+ return None
479
+
480
+ # Save audio to temporary file
481
+ output_file_path = text_to_speech_manager.save_audio_to_file(generated_audio)
482
+
483
+ return output_file_path
484
+
485
+ except gr.Error:
486
+ raise
487
+
488
+ except Exception as generation_error:
489
+ raise gr.Error(f"Speech generation failed: {str(generation_error)}")
490
+
491
+ finally:
492
+ # Always release generation lock
493
+ with generation_state_lock:
494
+ is_currently_generating = False
495
+ stop_generation_requested = False
496
+
497
+
498
+ # =============================================================================
499
+ # UI STATE MANAGEMENT FUNCTIONS
500
+ # =============================================================================
501
+
502
+ def check_generate_button_state(text_content):
503
+ """
504
+ Update generate button interactivity based on text validity.
505
+
506
+ Args:
507
+ text_content: Current text input content
508
+
509
+ Returns:
510
+ gr.update: Update with interactive state
511
+ """
512
+ is_valid, _ = validate_text_input(text_content)
513
+ return gr.update(interactive=is_valid)
514
+
515
+
516
+ def calculate_character_count_display(text_content):
517
+ """
518
+ Generate HTML for character count display with color coding.
519
+
520
+ Args:
521
+ text_content: Current text input content
522
+
523
+ Returns:
524
+ str: HTML string for character count display
525
+ """
526
+ character_count = len(text_content) if text_content else 0
527
+
528
+ # Use error color if over limit
529
+ display_color = (
530
+ "var(--error-text-color)"
531
+ if character_count > MAXIMUM_INPUT_LENGTH
532
+ else "var(--body-text-color-subdued)"
533
+ )
534
+
535
+ return f"<div style='text-align: right; padding: 4px 0;'><span style='color: {display_color}; font-size: 0.85em;'>{character_count} / {MAXIMUM_INPUT_LENGTH}</span></div>"
536
+
537
+
538
+ def determine_clear_button_visibility_idle(text_content, audio_output):
539
+ """
540
+ Determine clear button visibility based on content state.
541
+
542
+ Args:
543
+ text_content: Current text input content
544
+ audio_output: Current audio output value
545
+
546
+ Returns:
547
+ gr.update: Update with visibility state
548
+ """
549
+ has_text_content = bool(text_content and text_content.strip())
550
+ has_audio_output = audio_output is not None
551
+ should_show_clear = has_text_content or has_audio_output
552
+ return gr.update(visible=should_show_clear)
553
+
554
+
555
+ def update_voice_mode_visibility(voice_mode_value):
556
+ """
557
+ Update visibility of voice selection containers based on mode.
558
+
559
+ Args:
560
+ voice_mode_value: Selected voice mode
561
+
562
+ Returns:
563
+ tuple: (preset_container_update, clone_container_update)
564
+ """
565
+ if voice_mode_value == VOICE_MODE_CLONE:
566
+ return gr.update(visible=False), gr.update(visible=True)
567
+ else:
568
+ return gr.update(visible=True), gr.update(visible=False)
569
+
570
+
571
+ def switch_to_generating_state():
572
+ """
573
+ Switch UI to generation-in-progress state.
574
+
575
+ Returns:
576
+ tuple: Updates for (generate_button, stop_button, clear_button)
577
+ """
578
+ return (
579
+ gr.update(visible=False), # Hide generate button
580
+ gr.update(visible=True, interactive=True), # Show stop button
581
+ gr.update(visible=False) # Hide clear button
582
+ )
583
+
584
+
585
+ def switch_to_idle_state(text_content, audio_output):
586
+ """
587
+ Switch UI back to idle state after generation.
588
+
589
+ Args:
590
+ text_content: Current text input content
591
+ audio_output: Current audio output value
592
+
593
+ Returns:
594
+ tuple: Updates for (generate_button, stop_button, clear_button)
595
+ """
596
+ has_text_content = bool(text_content and text_content.strip())
597
+ has_audio_output = audio_output is not None
598
+ should_show_clear = has_text_content or has_audio_output
599
+
600
+ return (
601
+ gr.update(visible=True), # Show generate button
602
+ gr.update(visible=False), # Hide stop button
603
+ gr.update(visible=should_show_clear) # Show clear if content exists
604
+ )
605
+
606
+
607
+ def perform_clear_action():
608
+ """
609
+ Clear all input and output fields.
610
+
611
+ Returns:
612
+ tuple: Reset values for all clearable components
613
+ """
614
+ return (
615
+ "", # Clear text input
616
+ None, # Clear audio output
617
+ gr.update(visible=False), # Hide clear button
618
+ VOICE_MODE_PRESET, # Reset voice mode
619
+ DEFAULT_VOICE, # Reset voice preset
620
+ None # Clear clone audio
621
+ )
622
+
623
+
624
+ # =============================================================================
625
+ # EXAMPLE HANDLING FUNCTIONS
626
+ # =============================================================================
627
+
628
+ def create_example_handler(example_text, example_voice):
629
+ """
630
+ Create a handler function for example button clicks.
631
+
632
+ Args:
633
+ example_text: Example text to set
634
+ example_voice: Example voice to select
635
+
636
+ Returns:
637
+ function: Handler that sets example values
638
+ """
639
+ def set_example_values():
640
+ return example_text, VOICE_MODE_PRESET, example_voice
641
+ return set_example_values
642
+
643
+
644
+ def format_example_button_label(example_text, example_voice, max_text_length=40):
645
+ """
646
+ Format example button label with voice and truncated text.
647
+
648
+ Args:
649
+ example_text: Full example text
650
+ example_voice: Voice name
651
+ max_text_length: Maximum text length before truncation
652
+
653
+ Returns:
654
+ str: Formatted button label
655
+ """
656
+ truncated_text = (
657
+ example_text[:max_text_length] + "..."
658
+ if len(example_text) > max_text_length
659
+ else example_text
660
+ )
661
+ return f"[{example_voice}] {truncated_text}"
662
+
663
+
664
+ # =============================================================================
665
+ # GRADIO APPLICATION DEFINITION
666
+ # =============================================================================
667
+
668
+ with gr.Blocks() as application:
669
+
670
+ # -------------------------------------------------------------------------
671
+ # SIDEBAR SECTION
672
+ # -------------------------------------------------------------------------
673
+ # Contains project information, description, and credits
674
+
675
+ with gr.Sidebar():
676
+ gr.HTML(
677
+ """
678
+ <h1>Audio Generation Playground part of the
679
+ <a href="https://huggingface.co/spaces/hadadxyz/ai" target="_blank">
680
+ Demo Playground</a>, and the
681
+ <a href="https://huggingface.co/umint" target="_blank">
682
+ UltimaX Intelligence</a> project.</h1><br />
683
+
684
+ This space runs the <b><a href="https://huggingface.co/kyutai/pocket-tts"
685
+ target="_blank">Pocket TTS</a></b> model from <b>Kyutai</b>.<br /><br />
686
+
687
+ A lightweight text-to-speech (TTS) application designed to run
688
+ efficiently on CPUs. Forget about the hassle of using GPUs and
689
+ web APIs serving TTS models.<br /><br />
690
+
691
+ Additionally, this space runs with a custom Docker image to
692
+ maximize the model's potential and has been optimized for the
693
+ limited scope of Hugging Face Spaces.<br /><br />
694
+
695
+ ⚠️ This space was created entirely by the
696
+ <b><a href="https://huggingface.co/hadadrjt/JARVIS" target="_blank">
697
+ J.A.R.V.I.S.</a></b> model operating in autonomous agent mode.
698
+ All code was generated by AI without human review.<br /><br />
699
+
700
+ This is an experimental space and is not part of production.
701
+ There may be minor bugs since the code was generated by AI.
702
+ However, none have been found so far.<br /><br />
703
+
704
+ If you find a bug, please report it in the community tab.<br /><br />
705
+
706
+ <b>Like this project? You can support me by buying a
707
+ <a href="https://ko-fi.com/hadad" target="_blank">coffee</a></b>
708
+ """
709
+ )
710
+
711
+ # -------------------------------------------------------------------------
712
+ # AUDIO OUTPUT SECTION
713
+ # -------------------------------------------------------------------------
714
+
715
+ audio_output_component = gr.Audio(
716
+ label="Generated Speech Output",
717
+ type="filepath",
718
+ interactive=False,
719
+ show_download_button=True
720
+ )
721
+
722
+ # -------------------------------------------------------------------------
723
+ # VOICE SELECTION SECTION
724
+ # -------------------------------------------------------------------------
725
+
726
+ with gr.Accordion("🎭 Voice Selection", open=True):
727
+ # Voice mode selector (preset vs cloning)
728
+ voice_mode_radio = gr.Radio(
729
+ label="Voice Mode",
730
+ choices=[VOICE_MODE_PRESET, VOICE_MODE_CLONE],
731
+ value=VOICE_MODE_PRESET,
732
+ info="Choose between preset voices or clone a voice from uploaded audio"
733
+ )
734
+
735
+ # Container for preset voice selection
736
+ with gr.Column(visible=True) as preset_voice_container:
737
+ voice_preset_dropdown = gr.Dropdown(
738
+ label="Select Preset Voice",
739
+ choices=AVAILABLE_VOICES,
740
+ value=DEFAULT_VOICE
741
+ )
742
+
743
+ # Container for voice cloning audio upload
744
+ with gr.Column(visible=False) as clone_voice_container:
745
+ voice_clone_audio_input = gr.Audio(
746
+ label="Upload Audio for Voice Cloning",
747
+ type="filepath"
748
+ )
749
+
750
+ # -------------------------------------------------------------------------
751
+ # GENERATION PARAMETERS SECTION
752
+ # -------------------------------------------------------------------------
753
+
754
+ with gr.Accordion("⚙️ Generation Parameters", open=False):
755
+ with gr.Row():
756
+ temperature_slider = gr.Slider(
757
+ label="Temperature",
758
+ minimum=0.1,
759
+ maximum=2.0,
760
+ step=0.05,
761
+ value=DEFAULT_TEMPERATURE,
762
+ info="Higher values produce more expressive speech"
763
+ )
764
+ lsd_decode_steps_slider = gr.Slider(
765
+ label="LSD Decode Steps",
766
+ minimum=1,
767
+ maximum=20,
768
+ step=1,
769
+ value=DEFAULT_LSD_DECODE_STEPS,
770
+ info="More steps may improve quality but slower"
771
+ )
772
+
773
+ with gr.Row():
774
+ noise_clamp_slider = gr.Slider(
775
+ label="Noise Clamp",
776
+ minimum=0.0,
777
+ maximum=2.0,
778
+ step=0.05,
779
+ value=DEFAULT_NOISE_CLAMP,
780
+ info="Maximum noise sampling value (0 = disabled)"
781
+ )
782
+ eos_threshold_slider = gr.Slider(
783
+ label="End of Sequence Threshold",
784
+ minimum=-10.0,
785
+ maximum=0.0,
786
+ step=0.25,
787
+ value=DEFAULT_EOS_THRESHOLD,
788
+ info="Smaller values cause earlier completion"
789
+ )
790
+
791
+ # -------------------------------------------------------------------------
792
+ # ADVANCED SETTINGS SECTION
793
+ # -------------------------------------------------------------------------
794
+
795
+ with gr.Accordion("🔧 Advanced Settings", open=False):
796
+ model_variant_textbox = gr.Textbox(
797
+ label="Model Variant Identifier",
798
+ value=DEFAULT_MODEL_VARIANT,
799
+ info="Model signature for generation"
800
+ )
801
+
802
+ with gr.Row():
803
+ enable_custom_frames_checkbox = gr.Checkbox(
804
+ label="Enable Custom Frames After EOS",
805
+ value=False,
806
+ info="Manually control post-EOS frame generation"
807
+ )
808
+ frames_after_eos_slider = gr.Slider(
809
+ label="Frames After EOS",
810
+ minimum=0,
811
+ maximum=100,
812
+ step=1,
813
+ value=DEFAULT_FRAMES_AFTER_EOS,
814
+ info="Additional frames after end-of-sequence (80ms per frame)"
815
+ )
816
+
817
+ # -------------------------------------------------------------------------
818
+ # TEXT INPUT SECTION
819
+ # -------------------------------------------------------------------------
820
+
821
+ text_input_component = gr.Textbox(
822
+ label="Prompt",
823
+ placeholder="Enter the text you want to convert to speech...",
824
+ lines=3,
825
+ max_lines=20,
826
+ max_length=MAXIMUM_INPUT_LENGTH,
827
+ autoscroll=True
828
+ )
829
+
830
+ # Character count display
831
+ character_count_display = gr.HTML(
832
+ f"<div style='text-align: right; padding: 4px 0;'><span style='color: var(--body-text-color-subdued); font-size: 0.85em;'>0 / {MAXIMUM_INPUT_LENGTH}</span></div>"
833
+ )
834
+
835
+ # -------------------------------------------------------------------------
836
+ # ACTION BUTTONS SECTION
837
+ # -------------------------------------------------------------------------
838
+
839
+ # Primary generate button
840
+ generate_button = gr.Button(
841
+ "🎙️ Generate Speech",
842
+ variant="primary",
843
+ size="lg",
844
+ interactive=False
845
+ )
846
+
847
+ # Stop button (visible during generation)
848
+ stop_button = gr.Button(
849
+ "⏹️ Stop Generation",
850
+ variant="stop",
851
+ size="lg",
852
+ visible=False
853
+ )
854
+
855
+ # Clear button (visible when content exists)
856
+ clear_button = gr.Button(
857
+ "🗑️ Clear",
858
+ variant="secondary",
859
+ size="lg",
860
+ visible=False
861
+ )
862
+
863
+ # -------------------------------------------------------------------------
864
+ # EXAMPLE PROMPTS SECTION
865
+ # -------------------------------------------------------------------------
866
+
867
+ gr.HTML("""
868
+ <div style="padding: 16px 0 8px 0;">
869
+ <h3 style="margin: 0 0 8px 0; font-size: 1.1em;">💡 Example Prompts</h3>
870
+ <p style="margin: 0; opacity: 0.7; font-size: 0.9em;">Click any example to generate speech with its assigned voice</p>
871
+ </div>
872
+ """)
873
+
874
+ # Create example buttons dynamically
875
+ example_buttons_list = []
876
+
877
+ with gr.Row():
878
+ example_button_0 = gr.Button(
879
+ format_example_button_label(
880
+ EXAMPLE_PROMPTS_WITH_VOICES[0]["text"],
881
+ EXAMPLE_PROMPTS_WITH_VOICES[0]["voice"]
882
+ ),
883
+ size="sm",
884
+ variant="secondary"
885
+ )
886
+ example_buttons_list.append(example_button_0)
887
+
888
+ example_button_1 = gr.Button(
889
+ format_example_button_label(
890
+ EXAMPLE_PROMPTS_WITH_VOICES[1]["text"],
891
+ EXAMPLE_PROMPTS_WITH_VOICES[1]["voice"]
892
+ ),
893
+ size="sm",
894
+ variant="secondary"
895
+ )
896
+ example_buttons_list.append(example_button_1)
897
+
898
+ with gr.Row():
899
+ example_button_2 = gr.Button(
900
+ format_example_button_label(
901
+ EXAMPLE_PROMPTS_WITH_VOICES[2]["text"],
902
+ EXAMPLE_PROMPTS_WITH_VOICES[2]["voice"]
903
+ ),
904
+ size="sm",
905
+ variant="secondary"
906
+ )
907
+ example_buttons_list.append(example_button_2)
908
+
909
+ example_button_3 = gr.Button(
910
+ format_example_button_label(
911
+ EXAMPLE_PROMPTS_WITH_VOICES[3]["text"],
912
+ EXAMPLE_PROMPTS_WITH_VOICES[3]["voice"]
913
+ ),
914
+ size="sm",
915
+ variant="secondary"
916
+ )
917
+ example_buttons_list.append(example_button_3)
918
+
919
+ with gr.Row():
920
+ example_button_4 = gr.Button(
921
+ format_example_button_label(
922
+ EXAMPLE_PROMPTS_WITH_VOICES[4]["text"],
923
+ EXAMPLE_PROMPTS_WITH_VOICES[4]["voice"]
924
+ ),
925
+ size="sm",
926
+ variant="secondary"
927
+ )
928
+ example_buttons_list.append(example_button_4)
929
+
930
+ # -------------------------------------------------------------------------
931
+ # EVENT HANDLERS AND BINDINGS
932
+ # -------------------------------------------------------------------------
933
+
934
+ # Define input components list for generation function
935
+ generation_inputs = [
936
+ text_input_component,
937
+ voice_mode_radio,
938
+ voice_preset_dropdown,
939
+ voice_clone_audio_input,
940
+ model_variant_textbox,
941
+ lsd_decode_steps_slider,
942
+ temperature_slider,
943
+ noise_clamp_slider,
944
+ eos_threshold_slider,
945
+ frames_after_eos_slider,
946
+ enable_custom_frames_checkbox
947
+ ]
948
+
949
+ # Voice mode change handler
950
+ voice_mode_radio.change(
951
+ fn=update_voice_mode_visibility,
952
+ inputs=[voice_mode_radio],
953
+ outputs=[preset_voice_container, clone_voice_container]
954
+ )
955
+
956
+ # Text input change handlers
957
+ text_input_component.change(
958
+ fn=calculate_character_count_display,
959
+ inputs=[text_input_component],
960
+ outputs=[character_count_display]
961
+ )
962
+
963
+ text_input_component.change(
964
+ fn=check_generate_button_state,
965
+ inputs=[text_input_component],
966
+ outputs=[generate_button]
967
+ )
968
+
969
+ text_input_component.change(
970
+ fn=determine_clear_button_visibility_idle,
971
+ inputs=[text_input_component, audio_output_component],
972
+ outputs=[clear_button]
973
+ )
974
+
975
+ # Audio output change handler
976
+ audio_output_component.change(
977
+ fn=determine_clear_button_visibility_idle,
978
+ inputs=[text_input_component, audio_output_component],
979
+ outputs=[clear_button]
980
+ )
981
+
982
+ # Generate button click handler chain
983
+ generate_button.click(
984
+ fn=switch_to_generating_state,
985
+ outputs=[generate_button, stop_button, clear_button]
986
+ ).then(
987
+ fn=perform_speech_generation,
988
+ inputs=generation_inputs,
989
+ outputs=[audio_output_component]
990
+ ).then(
991
+ fn=switch_to_idle_state,
992
+ inputs=[text_input_component, audio_output_component],
993
+ outputs=[generate_button, stop_button, clear_button]
994
+ ).then(
995
+ fn=check_generate_button_state,
996
+ inputs=[text_input_component],
997
+ outputs=[generate_button]
998
+ )
999
+
1000
+ # Stop button handler
1001
+ stop_button.click(
1002
+ fn=request_generation_stop,
1003
+ outputs=[stop_button]
1004
+ )
1005
+
1006
+ # Clear button handler
1007
+ clear_button.click(
1008
+ fn=perform_clear_action,
1009
+ outputs=[
1010
+ text_input_component,
1011
+ audio_output_component,
1012
+ clear_button,
1013
+ voice_mode_radio,
1014
+ voice_preset_dropdown,
1015
+ voice_clone_audio_input
1016
+ ]
1017
+ )
1018
+
1019
+ # Example button handlers
1020
+ for button_index, example_button in enumerate(example_buttons_list):
1021
+ example_text = EXAMPLE_PROMPTS_WITH_VOICES[button_index]["text"]
1022
+ example_voice = EXAMPLE_PROMPTS_WITH_VOICES[button_index]["voice"]
1023
+
1024
+ example_button.click(
1025
+ fn=create_example_handler(example_text, example_voice),
1026
+ outputs=[text_input_component, voice_mode_radio, voice_preset_dropdown]
1027
+ ).then(
1028
+ fn=switch_to_generating_state,
1029
+ outputs=[generate_button, stop_button, clear_button]
1030
+ ).then(
1031
+ fn=perform_speech_generation,
1032
+ inputs=generation_inputs,
1033
+ outputs=[audio_output_component]
1034
+ ).then(
1035
+ fn=switch_to_idle_state,
1036
+ inputs=[text_input_component, audio_output_component],
1037
+ outputs=[generate_button, stop_button, clear_button]
1038
+ ).then(
1039
+ fn=check_generate_button_state,
1040
+ inputs=[text_input_component],
1041
+ outputs=[generate_button]
1042
+ )
1043
+
1044
+
1045
+ # =============================================================================
1046
+ # APPLICATION ENTRY POINT
1047
+ # =============================================================================
1048
+
1049
+ if __name__ == "__main__":
1050
+ application.launch(
1051
+ server_name="0.0.0.0",
1052
+ share=False
1053
+ )