STT Timing Settings
Configure speech-to-text recognition timing, VAD parameters, and dispatch thresholds for optimal transcription accuracy and translation pipeline responsiveness.
Settings Reference
| Setting |
Default |
Description |
commit_merge_ms |
1500 |
Buffer VAD commits for this duration before translating; merges short fragments into coherent chunks. |
stability_timeout_ms |
2000 |
Wait for partial transcript to remain unchanged for this duration before triggering translation as fallback to VAD. |
tts_segment_pause_ms |
0 |
Pause between consecutive TTS audio segments emitted to the frontend; frontend reads this value to synchronize playback. |
max_accumulation_ms |
8000 |
Maximum time to buffer new words during continuous speech before force-dispatching for translation; prevents stalls during long utterances. |
vad_threshold |
0.5 |
Voice Activity Detection sensitivity (0–1, higher = stricter noise rejection); passed to ElevenLabs Scribe as query parameter. |
vad_silence_threshold_secs |
1.0 |
Seconds of silence required before VAD commits the current utterance; passed to ElevenLabs Scribe. |
min_speech_duration_ms |
100 |
Ignore speech segments shorter than this duration; helps filter noise and clicks. |
min_silence_duration_ms |
100 |
Minimum silence gap (in ms) required between speech segments; prevents spurious fragment splits. |
flush_on_sentence_boundary |
true |
When true, dispatch only complete sentences (ending in .?!;) rather than flushing all buffered text at once. |
min_chars_before_dispatch |
40 |
Minimum character count before a chunk is dispatched for translation; prevents tiny, low-confidence fragments. |
Tuning Guide
- For responsive real-time translation: Reduce
commit_merge_ms to 800–1200 ms and max_accumulation_ms to 5000–6000 ms; trade short latency for smaller chunks.
- For sermon/lecture mode (longer utterances): Increase
max_accumulation_ms to 10000–15000 ms and min_chars_before_dispatch to 100–200; allows fuller sentences before dispatch.
- For noisy environments: Raise
vad_threshold to 0.6–0.8 to filter background noise; increase min_speech_duration_ms to 150–300 ms.
- For quiet environments: Lower
vad_threshold to 0.3–0.4 and vad_silence_threshold_secs to 0.5–0.8 to catch subtle speech.
- For prayer/song pause: Use
broadcast_pause event to pause audio input and clear the TTS queue; broadcast_resume to resume.
- To control TTS pacing: Adjust
tts_segment_pause_ms (frontend reads this to add inter-segment silence); also tune low_water_hold_ms in tts_pipeline config.
API Endpoints
GET /admin/stt-timing
Retrieve current STT timing settings.
curl -X GET http://localhost:3001/admin/stt-timing \
-H "Cookie: auth=<jwt_token>" \
-H "Content-Type: application/json"
# Response:
{
"settings": {
"commit_merge_ms": 1500,
"stability_timeout_ms": 2000,
"tts_segment_pause_ms": 0,
"max_accumulation_ms": 8000,
"vad_threshold": 0.5,
"vad_silence_threshold_secs": 1.0,
"min_speech_duration_ms": 100,
"min_silence_duration_ms": 100,
"flush_on_sentence_boundary": true,
"min_chars_before_dispatch": 40
}
}
POST /admin/stt-timing
Update one or more STT timing settings. Partial updates are merged with existing values.
curl -X POST http://localhost:3001/admin/stt-timing \
-H "Cookie: auth=<jwt_token>" \
-H "Content-Type: application/json" \
-d '{
"max_accumulation_ms": 10000,
"min_chars_before_dispatch": 60,
"vad_threshold": 0.6,
"tts_segment_pause_ms": 500
}'
# Response:
{
"settings": {
"commit_merge_ms": 1500,
"stability_timeout_ms": 2000,
"tts_segment_pause_ms": 500,
"max_accumulation_ms": 10000,
"vad_threshold": 0.6,
"vad_silence_threshold_secs": 1.0,
"min_speech_duration_ms": 100,
"min_silence_duration_ms": 100,
"flush_on_sentence_boundary": true,
"min_chars_before_dispatch": 60
}
}
Real-Time Socket Events
When STT timing settings are updated via the admin API, the server broadcasts the new tts_segment_pause_ms to all connected clients:
// Received by all WebSocket clients:
socket.on('stt_timing', (data) => {
const { tts_segment_pause_ms } = data;
// Frontend uses this value to add inter-segment silence during playback
console.log(`STT segment pause: ${tts_segment_pause_ms}ms`);
});
Implementation Details
- VAD Auto-Commit: ElevenLabs Scribe fires
committed_transcript automatically based on vad_silence_threshold_secs; multiple commits are buffered for commit_merge_ms before translation.
- Stability Fallback: If partial transcript is unchanged for
stability_timeout_ms, it triggers translation independently (backup for slow VAD).
- Accumulation Timer: During continuous speech, dispatches new words every
max_accumulation_ms to prevent pipeline stalls; respects min_chars_before_dispatch threshold.
- Sentence Boundary Detection: When
flush_on_sentence_boundary is true, the system splits on .?!; punctuation and only dispatches complete sentences, keeping unfinished text for the next cycle.
- Admin-Configurable at Runtime: Changes via
POST /admin/stt-timing are persisted to Redis and immediately applied to all new sessions; existing sessions inherit the new values on next dispatch event.
Related Configuration Files
config/application.yaml → elevenlabs.stt_model (e.g., scribe_v2_realtime)
config/application.yaml → tts_pipeline (initial buffer segments, low-water-mark hold duration)
- Redis key:
setting:stt_timing (persisted state)
- Socket event:
stt_timing → emitted to all clients on connect and when admin updates settings
Authentication: JWT cookie-based admin auth middleware. Requires is_admin flag OR user with role permissions. Access denied returns 401 (not authenticated) or 403 (insufficient permissions).
API Keys Management
Retrieve all configured API keys and their status (masked for security).
Update one or more API keys. Valid keys: elevenlabs, anthropic, deepl, libretranslate, google.
Body: {
"elevenlabs": "string",
"anthropic": "string",
"deepl": "string",
"libretranslate": "string",
"google": "string"
}
Retrieve the Gemini API key (returns masked value).
Voice Management
Scan ElevenLabs for all available voices and detect new voices not yet in the allowed list.
Get the list of voice IDs allowed for user selection (null → all voices allowed).
Set the allowed voice IDs. Broadcasts update to all connected clients via Socket.IO.
Body: {
"voiceIds": ["voice_id_1", "voice_id_2"]
}
Feature Flags
Get all feature flags merged from YAML defaults & Redis overrides.
Get a single feature flag value.
Set a feature flag value. Broadcasts updated flags to all clients via Socket.IO.
TTS & STT Settings
Retrieve TTS voice settings (stability, similarity boost, style, speed, speaker boost).
Update TTS settings. Partial updates are merged with existing settings.
Body: {
"stability": 0.5,
"similarity_boost": 0.75,
"style": 0.0,
"speed": 1.0,
"use_speaker_boost": true
}
Get STT timing configuration (VAD thresholds, commit merge delays, min character dispatch threshold).
Update STT timing settings dynamically (applied to all active sessions immediately).
Body: {
"commit_merge_ms": 1500,
"stability_timeout_ms": 2000,
"tts_segment_pause_ms": 0,
"max_accumulation_ms": 8000,
"vad_threshold": 0.5,
"vad_silence_threshold_secs": 1.0,
"min_speech_duration_ms": 100,
"min_silence_duration_ms": 100,
"flush_on_sentence_boundary": true,
"min_chars_before_dispatch": 40
}
Get video call STT/TTS settings (separate from live-translation settings).
Update video call settings (stability, commit merge, translation provider).
Body: {
"stability_ms": 500,
"commit_merge_ms": 50,
"translation_provider": "claude"
}
Languages
Get the current active language pair (e.g., ["en", "ru"]).
Set the active language pair. Must provide exactly 2 language codes. Broadcasts to all clients via Socket.IO.
Body: {
"languages": ["en", "ru"]
}
Get the pool of languages available for viewer selection.
Set the available language pool. Viewers can only select from this list. Broadcasts pool & active pair to all clients.
Body: {
"languages": ["en", "ru", "uk"]
}
Translation Provider
Get the currently active translation provider and list of available providers.
Set the active translation provider. Valid values: google, deepl, claude, libretranslate.
Body: {
"provider": "google"
}
Get the currently selected Claude model for translation & available Claude models.
Set the Claude model to use for translation (when provider is claude).
Body: {
"model": "claude-3-5-sonnet-20241022"
}
Audio Device
Get the admin-selected audio input device (overrides viewer's local selection).
Set the admin-selected audio input device. Broadcasts to all connected clients via Socket.IO.
Body: {
"deviceId": "device_id_string",
"label": "Microphone Name"
}
Broadcast Schedule
Get the list of upcoming broadcast schedule events.
Set the broadcast schedule. Past events are automatically expired on broadcast start.
Body: {
"events": [
{
"id": "uuid",
"title": "Morning Service",
"datetime": "2024-01-15T09:00:00Z",
"description": "Optional description"
}
]
}
TTS Preview & Generation
Generate TTS audio for a text sample. Returns MP3 audio stream with Content-Type: audio/mpeg.
Body: {
"text": "Sample text to speak",
"voiceId": "optional_voice_id"
}
Generate a biblical sermon excerpt via Gemini Flash AI (admin test endpoint).
Body: {
"apiKey": "optional_gemini_api_key",
"language": "en|ru|uk",
"sentences": 1-20
}
Voice Training & Cloning
Clone a voice from browser microphone recordings (base64-encoded audio blobs). Returns cloned voice details.
Body: {
"name": "New Voice Name",
"clips": ["base64_audio_blob_1", "base64_audio_blob_2"],
"mimeType": "audio/webm"
}
Clone a voice from a YouTube video. Extracts N×30s clips using yt-dlp & ffmpeg, then uploads to ElevenLabs.
Body: {
"name": "New Voice Name",
"youtubeUrl": "https://www.youtube.com/watch?v=...",
"clipCount": 3,
"startOffset": 0
}
Monitoring & Diagnostics
Get hallucination detection stats & log of detected hallucinated transcripts.
Clear the hallucination detection log.
Get recent translation log entries (original, translated, timings, provider).
Clear the translation log.
Get current broadcast & stream queue depths for performance monitoring.
Broadcast Session History
Get list of all past broadcast sessions with metadata (start time, duration, word count, error count).
Get full details of a single broadcast session including all transcripts with timings.
Export session data in JSON, CSV, or plain text format. Query param: format=json|csv|txt (default: json).
User Management
Get all users (requires user_management permission). Password hashes & avatar data stripped.
Update a user's admin status and/or assigned roles (requires user_management permission).
Body: {
"isAdmin": true,
"roleId": "optional_single_role_id",
"roleIds": ["role_id_1", "role_id_2"]
}
Reset a user's password (requires user_management permission). Password must be at least 6 characters.
Body: {
"password": "new_password"
}
Delete a user account (requires user_management permission). Cannot delete your own account.
Roles & Permissions
Get the list of all available permissions (requires user_management permission).
Get all defined roles with their permissions (requires user_management permission).
Create a new role with specified permissions (requires user_management permission). Returns 409 if role name already exists.
Body: {
"name": "Moderator",
"permissions": ["user_management", "broadcast_control"]
}
Update a role's name and permissions (requires user_management permission).
Body: {
"name": "Updated Role Name",
"permissions": ["permission_1", "permission_2"]
}
Delete a role (requires user_management permission).
SDK
Uses the official @elevenlabs/elevenlabs-js SDK (v2). The client is lazy-loaded on first use.
Speech-to-Text (Scribe v2 Realtime)
Connects via native WebSocket to wss://api.elevenlabs.io/v1/speech-to-text/realtime. Handles:
- VAD-based commit buffering with configurable merge window
- Stability timeout fallback for stalled VAD
- Text validation (EN/RU/UK character regex filtering)
- Partial and final transcript emission
Text-to-Speech
Uses client.textToSpeech.stream() with the eleven_multilingual_v2 model. Audio is collected into a Buffer and emitted as base64 MP3.
Voice Management
client.voices.getAll() — fetches all voices from account
- Admin can filter which voices are available to viewers
- Voice cloning via IVC API (from recordings or YouTube)
Key File
backend/src/services/elevenlabs.service.ts
Provider Details
Google Translate
Google Cloud Translation API v2. Fast (~200ms), deterministic, and reliable. Requires GOOGLE_TRANSLATE_API_KEY with the Cloud Translation API enabled in Google Cloud Console. Ensure the API key has no HTTP referrer restrictions (server-side requests have no referrer).
File: backend/src/services/google-translate.service.ts
LibreTranslate
Self-hosted in Docker. No API key required by default. Provides language detection and translation via REST API.
File: backend/src/services/libretranslate.service.ts
DeepL
Premium translation API. Auto-detects free vs. paid endpoint based on the API key format.
File: backend/src/services/deepl.service.ts
Claude (Anthropic)
AI-powered translation using claude-haiku-4-5 for speed. Includes language detection and auto-flip logic.
File: backend/src/services/claude-translate.service.ts
Routing
Provider routing is handled by backend/src/services/translation.provider.ts:
- Try admin-selected primary provider
- On failure, try configured fallback provider
- LibreTranslate is always the last-resort fallback
Connection
Uses ioredis with automatic retry strategy. Falls back to in-memory/YAML defaults if Redis is unavailable.
Key Patterns
| Pattern | Example | Purpose |
flag:<name> |
flag:youtube_input |
Feature flag boolean values |
setting:<name> |
setting:tts_settings |
JSON settings objects |
Key File
backend/src/services/redis.service.ts
Local Development
Use docker-compose.local.yml for Redis and LibreTranslate only (backend/frontend run natively):
docker compose -f docker-compose.local.yml up -d
Production
Use docker-compose.yml for all services:
docker compose up -d --build
Services
| Service | Image | Port | Notes |
| frontend |
node:24-alpine + Nginx |
80 (exposed) |
Serves React build, proxies API/WS to backend |
| backend |
node:24-alpine |
3001 (internal) |
Express + Socket.io server |
| redis |
redis:7-alpine |
6379 (internal) |
Feature flags and settings store |
| libretranslate |
libretranslate/libretranslate |
5000 (internal) |
Self-hosted translation engine |
Configuration
ELEVENLABS_API_KEY=sk-your-production-key
ADMIN_PASSWORD=strong-secure-password
FRONTEND_URL=https://translate.example.com
APP_ENV=prod
REDIS_PASSWORD=redis-secret
Deploy
docker compose up -d --build
Reverse Proxy
When running behind Nginx or another reverse proxy:
- Set
LISTEN_PORT in .env (e.g., 8080)
- Proxy pass to
localhost:8080
- Important: Ensure WebSocket upgrades are forwarded for the
/socket.io/ path
server {
listen 443 ssl;
server_name translate.example.com;
location / {
proxy_pass http://localhost:8080;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header Host $host;
}
}
Monitoring
# Check all services
docker compose ps
# View backend logs
docker compose logs -f backend
# Health check
curl http://localhost:3001/api/health