Ship live translations
with confidence
A production-ready full-stack Node.js + React application for seamless EN↔RU↔UK live auto-detect translation with voice synthesis.
⚙
Installation
Set up the project locally with Docker, Redis, and LibreTranslate in minutes.
▦
Architecture
Understand the STT → Translation → TTS pipeline and real-time Socket.io communication.
▶
Live Translation
Stream from YouTube or microphone with automatic EN/RU/UK language detection and voice output.
📜
Biblical Simulator
Test the full pipeline with AI-generated biblical passages in King James, Church Slavonic, or Ukrainian style.
🎤
Voice Training
Clone custom voices from microphone recordings or YouTube videos using ElevenLabs IVC.
Prerequisites
●
Node.js 20+
Runtime for backend and build tools
●
Docker + Docker Compose
For Redis and LibreTranslate services
●
yt-dlp + ffmpeg
Required for YouTube audio extraction
●
ElevenLabs API Key
For speech-to-text and text-to-speech
Clone & Configure
git clone https://github.com/Pzharyuk/live-translator-node.git && cd live-translator-node
cp .env.example .env
Edit .env and set your API key:
ELEVENLABS_API_KEY=sk-your-key-here
ADMIN_PASSWORD=your-secure-password
Start Infrastructure
# Start Redis + LibreTranslate
docker compose -f docker-compose.local.yml up -d
# Wait for LibreTranslate to download language models (~500 MB)
docker logs -f $(docker ps -qf "name=libretranslate") 2>&1 | grep -i "running"
Start Backend
cd backend
npm install
npm run dev # nodemon watches for changes
Start Frontend
cd frontend
npm install
npm run dev # Vite hot-reload on localhost:5173
✓
You're all set!
Open http://localhost:5173 — log in with user / changeme and you will be redirected to /translate. Admin panel: http://localhost:5173/admin (admin password: admin123).
System Overview
Frontend
React 19 + Vite
Socket.io Client
Web Audio API
↔
Backend
Express + Socket.io
TypeScript
Service Layer
ElevenLabs
Scribe v2 (STT)
TTS Streaming
Voice Cloning
Translation
LibreTranslate (self-hosted)
DeepL (premium API)
Claude / Anthropic (AI)
Redis
Feature Flags
Settings Store
Anthropic
Biblical Simulator
Claude Translation
DeepL
Free & Pro tiers
Auto endpoint detection
Data Flow
1 Audio Input (Mic / YouTube / Simulator)
↓
2 PCM 16-bit LE @ 16kHz via Socket.io chunks
↓
3 ElevenLabs Scribe v2 WebSocket STT
↓
4 Commit Merge Buffer 2.5s VAD aggregation
↓
5 Translation Provider LibreTranslate / DeepL / Claude
↓
6 ElevenLabs TTS Voice synthesis streaming
↓
7 Audio Playback Queued with 600ms pause
Key Architecture Decisions
Two-layer Language Detection
LibreTranslate's /detect endpoint returns 0-confidence for short Cyrillic phrases. The app uses script-based pre-detection (Unicode 0x0400–0x04FF = Cyrillic) combined with ElevenLabs Scribe's language_code output for reliable EN/RU/UK auto-detection.
VAD Commit Merging
Voice Activity Detection can fire aggressively on speaker breathing. Commits are buffered for 2.5 seconds before translation to merge fragments into meaningful phrases.
Feature Flag Merging
YAML config defaults are merged with Redis runtime overrides. Redis values take priority, falling back to YAML if Redis is unavailable.
API Key Hierarchy
Keys resolve in order: Runtime Cache → Redis → Config File → Empty. This allows hot-swapping keys without restarts.
Connection Lifecycle
- Client sends
start_session with source type (mic or youtube) and optional voiceId
- Backend opens a WebSocket to
wss://api.elevenlabs.io/v1/speech-to-text/realtime
- For YouTube: spawns
yt-dlp | ffmpeg child processes to extract PCM audio
- For Microphone: awaits
audio_chunk events from the frontend
Audio Streaming
Audio chunks are sent to Scribe as JSON messages:
{
"message_type": "input_audio_chunk",
"audio_base_64": "UklGR..." // PCM 16-bit LE, 16kHz, mono
}
Scribe Responses
| Response Type | Meaning | Action |
partial_transcript |
Live partial text (speculative) |
Emitted as non-final transcript event |
committed_transcript |
VAD fired — complete phrase |
Buffered for commit merge window |
Commit Merge Buffer
After receiving a committed_transcript, the backend waits 2.5 seconds (COMMIT_MERGE_MS) to collect additional commits before translating. This prevents fragmented translations from aggressive VAD.
Stability Timeout
If VAD stalls (no new commits), a 3.5 second fallback timer (STABILITY_TIMEOUT_MS) fires to translate whatever new text has accumulated, preventing indefinite silence.
Text Validation
Before translation, text is validated against EN/RU/UK character regex patterns. This filters out hallucinated text from the STT model (common with silence or background noise).
Provider Chain
The system supports three translation providers with automatic fallback:
Default
LibreTranslate
Self-hosted, no API key required. Runs in Docker alongside the app. Best for privacy and cost.
Premium
DeepL
High-quality translations. Supports both free and paid API tiers. Auto-detects endpoint.
AI
Claude
Anthropic's Claude for context-aware translations. Uses claude-haiku-4-5 for speed.
Fallback Logic
1. Try primary provider (admin-selected)
2. If primary fails → try configured fallback
3. If fallback fails → try LibreTranslate (last resort)
4. If all fail → emit error event
Language Detection
The app uses a two-layer auto-detection approach:
Layer 1: Script-based Pre-detection
Before calling any translation API, the backend checks Unicode character scripts:
- Cyrillic characters (Unicode 0x0400–0x04FF) → if >50% of matched letters are Cyrillic, detected as Russian
- Latin characters → detected as English
- This avoids low-confidence results from LibreTranslate's
/detect endpoint on short text
Layer 2: STT Language Code
When the auto_language_detect flag is enabled, ElevenLabs Scribe returns a language_code with each transcript commit. The backend uses this to correctly route EN/RU/UK without relying solely on script detection.
Note: For LibreTranslate, both Russian and Ukrainian Cyrillic text is passed with source ru since LibreTranslate handles Ukrainian text acceptably via the Russian model. DeepL and Claude providers distinguish Ukrainian natively and handle uk as a proper source language.
Language Gating
Detected languages are checked against the admin-approved pool. If a detected language isn't in the allowed set, the translation is rejected to prevent hallucinated language outputs.
TTS Pipeline
After translation, the text is sent to ElevenLabs TTS:
const stream = await client.textToSpeech.stream(voiceId, {
text: translatedText,
model_id: "eleven_multilingual_v2",
output_format: "mp3_44100_128",
voice_settings: {
stability: 0.5,
similarity_boost: 0.75,
style: 0.0,
speed: 1.0,
use_speaker_boost: true
}
});
Audio Delivery
TTS audio is streamed to a Buffer, then emitted as a base64-encoded MP3 via the tts_audio Socket.io event.
Frontend Playback Queue
The frontend maintains an audio queue to prevent overlapping playback:
- Received
tts_audio events are queued
- Each segment plays to completion before the next starts
- A configurable pause (600ms default) is inserted between segments
- The pause duration is controlled by
tts_segment_pause_ms (adjustable in admin)
Microphone Input
- User selects "Mic" tab and chooses a TTS voice
- Browser captures audio via Web Audio API's
ScriptProcessor
- PCM 16-bit LE at 16kHz sample rate sent to backend via Socket.io
- Backend pipes audio to ElevenLabs Scribe v2 Realtime WebSocket
- Language auto-detected (EN/RU/UK), text translated and synthesized
- TTS audio returned and played back with inter-segment pauses
YouTube Input
- User pastes a YouTube URL (live stream or video)
- Backend spawns
yt-dlp | ffmpeg child processes
- Audio extracted as PCM stream (16kHz, 16-bit LE, mono)
- Piped to Scribe v2, same pipeline as microphone
- Stream ends when YouTube content ends or user stops
User Interface
The user view features a dark cavern theme with:
- Waveform visualizer — Canvas-based bar chart with orange gradient and cyan tips
- Transcript display — White translated text scrolls upward with fade masks
- Partial transcript — Shown in italic orange while STT is processing
- Source tabs — Toggle between Mic and YouTube (controlled by feature flags)
How It Works
The backend uses yt-dlp and ffmpeg as child processes to extract audio from YouTube URLs:
yt-dlp (best audio) → ffmpeg (PCM 16kHz 16-bit LE mono) → Scribe v2
Supported Sources
- Live streams — Translates in real-time as the stream progresses
- Regular videos — Processes the full audio track
- Any URL supported by yt-dlp (YouTube, etc.)
Requirements
Both yt-dlp and ffmpeg must be installed and available in the system PATH. On macOS:
brew install yt-dlp ffmpeg
⚠
Feature Flag Required
YouTube input is controlled by the youtube_input feature flag. Enable it in the admin panel to show the YouTube tab in the user view.
Overview
The Biblical Transcript Simulator is an admin-only feature that generates biblical text passages using Anthropic's Claude API, then routes them through the full translation pipeline. This provides a hands-free way to test STT → Translation → TTS without a live audio source.
Language Styles
| Language | Style | Example |
en |
King James English |
"In the beginning was the Word..." |
ru |
Church Slavonic Russian |
"В начале было Слово..." |
uk |
Traditional Ukrainian |
"На початку було Слово..." |
Flow
- Admin provides Anthropic API key and selects language
- Backend calls Claude with streaming (uses
claude-sonnet-4-6)
- Claude generates 6-8 biblical passages, 3-5 sentences each
- Stream is buffered until 140+ characters AND complete sentences
- Chunks emitted with 1800ms smooth pacing between them
- Each chunk flows through the standard pipeline:
- Emitted as
transcript (isFinal: true)
- Auto-translated via configured provider
- TTS synthesized and audio returned
- Frontend plays audio with standard inter-segment pause
💡
Feature Flag
Enable biblical_simulator in the admin feature flags panel. The Anthropic API key is provided at runtime in the UI — it's never stored in config files.
Overview
Voice Training uses ElevenLabs' Instant Voice Cloning (IVC) API to create custom voices from audio samples. Once cloned, the voice appears in the voice selector immediately.
From Microphone
- Open the Voice Training section in the admin panel
- Record multiple audio clips using your browser microphone
- Provide a name for the voice
- Clips are uploaded to ElevenLabs IVC API
- Cloned voice is available for TTS immediately
From YouTube
- Paste a YouTube URL in the Voice Training section
- Backend extracts N × 30-second clips via
yt-dlp + ffmpeg
- Clips are uploaded to ElevenLabs IVC API
- Resulting voice is stored in your ElevenLabs account
⚠
ElevenLabs Account
Cloned voices are stored in your ElevenLabs account, not locally. Ensure your plan supports voice cloning.
Concepts
| Concept | Description |
| Active Language Pair |
The current pair used for translation (e.g., EN ↔ RU, EN ↔ UK, or RU ↔ UK). Set by admin. |
| Available Languages |
The pool of languages viewers can select from (if user_language_selector is enabled). |
Admin Controls
- Change the active language pair via the admin panel
- Changes broadcast to all connected clients in real-time
- Manage the available languages pool for viewer selection
Viewer Selection
When the user_language_selector feature flag is enabled, viewers can override the admin-set language pair by selecting their own preferred languages from the available pool.
Overview
Two people can video call each other through the app, each speaking their own language. The app transcribes, translates, and synthesizes speech in real-time so each participant hears the other in their language.
Feature flag: Video call is gated behind the video_translation flag. Enable it in the admin panel or set video_translation: true in your YAML config.
How It Works
- Create a room — Person A selects their language, picks a TTS voice, and clicks "Create Room". A 6-character room code is generated.
- Share the code — Person A shares the room code with Person B (copy button provided).
- Join the room — Person B enters the code, selects their language and TTS voice, and clicks "Join".
- WebRTC connection — The app establishes a peer-to-peer video connection via WebRTC (signaled through Socket.io). Video flows directly between browsers.
- Audio translation — Each participant's microphone audio is simultaneously:
- Sent to the peer via WebRTC (but muted on their end)
- Captured as PCM chunks and sent to the backend via Socket.io for STT
- Translation pipeline — Each participant has their own independent Scribe STT session. Transcribed text is translated to the other participant's language, then synthesized via ElevenLabs TTS and sent back to the peer.
- Playback — The peer hears the TTS translation instead of the raw audio. Translated transcript is displayed below the video.
Architecture
Person A (Browser) Server Person B (Browser)
├─ getUserMedia ├─ Socket.io ├─ getUserMedia
├─ WebRTC P2P ═══video═══►│ (signaling) ◄═══ ├─ WebRTC P2P
│ │ │
├─ PCM chunks ──Socket.io─►├─ ScribeA(STT) │
│ │ ↓ translate │
│ │ ↓ TTS ───────────►├─ Plays TTS
│ │ │
│ Plays TTS ◄─────────────├─ ScribeB(STT) ◄───├─ PCM chunks
│ (remote video muted) │ ↓ translate │ (remote video muted)
└──────────────────────────┴────────────────────┘
Socket Events
| Event | Direction | Purpose |
video_create_room | C→S | Create a new room with language + voice |
video_room_created | S→C | Returns the 6-char room code |
video_join_room | C→S | Join an existing room |
video_room_joined | S→C | Sent to both participants, triggers WebRTC |
video_signal_offer/answer/ice | C↔S | WebRTC signaling relay |
video_audio_chunk | C→S | PCM audio for STT processing |
video_transcript | S→C | Transcript sent to the speaker |
video_translation | S→C | Translation sent to the listener |
video_tts_audio | S→C | TTS audio sent to the listener |
video_leave_room | C→S | Leave the room |
video_room_closed | S→C | Notify peer when other leaves |
Room Lifecycle
- Rooms are stored in Redis with key
video_room:{code} and a 4-hour TTL
- Maximum 2 participants per room
- When one participant disconnects, the other is notified and the call ends
- Scribe sessions are automatically cleaned up on disconnect
Feature Flags
Feature flags control the availability of major features across the application. They are stored in config/application.yaml and can be overridden at runtime via Redis (managed through the Admin UI). On startup, the application merges YAML defaults with any Redis overrides, allowing administrators to toggle features without redeploying.
Flag Configuration
| Flag |
Default |
Description |
youtube_input |
true |
Enable YouTube URL input as an audio source for transcription and translation. |
mic_input |
true |
Enable microphone audio input for real-time transcription and translation. |
auto_language_detect |
true |
Automatically detect source language from audio; when disabled, use fixed source language. |
user_language_selector |
false |
Allow end-users to select their own language pair from the available pool. |
audio_device_selector |
true |
Display microphone/input device selection dropdown in the UI. |
video_translation |
false |
Enable the /video route for real-time video call translation with low-latency TTS. |
video_voice_cloning |
false |
Premium feature: show the Clone Voice button in the /video lobby for instant voice training. |
broadcast |
false |
Enable the /broadcast route — public receiver page for watching translated live broadcasts. |
translate |
false |
Enable the /translate route — live translator page for running personal translation sessions. |
Storage & Override Mechanism
Feature flags are persisted in Redis under the key prefix flag:flagname, with values stored as 'true' or 'false' strings. On each client connection, the server merges config defaults with Redis overrides, allowing live updates without redeployment. When an administrator changes a flag via the Admin UI, the server broadcasts the updated flags to all connected Socket.IO clients via the feature_flags event.
API Reference
GET /admin/flags
Returns merged flags (YAML defaults + Redis overrides)
Response: { flags: { youtube_input: true, mic_input: true, ... } }
POST /admin/flags/:flag
Set a single flag at runtime
Body: { value: boolean }
Response: { flag: "youtube_input", value: true }
GET /admin/flags/:flag
Get a single flag value
Response: { flag: "youtube_input", value: true }
Client Socket Events
All connected clients receive the merged feature flags upon connection and whenever an administrator updates any flag:
socket.on('feature_flags', (data) => {
// data = { youtube_input: true, mic_input: true, ... }
// Update UI visibility & routing based on flag state
});
File Structure
| File | Purpose |
config/application.yaml |
Base defaults for all environments |
config/application-local.yaml |
Local development overrides (localhost URLs) |
config/application-prod.yaml |
Production overrides (Docker service names) |
The APP_ENV environment variable (local or prod) determines which overlay file is loaded on top of the base config.
Full Configuration Reference
server:
port: 3001
cors_origin: "http://localhost:5173"
elevenlabs:
api_key: "${ELEVENLABS_API_KEY}"
default_voice_id: "kxj9qk6u5PfI0ITgJwO0"
tts_model: "eleven_multilingual_v2"
tts_settings:
stability: 0.5
similarity_boost: 0.75
style: 0.0
speed: 1.0
use_speaker_boost: true
stt_model: "scribe_v2"
anthropic:
api_key: "${ANTHROPIC_API_KEY}"
deepl:
api_key: "${DEEPL_API_KEY}"
libretranslate:
url: "http://libretranslate:5000"
api_key: ""
redis:
host: "redis"
port: 6379
password: ""
feature_flags:
youtube_input: true
mic_input: true
auto_language_detect: true
user_language_selector: false
audio_device_selector: true
video_translation: false
video_voice_cloning: false
broadcast: false
audio:
sample_rate: 16000
channels: 1
chunk_duration_ms: 250
translation:
source_lang: "auto"
target_lang_en: "en"
target_lang_ru: "ru"
provider: "libretranslate"
fallback: "libretranslate"
Environment Variable Interpolation
YAML values using ${VAR_NAME} syntax are automatically replaced with the corresponding environment variable at startup.
TTS Settings
API Endpoints
GET /admin/tts-settings
Response:
{
"settings": {
"stability": 0.5,
"similarity_boost": 0.75,
"style": 0.0,
"speed": 1.0,
"use_speaker_boost": true
}
}
POST /admin/tts-settings
Request Body:
{
"stability": 0.5,
"similarity_boost": 0.75,
"style": 0.0,
"speed": 1.0,
"use_speaker_boost": true
}
Response: Same as GET
TTS Voice Settings
ElevenLabs text-to-speech quality parameters. Settings persist across restarts via Redis.
| Setting |
Range |
Default |
Description |
stability |
0.0 – 1.0 |
0.5 |
Voice consistency; higher = more consistent but less expressive. |
similarity_boost |
0.0 – 1.0 |
0.75 |
How closely the voice matches the original sample; higher = closer match. |
style |
0.0 – 1.0 |
0.0 |
Exaggeration of voice style; 0 = neutral, higher = more stylized. |
speed |
0.5 – 2.0 |
1.0 |
Playback speed multiplier; 1.0 = normal, <1.0 = slower, >1.0 = faster. |
use_speaker_boost |
boolean |
true |
Apply ElevenLabs speaker boost for clearer, more professional audio output. |
STT Timing Settings
Speech-to-text recognition & translation dispatch timing. Controls when transcribed audio is sent to translation. Persist via Redis.
GET /admin/stt-timing
Response:
{
"settings": {
"commit_merge_ms": 2500,
"stability_timeout_ms": 2000,
"tts_segment_pause_ms": 250,
"max_accumulation_ms": 10000,
"vad_threshold": 0.5,
"vad_silence_threshold_secs": 1.5,
"min_speech_duration_ms": 100,
"min_silence_duration_ms": 100,
"flush_on_sentence_boundary": true
}
}
POST /admin/stt-timing
Request Body: (same fields as response)
Response: Same as GET
| Setting |
Range |
Default |
Description |
commit_merge_ms |
500 – 5000 |
2500 |
Buffer VAD commits for this duration (ms) before translating — merges sentence fragments into complete thoughts. |
stability_timeout_ms |
500 – 5000 |
2000 |
Wait for partial transcript to remain unchanged for this duration (ms) before translating — fires if continuous speech stabilizes. |
tts_segment_pause_ms |
0 – 1000 |
250 |
Pause between consecutive audio clip playback (ms) — frontend uses this for natural pacing. |
max_accumulation_ms |
2000 – 30000 |
10000 |
Maximum time to accumulate new words during continuous speech before force-dispatching for translation — prevents stalling during long speaker turns. |
vad_threshold |
0.0 – 1.0 |
0.5 |
Voice Activity Detection sensitivity; higher = stricter noise filtering, fewer false positives. |
vad_silence_threshold_secs |
0.5 – 3.0 |
1.5 |
Silence duration (seconds) that triggers VAD commit — speaker pause sends transcript to translation. |
min_speech_duration_ms |
50 – 500 |
100 |
Ignore speech shorter than this (ms) — filters brief noise/coughs. |
min_silence_duration_ms |
50 – 500 |
100 |
Minimum silence gap (ms) between words — prevents fragmentation of fluent speech. |
flush_on_sentence_boundary |
boolean |
true |
Split commit buffer at sentence-ending punctuation (.?!;) instead of flushing all at once — improves streaming quality. |
Video Call Settings
Real-time video call translation parameters — lower latency than broadcast mode.
GET /admin/video-settings
Response:
{
"stability_ms": 500,
"commit_merge_ms": 50,
"translation_provider": "claude"
}
POST /admin/video-settings
Request Body: (partial update allowed)
{
"stability_ms": 500,
"commit_merge_ms": 50,
"translation_provider": "deepl"
}
Response: Same as GET
| Setting |
Range |
Default |
Description |
stability_ms |
200 – 2000 |
500 |
Wait for stable partial transcript (ms) before translating — shorter timeout for snappier response in video calls. |
commit_merge_ms |
10 – 500 |
50 |
Merge VAD commits over this window (ms) — very short for minimal latency. |
translation_provider |
libretranslate &pipe; claude &pipe; deepl &pipe; google |
claude |
Translation provider for video call pipeline (independent of broadcast/translate provider). |
Application Configuration Defaults
Set in config/application.yaml. Runtime TTS settings start from these defaults; admin changes persist to Redis.
| Config Key |
Value |
Description |
elevenlabs.tts_model |
eleven_multilingual_v2 |
ElevenLabs TTS model for broadcast & translate routes; fast route uses eleven_flash_v2_5. |
elevenlabs.stt_model |
scribe_v2_realtime |
ElevenLabs speech-to-text WebSocket model; realtime protocol. |
elevenlabs.default_voice_id |
kxj9qk6u5PfI0ITgJwO0 |
Default voice when user does not specify one. |
audio.sample_rate |
16000 Hz |
PCM audio sample rate for STT (must be 16 kHz for ElevenLabs Scribe). |
audio.channels |
1 (mono) |
Audio channel count. |
audio.chunk_duration_ms |
250 ms |
Frontend ScriptProcessor buffer window — sends audio to STT at this interval. |
translation.translate_workers |
2 |
Parallel translation workers (Stage 1 of TTS pipeline); more workers reduce latency but don't affect output order. |
Notes
- Persistence: All TTS, STT, and video settings are stored in Redis on change. They load at server startup, so runtime admin changes survive restarts.
- STT Dispatch Logic: Text is sent for translation when any of these fire:
- VAD commits after
vad_silence_threshold_secs of silence (buffered, flushed after commit_merge_ms)
- Partial text is stable (unchanged) for
stability_timeout_ms
max_accumulation_ms elapses during continuous speech
- Sentence Boundary Flush: When
flush_on_sentence_boundary=true, the commit buffer splits at .?!; punctuation — text before the boundary is translated immediately, remainder waits for more input.
- Video vs. Broadcast: Video calls use dedicated
video_call_settings (lower latency, independent provider). Broadcast & /translate routes use main TTS/STT settings.
- Audio Event Filtering: Scribe transcripts are cleaned of audio event tags like "(laughter)", "(speaks in foreign language)" before translation.
STT Timing Settings
Configure speech-to-text recognition timing, voice activity detection (VAD), and translation dispatch behavior.
| Setting |
Default |
Description |
commit_merge_ms |
2500 |
Buffer VAD commits for this duration (ms) before translating — merges short speech fragments. |
stability_timeout_ms |
2000 |
Wait for partial text to remain unchanged for this duration (ms) before triggering translation via stability timer. |
tts_segment_pause_ms |
250 |
Pause between TTS audio segments (ms) — sent to frontend for playback spacing. |
max_accumulation_ms |
10000 |
Force dispatch new words for translation after this duration (ms) of continuous speech, even if VAD & stability don't fire. |
vad_threshold |
0.5 |
Voice activity detection sensitivity (0–1, higher = stricter noise filter) — sent to ElevenLabs Scribe as query param. |
vad_silence_threshold_secs |
1.5 |
Seconds of silence before VAD commits a transcript segment — sent to ElevenLabs Scribe as query param. |
min_speech_duration_ms |
100 |
Ignore speech segments shorter than this (ms) — sent to ElevenLabs Scribe as query param. |
min_silence_duration_ms |
100 |
Minimum silence gap (ms) — sent to ElevenLabs Scribe as query param. |
flush_on_sentence_boundary |
true |
When enabled, flush commit buffer at sentence boundaries (.?!;) instead of all at once. |
Tuning Guide
- Faster response: Lower
commit_merge_ms (e.g., 1000–1500 ms) to dispatch translations sooner after speaker pauses.
- Fewer fragments: Increase
commit_merge_ms (e.g., 3000–4000 ms) to merge more speech fragments into single translations.
- Better accuracy on fast speech: Increase
stability_timeout_ms (e.g., 3000+ ms) to allow partial text more time to stabilize before translating.
- Continuous speech handling: Decrease
max_accumulation_ms (e.g., 5000–7000 ms) to dispatch mid-sentence during sermons or long monologues.
- Reduce background noise: Increase
vad_threshold (e.g., 0.6–0.8) to raise the noise detection bar.
- Faster VAD commits: Decrease
vad_silence_threshold_secs (e.g., 0.8–1.0 s) to trigger segment commits on shorter pauses.
- Sentence-aware flushing: Enable
flush_on_sentence_boundary to naturally break at .?!; punctuation instead of buffering everything.
API Endpoints
GET /admin/stt-timing
Retrieve current STT timing settings.
curl -X GET http://localhost:3001/admin/stt-timing \
-H "Cookie: jwt=<token>"
Response:
{
"settings": {
"commit_merge_ms": 2500,
"stability_timeout_ms": 2000,
"tts_segment_pause_ms": 250,
"max_accumulation_ms": 10000,
"vad_threshold": 0.5,
"vad_silence_threshold_secs": 1.5,
"min_speech_duration_ms": 100,
"min_silence_duration_ms": 100,
"flush_on_sentence_boundary": true
}
}
POST /admin/stt-timing
Update one or more STT timing settings. Unspecified fields retain their current values.
curl -X POST http://localhost:3001/admin/stt-timing \
-H "Cookie: jwt=<token>" \
-H "Content-Type: application/json" \
-d '{
"commit_merge_ms": 1500,
"max_accumulation_ms": 7000,
"vad_threshold": 0.6
}'
Response:
{
"settings": {
"commit_merge_ms": 1500,
"stability_timeout_ms": 2000,
"tts_segment_pause_ms": 250,
"max_accumulation_ms": 7000,
"vad_threshold": 0.6,
"vad_silence_threshold_secs": 1.5,
"min_speech_duration_ms": 100,
"min_silence_duration_ms": 100,
"flush_on_sentence_boundary": true
}
}
Notes
- All timing values are in milliseconds unless otherwise noted (VAD parameters are in seconds or counts).
- Settings are persisted to Redis on update and survive server restarts.
- Changes apply immediately to new Scribe sessions; active sessions may take time to reflect updates.
- The
vad_* and min_* parameters are sent as WebSocket query parameters to ElevenLabs Scribe and affect server-side voice activity detection.
- Three-stage pipeline: Stability timer (partial text unchanged), commit buffer (VAD fragments), and accumulation timer (continuous speech) ensure timely translation dispatch across speech patterns.
Authentication: All endpoints require JWT cookie (COOKIE_NAME) with either is_admin: true or a role with appropriate permissions. Returns 401 Unauthorized if token missing or expired, 403 Forbidden if insufficient permissions.
API Keys
Retrieve status of all configured API keys (elevenlabs, anthropic, deepl, libretranslate).
Update one or more API keys by name.
Body: { elevenlabs?: string, anthropic?: string, deepl?: string, libretranslate?: string }
Retrieve the stored Anthropic API key.
Voice Management
Scan and list all available ElevenLabs voices with metadata (name, voice_id, category, preview_url).
Get the list of admin-approved voice IDs that viewers can select from.
Set the pool of allowed voice IDs; broadcasts to all connected clients.
Body: { voiceIds: string[] }
Feature Flags
Retrieve all feature flags (merged from YAML defaults + Redis overrides).
Get a single feature flag value by name.
Set a feature flag value and broadcast to all connected clients.
TTS & STT Settings
Get current TTS settings (stability, similarity_boost, style, speed, use_speaker_boost).
Update TTS settings (partial update allowed).
Body: { stability?: number, similarity_boost?: number, style?: number, speed?: number, use_speaker_boost?: boolean }
Get STT timing settings (VAD parameters, commit merge delay, stability timeout, accumulation limits).
Update STT timing settings (affects all active Scribe sessions).
Body: { commit_merge_ms?: number, stability_timeout_ms?: number, tts_segment_pause_ms?: number, max_accumulation_ms?: number, vad_threshold?: number, vad_silence_threshold_secs?: number, min_speech_duration_ms?: number, min_silence_duration_ms?: number, flush_on_sentence_boundary?: boolean }
Get video call STT/TTS settings (stability_ms, commit_merge_ms, translation_provider).
Update video call settings.
Body: { stability_ms?: number, commit_merge_ms?: number, translation_provider?: 'libretranslate' | 'claude' | 'deepl' | 'google' }
Languages
Get the currently active language pair (source → target).
Set the active language pair and broadcast to all connected viewers.
Body: { languages: [string, string] }
Get the pool of languages viewers can choose from.
Set the available language pool and broadcast to all clients.
Body: { languages: string[] }
Translation Provider
Get the active translation provider and list available options.
Set the translation provider (deepl, claude, libretranslate, or google).
Body: { provider: 'deepl' | 'claude' | 'libretranslate' | 'google' }
Get the active Claude translation model and list available models.
Set the Claude model for translation (e.g. claude-opus-4, claude-sonnet-3, claude-haiku-4).
Audio Device
Get the admin-selected audio input device (overrides viewer's local choice).
Set the forced audio input device and broadcast to all connected clients.
Body: { deviceId?: string, label?: string }
Voice Training
Clone a voice from base64-encoded browser mic recordings and upload to ElevenLabs.
Body: { name: string, clips: string[] (base64), mimeType?: string }
Clone a voice from a YouTube URL using yt-dlp & ffmpeg to extract audio clips.
Body: { name: string, youtubeUrl: string, clipCount?: number, startOffset?: number }
Content Generation
Generate a biblical sermon snippet via Anthropic Claude in the specified language.
Body: { apiKey?: string, language?: 'ru' | 'uk' | 'en' }
Monitoring & Logs
Get hallucination detection statistics and log entries.
Clear the hallucination log.
Get translation history (original, translated, provider, latency metrics).
Clear the translation log.
Get the current broadcast TTS queue depth for monitoring.
Session History
Retrieve all broadcast session records from PostgreSQL.
Get detailed transcript data for a specific session (seq, original, translated, timings).
Export session transcripts as JSON, CSV, or TXT (query param: format=json|csv|txt).
User Management
Requires user_management permission. List all users (password hashes stripped).
Requires user_management permission. Update user's admin status and/or role assignments.
Body: { isAdmin?: boolean, roleId?: string | null, roleIds?: string[] }
Requires user_management permission. Reset a user's password (minimum 6 characters).
Body: { password: string }
Requires user_management permission. Delete a user (cannot delete yourself).
Roles & Permissions
Requires user_management permission. List all available permissions in the system.
Requires user_management permission. List all roles and their permissions.
Requires user_management permission. Create a new role with specified permissions.
Body: { name: string, permissions: Permission[] }
Requires user_management permission. Update an existing role's name and permissions.
Body: { name: string, permissions: Permission[] }
Requires user_management permission. Delete a role.
SDK
Uses the official @elevenlabs/elevenlabs-js SDK (v2). The client is lazy-loaded on first use.
Speech-to-Text (Scribe v2 Realtime)
Connects via native WebSocket to wss://api.elevenlabs.io/v1/speech-to-text/realtime. Handles:
- VAD-based commit buffering with configurable merge window
- Stability timeout fallback for stalled VAD
- Text validation (EN/RU/UK character regex filtering)
- Partial and final transcript emission
Text-to-Speech
Uses client.textToSpeech.stream() with the eleven_multilingual_v2 model. Audio is collected into a Buffer and emitted as base64 MP3.
Voice Management
client.voices.getAll() — fetches all voices from account
- Admin can filter which voices are available to viewers
- Voice cloning via IVC API (from recordings or YouTube)
Key File
backend/src/services/elevenlabs.service.ts
Provider Details
LibreTranslate
Self-hosted in Docker. No API key required by default. Provides language detection and translation via REST API.
File: backend/src/services/libretranslate.service.ts
DeepL
Premium translation API. Auto-detects free vs. paid endpoint based on the API key format.
File: backend/src/services/deepl.service.ts
Claude (Anthropic)
AI-powered translation using claude-haiku-4-5 for speed. Includes language detection and auto-flip logic.
File: backend/src/services/claude-translate.service.ts
Routing
Provider routing is handled by backend/src/services/translation.provider.ts:
- Try admin-selected primary provider
- On failure, try configured fallback provider
- LibreTranslate is always the last-resort fallback
Connection
Uses ioredis with automatic retry strategy. Falls back to in-memory/YAML defaults if Redis is unavailable.
Key Patterns
| Pattern | Example | Purpose |
flag:<name> |
flag:youtube_input |
Feature flag boolean values |
setting:<name> |
setting:tts_settings |
JSON settings objects |
Key File
backend/src/services/redis.service.ts
Local Development
Use docker-compose.local.yml for Redis and LibreTranslate only (backend/frontend run natively):
docker compose -f docker-compose.local.yml up -d
Production
Use docker-compose.yml for all services:
docker compose up -d --build
Services
| Service | Image | Port | Notes |
| frontend |
Nginx (custom build) |
80 (exposed) |
Serves React build, proxies API/WS to backend |
| backend |
Node.js (custom build) |
3001 (internal) |
Express + Socket.io server |
| redis |
redis:7-alpine |
6379 (internal) |
Feature flags and settings store |
| libretranslate |
libretranslate/libretranslate |
5000 (internal) |
Self-hosted translation engine |
Configuration
ELEVENLABS_API_KEY=sk-your-production-key
ADMIN_PASSWORD=strong-secure-password
FRONTEND_URL=https://translate.example.com
APP_ENV=prod
REDIS_PASSWORD=redis-secret
Deploy
docker compose up -d --build
Reverse Proxy
When running behind Nginx or another reverse proxy:
- Set
LISTEN_PORT in .env (e.g., 8080)
- Proxy pass to
localhost:8080
- Important: Ensure WebSocket upgrades are forwarded for the
/socket.io/ path
server {
listen 443 ssl;
server_name translate.example.com;
location / {
proxy_pass http://localhost:8080;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header Host $host;
}
}
Monitoring
# Check all services
docker compose ps
# View backend logs
docker compose logs -f backend
# Health check
curl http://localhost:3001/api/health