Ship live translations
with confidence
A production-ready full-stack Node.js + React application for seamless EN↔RU↔UK live auto-detect translation with voice synthesis.
⚙
Installation
Set up the project locally with Docker, Redis, and LibreTranslate in minutes.
▦
Architecture
Understand the STT → Translation → TTS pipeline and real-time Socket.io communication.
▶
Live Translation
Stream from YouTube or microphone with automatic EN/RU/UK language detection and voice output.
📜
Biblical Simulator
Test the full pipeline with AI-generated biblical passages in King James, Church Slavonic, or Ukrainian style.
🎤
Voice Training
Clone custom voices from microphone recordings or YouTube videos using ElevenLabs IVC.
Prerequisites
●
Node.js 20+
Runtime for backend and build tools
●
Docker + Docker Compose
For Redis and LibreTranslate services
●
yt-dlp + ffmpeg
Required for YouTube audio extraction
●
ElevenLabs API Key
For speech-to-text and text-to-speech
Clone & Configure
git clone https://github.com/Pzharyuk/live-translator-node.git && cd live-translator-node
cp .env.example .env
Edit .env and set your API key:
ELEVENLABS_API_KEY=sk-your-key-here
ADMIN_PASSWORD=your-secure-password
Start Infrastructure
# Start Redis + LibreTranslate
docker compose -f docker-compose.local.yml up -d
# Wait for LibreTranslate to download language models (~500 MB)
docker logs -f $(docker ps -qf "name=libretranslate") 2>&1 | grep -i "running"
Start Backend
cd backend
npm install
npm run dev # nodemon watches for changes
Start Frontend
cd frontend
npm install
npm run dev # Vite hot-reload on localhost:5173
✓
You're all set!
Open http://localhost:5173 — log in with user / changeme and you will be redirected to /translate. Admin panel: http://localhost:5173/admin (admin password: admin123).
System Overview
Frontend
React 19 + Vite
Socket.io Client
Web Audio API
↔
Backend
Express + Socket.io
TypeScript
Service Layer
ElevenLabs
Scribe v2 (STT)
TTS Streaming
Voice Cloning
Translation
Google Translate (Cloud API)
LibreTranslate (self-hosted)
DeepL (premium API)
Claude / Anthropic (AI)
Redis
Feature Flags
Settings Store
Google Gemini
Biblical Simulator
Sermon Generation
Voice Training Text
DeepL
Free & Pro tiers
Auto endpoint detection
Data Flow
1 Audio Input (Mic / YouTube / Simulator)
↓
2 PCM 16-bit LE @ 16kHz via Socket.io chunks
↓
3 ElevenLabs Scribe v2 WebSocket STT
↓
4 Commit Merge Buffer 2.5s VAD aggregation
↓
5 Translation Provider Google / LibreTranslate / DeepL / Claude
↓
6 ElevenLabs TTS Voice synthesis streaming
↓
7 Audio Playback Queued with 600ms pause
Key Architecture Decisions
Two-layer Language Detection
LibreTranslate's /detect endpoint returns 0-confidence for short Cyrillic phrases. The app uses script-based pre-detection (Unicode 0x0400–0x04FF = Cyrillic) combined with ElevenLabs Scribe's language_code output for reliable EN/RU/UK auto-detection.
VAD Commit Merging
Voice Activity Detection can fire aggressively on speaker breathing. Commits are buffered for 2.5 seconds before translation to merge fragments into meaningful phrases.
Feature Flag Merging
YAML config defaults are merged with Redis runtime overrides. Redis values take priority, falling back to YAML if Redis is unavailable.
API Key Hierarchy
Keys resolve in order: Runtime Cache → Redis → Config File → Empty. This allows hot-swapping keys without restarts.
Connection Lifecycle
- Client sends
start_session with source type (mic or youtube) and optional voiceId
- Backend opens a WebSocket to
wss://api.elevenlabs.io/v1/speech-to-text/realtime
- For YouTube: spawns
yt-dlp | ffmpeg child processes to extract PCM audio
- For Microphone: awaits
audio_chunk events from the frontend
Audio Streaming
Audio chunks are sent to Scribe as JSON messages:
{
"message_type": "input_audio_chunk",
"audio_base_64": "UklGR..." // PCM 16-bit LE, 16kHz, mono
}
Scribe Responses
| Response Type | Meaning | Action |
partial_transcript |
Live partial text (speculative) |
Emitted as non-final transcript event |
committed_transcript |
VAD fired — complete phrase |
Buffered for commit merge window |
Commit Merge Buffer
After receiving a committed_transcript, the backend waits 2.5 seconds (COMMIT_MERGE_MS) to collect additional commits before translating. This prevents fragmented translations from aggressive VAD.
Stability Timeout
If VAD stalls (no new commits), a 3.5 second fallback timer (STABILITY_TIMEOUT_MS) fires to translate whatever new text has accumulated, preventing indefinite silence.
Text Validation
Before translation, text is validated against EN/RU/UK character regex patterns. This filters out hallucinated text from the STT model (common with silence or background noise).
Provider Chain
The system supports three translation providers with automatic fallback:
Default
LibreTranslate
Self-hosted, no API key required. Runs in Docker alongside the app. Best for privacy and cost.
Premium
DeepL
High-quality translations. Supports both free and paid API tiers. Auto-detects endpoint.
AI
Claude
Anthropic's Claude for context-aware translations. Uses claude-haiku-4-5 for speed.
Fallback Logic
1. Try primary provider (admin-selected)
2. If primary fails → try configured fallback
3. If fallback fails → try LibreTranslate (last resort)
4. If all fail → emit error event
Language Detection
The app uses a two-layer auto-detection approach:
Layer 1: Script-based Pre-detection
Before calling any translation API, the backend checks Unicode character scripts:
- Cyrillic characters (Unicode 0x0400–0x04FF) → if >50% of matched letters are Cyrillic, detected as Russian
- Latin characters → detected as English
- This avoids low-confidence results from LibreTranslate's
/detect endpoint on short text
Layer 2: STT Language Code
When the auto_language_detect flag is enabled, ElevenLabs Scribe returns a language_code with each transcript commit. The backend uses this to correctly route EN/RU/UK without relying solely on script detection.
Note: For LibreTranslate, both Russian and Ukrainian Cyrillic text is passed with source ru since LibreTranslate handles Ukrainian text acceptably via the Russian model. DeepL and Claude providers distinguish Ukrainian natively and handle uk as a proper source language.
Language Gating
Detected languages are checked against the admin-approved pool. If a detected language isn't in the allowed set, the translation is rejected to prevent hallucinated language outputs.
TTS Pipeline
After translation, the text is sent to ElevenLabs TTS:
const stream = await client.textToSpeech.stream(voiceId, {
text: translatedText,
model_id: "eleven_multilingual_v2",
output_format: "mp3_44100_128",
voice_settings: {
stability: 0.5,
similarity_boost: 0.75,
style: 0.0,
speed: 1.0,
use_speaker_boost: true
}
});
Audio Delivery
TTS audio is streamed to a Buffer, then emitted as a base64-encoded MP3 via the tts_audio Socket.io event.
Frontend Playback Queue
The frontend maintains an audio queue to prevent overlapping playback:
- Received
tts_audio events are queued
- Each segment plays to completion before the next starts
- A configurable pause (600ms default) is inserted between segments
- The pause duration is controlled by
tts_segment_pause_ms (adjustable in admin)
Microphone Input
- User selects "Mic" tab and chooses a TTS voice
- Browser captures audio via Web Audio API's
ScriptProcessor
- PCM 16-bit LE at 16kHz sample rate sent to backend via Socket.io
- Backend pipes audio to ElevenLabs Scribe v2 Realtime WebSocket
- Language auto-detected (EN/RU/UK), text translated and synthesized
- TTS audio returned and played back with inter-segment pauses
YouTube Input
- User pastes a YouTube URL (live stream or video)
- Backend spawns
yt-dlp | ffmpeg child processes
- Audio extracted as PCM stream (16kHz, 16-bit LE, mono)
- Piped to Scribe v2, same pipeline as microphone
- Stream ends when YouTube content ends or user stops
User Interface
The user view features a dark cavern theme with:
- Waveform visualizer — Canvas-based bar chart with orange gradient and cyan tips
- Transcript display — White translated text scrolls upward with fade masks
- Partial transcript — Shown in italic orange while STT is processing
- Source tabs — Toggle between Mic and YouTube (controlled by feature flags)
How It Works
The backend uses yt-dlp and ffmpeg as child processes to extract audio from YouTube URLs:
yt-dlp (best audio) → ffmpeg (PCM 16kHz 16-bit LE mono) → Scribe v2
Supported Sources
- Live streams — Translates in real-time as the stream progresses
- Regular videos — Processes the full audio track
- Any URL supported by yt-dlp (YouTube, etc.)
Requirements
Both yt-dlp and ffmpeg must be installed and available in the system PATH. On macOS:
brew install yt-dlp ffmpeg
⚠
Feature Flag Required
YouTube input is controlled by the youtube_input feature flag. Enable it in the admin panel to show the YouTube tab in the user view.
Overview
The Biblical Transcript Simulator is an admin-only feature that generates biblical text passages using Google's Gemini API (gemini-2.5-flash), then routes them through the full translation pipeline. This provides a hands-free way to test STT → Translation → TTS without a live audio source.
Language Styles
| Language | Style | Example |
en |
King James English |
"In the beginning was the Word..." |
ru |
Church Slavonic Russian |
"В начале было Слово..." |
uk |
Traditional Ukrainian |
"На початку було Слово..." |
Flow
- Admin selects language (EN/RU/UK)
- Backend calls Gemini 2.5 Flash with streaming
- Gemini generates 6-8 biblical passages, 3-5 sentences each
- Stream is buffered until 140+ characters AND complete sentences
- Chunks emitted with 1800ms smooth pacing between them
- Each chunk flows through the standard pipeline:
- Emitted as
transcript (isFinal: true)
- Auto-translated via configured provider
- TTS synthesized and audio returned
- Frontend plays audio with standard inter-segment pause
💡
Feature Flag
Enable biblical_simulator in the admin feature flags panel. The Gemini API key is configured via the GEMINI_API_KEY environment variable or set at runtime in the admin API Keys panel.
Overview
Voice Training uses ElevenLabs' Instant Voice Cloning (IVC) API to create custom voices from audio samples. Once cloned, the voice appears in the voice selector immediately.
From Microphone
- Open the Voice Training section in the admin panel
- Click Generate Text to get an AI-generated reading passage (via Gemini) — gives the speaker natural, phonetically diverse text to read aloud
- Record multiple audio clips using your browser microphone while reading the generated text
- Provide a name for the voice
- Clips are uploaded to ElevenLabs IVC API
- Cloned voice is available for TTS immediately
- Click Preview Voice to hear the cloned voice speak a sample sentence via TTS
From YouTube
- Paste a YouTube URL in the Voice Training section
- Backend extracts N × 30-second clips via
yt-dlp + ffmpeg
- Clips are uploaded to ElevenLabs IVC API
- Resulting voice is stored in your ElevenLabs account
⚠
ElevenLabs Account
Cloned voices are stored in your ElevenLabs account, not locally. Ensure your plan supports voice cloning.
Concepts
| Concept | Description |
| Active Language Pair |
The current pair used for translation (e.g., EN ↔ RU, EN ↔ UK, or RU ↔ UK). Set by admin. |
| Available Languages |
The pool of languages viewers can select from (if user_language_selector is enabled). |
Admin Controls
- Change the active language pair via the admin panel
- Changes broadcast to all connected clients in real-time
- Manage the available languages pool for viewer selection
Viewer Selection
When the user_language_selector feature flag is enabled, viewers can override the admin-set language pair by selecting their own preferred languages from the available pool.
Overview
Two people can video call each other through the app, each speaking their own language. The app transcribes, translates, and synthesizes speech in real-time so each participant hears the other in their language.
Feature flag: Video call is gated behind the video_translation flag. Enable it in the admin panel or set video_translation: true in your YAML config.
How It Works
- Create a room — Person A selects their language, picks a TTS voice, and clicks "Create Room". A 6-character room code is generated.
- Share the code — Person A shares the room code with Person B (copy button provided).
- Join the room — Person B enters the code, selects their language and TTS voice, and clicks "Join".
- WebRTC connection — The app establishes a peer-to-peer video connection via WebRTC (signaled through Socket.io). Video flows directly between browsers.
- Audio translation — Each participant's microphone audio is simultaneously:
- Sent to the peer via WebRTC (but muted on their end)
- Captured as PCM chunks and sent to the backend via Socket.io for STT
- Translation pipeline — Each participant has their own independent Scribe STT session. Transcribed text is translated to the other participant's language, then synthesized via ElevenLabs TTS and sent back to the peer.
- Playback — The peer hears the TTS translation instead of the raw audio. Translated transcript is displayed below the video.
Architecture
Person A (Browser) Server Person B (Browser)
├─ getUserMedia ├─ Socket.io ├─ getUserMedia
├─ WebRTC P2P ═══video═══►│ (signaling) ◄═══ ├─ WebRTC P2P
│ │ │
├─ PCM chunks ──Socket.io─►├─ ScribeA(STT) │
│ │ ↓ translate │
│ │ ↓ TTS ───────────►├─ Plays TTS
│ │ │
│ Plays TTS ◄─────────────├─ ScribeB(STT) ◄───├─ PCM chunks
│ (remote video muted) │ ↓ translate │ (remote video muted)
└──────────────────────────┴────────────────────┘
Socket Events
| Event | Direction | Purpose |
video_create_room | C→S | Create a new room with language + voice |
video_room_created | S→C | Returns the 6-char room code |
video_join_room | C→S | Join an existing room |
video_room_joined | S→C | Sent to both participants, triggers WebRTC |
video_signal_offer/answer/ice | C↔S | WebRTC signaling relay |
video_audio_chunk | C→S | PCM audio for STT processing |
video_transcript | S→C | Transcript sent to the speaker |
video_translation | S→C | Translation sent to the listener |
video_tts_audio | S→C | TTS audio sent to the listener |
video_leave_room | C→S | Leave the room |
video_room_closed | S→C | Notify peer when other leaves |
Room Lifecycle
- Rooms are stored in Redis with key
video_room:{code} and a 4-hour TTL
- Maximum 2 participants per room
- When one participant disconnects, the other is notified and the call ends
- Scribe sessions are automatically cleaned up on disconnect
The Mac Audio Agent has moved to its own public repository:
github.com/Pzharyuk/live-translator-agent
It is a lightweight Node.js daemon that runs as a macOS LaunchAgent and streams microphone audio to the live-translator backend via Socket.io — eliminating the need to open a browser for the Remote Audio Source role.
Pre-shared key authentication
Any socket that emits register_audio_source must present the server's pre-shared key in the Socket.IO handshake (auth.agentPsk). This stops random clients from connecting to the backend and impersonating an agent.
- Server: set
AGENT_PSK (env var) — surfaces as auth.agent_psk in application.yaml. An empty value disables enforcement and logs a warning on every registration.
- Mac daemon: add
agentPsk to ~/.config/live-translator-agent/config.json (or set the AGENT_PSK env var — env wins).
- Browser
/audio-source: paste the key into the new Agent Pre-Shared Key field; it is stored in localStorage on that device only and travels in the handshake (never in event payloads).
- Mismatch behaviour: server logs
register_audio_source REJECTED ... invalid or missing PSK, emits agent_auth_error to the client, then disconnects.
Feature Flags
Feature flags control runtime behavior without redeployment. Defaults are specified in config/application.yaml under the feature_flags section. At startup, these values are read into memory. Admins can override any flag at runtime via the POST /admin/flags/:flag endpoint, and changes are persisted in Redis and broadcast to all connected clients via Socket.IO.
Storage: Flag values live in Redis under keys like flag:youtube_input. On server restart, Redis overrides are preserved; if Redis is unavailable or a flag was never overridden, the YAML default is used. All changes immediately emit a feature_flags socket event to all connected clients so the UI reflects the new state without requiring a page refresh.
| Flag |
Default |
Description |
youtube_input |
true |
Enable YouTube URL input as a broadcast audio source. |
mic_input |
true |
Enable microphone input for broadcast and private sessions. |
auto_language_detect |
true |
Automatically detect source language; if false, use translation.source_lang from config. |
user_language_selector |
false |
Allow viewers to select the translation language pair instead of admin-only control. |
audio_device_selector |
true |
Show audio device selection in the UI (microphone/input device picker). |
video_translation |
true |
Enable video call real-time translation feature (/video route). |
video_voice_cloning |
false |
Show "Clone Voice" button in video lobby (premium feature). |
remote_audio_source |
false |
Enable /audio-source route for headless remote audio relay agents. |
agent_audio_source |
false |
Show connected agent audio sources section in the admin panel. |
broadcast |
false |
Enable /broadcast route — public receiver page for live translation broadcast. |
translate |
false |
Enable /translate route — live translator page for personal audio-to-text sessions. |
API Endpoints
GET /admin/flags
Retrieve all feature flags (merged from YAML defaults & Redis overrides).
curl -X GET http://localhost:3001/admin/flags \
-H "Cookie: sid=<session_token>"
Response:
{
"flags": {
"youtube_input": true,
"mic_input": true,
"broadcast": false,
"translate": true
}
}
GET /admin/flags/:flag
Retrieve a single flag value. Returns the merged YAML + Redis value.
curl -X GET http://localhost:3001/admin/flags/broadcast \
-H "Cookie: sid=<session_token>"
Response:
{
"flag": "broadcast",
"value": false
}
POST /admin/flags/:flag
Set a feature flag value. Persists to Redis, broadcasts to all clients via Socket.IO.
curl -X POST http://localhost:3001/admin/flags/broadcast \
-H "Cookie: sid=<session_token>" \
-H "Content-Type: application/json" \
-d '{"value": true}'
Response:
{
"flag": "broadcast",
"value": true
}
Socket.IO Events
feature_flags (emit to client)
Broadcast whenever any flag changes. Clients receive the merged YAML + Redis state.
socket.on('feature_flags', (merged) => {
console.log('Flags updated:', merged);
// merged = { youtube_input: true, broadcast: true, ... }
});
File Structure
| File | Purpose |
config/application.yaml |
Base defaults for all environments |
config/application-local.yaml |
Local development overrides (localhost URLs) |
config/application-prod.yaml |
Production overrides (Docker service names) |
The APP_ENV environment variable (local or prod) determines which overlay file is loaded on top of the base config.
Full Configuration Reference
server:
port: 3001
cors_origin: "http://localhost:5173"
elevenlabs:
api_key: "${ELEVENLABS_API_KEY}"
default_voice_id: "kxj9qk6u5PfI0ITgJwO0"
tts_model: "eleven_multilingual_v2"
tts_settings:
stability: 0.5
similarity_boost: 0.75
style: 0.0
speed: 1.0
use_speaker_boost: true
stt_model: "scribe_v2"
anthropic:
api_key: "${ANTHROPIC_API_KEY}"
deepl:
api_key: "${DEEPL_API_KEY}"
libretranslate:
url: "http://libretranslate:5000"
api_key: ""
redis:
host: "redis"
port: 6379
password: ""
feature_flags:
youtube_input: true
mic_input: true
auto_language_detect: true
user_language_selector: false
audio_device_selector: true
video_translation: false
video_voice_cloning: false
broadcast: false
audio:
sample_rate: 16000
channels: 1
chunk_duration_ms: 250
translation:
source_lang: "auto"
target_lang_en: "en"
target_lang_ru: "ru"
provider: "libretranslate"
fallback: "libretranslate"
Environment Variable Interpolation
YAML values using ${VAR_NAME} syntax are automatically replaced with the corresponding environment variable at startup.
Environment Variables
Configure the application using environment variables from .env.example and application.yaml.
| Variable |
Required |
Default |
Description |
ELEVENLABS_API_KEY |
Yes |
— |
API key for ElevenLabs text-to-speech & speech-to-text services. |
elevenlabs.default_voice_id |
Yes |
kxj9qk6u5PfI0ITgJwO0 |
Default ElevenLabs voice ID for TTS output. |
elevenlabs.tts_model |
No |
eleven_multilingual_v2 |
ElevenLabs TTS model identifier. |
elevenlabs.stt_model |
No |
scribe_v2_realtime |
ElevenLabs speech-to-text model (Scribe realtime endpoint). |
elevenlabs.tts_settings.stability |
No |
0.5 |
TTS voice stability (0–1; higher → more consistent). |
elevenlabs.tts_settings.similarity_boost |
No |
0.75 |
TTS similarity to original voice (0–1). |
elevenlabs.tts_settings.style |
No |
0.0 |
TTS exaggeration of voice style (0–1). |
elevenlabs.tts_settings.speed |
No |
1.0 |
TTS playback speed multiplier. |
elevenlabs.tts_settings.use_speaker_boost |
No |
true |
Enable ElevenLabs speaker boost for clearer output. |
ANTHROPIC_API_KEY |
No |
— |
Claude API key for sermon generation & translation (optional; Gemini preferred). |
GOOGLE_TRANSLATE_API_KEY |
No |
— |
Google Translate API key (default translation provider). |
GEMINI_API_KEY |
No |
— |
Google Gemini API key for biblical simulator & sermon generation. |
GOOGLE_CLIENT_ID |
No |
— |
OAuth 2.0 Client ID for Google Sign-In (optional). |
GOOGLE_CLIENT_SECRET |
No |
— |
OAuth 2.0 Client Secret for Google Sign-In. |
DEEPL_API_KEY |
No |
— |
DeepL API key (alternative translation provider). |
YOUTUBE_API_KEY |
No |
— |
YouTube Data API v3 key for live stream discovery. |
YOUTUBE_CHANNEL_ID |
No |
— |
Default YouTube channel ID to search for live streams. |
libretranslate.url |
No |
http://libretranslate:5000 |
LibreTranslate service endpoint. |
libretranslate.api_key |
No |
— |
LibreTranslate API key (if instance requires authentication). |
LIBRETRANSLATE_API_KEY |
No |
— |
Alternative env-var for LibreTranslate API key. |
server.port |
No |
3001 |
Backend server port. |
server.cors_origin |
No |
http://localhost:5183 |
CORS origin for frontend requests. |
FRONTEND_URL |
No |
http://localhost |
Frontend URL for CORS & redirects in production. |
LISTEN_PORT |
No |
80 |
Frontend port (used in Docker Compose). |
database.host |
No |
postgres |
PostgreSQL hostname. |
database.port |
No |
5432 |
PostgreSQL port. |
database.username |
No |
translator |
PostgreSQL username. |
database.password |
Yes |
— |
PostgreSQL password. |
DB_PASSWORD |
Yes |
— |
PostgreSQL password (referenced in YAML as ${DB_PASSWORD}). |
database.database |
No |
translator_db |
PostgreSQL database name. |
database.pool_size |
No |
10 |
PostgreSQL connection pool size. |
redis.host |
No |
redis |
Redis hostname. |
redis.port |
No |
6379 |
Redis port. |
redis.password |
No |
— |
Redis password (optional). |
REDIS_PASSWORD |
No |
— |
Redis password environment variable. |
auth.admin_username |
No |
admin |
Legacy admin panel username (deprecated). |
auth.admin_password |
No |
— |
Legacy admin panel password (deprecated). |
ADMIN_PASSWORD |
No |
admin123 |
Legacy admin socket authentication password. |
APP_ADMIN_USERNAME |
No |
admin |
Admin user seeded into database on first boot. |
APP_ADMIN_PASSWORD |
No |
admin123 |
Admin user initial password (user must change on first login). |
APP_USERNAME |
No |
user |
User-facing login username. |
APP_PASSWORD |
No |
changeme |
User-facing login password. |
auth.jwt_secret |
Yes |
— |
JWT secret for session cookies (generate with openssl rand -hex 32). |
JWT_SECRET |
Yes |
— |
JWT secret environment variable. |
auth.session_days |
No |
30 |
Session cookie expiration in days. |
auth.agent_psk |
No |
— |
Pre-shared key for remote audio agents (register_audio_source); leave empty to disable. |
AGENT_PSK |
No |
— |
Agent pre-shared key environment variable. |
COOKIE_SECURE |
No |
true |
Set secure flag on cookies (required for HTTPS). |
APPLE_CLIENT_ID |
No |
— |
Apple Sign In Services ID identifier. |
APPLE_TEAM_ID |
No |
— |
Apple Developer Team ID (10 characters). |
APPLE_KEY_ID |
No |
— |
Apple Sign In Key ID (10 characters from .p8 file). |
APPLE_PRIVATE_KEY |
No |
— |
Apple private key (.p8 file contents with PEM markers). |
OIDC_ISSUER |
No |
— |
OpenID Connect issuer URL (legacy Authentik; being phased out). |
OIDC_CLIENT_ID |
No |
— |
OIDC client ID. |
OIDC_CLIENT_SECRET |
No |
— |
OIDC client secret. |
audio.sample_rate |
No |
16000 |
Audio sample rate in Hz (16 kHz for Scribe compatibility). |
audio.channels |
No |
1 |
Number of audio channels (mono). |
audio.chunk_duration_ms |
No |
250 |
Audio chunk duration for WebSocket streaming. |
translation.source_lang |
No |
auto |
Source language (auto-detect). |
translation.target_lang_en |
No |
en |
Target language when source is non-English. |
translation.target_lang_ru |
No |
ru |
Target language when source is non-Russian. |
translation.provider |
No |
google |
Primary translation provider: google | deepl | claude | libretranslate. |
TRANSLATION_PROVIDER |
No |
— |
Translation provider override (sets boot default). |
translation.fallback |
No |
libretranslate |
Fallback provider when primary fails: google | deepl | claude | libretranslate | none. |
translation.translate_workers |
No |
2 |
Number of parallel translation workers (Stage 1 of TTS pipeline). |
translation.request_timeout_ms |
No |
5000 |
Per-provider translation request timeout (milliseconds). |
tts_pipeline.initial_buffer_segments |
No |
1 |
Translated segments to buffer before starting TTS playback. |
tts_pipeline.low_water_hold_ms |
No |
1500 |
Wait time before emitting audio for segment N to ensure N+1 is queued (0 to disable). |
APP_ENV |
No |
local |
Application environment: local (dev) or prod (Docker). |
feature_flags.youtube_input |
No |
true |
Enable YouTube live stream broadcast source. |
feature_flags.mic_input |
No |
true |
Enable microphone input for admin broadcast. |
feature_flags.auto_language_detect |
No |
true |
Enable automatic language detection. |
feature_flags.user_language_selector |
No |
false |
Allow viewers to select language pair (admin-gated pool). |
feature_flags.audio_device_selector |
No |
true |
Enable audio device selection in admin panel. |
feature_flags.video_translation |
No |
true |
Enable video call translation feature. |
feature_flags.video_voice_cloning |
No |
false |
Premium feature: show Clone Voice button in /video lobby. |
feature_flags.remote_audio_source |
No |
false |
Enable /audio-source route for headless remote audio relay. |
feature_flags.agent_audio_source |
No |
false |
Show connected agent audio sources in admin panel. |
feature_flags.broadcast |
No |
false |
Enable /broadcast route (public receiver page). |
feature_flags.translate |
No |
false |
Enable /translate route (live translator page). |
TTS Settings
API Endpoints
GET
/admin/tts-settings
Retrieve current TTS settings.
curl -X GET http://localhost:3001/admin/tts-settings \
-H "Cookie: auth=" \
-H "Content-Type: application/json"
Response:
{
"settings": {
"stability": 0.5,
"similarity_boost": 0.75,
"style": 0.0,
"speed": 1.0,
"use_speaker_boost": true
}
}
POST
/admin/tts-settings
Update one or more TTS settings (partial update).
curl -X POST http://localhost:3001/admin/tts-settings \
-H "Cookie: auth=" \
-H "Content-Type: application/json" \
-d '{
"stability": 0.6,
"speed": 1.1
}'
Response:
{
"settings": {
"stability": 0.6,
"similarity_boost": 0.75,
"style": 0.0,
"speed": 1.1,
"use_speaker_boost": true
}
}
Settings Reference
| Setting |
Range |
Default |
Description |
stability |
0.0 – 1.0 |
0.5 |
Voice stability: lower = more variable, higher = more consistent & monotone. |
similarity_boost |
0.0 – 1.0 |
0.75 |
How closely the voice matches the reference: higher = more similar to the original voice. |
style |
0.0 – 1.0 |
0.0 |
Emphasis strength: 0 = neutral, higher = more dramatic/expressive. |
speed |
0.5 – 2.0 |
1.0 |
Playback speed: 1.0 = normal, <1.0 = slower, >1.0 = faster. |
use_speaker_boost |
true | false |
true |
Enable speaker boost for enhanced clarity in noisy environments. |
Configuration File
Default values are defined in config/application.yaml under elevenlabs.tts_settings:
elevenlabs:
api_key: "${ELEVENLABS_API_KEY}"
default_voice_id: "kxj9qk6u5PfI0ITgJwO0"
tts_model: "eleven_multilingual_v2"
tts_settings:
stability: 0.5
similarity_boost: 0.75
style: 0.0
speed: 1.0
use_speaker_boost: true
stt_model: "scribe_v2_realtime"
Notes
- Settings are persisted to Redis (
setting:tts_settings) and survive pod restarts.
- Changes apply immediately to all subsequent TTS requests — no server restart required.
- The
tts_model setting (e.g., eleven_multilingual_v2) is configured in YAML and cannot be changed via API.
- TTS output format defaults to
mp3_44100_128 (MP3, 44.1 kHz, 128 kbps) for browser playback; pcm_16000 (16 kHz mono PCM) is used internally for Scribe STT loopback.
STT Timing Configuration
Speech-to-text timing settings control how long the system waits before dispatching audio chunks for translation.
These settings fine-tune the balance between responsiveness and translation accuracy. All values are dynamically
configurable via the admin API and persist across restarts.
Settings Reference
| Setting |
Default |
Description |
commit_merge_ms |
2500 |
Milliseconds to buffer VAD commits before translating—merges short fragments into larger chunks. |
stability_timeout_ms |
2000 |
Milliseconds to wait for stable partial text before triggering translation when text is unchanged. |
tts_segment_pause_ms |
0 |
Pause duration (ms) between TTS audio segments—sent to frontend for playback spacing. |
max_accumulation_ms |
8000 |
Maximum time to accumulate words during continuous speech before force-dispatching for translation. |
vad_threshold |
0.5 |
Voice activity detection sensitivity (0–1)—higher values filter more background noise. |
vad_silence_threshold_secs |
1.5 |
Seconds of silence required before VAD commits a final transcript segment. |
min_speech_duration_ms |
100 |
Ignore speech segments shorter than this duration (milliseconds). |
min_silence_duration_ms |
100 |
Minimum silence gap required in milliseconds. |
flush_on_sentence_boundary |
true |
When true, dispatch complete sentences at boundaries (.?!;) instead of waiting for longer buffers. |
min_chars_before_dispatch |
40 |
Minimum character count before a chunk is dispatched for translation—prevents tiny fragments. |
Tuning Guide
-
Responsive translation: Lower
commit_merge_ms (e.g., 1000) and max_accumulation_ms (e.g., 4000)
to dispatch chunks more frequently. Reduces perceived latency but increases API calls.
-
Accurate translation: Increase
commit_merge_ms (e.g., 4000) to buffer more fragments,
giving the translator context. Trade-off: slightly higher latency per segment.
-
Continuous speech (sermons): The
max_accumulation_ms timer ensures new words are dispatched
every N milliseconds even when VAD and stability timers don’t fire. Set to 8000–12000 for natural longer chunks.
-
Sentence-aware dispatch: Enable
flush_on_sentence_boundary to split at periods, question marks,
and exclamation marks. Keeps translations more focused and prevents mid-sentence fragmentation.
-
Silence sensitivity: Increase
vad_silence_threshold_secs (e.g., 2.0) to require longer pauses
before committing, reducing fragmentation on speaker breaths. Decrease (e.g., 0.8) for faster VAD commits.
-
Noise filtering: Adjust
vad_threshold (0.4–0.6) to suppress background noise.
Higher values = stricter filtering but may miss quiet speech.
API Endpoints
GET /admin/stt-timing
Retrieve current STT timing settings.
curl -X GET http://localhost:3001/admin/stt-timing \
-H "Cookie: auth=<jwt-token>"
Response:
{
"settings": {
"commit_merge_ms": 2500,
"stability_timeout_ms": 2000,
"tts_segment_pause_ms": 0,
"max_accumulation_ms": 8000,
"vad_threshold": 0.5,
"vad_silence_threshold_secs": 1.5,
"min_speech_duration_ms": 100,
"min_silence_duration_ms": 100,
"flush_on_sentence_boundary": true,
"min_chars_before_dispatch": 40
}
}
POST /admin/stt-timing
Update one or more STT timing settings. Partial updates are supported—omitted fields retain their current values.
curl -X POST http://localhost:3001/admin/stt-timing \
-H "Cookie: auth=<jwt-token>" \
-H "Content-Type: application/json" \
-d '{
"max_accumulation_ms": 6000,
"vad_silence_threshold_secs": 1.8,
"flush_on_sentence_boundary": false
}'
Response:
{
"settings": {
"commit_merge_ms": 2500,
"stability_timeout_ms": 2000,
"tts_segment_pause_ms": 0,
"max_accumulation_ms": 6000,
"vad_threshold": 0.5,
"vad_silence_threshold_secs": 1.8,
"min_speech_duration_ms": 100,
"min_silence_duration_ms": 100,
"flush_on_sentence_boundary": false,
"min_chars_before_dispatch": 40
}
}
Implementation Notes
-
Persistence: All settings are stored in Redis under the
setting:stt_timing key.
Changes apply immediately to active broadcasts without requiring a restart.
-
ElevenLabs Scribe integration: Settings like
vad_threshold, vad_silence_threshold_secs,
min_speech_duration_ms, and min_silence_duration_ms are passed directly to the Scribe v2 Realtime
WebSocket as query parameters during session initialization.
-
Multi-stage dispatch: Text flows through three potential dispatch mechanisms (in order): sentence boundary detection,
stability timer, accumulation timer, and VAD commit buffer. The first one to fire wins; others are cancelled to avoid duplicate translations.
-
Word-count slicing: The system tracks how many words have already been dispatched and only sends new words,
surviving Scribe’s mid-utterance revisions without re-translating earlier text.
-
Frontend awareness: The
tts_segment_pause_ms value is emitted to the client via the
stt_timing Socket.IO event, allowing the frontend to space out audio playback accordingly.
Authentication: JWT cookie-based admin auth. Routes require either is_admin=true or valid permissions in the JWT payload. Unauthenticated requests receive 401 Not authenticated; insufficient permissions receive 403 Admin access required or 403 Insufficient permissions.
API Keys Management
Retrieve status of all configured API keys (elevenlabs, anthropic, deepl, libretranslate, google, youtube).
Update one or more API keys.
Body: {
"elevenlabs"?: string,
"anthropic"?: string,
"deepl"?: string,
"libretranslate"?: string,
"google"?: string,
"youtube"?: string
}
Voice Management
Scan ElevenLabs and return all available voices with metadata (name, voice_id, category, preview_url).
Retrieve the admin-curated list of voice IDs allowed for viewers to select from.
Update the pool of allowed voice IDs; broadcasts to all connected clients in real-time.
Body: {
"voiceIds": string[]
}
TTS & STT Settings
Get current TTS voice settings (stability, similarity_boost, style, speed, use_speaker_boost).
Update TTS settings; persisted to Redis and used for all subsequent synthesis.
Body: {
"stability"?: number,
"similarity_boost"?: number,
"style"?: number,
"speed"?: number,
"use_speaker_boost"?: boolean
}
Get STT timing & VAD parameters (commit_merge_ms, stability_timeout_ms, tts_segment_pause_ms, max_accumulation_ms, vad_threshold, etc.).
Update STT timing configuration; controls speech-to-text chunking behavior and VAD sensitivity.
Body: {
"commit_merge_ms"?: number,
"stability_timeout_ms"?: number,
"tts_segment_pause_ms"?: number,
"max_accumulation_ms"?: number,
"vad_threshold"?: number,
"vad_silence_threshold_secs"?: number,
"min_speech_duration_ms"?: number,
"min_silence_duration_ms"?: number,
"flush_on_sentence_boundary"?: boolean,
"min_chars_before_dispatch"?: number
}
Get video call STT/TTS settings (stability_ms, commit_merge_ms, translation_provider).
Update video call translation pipeline settings.
Body: {
"stability_ms"?: number,
"commit_merge_ms"?: number,
"translation_provider"?: "libretranslate" | "claude" | "deepl" | "google"
}
Languages
Get the active translation language pair (e.g., ["en", "ru"]).
Set the active translation language pair; broadcasts to all viewers to update in real-time.
Body: {
"languages": [string, string]
}
Retrieve the admin-curated pool of language codes that viewers may select from.
Update the available language pool; broadcasts updated pool & active pair to all clients.
Body: {
"languages": string[]
}
Translation Provider
Get the active translation provider and list of available options (google, deepl, claude, libretranslate).
Set the active translation provider; persisted to Redis and used for all subsequent translations.
Body: {
"provider": "google" | "deepl" | "claude" | "libretranslate"
}
Get the active Claude model selection and available Claude model options.
Set the Claude model used when translation provider is 'claude'.
Body: {
"model": string
}
Audio Device
Get the admin-selected audio input device (overrides viewer's local selection).
Set the admin-enforced audio input device; broadcasts to all connected viewers in real-time.
Body: {
"deviceId"?: string,
"label"?: string
}
Feature Flags
Retrieve all feature flags (merged from config defaults & Redis overrides).
Get the value of a specific feature flag.
Set a feature flag value; broadcasts updated flags to all connected clients via Socket.IO.
Body: {
"value": boolean
}
TTS Preview
Generate TTS audio for preview; returns MP3 or PCM depending on format param.
Body: {
"text": string,
"voiceId"?: string,
"format"?: "mp3" | "pcm"
}
Voice Training & Cloning
Clone a voice from browser-recorded audio clips (base64-encoded).
Body: {
"name": string,
"clips": string[],
"mimeType"?: string
}
Clone a voice from a YouTube URL using yt-dlp & ffmpeg extraction.
Body: {
"name": string,
"youtubeUrl": string,
"clipCount"?: number,
"startOffset"?: number
}
Sermon Generation
Generate a biblical sermon snippet via Gemini Flash; returns poetic text in specified language.
Body: {
"apiKey"?: string,
"language"?: string,
"sentences"?: number
}
Hallucination Monitor
Retrieve hallucination detection statistics and logged events.
Clear the hallucination log.
Custom Fillers
Get the list of custom filler words to strip from transcripts before translation.
Update custom filler words; persisted to Redis and applied on-the-fly to in-flight broadcasts.
Body: {
"words": string[]
}
Translation Log
Retrieve translation log entries.
Clear the translation log.
YouTube
Get the configured YouTube channel ID and whether it came from environment variables.
Set the YouTube channel ID for live stream lookups.
Body: {
"channelId": string
}
Look up live streams from a YouTube channel (query param: channelId, defaults to configured ID); uses YouTube API if key is set, otherwise yt-dlp.
Broadcast Schedule
Retrieve the broadcast schedule (array of events with datetime, source, voice settings, optional translation pair).
Update the broadcast schedule; persisted to Redis.
Body: {
"events": Array<{
id: string,
title: string,
datetime: string,
description?: string,
source?: "mic" | "youtube" | "biblical" | "remote",
voiceId?: string,
durationMinutes?: number,
youtubeUrl?: string,
biblicalPrompt?: string,
agentId?: string,
agentDeviceId?: string,
skipSourceLang?: string | null,
allowedLanguages?: [string, string]
}>
}
Video Room Moderation
List all active video call rooms with participant metadata (socketId, language, displayName, connected status); strips rejoin tokens.
Force-close a video call room by code.
Queue & Monitoring
Get current broadcast TTS queue depth & Redis stream statistics.
Broadcast Sessions
Retrieve all broadcast sessions from PostgreSQL.
Get detailed session info including transcript log with timings.
Export session transcript as CSV, TXT, or JSON (query param: format=csv|txt|json).
User Management
List all users (requires permission: user_management); strips password hashes & avatar data.
Update user's admin status or role assignments (requires permission: user_management).
Body: {
"isAdmin"?: boolean,
"roleId"?: string | null,
"roleIds"?: string[]
}
Reset a user's password (requires permission: user_management); password must be ≥ 6 characters.
Body: {
"password": string
}
Delete a user (requires permission: user_management); prevents self-deletion.
Roles & Permissions
List all available permissions (requires permission: user_management).
Retrieve all roles (requires permission: user_management).
Create a new role (requires permission: user_management); name must be unique.
Body: {
"name": string,
"permissions": string[]
}
Update an existing role (requires permission: user_management).
Body: {
"name": string,
"permissions": string[]
}
Delete a role (requires permission: user_management).
Public Endpoints
Return the Gemini API key for frontend use (no auth required).
SDK
Uses the official @elevenlabs/elevenlabs-js SDK (v2). The client is lazy-loaded on first use.
Speech-to-Text (Scribe v2 Realtime)
Connects via native WebSocket to wss://api.elevenlabs.io/v1/speech-to-text/realtime. Handles:
- VAD-based commit buffering with configurable merge window
- Stability timeout fallback for stalled VAD
- Text validation (EN/RU/UK character regex filtering)
- Partial and final transcript emission
Text-to-Speech
Uses client.textToSpeech.stream() with the eleven_multilingual_v2 model. Audio is collected into a Buffer and emitted as base64 MP3.
Voice Management
client.voices.getAll() — fetches all voices from account
- Admin can filter which voices are available to viewers
- Voice cloning via IVC API (from recordings or YouTube)
Key File
backend/src/services/elevenlabs.service.ts
Provider Details
Google Translate
Google Cloud Translation API v2. Fast (~200ms), deterministic, and reliable. Requires GOOGLE_TRANSLATE_API_KEY with the Cloud Translation API enabled in Google Cloud Console. Ensure the API key has no HTTP referrer restrictions (server-side requests have no referrer).
File: backend/src/services/google-translate.service.ts
LibreTranslate
Self-hosted in Docker. No API key required by default. Provides language detection and translation via REST API.
File: backend/src/services/libretranslate.service.ts
DeepL
Premium translation API. Auto-detects free vs. paid endpoint based on the API key format.
File: backend/src/services/deepl.service.ts
Claude (Anthropic)
AI-powered translation using claude-haiku-4-5 for speed. Includes language detection and auto-flip logic.
File: backend/src/services/claude-translate.service.ts
Routing
Provider routing is handled by backend/src/services/translation.provider.ts:
- Try admin-selected primary provider
- On failure, try configured fallback provider
- LibreTranslate is always the last-resort fallback
Connection
Uses ioredis with automatic retry strategy. Falls back to in-memory/YAML defaults if Redis is unavailable.
Key Patterns
| Pattern | Example | Purpose |
flag:<name> |
flag:youtube_input |
Feature flag boolean values |
setting:<name> |
setting:tts_settings |
JSON settings objects |
Key File
backend/src/services/redis.service.ts
Local Development
Use docker-compose.local.yml for Redis and LibreTranslate only (backend/frontend run natively):
docker compose -f docker-compose.local.yml up -d
Production
Use docker-compose.yml for all services:
docker compose up -d --build
Services
| Service | Image | Port | Notes |
| frontend |
node:24-alpine + Nginx |
80 (exposed) |
Serves React build, proxies API/WS to backend |
| backend |
node:24-alpine |
3001 (internal) |
Express + Socket.io server |
| redis |
redis:7-alpine |
6379 (internal) |
Feature flags and settings store |
| libretranslate |
libretranslate/libretranslate |
5000 (internal) |
Self-hosted translation engine |
Configuration
ELEVENLABS_API_KEY=sk-your-production-key
ADMIN_PASSWORD=strong-secure-password
FRONTEND_URL=https://translate.example.com
APP_ENV=prod
REDIS_PASSWORD=redis-secret
Deploy
docker compose up -d --build
Reverse Proxy
When running behind Nginx or another reverse proxy:
- Set
LISTEN_PORT in .env (e.g., 8080)
- Proxy pass to
localhost:8080
- Important: Ensure WebSocket upgrades are forwarded for the
/socket.io/ path
server {
listen 443 ssl;
server_name translate.example.com;
location / {
proxy_pass http://localhost:8080;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header Host $host;
}
}
Monitoring
# Check all services
docker compose ps
# View backend logs
docker compose logs -f backend
# Health check
curl http://localhost:3001/api/health