Ship live translations
with confidence
A production-ready full-stack Node.js + React application for seamless EN↔RU↔UK live auto-detect translation with voice synthesis.
⚙
Installation
Set up the project locally with Docker, Redis, and LibreTranslate in minutes.
▦
Architecture
Understand the STT → Translation → TTS pipeline and real-time Socket.io communication.
▶
Live Translation
Stream from YouTube or microphone with automatic EN/RU/UK language detection and voice output.
📜
Biblical Simulator
Test the full pipeline with AI-generated biblical passages in King James, Church Slavonic, or Ukrainian style.
🎤
Voice Training
Clone custom voices from microphone recordings or YouTube videos using ElevenLabs IVC.
Prerequisites
●
Node.js 20+
Runtime for backend and build tools
●
Docker + Docker Compose
For Redis and LibreTranslate services
●
yt-dlp + ffmpeg
Required for YouTube audio extraction
●
ElevenLabs API Key
For speech-to-text and text-to-speech
Clone & Configure
git clone https://github.com/Pzharyuk/live-translator-node.git && cd live-translator-node
cp .env.example .env
Edit .env and set your API key:
ELEVENLABS_API_KEY=sk-your-key-here
ADMIN_PASSWORD=your-secure-password
Start Infrastructure
# Start Redis + LibreTranslate
docker compose -f docker-compose.local.yml up -d
# Wait for LibreTranslate to download language models (~500 MB)
docker logs -f $(docker ps -qf "name=libretranslate") 2>&1 | grep -i "running"
Start Backend
cd backend
npm install
npm run dev # nodemon watches for changes
Start Frontend
cd frontend
npm install
npm run dev # Vite hot-reload on localhost:5173
✓
You're all set!
Open http://localhost:5173 — log in with user / changeme and you will be redirected to /translate. Admin panel: http://localhost:5173/admin (admin password: admin123).
1
Start the services
Follow the Installation guide to get Docker services, backend, and frontend running.
2
Open the Admin Panel
Navigate to http://localhost:5173/admin and enter the admin password.
3
Select a Voice
Choose a TTS voice from the dropdown. The voice list is fetched from your ElevenLabs account.
4
Test with Text
Use the free-text area in the admin panel to type a phrase. Click translate to hear the TTS output instantly.
5
Go Live
Open the user view at http://localhost:5173/translate. Select "Mic" as input, pick a voice, and click Start. Speak into your microphone and watch real-time translation appear with audio playback.
💡
Try the Biblical Simulator
For a hands-free demo, enable the biblical_simulator feature flag in admin, enter an Anthropic API key, select a language, and click "Generate". The system will produce biblical passages through the full STT → Translation → TTS pipeline.
System Overview
Frontend
React 19 + Vite
Socket.io Client
Web Audio API
↔
Backend
Express + Socket.io
TypeScript
Service Layer
ElevenLabs
Scribe v2 (STT)
TTS Streaming
Voice Cloning
Translation
Google Translate (Cloud API)
LibreTranslate (self-hosted)
DeepL (premium API)
Claude / Anthropic (AI)
Redis
Feature Flags
Settings Store
Google Gemini
Biblical Simulator
Sermon Generation
Voice Training Text
DeepL
Free & Pro tiers
Auto endpoint detection
Data Flow
1 Audio Input (Mic / YouTube / Simulator)
↓
2 PCM 16-bit LE @ 16kHz via Socket.io chunks
↓
3 ElevenLabs Scribe v2 WebSocket STT
↓
4 Commit Merge Buffer 2.5s VAD aggregation
↓
5 Translation Provider Google / LibreTranslate / DeepL / Claude
↓
6 ElevenLabs TTS Voice synthesis streaming
↓
7 Audio Playback Queued with 600ms pause
Key Architecture Decisions
Two-layer Language Detection
LibreTranslate's /detect endpoint returns 0-confidence for short Cyrillic phrases. The app uses script-based pre-detection (Unicode 0x0400–0x04FF = Cyrillic) combined with ElevenLabs Scribe's language_code output for reliable EN/RU/UK auto-detection.
VAD Commit Merging
Voice Activity Detection can fire aggressively on speaker breathing. Commits are buffered for 2.5 seconds before translation to merge fragments into meaningful phrases.
Feature Flag Merging
YAML config defaults are merged with Redis runtime overrides. Redis values take priority, falling back to YAML if Redis is unavailable.
API Key Hierarchy
Keys resolve in order: Runtime Cache → Redis → Config File → Empty. This allows hot-swapping keys without restarts.
Connection Lifecycle
- Client sends
start_session with source type (mic or youtube) and optional voiceId
- Backend opens a WebSocket to
wss://api.elevenlabs.io/v1/speech-to-text/realtime
- For YouTube: spawns
yt-dlp | ffmpeg child processes to extract PCM audio
- For Microphone: awaits
audio_chunk events from the frontend
Audio Streaming
Audio chunks are sent to Scribe as JSON messages:
{
"message_type": "input_audio_chunk",
"audio_base_64": "UklGR..." // PCM 16-bit LE, 16kHz, mono
}
Scribe Responses
| Response Type | Meaning | Action |
partial_transcript |
Live partial text (speculative) |
Emitted as non-final transcript event |
committed_transcript |
VAD fired — complete phrase |
Buffered for commit merge window |
Commit Merge Buffer
After receiving a committed_transcript, the backend waits 2.5 seconds (COMMIT_MERGE_MS) to collect additional commits before translating. This prevents fragmented translations from aggressive VAD.
Stability Timeout
If VAD stalls (no new commits), a 3.5 second fallback timer (STABILITY_TIMEOUT_MS) fires to translate whatever new text has accumulated, preventing indefinite silence.
Text Validation
Before translation, text is validated against EN/RU/UK character regex patterns. This filters out hallucinated text from the STT model (common with silence or background noise).
Provider Chain
The system supports three translation providers with automatic fallback:
Default
LibreTranslate
Self-hosted, no API key required. Runs in Docker alongside the app. Best for privacy and cost.
Premium
DeepL
High-quality translations. Supports both free and paid API tiers. Auto-detects endpoint.
AI
Claude
Anthropic's Claude for context-aware translations. Uses claude-haiku-4-5 for speed.
Fallback Logic
1. Try primary provider (admin-selected)
2. If primary fails → try configured fallback
3. If fallback fails → try LibreTranslate (last resort)
4. If all fail → emit error event
Language Detection
The app uses a two-layer auto-detection approach:
Layer 1: Script-based Pre-detection
Before calling any translation API, the backend checks Unicode character scripts:
- Cyrillic characters (Unicode 0x0400–0x04FF) → if >50% of matched letters are Cyrillic, detected as Russian
- Latin characters → detected as English
- This avoids low-confidence results from LibreTranslate's
/detect endpoint on short text
Layer 2: STT Language Code
When the auto_language_detect flag is enabled, ElevenLabs Scribe returns a language_code with each transcript commit. The backend uses this to correctly route EN/RU/UK without relying solely on script detection.
Note: For LibreTranslate, both Russian and Ukrainian Cyrillic text is passed with source ru since LibreTranslate handles Ukrainian text acceptably via the Russian model. DeepL and Claude providers distinguish Ukrainian natively and handle uk as a proper source language.
Language Gating
Detected languages are checked against the admin-approved pool. If a detected language isn't in the allowed set, the translation is rejected to prevent hallucinated language outputs.
TTS Pipeline
After translation, the text is sent to ElevenLabs TTS:
const stream = await client.textToSpeech.stream(voiceId, {
text: translatedText,
model_id: "eleven_multilingual_v2",
output_format: "mp3_44100_128",
voice_settings: {
stability: 0.5,
similarity_boost: 0.75,
style: 0.0,
speed: 1.0,
use_speaker_boost: true
}
});
Audio Delivery
TTS audio is streamed to a Buffer, then emitted as a base64-encoded MP3 via the tts_audio Socket.io event.
Frontend Playback Queue
The frontend maintains an audio queue to prevent overlapping playback:
- Received
tts_audio events are queued
- Each segment plays to completion before the next starts
- A configurable pause (600ms default) is inserted between segments
- The pause duration is controlled by
tts_segment_pause_ms (adjustable in admin)
Microphone Input
- User selects "Mic" tab and chooses a TTS voice
- Browser captures audio via Web Audio API's
ScriptProcessor
- PCM 16-bit LE at 16kHz sample rate sent to backend via Socket.io
- Backend pipes audio to ElevenLabs Scribe v2 Realtime WebSocket
- Language auto-detected (EN/RU/UK), text translated and synthesized
- TTS audio returned and played back with inter-segment pauses
YouTube Input
- User pastes a YouTube URL (live stream or video)
- Backend spawns
yt-dlp | ffmpeg child processes
- Audio extracted as PCM stream (16kHz, 16-bit LE, mono)
- Piped to Scribe v2, same pipeline as microphone
- Stream ends when YouTube content ends or user stops
User Interface
The user view features a dark cavern theme with:
- Waveform visualizer — Canvas-based bar chart with orange gradient and cyan tips
- Transcript display — White translated text scrolls upward with fade masks
- Partial transcript — Shown in italic orange while STT is processing
- Source tabs — Toggle between Mic and YouTube (controlled by feature flags)
How It Works
The backend uses yt-dlp and ffmpeg as child processes to extract audio from YouTube URLs:
yt-dlp (best audio) → ffmpeg (PCM 16kHz 16-bit LE mono) → Scribe v2
Supported Sources
- Live streams — Translates in real-time as the stream progresses
- Regular videos — Processes the full audio track
- Any URL supported by yt-dlp (YouTube, etc.)
Requirements
Both yt-dlp and ffmpeg must be installed and available in the system PATH. On macOS:
brew install yt-dlp ffmpeg
⚠
Feature Flag Required
YouTube input is controlled by the youtube_input feature flag. Enable it in the admin panel to show the YouTube tab in the user view.
Overview
The Biblical Transcript Simulator is an admin-only feature that generates biblical text passages using Google's Gemini API (gemini-2.5-flash), then routes them through the full translation pipeline. This provides a hands-free way to test STT → Translation → TTS without a live audio source.
Language Styles
| Language | Style | Example |
en |
King James English |
"In the beginning was the Word..." |
ru |
Church Slavonic Russian |
"В начале было Слово..." |
uk |
Traditional Ukrainian |
"На початку було Слово..." |
Flow
- Admin selects language (EN/RU/UK)
- Backend calls Gemini 2.5 Flash with streaming
- Gemini generates 6-8 biblical passages, 3-5 sentences each
- Stream is buffered until 140+ characters AND complete sentences
- Chunks emitted with 1800ms smooth pacing between them
- Each chunk flows through the standard pipeline:
- Emitted as
transcript (isFinal: true)
- Auto-translated via configured provider
- TTS synthesized and audio returned
- Frontend plays audio with standard inter-segment pause
💡
Feature Flag
Enable biblical_simulator in the admin feature flags panel. The Gemini API key is configured via the GEMINI_API_KEY environment variable or set at runtime in the admin API Keys panel.
Overview
Voice Training uses ElevenLabs' Instant Voice Cloning (IVC) API to create custom voices from audio samples. Once cloned, the voice appears in the voice selector immediately.
From Microphone
- Open the Voice Training section in the admin panel
- Click Generate Text to get an AI-generated reading passage (via Gemini) — gives the speaker natural, phonetically diverse text to read aloud
- Record multiple audio clips using your browser microphone while reading the generated text
- Provide a name for the voice
- Clips are uploaded to ElevenLabs IVC API
- Cloned voice is available for TTS immediately
- Click Preview Voice to hear the cloned voice speak a sample sentence via TTS
From YouTube
- Paste a YouTube URL in the Voice Training section
- Backend extracts N × 30-second clips via
yt-dlp + ffmpeg
- Clips are uploaded to ElevenLabs IVC API
- Resulting voice is stored in your ElevenLabs account
⚠
ElevenLabs Account
Cloned voices are stored in your ElevenLabs account, not locally. Ensure your plan supports voice cloning.
Concepts
| Concept | Description |
| Active Language Pair |
The current pair used for translation (e.g., EN ↔ RU, EN ↔ UK, or RU ↔ UK). Set by admin. |
| Available Languages |
The pool of languages viewers can select from (if user_language_selector is enabled). |
Admin Controls
- Change the active language pair via the admin panel
- Changes broadcast to all connected clients in real-time
- Manage the available languages pool for viewer selection
Viewer Selection
When the user_language_selector feature flag is enabled, viewers can override the admin-set language pair by selecting their own preferred languages from the available pool.
Overview
Two people can video call each other through the app, each speaking their own language. The app transcribes, translates, and synthesizes speech in real-time so each participant hears the other in their language.
Feature flag: Video call is gated behind the video_translation flag. Enable it in the admin panel or set video_translation: true in your YAML config.
How It Works
- Create a room — Person A selects their language, picks a TTS voice, and clicks "Create Room". A 6-character room code is generated.
- Share the code — Person A shares the room code with Person B (copy button provided).
- Join the room — Person B enters the code, selects their language and TTS voice, and clicks "Join".
- WebRTC connection — The app establishes a peer-to-peer video connection via WebRTC (signaled through Socket.io). Video flows directly between browsers.
- Audio translation — Each participant's microphone audio is simultaneously:
- Sent to the peer via WebRTC (but muted on their end)
- Captured as PCM chunks and sent to the backend via Socket.io for STT
- Translation pipeline — Each participant has their own independent Scribe STT session. Transcribed text is translated to the other participant's language, then synthesized via ElevenLabs TTS and sent back to the peer.
- Playback — The peer hears the TTS translation instead of the raw audio. Translated transcript is displayed below the video.
Architecture
Person A (Browser) Server Person B (Browser)
├─ getUserMedia ├─ Socket.io ├─ getUserMedia
├─ WebRTC P2P ═══video═══►│ (signaling) ◄═══ ├─ WebRTC P2P
│ │ │
├─ PCM chunks ──Socket.io─►├─ ScribeA(STT) │
│ │ ↓ translate │
│ │ ↓ TTS ───────────►├─ Plays TTS
│ │ │
│ Plays TTS ◄─────────────├─ ScribeB(STT) ◄───├─ PCM chunks
│ (remote video muted) │ ↓ translate │ (remote video muted)
└──────────────────────────┴────────────────────┘
Socket Events
| Event | Direction | Purpose |
video_create_room | C→S | Create a new room with language + voice |
video_room_created | S→C | Returns the 6-char room code |
video_join_room | C→S | Join an existing room |
video_room_joined | S→C | Sent to both participants, triggers WebRTC |
video_signal_offer/answer/ice | C↔S | WebRTC signaling relay |
video_audio_chunk | C→S | PCM audio for STT processing |
video_transcript | S→C | Transcript sent to the speaker |
video_translation | S→C | Translation sent to the listener |
video_tts_audio | S→C | TTS audio sent to the listener |
video_leave_room | C→S | Leave the room |
video_room_closed | S→C | Notify peer when other leaves |
Room Lifecycle
- Rooms are stored in Redis with key
video_room:{code} and a 4-hour TTL
- Maximum 2 participants per room
- When one participant disconnects, the other is notified and the call ends
- Scribe sessions are automatically cleaned up on disconnect
The Mac Audio Agent has moved to its own public repository:
github.com/Pzharyuk/live-translator-agent
It is a lightweight Node.js daemon that runs as a macOS LaunchAgent and streams microphone audio to the live-translator backend via Socket.io — eliminating the need to open a browser for the Remote Audio Source role.
Pre-shared key authentication
Any socket that emits register_audio_source must present the server's pre-shared key in the Socket.IO handshake (auth.agentPsk). This stops random clients from connecting to the backend and impersonating an agent.
- Server: set
AGENT_PSK (env var) — surfaces as auth.agent_psk in application.yaml. An empty value disables enforcement and logs a warning on every registration.
- Mac daemon: add
agentPsk to ~/.config/live-translator-agent/config.json (or set the AGENT_PSK env var — env wins).
- Browser
/audio-source: paste the key into the new Agent Pre-Shared Key field; it is stored in localStorage on that device only and travels in the handshake (never in event payloads).
- Mismatch behaviour: server logs
register_audio_source REJECTED ... invalid or missing PSK, emits agent_auth_error to the client, then disconnects.
Feature Flags
Feature flags control which UI routes and functionality are exposed to users. They are stored in Redis with YAML defaults as fallback. Changes to flags broadcast immediately to all connected clients via Socket.IO, allowing live toggling without server restart.
| Flag |
Default |
Description |
youtube_input |
true |
Enable YouTube broadcast source (admin can stream audio from a YouTube live channel). |
mic_input |
true |
Enable microphone input from admin's browser for live broadcast. |
auto_language_detect |
true |
Automatically detect source language before translation; users cannot override. |
user_language_selector |
false |
Allow viewers to select their own language pair from available languages pool. |
audio_device_selector |
true |
Show audio device selector in the admin panel (microphone/speaker selection). |
video_translation |
true |
Enable live video call translation (real-time subtitle sync during peer video calls). |
video_voice_cloning |
false |
Premium feature: show Clone Voice button in video lobby for instant voice training from recordings. |
remote_audio_source |
false |
Enable /audio-source route for headless remote audio relay (native agents, browser-based audio streaming). |
agent_audio_source |
false |
Show connected agent audio sources section in admin panel (multi-agent management, active agent selection). |
broadcast |
false |
Enable /broadcast route — public receiver page for live-streamed translated content. |
translate |
false |
Enable /translate route — live translator page for personal transcription & translation sessions. |
Storage & Broadcasting
Feature flags are persisted in Redis under the flag:{flagName} key with a string value ('true' or 'false'). On startup, flags are initialized from the YAML feature_flags section; any runtime changes via POST /admin/flags/:flag overwrite the Redis value and broadcast to all connected Socket.IO clients immediately, so UI switches take effect without page reload.
Admin API
GET /admin/flags
→ { "flags": { "youtube_input": true, "mic_input": true, … } }
POST /admin/flags/:flag
→ { "flag": "youtube_input", "value": false }
GET /admin/flags/:flag
→ { "flag": "youtube_input", "value": true }
File Structure
| File | Purpose |
config/application.yaml |
Base defaults for all environments |
config/application-local.yaml |
Local development overrides (localhost URLs) |
config/application-prod.yaml |
Production overrides (Docker service names) |
The APP_ENV environment variable (local or prod) determines which overlay file is loaded on top of the base config.
Full Configuration Reference
server:
port: 3001
cors_origin: "http://localhost:5173"
elevenlabs:
api_key: "${ELEVENLABS_API_KEY}"
default_voice_id: "kxj9qk6u5PfI0ITgJwO0"
tts_model: "eleven_multilingual_v2"
tts_settings:
stability: 0.5
similarity_boost: 0.75
style: 0.0
speed: 1.0
use_speaker_boost: true
stt_model: "scribe_v2"
anthropic:
api_key: "${ANTHROPIC_API_KEY}"
deepl:
api_key: "${DEEPL_API_KEY}"
libretranslate:
url: "http://libretranslate:5000"
api_key: ""
redis:
host: "redis"
port: 6379
password: ""
feature_flags:
youtube_input: true
mic_input: true
auto_language_detect: true
user_language_selector: false
audio_device_selector: true
video_translation: false
video_voice_cloning: false
broadcast: false
audio:
sample_rate: 16000
channels: 1
chunk_duration_ms: 250
translation:
source_lang: "auto"
target_lang_en: "en"
target_lang_ru: "ru"
provider: "libretranslate"
fallback: "libretranslate"
Environment Variable Interpolation
YAML values using ${VAR_NAME} syntax are automatically replaced with the corresponding environment variable at startup.
| Variable |
Required |
Default |
Description |
ELEVENLABS_API_KEY |
Yes |
— |
ElevenLabs API key for text-to-speech & speech-to-text services. |
ELEVENLABS_VOICE_ID |
No |
JBFqnCBsd6RMkjVDRZzb |
Default voice ID for TTS synthesis (overridable in config.elevenlabs.default_voice_id). |
ANTHROPIC_API_KEY |
No |
— |
Anthropic API key for sermon generation & Claude translation provider (configurable in Admin UI). |
GEMINI_API_KEY |
No |
— |
Google Gemini API key for biblical simulator & sermon generation. |
GOOGLE_TRANSLATE_API_KEY |
No |
— |
Google Cloud Translation API key (default translation provider). |
DEEPL_API_KEY |
No |
— |
DeepL API key for translation provider fallback (free tier keys end with :fx). |
YOUTUBE_API_KEY |
No |
— |
YouTube Data API v3 key for live stream lookup & discovery. |
YOUTUBE_CHANNEL_ID |
No |
— |
Default YouTube channel ID to search for live streams (format: UC...). |
APP_ENV |
No |
local |
Application environment (local for development, prod for Docker). |
FRONTEND_URL |
No |
http://localhost |
Frontend URL used for CORS origin (set to your domain in production). |
LISTEN_PORT |
No |
80 |
Host port the frontend listens on. |
REDIS_PASSWORD |
No |
— |
Redis authentication password (leave empty for no auth). |
LIBRETRANSLATE_API_KEY |
No |
— |
Optional API key if your LibreTranslate instance requires authentication. |
ADMIN_PASSWORD |
Yes |
admin123 |
Legacy socket authentication password — change in production. |
APP_ADMIN_USERNAME |
No |
admin |
Admin user seeded into the database on first boot. |
APP_ADMIN_PASSWORD |
No |
admin123 |
Initial admin password — change in production; user must reset on first login if force_password_change=true. |
APP_USERNAME |
No |
user |
Default user-facing login username. |
APP_PASSWORD |
No |
changeme |
Default user-facing login password — change in production. |
JWT_SECRET |
Yes |
— |
JWT secret for session cookies — generate a strong random string (e.g., openssl rand -hex 32). |
COOKIE_SECURE |
No |
true |
Enable secure cookies flag — set to true when serving over HTTPS. |
AGENT_PSK |
No |
— |
Pre-shared key for native agents registering audio sources; leave empty to disable enforcement. |
DB_PASSWORD |
Yes |
— |
PostgreSQL database password. |
GOOGLE_CLIENT_ID |
No |
— |
Google OAuth client ID (login UI shows button only when set). |
GOOGLE_CLIENT_SECRET |
No |
— |
Google OAuth client secret. |
APPLE_CLIENT_ID |
No |
— |
Apple Sign In Services ID identifier (login UI shows button only when set). |
APPLE_TEAM_ID |
No |
— |
Apple developer account 10-character team identifier. |
APPLE_KEY_ID |
No |
— |
Apple Sign In Key 10-character Key ID. |
APPLE_PRIVATE_KEY |
No |
— |
Apple private key file contents (PEM-encoded .p8 with -----BEGIN/END markers). |
OIDC_ISSUER |
No |
— |
Authentik OIDC issuer URL (legacy, being phased out); leave empty to disable. |
OIDC_CLIENT_ID |
No |
— |
Authentik OIDC client ID. |
OIDC_CLIENT_SECRET |
No |
— |
Authentik OIDC client secret. |
elevenlabs.default_voice_id |
No |
kxj9qk6u5PfI0ITgJwO0 |
Default ElevenLabs voice ID in application.yaml (config setting). |
elevenlabs.tts_model |
No |
eleven_multilingual_v2 |
ElevenLabs TTS model name. |
elevenlabs.stt_model |
No |
scribe_v2_realtime |
ElevenLabs speech-to-text model (scribe_v2_realtime for live streaming). |
elevenlabs.tts_settings.stability |
No |
0.5 |
TTS voice stability parameter (0–1). |
elevenlabs.tts_settings.similarity_boost |
No |
0.75 |
TTS similarity boost parameter (0–1). |
elevenlabs.tts_settings.style |
No |
0.0 |
TTS voice style parameter (0–1). |
elevenlabs.tts_settings.speed |
No |
1.0 |
TTS playback speed multiplier. |
elevenlabs.tts_settings.use_speaker_boost |
No |
true |
Enable TTS speaker boost for clarity. |
server.port |
No |
3001 |
Backend server port. |
server.cors_origin |
No |
http://localhost:5183 |
CORS origin for frontend requests. |
database.host |
No |
postgres |
PostgreSQL hostname. |
database.port |
No |
5432 |
PostgreSQL port. |
database.username |
No |
translator |
PostgreSQL username. |
database.database |
No |
translator_db |
PostgreSQL database name. |
database.pool_size |
No |
10 |
PostgreSQL connection pool size. |
redis.host |
No |
redis |
Redis hostname. |
redis.port |
No |
6379 |
Redis port. |
auth.admin_username |
No |
admin |
Legacy admin username. |
auth.admin_password |
No |
admin123 |
Legacy admin password — change in production. |
auth.session_days |
No |
30 |
JWT session cookie expiration in days. |
libretranslate.url |
No |
http://libretranslate:5000 |
LibreTranslate service URL. |
audio.sample_rate |
No |
16000 |
Audio sample rate in Hz (16kHz for Scribe). |
audio.channels |
No |
1 |
Audio channel count (mono). |
audio.chunk_duration_ms |
No |
250 |
Audio chunk duration in milliseconds. |
translation.source_lang |
No |
auto |
Default source language (auto for auto-detection). |
translation.target_lang_en |
No |
en |
Target language code for English output. |
translation.target_lang_ru |
No |
ru |
Target language code for Russian output. |
translation.provider |
No |
google |
Primary translation provider (google | deepl | claude | libretranslate). |
translation.fallback |
No |
libretranslate |
Fallback translation provider when primary fails (or none to disable). |
translation.translate_workers |
No |
2 |
Number of parallel translation workers in stage 1 of TTS pipeline. |
translation.request_timeout_ms |
No |
5000 |
Per-provider translation request timeout (ms); fallback chain on timeout. |
tts_pipeline.initial_buffer_segments |
No |
1 |
Number of translated segments to buffer before starting TTS playback. |
tts_pipeline.low_water_hold_ms |
No |
1500 |
Hold audio emit until N+1 segment arrives (ms); set to 0 to disable. |
feature_flags.youtube_input |
No |
true |
Enable YouTube broadcast source in UI. |
feature_flags.mic_input |
No |
true |
Enable microphone input in UI. |
feature_flags.auto_language_detect |
No |
true |
Enable automatic language detection. |
feature_flags.user_language_selector |
No |
false |
Allow viewers to select target language. |
feature_flags.audio_device_selector |
No |
true |
Show audio device selector in UI. |
feature_flags.video_translation |
No |
true |
Enable video call translation feature. |
feature_flags.video_voice_cloning |
No |
false |
Premium feature: show Clone Voice button in video lobby. |
feature_flags.remote_audio_source |
No |
false |
Enable /audio-source headless remote audio relay route. |
feature_flags.agent_audio_source |
No |
false |
Show connected agent audio sources section in admin panel. |
feature_flags.broadcast |
No |
false |
Enable /broadcast public receiver page. |
feature_flags.translate |
No |
false |
Enable /translate live translator page. |
TTS Settings
API Endpoints
GET /admin/tts-settings
Retrieve current TTS settings.
curl -X GET http://localhost:3001/admin/tts-settings \
-H "Cookie: auth=<jwt_token>"
POST /admin/tts-settings
Update one or more TTS settings. All fields are optional — only provided fields are changed.
curl -X POST http://localhost:3001/admin/tts-settings \
-H "Cookie: auth=<jwt_token>" \
-H "Content-Type: application/json" \
-d '{
"stability": 0.6,
"similarity_boost": 0.8,
"speed": 1.1
}'
Configuration Settings
All TTS parameters are configurable at runtime via the admin API. Settings persist in Redis and override the application.yaml defaults on restart.
| Setting |
Range |
Default |
Description |
stability |
0.0 – 1.0 |
0.5 |
Voice consistency — higher values reduce variation between phonemes (0.0 = high variability, 1.0 = monotone). |
similarity_boost |
0.0 – 1.0 |
0.75 |
Accent preservation — higher values make the output more recognizable as the original voice. |
style |
0.0 – 1.0 |
0.0 |
Exaggeration level — 0.0 = neutral, 1.0 = highly expressive and theatrical. |
speed |
0.5 – 2.0 |
1.0 |
Playback rate multiplier — 0.5 = half speed, 2.0 = double speed. |
use_speaker_boost |
boolean |
true |
Enable speaker boost — improves audio quality at the cost of slightly higher latency. |
Voice Selection
ElevenLabs voices are fetched dynamically and can be filtered to an admin-approved list:
GET /admin/voices
List all available voices from ElevenLabs.
GET /admin/available-voices
Get the admin-filtered list of allowed voice IDs (null = all voices permitted).
POST /admin/available-voices
Set the allowed voice list. Viewers can only select from these voices.
curl -X POST http://localhost:3001/admin/available-voices \
-H "Cookie: auth=<jwt_token>" \
-H "Content-Type: application/json" \
-d '{
"voiceIds": [
"kxj9qk6u5PfI0ITgJwO0",
"JBFqnCBsd6RMkjVDRZzb",
"26Z6DtHZV1L8xuNVXj2H"
]
}'
Default Voice
| Configuration |
Value |
Source |
default_voice_id |
kxj9qk6u5PfI0ITgJwO0 |
application.yaml or .env |
tts_model |
eleven_multilingual_v2 |
application.yaml |
TTS Preview Endpoint
Test TTS output before broadcasting:
POST /admin/tts-preview
Generate and stream audio for a given text snippet using the current TTS settings.
curl -X POST http://localhost:3001/admin/tts-preview \
-H "Cookie: auth=<jwt_token>" \
-H "Content-Type: application/json" \
-d '{
"text": "This is a test message.",
"voiceId": "kxj9qk6u5PfI0ITgJwO0",
"format": "mp3"
}' \
--output preview.mp3
Query Parameters:
text (required): Text to synthesize
voiceId (optional): Voice ID; defaults to default_voice_id
format (optional): mp3 (default) or pcm (16 kHz linear PCM)
Response: Audio stream with Content-Type: audio/mpeg or audio/L16; rate=16000; channels=1
Related Settings
The following settings affect TTS behavior and are configured separately:
| Setting Endpoint |
Purpose |
/admin/stt-timing |
Speech recognition timing — affects when transcripts are sent for translation & TTS. |
/admin/video-settings |
Video call TTS/STT tuning — separate from live broadcast settings. |
/admin/languages |
Active language pair for translation (e.g., ['en', 'ru']). |
STT Timing Configuration
Speech-to-Text (STT) timing settings control how the Scribe voice recognition engine buffers, commits, and dispatches audio transcripts for translation. These parameters directly affect latency, transcript accuracy, and the smoothness of real-time translation output.
Settings Table
| Setting |
Default |
Description |
commit_merge_ms |
2500 |
Time in milliseconds to buffer VAD-committed transcript fragments before merging and translating (higher → fewer, longer chunks). |
stability_timeout_ms |
2000 |
Time in milliseconds to wait for partial transcript text to remain unchanged before dispatching for translation (fallback when VAD is slow). |
tts_segment_pause_ms |
0 |
Pause in milliseconds between consecutive TTS audio segments played on the frontend (viewer-facing only, emitted via socket). |
max_accumulation_ms |
8000 |
Maximum time in milliseconds to accumulate words during continuous speech before force-dispatching for translation (prevents long delays during non-stop speaking). |
vad_threshold |
0.5 |
Voice Activity Detection noise filter strength, 0–1 (higher → stricter, fewer false positives; lower → more sensitive to quiet speech). |
vad_silence_threshold_secs |
1.5 |
Duration of silence in seconds before VAD commits the current transcript fragment (higher → longer pauses required to trigger a commit). |
min_speech_duration_ms |
100 |
Minimum duration in milliseconds of audio to recognize as speech (filters out clicks, pops, brief noise). |
min_silence_duration_ms |
100 |
Minimum gap in milliseconds between words to count as silence (helps VAD distinguish speech pauses from natural speech rhythm). |
flush_on_sentence_boundary |
true |
When enabled, splits dispatch at sentence-ending punctuation (.?!;) to avoid breaking sentences across translation chunks. |
min_chars_before_dispatch |
40 |
Minimum character count before a transcript fragment is dispatched for translation (prevents tiny fragments that waste API calls). |
Tuning Guide
- For low-latency response: Decrease
commit_merge_ms (e.g., 500–1000) and max_accumulation_ms (e.g., 3000–5000). Trade-off: more frequent, smaller translation chunks may feel choppy.
- For natural chunking: Increase
commit_merge_ms (e.g., 3000–5000) and max_accumulation_ms (e.g., 8000–12000) to merge short pauses into complete sentences. Trade-off: slightly higher total latency, smoother TTS playback.
- For continuous speech (sermons, lectures): The
max_accumulation_ms timer ensures dispatch every N seconds even when the speaker never pauses. Adjust to match your expected speaking cadence.
- For noisy environments: Increase
vad_threshold (e.g., 0.6–0.8) to filter background noise and air conditioning hum. Cost: may miss quiet words.
- For quiet speakers: Decrease
vad_threshold (e.g., 0.2–0.4) and min_speech_duration_ms (e.g., 50–100). Cost: may pick up room noise.
- For sentence-perfect output: Enable
flush_on_sentence_boundary so each translation request contains a complete sentence. Disable only if your language rarely uses punctuation.
- To avoid tiny translation fragments: Increase
min_chars_before_dispatch (e.g., 80–120). Cost: slightly longer wait for the first chunk to be translated.
API Endpoints
GET /admin/stt-timing
Retrieve current STT timing settings.
{
"settings": {
"commit_merge_ms": 2500,
"stability_timeout_ms": 2000,
"tts_segment_pause_ms": 0,
"max_accumulation_ms": 8000,
"vad_threshold": 0.5,
"vad_silence_threshold_secs": 1.5,
"min_speech_duration_ms": 100,
"min_silence_duration_ms": 100,
"flush_on_sentence_boundary": true,
"min_chars_before_dispatch": 40
}
}
POST /admin/stt-timing
Update one or more STT timing settings. Partial updates are supported — omitted fields retain their current values.
{
"commit_merge_ms": 1500,
"max_accumulation_ms": 5000,
"vad_threshold": 0.6,
"flush_on_sentence_boundary": true
}
Response: Returns the complete updated settings object (same schema as GET).
Runtime Behavior
- Settings are persisted to Redis and loaded on server startup, surviving restarts.
- Changes via POST /admin/stt-timing apply immediately to all new Scribe sessions (ongoing sessions use the values active at session creation).
- Frontend viewers receive
tts_segment_pause_ms via the stt_timing socket event and use it to space out TTS audio playback.
- Scribe VAD parameters (
vad_threshold, vad_silence_threshold_secs, min_speech_duration_ms, min_silence_duration_ms) are sent to ElevenLabs as WebSocket query parameters on every session creation.
- The accumulation timer, stability timer, and sentence-boundary detector all run locally on the backend and depend on
max_accumulation_ms, stability_timeout_ms, and flush_on_sentence_boundary respectively.
Interaction with Translation Pipeline
STT timing settings feed into the broader translation & TTS pipeline:
- Scribe emits
partial_transcript and committed_transcript events over WebSocket.
- Backend applies stability, accumulation, and sentence-boundary logic to decide when to dispatch a chunk for translation.
- Dispatched text is queued to Redis Streams
broadcast:pending and processed by parallel translate workers.
- Translated results flow to
broadcast:translated and are picked up by the TTS synth worker, which emits both transcript text and audio.
tts_segment_pause_ms is sent to viewers so they can space TTS playback, preventing audio gaps or overlaps.
Authentication: JWT cookie-based. Admin access requires either is_admin=true or permissions array with at least one permission. Some endpoints require specific permissions (noted below).
API Keys Management
Retrieve status of all configured API keys (elevenlabs, anthropic, deepl, libretranslate, google, youtube).
Update one or more API keys by name.
Body: { elevenlabs?: string, anthropic?: string, deepl?: string, libretranslate?: string, google?: string, youtube?: string }
Voice Management
Scan and list all available ElevenLabs voices, noting any new voices not yet in the allowed list.
Get the list of voice IDs allowed for viewer/admin selection (null = all allowed).
Set the restricted list of voice IDs viewers can pick from; broadcasts to all clients.
Body: { voiceIds: string[] }
TTS & STT Settings
Get current TTS voice settings (stability, similarity_boost, style, speed, use_speaker_boost).
Update TTS voice settings.
Body: { stability?: number, similarity_boost?: number, style?: number, speed?: number, use_speaker_boost?: boolean }
Get STT timing settings (VAD thresholds, silence duration, min/max accumulation, dispatch thresholds).
Update STT timing settings for Scribe session control.
Body: { commit_merge_ms?: number, stability_timeout_ms?: number, tts_segment_pause_ms?: number, max_accumulation_ms?: number, vad_threshold?: number, vad_silence_threshold_secs?: number, min_speech_duration_ms?: number, min_silence_duration_ms?: number, flush_on_sentence_boundary?: boolean, min_chars_before_dispatch?: number }
Get video call STT/TTS settings (stability, commit merge delay, provider).
Update video call settings.
Body: { stability_ms?: number, commit_merge_ms?: number, translation_provider?: 'libretranslate' | 'claude' | 'deepl' | 'google' }
Languages
Get the currently active language pair (source → target).
Set the active language pair; broadcasts to all connected clients.
Body: { languages: [string, string] } // exactly 2 language codes
Get the pool of languages available for viewers to select from.
Set the language pool; broadcasts both pool & updated active pair to all clients.
Body: { languages: string[] }
Translation Provider
Get the currently active translation provider & list of available options.
Set the active translation provider (google, deepl, claude, or libretranslate).
Body: { provider: 'google' | 'deepl' | 'claude' | 'libretranslate' }
Get the currently selected Claude translation model & list of available models.
Set the Claude model for translation (when provider = claude).
Body: { model: string } // must be a valid Claude model ID
Audio Device
Get the admin-selected audio input device that overrides viewer selection.
Set the admin-enforced audio device; broadcasts to all connected clients.
Body: { deviceId?: string, label?: string }
Feature Flags
Get all feature flags merged from YAML defaults & Redis overrides.
Get a single feature flag value.
Set a feature flag & broadcast to all connected clients.
YouTube Integration
Get the configured YouTube channel ID & whether it came from environment.
Set the YouTube channel ID for live stream lookups.
Body: { channelId: string }
Find live streams from a channel (via API or yt-dlp) using YouTube Data API or fallback.
TTS Preview
Generate audio preview for a text snippet in a given voice & format.
Body: { text: string, voiceId?: string, format?: 'mp3' | 'pcm' }
Voice Training / Voice Cloning
Clone a voice from base64-encoded browser microphone recordings.
Body: { name: string, clips: string[], mimeType?: string } // clips are base64-encoded audio blobs
Clone a voice from a YouTube URL using yt-dlp & ffmpeg to extract audio clips.
Body: { name: string, youtubeUrl: string, clipCount?: number, startOffset?: number }
Content Generation
Generate a biblical sermon snippet via Gemini Flash (majestic, poetic, multi-language support).
Body: { apiKey?: string, language?: string, sentences?: number } // language: 'en' | 'ru' | 'uk', default 3 sentences
Broadcast Schedule
Get the list of scheduled broadcast events (auto-prunes past events on broadcast start).
Set the broadcast schedule; events include optional source config (youtube URL, biblical prompt, agent ID) & language pair.
Body: { events: Array<{ id: string, title: string, datetime: string, description?: string, source?: 'mic' | 'youtube' | 'biblical' | 'remote', voiceId?: string, durationMinutes?: number, youtubeUrl?: string, biblicalPrompt?: string, agentId?: string, agentDeviceId?: string, skipSourceLang?: string | null, allowedLanguages?: [string, string] }> }
Monitoring & Diagnostics
Get hallucination detection statistics & recent false-positive transcripts.
Clear the hallucination log.
Get the list of custom filler words stripped from transcripts before translation.
Set custom filler words (e.g., "uh", "um", "ээ") to strip from transcripts.
Body: { words: string[] }
Get the translation audit log (original, translated, detected language, provider, timings).
Clear the translation log.
Get real-time broadcast translation queue depth & stream stats (pending, translated, consumer lag).
Video Call Moderation
List active video call rooms with participant info (strips rejoin tokens for security).
Force-close a video call room & disconnect all participants.
Session History & Export
Get list of all broadcast sessions from PostgreSQL (with summary metadata).
Get detailed transcript for a single session (seq, timestamps, detected language, original, translated, timings).
Export session transcripts in CSV, TXT, or JSON format (format query param: csv | txt | json).
User Management
Permission Required: user_management
Get all users (password hashes & avatar data stripped).
Update a user's admin status &/or role assignments.
Body: { isAdmin?: boolean, roleId?: string | null, roleIds?: string[] }
Admin-reset a user's password (hashed with bcrypt).
Body: { password: string } // minimum 6 characters
Delete a user account (prevents self-deletion).
Role & Permission Management
Permission Required: user_management
Get the full list of available permissions.
Get all custom roles with their assigned permissions.
Create a new role with specified permissions.
Body: { name: string, permissions: string[] }
Update an existing role's name & permissions.
Body: { name: string, permissions: string[] }
Miscellaneous
Get the Anthropic (legacy Gemini reference) API key status.
Socket.IO Events
Server → Client Events
| Event |
Payload |
Description |
feature_flags |
{ [flag: string]: boolean } |
Merged feature flags from YAML defaults & Redis overrides. |
languages |
{ languages: [string, string] } |
Current active language pair (source, target). |
available_languages |
{ languages: string[] } |
Pool of languages available for viewer selection. |
stt_timing |
{ tts_segment_pause_ms: number } |
STT timing configuration for frontend audio processing. |
broadcast_status |
{ active: boolean; source: string; pauseReason?: string; skipSourceLang?: string; voiceId?: string; orphaned?: boolean } |
Current broadcast state: active/inactive, source type, pause status, and voice ID. |
broadcast_viewer_count |
{ count: number } |
Real-time count of viewers watching the broadcast. |
remote_audio_sources |
{ sources: RemoteSource[] } |
List of registered remote audio agents with device info. |
admin_audio_device |
{ deviceId: string; label: string } |
Admin-selected audio device override for all viewers. |
broadcast_transcript |
{ text: string; isFinal: boolean; skipped?: boolean } |
Incoming speech-to-text transcript chunks (partial or final). |
broadcast_translation |
{ original: string; translated: string; detectedLanguage?: string } |
Translated text ready for TTS synthesis. |
broadcast_tts_audio |
{ audio: string } |
Base64-encoded audio chunk for broadcast playback. |
broadcast_transcript_history |
{ original: string; translated: string; detectedLanguage?: string }[] |
Recent translations replayed to viewers who rejoin mid-broadcast. |
broadcast_source_status |
{ stalled: boolean; message?: string } |
Audio source watchdog: signals when remote agent audio feed stalls or resumes. |
audio_level |
{ data: number[] } |
Downsampled PCM waveform (64 samples) for real-time visualizer. |
stream_ended |
{} |
Broadcast stream has ended (YouTube/biblical/remote source finished). |
error |
{ message: string } |
Error notification (speech recognition, translation, TTS, etc.). |
tts_clear_queue |
{} |
Clear any queued TTS audio (e.g., on voice change or broadcast pause). |
broadcast_voice_changed |
{ voiceId: string } |
Voice used for TTS has changed mid-broadcast. |
translation |
{ original: string; translated: string; detectedLanguage?: string } |
Translation result in private session (viewer mode only). |
tts_audio |
{ audio: string } |
Base64-encoded TTS audio in private session (viewer mode only). |
transcript |
{ text: string; isFinal: boolean } |
STT transcript in private session (viewer mode only). |
session_started |
{ source: string } |
Private session initialized (mic or YouTube source). |
session_stopped |
{} |
Private session ended (client must restart to resume). |
agent_auth_error |
{ code: string; message: string } |
Remote agent pre-shared key validation failed; socket will disconnect. |
select_device |
{ id: string } |
Admin instructs remote agent to switch to a specific audio device. |
refresh_devices |
{} |
Admin requests remote agent to re-enumerate available audio devices. |
remote_audio_error |
{ socketId: string; deviceId: string; message: string } |
Remote agent reported an audio stream error (broadcast to all admins). |
admin_translate_result |
{ original: string; translated: string; detectedLanguage?: string; audio: string } |
Result of an instant translate&TTS test initiated by admin. |
device_select_error |
{ socketId: string; message: string } |
Failed to switch device on remote agent (device unavailable or offline). |
Client → Server Events
| Event |
Payload |
Description |
join_broadcast |
{} |
Viewer joins the broadcast room to receive live translation & audio. |
leave_broadcast |
{} |
Viewer leaves the broadcast room. |
set_languages |
{ languages: [string, string] } |
Viewer selects a language pair (validated against available pool). |
start_session |
{ source: 'mic' | 'youtube'; voiceId?: string; youtubeUrl?: string } |
Start a private translation session (viewer mode). |
stop_session |
{} |
Stop the active private session. |
change_voice |
{ voiceId: string } |
Change TTS voice (broadcast admin only; updates all viewers in real-time). |
audio_chunk |
{ audio: string } |
Send base64-encoded PCM audio chunk (mic or remote agent). |
test_audio_chunk |
{ audio: string } |
Send audio to private session only (never routed to broadcast). |
admin_start_broadcast |
{ voiceId?: string; source: 'mic' | 'youtube' | 'remote'; youtubeUrl?: string } |
Admin initiates a broadcast from specified source. |
admin_stop_broadcast |
{} |
Admin stops the active broadcast. |
reclaim_broadcast |
{} |
Admin reconnects and reclaims an orphaned broadcast after page reload. |
broadcast_pause |
{ reason: 'prayer' | 'song' } |
Pause broadcast translation&TTS (e.g., during prayer or song). |
broadcast_resume |
{} |
Resume broadcast after pause. |
broadcast_skip_lang |
{ lang: string | null } |
Skip translation when detected language matches (e.g., skip English when human translator is speaking). |
register_audio_source |
{ agentId?: string; label: string; deviceId: string; devices?: { id: string; name: string }[]; selectedDevice?: string | null } |
Remote agent registers itself as an available audio source with device list. |
unregister_audio_source |
{} |
Remote agent disconnects and unregisters from broadcast. |
select_active_agent |
{ socketId: string } |
Admin chooses which registered agent's audio feeds the broadcast. |
select_agent_device |
{ socketId: string; deviceId: string } |
Admin selects which audio device a specific agent should use. |
refresh_devices |
{ socketId: string } |
Admin requests a remote agent to re-enumerate audio devices. |
audio_stream_error |
{ deviceId: string; message: string } |
Remote agent reports audio streaming error (broadcast to admins). |
admin_translate_test |
{ text: string; voiceId?: string; sourceLang?: string; targetLang?: string } |
Admin requests instant translate&TTS test (private response only). |
start_biblical_sim |
{ anthropicApiKey?: string; geminiApiKey?: string; language: BiblicalLanguage; voiceId?: string } |
Start biblical sermon simulator broadcast (broadcasts to all viewers). |
stop_biblical_sim |
{} |
Stop biblical simulator broadcast. |
SDK
Uses the official @elevenlabs/elevenlabs-js SDK (v2). The client is lazy-loaded on first use.
Speech-to-Text (Scribe v2 Realtime)
Connects via native WebSocket to wss://api.elevenlabs.io/v1/speech-to-text/realtime. Handles:
- VAD-based commit buffering with configurable merge window
- Stability timeout fallback for stalled VAD
- Text validation (EN/RU/UK character regex filtering)
- Partial and final transcript emission
Text-to-Speech
Uses client.textToSpeech.stream() with the eleven_multilingual_v2 model. Audio is collected into a Buffer and emitted as base64 MP3.
Voice Management
client.voices.getAll() — fetches all voices from account
- Admin can filter which voices are available to viewers
- Voice cloning via IVC API (from recordings or YouTube)
Key File
backend/src/services/elevenlabs.service.ts
Provider Details
Google Translate
Google Cloud Translation API v2. Fast (~200ms), deterministic, and reliable. Requires GOOGLE_TRANSLATE_API_KEY with the Cloud Translation API enabled in Google Cloud Console. Ensure the API key has no HTTP referrer restrictions (server-side requests have no referrer).
File: backend/src/services/google-translate.service.ts
LibreTranslate
Self-hosted in Docker. No API key required by default. Provides language detection and translation via REST API.
File: backend/src/services/libretranslate.service.ts
DeepL
Premium translation API. Auto-detects free vs. paid endpoint based on the API key format.
File: backend/src/services/deepl.service.ts
Claude (Anthropic)
AI-powered translation using claude-haiku-4-5 for speed. Includes language detection and auto-flip logic.
File: backend/src/services/claude-translate.service.ts
Routing
Provider routing is handled by backend/src/services/translation.provider.ts:
- Try admin-selected primary provider
- On failure, try configured fallback provider
- LibreTranslate is always the last-resort fallback
Connection
Uses ioredis with automatic retry strategy. Falls back to in-memory/YAML defaults if Redis is unavailable.
Key Patterns
| Pattern | Example | Purpose |
flag:<name> |
flag:youtube_input |
Feature flag boolean values |
setting:<name> |
setting:tts_settings |
JSON settings objects |
Key File
backend/src/services/redis.service.ts
Local Development
Use docker-compose.local.yml for Redis and LibreTranslate only (backend/frontend run natively):
docker compose -f docker-compose.local.yml up -d
Production
Use docker-compose.yml for all services:
docker compose up -d --build
Services
| Service | Image | Port | Notes |
| frontend |
node:24-alpine + Nginx |
80 (exposed) |
Serves React build, proxies API/WS to backend |
| backend |
node:24-alpine |
3001 (internal) |
Express + Socket.io server |
| redis |
redis:7-alpine |
6379 (internal) |
Feature flags and settings store |
| libretranslate |
libretranslate/libretranslate |
5000 (internal) |
Self-hosted translation engine |
Configuration
ELEVENLABS_API_KEY=sk-your-production-key
ADMIN_PASSWORD=strong-secure-password
FRONTEND_URL=https://translate.example.com
APP_ENV=prod
REDIS_PASSWORD=redis-secret
Deploy
docker compose up -d --build
Reverse Proxy
When running behind Nginx or another reverse proxy:
- Set
LISTEN_PORT in .env (e.g., 8080)
- Proxy pass to
localhost:8080
- Important: Ensure WebSocket upgrades are forwarded for the
/socket.io/ path
server {
listen 443 ssl;
server_name translate.example.com;
location / {
proxy_pass http://localhost:8080;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header Host $host;
}
}
Monitoring
# Check all services
docker compose ps
# View backend logs
docker compose logs -f backend
# Health check
curl http://localhost:3001/api/health
Shipped
v0.1 – v0.2 — Core Translation Engine
- Real-time STT via ElevenLabs Scribe v2 Realtime
- Multi-provider translation (LibreTranslate, DeepL, Claude)
- TTS voice synthesis with ElevenLabs
- Microphone and YouTube live input
- Admin panel with feature flags, voice management, TTS tuning
- Biblical Transcript Simulator for pipeline testing
- Instant Voice Cloning from recordings and YouTube
Shipped
v0.3 — Audio Mixer & Device Selection
Browser-side audio device scanning with support for professional mixing consoles, virtual audio devices, and audio interfaces.
- Browser-side device enumeration with permission flow
- Virtual device detection (Loopback, BlackHole, VB-Audio, Voicemeeter, OBS)
- Categorized device picker (Microphones vs Mixers / Virtual Devices)
- Admin device override broadcast to all viewers via Socket.io
- Real-time feature flag broadcasting
Shipped
v0.7 — Broadcast Service
The /translate route is now a true broadcast service. Admins start one global translation session from the admin panel and all connected viewers receive the live output simultaneously.
- Single global broadcast session (one-to-many)
- Admin "Broadcast Control" panel — Start/Stop with source + voice selection
- Microphone and YouTube source both supported for broadcast
- All translation output (transcript, translated text, TTS audio)
io.emit’d to every viewer
- Viewer shows Waiting for broadcast to start… status when off air
- "On Air" / "Off Air" status pill visible to viewers in real-time
- Broadcast ownership tracked by admin socket ID; auto-stops on admin disconnect
- Biblical Transcript Simulator also broadcasts to all viewers
Shipped
v0.8 — Navigation, Broadcast FF & Transcript UX
Global persistent bottom navigation, feature-flag-gated route visibility, and a refined transcript reading experience.
- Persistent bottom navigation bar on all pages (
/translate, /broadcast, /video, /admin)
- FF-gated nav links — Broadcast and Video Call entries only appear when their flags are enabled
- No extra socket connection — nav reads flags from the page’s existing
useSocket call via props
- Nav renders a frosted dark background gradient so it never overlaps content
/broadcast route is now public (no login required); gated inside the page by the broadcast feature flag
broadcast feature flag added to YAML, backend config, and frontend FeatureFlags interface
- Transcript panel: newest translation is always at the top; older lines scroll down and fade out at the bottom
- Each new transcript entry animates in from above (
transcriptIn keyframe)
- Removed duplicate “Video Call” button from
/translate and /broadcast header bars
Shipped
v0.9 — Translation Pipeline Overhaul & Google Integration
Major improvements to translation chunking, provider support, and admin tooling.
- Google Translate as primary translation provider with automatic fallback chain
- Google Gemini 2.5 Flash for biblical simulator and sermon generation (replaces deprecated Gemini 2.0 Flash)
- Overhauled STT chunking: disabled aggressive sentence-boundary splitting, stability timer defers to accumulation during continuous speech, commit buffer defers when speaker has resumed
- Configurable sermon length (1–20 sentences) in admin UI
- Voice training: AI-generated reading text (Gemini) for mic recording sessions
- Voice training: preview playback of cloned voice after training via TTS
- Broadcast mute/unmute toggle (muted by default, replaces “Tap to enable audio” banner)
- Audio device auto-scan on page load with spinning refresh indicator
- Fixed admin Raw Server Logs auto-scroll toggle re-enabling on new messages
- Updated Claude model list: removed deprecated models, default is
claude-haiku-4-5
- Docker images upgraded to Node.js 24 (Alpine)
Shipped
v0.4 — Mac Audio Agent
Lightweight Node.js daemon that captures Mac microphone audio and streams it to the backend via Socket.io — no browser required on the audio source machine.
- Runs as a macOS LaunchAgent (auto-start on login, auto-restart on crash)
- Captures 16 kHz 16-bit mono PCM via
sox
- Identical chunk format and encoding to the browser client
- Registers as a named remote audio source visible in the Admin UI
- Starts/stops streaming automatically based on
broadcast_status events
- One-command install script (see standalone repo)
Up Next
v0.4.1 — Direct Audio Interface Feed
Accept audio directly from professional mixing consoles and audio interfaces — extend the Mac agent to support Core Audio device selection for broadcast-quality input.
- Direct audio interface input (Core Audio / ASIO / ALSA)
- Multi-channel mixer feed support
- Low-latency audio routing (sub-100ms)
- Hardware device auto-discovery and selection
- Professional broadcast integration (NDI, Dante)
Shipped
v0.5 — Video Call Translation
WebRTC peer-to-peer video calls with real-time bidirectional translation. Two people speak different languages and hear each other translated via TTS.
- Built-in WebRTC video call with room codes
- Full-duplex translation (each person hears the other translated)
- Per-participant STT pipeline with independent Scribe sessions
- Video grid UI with local PiP and remote full-screen
- Mic/video mute controls, hang up, auto-cleanup on disconnect
- Feature-flagged behind
video_translation
Shipped
v0.6 — Auth, Mobile & Voice Cloning in /video
- User-facing login page (
/) with JWT cookie sessions (30-day sticky, HttpOnly)
- All app routes protected — redirect to login if unauthenticated
- Live translator moved to
/translate
- Mobile-responsive UI across Translator, Admin, and Video Call views
- FaceTime-style full-screen in-call layout on mobile with safe-area insets
- “Clone Voice” button in
/video lobby, gated by video_voice_cloning feature flag
- Voice cloning modal with mic recording or YouTube URL, admin-password gated
Planned
Future
- Additional language pairs beyond EN/RU/UK
- Speaker diarization (multi-speaker detection)
- Translation memory and glossary support
- Webhooks and API for third-party integrations
- Multi-tenant deployment with user accounts