Ship live translations
with confidence
A production-ready full-stack Node.js + React application for seamless EN↔RU↔UK live auto-detect translation with voice synthesis.
⚙
Installation
Set up the project locally with Docker, Redis, and LibreTranslate in minutes.
▦
Architecture
Understand the STT → Translation → TTS pipeline and real-time Socket.io communication.
▶
Live Translation
Stream from YouTube or microphone with automatic EN/RU/UK language detection and voice output.
📜
Biblical Simulator
Test the full pipeline with AI-generated biblical passages in King James, Church Slavonic, or Ukrainian style.
🎤
Voice Training
Clone custom voices from microphone recordings or YouTube videos using ElevenLabs IVC.
Prerequisites
●
Node.js 20+
Runtime for backend and build tools
●
Docker + Docker Compose
For Redis and LibreTranslate services
●
yt-dlp + ffmpeg
Required for YouTube audio extraction
●
ElevenLabs API Key
For speech-to-text and text-to-speech
Clone & Configure
git clone https://github.com/Pzharyuk/live-translator-node.git && cd live-translator-node
cp .env.example .env
Edit .env and set your API key:
ELEVENLABS_API_KEY=sk-your-key-here
ADMIN_PASSWORD=your-secure-password
Start Infrastructure
# Start Redis + LibreTranslate
docker compose -f docker-compose.local.yml up -d
# Wait for LibreTranslate to download language models (~500 MB)
docker logs -f $(docker ps -qf "name=libretranslate") 2>&1 | grep -i "running"
Start Backend
cd backend
npm install
npm run dev # nodemon watches for changes
Start Frontend
cd frontend
npm install
npm run dev # Vite hot-reload on localhost:5173
✓
You're all set!
Open http://localhost:5173 for the user view and http://localhost:5173/admin for the admin panel (default password: admin123).
1
Start the services
Follow the Installation guide to get Docker services, backend, and frontend running.
2
Open the Admin Panel
Navigate to http://localhost:5173/admin and enter the admin password.
3
Select a Voice
Choose a TTS voice from the dropdown. The voice list is fetched from your ElevenLabs account.
4
Test with Text
Use the free-text area in the admin panel to type a phrase. Click translate to hear the TTS output instantly.
5
Go Live
Open the user view at http://localhost:5173. Select "Mic" as input, pick a voice, and click Start. Speak into your microphone and watch real-time translation appear with audio playback.
💡
Try the Biblical Simulator
For a hands-free demo, enable the biblical_simulator feature flag in admin, enter an Anthropic API key, select a language, and click "Generate". The system will produce biblical passages through the full STT → Translation → TTS pipeline.
System Overview
Frontend
React 19 + Vite
Socket.io Client
Web Audio API
↔
Backend
Express + Socket.io
TypeScript
Service Layer
ElevenLabs
Scribe v2 (STT)
TTS Streaming
Voice Cloning
Translation
LibreTranslate (self-hosted)
DeepL (premium API)
Claude / Anthropic (AI)
Redis
Feature Flags
Settings Store
Anthropic
Biblical Simulator
Claude Translation
DeepL
Free & Pro tiers
Auto endpoint detection
Data Flow
1 Audio Input (Mic / YouTube / Simulator)
↓
2 PCM 16-bit LE @ 16kHz via Socket.io chunks
↓
3 ElevenLabs Scribe v2 WebSocket STT
↓
4 Commit Merge Buffer 2.5s VAD aggregation
↓
5 Translation Provider LibreTranslate / DeepL / Claude
↓
6 ElevenLabs TTS Voice synthesis streaming
↓
7 Audio Playback Queued with 600ms pause
Key Architecture Decisions
Script-based Language Detection
LibreTranslate's /detect endpoint returns 0-confidence for short Cyrillic phrases. The app checks Unicode character scripts first (0x0400-0x04FF = Cyrillic) before calling the API.
VAD Commit Merging
Voice Activity Detection can fire aggressively on speaker breathing. Commits are buffered for 2.5 seconds before translation to merge fragments into meaningful phrases.
Feature Flag Merging
YAML config defaults are merged with Redis runtime overrides. Redis values take priority, falling back to YAML if Redis is unavailable.
API Key Hierarchy
Keys resolve in order: Runtime Cache → Redis → Config File → Empty. This allows hot-swapping keys without restarts.
Connection Lifecycle
- Client sends
start_session with source type (mic or youtube) and optional voiceId
- Backend opens a WebSocket to
wss://api.elevenlabs.io/v1/speech-to-text/realtime
- For YouTube: spawns
yt-dlp | ffmpeg child processes to extract PCM audio
- For Microphone: awaits
audio_chunk events from the frontend
Audio Streaming
Audio chunks are sent to Scribe as JSON messages:
{
"message_type": "input_audio_chunk",
"audio_base_64": "UklGR..." // PCM 16-bit LE, 16kHz, mono
}
Scribe Responses
| Response Type | Meaning | Action |
partial_transcript |
Live partial text (speculative) |
Emitted as non-final transcript event |
committed_transcript |
VAD fired — complete phrase |
Buffered for commit merge window |
Commit Merge Buffer
After receiving a committed_transcript, the backend waits 2.5 seconds (COMMIT_MERGE_MS) to collect additional commits before translating. This prevents fragmented translations from aggressive VAD.
Stability Timeout
If VAD stalls (no new commits), a 3.5 second fallback timer (STABILITY_TIMEOUT_MS) fires to translate whatever new text has accumulated, preventing indefinite silence.
Text Validation
Before translation, text is validated against EN/RU/UK character regex patterns. This filters out hallucinated text from the STT model (common with silence or background noise).
Provider Chain
The system supports three translation providers with automatic fallback:
Default
LibreTranslate
Self-hosted, no API key required. Runs in Docker alongside the app. Best for privacy and cost.
Premium
DeepL
High-quality translations. Supports both free and paid API tiers. Auto-detects endpoint.
AI
Claude
Anthropic's Claude for context-aware translations. Uses claude-haiku-4-5 for speed.
Fallback Logic
1. Try primary provider (admin-selected)
2. If primary fails → try configured fallback
3. If fallback fails → try LibreTranslate (last resort)
4. If all fail → emit error event
Language Detection
Before translation, the pipeline performs script-based language pre-detection:
- Cyrillic characters (Unicode 0x0400-0x04FF) → detected as Russian or Ukrainian
- Latin characters → detected as English
- This avoids low-confidence results from LibreTranslate's
/detect endpoint on short text
Language Gating
Detected languages are checked against the admin-approved pool. If a detected language isn't in the allowed set, the translation is rejected to prevent hallucinated language outputs.
TTS Pipeline
After translation, the text is sent to ElevenLabs TTS:
const stream = await client.textToSpeech.stream(voiceId, {
text: translatedText,
model_id: "eleven_multilingual_v2",
output_format: "mp3_44100_128",
voice_settings: {
stability: 0.5,
similarity_boost: 0.75,
style: 0.0,
speed: 1.0,
use_speaker_boost: true
}
});
Audio Delivery
TTS audio is streamed to a Buffer, then emitted as a base64-encoded MP3 via the tts_audio Socket.io event.
Frontend Playback Queue
The frontend maintains an audio queue to prevent overlapping playback:
- Received
tts_audio events are queued
- Each segment plays to completion before the next starts
- A configurable pause (600ms default) is inserted between segments
- The pause duration is controlled by
tts_segment_pause_ms (adjustable in admin)
Microphone Input
- User selects "Mic" tab and chooses a TTS voice
- Browser captures audio via Web Audio API's
ScriptProcessor
- PCM 16-bit LE at 16kHz sample rate sent to backend via Socket.io
- Backend pipes audio to ElevenLabs Scribe v2 Realtime WebSocket
- Language auto-detected (EN/RU/UK), text translated and synthesized
- TTS audio returned and played back with inter-segment pauses
YouTube Input
- User pastes a YouTube URL (live stream or video)
- Backend spawns
yt-dlp | ffmpeg child processes
- Audio extracted as PCM stream (16kHz, 16-bit LE, mono)
- Piped to Scribe v2, same pipeline as microphone
- Stream ends when YouTube content ends or user stops
User Interface
The user view features a dark cavern theme with:
- Waveform visualizer — Canvas-based bar chart with orange gradient and cyan tips
- Transcript display — White translated text scrolls upward with fade masks
- Partial transcript — Shown in italic orange while STT is processing
- Source tabs — Toggle between Mic and YouTube (controlled by feature flags)
How It Works
The backend uses yt-dlp and ffmpeg as child processes to extract audio from YouTube URLs:
yt-dlp (best audio) → ffmpeg (PCM 16kHz 16-bit LE mono) → Scribe v2
Supported Sources
- Live streams — Translates in real-time as the stream progresses
- Regular videos — Processes the full audio track
- Any URL supported by yt-dlp (YouTube, etc.)
Requirements
Both yt-dlp and ffmpeg must be installed and available in the system PATH. On macOS:
brew install yt-dlp ffmpeg
⚠
Feature Flag Required
YouTube input is controlled by the youtube_input feature flag. Enable it in the admin panel to show the YouTube tab in the user view.
Overview
The Biblical Transcript Simulator is an admin-only feature that generates biblical text passages using Anthropic's Claude API, then routes them through the full translation pipeline. This provides a hands-free way to test STT → Translation → TTS without a live audio source.
Language Styles
| Language | Style | Example |
en |
King James English |
"In the beginning was the Word..." |
ru |
Church Slavonic Russian |
"В начале было Слово..." |
uk |
Traditional Ukrainian |
"На початку було Слово..." |
Flow
- Admin provides Anthropic API key and selects language
- Backend calls Claude with streaming (uses
claude-sonnet-4-6)
- Claude generates 6-8 biblical passages, 3-5 sentences each
- Stream is buffered until 140+ characters AND complete sentences
- Chunks emitted with 1800ms smooth pacing between them
- Each chunk flows through the standard pipeline:
- Emitted as
transcript (isFinal: true)
- Auto-translated via configured provider
- TTS synthesized and audio returned
- Frontend plays audio with standard inter-segment pause
💡
Feature Flag
Enable biblical_simulator in the admin feature flags panel. The Anthropic API key is provided at runtime in the UI — it's never stored in config files.
Overview
Voice Training uses ElevenLabs' Instant Voice Cloning (IVC) API to create custom voices from audio samples. Once cloned, the voice appears in the voice selector immediately.
From Microphone
- Open the Voice Training section in the admin panel
- Record multiple audio clips using your browser microphone
- Provide a name for the voice
- Clips are uploaded to ElevenLabs IVC API
- Cloned voice is available for TTS immediately
From YouTube
- Paste a YouTube URL in the Voice Training section
- Backend extracts N × 30-second clips via
yt-dlp + ffmpeg
- Clips are uploaded to ElevenLabs IVC API
- Resulting voice is stored in your ElevenLabs account
⚠
ElevenLabs Account
Cloned voices are stored in your ElevenLabs account, not locally. Ensure your plan supports voice cloning.
Concepts
| Concept | Description |
| Active Language Pair |
The current pair used for translation (e.g., EN ↔ RU, EN ↔ UK, or RU ↔ UK). Set by admin. |
| Available Languages |
The pool of languages viewers can select from (if user_language_selector is enabled). |
Admin Controls
- Change the active language pair via the admin panel
- Changes broadcast to all connected clients in real-time
- Manage the available languages pool for viewer selection
Viewer Selection
When the user_language_selector feature flag is enabled, viewers can override the admin-set language pair by selecting their own preferred languages from the available pool.
Feature Flags
Feature flags control optional functionality and user-facing features. They are defined in application.yaml with default values, and can be overridden at runtime via the Admin API using Redis for persistence.
| Flag |
Default |
Description |
youtube_input |
true |
Enable YouTube audio streaming as a session source. |
mic_input |
true |
Enable microphone capture as a session source. |
auto_language_detect |
true |
Automatically detect source language; if disabled, use translation.source_lang. |
user_language_selector |
false |
Allow viewers to select language pair from the available pool. |
audio_device_selector |
false |
Allow viewers to select their audio input device. |
Storage & Override Behavior
Feature flags are stored in two locations:
- Config defaults →
application.yaml — Used on startup and as fallback.
- Runtime overrides →
Redis (keys prefixed flag:) — Persisted and served to clients on connection.
When a client connects, the backend merges config defaults with any Redis overrides, then emits the merged set via the feature_flags socket event. Changes made via the Admin API immediately update Redis and are broadcast to all connected clients.
API Endpoints
GET /admin/flags
Returns all feature flags (config defaults + Redis overrides)
Response: { "flags": { "youtube_input": true, "mic_input": true, … } }
POST /admin/flags/:flag
Set a feature flag to a boolean value (persists to Redis)
Body: { "value": true }
Response: { "flag": "youtube_input", "value": true }
GET /admin/flags/:flag
Get the current value of a single feature flag
Response: { "flag": "youtube_input", "value": true }
Socket Events
feature_flags (emit on connect)
Sent to client when socket connects
Data: { "youtube_input": true, "mic_input": true, … }
File Structure
| File | Purpose |
config/application.yaml |
Base defaults for all environments |
config/application-local.yaml |
Local development overrides (localhost URLs) |
config/application-prod.yaml |
Production overrides (Docker service names) |
The APP_ENV environment variable (local or prod) determines which overlay file is loaded on top of the base config.
Full Configuration Reference
server:
port: 3001
cors_origin: "http://localhost:5173"
elevenlabs:
api_key: "${ELEVENLABS_API_KEY}"
default_voice_id: "kxj9qk6u5PfI0ITgJwO0"
tts_model: "eleven_multilingual_v2"
tts_settings:
stability: 0.5
similarity_boost: 0.75
style: 0.0
speed: 1.0
use_speaker_boost: true
stt_model: "scribe_v2_realtime"
anthropic:
api_key: "${ANTHROPIC_API_KEY}"
deepl:
api_key: "${DEEPL_API_KEY}"
libretranslate:
url: "http://libretranslate:5000"
api_key: ""
redis:
host: "redis"
port: 6379
password: ""
feature_flags:
youtube_input: true
mic_input: true
auto_language_detect: true
user_language_selector: false
audio_device_selector: false
audio:
sample_rate: 16000
channels: 1
chunk_duration_ms: 250
translation:
source_lang: "auto"
target_lang_en: "en"
target_lang_ru: "ru"
provider: "libretranslate"
fallback: "libretranslate"
Environment Variable Interpolation
YAML values using ${VAR_NAME} syntax are automatically replaced with the corresponding environment variable at startup.
| Variable |
Required |
Default |
Description |
ELEVENLABS_API_KEY |
Yes |
— |
ElevenLabs API key for text-to-speech & speech-to-text services. |
ELEVENLABS_VOICE_ID |
No |
kxj9qk6u5PfI0ITgJwO0 |
Default ElevenLabs voice ID when none is specified. |
ANTHROPIC_API_KEY |
No |
— |
Anthropic API key for sermon generation & Claude translation provider; can also be set in Admin UI at runtime. |
DEEPL_API_KEY |
No |
— |
DeepL API key for translation provider; free tier keys end with :fx. |
TRANSLATION_PROVIDER |
No |
libretranslate |
Primary translation provider: deepl | claude | libretranslate; can be changed live in Admin UI. |
APP_ENV |
No |
local |
Application environment: local (dev) | prod (Docker). |
FRONTEND_URL |
No |
http://localhost |
Frontend URL used for CORS origin in production; set to actual domain (e.g., https://translate.example.com). |
LISTEN_PORT |
No |
80 |
Host port the frontend listens on. |
REDIS_PASSWORD |
No |
— |
Redis authentication password; leave empty for no authentication. |
LIBRETRANSLATE_API_KEY |
No |
— |
LibreTranslate API key if the instance requires authentication. |
ADMIN_PASSWORD |
No |
admin123 |
Admin page protection password; must be changed in production. |
server.port |
No |
3001 |
Backend server port. |
server.cors_origin |
No |
http://localhost:5173 |
CORS origin for frontend requests. |
elevenlabs.tts_model |
No |
eleven_multilingual_v2 |
ElevenLabs text-to-speech model ID. |
elevenlabs.stt_model |
No |
scribe_v2 |
ElevenLabs speech-to-text model ID. |
elevenlabs.tts_settings.stability |
No |
0.5 |
TTS voice stability parameter (0.0—1.0). |
elevenlabs.tts_settings.similarity_boost |
No |
0.75 |
TTS voice similarity boost parameter (0.0—1.0). |
elevenlabs.tts_settings.style |
No |
0.0 |
TTS voice style parameter (0.0—1.0). |
elevenlabs.tts_settings.speed |
No |
1.0 |
TTS speech speed multiplier. |
elevenlabs.tts_settings.use_speaker_boost |
No |
true |
Enable ElevenLabs speaker boost for TTS. |
redis.host |
No |
redis |
Redis server hostname. |
redis.port |
No |
6379 |
Redis server port. |
libretranslate.url |
No |
http://libretranslate:5000 |
LibreTranslate API endpoint URL. |
feature_flags.youtube_input |
No |
true |
Enable YouTube audio input source. |
feature_flags.mic_input |
No |
true |
Enable microphone audio input source. |
feature_flags.auto_language_detect |
No |
true |
Enable automatic source language detection. |
feature_flags.user_language_selector |
No |
false |
Allow viewers to select language pair from available pool. |
feature_flags.audio_device_selector |
No |
false |
Allow viewers to select audio input device. |
audio.sample_rate |
No |
16000 |
Audio sample rate in Hz. |
audio.channels |
No |
1 |
Number of audio channels (mono). |
audio.chunk_duration_ms |
No |
250 |
Duration of each audio chunk in milliseconds. |
translation.source_lang |
No |
auto |
Source language for translation (“auto” for detection). |
translation.target_lang_en |
No |
en |
Target language code for English context. |
translation.target_lang_ru |
No |
ru |
Target language code for Russian context. |
translation.fallback |
No |
libretranslate |
Fallback translation provider when primary fails: deepl | claude | libretranslate | none. |
TTS Settings
Configure ElevenLabs text-to-speech parameters. Settings are persisted in Redis and applied globally to all TTS operations.
API Endpoints
GET /admin/tts-settings
Response: { "settings": { "stability": 0.5, "similarity_boost": 0.75, "style": 0.0, "speed": 1.0, "use_speaker_boost": true } }
POST /admin/tts-settings
Request: { "stability": 0.5, "similarity_boost": 0.75, "style": 0.0, "speed": 1.0, "use_speaker_boost": true }
Response: { "settings": { … } }
Settings Reference
| Setting |
Range |
Default |
Description |
stability |
0.0 – 1.0 |
0.5 |
Controls speech stability — lower values increase variability, higher values produce more consistent speech. |
similarity_boost |
0.0 – 1.0 |
0.75 |
Amplifies voice character match — higher values make the voice more recognizable, lower values allow more creative variation. |
style |
0.0 – 1.0 |
0.0 |
Exaggerates emotional expression in speech — 0 = neutral, 1 = maximum stylization. |
speed |
0.5 – 2.0 |
1.0 |
Speech rate multiplier — 1.0 = normal, <1.0 = slower, >1.0 = faster. |
use_speaker_boost |
true — false |
true |
Enables speaker boost preprocessing — enhances voice clarity and presence when enabled. |
Configuration Source
TTS settings are defined in config/application.yaml under elevenlabs.tts_settings and loaded into memory at startup. Runtime changes via POST /admin/tts-settings are persisted to Redis and applied immediately to all subsequent TTS generation calls.
Related Settings
- TTS Model:
eleven_multilingual_v2 — Multilingual ElevenLabs model
- Output Format:
mp3_44100_128 — MP3 at 44.1 kHz, 128 kbps
- Default Voice:
kxj9qk6u5PfI0ITgJwO0 — Fallback voice when viewer or admin does not specify
STT Timing Settings
Configure speech-to-text timing behavior, including commit buffering, stability detection, and TTS segment pauses.
Settings Reference
| Setting |
Default |
Description |
commit_merge_ms |
2500 |
Buffer VAD commits for this duration (ms) before translating — merges sentence fragments into coherent segments. |
stability_timeout_ms |
3500 |
Timeout (ms) for stable partial text before automatic translation trigger if VAD doesn’t fire. |
tts_segment_pause_ms |
600 |
Pause between consecutive TTS audio segment playbacks (ms) — frontend uses this for timing. |
Tuning Guide
- Fast response: Lower
commit_merge_ms (e.g., 1500–2000) & stability_timeout_ms (e.g., 2500–3000) to translate shorter fragments sooner.
- Coherent segments: Increase
commit_merge_ms (e.g., 3000–4000) to merge more VAD commits into complete sentences.
- Smooth playback: Adjust
tts_segment_pause_ms (e.g., 300–1000) to control pause duration between audio clips.
- VAD sensitivity: If VAD fires too early (short pauses), increase both timers; if too late, decrease them.
- Stability fallback:
stability_timeout_ms is a safety net when VAD doesn’t commit — increases latency but ensures text is eventually translated.
API Endpoints
GET /admin/stt-timing
Retrieve current STT timing settings.
curl -H "x-admin-password: admin123" \
http://localhost:3001/admin/stt-timing
Response:
{
"settings": {
"commit_merge_ms": 2500,
"stability_timeout_ms": 3500,
"tts_segment_pause_ms": 600
}
}
POST /admin/stt-timing
Update one or more STT timing settings (partial update supported).
curl -X POST \
-H "x-admin-password: admin123" \
-H "Content-Type: application/json" \
-d '{
"commit_merge_ms": 3000,
"stability_timeout_ms": 4000,
"tts_segment_pause_ms": 800
}' \
http://localhost:3001/admin/stt-timing
Response:
{
"settings": {
"commit_merge_ms": 3000,
"stability_timeout_ms": 4000,
"tts_segment_pause_ms": 800
}
}
Note: Settings are persisted to Redis and survive server restarts. Changes are sent to connected clients via WebSocket event stt_timing.
Authentication: All endpoints require x-admin-password header matching ADMIN_PASSWORD environment variable (default: admin123).
API Keys
Retrieve status of all configured API keys (elevenlabs, anthropic, deepl, libretranslate).
Update one or more API keys.
Body: {
"elevenlabs": "string (optional)",
"anthropic": "string (optional)",
"deepl": "string (optional)",
"libretranslate": "string (optional)"
}
Retrieve the currently configured Anthropic API key.
Voice Management
Scan and list all available voices from ElevenLabs, comparing against the allowed voices whitelist.
Get the current whitelist of allowed voice IDs (null → all voices allowed, array → filtered list).
Update the allowed voices whitelist and broadcast to all connected viewers.
Body: {
"voiceIds": ["voice_id_1", "voice_id_2", ...]
}
Feature Flags
Retrieve all feature flags, merged from YAML config defaults and Redis overrides.
Get the current value of a specific feature flag.
Set a feature flag value and persist to Redis.
Body: {
"value": boolean
}
TTS & STT Settings
Retrieve current TTS settings (stability, similarity_boost, style, speed, use_speaker_boost).
Update TTS voice settings for ElevenLabs text-to-speech.
Body: {
"stability": number (optional),
"similarity_boost": number (optional),
"style": number (optional),
"speed": number (optional),
"use_speaker_boost": boolean (optional)
}
Retrieve STT timing settings (commit_merge_ms, stability_timeout_ms, tts_segment_pause_ms).
Update STT timing parameters for speech-to-text processing.
Body: {
"commit_merge_ms": number (optional),
"stability_timeout_ms": number (optional),
"tts_segment_pause_ms": number (optional)
}
Languages
Get the currently active language pair (source and target language codes).
Set the active language pair and broadcast to all connected viewers.
Body: {
"languages": ["en", "ru"]
}
Get the pool of available languages that viewers can choose from.
Update the available languages pool and broadcast to all viewers.
Body: {
"languages": ["en", "ru", "uk", ...]
}
Translation Provider
Get the currently active translation provider and list of available providers (deepl, claude, libretranslate).
Switch the active translation provider.
Body: {
"provider": "deepl" | "claude" | "libretranslate"
}
Audio Device
Get the admin-selected audio input device ID and label (overrides viewer's local selection).
Set the admin-forced audio input device and broadcast to all connected viewers.
Body: {
"deviceId": "string (optional)",
"label": "string (optional)"
}
Voice Training & Cloning
Clone a voice from browser microphone recordings (base64-encoded audio blobs).
Body: {
"name": "string (required)",
"clips": ["base64_audio_1", "base64_audio_2", ...] (required),
"mimeType": "string (optional, default: audio/webm)"
}
Clone a voice from YouTube URL using yt-dlp & ffmpeg to extract audio clips.
Body: {
"name": "string (required)",
"youtubeUrl": "string (required)",
"clipCount": number (optional, default: 3, max: 25),
"startOffset": number (optional, default: 0, in seconds)
}
Public Endpoints
Generate a biblical sermon snippet using Anthropic Claude and optionally convert to speech.
Body: {
"apiKey": "string (optional, uses configured key if omitted)",
"language": "en" | "ru" | "uk" (optional, default: "en")
}
Socket.io Events
Server → Client Events
| Event |
Payload |
Description |
feature_flags |
{ [key: string]: boolean } |
Emits merged feature flags from YAML config & Redis overrides on connection. |
languages |
{ languages: string[] } |
Emits current active language pair (source & target). |
available_languages |
{ languages: string[] } |
Emits pool of languages available for viewers to choose from. |
stt_timing |
{ tts_segment_pause_ms: number } |
Emits pause duration between TTS audio segments (milliseconds). |
admin_audio_device |
{ deviceId: string; label: string } |
Emits admin-selected audio input device; overrides viewer's local selection. |
session_started |
{ source: 'mic' | 'youtube' | 'biblical' } |
Signals that a translation session has started from specified source. |
transcript |
{ text: string; isFinal: boolean } |
Emits recognized speech text; isFinal indicates if VAD or stability timeout fired. |
translation |
{ original: string; translated: string; detectedLanguage: string } |
Emits translated text & auto-detected source language after final transcript. |
tts_audio |
{ audio: string } |
Emits base64-encoded MP3 audio buffer of translated text. |
audio_level |
{ data: number[] } |
Emits downsampled waveform data (64 samples) for real-time audio visualization. |
stream_ended |
{} |
Signals that YouTube or biblical stream has ended. |
session_stopped |
{} |
Signals that the active translation session has stopped. |
admin_translate_result |
{ original: string; translated: string; detectedLanguage: string; audio: string } |
Returns result of admin test translate — includes base64 audio & detected language. |
error |
{ message: string } |
Emits error message from speech recognition, translation, or TTS processing. |
Client → Server Events
| Event |
Payload |
Description |
set_languages |
{ languages: string[] } |
Viewer selects language pair; validated against available pool & broadcasted to all clients. |
start_session |
{ voiceId?: string; source: 'mic' | 'youtube'; youtubeUrl?: string } |
Starts STT & translation session from microphone or YouTube URL; creates Scribe WebSocket. |
audio_chunk |
{ audio: string } |
Sends base64-encoded PCM audio chunk from browser microphone to Scribe STT session. |
stop_session |
{} |
Stops active STT session, closes Scribe WebSocket, & terminates YouTube stream if active. |
admin_translate_test |
{ text: string; voiceId?: string; sourceLang?: string; targetLang?: string } |
Admin test: translates text with optional language override & returns TTS audio. |
start_biblical_sim |
{ anthropicApiKey: string; language: 'en' | 'ru' | 'uk'; voiceId?: string } |
Starts biblical transcript simulator using Anthropic API; streams sentences as final transcripts. |
stop_biblical_sim |
{} |
Stops active biblical simulator stream. |
SDK
Uses the official @elevenlabs/elevenlabs-js SDK (v2). The client is lazy-loaded on first use.
Speech-to-Text (Scribe v2 Realtime)
Connects via native WebSocket to wss://api.elevenlabs.io/v1/speech-to-text/realtime. Handles:
- VAD-based commit buffering with configurable merge window
- Stability timeout fallback for stalled VAD
- Text validation (EN/RU/UK character regex filtering)
- Partial and final transcript emission
Text-to-Speech
Uses client.textToSpeech.stream() with the eleven_multilingual_v2 model. Audio is collected into a Buffer and emitted as base64 MP3.
Voice Management
client.voices.getAll() — fetches all voices from account
- Admin can filter which voices are available to viewers
- Voice cloning via IVC API (from recordings or YouTube)
Key File
backend/src/services/elevenlabs.service.ts
Provider Details
LibreTranslate
Self-hosted in Docker. No API key required by default. Provides language detection and translation via REST API.
File: backend/src/services/libretranslate.service.ts
DeepL
Premium translation API. Auto-detects free vs. paid endpoint based on the API key format.
File: backend/src/services/deepl.service.ts
Claude (Anthropic)
AI-powered translation using claude-haiku-4-5 for speed. Includes language detection and auto-flip logic.
File: backend/src/services/claude-translate.service.ts
Routing
Provider routing is handled by backend/src/services/translation.provider.ts:
- Try admin-selected primary provider
- On failure, try configured fallback provider
- LibreTranslate is always the last-resort fallback
Connection
Uses ioredis with automatic retry strategy. Falls back to in-memory/YAML defaults if Redis is unavailable.
Key Patterns
| Pattern | Example | Purpose |
flag:<name> |
flag:youtube_input |
Feature flag boolean values |
setting:<name> |
setting:tts_settings |
JSON settings objects |
Key File
backend/src/services/redis.service.ts
Local Development
Use docker-compose.local.yml for Redis and LibreTranslate only (backend/frontend run natively):
docker compose -f docker-compose.local.yml up -d
Production
Use docker-compose.yml for all services:
docker compose up -d --build
Services
| Service | Image | Port | Notes |
| frontend |
Nginx (custom build) |
80 (exposed) |
Serves React build, proxies API/WS to backend |
| backend |
Node.js (custom build) |
3001 (internal) |
Express + Socket.io server |
| redis |
redis:7-alpine |
6379 (internal) |
Feature flags and settings store |
| libretranslate |
libretranslate/libretranslate |
5000 (internal) |
Self-hosted translation engine |
Configuration
ELEVENLABS_API_KEY=sk-your-production-key
ADMIN_PASSWORD=strong-secure-password
FRONTEND_URL=https://translate.example.com
APP_ENV=prod
REDIS_PASSWORD=redis-secret
Deploy
docker compose up -d --build
Reverse Proxy
When running behind Nginx or another reverse proxy:
- Set
LISTEN_PORT in .env (e.g., 8080)
- Proxy pass to
localhost:8080
- Important: Ensure WebSocket upgrades are forwarded for the
/socket.io/ path
server {
listen 443 ssl;
server_name translate.example.com;
location / {
proxy_pass http://localhost:8080;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header Host $host;
}
}
Monitoring
# Check all services
docker compose ps
# View backend logs
docker compose logs -f backend
# Health check
curl http://localhost:3001/api/health
Shipped
v1 — Core Translation Engine
- Real-time STT via ElevenLabs Scribe v2 Realtime
- Multi-provider translation (LibreTranslate, DeepL, Claude)
- TTS voice synthesis with ElevenLabs
- Microphone and YouTube live input
- Admin panel with feature flags, voice management, TTS tuning
- Biblical Transcript Simulator for pipeline testing
- Instant Voice Cloning from recordings and YouTube
Up Next
v2 — Video Translation
Full video translation pipeline — not just audio. Translate video content with synchronized subtitles and dubbed audio output.
- Video file upload and URL ingestion
- Synchronized subtitle generation (SRT/VTT)
- Dubbed audio track with voice-matched TTS
- Video player with real-time translated overlay
- Batch video processing queue
Up Next
v2.1 — Direct Audio Mixer Feed
Accept audio directly from professional mixing consoles and audio interfaces — bypass browser mic capture entirely for broadcast-quality input.
- Direct audio interface input (ASIO / Core Audio / ALSA)
- Multi-channel mixer feed support
- Low-latency audio routing (sub-100ms)
- Hardware device auto-discovery and selection
- Professional broadcast integration (NDI, Dante)
Planned
Future
- Additional language pairs beyond EN/RU/UK
- Speaker diarization (multi-speaker detection)
- Translation memory and glossary support
- Webhooks and API for third-party integrations
- Multi-tenant deployment with user accounts