Live Translator - Documentation

⚙

Installation

Set up the project locally with Docker, Redis, and LibreTranslate in minutes.

▦

Architecture

Understand the STT → Translation → TTS pipeline and real-time Socket.io communication.

▶

Live Translation

Stream from YouTube or microphone with automatic EN/RU/UK language detection and voice output.

📜

Biblical Simulator

Test the full pipeline with AI-generated biblical passages in King James, Church Slavonic, or Ukrainian style.

🎤

Voice Training

Clone custom voices from microphone recordings or YouTube videos using ElevenLabs IVC.

💡

Looking for examples?

Check the Quickstart guide for a complete walkthrough, or jump to the API Reference for endpoint documentation.

Prerequisites

●

Node.js 20+ Runtime for backend and build tools

●

Docker + Docker Compose For Redis and LibreTranslate services

●

yt-dlp + ffmpeg Required for YouTube audio extraction

●

ElevenLabs API Key For speech-to-text and text-to-speech

Clone & Configure

Terminal

git clone https://github.com/Pzharyuk/live-translator-node.git && cd live-translator-node
cp .env.example .env

Edit .env and set your API key:

.env

ELEVENLABS_API_KEY=sk-your-key-here
ADMIN_PASSWORD=your-secure-password

Start Infrastructure

Terminal

# Start Redis + LibreTranslate
docker compose -f docker-compose.local.yml up -d

# Wait for LibreTranslate to download language models (~500 MB)
docker logs -f $(docker ps -qf "name=libretranslate") 2>&1 | grep -i "running"

Start Backend

Terminal

cd backend
npm install
npm run dev  # nodemon watches for changes

Start Frontend

Terminal

cd frontend
npm install
npm run dev  # Vite hot-reload on localhost:5173

✓

You're all set!

Open http://localhost:5173 — log in with user / changeme and you will be redirected to /translate. Admin panel: http://localhost:5173/admin (admin password: admin123).

1

Start the services

Follow the Installation guide to get Docker services, backend, and frontend running.

2

Open the Admin Panel

Navigate to http://localhost:5173/admin and enter the admin password.

3

Select a Voice

Choose a TTS voice from the dropdown. The voice list is fetched from your ElevenLabs account.

4

Test with Text

Use the free-text area in the admin panel to type a phrase. Click translate to hear the TTS output instantly.

5

Go Live

Open the user view at http://localhost:5173/translate. Select "Mic" as input, pick a voice, and click Start. Speak into your microphone and watch real-time translation appear with audio playback.

💡

Try the Biblical Simulator

For a hands-free demo, enable the biblical_simulator feature flag in admin, enter an Anthropic API key, select a language, and click "Generate". The system will produce biblical passages through the full STT → Translation → TTS pipeline.

System Overview

Frontend

React 19 + Vite

Socket.io Client

Web Audio API

↔

Backend

Express + Socket.io

TypeScript

Service Layer

ElevenLabs

Scribe v2 (STT)

TTS Streaming

Voice Cloning

Translation

Google Translate (Cloud API)

LibreTranslate (self-hosted)

DeepL (premium API)

Claude / Anthropic (AI)

Redis

Feature Flags

Settings Store

Google Gemini

Biblical Simulator

Sermon Generation

Voice Training Text

DeepL

Free & Pro tiers

Auto endpoint detection

Data Flow

1 Audio Input (Mic / YouTube / Simulator)

↓

2 PCM 16-bit LE @ 16kHz via Socket.io chunks

↓

3 ElevenLabs Scribe v2 WebSocket STT

↓

4 Commit Merge Buffer 2.5s VAD aggregation

↓

5 Translation Provider Google / LibreTranslate / DeepL / Claude

↓

6 ElevenLabs TTS Voice synthesis streaming

↓

7 Audio Playback Queued with 600ms pause

Key Architecture Decisions

Two-layer Language Detection

LibreTranslate's /detect endpoint returns 0-confidence for short Cyrillic phrases. The app uses script-based pre-detection (Unicode 0x0400–0x04FF = Cyrillic) combined with ElevenLabs Scribe's language_code output for reliable EN/RU/UK auto-detection.

VAD Commit Merging

Voice Activity Detection can fire aggressively on speaker breathing. Commits are buffered for 2.5 seconds before translation to merge fragments into meaningful phrases.

Feature Flag Merging

YAML config defaults are merged with Redis runtime overrides. Redis values take priority, falling back to YAML if Redis is unavailable.

API Key Hierarchy

Keys resolve in order: Runtime Cache → Redis → Config File → Empty. This allows hot-swapping keys without restarts.

Connection Lifecycle

Client sends start_session with source type (mic or youtube) and optional voiceId
Backend opens a WebSocket to wss://api.elevenlabs.io/v1/speech-to-text/realtime
For YouTube: spawns yt-dlp | ffmpeg child processes to extract PCM audio
For Microphone: awaits audio_chunk events from the frontend

Audio Streaming

Audio chunks are sent to Scribe as JSON messages:

WebSocket Message

{
  "message_type": "input_audio_chunk",
  "audio_base_64": "UklGR..."  // PCM 16-bit LE, 16kHz, mono
}

Scribe Responses

Response Type	Meaning	Action
`partial_transcript`	Live partial text (speculative)	Emitted as non-final `transcript` event
`committed_transcript`	VAD fired — complete phrase	Buffered for commit merge window

Commit Merge Buffer

After receiving a committed_transcript, the backend waits 2.5 seconds (COMMIT_MERGE_MS) to collect additional commits before translating. This prevents fragmented translations from aggressive VAD.

Stability Timeout

If VAD stalls (no new commits), a 3.5 second fallback timer (STABILITY_TIMEOUT_MS) fires to translate whatever new text has accumulated, preventing indefinite silence.

Text Validation

Before translation, text is validated against EN/RU/UK character regex patterns. This filters out hallucinated text from the STT model (common with silence or background noise).

Provider Chain

The system supports three translation providers with automatic fallback:

Default

LibreTranslate

Self-hosted, no API key required. Runs in Docker alongside the app. Best for privacy and cost.

Premium

DeepL

High-quality translations. Supports both free and paid API tiers. Auto-detects endpoint.

AI

Claude

Anthropic's Claude for context-aware translations. Uses claude-haiku-4-5 for speed.

Fallback Logic

Provider Resolution

1. Try primary provider (admin-selected)
2. If primary fails → try configured fallback
3. If fallback fails → try LibreTranslate (last resort)
4. If all fail → emit error event

Language Detection

The app uses a two-layer auto-detection approach:

Layer 1: Script-based Pre-detection

Before calling any translation API, the backend checks Unicode character scripts:

Cyrillic characters (Unicode 0x0400–0x04FF) → if >50% of matched letters are Cyrillic, detected as Russian
Latin characters → detected as English
This avoids low-confidence results from LibreTranslate's /detect endpoint on short text

Layer 2: STT Language Code

When the auto_language_detect flag is enabled, ElevenLabs Scribe returns a language_code with each transcript commit. The backend uses this to correctly route EN/RU/UK without relying solely on script detection.

Note: For LibreTranslate, both Russian and Ukrainian Cyrillic text is passed with source ru since LibreTranslate handles Ukrainian text acceptably via the Russian model. DeepL and Claude providers distinguish Ukrainian natively and handle uk as a proper source language.

Language Gating

Detected languages are checked against the admin-approved pool. If a detected language isn't in the allowed set, the translation is rejected to prevent hallucinated language outputs.

TTS Pipeline

After translation, the text is sent to ElevenLabs TTS:

TypeScript

const stream = await client.textToSpeech.stream(voiceId, {
  text: translatedText,
  model_id: "eleven_multilingual_v2",
  output_format: "mp3_44100_128",
  voice_settings: {
    stability: 0.5,
    similarity_boost: 0.75,
    style: 0.0,
    speed: 1.0,
    use_speaker_boost: true
  }
});

Audio Delivery

TTS audio is streamed to a Buffer, then emitted as a base64-encoded MP3 via the tts_audio Socket.io event.

Frontend Playback Queue

The frontend maintains an audio queue to prevent overlapping playback:

Received tts_audio events are queued
Each segment plays to completion before the next starts
A configurable pause (600ms default) is inserted between segments
The pause duration is controlled by tts_segment_pause_ms (adjustable in admin)

Microphone Input

User selects "Mic" tab and chooses a TTS voice
Browser captures audio via Web Audio API's ScriptProcessor
PCM 16-bit LE at 16kHz sample rate sent to backend via Socket.io
Backend pipes audio to ElevenLabs Scribe v2 Realtime WebSocket
Language auto-detected (EN/RU/UK), text translated and synthesized
TTS audio returned and played back with inter-segment pauses

YouTube Input

User pastes a YouTube URL (live stream or video)
Backend spawns yt-dlp | ffmpeg child processes
Audio extracted as PCM stream (16kHz, 16-bit LE, mono)
Piped to Scribe v2, same pipeline as microphone
Stream ends when YouTube content ends or user stops

User Interface

The user view features a dark cavern theme with:

Waveform visualizer — Canvas-based bar chart with orange gradient and cyan tips
Transcript display — White translated text scrolls upward with fade masks
Partial transcript — Shown in italic orange while STT is processing
Source tabs — Toggle between Mic and YouTube (controlled by feature flags)

How It Works

The backend uses yt-dlp and ffmpeg as child processes to extract audio from YouTube URLs:

Pipeline

yt-dlp (best audio) → ffmpeg (PCM 16kHz 16-bit LE mono) → Scribe v2

Supported Sources

Live streams — Translates in real-time as the stream progresses
Regular videos — Processes the full audio track
Any URL supported by yt-dlp (YouTube, etc.)

Requirements

Both yt-dlp and ffmpeg must be installed and available in the system PATH. On macOS:

Terminal

brew install yt-dlp ffmpeg

⚠

Feature Flag Required

YouTube input is controlled by the youtube_input feature flag. Enable it in the admin panel to show the YouTube tab in the user view.

Overview

The Biblical Transcript Simulator is an admin-only feature that generates biblical text passages using Google's Gemini API (gemini-2.5-flash), then routes them through the full translation pipeline. This provides a hands-free way to test STT → Translation → TTS without a live audio source.

Language Styles

Language	Style	Example
`en`	King James English	"In the beginning was the Word..."
`ru`	Church Slavonic Russian	"В начале было Слово..."
`uk`	Traditional Ukrainian	"На початку було Слово..."

Flow

Admin selects language (EN/RU/UK)
Backend calls Gemini 2.5 Flash with streaming
Gemini generates 6-8 biblical passages, 3-5 sentences each
Stream is buffered until 140+ characters AND complete sentences
Chunks emitted with 1800ms smooth pacing between them
Each chunk flows through the standard pipeline:
- Emitted as transcript (isFinal: true)
- Auto-translated via configured provider
- TTS synthesized and audio returned
Frontend plays audio with standard inter-segment pause

💡

Feature Flag

Enable biblical_simulator in the admin feature flags panel. The Gemini API key is configured via the GEMINI_API_KEY environment variable or set at runtime in the admin API Keys panel.

Overview

Voice Training uses ElevenLabs' Instant Voice Cloning (IVC) API to create custom voices from audio samples. Once cloned, the voice appears in the voice selector immediately.

From Microphone

Open the Voice Training section in the admin panel
Click Generate Text to get an AI-generated reading passage (via Gemini) — gives the speaker natural, phonetically diverse text to read aloud
Record multiple audio clips using your browser microphone while reading the generated text
Provide a name for the voice
Clips are uploaded to ElevenLabs IVC API
Cloned voice is available for TTS immediately
Click Preview Voice to hear the cloned voice speak a sample sentence via TTS

From YouTube

Paste a YouTube URL in the Voice Training section
Backend extracts N × 30-second clips via yt-dlp + ffmpeg
Clips are uploaded to ElevenLabs IVC API
Resulting voice is stored in your ElevenLabs account

⚠

ElevenLabs Account

Cloned voices are stored in your ElevenLabs account, not locally. Ensure your plan supports voice cloning.

Concepts

Concept	Description
Active Language Pair	The current pair used for translation (e.g., EN ↔ RU, EN ↔ UK, or RU ↔ UK). Set by admin.
Available Languages	The pool of languages viewers can select from (if `user_language_selector` is enabled).

Admin Controls

Change the active language pair via the admin panel
Changes broadcast to all connected clients in real-time
Manage the available languages pool for viewer selection

Viewer Selection

When the user_language_selector feature flag is enabled, viewers can override the admin-set language pair by selecting their own preferred languages from the available pool.

Overview

Two people can video call each other through the app, each speaking their own language. The app transcribes, translates, and synthesizes speech in real-time so each participant hears the other in their language.

Feature flag: Video call is gated behind the video_translation flag. Enable it in the admin panel or set video_translation: true in your YAML config.

How It Works

Create a room — Person A selects their language, picks a TTS voice, and clicks "Create Room". A 6-character room code is generated.
Share the code — Person A shares the room code with Person B (copy button provided).
Join the room — Person B enters the code, selects their language and TTS voice, and clicks "Join".
WebRTC connection — The app establishes a peer-to-peer video connection via WebRTC (signaled through Socket.io). Video flows directly between browsers.
Audio translation — Each participant's microphone audio is simultaneously:
- Sent to the peer via WebRTC (but muted on their end)
- Captured as PCM chunks and sent to the backend via Socket.io for STT
Translation pipeline — Each participant has their own independent Scribe STT session. Transcribed text is translated to the other participant's language, then synthesized via ElevenLabs TTS and sent back to the peer.
Playback — The peer hears the TTS translation instead of the raw audio. Translated transcript is displayed below the video.

Architecture

Person A (Browser)         Server              Person B (Browser)
├─ getUserMedia            ├─ Socket.io         ├─ getUserMedia
├─ WebRTC P2P ═══video═══►│  (signaling)  ◄═══ ├─ WebRTC P2P
│                          │                    │
├─ PCM chunks ──Socket.io─►├─ ScribeA(STT)      │
│                          │  ↓ translate       │
│                          │  ↓ TTS ───────────►├─ Plays TTS
│                          │                    │
│  Plays TTS ◄─────────────├─ ScribeB(STT) ◄───├─ PCM chunks
│  (remote video muted)    │  ↓ translate       │  (remote video muted)
└──────────────────────────┴────────────────────┘

Socket Events

Event	Direction	Purpose
`video_create_room`	C→S	Create a new room with language + voice
`video_room_created`	S→C	Returns the 6-char room code
`video_join_room`	C→S	Join an existing room
`video_room_joined`	S→C	Sent to both participants, triggers WebRTC
`video_signal_offer/answer/ice`	C↔S	WebRTC signaling relay
`video_audio_chunk`	C→S	PCM audio for STT processing
`video_transcript`	S→C	Transcript sent to the speaker
`video_translation`	S→C	Translation sent to the listener
`video_tts_audio`	S→C	TTS audio sent to the listener
`video_leave_room`	C→S	Leave the room
`video_room_closed`	S→C	Notify peer when other leaves

Room Lifecycle

Rooms are stored in Redis with key video_room:{code} and a 4-hour TTL
Maximum 2 participants per room
When one participant disconnects, the other is notified and the call ends
Scribe sessions are automatically cleaned up on disconnect

The Mac Audio Agent has moved to its own public repository:

github.com/Pzharyuk/live-translator-agent

It is a lightweight Node.js daemon that runs as a macOS LaunchAgent and streams microphone audio to the live-translator backend via Socket.io — eliminating the need to open a browser for the Remote Audio Source role.

Feature Flags

Feature flags control which routes and UI sections are enabled in the application. Defaults are defined in config/application.yaml under the feature_flags section. At runtime, Redis overrides can be set via the Admin API to toggle flags without restarting the server.

Flag Registry

Flag	Default	Description
`youtube_input`	true	Allow audio input from YouTube URLs.
`mic_input`	true	Allow audio input from browser microphone.
`auto_language_detect`	true	Enable automatic source language detection during transcription.
`user_language_selector`	false	Allow viewers to select language pair from available pool.
`audio_device_selector`	true	Enable audio device selection UI in broadcast admin panel.
`video_translation`	true	Enable the `/video` peer-to-peer video call translation route.
`video_voice_cloning`	false	Premium feature — show Instant Voice Clone button in `/video` lobby.
`remote_audio_source`	false	Enable `/audio-source` route for headless remote audio relay agents.
`agent_audio_source`	false	Show connected remote audio source agents section in admin panel.
`broadcast`	false	Enable `/broadcast` public receiver page & broadcast admin controls.
`translate`	false	Enable `/translate` live translator page.

Storage & Runtime Override

Feature flags are persisted in Redis with the key prefix flag:. On server startup, config defaults are loaded. Admin API calls merge YAML defaults with Redis overrides and broadcast changes to all connected Socket.IO clients via the feature_flags event, enabling real-time UI updates without page refresh.

Admin API Endpoints

# Fetch all flags (merged defaults + Redis overrides)
GET /admin/flags
→ { "flags": { "youtube_input": true, "broadcast": false, ... } }

# Get a single flag
GET /admin/flags/:flag
→ { "flag": "broadcast", "value": false }

# Set a flag at runtime (updates Redis & broadcasts to clients)
POST /admin/flags/:flag
Body: { "value": true }
→ { "flag": "broadcast", "value": true }
# All connected clients receive: event 'feature_flags' with updated merged state

File Structure

File	Purpose
`config/application.yaml`	Base defaults for all environments
`config/application-local.yaml`	Local development overrides (localhost URLs)
`config/application-prod.yaml`	Production overrides (Docker service names)

The APP_ENV environment variable (local or prod) determines which overlay file is loaded on top of the base config.

Full Configuration Reference

config/application.yaml

server:
  port: 3001
  cors_origin: "http://localhost:5173"

elevenlabs:
  api_key: "${ELEVENLABS_API_KEY}"
  default_voice_id: "kxj9qk6u5PfI0ITgJwO0"
  tts_model: "eleven_multilingual_v2"
  tts_settings:
    stability: 0.5
    similarity_boost: 0.75
    style: 0.0
    speed: 1.0
    use_speaker_boost: true
  stt_model: "scribe_v2"

anthropic:
  api_key: "${ANTHROPIC_API_KEY}"

deepl:
  api_key: "${DEEPL_API_KEY}"

libretranslate:
  url: "http://libretranslate:5000"
  api_key: ""

redis:
  host: "redis"
  port: 6379
  password: ""

feature_flags:
  youtube_input: true
  mic_input: true
  auto_language_detect: true
  user_language_selector: false
  audio_device_selector: true
  video_translation: false
  video_voice_cloning: false
  broadcast: false

audio:
  sample_rate: 16000
  channels: 1
  chunk_duration_ms: 250

translation:
  source_lang: "auto"
  target_lang_en: "en"
  target_lang_ru: "ru"
  provider: "libretranslate"
  fallback: "libretranslate"

Environment Variable Interpolation

YAML values using ${VAR_NAME} syntax are automatically replaced with the corresponding environment variable at startup.

Variable	Required	Default	Description
`ELEVENLABS_API_KEY`	Yes	—	ElevenLabs API key for text-to-speech & speech-to-text services.
`ELEVENLABS_VOICE_ID`	No	`JBFqnCBsd6RMkjVDRZzb`	Default ElevenLabs voice ID for TTS output.
`ANTHROPIC_API_KEY`	No	—	Anthropic API key for sermon generation & Claude translation provider (optional, can be set in Admin UI).
`GEMINI_API_KEY`	No	—	Google Gemini API key for biblical simulator & sermon generation.
`GOOGLE_TRANSLATE_API_KEY`	No	—	Google Cloud Translation API key (default translation provider).
`DEEPL_API_KEY`	No	—	DeepL API key for translation (optional, used when `translation.provider=deepl`).
`YOUTUBE_API_KEY`	No	—	YouTube Data API v3 key for live stream lookup (can be set in Admin UI).
`YOUTUBE_CHANNEL_ID`	No	—	Default YouTube channel ID to search for live streams (can be changed in Admin UI).
`TRANSLATION_PROVIDER`	No	`libretranslate`	Primary translation provider: `google` &pipe; `deepl` &pipe; `claude` &pipe; `libretranslate` (can be changed live in Admin UI).
`APP_ENV`	No	`local`	Application environment: `local` (dev) or `prod` (Docker).
`FRONTEND_URL`	No	`http://localhost`	Frontend URL used for CORS origin in production (e.g., `https://translate.example.com`).
`LISTEN_PORT`	No	`80`	Host port the frontend listens on.
`REDIS_PASSWORD`	No	—	Redis authentication password (leave empty for no auth).
`LIBRETRANSLATE_API_KEY`	No	—	LibreTranslate API key if your instance requires authentication.
`ADMIN_PASSWORD`	No	`admin123`	Legacy socket auth password for admin page (must be changed in production).
`APP_ADMIN_USERNAME`	No	`admin`	Admin user seeded into database on first boot.
`APP_ADMIN_PASSWORD`	No	`admin123`	Admin user password (must be changed in production).
`APP_USERNAME`	No	`user`	User-facing login username (must be changed in production).
`APP_PASSWORD`	No	`changeme`	User-facing login password (must be changed in production).
`JWT_SECRET`	No	—	JWT secret for session cookies (generate with `openssl rand -hex 32` in production).
`COOKIE_SECURE`	No	`true`	Enable secure cookies when serving over HTTPS (Cloudflare Tunnel always uses HTTPS).
`DB_PASSWORD`	No	—	PostgreSQL database password.
`server.port`	No	`3001`	Backend API server port (application.yaml).
`server.cors_origin`	No	`http://localhost:5183`	CORS origin for frontend requests (application.yaml).
`elevenlabs.tts_model`	No	`eleven_multilingual_v2`	ElevenLabs TTS model ID (application.yaml).
`elevenlabs.stt_model`	No	`scribe_v2_realtime`	ElevenLabs STT model ID (application.yaml).
`elevenlabs.tts_settings.stability`	No	`0.5`	TTS voice stability (0–1, higher = more consistent).
`elevenlabs.tts_settings.similarity_boost`	No	`0.75`	TTS voice similarity boost (0–1).
`elevenlabs.tts_settings.style`	No	`0.0`	TTS style (0–1).
`elevenlabs.tts_settings.speed`	No	`1.0`	TTS playback speed (0.5–2.0).
`elevenlabs.tts_settings.use_speaker_boost`	No	`true`	Enable ElevenLabs speaker boost for TTS.
`libretranslate.url`	No	`http://libretranslate:5000`	LibreTranslate server URL (application.yaml).
`database.host`	No	`postgres`	PostgreSQL host (application.yaml).
`database.port`	No	`5432`	PostgreSQL port (application.yaml).
`database.username`	No	`translator`	PostgreSQL username (application.yaml).
`database.database`	No	`translator_db`	PostgreSQL database name (application.yaml).
`database.pool_size`	No	`10`	PostgreSQL connection pool size (application.yaml).
`redis.host`	No	`redis`	Redis host (application.yaml).
`redis.port`	No	`6379`	Redis port (application.yaml).
`audio.sample_rate`	No	`16000`	Audio sample rate in Hz (application.yaml).
`audio.channels`	No	`1`	Audio channels (1 = mono).
`audio.chunk_duration_ms`	No	`250`	Audio chunk duration in milliseconds.
`translation.source_lang`	No	`auto`	Source language for translation (“auto” → auto-detect).
`translation.target_lang_en`	No	`en`	Target language when user selects English.
`translation.target_lang_ru`	No	`ru`	Target language when user selects Russian.
`translation.provider`	No	`google`	Primary translation provider: `google` &pipe; `deepl` &pipe; `claude` &pipe; `libretranslate`.
`translation.fallback`	No	`libretranslate`	Fallback provider when primary fails: `google` &pipe; `deepl` &pipe; `claude` &pipe; `libretranslate` &pipe; `none`.
`translation.translate_workers`	No	`2`	Number of parallel translation workers in the TTS pipeline.
`tts_pipeline.initial_buffer_segments`	No	`2`	Number of translated segments to buffer before starting TTS playback.
`tts_pipeline.low_water_hold_ms`	No	`1500`	Low-water-mark hold time (ms) to wait for next segment before emitting current audio segment (0 = disabled).
`auth.admin_username`	No	`admin`	Admin login username (application.yaml).
`auth.admin_password`	No	`admin123`	Admin login password (application.yaml).
`auth.session_days`	No	`30`	Session expiry in days (application.yaml).
`feature_flags.youtube_input`	No	`true`	Enable YouTube audio source input.
`feature_flags.mic_input`	No	`true`	Enable microphone audio input.
`feature_flags.auto_language_detect`	No	`true`	Enable automatic language detection.
`feature_flags.user_language_selector`	No	`false`	Allow users to select target language (admin-configured by default).
`feature_flags.audio_device_selector`	No	`true`	Enable audio device selection in UI.
`feature_flags.video_translation`	No	`true`	Enable video call translation feature.
`feature_flags.video_voice_cloning`	No	`false`	Enable voice cloning in video calls (premium feature).
`feature_flags.remote_audio_source`	No	`false`	Enable `/audio-source` route for headless remote audio relay.
`feature_flags.agent_audio_source`	No	`false`	Show connected agent audio sources section in admin panel.
`feature_flags.broadcast`	No	`false`	Enable `/broadcast` route (public receiver page).
`feature_flags.translate`	No	`false`	Enable `/translate` route (live translator page).

TTS Settings

Configure ElevenLabs text-to-speech parameters and STT timing behavior via the admin API.

API Endpoints

GET /admin/tts-settings
Returns current TTS settings.

Response:
{
  "settings": {
    "stability": 0.5,
    "similarity_boost": 0.75,
    "style": 0.0,
    "speed": 1.0,
    "use_speaker_boost": true
  }
}

---

POST /admin/tts-settings
Update one or more TTS settings (partial update).

Request body:
{
  "stability": 0.6,
  "speed": 1.1
}

Response: Updated settings object

TTS Settings Reference

Setting	Range	Default	Description
`stability`	0.0 – 1.0	0.5	Voice stability — higher values produce more consistent pronunciation, lower values add variation & emotion.
`similarity_boost`	0.0 – 1.0	0.75	Similarity to voice sample — higher values adhere more closely to the voice's characteristics.
`style`	0.0 – 1.0	0.0	Style exaggeration — adds expressiveness & emotion to the voice (ElevenLabs v2 voices only).
`speed`	0.5 – 2.0	1.0	Playback speed multiplier — 1.0 = normal, <1.0 = slower, >1.0 = faster.
`use_speaker_boost`	true \| false	true	Speaker boost — improves clarity & presence (uses slightly more API credits).

STT Timing Settings Reference

Control speech-to-text detection, buffering, & dispatch behavior.

GET /admin/stt-timing
Returns current STT timing configuration.

POST /admin/stt-timing
Update STT timing (partial update).

Request body:
{
  "commit_merge_ms": 1500,
  "stability_timeout_ms": 2000,
  "max_accumulation_ms": 8000,
  ...
}

Setting	Range	Default	Description
`commit_merge_ms`	500 – 5000	1500	Buffer VAD commits for this many milliseconds before flushing to translation, merging short speech fragments into coherent chunks.
`stability_timeout_ms`	500 – 5000	2000	Fallback timer — if partial transcript unchanged for this duration, dispatch for translation (compensates for unreliable VAD).
`tts_segment_pause_ms`	0 – 1000	0	Pause duration between consecutive audio segments on frontend playback — frontend reads this value.
`max_accumulation_ms`	3000 – 15000	8000	Force dispatch of accumulated words after this duration of continuous speech, ensuring translation happens during sermons even without VAD commits.
`vad_threshold`	0.0 – 1.0	0.5	VAD noise filter strictness — higher = stricter, lower = more permissive; sent to ElevenLabs Scribe endpoint.
`vad_silence_threshold_secs`	0.5 – 3.0	1.0	Seconds of silence before VAD triggers a commit — sent to ElevenLabs Scribe endpoint.
`min_speech_duration_ms`	50 – 500	100	Ignore speech shorter than this duration (noise suppression) — sent to ElevenLabs Scribe endpoint.
`min_silence_duration_ms`	50 – 500	100	Minimum silence gap to reset speech detection — sent to ElevenLabs Scribe endpoint.
`flush_on_sentence_boundary`	true \| false	true	When enabled, split buffered commits & accumulated words at sentence boundaries (.?!) so translation receives complete sentences.
`min_chars_before_dispatch`	20 – 200	40	Minimum characters required before a chunk is dispatched for translation — prevents tiny fragments from being translated separately.

Video Call Settings Reference

Separate configuration for real-time video call STT/TTS (lower latency requirements).

GET /admin/video-settings
Returns video call settings.

POST /admin/video-settings
Update video call settings (partial update).

Request body:
{
  "stability_ms": 500,
  "commit_merge_ms": 50,
  "translation_provider": "claude"
}

Setting	Range	Default	Description
`stability_ms`	200 – 2000	500	Milliseconds to wait for stable partial text before translating in video calls (lower = faster response).
`commit_merge_ms`	0 – 500	50	Milliseconds to merge VAD commits in video calls (lower = more responsive but more API calls).
`translation_provider`	libretranslate \| claude \| deepl \| google	claude	Translation provider used exclusively for video call sessions (does not affect broadcast or private translator).

Notes

Persistence: All TTS & STT settings are persisted to Redis. Changes apply immediately to new sessions; active sessions use settings from boot time.
VAD Parameters: vad_threshold, vad_silence_threshold_secs, min_speech_duration_ms, & min_silence_duration_ms are sent as WebSocket query parameters to the ElevenLabs Scribe endpoint. See ElevenLabs Scribe documentation for details.
Translation Dispatch Logic: Speech is dispatched for translation when any of the following fires: VAD commit (followed by commit_merge_ms buffer window), stability timer (unchanged partial for stability_timeout_ms), sentence boundary detection, or accumulation timer (max_accumulation_ms). The first to fire wins; timers are cancelled to avoid duplicate translation.
Sentence Boundary: When flush_on_sentence_boundary is enabled, chunks are split at .?! boundaries. Incomplete sentences are held until the next cycle, reducing fragmentation in translation output.
Minimum Characters: min_chars_before_dispatch prevents tiny fragments (e.g., “OK”, “Yes”) from being translated individually. The dispatcher waits until the chunk reaches this threshold.

STT Timing Settings

Configure speech-to-text timing parameters that control when transcribed audio is dispatched for translation.

Settings Reference

Setting	Default	Description
`commit_merge_ms`	1500	Milliseconds to buffer VAD commits before translating (merges short fragments into coherent chunks).
`stability_timeout_ms`	2000	Milliseconds to wait for stable partial text before translating (fires if text unchanged for this duration).
`tts_segment_pause_ms`	0	Pause between TTS audio segments (ms) — sent to frontend for playback timing.
`max_accumulation_ms`	8000	Maximum time to accumulate words during continuous speech before force-dispatching for translation (prevents stalled transcription during long utterances).
`vad_threshold`	0.5	Voice Activity Detection strictness (0–1, higher = stricter noise filter).
`vad_silence_threshold_secs`	1.0	Seconds of silence required before VAD commits the transcript.
`min_speech_duration_ms`	100	Ignore speech shorter than this (milliseconds).
`min_silence_duration_ms`	100	Minimum silence gap in milliseconds.
`flush_on_sentence_boundary`	`true`	When enabled, flush commit buffer at sentence boundaries (.?!) instead of all at once (improves naturalness).
`min_chars_before_dispatch`	40	Minimum characters before a chunk is dispatched for translation (prevents tiny, incomplete fragments).

Tuning Guide

Faster response: Reduce commit_merge_ms (e.g., 800–1200 ms) and stability_timeout_ms (e.g., 1000–1500 ms). Trade-off: may produce fragmented translations.
Merge short fragments: Increase commit_merge_ms (e.g., 2000–3000 ms) to wait longer for related commits before translating.
Continuous speech (sermon): Lower max_accumulation_ms (e.g., 5000–6000 ms) to dispatch longer chunks at regular intervals during non-stop speaking.
Noisy environment: Increase vad_threshold (e.g., 0.6–0.8) to reject more background noise.
Quiet speaker: Decrease vad_threshold (e.g., 0.3–0.4) to capture softer speech.
Prevent tiny fragments: Raise min_chars_before_dispatch (e.g., 60–100) to wait for more complete sentences.
Eager sentence dispatch: Enable flush_on_sentence_boundary so complete sentences are sent immediately after punctuation, rather than waiting for silence.

API Endpoints

GET /admin/stt-timing
Returns the current STT timing settings.

Response:
{
  "settings": {
    "commit_merge_ms": 1500,
    "stability_timeout_ms": 2000,
    "tts_segment_pause_ms": 0,
    "max_accumulation_ms": 8000,
    "vad_threshold": 0.5,
    "vad_silence_threshold_secs": 1.0,
    "min_speech_duration_ms": 100,
    "min_silence_duration_ms": 100,
    "flush_on_sentence_boundary": true,
    "min_chars_before_dispatch": 40
  }
}


POST /admin/stt-timing
Update one or more STT timing settings.

Request Body:
{
  "commit_merge_ms": 1200,
  "max_accumulation_ms": 6000,
  "vad_threshold": 0.6,
  "flush_on_sentence_boundary": true
}

Response:
{
  "settings": {
    "commit_merge_ms": 1200,
    "stability_timeout_ms": 2000,
    "tts_segment_pause_ms": 0,
    "max_accumulation_ms": 6000,
    "vad_threshold": 0.6,
    "vad_silence_threshold_secs": 1.0,
    "min_speech_duration_ms": 100,
    "min_silence_duration_ms": 100,
    "flush_on_sentence_boundary": true,
    "min_chars_before_dispatch": 40
  }
}

Socket Events

The frontend receives STT timing settings on connection and whenever admin updates occur:

socket.on('stt_timing', (data) => {
  // data.tts_segment_pause_ms — pause to insert between TTS segments (ms)
  console.log('TTS pause:', data.tts_segment_pause_ms);
});

Authentication: All endpoints require a valid JWT cookie (auth). Obtained via POST /api/login. Admin access requires is_admin=true OR a role with appropriate permissions in the JWT payload.

API Keys

GET /admin/api-keys

Retrieve all API key names and their current setup status (configured or not).

POST /admin/api-keys

Update one or more API keys (elevenlabs, anthropic, deepl, libretranslate, google, youtube).

Body: {
  "elevenlabs": "sk_...",
  "anthropic": "sk_...",
  "deepl": "...",
  "libretranslate": "...",
  "google": "...",
  "youtube": "..."
}

YouTube Live Stream Lookup & Channel Configuration

GET /admin/youtube/channel-id

Get the configured YouTube channel ID and whether it came from environment variables.

PUT /admin/youtube/channel-id

Set the YouTube channel ID for live stream lookups.

Body: {
  "channelId": "UCxxxxxxxxxxxxxxxxxxxxxx"
}

GET /admin/youtube/live-streams

Retrieve live streams from a YouTube channel (uses YouTube API or yt-dlp fallback). Query param: ?channelId=... (optional, defaults to configured channel).

Voice Management

GET /admin/voices

Scan ElevenLabs and retrieve all available voices with metadata (name, category, preview URL).

GET /admin/available-voices

Get the list of voice IDs that viewers are allowed to choose from (null = all voices allowed).

POST /admin/available-voices

Set the list of allowed voice IDs for viewers. Broadcasts to all connected clients in real-time.

Body: {
  "voiceIds": ["kxj9qk6u5PfI0ITgJwO0", "nPczCjzI2devNBz1zQrb"]
}

Feature Flags

GET /admin/flags

Retrieve all feature flags merged from YAML config defaults & Redis overrides.

GET /admin/flags/:flag

Get a single feature flag value.

POST /admin/flags/:flag

Set a feature flag value and broadcast to all connected clients.

Body: {
  "value": true
}

TTS & STT Settings

GET /admin/tts-settings

Retrieve current TTS voice settings (stability, similarity_boost, style, speed, use_speaker_boost).

POST /admin/tts-settings

Update TTS voice settings.

Body: {
  "stability": 0.5,
  "similarity_boost": 0.75,
  "style": 0.0,
  "speed": 1.0,
  "use_speaker_boost": true
}

GET /admin/stt-timing

Retrieve STT timing settings (commit merge delay, stability timeout, VAD parameters, minimum dispatch thresholds).

POST /admin/stt-timing

Update STT timing settings to control speech-to-text buffering and translation dispatch intervals.

Body: {
  "commit_merge_ms": 1500,
  "stability_timeout_ms": 2000,
  "tts_segment_pause_ms": 0,
  "max_accumulation_ms": 8000,
  "vad_threshold": 0.5,
  "vad_silence_threshold_secs": 1.0,
  "min_speech_duration_ms": 100,
  "min_silence_duration_ms": 100,
  "flush_on_sentence_boundary": true,
  "min_chars_before_dispatch": 40
}

Languages

GET /admin/languages

Get the current active language pair (source & target).

POST /admin/languages

Set the active language pair. Broadcasts update to all connected clients.

Body: {
  "languages": ["en", "ru"]
}

GET /admin/available-languages

Get the pool of languages that viewers can select from.

POST /admin/available-languages

Set the pool of available languages for viewer selection. Broadcasts to all clients.

Body: {
  "languages": ["en", "ru", "uk", "es"]
}

Translation Provider

GET /admin/translation-provider

Get the active translation provider and list of available providers (google, deepl, claude, libretranslate).

POST /admin/translation-provider

Set the active translation provider.

Body: {
  "provider": "google"
}

GET /admin/claude-model

Get the active Claude translation model and list of available Claude models.

POST /admin/claude-model

Set the Claude translation model when using Claude as the translation provider.

Body: {
  "model": "claude-3-5-sonnet-20241022"
}

Audio Device

GET /admin/audio-device

Get the admin-selected audio input device (overrides viewer's local choice).

POST /admin/audio-device

Set the admin audio device. Broadcasts to all viewers in real-time.

Body: {
  "deviceId": "default",
  "label": "Built-in Microphone"
}

Video Call Settings

GET /admin/video-settings

Get video call STT/TTS settings (stability, commit merge, translation provider).

POST /admin/video-settings

Update video call translation settings.

Body: {
  "stability_ms": 500,
  "commit_merge_ms": 50,
  "translation_provider": "claude"
}

Content Generation

POST /admin/generate-sermon

Generate a biblical sermon snippet via Gemini Flash (used by biblical simulator).

Body: {
  "apiKey": "sk_...",
  "language": "ru",
  "sentences": 5
}

Broadcast Schedule

GET /admin/broadcast-schedule

Retrieve scheduled broadcast events.

POST /admin/broadcast-schedule

Set scheduled broadcast events (past events are auto-expired).

Body: {
  "events": [
    {
      "id": "evt-001",
      "title": "Sunday Service",
      "datetime": "2024-12-08T10:00:00Z",
      "description": "Weekly broadcast"
    }
  ]
}

TTS Preview

POST /admin/tts-preview

Generate and return TTS audio (MP3) for a text sample with a specified voice. Admin-only test endpoint.

Body: {
  "text": "Hello, this is a test.",
  "voiceId": "kxj9qk6u5PfI0ITgJwO0"
}

Voice Training & Instant Voice Cloning

POST /admin/voice-training/from-recording

Clone a custom voice from browser microphone recordings (base64-encoded audio blobs).

Body: {
  "name": "My Custom Voice",
  "clips": ["", ""],
  "mimeType": "audio/webm"
}

POST /admin/voice-training/from-youtube

Clone a voice from YouTube video audio (extracted & uploaded to ElevenLabs).

Body: {
  "name": "YouTube Voice Clone",
  "youtubeUrl": "https://www.youtube.com/watch?v=...",
  "clipCount": 3,
  "startOffset": 60
}

Monitoring & Analytics

GET /admin/hallucinations

Retrieve hallucination detection statistics and recent flagged transcripts.

DELETE /admin/hallucinations

Clear the hallucination log.

GET /admin/translation-log

Retrieve recent translation entries with timing metrics.

DELETE /admin/translation-log

Clear the translation log.

GET /admin/queue-depth

Get real-time broadcast queue depth and stream statistics for admin monitoring.

Broadcast Session History

GET /admin/sessions

Retrieve all broadcast & private sessions from the database.

GET /admin/sessions/:id

Retrieve detailed transcript data for a specific session (all translated segments with timing).

GET /admin/sessions/:id/export

Export session transcripts as CSV, JSON, or TXT. Query param: ?format=csv|json|txt (default: json).

User Management

GET /admin/users

Retrieve all users (requires user_management permission). Password hashes are stripped before sending.

POST /admin/users/:id/role

Update user admin status and/or role assignments (requires user_management permission).

Body: {
  "isAdmin": true,
  "roleIds": ["role-1", "role-2"]
}

POST /admin/users/:id/reset-password

Force-reset a user's password (requires user_management permission, ≥6 characters).

Body: {
  "password": "newpassword123"
}

DELETE /admin/users/:id

Delete a user account (requires user_management permission, cannot delete self).

Roles & Permissions

GET /admin/permissions

List all available permissions (requires user_management permission).

GET /admin/roles

Retrieve all roles (requires user_management permission).

POST /admin/roles

Create a new role with selected permissions (requires user_management permission, role name must be unique).

Body: {
  "name": "Translator",
  "permissions": ["broadcast_control", "session_export"]
}

PUT /admin/roles/:id

Update an existing role's name & permissions (requires user_management permission).

Body: {
  "name": "Senior Translator",
  "permissions": ["broadcast_control", "session_export", "voice_cloning"]
}

DELETE /admin/roles/:id

Delete a role (requires user_management permission).

Socket.io Events

Server → Client Events

Event	Payload	Description
`feature_flags`	`{ [flag: string]: boolean }`	Merged feature flags from config & Redis overrides; emitted on connection & after admin updates.
`languages`	`{ languages: string[] }`	Current active source → target language pair (2-element array); emitted on connection & after admin change.
`available_languages`	`{ languages: string[] }`	Pool of languages available for viewer selection; emitted on connection & after admin updates.
`stt_timing`	`{ tts_segment_pause_ms: number }`	STT timing configuration including pause duration between TTS segments.
`broadcast_status`	`{ active: boolean; source?: string; pauseReason?: string \| null; skipSourceLang?: string \| null; voiceId?: string; orphaned?: boolean }`	Global broadcast state: whether on-air, current source (mic/youtube/remote/biblical), pause reason, skip language filter, voice, & orphan status.
`broadcast_viewer_count`	`{ count: number }`	Current number of viewers in the broadcast room.
`remote_audio_sources`	`{ sources: RemoteSource[] }`	List of connected remote audio agents with their device selection state.
`admin_audio_device`	`{ deviceId: string; label: string }`	Admin-forced audio input device that viewers should use (empty if not set).
`session_started`	`{ source: 'mic' \| 'youtube' }`	Confirms private session start with the audio source.
`session_stopped`	`{}`	Private session has been stopped.
`transcript`	`{ text: string; isFinal: boolean }`	Live transcription from STT (partial or final).
`translation`	`{ original: string; translated: string; detectedLanguage?: string }`	Final translated text for a committed segment (private session only).
`tts_audio`	`{ audio: string }`	Base64-encoded MP3 audio chunk for a translated segment (private session).
`tts_clear_queue`	`{}`	Clear any pending TTS audio in the frontend queue (e.g. on broadcast pause).
`audio_level`	`{ data: number[] }`	Downsampled waveform levels (64 samples) for real-time audio visualization.
`error`	`{ message: string }`	Error message from the server (STT failure, translation error, etc.).
`stream_ended`	`{}`	YouTube stream or biblical simulator has finished.
`broadcast_transcript`	`{ text: string; isFinal: boolean; skipped?: boolean }`	Live STT transcript from broadcast (all viewers in broadcast room).
`broadcast_translation`	`{ original: string; translated: string; detectedLanguage?: string }`	Final translated text for broadcast (all viewers in broadcast room).
`broadcast_tts_audio`	`{ audio: string }`	Base64-encoded MP3 audio for broadcast (all viewers in broadcast room).
`broadcast_voice_changed`	`{ voiceId: string }`	TTS voice was changed during broadcast.
`admin_translate_result`	`{ original: string; translated: string; detectedLanguage?: string; audio: string }`	Result of admin instant translate & TTS test (private to admin socket).
`select_device`	`{ id: string }`	Instruction to select a specific audio device (sent to remote audio agents).
`refresh_devices`	`{}`	Request remote audio agent to refresh its device list.
`remote_audio_error`	`{ socketId: string; deviceId: string; message: string }`	Error from remote audio source (e.g. device unavailable).
`device_select_error`	`{ socketId: string; message: string }`	Admin attempt to select device on remote source failed.

Client → Server Events

Event	Payload	Description
`join_broadcast`	`{}`	Viewer joins the broadcast room to receive live translation & audio.
`leave_broadcast`	`{}`	Viewer leaves the broadcast room.
`set_languages`	`{ languages: string[] }`	Viewer selects a new source → target language pair (must be in available pool).
`start_session`	`{ source: 'mic' \| 'youtube'; voiceId?: string; youtubeUrl?: string }`	Start a private translation session (mic or YouTube source).
`stop_session`	`{}`	Stop private translation session & clean up resources.
`change_voice`	`{ voiceId: string }`	Change TTS voice for current broadcast or private session (mid-session).
`audio_chunk`	`{ audio: string }`	Base64-encoded PCM audio from microphone (routes to broadcast or private session).
`test_audio_chunk`	`{ audio: string }`	Audio chunk for private session testing (never routes to broadcast).
`admin_start_broadcast`	`{ voiceId?: string; source: 'mic' \| 'youtube' \| 'remote'; youtubeUrl?: string }`	Admin starts a global broadcast session from mic, YouTube URL, or remote audio sources.
`admin_stop_broadcast`	`{}`	Admin stops the active broadcast session.
`reclaim_broadcast`	`{}`	Admin reclaims an orphaned broadcast after reconnecting (no socket ID change needed).
`broadcast_pause`	`{ reason: 'prayer' \| 'song' }`	Admin pauses broadcast (stops audio input & clears TTS queue, keeps session alive).
`broadcast_resume`	`{}`	Admin resumes broadcast from pause.
`broadcast_skip_lang`	`{ lang: string \| null }`	Admin sets source language to skip translation/TTS (e.g. 'en' when human translator is speaking).
`admin_translate_test`	`{ text: string; voiceId?: string; sourceLang?: string; targetLang?: string }`	Admin instant translate & TTS test (results emitted only to admin socket).
`register_audio_source`	`{ agentId?: string; label: string; deviceId: string; devices?: { id: string; name: string }[]; selectedDevice?: string \| null }`	Remote audio agent registers itself as a broadcast audio source with device list.
`unregister_audio_source`	`{}`	Remote audio agent unregisters (disconnect).
`select_agent_device`	`{ socketId: string; deviceId: string }`	Admin selects which device a remote audio agent should use.
`refresh_devices`	`{ socketId: string }`	Admin requests a remote audio agent to refresh its device list.
`audio_stream_error`	`{ deviceId: string; message: string }`	Remote audio agent reports a stream error (device unavailable, permission denied, etc.).
`start_biblical_sim`	`{ anthropicApiKey?: string; geminiApiKey?: string; language: BiblicalLanguage; voiceId?: string }`	Admin starts biblical text simulator broadcast (generates & translates sermon excerpts).
`stop_biblical_sim`	`{}`	Admin stops biblical simulator broadcast.

SDK

Uses the official @elevenlabs/elevenlabs-js SDK (v2). The client is lazy-loaded on first use.

Speech-to-Text (Scribe v2 Realtime)

Connects via native WebSocket to wss://api.elevenlabs.io/v1/speech-to-text/realtime. Handles:

VAD-based commit buffering with configurable merge window
Stability timeout fallback for stalled VAD
Text validation (EN/RU/UK character regex filtering)
Partial and final transcript emission

Text-to-Speech

Uses client.textToSpeech.stream() with the eleven_multilingual_v2 model. Audio is collected into a Buffer and emitted as base64 MP3.

Voice Management

client.voices.getAll() — fetches all voices from account
Admin can filter which voices are available to viewers
Voice cloning via IVC API (from recordings or YouTube)

Key File

backend/src/services/elevenlabs.service.ts

Provider Details

Google Translate

Google Cloud Translation API v2. Fast (~200ms), deterministic, and reliable. Requires GOOGLE_TRANSLATE_API_KEY with the Cloud Translation API enabled in Google Cloud Console. Ensure the API key has no HTTP referrer restrictions (server-side requests have no referrer).

File: backend/src/services/google-translate.service.ts

LibreTranslate

Self-hosted in Docker. No API key required by default. Provides language detection and translation via REST API.

File: backend/src/services/libretranslate.service.ts

DeepL

Premium translation API. Auto-detects free vs. paid endpoint based on the API key format.

File: backend/src/services/deepl.service.ts

Claude (Anthropic)

AI-powered translation using claude-haiku-4-5 for speed. Includes language detection and auto-flip logic.

File: backend/src/services/claude-translate.service.ts

Routing

Provider routing is handled by backend/src/services/translation.provider.ts:

Try admin-selected primary provider
On failure, try configured fallback provider
LibreTranslate is always the last-resort fallback

Connection

Uses ioredis with automatic retry strategy. Falls back to in-memory/YAML defaults if Redis is unavailable.

Key Patterns

Pattern	Example	Purpose
`flag:<name>`	`flag:youtube_input`	Feature flag boolean values
`setting:<name>`	`setting:tts_settings`	JSON settings objects

Key File

backend/src/services/redis.service.ts

Local Development

Use docker-compose.local.yml for Redis and LibreTranslate only (backend/frontend run natively):

Terminal

docker compose -f docker-compose.local.yml up -d

Production

Use docker-compose.yml for all services:

Terminal

docker compose up -d --build

Services

Service	Image	Port	Notes
frontend	node:24-alpine + Nginx	80 (exposed)	Serves React build, proxies API/WS to backend
backend	node:24-alpine	3001 (internal)	Express + Socket.io server
redis	redis:7-alpine	6379 (internal)	Feature flags and settings store
libretranslate	libretranslate/libretranslate	5000 (internal)	Self-hosted translation engine

Configuration

.env (production)

ELEVENLABS_API_KEY=sk-your-production-key
ADMIN_PASSWORD=strong-secure-password
FRONTEND_URL=https://translate.example.com
APP_ENV=prod
REDIS_PASSWORD=redis-secret

Deploy

Terminal

docker compose up -d --build

Reverse Proxy

When running behind Nginx or another reverse proxy:

Set LISTEN_PORT in .env (e.g., 8080)
Proxy pass to localhost:8080
Important: Ensure WebSocket upgrades are forwarded for the /socket.io/ path

Nginx Config (example)

server {
    listen 443 ssl;
    server_name translate.example.com;

    location / {
        proxy_pass http://localhost:8080;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Host $host;
    }
}

Monitoring

Terminal

# Check all services
docker compose ps

# View backend logs
docker compose logs -f backend

# Health check
curl http://localhost:3001/api/health

Shipped

v0.1 – v0.2 — Core Translation Engine

Real-time STT via ElevenLabs Scribe v2 Realtime
Multi-provider translation (LibreTranslate, DeepL, Claude)
TTS voice synthesis with ElevenLabs
Microphone and YouTube live input
Admin panel with feature flags, voice management, TTS tuning
Biblical Transcript Simulator for pipeline testing
Instant Voice Cloning from recordings and YouTube

Shipped

v0.3 — Audio Mixer & Device Selection

Browser-side audio device scanning with support for professional mixing consoles, virtual audio devices, and audio interfaces.

Browser-side device enumeration with permission flow
Virtual device detection (Loopback, BlackHole, VB-Audio, Voicemeeter, OBS)
Categorized device picker (Microphones vs Mixers / Virtual Devices)
Admin device override broadcast to all viewers via Socket.io
Real-time feature flag broadcasting

Shipped

v0.7 — Broadcast Service

The /translate route is now a true broadcast service. Admins start one global translation session from the admin panel and all connected viewers receive the live output simultaneously.

Single global broadcast session (one-to-many)
Admin "Broadcast Control" panel — Start/Stop with source + voice selection
Microphone and YouTube source both supported for broadcast
All translation output (transcript, translated text, TTS audio) io.emit’d to every viewer
Viewer shows Waiting for broadcast to start… status when off air
"On Air" / "Off Air" status pill visible to viewers in real-time
Broadcast ownership tracked by admin socket ID; auto-stops on admin disconnect
Biblical Transcript Simulator also broadcasts to all viewers

Shipped

v0.8 — Navigation, Broadcast FF & Transcript UX

Global persistent bottom navigation, feature-flag-gated route visibility, and a refined transcript reading experience.

Persistent bottom navigation bar on all pages (/translate, /broadcast, /video, /admin)
FF-gated nav links — Broadcast and Video Call entries only appear when their flags are enabled
No extra socket connection — nav reads flags from the page’s existing useSocket call via props
Nav renders a frosted dark background gradient so it never overlaps content
/broadcast route is now public (no login required); gated inside the page by the broadcast feature flag
broadcast feature flag added to YAML, backend config, and frontend FeatureFlags interface
Transcript panel: newest translation is always at the top; older lines scroll down and fade out at the bottom
Each new transcript entry animates in from above (transcriptIn keyframe)
Removed duplicate “Video Call” button from /translate and /broadcast header bars

Shipped

v0.9 — Translation Pipeline Overhaul & Google Integration

Major improvements to translation chunking, provider support, and admin tooling.

Google Translate as primary translation provider with automatic fallback chain
Google Gemini 2.5 Flash for biblical simulator and sermon generation (replaces deprecated Gemini 2.0 Flash)
Overhauled STT chunking: disabled aggressive sentence-boundary splitting, stability timer defers to accumulation during continuous speech, commit buffer defers when speaker has resumed
Configurable sermon length (1–20 sentences) in admin UI
Voice training: AI-generated reading text (Gemini) for mic recording sessions
Voice training: preview playback of cloned voice after training via TTS
Broadcast mute/unmute toggle (muted by default, replaces “Tap to enable audio” banner)
Audio device auto-scan on page load with spinning refresh indicator
Fixed admin Raw Server Logs auto-scroll toggle re-enabling on new messages
Updated Claude model list: removed deprecated models, default is claude-haiku-4-5
Docker images upgraded to Node.js 24 (Alpine)

Shipped

v0.4 — Mac Audio Agent

Lightweight Node.js daemon that captures Mac microphone audio and streams it to the backend via Socket.io — no browser required on the audio source machine.

Runs as a macOS LaunchAgent (auto-start on login, auto-restart on crash)
Captures 16 kHz 16-bit mono PCM via sox
Identical chunk format and encoding to the browser client
Registers as a named remote audio source visible in the Admin UI
Starts/stops streaming automatically based on broadcast_status events
One-command install script (see standalone repo)

Up Next

v0.4.1 — Direct Audio Interface Feed

Accept audio directly from professional mixing consoles and audio interfaces — extend the Mac agent to support Core Audio device selection for broadcast-quality input.

Direct audio interface input (Core Audio / ASIO / ALSA)
Multi-channel mixer feed support
Low-latency audio routing (sub-100ms)
Hardware device auto-discovery and selection
Professional broadcast integration (NDI, Dante)

Shipped

v0.5 — Video Call Translation

WebRTC peer-to-peer video calls with real-time bidirectional translation. Two people speak different languages and hear each other translated via TTS.

Built-in WebRTC video call with room codes
Full-duplex translation (each person hears the other translated)
Per-participant STT pipeline with independent Scribe sessions
Video grid UI with local PiP and remote full-screen
Mic/video mute controls, hang up, auto-cleanup on disconnect
Feature-flagged behind video_translation

Shipped

v0.6 — Auth, Mobile & Voice Cloning in /video

User-facing login page (/) with JWT cookie sessions (30-day sticky, HttpOnly)
All app routes protected — redirect to login if unauthenticated
Live translator moved to /translate
Mobile-responsive UI across Translator, Admin, and Video Call views
FaceTime-style full-screen in-call layout on mobile with safe-area insets
“Clone Voice” button in /video lobby, gated by video_voice_cloning feature flag
Voice cloning modal with mic recording or YouTube URL, admin-password gated

Planned

Future

Additional language pairs beyond EN/RU/UK
Speaker diarization (multi-speaker detection)
Translation memory and glossary support
Webhooks and API for third-party integrations
Multi-tenant deployment with user accounts

Introduction

Ship live translationswith confidence

Installation

Architecture

Live Translation

Biblical Simulator

Voice Training

Installation

Prerequisites

Clone & Configure

Start Infrastructure

Start Backend

Start Frontend

Quickstart

Start the services

Open the Admin Panel

Select a Voice

Test with Text

Go Live

Architecture

System Overview

Data Flow

Key Architecture Decisions

Two-layer Language Detection

VAD Commit Merging

Feature Flag Merging

API Key Hierarchy

Speech-to-Text Flow

Connection Lifecycle

Audio Streaming

Scribe Responses

Commit Merge Buffer

Stability Timeout

Text Validation

Translation Pipeline

Provider Chain

LibreTranslate

DeepL

Claude

Fallback Logic

Language Detection

Layer 1: Script-based Pre-detection

Layer 2: STT Language Code

Language Gating

TTS & Playback

TTS Pipeline

Audio Delivery

Frontend Playback Queue

Live Translation

Microphone Input

YouTube Input

User Interface

YouTube Input

How It Works

Supported Sources

Requirements

Biblical Transcript Simulator

Overview

Language Styles

Flow

Voice Training

Overview

From Microphone

From YouTube

Language Management

Concepts

Admin Controls

Viewer Selection

Video Call Translation

Overview

How It Works

Architecture

Socket Events

Room Lifecycle

Mac Audio Agent

Feature Flags

Feature Flags

Flag Registry

Storage & Runtime Override

Admin API Endpoints

Ship live translations
with confidence