Pocket TTS — Voice Studio

ⓘ How these settings work — Pocket TTS

Voice — the reference whose prosody is cloned, grouped by persona (athena / majel / custom) plus the built-in Kyutai voices. This is the biggest lever: a deadpan reference stays deadpan no matter the sliders.

Or clone an uploaded clip — upload any audio to clone it on the fly; it overrides the dropdown. Longer & cleaner (~10–30s) clones better. The ✕ clears the upload.

Temperature (0.7) — randomness of delivery. Higher = more pitch/emotion; too high slurs. Lower = flatter but rock-stable.

Decode steps (1) — decoder refinement passes; 2+ smooths artifacts at ~linear CPU cost (subtle, slower).

EOS threshold (−4) — how eagerly it stops; −2 can clip the ending, −6 may add trailing junk.

Noise clamp (off) — caps the magnitude of sampled noise. 0 = off (no clamp); a value steadies/flattens delivery (lower = tighter). Leave off unless a take is too jittery.

Quick starts: lively temp 0.9 · stable temp 0.6 · cleaner take → steps 2.

ⓘ How these settings work — Kokoro Blend

Add voice to blend — pick a preset Kokoro voice and + Add it. Add several to mix a custom voice; a single voice is used as-is.

Sign (+ / −) per row — + adds a voice's character, − subtracts it (voice sculpting). The first voice is always additive.

Weight per row — relative mix amount, auto-normalized to 100% (so 2 + 1 ≈ 67% / 33%). The live blend string below the rows is exactly what's sent to Kokoro.

Speed (1.0×) — slower = more deliberate, faster = clipped.

Volume (1.0×) — output gain multiplier on the rendered audio.

Language / accent (auto) — overrides the voice's built-in accent (e.g. American vs British English, or another language). auto infers it from the voice prefix.

Text normalization — how the text is cleaned before speaking. normalize text is the master toggle; the rest expand specific forms into spoken words: URLs, emails, phone numbers, optional pluralization (e.g. "1 cat" vs "2 cats"), replace remaining symbols, and unit normalization (e.g. "5km" → "five kilometers"; off by default). Turn off to read text more literally.

Kokoro is ~1.8× slower than Pocket and heavier on CPU — it's for designing a voice you like (then optionally promote a clip into Pocket for speed).

ⓘ How these settings work — XTTS Clone

Voice — three ways to drive it: clone a persona reference (same voices/ tree as the Pocket tab), pick one of XTTS-v2's 58 built-in studio speakers (the Built-in group, no reference needed), or upload a clip to clone on the fly (overrides the dropdown). XTTS often sounds the most natural, at the cost of speed.

Language — XTTS-v2 is multilingual (17 languages); pick the language your text is written in for correct pronunciation. (Works in both clone and built-in modes.)

XTTS is the slowest engine on CPU — a long passage can take ~30–60s. Its deeper sampling knobs (temperature, top-k/p, penalties, speed) aren't exposed: the stock server doesn't pass them to the model.

ⓘ Text, generating & saved clips (all engines)

Text — what gets spoken; shared across all three tabs.

Generate renders the clip and plays it immediately — but nothing is saved yet (shown as · unsaved). Reset defaults restores the current tab's controls.

💾 save — writes the just-generated clip into the library at studio/<engine>/<voice>/; only saved clips appear under Saved clips.

Saved clips — grouped by engine · voice, each with a player, ➜ voice (promote the clip into a reusable reference voice under custom), and ✕ delete.

Voice library — manage the cloneable reference voices (used by the Pocket & XTTS tabs): ▶ preview, ✎ rename or move to another persona, ✕ delete, and Add a voice to upload a new WAV into voices/<persona>/. Changes refresh the voice pickers immediately.