tts/studio/.Voice — the reference whose prosody is cloned, grouped by persona (athena / majel / custom) plus the built-in Kyutai voices. This is the biggest lever: a deadpan reference stays deadpan no matter the sliders.
Or clone an uploaded clip — upload any audio to clone it on the fly; it overrides the dropdown. Longer & cleaner (~10–30s) clones better. The ✕ clears the upload.
Temperature (0.7) — randomness of delivery. Higher = more pitch/emotion; too high slurs. Lower = flatter but rock-stable.
Decode steps (1) — decoder refinement passes; 2+ smooths artifacts at ~linear CPU cost (subtle, slower).
EOS threshold (−4) — how eagerly it stops; −2 can clip the ending, −6 may add trailing junk.
Noise clamp (off) — caps the magnitude of sampled noise. 0 = off (no clamp);
a value steadies/flattens delivery (lower = tighter). Leave off unless a take is too jittery.
Quick starts: lively temp 0.9 ·
stable temp 0.6 ·
cleaner take → steps 2.
Text — what gets spoken; shared across all three tabs.
Generate renders the clip and plays it immediately — but nothing is saved yet (shown as · unsaved). Reset defaults restores the current tab's controls.
💾 save — writes the just-generated clip into the library at
studio/<engine>/<voice>/; only saved clips appear under Saved clips.
Saved clips — grouped by engine · voice, each with a player, ➜ voice
(promote the clip into a reusable reference voice under custom), and ✕ delete.
Voice library — manage the cloneable reference voices (used by the Pocket & XTTS tabs):
▶ preview, ✎ rename or move to another persona, ✕ delete, and Add a voice to upload a new
WAV into voices/<persona>/. Changes refresh the voice pickers immediately.
voices/<persona>/<name>.wav · WAV only (no transcoder here).