Voice catalog
storyvox ships two local voice families, with an optional cloud backend wired in for users who want studio-grade narration on slow devices:
- Piper (local, in-process) — single-speaker, compact, fast. Each voice is ~14–30 MB. Best for first-time installs and for fast turnaround on slow hardware.
- Kokoro (local, in-process) — multi-speaker, ~330 MB single download, ships dozens of voice profiles in one model. Higher quality, slower to synthesize.
- Azure Cognitive Services HD voices (cloud, BYOK) — opt-in, paid by you to Microsoft, never by storyvox. Studio-grade narration with offline fallback to a local voice if your key fails or the network drops. See Cloud voices below.
Local voices run in-process via the VoxSherpa-TTS :engine-lib AAR, which wraps k2-fsa/sherpa-onnx for inference. No second APK, no system-TTS handoff, no per-character billing.
Quality tiers (Piper)
Piper voices ship in three quality tiers from the upstream vits-piper-* models:
| Tier | Approx. download | Synthesis speed | Notes |
|---|---|---|---|
| low | ~14 MB | Fastest | Best for very slow devices; some artifacting on long sentences. |
| medium | ~22 MB | Mid | Default tier. Good balance. |
| high | ~28 MB | Slowest | Most natural prosody; can run below realtime on Helio P22T-class chips (see Performance modes on the wiki). |
Voice tiers are independent of speaker — pick the speaker first, then the tier you can afford.
Where voices live
Model weights are not bundled in the APK (they'd add 1+ GB). They download on demand from the
voices-v2 GitHub release on
techempower-org/VoxSherpa-TTS. Each Piper voice ships as {lang}-{voice}-{quality}.onnx plus a
.tokens.txt sidecar. Kokoro ships as kokoro-model.onnx + kokoro-voices.bin + kokoro-tokens.txt.
voices-v2 is a re-hosting of the upstream k2-fsa tarballs as flat single-file downloads — extracting .tar.bz2 archives on Tab A7 Lite-class hardware is slow enough to delay the first chapter for tens of seconds. Doing it once server-side moves that cost off the device.
Why fp32, not int8
The previous voices-v1 release used INT8-quantized models — ~3× smaller download, but a vocoder
run through INT8 dynamic quantization adds audible noise to spectral coefficients. That's the
"fuzz" symptom users heard on Samsung tablets in v0.4.x.
voices-v2 uses fp32 weights — exactly what's inside the upstream
vits-piper-*.tar.bz2 and kokoro-multi-lang-v1_1.tar.bz2 archives. Larger downloads, clean speech.
The _int8 suffix in some VoiceCatalog.kt voice IDs (e.g. piper_lessac_en_US_high_int8)
is historical — kept stable so already-installed users don't have to re-pick a voice when they
update. The on-disk files and URLs no longer involve quantization.
Picking a voice
In the app: Settings → Voice library, or tap the voice name in Settings → Voice & Playback to jump there.
The library groups voices by engine (Piper / Kokoro), language, and speaker. Each row shows:
- Star — toggle to favorite a voice. Stars persist across reinstalls and sit at the top of the list.
- Status — installed, downloading, available, or starred.
- Size — disk footprint after install.
- Quality tier — for Piper voices.
The current selection is highlighted in brass. Tapping a voice that isn't installed kicks off the download with a progress bar and notification; the voice becomes selectable when it lands.
Disk hygiene
- Long-press a voice → Delete to free disk. The currently-selected voice can't be deleted; switch first.
- The starred surface in the library shows your favorite voices at the top regardless of install status. Use it as a shortlist when you reinstall.
Catalog refresh workflow (maintainers)
# Compare upstream to voices-v2 (no changes, no auth needed)
./scripts/voices/refresh-voices-v2.sh --check-only
# Pull new tarballs, extract, upload (needs gh CLI auth with write
# access to techempower-org/VoxSherpa-TTS; uses ~5 GB temp space)
./scripts/voices/refresh-voices-v2.sh
The script:
- Lists
tts-modelsassets onk2-fsa/sherpa-onnxmatchingvits-piper-(en_US|en_GB)-*.tar.bz2(no-int8, no-fp16suffix) andkokoro-multi-lang-v1_1.tar.bz2. - Diffs against assets currently on
voices-v2. - Downloads any missing tarballs (parallel x8).
- Extracts and flattens — drops
MODEL_CARD,espeak-ng-data/, and various Kokoro lexicon/dict files (storyvox bundles its own espeak data). - Uploads everything to
voices-v2with--clobberso reruns are idempotent.
When you add a new upstream voice, also add a CatalogEntry to
core-playback/src/main/kotlin/in/jphe/storyvox/playback/voice/VoiceCatalog.kt. The script
does not edit the catalog automatically — that part is a deliberate human review.
Automated drift detection
.github/workflows/voice-catalog-check.yml runs on the 1st of every month (and on manual
dispatch). It performs the same diff as --check-only and, if there's drift, files a GitHub
issue listing new upstream voices. It does not auto-publish — refreshing voices-v2 requires
write access to techempower-org/VoxSherpa-TTS, which the default GITHUB_TOKEN doesn't have.
Why we don't auto-update the in-app catalog
VoiceCatalog.kt is compiled into the APK. To pick up new voices the user has to install a new
app version. That's intentional:
- The catalog includes language tags, quality tier hints, and recommended voices that need human curation.
- Adding voices is rare enough (months between k2-fsa releases) that a per-release cadence doesn't hurt.
- A network-fetched dynamic catalog would add a runtime failure mode and an "is the index loading?" UX state. Not worth it for the cadence.
If the cadence ever picks up, the plan is to switch to a JSON file in the same voices-v2
release that the app fetches and caches; VoiceCatalog.kt becomes a fallback. That's a v1.x
feature, not v0.4.x.
Cloud voices — Azure HD (BYOK)
For users who want studio-grade narration on slow devices, Azure Cognitive Services HD voices are wired in as an optional remote backend. Bring your own Azure key + region; storyvox never touches your billing.
To set up Azure:
- Create an Azure Speech resource (Azure portal) and copy the key + region (e.g.
eastus,westeurope). - In storyvox, open Settings → Voice & Playback → Azure.
- Paste the key, pick the region, tap Test connection. The full HD voice roster lights up.
- Pick an Azure voice from the regular voice library — it shows up grouped under "Azure HD".
Storyvox uses SSML to drive the same per-voice speed, pitch, and punctuation cadence knobs you set for local voices, with retries on transient HTTP errors. If your key fails or the network drops mid-chapter, playback falls back to your selected local voice for the remainder — it never just stops on you.
Cache eviction priority weights Azure renders higher than local renders since they cost real money: the chapter PCM cache evicts local-engine outputs first when it needs space.
Originally tracked in #85; shipped across #182–#186.
Coming soon
- VoxSherpa-TTS knob exposure (research draft) — loudness normalization, breath pause, pitch envelope as user-tunable settings.
- Per-voice lexicon override (#197) — IPA pronunciation dictionaries per voice for the names that always come out wrong.
- KittenTTS (#119) — a third in-process voice family at the smallest tier, for the slowest of slow devices.