Vox — User Manual

Contents

What's in here

1 Overview: what Vox does 2 Convert tab: voice conversion 3 Voice models: finding, loading, versioning 4 Model merging: building new instruments 5 Separate tab: standalone stem separation 6 Stem architectures: what they do differently 7 Which model for which job 8 Recipes: model chains for specific outcomes 9 f0 pitch detection: the honest version 10 Advanced settings reference 11 Credits and attribution

Overview

What Vox does

Vox is a desktop tool for two things: changing what a voice sounds like, and pulling audio apart into stems. It combines RVC voice conversion with the full UVR stem separation engine, model merging, and a model library, in a single app.

Four tabs across the top:

Convert

Separate

Train

Generate

Convert takes audio in, runs it through a voice model (or a merged pair of models), and outputs a transformed vocal. Stem separation is built into this pipeline as an optional preprocessing step.

Separate is standalone stem separation with full parameter control. No voice conversion involved. Drop audio in, pick an architecture and model, tune the parameters, get stems out.

Train and Generate require GPU hardware and are placeholder tabs on Intel Mac. They're functional on compatible systems.

Vox is built on Replay by Weights (MIT license), python-audio-separator (MIT), and the RVC voice conversion pipeline (MIT). All original authors are credited in the source LICENSE file.

Convert

Voice conversion

The Convert tab is the main workflow. Top to bottom:

Audio input

Drag an audio file into the drop zone, use the file picker, paste a YouTube URL, or record directly. Most formats work: WAV, MP3, FLAC, OGG, M4A.

Voice model selection

Pick a model from the grid. Models are organized by category (Musician, Politician, Fictional Character, etc.) and filterable. Downloaded models show file size. Models with a v2 badge use the newer 768-dimensional feature space. Models without a badge are v1 (256-dimensional).

Pipeline options

Below the model grid, four toggles control preprocessing:

Stem Only

Skip voice conversion entirely. Just output the separated vocal stem.

Pre-Stemmed

Your input is already a clean vocal with no instrumentals. Skip the separation step.

Sample Mode

Process only the first 30 seconds. Use this for quick previews before committing to a full conversion.

De Echo & Reverb

Remove echo and reverb from the vocal before conversion. Useful for recordings with room ambience.

Stem Method

The architecture tabs (Roformer, MDX-Net, Demucs, VR Arc) and model dropdown control which separation model runs on your audio before conversion. BS-Roformer-Viperx-1297 on the Roformer tab is the strongest vocal isolator available.

For best results, pitch-correct your vocal in Melodyne before bringing it into Vox. RVC handles pitch shifting, but Melodyne gives finer control over individual notes.

Pitch controls

Relative Pitch shifts the converted voice up or down in semitones. +10 to +12 is a common range for male-to-female shifts. Instrumental Pitch adjusts the backing track to match if the vocal pitch change made it sound mismatched.

Models

Voice models

v1 vs v2

RVC v1 models extract 256-dimensional features from HuBERT layer 9. v2 models extract 768-dimensional features from layer 12. These are not different quality tiers. They encode genuinely different information: layer 9 is more acoustic (timbre, phonetics), layer 12 is more semantic (linguistic structure, speaker identity).

v2 is generally preferred for large training datasets (40+ minutes of audio). v1 can actually outperform v2 on small datasets (~5 minutes). Vox shows a v2 badge on v2 models so you always know what you're working with.

v1 and v2 models cannot be merged together. The tensor shapes are incompatible (192×256 vs 192×768). Both models in a merge must be the same version.

File format

Models are .pth files (PyTorch checkpoints, ~55-60 MB) optionally paired with .index files (FAISS feature indices that improve timbre accuracy). Models without index files still work but produce less precise timbres.

Where to find models

Hugging Face is the primary source. Search "[name] RVC" to find community-uploaded models. AI Hub Discord has a #voice-models forum channel. Applio has a built-in downloader supporting multiple sources. Quality indicators: both .pth and .index present, training duration over 10 minutes, demo samples available.

Loading models

Drop a .pth file (and its .index if available) into the voice model drop zone, or place them in the models directory. They appear in the grid immediately.

Merge

Model merging as instrument design

Most people merge voice models to clone someone. That's not what this is for. Blending two voice models creates a third voice that doesn't exist anywhere. It's a new timbral instrument built from the characteristics of two sources.

Click Merge Models in the Convert tab. Model A is your currently selected model. Pick Model B from the dropdown. The compatibility check runs automatically and shows green (compatible) or red (version mismatch) with an explanation.

Blend ratio

The slider controls the interpolation between the two models. At 0% you get pure Model A. At 100%, pure Model B. The interesting space is in between.

25 / 75

Mostly Model A with a wash of B's character. Subtle timbral coloring.

50 / 50

Equal blend. Neither source dominates. Most likely to produce something genuinely new.

75 / 25

Mostly Model B. A's influence becomes textural rather than tonal.

Complementary timbres

Blend a breathy model with a bright one. The result has characteristics neither source can produce alone.

Save vs ephemeral

With Save merged model checked, the blended model is written to disk as a new .pth file. It appears in your model grid and persists across sessions. With it unchecked, the merge exists only in memory for one conversion, then is discarded. Use ephemeral mode for quick experiments, save mode when you find something worth keeping.

Merged models don't have .index files. You can't meaningfully combine two FAISS indices, and generating a new one requires the original training data. The quality trade-off is minor. The model works fine without one.

Strategies for productive merging

Match RVC versions (v1+v1 or v2+v2, never cross-version). Use models with similar pitch ranges for more coherent results. Start at 50/50 and adjust from there. Name your saved merges descriptively so you remember what went in.

Separate

Standalone stem separation

The Separate tab gives you full control over stem separation, independent of voice conversion. This is the equivalent of running UVR directly, with per-architecture parameter tuning.

Workflow

Drop audio files into the input zone. Pick an architecture tab, select a model, choose output stems (All Stems, Vocals Only, or Instrumental Only), set the output format, tune parameters if needed, hit Separate.

Output options

All Stems outputs both the vocal and instrumental (or all stems for Demucs). Vocals Only and Instrumental Only output a single file. Format options: WAV (lossless, largest), FLAC (lossless, compressed), MP3 (lossy, smallest).

Parameters

Each architecture exposes different tunable parameters. Hover over the ? icon next to any parameter for a description. The Reset to defaults link restores all parameters to their default values.

Most of the time, defaults are fine. The parameters that matter most when defaults aren't enough:

Segment Size

Larger = better quality, more RAM. If you're getting RAM errors, reduce this.

Overlap

Higher overlap reduces artifacts at segment boundaries. Costs processing time.

Aggression (VR only)

How hard the model separates. Too high introduces artifacts. Too low leaves bleed.

Shifts (Demucs only)

Random shifts for augmentation. Higher is better but slower. 2 is a good default.

Model Library

The Model Library panel (accessible at the bottom of the Separate tab) shows all available models grouped by architecture. Downloaded models show a delete icon. Models not yet downloaded show a download icon. Models marked "auto" are downloaded on demand when first selected.

Architectures

What the six architectures actually do

You don't need to understand the architectures to use Vox. You need to know which model to pick, and the architecture is just context for why some models work better on certain material. But when your go-to model fails on a specific track, knowing the approach helps you pick a fallback.

Architecture	Strategy	Strengths
Roformer	Rotary attention over frequency bands	Best overall quality. BS-RoFormer-Viperx-1297 is the single strongest vocal separator.
MDX-Net	Hybrid spectrogram network	Fast, reliable workhorse. Good ONNX performance on CPU.
Demucs	U-Net over raw waveform	The only architecture that does true multi-stem (drums, bass, vocals, other, guitar, piano).
VR Architecture	Multi-band DenseNet via STFT	Best for utility tasks: de-echo, de-reverb, de-noise. The utility specialist.
MDX23C	Enhanced hybrid network	Better SDR scores than MDX-Net. Fewer model variants available.

On Intel Mac, all architectures run via CPU inference. MDX-Net with ONNX is the fastest (~30-60s per song). Roformer is the slowest (~2-3 min) but produces the best results.

Decision tree

Which model for which job

You want to separate something. Here's the model to use.

Vocal isolation

Model	Arch	Notes
BS-Roformer-Viperx-1297	Roformer	Best. Use this first, always.
Mel-Roformer-Viperx-1143	Roformer	Strong alternative. Try if BS-Roformer struggles.
Kim_Vocal_2	MDX-Net	Fastest on CPU. More forgiving on dirty sources.
MDX23C-InstVoc HQ	MDX23C	Good all-rounder.

Multi-stem (drums, bass, guitar, piano)

Model	Stems	Notes
htdemucs_ft	4 (vocals, drums, bass, other)	The standard. Bass separation is excellent (SDR ~11.9).
htdemucs_6s	6 (+guitar, piano)	Guitar and piano quality is usable but not perfect.

Noise removal

Model	Arch	Notes
Mel-Roformer-Denoise	Roformer	Best. Broadband noise.
UVR-DeNoise	VR	Good alternative. Lighter processing.

De-reverb

Model	Arch	Notes
Mel-Roformer-Dereverb	Roformer	Best for stereo reverb.
UVR-DeEcho-DeReverb	VR	Combined reverb + echo removal in one pass.

De-echo

Model	Arch	Notes
UVR-De-Echo-Aggressive	VR	For heavy echo.
UVR-De-Echo-Normal	VR	Lighter touch.

Karaoke (lead vs backing vocals)

Model	Arch	Notes
Mel-Roformer-Karaoke	Roformer	Best. Run on an isolated vocal stem, not a full mix.
UVR-MDX-NET Karaoke	MDX-Net	Faster alternative.

Recipes

Model chains for specific outcomes

These are curated pipelines. Order matters.

Clean vocal from a studio mix

BS-Roformer-Viperx-1297

Single pass. If this doesn't get you there, nothing will without manual cleanup.

Clean vocal from a dirty/noisy source

Kim_Vocal_2

→

UVR-DeEcho-DeReverb

→

UVR-De-Echo-Aggressive

→

Mel-Roformer-Denoise

Kim_Vocal_2 first because it's most forgiving on dirty sources. Denoise last to catch residual artifacts. This is the community-consensus chain.

Lead vocal only (no backing harmonies)

BS-Roformer-Viperx-1297

→

Mel-Roformer-Karaoke

Isolate all vocals first, then split lead from backing. Don't run the karaoke model on a full mix.

Sample rescue (old/degraded audio)

Mel-Roformer-Denoise

→

Mel-Roformer-Dereverb

→

BS-Roformer-Viperx-1297

Denoise first (remove hiss/hum), then dereverb (remove room), then separate. Reversing the order embeds noise into the separated stems.

Voice merge → convert pipeline

DAW

→

Melodyne

→

Export dry vocal

→

Vox: merge + convert

Always pitch-correct before conversion. Use same-version models for the merge. Enable "Pre-Stemmed" since your exported vocal is already clean.

Full pipeline: raw recording to finished vocal

Record in DAW

→

Melodyne

→

Export

→

Denoise (if needed)

→

Vox merge + convert

→

DAW

Skip the denoise step if your recording environment is clean.

Pitch detection

f0 method: the honest version

RVC gives you a dropdown of pitch detection algorithms. The internet has comparison tables that look definitive. In practice, you'll still trial-and-error. Here's the short version that actually helps.

Use RMVPE first. It handles polyphonic and noisy sources well and it's fast enough for offline work.

If it glitches, try Crepe (full, not tiny).

"Tiny" and "mini" variants are optimized for real-time conversion latency. If you're doing offline conversion (you are), skip them.

PM may not appear depending on your platform and hardware. Don't worry about it.

Harvest is slow and you'll rarely need it. It's there if RMVPE and Crepe both fail on a specific passage.

If every f0 method glitches on the same passage, the problem is your source audio, not the algorithm. Go back and fix the recording or the pitch correction.

Reference

Advanced settings

These live under "Advanced Settings" in the Convert tab. Most of the time, defaults are correct.

Index Ratio

How much the FAISS index influences the output (0 = none, 1 = full). Higher values make the output sound more like the training data. Lower values let the source vocal's character through. For merged models (which lack .index files), this has no effect.

Consonant Protection

Reduces artifacts on consonant sounds at lower volumes. Lower values = more protection. 0.5 = no protection. Start at the default and reduce if you hear crackling on quiet consonants.

Volume Envelope

Scales the output volume to match the input. At 1 (default), no scaling is applied. Lower values make the output dynamics match the input more closely.

Device

CPU on Intel Mac. MPS on Apple Silicon. CUDA on NVIDIA GPU. Affects processing speed, not output quality.

Credits

Attribution

Vox is built on open-source work by many contributors:

Replay by Weights — the original desktop application (MIT license). Codebase mirror maintained by THE-SINDOL.

python-audio-separator by nomadkaraoke — the stem separation engine wrapping all UVR architectures (MIT license).

Ultimate Vocal Remover by Anjok07 and aufr33 — the model ecosystem and research (MIT license).

RVC by RVC-Project — the voice conversion pipeline (MIT license).

Applio by IAHispano — the actively maintained RVC implementation (MIT license).

Demucs by Alexandre Défossez — multi-stem separation (MIT license).

BS-RoFormer community weights by Roman Solovyev, Viperx, aufr33, and others.

All original copyright notices and licenses are preserved in the source repository.

macOS: right-click → Open on first launch to bypass Gatekeeper. Vox is unsigned during the testing phase.

Vox User Manual

What's in here

What Vox does

Voice conversion

Audio input

Voice model selection

Pipeline options

Stem Method

Pitch controls

Voice models

v1 vs v2

File format

Where to find models

Loading models

Model merging as instrument design

Blend ratio

Save vs ephemeral

Strategies for productive merging

Standalone stem separation

Workflow

Output options

Parameters

Model Library

What the six architectures actually do

Which model for which job

Vocal isolation

Multi-stem (drums, bass, guitar, piano)

Noise removal

De-reverb

De-echo

Karaoke (lead vs backing vocals)

Model chains for specific outcomes

Clean vocal from a studio mix

Clean vocal from a dirty/noisy source

Lead vocal only (no backing harmonies)

Sample rescue (old/degraded audio)

Voice merge → convert pipeline

Full pipeline: raw recording to finished vocal

f0 method: the honest version

Advanced settings

Attribution