VOX

Vox User Manual

Voice conversion, stem separation, and model merging. By Cablewight.

Download Vox

What's in here

What Vox does

Vox is a desktop tool for two things: changing what a voice sounds like, and pulling audio apart into stems. It combines RVC voice conversion with the full UVR stem separation engine, model merging, and a model library, in a single app.

Four tabs across the top:

Convert
Separate
Train
Generate

Convert takes audio in, runs it through a voice model (or a merged pair of models), and outputs a transformed vocal. Stem separation is built into this pipeline as an optional preprocessing step.

Separate is standalone stem separation with full parameter control. No voice conversion involved. Drop audio in, pick an architecture and model, tune the parameters, get stems out.

Train and Generate require GPU hardware and are placeholder tabs on Intel Mac. They're functional on compatible systems.

Vox is built on Replay by Weights (MIT license), python-audio-separator (MIT), and the RVC voice conversion pipeline (MIT). All original authors are credited in the source LICENSE file.

Voice conversion

The Convert tab is the main workflow. Top to bottom:

Audio input

Drag an audio file into the drop zone, use the file picker, paste a YouTube URL, or record directly. Most formats work: WAV, MP3, FLAC, OGG, M4A.

Voice model selection

Pick a model from the grid. Models are organized by category (Musician, Politician, Fictional Character, etc.) and filterable. Downloaded models show file size. Models with a v2 badge use the newer 768-dimensional feature space. Models without a badge are v1 (256-dimensional).

Pipeline options

Below the model grid, four toggles control preprocessing:

Stem Only
Skip voice conversion entirely. Just output the separated vocal stem.
Pre-Stemmed
Your input is already a clean vocal with no instrumentals. Skip the separation step.
Sample Mode
Process only the first 30 seconds. Use this for quick previews before committing to a full conversion.
De Echo & Reverb
Remove echo and reverb from the vocal before conversion. Useful for recordings with room ambience.

Stem Method

The architecture tabs (Roformer, MDX-Net, Demucs, VR Arc) and model dropdown control which separation model runs on your audio before conversion. BS-Roformer-Viperx-1297 on the Roformer tab is the strongest vocal isolator available.

For best results, pitch-correct your vocal in Melodyne before bringing it into Vox. RVC handles pitch shifting, but Melodyne gives finer control over individual notes.

Pitch controls

Relative Pitch shifts the converted voice up or down in semitones. +10 to +12 is a common range for male-to-female shifts. Instrumental Pitch adjusts the backing track to match if the vocal pitch change made it sound mismatched.

Voice models

v1 vs v2

RVC v1 models extract 256-dimensional features from HuBERT layer 9. v2 models extract 768-dimensional features from layer 12. These are not different quality tiers. They encode genuinely different information: layer 9 is more acoustic (timbre, phonetics), layer 12 is more semantic (linguistic structure, speaker identity).

v2 is generally preferred for large training datasets (40+ minutes of audio). v1 can actually outperform v2 on small datasets (~5 minutes). Vox shows a v2 badge on v2 models so you always know what you're working with.

v1 and v2 models cannot be merged together. The tensor shapes are incompatible (192×256 vs 192×768). Both models in a merge must be the same version.

File format

Models are .pth files (PyTorch checkpoints, ~55-60 MB) optionally paired with .index files (FAISS feature indices that improve timbre accuracy). Models without index files still work but produce less precise timbres.

Where to find models

Hugging Face is the primary source. Search "[name] RVC" to find community-uploaded models. AI Hub Discord has a #voice-models forum channel. Applio has a built-in downloader supporting multiple sources. Quality indicators: both .pth and .index present, training duration over 10 minutes, demo samples available.

Loading models

Drop a .pth file (and its .index if available) into the voice model drop zone, or place them in the models directory. They appear in the grid immediately.

Model merging as instrument design

Most people merge voice models to clone someone. That's not what this is for. Blending two voice models creates a third voice that doesn't exist anywhere. It's a new timbral instrument built from the characteristics of two sources.

Click Merge Models in the Convert tab. Model A is your currently selected model. Pick Model B from the dropdown. The compatibility check runs automatically and shows green (compatible) or red (version mismatch) with an explanation.

Blend ratio

The slider controls the interpolation between the two models. At 0% you get pure Model A. At 100%, pure Model B. The interesting space is in between.

25 / 75
Mostly Model A with a wash of B's character. Subtle timbral coloring.
50 / 50
Equal blend. Neither source dominates. Most likely to produce something genuinely new.
75 / 25
Mostly Model B. A's influence becomes textural rather than tonal.
Complementary timbres
Blend a breathy model with a bright one. The result has characteristics neither source can produce alone.

Save vs ephemeral

With Save merged model checked, the blended model is written to disk as a new .pth file. It appears in your model grid and persists across sessions. With it unchecked, the merge exists only in memory for one conversion, then is discarded. Use ephemeral mode for quick experiments, save mode when you find something worth keeping.

Merged models don't have .index files. You can't meaningfully combine two FAISS indices, and generating a new one requires the original training data. The quality trade-off is minor. The model works fine without one.

Strategies for productive merging

Match RVC versions (v1+v1 or v2+v2, never cross-version). Use models with similar pitch ranges for more coherent results. Start at 50/50 and adjust from there. Name your saved merges descriptively so you remember what went in.

Standalone stem separation

The Separate tab gives you full control over stem separation, independent of voice conversion. This is the equivalent of running UVR directly, with per-architecture parameter tuning.

Workflow

Drop audio files into the input zone. Pick an architecture tab, select a model, choose output stems (All Stems, Vocals Only, or Instrumental Only), set the output format, tune parameters if needed, hit Separate.

Output options

All Stems outputs both the vocal and instrumental (or all stems for Demucs). Vocals Only and Instrumental Only output a single file. Format options: WAV (lossless, largest), FLAC (lossless, compressed), MP3 (lossy, smallest).

Parameters

Each architecture exposes different tunable parameters. Hover over the ? icon next to any parameter for a description. The Reset to defaults link restores all parameters to their default values.

Most of the time, defaults are fine. The parameters that matter most when defaults aren't enough:

Segment Size
Larger = better quality, more RAM. If you're getting RAM errors, reduce this.
Overlap
Higher overlap reduces artifacts at segment boundaries. Costs processing time.
Aggression (VR only)
How hard the model separates. Too high introduces artifacts. Too low leaves bleed.
Shifts (Demucs only)
Random shifts for augmentation. Higher is better but slower. 2 is a good default.

Model Library

The Model Library panel (accessible at the bottom of the Separate tab) shows all available models grouped by architecture. Downloaded models show a delete icon. Models not yet downloaded show a download icon. Models marked "auto" are downloaded on demand when first selected.

What the six architectures actually do

You don't need to understand the architectures to use Vox. You need to know which model to pick, and the architecture is just context for why some models work better on certain material. But when your go-to model fails on a specific track, knowing the approach helps you pick a fallback.

ArchitectureStrategyStrengths
Roformer Rotary attention over frequency bands Best overall quality. BS-RoFormer-Viperx-1297 is the single strongest vocal separator.
MDX-Net Hybrid spectrogram network Fast, reliable workhorse. Good ONNX performance on CPU.
Demucs U-Net over raw waveform The only architecture that does true multi-stem (drums, bass, vocals, other, guitar, piano).
VR Architecture Multi-band DenseNet via STFT Best for utility tasks: de-echo, de-reverb, de-noise. The utility specialist.
MDX23C Enhanced hybrid network Better SDR scores than MDX-Net. Fewer model variants available.
On Intel Mac, all architectures run via CPU inference. MDX-Net with ONNX is the fastest (~30-60s per song). Roformer is the slowest (~2-3 min) but produces the best results.

Which model for which job

You want to separate something. Here's the model to use.

Vocal isolation

ModelArchNotes
BS-Roformer-Viperx-1297RoformerBest. Use this first, always.
Mel-Roformer-Viperx-1143RoformerStrong alternative. Try if BS-Roformer struggles.
Kim_Vocal_2MDX-NetFastest on CPU. More forgiving on dirty sources.
MDX23C-InstVoc HQMDX23CGood all-rounder.

Multi-stem (drums, bass, guitar, piano)

ModelStemsNotes
htdemucs_ft4 (vocals, drums, bass, other)The standard. Bass separation is excellent (SDR ~11.9).
htdemucs_6s6 (+guitar, piano)Guitar and piano quality is usable but not perfect.

Noise removal

ModelArchNotes
Mel-Roformer-DenoiseRoformerBest. Broadband noise.
UVR-DeNoiseVRGood alternative. Lighter processing.

De-reverb

ModelArchNotes
Mel-Roformer-DereverbRoformerBest for stereo reverb.
UVR-DeEcho-DeReverbVRCombined reverb + echo removal in one pass.

De-echo

ModelArchNotes
UVR-De-Echo-AggressiveVRFor heavy echo.
UVR-De-Echo-NormalVRLighter touch.

Karaoke (lead vs backing vocals)

ModelArchNotes
Mel-Roformer-KaraokeRoformerBest. Run on an isolated vocal stem, not a full mix.
UVR-MDX-NET KaraokeMDX-NetFaster alternative.

Model chains for specific outcomes

These are curated pipelines. Order matters.

Clean vocal from a studio mix

BS-Roformer-Viperx-1297

Single pass. If this doesn't get you there, nothing will without manual cleanup.

Clean vocal from a dirty/noisy source

Kim_Vocal_2
UVR-DeEcho-DeReverb
UVR-De-Echo-Aggressive
Mel-Roformer-Denoise

Kim_Vocal_2 first because it's most forgiving on dirty sources. Denoise last to catch residual artifacts. This is the community-consensus chain.

Lead vocal only (no backing harmonies)

BS-Roformer-Viperx-1297
Mel-Roformer-Karaoke

Isolate all vocals first, then split lead from backing. Don't run the karaoke model on a full mix.

Sample rescue (old/degraded audio)

Mel-Roformer-Denoise
Mel-Roformer-Dereverb
BS-Roformer-Viperx-1297

Denoise first (remove hiss/hum), then dereverb (remove room), then separate. Reversing the order embeds noise into the separated stems.

Voice merge → convert pipeline

DAW
Melodyne
Export dry vocal
Vox: merge + convert

Always pitch-correct before conversion. Use same-version models for the merge. Enable "Pre-Stemmed" since your exported vocal is already clean.

Full pipeline: raw recording to finished vocal

Record in DAW
Melodyne
Export
Denoise (if needed)
Vox merge + convert
DAW

Skip the denoise step if your recording environment is clean.

f0 method: the honest version

RVC gives you a dropdown of pitch detection algorithms. The internet has comparison tables that look definitive. In practice, you'll still trial-and-error. Here's the short version that actually helps.

Use RMVPE first. It handles polyphonic and noisy sources well and it's fast enough for offline work.

If it glitches, try Crepe (full, not tiny).

"Tiny" and "mini" variants are optimized for real-time conversion latency. If you're doing offline conversion (you are), skip them.

PM may not appear depending on your platform and hardware. Don't worry about it.

Harvest is slow and you'll rarely need it. It's there if RMVPE and Crepe both fail on a specific passage.

If every f0 method glitches on the same passage, the problem is your source audio, not the algorithm. Go back and fix the recording or the pitch correction.

Advanced settings

These live under "Advanced Settings" in the Convert tab. Most of the time, defaults are correct.

Index Ratio
How much the FAISS index influences the output (0 = none, 1 = full). Higher values make the output sound more like the training data. Lower values let the source vocal's character through. For merged models (which lack .index files), this has no effect.
Consonant Protection
Reduces artifacts on consonant sounds at lower volumes. Lower values = more protection. 0.5 = no protection. Start at the default and reduce if you hear crackling on quiet consonants.
Volume Envelope
Scales the output volume to match the input. At 1 (default), no scaling is applied. Lower values make the output dynamics match the input more closely.
Device
CPU on Intel Mac. MPS on Apple Silicon. CUDA on NVIDIA GPU. Affects processing speed, not output quality.

Attribution

Vox is built on open-source work by many contributors:

Replay by Weights — the original desktop application (MIT license). Codebase mirror maintained by THE-SINDOL.

python-audio-separator by nomadkaraoke — the stem separation engine wrapping all UVR architectures (MIT license).

Ultimate Vocal Remover by Anjok07 and aufr33 — the model ecosystem and research (MIT license).

RVC by RVC-Project — the voice conversion pipeline (MIT license).

Applio by IAHispano — the actively maintained RVC implementation (MIT license).

Demucs by Alexandre Défossez — multi-stem separation (MIT license).

BS-RoFormer community weights by Roman Solovyev, Viperx, aufr33, and others.

All original copyright notices and licenses are preserved in the source repository.

macOS: right-click → Open on first launch to bypass Gatekeeper. Vox is unsigned during the testing phase.