Voice conversion, stem separation, and model merging. By Cablewight.
Download VoxVox is a desktop tool for two things: changing what a voice sounds like, and pulling audio apart into stems. It combines RVC voice conversion with the full UVR stem separation engine, model merging, and a model library, in a single app.
Four tabs across the top:
Convert takes audio in, runs it through a voice model (or a merged pair of models), and outputs a transformed vocal. Stem separation is built into this pipeline as an optional preprocessing step.
Separate is standalone stem separation with full parameter control. No voice conversion involved. Drop audio in, pick an architecture and model, tune the parameters, get stems out.
Train and Generate require GPU hardware and are placeholder tabs on Intel Mac. They're functional on compatible systems.
The Convert tab is the main workflow. Top to bottom:
Drag an audio file into the drop zone, use the file picker, paste a YouTube URL, or record directly. Most formats work: WAV, MP3, FLAC, OGG, M4A.
Pick a model from the grid. Models are organized by category (Musician, Politician, Fictional Character, etc.) and filterable. Downloaded models show file size. Models with a v2 badge use the newer 768-dimensional feature space. Models without a badge are v1 (256-dimensional).
Below the model grid, four toggles control preprocessing:
The architecture tabs (Roformer, MDX-Net, Demucs, VR Arc) and model dropdown control which separation model runs on your audio before conversion. BS-Roformer-Viperx-1297 on the Roformer tab is the strongest vocal isolator available.
Relative Pitch shifts the converted voice up or down in semitones. +10 to +12 is a common range for male-to-female shifts. Instrumental Pitch adjusts the backing track to match if the vocal pitch change made it sound mismatched.
RVC v1 models extract 256-dimensional features from HuBERT layer 9. v2 models extract 768-dimensional features from layer 12. These are not different quality tiers. They encode genuinely different information: layer 9 is more acoustic (timbre, phonetics), layer 12 is more semantic (linguistic structure, speaker identity).
v2 is generally preferred for large training datasets (40+ minutes of audio). v1 can actually outperform v2 on small datasets (~5 minutes). Vox shows a v2 badge on v2 models so you always know what you're working with.
Models are .pth files (PyTorch checkpoints, ~55-60 MB) optionally paired with .index files (FAISS feature indices that improve timbre accuracy). Models without index files still work but produce less precise timbres.
Hugging Face is the primary source. Search "[name] RVC" to find community-uploaded models. AI Hub Discord has a #voice-models forum channel. Applio has a built-in downloader supporting multiple sources. Quality indicators: both .pth and .index present, training duration over 10 minutes, demo samples available.
Drop a .pth file (and its .index if available) into the voice model drop zone, or place them in the models directory. They appear in the grid immediately.
Most people merge voice models to clone someone. That's not what this is for. Blending two voice models creates a third voice that doesn't exist anywhere. It's a new timbral instrument built from the characteristics of two sources.
Click Merge Models in the Convert tab. Model A is your currently selected model. Pick Model B from the dropdown. The compatibility check runs automatically and shows green (compatible) or red (version mismatch) with an explanation.
The slider controls the interpolation between the two models. At 0% you get pure Model A. At 100%, pure Model B. The interesting space is in between.
With Save merged model checked, the blended model is written to disk as a new .pth file. It appears in your model grid and persists across sessions. With it unchecked, the merge exists only in memory for one conversion, then is discarded. Use ephemeral mode for quick experiments, save mode when you find something worth keeping.
Match RVC versions (v1+v1 or v2+v2, never cross-version). Use models with similar pitch ranges for more coherent results. Start at 50/50 and adjust from there. Name your saved merges descriptively so you remember what went in.
The Separate tab gives you full control over stem separation, independent of voice conversion. This is the equivalent of running UVR directly, with per-architecture parameter tuning.
Drop audio files into the input zone. Pick an architecture tab, select a model, choose output stems (All Stems, Vocals Only, or Instrumental Only), set the output format, tune parameters if needed, hit Separate.
All Stems outputs both the vocal and instrumental (or all stems for Demucs). Vocals Only and Instrumental Only output a single file. Format options: WAV (lossless, largest), FLAC (lossless, compressed), MP3 (lossy, smallest).
Each architecture exposes different tunable parameters. Hover over the ? icon next to any parameter for a description. The Reset to defaults link restores all parameters to their default values.
Most of the time, defaults are fine. The parameters that matter most when defaults aren't enough:
The Model Library panel (accessible at the bottom of the Separate tab) shows all available models grouped by architecture. Downloaded models show a delete icon. Models not yet downloaded show a download icon. Models marked "auto" are downloaded on demand when first selected.
You don't need to understand the architectures to use Vox. You need to know which model to pick, and the architecture is just context for why some models work better on certain material. But when your go-to model fails on a specific track, knowing the approach helps you pick a fallback.
| Architecture | Strategy | Strengths |
|---|---|---|
| Roformer | Rotary attention over frequency bands | Best overall quality. BS-RoFormer-Viperx-1297 is the single strongest vocal separator. |
| MDX-Net | Hybrid spectrogram network | Fast, reliable workhorse. Good ONNX performance on CPU. |
| Demucs | U-Net over raw waveform | The only architecture that does true multi-stem (drums, bass, vocals, other, guitar, piano). |
| VR Architecture | Multi-band DenseNet via STFT | Best for utility tasks: de-echo, de-reverb, de-noise. The utility specialist. |
| MDX23C | Enhanced hybrid network | Better SDR scores than MDX-Net. Fewer model variants available. |
You want to separate something. Here's the model to use.
| Model | Arch | Notes |
|---|---|---|
| BS-Roformer-Viperx-1297 | Roformer | Best. Use this first, always. |
| Mel-Roformer-Viperx-1143 | Roformer | Strong alternative. Try if BS-Roformer struggles. |
| Kim_Vocal_2 | MDX-Net | Fastest on CPU. More forgiving on dirty sources. |
| MDX23C-InstVoc HQ | MDX23C | Good all-rounder. |
| Model | Stems | Notes |
|---|---|---|
| htdemucs_ft | 4 (vocals, drums, bass, other) | The standard. Bass separation is excellent (SDR ~11.9). |
| htdemucs_6s | 6 (+guitar, piano) | Guitar and piano quality is usable but not perfect. |
| Model | Arch | Notes |
|---|---|---|
| Mel-Roformer-Denoise | Roformer | Best. Broadband noise. |
| UVR-DeNoise | VR | Good alternative. Lighter processing. |
| Model | Arch | Notes |
|---|---|---|
| Mel-Roformer-Dereverb | Roformer | Best for stereo reverb. |
| UVR-DeEcho-DeReverb | VR | Combined reverb + echo removal in one pass. |
| Model | Arch | Notes |
|---|---|---|
| UVR-De-Echo-Aggressive | VR | For heavy echo. |
| UVR-De-Echo-Normal | VR | Lighter touch. |
| Model | Arch | Notes |
|---|---|---|
| Mel-Roformer-Karaoke | Roformer | Best. Run on an isolated vocal stem, not a full mix. |
| UVR-MDX-NET Karaoke | MDX-Net | Faster alternative. |
These are curated pipelines. Order matters.
Single pass. If this doesn't get you there, nothing will without manual cleanup.
Kim_Vocal_2 first because it's most forgiving on dirty sources. Denoise last to catch residual artifacts. This is the community-consensus chain.
Isolate all vocals first, then split lead from backing. Don't run the karaoke model on a full mix.
Denoise first (remove hiss/hum), then dereverb (remove room), then separate. Reversing the order embeds noise into the separated stems.
Always pitch-correct before conversion. Use same-version models for the merge. Enable "Pre-Stemmed" since your exported vocal is already clean.
Skip the denoise step if your recording environment is clean.
RVC gives you a dropdown of pitch detection algorithms. The internet has comparison tables that look definitive. In practice, you'll still trial-and-error. Here's the short version that actually helps.
Use RMVPE first. It handles polyphonic and noisy sources well and it's fast enough for offline work.
If it glitches, try Crepe (full, not tiny).
"Tiny" and "mini" variants are optimized for real-time conversion latency. If you're doing offline conversion (you are), skip them.
PM may not appear depending on your platform and hardware. Don't worry about it.
Harvest is slow and you'll rarely need it. It's there if RMVPE and Crepe both fail on a specific passage.
These live under "Advanced Settings" in the Convert tab. Most of the time, defaults are correct.
Vox is built on open-source work by many contributors:
Replay by Weights — the original desktop application (MIT license). Codebase mirror maintained by THE-SINDOL.
python-audio-separator by nomadkaraoke — the stem separation engine wrapping all UVR architectures (MIT license).
Ultimate Vocal Remover by Anjok07 and aufr33 — the model ecosystem and research (MIT license).
RVC by RVC-Project — the voice conversion pipeline (MIT license).
Applio by IAHispano — the actively maintained RVC implementation (MIT license).
Demucs by Alexandre Défossez — multi-stem separation (MIT license).
BS-RoFormer community weights by Roman Solovyev, Viperx, aufr33, and others.
All original copyright notices and licenses are preserved in the source repository.