By Emmanuel Ansah · April 2, 2026 · 5 min read

How AI Stem Separation Works

A plain-English explanation of the machine learning that separates music into individual stems.

If you've ever wondered how an app can take a mixed song and hand you back just the vocals - or just the drums - without any audible artefacts, the answer lies in a branch of AI called music source separation.

What is a "stem"?

In music production, a stem is an individual component of a mixed track. A typical song might have stems for: vocals, lead guitar, rhythm guitar, bass, drums, and keys. When an engineer mixes a song, these individual signals are combined into a single stereo file - the track you hear on streaming platforms.

Stem separation is the process of reversing that mix - pulling apart a finished song back into its constituent elements.

Why is it hard?

Mixing audio is a one-way operation. When you add two signals together, information from both is blended in every single sample. Traditional signal processing (like filtering or phase cancellation) can crudely separate frequency bands, but it creates terrible artefacts and leaks. It works for very simple cases; it fails badly on complex, professionally mixed music.

Enter deep learning

Modern AI approaches train neural networks on thousands of songs where the original multi-track recordings are available (datasets like MUSDB18-HQ contain 150 professional tracks with separate stems). The model learns to predict, from the mixed audio, what each individual stem must have sounded like.

The best models today use a hybrid architecture. They process the audio simultaneously in two representations:

Time domain: The raw waveform, which captures transients (drum hits, plucks) precisely.
Frequency domain (STFT): A spectrogram, which captures tonal information like pitch and harmonics more efficiently.

A transformer bottleneck (the same attention mechanism behind GPT) then reconciles information from both domains, letting the model understand long-range dependencies - like the way a sustained guitar note relates to the chord played two seconds earlier.

How modern models achieve quality

The best modern models achieve state-of-the-art Signal-to-Distortion Ratio (SDR) scores on the MUSDB18 benchmark:

Vocals: ~8.4 dB SDR (very high - clean separations with minimal leakage)
Drums: ~7.5 dB SDR
Bass: ~7.0 dB SDR

These numbers translate to stems that are usable in production contexts - good enough to sample from, remix, or use for karaoke backing tracks.

What about songs not in the training data?

These models generalise well to unseen songs because they learn abstract musical patterns (the spectral profile of a snare, the harmonic behaviour of sung vocals) rather than memorising specific tracks. However, unusual instrumentation, very dense mixes, or unusual tunings can still confuse the model.

Try it yourself

Song Splitter separates your tracks in seconds on GPU infrastructure. Upload a song and see the results for free →