Skip to the content.

Discretization and Re-synthesis: an alternative method to solve the Cocktail Party Problem

Abstract

Neural network based models have significantly improved the performance of speech separation with the input waveform from scenarios like the cocktail party. Prominent methods (e.g., frequency-domain and time-domain speech separation) usually build regression model to recover the ground-truth speech as much as possible, with the masking-based design and the signal-level loss criterion (e.g., MSE or SI-SNR). We propose here that the synthesis-based approach can also perform well towards this problem, with great flexibility and potential. Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols, converting the paradigm of the speech separation/enhancement related tasks from regression problem to classification task. By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized. The experimental evaluation with the WSJ0-2mix and VCTK-noisy corpus in various settings shows that the proposed method could steadily synthesize the separated speech of good listening quality, nearly without any interference, which is difficult to avoid in regression based methods. In addition, with quite little loss of listening quality, the speaker conversion of enhanced/separated speech could be easily realized through our method.

Test samples of Speech enhancement

The sound files blow are some raw noisy wavs and the enhanced speech from different methods. All the samples are driven from the test set of VCTK-noisy.

Notice: here we use the vctk vocoder or the ljspeech vocoder. The former one corresponds to the training corpus we used (vctk-noisy) while the later one is an external single-speaker corpus, which could be used to show the speaker-transfer characteristic of our method.

(1) Sample1:

Noisy Wav:

Conv-TasNet:           MTL-mimic-voicebank:              

Metricgan-voicebank:

Our discrete enhancement (with vctk vocoder / ljspeech vocoder):

(2) Sample2:

Noisy Wav:

Conv-TasNet:           MTL-mimic-voicebank: 

Metricgan-voicebank:

Our discrete enhancement (with vctk vocoder / ljspeech vocoder):

(3) Sample3:

Noisy Wav:

Conv-TasNet:           MTL-mimic-voicebank:  

Metricgan-voicebank:

Our discrete enhancement (with vctk vocoder / ljspeech vocoder):

(4) Sample4:

Noisy Wav:

Conv-TasNet:           MTL-mimic-voicebank:  

Metricgan-voicebank:

Our discrete enhancement (with vctk vocoder / ljspeech vocoder):

(5) Sample5:

Noisy Wav:

Conv-TasNet:           MTL-mimic-voicebank:  

Metricgan-voicebank:

Our discrete enhancement (with vctk vocoder / ljspeech vocoder):

(6) Sample6:

Noisy Wav:

Conv-TasNet:           MTL-mimic-voicebank:  

Metricgan-voicebank:

Our discrete enhancement (with vctk vocoder / ljspeech vocoder):

(7) Sample7:

Noisy Wav:

Conv-TasNet:           MTL-mimic-voicebank:  

Metricgan-voicebank:

Our discrete enhancement (with vctk vocoder / ljspeech vocoder):

Test samples of Speech separation

The sound files blow are some raw mixture wavs and the separated speech from different methods. We show the different gender combination results as: Female+Female / Female+Male/ Male+Male. All the samples are driven from the test set of WSJ0-2mix.

Female + Female

(1) Sample1:

Mixture Wav:

Conv-TasNet:

Conv-DPRNN:

Our discrete Separation:

(2) Sample2:

Mixture Wav:

Conv-TasNet:

Conv-DPRNN:

Our discrete Separation:

(3) Sample3:

Mixture Wav:

Conv-TasNet:

Conv-DPRNN:

Our discrete Separation:

Female + Male

(1) Sample1:

Mixture Wav:

Conv-TasNet:

Conv-DPRNN:

Our discrete Separation:

(2) Sample2:

Mixture Wav:

Conv-TasNet:

Conv-DPRNN:

Our discrete Separation:

(3) Sample3:

Mixture Wav:

Conv-TasNet:

Conv-DPRNN:

Our discrete Separation:

Male + Male

(1) Sample1:

Mixture Wav:

Conv-TasNet:

Conv-DPRNN:

Our discrete Separation:

(2) Sample2:

Mixture Wav:

Conv-TasNet:

Conv-DPRNN:

Our discrete Separation:

(3) Sample3:

Mixture Wav:

Conv-TasNet:

Conv-DPRNN:

Our discrete Separation: