Abstract

Current state-of-the-art voice conversion (VC) tools rely on neural models trained on massive corpora of data for hundreds of hours. This approach surely leads to astonishing results, but lacks in speed, simplicity and accessibility. In this paper we introduce a simple and fast any-to-any non-parallel voice conversion tool that is able to perform its task provided only with a small audio excerpt of target speaker. We consider a modular approach to VC, cascading an automatic-speech-recognition (ASR) model, used to transcribe the source speech, and a text-to-speech (TTS) model, to generate the target speech. This approach presents a straightforward pipeline, allows to use already available models and opens doors to many expansions. We prove our output to be intelligible and distinguishable between different speakers.

Audio examples

Class 1

Source speaker
Target speaker
Output

Class 2

Source speaker
Target speaker
Output

Class 3

Source speaker
Target speaker
Output

Class 4

Source speaker
Target speaker
Output

Class 5

Source speaker
Target speaker
Output