TTS-GUIDED TRAINING FOR ACCENT CONVERSION WITHOUT PARALLEL DATA

Speech Samples

(a) parallel-VC : A sequence-to-sequence voice conversion system with parallel speech data between the source and target accents. This system cannot preserve the speaker identity.
(b) BNF-AC : This is an accent conversion system that takes bottleneck features as input and generates acoustic features [1].
(c) BNF-PC-AC: This system is an extension from system (b) with a pronuciation correction model [2].
(d) TTS : The multi-speaker TTS system, which is trained with target-accented speech data.
(e) TTS-AC : Our proposed accent conversion system without parallel data.

source: The natural speech from a speaker with the source accent, we hope to change the accent of his/her speech to the target while preserving the speech content and his/her identity.
target: The natural speech from a different speaker with the target accent. It is used as a reference. In this work, we define American, British, and Canadian English as the target accents.

Convert Chinese Accent to the Target Accent
System	Chinese female speaker	Chinese male speaker
Source
(a) parallel-VC
(b) BNF-AC
(c) BNF-PC-AC
(d) TTS
(e) TTS-AC
Target (Reference)
Convert Indian Accent to the Target Accent
System	Indian female speaker	Indian male speaker
Source
(a) parallel-VC
(b) BNF-AC
(c) BNF-PC-AC
(d) TTS
(e) TTS-AC
Target (Reference)

References

[1] Zhao Guanlong, Sonsaat Sinem, Levis John, Chukharev-Hudilainen Evgeny, and Gutierrez-Osuna Ricardo, “Accent conversion using phonetic posteriorgrams,” in IEEE ICASSP, 2018, pp. 5314–5318..
[2] Guanlong Zhao, Shaojin Ding, and Ricardo Gutierrez-Osuna, “Converting foreign accent speech without a reference,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 2367–2381, 2021.