Speech Samples


The model is evaluated with speech from two database:
English: VCC2018[1]
Mandarin: The Speech synthesis-library of average model provided by Databaker(http://www.data-baker.com).


There are 2 systems implemented for comparison:
iSE: the cross-lingual VC system using the i-vector based Speaker Embedding (iSE), and we benchmark this as our baseline.
SEN: the proposed jointly trained Speaker Embedding Network (SEN).


Enable a Monolingual English Speaker to Speak Mandarin
Source Target Baseline iSE Proposed SEN
Sample 1
Sample 2
Sample 1
Source
Target
Monolingual Baseline
Bilingual Baseline
Sample 2
Source
Target
Monolingual Baseline
Bilingual Baseline
Enable a Monolingual English Speaker to Speak Mandarin
Source Target Baseline iSE Proposed SEN
Sample 1
Sample 2
Sample 1
Source
Target
Monolingual Baseline
Bilingual Baseline
Sample 2
Source
Target
Monolingual Baseline
Bilingual Baseline

References

[1] Jaime Lorenzo-Trueba, Junichi Yamagishi, Tomoki Toda, Daisuke Saito, Fernando Villavicencio, Tomi Kinnunen, and Zhenhua Ling, “The voice conversion challenge 2018: Promoting development of parallel and nonparallel method,” in Speaker Odyssey, 2018.