paint-brush
Zero-shot Voice Conversion: Comparing HierSpeech++ to Other Basemodelsby@fewshot
New Story

Zero-shot Voice Conversion: Comparing HierSpeech++ to Other Basemodels

by The FewShot Prompting Publication 3mDecember 19th, 2024
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

For a fair comparison, we trained all model with the same dataset (LT460, train-clean-460 subsets of LibriTTS) without YourTTS.
featured image - Zero-shot Voice Conversion: Comparing HierSpeech++ to Other Basemodels
The FewShot Prompting Publication  HackerNoon profile picture
0-item

Abstract and 1 Introduction

2 Related Work

2.1 Neural Codec Language Models and 2.2 Non-autoregressive Models

2.3 Diffusion Models and 2.4 Zero-shot Voice Cloning

3 Hierspeech++ and 3.1 Speech Representations

3.2 Hierarchical Speech Synthesizer

3.3 Text-to-Vec

3.4 Speech Super-resolution

3.5 Model Architecture

4 Speech Synthesis Tasks

4.1 Voice Conversion and 4.2 Text-to-Speech

4.3 Style Prompt Replication

5 Experiment and Result, and Dataset

5.2 Preprocessing and 5.3 Training

5.4 Evaluation Metrics

5.5 Ablation Study

5.6 Zero-shot Voice Conversion

5.7 High-diversity but High-fidelity Speech Synthesis

5.8 Zero-shot Text-to-Speech

5.9 Zero-shot Text-to-Speech with 1s Prompt

5.10 Speech Super-resolution

5.11 Additional Experiments with Other Baselines

6 Limitation and Quick Fix

7 Conclusion, Acknowledgement and References

5.6 Zero-shot Voice Conversion

We compared the voice style transfer performance of HierSpeech++ with other basemodels: 1) AutoVC [66], which is an autoencoder-based non-autoregressive VC model using an information bottleneck to disentangle the content and style, 2) VoiceMixer [46], which is a GAN-based parallel VC model using similarity-based information bottleneck, 3- 5) Diffusion-based models (DiffVC [64], Diff-HierVC [11], and DDDM-VC [10]), 6) YourTTS [7], VITS-based end-to-end VC models utilizing phoneme sequences to extract content information, 7) HierVST [45], hierspeech-based end-to-end VC model using hierarchical style adaptation. For a fair comparison, we trained all model with the same dataset (LT460, train-clean-460 subsets of LibriTTS) without YourTTS. We utilized the official implementation of YourTTS which was trained with an additional dataset. We also trained the model with a large-scale dataset such as LT-960, all training subsets of LibriTTS) and additional datasets to verify the effectiveness of scaling-up the dataset.


For the subjective objective, TABLE 5 demonstrates that our model significantly improves the naturalness and similarity of the converted speech in terms of nMOS and sMOS. We found that our model with a large-scale dataset showed a better naturalness than ground-truth speech.


However, the results also showed that increasing the dataset without filtering noisy data slightly decreased an audio quality in terms of nMOS and UTMOS but the similarity increased consistently according to the data scale. In addition, WER also showed better performance than the other models. Furthermore, the results of the similarity measurement show that our models perform better in terms of the EER and SECS. Moreover, the results verified that HierSpeech++ which was trained with a large-scale dataset is a much stronger zero-shot speech synthesizer.


We will include zero-shot cross-lingual voice style transfer results and additional zero-shot voice conversion results with noisy speech prompts on our demo page. We highly recommend listening to the demo samples and will release the source code of the hierarchical speech synthesizer for a strong zero-shot speech synthesizer. Furthermore, we can also upsample the audio using SpeechSR from 16 kHz to 48 kHz, which can simply improve the perceptual audio quality as described in Section 5.10.


TABLE 6: Temperature parameter search. For TTS, we have two controllable temperature, Tttv of TTV and Th of hierarchical speech synthesizer. We utilize a HierSpeech++ trained with LT-960 for this experiment and fix the random seed. We found that low temperatures improve the robustness of synthetic speech in terms of CER and WER. However, if you hope to generate diverse and expressive speech, you could choose high temperatures. The CER and WER of groundtruth is 2.31 and 4.13, respectively. UTMOS is presented with standard deviation.


This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.

Authors:

(1) Sang-Hoon Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;

(2) Ha-Yeong Choi, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;

(3) Seung-Bin Kim, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;

(4) Seong-Whan Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea and a Corresponding author.