ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations
This is the code for the ZMM-TTS submitted to the IEEE TASLP. This paper presents ZMM-TTS, a multilingual and multispeaker framework utilizing quantized latent speech representations from a large-scale, pre-trained, self-supervised model. Our paper is the first to incorporate the representations from text-based and speech-based self-supervised learning models into multilingual speech synthesis tasks. We conducted comprehensive subjective and objective evaluations through a series of experiments. Our model has been proven effective in terms of speech naturalness and similarity for both seen and unseen speakers in six high-resource languages. We also tested the efficiency of our method on two hypothetical low-resource languages. The results are promising, indicating that our proposed approach can synthesize audio that is intelligible and has a high degree of similarity to the target speaker's voice, even without any training data for the new, unseen language.
Welcome to try our code and pre-trained model on different languages!
- [20/01] 🔥 We released code and model pre-trained on 6 language (English, French, German, Portuguese, Spanish and Swedish) public datasets.
Samples are provided on our demo page.
ZMM-TTS requires Python>=3.8, and a reasonly recent version of PyTorch. To install ZMM-TTS and make a quick synthesis, you can run from this repository:
git clone https://github.com/nii-yamagishilab-visitors/ZMM-TTS.git
cd ZMM-TTS
pip3 install -r requirements.txt
#In addition, you may need to install these libraries to support full functionality.
pip install transformers #For support XLSR-53 and XphoneBERT model.
pip install speechbrain #For extracting speaker embedding.
If you want to try IPA representations, you need to install Epitran.
Model | Modality | Lang | Training data |
---|---|---|---|
XLSR-53 | Audio | 53 | 56K hours |
ECAPA-TDNN | Audio | > 5 | 2794 hours |
XPhoneBERT | Text | 94 | 330M sentences |
In my paper, the training data we used contained GlobalPhone, and unfortunately that is not an open source data.
Considering the scarcity of publicly multilingual and multilingual speaker databases for speech synthesis, I designed the following training database based on the MLS and NHT Swedish databases and called it MM6. (It seems that NST is no longer open for downloads in Swedish, in which case you should apply this data from The Norwegian Language Bank). If you have GlobalPhone dataset, you can try the same training data Dataset/train_paper.txt
as our paper.
Language | Gender | Speakers | Sentences | Durations (h) | Database |
---|---|---|---|---|---|
English | Female | 20 | 4000 | 13.9 | MLS |
English | Male | 20 | 4000 | 13.9 | MLS |
French | Female | 20 | 4000 | 13.9 | MLS |
French | Male | 20 | 4000 | 13.9 | MLS |
German | Female | 20 | 4000 | 13.9 | MLS |
German | Male | 20 | 4000 | 13.9 | MLS |
Portuguese | Female | 16 | 3741 | 13.0 | MLS |
Portuguese | Male | 20 | 4175 | 14.5 | MLS |
Spanish | Female | 20 | 3519 | 12.2 | MLS |
Spanish | Male | 20 | 3786 | 13.1 | MLS |
Swedish | Female | 0 | 0 | 0 | |
Swedish | Male | 20 | 4000 | 13.9 | NST |
You can generate MM6 dataset through following download and norm scripts:
bash scripts/download.sh #download the MLS data.
python prepare_data/creat_meta_data_mls.py #Generate speaker-gender-language balance data.
#We recommend that you use sv56 to normalize the MLS audio.
bash scripts/norm_wav.sh
Please contact The Norwegian Language Bank if you want to get NHT Swedish data, and extract it to the Dataset/origin_data/
.
Or, you could simply consider excluding the Swedish language.
#The Swedish audio already normalize
python prepare_data/creat_meta_data_swe.py
This MM6 is a multilingual dataset with a largely balanced mix of speakers and genders, and we encourage you to experiment with other tasks as well.
After you download and nom the wav, you can generate in Dataset
folder as:
|--Dataset
|--MM6
|--wavs #Store audio files
|--preprocessed_data #Store preprocessed data: text, features,...
|--MM6
|--train.txt
you can find wav in Dataset/MM6/wavs/
and meta file in Dataset/preprocessed_data/ZMM6/train.txt
.
The train.txt looks like:
Name|Database|Language|Speaker|text
7756_9025_000004|MM6|English|7756|on tiptoe also i followed him and just as his hands were on the wardrobe door my hands were on his throat he was a little man and no match for me
-
- Extract discrete code index and representations:
bash scripts/extract_discrete.sh
-
- Extract speaker embeddings:
bash scripts/extract_spk.sh
-
- Extract text sequences:
python prepare_data/extract_text_seq_from_raw_text.py
-
- Extract mel spectrograms:
python prepare_data/compute_mel.py
-
- Compute a priori alignment probabilities:
python prepare_data/compute_attention_prior.py
-
- Train txt2vec model:
#Using XphoneBERT:
python txt2vec/train.py --dataset MM6 --config MM6_XphoneBERT
#Using Characters (Letters):
python txt2vec/train.py --dataset MM6 --config MM6_Letters
#Using IPA:
python txt2vec/train.py --dataset MM6 --config MM6_IPA
#If you want to train a model without a language layer, you could use xxx_wo config like:
python txt2vec/train.py --dataset MM6 --config MM6_XphoneBERT_wo
NOTE: Please set needUpdate: True
in model.yaml after 1/4 iteration, when you use XphoneBERT.
-
- Train vec2mel model:
python vec2mel/train.py --dataset MM6 --config MM6
For the training of txt2vec and vec2mel model, we used a batch_size of 16 and trained for 1.2M steps. It took about 3 days on 1 Tesla A100 GPU.
-
- Train vec2wav model:
python prepare_data/creat_lists.py
python vec2wav/train.py -c Config/vec2wav/vec2wav.yaml
#If you want to train a model without a language layer:
python vec2wav/train.py -c Config/vec2wav/vec2wav_wo.yaml
For the training of vec2wav , we used a batch_size of 16 and trained for 1M steps. It took about 3 days on 1 Tesla A100 GPU.
-
- Train HifiGAN model:
python Vocoder_HifiGAN_Model/train.py --config Config/config_16k_mel.json
For the training of HifiGAN, we used a batch_size of 16 and trained for 1M steps. It took about 3 days on 1 Tesla A100 GPU.
-
- Prepare test data:
- a. test meta file
Dataset/MM6/test.txt
. - b. ref speaker embedding in
Dataset/MM6/test_spk_emb/
.
-
- Generate sample
bash test_scripts/quick_test.sh
Of course, you can download our pre-trained model from google driver. And put it in the corresponding
Train_log
directory. The training log can be found in the correspondingTrain_log
files. -
- The result would be found in
test_result
files.
- The result would be found in
- Scripts for few-shot training.
- Scripts for zero-shot inference on any language.
If you use this code, result, or MM6 dataset in your paper, please cite our work as:
@article{gong2023zmm,
title={ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations},
author={Gong, Cheng and Wang, Xin and Cooper, Erica and Wells, Dan and Wang, Longbiao and Dang, Jianwu and Richmond, Korin and Yamagishi, Junichi},
journal={arXiv preprint arXiv:2312.14398},
year={2023}
}
- Comprehensive-Transformer-TTS, the txt2vec and vec2mel model were built on this project.
- XPhoneBERT, a Pre-trained Multilingual Model for Phoneme Representations for Text-to-Speech.
- MSMC-TTS, the vec2wav model was built on this project.
- HifiGAN, Vocoder.
- wav2vec2-codebook-indicesThe scripts for extracting the discrete code index and representations.
The code in this repository is released under the BSD-3-Clause license as found in the LICENSE file.
The txt2vec
, vec2mel
and vec2wav
subfolder have MIT License.
The sv56scripts
has GPL License.