SpeechCraft

This is the official repository of the ACM Multimedia 2024 paper "SpeechCraft: A Fine-Grained Expressive Speech Dataset with Natural Language Description".

For details of the pipeline and dataset, please refer to our Paper and Demo Page

News

[2024-09-26]: Structured metadata (pitch, energy, speed, age, gender, emotion tone, emphasis, topic/category, and transcript) has been made available to facilitate further enhancements and augmentations of the dataset.

[2024-12-20]: Code and checkpoint of the Annotation pipeline are released.

SpeechCraft Dataset

1. Download Speech Corpus

Language	Speech Corpus	#Duration	#Clips
ZH	Zhvoice	799.68h	1,020,427
ZH	AISHELL-3	63.70h	63,011
EN	GigaSpeech-M	739.91h	670,070
EN	LibriTTS-R	548.88h	352,265

2. Download Speech Annotation

	Description	Instruction	Labels
ZH	download	download	download
EN	download	download	download

3. Labels and Prompts

EN Version

--gender: Male, Female
--age: Child, Teenager, Youth adult, Middle-aged, Elderly
--pitch: low, normal, high
--speed: slow, normal, fast
--volume: low, normal, high
--emotion (English): Fearful, Happy, Disgusted, Sad, Surprised, Angry, Neutral
--emphasis: Non-label words
--transcript: Non-label sentence
--LLM Prompt:

Note: You must vividly describe the sentence’s intonation, pitch, tone, and emotion. All outputs must strictly avoid identical wording and sentence structure. There is no need to describe body language or psychological state and do not repeat the input content.
Refer to the format of the following four cases:

*Example Input - Example Output*

Now try to process the following sentences, directly output the converted sentences according to the examples without missing any labels.

ZH Version

--年龄：儿童，少年，青年，中年，老年
--性别：男，女
--语速：快，中，慢
--音高：高，中，低
--音量：高，中，低
--重读：无标签，字词
--语气：无标签，自然语句
--文本：无标签，自然语句
--LLM Prompt:

请参照以下转换案例，使用中文自然语言描述一个人按照给定风格属性，如音高、音量、年龄、性别、语调，来说文本中的话。注意，仅描述说话风格，不需要描述肢体动作或心理状态，不要重复输入的内容。

*示例输入-示例输出*

现在尝试处理以下句子，根据示例直接输出转换后的句子，不要遗漏任何标签。

4. Request Access to Emphasis Speech Dataset

Since we do not own the copyright of the original audio files, for researchers and educators who wish to use the audio files for non-commercial research and/or educational purposes, we can provide access to our regenerated version under certain conditions and terms. To apply for the AISHELL-3 and LibriTTS-R with fine-grained keyword emphasis, please fill out the EULA form at Emphasis-SpeechCraft-EULA.pdf and send the scanned form to [email protected]. Once approved, you will be supplied with a download link. ([2024-09-26]: With metadata updated!)

Please first refer to some emphasis examples provided here. We are actively working on improving methods for large-scale fine-grained data construction that align with human perception.

Language	Speech Corpus	#Duration	#Clips
ZH	AISHELL-3-stress	50.59h	63,258
EN	LibriTTS-R-stress	148.78h	75,654

Annotation Pipeline

Step 0 : Installation

Download models for speech style recognition.

Models from 🤗:

llama_base_model = "baichuan-inc/Baichuan2-13B-Base"
gender_model_path = "alefiury/wav2vec2-large-xlsr-53-gender-recognition-librispeech"
age_model_path = "audeering/wav2vec2-large-robust-24-ft-age-gender"
asr_path = "openai/whisper-medium" / "openai/whisper-large-v3"

Model from funasr (for English emotion classification):

emotion_model = "iic/emotion2vec_base_finetuned"

Prepare SECap from here (for Chinese emotion captioning).

Create conda environment

conda env create -f ./requirements.yaml
mv ./AutomaticPipeline/models/SECap/model2.py $your_SECap_dir

Download the lora ckpt from here as ./llama-ft/finetuned-llama/ for description rewriting.

Remember to change the path of LLM ckpt at "base_model_name_or_path" in ./llama-ft/finetuned-llama/adapter_config.json.

Step 1 : Labeling with the Automatic Annotation Pipeline

Get the scp file with raw scores for the audio corpus.
```
cd ./AutomaticPipeline
python AutoPipeline.py
```
Get the json file with classified result prepared for the description rewriting.
```
python Clustering.py
```

Step 2 : Rewriting with the Finetuned Llama

cd ../llama-ft
python llama_infer.py

Citation

Please cite our paper if you find this work useful:

@inproceedings{jin2024speechcraft,
title={SpeechCraft: A Fine-Grained Expressive Speech Dataset with Natural Language Description},
author={Zeyu Jin and Jia Jia and Qixin Wang and Kehan Li and Shuoyi Zhou and Songtao Zhou and Xiaoyu Qin and Zhiyong Wu},
booktitle={ACM Multimedia 2024},
year={2024},
url={https://openreview.net/forum?id=rjAY1DGUWC}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SpeechCraft

News

SpeechCraft Dataset

1. Download Speech Corpus

2. Download Speech Annotation

3. Labels and Prompts

EN Version

ZH Version

4. Request Access to Emphasis Speech Dataset

Annotation Pipeline

Step 0 : Installation

Step 1 : Labeling with the Automatic Annotation Pipeline

Step 2 : Rewriting with the Finetuned Llama

Citation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
AutomaticPipeline		AutomaticPipeline
llama-ft		llama-ft
Emphasis-SpeechCraft-EULA.pdf		Emphasis-SpeechCraft-EULA.pdf
README.md		README.md
requirements.yaml		requirements.yaml

thuhcsi/SpeechCraft

Folders and files

Latest commit

History

Repository files navigation

SpeechCraft

News

SpeechCraft Dataset

1. Download Speech Corpus

2. Download Speech Annotation

3. Labels and Prompts

EN Version

ZH Version

4. Request Access to Emphasis Speech Dataset

Annotation Pipeline

Step 0 : Installation

Step 1 : Labeling with the Automatic Annotation Pipeline

Step 2 : Rewriting with the Finetuned Llama

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages