-
Notifications
You must be signed in to change notification settings - Fork 143
Modalities beyond text
In addition to text, TencentPretrain supports vision, audio, and cross-modal pre-training models. This section demonstrates the way of using TencentPretrain to pre-train and fine-tune in models of different modalities.
The example of pre-training ViT model on CIFAR10 dataset:
python3 preprocess.py --corpus_path datasets/cifar10/train.tsv --tokenizer virtual \
--dataset_path dataset.pt --processes_num 8 --data_processor vit
python3 pretrain.py --dataset_path dataset.pt --tokenizer virtual \
--pretrained_model_path models/vit_base_patch16_224_model.bin \
--config_path models/vit/base-16-224_config.json \
--output_model_path models/cifar10_vit_base_model.bin \
--world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
--total_steps 2000 --save_checkpoint_steps 1000 --batch_size 32 \
--labels_num 10
vit_base_patch16_224_model.bin can be found in Model Zoo section. --tokenizer virtual is specified since image does not need a tokenizer to tokenize the text. The example of fune-tuning and doing inference on CIFAR10 dataset:
python3 finetune/run_image_classifier.py --pretrained_model_path models/vit_base_patch16_224_model.bin \
--tokenizer virtual \
--config_path models/vit/base-16-224_config.json \
--train_path datasets/cifar10/train.tsv \
--dev_path datasets/cifar10/test.tsv \
--output_model_path models/image_classifier_model.bin \
--epochs_num 3 --batch_size 64
python3 inference/run_image_classifier_infer.py --load_model_path models/image_classifier_model.bin \
--tokenizer virtual \
--config_path models/vit/base-16-224_config.json \
--test_path datasets/cifar10/test.tsv \
--prediction_path datasets/cifar10/prediction.tsv \
--labels_num 10
CIFAR10 dataset has 10 labels (--labels_num 10).
The example of pre-training S2T model on LibriSpeech] dataset: One should modify models/special_tokens_map.json to models/xlmroberta_special_tokens_map.json in the file tencentpretrain/utils/constants.py, and process the data into a format that TencentPretrain can handle. We provide the prepared 10h version in Downstream datasets section.
python3 scripts/prepare_librispeech_data.py --input_path datasets/librispeech/train-10h \
--output_path datasets/librispeech/train-10h.tsv
And then preprocess and pretrain:
python3 preprocess.py --corpus_path datasets/librispeech/train-10h.tsv \
--spm_model_path models/sentencepiece.bpe.model \
--dataset_path dataset.pt \
--processes_num 8 --data_processor s2t
python3 pretrain.py --dataset_path dataset.pt \
--spm_model_path models/sentencepiece.bpe.model \
--config_path models/s2t/small_config.json \
--output_model_path models/output_model.bin \
--accumulation_steps 8 \
--world_size 4 --gpu_ranks 0 1 2 3 \
--total_steps 100000 --save_checkpoint_steps 10000 --report_steps 100 \
--batch_size 8 --learning_rate 2e-3
To fine-tune a S2T model. --add_column argument is added to introduce the column names in the first line when preparing the dataset. The example of fune-tuning on LibriSpeech dataset:
python3 scripts/prepare_librispeech_data.py --input_path datasets/librispeech/train-10h \
--output_path datasets/librispeech/train-10h.tsv \
--add_column
python3 scripts/prepare_librispeech_data.py --input_path datasets/librispeech/dev-clean \
--output_path datasets/librispeech/dev-clean.tsv \
--add_column
python3 finetune/run_speech2text.py --pretrained_model_path models/output_model.bin \
--spm_model_path models/sentencepiece.bpe.model \
--config_path models/s2t/small_config.json \
--train_path datasets/librispeech/train-10h.tsv \
--dev_path datasets/librispeech/dev-clean.tsv \
--output_model_path models/finetuned_model.bin \
--batch_size 8 --epochs_num 10 \
--learning_rate 2e-4 --report_steps 200
During inference stage,beam search is applied, and the beam size can be adjusted by --beam_width:
python3 scripts/prepare_librispeech_data.py --input_path datasets/librispeech/test-clean \
--output_path datasets/librispeech/test-clean.tsv \
--add_column
python3 inference/run_speech2text_infer.py --load_model_path models/finetuned_model.bin \
--spm_model_path models/sentencepiece.bpe.model \
--config_path models/s2t/small_config.json \
--test_path datasets/librispeech/test-clean.tsv \
--prediction_path output.txt \
--batch_size 8 --tgt_seq_length 100 \
--beam_width 5
When inferring a S2T model downloaded from Huggingface, we need to change "cls_token": "<s>" into "cls_token": "</s>" in models/special_tokens_map.json .
python3 scripts/convert_s2t_from_huggingface_to_tencentpretrain.py --input_model_path s2t_huggingface_model.bin \
--output_model_path s2t_tencentpretrain_model.bin
python3 inference/run_speech2text_infer.py --load_model_path s2t_tencentpretrain_model.bin \
--spm_model_path models/sentencepiece.bpe.model \
--config_path models/s2t/small_config.json \
--test_path datasets/librispeech/test-clean.tsv \
--prediction_path output.txt \
--batch_size 8 --tgt_seq_length 100 \
--beam_width 5