-
Notifications
You must be signed in to change notification settings - Fork 143
Preprocess the data
usage: preprocess.py [-h] --corpus_path CORPUS_PATH
[--dataset_path DATASET_PATH]
[--tokenizer {bert,bpe,char,space,xlmroberta,image,text_image}]
[--vocab_path VOCAB_PATH] [--merges_path MERGES_PATH]
[--spm_model_path SPM_MODEL_PATH]
[--do_lower_case {true,false}]
[--vqgan_model_path VQGAN_MODEL_PATH]
[--vqgan_config_path VQGAN_CONFIG_PATH]
[--tgt_tokenizer {bert,bpe,char,space,xlmroberta}]
[--tgt_vocab_path TGT_VOCAB_PATH]
[--tgt_merges_path TGT_MERGES_PATH]
[--tgt_spm_model_path TGT_SPM_MODEL_PATH]
[--tgt_do_lower_case {true,false}]
[--processes_num PROCESSES_NUM]
[--data_processor {bert,lm,mlm,bilm,albert,mt,t5,cls,prefixlm,gsg,bart,cls_mlm,vit,vilt,clip,s2t,beit,dalle}]
[--docs_buffer_size DOCS_BUFFER_SIZE]
[--seq_length SEQ_LENGTH]
[--tgt_seq_length TGT_SEQ_LENGTH]
[--dup_factor DUP_FACTOR]
[--short_seq_prob SHORT_SEQ_PROB] [--full_sentences]
[--seed SEED] [--dynamic_masking] [--whole_word_masking]
[--span_masking] [--span_geo_prob SPAN_GEO_PROB]
[--span_max_length SPAN_MAX_LENGTH]
[--sentence_selection_strategy {lead,random}]
Users have to preprocess the corpus before pre-training. The example of pre-processing on a single machine:
python3 preprocess.py --corpus_path corpora/book_review_bert.txt --vocab_path models/google_zh_vocab.txt \
--dataset_path dataset.pt --processes_num 8 --dynamic_masking --data_processor bert
The output of pre-processing stage is dataset.pt (--dataset_path), which is the input of pretrain.py . If multiple machines are available, users can run preprocess.py on one machine and copy the dataset.pt to other machines.
We need to specify the format of dataset.pt generated by pre-processing stage (--data_processor) since different pre-training models require different data formats in pre-training stage. Currently, TencentPretrain supports formats for abundant pre-training models, for example:
- lm: language model
- mlm: masked language model
- cls: classification
- bilm: bi-directional language model
- bert: masked language model + next sentence prediction
- albert: masked language model + sentence order prediction
- prefixlm:prefix language model
Notice that we should use the corpus (--corpus_path) whose format is in accordance with the --data_processor . More use cases are found in Pretraining model examples.
--processes_num n denotes that n processes are used for pre-processing. More processes can speed up the preprocess stage but lead to more memory consumption.
--dup_factor denotes that instances are duplicated multiple times (when using static masking). Static masking is used in BERT. The masked words are determined in pre-processing stage.
--dynamic_masking denotes that the words are masked during the pre-training stage, which is used in RoBERTa. Dynamic masking performs better and the output file (--dataset_path) is smaller (since it doesn't have to duplicate instances).
--full_sentences allows a sample to include contents from multiple documents, which is used in RoBERTa.
--span_masking denotes that masking consecutive words, which is used in SpanBERT. If dynamic masking is used, we should specify --span_masking in pre-training stage, otherwise we should specify --span_masking in pre-processing stage.
--docs_buffer_size specifies the buffer size in memory in pre-processing stage.
Sequence length is specified in pre-processing stage by --seq_length . The default value is 128. When doing incremental pre-training upon existing pre-trained model, --seq_length should be smaller than the maximum sequence length the pre-trained model supports (--max_seq_length).
Vocabulary and tokenizer are also specified in pre-processing stage. More details are discussed in Tokenization and vocabulary section.