Pytorch implementation of the AAAI-2021 paper: Unsupervised Summarization for Chat Logs with Topic-Oriented Ranking and Context-Aware Auto-Encoders.
The code is partially referred to https://github.com/nlpyang/PreSumm.
- Python 3.6 or higher
- torch==1.1.0
- pytorch-transformers==1.1.0
- torchtext==0.4.0
- rouge==0.3.2
- tensorboardX==2.1
- nltk==3.5
- Tesla V100 32GB GPU
- CUDA 10.2
Each json file is a data list that includes chat log samples. The format of a chat log sample is shown as follows:
{"session": [
{"content": ["is", "anyone", "there", ",", "please", "?"],
"type": "c2b"},
{"content": ["i", "want", "to", "buy", "this", "skirt", ",", "but", "i", "don't", "know", "what", "size", "suits", "me"],
"type": "c2b"},
{"content": ["what's", "your", "height", "and", "weight", "?"],
"type": "b2c"},
{"content": ["165", "cm", "and", "55", "kg"],
"type": "c2b"},
{"content": ["well", ",", "size", "m", "suits", "you"],
"type": "b2c"}
],
"summary": ["the", "user", "wants", "to", "buy", "a", "skirt", ".", "size", "m", "suits", "people", "of", "165", "cm", "and", "55", "kg", "."]
}
{"session": [
{"content": ["发", "什", "么", "快", "递", "?"],
"type": "c2b"},
{"content": ["发", "顺", "丰"],
"type": "b2c"},
{"content": ["包", "邮", "吗"],
"type": "c2b"},
{"content": ["满", "300", "元", "包", "邮"],
"type": "b2c"},
{"content": ["我", "下", "单", "了", ",", "什", "么", "时", "候", "发", "货"],
"type": "c2b"},
{"content": ["明", "天"],
"type": "b2c"}
],
"summary": ["商", "品", "明", "天", "发", "顺", "丰", ",", "满", "300", "元", "包", "邮", "。"]
}
-
Download BERT checkpoints.
The pretrained BERT checkpoints can be found at:
- Chinese BERT: https://github.com/ymcui/Chinese-BERT-wwm
- English BERT: https://github.com/google-research/bert
Put BERT checkpoints into the directory bert like this:
--- bert | |--- chinese_bert | |--- config.json | |--- pytorch_model.bin | |--- vocab.txt
-
Data Processing
PYTHONPATH=. python ./src/preprocess.py -raw_path json_data -save_path bert_data -bert_temp_dir bert/chinese_bert -log_file logs/preprocess.log
-
Train
PYTHONPATH=. python ./src/train.py -data_path bert_data/taobao -log_file logs/rankae.train.log -model_path models/rankae -sep_optim -train_steps 200000
-
Validate
PYTHONPATH=. python ./src/train.py -mode validate -data_path bert_data/taobao -log_file logs/rankae.val.log -alpha 0.95 -model_path models/rankae
-
Testing
PYTHONPATH=. python ./src/train.py -mode test -data_path bert_data/taobao -test_from models/rankae/model_step_200000.pt -log_file logs/rankae.test.log -alpha 0.95
Our chat log dataset is collected from Taobao, where conversations take place between customers and merchants in the Chinese language. For the security of private information from customers, we performed the data desensitization and converted words to IDs. As a result, the data cannot be directly used in our released codes and other pre-trained models like BERT, but the dataset still provides some statistical information.
The desensitized data is available at Google Drive or Baidu Pan (extract code: 4298).
@article{Zou_Lin_Zhao_Kang_Jiang_Sun_Zhang_Huang_Liu_2021,
title={Unsupervised Summarization for Chat Logs with Topic-Oriented Ranking and Context-Aware Auto-Encoders},
volume={35},
url={https://ojs.aaai.org/index.php/AAAI/article/view/17724},
number={16},
journal={Proceedings of the AAAI Conference on Artificial Intelligence},
author={Zou, Yicheng and Lin, Jun and Zhao, Lujun and Kang, Yangyang and Jiang, Zhuoren and Sun, Changlong and Zhang, Qi and Huang, Xuanjing and Liu, Xiaozhong},
year={2021},
month={May},
pages={14674-14682}
}