This is the official GitHub page for the paper:
Hussain Kanafani, Junaid Ahmed Ghauri, Sherzod Hakimov, Ralph Ewerth. 2021. Unsupervised Video Summarization via Multi-source Features. In Proceedings of the 2021 International Conference on MultimediaRetrieval (ICMR ’21), August 21–24, 2021, Taipei, Taiwan. ACM, New York, NY, USA, https://doi.org/10.1145/3460426.3463597
The paper is available on:
Model architecture: Multi-Source Chunk and Stride Fusion (MCSF)
python 3.6
cd MCSF
conda create -n mcsf python=3.6
conda activate mcsf
pip install -r requirements.txt
Directory:
- /data
- /plc_365 (places features for summe and tvsum)
- /splits (original and non-overlapping splits)
- /SumMe (processed dataset h5)
- /TVSum (processed dataset h5)
- /csnet (implementation of csnet method)
- /mcsf-places365-early-fusion
- /mcsf-places365-late-fusion
- /mcsf-places365-intermediate-fusion
- /src/evaluation (evaluation using F1-score)
- /src/visualization
- /sum-ind (implementation of SUM-Ind method)
Structured h5 files with the video features and annotations of the SumMe and TVSum datasets are available within the "data" folder. The GoogleNet features of the video frames were extracted by Ke Zhang and [Wei-Lun Chao] and the h5 files were obtained from Kaiyang Zhou.
wget https://zenodo.org/record/4884870/files/datasets.tar
The implemented models use the provided h5 files which have the following structure:
/key
/features 2D-array with shape (n_steps, feature-dimension)
/gtscore 1D-array with shape (n_steps), stores ground truth improtance score (used for training, e.g. regression loss)
/user_summary 2D-array with shape (num_users, n_frames), each row is a binary vector (used for test)
/change_points 2D-array with shape (num_segments, 2), each row stores indices of a segment
/n_frame_per_seg 1D-array with shape (num_segments), indicates number of frames in each segment
/n_frames number of frames in original video
/picks positions of subsampled frames in original video
/n_steps number of subsampled frames
/gtsummary 1D-array with shape (n_steps), ground truth summary provided by user (used for training, e.g. maximum likelihood)
/video_name (optional) original video name, only available for SumMe dataset
Original videos and annotations for each dataset are also available in the authors' project webpages:
TVSum dataset: https://github.com/yalesong/tvsum
SumMe dataset: https://gyglim.github.io/me/vsum/index.html#benchmark
We used SUM-GAN method as a starting point for the implementation.
Run main.py file with the configurations specified in configs.py to train the model. In config.py you find argument parameters for training:
Parameter | type | default |
---|---|---|
mode | string possible values (train, test) | train |
verbose | boolean | true |
video_type | string (summe or tvsum) | summe |
input_size | int | 1024 |
hidden_size | int | 500 |
split_index | int | 0 |
n_epochs | int | 20 |
m | int (number of divisions used for chunk and stride network) | 4 |
For training the model using a single split, run:
python main.py --split_index N (with N being the index of the split)
Using multiple human-generated summaries per video: To evaluate CSNET and all other MCSF models by comparing, after each training epoch, the generated summary for each test video against a set of reference human summaries that are available for that video (see the '/user_summary' entry in the explanation of the h5 file structure in the Data section above), run the 'src/evalution/evaluate.py' script after specifying which config file to use: 'config_summe.yaml' or 'config_tvsum.yaml'
Train and test codes are written in main.py
. To see the detailed arguments, please do python main.py -h
.
python main.py -d datasets/eccv16_dataset_summe_google_pool5.h5 -s datasets/summe_splits.json -m summe --gpu 0 --save-dir log/summe-split0 --split-id 0 --verbose
python main.py -d datasets/eccv16_dataset_summe_google_pool5.h5 -s datasets/summe_splits.json -m summe --gpu 0 --save-dir log/summe-split0 --split-id 0 --evaluate --resume path_to_your_model.pth.tar --verbose --save-results
@article{kanafani2021MCSF,
title={Unsupervised Video Summarization via Multi-source Features},
author={Kanafani, Hussain and Ghauri, Junaid Ahmed and Hakimov, Sherzod and Ewerth, Ralph},
Conference={ACM International Conference on Multimedia Retrieval (ICMR)},
year={2021}
}