2025年 04期

Unsupervised Video Summarization Model Based on Multi-head Concentration Mechanism


摘要(Abstract):

针对现有视频摘要方法在建立长距离帧依赖性和并行化训练方面的局限性问题,提出一种基于多头集中注意力机制的无监督视频摘要模型(MH-CASUM)。将多头注意力机制融入集中注意力模型,改进长度正则化损失函数,优化损失阈值以选择模型参数,并结合视频帧的唯一性与多样性来丰富摘要信息,从而更高效地完成视频摘要任务。通过在SumMe和TVSum数据集上进行的F_1值、 Kendall相关系数和Spearman相关系数的评估实验,验证MH-CASUM模型的性能。结果表明:引入的多头注意力机制及在模型参数选择上损失阈值的改进方法使得MH-CASUM模型的视频摘要性能显著提升;与之前表现最佳的无监督视频摘要模型CASUM相比,MH-CASUM在TVSum数据集中的F_1值提升0.98%,证明了其在视频摘要任务中的优越性和竞争力。

关键词(KeyWords):视频摘要;注意力机制;多头集中注意力;无监督方法

基金项目(Foundation):国家自然科学基金项目(62076077);; 广西科技重大专项(桂科AA22068057);; 广西自然科学基金项目(2022GXNSFBA035644);; 认知无线电与信息处理教育部重点实验室2022年主任基金项目;; 桂林电子科技大学研究生教育创新计划项目(2025YCXS243)

作者(Author): 李玉洁,贾皓楠,零俐,周文凯,蒋政,丁数学,谭本英

DOI: 10.13349/j.cnki.jdxbn.20250508.001

参考文献(References):

[1] CHEN Z K,ZHONG F M,YUAN X,et al.Framework of integrated big data:a review[C]//2016 IEEE International Conference on Big Data Analysis (ICBDA),March 12-14,2016,Hangzhou,China.New York:IEEE,2016:1.

[2] 中国互联网络信息中心.第53次《中国互联网络发展状况统计报告》[EB/OL].(2024-03-22) [2024-06-07].https://www.cnnic.net.cn/n4/2024/0322/c88-10964.html.

[3] 王方石,须德,吴伟鑫.基于自适应阈值的自动提取关键帧的聚类算法[J].计算机研究与发展,2005,42(10):1752.

[4] 聂秀山,柴彦娥,滕聪.基于支配集的视频关键帧提取方法[J].计算机研究与发展,2015,52(12):2879.

[5] ZHOU K Y,QIAO Y,XIANG T.Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward[C]//Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence (AAAI’18/IAAI’18/EAAI’18),February 2-7,2018,New Orleans,Louisiana,USA.Menlo Park:AAAI Press,2018:7582.

[6] CHEN Y Y,TAO L,WANG X T,et al.Weakly supervised video summarization by hierarchical reinforcement learning[C]//Proceedings of the 1st ACM International Conference on Multimedia in Asia,December 15-18,2019,Beijing,China.New York:Association for Computing Machinery,2019:1.

[7] ZHU W C,LU J W,LI J H,et al.DSNet:a flexible detect-to-summarize network for video summarization[J].IEEE Transactions on Image Processing,2020,30:948.

[8] APOSTOLIDIS E,ADAMANTIDOU E,METSAI A I,et al.AC-SUM-GAN:connecting actor-critic and generative adversarial networks for unsupervised video summarization[J].IEEE Transactions on Circuits and Systems for Video Technology,2020,31(8):3278.

[9] GHAURI J A,HAKIMOV S,EWERTH R.Supervised video summarization via multiple feature sets with parallel attention[C]//2021 IEEE International Conference on Multimedia and Expo (ICME),July 5-9,2021,Shenzhen,China.New York:IEEE,2021:1.

[10] JI Z,XIONG K L,PANG Y W,et al.Video summarization with attention-based encoder-decoder networks[J].IEEE Transactions on Circuits and Systems for Video Technology,2019,30(6):1709.

[11] ZHONG R,WANG R,ZOU Y,et al.Graph attention networks adjusted Bi-LSTM for video summarization[J].IEEE Signal Processing Letters,2021,28:663.

[12] FAJTL J,SOKEH H S,ARGYRIOU V,et al.Summarizing videos with attention[C]//Computer Vision-ACCV 2018 Workshops:14th Asian Conference on Computer Vision,December 2-6,2018,Perth,Australia.Cham:Springer International Publishing,2019:39.

[13] JUNG Y J,CHO D H,KIM D H,et al.Discriminative feature learning for unsupervised video summarization[C]//Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence (AAAI’19/IAAI’19/EAAI’19),January 27-February 1,2019,Honolulu,Hawaii,USA.Menlo Park:AAAI Press,2019:8537.

[14] JUNG Y J,CHO D H,WOO S H,et al.Global-and-local relative position embedding for unsupervised video summarization[C]//European Conference on Computer Vision,August 23-28,2020,Glasgow,UK.Cham:Springer International Publishing,2020:167.

[15] APOSTOLIDIS E,METSAI A I,ADAMANTIDOU E,et al.A stepwise,label-based approach for improving the adversarial training in unsupervised video summarization[C]//Proceedings of the 1st International Workshop on AI for Smart TV Content Production,Access and Delivery (AI4TV’19),October 21,2019,Nice,France.New York:Association for Computing Machinery,2019:17.

[16] APOSTOLIDIS E,ADAMANTIDOU E,METSAI A I,et al.Unsupervised video summarization via attention-driven adversarial learning[C]//MultiMedia Modeling:26th International Conference,MMM 2020,January 5-8,2020,Daejeon,Republic of Korea.Cham:Springer International Publishing,2020:492.

[17] HE X F,HUA Y,SONG T,et al.Unsupervised video summarization with attentive conditional generative adversarial networks[C]//Proceedings of the 27th ACM International Conference on Multimedia (MM’19),October 21-25,2019,Nice,France.New York:Association for Computing Machinery,2019:2296.

[18] VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17),December 4-9,2017,Long Beach,California,USA.Red Hook:Curran Associates Inc,2017:6000.

[19] DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.An image is worth 16×16 words:transformers for image recognition at scale[EB/OL].(2020-10-22)[2024-06-07].https://doi.org/10.48550/arXiv.2010.11929.

[20] 朱张莉,饶元,吴渊,等.注意力机制在深度学习中的研究进展[J].中文信息学报,2019,33(6):1.

[21] 李依依,王继龙.自注意力机制的视频摘要模型[J].计算机辅助设计与图形学学报,2020,32(4):652.

[22] APOSTOLIDIS E,BALAOURAS G,MEZARIS V,et al.Summarizing videos using concentrated attention and considering the uniqueness and diversity of the video frames[C]//Proceedings of the 2022 International Conference on Multimedia Retrieval (ICMR’22),June 27-30,2022,Newark,NJ,USA.New York:Association for Computing Machinery,2022:407.

[23] GYGLI M,GRABNER H,RIEMENSCHNEIDER H,et al.Creating summaries from user videos[C]//Computer Vision-ECCV 2014:13th European Conference,September 6-12,2014,Zurich Switzerland.Cham:Springer International Publishing,2014:505.

[24] SONG Y L,VALLMITJANA J,STENT A,et al.TVSum:summarizing web videos using titles[C]//2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),June 7-12,2015,Boston,MA,USA.New York:IEEE,2015:5179.

[25] OTANI M,NAKASHIMA Y,RAHTU E,et al.Rethinking the evaluation of video summaries[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),June 15-20,2019,Long Beach,CA,USA.New York:IEEE,2019:7596.

[26] LEBRON CASAS L,KOBLENTS E.Video summarization with LSTM and deep attention models[C]//International Conference on Multimedia Modeling,January 8-11,2019,Thessaloniki,Greece.Cham:Springer International Publishing,2018:67.

[27] ZHANG K,CHAO W L,SHA F,et al.Video summarization with long short-term memory[C]//Computer Vision-ECCV 2016:14th European Conference,October 11-14,2016,Amsterdam,The Netherlands.Cham:Springer International Publishing,2016:766.

[28] MAHASSENI B,LAM M,TODOROVIC S.Unsupervised video summarization with adversarial LSTM networks[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),July 21-26,2017,Honolulu,HI,USA.New York:IEEE,2017:202.

[29] PHAPHUANGWITTAYAKUL A,GUO Y,YING F L,et al.Self-attention recurrent summarization network with reinforcement learning for video summarization Task[C]//2021 IEEE International Conference on Multimedia and Expo (ICME).July 5-9,2021,Shenzhen,China.New York:IEEE,2021:1.

[30] LI P,YE Q H,ZHANG L M,et al.Exploring global diverse attention via pairwise temporal relation for video summarization[J].Pattern Recognition,2021,111:107677.

[31] KANAFANI H,GHAURI J A,HAKIMOV S,et al.Unsupervised video summarization via multi-source features[C]//Proceedings of the 2021 International Conference on Multimedia Retrieval (ICMR’21),August 21-24,2021,Taipei,China.New York:Association for Computing Machinery,2021:466.

[32] WU G D,LIN J Z,SILVA C T.ERA:entity relationship aware video summarization with Wasserstein GAN[EB/OL].(2021-09-06)[2024-06-07].https://doi.org/10.48550/arXiv.2109.02625.

[33] ZHAO B,LI H P,LU X Q,et al.Reconstructive sequence-graph network for video summarization[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2021,44(5):2793.