# Title: MSKVS: Adaptive mean shift-based keyframe extraction for video summarization and a new objective verification approach
# Highlight:
- A novel method for video summarization based on keyframe extraction is proposed.
- Keyframes are extracted using mean shift with global frame orientation features.
- A new algorithm to remove the redundant frames of the video is proposed.
- A new verification approach to objectively evaluate the video summary quality is proposed.
- The proposed method achieves the state-of-the-art performances on several datasets.
Introduction
- (1) MSKVS for video summarization does not rely on shot boundary detection, which makes it appropriate to different genres of videos such as costumer videos, surveillance videos and other general videos, as experimentally demonstrated.
- (2) Different from existing summarization approaches [14,17,34,38] that widely focus on color and texture content to represent the video frames. The proposed GFFV descriptor is built from the dominant orientation of stable and informative keypoints extracted using DoG, making it mostly invariant to scale, illumination, noises and other external factors. More than that, it allows preserving spatial information presented in the video frames, which is an important aspect to eliminate spatial redundancy between frames.
- (3) The proposed algorithm for eliminating unnecessary frames benefits from the combination of both temporal and visual information of the video, unlike the existing approaches [14,17,38] that use only pre-sampling technique with the sampling rate measure, so that important parts of the video are likely to be wasted, especially in case of videos with short duration shots.
- (4) In our approach, the frame weighting in the density estimate varies depending on both number of keypoints used to build the GFFV descriptor and the amount of visual information expressed in the frame. This gives a remarkable improvement in keyframe extraction performance than taking the density estimation alone without frame weighting.
- (5) The overall proposed work enables to browse a video in a much shorter time, is suitable to overcome usual shortcomings of many other techniques proposed so far for video summarization as well as for summary assessment, and has also achieved great or comparable results on several datasets. The obtained results suggest more general applicability to other frameworks that deal with video indexing, retrieval, compression and browsing.
MSKVS framework
1. Feature Extraction (Section 3.1) - multi-dimensional space by a descriptor termed GFFV
1-1. Keypoints detection (Difference-of-Gaussian (DoG) of SIFT discriptor)
DoG practically works in this work as follows:
1) built a scale space representation by sequentially smoothing and resampling the frame Fi using Gaussian kernel.
2) each pixel in DoG images is compared to its eight closest neighbors at the same scale, and to the nine corresponding neighboring pixels in each of the above and below scales.
If the pixel’s intensity is larger or smaller than all the compared pixels, then it is selected as a candidate keypoint, and low contrast keypoints are eliminated using Taylor Expansion[18].
1-2. local orientations assignment
1) crucial factor for a successful keyframe extraction scheme
2) all the bins of the oriented histogram are considered and stored as one giant vector
+ dominant orientation Ψi n*m of the histogram Hi n*m
2. Elimination of Unnecessary Frames (Section 3.2) - handle either unnecessary data or redundant and similar content.
an innovative algorithm to dynamically reduce the amount of data to be processed based on the combination of temporal and visual information of the video.
1) measure a quantitative change between each two consecutive frames (GFFV descriptor)
the corresponding adaptive threshold τOin (see Eq. (9)) is computed based on the mean μOin (Eq. (7)) and the standard deviation σOin (Eq. (8)) of the pair-wise dissimilarity between consecutive frames for the whole video stream.
2) between two frames Fi and Fj is computed using Euclidean distance between their respective feature vectors
if the dissimilarity is above the threshold, then the corresponding consecutive frames will be considered as different frames holding different information
the parameter α that is used to determine the threshold(empirical value α = 2.5).
3. keyframes selection (Section 3.3) - mean shift-based local mode finding algorithm [19].
1) extract the set of keyframes OKF based on mean shif (non-parametric procedure).
1-1) data point Vi u*v in the u·v dimensional space represents the extracted feature vector.
1-2) the probability density function of the video Oout can be estimated using kernel density estimation. use the weighted Gaussian kernel, since it has proved to be the optimal and the most efficient one for mean shift procedure(b defines the learned bandwidth of Gaussian g).
1-3) same entropy-based singular values as in our previous work [6] to measure the amount of visual information.
1-3-1) perform singular value decomposition (SVD) [51] to the frame Fi
1-3-2) derive the entropy measure Ent(Fi) based on the distribution of singular values S(Fi)
1-3-3) define our frame weighting wi as the production of the amount of the computed entropy Ent(Fi) and the number of extracted keypoints k(w(i) = k * Ent(Fi))
1-3-4) we obtain the gradient of the density estimator
2) Keyframe finding
given point × Vi u v, mean shift procedure is successively obtained as follows:
2-1) compare all the selected candidate keyframes among themselves using Euclidian distance (Eq. (10))
2-2) keyframes (Fc, Fd) dissimilarity less than threshold (Dist (Fc, Fd) < 0.3), then one of these two keyframes will be removed.
2-3) produced by arranging the remaining candidate keyframes in the original temporal order
4. Keyframe verification approach
1) Compactness ratio
2) Representativeness ratio
Evaluation metrics
1) Recall, Precision and F-score: F = 2 * (R * P) / (R + P)
2) parameter selection:
Dataset (ChokePoint)
Performance comparison with state-of-the-art methods
we now extensively compare MSKVS with different well-known approaches presented in the literature including:
1) Clustering-based methods (VSUMM [14], STIMO [47], Delaunay Clustering (DT) [46], and Density-Based Spatial Clustering (VSCAN) [66]);
2) Shot-based methods (Color Co-Occurrence Matrices (VISCOM) [67], Image Quality Assessment (VSQUAL) [65], Segments Summary Graphs (SSG) [55], Keypoint Based Keyframe Selection (KBKS) [56], VIdeo Summarization for ONline applications (VISON) [64], SIFT-PDH [6], MoSIFT-PDH [45], and Open Video Project storyboard (OVP) [44]);
3) Sparse dictionary-based methods (Minimum Sparse Reconstruction (MSR) [39] and Sparse Dictionary Selection (SDS) [40]);
4) Visual attention-based methods (Multiple Saliency Maps (MSM) [15], Feature Aggregation (FA) [68], Clustering-based Attention Selection (CAS) [69], and Non-Linear Weighted Fusion (NLWF) [41]).
+ We further compare MSKVS with some methods for video surveillance summarization (Localised Foreground Entropy (LFE) [48] and Surveillance Video Summarization (SVS) [49]), and others for user video summarization (Superframe Segmentation (SS) [70], Spatio-Temporal Scoring (STS) [71], and Dictionary Learning (DL) [72]).
'Study: Artificial Intelligence(AI)' 카테고리의 다른 글
[Vision] SIFT (Scale-Invariant Feature Transform)의 원리.txt (0) | 2021.06.01 |
---|---|
[Vision] Benefiting from AI and deep learning for video summarization (0) | 2021.06.01 |
[논문리뷰] A Survey on Video Summarization Techniques (IJCA 2015) (0) | 2021.05.25 |
[AI] Sequential, Functional, Model sub-classing using Keras and TensorFlow 2.0 (0) | 2020.09.08 |
[AI] back-propagation supplementary explanation (0) | 2020.09.01 |