Description
The delivered package contains the development and test sets for the MediaEval 2016 Predicting Media Interestingness Task and for the MediaEval 2017 Predicting Media Interestingness Task. For 2016, it is composed of:
- Shots and key-frames from a set of 78 Hollywood-like movie trailers of different genres
- The corresponding ground truth
- Additional low-level and mid-level features
For 2017, it is composed of:
- Shots and key-frames from a set of 103 Hollywood-like movie trailers of different genres and 4 continuous extracts of ca. 15min from full-length movies.
- The corresponding ground truth
- Additional low-level and mid-level features
All or part of the content is distributed under CC license. The researcher commits to use the content in line with its provisions. Should any provision of the CC license and this license be irreconcilable, the CC license will prevail. The content contains the relevant credits, in accordance with the CC license. The researcher won’t, in any circumstance, delete or alter such credits. The researcher will find such license here
Data
The data consists of:
- the movie shots (obtained after the manual segmentation of the trailers or excerpts). Video shots are provided as individual mp4 files, whose names follow the format:
shotstartingframe-shotendingframe.mp4.
- collections of key-frames extracted from the previous video shots (one key-frame per shot). The extracted key-frame corresponds to the frame in the middle of each video shot. Its naming format follows:
frameNb_shotstartingframe-shotendingframe.jpg.
- the corresponding movie titles (or at least the name of the video as it appears on the internet).
Ground-Truth
For all these trailers and excerpts, ground-truth consists in binary annotations of each shot and key-frame into interesting/non-interesting according to the following use scenario:
Interestingness should be assessed according to the following use case. The use case scenario of the task derives from a practical use case at InterDigital which involves helping professionals to illustrate a Video on Demand (VOD) web site by selecting some interesting frames and/or video excerpts for the movies. The frames and excerpts should be suitable in terms of helping a user to make his/her decision about whether he/she is interested in watching a movie.
Ground truth is provided in two separated text files, one for the shots and another one for the key-frames. All data was manually annotated in terms of interestingness by human assessors. A pair-wise comparison protocol was used [1]. Annotators were provided with a pair of images/video shots at a time and asked to tag which of the content is more interesting for them. The process is repeated by scanning the whole dataset. To avoid an exhaustive, full comparison, between all the possible pairs, a boosting selection method was employed (i.e., the adaptive square design method [2]). The obtained annotations are finally aggregated using a BTL model computation [1] resulting in the final interestingness degrees of the images/video shots. The final binary decisions are then obtained from some empirical thresholding applied to the rankings.
Important note: Because we are using the adaptive square design method [2] to annotate the data, the number of shots, and consequently the number of key-frames, for which we are providing annotations is the maximal number which can be expressed as t=s^2 in each video. E.g., for a video with 55 shots in total, only 49 = 7^2 shots, resp. key-frames, were annotated. This explains the discrepancy one may notice between the number of shots and key-frames provided in the data, and the number of shots and key-frames in the annotation files. Also, this appears in the provided features which we computed on all the data, not taking into account the limit that comes from the annotation process. Please use the annotation files to serve as references when the number of shots and key-frames is concerned.
In both cases, the following format will be used for ground truth text files (please note that the data format is comma-separated):
- Shots: one line per shot:
videoname,shotname,[classification decision: 1(interesting) or 0(not interesting)],[interestingness level],[shot rank in video]
- Keyframes: one line per key-frame:
videoname,key-framename,[classification decision: 1(interesting) or 0(not interesting)],[interestingness level],[key-frame rank in movie]
Features
Low-level features are also provided together with the data. We would like to thank our colleagues Yu-Gang Jiang and Baohan Xu from the Fudan University, China, for making these features available for the task:
- Dense SIFT are computed following the original work in [3], except that the local frame patches are densely sampled instead of using interest point detectors. A codebook of 300 codewords is used in the quantization process with a spatial pyramid of three layers [4];
- HOG descriptors [5] are computed over densely sampled patches. Following [6], HOG descriptors in a 2x2 neighborhood are concatenated to form a descriptor of higher dimension;
- LBP (Local Binary Patterns) [7];
- GIST is computed based on the output energy of several Gabor-like filters (8 orientations and 4 scales) over a dense frame grid like in [8];
(All the aforementioned visual features are extracted using the codes from the authors of [6]).
- Color Histogram in HSV space;
- MFCC computed over every 32ms time-window with 50% overlap. The cepstral vectors are concatenated with their first and second derivatives;
- fc7 layer (4096 dimensions) and prob layer (1000 dimensions) of AlexNet.
These features are provided in Matlab file format (.mat), and therefore can be loaded using the Matlab function, i.e., "load filename.mat". For more information about these features and how they are organized, please refer to the README.txt file in the released package.
Please cite the following paper in your publication if you happen to use the above features:
[10] Yu-Gang Jiang, Qi Dai, Tao Mei, Yong Rui, Shih-Fu Chang. Super Fast Event Recognition in Internet Videos. IEEE Transactions on Multimedia, vol. 177, issue 8, pp. 1-13, 2015.
Additionally, some features at video level are also provided:
- C3D features, extracted from fc6 layer (4096 dimensions) and averaged on a segment level.
Mid Level Features
In addition to the low-level features, mid level features related to face detection and tracking are also provided. These features were kindly computed by the organizers of the Multimodal Person Discovery in Broadcast TV task:
- Face-related features. Face tracking-by-detection is applied within each shot using a detector based on histogram of oriented gradients [5] and the correlation tracker proposed in [9]. Format is the following:
* time identifier left top right bottom
* identifier : face track identifier
* time : in seconds
* left : bounding box left boundary (image width ratio)
* top : bounding box top boundary (image height ratio)
* right : bounding box right boundary (image width ratio)
* bottom : bounding box bottom boundary (image height ratio)
References
[1] R.A. Bradley, M. E. Terry, Rank Analysis of Incomplete Block Designs: the method of paired comparisons. Biometrika, 29 (3-4):324-345, 1952.
[2] J. Li, M. Barkowsky and P. Le Callet, Boosting Paired Comparison Methodology in Measuring Visual Discomfort for 3DTV: Performances of three different Designs. SPIE Electronic Imaging, Stereoscopic Displays and Applications, Human Factors, 8648, p. 1-12, 2013.
[3] D. Lowe, Distinctive image features from scale-invariant keypoints. International Journal on Computer Vision 60:91¨C110, 2004.
[4] S. Lazebnik, C. Schmid and J. Ponce, Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition, 2006.
[5] N. Dalal and B. Triggs, Histograms of oriented gradients for human detection. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition, 2005.
[6] J. Xiao, J. Hays, K. Ehinger, A. Oliva and A. Torralba, SUN database: Large-scale scene recognition from abbey to zoo. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition, 2010.
[7] T. Ojala, M. Pietikainen and T. Maenpaa, Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence24(7):971¨C987, 2002.
[8] A. Oliva and A. Torralba, Modeling the shape of the scene: a holistic representation of the spatial envelope. International Journal of Computer Vision 42:145¨C175, 2001.
[9] Danelljan et al. Accurate scale estimation for robust visual tracking. BMVA 2014.