The delivered package contains the dataset used in . It was intended for the tasks of video memorability understanding and prediction. It is composed of:
- A list of 660 short movie excerpts extracted from 100 Hollywood-like movies;
- The corresponding ground truth for the 660 movie excerpts (i.e., for each excerpt, a long term-memorability score and its type – neutral vs. typical – and the number of annotations);
- Extracted audio and video features that were used in .
It is accompanied with the original movie excerpts that were used to build the ground truth and to extract the audio-visual features. You are informed that these excerpts remain the ownership of their legitimate owners and that no license is granted on these excerpts. They are provided and may be used exclusively under the article L.122-5 3° a) of the French Code of intellectual property or, where applicable, under the “fair use” doctrine or its equivalent.
List of movies
The complete list of movies is provided in file movie_list.txt.
Ground truth is provided in file ground-truth.xlsx. For each video sequence, it consists of:
- The corresponding movie’s title
- The start and the end times (in seconds) of the sequence (obtained after a manual segmentation of the movie)
- The sequence’s name
- Its type – neutral vs. typical (annotated 1 for neutral and 0 for typical in the .xlsx file). A neutral video is a part of a movie which contains no element that would enable someone to easily guess this video belongs to a particular movie. The list of undesirable elements includes but is not limited to: recognizable famous actors, typical music, style, etc. Typical videos are simply defined as non-neutral videos. See  for a complete definition.
- The number of annotations of each sequence
- Its memorability score
The set of audio and video features used to train the model presented in  is provided together with the data.
- Image captions
Python scripts are also provided to read the different features along with sequences’ names.
C3D is a feature for generic video analysis . C3D is obtained by training a deep 3D convolutional network on a large annotated video dataset. We used the 4096-dimensional output of the fully convolutional layers of the 3D CNN as a feature vector for training and evaluating the model for memorability prediction. We used the source code and the pre-trained models from the following location: https://github.com/facebook/C3D
AudioSet  consists of an expanding ontology of 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos . We used the code provided by Google to extract the 128-dimensional embeddings for the audio tracks of the video segments in our dataset. The ontology of the events can be found here: https://github.com/audioset/ontology
And the code and models can be found here: https://github.com/tensorflow/models/tree/master/research/audioset
We extracted the image captions for frames sampled every one second of the video segment. For each word in the image caption, we computed the word embedding using the word2vec model. We used the following model to compute the image captions: https://github.com/karpathy/neuraltalk2, that implemented the work in . For the word2vec model, we use Python’s Natural Language Toolkit (NLTK).
In order to obtain the sentiment from the visual signal, we used the Sentibank visual concept detector . SentiBank is a set of 1200 trained visual concept detectors providing a mid-level representation of sentiment, associated training images acquired from Flickr, and a benchmark containing 603 photo tweets covering a diverse set of 21 topics. We picked the top-50 visual concepts that have the highest confidence and for each of these 50 visual concepts, we extracted the word2vec embeddings and average the vectors across the video. We used the code and data that is available here: http://www.ee.columbia.edu/ln/dvmm/vso/download/sentibank.html
For extracting emotion related information, we computed the arousal and valence scores from the audio-visual signal. For arousal, which represents the level of excitement in the video, we used the shot change frequency, energy in the audio signal, and the motion activity. For valence, which represents the positive/negative emotion in the video, we used the HSV histogram of each frame in the video. We used an implementation which is based on the publication of Hanjalic & Xu, 2005 .
 Cohendet, R., Yadati, K., Duong, N. Q., and Demarty, C.-H. Annotating, understanding, and predicting long-term video memorability. In Proceedings of the ICMR 2018 Conference, Yokohama, Japan, June 11-14, 2018.
 Tran, D., Bourdev, L. D., Fergus, R., Torresani, L., & Paluri, M. (2014). C3D: generic features for video analysis. CoRR, abs/1412.0767, 2(7), 8.
 Gemmeke, J. F., Ellis, D. P., Freedman, D., Jansen, A., Lawrence, W., Moore, R. C., ... & Ritter, M. (2017, March). Audio set: An ontology and human-labeled dataset for audio events. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on (pp. 776-780). IEEE.
 Vinyals, O., Toshev, A., Bengio, S., Erhan, D. (2015). Show and Tell: A Neural Image Caption Generator. In Proceedings of the Computer Vision and Pattern Recognition Conference, 2015.
 Borth, D., Ji, R., Chen, T., Breuel, T., & Chang, S. F. (2013, October). Large-scale visual sentiment ontology and detectors using adjective noun pairs. In Proceedings of the 21st ACM international conference on Multimedia (pp. 223-232). ACM.
 Hanjalic, A., & Xu, L. Q. (2005). Affective video content representation and modeling. IEEE transactions on multimedia, 7(1), 143-154.