The data is delivered in 3 different sub-packages:

  • Data used for the 2013 and before versions of the benchmark (old naming of annotations)
  • Data used in the 2014 benchmark:
    • new naming of the annotations, web videos and features;
    • old naming of the annotations.

The ground truth was created from a collection of 32 movies of different genres (from extremely violent movies to non violent movies). Due to copyright issues, these movies cannot be delivered. We therefore provide the entire movie list, together with the links to the DVDs used for the annotation on the Amazon web site.

In 2014, 86 short web videos downloaded from YouTube, and normalised to a frame rate of 25, were also annotated.

For all these movies and videos, ground-truth consists in segments containing violence according to the following definition:

Violent scenes are “scenes one would not let an 8 year old child see because they contain physical violence”. This is what is called the “subjective definition” in the following.

In addition to segments containing physical violence according to the above definition, annotations also include, for part of the development set only, i.e., 18 movies, the following high-level concepts: presence of blood, fights, presence of fire, presence of guns, presence of cold arms, car chases and gory scenes, for the visual modality, presence of gunshots, explosions and screams for the audio modality. For the development set, we are also including an additional definition of violence, the ”objective definition”, which was used in the previous versions of the task:

- Objective definition: “physical violence or accident resulting in human injury or pain

Violent segments and high level video concepts were annotated at frame level at 25fps. Each segment or concept is therefore defined by its starting and ending frame numbers. Only segments which correspond to the targeted events were annotated, i.e. will be present in the ground-truth files.

High level audio concepts are defined by their starting and ending times in seconds. Contrary to what was done for the video part of the annotation, all segments of the movie can be found in the ground-truth files, i.e. those which correspond to the targeted events, and segments with no event.

All segments and concepts – audio and video – may also have additional tags, describing the events, depending on their types.

All annotations are provided in text format, one file per concept (some meaningful suffixes were used), with the following format:

Starting_time ending_time addional_tags_if_any

In 2014, standard audio video features were also included for the movies and web videos used in the 2014 benchmark.

Violent segment 
Two different annotations are provided depending on the two definitions of violence. For a given definition, each violent segment contains only one action according to this definition, whenever it is possible. Some cases where different actions are overlapping are proposed as a single segment, with the additional tag ‘multiple_action_scene’.

Video concept – Presence of blood
As soon as blood is visually present in the images, it is annotated. Additional tags representing the proportion of the screen covered with blood are added. These tags are chosen among the following values: unnoticeable, low, medium, high with the following meanings:

  • unnoticeable: there is some blood pixels and their surface represents no more than 5% of the image
  • low: surface_of_blood_pixels is between 5% and 25%
  • medium: surface_of_blood_pixels in [25%, 50%[
  • high: surface_of_blood_pixels > 50%

Video concept – Fights 
Different types of fights were annotated, resulting in different tags in file:

  • 1vs1: only two people fighting
  • small: for a small group of people (number of people was not counted, it will roughly correspond to less than 10)
  • large: for a large group of people (> 10)
  • distant attack: no real fight but somebody is shot or attacked at distance (gunshot, arrow, car, etc)

It could possibly be human against animal.

Video concept – Presence of fire
As soon as fire is visually present in the images, it is annotated. It could be a big fire as well as fire coming out of a gun while shooting. It could be also a candle or a cigarette lighter, or even a cigarette, or sparks. A space shuttle taking off will also generate fire. This will include explosions. When the fire is not yellow or orange, an additional tag indicates its color. In case too many extra colors are visible, a ‘multicolor’ tag will be used.

Video concept – Presence of firearms (guns and assimilated) 
When any type of guns or assimilated arms is shown on screen, it is annotated. Guns with bayonets were annotated as guns, whenever a part of it is seen, even if it is a part of the bayonet.

Video concept – Presence of cold arms 
Same as for firearms but for any kind of cold arms. Guns with bayonets were annotated also as cold arms, only when the bayonet is visible.

Video concept – Car chases
Annotations of car chases indicate segments showing a car chase.

Video concept – Gory scenes 
Annotations of gory scenes will indicate graphic images of bloodletting and/or tissue damage. It will include horror or war representations. As this is also a subjective and difficult notion to define, some additional segments showing really disgusting mutants or creatures were annotated. Additional tags describing the event/scene were added in this case.

Audio concept – Gunshots 
Each gunshot was annotated as a single segment whenever possible, with tag ‘gunshot’ and corresponding starting and ending times in seconds. Tag ‘multiple_actions’ was used when several events happen together. Tag ‘(nothing)’ corresponds to segments with no event. Canon fires were also annotated as gunshots, e.g., in Pirates of the Caribbean, or with tag ‘canon_fire’ in Saving Private Ryan, wherever possible. Additional tag ‘multiple_actions_cannon_fire’ was also used when appropriate. Tags ‘canon_fire’ and ‘multiple_actions_canon_fire’ mean that canon fires can be heard but no gunshots, whereas tags ‘gunshot’ and ‘multiple_actions’ may indicate that canon fires were possibly heard in addition to gunshots.

Audio concept – Explosions 
Same format as above, with tags ‘explosion’, ‘multiple_actions’ and ‘(nothing)’. Any kind of explosions was annotated, even if they were magic explosions.

Audio concept – Scream 
Same format as above, with tags ‘scream’, ‘multiple_actions’ and ‘(nothing)’. Anything from non verbal screams to what we call ‘effort noise’ was annotated, as long as a human or a humanoid (e.g. mutant in I Am Legend) is the origin of the noise. Effort noises were annotated using tags ‘scream_effort’, or ‘multiple_actions_scream_effort’. Animal screams were not annotated, neither were screams in which one can recognize words.

About

The VSD benchmark is a collection of ground-truth files based on the extraction of violent events in movies and web videos, together with high level audio and video concepts. It is intended to be used for assessing the quality of methods for the detection of violent scenes and/or the recognition of some high level, violencerelated, concepts in movies and web videos.

The data has been produced by Technicolor for the 2012 subset and by the Fudan University and the Ho Chi Minh University of Science for the 2013 and 2014 subsets. It has been described in several publications. A detailed description of the benchmark can be found on our Data Description page. The license conditions are mentioned on the download page.

This dataset was used in the multimodal benchmark MediaEval, for the 2011, 2012, 2013 and 2014 Affect Task – Violent Scenes Detection.

 

ACKNOWLEDGEMENTS

We would like to thank the MediaEval benchmark and their organizers for their support in the creation of this dataset. We also would like to thank the different co-organizers during all these past years:

and of course our annotators.

The creation of this benchmark has also been supported, in part, by:

  • the Quaero Program
  • the China’s National 973 Program (#2010CB327900)
  • the China’s NSF Projects (#61201387 and #61228205)
  • the VNU-HCM Project (#B2013-26-01)
  • Academy of Finland funding grants no. 255745 and 251170.
  • UEFISCDI SCOUTER (under grant no. 28DPST/30-08-2013).
  • Austrian Science Fund (FWF): P25655
  • EU FP7-ICT-2011-9: project no. 601166 ("PHENICX")

CITING VIOLENT SCENES DATASET

If you make use of the VSD dataset, or refer to its results, please use the following citation:

C.H. Demarty,C. Penet, M. Soleymani, G. Gravier. VSD, a public dataset for the detection of violent scenes in movies: design, annotation, analysis and evaluation. In Multimedia Tools and Applications, May 2014. (pdf)

C.H. Demarty, B. Ionescu, Y.G. Jiang, and C. Penet. Benchmarking Violent Scenes Detection in movies. In Proceedings of the 2014 12th International Workshop on Content-Based Multimedia Indexing (CBMI), 2014. (pdf)

M. Sjöberg, B. Ionescu, Y.G. Jiang, V.L. Quang, M. Schedl and C.H. Demarty. The MediaEval 2014 Affect Task: Violent Scenes Detection. In Working Notes Proceedings of the MediaEval 2014 Workshop, Barcelona, Spain (2014). (pdf)

C.H. Demarty,C. Penet, G. Gravier and M. Soleymani. A benchmarking campaign for the multimodal detection of violent scenes in movies.In Proceedings of the 12thinternational conference on Computer Vision – Volume Part III (ECCV’12),Andrea Fusiello, Vittorio Murino, and Rita Cucchiara (Eds), Col. Part III. Springer Verlag, Berlin. (pdf)

Download

In order to get the data, you are asked to supply your name and email address. You will receive instruction on how to download the dataset via this email address. We may store the data you supplied in order to contact you later about benchmark related matters. The data will not be used in any other way.

To download the violent scene dataset, please send an email to vsdmanagement@interdigital.com. By doing this you irrevocably agree to any and all provision of the license agreement in this page.

 

TERMS OF USE

VIOLENT SCENES DATASET RELEASE AGREEMENT

The scene selection you are about to download, in case you agree with these Terms of use, may not be suitable for children. Some of the scenes are taken from extremely violent movies.

In the following we will refer to the Violent Scenes Dataset as VSD dataset.

The goal of the VSD dataset is to develop new techniques, technology, and algorithms for the automatic detection of violent scenes in movies.

The VSD dataset was produced in three steps leading to two different sub sets: the 2012 subset and the 2013-2014 subset.

The 2012 subset consists in the annotations of:

  • the high level audio and video concepts and the violent scenes according to the objective definition for the following movies:
    Léon, Reservoir Dogs, Armageddon, I am Legend, Saving Private Ryan, Eragon, Harry Potter and the order of the Phoenix, Billy Elliot, Pirates of the Caribbean – the curse of the black pearl, The Sixth Sense, The Wicker Man, Midnight Express, Kill Bill, The Wizard of Oz, The Bourne Identity.
  • the violent scenes according to the objective definition for the following movies:
    Dead Poets Society, Fight Club, Independence Day.

The 2013-2014 subset consists in the annotation of:

  • the high level audio and video concepts for the following movies:
    Dead Poets Society, Fight Club, Independence Day.
  • the violent scenes according to the objective definition for the following movies:
    Fantastic Four 1, Fargo, Forrest Gump, Legally Blond, Pulp Fiction, The God Father 1, The Pianist .
  • the violent scenes according to the subjective definition for all the following movies:
    Léon, Reservoir Dogs, Armageddon, I am Legend, Saving Private Ryan, Eragon, Harry Potter and the order of the Phoenix, Billy Elliot, Pirates of the Caribbean – the curse of the black pearl, The Sixth Sense, The Wicker Man, Midnight Express, Kill Bill, The Wizard of Oz, The Bourne Identity, Dead Poets Society, Fight Club, Independence Day, Fantastic Four 1, Fargo, Forrest Gump, Legally Blond, Pulp Fiction, The God Father 1, The Pianist, 8 Mile, Braveheart, Desperado, Ghost in the Shell, Jumanji, Terminator 2, V for Vendetta.
  • the violent scenes according to the subjective definition for 86 short web videos.

The 2014 subset part contains also standard audio and video features.

Technicolor has copyright and all rights of authorship on the 2012 data, whereas the Fudan University and the Ho Chi Minh University of Science share the copyright and all rights of authorship on the 2013-2014 data. Technicolor is the principal distributor of the VSD dataset. 

RELEASE OF THE DATASET

To advance the state-of-the-art in violent scenes detection, the VSD dataset is made available to the researcher community for scientific research only.  All other uses of the VSD dataset will be considered on a case-by-case basis. To receive a copy of the VSD dataset, the requestor must agree to observe all of these Terms of use.

CONSENT

The researcher(s) agrees to the following restrictions on the VSD dataset:

1. Redistribution: Without prior written approval from Technicolor,  the 2012 part of the VSD dataset,  in whole or in part, shall not be further distributed, published, copied, or disseminated in any way or form whatsoever, whether for profit or not. For the avoidance of any doubt, this prohibition includes further distributing, copying or disseminating to a different facility or organizational unit in the requesting university, organization, or company.

2. Modification and Non Commercial Use: Without prior written approval from Technicolor, the 2012 part of the VSD dataset, in whole or in part, may not be modified or used for commercial purposes. Modification is allowed for scientific research purposes only. It would be highly appreciated if the modified VSD dataset was shared with Technicolor, at this address: vsdmanagement@interdigital.com.

For the avoidance of doubt, commercial purposes include but are not limited to:

  • Development of commercial systems,
  • proving the efficiency of commercial systems,
  • training or testing of commercial systems,
  • using screenshots of data from the database in advertisements,
  • selling data from the database

3. Publication Requirements: In no case should the still frames or videos be used in any way that could directly or indirectly harm Technicolor, the Fudan University or the Ho Chi Minh University of Science.  Technicolor, the Fudan University or the Ho Chi Minh University of Science permit publication (paper or web-based) of the data for scientific purposes only. Any other publication without scientific and academic value is strictly prohibited.

4. Citation/Reference: All documents and papers that report on research that uses the VSD dataset must acknowledge the use of the dataset by including an appropriate citation to the following:

  • C.H. DemartyC. Penet, M. Soleymani, G. Gravier. VSD, a public dataset for the detection of violent scenes in movies: design, annotation, analysis and evaluation. In Multimedia Tools and Applications, May 2014.
  • C.H. Demarty, B. Ionescu, Y.G. Jiang, and C.Penet. Benchmarking Violent Scenes Detection in movies. In Proceedings of the 2014 12th International Workshop on Content-Based Multimedia Indexing (CBMI), 2014.
  • M. Sjöberg, B. Ionescu, Y.G. Jiang, V.L. Quang, M. Schedl and C.H. Demarty. The MediaEval 2014 Affect Task: Violent Scenes Detection. In Working Notes Proceedings of the MediaEval 2014 Workshop, Barcelona, Spain (2014)
  • C.H. Demarty, C. Penet, G. Gravier and M.Soleymani. A benchmarking campaign for the multimodal detection of violent scenes in movies. In Proceedings of the 12th international conference on Computer Vision – Volume Part III (ECCV’12), Andrea Fusiello, Vittorio Murino, and Rita Cucchiara (Eds), Col. Part III. Springer Verlag, Berlin.

5. No Warranty: THE PROVIDER OF THE DATA MAKES NO REPRESENTATIONS AND EXTENDS NO WARRANTIES OF ANY KIND, EITHER EXPRESSED OR IMPLIED.  THERE ARE NO EXPRESS OR IMPLIED WARRANTIES OF MERCHANT ABILITY OR FITNESS FOR A PARTICULAR PURPOSE, OR THAT THE USE OF THE MATERIAL WILL NOT INFRINGE ANY PATENT, COPYRIGHT, TRADEMARK, OR OTHER PRO- PRIETARY RIGHTS.

The Principal Investigators can be contacted via email.

Sources

Among the 31 movies used for the benchmark in 2014, 24 are dedicated to the training step and 7 for the test step. High level concepts annotations are only provided on the first 18 movies of the test set.
Please note that the movie Kill Bill was officially removed from the training set for the Violent Scenes Detection Task due to availability issues. However, we are still providing the annotations for this movie together with the VSD dataset.

OFFICIAL 2014 TRAINING SET:

Léon, Reservoir Dogs, Armageddon, I am Legend, Saving Private Ryan, Eragon, Harry Potter and the order of the Phoenix, Billy Elliot, Pirates of the Caribbean – the curse of the black pearl, The Sixth Sense, The Wicker Man, Midnight Express,  The Wizard of Oz, The Bourne Identity, Independence Day, Fight Club, Dead Poets Society, Fantastic Four 1, Fargo, Forrest Gump, Legally Blond, Pulp Fiction, The God Father 1, The Pianist

ADDITIONAL TRAINING SET:

Kill Bill

OFFICIAL 2014 TEST SET:

  • 7 Hollywood movies: 8 Mile, Braveheart, Desperado, Ghost in the Shell, Jumanji, Terminator 2, V for Vendetta.
  • 86 web short videos downloaded from YouTube

Due to copyright issues, we cannot deliver the movies, but only the corresponding annotations. We therefore provide the entire movie list, together with the links to the DVDs used for the annotation on the Amazon web site.

The 86 web videos, under the Creative Commons Attribution 3.0 Unported license (http://creativecommons.org/licenses/by/3.0/), are included in the downloadable package.

LINKS REFERENCING THESE MOVIES TO THE AMAZON WEB SITE:

ADDITIONAL INFORMATION FOR THE YOUTUBE VIDEOS:

This part of the test set contains 86 mp4 files downloaded from YouTube, and normalised to a frame rate of 25 using the libav utilities.

Each movie file (within the zip files) is named according to the YouTube video id, so for example 0egEFZq2Y28.mp4 has the id "0egEFZq2Y28". The original file can be accessed from the URL

https://www.youtube.com/watch?v=[ID]

substituting [ID] with the YouTube video id of the video. It is possible that some videos have disappeared from YouTube since downloaded by the organisers.

The metadata provided by YouTube is included in a single zip file (xml-metadata.zip). There is one XML file for each video named according to the YouTube video id. From this you can extract e.g. the title, description, license (all are Creative Commons licences that allow redistribution), and original author.