The data is delivered in 3 different sub-packages:
- Data used for the 2013 and before versions of the benchmark (old naming of annotations)
- Data used in the 2014 benchmark:
- new naming of the annotations, web videos and features;
- old naming of the annotations.
The ground truth was created from a collection of 32 movies of different genres (from extremely violent movies to non violent movies). Due to copyright issues, these movies cannot be delivered. We therefore provide the entire movie list, together with the links to the DVDs used for the annotation on the Amazon web site.
In 2014, 86 short web videos downloaded from YouTube, and normalised to a frame rate of 25, were also annotated.
For all these movies and videos, ground-truth consists in segments containing violence according to the following definition:
Violent scenes are “scenes one would not let an 8 year old child see because they contain physical violence”. This is what is called the “subjective definition” in the following.
In addition to segments containing physical violence according to the above definition, annotations also include, for part of the development set only, i.e., 18 movies, the following high-level concepts: presence of blood, fights, presence of fire, presence of guns, presence of cold arms, car chases and gory scenes, for the visual modality, presence of gunshots, explosions and screams for the audio modality. For the development set, we are also including an additional definition of violence, the ”objective definition”, which was used in the previous versions of the task:
- Objective definition: “physical violence or accident resulting in human injury or pain”
Violent segments and high level video concepts were annotated at frame level at 25fps. Each segment or concept is therefore defined by its starting and ending frame numbers. Only segments which correspond to the targeted events were annotated, i.e. will be present in the ground-truth files.
High level audio concepts are defined by their starting and ending times in seconds. Contrary to what was done for the video part of the annotation, all segments of the movie can be found in the ground-truth files, i.e. those which correspond to the targeted events, and segments with no event.
All segments and concepts – audio and video – may also have additional tags, describing the events, depending on their types.
All annotations are provided in text format, one file per concept (some meaningful suffixes were used), with the following format:
Starting_time ending_time addional_tags_if_any
In 2014, standard audio video features were also included for the movies and web videos used in the 2014 benchmark.
Two different annotations are provided depending on the two definitions of violence. For a given definition, each violent segment contains only one action according to this definition, whenever it is possible. Some cases where different actions are overlapping are proposed as a single segment, with the additional tag ‘multiple_action_scene’.
Video concept – Presence of blood
As soon as blood is visually present in the images, it is annotated. Additional tags representing the proportion of the screen covered with blood are added. These tags are chosen among the following values: unnoticeable, low, medium, high with the following meanings:
- unnoticeable: there is some blood pixels and their surface represents no more than 5% of the image
- low: surface_of_blood_pixels is between 5% and 25%
- medium: surface_of_blood_pixels in [25%, 50%[
- high: surface_of_blood_pixels > 50%
Video concept – Fights
Different types of fights were annotated, resulting in different tags in file:
- 1vs1: only two people fighting
- small: for a small group of people (number of people was not counted, it will roughly correspond to less than 10)
- large: for a large group of people (> 10)
- distant attack: no real fight but somebody is shot or attacked at distance (gunshot, arrow, car, etc)
It could possibly be human against animal.
Video concept – Presence of fire
As soon as fire is visually present in the images, it is annotated. It could be a big fire as well as fire coming out of a gun while shooting. It could be also a candle or a cigarette lighter, or even a cigarette, or sparks. A space shuttle taking off will also generate fire. This will include explosions. When the fire is not yellow or orange, an additional tag indicates its color. In case too many extra colors are visible, a ‘multicolor’ tag will be used.
Video concept – Presence of firearms (guns and assimilated)
When any type of guns or assimilated arms is shown on screen, it is annotated. Guns with bayonets were annotated as guns, whenever a part of it is seen, even if it is a part of the bayonet.
Video concept – Presence of cold arms
Same as for firearms but for any kind of cold arms. Guns with bayonets were annotated also as cold arms, only when the bayonet is visible.
Video concept – Car chases
Annotations of car chases indicate segments showing a car chase.
Video concept – Gory scenes
Annotations of gory scenes will indicate graphic images of bloodletting and/or tissue damage. It will include horror or war representations. As this is also a subjective and difficult notion to define, some additional segments showing really disgusting mutants or creatures were annotated. Additional tags describing the event/scene were added in this case.
Audio concept – Gunshots
Each gunshot was annotated as a single segment whenever possible, with tag ‘gunshot’ and corresponding starting and ending times in seconds. Tag ‘multiple_actions’ was used when several events happen together. Tag ‘(nothing)’ corresponds to segments with no event. Canon fires were also annotated as gunshots, e.g., in Pirates of the Caribbean, or with tag ‘canon_fire’ in Saving Private Ryan, wherever possible. Additional tag ‘multiple_actions_cannon_fire’ was also used when appropriate. Tags ‘canon_fire’ and ‘multiple_actions_canon_fire’ mean that canon fires can be heard but no gunshots, whereas tags ‘gunshot’ and ‘multiple_actions’ may indicate that canon fires were possibly heard in addition to gunshots.
Audio concept – Explosions
Same format as above, with tags ‘explosion’, ‘multiple_actions’ and ‘(nothing)’. Any kind of explosions was annotated, even if they were magic explosions.
Audio concept – Scream
Same format as above, with tags ‘scream’, ‘multiple_actions’ and ‘(nothing)’. Anything from non verbal screams to what we call ‘effort noise’ was annotated, as long as a human or a humanoid (e.g. mutant in I Am Legend) is the origin of the noise. Effort noises were annotated using tags ‘scream_effort’, or ‘multiple_actions_scream_effort’. Animal screams were not annotated, neither were screams in which one can recognize words.