is a collection of ground-truth files based on the extraction of violent events in movies and web videos, together with high-level audio and video concepts. It is intended to be used for assessing the quality of methods for the detection of violent scenes and/or the recognition of some high level, violence-related, concepts in movies and web videos.

The data was produced by Technicolor for the 2012 subset and by the 

 for the 2013 and 2014 subsets. It has been described in several publications. A detailed description of the benchmark can be found on our Data Description page. The license conditions are mentioned on the Download page.

This dataset was used in the multimodal benchmark 

We would like to thank the MediaEval benchmark and their organizers for their support in the creation of this dataset. We also would like to thank the different co-organizers during all these past years:

, co-organizers of the Affect Task in 2011 and 2012;

, co-organizers of the Affect Task in 2013 and 2014, who also participated in the definition of the violent segments;

, who joined and led the organizers team in 2014;

The creation of this benchmark has also been supported, in part, by:

the China’s National 973 Program (#2010CB327900)

the China’s NSF Projects (#61201387 and #61228205)

Academy of Finland funding grants no. 255745 and 251170.

UEFISCDI SCOUTER (under grant no. 28DPST/30-08-2013).

EU FP7-ICT-2011-9: project no. 601166 ("PHENICX")

If you make use of the VSD dataset, or refer to its results, please use the following citations:

, C. Penet, M. Soleymani, G. Gravier. VSD, a public dataset for the detection of violent scenes in movies: design, annotation, analysis and evaluation. In 

, B. Ionescu, Y.G. Jiang, and C. Penet. Benchmarking Violent Scenes Detection in movies. In Proceedings of the 2014 12th International Workshop on Content-Based Multimedia Indexing (CBMI), 2014. (

M. Sjöberg, B. Ionescu, Y.G. Jiang, V.L. Quang, M. Schedl and 

. The MediaEval 2014 Affect Task: Violent Scenes Detection. In Working Notes Proceedings of the MediaEval 2014 Workshop, Barcelona, Spain (2014). (

, C. Penet, G. Gravier and M. Soleymani. A benchmarking campaign for the multimodal detection of violent scenes in movies. 

In Proceedings of the 12thinternational conference on Computer Vision – Volume Part III (ECCV’12), 

Andrea Fusiello, Vittorio Murino, and Rita Cucchiara (Eds), Col. Part III. Springer Verlag, Berlin. (

Markus Schedl, Mats Sjöberg, Ionuţ Mironică, Bogdan Ionescu, Vu Lam Quang, Yu-Gang Jiang, 

. VSD2014: A dataset for violent scenes detection in Hollywood movies and web videos. CBMI 2015

The data is delivered in 3 different sub-packages:

Data used for the 2013 and before versions of the benchmark (old naming of annotations)

new naming of the annotations, web videos and features;

The ground truth was created from a collection of 32 movies of different genres (from extremely violent movies to non violent movies). Due to copyright issues, these movies cannot be delivered. We therefore provide the entire movie list, together with the links to the DVDs used for the annotation on the Amazon web site.

In 2014, 86 short web videos downloaded from YouTube, and normalised to a frame rate of 25, were also annotated.

For all these movies and videos, ground-truth consists in segments containing violence according to the following definition:

scenes one would not let an 8 year old child see because they contain physical violence

”. This is what is called the “subjective definition” in the following.

In addition to segments containing physical violence according to the above definition, annotations also include, for part of the development set only, i.e., 18 movies, the following high-level concepts: presence of blood, fights, presence of fire, presence of guns, presence of cold arms, car chases and gory scenes, for the visual modality, presence of gunshots, explosions and screams for the audio modality. For the development set, we are also including an additional definition of violence, the ”objective definition”, which was used in the previous versions of the task:

physical violence or accident resulting in human injury or pain

Violent segments and high level video concepts

 were annotated at frame level at 25fps. Each segment or concept is therefore defined by its starting and ending frame numbers. Only segments which correspond to the targeted events were annotated, i.e. will be present in the ground-truth files.

 are defined by their starting and ending times in seconds. Contrary to what was done for the video part of the annotation, all segments of the movie can be found in the ground-truth files, i.e. those which correspond to the targeted events, and segments with no event.

All segments and concepts – audio and video – may also have additional tags, describing the events, depending on their types.

All annotations are provided in text format, one file per concept (some meaningful suffixes were used), with the following format:

Starting_time ending_time addional_tags_if_any

In 2014, standard audio video features were also included for the movies and web videos used in the 2014 benchmark.


Two different annotations are provided depending on the two definitions of violence. For a given definition, each violent segment contains only one action according to this definition, whenever it is possible. Some cases where different actions are overlapping are proposed as a single segment, with the additional tag ‘multiple_action_scene’.


As soon as blood is visually present in the images, it is annotated. Additional tags representing the proportion of the screen covered with blood are added. These tags are chosen among the following values: unnoticeable, low, medium, high with the following meanings:

unnoticeable: there is some blood pixels and their surface represents no more than 5% of the image

low: surface_of_blood_pixels is between 5% and 25%

medium: surface_of_blood_pixels in [25%, 50%[


Different types of fights were annotated, resulting in different tags in file:

small: for a small group of people (number of people was not counted, it will roughly correspond to less than 10)

large: for a large group of people (> 10)

distant attack: no real fight but somebody is shot or attacked at distance (gunshot, arrow, car, etc)

It could possibly be human against animal.


As soon as fire is visually present in the images, it is annotated. It could be a big fire as well as fire coming out of a gun while shooting. It could be also a candle or a cigarette lighter, or even a cigarette, or sparks. A space shuttle taking off will also generate fire. This will include explosions. When the fire is not yellow or orange, an additional tag indicates its color. In case too many extra colors are visible, a ‘multicolor’ tag will be used.

Video concept – Presence of firearms (guns and assimilated)


When any type of guns or assimilated arms is shown on screen, it is annotated. Guns with bayonets were annotated as guns, whenever a part of it is seen, even if it is a part of the bayonet.


Same as for firearms but for any kind of cold arms. Guns with bayonets were annotated also as cold arms, only when the bayonet is visible.


Annotations of car chases indicate segments showing a car chase.


Annotations of gory scenes will indicate graphic images of bloodletting and/or tissue damage. It will include horror or war representations. As this is also a subjective and difficult notion to define, some additional segments showing really disgusting mutants or creatures were annotated. Additional tags describing the event/scene were added in this case.


Each gunshot was annotated as a single segment whenever possible, with tag ‘gunshot’ and corresponding starting and ending times in seconds. Tag ‘multiple_actions’ was used when several events happen together. Tag ‘(nothing)’ corresponds to segments with no event. Canon fires were also annotated as gunshots, e.g., in Pirates of the Caribbean, or with tag ‘canon_fire’ in Saving Private Ryan, wherever possible. Additional tag ‘multiple_actions_cannon_fire’ was also used when appropriate. Tags ‘canon_fire’ and ‘multiple_actions_canon_fire’ mean that canon fires can be heard but no gunshots, whereas tags ‘gunshot’ and ‘multiple_actions’ may indicate that canon fires were possibly heard in addition to gunshots.


Same format as above, with tags ‘explosion’, ‘multiple_actions’ and ‘(nothing)’. Any kind of explosions was annotated, even if they were magic explosions.


Same format as above, with tags ‘scream’, ‘multiple_actions’ and ‘(nothing)’. Anything from non verbal screams to what we call ‘effort noise’ was annotated, as long as a human or a humanoid (e.g. mutant in I Am Legend) is the origin of the noise. Effort noises were annotated using tags ‘scream_effort’, or ‘multiple_actions_scream_effort’. Animal screams were not annotated, neither were screams in which one can recognize words.

In order to get the data, you are asked to supply your name and email address. You will receive instruction on how to download the dataset via this email address. We may store the data you supplied in order to contact you later about benchmark related matters. The data will not be used in any other way.

To download the violent scene dataset, please send an email to 

 By doing this you irrevocably agree to any and all provision of the license agreement in this page.

The scene selection you are about to download, in case you agree with these Terms of use, may not be suitable for children. Some of the scenes are taken from extremely violent movies.

In the following we will refer to the Violent Scenes Dataset as VSD dataset.

The goal of the VSD dataset is to develop new techniques, technology, and algorithms for the automatic detection of violent scenes in movies.

The VSD dataset was produced in three steps leading to two different sub sets: the 2012 subset and the 2013-2014 subset.

The 2012 subset consists in the annotations of:

the high level audio and video concepts and the violent scenes according to the objective definition for the following movies:


Léon, Reservoir Dogs, Armageddon, I am Legend, Saving Private Ryan, Eragon, Harry Potter and the order of the Phoenix, Billy Elliot, Pirates of the Caribbean – the curse of the black pearl, The Sixth Sense, The Wicker Man, Midnight Express, Kill Bill, The Wizard of Oz, The Bourne Identity

the violent scenes according to the objective definition for the following movies:


Dead Poets Society, Fight Club, Independence Day

The 2013-2014 subset consists in the annotation of:

the high level audio and video concepts for the following movies:


Fantastic Four 1, Fargo, Forrest Gump, Legally Blond, Pulp Fiction, The God Father 1, The Pianist 

the violent scenes according to the subjective definition for all the following movies:


Léon, Reservoir Dogs, Armageddon, I am Legend, Saving Private Ryan, Eragon, Harry Potter and the order of the Phoenix, Billy Elliot, Pirates of the Caribbean – the curse of the black pearl, The Sixth Sense, The Wicker Man, Midnight Express, Kill Bill, The Wizard of Oz, The Bourne Identity, Dead Poets Society, Fight Club, Independence Day, Fantastic Four 1, Fargo, Forrest Gump, Legally Blond, Pulp Fiction, The God Father 1, The Pianist, 8 Mile, Braveheart, Desperado, Ghost in the Shell, Jumanji, Terminator 2, V for Vendetta.

the violent scenes according to the subjective definition for 86 short web videos.

The 2014 subset part contains also standard audio and video features.

InterDigital has copyright and all rights of authorship on the 2012 data, whereas the Fudan University and the Ho Chi Minh University of Science share the copyright and all rights of authorship on the 2013-2014 data. InterDigital is the principal distributor of the VSD dataset.

To advance the state-of-the-art in violent scenes detection, the VSD dataset is made available to the researcher community for scientific research only. All other uses of the VSD dataset will be considered on a case-by-case basis. To receive a copy of the VSD dataset, the requestor must agree to observe all of these Terms of use.

The researcher(s) agrees to the following restrictions on the VSD dataset:

: Without prior written approval from InterDigital, the 2012 part of the VSD dataset, in whole or in part, shall not be further distributed, published, copied, or disseminated in any way or form whatsoever, whether for profit or not. For the avoidance of any doubt, this prohibition includes further distributing, copying or disseminating to a different facility or organizational unit in the requesting university, organization, or company.

2. Modification and Non-Commercial Use: 

Without prior written approval from InterDigital, the 2012 part of the VSD dataset, in whole or in part, may not be modified or used for commercial purposes. For commercial use of the dataset, a specific paying license may be negotiated, please 

. Modification is allowed for scientific research purposes only. It would be highly appreciated if the modified VSD dataset was shared with InterDigital, at this address: 

For the avoidance of doubt, commercial purposes include but are not limited to:

proving the efficiency of commercial systems,

training or testing of commercial systems,

using screenshots of data from the database in advertisements,

 In no case should the still frames or videos be used in any way that could directly or indirectly harm InterDigital, the Fudan University or the Ho Chi Minh University of Science. InterDigital, the Fudan University or the Ho Chi Minh University of Science permit publication (paper or web-based) of the data for scientific purposes only. Any other publication without scientific and academic value is strictly prohibited.

 All documents and papers that report on research that uses the VSD dataset must acknowledge the use of the dataset by including an appropriate citation to the following:

, C. Penet, M. Soleymani, G. Gravier. VSD, a public dataset for the detection of violent scenes in movies: design, annotation, analysis and evaluation. In 

, B. Ionescu, Y.G. Jiang, and C.Penet. Benchmarking Violent Scenes Detection in movies. In Proceedings of the 2014 12th International Workshop on Content-Based Multimedia Indexing (CBMI), 2014.

M. Sjöberg, B. Ionescu, Y.G. Jiang, V.L. Quang, M. Schedl and 

. The MediaEval 2014 Affect Task: Violent Scenes Detection. In Working Notes Proceedings of the MediaEval 2014 Workshop, Barcelona, Spain (2014)

, C. Penet, G. Gravier and M.Soleymani. A benchmarking campaign for the multimodal detection of violent scenes in movies. 

In Proceedings of the 12th international conference on Computer Vision – Volume Part III (ECCV’12),

 Andrea Fusiello, Vittorio Murino, and Rita Cucchiara (Eds), Col. Part III. Springer Verlag, Berlin.

. VSD2014: A dataset for violent scenes detection in Hollywood movies and web videos. CBMI 2015.

 THE PROVIDER OF THE DATA MAKES NO REPRESENTATIONS AND EXTENDS NO WARRANTIES OF ANY KIND, EITHER EXPRESSED OR IMPLIED. THERE ARE NO EXPRESS OR IMPLIED WARRANTIES OF MERCHANT ABILITY OR FITNESS FOR A PARTICULAR PURPOSE, OR THAT THE USE OF THE MATERIAL WILL NOT INFRINGE ANY PATENT, COPYRIGHT, TRADEMARK, OR OTHER PRO- PRIETARY RIGHTS.

The Principal Investigators can be contacted via 

Among the 31 movies used for the benchmark in 2014, 24 are dedicated to the training step and 7 for the test step. High level concepts annotations are only provided on the first 18 movies of the test set.
Please note that the movie Kill Bill was officially removed from the training set for the Violent Scenes Detection Task due to availability issues. However, we are still providing the annotations for this movie together with the VSD dataset.

Léon, Reservoir Dogs, Armageddon, I am Legend, Saving Private Ryan, Eragon, Harry Potter and the order of the Phoenix, Billy Elliot, Pirates of the Caribbean – the curse of the black pearl, The Sixth Sense, The Wicker Man, Midnight Express, The Wizard of Oz, The Bourne Identity, Independence Day, Fight Club, Dead Poets Society, Fantastic Four 1, Fargo, Forrest Gump, Legally Blond, Pulp Fiction, The God Father 1, The Pianist

7 Hollywood movies: 8 Mile, Braveheart, Desperado, Ghost in the Shell, Jumanji, Terminator 2, V for Vendetta.

86 web short videos downloaded from YouTube

Due to copyright issues, we cannot deliver the movies, but only the corresponding annotations. We therefore provide the entire movie list, together with the links to the DVDs used for the annotation on the Amazon web site.

The 86 web videos, under the Creative Commons Attribution 3.0 Unported license (

http://creativecommons.org/licenses/by/3.0/

), are included in the downloadable package.

LINKS REFERENCING THESE MOVIES TO THE AMAZON WEB SITE:

Harry Potter and the Order of the Phoenix

Pirates of the Caribbean – the Curse of the Black Pearl

ADDITIONAL INFORMATION FOR THE YOUTUBE VIDEOS:

This part of the test set contains 86 mp4 files downloaded from YouTube, and normalised to a frame rate of 25 using the 

Each movie file (within the zip files) is named according to the YouTube video id, so for example 0egEFZq2Y28.mp4 has the id "0egEFZq2Y28". The original file can be accessed from the URL

substituting [ID] with the YouTube video id of the video. It is possible that some videos have disappeared from YouTube since downloaded by the organisers.

The metadata provided by YouTube is included in a single zip file (xml-metadata.zip). There is one XML file for each video named according to the YouTube video id. From this you can extract e.g. the title, description, license (all are Creative Commons licences that allow redistribution), and original author.