This dataset is based on the movie “Hannah and her sisters” by Woody Allen, released in 1986 and available on DVD. The full movie (153,825 frames) has been manually annotated by a single annotator for several types of audio and visual information. Audio annotation indicates speech segments and associated speaker identification (consistent with face identification) and was performed using Audacity (http://audacity.sourceforge.net/). Visual annotation concerns all shot boundaries and all identified face tracks within shots. This visual annotation work has been achieved using the VIPER-GT platform (http://viper-toolkit.sourceforge.net/).
The face ground-truth metadata contains a frame by frame description of all “sufficiently” visible faces in the form of a horizontal, rectangular bounding box and an identifier. The annotator was given the following instructions: all the poses from frontal to profile are accepted; for a face to be annotated, corresponding bounding box should be wider than 24 pixels (image size being 996×560); bounding box goes vertically from the middle of the chin to the middle of the forehead and, horizontally, from one ear to the other or from one ear to the tip of the nose depending on the pose; finally, regarding occlusion, it was required that at least half of the face was visible.
Each bounding box is also manually tagged, based on the identity of the person. For 53 characters, the label is the name, such as “Hannah”, “Elliot” and “Lee”. For other persons, 186 of them have been uniquely identified and tagged with labels such as “Girl1” or “Boy2”. Audio speaker segments were consistently labeled with the corresponding names. Finally, in crowded scenes, groups of secondary characters were annotated within collective bounding boxes and labeled as ’Crowd1’, ’Crowd2’, etc. There is a total of 254 distinct labels in the dataset.
Given face and shot annotations, ground-truth face tracks (“GT-track” in short) are defined as follows: a face track is a maximally long sequence of face bounding boxes that are consecutive in time, share the same label and belong to the same shot. There are 2,002 such tracks spread over the 245 shots of the movie. Duration of GT-tracks ranges from 1 to 500 frames, with a mean of 99.1 frames. The number of tracks simultaneously appearing in a frame ranges from 0 to 10 and more in the numerous gathering scenes.
- 153,833 frames, size 996×560
- 245 shots
- 202,178 bounding faces boxes
- 2,002 face tracks
- 1,518 speech segments
- 400 audio segments with non-speech human sounds (laughing, screaming, kissing, etc.)
- 254 labels: 53 named characters, 186 identified un-named characters, 15 crowds