Automatic extraction of face tracks is a key component of systems that analyzes people in audio-visual content such as TV programs and movies. Due to the lack of annotated content of this type, popular algorithms for extracting face tracks have not been fully assessed in the literature. To help fill this gap, we introduce a new dataset, based on the full audio-visual person annotation of a feature movie.

Thanks to this dataset, state-of-art tracking metrics such as track purity, can now be exploited to evaluate face tracks used by, e.g., automatic character naming systems. Also, due to consistent labeling, algorithms that aim at clustering faces or face tracks in an unsupervised fashion can benefit from this test-bed. Finally, thanks to the availability of the corresponding audio annotation, the dataset can be, e.g., used for evaluation of speaker diarization methods, and more generally for assessing multimodal people clustering or naming systems.

In order to get the data, you are asked to supply your name and email address. You will receive instructions on how to download the dataset via this email address. We may store the data you supplied in order to contact you later about benchmark related matters. The data will not be used in any other way.

To download the Hannah dataset, please send an email to 

This work is supported by AXES EU project.

A. Ozerov, J.-R. Vigouroux, L. Chevallier and P. Pérez. On evaluating face tracks in movies. In Proc. Int. Conf. Image Proc. (ICIP), 2013.

This dataset is based on the movie “Hannah and her sisters” by Woody Allen, released in 1986 and available on DVD. The full movie (153,825 frames) has been manually annotated by a single annotator for several types of audio and visual information. Audio annotation indicates speech segments and associated speaker identification (consistent with face identification) and was performed using Audacity (

). Visual annotation concerns all shot boundaries and all identified face tracks within shots. This visual annotation work has been achieved using the VIPER-GT platform (

The face ground-truth metadata contains a frame by frame description of all “sufficiently” visible faces in the form of a horizontal, rectangular bounding box and an identifier. The annotator was given the following instructions: all the poses from frontal to profile are accepted; for a face to be annotated, corresponding bounding box should be wider than 24 pixels (image size being 996×560); bounding box goes vertically from the middle of the chin to the middle of the forehead and, horizontally, from one ear to the other or from one ear to the tip of the nose depending on the pose; finally, regarding occlusion, it was required that at least half of the face was visible.

Each bounding box is also manually tagged, based on the identity of the person. For 53 characters, the label is the name, such as “Hannah”, “Elliot” and “Lee”. For other persons, 186 of them have been uniquely identified and tagged with labels such as “Girl1” or “Boy2”. Audio speaker segments were consistently labeled with the corresponding names. Finally, in crowded scenes, groups of secondary characters were annotated within collective bounding boxes and labeled as ’Crowd1’, ’Crowd2’, etc. There is a total of 254 distinct labels in the dataset.

Given face and shot annotations, ground-truth face tracks (“GT-track” in short) are defined as follows: a face track is a maximally long sequence of face bounding boxes that are consecutive in time, share the same label and belong to the same shot. There are 2,002 such tracks spread over the 245 shots of the movie. Duration of GT-tracks ranges from 1 to 500 frames, with a mean of 99.1 frames. The number of tracks simultaneously appearing in a frame ranges from 0 to 10 and more in the numerous gathering scenes.

400 audio segments with non-speech human sounds (laughing, screaming, kissing, etc.)

254 labels: 53 named characters, 186 identified un-named characters, 15 crowds

In order to get the data, you are asked to supply your name and email address. You will receive instruction how to download the dataset via this email address. We may store the data you supplied in order to contact you later about benchmark related matters. The data will not be used in any other way.

To download the Hannah dataset, please send an email to 

InterDigital Hannah Dataset Release Agreement

The goal of the Hannah dataset is to contribute to development and assessment of new techniques, technology, and algorithms for the detection, tracking and recognition of persons in audio-visual content. InterDigital has copyright and all rights of authorship on the dataset and is the principal distributor of the Hannah dataset.

To advance the state-of-the-art in person detection, tracking and recognition, the Hannah dataset is made available to the researcher community for scientific research only. All other uses of the Hannah dataset will be considered on a case-by-case basis. To receive a copy of the Hannah dataset, the requestor must agree to observe all of these Terms of use.

The researcher(s) agrees to the following restrictions on the Hannah dataset:

: Without prior written approval from InterDigital, the Hannah dataset, in whole or in part, shall not be further distributed, published, copied, or disseminated in any way or form whatsoever, whether for profit or not. For the avoidance of any doubt, this prohibition includes further distributing, copying or disseminating to a different facility or organizational unit in the requesting university, organization, or company.

Without prior written approval from InterDigital, the Hannah dataset, in whole or in part, may not be modified or used for commercial purposes. Modification is allowed for scientific research purposes only. It would be highly appreciated if the modified Hannah dataset was shared with InterDigital, at this address: 

 In no case should the still frames or videos be used in any way that could directly or indirectly harm InterDigital. InterDigital permits publication (paper or web-based) of the annotation data for scientific purposes only. Any other publication without scientific and academic value is strictly prohibited.

 All documents and papers that report on research that uses the Hannah dataset must acknowledge the use of the dataset by including an appropriate citation to the followings:

 THE PROVIDER OF THE DATA MAKES NO REPRESENTATIONS AND EXTENDS NO WARRANTIES OF ANY KIND, EITHER EXPRESSED OR IMPLIED. THERE ARE NO EXPRESS OR IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, OR THAT THE USE OF THE MATERIAL WILL NOT INFRINGE ANY PATENT, COPYRIGHT, TRADEMARK, OR OTHER PRO- PRIETARY RIGHTS.

The Principal Investigators can be contacted via