Automatic extraction of face tracks is a key component of systems that analyses people in audio-visual content such as TV programs and movies. Due to the lack of annotated content of this type, popular algorithms for extracting face tracks have not been fully assessed in the literature. To help fill this gap, we introduce a new dataset, based on the full audio-visual person annotation of a feature movie.

Thanks to this dataset, state-of-art tracking metrics such as track purity, can now be exploited to evaluate face tracks used by, e.g., automatic character naming systems. Also, due to consistent labeling, algorithms that aim at clustering faces or face tracks in an unsupervised fashion can benefit from this test-bed. Finally, thank to the availability of the corresponding audio annotation, the dataset can be, e.g., used for evaluation of speaker diarization methods, and more generally for assessing multimodal people clustering or naming systems.

Acknowledgments

This work is supported by AXES EU project.

 

DATA DESCRIPTION

This dataset is based on the movie “Hannah and her sisters” by Woody Allen, released in 1986 and available on DVD. The full movie (153,825 frames) has been manually annotated by a single annotator for several types of audio and visual information. Audio annotation indicates speech segments and associated speaker identification (consistent with face identification) and was performed using Audacity (http://audacity.sourceforge.net/). Visual annotation concerns all shot boundaries and all identified face tracks within shots. This visual annotation work has been achieved using the VIPER-GT platform (http://viper-toolkit.sourceforge.net/).

The face ground-truth metadata contains a frame by frame description of all “sufficiently” visible faces in the form of a horizontal, rectangular bounding box and an identifier. The annotator was given the following instructions: all the poses from frontal to profile are accepted; for a face to be annotated, corresponding bounding box should be wider than 24 pixels (image size being 996×560); bounding box goes vertically from the middle of the chin to the middle of the forehead and, horizontally, from one ear to the other or from one ear to the tip of the nose depending on the pose; finally, regarding occlusion, it was required that at least half of the face was visible.

Each bounding box is also manually tagged, based on the identity of the person. For 53 characters, the label is the name, such as “Hannah”, “Elliot” and “Lee”. For other persons, 186 of them have been uniquely identified and tagged with labels such as “Girl1” or “Boy2”. Audio speaker segments were consistently labeled with the corresponding names. Finally, in crowded scenes, groups of secondary characters were annotated within collective bounding boxes and labeled as ’Crowd1’, ’Crowd2’, etc. There is a total of 254 distinct labels in the dataset.

Given face and shot annotations, ground-truth face tracks (“GT-track” in short) are defined as follows: a face track is a maximally long sequence of face bounding boxes that are consecutive in time, share the same label and belong to the same shot. There are 2,002 such tracks spread over the 245 shots of the movie. Duration of GT-tracks ranges from 1 to 500 frames, with a mean of 99.1 frames. The number of tracks simultaneously appearing in a frame ranges from 0 to 10 and more in the numerous gathering scenes.

Statistics:

  • 153,833 frames, size 996×560
  • 245 shots
  • 202,178 bounding faces boxes
  • 2,002 face tracks
  • 1,518 speech segments
  • 400 audio segments with non-speech human sounds (laughing, screaming, kissing, etc.)
  • 254 labels: 53 named characters, 186 identified un-named characters, 15 crowds

 

CITING HANNAH DATASET

A. Ozerov, J.-R. Vigouroux, L. Chevallier and P. Pérez. On evaluating face tracks in movies. In Proc. Int. Conf. Image Proc. (ICIP), 2013.