Multimodal Technologies



Multimodal technologies refer to all technologies combining features extracted from different modalities (text, audio, image, etc.). This covers a wide range of component technologies:

  • Audiovisual Speech Recognition.
  • Audiovisual Person Identification.
  • Audiovisual Event Detection.
  • Audiovisual Object or Person Tracking.
  • Biometric Identification (using face, voice fingerprints, iris, etc.).
  • Head Pose Estimation.
  • Gesture Recognition.
  • Multimodal Information Retrieval (e.g. Video Retrieval).
  • etc.



There is no generic evaluation approach for such a wide and heterogeneous range of technologies. In some cases, the evaluation paradigm is basically the same as for the equivalent mono-modal technology (e.g. traditional IR vs. multimodal IR). For very specific applications (e.g. 3D person tracking in a particular environment), ad hoc evaluation methodologies have to be pre-defined before the start of the evaluation campaign.

A good example of this is the multimodal evaluation framework [1] set up for the CHIL project (Computers in the Human Interaction Loop). Different test collections (production of ground truth annotations) and specific evaluation metrics were defined to address a large range of audio-visual technologies:

  • Acoustic speaker identification & segmentation
  • Acoustic emotion recognition
  • Acoustic event detection
  • Speech activity detection
  • Face and Head tracking
  • Visual Person tracking
  • Visual Speaker Identification
  • Head Pose Estimation
  • Gesture Recognition
  • Multimodal Person Identification
  • Multimodal Person Tracking
  • etc.

For a complete overview of these tasks, see the book that was published at the end of the project [2].




  • AMIDA: a EU FP7 project, follow-up of FP6 AMI project
  • QUAERO: Germano-French collaborative research and development program, centered at developing multimedia and multilingual indexing and management tools
  • TRECVid: Digital Video Retrieval evaluations at NIST.
  • ImageCLEF Campaigns: Cross-language image retrieval track within the Cross Language Evaluation Forum (CLEF).


  • CHIL Project: Computers in the Human Interaction Loop (IST-2002-506909).
  • AMI Project: Augmented Multi-party Interaction.
  • VACE (Video Analysis and Content Extraction): a US program including evaluations of object detection and video tracking technologies.
  • SIMILAR: European Network of Excellence on human machine interfaces.
  • HUMAINE: Human-Machine Interaction Network on Emotion (IST-2002-507422).
  • TECHNO-VISION, a French program that included several vision-related evaluation campaigns:
    • ARGOS: evaluation campaign for surveillance tools of video content
    • EPEIRES : Performance Evaluation of Symbol Recognition Methods
    • ETISEO: videosurveillance
    • EVALECHOCARD: medical imaging
    • IMAGEVAL : image processing technology assessment
    • IV2: Biometric iris and face identification
    • MESSIDOR: Methods to evaluate segmentation and indexing techniques in the field of retinal ophthalmology
    • RIMES : evaluation campaign for handwritten document processing
    • ROBIN : evaluation of object recognition algorithms
    • TOPVISION: submarine imaging systems.
  • BioSec Project on Biometrics and Security (IST-2002-001766)




  • ImageCLEF 2010: 2010 cross-language image retrieval evaluation campaign.
  • MIR 2010 (ACM SIGMM International Conference on Multimedia Information Retrieval)
  • CBMI’2010 (8th International Workshop on Content-Based Multimedia Indexing)
  • CIVR 2010 (ACM International Conference on Image and Video Retrieval)


  • CLEAR (Classification of Events, Activities and Relationships) evaluations :
    • CLEAR’07 included the following tasks: Person Tracking (2D and 3D, audio-only, video-only, multimodal), Face Tracking, Vehicle Tracking, Person Identification (audio-only, video-only, multimodal), Head Pose Estimation (2D, 3D), Acoustic Event Detection and Classification
    • CLEAR’06 included the following tasks: Person Tracking (2D and 3D, audio-only, video-only, multimodal), Face Tracking, Head Pose Estimation (2D, 3D), Person Identification,(audio-only, video-only, multimodal), Acoustic Event Detection and Classification.
  • (VideoRec’08): International Workshop on Video Processing and Recognition.
  • (VideoRec’07): First International Workshop on Video Processing and Recognition.
  • VP4S-06: First International Workshop on Video Processing for Security.


  • IAPR TC-12 is a free test collection for image retrieval containing still natural images with text captions in up to three different languages (English, German and Spanish).


  • Computers in the Human Interaction Loop, Alexander Waibel and Rainer Stiefelhagen (Ed.), Springer London, 2009.
  • Moreau N., Mostefa D. Stiefelhagen R., Burger S. and Choukri K. (2008). "Data Collection for the CHIL CLEAR 2007 Evaluation Campaign", In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC08), May 2008, Marrakech, Morocco.
  • Mostefa D., Moreau N., Choukri K., Potamianos G., Chu S., Tyagi A., Casas J., Turmo J., Cristoforetti L., Tobia F., Pnevmatikakis A., Mylonakis V., Talantzis F., Burger S., Stiefelhagen R., Bernardin K. and Rochet C. (2007). The CHIL audiovisual corpus for lecture and meeting analysis inside smart rooms In Language Resources and Evaluation, Vol. 41, No. 3. 16 December 2007, pp. 389-407.
  • Stiefelhagen R., Bernardin K., Bowers R., Garofolo J., Mostefa D. and Soundararajan P. (2007). The CLEAR 2006 Evaluation In Multimodal Technologies for Perception of Humans, Lecture Notes of Computer Science, Volume 4122/2007, pp 1-44, 2007.