Multimodal technologies refer to all technologies combining features extracted from different modalities (text, audio, image, etc.). This covers a wide range of component technologies:
- Audiovisual Speech Recognition.
- Audiovisual Person Identification.
- Audiovisual Event Detection.
- Audiovisual Object or Person Tracking.
- Biometric Identification (using face, voice fingerprints, iris, etc.).
- Head Pose Estimation.
- Gesture Recognition.
- Multimodal Information Retrieval (e.g. Video Retrieval).
There is no generic evaluation approach for such a wide and heterogeneous range of technologies. In some cases, the evaluation paradigm is basically the same as for the equivalent mono-modal technology (e.g. traditional IR vs. multimodal IR). For very specific applications (e.g. 3D person tracking in a particular environment), ad hoc evaluation methodologies have to be pre-defined before the start of the evaluation campaign.
A good example of this is the multimodal evaluation framework  set up for the CHIL project (Computers in the Human Interaction Loop). Different test collections (production of ground truth annotations) and specific evaluation metrics were defined to address a large range of audio-visual technologies:
- Acoustic speaker identification & segmentation
- Acoustic emotion recognition
- Acoustic event detection
- Speech activity detection
- Face and Head tracking
- Visual Person tracking
- Visual Speaker Identification
- Head Pose Estimation
- Gesture Recognition
- Multimodal Person Identification
- Multimodal Person Tracking
For a complete overview of these tasks, see the book that was published at the end of the project .
- AMIDA: a EU FP7 project, follow-up of FP6 AMI project
- QUAERO: Germano-French collaborative research and development program, centered at developing multimedia and multilingual indexing and management tools
- TRECVid: Digital Video Retrieval evaluations at NIST.
- ImageCLEF Campaigns: Cross-language image retrieval track within the Cross Language Evaluation Forum (CLEF).
- CHIL Project: Computers in the Human Interaction Loop (IST-2002-506909).
- AMI Project: Augmented Multi-party Interaction.
- VACE (Video Analysis and Content Extraction): a US program including evaluations of object detection and video tracking technologies.
- SIMILAR: European Network of Excellence on human machine interfaces.
- HUMAINE: Human-Machine Interaction Network on Emotion (IST-2002-507422).
- TECHNO-VISION, a French program that included several vision-related evaluation campaigns:
- ARGOS: evaluation campaign for surveillance tools of video content
- EPEIRES : Performance Evaluation of Symbol Recognition Methods
- ETISEO: videosurveillance
- EVALECHOCARD: medical imaging
- IMAGEVAL : image processing technology assessment
- IV2: Biometric iris and face identification
- MESSIDOR: Methods to evaluate segmentation and indexing techniques in the field of retinal ophthalmology
- RIMES : evaluation campaign for handwritten document processing
- ROBIN : evaluation of object recognition algorithms
- TOPVISION: submarine imaging systems.
- BioSecure Excellence Network (IST-2002-507634)
- BioSec Project on Biometrics and Security (IST-2002-001766)
- ImageCLEF 2010: 2010 cross-language image retrieval evaluation campaign.
- MIR 2010 (ACM SIGMM International Conference on Multimedia Information Retrieval)
- CBMI’2010 (8th International Workshop on Content-Based Multimedia Indexing)
- CIVR 2010 (ACM International Conference on Image and Video Retrieval)
- CLEAR (Classification of Events, Activities and Relationships) evaluations :
- CLEAR’07 included the following tasks: Person Tracking (2D and 3D, audio-only, video-only, multimodal), Face Tracking, Vehicle Tracking, Person Identification (audio-only, video-only, multimodal), Head Pose Estimation (2D, 3D), Acoustic Event Detection and Classification
- CLEAR’06 included the following tasks: Person Tracking (2D and 3D, audio-only, video-only, multimodal), Face Tracking, Head Pose Estimation (2D, 3D), Person Identification,(audio-only, video-only, multimodal), Acoustic Event Detection and Classification.
- Past ImageCLEF campaigns: ImageCLEF 2009, Image CLEF 2008, Image CLEF 2007, Image CLEF 2006, Image CLEF 2005, Image CLEF 2004, Image CLEF 2003
- Past TRECVID campaigns: TRECVID 2009, TRECVID 2008, TRECVID 2007, TRECVID 2006, TRECVID 2004, TRECVID 2003, TREC-2002 Video Track, TREC-2001 Video Track
- MLMI : Joint Workshops on Machine Learningand Multimodal Interaction: MLMI’08, MLMI’07, MLMI’06, MLMI’05, MLMI’04
- Face and Gesture Recognition (FGR) workshops: FGR2008, FGR2006, FGR2004, FGR2002, FGR2000, FGR1998, FGR1996, FGR1995
- (VideoRec’08): International Workshop on Video Processing and Recognition.
- (VideoRec’07): First International Workshop on Video Processing and Recognition.
- VP4S-06: First International Workshop on Video Processing for Security.
- PETS: Performance Evaluation of Tracking and Surveillance: PETS’2006 (Surveillance of public spaces, detection of left luggage events), PETS’2005 (Challenging detection/tracking scenes on water.), PETS’2004 (people tracking), PETS’2003 (Outdoor people tracking - football data), PETS’2002 (Indoor people tracking (and counting) and hand posture classification), PETS’2001 (Outdoor people and vehicle tracking), PETS’2000(Outdoor people and vehicle tracking)
- CHIL Evaluation Packages (resulting from the CLEAR evaluation campaigns) are available from ELRA’s catalogue:
- TRECVID test collections are available from the LDC catalogue.
- IAPR TC-12 is a free test collection for image retrieval containing still natural images with text captions in up to three different languages (English, German and Spanish).
- Computers in the Human Interaction Loop, Alexander Waibel and Rainer Stiefelhagen (Ed.), Springer London, 2009.
- Moreau N., Mostefa D. Stiefelhagen R., Burger S. and Choukri K. (2008). "Data Collection for the CHIL CLEAR 2007 Evaluation Campaign", In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC08), May 2008, Marrakech, Morocco.
- Mostefa D., Moreau N., Choukri K., Potamianos G., Chu S., Tyagi A., Casas J., Turmo J., Cristoforetti L., Tobia F., Pnevmatikakis A., Mylonakis V., Talantzis F., Burger S., Stiefelhagen R., Bernardin K. and Rochet C. (2007). The CHIL audiovisual corpus for lecture and meeting analysis inside smart rooms In Language Resources and Evaluation, Vol. 41, No. 3. 16 December 2007, pp. 389-407.
- Stiefelhagen R., Bernardin K., Bowers R., Garofolo J., Mostefa D. and Soundararajan P. (2007). The CLEAR 2006 Evaluation In Multimodal Technologies for Perception of Humans, Lecture Notes of Computer Science, Volume 4122/2007, pp 1-44, 2007.