Multimodal Technologies

Description

Multimodal technologies refer to all technologies combining features extracted from different modalities (text, audio, image, etc.). This covers a wide range of component technologies:

Audiovisual Speech Recognition.
Audiovisual Person Identification.
Audiovisual Event Detection.
Audiovisual Object or Person Tracking.
Biometric Identification (using face, voice fingerprints, iris, etc.).
Head Pose Estimation.
Gesture Recognition.
Multimodal Information Retrieval (e.g. Video Retrieval).
etc.

There is no generic evaluation approach for such a wide and heterogeneous range of technologies. In some cases, the evaluation paradigm is basically the same as for the equivalent mono-modal technology (e.g. traditional IR vs. multimodal IR). For very specific applications (e.g. 3D person tracking in a particular environment), ad hoc evaluation methodologies have to be pre-defined before the start of the evaluation campaign.

A good example of this is the multimodal evaluation framework [1] set up for the CHIL project (Computers in the Human Interaction Loop). Different test collections (production of ground truth annotations) and specific evaluation metrics were defined to address a large range of audio-visual technologies:

Acoustic speaker identification & segmentation
Acoustic emotion recognition
Acoustic event detection
Speech activity detection
Face and Head tracking
Visual Person tracking
Visual Speaker Identification
Head Pose Estimation
Gesture Recognition
Multimodal Person Identification
Multimodal Person Tracking
etc.

For a complete overview of these tasks, see the book that was published at the end of the project [2].

Projects

Ongoing

AMIDA: a EU FP7 project, follow-up of FP6 AMI project

QUAERO: Germano-French collaborative research and development program, centered at developing multimedia and multilingual indexing and management tools

TRECVid: Digital Video Retrieval evaluations at NIST.

ImageCLEF Campaigns: Cross-language image retrieval track within the Cross Language Evaluation Forum (CLEF).

Past

CHIL Project: Computers in the Human Interaction Loop (IST-2002-506909).

AMI Project: Augmented Multi-party Interaction.

VACE (Video Analysis and Content Extraction): a US program including evaluations of object detection and video tracking technologies.

SIMILAR: European Network of Excellence on human machine interfaces.

HUMAINE: Human-Machine Interaction Network on Emotion (IST-2002-507422).

TECHNO-VISION, a French program that included several vision-related evaluation campaigns:
- ARGOS: evaluation campaign for surveillance tools of video content
- EPEIRES : Performance Evaluation of Symbol Recognition Methods
- ETISEO: videosurveillance
- EVALECHOCARD: medical imaging
- IMAGEVAL : image processing technology assessment
- IV2: Biometric iris and face identification
- MESSIDOR: Methods to evaluate segmentation and indexing techniques in the field of retinal ophthalmology
- RIMES : evaluation campaign for handwritten document processing
- ROBIN : evaluation of object recognition algorithms
- TOPVISION: submarine imaging systems.

BioSecure Excellence Network (IST-2002-507634)

BioSec Project on Biometrics and Security (IST-2002-001766)

Events

Upcoming

ImageCLEF 2010: 2010 cross-language image retrieval evaluation campaign.

MIR 2010 (ACM SIGMM International Conference on Multimedia Information Retrieval)

CBMI’2010 (8th International Workshop on Content-Based Multimedia Indexing)

CIVR 2010 (ACM International Conference on Image and Video Retrieval)

Past

CLEAR (Classification of Events, Activities and Relationships) evaluations :
- CLEAR’07 included the following tasks: Person Tracking (2D and 3D, audio-only, video-only, multimodal), Face Tracking, Vehicle Tracking, Person Identification (audio-only, video-only, multimodal), Head Pose Estimation (2D, 3D), Acoustic Event Detection and Classification
- CLEAR’06 included the following tasks: Person Tracking (2D and 3D, audio-only, video-only, multimodal), Face Tracking, Head Pose Estimation (2D, 3D), Person Identification,(audio-only, video-only, multimodal), Acoustic Event Detection and Classification.

Past ImageCLEF campaigns: ImageCLEF 2009, Image CLEF 2008, Image CLEF 2007, Image CLEF 2006, Image CLEF 2005, Image CLEF 2004, Image CLEF 2003

Past TRECVID campaigns: TRECVID 2009, TRECVID 2008, TRECVID 2007, TRECVID 2006, TRECVID 2004, TRECVID 2003, TREC-2002 Video Track, TREC-2001 Video Track

MLMI : Joint Workshops on Machine Learningand Multimodal Interaction: MLMI’08, MLMI’07, MLMI’06, MLMI’05, MLMI’04

ICMI: International Conferences on Multimodal Interfaces: ICMI’07, ICMI’06

Face and Gesture Recognition (FGR) workshops: FGR2008, FGR2006, FGR2004, FGR2002, FGR2000, FGR1998, FGR1996, FGR1995

(VideoRec’08): International Workshop on Video Processing and Recognition.

(VideoRec’07): First International Workshop on Video Processing and Recognition.

VP4S-06: First International Workshop on Video Processing for Security.

FPiV’2004, FPiV’2005: Workshops on Face Processing in Video.

PETS: Performance Evaluation of Tracking and Surveillance: PETS’2006 (Surveillance of public spaces, detection of left luggage events), PETS’2005 (Challenging detection/tracking scenes on water.), PETS’2004 (people tracking), PETS’2003 (Outdoor people tracking - football data), PETS’2002 (Indoor people tracking (and counting) and hand posture classification), PETS’2001 (Outdoor people and vehicle tracking), PETS’2000(Outdoor people and vehicle tracking)

LRs

CHIL Evaluation Packages (resulting from the CLEAR evaluation campaigns) are available from ELRA’s catalogue:
- CHIL 2006 evaluation package.
- CHIL 2005 evaluation package.
- CHIL 2004 evaluation package.

TRECVID test collections are available from the LDC catalogue.

IAPR TC-12 is a free test collection for image retrieval containing still natural images with text captions in up to three different languages (English, German and Spanish).

References

Computers in the Human Interaction Loop, Alexander Waibel and Rainer Stiefelhagen (Ed.), Springer London, 2009.

Moreau N., Mostefa D. Stiefelhagen R., Burger S. and Choukri K. (2008). "Data Collection for the CHIL CLEAR 2007 Evaluation Campaign", In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC08), May 2008, Marrakech, Morocco.

Mostefa D., Moreau N., Choukri K., Potamianos G., Chu S., Tyagi A., Casas J., Turmo J., Cristoforetti L., Tobia F., Pnevmatikakis A., Mylonakis V., Talantzis F., Burger S., Stiefelhagen R., Bernardin K. and Rochet C. (2007). The CHIL audiovisual corpus for lecture and meeting analysis inside smart rooms In Language Resources and Evaluation, Vol. 41, No. 3. 16 December 2007, pp. 389-407.

Stiefelhagen R., Bernardin K., Bowers R., Garofolo J., Mostefa D. and Soundararajan P. (2007). The CLEAR 2006 Evaluation In Multimodal Technologies for Perception of Humans, Lecture Notes of Computer Science, Volume 4122/2007, pp 1-44, 2007.

ELRA

European Language Resources Association

Multimodal Technologies

Description

Approach

Projects

Events

LRs

References