Multilingual Text Alignment

Description

Multilingual texts alignment consists in identifying correspondences between different text units, e.g., words, sentences, paragraphs, etc. in parallel texts.

The main approach of alignment evaluation is to compare a system-computed alignment with a manually produced reference alignment, usually called a gold standard. Different tasks have been defined in previous evaluation exercises such as Blinker, ARCADE, HLT-NAACL and ACL.

Measures

Alignment evaluations were generally performed by using traditional IR measures:
Precision
Recall
F-measure
AER (Och and Ney, 2000), Alignment Error Rate, derived from F-measure

Projects

Past

ARCADE I (1996-1999) and ARCADE II, multilingual text alignment evaluation campains (2003-2006).
Blinker (1998-2001).

Events

Past

ACL 2005 workshop on "Building and Using Parallel Texts Data Driven Machine Translation and Beyond".
LREC 2004 workshop on "the Amazing Utility of Parallel and Comparable Corpora".
HLT-NAACL 2003 worshop on "Building and Using Parallel Texts Data Driven Machine Translation and Beyond".

Tools

Aligners

bitext2tmx
hunalign
Geometric Mapping and Alignment (GMA)
Champollion Tool Kit (CTK)

LRs

ARCADE II Evaluation package
Data from HLT-NAACL 2003 workshop on parallel texts (English, Romanian, French)
The Bible, parallel biblical texts available in several languages, among which Chinese, Danish, English, French, Greek, Swahili.
The MULTEXT corpora (English, French, German, Italian and Spanish) and MULTEXT-East corpora (English, Bulgarian, Czech, Estonian, Hungarian, Romanian and Slovenian).
The ARCADE/ROMENSEVAL multilingual corpora (English, French, German, Italian, Spanish, Arabic, Chinese, Japanese, Greek, Persian, Russian)
Data from ACL 2005 workshop on Building and Using Parallel Texts (English, Inukitut, Romanian and Hindi).

References

Och F. J. and Ney H. (2000) A Comparison of Alignment models for statistical machine translation. In Proceedings of the 18th International Conference on Computational Linguistics (COLING-ACL 2000), p1086-1090, Saarbrücken, Germany.

ELRA

European Language Resources Association