Description
Multilingual texts alignment consists in identifying correspondences between different text units, e.g., words, sentences, paragraphs, etc. in parallel texts.
Approach
The main approach of alignment evaluation is to compare a system-computed alignment with a manually produced reference alignment, usually called a gold standard. Different tasks have been defined in previous evaluation exercises such as Blinker, ARCADE, HLT-NAACL and ACL.
Measures
Alignment evaluations were generally performed by using traditional IR measures:
Precision
Recall
F-measure
AER (Och and Ney, 2000), Alignment Error Rate, derived from F-measure
Tools
Aligners
bitext2tmx
hunalign
Geometric Mapping and Alignment (GMA)
Champollion Tool Kit (CTK)
LRs
ARCADE II Evaluation package
Data from HLT-NAACL 2003 workshop on parallel texts (English, Romanian, French)
The Bible, parallel biblical texts available in several languages, among which Chinese, Danish, English, French, Greek, Swahili.
The MULTEXT corpora (English, French, German, Italian and Spanish) and MULTEXT-East corpora (English, Bulgarian, Czech, Estonian, Hungarian, Romanian and Slovenian).
The ARCADE/ROMENSEVAL multilingual corpora (English, French, German, Italian, Spanish, Arabic, Chinese, Japanese, Greek, Persian, Russian)
Data from ACL 2005 workshop on Building and Using Parallel Texts (English, Inukitut, Romanian and Hindi).
References
- Och F. J. and Ney H. (2000) A Comparison of Alignment models for statistical machine translation. In Proceedings of the 18th International Conference on Computational Linguistics (COLING-ACL 2000), p1086-1090, Saarbrücken, Germany.