Description
Machine Translation (MT) technologies convert text from a source language (L1) into a target language (L2).
One of the most difficult things in Machine Translation is the evaluation of a proposed system. The problem with language is that language has some degree of ambiguity which makes it hard to run an objective evaluation. For example, with Machine Translation one problem is that there is not only one good translation for a given source text.
Van Slype (1979) distinguished macro evaluation, designed to measure product quality and micro evaluation, assess the improvability of the system.
The macro evaluation, also called total evaluation enables comparison of the performance of two translation systems or two versions of the same system. The micro evaluation, also known as detailed evaluation seeks to assess the improvability of the translation system.
Approach
The performance of a translation system is usually measured by the quality of its translated texts. Since there is no absolute translation for a given text, the challenge of the machine translation evaluation is to provide an objective and economic assessment.
Given the difficulty of the task, most of the translation quality assessments were based on human judgement in the history of MT evaluation. However, automatic procedures allow a quicker, repeatable, objective and cheaper evaluation.
Automatic MT evaluation consists in comparing the MT system output to one or more human reference translations. Human scores (manual evaluation) are assigned according to the adequacy, the fluency or the informativeness of the translated text.
In automatic evaluation, the fluency and adequacy of MT output can be measured by n-gram analysis.
Measures
Some of the most common automatic evaluation metrics are:
Metrics | Description | Reference |
BLEU | IBM BLEU for BiLingual Evaluation Understudy is an n-gram co-occurrence scoring procedure. | (Papineni et al., 2001) |
NIST | A variation of BLEU used in NIST HLT evaluation | (Doddington, 2002) |
EvalTrans | Tool for the automatic and manual evaluation of translations | (Niessen et al., 2000) |
GTM | General Text Matcher based on accuracy measures as precision, recall and F-measure | (Turian et al., 2003) |
mWER | Multiple reference Word Error Rate is the average number of MT system output and several human reference translation | (Niessen et al., 2000) |
mPER | Multiple reference Position independent word Error Rate | (Tillmann et al., 1997) |
METEOR | Metric for Evaluation of Translation with Explicit ORdering, based on the harmonic mean of unigram precision and recall | (Banerjee & Lavie, 2005) |
ROUGE | Recall-Oriented Understudy for Gisting Evaluation based on N-gram co-occurrence measure | (Lin, 2004) |
TER | Translation Error Rate | (Snover et al., 2006) |
For human evaluation, Fluency and adequacy are two commonly used translation quality notions (LDC2002, White et al. 1994). Fluency refers to the degree to which the system output is well-formed according the target language’s grammar. Adequacy refers to the degree to which the output communicates the information present in the reference translation. Recently, other measures have been tested, such as the comprehensibility of a MT translated segment (NIST MT09), or the preference between MT translations of different systems (NIST MT08).
Projects
WMT (2006-2011): 2006, 2007, 2008, 2009, 2010, 2011.
EuroMatrixPlus (2009-2012)
The annual IWSLT (2004-2009): 2004, 2005, 2006, 2007, 2008, 2009, 2010.
EuroMatrix (2006-2009)
FEMTI, the Framework for the Evaluation of Machine Translation in ISLE (2001-2009).
GALE evaluation (2006-2008): 2007, 2008, 2009.
The TC-STAR evaluation campaigns, Technology and Corpora for Speech to Speech Translation, 6th FP project (2004-2007): 2005, 2006, 2007.
Swiss National Fund: Quality models and resources for the evaluation of MT (2004 - 2008).
The CESTA evaluation campaigns (in French), Evalda project, French Technolangue program (2002-2006).
The annual NIST Open Machine Translation Evaluation (2001-2009): 2001, 2002, 2003, 2004, 2005, 2006, 2008, 2009.
The C-STAR evaluation campaigns (2001, 2002, 2003).
EAGLES, Evaluation of Natural Language Processing Systems (1993-1995).
863 Evaluation, HTRDP Evaluation of Chinese Language Processing and Intelligent Human Machine Interface (1986).
Events
ACL 2010, joint Fifth Workshop on Statistical Machine Translation and Metrics MATR
Machine Translation Summit XII.
EACL 2009, Fourth Workshop on "Statistical Machine Translation".
IWSLT 2009.
AMTA 2008, Workshop on "Metrics MATR: NIST Metrics for Machine Translation Challenge".
LREC 2008, Tutorial on "Evaluating Machine Translation in Use: From theory to practice".
ACL 2008, Third Workshop on "Statistical Machine Translation".
MT Summit XI (2007), Workshop on "Automatic Procedures in MT Evaluation".
MT Summit XI (2007), Tutorial on "Context-based evaluation of MT systems: Principles and Tools".
ACL 2007, Second Workshop on "Statistical Machine Translation".
AMTA 2006, Workshop on "MT Evaluation: the Black Box in the Hall of Mirrors".
HLT-NAACL 2006, Workshop on "Statistical Machine Translation".
HLT Evaluation Workshop in Malta (2005).
MT Summit IX (2003), Workshop on "Towards Systematizing MT Evaluation".
LREC 2002, Workshop on "Machine Translation Evaluation: Human Evaluators meet Automatic Metrics".
MT Summit VIII (2001), Workshop on "Who did What to Whom".
LREC 2000, Workshop on "The evaluation of Machine Translation".
AMTA 2000, MT Evaluation Workshop on "Hands-on Evaluation".
MT Summit VI (1997), Tutorial on "MT Evaluation: Old, New and Recycled".
AMTA 1998, Tutorial on "MT Evaluation: Old, New and Recycled".
Machine Translation Vol. 8, nos. 1-2 (1993), Special Issue On Evaluation Of MT Systems.
AMTA 1992, Workshop on "MT Evaluation: Basis for Future Directions an NSF-Sponsored Workshop".
Tools
Open-source Machine Translation Systems
Apertium open-source machine translation platform
GenPar Toolkit for Research on Generalized Parsing
JosHUa open-source decoder for parsing-based machine translation
Matxin open-source transfer machine translation engine
Moses open-source statistical machine translation system
Automatic Metrics
LRs
TC-STAR 2007 - SLT English-to-Spanish
TC-STAR 2007 - SLT Spanish-to-English - CORTES
TC-STAR 2007 - SLT Spanish-to-English - EPPS
TC-STAR 2007 - SLT Chinese-to-English
TC-STAR 2006 - SLT English-to-Spanish
TC-STAR 2006 - SLT Spanish-to-English - EPPS
TC-STAR 2006 - SLT Spanish-to-English - CORTES
TC-STAR 2006 - SLT Chinese-to-English
TC-STAR 2005 - SLT English-to-Spanish
References
For further information on research, campaigns, conferences, software and data regarding statistical machine translation and its evaluation, please refer to the European Association for Machine Translation
The Machine Translation Archive is also offering a repository and bibliography about machine translation.
Bibliography
- Lin C.-Y., Cao G., Gao J., Nie J.-Y. (2006). An information-theoretic approach to automatic evaluation of summaries. In Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, p.463-470, New York, New York
- Snover M., Dorr B., Schwartz R., Micciulla L., and Makhoul J. (2006). A study of translation edit rate with targeted human annotation. In Proceedings of the 7th Biennial Conference of the Association for Machine Translation in the Americas (AMTA-2006), Cambridge, Massachusetts.
- Banerjee S. et Lavie A. (2005). METEOR : An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan. Association for Computational Linguistics.
- Turian J. P., Shen L., and Dan Melamed I. (2003). Evaluation of Machine Translation and Its Evaluation. Proceedings of MT Summit 2003: 386-393. New Orleans, Luisiana.
- Doddington G. (2002). Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In Proceedings of ARPA Workshop on Human Language Technology.
- Papineni K., Roukos S., Ward T. et Zhu W.-J. (2001). Bleu : a method for automatic evaluation of machine translation. Rapport technique, IBM Research Division, Thomas J. Watson Research Center.
- Niessen S., Och F. J., Leusch G. et Ney H. (2000). An evaluation tool for machine translation : Fast evaluation for mt reseach. In Proceedings of the 2nd International Conference on Language Resources and Evaluation, Athens, Greece.
- Tillmann C., Vogel S., Ney H., Zubiaga A., and Sawaf H. (1997). Accelerated DP based search for statistical translation. In Fifth European Conf. on Speech Communication and Technology, pages 2667–2670, Rhodos, Greece, September.
- White J. S., O’Connel T. A. and O’Maraf (1994). The arpa mt evaluation methodologies : evolution, lessons, and future approaches. In Proceedings of the First Conference of the Association for Machine Translation in the Americas, Columbia, Maryland, USA.
- Van Slype G. (1979). Critical study of methods for evaluating the quality of machine translation. Rapport technique Final report BR 19142, Brussels : Bureau Marcel van Dijk.