Machine Translation

Description

Machine Translation (MT) technologies convert text from a source language (L1) into a target language (L2).

One of the most difficult things in Machine Translation is the evaluation of a proposed system. The problem with language is that language has some degree of ambiguity which makes it hard to run an objective evaluation. For example, with Machine Translation one problem is that there is not only one good translation for a given source text.

Van Slype (1979) distinguished macro evaluation, designed to measure product quality and micro evaluation, assess the improvability of the system.

The macro evaluation, also called total evaluation enables comparison of the performance of two translation systems or two versions of the same system. The micro evaluation, also known as detailed evaluation seeks to assess the improvability of the translation system.

Approach

The performance of a translation system is usually measured by the quality of its translated texts. Since there is no absolute translation for a given text, the challenge of the machine translation evaluation is to provide an objective and economic assessment.

Given the difficulty of the task, most of the translation quality assessments were based on human judgement in the history of MT evaluation. However, automatic procedures allow a quicker, repeatable, objective and cheaper evaluation.

Automatic MT evaluation consists in comparing the MT system output to one or more human reference translations. Human scores (manual evaluation) are assigned according to the adequacy, the fluency or the informativeness of the translated text.

In automatic evaluation, the fluency and adequacy of MT output can be measured by n-gram analysis.

Measures

Some of the most common automatic evaluation metrics are:

Metrics	Description	Reference
BLEU	IBM BLEU for BiLingual Evaluation Understudy is an n-gram co-occurrence scoring procedure.	(Papineni et al., 2001)
NIST	A variation of BLEU used in NIST HLT evaluation	(Doddington, 2002)
EvalTrans	Tool for the automatic and manual evaluation of translations	(Niessen et al., 2000)
GTM	General Text Matcher based on accuracy measures as precision, recall and F-measure	(Turian et al., 2003)
mWER	Multiple reference Word Error Rate is the average number of MT system output and several human reference translation	(Niessen et al., 2000)
mPER	Multiple reference Position independent word Error Rate	(Tillmann et al., 1997)
METEOR	Metric for Evaluation of Translation with Explicit ORdering, based on the harmonic mean of unigram precision and recall	(Banerjee & Lavie, 2005)
ROUGE	Recall-Oriented Understudy for Gisting Evaluation based on N-gram co-occurrence measure	(Lin, 2004)
TER	Translation Error Rate	(Snover et al., 2006)

For human evaluation, Fluency and adequacy are two commonly used translation quality notions (LDC2002, White et al. 1994). Fluency refers to the degree to which the system output is well-formed according the target language’s grammar. Adequacy refers to the degree to which the output communicates the information present in the reference translation. Recently, other measures have been tested, such as the comprehensibility of a MT translated segment (NIST MT09), or the preference between MT translations of different systems (NIST MT08).

Projects

WMT (2006-2011): 2006, 2007, 2008, 2009, 2010, 2011.

EuroMatrixPlus (2009-2012)

The annual IWSLT (2004-2009): 2004, 2005, 2006, 2007, 2008, 2009, 2010.

NIST MATR08, NIST MATR10.

EuroMatrix (2006-2009)

FEMTI, the Framework for the Evaluation of Machine Translation in ISLE (2001-2009).

GALE evaluation (2006-2008): 2007, 2008, 2009.

The TC-STAR evaluation campaigns, Technology and Corpora for Speech to Speech Translation, 6th FP project (2004-2007): 2005, 2006, 2007.

Swiss National Fund: Quality models and resources for the evaluation of MT (2004 - 2008).

The CESTA evaluation campaigns (in French), Evalda project, French Technolangue program (2002-2006).

The annual NIST Open Machine Translation Evaluation (2001-2009): 2001, 2002, 2003, 2004, 2005, 2006, 2008, 2009.

The C-STAR evaluation campaigns (2001, 2002, 2003).

EAGLES, Evaluation of Natural Language Processing Systems (1993-1995).

863 Evaluation, HTRDP Evaluation of Chinese Language Processing and Intelligent Human Machine Interface (1986).

Events

ACL 2010, joint Fifth Workshop on Statistical Machine Translation and Metrics MATR

Machine Translation Summit XII.

AMTA 2009

EACL 2009, Fourth Workshop on "Statistical Machine Translation".

IWSLT 2009.

ACL-IJCNLP 2009.

AMTA 2008, Workshop on "Metrics MATR: NIST Metrics for Machine Translation Challenge".

LREC 2008, Tutorial on "Evaluating Machine Translation in Use: From theory to practice".

ACL 2008, Third Workshop on "Statistical Machine Translation".

IWSLT 2008.

MT Summit XI (2007), Workshop on "Automatic Procedures in MT Evaluation".

MT Summit XI (2007), Tutorial on "Context-based evaluation of MT systems: Principles and Tools".

ACL 2007, Second Workshop on "Statistical Machine Translation".

IWSLT 2007.

AMTA 2006, Workshop on "MT Evaluation: the Black Box in the Hall of Mirrors".

HLT-NAACL 2006, Workshop on "Statistical Machine Translation".

IWSLT 2006.

HLT Evaluation Workshop in Malta (2005).

IWSLT 2005.

IWSLT 2004.

MT Summit IX (2003), Workshop on "Towards Systematizing MT Evaluation".

LREC 2002, Workshop on "Machine Translation Evaluation: Human Evaluators meet Automatic Metrics".

MT Summit VIII (2001), Workshop on "Who did What to Whom".

LREC 2000, Workshop on "The evaluation of Machine Translation".

AMTA 2000, MT Evaluation Workshop on "Hands-on Evaluation".

MT Summit VI (1997), Tutorial on "MT Evaluation: Old, New and Recycled".

AMTA 1998, Tutorial on "MT Evaluation: Old, New and Recycled".

Machine Translation Vol. 8, nos. 1-2 (1993), Special Issue On Evaluation Of MT Systems.

AMTA 1992, Workshop on "MT Evaluation: Basis for Future Directions an NSF-Sponsored Workshop".

Tools

Open-source Machine Translation Systems

Apertium open-source machine translation platform

GenPar Toolkit for Research on Generalized Parsing

JosHUa open-source decoder for parsing-based machine translation

Matxin open-source transfer machine translation engine

Moses open-source statistical machine translation system

Automatic Metrics

LRs

CESTA

TC-STAR 2007 - SLT English-to-Spanish

TC-STAR 2007 - SLT Spanish-to-English - CORTES

TC-STAR 2007 - SLT Spanish-to-English - EPPS

TC-STAR 2007 - SLT Chinese-to-English

TC-STAR 2006 - SLT English-to-Spanish

TC-STAR 2006 - SLT Spanish-to-English - EPPS

TC-STAR 2006 - SLT Spanish-to-English - CORTES

TC-STAR 2006 - SLT Chinese-to-English

TC-STAR 2005 - SLT English-to-Spanish

TC-STAR 2005 - SLT Spanish-to-English

TC-STAR 2005 - SLT Chinese-to-English

References

For further information on research, campaigns, conferences, software and data regarding statistical machine translation and its evaluation, please refer to the European Association for Machine Translation

The Machine Translation Archive is also offering a repository and bibliography about machine translation.

Bibliography

Lin C.-Y., Cao G., Gao J., Nie J.-Y. (2006). An information-theoretic approach to automatic evaluation of summaries. In Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, p.463-470, New York, New York
Snover M., Dorr B., Schwartz R., Micciulla L., and Makhoul J. (2006). A study of translation edit rate with targeted human annotation. In Proceedings of the 7th Biennial Conference of the Association for Machine Translation in the Americas (AMTA-2006), Cambridge, Massachusetts.
Banerjee S. et Lavie A. (2005). METEOR : An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan. Association for Computational Linguistics.
Turian J. P., Shen L., and Dan Melamed I. (2003). Evaluation of Machine Translation and Its Evaluation. Proceedings of MT Summit 2003: 386-393. New Orleans, Luisiana.
Doddington G. (2002). Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In Proceedings of ARPA Workshop on Human Language Technology.
Papineni K., Roukos S., Ward T. et Zhu W.-J. (2001). Bleu : a method for automatic evaluation of machine translation. Rapport technique, IBM Research Division, Thomas J. Watson Research Center.
Niessen S., Och F. J., Leusch G. et Ney H. (2000). An evaluation tool for machine translation : Fast evaluation for mt reseach. In Proceedings of the 2nd International Conference on Language Resources and Evaluation, Athens, Greece.
Tillmann C., Vogel S., Ney H., Zubiaga A., and Sawaf H. (1997). Accelerated DP based search for statistical translation. In Fifth European Conf. on Speech Communication and Technology, pages 2667–2670, Rhodos, Greece, September.
White J. S., O’Connel T. A. and O’Maraf (1994). The arpa mt evaluation methodologies : evolution, lessons, and future approaches. In Proceedings of the First Conference of the Association for Machine Translation in the Americas, Columbia, Maryland, USA.
Van Slype G. (1979). Critical study of methods for evaluating the quality of machine translation. Rapport technique Final report BR 19142, Brussels : Bureau Marcel van Dijk.

ELRA

European Language Resources Association