How to evaluate?

In general, an evaluation starts with the description of

  • the object of the evaluation;
  • classes of users;
  • the measurable attributes of systems or evaluation criteria along with the metrics.

The EAGLES (Expert Advisory Group on Language Engineering Standards) evaluation working group identified the 7 major steps for a successful evaluation:

  1. Why is the evaluation being done?
  2. Elaborate a task model
  3. Define top level quality characteristics
  4. Produce detailed requirements for the system under evaluation, on the basis of 2 and 3
  5. Devise the metrics to be applied to the system for the requirements produced under 4
  6. Design the execution of the evaluation
  7. Execute the evaluation

Usability evaluation

For evaluation design the emphasis has traditionally been put on measuring systems performance that meet specific functional requirements. Usability is generally ignored because there are no objective criteria for usability.

ISO 9241, one of ISO Standards that apply to usability and ergonomics, provides the information that needs to be taken into account when specifying or evaluating usability in terms of measures of user performance and satisfaction. ISO 13407 specifies the user-centered design process needed to achieve the usability and quality in use goals.

The Common Industry Format, developed within the NIST Industry USability Reporting (IUSR) Project for usability test report, has been approved as ISO Standard. The document will be called: "ISO 25062 Software Engineering- Software Quality and Requirements Evaluation- Common Industry Format for Usability Test Reports" (usabilitynews.com).

Comparative evaluation

Comparative evaluation is a paradigm in which a set of participants compare the results of their systems using the same data and control tasks with metrics that are agreed upon. Usually this evaluation is performed in a number of successive evaluation campaigns with open participation. For every campaign, the results are presented and compared in special workshops where the methods used by the participants are discussed and contrasted.

The experience with comparative evaluation in the USA and in Europe has shown that the approach leads to a significant improvement of the performance of the evaluated technologies. A consequence is often the production of high quality resources. The evaluation requires the development of annotated data and test sets since the participants need data for training and testing their systems. Also the availability of language resources during campaigns enables all researchers in a particular field to evaluate, benchmark and compare the performance of their system.