Speech recognition, also known as automated speech recognition (ASR) or speech-to-text (STT) is a process by which a program or a system transcribes an acoustic speech signal to text.
Systems generally perform two different types of recognition: single-word and continuous speech recognition. Continuous speech is more difficult to handle because of a variety of effects such as speech rate, coarticulation, etc. Today’s state-of-the-art systems are able to transcribe unrestricted continuous speech from broadcast data with acceptable performance.
Evaluation of ASR systems is mainly performed by computing the Word Error Rate (WER) or Character Error Rate (CER) for some languages like Chinese or Japanese. WER is derived from the Levenshtein distance (or edit distance) and measures the distance between the hypothesis transcription produced by the ASR module and the reference transcription. The WER is computed after the alignment between the hypothesis and the reference transcriptions have been done by dynamic programming (the optimal alignment being the one which minimises the Levenshtein distance). Usually the costs for insertion, deletion and substitution are respectively 3,3,4. After alignment between the hypothesis and the reference, WER counts the number of recognition errors. Three kinds of errors are taken into account when computing the word error rate, i.e. substitution, deletion and insertion errors.
Substitution: a reference word is replaced by another word in the best alignment between the reference and the system hypothesis.
Deletion: a reference word is not present in the system hypothesis in the best alignment.
Insertion: Some extra words are present in the system hypothesis in the best alignment between the reference and the hypothesis.
Although word is the basic unit for assessing ASR systems, the same computation can be made using different granularities (phonemes, syllables, etc.) WER can be greater than 100%, if the number of errors is more important than the number of words. Prior to scoring both hypothesis and reference have to be normalized. The normalisation consists of converting the transcription into a more standardised form. This step is language dependent and applies a number of rules for transforming each token into its normalised form. For instance numbers are spelled out, punctuation marks are removed, contractions are expanded, multiple orthographies are converted to a unique form, etc. Although WER is the main metric for assessing ASR system, its major drawback is that all word errors are equally penalized, regardless the importance and meaning of the word, eg an empty word has the same importance as a named entity.
Performance of ASR systems are also measured in terms of speed by measuring the processing time and computing the real time factor on a specific hardware configuration. This is an important factor for some applications that may require a real-time processing speed or some devices that are limited in terms of memory or processor speed.
For ASR evaluation, the criterion is recognition accuracy, one commonly used measure is word error rate or the related metric word accuracy rate (WER), also used in machine translation evaluation.
The method used in the current DARPA speech recognition evaluation involves comparing system transcription of the input speech to the reference (i.e., transcription by a human expert), using algorithms to score agreement at the word level. More higher-level metrics such as sentence error rate as concept error rate can be applied regarding different applications.
Communication style (i.e., speaker independent, spontaneous speech, etc), vocabulary size, language model and usage conditions are also important features which can affect the performance of a speech recognizer for a particular task.
NIST Rich Transcription evaluations:
- The Rich Transcription Spring 2009 Meeting Recognition focused on the English Meeting speech. There are three evaluation tasks supported, STT, Speaker Diarization, and Speaker Attributed STT.
- The Rich Transcription Spring 2007 Meeting Recognition English Meeting speech. There were three evaluation task supported, STT,MDE Speaker Diarization, and a new task Speaker Attributed STT.
- The Rich Transcription Spring 2006 Meeting Recognition English meeting domain speech.
- The Rich Transcription Spring 2005 Evaluation (RT-05F): English meeting domain speech.
- The Rich Transcription Fall 2004 Evaluation (RT-04F): broadcast news speech and conversational telephone speech in English, Mandarin, and Arabic.
- The Rich Transcription Spring 2004 Evaluation (RT-04S):meeting domain in English
- The Rich Transcription Fall 2003 Evaluation (RT-03F): broadcast news speech and conversational telephone speech in English.
- The Rich Transcription Spring 2003 Evaluation (RT-03S): broadcast news speech and conversational telephone speech in English, Chinese and Arabic.
- The Rich Transcription 2002 Evaluation (RT-02): broadcast news speech, conversational telephone speech, and meeting room speech in English.
- TC-STAR 2007 ASR evaluation campaign: BN and Parliament speeches for Chinese, English and Spanish
- TC-STAR 2006 ASR evaluation campaign: BN and Parliament speeches for Chinese, English and Spanish
- TC-STAR 2005 ASR evaluation campaign: BN and Parliament speeches for Chinese, English and Spanish
- The ESTER 1 (2005-2007) evaluation campaign focused on French broadcast news. There were three evaluation tasks, STT, Speaker Diarization and Named Entity Recognition.
- The ESTER 2 evaluation (2008-2009) campaign focuses on French broadcast news. The same evaluation tasks as in the previous campaign are organized. In addition new experimental tasks such as sentence boundary detection are organised.
- EVALITA 2009: Connected digit recognition for Italian in clean and noisy environments
- AURORA: distributed noisy speech recognition evaluation framework for English (evaluation packages are available from ELRA’s catalogue)
- Aurora 2: a connected digit recognition task under various additive noises
- Aurora 3: in-car connected digit recognition task
- Aurora 4: continuous speech recognition
CENSREC (Corpus and Environment for Noisy Speech RECognition ) Japanese noisy speech recognition evaluation framework
- CENSREC 1 (2003): noisy speech recognition evaluation frameworks
- CENSREC-1-C (2006): voice activity detection under noisy conditions
- CENSREC-2 (2005): in-car connected digit recognition
- CENSREC-3 (2005): in-car isolated word recognition
- CENSREC-4 (2006): an evaluation framework for distant-talking speech under hands-free conditions, connected digits as in CENSREC 1
Scoring evaluation tools such as SCLITE are available on NIST’s speech group website: NIST tools
[AURORA Project database - Subset of SpeechDat-Car - Italian database - Evaluation Package http://catalog.elra.info/product_in…]