Comparing Two ASR Models: A Word Error Rate Analysis on the Hypotheses by Whisper and Wav2vec 2.0
Ladataan...
336.6 KB
suljettu
Julkaisu on tekijänoikeussäännösten alainen. Teosta voi lukea ja tulostaa henkilökohtaista käyttöä varten. Käyttö kaupallisiin tarkoituksiin on kielletty.
Pysyvä osoite
Verkkojulkaisu
DOI
Tiivistelmä
Automatic speech recognition (ASR) means technology that converts human speech into a text. It has become increasingly accurate and popular starting in the 2010s, due to the increase in computing power and neural network architecture. However, there are several challenges affecting the performances of ASR models, such as the availability of labeled training data, out-of-vocabulary words and background noise. Historically, ASR models have been developed using especially hidden Markov model (HMM) based models consisting of multiple components. More recently end-to-end models based on neural network architecture using, for example, transformers have become common. ASR models typically need vast amounts of labeled training data, meaning speech and its transcriptions, but a self-supervised method that require smaller amounts of labeled training data has been introduced. This thesis compares the hypotheses, meaning the outputs, of two open-source ASR models, Whisper and Wav2vec 2.0. Whisper is developed by OpenAI and released in 2022 whereas Wav2vec 2.0, a successor to Wav2vec, is developed by Meta and released in 2020. Whisper uses large-scale training data and Wav2vec 2.0 uses self-supervision. Both English and Finnish speech data was used to analyze the performance of the models. The performance of an ASR model is typically evaluated using word error rate (WER), which is a simple numeric value obtained by dividing the sum of substitutions, deletions and insertions by the total of number of words when compared to the reference transcription. The purpose of this thesis is to analyze these errors further and find patters and reasons what might explain them. The results showed that there are phonetic aspects difficult for ASR models. For both languages, compound words, proper nouns and inflection were difficult. For Finnish, the vowel clusters and especially vowel length, /h/ fricative phoneme and nasal phonemes produced errors. For English, articles, conjunctions and prepositions as well as word repetition and adjacent words produced errors. There were differences in the performance of the models, Whisper being more creative and Wav2vec 2.0 being more conservative. It is possible but not guaranteed that these results can be generalized to other speech data, as they are aligned with the challenges known to affect the performance of ASR models.