MENU

[AI from Scratch] Episode 201: Evaluation Metrics for Speech Generation — PESQ, STOI, and More

TOC

Recap: Tacotron

In the previous episode, we explained Tacotron, a model that converts text into speech, widely used in applications such as voice assistants and narration generation. Especially with Tacotron 2, the quality of the generated speech has significantly improved. This time, we will discuss evaluation metrics used to assess the quality of such speech generation technologies.

What Is Speech Generation Evaluation?

Evaluating speech generation involves measuring the quality and perceived naturalness of the synthesized speech. Evaluation metrics can be classified into two types: objective evaluation and subjective evaluation.

  • Objective Evaluation uses numerical metrics to evaluate speech quality, such as PESQ and STOI.
  • Subjective Evaluation involves human listeners rating the speech, with MOS (Mean Opinion Score) being a common method.

Objective Evaluation Metrics

1. PESQ (Perceptual Evaluation of Speech Quality)

PESQ is a standardized metric developed by the ITU-T (International Telecommunication Union) to evaluate the perceptual quality of speech. It analyzes the difference between the original speech and the generated speech, quantifying the perceptual discrepancy.

PESQ scores range from 0 to 4.5, with higher scores indicating higher quality, closer to the original speech. For instance, a PESQ score above 4.0 indicates a very natural and high-quality match to the original sound. PESQ is not only used for speech synthesis but also for evaluating the quality of speech codecs and communication systems.

2. STOI (Short-Time Objective Intelligibility)

STOI measures the intelligibility (clarity) of speech. It is particularly useful for evaluating speech recognition performance in noisy environments. STOI calculates the short-time frame similarity between the original and synthesized speech, providing a score between 0 and 1. The closer the score is to 1, the clearer and more accurately reproduced the speech is.

STOI is valuable for evaluating noise suppression technologies and speech synthesis quality, especially under challenging conditions where background noise is present.

3. LSD (Log-Spectral Distance)

LSD evaluates the difference in spectral characteristics of the speech. It calculates the difference between the log-spectra of the original and generated speech. Since this metric assesses frequency components, it captures the differences in various frequency bands. A smaller LSD value indicates that the generated speech closely matches the original.

Subjective Evaluation Metrics

1. MOS (Mean Opinion Score)

MOS is a subjective evaluation method where listeners rate the quality of speech on a scale from 1 to 5. A higher score reflects better quality. MOS evaluates how natural the sound is and how free from noise it appears, making it a critical measure for finalizing speech synthesis technologies since it directly reflects human perception.

Example: How MOS Is Applied

Listeners are asked to rate the synthesized speech based on criteria such as naturalness, clarity, and noise levels. This score reflects the overall listener experience and perception of quality.

Challenges in Evaluating Speech Generation

1. Cost of Subjective Evaluation

Since subjective evaluation involves human listeners directly rating the speech, it requires time and resources, especially when evaluating a large dataset. Maintaining consistency and reliability across a large group of listeners can also be challenging.

2. Gap Between Objective and Subjective Evaluation

Objective metrics do not always perfectly align with subjective evaluations. For example, even if the PESQ score is high, listeners might still perceive the speech as unnatural. Therefore, combining multiple evaluation methods is crucial for a comprehensive assessment.

Summary

In this episode, we discussed various evaluation metrics for speech generation, ranging from objective metrics like PESQ and STOI to subjective metrics like MOS. Evaluating the quality of speech from multiple perspectives is essential for developing and refining speech synthesis technologies. In the next episode, we will cover applications of self-supervised learning, exploring how models learn from unlabeled data and apply this knowledge in real-world applications.


Preview of the Next Episode

Next time, we will explain applications of self-supervised learning. We will explore how models learn from unlabeled data and how this learning is applied in real-world applications. Stay tuned!


Annotations

  1. PESQ (Perceptual Evaluation of Speech Quality): A standardized metric for evaluating perceptual quality in speech.
  2. STOI (Short-Time Objective Intelligibility): A metric used for evaluating the intelligibility of speech, especially in noisy environments.
  3. MOS (Mean Opinion Score): A method where listeners rate speech quality on a five-point scale.
  4. LSD (Log-Spectral Distance): A metric that assesses the spectral differences in speech based on frequency components.
Let's share this post !

Author of this article

株式会社PROMPTは生成AIに関する様々な情報を発信しています。
記事にしてほしいテーマや調べてほしいテーマがあればお問合せフォームからご連絡ください。
---
PROMPT Inc. provides a variety of information related to generative AI.
If there is a topic you would like us to write an article about or research, please contact us using the inquiry form.

Comments

To comment

TOC