End-to-End Kurdish Speech Synthesis Based on Transfer Learning

Document Type : Original Article

Authors

1 Computer Science Department, Faculty of Science, Soran University

2 Faculty of New Sciences and Technologies University of Tehran Tehran, Iran

Abstract

A text-to-speech (TTS) system converts the texts into speech in a specific language. Several TTS systems generate natural-like speech signals in numerous languages, such as English. On the other hand, the Kurdish language has just been examined. Existing preliminary research on Kurdish speech synthesis has utilized old methods and has generated low-quality speech. They also lack important aspects of speech, including intonation, emphasis, and rhythm. Some approaches were presented to address these challenges, including the use of concatenative systems. For example, the unit selection or statistical parametric methods. On the other hand, they need a great deal of time, effort, and domain knowledge. An additional factor for Kurdish speech synthesizers' low performance is the absence of publicly available speech corpora, unlike English, which has many freely-available corpora and audiobooks. The motivation of this paper is to create a Central Kurdish speech corpus and generate a human-like speech from the Kurdish text. This paper explains how to utilize Tacotron 2, an end-to-end neural network architecture and HiFi-GAN vocoder, to produce a high-quality, realistic, and human-like Kurdish voice. This work utilizes "text, audio" pairings, which contain 10 hours of recorded audio samples and texts collected from the Internet and textbooks. It shows how to use English character embedding as the pre-trained knowledge with Kurdish characters as input and how to preprocess these audio examples to get a great outcome. Our evaluations for various types of texts show a mean opinion score of 4.1, comparable with state-of-the-art synthesizers in other languages.

Keywords

Main Subjects