Update: I later found out that the method described below did not work as expected. Tricking Festival by simply providing it with a custom monophone transcript will generates invalidĀ .utt files. Then creating the full-context labels from those .utt files will give you only quin-phone without any other linguistic context.
However, you would still be able to utilize the script in the first part as the front-end for the TTS system (label/.utt generation using Festival). To create .utt for training data, I have noted down better way here: A better way to create the full-context labels for HTS training data.
Introduction
If you are familiar with the HTS demos, you probably know about their full-context label format. One full context labels looks like this:
1 |
ao^th-er+ah=v@1_1/A:1_1_2/B:0-0-1@2-1&2-6#1-4$1-3!1-1;1-3|er/C:1+0+2/D:0_0/E:content+2@1+5&1+2#0+3/F:in_1/G:0_0/H:7=5^1=2|L-L%/I:7=3/J:14+8-2 |
The above line contain the phone identity and many of its linguistic context, including the 2 previous and 2 following phones, position of current phone in current syllable, position of current syllable in current words, stress, accent and so many other think. The detailed description of all those context is in lab_format.pdf inside the data folder of any HTS demo.
However, if you are building your own system, you may have problem getting all those context to create that long labels. In fact, HTS could still work with much shorter full-context labels containing much less information, but you should expect some degree of degradation in the quality of the synthesized speech due to the shrinking of the decision tree. Fortunately, all the text analysis can actually be done automatically by Festival. I will show all the steps in the sections below.