A better way to create the full-context labels for HTS training data

One of my previous post describes my first attempt to generate training data for HTS system from recordings and transcripts: How to create full-context labels for your HTS system (update: not really worked); unfortunately, it did not work as expected. During eNTERFACE’14, I have learned that there is a tool named EHMM in festvox that can help to build the .utt files (and in turn the full-context labels easily).

To start, you can follow the steps in the following link, which is the full instruction to build a CLUSTERGEN voice: http://festvox.org/bsv/c3170.html.

You can actually stop after having obtained the .utt files and use it for your own purposes.

In general, the necessary steps are listed below:

Read More

ZureTTS from the eNTERFACE’14 workshop

Last month (from June 9 to July 4), I have a chance to go to Spain to work on a project name ZureTTS (which means Your Text-to-speech in Basque language) during the eNTERFACE’14 workshop. This was very excited considering the unique opportunity to work with graduate students and researchers in speech processing all around Europe in 4 full working weeks (and the opportunity to leave in Spain for 1 month as well).

Basically ZureTTS is an online platform for anyone to obtain a speech synthesis engine with their own voice. The technology was not new, but this was the effort to coordinate multiple modules and technologies to create a complete and assessable system. Users can go to the website, register for an account, then record there voice with 100 pre-defined sentences in a chosen language, and choose to start the adaptation process. The adaption from the average/generic voice to the personalized voice is queued up and run sequentially in the background on the main server, which may last from 1 hours to 10 hours. After the adaption has finished, an notification email will be sent, and users can go back to the website to synthesize speech using there own voice. There is also a web API for synthesizing so that mobile apps can plug in to this in the future.

I am very happy that we can complete a working system described above in only 4 weeks. Of course, there are still many possible improvements, but the basic flow of the system is complete, and the system is also quite decoupled and extensible. You can try it out at http://aholab.ehu.es/zuretts (the server sometimes blocks you for 20 to 30 minutes, I am still not sure why).

And below is a photo of the awesome team and awesome new friends.

From left to right: Haritz Arzelus, Rubén Pérez Ramón, Carmen Magariños, Igor Jauk, Agustín Alonso, Daniel Erro (the first project leader), Jianpei Ye, Xin Wang, Martin Sulír, me and Xiaohai Tian
From left to right: Haritz Arzelus, Rubén Pérez Ramón, Carmen Magariños, Igor Jauk, Agustín Alonso, Daniel Erro (the first project leader), Jianpei Ye, Xin Wang, Martin Sulír, me and Xiaohai Tian

And below is a photo of Prof. Inma Hernaez, our 2nd project leader and the photographer of the photo above, and me.

Inma and me
Prof. Inma Hernáez and me

SailAlign and the error “ReadString: String too long”

If you have used SailAlign (or HTK) to do forced alignment on a large corpus, you may already encounter the error: ReadString: String too long. This error is actually thrown out from HTK, and a quick search on the Internet would return the below web page.


The solution according to the page is:

Make changes to the pronunciation dictionary:
Replace all multiple spaces with single space;
Replace all tabs with single space;
Put a ” before every double quote (“); %”
Put a ” before any dictionary entry beginning with single quote (‘)

And this actually solves the problem, which is quite annoying since the error message “String too long” gives no clue on this solution. Moreover, you will also have to make the same changes to the transcript giving to SailAlign to avoid seeing the same problem with HDecode.

I have spent so much time checking the dictionary and reducing the length of the input data to get rid of the error, just to find out that those suspects are irrelevant. Fortunately I found the problem right in the transcript, and at last SailAlign can run without a hitch now.

How to create full-context labels for your HTS system (update: not really worked)

Update: I later found out that the method described below did not work as expected. Tricking Festival by simply providing it with a custom monophone transcript will generates invalid .utt files. Then creating the full-context labels from those .utt files will give you only quin-phone without any other linguistic context.

However, you would still be able to utilize the script in the first part as the front-end for the TTS system (label/.utt generation using Festival). To create .utt for training data, I have noted down better way here: A better way to create the full-context labels for HTS training data.


If you are familiar with the HTS demos, you probably know about their full-context label format. One full context labels looks like this:

The above line contain the phone identity and many of its linguistic context, including the 2 previous and 2 following phones, position of current phone in current syllable, position of current syllable in current words, stress, accent and so many other think. The detailed description of all those context is in lab_format.pdf inside the data folder of any HTS demo.

However, if you are building your own system, you may have problem getting all those context to create that long labels. In fact, HTS could still work with much shorter full-context labels containing much less information, but you should expect some degree of degradation in the quality of the synthesized speech due to the shrinking of the decision tree. Fortunately, all the text analysis can actually be done automatically by Festival. I will show all the steps in the sections below.

Continue reading the detailed steps

How to configure HTS for in-training synthesis with state-level alignment labels


Utilizing state-level alignment labels allows us to copy the prosody from one speaker and use it on another speaker’s acoustic model. This can be used to improve the synthesized results by using prosody from natural speech and phone features from a HMM-based acoustic models. Moreover, since this technique can create phone-aligned parallel sentences from different acoustic models, we can also use it to generate comparable sentences where the quality of the vocoders or the acoustic features in the training data can be compared separately from the duration models.

Continue reading more on the steps