I have been using HTS for a while for my research on speech synthesis. Recently, I have had some problems when I tried to configure the HTS demo with STRAIGHT features to use 16k data instead of 48k. I finally figured out how to properly do that work, and it is really not as easy as changing one or two configurations like in other demoes without STRAIGHT, so I decided to note all the steps down here.
Normally, we will have to change the makefile in the data folder and make some small modifications the Config.m and Training.pl file in the scripts folder (we can also change the .in files directly as well).
makefile for training data
Obviously, we will have to change SAMPFREQ to 16000, FRAMESHIFT to 80 (if you want to keep the 5ms shifting interval) and FREQWARP to 0.42. We might also change the LOWERF0 and UPPERF0 if necessary.
You may realize that this makefile does not have a FRAMELEN configuration like in other demo. Actually the frame length is fixed at 2048 and is written directly as a MGC extraction parameter:
1 |
$(MGCEP) -a $(FREQWARP) -m $(MGCORDER) -l 2024 -e 1.0E-08 -j 0 -f 0.0 -q 3 > mgc/$${base}.mgc; |
I am still not sure why FRAMESHIFT configuration is provided by FRAMELEN is not, but since the SP feature dimension changes (from 1025 to 513) when we change frame shift to 80, we will also have to change this frame length to get correct MGC from the new SP inputs. Therefore, we could introduce a FRAMELEN = 1024 configuration and replace all the 2048 parameter by the new length.
So below is the final speech analysis configuration (I also changed lower f0 to 50Hz for male voice):
1 2 3 4 5 6 7 8 9 |
SAMPFREQ = 16000 # Sampling frequency (48kHz) FRAMESHIFT = 80 # Frame shift in point (240 = 48000 * 0.005) FRAMELEN = 1024 # Frame length (changed from 2048) FREQWARP = 0.42 # frequency warping factor GAMMA = 0 # pole/zero weight for mel-generalized cepstral (MGC) analysis MGCORDER = 49 # order of MGC analysis LNGAIN = 1 # use logarithmic gain rather than linear gain LOWERF0 = 50 # lower limit for f0 extraction (Hz) UPPERF0 = 280 # upper limit for f0 extraction (Hz) |
And the calls to MGCEP and LPC2LSP will be changed too (to add the FRAMELEN configuration):
1 |
$(MGCEP) -a $(FREQWARP) -m $(MGCORDER) -l $(FRAMELEN) -e 1.0E-08 -j 0 -f 0.0 -q 3 > mgc/$${base}.mgc; |
1 2 |
$(MGCEP) -a $(FREQWARP) -c $(GAMMA) -m $(MGCORDER) -l $(FRAMELEN) -e 1.0E-08 -j 0 -f 0.0 -q 3 -o 4 | $(LPC2LSP) -m $(MGCORDER) -s $${SAMPKHZ} $${GAINOPT} -n $(FRAMELEN) -p 8 -d 1.0E-08 > mgc/$${base}.mgc; |
We still have not finished yet. The BAP part, although does not have any configurable parameters, also needs to change. This part is actually reducing the dimension of the AP from 1025 (for 48kHz training data) to 26 by averaging 26 separate bands in the AP. The problem arises when we only have 513 AP features in our experiments with 16kHz data, so the last 3 bands in the BAP files, trying to accumulate more than 513 features, will contain wrong data. We fixed this by scaling down the bandwidth of each band, so that the block copy will not exceed 512. The result is below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
$(X2X) +af $${ap} | $(BCP) +f -n 512 -L 1 -s 0 -e 0 -S 0 | $(AVERAGE) -l 1 > bap01; $(X2X) +af $${ap} | $(BCP) +f -n 512 -L 2 -s 1 -e 2 -S 0 | $(AVERAGE) -l 2 > bap02; $(X2X) +af $${ap} | $(BCP) +f -n 512 -L 2 -s 3 -e 4 -S 0 | $(AVERAGE) -l 2 > bap03; $(X2X) +af $${ap} | $(BCP) +f -n 512 -L 2 -s 5 -e 6 -S 0 | $(AVERAGE) -l 2 > bap04; $(X2X) +af $${ap} | $(BCP) +f -n 512 -L 2 -s 7 -e 8 -S 0 | $(AVERAGE) -l 2 > bap05; $(X2X) +af $${ap} | $(BCP) +f -n 512 -L 2 -s 9 -e 10 -S 0 | $(AVERAGE) -l 2 > bap06; $(X2X) +af $${ap} | $(BCP) +f -n 512 -L 3 -s 11 -e 13 -S 0 | $(AVERAGE) -l 3 > bap07; $(X2X) +af $${ap} | $(BCP) +f -n 512 -L 3 -s 14 -e 16 -S 0 | $(AVERAGE) -l 3 > bap08; $(X2X) +af $${ap} | $(BCP) +f -n 512 -L 3 -s 17 -e 19 -S 0 | $(AVERAGE) -l 3 > bap09; $(X2X) +af $${ap} | $(BCP) +f -n 512 -L 3 -s 20 -e 22 -S 0 | $(AVERAGE) -l 3 > bap10; $(X2X) +af $${ap} | $(BCP) +f -n 512 -L 4 -s 23 -e 26 -S 0 | $(AVERAGE) -l 4 > bap11; $(X2X) +af $${ap} | $(BCP) +f -n 512 -L 4 -s 27 -e 30 -S 0 | $(AVERAGE) -l 4 > bap12; $(X2X) +af $${ap} | $(BCP) +f -n 512 -L 5 -s 31 -e 35 -S 0 | $(AVERAGE) -l 5 > bap13; $(X2X) +af $${ap} | $(BCP) +f -n 512 -L 6 -s 36 -e 41 -S 0 | $(AVERAGE) -l 6 > bap14; $(X2X) +af $${ap} | $(BCP) +f -n 512 -L 7 -s 42 -e 48 -S 0 | $(AVERAGE) -l 7 > bap15; $(X2X) +af $${ap} | $(BCP) +f -n 512 -L 8 -s 49 -e 56 -S 0 | $(AVERAGE) -l 8 > bap16; $(X2X) +af $${ap} | $(BCP) +f -n 512 -L 10 -s 57 -e 66 -S 0 | $(AVERAGE) -l 10 > bap17; $(X2X) +af $${ap} | $(BCP) +f -n 512 -L 12 -s 67 -e 78 -S 0 | $(AVERAGE) -l 12 > bap18; $(X2X) +af $${ap} | $(BCP) +f -n 512 -L 14 -s 79 -e 92 -S 0 | $(AVERAGE) -l 14 > bap19; $(X2X) +af $${ap} | $(BCP) +f -n 512 -L 19 -s 93 -e 111 -S 0 | $(AVERAGE) -l 19 > bap20; $(X2X) +af $${ap} | $(BCP) +f -n 512 -L 25 -s 112 -e 136 -S 0 | $(AVERAGE) -l 25 > bap21; $(X2X) +af $${ap} | $(BCP) +f -n 512 -L 34 -s 137 -e 170 -S 0 | $(AVERAGE) -l 34 > bap22; $(X2X) +af $${ap} | $(BCP) +f -n 512 -L 49 -s 171 -e 219 -S 0 | $(AVERAGE) -l 49 > bap23; $(X2X) +af $${ap} | $(BCP) +f -n 512 -L 80 -s 220 -e 299 -S 0 | $(AVERAGE) -l 80 > bap24; $(X2X) +af $${ap} | $(BCP) +f -n 512 -L 150 -s 300 -e 449 -S 0 | $(AVERAGE) -l 150 > bap25; $(X2X) +af $${ap} | $(BCP) +f -n 512 -L 63 -s 450 -e 512 -S 0 | $(AVERAGE) -l 63 > bap26; |
I think we can also modify just 3 last bands (e.g. remove 2 and modify 1), but I am not sure about the effect on resolution power of the first few bands.
Rounding error with STRAIGHT vocoder
There is a rounding error in STRAIGHT causing some APs and SPs have different numbers of frame for the same sentence. I am still not sure if this would cause problems in the training phase, but to make sure the data is correct, I add one line into the makefile right after ap and sp are returned from STRAIGHT to make sure all the dimensions are consistent:
1 2 |
echo "lmin = min(length(sp), length(ap));" >> scripts/extract.m; echo "ap=ap(:,1:lmin);sp=sp(:,1:lmin);f0=f0(:,1:lmin);" >> scripts/extract.m; |
Config.m
For the Config.m file, we just need to change some variables in the speech analysis part to correspond to the change in the make file: $sr to 16000, $fs to 80, $fw to 0.42
Training.m
We will have to change the gen_wave function at the end of the file, to make sure that the AP is generated correctly. Basically, we will adapt this part according to the changes in the bands and bandwidths in the makefile.
And now you are ready to use your 16kHz data with STRAIGHT vocoder!
Dear Nguyen Quy Hy,
Thank you very much giving information on how to configure HTS for 16 kHz data. I could follow every thing except the last part. You have mentioned that gen_wave in training.pl need to be modified to make sure that AP is generated properly. Could you please tell me what need to modified for the band and bandwidth you have mentioned above.
Thanks and regards
Narendra N P
You must have noticed that I have changed the bandwidth of the 26 aperiodicity bands (approximately half of those in the demo).
So when recreating the aperiodicity (AP) from the banded AP, you should also half the bandwidth here (right inside the gen_wave function).
However, I found that STRAIGHT can still regenerate the wave form without any modification to the gen_wave function, which means the generated AP will be double the dimension of the training data. Therefore, unless you are trying to use the AP features for further processing, you can just ignore that step.
Thanks for the info.
Dear Nguyen Quy Hy,
Thank you very much. I follow your guide to configure hts for 16k, but in “Make file for training data” I don’t know how to scaling down the bandwidth of each band, so the block copy will not exceed 512.Could you please detail how to do that.
Thanks and regards
Nguyễn Văn Thịnh