Day 16 – Finale.Part 2 (Getting things done)

Continuation from part 1. Read this first preferably.

Stitching the strands together

Imagine having a bunch of spaghetti strands (uncooked or cooked doesn’t matter). Lets say each strand of spaghetti is a sequence, audio sequence, RNN sequence, etc… Generating audio sequence is like producing each noodle strand 1064.jpgindividually, and at every fixed interval, you make a cut. You collect all these strands and lay them out, end to end as one long giant strand of spag (music). But it turns out, doing this means at each start of a new sequence it starts off with the same “initial value”, and sometimes, it repeats itself over and over again. The major problem is that the first frame of the audio starts off not knowing where it ended last time. As you can see in the image below, the starting frame do not match the final frame from the previous sequence. Also you’ll notice that each new beginning tend to start near 0, or what the initial estimated parameters was. So what we want to do is to somehow “connect” the spaghetti strands together instead of just lining them up.


So, instead of letting the initial RNN parameters to be fixed, I “force” it to the last RNN parameters of the previous sequence. What I did was to let r_0 (RNN input) and c_0 (LSTM cell input) to be a tensor, and taking the previous’ sequence r_last and c_last to replace it. A crude example below. Theoretically, this should work to remove and audio “jinks” and the music (if you can call it that) will flow smooth.

Both sequences are not connected, we let the new sequence begin where we left off the last time.

Again, we trained a new model using this method, using the same parameters, but this time with a shorter recurrent length and longer training time. (single layer LSTMP; rnn length=4000; input size=240; hidden units=512)


The loss curve does not show any signs of irregularities, so it might be good yeah? Finally I used this model to generate new audio with the methods explained in Part I, but this time, I added the sequence joining method and….


The winner

It actually works. I couldn’t believe it myself. The “music” is now free from audio hiccups! (It gets abit better towards the end) Its not very interesting and it needs a little polishing (maybe more layers or hidden units), but I think this is the best result I can achieve. But all things must come to an end. This is the end of the project and the IFT6266 course. I’ve learn alot from this, coming from a guy who is not a computer science major, and who is pretty much clueless on neural networks on the first lesson, I managed to produce something very cool.

Or perhaps what I’m doing its all just pure dumb luck ¯\_(ツ)_/¯

I’ll probably continue working on this if I find the time in the future.  Feel free to drop me a message if you find this interesting

Day 15 – Finale.Part 1

We are coming to the end of the course, and I will try to explain more in detail, my methodology to the training and audio generation procedure. In part 1, I will explain how I used my training inputs.



First, I present the neural network architecture which I used – LSTM with a recurrent projection layer (we call it LSTMP from here on). You can find the paper describing the architecture in my previous post. A simple diagram of the LSTMP is shown below:lstmp_finale.png

First thing you’ll notice is that it is not an autoencoder. Input feature size is 40 whilst Target size is 1. It is not trying to reconstruct itself, but to predict future frames. An analogy will be like listening to 1 minute of audio and try to guess the next 10 seconds. Targets t are delayed such that it does not overlap the input.

Say we have a LSTMP network of Recurrent length of 100 and input feature size of 40, each iteration will use 4000 input frames and 100 target frames, the target samples delayed by 4000 frames. ie. input X[x_1,x_2,…x_T] and target t[t_1,t_2,…t_T], where bold x’s are vectors of size 40.


 Arrangement of inputs

Typically, in a generative RNN machine, we want to generate the final output, ie. y_last from the sequence. But this time, generating y[y_1,y_2,…y_T] is more logical. This machine will generate sequences of frames, not just the next frames. This way, the output frames will have a smooth transitonal flow from frame to frame.

using the example of RNN length of 100 and input feature size of 40, a typical input-target pair in terms of frame sequence no. will look like this. (numbers in {} are audio frame no.)

X_1: {1,101,201,301,…,3901} ;   outputs ->           y_1: {4001}

X_2: {2,102,202,302,…3902};   outputs ->           y_2: {4002}

…                                                                 ….

X_T: {100,200,300,400,…4000};   outputs ->     y_T:{4100}

Loss between output y[y_1,…,y_T] and targets t[t_1,…,t_T] is minimized using MSE.



Generating audio is the same as training. A seed sequence is taken from the original audio once,

seed -> X_0[x_1,x_2,…,x_T]

and this generates a output sequence -> y_i [y_1,y_2,…,y_T]

this output sequence is stored in an array called G, and subsequent y_i are concatenated to G, giving the final audio which you’ll hear. This goes on and on, taking the input sequence and output a brand new sequence and concatenated together with the previous output sequences. But there is another very important point which deserves its own explanation…


Input “seed” sequence update procedure

Before you go “Hey! You are using the same seed sequence over and over again!”. Yes, I did change the input “seed” every time I generate a new output, so each time the generated audio is new, not a repetitive cycle. First you’ll notice that outputs are sequential -> 4001,4002,…4100… etc. We can “shift in” the generated sequence into X and “push out” the first column slice from X.

For example, in round no.2 of audio generation, frames 1,2,3.,..,100 are removed from the front and new frames 4001,4002,4003,…,4100 are added to the back to X, and this new X will generate new set of sequences 4101,4102,…4200.

X_1: {1,101,201,301,…,3901,4001} ;   outputs ->           y_1: {4101}

X_2: {2,102,202,302,…3902,4002};   outputs ->           y_2: {4102}

…                                                                 ….

X_T: {100,200,300,400,…4000,4100};   outputs ->     y_T:{4200}

This procedure will carry on for a number of iterations and array G is saved as a .wav file

In part 2, I will continue the discussion, and explain on the audio “jink” you hear, and how I go about to removing it (as much as possible). *A diagram shown below.


Day 14 – Testing out new LSTM architechtures Part II


So I have attempted both methods described in the previous post, and basically, both methods are not significantly different from a 2-Layer LSTM network.

But first, these are the loss-iterations plotplot.pngLSTM: Original 2-Layer LSTM

LSTMP: 2-Layer LSTM + Recurrent Projection layer

LSTMOR: 2-Layer LSTM + Output Recurrent layer

To recap, LSTMP is LSTM Recurrent Projection, which means output from the LSTM memory cell is parameterized before the “recurrent step” like so:

Screenshot from 2016-04-04 21-30-09

Disregarding the initial spike at the beginning, LSTMP has better training loss up until about 150th iterations, after that, there is not any significant difference in training loss compared to LSTM or LSTMOR. But still, it would be nice to hear what each of the architecture generates right?

It seems that even with similar training graph, both LSTMP and LSTMOR has slightly different sounding reproduction. LSTMOR has less of a “static noise” while LSTMP seems to have more “dynamic range”. Unless you use a headphone to listen to the background “hum”, the audio are almost indistinguishable from the original 2-Layer LSTM.

My conclusion: Based on my personal preference, I would choose the LSTMP version.





Whats next? I will try to increase the number of hidden units and training time, and see if that will help (doubt so) improve anything.

Day 13 – Testing out new LSTM architectures Part 1

As I was looking for solutions to the “ticking” problem and how to improve the quality of synthesis, I chanced upon two papers on different LSTM architectures for acoustic modeling:

  1. Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition
  2. Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis

In the first paper, it describes a deep LSTM architecture which uses a RNN projection layer that improves speech recognition performance. A diagram is shown below. Instead of a hidden-to-hidden connection, a output projection layer is added and the projected output r_(t-1) is fed back into the hidden connection in place of h_(t-1).

Screenshot from 2016-03-31 17-42-48

In the second paper, which is somewhat similar, but instead of a recurrent-to-hidden connection, it is a recurrent-to-recurrent connection. This is simply an additional vanilla RNN on top of a LSTM network. Now instead of the output y = Wh_t + b, it is

y = Wh_t + Wy_(t-1) + b

The authors hypothesizes that the additional RNN layer can smooth transition between acoustic features at consecutive frames, which might be what is needed to remove the audio skip between the ends.

Screenshot from 2016-03-31 17-51-46


These two methods seems promising, and I’ll be attempting both methods very soon!

Day 12 – 2-Layer LSTM

I have made a 2-Layer LSTM, and it is giving me better sounding audio this time!

Some key changes include the bias for the forget gate. In this paper, the author suggested a high forget gate bias of 5 to encourage long-term learning, but using 5 in my LSTM did not work at all (its giving me saturated values, perhaps its remembering too much?). I reduced the bias to random_uniform(0., 1.) instead, and this works well as attempted by Alex¹.

Another key change I’ve made is I have fixed the fading audio (its a bug which I mistakenly normalized the new generated signals again and again that resulted in rapidly diminishing values) and the way my audio is being generated (using last batch instead of first).

The graphical diagram below shows the training and generating phase of my NN. It is somewhat the same as my drawing previously, but now instead of the first batch from the generated signal, I used the last batch. This is because during training, the targets lead the input by 1 batch size. Therefore, it would be illogical to use the first generated batch of audio to append to the input, as shown in the diagram. Using the last batch makes more sense.

ift6266_blog_images.pngLater on, I improved my LSTM module by adding a second layer after the first output. This is done by adding an extra “scan” function taking the y output from the prior scan as the input. I have also changed some of the hyper parameters:

  • Batch size: 40000 frames
  • N: 40
  • input length: 100 seconds
  • hidden units: 300
  • learning rate: 0.002
  • training iterations: 128 cycles

Lastly, I trained the LSTM network again using the above parameters, and plot the cost of     (T.sqr(t-y)).mean()

LSTM_LOSS_PLOT_SIZE40000_100.0s_LR0.002_HU300The training error/loss is reaching very close to 0, which means the y values are not far off from the t values. I could have trained more that 128 iterations but this seems to be good enough… for now.

The generated audio signal graph:


This shows 2 minutes of generated audio. As you can see, the fading effect is gone (still present but not as large as the previous post)

Day 11 Deeper analysis of the data

After the last post, it got me thinking – The generated audio seems to be consistent, but its wayy to soft. So I dug deeper and found a very interesting property of .wav files.

The fuel converter generates a int16 (signed 16bit integer) data which means the range is an integer from -65536/2 to 65536/2. However, we note that for the LSTM, the tensors are all floating point numbers (float64). Having said that, the output is also a floating point number, then when it comes time to write back the .wav audio, should we use int16 or float64?

Some of us in the class have already noted that saving the audio as a float64 gives ear-popping static audio. Well, I have found out the reason for this.

.wav files can be read natively either a 16bit integer (-32768 to 32768) or a double precision (32/64 bit?) number with range of (-1.0 to 1.0). Two very distinct data types. In our case when we do not use a floating point input, the resulting output will go past the peak amplitude of a .wav floating point number (-1.0 to 1.0) (and produce loud statics!). Therefore to solve this problem, we shall normalize the int16 number to (-1.0 to 1.0) at the input level before passing into the LSTM. If we do that, then the resulting float64-output will be a “proper” .wav data, and no amplification will be required and the output will not sound so static-ky.

Therefore I suggest to anyone who attempt to use the .wav file to first normalize (dividing by 0x8000 or 2^16) before passing it into the Neural Network. Eventually the generated output will sound sane. I have tried doing this and sure enough, I get a audio signal that is clear as day (no amplification like in the previous post), but still some issues needs to be solved like for example the audio start to fade down after a long sequence.


Finally here is the 3rd audio. *Youtube flagged this as a copyright infringement, thus I believe Youtube is onto something here 😛


Day 10 – Another audio, better this time

Compared to the previous audio, this time I used a larger sequence size (200 seconds) but keeping the input size the same (16000 samples), meaning increasing the batch size (to 200 per minibatch).

Now with a “longer” memory, the NN could be able to generate some sort of audible audio. However, the raw audio file is too soft and I had to boost the amplification using audacity (I don’t know if that is cheating or not 😐 ) Maybe I could add in an amplification post-processing in the code?

Output signal plot: