Continuation from part 1. Read this first preferably.
Stitching the strands together
Imagine having a bunch of spaghetti strands (uncooked or cooked doesn’t matter). Lets say each strand of spaghetti is a sequence, audio sequence, RNN sequence, etc… Generating audio sequence is like producing each noodle strand individually, and at every fixed interval, you make a cut. You collect all these strands and lay them out, end to end as one long giant strand of spag (music). But it turns out, doing this means at each start of a new sequence it starts off with the same “initial value”, and sometimes, it repeats itself over and over again. The major problem is that the first frame of the audio starts off not knowing where it ended last time. As you can see in the image below, the starting frame do not match the final frame from the previous sequence. Also you’ll notice that each new beginning tend to start near 0, or what the initial estimated parameters was. So what we want to do is to somehow “connect” the spaghetti strands together instead of just lining them up.
So, instead of letting the initial RNN parameters to be fixed, I “force” it to the last RNN parameters of the previous sequence. What I did was to let r_0 (RNN input) and c_0 (LSTM cell input) to be a tensor, and taking the previous’ sequence r_last and c_last to replace it. A crude example below. Theoretically, this should work to remove and audio “jinks” and the music (if you can call it that) will flow smooth.
Again, we trained a new model using this method, using the same parameters, but this time with a shorter recurrent length and longer training time. (single layer LSTMP; rnn length=4000; input size=240; hidden units=512)
The loss curve does not show any signs of irregularities, so it might be good yeah? Finally I used this model to generate new audio with the methods explained in Part I, but this time, I added the sequence joining method and….
It actually works. I couldn’t believe it myself. The “music” is now free from audio hiccups! (It gets abit better towards the end) Its not very interesting and it needs a little polishing (maybe more layers or hidden units), but I think this is the best result I can achieve. But all things must come to an end. This is the end of the project and the IFT6266 course. I’ve learn alot from this, coming from a guy who is not a computer science major, and who is pretty much clueless on neural networks on the first lesson, I managed to produce something very cool.
Or perhaps what I’m doing its all just pure dumb luck ¯\_(ツ)_/¯
I’ll probably continue working on this if I find the time in the future. Feel free to drop me a message if you find this interesting