Name: Krishna Subramani
Supervisor: Prof. Preeti Rao
Dept: Electrical Engineering
1. Generative Synthesis When you hear about Audio or Music Synthesis, one of the classic synthesizers by Yamaha or Casio must be coming to mind. Indeed, audio synthesis is ‘synthesizing’ music by controlling parameters like the pitch (the notes being played), the loudness and the timbre (the instrument being played). Classical methods of audio synthesis use Physical Modeling. These setup a physical system to model the instrument physics involved in sound generation. Recently, with the advent of data-driven statistical modeling, and the availability of abundant computing power with GPUs, researchers have begun using Deep Learning for audio synthesis. These models primarily rely on the ability of neural networks to extract musically relevant information from tons of available recordings. As opposed to modeling the complicated instrument physics, neural networks can implicitly learn the complex factors underlying the sound. Thus, natural sounds can be generated by the model when trained on a dataset consisting of isolated musical notes being played with different styles and loudness. Specific challenges in the context of Indian music are the continuous nature of musical attributes. For example, the Violin in Carnatic Music from South India produces melodies with continuously varying pitch and loudness dynamics. One class of models popularly used for audio synthesis are ‘Generative Models’ (thus Generative Synthesis). These try to model the underlying distribution on which the data points lie. Once that is learnt, these models can then ‘sample’ new data points from this distribution. In the context of audio, you can think of these as models from which we can sample and thus ‘generate’ new audio! An immediate application of our work is in an automatic accompaniment system for singing voice, as depicted in Figure 1. Consider you have a singer singing a song, and you wanted a musical instrument to accompany her/him. Common accompaniments in Indian Classical Music are the harmonium in Hindustani and the Violin in Carnatic Music. Thus, with the singer’s melodic pitch tracked, it would be possible to ‘generate’ accompaniment for the song by giving this as input to a pre-trained model. Figure 1 shows the flow to synthesize a violin accompaniment for a Carnatic Song.
2. Our System Audio can be represented by its time-domain samples. Alternatively, it can be represented by its spectrum via the Fourier Transform. Both of these representations however fail to take into account inherent structure in instrumental audio. One such structure is that instrumental audio is harmonic i.e. all the frequencies are multiples of a fundamental frequency (or the pitch). Thus, in essence, all the information you need to model the audio is amplitudes of the harmonic frequencies and not the complete spectrum. The magnitude spectrum can be compactly represented by a ‘Spectral Envelope’ as shown in Figure 2. It is a smooth function that gives us the magnitudes of the harmonic amplitudes as a function of frequency. This is the audio representation that we use as input for our model. To synthesize the audio from this, one can obtain the harmonic amplitudes by sampling this function at the harmonic frequencies, and then write the audio as a sum of sinusoids with these amplitudes and frequencies. Our choice of a generative model is a Conditional Variational Autoencoder, which is trained on this spectral envelope as input. An autoencoder is a neural network that is trained to obtain a compact lower-dimensional representation of the input data. A Variational Autoencoder enforces this lower-dimensional representation to be Gaussian, thus allowing us to easily sample and synthesize audio from the network. By using this musically motivated parametric representation of audio, the neural network can learn a rich set of features that allows us to synthesize high-quality audio.
Conditioning on additional variables like the pitch or loudness gives us more control over the audio we can generate from the network.
The major advantage of our approach is the ability of the network to synthesize audio notes that the network has not been trained on. One cannot possibly obtain data for all the possible pitches one want to generate (especially on an instrument which can produce a continuously varying pitch, like the violin playing a Carnatic ornamentation). Thus, with the parametric representation we use, the network learns to ‘interpolate’ in between the pitches it has been trained on, which allows it to generalize better. Other advantages are the requirements of lesser data for training, small networks and fast training (under an hour on a laptop GPU!). We have published and presented our work at ICASSP 2020 (paper,presentation). You can listen to the audio examples in our accompanying webpage, and our code is openly available here.