Generative Models: Do not believe anything you hear and see

BLOGS

Generative Models: Do not believe anything you hear and see

By Nirmesh Shah, Senior Research Scientist at Sony Research India

8^th March 2022

Nirmesh Shah, Senior Research Scientist at Sony Research India, educates us about the wide array of possibilities using generative models and that while they have several applications in the multimedia industry, they do not come without their share of caveats.

In 1845, Edgar Allan famously wrote in one of his short stories, “Believe nothing you hear, and only one half that you see” to encourage skepticism and with an intention of teaching people not to trust gossip. However, the same quote can be extended to “Do not believe anything you hear and see” in today’s era of deep learning. The advent of huge computational powers and the rapid advancement in deep learning technologies have made it possible to generate audio and video clips that seem very real to us. Broadly, deep learning models can be classified in two categories, discriminative models and generative models.

Recent deep generative models are especially capable of synthesizing such realistic video and audio files.

The key goal of generative models is to estimate the density of the large amount of data collected in some domain and then generate new data very similar to it. This can be primarily achieved in two ways, explicit and implicit. Explicit density models define explicit density functions and estimate parameters of this function during training. On the other hand, implicit density models define a stochastic procedure that can directly generate the data. PixelCNN, pixelRNN, variational autoencoders (VAEs), energy-based generative models, flow-based generative models come under the category of explicit models. Generative adversarial networks (GANs), markov chain-based models come under the category of implicit models.

(a) VAE

(b) GAN

Figure: Schematic representation of (a) VAEs [1] and (b) GANs[2]

In VAEs, instead of directly extracting the latent features from the input data, we are trying to represent this problem in the framework of probabilistic graphical models. Basically, we are maximizing a lower bound on the log likelihood of data. On the other hand, GAN consists of two components, generator, and discriminator. The goal of the former is to generate data and the goal of the latter is to check whether data is coming from the true data distribution or model distribution (i.e., the generated sample is real or fake). Whenever a discriminator notices a difference between two distributions, the generator will adjust its parameters further to generate more realistic samples via reducing the distance between the two distributions. Both methods have their pros and cons. For instance, GANs can produce shared and very realistic data but are very hard to train. On the other hand, training, and inference of VAEs are sophisticated, but they produce blurry samples. Still, most of the approaches have overcome their limitations in the last couple of years and are now able to generate very realistic samples.

When it comes to the audio side, text-to-speech (TTS) technologies are used to synthesize any written text to different languages. Also, voice cloning, and voice conversion technologies can alter perceived speaker identity and/or a style of any source audio speech signal from any source speaker into desired target speaker. Similarly, image-to-image and video-to-video style conversion techniques are used to modify source image or video frames into desired target image or video frames to change facial characteristics, lip movements, etc. to alter any video. In core, all these technologies use deep generative models. These artificial video generation techniques are popularly known as Deep Fakes.

Since these technologies are very powerful and can easily manipulate any video or audio clips, it can have negative implications on society and can even spread hatred. Hence, I would like to conclude by emphasizing that “it would be advisable to never believe anything you see or hear on the internet!”

References:

[1] Van Den Oord, Aaron, and Oriol Vinyals. “Neural discrete representation learning.” Advances in neural information processing systems 30 (2017).

[2] Goodfellow, Ian, et al. “Generative adversarial nets.” Advances in neural information processing systems 27 (2014).