Your computer can be the next Monet.

Photo by Birmingham Museums Trust on Unsplash

No, I am not asking you to create a robot and make him paint. I am talking about Generative adversarial networks (GANs). GANs have revolutionized the world of AI by providing one of the most clever ways of producing materials such as speech, pictures, music; which are strikingly similar to what humans produce. These algorithms tend to learn from the inputs, capture few most important features of the input and produce outputs which share similar characteristics of the input provided. For example, if one feeds one thousand actors who played James Bond to a GAN algorithm, chances are, the ‘generated’ image by the algorithm could probably look as charming as Roger Moore in his prime.

Let’s dive in deep a little!

An introduction to GANs

GAN, Generative adversarial networks- simply put are advanced algorithms based on neural networks that can learn data and generate entirely new data based on the features learnt from data. Grab a coffee, let’s understand this with an example.

Jerry has a business. He runs a graphic store where a Hollywood director approaches him to build an audience graphic — consisting of numerous people with a smile cheering for their actor. Jerry could utilize the power of GANs and generate new faces from a set of images he already has. This way, the whole problem of assembling people, taking their photos, asking their permission is replaced by just one AI algorithm. With just a little bit of graphic simulation, the box office is one movie away from a packed Friday evening.

How does GAN work?

Let’s assume we are building a GAN for image development. The business interest is to generate new images from a set of real photos. The system consists of a Generator network and a discriminator network- as fancy as it might sound, they are actually quite simple. A generator is like one of those street-smart lawyers who tries to generate new evidence (here, images) from the set of evidences he saw on the crime scene (training set images). The jury here is the discriminator- which decides whether the generated data is genuinely generated or taken from the training set itself. This ground truth detection by the discriminator is done using a binary classification logic running a sigmoid function under the hood.

Interesting thing to note here is that, the discriminator is basically estimating the probability that the sample is coming from real data or generated using a threshold probability cut-off which could be tuned based upon the business costs estimation of false positives and false negatives.

Running a GAN

Now that we know how GANs work, let’s understand the process of training a GAN. In the first step, we train the discriminator and freeze the generator. This is basically letting the network do the forward pass and restrict back propagation. This means, we are showing the actual evidence we found on the crime scene to jury, so that it learns how to reject the same if generated images by the generator are from the training data.

Second step is training the generator and freezing the discriminator. We get the result from the first phase and can use them to make better sample from the previous state; to try and fool the discriminator better. This is basically a tuning mechanism to make generator smarter in generating better samples.

An intelligent generator

The story so far is somewhere hinting that fine-tuning the generator to pick up certain ‘style-based’ learning can open new dimensions for the sample creation. Think about it, a regular generator can only find some basic most obvious features like- gender, age, hair length, glasses and pose; combine it with the rest of the secondary features such as skin tone, texture and produce samples. Which is not bad, but the samples generated are based on single tone logic of combining A and B to get C.

Whereas a generator which is tuned to think of an image as a collection of ‘styles’, open doors for a multitude of new combination of each such style resulting in finer attention to facial features. This system, where each style control the effects of a particular scale, could help produce a well balanced mix of coarse styles, middle styles and fine styles to produce better sample images. Which in business could mean — a lot of unique, real-life resembling images with a humane touch.

Another advantage of these tuned generators is that they automatically separate inconsequential variation from high level attributes such as facial pose, identity and symmetry. This means that we are filtering out the images which aren’t adding value; the ones who are just variation of poses, identity.

We can also choose the strength at which each style is applied samples. Which is not bad, but the samples generated are based on single tone logic of combining A and B to get C.

Whereas a generator which is tuned to think of an image as a collection of ‘styles’, open doors for a multitude of new combination of each such style resulting in finer attention to facial features. This system, where each style control the effects of a particular scale, could help produce a well balanced mix of coarse styles, middle styles and fine styles to produce better sample images. Which in business could mean — a lot of unique, real-life resembling images with a humane touch.

Another advantage of these tuned generators is that they automatically separate inconsequential variation from high level attributes such as facial pose, identity and symmetry. This means that we are filtering out the images which aren’t adding value; the ones who are just variation of poses, identity.

We can also choose the strength at which each style is applied with respect to an ‘average face’, and fine tune different types of noises like- coarse (curling of hair), fine noise (finer details) and produce the results which are more life-like. For business, this indicates, better accuracy without human supervision. To break it down, it means that taking a base image and filling in the details using the styles in an exhaustive manner is what making this generator better.

I hope after reading this, you wouldn’t be surprised if the next Picasso is a black box, speaks binary language and consumes Nvidia graphics for breakfast!

Freelance Writer; Blogger; Automotive engineer; Journalist; ex-broadcaster