# OpenAI post about Generative Models: an example of excellence in R&D

The last post of this week will be a share of a Blog post from the excellent organization OpenAI. OpenAI is excellent because of its quality overall, but importantly because it is completely open and open source about Artificial Intelligence (AI) research and development of algorithms; the researchers comprising it are all (or almost I suppose…) PhDs in the fields of Computer Science, Machine/Deep Learning, AI or related scientific subjects like Data Science/Statistics (the cautioned informed reader knows full well that Machine/Deep learning aren’t really independent fields of study, so there aren’t really PhDs in those fields, but only in Computer Science, but I hope that shouldn’t be understood literally).

OpenAI features an excellent online presence and its Blog posts are readership that is highly to recommend. Today I would like to share one such posts that I encountered while doing my daily research briefing in AI subject matters. It is about generative models, and the flurry of activity these models are creating within the AI research communities is nothing short of justified excitement. These models are demonstrating important capacities to improve significantly machine/deep learning pipelines of all sorts, but the main excitement has revolved around Computer Vision and Natural Language Processing applications. The emergence og GANS (generative adversarial networks), for instance, have found potential in social media environments and other settings. But generative models are a broad category of models and conceptual frameworks with usability in the fields I mentioned earlier in the first paragraph. To list them like it is found in the Wikipedia entry would also be useful for purposes here (with the corresponding links) :

Examples of generative models include:

Any one of these links is worth full blog posts or much more detailed and substantive research efforts – in fact much of those models are already heavy loaded research efforts on their own –  but this just gives a picture of what generative models are about.

Coming back to the OpenAI post, it also is loaded with links pointing to significant research efforts, but with the specific AI and/or deep learning  applications in mind. With this contribution by The Information Age I will do full justice to some of the passages I found important and worthwhile to further pursue.

### RESEARCH #1 – Generative Models

One of our core aspirations at OpenAI is to develop algorithms and techniques that endow computers with an understanding of our world.

It’s easy to forget just how much you know about the world: you understand that it is made up of 3D environments, objects that move, collide, interact; people who walk, talk, and think; animals who graze, fly, run, or bark; monitors that display information encoded in language about the weather, who won a basketball game, or what happened in 1970.

This tremendous amount of information is out there and to a large extent easily accessible — either in the physical world of atoms or the digital world of bits. The only tricky part is to develop models and algorithms that can analyze and understand this treasure trove of data.

Generative models are one of the most promising approaches towards this goal. To train a generative model we first collect a large amount of data in some domain (e.g., think millions of images, sentences, or sounds, etc.) and then train a model to generate data like it. The intuition behind this approach follows a famous quote from Richard Feynman:

“What I cannot create, I do not understand.”
The quote from Richard Feynman above summarizes the way generative models work spot on. It is almost as if the models were like the brain of a scientist or engineer in trying to make sense of the data and information coming to their input sensors. This is the crucial bit about generative models: there are also less parameters in the model compared with the data points it must fit to in order to classify or understand a particular dataset.
(…)
The trick is that the neural networks we use as generative models have a number of parameters significantly smaller than the amount of data we train them on, so the models are forced to discover and efficiently internalize the essence of the data in order to generate it.

Generative models have many short-term applications. But in the long run, they hold the potential to automatically learn the natural features of a dataset, whether categories or dimensions or something else entirely.

The characteristic allowing these models to operate in this way has to do with fact that they are probabilistic models of all available data fed into it, instead of being a model only of target variables conditional on observed variables, like is the case with discriminative models (like Support vector machines, logistic and linear regression classifiers or random forests). This characteristic allows generative models to simulate (generate) values of any variable in a model, whereas the discriminative model allows only sampling of the target variables conditional on the observed quantities. In this way more complex features of the dataset might get taken into account by the model (the cost here will inevitable dwell on accuracy and uncertainty about the particular set of parameters of the model doing well its job… with model assuming more than it is necessary…).

The research in this area is also significant form the point of view of overcoming obstacles from the supervised learning methods of most other models in machine learning. Generative models are generally unsupervised learning models, and as such they normally integrate well with similar techniques such as reinforcement learning, and in this way aid in the quest to achieve ever more robust AI models, pointing towards a future of true Artificial General Intelligence (AGI).

### Generating images

Let’s make this more concrete with an example. Suppose we have some large collection of images, such as the 1.2 million images in the ImageNet dataset (but keep in mind that this could eventually be a large collection of images or videos from the internet or robots). If we resize each image to have width and height of 256 (as is commonly done), our dataset is one large 1,200,000x256x256x3 (about 200GB) block of pixels. Here are a few example images from this dataset:

These images are examples of what our visual world looks like and we refer to these as “samples from the true data distribution”. We now construct our generative model which we would like to train to generate images like this from scratch. Concretely, a generative model in this case could be one large neural network that outputs images and we refer to these as “samples from the model”.

### DCGAN

One such recent model is the DCGAN network from Radford et al. (shown below). This network takes as input 100 random numbers drawn from a uniform distribution(we refer to these as a code, or latent variables, in red) and outputs an image (in this case 64x64x3 images on the right, in green). As the code is changed incrementally, the generated images do too — this shows the model has learned features to describe how the world looks, rather than just memorizing some examples.

The network (in yellow) is made up of standard convolutional neural network components, such as deconvolutional layers (reverse of convolutional layers), fully connected layers, etc.:

(…)

DCGAN is initialized with random weights, so a random code plugged into the network would generate a completely random image. However, as you might imagine, the network has millions of parameters that we can tweak, and the goal is to find a setting of these parameters that makes samples generated from random codes look like the training data. Or to put it another way, we want the model distribution to match the true data distribution in the space of images.

(…)

### A simple explanation of Generative Adversarial Networks(GANs)

The following paragraphs in the post provide one of the most elegant, yet simple, explanation of Generative Adversarial Networks(GANs) I’ve seen and read around.

### Training a generative model

Suppose that we used a newly initialized network to generate 200 images, each time starting with a different random code. The question is: how should we adjust the network’s parameters to encourage it to produce slightly more believable samples in the future? Notice that we’re not in a simple supervised setting and don’t have any explicit desired targets for our 200 generated images; we merely want them to look real.

One clever approach around this problem is to follow the Generative Adversarial Network (GAN) approach. Here we introduce a second discriminator network (usually a standard convolutional neural network) that tries to classify if an input image is real or generated. For instance, we could feed the 200 generated images and 200 real images into the discriminator and train it as a standard classifier to distinguish between the two sources. But in addition to that — and here’s the trick — we can also backpropagate through both the discriminator and the generator to find how we should change the generator’s parameters to make its 200 samples slightly more confusing for the discriminator. These two networks are therefore locked in a battle: the discriminator is trying to distinguish real images from fake images and the generator is trying to create images that make the discriminator think they are real. In the end, the generator network is outputting images that are indistinguishable from real images for the discriminator.

(…)

This is exciting — these neural networks are learning what the visual world looks like! These models usually have only about 100 million parameters, so a network trained on ImageNet has to (lossily) compress 200GB of pixel data into 100MB of weights. This incentivizes it to discover the most salient features of the data: for example, it will likely learn that pixels nearby are likely to have the same color, or that the world is made up of horizontal or vertical edges, or blobs of different colors. Eventually, the model may discover many more complex regularities: that there are certain types of backgrounds, objects, textures, that they occur in certain likely arrangements, or that they transform in certain ways over time in videos, etc.

### More general formulation

Mathematically, we think about a dataset of examples x1,…,xn as samples from a true data distribution p(x). In the example image below, the blue region shows the part of the image space that, with a high probability (over some threshold) contains real images, and black dots indicate our data points (each is one image in our dataset). Now, our model also describes a distribution p^θ(x) (green) that is defined implicitly by taking points from a unit Gaussian distribution (red) and mapping them through a (deterministic) neural network — our generative model (yellow).

Our network is a function with parameters θ, and tweaking these parameters will tweak the generated distribution of images. Our goal then is to find parameters θ that produce a distribution that closely matches the true data distribution (for example, by having a small KL divergence loss). Therefore, you can imagine the green distribution starting out random and then the training process iteratively changing the parameters θ to stretch and squeeze it to better match the blue distribution.

### Three approaches to generative models

Most generative models have this basic setup, but differ in the details. Here are three popular examples of generative model approaches to give you a sense of the variation:

• Autoregressive models such as PixelRNN instead train a network that models the conditional distribution of every individual pixel given previous pixels (to the left and to the top). This is similar to plugging the pixels of the image into a char-rnn, but the RNNs run both horizontally and vertically over the image instead of just a 1D sequence of characters.

All of these approaches have their pros and cons. For example, Variational Autoencoders allow us to perform both learning and efficient Bayesian inference in sophisticated probabilistic graphical models with latent variables (e.g. see DRAW, or Attend Infer Repeat for hints of recent relatively complex models). However, their generated samples tend to be slightly blurry. GANs currently generate the sharpest images but they are more difficult to optimize due to unstable training dynamics. PixelRNNs have a very simple and stable training process (softmax loss) and currently give the best log likelihoods (that is, plausibility of the generated data). However, they are relatively inefficient during sampling and don’t easily provide simple low-dimensional codes for images. All of these models are active areas of research and we are eager to see how they develop in the future!

I would like to point out how significant is this post and its content. It really is a treasure trove of multiple resourses and intuitive explanations of the main models/techniques currently being researched with the implementation of generative models in machine learning and AI. The following paragraphs continues with this magnificent offering – the four major projects applying these models at OpenAI:

### Our recent contributions

We’re quite excited about generative models at OpenAI, and have just released four projects that advance the state of the art. For each of these contributions we are also releasing a technical report and source code.

Improving GANs (code). First, as mentioned above GANs are a very promising family of generative models because, unlike other methods, they produce very clean and sharp images and learn codes that contain valuable information about these textures. However, GANs are formulated as a game between two networks and it is important (and tricky!) to keep them in balance: for example, they can oscillate between solutions, or the generator has a tendency to collapse. In this work, Tim Salimans, Ian Goodfellow, Wojciech Zaremba and colleagues have introduced a few new techniques for making GAN training more stable. These techniques allow us to scale up GANs and obtain nice 128x128 ImageNet samples:

Our CIFAR-10 samples also look very sharp – Amazon Mechanical Turk workers can distinguish our samples from real data with an error rate of 21.3% (50% would be random guessing):

In addition to generating pretty pictures, we introduce an approach for semi-supervised learning with GANs that involves the discriminator producing an additional output indicating the label of the input. This approach allows us to obtain state of the art results on MNIST, SVHN, and CIFAR-10 in settings with very few labeled examples. On MNIST, for example, we achieve 99.14% accuracy with only 10 labeled examples per class with a fully connected neural network — a result that’s very close to the best known results with fully supervised approaches using all 60,000 labeled examples. This is very promising because labeled examples can be quite expensive to obtain in practice.

Improving VAEs (code). In this work Durk Kingma and Tim Salimans introduce a flexible and computationally scalable method for improving the accuracy of variational inference. In particular, most VAEs have so far been trained using crude approximate posteriors, where every latent variable is independent. Recent extensions have addressed this problem by conditioning each latent variable on the others before it in a chain, but this is computationally inefficient due to the introduced sequential dependencies. The core contribution of this work, termed inverse autoregressive flow (IAF), is a new approach that, unlike previous work, allows us to parallelize the computation of rich approximate posteriors, and make them almost arbitrarily flexible.

(…)

InfoGAN (code). Peter Chen and colleagues introduce InfoGAN — an extension of GAN that learns disentangled and interpretable representations for images. A regular GAN achieves the objective of reproducing the data distribution in the model, but the layout and organization of the code space is underspecified — there are many possible solutions to mapping the unit Gaussian to images and the one we end up with might be intricate and highly entangled. The InfoGAN imposes additional structure on this space by adding new objectives that involve maximizing the mutual information between small subsets of the representation variables and the observation. This approach provides quite remarkable results.  (…)

(…)

The next two recent projects are in a reinforcement learning (RL) setting (another area of focus at OpenAI), but they both involve a generative model component.

Curiosity-driven Exploration in Deep Reinforcement Learning via Bayesian Neural Networks (code). Efficient exploration in high-dimensional and continuous spaces is presently an unsolved challenge in reinforcement learning. Without effective exploration methods our agents thrash around until they randomly stumble into rewarding situations. This is sufficient in many simple toy tasks but inadequate if we wish to apply these algorithms to complex settings with high-dimensional action spaces, as is common in robotics. In this paper, Rein Houthooft and colleagues propose VIME, a practical approach to exploration using uncertainty on generative models. VIME makes the agent self-motivated; it actively seeks out surprising state-actions. We show that VIME can improve a range of policy search methods and makes significant progress on more realistic tasks with sparse rewards (e.g. scenarios in which the agent has to learn locomotion primitives without any guidance).

Finally, we would like to include a bonus fifth project: Generative Adversarial Imitation Learning (code), in which Jonathan Ho and colleagues present a new approach for imitation learning. Jonathan Ho is joining us at OpenAI as a summer intern. He did most of this work at Stanford but we include it here as a related and highly creative application of GANs to RL. The standard reinforcement learning setting usually requires one to design a reward function that describes the desired behavior of the agent. However, in practice this can sometimes involve expensive trial-and-error process to get the details right. In contrast, in imitation learning the agent learns from example demonstrations (for example provided by teleoperation in robotics), eliminating the need to design a reward function.

Popular imitation approaches involve a two-stage pipeline: first learning a reward function, then running RL on that reward. Such a pipeline can be slow, and because it’s indirect, it is hard to guarantee that the resulting policy works well. This work shows how one can directly extract policies from data via a connection to GANs. As a result, this approach can be used to learn policies from expert demonstrations (without rewards) on hard OpenAI Gym environments, such as Ant and Humanoid.

### Conclusion

This blog is proud and privileged to have shared and commented on this stunning post from OpenAI. I conclude with the final remarks by the authors of this post about the possible future paths of research going forward. We just aren’t expecting nothing less than amazing compared to what has already been done with generative models. We feel that OpenAI is expecting to further surprise and enhance us all…!

### Going forward

Generative models are a rapidly advancing area of research. As we continue to advance these models and scale up the training and the datasets, we can expect to eventually generate samples that depict entirely plausible images or videos. This may by itself find use in multiple applications, such as on-demand generated art, or Photoshop++ commands such as “make my smile wider”. Additional presently known applications include image denoising, inpainting, super-resolution, structured prediction, exploration in reinforcement learning, and neural network pretraining in cases where labeled data is expensive.

However, the deeper promise of this work is that, in the process of training generative models, we will endow the computer with an understanding of the world and what it is made up of.

featured image: OpenAI entry at Wikipedia