Long paper review of Attend, Infer, Repeat: Fast Scene Understanding with Generative Models

Upfront note:  this paper featured in the recent 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

Today I will do a long paper review of one of those papers deserving it. The paper is about an algorithmic efficient inference in structured image models that explicitly reason about objects. The relatively long list of authors, which includes one of the best luminary mind on deep neural networks Geoffrey Hinton, achieve this by a probabilistic algorithm that does inference using a recurrent neural network.

Remarkably the model is an automated piece of code that by itself learns to find and choose the appropriate number of inference steps. This framework is then applied to perform 2D and 3D computer vision rendering tasks on partially specified  2D models (variational auto-enconders) and on fully specified 3D models (probabilistic renderers). These models are what in the field are known as Generative Models, which are probabilistic models to randomly generate observable data points of an image or large dataset using a proper parameterization of the hidden layers of a neural network.

This work demonstrates the cutting edge of today’s computer vision machine learning. This subject promises to bring about a complete revolution to the way machines perceive reality, in its quest to approximate it more to the highly structured way Human vision performs this important cognitive task. We at the moment can only do an exercise of imagination as to what will be the real world applications of this kind of technology, but the burgeoning field of self-driving vehicles is one such application, that is becoming a closer reality by the day. But others even more important such as improved medical imaging assistants, or in street security by improved surveillance cameras may be a possibility also.

Attend, Infer, Repeat: Fast Scene Understanding with Generative Models


We present a framework for efficient inference in structured image models that explicitly reason about objects. We achieve this by performing probabilistic inference using a recurrent neural network that attends to scene elements and processes them one at a time. Crucially, the model itself learns to choose the appropriate number of inference steps. We use this scheme to learn to perform inference in partially specified 2D models (variable-sized variational auto-encoders) and fully specified 3D models (probabilistic renderers). We show that such models learn to identify multiple objects – counting, locating and classifying the elements of a scene – without any supervision, e.g., decomposing 3D images with various numbers of objects in a single forward pass of a neural network. We further show that the networks produce accurate inferences when compared to supervised counterparts, and that their structure leads to improved generalization.


In the introductory paragraphs of this paper the authors disclose at the outset that they have not got deep misgivings or negative bias about the application of artificial neural networks to real life settings. While this would seem obvious given what they do is precisely research in this area, one should not forget that other researchers do have or may have shortcomings when it comes the application of research. Conservative mindsets on this is sometimes warranted by ethical concerns, understandably so. But we should not forget that when the evidence points in the direction of greater benefit than possible harm in the application of these technologies, what is lost in missing those applications is greater. In the very first paragraph of the paper this is explicitly outlined:

The human percept of a visual scene is highly structured. Scenes naturally decompose into objects that are arranged in space, have visual and physical properties, and are in functional relationships with each other. Artificial systems that interpret images in this way are desirable, as accurate detection of objects and inference of their attributes is thought to be fundamental for many problems of interest. Consider a robot whose task is to clear a table after dinner. To plan its actions it will need to determine which objects are present, what classes they belong to and where each one is located on the table.

In the next few paragraphs of the introduction the authors further their argument about the relevance of innovative thinking and boldness in trying to solve the issues in this kind of technology, describing the history of previous frustrated attempts and other perspectives and assumptions tha flounder. For instance it has been known for many years the difficulty in defining structural vision models that were expressive enough in capturing the full complexity of a visual scene and at the same time were themselves models amenable to perform inference. The common assumptions were often one of either/whether, but not both:

The notion of using structured models for image understanding has a long history (e.g., ‘vision as inverse graphics’ [4]), however in practice it has been difficult to define models that are: (a) expressive enough to capture the complexity of natural scenes, and (b) amenable to tractable inference. Meanwhile, advances in deep learning have shown how neural networks can be used to make sophisticated predictions from images using little interpretable structure (e.g., [10]). Here we explore the intersection of structured probabilistic models and deep networks. Prior work on deep generative methods (e.g., VAEs [9]) have been mostly unstructured, therefore despite producing impressive samples and likelihood scores their representations have lacked interpretable meaning. On the other hand, structured generative methods have largely been incompatible with deep learning, and therefore inference has been hard and slow (e.g., via MCMC -Markov Chain Monte Carlo)

Quite interesting this interplay and implication of interpretable meaning with inference by deep networks. This somewhat equates, maybe rightly so, conceptual learning with visual consciousness. When we see we are automatically thinking and conceptualizing about what we see. Could this be emulated by a machine/non-conscious object..? The paper content hopes so:

Our proposed framework achieves scene interpretation via learned, amortized inference, and it imposes structure on its representation through appropriate partly- or fully-specified generative models, rather than supervision from labels. It is important to stress that by training generative models, the aim is not primarily to obtain good reconstructions, but to produce good representations, in other words to understand scenes. We show experimentally that by incorporating the right kinds of structures, our models produce representations that are more useful for downstream tasks than those produced by VAEs or state-of-the-art generative models such as DRAW [3].

The proposed framework crucially allows for reasoning about the complexity of a given scene (the dimensionality of its latent space). We demonstrate that via an Occam’s razor type effect, this makes it possible to discover the underlying causes of a dataset of images in an unsupervised manner. For instance, the model structure will enforce that a scene is formed by a variable number of entities that appear in different locations, but the process of learning will identify what these scene elements look like and where they appear in any given image. The framework also combines high-dimensional distributed representations with directly interpretable latent variables (e.g., affine pose). This combination makes it easier to avoid the pitfalls of models that are too unconstrained (leading to data-hungry learning) or too rigid (leading to failure via mis-specification).

Important to remark here the fact that generative models by allowing for the imposition of structure on a representation in a partially or fully-specified modeling overrides the need for supervised labels. This is clearly an improved solution in terms of costs to implementation of a machine learning framework. The outline for the rest of the paper went as follows:

First, in Sec. 2 we formalize a scheme for efficient variational inference in latent spaces of variable dimensionality. The key idea is to treat inference as an iterative process, implemented as a recurrent neural network that attends to one object at a time, and learns to use an appropriate number of inference steps for each image. We call the proposed framework Attend-Infer-Repeat (AIR). End-to-end learning is enabled by recent advances in amortized variational inference, e.g., combining gradient based optimization for continuous latent variables with black-box optimization for discrete ones. Second, in Sec. 3 we show that AIR allows for learning of generative models that decompose multi-object scenes into their underlying causes, e.g., the constituent objects, in an unsupervised manner. We demonstrate these capabilities on MNIST digits (Sec. 3.1), overlapping sprites and Omniglot glyphs (appendices H and G). We show that model structure can provide an important inductive bias that is not easily learned otherwise, leading to improved generalization. Finally, in Sec. 3.2 we demonstrate how our inference framework can be used to perform inference for a 3D rendering engine with unprecedented speed, recovering the counts, identities and 3D poses of complex objects in scenes with significant occlusion in a single forward pass of a neural network, providing a scalable approach to ‘vision as inverse graphics’.

A final remark in this introduction. A latent variable model such as this one does not imply latency in the implementation of the algorithmic framework. On the contrary it is demonstrated an unprecedented speed in the rendering of 3D images with complex objects with scenes with significant occlusion. Even if it is a forward pass performance of a neural network, this scalable result is impressive.


The details of the approach follows:

Many real-world scenes naturally decompose into objects. We therefore make the modeling assumption that the scene description is structured into groups of variables zˆi , where each group describes the attributes of one of the objects in the scene, e.g., its type, appearance, and pose. Since the number of objects will vary from scene to scene, we assume models of the following form


This can be interpreted as follows. We first sample the number of objects n from a suitable prior (for instance a Binomial distribution) with maximum value N. The latent, variable length, scene descriptor z = (z ˆ1 , zˆ 2 , . . . , z ˆn) is then sampled from a scene model z ∼ pˆz θ (·|n). Finally, we render the image according to x ∼ pˆx θ (·|z). Since the indexing of objects is arbitrary, pˆz θ (·) is exchangeable and pˆx θ (x|·) is permutation invariant, and therefore the posterior over z is exchangeable. The prior and likelihood terms can take different forms. We consider two scenarios: For 2D scenes (Sec. 3.1), each object is characterized in terms of a learned distributed continuous representation for its shape, and a continuous 3-dimensional variable for its pose (position and scale).

For 3D scenes (Sec. 3.2), objects are defined in terms of a categorical variable that characterizes their identity, e.g., sphere, cube or cylinder, as well as their positions and rotations. We refer to the two kinds of variables for each object i in both scenarios as zˆi what and zˆi where respectively, bearing in mind that their meaning (e.g., position and scale in pixel space vs. position and orientation in 3D space) and their data type (continuous vs. discrete) will vary. We further assume that zˆi are independent under the prior, i.e., p ˆz θ (z|n) = ∏(n to i=1 ) pˆ z θ(z ˆi ), but non-independent priors, such as a distribution over hierarchical scene graphs (e.g., [28]), can also be accommodated. Furthermore, while the number of objects is bounded as per Eq. 1, it is relatively straightforward to relax this assumption.

Figure 1: Left: A single random variable z produces the observation x (the image). The relationship between z and x is specified by a model. Inference is the task of computing likely values of z given x. Using an auto-encoding architecture, the model (red arrow) and its inference network (black arrow) can be trained end-to-end via gradient descent. Right: For most images of interest, multiple latent variables (e.g., multiple objects) give rise to the image. We propose an iterative, variable-length inference network (black arrows) that attends to one object at a time, and train it jointly with its model. The result is fast, feed-forward, interpretable scene understanding trained without supervision.


Despite their natural appeal, inference for most models in the form of Eq. 1 is intractable. We therefore employ an amortized variational approximation to the true posterior by learning a distribution qφ(z, n|x) parameterized by φ that minimizes the divergence KL [qφ(z, n|x)||p ˆz θ (z, n|x)]. While amortized variational approximations have recently been used successfully in a variety of works [21, 9, 18] the specific form of our model poses two additional difficulties. Trans-dimensionality: As a challenging departure from classical latent space models, the size of the latent space n (i.e., the number of objects) is a random variable itself, which necessitates evaluating pN (n|x) = R pˆ z θ (z, n|x)dz, for all n = 1…N. Symmetry: There are strong symmetries that arise, for instance, from alternative assignments of objects appearing in an image x to latent variables zˆ i.

To simplify sequential reasoning about the number of objects, we parametrize n as a variable length latent vector zpres using a unary code: for a given value n, zpres is the vector formed of n ones followed by one zero. Note that the two representations are equivalent. The posterior takes the following form:


qφ is implemented as a neural network that, in each step, outputs the parameters of the sampling distributions over the latent variables, e.g., the mean and standard deviation of a Gaussian distribution for continuous variables. zpres can be understood as an interruption variable: at each time step, if the network outputs z(pres) = 1, it describes at least one more object and proceeds, but if it outputs zpres = 0, no more objects are described, and inference terminates for that particular datapoint. Note that conditioning of zˆi |x, z ˆ1:i−1 is critical to capture dependencies between the latent variables z i in the posterior, e.g., to avoid explaining the same object twice. The specifics of the networks that achieve this depend on the particularities of the models and we will describe them in detail in Sec. 3.

Model and Experiments

We next briefly describe the methodological approach and the experimental setup for this paper and then conclude. As always beyond this review, which only is an attempt at my own interpretation and views on the subject matter of the content reviewed, I highly recommend further reading and deepening understanding of the paper. This is one of those papers that the effort is guaranteed to an intellectual and knowledge payoff, for sure. I wouldn’t say this much from other reviews already done on this blog..:

Details of the AIR model and networks used in the 2D experiments are shown in Fig. 2. The generative model (Fig. 2, left) draws n ∼ Geom(ρ) digits {yˆi att}, scales and shifts them according to zˆi where ∼ N (0, Σ) using spatial transformers, and sums the results {yˆi} to form the image. Each digit is obtained by first sampling a latent code zˆi what from the prior zˆi what ∼ N (0, 1) and propagating it through a decoder network. The learnable parameters of the generative model are the parameters of this decoder network. The AIR inference network (Fig. 2, middle) produces three sets of variables for each entity at every time-step: a 1-dimensional Bernoulli variable indicating the entity’s presence, a C-dimensional distributed vector describing its class or appearance (zˆi what), and a 3-dimensional vector specifying the affine parameters of its position and scale (zˆi where). Fig. 2 (right) shows the interaction between the inference and generation networks at every time-step. The inferred pose is used to attend to a part of the image (using a spatial transformer) to produce xˆi att, which is processed to produce the inferred code zˆi code and the reconstruction of the contents of the attention window yˆi att. The same pose information is used by the generative model to transform y i att to obtain yˆi . This contribution is only added to the canvas y if zˆi pres was inferred to be true.

Figure 2: AIR in practice: Left: The assumed generative model. Middle: AIR inference for this model. The contents of the grey box are input to the decoder. Right: Interaction between the inference and generation networks at every time-step. In our experiments the relationship between x i att and y i att is modeled by a VAE, however any generative model of patches could be used (even, e.g., DRAW).

Learning with Multi-MNIST

The following is a description of the learning setting of the model under a Multi-MNIST setting. The MNIST dataset is old known attribute of this blog as well as of the wider deep learning community of developers.

Interesting the paragraph explaining the image below as to the apparent contradiction of the maximization of correctness of the model in interpreting the dataset and the opposing pressures it might be put on. Would that be so surprising… ?:

We begin with a 50×50 dataset of multi-MNIST digits. Each image contains zero, one or two non-overlapping random MNIST digits with equal probability. The desired goal is to train a network that produces sensible explanations for each of the images. We train AIR with N = 3 on 60,000 such images from scratch, i.e., without a curriculum or any form of supervision by maximizing L with respect to the parameters of the inference network and the generative model. Upon completion of training we inspect the model’s inferences (see Fig. 3, left). We draw the reader’s attention to the following observations. First, the model identifies the number of digits correctly, due to the opposing pressures of (a) wanting to explain the scene, and (b) the cost that arises from instantiating an object under the prior. This is indicated by the number of attention windows in each image; we also plot the accuracy of count inference over the course of training (Fig. 3, above right). Second, it locates the digits accurately. Third, the recurrent network learns a suitable scanning policy to ensure that different time-steps account for different digits (Fig. 3, below right). Note that we did not have to specify any such policy in advance, nor did we have to build in a constraint to prevent two time-steps from explaining the same part of the image. Finally, that the network learns to not use the second time-step when the image contains only a single digit, and to never use the third time-step (images contain a maximum of two digits). This allows for the inference network to stop upon encountering the first zˆi pres equaling 0, leading to potential savings in computation during inference.

Figure 3: Multi-MNIST learning: Left above: Images from the dataset. Left below: Reconstructions at different stages of training along with a visualization of the model’s attention windows. The 1st, 2nd and 3rd time-steps are displayed using red, green and blue borders respectively. A video of this sequence is provided in the supplementary material. Above right: Count accuracy over time. The model detects the counts of digits accurately, despite having never been provided supervision. Below right: The learned scanning policy for 3 different runs of training (only differing in the random seed). We visualize empirical heatmaps of the attention windows’ positions (red, and green for the first and second time-steps respectively). As expected, the policy is random. This suggests that the policy is spatial, as opposed to identity- or size-based.

Stressing further the differences in performance between the AIR and DAIR models, and demonstrating how DRAW (Deep Recurrent Attentive Writer) learns to ignore irrelevant features in the data input, strong generalization is achieved:

Strong Generalization

Since the model learns the concept of a digit independently of the positions or numbers of times it appears in each image, one would hope that it would be able to generalize, e.g., by demonstrating an understanding of scenes that have structural differences to training scenes. We probe this behavior with the following scenarios: (a) Extrapolation: training on images each containing 0, 1 or 2 digits and then testing on images containing 3 digits, and (b) Interpolation: training on images containing 0, 1 or 3 digits and testing on images containing 2 digits. The result of this experiment is shown in Fig. 4. An AIR model trained on up to 2 digits is effectively unable to infer the correct count when presented with an image of 3 digits. We believe this to be caused by the LSTM which learns during training never to expect more than 2 digits. AIR’s generalization performance is improved somewhat when considering the interpolation task. DAIR by contrast generalizes well in both tasks (and finds interpolation to be slightly easier than extrapolation). A closely related baseline is the Deep Recurrent Attentive Writer (DRAW, [3]), which like AIR, generates data sequentially. However, DRAW has a fixed and large number of steps (40 in our experiments). As a consequence generative steps do not correspond to easily interpretable entities, complex scenes are drawn faster and simpler ones slower. We show DRAW’s reconstructions in Fig. 4. Interestingly, DRAW learns to ignore precisely one digit in the image. See appendix for further details of these experiments.

Figure 4: Strong generalization: Left: Reconstructions of images with 3 digits made by DAIR trained on 0, 1 or 2 digits, as well as a comparison with DRAW. Right: Variational lower bound, and generalizing / interpolating count accuracy. DAIR out-performs both DRAW and AIR at this task.

Conclusion and discussion

This is a remarkable paper showing and demonstrating to us the current cutting edge technologies underpinning deep neural networks applied to machine vision problems. I strongly recommend interested readers to check the full paper, the related work mentioned and the appendices inside the paper, after the list of references.

For now let us just conclude with the discussion and main comments of the research results around this paper by the authors themselves, where we see and confirm the achievements and the shortcomings of the approach, with a special note to the increasing relevance of model building in the design of improved deep learning architectures and frameworks. As the attentive follower would have expected…:

Discussion In this paper our aim has been to learn unsupervised models that are good at scene understanding, in addition to scene reconstruction. We presented several principled models that learn to count, locate, classify and reconstruct the elements of a scene, and do so in a fraction of a second at test-time. The main ingredients are (a) building in meaning using appropriate structure, (b) amortized inference that is attentive, iterative and variable-length, and (c) end-to-end learning. We demonstrated that model structure can provide an important inductive bias that gives rise to interpretable representations that are not easily learned otherwise. We also showed that even for sophisticated models or renderers, fast inference is possible. We do not claim to have found an ideal model for all images; many challenges remain, e.g., the difficulty of working with the reconstruction loss and that of designing models rich enough to capture all natural factors of variability. Learning in AIR is most successful when the variance of the gradients is low and the likelihood is well suited to the data. It will be of interest to examine the scaling of variance with the number of objects and alternative likelihoods. It is straightforward to extend the framework to semi- or fully-supervised settings. Furthermore, the framework admits a plug-and-play approach where existing state-of-the-art detectors, classifiers and renderers are used as sub-components of an AIR inference network. We plan to investigate these lines of research in future work.

Text images and featured image: from the paper itself


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s