Deep convolutional neural networks with a Mathematical model

The resurgence of interest in Artificial Intelligence (AI) appears to have come to stay and progress to deeper and broad parts of our Societies. The Governments of United Kingdom of the United States have gone so far as called for parliamentary select committees or congressional hearing committees devoted to the subject.

But what we hear and read are just the very tip of the iceberg of what really this is all about. The better way to a proper understanding of what really is going on is to try to check with rigor and attention the technologies and developments around AI. Those main developments centered around Deep Artificial Neural Networks, and what they can achieve today, in a time of massive access to cheap, abundant and multiple kinds of data and information combined with impressive advances in computational power and better and hardware architectures.

But it is often lacking here is an appreciation of the fact that there isn’t yet a proper robust theoretical mathematical formalism for explaining what a deep neural network(DNN) actually does. In a good spirit of proper Science, everything should have an explanation, and serious scientists (or reasonable people, for the matter) do not believe in magical or supernatural powers – if only if paying due respect to their probable religious beliefs, but even then just as a question of cultural preference and deep understanding of what that is about, often just a bound from their sociological backgrounds…

After this long introduction I just want to quickly pass to the review of a paper from a researcher at the University of Los Angeles, California, that tries precisely to deal with the complexity and difficulties of achieving a mathematical understanding and modelling od DNNs, something that is currently lacking but that could dramatically improve what this technology can accomplish in the future:

Understanding Convolutional Neural Networks with A Mathematical Model

Abstract:

This work attempts to address two fundamental questions about the structure of the convolutional neural networks (CNN): 1) why a non-linear activation function is essential at the filter output of every convolutional layer? 2) what is the advantage of the two-layer cascade system over the one-layer system? A mathematical model called the “REctified-COrrelations on a Sphere” (RECOS) is proposed to answer these two questions. After the CNN training process, the converged filter weights define a set of anchor vectors in the RECOS model. Anchor vectors represent the frequently occurring patterns (or the spectral components). The necessity of rectification is explained using the RECOS model. Then, the behavior of a two-layer RECOS system is analyzed and compared with its one-layer counterpart. The LeNet- 5 and the MNIST dataset are used to illustrate discussion points. Finally, the RECOS model is generalized to a multi-layer system with the AlexNet as an example.

To note, the introduction is crystal clear and easy to understand, as far as the goals of the author are concerned (the numbers in squared brackets point to the ordering in the references section ):

Although deep learning often outperforms classical pattern recognition methods experimentally, a mathematical theory to explain its behavior and performance is nevertheless lacking. Without a solid understanding of deep learning, we can only have a set of empirical rules and intuitions, which is not sufficient to advance the scientific knowledge profoundly. There has been a large amount of efforts devoted to the understanding of CNNs from various angles. Examples include scattering networks [2, 3, 4], tensor analysis [5], generative modeling [6], relevance propagation [7], Taylor decomposition [8], etc. Another popular topic along this line is on the visualization of filter responses at various layers [9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26]. 

This research attempts to address two fundamental questions about CNNs: 1) Why a non-linear activation operation is needed at the filter output of every convolutional layer? 2) What is the advantage of the cascade of two layers in comparison with a single layer? These two questions are related to each other. The convolutional operation is a linear one. If the non-linear operation between every two convolutional layers is removed, the cascade of two linear systems is equivalent to a single linear system. Then, we can simply go with one linear system and the necessity of a multi-layer network architecture is not obvious. Although one may argue that a multi-layer network has a multi-resolution representation capability, this is a well-known fact and has been extensively studied before. Examples include the Gaussian and the wavelet pyramids. There must be something deeper than the multi-resolution property in the CNN architecture due to the adoption of the non-linear activation unit.

Indeed the analysis of convolutional neural networks (CNNs) – despite the fact that it operates a linear operation but with a non-linear combination of weights and layers -, is extremely challenging  due to this non-linear property of multi-layer  artificial neural networks (ANNs), requiring multi-resolution representations. The proposal of the author here is to view the CNNs as operating on what he calls a ““REctified COrrelations on a Sphere (RECOS)”, thus appearing to want to rely on an interesting and vibrant mathematical field called Algebraic Topology. With the implication of a proper vector formalism:

A set of anchor vectors is selected for each RECOS model to capture and represent frequently occurring patterns. For an input vector, we compute its correlation with each anchor vector to measure their similarity. All negative correlations are rectified to zero in the RECOS module, and the necessity of rectification is explained. Anchor vectors are called weights in the CNN literature. In the CNN training, weights are first initialized and then adjusted by backpropagation to minimize a cost function. Here, we adopt a different name to emphasize its role in representing clustered input data in the RECOS model. After neuron modeling, we examine twolayer neural networks, where the first layer consists of either one or multiple neurons while the second layer contains only one neuron. We conduct a mathematical analysis on the behavior of the cascaded RECOS systems. This analysis sheds light on the advantage of deeper networks.

The Model

I will add a note here before the full description of the paper. This site has posted widely (fiercely I would say…) on Convolutional Neural Networks recently. One of the posts was also about a novel way of interpreting with a formal scientific model what a deep artificial neural network might be doing, in that case the modeling concerned a physical interpretation that explicitly ruled out a mathematical one. What I want to point with this post is that instead of ruling out a possible alternative interpretation, we should first consider its merits, and only after with a strong empirical ground, refute or integrate what is possible to integrate, building a more robust model.

This research was carried under the empirical framework of the MNIST project, with its deep dataset repository led by the well-known Facebook chief  scientist and engineer Yann LeCun. LeCun is a prominent leading figure in deep neural networks research:

The MNIST dataset1 is formed by ten handwritten digits (0, 1, …, 9). All digits are size-normalized and centered in an image of size 32 by 32. The dataset has a training set of 60,000 examples and a test set of 10,000 examples. The LeNet-5 is the latest CNN designed by LeCun et al. [27] for handwritten and machine-printed character recognition.

mnist-train-xs

Each convolutional layer is specified by its filter weights which are determined in the training stage by an iterative update process. That is, they are first initialized and then adjusted by backpropagation to minimize a cost function. All weights are then fixed in the testing stage. These weights play the role of “system memory”. In this work, we adopt a different name for filter weights to emphasize their role in the testing stage. We call them “anchor vectors” since they serve as reference signals (or visual patterns) for each input patch of test images. It is well-known that signal convolution can also be viewed as signal correlation or projection. For an input image patch, we compute its correlation with each anchor vector to measure their similarity. Clearly, the projection onto a set of anchor vectors offers a spectral decomposition of an input.

In what followed there were interesting discussions of the appropriate mathematical modeling and framework to understanding CNNs, questioning the advantages, or otherwise, of cascading layers, shedding a light on the advantage of deeper networks:

The LeNet-5 is essentially a neural network with two convolutional layers since the compound operations of convolution/sampling/nonlinear clipping are viewed as one complete layer in the modern neural network literatures. The input to the first layer of the LeNet-5 is a purely spatial signal while the input to the second layer of the LeNet-5 is a hybrid spectral-spatial signal consisting of spatial signals from 6 spectral bands. The spatial coverage of a filter with respect to the source image is called its receptive field. The receptive fields of the first and the second layers of the LeNet-5 are 5 ˆ 5 and 13 ˆ 13, respectively. For each spatial location in the 13 ˆ 13 receptive field, it may be covered by one, two or four layer-one filters as shown in Fig. 6. In the following, we conduct a mathematical analysis on the behavior of  the cascaded systems. This analysis sheds light on the advantage of deeper networks. In the following discussion, we begin with the cascade of one layer- 1 RECOS unit and one layer-2 RECOS unit, and then generalize it to the cascade of multiple layer-1 RECOS units to one layer-2 RECOS unit. For simplicity, the means of all inputs are assumed to be zero. (…)

From one-to-one cascading to many-to-one cascading

One-to-One Cascade.

We define two anchor matrices:

                   A = [a1, ¨ ¨ ¨ , ak ¨ ¨ ¨ , aK], B = [b1, ¨ ¨ ¨ , bl ¨ ¨ ¨ , bL]                         (11)

whose column are anchor vectors ak and bl of the two individual RECOS units. Clearly, A ∈  Rˆn×k and B ∈  Rˆk×l . To make the analysis tractable, we begin with the correlation analysis and will take the nonlinear rectification effect into account at the end. For the correlation part, let y = AˆT x and z = BˆT y. Then, we have z = B ˆT AˆT x  = C ˆT x, C ≡ AB. (12)

Clearly, C ∈ Rˆn×l with its (n, l)th element equal to:

                                                        C(n,l)  = αˆ T (n) b(l) ,                                        (13)

The meaning of αn can be visualized in Fig. 7 Mathematically, we decompose

                                                         x = ∑ (N n=1)   x(n)e(n)                                      (14)

where e(n) ∈ Rˆn is the nth coordinate-unit-vector. Then, α ˆT( n) = Ae(n). (15)

Since α(n) captures the position information of anchor vectors in A, it is called the anchor-position vector.

 

Generalizes to:

 

Many-to-One Cascade

It is straightforward to extend generalize oneto-one cascaded case to the many-to-one cascaded case. The correlation of the first-layer RECOS units can be written as

                                                         Y = AˆTX                                                          (18)

where

                                                  Y = [y1, ¨ ¨ ¨ yp],        X = [x1, ¨ ¨ ¨ xp] ,           (19)

                                                           z = Bˆ T ˜y,                                                  (20)

where z ∈ Rˆl , B ∈ Rˆpk×l  and ˜y = (yˆ T 1 , ¨ ¨ ¨ , y ˆTp)ˆ T  ∈ Rˆpk  is formed by the cascade of P output vectors of the first-layer RECOS units.Finally, the correlation of the compound effect of Eqs. (18) and (20) can be written as

                                                       z =  C ˆT x, C ≡ AB,                                               (21)

Anchor matrix A extracts representative patterns in different regions while anchor matrix B is used to stitch these spatially-dependent representative patterns to form larger representative patterns. For example, consider a large lawn composed by multiple grass patches. Suppose that the grass patterns can be captured by an anchor vector in A. Then, an anchor vector in B will provide a superposition rule to stitch these spatially distributed anchor vectors of grass to form a larger lawn.

Conclusion

The author further consider all the implications of the implementation of the model on a comparison of one-layer and two-layer systems, the role of fully connected layers and the rules for multi-layers CNNs, concluding with a comprehensive set of open problems:

In this work, a RECOS model was adopted to explain the role of the nonlinear clipping function in CNNs, and a simple matrix analysis was used to explain the advantage of the two-layer RECOS model over the single-layer RECOS model. The proposed RECOS mathematical model is centered on the selection of anchor vectors. CNNs do offer a very powerful tool for image processing and understanding. There are however a few open problems remaining in CNN interpretability and wider applications. Some of them are listed below:

1. Application-Specific CNN Architecture

2. Robustness to Input Variations

3. Weakly Supervised Learning

 

 

Bodytext Image: MNIST For ML Beginners

Featured Image:  pdf in IMEC 2006

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s