{"title": "A Probabilistic Framework for Deep Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2558, "page_last": 2566, "abstract": "We develop a probabilistic framework for deep learning based on the Deep Rendering Mixture Model (DRMM), a new generative probabilistic model that explicitly capture variations in data due to latent task nuisance variables. We demonstrate that max-sum inference in the DRMM yields an algorithm that exactly reproduces the operations in deep convolutional neural networks (DCNs), providing a first principles derivation. Our framework provides new insights into the successes and shortcomings of DCNs as well as a principled route to their improvement. DRMM training via the Expectation-Maximization (EM) algorithm is a powerful alternative to DCN back-propagation, and initial training results are promising. Classification based on the DRMM and other variants outperforms DCNs in supervised digit classification, training 2-3x faster while achieving similar accuracy. Moreover, the DRMM is applicable to semi-supervised and unsupervised learning tasks, achieving results that are state-of-the-art in several categories on the MNIST benchmark and comparable to state of the art on the CIFAR10 benchmark.", "full_text": "A Probabilistic Framework for Deep Learning\n\nAnkit B. Patel\n\nBaylor College of Medicine, Rice University\n\nankitp@bcm.edu,abp4@rice.edu\n\nTan Nguyen\nRice University\nmn15@rice.edu\n\nRichard G. Baraniuk\n\nRice University\nrichb@rice.edu\n\nAbstract\n\nWe develop a probabilistic framework for deep learning based on the Deep Render-\ning Mixture Model (DRMM), a new generative probabilistic model that explicitly\ncapture variations in data due to latent task nuisance variables. We demonstrate\nthat max-sum inference in the DRMM yields an algorithm that exactly reproduces\nthe operations in deep convolutional neural networks (DCNs), providing a \ufb01rst\nprinciples derivation. Our framework provides new insights into the successes and\nshortcomings of DCNs as well as a principled route to their improvement. DRMM\ntraining via the Expectation-Maximization (EM) algorithm is a powerful alternative\nto DCN back-propagation, and initial training results are promising. Classi\ufb01cation\nbased on the DRMM and other variants outperforms DCNs in supervised digit\nclassi\ufb01cation, training 2-3\u21e5 faster while achieving similar accuracy. Moreover, the\nDRMM is applicable to semi-supervised and unsupervised learning tasks, achiev-\ning results that are state-of-the-art in several categories on the MNIST benchmark\nand comparable to state of the art on the CIFAR10 benchmark.\n\nIntroduction\n\n1\nHumans are adept at a wide array of complicated sensory inference tasks, from recognizing objects\nin an image to understanding phonemes in a speech signal, despite signi\ufb01cant variations such as\nthe position, orientation, and scale of objects and the pronunciation, pitch, and volume of speech.\nIndeed, the main challenge in many sensory perception tasks in vision, speech, and natural language\nprocessing is a high amount of such nuisance variation. Nuisance variations complicate perception\nby turning otherwise simple statistical inference problems with a small number of variables (e.g.,\nclass label) into much higher-dimensional problems. The key challenge in developing an inference\nalgorithm is then how to factor out all of the nuisance variation in the input. Over the past few decades,\na vast literature that approaches this problem from myriad different perspectives has developed, but\nthe most dif\ufb01cult inference problems have remained out of reach.\nRecently, a new breed of machine learning algorithms have emerged for high-nuisance inference\ntasks, achieving super-human performance in many cases. A prime example of such an architecture\nis the deep convolutional neural network (DCN), which has seen great success in tasks like visual\nobject recognition and localization, speech recognition and part-of-speech recognition.\nThe success of deep learning systems is impressive, but a fundamental question remains: Why do they\nwork? Intuitions abound to explain their success. Some explanations focus on properties of feature\ninvariance and selectivity developed over multiple layers, while others credit raw computational\npower and the amount of available training data. However, beyond these intuitions, a coherent\ntheoretical framework for understanding, analyzing, and synthesizing deep learning architectures has\nremained elusive.\nIn this paper, we develop a new theoretical framework that provides insights into both the successes\nand shortcomings of deep learning systems, as well as a principled route to their design and improve-\nment. Our framework is based on a generative probabilistic model that explicitly captures variation\ndue to latent nuisance variables. The Rendering Mixture Model (RMM) explicitly models nuisance\nvariation through a rendering function that combines task target variables (e.g., object class in an\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fobject recognition) with a collection of task nuisance variables (e.g., pose). The Deep Rendering\nMixture Model (DRMM) extends the RMM in a hierarchical fashion by rendering via a product of\naf\ufb01ne nuisance transformations across multiple levels of abstraction. The graphical structures of the\nRMM and DRMM enable ef\ufb01cient inference via message passing (e.g., using the max-sum/product\nalgorithm) and training via the expectation-maximization (EM) algorithm. A key element of our\nframework is the relaxation of the RMM/DRMM generative model to a discriminative one in order to\noptimize the bias-variance tradeoff. Below, we demonstrate that the computations involved in joint\nMAP inference in the relaxed DRMM coincide exactly with those in a DCN.\nThe intimate connection between the DRMM and DCNs provides a range of new insights into how\nand why they work and do not work. While our theory and methods apply to a wide range of different\ninference tasks (including, for example, classi\ufb01cation, estimation, regression, etc.) that feature a\nnumber of task-irrelevant nuisance variables (including, for example, object and speech recognition),\nfor concreteness of exposition, we will focus below on the classi\ufb01cation problem underlying visual\nobject recognition. The proofs of several results appear in the Appendix.\n\n2 Related Work\nTheories of Deep Learning. Our theoretical work shares similar goals with several others such\nas the i-Theory [1] (one of the early inspirations for this work), Nuisance Management [24], the\nScattering Transform [6], and the simple sparse network proposed by Arora et al. [2].\nHierarchical Generative Models. The DRMM is closely related to several hierarchical models,\nincluding the Deep Mixture of Factor Analyzers [27] and the Deep Gaussian Mixture Model [29].\nLike the above models, the DRMM attempts to employ parameter sharing, capture the notion of\nnuisance transformations explicitly, learn selectivity/invariance, and promote sparsity. However,\nthe key features that distinguish the DRMM approach from others are: (i) The DRMM explicitly\nmodels nuisance variation across multiple levels of abstraction via a product of af\ufb01ne transformations.\nThis factorized linear structure serves dual purposes: it enables (ii) tractable inference (via the max-\nsum/product algorithm), and (iii) it serves as a regularizer to prevent over\ufb01tting by an exponential\nreduction in the number of parameters. Critically, (iv) inference is not performed for a single variable\nof interest but instead for the full global con\ufb01guration of nuisance variables. This is justi\ufb01ed in low-\nnoise settings. And most importantly, (v) we can derive the structure of DCNs precisely, endowing\nDCN operations such as the convolution, recti\ufb01ed linear unit, and spatial max-pooling with principled\nprobabilistic interpretations. Independently from our work, Soatto et al. [24] also focus strongly on\nnuisance management as the key challenge in de\ufb01ning good scene representations. However, their\nwork considers max-pooling and ReLU as approximations to a marginalized likelihood, whereas our\nwork interprets those operations differently, in terms of max-sum inference in a speci\ufb01c probabilistic\ngenerative model. The work on the number of linear regions in DCNs [14] is complementary to our\nown, in that it sheds light on the complexity of functions that a DCN can compute. Both approaches\ncould be combined to answer questions such as: How many templates are required for accurate\ndiscrimination? How many samples are needed for learning? We plan to pursue these questions in\nfuture work.\nSemi-Supervised Neural Networks. Recent work in neural networks designed for semi-supervised\nlearning (few labeled data, lots of unlabeled data) has seen the resurgence of generative-like ap-\nproaches, such as Ladder Networks [17], Stacked What-Where Autoencoders (SWWAE) [31] and\nmany others. These network architectures augment the usual task loss with one or more regularization\nterm, typically including an image reconstruction error, and train jointly. A key difference with our\nDRMM-based approach is that these networks do not arise from a proper probabilistic density and as\nsuch they must resort to learning the bottom-up recognition and top-down reconstruction weights\nseparately, and they cannot keep track of uncertainty.\n\n3 The Deep Rendering Mixture Model: Capturing Nuisance Variation\nAlthough we focus on the DRMM in this paper, we de\ufb01ne and explore several other interesting\nvariants, including the Deep Rendering Factor Model (DRFM) and the Evolutionary DRMM (E-\nDRMM), both of which are discussed in more detail in [16] and the Appendix. The E-DRMM\nis particularly important, since its max-sum inference algorithm yields a decision tree of the type\nemployed in a random decision forest classi\ufb01er[5].\n\n2\n\n\fA\n\nRendering Mixture Model\n\na\n\nI\n\ng\n\nc\n\nRendering Factor Model\n\nz\n\nI\n\na\n\ng\n\nc\n\nB\n\nDeep Rendering \nMixture Model\n\nC\nModel\n\nDeep Sparse Path \n\ncL\n\nzL\n\ngL\n\nzL1\n\ngL1\n\n...\n\nz1\n\nI\n\ng1\n\nFigure 1: Graphical model depiction of (A) the Shallow Rendering Models and (B) the DRMM. All\ndependence on pixel location x has been suppressed for clarity. (C) The Sparse Sum-over-Paths\nformulation of the DRMM. A rendering path contributes only if it is active (green arrows).\n\n3.1 The (Shallow) Rendering Mixture Model\nThe RMM is a generative probabilistic model for images that explicitly models the relationship\nbetween images I of the same object c subject to nuisance g 2G , where G is the set of all nuisances\n(see Fig. 1A for the graphical model depiction).\n\nc \u21e0 Cat({\u21e1c}c2C),\n\ng \u21e0 Cat({\u21e1g}g2G),\nI = a\u00b5cg + noise.\n\na \u21e0 Bern({\u21e1a}a2A),\n\n(1)\nHere, \u00b5cg is a template that is a function of the class c and the nuisance g. The switching variable\na 2A = {ON, OFF} determines whether or not to render the template at a particular patch; a\nsparsity prior on a thus encourages each patch to have a few causes. The noise distribution is from the\nexponential family, but without loss of generality we illustrate below using Gaussian noise N (0, 21).\nWe assume that the noise is i.i.d. as a function of pixel location x and that the class and nuisance\nvariables are independently distributed according to categorical distributions. (Independence is\nmerely a convenience for the development; in practice, g can depend on c.) Finally, since the world is\nspatially varying and an image can contain a number of different objects, it is natural to break the\nimage up into a number of patches, that are centered on a single pixel x. The RMM described in (1)\nthen applies at the patch level, where c, g, and a depend on pixel/patch location x. We will omit the\ndependence on x when it is clear from context.\nInference in the Shallow RMM Yields One Layer of a DCN. We now connect the RMM with the\ncomputations in one layer of a deep convolutional network (DCN). To perform object recognition\nwith the RMM, we must marginalize out the nuisance variables g and a. Maximizing the log-posterior\nover g 2G and a 2A and then choosing the most likely class yields the max-sum classi\ufb01er\n\n\u02c6c(I) = argmax\n\nc2C\n\nmax\ng2G\n\nmax\na2A\n\nln p(I|c, g, a) + ln p(c, g, a)\n\n(2)\n\nmax\ng2G\n\nmax\na2A\n\nthat computes the most likely global con\ufb01guration of target and nuisance variables for the image.\nAssuming that Gaussian noise is added to the template, the image is normalized so that kIk2 = 1,\nand c, g are uniformly distributed, (2) becomes\nReLu(hwcg|Ii + bcg) + b0 (3)\n\u02c6c(I) \u2318 argmax\nwhere ReLU(u) \u2318 (u)+ = max{u, 0} is the soft-thresholding operation performed by the rec-\nti\ufb01ed linear units in modern DCNs. Here we have reparameterized the RMM model from the\nmoment parameters \u2713 \u2318{ 2, \u00b5cg,\u21e1 a} to the natural parameters \u2318(\u2713) \u2318{ wcg \u2318 1\n2 \u00b5cg, bcg \u2318\np(a=0)\u2318. The relationships \u2318(\u2713) are referred to as the\n 1\n22k\u00b5cgk2\ngenerative parameter constraints.\n\n2, ba \u2318 ln p(a) = ln \u21e1a, b0 \u2318 ln\u21e3 p(a=1)\n\na(hwcg|Ii + bcg) + ba = argmax\n\nmax\ng2G\n\nc2C\n\nc2C\n\n3\n\n\fWe now demonstrate that the sequence of operations in the max-sum classi\ufb01er in (3) coincides exactly\nwith the operations involved in one layer of a DCN: image normalization, linear template matching,\nthresholding, and max pooling. First, the image is normalized (by assumption). Second, the image is\n\ufb01ltered with a set of noise-scaled rendered templates wcg. If we assume translational invariance in\nthe RMM, then the rendered templates wcg yield a convolutional layer in a DCN [10] (see Appendix\nLemma A.2). Third, the resulting activations (log-probabilities of the hypotheses) are passed through\na pooling layer; if g is a translational nuisance, then taking the maximum over g corresponds to max\npooling in a DCN. Fourth, since the switching variables are latent (unobserved), we max-marginalize\nover them during classi\ufb01cation. This leads to the ReLU operation (see Appendix Proposition A.3).\n3.2 The Deep Rendering Mixture Model: Capturing Levels of Abstraction\nMarginalizing over the nuisance g 2G in the RMM is intractable for modern datasets, since G will\ncontain all con\ufb01gurations of the high-dimensional nuisance variables g. In response, we extend the\nRMM into a hierarchical Deep Rendering Mixture Model (DRMM) by factorizing g into a number of\ndifferent nuisance variables g(1), g(2), . . . , g(L) at different levels of abstraction. The DRMM image\ngeneration process starts at the highest level of abstraction (` = L), with the random choice of the\nobject class c(L) and overall nuisance g(L). It is then followed by random choices of the lower-level\ndetails g(`) (we absorb the switching variable a into g for brevity), progressively rendering more\nconcrete information level-by-level (` ! `1), until the process \ufb01nally culminates in a fully rendered\nD-dimensional image I (` = 0). Generation in the DRMM takes the form:\n(4)\nc(L) \u21e0 Cat({\u21e1c(L)}), g(`) \u21e0 Cat({\u21e1g(`)}) 8` 2 [L]\n\u00b5c(L)g \u2318 \u21e4g\u00b5c(L) \u2318 \u21e4(1)\n(5)\ng(L)\u00b5c(L)\n(6)\nwhere the latent variables, parameters, and helper variables are de\ufb01ned in full detail in Appendix B.\nThe DRMM is a deep Gaussian Mixture Model (GMM) with special constraints on the latent variables.\nHere, c(L) 2C L and g(`) 2G `, where CL is the set of target-relevant nuisance variables, and G` is the\nset of all target-irrelevant nuisance variables at level `. The rendering path is de\ufb01ned as the sequence\n(c(L), g(L), . . . , g(`), . . . , g(1)) from the root (overall class) down to the individual pixels at ` = 0.\n\u00b5c(L)g is the template used to render the image, and \u21e4g \u2318Q` \u21e4g(`) represents the sequence of local\n\nI \u21e0N (\u00b5c(L)g, \u2318 21D),\n\ng(2) \u00b7\u00b7\u00b7 \u21e4(L1)\n\ng(L1)\u21e4(L)\n\ng(1)\u21e4(2)\n\np a(1)\np\n\np a(L)\n\n\u00b7\u00b7\u00b7 (1)\n\ng(`) is an af\ufb01ne transformation with a bias term \u21b5(`)\n\nperspective on the DRMM, as each pixel intensity Ix =Pp (L)\nevery switch on the path is active i.e.Q` a(`)\n\nnuisance transformations that partially render \ufb01ner-scale details as we move from abstract to concrete.\nNote that each \u21e4(`)\ng(`) that we have suppressed for\nclarity. Fig. 1B illustrates the corresponding graphical model. As before, we have suppressed the\ndependence of g(`) on the pixel location x(`) at level ` of the hierarchy.\nSum-Over-Paths Formulation of the DRMM. We can rewrite the DRMM generation process\nby expanding out the matrix multiplications into scalar products. This yields an interesting new\nis the sum over all\nactive paths to that pixel, of the product of weights along that path. A rendering path p is active iff\np = 1 . While exponentially many possible rendering\npaths exist, only a very small fraction, controlled by the sparsity of a, are active. Fig. 1C depicts the\nsum-over-paths formulation graphically.\nRecursive and Nonnegative Forms. We can rewrite the DRMM into a recursive form as z(`) =\n\u21e4(`+1)\ng(`+1)z(`+1), where z(L) \u2318 \u00b5c(L) and z(0) \u2318 I. We refer to the helper latent variables z(`) as\nintermediate rendered templates. We also de\ufb01ne the Nonnegative DRMM (NN-DRMM) as a DRMM\nwith an extra nonnegativity constraint on the intermediate rendered templates, z(`)  08` 2 [L].\nThe latter is enforced in training via the use of a ReLu operation in the top-down reconstruction\nphase of inference. Throughout the rest of the paper, we will focus on the NN-DRMM, leaving the\nunconstrained DRMM for future work. For brevity, we will drop the NN pre\ufb01x.\nFactor Model. We also de\ufb01ne and explore a variant of the DRMM that where the top-level latent\nvariable is Gaussian: z(L+1) \u21e0N (0, 1d) 2 Rd and the recursive generation process is otherwise\nidentical to the DRMM: z(`) =\u21e4 (`+1)\ng(`+1)z(`+1) where g(L+1) \u2318 c(L). We call this the Deep Rendering\nFactor Model (DRFM). The DRFM is closely related to the Spike-and-Slab Sparse Coding model\n[22]. Below we explore some training results, but we leave most of the exploration for future work.\n(see Fig. 3 in Appendix C for architecture of the RFM, the shallow version of the DRFM)\n\np\n\n4\n\n\fNumber of Free Parameters. Compared to the shallow RMM, which has D |CL|Q` |G`| parameters,\nthe DRMM has onlyP` |G`+1|D`D`+1 parameters, an exponential reduction in the number of free\nparameters (Here GL+1 \u2318C L and D` is the number of units in the `-th layer with D0 \u2318 D). This\nenables ef\ufb01cient inference, learning, and better generalization. Note that we have assumed dense\n(fully connected) \u21e4g\u2019s here; if we impose more structure (e.g. translation invariance), the number of\nparameters will be further reduced.\nBottom-Up Inference. As in the shallow RMM, given an input image I the DRMM classi\ufb01er infers\nthe most likely global con\ufb01guration {c(L), g(`)}, ` = 0, 1, . . . , L by executing the max-sum/product\nmessage passing algorithm in two stages: (i) bottom-up (from \ufb01ne-to-coarse) to infer the overall class\nlabel \u02c6c(L) and (ii) top-down (from coarse-to-\ufb01ne) to infer the latent variables \u02c6g(`) at all intermediate\nlevels `. First, we will focus on the \ufb01ne-to-coarse pass since it leads directly to DCNs.\nUsing (3), the \ufb01ne-to-coarse NN-DRMM inference algorithm for inferring the most likely cateogry\n\u02c6cL is given by\n\nargmax\nc(L)2C\n\nmax\ng2G\n\n\u00b5T\nc(L)\n\n\u00b5T\nc(L)gI = argmax\nc(L)2C\n\u00b5T\nc(L) max\ng(L)2GL\n\n= argmax\nc(L)2C\n\nmax\ng2G\n\u21e4T\ng(L) \u00b7\u00b7\u00b7 max\ng(1)2G1\n\n\u21e4T\n\ng(`)I\n\n1Y`=L\n|\n\n\u2318 I 1\n\n{z\n\n}\n\n\u21e4T\ng(1)|I\n\n= \u00b7\u00b7\u00b7 \u2318 argmax\nc(L)2C\n\n\u00b5T\nc(L)I (L).\n\n(7)\n\n|\n\n\u2318W (`+1)\n\n{z\n\n}\n\nHere, we have assumed the bias terms \u21b5g(`) = 0. In the second line, we used the max-product\nalgorithm (distributivity of max over products i.e. for a > 0, max{ab, ac} = a max{b, c}). See\nAppendix B for full details. This enables us to rewrite (7) recursively:\n\nI (`+1) \u2318 max\n\ng(`+1)2G`+1\n\n(\u21e4g(`+1))T\n\nI (`) = MaxPool(ReLu(Conv(I (`)))),\n\n(8)\n\nn\n\n.\n\nwhere I (`) is the output feature maps of layer `, I (0) \u2318 I and W (`) are the \ufb01lters/weights for layer `.\nComparing to (3), we see that the `-th iteration of (7) and (8) corresponds to feedforward propagation\nin the `-th layer of a DCN. Thus a DCN\u2019s operation has a probabilistic interpretation as \ufb01ne-to-coarse\ninference of the most probable con\ufb01guration in the DRMM.\nTop-Down Inference. A unique contribution of our generative model-based approach is that we have\na principled derivation of a top-down inference algorithm for the NN-DRMM (Appendix B). The\nresulting algorithm amounts to a simple top-down reconstruction term \u02c6In =\u21e4 \u02c6gn\u00b5\u02c6c(L)\nDiscriminative Relaxations: From Generative to Discriminative Classi\ufb01ers. We have con-\nstructed a correspondence between the DRMM and DCNs, but the mapping is not yet complete.\nIn particular, recall the generative constraints on the weights and biases. DCNs do not have such\nconstraints \u2014 their weights and biases are free parameters. As a result, when faced with training data\nthat violates the DRMM\u2019s underlying assumptions, the DCN will have more freedom to compensate.\nIn order to complete our mapping from the DRMM to DCNs, we relax these parameter constraints,\nallowing the weights and biases to be free and independent parameters. We refer to this process as a\ndiscriminative relaxation of a generative classi\ufb01er ([15, 4], see the Appendix D for details).\n3.3 Learning the Deep Rendering Model via the Expectation-Maximization (EM) Algorithm\nWe describe how to learn the DRMM parameters from training data via the hard EM algorithm in\nAlgorithm 1. The DRMM E-Step consists of bottom-up and top-down (reconstruction) E-steps at\neach layer ` in the model. The ncg \u2318 p(c, g|In; \u2713) are the responsibilities, where for brevity we have\nabsorbed a into g. The DRMM M-step consists of M-steps for each layer ` in the model. The per-layer\nM-step in turn consists of a responsibility-weighted regression, where GLS(yn \u21e0 xn) denotes the\nsolution to a generalized Least Squares regression problem that predict targets yn from predictors\n\nxn and is closely related to the SVD. The Iversen bracket is de\ufb01ned asJbK \u2318 1 if expression b is\n\ntrue and is 0 otherwise. There are several interesting and useful features of the EM algorithm. First,\nwe note that it is a derivative-free alternative to the back propagation algorithm for training that is\nboth intuitive and potentially much faster (provided a good implementation for the GLS problem).\nSecond, it is easily parallelized over layers, since the M-step updates each layer separately (model\nparallelism). Moreover, it can be extended to a batch version so that at each iteration the model is\n\n5\n\n\fAlgorithm 1 Hard EM and EG Algorithms for the DRMM\n\nE-step:\n\nM-step:\n\nG-step:\n\n\u02c6cn, \u02c6gn = argmax\n\nncg\n\nc,g\n\n\u02c6\u21e4g(`) = GLS|{z}\u21e3I (`1)\n\n\u21e0 \u02c6z(`)\n\u02c6\u21e4g(`) / r\u21e4g(`) `DRMM (\u2713)\n\nn\n\nn | g(`) = \u02c6g(`)\n\nn \u2318 8g(`)\n\nsimultaneously updated using separate subsets of the data (data parallelism). This will enable training\nto be distributed easily across multiple machines. In this vein, our EM algorithm shares several\nfeatures with the ADMM-based Bregman iteration algorithm in [28]. However, the motivation there is\nfrom an optimization perspective and so the resulting training algorithm is not derived from a proper\nprobabilistic density. Third, it is far more interpretable via its connections to (deep) sparse coding\nand to the hard EM algorithm for GMMs. The sum-over-paths formulation makes it particularly clear\nthat the mixture components are paths (from root to pixels) in the DRMM.\nG-step. For the training results in this paper, we use the Generalized EM algorithm wherein we\nreplace the M-step with a gradient descent based G-step (see Algorithm 1). This is useful for\ncomparison with backpropagation-based training and for ease of implementation.\nFlexibility and Extensibility. Since we can choose different priors/types for the nuisances g, the\nlarger DRMM family could be useful for modeling a wider range of inputs, including scenes, speech\nand text. The EM algorithm can then be used to train the whole system end-to-end on different\nsources/modalities of labeled and unlabeled data. Moreover, the capability to sample from the model\nallows us to probe what is captured by the DRMM, providing us with principled ways to improve the\nmodel. And \ufb01nally, in order to properly account for noise/uncertainty, it is possible in principle to\nextend this algorithm into a soft EM algorithm. We leave these interesting extensions for future work.\n\n3.4 New Insights into Deep Convnets\nDCNs are Message Passing Networks. The convolution, Max-Pooling and ReLu operations in a\nDCN correspond to max-sum/product inference in a DRMM. Note that by \u201cmax-sum-product\u201d we\nmean a novel combination of max-sum and max-product as described in more detail in the proofs in\nthe Appendix. Thus, we see that architectures and layer types commonly used in today\u2019s DCNs can\nbe derived from precise probabilistic assumptions that entirely determine their structure. The DRMM\ntherefore uni\ufb01es two perspectives \u2014 neural network and probabilistic inference (see Table 2 in the\nAppendix for details).\nShortcomings of DCNs. DCNs perform poorly in categorizing transparent objects [20]. This\nmight be explained by the fact that transparent objects generate pixels that have multiple sources,\ncon\ufb02icting with the DRMM sparsity prior on a, which encourages few sources. DCNs also fail to\nclassify slender and man-made objects [20]. This is because of the locality imposed by the locally-\nconnected/convolutional layers, or equivalently, the small size of the template \u00b5c(L)g in the DRMM.\nAs a result, DCNs fail to model long-range correlations.\nClass Appearance Models and Activity Maximization. The DRMM enables us to understand how\ntrained DCNs distill and store knowledge from past experiences in their parameters. Speci\ufb01cally, the\nDRMM generates rendered templates \u00b5c(L)g via a mixture of products of af\ufb01ne transformations, thus\nimplying that class appearance models in DCNs are stored in a similar factorized-mixture form over\nmultiple levels of abstraction. As a result, it is the product of all the \ufb01lters/weights over all layers\nthat yield meaningful images of objects (Eq. 6). We can also shed new light on another approach\nto understanding DCN memories that proceeds by searching for input images that maximize the\nactivity of a particular class unit (say, class of cats) [23], a technique we call activity maximization.\nResults from activity maximization on a high performance DCN trained on 15 million images is\nshown in Fig. 1 of [23]. The resulting images reveal much about how DCNs store memories. Using\nthe DRMM, the solution I\u21e4c(L) of the activity maximization for class c(L) can be derived as the sum\nPi, each of which is a function of the learned DRMM\nof individual activity-maximizing patches I\u21e4\nPi) /PPi2P \u00b5(c(L), g\u21e4\nPi).\n\nparameters (see Appendix E). In particular, I\u21e4c(L) \u2318PPi2P I\u21e4\n\nPi(c(L), g\u21e4\n\n6\n\n\f&\n\n&\n\ne\nt\na\nR\ny\nc\na\nr\nu\nc\nc\nA\n\nLayer&\n\nFigure 2: Information about latent nuisance variables at each layer (Left), training results from EG\nfor RFM (Middle) and DRFM (Right) on MNIST, as compared to DCNs of the same con\ufb01guration.\n\nThis implies that I\u21e4c(L) contains multiple appearances of the same object but in various poses. Each\nactivity-maximizing patch has its own pose g\u21e4\nPi, consistent with Fig. 1 of [23] and our own extensive\nexperiments with AlexNet, VGGNet, and GoogLeNet (data not shown). Such images provide strong\ncon\ufb01rmational evidence that the underlying model is a mixture over nuisance parameters, as predcted\nby the DRMM.\nUnsupervised Learning of Latent Task Nuisances. A key goal of representation learning is to\ndisentangle the factors of variation that contribute to an image\u2019s appearance. Given our formulation of\nthe DRMM, it is clear that DCNs are discriminative classi\ufb01ers that capture these factors of variation\nwith latent nuisance variables g. As such, the theory presented here makes a clear prediction that for\na DCN, supervised learning of task targets will lead to unsupervised learning of latent task nuisance\nvariables. From the perspective of manifold learning, this means that the architecture of DCNs is\ndesigned to learn and disentangle the intrinsic dimensions of the data manifolds.\nIn order to test this prediction, we trained a DCN to classify synthetically rendered images of\nnaturalistic objects, such as cars and cats, with variation in factors such as location, pose, and lighting.\nAfter training, we probed the layers of the trained DCN to quantify how much linearly decodable\ninformation exists about the task target c(L) and latent nuisance variables g. Fig. 2 (Left) shows that\nthe trained DCN possesses signi\ufb01cant information about latent factors of variation and, furthermore,\nthe more nuisance variables, the more layers are required to disentangle the factors. This is strong\nevidence that depth is necessary and that the amount of depth required increases with the complexity\nof the class models and the nuisance variations.\n\n4 Experimental Results\nWe evaluate the DRMM and DRFM\u2019s performance on the MNIST dataset, a standard digit classi\ufb01ca-\ntion benchmark with a training set of 60,000 28 \u21e5 28 labeled images and a test set of 10,000 labeled\nimages. We also evaluate the DRMM\u2019s performance on CIFAR10, a dataset of natural objects which\ninclude a training set of 50,000 32 \u21e5 32 labeled images and a test set of 10,000 labeled images. In all\nexperiments, we use a full E-step that has a bottom-up phase and a principled top-down reconstruction\nphase. In order to approximate the class posterior in the DRMM, we include a Kullback-Leibler\ndivergence term between the inferred posterior p(c|I) and the true prior p(c) as a regularizer [9].\nWe also replace the M-step in the EM algorithm of Algorithm 1 by a G-step where we update\nthe model parameters via gradient descent. This variant of EM is known as the Generalized EM\nalgorithm [3], and here we refer to it as EG. All DRMM experiments were done with the NN-DRMM.\nCon\ufb01gurations of our models and the corresponding DCNs are provided in the Appendix I.\nSupervised Training. Supervised training results are shown in Table 3 in the Appendix. Shallow\nRFM: The 1-layer RFM (RFM sup) yields similar performance to a Convnet of the same con\ufb01guration\n(1.21% vs. 1.30% test error). Also, as predicted by the theory of generative vs discriminative\nclassi\ufb01ers, EG training converges 2-3x faster than a DCN (18 vs. 40 epochs to reach 1.5% test error,\nFig. 2, middle). Deep RFM: Training results from an initial implementation of the 2-layer DRFM\nEG algorithm converges 2  3\u21e5 faster than a DCN of the same con\ufb01guration, while achieving a\nsimilar asymptotic test error (Fig. 2, Right). Also, for completeness, we compare supervised training\nfor a 5-layer DRMM with a corresponding DCN, and they show comparable accuracy (0.89% vs\n0.81%, Table 3).\n\n7\n\n\fUnsupervised Training. We train the RFM and the 5-layer DRMM unsupervised with NU images,\nfollowed by an end-to-end re-training of the whole model (unsup-pretr) using NL labeled images. The\nresults and comparison to the SWWAE model are shown in Table 1. The DRMM model outperforms\nthe SWWAE model in both scenarios (Filters and reconstructed images from the RFM are available\nin the Appendix 4.)\n\nTable 1: Comparison of Test Error rates (%) between best DRMM variants and other best published\nresults on MNIST dataset for the semi-supervised setting (taken from [31]) with NU = 60K\nunlabeled images, of which NL 2{ 100, 600, 1K, 3K} are labeled.\nNL = 600\n\nNL = 100\n\nNL = 3K\n\nNL = 1K\n\nModel\n\nConvnet [10]\nMTC [18]\nPL-DAE [11]\nWTA-AE [13]\nSWWAE dropout [31]\nM1+TSVM [8]\nM1+M2 [8]\nSkip Deep Generative Model [12]\nLadderNetwork [17]\nAuxiliary Deep Generative Model [12]\ncatGAN [25]\nImprovedGAN [21]\nRFM\nDRMM 2-layer semi-sup\nDRMM 5-layer semi-sup\nDRMM 5-layer semi-sup NN+KL\nSWWAE unsup-pretr [31]\nRFM unsup-pretr\nDRMM 5-layer unsup-pretr\n\n22.98\n12.03\n10.49\n\n-\n\n1.32\n\n8.71 \u00b1 0.34\n11.82 \u00b1 0.25\n3.33 \u00b1 0.14\n1.06 \u00b1 0.37\n1.39 \u00b1 0.28\n0.93 \u00b1 0.065\n\n0.96\n\n14.47\n11.81\n3.50\n0.57\n\n-\n\n16.2\n12.03\n\n7.86\n5.13\n5.03\n2.37\n\n6.45\n3.64\n3.46\n1.92\n\n5.72\n\n3.31 \u00b1 0.40\n2.59 \u00b1 0.05\n\n-\n-\n-\n-\n-\n\n4.24\n\n2.83 \u00b1 0.10\n2.40 \u00b1 0.02\n0.84 \u00b1 0.08\n\n-\n\n-\n-\n-\n\n3.49\n\n2.10 \u00b1 0.22\n2.18 \u00b1 0.04\n\n3.35\n2.57\n2.69\n\n-\n\n-\n-\n-\n-\n-\n\n5.61\n3.73\n1.56\n\n9.80\n5.65\n3.61\n\n4.67\n2.88\n1.67\n\n6.135\n4.64\n2.73\n\n2.96\n1.72\n0.91\n\n4.41\n2.95\n1.68\n\nSemi-Supervised Training. For semi-supervised training, we use a randomly chosen subset of\nNL = 100, 600, 1K, and 3K labeled images and NU = 60K unlabeled images from the training\nand validation set. Results are shown in Table 1 for a RFM, a 2-layer DRMM and a 5-layer DRMM\nwith comparisons to related work. The DRMMs performs comparably to state-of-the-art models.\nSpecially, the 5-layer DRMM yields the best results when NL = 3K and NL = 600 while results in\nthe second best result when NL = 1K. We also show the training results of a 9-layer DRMM on\nCIFAR10 in Table 4 in Appendix H. The DRMM yields comparable results on CIFAR10 with the\nbest semi-supervised methods. For more results and comparison with other works, see Appendix H.\n5 Conclusions\nUnderstanding successful deep vision architectures is important for improving performance and\nsolving harder tasks. In this paper, we have introduced a new family of hierarchical generative\nmodels, whose inference algorithms for two different models reproduce deep convnets and decision\ntrees, respectively. Our initial implementation of the DRMM EG algorithm outperforms DCN back-\npropagation in both supervised and unsupervised classi\ufb01cation tasks and achieves comparable/state-\nof-the-art performance on several semi-supervised classi\ufb01cation tasks, with no architectural hyperpa-\nrameter tuning.\nAcknowledgments. Thanks to Xaq Pitkow and Ben Poole for helpful feedback. ABP and RGB\nwere supported by IARPA via DoI/IBC contract D16PC00003. 1 RGB was supported by NSF\nCCF-1527501, AFOSR FA9550-14-1-0088, ARO W911NF-15-1-0316, and ONR N00014-12-1-\n0579. TN was supported by an NSF Graduate Reseach Fellowship and NSF IGERT Training Grant\n(DGE-1250104).\n\n1The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwith-\nstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of\nthe authors and should not be interpreted as necessarily representing the of\ufb01cial policies or endorsements, either\nexpressed or implied, of IARPA, DoI/IBC, or the U.S. Government.\n\n8\n\n\fReferences\n[1] F. Anselmi, J. Z. Leibo, L. Rosasco, J. Mutch, A. Tacchetti, and T. Poggio. Magic materials: a theory of\ndeep hierarchical architectures for learning sensory representations. MIT CBCL Technical Report, 2013.\n[2] S. Arora, A. Bhaskara, R. Ge, and T. Ma. Provable bounds for learning some deep representations. arXiv\n\n[3] C. M. Bishop. Pattern Recognition and Machine Learning, volume 4. Springer New York, 2006.\n[4] C. M. Bishop, J. Lasserre, et al. Generative or discriminative? getting the best of both worlds. Bayesian\n\npreprint arXiv:1310.6343, 2013.\n\nStatistics, 8:3\u201324, 2007.\n\n[5] L. Breiman. Random forests. Machine learning, 45(1):5\u201332, 2001.\n[6] J. Bruna and S. Mallat. Invariant scattering convolution networks. IEEE Transactions on Pattern Analysis\n\nand Machine Intelligence, 35(8):1872\u20131886, 2013.\n\n[7] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks. arXiv\n\npreprint arXiv:1302.4389, 2013.\n\n[8] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling. Semi-supervised learning with deep generative\n\nmodels. In Advances in Neural Information Processing Systems, pages 3581\u20133589, 2014.\n\n[9] D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.\n[10] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[11] D.-H. Lee. Pseudo-label: The simple and ef\ufb01cient semi-supervised learning method for deep neural\n\nnetworks. In Workshop on Challenges in Representation Learning, ICML, volume 3, 2013.\n\n[12] L. Maal\u00f8e, C. K. S\u00f8nderby, S. K. S\u00f8nderby, and O. Winther. Auxiliary deep generative models. arXiv\n\n[13] A. Makhzani and B. J. Frey. Winner-take-all autoencoders. In Advances in Neural Information Processing\n\npreprint arXiv:1602.05473, 2016.\n\nSystems, pages 2773\u20132781, 2015.\n\n[14] G. F. Montufar, R. Pascanu, K. Cho, and Y. Bengio. On the number of linear regions of deep neural\n\nnetworks. In Advances in Neural Information Processing Systems, pages 2924\u20132932, 2014.\n\n[15] A. Ng and M. Jordan. On discriminative vs. generative classi\ufb01ers: A comparison of logistic regression and\n\nnaive bayes. Advances in neural information processing systems, 14:841, 2002.\n\n[16] A. B. Patel, T. Nguyen, and R. G. Baraniuk. A probabilistic theory of deep learning. arXiv preprint\n\narXiv:1504.00641, 2015.\n\n[17] A. Rasmus, M. Berglund, M. Honkala, H. Valpola, and T. Raiko. Semi-supervised learning with ladder\n\nnetworks. In Advances in Neural Information Processing Systems, pages 3532\u20133540, 2015.\n\n[18] S. Rifai, Y. N. Dauphin, P. Vincent, Y. Bengio, and X. Muller. The manifold tangent classi\ufb01er. In Advances\n\nin Neural Information Processing Systems, pages 2294\u20132302, 2011.\n\n[19] S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio. Contractive auto-encoders: Explicit invariance\nIn Proceedings of the 28th international conference on machine learning\n\nduring feature extraction.\n(ICML-11), pages 833\u2013840, 2011.\n\n[20] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,\nM. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer\nVision, 115(3):211\u2013252, 2015.\n\n[21] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for\n\ntraining gans. arXiv preprint arXiv:1606.03498, 2016.\n\n[22] A.-S. Sheikh, J. A. Shelton, and J. L\u00fccke. A truncated em approach for spike-and-slab sparse coding.\n\nJournal of Machine Learning Research, 15(1):2653\u20132687, 2014.\n\n[23] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: Visualising image\n\nclassi\ufb01cation models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.\n\n[24] S. Soatto and A. Chiuso. Visual representations: De\ufb01ning properties and deep approximations.\n\nIn\n\nInternational Conference on Learning Representations, 2016.\n\n[25] J. T. Springenberg. Unsupervised and semi-supervised learning with categorical generative adversarial\n\n[26] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller. Striving for simplicity: The all convolu-\n\nnetworks. arXiv preprint arXiv:1511.06390, 2015.\n\ntional net. arXiv preprint arXiv:1412.6806, 2014.\n\n[27] Y. Tang, R. Salakhutdinov, and G. Hinton. Deep mixtures of factor analysers.\n\narXiv preprint\n\narXiv:1206.4635, 2012.\n\n[28] G. Taylor, R. Burmeister, Z. Xu, B. Singh, A. Patel, and T. Goldstein. Training neural networks without\n\ngradients: A scalable admm approach. arXiv preprint arXiv:1605.02026, 2016.\n\n[29] A. van den Oord and B. Schrauwen. Factoring variations in natural images with deep gaussian mixture\n\nmodels. In Advances in Neural Information Processing Systems, pages 3518\u20133526, 2014.\n[30] V. N. Vapnik and V. Vapnik. Statistical learning theory, volume 1. Wiley New York, 1998.\n[31] J. Zhao, M. Mathieu, R. Goroshin, and Y. LeCun. Stacked what-where autoencoders. arXiv preprint\n\narXiv:1506.02351, 2016.\n\n9\n\n\f", "award": [], "sourceid": 1326, "authors": [{"given_name": "Ankit", "family_name": "Patel", "institution": "Baylor College of Medicine and Rice University"}, {"given_name": "Minh", "family_name": "Nguyen", "institution": "Rice University"}, {"given_name": "Richard", "family_name": "Baraniuk", "institution": "Rice University"}]}