{"title": "Top-Down Regularization of Deep Belief Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 1878, "page_last": 1886, "abstract": "Designing a principled and effective algorithm for learning deep architectures is a challenging problem. The current approach involves two training phases: a fully unsupervised learning followed by a strongly discriminative optimization. We suggest a deep learning strategy that bridges the gap between the two phases, resulting in a three-phase learning procedure. We propose to implement the scheme using a method to regularize deep belief networks with top-down information. The network is constructed from building blocks of restricted Boltzmann machines learned by combining bottom-up and top-down sampled signals. A global optimization procedure that merges samples from a forward bottom-up pass and a top-down pass is used. Experiments on the MNIST dataset show improvements over the existing algorithms for deep belief networks. Object recognition results on the Caltech-101 dataset also yield competitive results.", "full_text": "Top-Down Regularization of Deep Belief Networks\n\nHanlin Goh\u2217, Nicolas Thome, Matthieu Cord\n\nLaboratoire d\u2019Informatique de Paris 6\n\nUPMC \u2013 Sorbonne Universit\u00b4es, Paris, France\n{Firstname.Lastname}@lip6.fr\n\nJoo-Hwee Lim\u2020\n\njoohwee@i2r.a-star.edu.sg\n\nInstitute for Infocomm Research\n\nA*STAR, Singapore\n\nAbstract\n\nDesigning a principled and effective algorithm for learning deep architectures is a\nchallenging problem. The current approach involves two training phases: a fully\nunsupervised learning followed by a strongly discriminative optimization. We\nsuggest a deep learning strategy that bridges the gap between the two phases, re-\nsulting in a three-phase learning procedure. We propose to implement the scheme\nusing a method to regularize deep belief networks with top-down information. The\nnetwork is constructed from building blocks of restricted Boltzmann machines\nlearned by combining bottom-up and top-down sampled signals. A global op-\ntimization procedure that merges samples from a forward bottom-up pass and a\ntop-down pass is used. Experiments on the MNIST dataset show improvements\nover the existing algorithms for deep belief networks. Object recognition results\non the Caltech-101 dataset also yield competitive results.\n\n1\n\nIntroduction\n\nDeep architectures have strong representational power due to their hierarchical structures. They\nare capable of encoding highly varying functions and capture complex relationships and high-level\nabstractions among high-dimensional data [1]. Traditionally, the multilayer perceptron is used to\noptimize such hierarchical models based on a discriminative criterion that models P (y|x) using a\nerror backpropagating gradient descent [2, 3]. However, when the architecture is deep, it is challeng-\ning to train the entire network through supervised learning due to the large number of parameters,\nthe non-convex optimization problem and the dilution of the error signal through the layers. This\noptimization may even lead to worse performances as compared to shallower networks [4].\nRecent developments in unsupervised feature learning and deep learning algorithms have made it\npossible to learn deep feature hierarchies. Deep learning, in its current form, typically involves two\nconsecutive learning phases. The \ufb01rst phase greedily learns unsupervised modules layer-by-layer\nfrom the bottom-up [1, 5]. Some common criteria for unsupervised learning include the maxi-\nmum likelihood that models P (x) [1] and the input reconstruction error of vector x [5\u20137]. This is\nsubsequently followed by a supervised phase that \ufb01ne-tunes the network using a supervised, usu-\nally discriminative algorithm, such as supervised error backpropagation. The unsupervised learning\nphase initializes the parameters without taking into account the ultimate task of interest, such as\nclassi\ufb01cation. The second phase assumes the entire burden of modifying the model to \ufb01t the task.\nIn this work, we propose a gradual transition from the fully-unsupervised learning to the highly-\ndiscriminative optimization. This is done by adding an intermediate training phase between the two\nexisting deep learning phases, which enhances the unsupervised representation by incorporating\ntop-down information. To realize this notion, we introduce a new global (non-greedy) optimization\n\u2217Hanlin Goh is also with the Institute for Infocomm Research, A*STAR, Singapore and the Image and\n\u2020Joo-Hwee Lim is also with the Image and Pervasive Access Lab, CNRS UMI 2955, Singapore \u2013 France.\n\nPervasive Access Lab, CNRS UMI 2955, Singapore \u2013 France.\n\n1\n\n\fthat regularizes the deep belief network (DBN) from the top-down. We retain the same gradient\ndescent procedure of updating the parameters of the DBN as the unsupervised learning phase. The\nnew regularization method and deep learning strategy are applied to handwritten digit recognition\nand dictionary learning for object recognition, with competitive empirical results.\n\n2 Related Work\n\nRestricted Boltzmann Machines. A restricted Boltzmann\nmachine (RBM) [8] is a bipartite Markov random \ufb01eld with an\ninput layer x \u2208 RI and a latent layer z \u2208 RJ (see Figure 1). The\nlayers are connected by undirected weights W \u2208 RI\u00d7J. Each\nunit also receives input from a bias parameter bj or ci. The joint\ncon\ufb01guration of binary states {x, z} has an energy given by:\n(cid:88)\n\nE(x, z) = \u2212z(cid:62)Wx \u2212 b(cid:62)z \u2212 c(cid:62)x.\n\nThe probability assigned to x is given by:\n\n(cid:88)\n\n(1)\n\n(cid:88)\n\nP (x) =\n\nexp(\u2212E(x, z)),\n\nZ =\n\nx\n\nz\n\nFigure 1: Structure of the RBM.\n\nexp(\u2212E(x, z)),\n\n(2)\n\n1\nZ\n\nz\n\n(cid:89)\n(cid:89)\n\nj\n\ni\n\nwhere Z is known as the partition function, which normalizes P (x) to a valid distribution. The units\nin a layer are conditionally independent with distributions given by logistic functions:\n\nP (z|x) =\n\nP (x|z) =\n\nP (zj|x),\n\nP (xi|z),\n\nP (zj|x) = 1/(1 + exp(\u2212w(cid:62)\nP (xi|z) = 1/(1 + exp(\u2212wiz \u2212 ci)).\n\nj x \u2212 bj)),\n\n(3)\n\n(4)\n\nThis enables the model to be sampled via alternating Gibbs sampling between the two layers. To\nestimate the maximum likelihood of the data distribution P (x), the RBM is trained by taking the\ngradient of the log probability of the input data with respect to the parameters:\n\n\u2202 log P (x)\n\n\u2202wij\n\n\u2248 (cid:104)xizj(cid:105)0 \u2212 (cid:104)xizj(cid:105)N ,\n\n(5)\n\nwhere (cid:104)\u00b7(cid:105)t denotes the expectation under the distribution at the t-th sampling of the Markov chain.\nThe \ufb01rst term samples the data distribution at t = 0, while the second term approximates the equi-\nlibrium distribution at t = \u221e using the contrastive divergence method [9] by using a small and \ufb01nite\nnumber of sampling steps N to obtain a distribution of reconstructed states at t = N. RBMs have\nalso been regularized to produce sparse representations [10, 11].\n\nSupervised Restricted Boltzmann Machines. To introduce\nclass labels to the RBM, a one-hot coded output vector y \u2208 RC\nis de\ufb01ned, where yc = 1 iff c is the class index. Another set of\nweights V \u2208 RC\u00d7J connects y with z. The two vectors are con-\ncatenated to form a new input vector [x, y] for the RBM, which\nis linked to z through [W(cid:62), V(cid:62)], as shown in Figure 2. This\nsupervised RBM models the joint distribution P (x, y). The en-\nergy function of this model can be extended to\nE(x, y, z) = \u2212z(cid:62)Wx \u2212 z(cid:62)Vy \u2212 b(cid:62)z \u2212 c(cid:62)x \u2212 d(cid:62)y (6)\n\nThe conditional distribution of the concatenated vector is now:\n(7)\n\nP (x, y|z) = P (x|z)P (y|z) =\nwhere P (xi|z) is given in Equation 4 and the outputs yc may\neither be logistic units or the softmax units. The RBM may\nagain be trained using contrastive divergence algorithm [9] to\napproximate the maximum likelihood of joint distribution.\n\nP (yc|z),\n\nP (xi|z)\n\nc\n\ni\n\n(cid:89)\n\n(cid:89)\n\n2\n\nFigure 2: A supervised RBM\njointly models inputs and outputs.\nBiases are omitted for simplicity.\n\n1z!Latent layer!I input units!J latent units!x!Input layer!W!bc!z!Latent layer!I input units!J latent units!x!Inputs!W!C output units!y!Classes!V!Concatenated!layer!\fDuring inference, only x is given and y is set at a neutral value, which makes this part of the RBM\n\u2018noisy\u2019. The objective is to use x to \u2018denoise\u2019 y and obtain the prediction. This can be done by\nseveral iterations of alternating Gibbs sampling. If the number of classes is huge, the number of\ninput units need to be huge to maintain a high signal to noise ratio. Larochelle and Bengio [12]\nsuggested to couple this generative model P (x, y) with a discriminative model P (y|x), which can\nhelp alleviate this issue. However, if the objective is to train a deep network, then with ever new\nlayer, the previous V has to be discarded and retrained.\nIt may also not be desirable to use a\ndiscriminative criterion directly from the outputs, especially in the initial layers of the network.\n\nDeep Belief Networks. Deep belief networks (DBN) [1] are probabilistic graphical models made\nup of a hierarchy of stochastic latent variables. Being universal approximators [13], they have been\napplied to a variety of problems such as image and video recognition [1, 14], dimension reduc-\ntion [15]. It follows a two-phase training strategy of unsupervised greedy pre-training followed by\nsupervised \ufb01ne-tuning.\nFor unsupervised pre-training, a stack of RBMs is trained greedily from the bottom-up, with the\nlatent activations of each layer used as the inputs for the next RBM. Each new layer RBM models the\ndata distribution P (x), such that when higher-level layers are suf\ufb01ciently large, the variational bound\non the likelihood always improves [1]. A popular method for supervised \ufb01ne-tuning backpropagates\nthe error given by P (y|x) to update the network\u2019s parameters. It has been shown to perform well\nwhen initialized by \ufb01rst learning a model of input data using unsupervised pre-training [15].\nAn alternative supervised method is a generative model that implements a supervised RBM (Fig-\nure 2) that models P (x, y) at the top layer. For training, the network employs the up-down back-\n\ufb01tting algorithm [1]. The algorithm is initialized by untying the network\u2019s recognition and generative\nweights. First, a stochastic bottom-up pass is performed and the generative weights are adjusted to\nbe good at reconstructing the layer below. Next, a few iterations of alternating sampling using the\nrespective conditional probabilities are done at the top-level supervised RBM between the concate-\nnated vector and the latent layer. Using contrastive divergence the RBM is updated by \ufb01tting to its\nposterior distribution. Finally, a stochastic top-down pass adjusts bottom-up recognition weights to\nreconstruct the activations of the layer above.\nIn this work, we extend the existing DBN training strategy by having an additional supervised train-\ning phase before the discriminative error backpropagation. A top-down regularization of the net-\nwork\u2019s parameters is proposed. The network is optimized globally so that the inputs gradually map\nto the output through the layers. We also retain the simple method of using gradient descent to\nupdate the weights of the RBMs and retain the same convention for generative RBM learning.\n\n3 Top-Down RBM Regularization: The Building Block\n\nWe regularize RBM learning with targets obtained by sampling from higher-level representations.\n\nGeneric Cross-Entropy Regularization. The aim is to construct a top-down regularized building\nblock for deep networks, instead of combining the optimization criteria directly [12], which is done\nfor the supervised RBM model (Figure 2). To give control over individual elements in the latent\nvector, one way to manipulate the representations is to point-wise bias the activations for each latent\nvariable j [11]. Given a training dataset Dtrain, a regularizer based on the cross-entropy loss can be\nde\ufb01ned to penalize the difference between the latent vector z and a target vector \u02c6z:\n\nLRBM +reg(Dtrain) = \u2212\n\nlog P (xk) \u2212 \u03b1\n\nThe update rule of the cross-entropy-regularized RBM can be modi\ufb01ed to:\n\n|Dtrain|(cid:88)\n\nk=1\n\n|Dtrain|(cid:88)\n\nJ(cid:88)\n\nk=1\n\nj=1\n\nlog P (\u02c6zjk|zjk).\n\n(8)\n\n(9)\n\n\u2206wij \u221d (cid:104)xisj(cid:105)0 \u2212 (cid:104)xizj(cid:105)N ,\n\nwhere\n\n(10)\nis the merger of the latent and target activations used to update the parameters. Here, the in\ufb02uences\nof \u02c6zj and zj are regulated by parameter \u03bb. If \u03bb = 0 or when the activationes match (i.e. zj = \u02c6zj),\nthen the parameter update is exactly that the original contrastive divergence learning algorithm.\n\nsj = (1 \u2212 \u03bb) zj + \u03bb\u02c6zj\n\n3\n\n\fBuilding Block. The same principle of regularizing the latent activations can be used to combine\nsignals from the bottom-up and top-down. This forms the building block for optimizing a DBN\nwith top-down regularization. The basic building block is a three-layer structure consisting of three\nconsecutive layers: the previous zl\u22121 \u2208 RI, current zl \u2208 RJ and next zl+1 \u2208 RH layers. The\nlayers are connected by two sets of weight parameters Wl\u22121 and Wl to the previous and next\nlayers respectively. For the current layer zl, the bottom-up representations zl,l\u22121 are sampled from\nthe previous layer zl\u22121 through weighted connections Wl\u22121 with:\nP (zl,l\u22121,j | zl\u22121; Wl\u22121) = 1/(1 + exp(\u2212w(cid:62)\n\nl\u22121,jzl\u22121 \u2212 bl,j)),\n\nwhere the two terms in the subscripts of a sampled representation zdest,src refer to the destination\n(dest) and source (src) layers respectively. Meanwhile, sampling from the next layer zl+1 via\nweights Wl drives the top-down representations zl,l+1:\n\n(11)\n\nP (zl,l+1,j | zl+1; Wl) = 1/(1 + exp(\u2212wl,jzl+1 \u2212 cl,j)).\n\n(12)\n\nThe objective is to learn the RBM parameters Wl\u22121 that map from the previous layer zl\u22121 to\nthe current latent layer zl,l\u22121, by maximizing the likelihood of the previous layer P (zl\u22121) while\nconsidering the top-down samples zl,l+1 from the next layer zl+1 as target representations. The loss\nfunction for a network with L layers can be broken down as:\n\nLDBN +topdown =\n\nLl,RBM +topdown\n\nwhere the cross-entropy regularization the loss function for the layer is\n\nLl,RBM +topdown = \u2212\n\nlog P (zl\u22121,k) \u2212 \u03b1\n\nlog P (zl,l+1,jk|zl,l\u22121,jk).\n\nThis results in the following gradient descent:\n\nwhere\n\nL(cid:88)\n\nl=2\n\n|Dtrain|(cid:88)\n\nJ(cid:88)\n\nk=1\n\nj=1\n\n|Dtrain|(cid:88)\n\u2206wl\u22121,ij = \u03b5(cid:0)\n\nk=1\n\n(cid:1) ,\n\n(cid:104)zl\u22121,l\u22122,isl,j(cid:105)0 \u2212 (cid:104)zl\u22121,l,izl,l\u22121,j(cid:105)N\n\n(cid:124) (cid:123)(cid:122) (cid:125)\nsl,jk = (1 \u2212 \u03bbl) zl,l\u22121,jk\n\nBottom-up\n\n+\u03bbl zl,l+1,jk\n\n(cid:124) (cid:123)(cid:122) (cid:125)\n\nTop-down\n\n,\n\n(13)\n\n(14)\n\n(15)\n\n(16)\n\nis the merged representation from the bottom-up and top-down signals (see Figure 3), weighted by\nhyperparameter \u03bbl. The bias towards one source of signal can be adjusted by selecting an appropriate\n\u03bbl. Additionally, the alternating Gibbs sampling, necessary for the contrastive divergence updates,\nis performed from the unbiased bottom-up samples using Equation 11 and a symmetric decoder:\n\nP (zl\u22121,l,j = 1 | zl,l\u22121; Wl\u22121) = 1/(1 + exp(\u2212wl\u22121,izl,l\u22121 \u2212 cl\u22121,j)).\n\n(17)\n\nFigure 3: The basic building block learns a bottom-up latent representation regularized by top-\ndown signals. Bottom-up zl,l\u22121 and top-down zl,l+1 latent activations are sampled from zl\u22121 and\nzl+1 respectively. They are merged to get the modi\ufb01ed activations sl used for parameter updates.\nReconstructions independently driven from the input signals form the Gibbs sampling Markov chain.\n\n4\n\nBottom-up!Top-down!Merged!zl,l1zl+1zl1zl1,lzl,l+1zl,l11-step CD!slWl1WlPrevious layer!Next layer!Intermediate layer!\f4 Globally-Optimized Deep Belief Networks\n\nForward-Backward Learning Strategy.\nIn the DBN, RBMs are stacked from the bottom-up in\na greedy layer-wise manner, with each new layer modeling the posterior distribution of the previous\nlayer. Similarly, regularized building blocks can also be used to construct the regularized DBN\n(Figure 4). The network, as illustrated in Figure 4(a), comprises of a total of L \u2212 1 RBMs. The\nnetwork can be trained with a forward and backward strategy (Figure 4(b)). It integrates top-down\nregularization with contrastive divergence learning, which is given by alternating Gibbs sampling\nbetween the layers (Figure 4(c)).\n\n(a) Top-down regularized deep belief network.\n\n(b) Forward and backward passes for top-down regularization.\n\n(c) Alternating Gibbs sampling chains for contrastive divergence learning.\n\nFigure 4: Constructing a top-down regularized deep belief network (DBN). All the restricted Boltz-\nmann machines (RBM) that make up the network are concurrently optimized.\n(a) The building\nblocks are connected layer-wise. Both bottom-up and top-down activations are used for training the\nnetwork. (b) Activations for the top-down regularization are obtained by sampling and merging the\nforward pass and the backward pass. (c) From the activations of the forward pass, the reconstructions\ncan be obtained by performing alternating Gibbs sampling with the previous layer.\n\nIn the forward pass, given the input features, each layer zl is sampled from the bottom-up, based on\nthe representation of the previous layer zl\u22121 (Equation 11). The top-level vector zL is activated with\nthe softmax function. Upon reaching the output layer, the backward pass begins. The activations zL\nare combined with the output labels y to produce sL given by\n\n(18)\nThe merged activations sl (Equation 16), which besides being used for parameter updates, have a\nsecond role of activating the lower layer zl\u22121 from the top-down:\n\nsL,ck = (1 \u2212 \u03bbL)zL,L\u22121,ck + \u03bbLyck,\n\nP (zl\u22121,l,j | sl; Wl) = 1/1 + exp(\u2212wl\u22121,jsl \u2212 cl\u22121,j).\nThis is repeated until the second layer is reached (l = 2) and s2 is computed.\n\n(19)\n\n5\n\nInput!Output!Layer 2!Layer 4!Layer 3!xz3,2z4,5z2,1z2,3z3,4z4,3yz5,4z2,3z3,4z2,1z3,2z4,3z1,2s2s3s4s5z4,5z5,4xz3,2z4,5z2,1z2,3z3,4Merged!z4,3yz5,4s2s3s4s5Forward pass!Backward pass!xz3,2z2,1z4,3z5,4z2,3z3,4z2,1z3,2z4,3z1,2z4,5z5,41-step CD!\fTop-down sampling encourages the class-based invariance of the bottom-up representations. How-\never, sampling from the top-down, with the output vector y as the only source will result in only\none activation pattern per class. This is undesirable, especially for the bottom layers, which should\nhave representations more heavily in\ufb02uenced by bottom-up data. By merging the top-down repre-\nsentations with the bottom-up ones, the representations will encode both instance-based variations\nand class-based variations. In the last layer, we typically set \u03bbL as 1, so that the \ufb01nal RBM given by\nWL\u22121 learns to map to the class labels y. Backward activation of zL\u22121,L is a class-based invariant\nrepresentation obtained from y and used to regularize WL\u22122. All other backward activations from\nthis point onwards are based on the merged representation from instance- and class-based represen-\ntations.\n\nThree-Phase Learning Procedure. After greedy learning models P (x) and the top-down regu-\nlarized forward-backward learning is executed. The eventual goal of the network is to be able to give\na prediction of P (y|x). This suggest that the network can adopt a three-phase strategy for training,\nwhereby the parameters learned in one phase initializes the next, as follows:\n\n\u2022 Phase 1 \u2013 Unsupervised Greedy. The network is constructed by greedily learning a new\nunsupervised RBM on top of the existing network. To enhance the representations, various\nregularizations, such as sparsity [10], can be applied. The stacking process is repeated for\nL \u2212 2 RBMs, until layer L \u2212 1 is added to the network.\n\u2022 Phase 2 \u2013 Supervised Regularized. This phase begins by connecting the L \u2212 1 to a \ufb01nal\nlayer, which is activated by the softmax activation function for a classi\ufb01cation problem.\nUsing the one-hot coded output vector y \u2208 RC as its target activations and setting \u03bbL to 1,\nthe RBM is learned as an associative memory with the following update:\n(20)\n\u2206wL\u22121,ic \u221d (cid:104)zL\u22121,L\u22122,i yc(cid:105)0 \u2212 (cid:104)zL\u22121,L,i zL,L\u22121,c(cid:105)N .\nThis \ufb01nal RBM, together with the other RBMs learned from Phase 1, form the initialization\nfor the top-down regularized forward-backward learning algorithm. This phase is used to\n\ufb01ne-tune the network using generative learning, and binds the layers together by aligning\nall the parameters of the network with the outputs.\n\n\u2022 Phase 3 \u2013 Supervised Discriminative. Finally, the supervised error backpropagation al-\ngorithm is used to improve class discrimination in the representations. Backpropagation\ncan also be described in two passes. In the forward pass, each layer is activated from the\nbottom-up to obtain the class predictions. The classi\ufb01cation error is then computed based\non the groundtruth and the backward pass performs gradient descent on the parameters by\nbackpropagating the errors through the layers from the top-down.\n\nFrom Phase 1 to Phase 2, the form of the parameter update rule based on gradient descent does not\nchange. Only that top-down signals are also taken into account. Essentially, the two phases are\nperforming a variant of the contrastive divergence algorithm. Meanwhile, from Phase 2 to Phase 3,\nthe inputs to the phases (x and y) do not change, while the optimization function is modi\ufb01ed from\nperforming regularization to being completely discriminative.\n\n5 Empirical Evaluation\n\nIn this work, the proposed deep learning strategy and top-down regularization method were eval-\nuated and analyzed using the MNIST handwritten digit dataset [16] and the Caltech-101 object\nrecognition dataset [17].\n\n5.1 MNIST Handwritten Digit Recognition\n\nThe MNIST dataset contains images of handwritten digits. The task is to recognize a digit from\n0 to 9 given a 28 \u00d7 28 pixel image. The dataset is split into 60, 000 images used for training\nand 10, 000 test images. Many different methods have used this dataset to perform evaluation on\nclassi\ufb01cation performances, speci\ufb01cally the DBNN [1]. The basic version of this dataset, with\nneither preprocessing nor enhancements, was used for the evaluation. A \ufb01ve-layer DBN was setup\nto have the same topography as evaluated in [1]. The number of units in each layer, from the \ufb01rst to\nthe last layer, were 784, 500, 500, 2000 and 10, in that order. Five architectural setups were tested:\n\n6\n\n\f1. Stacked RBMs with up-down learning (original DBN reported in [1]),\n2. Stacked RBMs with forward-backward learning and backpropagation,\n3. Stacked sparse RBMs [11] with forward-backward learning and backpropagation, and\n4. Stacked sparse RBMs [11] with backpropagation, and\n5. Forward-backward learning from random weights.\n\nIn the phases 1 and 2, we followed the evaluation procedure of Hinton et al. [1] by initially using\n44, 000 training and 10, 000 validation images to train the network before retraining it with the\nfull training set. In phase 3, sets of 50, 000 and 10, 000 images were used as the initial training\nand validation sets. After model selection, the network was retrained on the training set of 60, 000\nimages.\nTo simplify the parameterization for the forward-backward learning in phase 2, the top-down mod-\nulation parameter \u03bbl across the layers were controlled by a single parameter \u03b3 using the function:\n\n\u03bbl = |l \u2212 1|\u03b3/(|l \u2212 1|\u03b3 \u2212 |L \u2212 l|\u03b3).\n\n(21)\nwhere \u03b3 > 0. The top-down in\ufb02uence for a layer l is also dependent on its relative position in the\nnetwork. The function assigns \u03bbl such that the layers nearer to the input will have stronger in\ufb02uences\nfrom the input, while the layers near the output will be biased towards the output. This distance-\nbased modulation of their in\ufb02uences enables a gradual mapping between the input and output layers.\nOur best performance was obtained using setting 3, which got an error rate of 0.91% on the test\nset. Figure 5 shows the 91 wrongly classi\ufb01ed test examples for this setting. When initialized with\nthe conventional RBMs but \ufb01ne-tuned with forward-backward learning and error backpropagation,\nthe score was 0.98%. As a comparison, the conventional DBN obtained an error rate of 1.25%.\nDirectly optimizing the network from random weights produced an error of 1.61%, which is still\nfairly decent, considering that the network was optimized globally from scratch. For each setup, the\nintermediate results for each training phase are reported in Table 1.\nOverall, the results achieved are very competitive for methods with the same complexity that rely on\nneither convolution nor image distortions and normalization. A variant of the DBN, which focused\non learning nonlinear transformations of the feature space for nearest neighbor classi\ufb01cation [18],\nhad an error rate of 1.0%. The deep convex net [19], which utilized more complex convex-optimized\nmodules as building blocks but did not perform \ufb01ne-tuning on a global network level, got a score\nof 0.83%. At the time of writing, the best performing model on the dataset gave an error rate of\n0.23% and used a heavy architecture of a committee of 35 deep convolutional neural nets with\nelastic distortions and image normalization [20].\nFrom Table 1, we can observe that each of the three learning phases helped to improve the overall\nperformance of the networks. The forward-backward algorithm outperforms the up-down learn-\ning of the original DBN. Using sparse RBMs [11] and backpropagation, it was possible to further\nimprove the recognition performances. The forward-backward learning was effective as a bridge\nbetween the other two phases, with an improvement of 0.17% over the setup without phase 2. The\nmethod was even as a standalone algorithm, demonstrating its potential by learning from randomly\ninitialized weights.\n\nTable 1: Results on MNIST after various phases of the training process.\nClassi\ufb01cation error rate\nSetup / Learning algorithm*\n\nPhase 1\n\nPhase 2\n\nPhase 1\n\nPhase 2\n\nPhase 3\n\nUp-down\n\nDeep belief network (reported in [1])\n1. RBMs\n2.49%\nProposed top-down regularized deep belief network\n2. RBMs\n2.49%\n3.\n2.14%\n4.\n2.14%\n5. Random weights\n*Phase 3 runs the error backpropagation algorithm whenever employed.\n\nForward-backward\nForward-backward\n\u2013\nForward-backward\n\nSparse RBMs\nSparse RBMs\n\n\u2013\n\n1.25%\n\n1.14%\n1.06%\n\n\u2013\n\n1.61%\n\n\u2013\n\n0.98%\n0.91%\n1.08%\n\n\u2013\n\n7\n\n\fFigure 5: The 91 wrongly classi\ufb01ed test examples from the MNIST dataset.\n\n5.2 Caltech-101 Object Recognition\n\nThe Caltech-101 dataset [17] is one of the most popular datasets for object recognition evaluation.\nIt contains 9, 144 images belonging to 101 object categories and one background class. The images\nwere \ufb01rst resized while retaining their original aspect ratios, such that the longer spatial dimension\nwas at most 300 pixels. SIFT descriptors [21] were extracted from densely sampled patches of\n16 \u00d7 16 at 4 pixel intervals. The SIFT descriptors were (cid:96)1-normalized by constraining each de-\nscriptor vector to sum to a maximum of one, resulting in a quasi-binary feature. Additionally, SIFT\ndescriptors from a spatial neighborhood of 2 \u00d7 2 were concatenated to form a macrofeature [22].\nA DBN setup was used to learn a dictionary to map local macrofeatures to a mid-level representa-\ntion. Two layers of RBMs were stacked to model the macrofeatures. Both RBMs were regularized\nwith population and lifetime sparseness during training [23]. First a single RBM, which had 1024\nlatent variables, was trained from macrofeature. A set of 200, 000 randomly selected macrofea-\ntures was used for training this \ufb01rst layer. The resulting representations of the \ufb01rst RBM were then\nconcatenated within each spatial neighborhood of 2 \u00d7 2. The second RBM modeled this spatially\naggregated representation into a higher-level representation. Another set of 200, 000 randomly se-\nlected spatially aggregated representations was used for training this RBM.\nThe higher-level RBM representation was associated to the image label. For each experimental\ntrial, a set of 30 training examples per class (totaling to 3060) was randomly selected for supervised\nlearning. The forward-backward learning algorithm was used to regularize the learning while \ufb01ne-\ntuning the network. Finally, error backpropagation was performed to further optimize the dictionary.\nFrom these representations, max-pooling within spatial regions de\ufb01ned by a spatial pyramid was\nemployed [22, 24] to obtain a single vector representing the whole image. It is also possible to\nemploy more advanced pooling schemes [25]. A linear SVM classi\ufb01er was then trained, using the\nsame train-test split from the previous supervised learning phase.\nTable 2 shows the average class-wise clas-\nsi\ufb01cation accuracy, averaged across 102\nclasses and 10 experimental trials. The re-\nsults demonstrate a consistent improvement\nmoving from Phase 1 to phase 3. The \ufb01nal\naccuracy obtained was 79.7%. This outper-\nforms all existing dictionary learning meth-\nods based on a single image descriptors,\nwith a 0.8% improvement over the previous\nstate-of-the-art results [23, 28]. As a com-\nparison, other existing reported dictionary\nlearning methods that encode SIFT-based lo-\ncal descriptors are also included in Table 2.\n\nTable 2: Classi\ufb01cation accuracy on Caltech-101.\nMethod / Training phase\nAccuracy\nProposed top-down regularized DBN\nPhase 1: Unsupervised stacking\nPhase 2: Top-down regularization\nPhase 3: Error backpropagation\n\nSparse coding & max-pooling [22]\nExtended HMAX [26]\nConvolutional RBM [27]\nUnsupervised & supervised RBM [23]\nGated Convolutional RBM [28]\n\n72.8%\n78.2%\n79.7%\n\n73.4%\n76.3%\n77.8%\n78.9%\n78.9%\n\n6 Conclusion\n\nWe proposed the notion of deep learning by gradually transitioning from being fully unsupervised to\nstrongly discriminative. This is achieved through the introduction of an intermediate phase between\nthe unsupervised and supervised learning phases. This notion is implemented by incorporating\ntop-down information to DBNs through regularization. The method is easily integrated into the\nintermediate learning phase based on simple building blocks. It can be performed to complement\ngreedy layer-wise unsupervised learning and discriminative optimization using error backpropaga-\ntion. Empirical evaluation show that the method leads to competitive results for handwritten digit\nrecognition and object recognition datasets.\n\n8\n\n\fReferences\n[1] G. E. Hinton, S. Osindero, and Y.-W. Teh, \u201cA fast learning algorithm for deep belief networks,\u201d Neural\n\nComputation, vol. 18, no. 7, pp. 1527\u20131554, 2006.\n\n[2] Y. LeCun, \u201cUne proc\u00b4edure d\u2019apprentissage pour r\u00b4eseau a seuil asymmetrique (a learning scheme for\n\nasymmetric threshold networks),\u201d in Cognitiva 85, 1985.\n\n[3] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, \u201cLearning representations by back-propagating errors,\u201d\n\nNature, vol. 323, pp. 533 \u2013 536, October 1986.\n\n[4] Y. Bengio, \u201cLearning deep architectures for AI,\u201d Foundations and Trends in Machine Learning, vol. 2,\n\nno. 1, pp. 1\u2013127, 2009.\n\n[5] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, \u201cGreedy layer-wise training of deep networks,\u201d in\n\nNIPS, 2006.\n\n[6] M. Ranzato, C. Poultney, S. Chopra, and Y. LeCun, \u201cEf\ufb01cient learning of sparse representations with an\n\nenergy-based model,\u201d in NIPS, 2006.\n\n[7] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, \u201cExtracting and composing robust features\n\nwith denoising autoencoders,\u201d in ICML, 2008.\n\n[8] P. Smolensky, \u201cInformation processing in dynamical systems: Foundations of harmony theory,\u201d in Paral-\n\nlel Distributed Processing: Volume 1: Foundations, ch. 6, pp. 194\u2013281, MIT Press, 1986.\n\n[9] G. E. Hinton, \u201cTraining products of experts by minimizing contrastive divergence,\u201d Neural Computation,\n\nvol. 14, no. 8, p. 1771\u20131800, 2002.\n\n[10] H. Lee, C. Ekanadham, and A. Ng, \u201cSparse deep belief net model for visual area V2,\u201d in NIPS, 2008.\n[11] H. Goh, N. Thome, and M. Cord, \u201cBiasing restricted Boltzmann machines to manipulate latent selectivity\n\nand sparsity,\u201d in NIPS Workshop, 2010.\n\n[12] H. Larochelle and Y. Bengio, \u201cClassi\ufb01cation using discriminative restricted Boltzmann machines,\u201d in\n\nICML, 2008.\n\n[13] N. Le Roux and Y. Bengio, \u201cRepresentational power of restricted Boltzmann machines and deep belief\n\nnetworks,\u201d Neural Computation, vol. 20, pp. 1631\u20131649, June 2008.\n\n[14] I. Sutskever and G. E. Hinton, \u201cLearning multilevel distributed representations for high-dimensional se-\n\nquences,\u201d in AISTATS, 2007.\n\n[15] G. E. Hinton and R. Salakhutdinov, \u201cReducing the dimensionality of data with neural networks,\u201d Science,\n\nvol. 28, pp. 504\u2013507, 2006.\n\n[16] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, \u201cGradient-based learning applied to document recogni-\n\ntion,\u201d Proceedings of the IEEE, vol. 86, pp. 2278\u20132324, November 1998.\n\n[17] L. Fei-Fei, R. Fergus, and P. Perona, \u201cLearning generative visual models from few training examples: An\n\nincremental bayesian approach tested on 101 object categories,\u201d CVPR Workshop, 2004.\n\n[18] R. Salakhutdinov and G. E. Hinton, \u201cLearning a nonlinear embedding by preserving class neighbourhood\n\nstructure,\u201d in AISTATS, 2007.\n\n[19] L. Deng and D. Yu, \u201cDeep convex net: A scalable architecture for speech pattern classi\ufb01cation,\u201d in Inter-\n\nspeech, 2011.\n\n[20] D. C. Cires\u00b8an, U. Meier, and J. Schmidhuber, \u201cMulti-column deep neural networks for image classi\ufb01ca-\n\ntion,\u201d in CVPR, 2012.\n\n[21] D. Lowe, \u201cObject recognition from local scale-invariant features,\u201d in CVPR, 1999.\n[22] Y. Boureau, F. Bach, Y. LeCun, and J. Ponce, \u201cLearning mid-level features for recognition,\u201d in CVPR,\n\n2010.\n\n[23] H. Goh, N. Thome, M. Cord, and J.-H. Lim, \u201cUnsupervised and supervised visual codes with restricted\n\nBoltzmann machines,\u201d in ECCV, 2012.\n\n[24] S. Lazebnik, C. Schmid, and J. Ponce, \u201cBeyond bags of features: Spatial pyramid matching for recogniz-\n\ning natural scene categories,\u201d in CVPR, 2006.\n\n[25] S. Avila, N. Thome, M. Cord, E. Valle, and A. Ara\u00b4ujo, \u201cPooling in image representation:\n\ncodeword point of view,\u201d Computer Vision and Image Understanding, pp. 453\u2013465, May 2013.\n\nthe visual\n\n[26] C. Theriault, N. Thome, and M. Cord, \u201cExtended coding and pooling in the HMAX model,\u201d IEEE Trans-\n\naction on Image Processing, 2013.\n\n[27] K. Sohn, D. Y. Jung, H. Lee, and A. Hero III, \u201cEf\ufb01cient learning of sparse, distributed, convolutional\n\nfeature representations for object recognition,\u201d in ICCV, 2011.\n\n[28] K. Sohn, G. Zhou, C. Lee, and H. Lee, \u201cLearning and selecting features jointly with point-wise gated\n\nboltzmann machines,\u201d in ICML, 2013.\n\n9\n\n\f", "award": [], "sourceid": 953, "authors": [{"given_name": "Hanlin", "family_name": "Goh", "institution": "LIP6/UPMC"}, {"given_name": "Nicolas", "family_name": "Thome", "institution": "University Pierre & Marie Curie and CNRS (UMR 7606)"}, {"given_name": "Matthieu", "family_name": "Cord", "institution": "University Pierre & Marie Curie and CNRS (UMR 7606)"}, {"given_name": "Joo-Hwee", "family_name": "Lim", "institution": "Institute for Infocomm Research, Singapore"}]}