{"title": "Adaptive dropout for training deep neural networks", "book": "Advances in Neural Information Processing Systems", "page_first": 3084, "page_last": 3092, "abstract": "Recently, it was shown that by dropping out hidden activities with a probability of 0.5, deep neural networks can perform very well. We describe a model in which a binary belief network is overlaid on a neural network and is used to decrease the information content of its hidden units by selectively setting activities to zero. This ''dropout network can be trained jointly with the neural network by approximately computing local expectations of binary dropout variables, computing derivatives using back-propagation, and using stochastic gradient descent. Interestingly, experiments show that the learnt dropout network parameters recapitulate the neural network parameters, suggesting that a good dropout network regularizes activities according to magnitude. When evaluated on the MNIST and NORB datasets, we found our method can be used to achieve lower classification error rates than other feather learning methods, including standard dropout, denoising auto-encoders, and restricted Boltzmann machines. For example, our model achieves 5.8% error on the NORB test set, which is better than state-of-the-art results obtained using convolutional architectures. \"", "full_text": "Adaptive dropout for training deep neural networks\n\nLei Jimmy Ba Brendan Frey\n\nDepartment of Electrical and Computer Engineering\n\nUniversity of Toronto\n\njimmy, frey@psi.utoronto.ca\n\nAbstract\n\nRecently, it was shown that deep neural networks can perform very well if the\nactivities of hidden units are regularized during learning, e.g, by randomly drop-\nping out 50% of their activities. We describe a method called \u2018standout\u2019 in which\na binary belief network is overlaid on a neural network and is used to regularize\nof its hidden units by selectively setting activities to zero. This \u2018adaptive dropout\nnetwork\u2019 can be trained jointly with the neural network by approximately com-\nputing local expectations of binary dropout variables, computing derivatives using\nback-propagation, and using stochastic gradient descent.\nInterestingly, experi-\nments show that the learnt dropout network parameters recapitulate the neural\nnetwork parameters, suggesting that a good dropout network regularizes activities\naccording to magnitude. When evaluated on the MNIST and NORB datasets, we\nfound that our method achieves lower classi\ufb01cation error rates than other feature\nlearning methods, including standard dropout, denoising auto-encoders, and re-\nstricted Boltzmann machines. For example, our method achieves 0.80% and 5.8%\nerrors on the MNIST and NORB test sets, which is better than state-of-the-art\nresults obtained using feature learning methods, including those that use convolu-\ntional architectures.\n\nIntroduction\n\n1\nFor decades, deep networks with broad hidden layers and full connectivity could not be trained to\nproduce useful results, because of over\ufb01tting, slow convergence and other issues. One approach\nthat has proven to be successful for unsupervised learning of both probabilistic generative models\nand auto-encoders is to train a deep network layer by layer in a greedy fashion [7]. Each layer of\nconnections is learnt using contrastive divergence in a restricted Boltzmann machine (RBM) [6] or\nbackpropagation through a one-layer auto-encoder [1], and then the hidden activities are used to\ntrain the next layer. When the parameters of a deep network are initialized in this way, further \ufb01ne\ntuning can be used to improve the model, e.g., for classi\ufb01cation [2]. The unsupervised, pre-training\nstage is a crucial component for achieving competitive overall performance on classi\ufb01cation tasks,\ne.g., Coates et al. [4] have achieved improved classi\ufb01cation rates by using different unsupervised\nlearning algorithms.\nRecently, a technique called dropout was shown to signi\ufb01cantly improve the performance of deep\nneural networks on various tasks [8], including vision problems [10]. Dropout randomly sets hidden\nunit activities to zero with a probability of 0.5 during training. Each training example can thus\nbe viewed as providing gradients for a different, randomly sampled architecture, so that the \ufb01nal\nneural network ef\ufb01ciently represents a huge ensemble of neural networks, with good generalization\ncapability. Experimental results on several tasks show that dropout frequently and signi\ufb01cantly\nimproves the classi\ufb01cation performance of deep architectures. Injecting noise for the purpose of\nregularization has been studied previously, but in the context of adding noise to the inputs [3],[21]\nand to network components [16].\nUnfortunately, when dropout is used to discriminatively train a deep fully connected neural network\non input with high variation, e.g., in viewpoint and angle, little bene\ufb01t is achieved (section 5.5),\nunless spatial structure is built in.\n\n1\n\n\fIn this paper, we describe a generalization of dropout, where the dropout probability for each\nhidden variable is computed using a binary belief network that shares parameters with the deep\nnetwork. Our method works well both for unsupervised and supervised learning of deep networks.\nWe present results on the MNIST and NORB datasets showing that our \u2018standout\u2019 technique can\nlearn better feature detectors for handwritten digit and object recognition tasks. Interestingly, we\nalso \ufb01nd that our method enables the successful training of deep auto-encoders from scratch, i.e.,\nwithout layer-by-layer pre-training.\n2 The model\nThe original dropout technique [8] uses a constant probability for omitting a unit, so a natural ques-\ntion we considered is whether it may help to let this probability be different for different hidden\nunits. In particular, there may be hidden units that can individually make con\ufb01dent predictions for\nthe presence or absence of an important feature or combination of features. Dropout will ignore this\ncon\ufb01dence and drop the unit out 50% of the time. Viewed another way, suppose after dropout is\napplied, it is found that several hidden units are highly correlated in the pre-dropout activities. They\ncould be combined into a single hidden unit with a lower dropout probability, freeing up hidden\nunits for other purposes.\nWe denote the activity of unit j in a deep neural network by aj and assume that its inputs are\n{ai : i < j}. In dropout, aj is randomly set to zero with probability 0.5. Let mj be a binary variable\nthat is used to mask, the activity aj, so that its value is\n\n(1)\nwhere wj,i is the weight from unit i to unit j and g(\u00b7) is the activation function and a0 = 1 accounts\nfor biases. Whereas in standard dropout, mj is Bernoulli with probability 0.5, here we use an\nadaptive dropout probability that depends on input activities:\n\nwj,iai\n\naj = mjg(cid:0)(cid:88)\n\n(cid:1),\nP (mj = 1|{ai : i < j}) = f(cid:0)(cid:88)\n\ni:i<j\n\n(cid:1),\n\n(2)\n\n\u03c0j,iai\n\ni:i<j\n\nwhere \u03c0j,i is the weight from unit i to unit j in the standout network or the adaptive dropout network;\nf (\u00b7) is a sigmoidal function, f : R \u2192 [0, 1]. We use the logistic function, f (z) = 1/(1 + exp(\u2212z)).\nThe standout network is an adpative dropout network that can be viewed as a binary belief net-\nwork that overlays the neural network and stochastically adapts its architecture, depending on the\ninput. Unlike a traditional belief network, the distribution over the output variable is not obtained\nby marginalizing over the hidden mask variables. Instead, the distribution over the hidden mask\nvariables should be viewed as specifying a Bayesian posterior distribution over models. Traditional\nBayesian inference generates a posterior distribution that does not depend on the input at test time,\nwhereas the posterior distribution described here does depend on the test input. At \ufb01rst, this may\nseem inappropriate. However, if we could exactly compute the Bayesian posterior distribution over\nneural networks (parameters and architectures), we would \ufb01nd strong correlations between compo-\nnents, such as the connectivity and weight magnitudes in one layer and the connectivity and weight\nmagnitudes in the next layer. The standout network described above can be viewed as approximately\ntaking into account these dependencies through the use of a parametric family of distributions.\nThe standout method described here can be simpli\ufb01ed to obtain other dropout techniques. The\noriginal dropout method is obtained by clamping \u03c0j,i = 0 for 0 \u2264 i < j. Another interesting\nsetting is obtained by clamping \u03c0j,i = 0 for 1 \u2264 i < j, but learning the input-independent dropout\nparameter \u03c0j,0 for each unit aj.\nAs in standard dropout, to process an input at test time, the stochastic feedforward process is replaced\nby taking the expectation of equation 1:\n\nE[aj] = f(cid:0)(cid:88)\n\n(cid:1)g(cid:0)(cid:88)\n\n(cid:1).\n\n\u03c0j,iai\n\nwj,iai\n\n(3)\n\ni:i<j\n\ni:i<j\n\nWe found that this method provides very similar results as randomly simulating the stochastic\nprocess and computing the expected output of the neural network.\n3 Learning\nFor a speci\ufb01c con\ufb01guration m of the mask variables, let L(m, w) denote the likelihood of a training\nset or a minibatch, where w is the set of neural network parameters. It may include a prior as well.\n\n2\n\n\f\u2212(cid:88)\n\nThe dependence of L on the input and output have been suppressed for notational simplicity. Given\nthe current dropout parameters, \u03c0, the standout network acts like a binary belief network that gen-\nerates a distribution over the mask variables for the training set or minibatch, denoted P (m|\u03c0, w).\nAgain, we have suppressed the dependence on the input to the neural network. As described above,\nthis distribution should not be viewed as the distribution over hidden variables in a latent variable\nmodel, but as an approximation to a Bayesian posterior distribution over model architectures.\nThe goal is to adjust \u03c0 and w to make P (m|\u03c0, w) close to the true posterior over architectures\nas given by L(m, w), while also adjusting L(m, w) so as maximize the data likelihood w.r.t. w.\nSince both the approximate posterior P (m|\u03c0, w) and the likelihood L(m, w) depend on the neural\nnetwork parameters, we use a crude approximation that we found works well in practice. If the\napproximate posterior were as close as possible to the true posterior, then the derivative of the free\nenergy F (P, L) w.r.t P would be zero and we can ignore terms of the form \u2202P/\u2202w. So, we adjust\nthe neural network parameters using the approximate derivative,\n\nm\n\n(4)\n\nlog L(m, w),\n\nP (m|\u03c0, w)\n\n\u2202\n\u2202w\nwhich can be computed by sampling from P (m|\u03c0, w).\nFor a given setting of the neural network parameters, the standout network can in principal be ad-\njusted to be closer to the Bayesian posterior by following the derivative of the free energy F (P, L)\nw.r.t. \u03c0. This is dif\ufb01cult in practice, so we use an approximation where we assume the approximate\nposterior is correct and sample a con\ufb01guration of m from it. Then, for each hidden unit, we consider\nmj = 0 and mj = 1 and determine the partial contribution to the free energy. The standout network\nparameters are adjusted for that hidden unit so as to decrease the partial contribution to the free\nenergy. Namely, the standout network updates are obtained by sampling the mask variables using\nthe current standout network, performing forward propagation in the neural network, and comput-\ning the data likelihood. The mask variables are sequentially perturbed by combining the standout\nnetwork probability for the mask variable with the data likelihood under the neural network, using\na partial forward propagation. The resulting mask variables are used as complete data for updating\nthe standout network.\nThe above learning technique is approximate, but works well in practice and achieves models that\noutperform standard dropout and other feature learning techniques, as described below.\nAlgorithm 1: Standout learning algorithm: alg1 and alg2\n\nInput: w, \u03c0, \u03b1, \u03b2\nalg1: initialize w, \u03c0 randomly; alg2: initialize w randomly, set \u03c0 = w;\nwhile not stopping criteria do\n\nfor hidden unit j = 1, 2, ... do\n\nNotation: H(cid:0)\u00b7(cid:1) is Heaviside step function ;\nP (mj = 1|{ai : i < j}) = f(cid:0)\u03b1(cid:80)\naj = mjg(cid:0)(cid:80)\ntj = H(cid:0)L(m, w|mj = 1) \u2212 L(m, w|mj = 0)(cid:1)\n\nend\nUpdate neural network parameter w using \u2202\n/* alg1\nfor hidden unit j = 1, 2, ... do\n\nmj \u223c P (mj = 1|{ai : i < j});\n\ni:i<j wj,iai\n\n(cid:1);\n\ni:i<j \u03c0j,iai + \u03b2(cid:1);\n\n\u2202w log L(m, w);\n\n*/\n\n*/\n\nend\nUpdate standout network \u03c0 using target t ;\n/* alg2\nUpdate standout network \u03c0 using \u03c0 \u2190 w ;\n\nend\n\n3.1 Stochastic adaptive mixtures of local experts\nA neural network of N hidden units can be viewed as 2N possible models given the standout mask\nM. Each of the 2N models acts like a separate \u201cexpert\u201d network that performs well for a subset\nof the input space. Training all 2N models separately can easily over-\ufb01t to the data, but weight\nsharing among the models can prevent over-\ufb01tting. Therefore, the standout network, much like a\ngating network, also produces a distributed representation to stochastically choses which expert to\n\n3\n\n\fFigure 1: Weights from hidden units that are least likely to be dropped out, for examples from each\nof the 10 classes, for (top) auto-encoder and (bottom) discriminative neural networks trained using\nstandout.\n\nFigure 2: First layer standout network \ufb01lters and neural network \ufb01lters learnt from MNIST data\nusing our method.\nturn on for a given input. This means 2N models are chosen by N binary numbers in this distributed\nrepresentation.\nThe standout network partitions the input space into different regions that are suitable for each\nexpert. We can visualize the effect of the standout network by showing the units that output high\nstandout probability for one class but not others. The standout network learns that some hidden units\nare important for one class and tend to keep those. These hidden units are then more likely to be\ndropped out when the input comes from a different class.\n4 Exploratory experiments\nHere, we study different aspects of our method using MNIST digits (see below for more details).\nWe trained a shallow one hidden layer auto-encoder on MNIST using the approximate learning\nalgorithm. We can visualize the effect of the standout network by showing the units that output low\ndropout probability for one class but not others. The standout network learns that some hidden units\nare important for one class and tends to keep those. These hidden units are more likely to be dropped\nwhen the input comes from a different class (see \ufb01gure 1).\nThe \ufb01rst layer \ufb01lters of both the standout network and the neural network are shown in \ufb01gure 2.\nWe noticed that the weights in the two networks are very similar. Since the learning algorithm for\nadjusting the dropout parameters is computationally burdensome (see above), we considered tying\nthe parameters w and \u03c0. To account for different scales and shifts, we set \u03c0 = \u03b1w + \u03b2, where \u03b1 and\n\u03b2 are learnt.\nConcretely, we found empirically that the standout network parameters trained in this way are quite\nsimilar (although not identical) to the neural network parameters, up to an af\ufb01ne transformation.\nThis motivated our second algorithm alg2 in psuedocode(1), where the neural network parameters\nare trained as described in learning section 3, but the standout parameters are set to an af\ufb01ne trans-\nformation of the neural network parameters with hyper-parameters alpha and beta. These hyper-\nparameters are determined as explained below. We found that this technique works very well in\npractice, for the MNIST and NORB datasets (see below). For example, for unsupervised learning\non MNIST using the architecture described below, we obtained 153 errors for tied parameters and\n158 errors for separately learnt parameters. This tied parameter learning algorithm is used for the\nexperiments in the rest of the paper. In the above description of our method, we mentioned two\nhyper-parameters that need to be considered: the scale parameter \u03b1 and the bias parameter \u03b2. Here\nwe explore the choice of these parameters, by presenting some experimental results obtained by\ntraining a dropout model as described below using MNIST handwritten digit images.\n\u03b1 controls the sensitivity of the dropout function to the weighted sum of inputs that is used to\ndetermine the hidden activity. In particular, \u03b1 scales the weighted sum of the activities from the\n\n4\n\n\flayer before. In contrast, the bias \u03b2 shifts the dropout probability to be high or low and ultimately\ncontrols the sparsity of the hidden unit activities. A model with a more negative \u03b2 will have most of\nits hidden activities concentrated near zero.\nFigure 3(a) illustrates how choices of \u03b1 and \u03b2 change the dependence of the dropout probability on\nthe input. It shows a histogram of hidden unit activities after training networks with different \u03b1\u2019s\nand \u03b2\u2019s on MNIST images.\n\nFigure 3: Histogram of hidden unit activities for various choices of hyper-parameters using the lo-\ngistic dropout function, including those con\ufb01gurations that are equivalent to dropout and no dropout-\nbased regularization (AE). Histograms of hidden unit activities for various dropout functions. Vari-\nous standout function f (\u00b7)\nWe also consider different forms of the dropout function other than the logistic function, as shown\nin \ufb01gure 3(b). The effect of different functional forms can be observed in the histogram of the\nactivities after training on the MNIST images. The logistic dropout function creates a sparse\ndistribution of activation values, whereas the functions such as f (z) = 1\u2212 4(1\u2212 \u03c3(z))\u03c3(z) produce\na multi-modal distribution over the activation values.\n5 Experimental results\nWe consider both unsupervised learning and discriminative learning tasks, and compare results ob-\ntained using standout to those obtained using restricted Boltzmann machines (RBMs) and auto-\nencoders trained using dropout, for unsupervised feature learning tasks. We also investigate clas-\nsi\ufb01cation performance by applying standout during discriminative training using the MNIST and\nNORB [11] datasets.\nIn our experiments, we have made a few engineering choices that are consistent with previous publi-\ncations in the area, so that our results are comparable to the literature. We used ReLU units, a linear\nmomentum schedule, and an exponentially decaying learning rate (c.f. Nair et al. 2009[13]; Hin-\nton et al. 2012 [8]). In addition, we used cross-validation to search over the learning rate (0.0001,\n0.0003, 0.001, 0.003, 0.01, 0.03) and the values of alpha and beta (-2, -1.5, -1, -.5, 0, .5, 1, 1.5, 2)\nand for the NORB dataset, the number of hidden units (1000, 2000, 4000, 6000).\n\n5.1 Datasets\nThe MNIST handwritten digit dataset is generally considered as a well-studied problem, which\noffers the ability to ensure that new algorithms produce sensible results when compared to the many\nother techniques that have been benchmarked. It consists of ten classes of handwritten digits, ranging\nfrom 0 to 9. There are, in total, 60,000 training images and 10,000 test images. Each image is 28\u00d728\npixels in size. Following the common convention, we randomly separate the original training set\ninto 50,000 training cases and 10,000 cases used for validating the choice of hyper-parameters. We\nconcatenate all the pixels in an image in a raster scan fashion to create a 784-dimensional vector.\nThe task is to predict the 10 class labels from the 784-dimensional input vector.\nThe small NORB normalized-uniform dataset contains 24,300 training examples and 24,300 test\nexamples. It consists of 50 different objects from \ufb01ve different classes: cars, trucks, planes, animals,\nand humans. Each data point is represented by a stereo image pair of size 96\u00d796 pixels. The training\nand test set used different object instances and images are created under different lighting conditions,\nelevations and azimuths. In order to perform well in NORB, it demands learning algorithms to learn\nfeatures that can generalize to test set and be able to handle large input dimension. This makes\nNORB signi\ufb01cantly more challenging than the MNIST dataset. The objects in the NORB dataset\nare 3D under difference out-of-plane rotation, and so on. Therefore, the models trained on NORB\nhave to learn and store implicit representations of 3D structure, lighting and so on. We formulate\n\n5\n\n\fthe data vector following Snoek et al.[17] by down-sampling from 96 \u00d7 96 to 32 \u00d7 32, so that the\n\ufb01nal training data vector has 2048 dimensions. Data points are subtracted by the mean and divided\nby the standard deviation along each input dimension across the whole training set to normalize the\ncontrast. The goal is to predict the \ufb01ve class labels for the previously unseen 24,300 test examples.\nThe training set is separated into 20,000 for training and 4,300 for validation.\n\n5.2 Nonlinearity for feedforward network\nWe used the ReLU [13] activation function for all of the results reported here, both on unsupervised\nand discriminative tasks. The ReLU function can be written as g(x) = max(0, x). We found that\nits use signi\ufb01cantly speeds up training by up to 10-fold, compared to the commonly used logistic\nactivation function. The speed-up we observed can be explained in two ways. First, computations\nare saved when using max instead of the exponential function. Second, ReLUs do not suffer from\nthe vanishing gradient problem that logistic functions have for very large inputs.\n\n5.3 Momentum\nWe optimized the model parameters using stochastic gradient descent with the Nesterov momentum\ntechnique [19], which can effectively speed up learning when applied to large models compared to\nstandard momentum. When using Nesterov momentum, the cost function J and derivatives \u2202J\n\u2202\u03b8 are\nevaluated at \u03b8 + vk, where vk = \u03b3vk\u22121 + \u03b7 \u2202J\n\u2202\u03b8 is the velocity and \u03b8 is the model parameter. \u03b3 < 1\nis the momentum coef\ufb01cient and \u03b7 is the learning rate. Nesterov momentum takes into account\nthe velocity in parameter space when computing updates. Therefore, it further reduces oscillations\ncompared to standard momentum.\nWe schedule the momentum coef\ufb01cient \u03b3 to further speed up the learning process. \u03b3 starts at 0.5 in\nthe \ufb01rst epoch and linearly increase to 0.99. The momentum stays at 0.99 during the major portion\nof learning and then is linearly ramped down to 0.5 during the end of learning.\n\n5.4 Computation time\nWe used the publicly available gnumpy library [20] to implement our models. The models mentioned\nin this work are trained on a single Nvidia GTX 580 GPU. As in psuedocode(1), the \ufb01rst algorithm\nis relatively slow, since the number of computations is O(n2) where n is the number of hidden units.\nThe second algorithm is much faster and takes O(kn) time, where k is the number of con\ufb01gurations\nof the hyper-parameters alpha and beta that are searched over. In particular, for a 784-1000-784 auto-\nencoder model with mini-batches of size 100 and 50,000 training cases on a GTX 580 GPU, learning\ntakes 1.66 seconds per epoch for standard dropout and 1.73 seconds for our second algorithm.\nThe computational cost of the improved representations produced by our algorithm is that a hyper-\nparameter search is needed. We note that some other recently developed dropout-related methods,\nsuch as maxout, also involve an additional computational factor.\n\n5.5 Unsupervised feature learning\nHaving good features is crucial for obtaining competitive performance in classi\ufb01cation and other\nhigh level tasks. Learning algorithms that can take advantage of unlabeled data are appealing due\nto increasing amount of unlabeled data. Furthermore, on more challenging datasets, such as NORB,\na fully connected discriminative neural network trained from scratch tends to perform poorly, even\nwith the help of dropout. (We trained a two hidden layer neural network on NORB to obtain 13%\nerror rate and saw no improvement by using dropout). Such disappointing performance motived us\nto investigate unsupervised feature learning and pre-training strategies with our new method. Below,\nwe show that our method can extract useful features in a self-taught fashion. The features extracted\nusing our method not only outperform other common feature learning methods, but our method is\nalso quite computationally ef\ufb01cient compared to techniques like sparse coding.\nWe use the following procedures for feature learning. We \ufb01rst extract the features using one of the\nunsupervised learning algorithms in \ufb01gure (4). The usefulness of the extracted features are then\nevaluated by training a linear classi\ufb01er to predict the object class from the extracted features. This\nprocess is similar to that employed in other feature learning research [14].\nWe trained a number of architectures on MNIST, including standard auto-encoders, dropout auto-\nencoders and standout auto-encoders. As described previously, we compute the expected value of\n\n6\n\n\farch.\n8976\n\n2048-4000\nweight decay\n\n2048-4000-2048\n2048-4000-2048\n\n50% hidden dropout\n\n2048-4000-2048\n\n22% hidden dropout\n\n2048-4000-2048\n\nstandout\n\n(b) NORB\n\nact. func.\n\nerr.\n\n\u03c3(\u00b7)\n\n23.6%\n10.6%\nReLU (\u00b7)\n9.5%\nReLU (\u00b7) 10.1%\nReLU (\u00b7)\n8.9%\nReLU (\u00b7) 7.3%\n\nraw pixel\nRBM\nDAE\ndropout\nAE\nstandout\nAE\n\narch.\n784\n\n784-1000\n\nweight decay\n784-1000-784\n784-1000-784\n\n50% hidden dropout\n\n784-1000-784\n\nstandout\n\n(a) MNIST\n\n\u03c3(\u00b7)\n\nact. func.\n\nerr.\n7.2%\n1.81%\nReLU (\u00b7) 1.95%\nReLU (\u00b7) 1.70%\nReLU (\u00b7) 1.53%\n\nraw pixel\nRBM\nDAE\ndropout\nAE\ndropout\nAE *\nstandout\nAE\n\nFigure 4: Performance of unsupervised feature learning methods. The dropout probability in the\nDAE * was optimized using [18]\n\neach hidden activity and use that as the feature when training a classi\ufb01er. We also examined RBM\u2019s,\nwhere we the soft probability for each hidden unit as a feature. Different classi\ufb01ers can be used\nand give similar performance; we used a linear SVM because it is fast and straightforward to apply.\nHowever, on a subset of problems we tried logistic classi\ufb01ers and they achieved indistinguishable\nclassi\ufb01cation rates.\nResults for the different architectures and learning methods are compared in table 4(a). The auto-\nencoder trained using our proposed technique with \u03b1 = 1 and \u03b2 = 0 performed the best on MNIST.\nWe performed extensive experiments on the NORB dataset with larger models. The hyper-\nparameters used for the best result are \u03b1 = 1 and \u03b2 = 1. Overall, we observed similar trends\nto the ones we observed for MNIST. Our standout method consistently performs better than other\nmethods, as shown in table 4(b).\n\n5.6 Discussion\nThe proposed standout method was able to outperform other feature learning methods in both\ndatasets with a noticeable margin. The stochasticity introduced by the standout network success-\nfully removes hidden units that are unnecessary for good performance and that hinder performance.\nBy inspecting the weights from auto-encoders regularized by dropout and standout, we \ufb01nd that the\nstandout auto-encoder weights are sharper than those learnt using dropout, which may be consistent\nwith the improved performance on classi\ufb01cation tasks.\nThe effect of the number of hidden units was studied us-\ning networks with sizes 500, 1000, 1500, and up to 4500.\nFigure 5 shows that all algorithms generally perform bet-\nter by increasing the number of hidden units. One notable\ntrend for dropout regularization is that it achieves sig-\nni\ufb01cantly better performance with large numbers of hid-\nden units since all units have equal chance to be omitted.\nIn comparison, standout can achieve similar performance\nwith only half as many hidden units, because highly use-\nful hidden units will be kept more often while only the\nless effective units will be dropped.\nOne question is whether it is the stochasticity of the standout network that helps, or just a dif-\nferent nonlinearity obtained by the expected activity in equation 3. To address this, we trained a\ndeterministic auto-encoder with hidden activation functions given by equation 3. The result of this\n\u2018deterministic standout method\u2019 is shown in \ufb01gure 5 and it performs quite poorly.\nIt is believed that sparse features can help improve the performance of linear classi\ufb01ers. We found\nthat auto-encoders trained using ReLU units and standout produce sparse features. We wondered\nwhether training a sparse auto-encoder with a sparsity level matching the one obtained by our method\nwould yield similar performance. We applied an L1 penalty on the hidden units and trained an\nauto-encoder to match the sparsity obtained by our method (\ufb01gure4). The \ufb01nal features extracted\nusing the sparse auto-encoder achieved 10.2% error on NORB, which is signi\ufb01cantly worse than\nour method. Further gains can be achieved by tuning hyper-parameters, but the hyper-parameters\nfor our method are easier to tune and, as shown above, have little effect on the \ufb01nal performance.\nMoreover, the sparse features learnt using standout are also computationally ef\ufb01cient compared\n\nFigure 5: Classi\ufb01cation error rate as a\nfunction of number of hidden units on\nNORB.\n\n7\n\n50010001500200025003000350040004500numer of hidden units68101214161820test error rate (%)Classification error rate as a function of number of hidden unitsDAEDropout AEDeterministic standout AEStandout AE\fDBN [15]\nDBM [15]\nthird order RBM [12]\ndropout shallow AE + FT\ndropout deep AE + FT\nstandout shallow AE + FT\nstandout deep AE + FT\n\n(b) NORB \ufb01ne-tuned\n\nerror rate\n\n8.3%\n7.2%\n6.5%\n7.5%\n7.0%\n6.2%\n5.8%\n\nRBM + FT\nDAE + FT\nshallow dropout AE + FT\ndeep dropout AE + FT\nstandout shallow AE + FT\nstandout deep AE + FT\n\n(a) MNIST \ufb01ne-tuned\n\nerror rate\n1.24%\n1.3%\n1.10%\n0.89%\n1.06%\n0.80%\n\nFigure 6: Performance of \ufb01ne-tuned classi\ufb01ers, where FT is \ufb01ne-tuning\n\nto more sophisticated encoding algorithms, e.g., [5]. To \ufb01nd the code for data points with more\nthan 4000 dimensions and 4000 dictionary elements, the sparse coding algorithm quickly becomes\nimpractical.\nSurprisingly, a shallow network with standout regularization (table 4(b)) outperforms some of the\nmuch larger and deeper networks shown. Some of those deeper models have three or four times\nmore parameters than the shallow network we trained here. This particular result show that a simpler\nmodel trained using our regularization technique can achieve higher performance compared to other,\nmore complicated methods.\n\n5.7 Discriminative learning\nIn deep learning, a common practice is to use the encoder weights learnt by an unsupervised learning\nmethod to initialize the early layers of a multilayer discriminative model. The backpropagation\nalgorithm is then used to learn the weights for the last hidden layer and also \ufb01ne tune the weights\nin the layers before. This procedure is often referred to as discriminative \ufb01ne tuning. We initialized\nneural networks using the models described above. The regularization method that we used for\nunsupervised learning (RBM, dropout, standout) is also used for corresponding discriminative \ufb01ne\ntuning. For example, if a neural network is initialized using an auto-encoder trained with standout,\nthe neural network will also be \ufb01ne tuned using standout for all its hidden units, with the same\nstandout function and hyper-parameters as the auto-encoder.\nDuring discriminative \ufb01ne tuning, we hold the weights \ufb01xed for all layers except the last one for the\n\ufb01rst 10 epochs, and then the weights are updated jointly after that. As found by previous authors,\nwe \ufb01nd that classi\ufb01cation performance is usually improved by the use of discriminative \ufb01ne tuning.\nImpressively, we found that a two-hidden-layer neural network with 1000 ReLU units in its \ufb01rst\nand second hidden layers trained with standout is able to achieve 80 errors on MNIST data after\n\ufb01ne tuning (error rate of 0.80%). This performance is better than the current best non-convolutional\nresult [8] and the training procedure is simpler. On NORB dataset, we similarly achieved 6.2%\nerror rate by \ufb01ne tuning the simple shallow auto-encoder from table(4(b)). Furthermore, a two-\nhidden-layer neural network with 4000 ReLU units in both hidden layers that is pre-trained using\nstandout achieved 5.8% error rate after \ufb01ne tuning. It is worth mentioning that a small weight decay\nof 0.0005 is applied to this network during \ufb01ne-tuning to further prevent over\ufb01tting. It outperforms\nother models that do not exploit spatial structure. As far as we know, this result is better than\nany previously published results without distortion or jitter. It even outperforms carefully designed\nconvolutional neural networks found in [9].\nFigure 6 reports the classi\ufb01cation accuracy obtained by different models, including state-of-the-art\ndeep networks.\n6 Conclusions\nOur results demonstrate that the proposed use of standout networks can signi\ufb01cantly improve per-\nformance of feature-learning methods. Further, our results provide additional support for the \u2018reg-\nularization by noise\u2019 hypothesis that has been used to regularize other deep architectures, including\nRBMs and denoising auto-encoders, and in dropout.\nAn obvious missing piece in this research is a good theoretical understanding of why the standout\nnetwork provides better regularization compared to the \ufb01xed dropout probability of 0.5. While we\nhave motivated our approach as one of approximating the Bayesian posterior, further theoretical\njusti\ufb01cations are needed.\n\n8\n\n\fReferences\n[1] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of deep\n[2] Yoshua Bengio. Learning deep architectures for ai. Foundations and Trends R(cid:13) in Machine\n\nnetworks. Advances in neural information processing systems, 19:153, 2007.\n\nLearning, 2(1):1\u2013127, 2009.\n\n[3] C.M. Bishop. Training with noise is equivalent to tikhonov regularization. Neural computation,\n\n7(1):108\u2013116, 1995.\n\n[4] A. Coates and A.Y. Ng. The importance of encoding versus training with sparse coding and\nvector quantization. In International Conference on Machine Learning, volume 8, page 10,\n2011.\n\n[5] Karol Gregor and Yann LeCun. Learning fast approximations of sparse coding. In Interna-\n\ntional Conference on Machine Learning, 2010.\n\n[6] G.E. Hinton, S. Osindero, and Y.W. Teh. A fast learning algorithm for deep belief nets. Neural\n\ncomputation, 18(7):1527\u20131554, 2006.\n\n[7] G.E. Hinton and R.R. Salakhutdinov. Reducing the dimensionality of data with neural net-\n\nworks. Science, 313(5786):504\u2013507, 2006.\n\n[8] G.E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R.R. Salakhutdinov.\n\nIm-\nproving neural networks by preventing co-adaptation of feature detectors. arXiv preprint\narXiv:1207.0580, 2012.\n\n[9] Kevin Jarrett, Koray Kavukcuoglu, MarcAurelio Ranzato, and Yann LeCun. What is the best\nmulti-stage architecture for object recognition? In Computer Vision, 2009 IEEE 12th Interna-\ntional Conference on, pages 2146\u20132153. IEEE, 2009.\n\n[10] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. Advances in Neural Information Processing Systems, 25, 2012.\n\n[11] Yann LeCun, Fu Jie Huang, and Leon Bottou. Learning methods for generic object recognition\nwith invariance to pose and lighting. In Computer Vision and Pattern Recognition, 2004. CVPR\n2004. Proceedings of the 2004 IEEE Computer Society Conference on, volume 2, pages II\u201397.\nIEEE, 2004.\n\n[12] V. Nair and G. Hinton. 3d object recognition with deep belief nets. Advances in Neural\n\nInformation Processing Systems, 22:1339\u20131347, 2009.\n\n[13] V. Nair and G.E. Hinton. Recti\ufb01ed linear units improve restricted boltzmann machines. In Proc.\n27th International Conference on Machine Learning, pages 807\u2013814. Omnipress Madison,\nWI, 2010.\n\n[14] S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio. Contractive auto-encoders: Ex-\nplicit invariance during feature extraction. In Proceedings of the Twenty-eight International\nConference on Machine Learning (ICML11), 2011.\n\n[15] Ruslan Salakhutdinov and Hugo Larochelle. Ef\ufb01cient learning of deep boltzmann machines.\n\nIn International Conference on Arti\ufb01cial Intelligence and Statistics. Citeseer, 2010.\n\n[16] J. Sietsma and R.J.F. Dow. Creating arti\ufb01cial neural networks that generalize. Neural Networks,\n\n4(1):67\u201379, 1991.\n\n[17] Jasper Snoek, Ryan P Adams, and Hugo Larochelle. Nonparametric guidance of autoencoder\nrepresentations using label information. Journal of Machine Learning Research, 13:2567\u2013\n2588, 2012.\n\n[18] Jasper Snoek, Hugo Larochelle, and Ryan Adams. Practical bayesian optimization of machine\nlearning algorithms. In Advances in Neural Information Processing Systems 25, pages 2960\u2013\n2968, 2012.\n\n[19] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of\n\ninitialization and momentum in deep learning.\n\n[20] Tijmen Tieleman. Gnumpy: an easy way to use gpu boards in python. Department of Computer\n\nScience, University of Toronto, 2010.\n\n[21] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.A. Manzagol. Stacked denoising autoen-\ncoders: Learning useful representations in a deep network with a local denoising criterion. The\nJournal of Machine Learning Research, 11:3371\u20133408, 2010.\n\n9\n\n\f", "award": [], "sourceid": 1409, "authors": [{"given_name": "Jimmy", "family_name": "Ba", "institution": "University of Toronto"}, {"given_name": "Brendan", "family_name": "Frey", "institution": "University of Toronto"}]}