{"title": "Understanding Dropout", "book": "Advances in Neural Information Processing Systems", "page_first": 2814, "page_last": 2822, "abstract": "Dropout is a relatively new algorithm for training neural networks which relies on stochastically dropping out'' neurons during training in order to avoid the co-adaptation of feature detectors. We introduce a general formalism for studying dropout on either units or connections, with arbitrary probability values, and use it to analyze the averaging and regularizing properties of dropout in both linear and non-linear networks. For deep neural networks, the averaging properties of dropout are characterized by three recursive equations, including the approximation of expectations by normalized weighted geometric means. We provide estimates and bounds for these approximations and corroborate the results with simulations. We also show in simple cases how dropout performs stochastic gradient descent on a regularized error function.\"", "full_text": "Understanding Dropout\n\nPierre Baldi\n\nDepartment of Computer Science\nUniversity of California, Irvine\n\nIrvine, CA 92697\n\npfbaldi@uci.edu\n\nPeter Sadowski\n\nDepartment of Computer Science\nUniversity of California, Irvine\n\nIrvine, CA 92697\n\npjsadows@ics.uci.edu\n\nAbstract\n\nDropout is a relatively new algorithm for training neural networks which relies\non stochastically \u201cdropping out\u201d neurons during training in order to avoid the\nco-adaptation of feature detectors. We introduce a general formalism for study-\ning dropout on either units or connections, with arbitrary probability values, and\nuse it to analyze the averaging and regularizing properties of dropout in both lin-\near and non-linear networks. For deep neural networks, the averaging properties\nof dropout are characterized by three recursive equations, including the approx-\nimation of expectations by normalized weighted geometric means. We provide\nestimates and bounds for these approximations and corroborate the results with\nsimulations. Among other results, we also show how dropout performs stochastic\ngradient descent on a regularized error function.\n\n1\n\nIntroduction\n\nDropout is an algorithm for training neural networks that was described at NIPS 2012 [7]. In its\nmost simple form, during training, at each example presentation, feature detectors are deleted with\nprobability q = 1 \u2212 p = 0.5 and the remaining weights are trained by backpropagation. All weights\nare shared across all example presentations. During prediction, the weights are divided by two.\nThe main motivation behind the algorithm is to prevent the co-adaptation of feature detectors, or\nover\ufb01tting, by forcing neurons to be robust and rely on population behavior, rather than on the\nactivity of other speci\ufb01c units. In [7], dropout is reported to achieve state-of-the-art performance on\nseveral benchmark datasets. It is also noted that for a single logistic unit dropout performs a kind of\n\u201cgeometric averaging\u201d over the ensemble of possible subnetworks, and conjectured that something\nsimilar may occur also in multilayer networks leading to the view that dropout may be an economical\napproximation to training and using a very large ensemble of networks.\nIn spite of the impressive results that have been reported, little is known about dropout from a\ntheoretical standpoint, in particular about its averaging, regularization, and convergence properties.\nLikewise little is known about the importance of using q = 0.5, whether different values of q can\nbe used including different values for different layers or different units, and whether dropout can be\napplied to the connections rather than the units. Here we address these questions.\n\n2 Dropout in Linear Networks\n\nIt is instructive to \ufb01rst look at some of the properties of dropout in linear networks, since these can\nbe studied exactly in the most general setting of a multilayer feedforward network described by an\nunderlying acyclic graph. The activity in unit i of layer h can be expressed as:\n\nSh\n\ni (I) =\n\nwhl\n\nij Sl\n\nj with S0\n\nj = Ij\n\n(1)\n\n(cid:88)\n\n(cid:88)\n\nl<h\n\nj\n\n1\n\n\fwhere the variables w denote the weights and I the input vector. Dropout applied to the units can be\nexpressed in the form\n\nSh\n\ni =\n\nwhl\n\nij \u03b4l\n\njSl\n\nj with S0\n\nj = Ij\n\n(2)\n\n(cid:88)\n\n(cid:88)\n\nl<h\n\nj\n\n(cid:88)\n\n(cid:88)\n\nl<h\n\nj\n\n(cid:88)\n\n(cid:88)\n\nj is a gating 0-1 Bernoulli variable, with P (\u03b4l\n\nj . Throughout this paper we assume\nwhere \u03b4l\nthat the variables \u03b4l\nj are independent of each other, independent of the weights, and independent of\nthe activity of the units. Similarly, dropout applied to the connections leads to the random variables\n\nj = 1) = pl\n\nSh\n\ni =\n\n\u03b4hl\nij whl\n\nij Sl\n\nj with S0\n\nj = Ij\n\n(3)\n\nFor brevity in the rest of this paper, we focus exclusively on dropout applied to the units, but all the\nresults remain true for the case of dropout applied to the connections with minor adjustments.\nFor a \ufb01xed input vector, the expectation of the activity of all the units, taken over all possible real-\nizations of the gating variables hence all possible subnetworks, is given by:\n\nE(Sh\n\ni ) =\n\nwhl\n\nij pl\n\njE(Sl\nj)\n\nfor h > 0\n\n(4)\n\nwith E(S0\nfeedforward propagation in the original network, simply replacing the weights whl\n\nj ) = Ij in the input layer. In short, the ensemble average can easily be computed by\n\nij by whl\n\nj.\nij pl\n\nl<h\n\nj\n\n3 Dropout in Neural Networks\n\n3.1 Dropout in Shallow Neural Networks\n\n1 wjIj.\nTo achieve the greatest level of generality, we assume that the unit produces different outputs\nO1, . . . , Om, corresponding to different sums S1 . . . , Sm with different probabilities P1, . . . , Pm\nIn the most relevant case, these outputs and these sums are associated with the\nm = 2n possible subnetworks of the unit. The probabilities P1, . . . , Pm could be generated, for\ninstance, by using Bernoulli gating variables, although this is not necessary for this derivation. It is\n\nConsider \ufb01rst a single logistic unit with n inputs O = \u03c3(S) = 1/(1 + ce\u2212\u03bbS) and S =(cid:80)n\n((cid:80) Pm = 1).\nuseful to de\ufb01ne the following four quantities: the mean E =(cid:80) PiOi; the mean of the complements\nE(cid:48) = (cid:80) Pi(1 \u2212 Oi) = 1 \u2212 E; the weighted geometric mean (W GM) G = (cid:81)\nweighted geometric mean of the complements G(cid:48) =(cid:81)\n\n; and the\ni(1 \u2212 Oi)Pi. We also de\ufb01ne the normalized\nweighted geometric mean N W GM = G/(G + G(cid:48)). We can now prove the key averaging theorem\nfor logistic functions:\n\ni OPi\n\ni\n\nN W GM (O1, . . . , Om) =\n\n1\n\n1 + ce\u2212\u03bbE(S)\n\n= \u03c3(E(S))\n\nTo prove this result, we write\n\n1\n\n1\n\n=\n\n1 +\n\nN W GM (O1, . . . , Om) =\n\nThe logistic function satis\ufb01es the identity [1 \u2212 \u03c3(x)]/\u03c3(x) = ce\u2212\u03bbx and thus\n\n(cid:81)(1\u2212\u03c3(Si))Pi\n(cid:81) \u03c3(Si)Pi\n1 + ce\u2212\u03bb(cid:80) PiSi\ndropout con\ufb01gurations by simple forward propagation by: N W GM = \u03c3((cid:80)n\n\n(cid:81)(1\u2212Oi)Pi\n(cid:81) OPi\n1 +(cid:81)[ce\u2212\u03bbSi]Pi\n\nN W GM (O1, . . . , Om) =\n\nThus in the case of Bernoulli gating variables, we can compute the N W GM over all possible\n1 wjpjIj). A similar\nresult is true also for normalized exponential transfer functions. Finally, one can also show that\nthe only class of functions f that satisfy N W GM (f ) = f (E) are the constant functions and the\nlogistic functions [1].\n\n= \u03c3(E(S))\n\n1 +\n\n=\n\n1\n\n1\n\ni\n\n(5)\n\n(6)\n\n(7)\n\n2\n\n\f3.2 Dropout in Deep Neural Networks\n\nWe can now deal with the most interesting case of deep feedforward networks of sigmoidal units 1,\ndescribed by a set of equations of the form\n\n(cid:88)\n\n(cid:88)\n\nl<h\n\nj\n\n(cid:88)\n\n(cid:88)\n\nl<h\n\nj\n\n(8)\n\n(9)\n\nOh\n\ni = \u03c3(Sh\n\ni ) = \u03c3(\n\nwhl\n\nij Ol\n\nj) with O0\n\nj = Ij\n\nwhere Oh\n\ni is the output of unit i in layer h. Dropout on the units can be described by\n\nOh\n\ni = \u03c3(Sh\n\ni ) = \u03c3(\n\nwhl\n\nij \u03b4l\n\njOl\n\nj) with O0\n\nj = Ij\n\nusing the Bernoulli selector variables \u03b4l\n\nj. For each sigmoidal unit\n\n(cid:81)N (Oh\ni )P (N ) +(cid:81)N (1 \u2212 Oh\n\ni )P (N )\n\n(cid:81)N (Oh\n\ni ) =\n\nN W GM (Oh\n\n(10)\nwhere N ranges over all possible subnetworks. Assume for now that the N W GM provides a\ngood approximation to the expectation (this point will be analyzed in the next section). Then the\naveraging properties of dropout are described by the following three recursive equations. First the\napproximation of means by NWGMs:\n\ni )P (N )\n\nE(Oh\n\ni ) \u2248 N W GM (Oh\ni )\n\nSecond, using the result of the previous section, the propagation of expectation symbols:\n\n(cid:2)E(Sh\ni )(cid:3)\n\nN W GM (Oh\n\ni ) = \u03c3h\ni\n\n(11)\n\n(12)\n\nAnd third, using the linearity of the expectation with respect to sums, and to products of independent\nrandom variables:\n\nE(Sh\n\ni ) =\n\nwhl\n\nij pl\n\njE(Ol\nj)\n\n(13)\n\nl<h\n\nj\n\nEquations 11, 12, and 13 are the fundamental equations explaining the averaging properties of the\ndropout procedure. The only approximation is of course Equation 11 which is analyzed in the next\nsection. If the network contains linear units, then Equation 11 is not necessary for those units and\ntheir average can be computed exactly. In the case of regression with linear units in the top layers,\nthis allows one to shave off one layer of approximations. The same is true in binary classi\ufb01cation\nby requiring the output layer to compute directly the N W GM of the ensemble rather than the\nexpectation. It can be shown that for any error function that is convex up (\u222a), the error of the mean,\nweighted geometric mean, and normalized weighted geometric mean of an ensemble is always less\nthan the expected error of the models [1].\nEquation 11 is exact if and only if the numbers Oh\nThus it is useful to measure the consistency C(Oh\n\ni (I)(cid:3) taken over all subnetworks N and their distribution when the input I is\n\ni are identical over all possible subnetworks N .\ni , I) of neuron i in layer h for input I by using\n\nthe variance V ar(cid:2)Oh\n\n\ufb01xed. The larger the variance is, the less consistent the neuron is, and the worse we can expect\nthe approximation in Equation 11 to be. Note that for a random variable O in [0,1] the variance\ncannot exceed 1/4 anyway. This is because V ar(O) = E(O2) \u2212 (E(O))2 \u2264 E(O) \u2212 (E(O))2 =\nE(O)(1 \u2212 E(O)) \u2264 1/4. This measure can also be averaged over a training set or a test set.\n\n1Given the results of the previous sections, the network can also include linear units or normalized expo-\n\nnential units.\n\n3\n\n(cid:88)\n\n(cid:88)\n\n\f4 The Dropout Approximation\n\nGiven a set of numbers O1, . . . , Om between 0 and 1, with probabilities P1, . . . , PM (corresponding\nto the outputs of a sigmoidal neuron for a \ufb01xed input and different subnetworks), we are primarily\ninterested in the approximation of E by N W GM. The N W GM provides a good approximation\nbecause we show below that to a \ufb01rst order of approximation: E \u2248 N W GM and E \u2248 G. Further-\nmore, there are formulae in the literature for bounding the error E \u2212 G in terms of the consistency\n(e.g. the Cartwright and Field inequality [6]). However, one can suspect that the N W GM provides\neven a better approximation to E than the geometric mean. For instance, if the numbers Oi satisfy\n0 < Oi \u2264 0.5 (consistently low), then\n\nG\n\nG(cid:48) \u2264 E\nE(cid:48)\n\nand therefore G \u2264 G\n\n(14)\nThis is proven by applying Jensen\u2019s inequality to the function ln x \u2212 ln(1 \u2212 x) for x \u2208 (0, 0.5]. It is\nalso known as the Ky Fan inequality [2, 8, 9].\nTo get even better results, one must consider a second order approximation. For this, we write\nOi = 0.5 + \u0001i with 0 \u2264 |\u0001i| \u2264 0.5. Thus we have E(O) = 0.5 + E(\u0001) and V ar(O) = V ar(\u0001).\nUsing a Taylor expansion:\n\nG + G(cid:48) \u2264 E\n\n(cid:89)\n\n\u221e(cid:88)\n\ni\n\nn=0\n\n1\n2\n\nG =\n\n(cid:18)pi\n\n(cid:19)\n\nn\n\n(2\u0001i)n =\n\n1\n2\n\n\uf8ee\uf8f01 +\n\n(cid:88)\n\ni\n\npi2\u0001i +\n\n(cid:88)\n\ni\n\npi(pi \u2212 1)\n\n2\n\n(2\u0001i)2 +\n\n(cid:88)\n\ni<j\n\n\uf8f9\uf8fb\n\n4pipj\u0001i\u0001j + R3(\u0001i)\n\n(cid:18)pi\n\n(cid:19)\n\n(2\u0001i)3\n\n(1 + ui)3\u2212pi\n\n(15)\n\n(16)\n\npi\u00012\n\ni +R3(\u0001) =\n\n1\n2\n\n+E(\u0001)\u2212V ar(\u0001)+R3(\u0001) = E(O)\u2212V ar(O)+R3(\u0001)\n\nwhere R3(\u0001i) is the remainder and\n\n3\nwhere |ui| \u2264 2|\u0001i|. Expanding the product gives\n\nR3(\u0001i) =\n\n(cid:88)\n\nG =\n\n1\n2\n\n+\n\npi\u0001i+(\n\ni\n\ni\n\nBy symmetry, we have\n\n(cid:88)\n\n\u0001i)2\u2212(cid:88)\n(cid:89)\n\n(17)\n\n(18)\n\nG(cid:48) =\n\n(1 \u2212 Oi)pi = 1 \u2212 E(O) \u2212 V ar(O) + R3(\u0001)\n\ni\n\nwhere R3(\u0001) is the higher order remainder. Neglecting the remainder and writing E = E(O) and\nV = V ar(O) we have\n\nG\n\nG + G(cid:48) \u2248 E \u2212 V\n1 \u2212 2V\n\nG + G(cid:48) \u2248 1 \u2212 E \u2212 V\nG(cid:48)\n1 \u2212 2V\n\nand\n\n(19)\n\nThus, to a second order, the differences between the mean and the geometric mean and the normal-\nized geometric means satisfy\n\nE \u2212 G \u2248 V\n\nand E \u2212 G\n\nG + G(cid:48) \u2248 V (1 \u2212 2E)\n1 \u2212 2V\n\n(20)\n\nand\n\n(21)\nFinally it is easy to check that the factor (1\u2212 2E)/(1\u2212 2V ) is always less or equal to 1. In addition\nwe always have V \u2264 E(1 \u2212 E), with equality achieved only for 0-1 Bernoulli variables. Thus\n\nand (1 \u2212 E) \u2212 G(cid:48)\n\nG + G(cid:48) \u2248 V (1 \u2212 2E)\n1 \u2212 2V\n\n1 \u2212 E \u2212 G(cid:48) \u2248 V\n\n4\n\n\f|E \u2212 G\n\nG + G(cid:48)| \u2248 V |1 \u2212 2E|\n1 \u2212 2V\n\n\u2264 E(1 \u2212 E)|1 \u2212 2E|\n\n1 \u2212 2V\n\n\u2264 2E(1 \u2212 E)|1 \u2212 2E|\n\n(22)\n\nThe \ufb01rst inequality is optimal in the sense that it is attained in the case of a Bernoulli variable\nwith expectation E and, intuitively, the second inequality shows that the approximation error is\nalways small, regardless of whether E is close to 0, 0.5, or 1. In short, the NWGM provides a\nvery good approximation to E, better than the geometric mean G. The property is always true to\na second order of approximation and it is exact when the activities are consistently low, or when\nN W GM \u2264 E, since the latter implies G \u2264 N W GM \u2264 E. Several additional properties of the\ndropout approximation, including the extension to recti\ufb01ed linear units and other transfer functions,\nare studied in [1].\n\n5 Dropout Dynamics\n\nDropout performs gradient descent on-line with respect to both the training examples and the en-\nsemble of all possible subnetworks. As such, and with the appropriately decreasing learning rates,\nit is almost surely convergent like other forms of stochastic gradient descent [11, 4, 5]. To further\nunderstand the properties of dropout, it is again instructive to look at the properties of the gradient\nin the linear case.\n\n5.1 Single Linear Unit\n\nIn the case of a single linear unit, consider the two error functions EEN S and ED associated with\nthe ensemble of all possible subnetworks and the network with dropout. For a single input I, these\nare de\ufb01ned by:\n\nEEN S =\n\nED =\n\n1\n2\n\n1\n2\n\n(t \u2212 OEN S)2 =\n\npiwiIi)2\n\n(t \u2212 OD)2 =\n\n1\n2\n\n\u03b4iwiIi)2\n\n1\n2\n\n(t \u2212 n(cid:88)\n(t \u2212 n(cid:88)\n\ni=1\n\ni=1\n\n(23)\n\n(24)\n\nWe use a single training input I for notational simplicity, otherwise the errors of each training\nexample can be combined additively. The learning gradient is given by\n\n\u2202EEN S\n\n\u2202wi\n\n= \u2212(t \u2212 OEN S)\n\n\u2202OEN S\n\n\u2202wi\n\n= \u2212(t \u2212 OEN S)piIi\n\n\u2202ED\n\u2202wi\n\n= \u2212(t \u2212 OD)\n\n\u2202OD\n\u2202wi\n\n= \u2212(t \u2212 OD)\u03b4iIi = \u2212t\u03bbiIi + wi\u03b42\n\ni I 2\n\ni +\n\n(25)\n\nwj\u03b4i\u03b4jIiIj\n\n(26)\n\n(cid:88)\n\nj(cid:54)=i\n\n(cid:18) \u2202ED\n\n(cid:19)\n\n\u2202wi\n\nE\n\nThe dropout gradient is a random variable and we can take its expectation. A short calculation yields\n\n=\n\n\u2202EEN S\n\n\u2202wi\n\n+ wipi(1 \u2212 pi)I 2\n\ni\n\n\u2202EEN S\n\n\u2202wi\n\n+ wiI 2\n\ni V ar(\u03b4i)\n\n(27)\n\nThus, remarkably, in this case the expectation of the gradient with dropout is the gradient of the\nregularized ensemble error\n\nE = EEN S +\n\n1\n2\n\nn(cid:88)\n\ni=1\n\nw2\n\ni I 2\n\ni V ar(\u03b4i)\n\n(28)\n\nThe regularization term is the usual weight decay or Gaussian prior term based on the square of the\nweights to prevent over\ufb01tting. Dropout provides immediately the magnitude of the regularization\nterm which is adaptively scaled by the inputs and by the variance of the dropout variables. Note that\npi = 0.5 is the value that provides the highest level of regularization.\n\n5\n\n\f5.2 Single Sigmoidal Unit\nThe previous result generalizes to a sigmoidal unit O = \u03c3(S) = 1/(1 + ce\u2212\u03bbS) trained to minimize\nthe relative entropy error E = \u2212(t log O + (1 \u2212 t) log(1 \u2212 O)). In this case,\n\n(29)\nThe terms O and Ii are not independent but using a Taylor expansion with the N W GM approxi-\nmation gives\n\n= \u2212\u03bb(t \u2212 O)\n\n= \u2212\u03bb(t \u2212 O)\u03b4iIi\n\n\u2202ED\n\u2202wi\n\n\u2202S\n\u2202wi\n\n(cid:18) \u2202ED\n\n(cid:19)\n\n\u2202wi\n\nE\n\nwith SEN S =(cid:80)\n\n\u2248 \u2202EEN S\n\u2202wi\n\n+ \u03bb\u03c3(cid:48)(SEN S)wiI 2\n\ni V ar(\u03b4i)\n\n(30)\n\nj wjpjIj. Thus, as in the linear case, the expectation of the dropout gradient is ap-\nproximately the gradient of the ensemble network regularized by weight decay terms with the proper\nadaptive coef\ufb01cients. A similar analysis, can be carried also for a set of normalized exponential\nunits and for deeper networks [1].\n\n5.3 Learning Phases and Sparse Coding\n\n(cid:88)\n\nDuring dropout learning, we can expect three learning phases: (1) At the beginning of learning, when\nthe weights are typically small and random, the total input to each unit is close to 0 for all the units\nand the consistency is high: the output of the units remains roughly constant across subnetworks\n(and equal to 0.5 with c = 1). (2) As learning progresses, activities tend to move towards 0 or 1\nand the consistency decreases, i.e. for a given input the variance of the units across subnetworks\nincreases. (3) As the stochastic gradient learning procedure converges, the consistency of the units\nconverges to a stable value.\nFinally, for simplicity, assume that dropout is applied only in layer h where the units have an output\nof the form Oh\nj is a constant since dropout\nis not applied to layer l. Thus\n\ni =(cid:80)\n\nj. For a \ufb01xed input, Ol\n\ni ) and Sh\n\ni = \u03c3(Sh\n\nl<h whl\n\nij \u03b4l\n\njOl\n\nV ar(Sh\n\ni ) =\n\n(whl\n\nij )2(Ol\n\nj)2pl\n\nj(1 \u2212 pl\nj)\n\n(31)\n\nl<h\n\nunder the usual assumption that the selector variables \u03b4l\nj are independent of each other. Thus\ni ) depends on three factors. Everything else being equal, it is reduced by: (1) Small weights\nV ar(Sh\nwhich goes together with the regularizing effect of dropout; (2) Small activities, which shows that\ndropout is not symmetric with respect to small or large activities. Overall, dropout tends to favor\nsmall activities and thus sparse coding; and (3) Small (close to 0) or large (close to 1) values of the\ndropout probabilities pl\nj = 0.5 maximize the regularization effect but may also lead\nto slower convergence to the consistent state. Additional results and simulations are given in [1].\n\nj. Thus values pl\n\n6 Simulation Results\n\nWe use Monte Carlo simulation to partially investigate the approximation framework embodied by\nthe three fundamental dropout equations 11, 12, and 13, the accuracy of the second-order approxi-\nmation and bounds in Equations 20 and 22, and the dynamics of dropout learning. We experiment\nwith an MNIST classi\ufb01er of four hidden layers (784-1200-1200-1200-1200-10) that replicates the\nresults in [7] using the Pylearn2 and Theano software libraries[12, 3]. The network is trained with\na dropout probability of 0.8 in the input, and 0.5 in the four hidden layers. For \ufb01xed weights and\na \ufb01xed input, 10,000 Monte Carlo simulations are used to estimate the distribution of activity O\nin each neuron. Let O\u2217 be the activation under the deterministic setting with the weights scaled\nappropriately.\nThe left column of Figure 1 con\ufb01rms empirically that the second-order approximation in Equation\n20 and the bound in Equation 22 are accurate. The right column of Figure 1 shows the difference be-\ntween the true ensemble average E(O) and the prediction-time neuron activity O\u2217. This difference\ngrows very slowly in the higher layers, and only for active neurons.\n\n6\n\n\fFigure 1: Left: The difference E(O) \u2212 N W GM (O), it\u2019s second-order approximation in Equation\n20, and the bound from Equation 22, plotted for four hidden layers and a typical \ufb01xed input. Right:\nThe difference between the true ensemble average E(O) and the \ufb01nal neuron prediction O\u2217.\n\nNext, we examine the neuron consistency during dropout training. Figure 2a shows the three phases\nof learning for a typical neuron. In Figure 2b, we observe that the consistency does not decline in\nhigher layers of the network.\nOne clue into how this happens is the distribution of neuron activity. As noted in [10] and section 5\nabove, dropout training results in sparse activity in the hidden layers (Figure 3). This increases the\nconsistency of neurons in the next layer.\n\n7\n\n\f(a) The three phases of learning. For a particu-\nlar input, a typical active neuron (red) starts out\nwith low variance, experiences a large increase in\nvariance during learning, and eventually settles to\nsome steady constant value. In contrast, a typical\ninactive neuron (blue) quickly learns to stay silent.\nShown are the mean with 5% and 95% percentiles.\n\n(b) Consistency does not noticeably decline in the up-\nper layers. Shown here are the mean Std(O) for active\nneurons (0.1 < O after training) in each layer, along\nwith the 5% and 95% percentiles.\n\nFigure 2\n\nFigure 3: In every hidden layer of a dropout trained network, the distribution of neuron activations\nO\u2217 is sparse and not symmetric. These histograms were totalled over a set of 100 random inputs.\n\n8\n\n\fReferences\n[1] P. Baldi and P. Sadowski. The Dropout Learning Algorithm. Arti\ufb01cial Intelligence, 2014. In\n\npress.\n\n[2] E. F. Beckenbach and R. Bellman. Inequalities. Springer-Verlag Berlin, 1965.\n[3] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian,\nD. Warde-Farley, and Y. Bengio. Theano: a CPU and GPU math expression compiler.\nIn\nProceedings of the Python for Scienti\ufb01c Computing Conference (SciPy), Austin, TX, June\n2010. Oral Presentation.\n\n[4] L. Bottou. Online algorithms and stochastic approximations. In D. Saad, editor, Online Learn-\n\ning and Neural Networks. Cambridge University Press, Cambridge, UK, 1998.\n\n[5] L. Bottou. Stochastic learning. In O. Bousquet and U. von Luxburg, editors, Advanced Lectures\non Machine Learning, Lecture Notes in Arti\ufb01cial Intelligence, LNAI 3176, pages 146\u2013168.\nSpringer Verlag, Berlin, 2004.\n\n[6] D. Cartwright and M. Field. A re\ufb01nement of the arithmetic mean-geometric mean inequality.\n\nProceedings of the American Mathematical Society, pages 36\u201338, 1978.\n\n[7] G. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neu-\nral networks by preventing co-adaptation of feature detectors. http://arxiv.org/abs/1207.0580,\n2012.\n\n[8] E. Neuman and J. S\u00b4andor. On the Ky Fan inequality and related inequalities i. MATHEMATI-\n\nCAL INEQUALITIES AND APPLICATIONS, 5:49\u201356, 2002.\n\n[9] E. Neuman and J. Sandor. On the Ky Fan inequality and related inequalities ii. Bulletin of the\n\nAustralian Mathematical Society, 72(1):87\u2013108, 2005.\n\n[10] S. Nitish.\n\nImproving Neural Networks with Dropout. PhD thesis, University of Toronto,\n\nToronto, Canada, 2013.\n\n[11] H. Robbins and D. Siegmund. A convergence theorem for non negative almost supermartin-\n\ngales and some applications. Optimizing methods in statistics, pages 233\u2013257, 1971.\n\n[12] D. Warde-Farley, I. Goodfellow, P. Lamblin, G. Desjardins, F. Bastien, and Y. Bengio.\n\npylearn2. 2011. http://deeplearning.net/software/pylearn2.\n\n9\n\n\f", "award": [], "sourceid": 1291, "authors": [{"given_name": "Pierre", "family_name": "Baldi", "institution": "UC Irvine"}, {"given_name": "Peter", "family_name": "Sadowski", "institution": "UC Irvine"}]}