{"title": "Learning the Number of Neurons in Deep Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 2270, "page_last": 2278, "abstract": "Nowadays, the number of layers and of neurons in each layer of a deep network are typically set manually. While very deep and wide networks have proven effective in general, they come at a high memory and computation cost, thus making them impractical for constrained platforms. These networks, however, are known to have many redundant parameters, and could thus, in principle, be replaced by more compact architectures. In this paper, we introduce an approach to automatically determining the number of neurons in each layer of a deep network during learning. To this end, we propose to make use of a group sparsity regularizer on the parameters of the network, where each group is defined to act on a single neuron. Starting from an overcomplete network, we show that our approach can reduce the number of parameters by up to 80\\% while retaining or even improving the network accuracy.", "full_text": "Learning the Number of Neurons in Deep Networks\n\nJose M. Alvarez\u2217\nData61 @ CSIRO\n\nCanberra, ACT 2601, Australia\n\njose.alvarez@data61.csiro.au\n\nMathieu Salzmann\n\nCVLab, EPFL\n\nCH-1015 Lausanne, Switzerland\nmathieu.salzmann@epfl.ch\n\nAbstract\n\nNowadays, the number of layers and of neurons in each layer of a deep network\nare typically set manually. While very deep and wide networks have proven\neffective in general, they come at a high memory and computation cost, thus\nmaking them impractical for constrained platforms. These networks, however,\nare known to have many redundant parameters, and could thus, in principle, be\nreplaced by more compact architectures. In this paper, we introduce an approach to\nautomatically determining the number of neurons in each layer of a deep network\nduring learning. To this end, we propose to make use of a group sparsity regularizer\non the parameters of the network, where each group is de\ufb01ned to act on a single\nneuron. Starting from an overcomplete network, we show that our approach can\nreduce the number of parameters by up to 80% while retaining or even improving\nthe network accuracy.\n\n1\n\nIntroduction\n\nThanks to the growing availability of large-scale datasets and computation power, Deep Learning\nhas recently generated a quasi-revolution in many \ufb01elds, such as Computer Vision and Natural\nLanguage Processing. Despite this progress, designing a deep architecture for a new task essentially\nremains a dark art. It involves de\ufb01ning the number of layers and of neurons in each layer, which,\ntogether, determine the number of parameters, or complexity, of the model, and which are typically\nset manually by trial and error.\nA recent trend to avoid this issue consists of building very deep [Simonyan and Zisserman, 2014] or\nultra deep [He et al., 2015] networks, which have proven more expressive. This, however, comes at\na signi\ufb01cant cost in terms of memory requirement and speed, which may prevent the deployment\nof such networks on constrained platforms at test time and complicate the learning process due to\nexploding or vanishing gradients.\nAutomatic model selection has nonetheless been studied in the past, using both constructive and\ndestructive approaches. Starting from a shallow architecture, constructive methods work by in-\ncrementally incorporating additional parameters [Bello, 1992] or, more recently, layers to the net-\nwork [Simonyan and Zisserman, 2014]. The main drawback of this approach stems from the fact\nthat shallow networks are less expressive than deep ones, and may thus provide poor initialization\nwhen adding new layers. By contrast, destructive techniques exploit the fact that very deep models\ninclude a signi\ufb01cant number of redundant parameters [Denil et al., 2013, Cheng et al., 2015], and thus,\ngiven an initial deep network, aim at reducing it while keeping its representation power. Originally,\nthis has been achieved by removing the parameters [LeCun et al., 1990, Hassibi et al., 1993] or\nthe neurons [Mozer and Smolensky, 1988, Ji et al., 1990, Reed, 1993] that have little in\ufb02uence on\nthe output. While effective this requires analyzing every parameter/neuron independently, e.g., via\nthe network Hessian, and thus does not scale well to large architectures. Therefore, recent trends\n\n\u2217http://www.josemalvarez.net.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fto performing network reduction have focused on training shallow or thin networks to mimic the\nbehavior of large, deep ones [Hinton et al., 2014, Romero et al., 2015]. This approach, however, acts\nas a post-processing step, and thus requires being able to successfully train an initial deep network.\nIn this paper, we introduce an approach to automatically selecting the number of neurons in each\nlayer of a deep architecture simultaneously as we learn the network. Speci\ufb01cally, our method does\nnot require training an initial network as a pre-processing step. Instead, we introduce a group sparsity\nregularizer on the parameters of the network, where each group is de\ufb01ned to act on the parameters\nof one neuron. Setting these parameters to zero therefore amounts to canceling the in\ufb02uence of a\nparticular neuron and thus removing it entirely. As a consequence, our approach does not depend on\nthe success of learning a redundant network to later reduce its parameters, but instead jointly learns\nthe number of relevant neurons in each layer and the parameters of these neurons.\nWe demonstrate the effectiveness of our approach on several network architectures and using several\nimage recognition datasets. Our experiments demonstrate that our method can reduce the number of\nparameters by up to 80% compared to the complete network. Furthermore, this reduction comes at no\nloss in recognition accuracy; it even typically yields an improvement over the complete network. In\nshort, our approach not only lets us automatically perform model selection, but it also yields networks\nthat, at test time, are more effective, faster and require less memory.\n\n2 Related work\n\nModel selection for deep architectures, or more precisely determining the best number of parameters,\nsuch as the number of layers and of neurons in each layer, has not yet been widely studied. Currently,\nthis is mostly achieved by manually tuning these hyper-parameters using validation data, or by relying\non very deep networks [Simonyan and Zisserman, 2014, He et al., 2015], which have proven effective\nin many scenarios. These large networks, however, come at the cost of high memory footprint and\nlow speed at test time. Furthermore, it is well-known that most of the parameters in such networks\nare redundant [Denil et al., 2013, Cheng et al., 2015], and thus more compact architectures could do\nas good a job as the very deep ones.\nWhile sparse, some literature on model selection for deep learning nonetheless exists. In particular, a\nforerunner approach was presented in [Ash, 1989] to dynamically add nodes to an existing architecture.\nSimilarly, [Bello, 1992] introduced a constructive method that incrementally grows a network by\nadding new neurons. More recently, a similar constructive strategy was successfully employed\nby [Simonyan and Zisserman, 2014], where their \ufb01nal very deep network was built by adding new\nlayers to an initial shallower architecture. The constructive approach, however, has a drawback:\nShallow networks are known not to handle non-linearities as effectively as deeper ones [Montufar\net al., 2014]. Therefore, the initial, shallow architectures may easily get trapped in bad optima, and\nthus provide poor initialization for the constructive steps.\nIn contrast with constructive methods, destructive approaches to model selection start with an initial\ndeep network, and aim at reducing it while keeping its behavior unchanged. This trend was started\nby [LeCun et al., 1990, Hassibi et al., 1993] to cancel out individual parameters, and by [Mozer\nand Smolensky, 1988, Ji et al., 1990, Reed, 1993], and more recently [Liu et al., 2015], when it\ncomes to removing entire neurons. The core idea of these methods consists of studying the saliency\nof individual parameters or neurons and remove those that have little in\ufb02uence on the output of\nthe network. Analyzing individual parameters/neurons, however, quickly becomes computationally\nexpensive for large networks, particularly when the procedure involves computing the network\nHessian and is repeated multiple times over the learning process. As a consequence, these techniques\nhave no longer been pursued in the current large-scale era. Instead, the more recent take on the\ndestructive approach consists of learning a shallower or thinner network that mimics the behavior\nof an initial deep one [Hinton et al., 2014, Romero et al., 2015], which ultimately also reduces the\nnumber of parameters of the initial network. The main motivation of these works, however, was not\ntruly model selection, but rather building a more compact network.\nAs a matter of fact, designing compact models also is an active research focus in deep learning. In\nparticular, in the context of Convolutional Neural Networks (CNNs), several works have proposed\nto decompose the \ufb01lters of a pre-trained network into low-rank \ufb01lters, thus reducing the number\nof parameters [Jaderberg et al., 2014b, Denton et al., 2014, Gong et al., 2014]. However, this\napproach, similarly to some destructive methods mentioned above, acts as a post-processing step, and\n\n2\n\n\fthus requires being able to successfully train an initial deep network. Note that, in a more general\ncontext, it has been shown that a two-step procedure is typically outperformed by one-step, direct\ntraining [Srivastava et al., 2015]. Such a direct approach has been employed by [Weigend et al.,\n1991] and [Collins and Kohli, 2014] who have developed regularizers that favor eliminating some\nof the parameters of the network, thus leading to lower memory requirement. The regularizers are\nminimized simultaneously as the network is learned, and thus no pre-training is required. However,\nthey act on individual parameters. Therefore, similarly to [Jaderberg et al., 2014b, Denton et al.,\n2014] and to other parameter regularization techniques [Krogh and Hertz, 1992, Bartlett, 1996], these\nmethods do not perform model selection; the number of layers and neurons per layer is determined\nmanually and won\u2019t be affected by learning.\nBy contrast, in this paper, we introduce an approach to automatically determine the number of neurons\nin each layer of a deep network. To this end, we design a regularizer-based formulation and therefore\ndo not rely on any pre-training. In other words, our approach performs model selection and produces\na compact network in a single, coherent learning framework. To the best of our knowledge, only\nthree works have studied similar group sparsity regularizers for deep networks. However, [Zhou\net al., 2016] focuses on the last fully-connected layer to obtain a compact model, and [S. Tu, 2014]\nand [Murray and Chiang, 2015] only considered small networks. Our approach scales to datasets\nand architectures two orders of magnitude larger than in these last two works with minimum (and\ntractable) training overhead. Furthermore, these three methods de\ufb01ne a single global regularizer. By\ncontrast, we work in a per-layer fashion, which we found more effective to reduce the number of\nneurons by large factors without accuracy drop.\n\n3 Deep Model Selection\n\nWe now introduce our approach to automatically determining the number of neurons in each layer of\na deep network while learning the network parameters. To this end, we describe our framework for a\ngeneral deep network, and discuss speci\ufb01c architectures in the experiments section.\nA general deep network can be described as a succession of L layers performing linear operations on\ntheir input, intertwined with non-linearities, such as Recti\ufb01ed Linear Units (ReLU) or sigmoids, and,\npotentially, pooling operations. Each layer l consists of Nl neurons, each of which is encoded by\nparameters \u03b8n\nl is a bias.\nl }1\u2264n\u2264Nl. Given\nAltogether, these parameters form the parameter set \u0398 = {\u03b8l}1\u2264l\u2264L, with \u03b8l = {\u03b8n\nan input signal x, such as an image, the output of the network can be written as \u02c6y = f (x, \u0398), where\nf (\u00b7) encodes the succession of linear, non-linear and pooling operations.\nGiven a training set consisting of N input-output pairs {(xi, yi)}1\u2264i\u2264N , learning the parameters of\nthe network can be expressed as solving an optimization of the form\n\nis a linear operator acting on the layer\u2019s input and bn\n\nl ], where wn\n\nl = [wn\n\nl , bn\n\nl\n\n(cid:96)(yi, f (xi, \u0398)) + r(\u0398) ,\n\n(1)\n\nN(cid:88)\n\ni=1\n\nmin\n\n\u0398\n\n1\nN\n\nwhere (cid:96)(\u00b7) is a loss that compares the network prediction with the ground-truth output, such as the\nlogistic loss for classi\ufb01cation or the square loss for regression, and r(\u00b7) is a regularizer acting on the\nnetwork parameters. Popular choices for such a regularizer include weight-decay, i.e., r(\u00b7) is the\n(squared) (cid:96)2-norm, of sparsity-inducing norms, e.g., the (cid:96)1-norm.\nRecall that our goal here is to automatically determine the number of neurons in each layer of the\nnetwork. We propose to do this by starting from an overcomplete network and canceling the in\ufb02uence\nof some of its neurons. Note that none of the standard regularizers mentioned above achieve this goal:\nThe former favors small parameter values, and the latter tends to cancel out individual parameters, but\nnot complete neurons. In fact, a neuron is encoded by a group of parameters, and our goal therefore\ntranslates to making entire groups go to zero. To achieve this, we make use of the notion of group\nsparsity [Yuan and Lin., 2007]. In particular, we write our regularizer as\n\nr(\u0398) =\n\n\u03bbl\n\nPl\n\n(cid:107)\u03b8n\n\nl (cid:107)2 ,\n\n(2)\n\nwhere, without loss of generality, we assume that the parameters of each neuron in layer l are grouped\nin a vector of size Pl, and where \u03bbl sets the in\ufb02uence of the penalty. Note that, in the general case,\n\nl=1\n\nn=1\n\n3\n\nL(cid:88)\n\n(cid:112)\n\nNl(cid:88)\n\n\fthis weight can be different for each layer l. In practice, however, we found most effective to have\ntwo different weights: a relatively small one for the \ufb01rst few layers, and a larger weight for the\nremaining ones. This effectively prevents killing too many neurons in the \ufb01rst few layers, and thus\nretains enough information for the remaining ones.\nWhile group sparsity lets us effectively remove some of the neurons, exploiting standard regularizers\non the individual parameters has proven effective in the past for generalization purpose [Bartlett,\n1996, Krogh and Hertz, 1992, Theodoridis, 2015, Collins and Kohli, 2014]. To further leverage this\nidea within our automatic model selection approach, we propose to exploit the sparse group Lasso\nidea of [Simon et al., 2013]. This lets us write our regularizer as\n\nr(\u0398) =\n\n(1 \u2212 \u03b1)\u03bbl\n\nPl\n\n(cid:107)\u03b8n\n\nl (cid:107)2 + \u03b1\u03bbl(cid:107)\u03b8l(cid:107)1\n\n,\n\n(3)\n\nthought of as iteratively taking a gradient step of size t with respect to the loss(cid:80)N\n\nwhere \u03b1 \u2208 [0, 1] sets the relative in\ufb02uence of both terms. Note that \u03b1 = 0 brings us back to the\nregularizer of Eq. 2. In practice, we experimented with both \u03b1 = 0 and \u03b1 = 0.5.\nTo solve Problem (1) with the regularizer de\ufb01ned by either Eq. 2 or Eq. 3, we follow a proximal\ngradient descent approach [Parikh and Boyd, 2014]. In our context, proximal gradient descent can be\ni=1 (cid:96)(yi, f (xi, \u0398))\nonly, and, from the resulting solution, applying the proximal operator of the regularizer. In our case,\nsince the groups are non-overlapping, we can apply the proximal operator to each group independently.\nSpeci\ufb01cally, for a single group, this translates to updating the parameters as\n\n(cid:32)\n\nL(cid:88)\n\nl=1\n\n(cid:112)\n\nNl(cid:88)\n\nn=1\n\n(cid:33)\n\n\u02dc\u03b8n\nl = argmin\n\n\u03b8n\nl\n\n1\n2t\n\n(cid:107)\u03b8n\n\nl \u2212 \u02c6\u03b8n\nl (cid:107)2\n\n2 + r(\u0398) ,\n\n(4)\n\nwhere \u02c6\u03b8n\nis the solution obtained from the loss-based gradient step. Following the derivations\nl\nof [Simon et al., 2013], and focusing on the regularizer of Eq. 3 of which Eq. 2 is a special case, this\nproblem has a closed-form solution given by\n\n(cid:32)\n1 \u2212 t(1 \u2212 \u03b1)\u03bbl\n\n||S(\u02c6\u03b8n\n\nPl\nl , t\u03b1\u03bbl)||2)\n\n\u221a\n\n(cid:33)\n\n+\n\n\u02dc\u03b8n\nl =\n\nS(\u02c6\u03b8n\n\nl , t\u03b1\u03bbl) ,\n\n(5)\n\nwhere + corresponds to taking the maximum between the argument and 0, and S(\u00b7) is the soft-\nthresholding operator de\ufb01ned elementwise as\n\n(S(z, \u03c4 ))j = sign(zj)(|zj| \u2212 \u03c4 )+ .\n\n(6)\n\nThe learning algorithm therefore proceeds by iteratively taking a gradient step based on the loss only,\nand updating the variables of all the groups according to Eq. 5. In practice, we follow a stochastic\ngradient descent approach and work with mini-batches. In this setting, we apply the proximal operator\nat the end of each epoch and run the algorithm for a \ufb01xed number of epochs.\nWhen learning terminates, the parameters of some of the neurons will have gone to zero. We can\nthus remove these neurons entirely, since they have no effect on the output. Furthermore, when\nconsidering fully-connected layers, the neurons acting on the output of zeroed-out neurons of the\nprevious layer also become useless, and can thus be removed. Ultimately, removing all these neurons\nyields a more compact architecture than the original, overcomplete one.\n\n4 Experiments\n\nIn this section, we demonstrate the ability of our method to automatically determine the number of\nneurons on the task of large-scale classi\ufb01cation. To this end, we study three different architectures and\nanalyze the behavior of our method on three different datasets, with a particular focus on parameter\nreduction. Below, we \ufb01rst describe our experimental setup and then discuss our results.\n\n4.1 Experimental setup\n\nDatasets: For our experiments, we used two large-scale image classi\ufb01cation datasets, Ima-\ngeNet [Russakovsky et al., 2015] and Places2-401 [Zhou et al., 2015]. Furthermore, we conducted\n\n4\n\n\fadditional experiments on the character recognition dataset of [Jaderberg et al., 2014a]. ImageNet\ncontains over 15 million labeled images split into 22, 000 categories. We used the ILSVRC-2012 [Rus-\nsakovsky et al., 2015] subset consisting of 1000 categories, with 1.2 million training images and\n50, 000 validation images. Places2-401 [Zhou et al., 2015] is a large-scale dataset speci\ufb01cally created\nfor high-level visual understanding tasks. It consists of more than 10 million images with 401 unique\nscene categories. The training set comprises between 5,000 and 30,000 images per category. Finally,\nthe ICDAR character recognition dataset of [Jaderberg et al., 2014a] consists of 185,639 training and\n5,198 test samples split into 36 categories. The training samples depict characters collected from\ntext observed in a number of scenes and from synthesized datasets, while the test set comes from the\nICDAR2003 training set after removing all non-alphanumeric characters.\nArchitectures: For ImageNet and Places2-401, our architectures are based on the VGG-B network\n(BNet) [Simonyan and Zisserman, 2014] and on DecomposeMe8 (Dec8) [Alvarez and Petersson,\n2016]. BNet consists of 10 convolutional layers followed by three fully-connected layers. In our\nexperiments, we removed the \ufb01rst two fully-connected layers. As will be shown in our results, while\nthis reduces the number of parameters, it maintains the accuracy of the original network. Below, we\nrefer to this modi\ufb01ed architecture as BNetC. Following the idea of low-rank \ufb01lters, Dec8 consists of\n16 convolutional layers with 1D kernels, effectively modeling 8 2D convolutional layers. For ICDAR,\nwe used an architecture similar to the one of [Jaderberg et al., 2014b]. The original architecture\nconsists of three convolutional layers with a maxout layer [Goodfellow et al., 2013] after each\nconvolution, followed by one fully-connected layer. [Jaderberg et al., 2014b] \ufb01rst trained this network\nand then decomposed each 2D convolution into 2 1D kernels. Here, instead, we directly start with 6\n1D convolutional layers. Furthermore, we replaced the maxout layers with max-pooling. As shown\nbelow, this architecture, referred to as Dec3, yields similar results as the original one, referred to as\nMaxOut.\nImplementation details: For the comparison to be fair, all models including the baselines were\ntrained from scratch on the same computer using the same random seed and the same framework.\nMore speci\ufb01cally, for ImageNet and Places2-401, we used the torch-7 multi-gpu framework [Collobert\net al., 2011] on a Dual Xeon 8-core E5-2650 with 128GB of RAM using three Kepler Tesla K20m\nGPUs in parallel. All models were trained for a total of 55 epochs with 12, 000 batches per epoch\nand a batch size of 48 and 180 for BNet and Dec8, respectively. These variations in batch size were\nmainly due to the memory and runtime limitations of BNet. The learning rate was set to an initial\nvalue of 0.01 and then multiplied by 0.1. Data augmentation was done through random crops and\nrandom horizontal \ufb02ips with probability 0.5. For ICDAR, we trained each network on a single Tesla\nK20m GPU for a total 45 epochs with a batch size of 256 and 1,000 iterations per epoch. In this\ncase, the learning rate was set to an initial value of 0.1 and multiplied by 0.1 in the second, seventh\nand \ufb01fteenth epochs. We used a momentum of 0.9. In terms of hyper-parameters, for large-scale\nclassi\ufb01cation, we used \u03bbl = 0.102 for the \ufb01rst three layers and \u03bbl = 0.255 for the remaining ones.\nFor ICDAR, we used \u03bbl = 5.1 for the \ufb01rst layer and \u03bbl = 10.2 for the remaining ones.\nEvaluation: We measure classi\ufb01cation performance as the top-1 accuracy using the center crop,\nreferred to as Top-1. We compare the results of our approach with those obtained by training\nthe same architectures, but without our model selection technique. We also provide the results of\nadditional, standard architectures. Furthermore, since our approach can determine the number of\nneurons per layer, we also computed results with our method starting for different number of neurons,\nreferred to as M below, in the overcomplete network. In addition to accuracy, we also report, for the\nconvolutional layers, the percentage of neurons set to 0 by our approach (neurons), the corresponding\npercentage of zero-valued parameters (group param), the total percentage of 0 parameters (total\nparam), which additionally includes the parameters set to 0 in non-completely zeroed-out neurons,\nand the total percentage of zero-valued parameters induced by the zeroed-out neurons (total induced),\nwhich additionally includes the neurons in each layer, including the last fully-connected layer, that\nhave been rendered useless by the zeroed-out neurons of the previous layer.\n\n4.2 Results\nBelow, we report our results on ImageNet and ICDAR. The results on Places-2 are provided as\nsupplementary material.\n\nImageNet: We \ufb01rst start by discussing our results on ImageNet. For this experiment, we used\nBNetC and Dec8, both with the group sparsity (GS) regularizer of Eq. 2. Furthermore, in the case of\n\n5\n\n\fTable 1: Top-1 accuracy results for several state-of-the art architectures and our method on ImageNet.\n\nModel\n\nTop-1 acc. (%)\n\nBNet\nBNetC\nResNet50a [He et al., 2015]\nDec8\nDec8-640\nDec8-768\n\n62.5\n61.1\n67.3\n64.8\n66.9\n68.1\n\nModel\nOurs-BnetC\nGS\nOurs-Dec8\u2212GS\nOurs-Dec8-640SGL\nOurs-Dec8-640GS\nOurs-Dec8-768GS\n\nTop-1 acc. (%)\n\n62.7\n64.8\n67.5\n68.6\n68.0\n\na Trained over 55 epochs using a batch size of 128 on two TitanX with code publicly available.\n\nBNetC on ImageNet (in %)\n\nneurons\ngroup param\ntotal param\ntotal induced\naccuracy gap\n\nGS\n12.70\n13.59\n13.59\n27.38\n1.6\n\nFigure 1: Parameter reduction on ImageNet using BNetC. (Left) Comparison of the number of\nneurons per layer of the original network with that obtained using our approach. (Right) Percentage\nof zeroed-out neurons and parameters, and accuracy gap between our network and the original one.\nNote that we outperform the original network while requiring much fewer parameters.\n\nand Dec25%\n\n8\n\n8\n\n-640, yield 64.5% and 65.8% accuracy, respectively.\n\nDec8, we evaluated two additional versions that, instead of the 512 neurons per layer of the original\narchitecture, have M = 640 and M = 768 neurons per layer, respectively. Finally, in the case of\nM = 640, we further evaluated both the group sparsity regularizer of Eq. 2 and the sparse group Lasso\n(SGL) regularizer of Eq. 3 with \u03b1 = 0.5. Table 1 compares the top-1 accuracy of our approach with\nthat of the original architectures and of other baselines. Note that, with the exception of Dec8-768, all\nour methods yield an improvement over the original network, with up to 1.6% difference for BNetC\nand 2.45% for Dec8-640. As an additional baseline, we also evaluated the naive approach consisting\nof reducing each layer in the model by a constant factor of 25%. The corresponding two instances,\nDec25%\nMore importantly, in Figure 1 and Figure 2, we report the relative saving obtained with our approach\nin terms of percentage of zeroed-out neurons/parameters for BNetC and Dec8, respectively. For\nBNetC, in Figure 1, our approach reduces the number of neurons by over 12%, while improving its\ngeneralization ability, as indicated by the accuracy gap in the bottom row of the table. As can be seen\nfrom the bar-plot, the reduction in the number of neurons is spread all over the layers with the largest\ndifference in the last layer. As a direct consequence, the number of neurons in the subsequent fully\nconnected layer is signi\ufb01cantly reduced, leading to 27% reduction in the total number of parameters.\nFor Dec8, in Figure 2, we can see that, when considering the original architecture with 512 neurons\nper layer, our approach only yields a small reduction in parameter numbers with minimal gain in\nperformance. However, when we increase the initial number of neurons in each layer, the bene\ufb01ts\nof our approach become more signi\ufb01cant. For M = 640, when using the group sparsity regularizer,\nwe see a reduction of the number of parameters of more than 19%, with improved generalization\nability. The reduction is even larger, 23%, when using the sparse group Lasso regularizer. In the case\nof M = 768, we managed to remove 26% of the neurons, which translates to 48% of the parameters.\nWhile, here, the accuracy is slightly lower than that of the initial network, it is in fact higher than that\nof the original Dec8 network, as can be seen in Table 1.\nInterestingly, during learning, we also noticed a signi\ufb01cant reduction in the training-validation\naccuracy gap when applying our regularization technique. For instance, for Dec8-768, which zeroes\nout 48.2% of the parameters, we found the training-validation gap to be 28.5% smaller than in the\noriginal network (from 14% to 10%). We believe that this indicates that networks trained using our\napproach have better generalization ability, even if they have fewer parameters. A similar phenomenon\nwas also observed for the other architectures used in our experiments.\nWe now analyze the sensitivity of our method with respect to \u03bbl (see Eq. (2)). To this end, we\nconsidered Dec8 \u2212 768GS and varied the value of the parameter in the range \u03bbl = [0.051..0.51].\nMore speci\ufb01cally, we considered 20 different pairs of values, (\u03bb1, \u03bb2), with the former applied to the\n\n6\n\n\fDec8 on ImageNet (in %)\n\nDec8\nGS\nneurons\n3.39\ngroup param 2.46\n2.46\ntotal param\n2.82\ntotal induced\naccuracy gap\n0.01\n\nDec8-640\nGS\n10.08\n12.48\n12.48\n19.11\n2.45\n\nSGL\n12.42\n13.69\n22.72\n23.33\n0.94\n\nDec8-768\n\nGS\n26.83\n31.53\n31.63\n48.28\n-0.02\n\nFigure 2: Parameter reduction using Dec8 on ImageNet. Note that we signi\ufb01cantly reduce the number\nof parameters and, in almost all cases, improve recognition accuracy over the original network.\nTop-1 acc. on ICDAR\n\nDec3 on ICDAR (in %)\n\nSGL\nneurons\n38.64\ngroup param 32.57\n72.41\ntotal param\n72.08\ntotal induced\naccuracy gap\n1.24\n\nGS\n55.11\n66.48\n66.48\n80.45\n1.38\n\na\n\nMaxOutDec\nMaxOutb\nMaxPool2Dneurons\nDec3 (baseline)\nOurs-Dec3\u2212SGL\nOurs-Dec3\u2212GS\n\n91.3%\n89.8%\n83.8%\n89.3%\n89.9%\n90.1%\n\na Results from Jaderberg et al. [2014a] using MaxOut layer instead of Max-\nPooling and decompositions as post-processing step\nb Results from Jaderberg et al. [2014a]\n\nFigure 3: Experimental results on ICDAR using Dec3. Note that our approach reduces the number of\nparameters by 72% while improving the accuracy of the original network.\n\n\ufb01rst three layers and the latter to the remaining ones. The details of this experiment are reported in\nsupplementary material. Altogether, we only observed small variations in validation accuracy (std of\n0.33%) and in number of zeroed-out neurons (std of 1.1%).\nICDAR: Finally, we evaluate our approach on a smaller dataset where architectures have not yet\nbeen heavily tuned. For this dataset, we used the Dec3 architecture, where the last two layers initially\ncontain 512 neurons. Our goal here is to obtain an optimal architecture for this dataset. Figure 3\nsummarizes our results using GS and SGL regularization and compares them to state-of-the-art\nbaselines. From the comparison between MaxPool2Dneurons and Dec3, we can see that learning 1D\n\ufb01lters leads to better performance than an equivalent network with 2D kernels. More importantly, our\nalgorithm reduces by up to 80% the number of parameters, while further improving the performance\nof the original network. We believe that these results evidence that our algorithm effectively performs\nautomatic model selection for a given (classi\ufb01cation) task.\n\n4.3 Bene\ufb01ts at test time\n\nWe now discuss the bene\ufb01ts of our algorithm at test time. For simplicity, our implementation does not\nremove neurons during training. However, these neurons can be effectively removed after training,\nthus yielding a smaller network to deploy at test time. Not only does this entail bene\ufb01ts in terms of\nmemory requirement, as illustrated above when looking at the reduction in number of parameters,\nbut it also leads to speedups compared to the complete network. To demonstrate this, in Table 2,\nwe report the relative runtime speedups obtained by removing the zeroed-out neurons. For BNet\nand Dec8, these speedups were obtained using ImageNet, while Dec3 was tested on ICDAR. Note\nthat signi\ufb01cant speedups can be achieved, depending on the architecture. For instance, using BNetC,\nwe achieve a speedup of up to 13% on ImageNet, while with Dec3 on ICDAR the speedup reaches\nalmost 50%. The right-hand side of Table 2 shows the relative memory saving of our networks. These\nnumbers were computed from the actual memory requirements in MB of the networks. In terms of\nparameters, for ImageNet, Dec8-768 yields a 46% reduction, while Dec3 increases this saving to\nmore than 80%. When looking at the actual features computed in each layer of the network, we reach\na 10% memory saving for Dec8-768 and a 25% saving for Dec3. We believe that these numbers\nclearly evidence the bene\ufb01ts of our approach in terms of speed and memory footprint at test time.\nNote also that, once the models are trained, additional parameters can be pruned using, at the level of\nindividual parameters, (cid:96)1 regularization and a threshold [Liu et al., 2015]. On ImageNet, with our\n\n7\n\n\fTable 2: Gain in runtime (actual clock time) and memory requirement of our reduced networks.\nNote that, for some con\ufb01gurations, our \ufb01nal networks achieve a speedup of close to 50%. Similarly,\nwe achieve memory savings of up to 82% in terms of parameters, and up to 25% in terms of the\nfeatures computed by the network. The runtimes were obtained using a single Tesla K20m and\nmemory estimations using RGB-images of size 224 \u00d7 224 for Ours-BNetC, Ours-Dec8-640GS and\nOurs-Dec8-768GS, and gray level images of size 32 \u00d7 32 for Ours-Dec3\u2212GS.\n\nModel\nOurs-BNetC\nGS\nOurs-Dec8-640GS\nOurs-Dec8-768GS\nOurs-Dec3\u2212GS\n\n1\n\n10.04\n-0.1\n15.29\n35.62\n\nRelative memory-savings\n\nBatch size 1 (%)\nFeatures\n18.54\n2.13\n10.00\n25.00\n\nParams\n12.06\n26.51\n46.73\n82.35\n\n16\n\n13.69\n4.37\n15.62\n49.63\n\nRelative speed-up (%)\n\nBatch size\n2\n8\n\n8.97\n5.44\n17.11\n43.07\n\n13.01\n3.91\n15.99\n44.40\n\nDec8-768GS model and the (cid:96)1 weight set to 0.0001 as in [Liu et al., 2015], this method yields 1.34M\nzero-valued parameters, compared to 7.74M for our approach, i.e., a 82% relative reduction in the\nnumber of individual parameters for our approach.\n\n5 Conclusions\n\nWe have introduced an approach to automatically determining the number of neurons in each layer\nof a deep network. To this end, we have proposed to rely on a group sparsity regularizer, which\nhas allowed us to jointly learn the number of neurons and the parameter values in a single, coherent\nframework. Not only does our approach estimate the number of neurons, it also yields a more compact\narchitecture than the initial overcomplete network, thus saving both memory and computation at test\ntime. Our experiments have demonstrated the bene\ufb01ts of our method, as well as its generalizability\nto different architectures. One current limitation of our approach is that the number of layers in the\nnetwork remains \ufb01xed. To address this, in the future, we intend to study architectures where each\nlayer can potentially be bypassed entirely, thus ultimately canceling out its in\ufb02uence. Furthermore,\nwe plan to evaluate the behavior of our approach on other types of problems, such as regression\nnetworks and autoencoders.\n\nAcknowledgments\n\nThe authors thank John Taylor and Tim Ho for helpful discussions and their continuous support through using\nthe CSIRO high-performance computing facilities. The authors also thank NVIDIA for generous hardware\ndonations.\n\nReferences\nJ.M. Alvarez and L. Petersson. Decomposeme: Simplifying convnets for end-to-end learning. CoRR,\n\nabs/1606.05426, 2016.\n\nT. Ash. Dynamic node creation in backpropagation networks. Connection Science, 1(4):365\u2013375, 1989.\n\nP. L. Bartlett. For valid generalization the size of the weights is more important than the size of the network. In\n\nNIPS, 1996.\n\nM. G. Bello. Enhanced training algorithms, and integrated training/architecture selection for multilayer percep-\n\ntron networks. IEEE Transactions on Neural Networks, 3(6):864\u2013875, Nov 1992.\n\nYu Cheng, Felix X. Yu, Rog\u00e9rio Schmidt Feris, Sanjiv Kumar, Alok N. Choudhary, and Shih-Fu Chang. An\n\nexploration of parameter redundancy in deep networks with circulant projections. In ICCV, 2015.\n\nM. D. Collins and P. Kohli. Memory Bounded Deep Convolutional Networks. CoRR, abs/1412.1442, 2014.\n\nR. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A matlab-like environment for machine learning. In\n\nBigLearn, NIPS Workshop, 2011.\n\nM. Denil, B. Shakibi, L. Dinh, M.A. Ranzato, and N. de Freitas. Predicting parameters in deep learning. CoRR,\n\nabs/1306.0543, 2013.\n\n8\n\n\fE. L Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus. Exploiting linear structure within convolutional\n\nnetworks for ef\ufb01cient evaluation. In NIPS. 2014.\n\nY. Gong, L. Liu, M. Yang, and L. D. Bourdev. Compressing deep convolutional networks using vector\n\nquantization. In CoRR, volume abs/1412.6115, 2014.\n\nI. J. Goodfellow, D. Warde-farley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks. In ICML, 2013.\n\nB. Hassibi, D. G. Stork, and G. J. Wolff. Optimal brain surgeon and general network pruning. In ICNN, 1993.\n\nK. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In CoRR, volume\n\nabs/1512.03385, 2015.\n\nG. E. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. In arXiv, 2014.\n\nM. Jaderberg, A. Vedaldi, and A. Zisserman. Deep features for text spotting. In ECCV, 2014a.\n\nM. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up convolutional neural networks with low rank\n\nexpansions. In BMVC, 2014b.\n\nC. Ji, R. R. Snapp, and D. Psaltis. Generalizing smoothness constraints from discrete samples. Neural\n\nComputation, 2(2):188\u2013197, June 1990. ISSN 0899-7667.\n\nA. Krogh and J. A. Hertz. A simple weight decay can improve generalization. In NIPS, 1992.\n\nY. LeCun, J. S. Denker, and S. A. Solla. Optimal brain damage. In NIPS, 1990.\n\nB. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Penksy. Sparse convolutional neural networks. In CVPR,\n\n2015.\n\nG. F Montufar, R. Pascanu, K. Cho, and Y. Bengio. On the number of linear regions of deep neural networks. In\n\nNIPS. 2014.\n\nM. Mozer and P. Smolensky. Skeletonization: A technique for trimming the fat from a network via relevance\n\nassessment. In NIPS, 1988.\n\nK. Murray and D. Chiang. Auto-sizing neural networks: With applications to n-gram language models. CoRR,\n\nabs/1508.05051, 2015.\n\nN. Parikh and S. Boyd. Proximal algorithms. Found. Trends Optim., 1(3):127\u2013239, January 2014.\n\nR. Reed. Pruning algorithms-a survey. IEEE Transactions on Neural Networks, 4(5):740\u2013747, Sep 1993.\n\nA. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. Fitnets: Hints for thin deep nets. In\n\nICLR, 2015.\n\nO. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,\n\nA. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.\n\nJ. Wang X. Huang X. Zhang S. Tu, Y. Xue. Learning block group sparse representation combined with\nconvolutional neural networks for rgb-d object recognition. Journal of Fiber Bioengineering and Informatics,\n7(4):603, 2014.\n\nN. Simon, J. Friedman, T. Hastie, and R. Tibshirani. A sparse-group lasso. Journal of Computational and\n\nGraphical Statistics, 2013.\n\nK. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR,\n\nabs/1409.1556, 2014.\n\nR. K Srivastava, K. Greff, and J. Schmidhuber. Training very deep networks. In NIPS, 2015.\n\nS. Theodoridis. Machine Learning, A Bayesian and Optimization Perspective, volume 8. Elsevier, 2015.\n\nA. S. Weigend, D. Rumelhart, and B. A. Huberman. Generalization by weight-elimination with application to\n\nforecasting. In NIPS, 1991.\n\nM. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal\n\nStatistical Society, Series B, 68(1):49\u201367, 2007.\n\nB. Zhou, A. Khosla, A. Lapedriza, A. Torralba, and A. Oliva. Places: An image database for deep scene\n\nunderstanding. 2015.\n\nH. Zhou, J. M. Alvarez, and F. Porikli. Less is more: Towards compact CNNs. In ECCV, 2016.\n\n9\n\n\f", "award": [], "sourceid": 1163, "authors": [{"given_name": "Jose", "family_name": "Alvarez", "institution": "NICTA"}, {"given_name": "Mathieu", "family_name": "Salzmann", "institution": "EPFL"}]}