{"title": "Learning Infinite RBMs with Frank-Wolfe", "book": "Advances in Neural Information Processing Systems", "page_first": 3063, "page_last": 3071, "abstract": "In this work, we propose an infinite restricted Boltzmann machine (RBM), whose maximum likelihood estimation (MLE) corresponds to a constrained convex optimization.  We consider the Frank-Wolfe algorithm to solve the program, which provides a sparse solution that can be interpreted as inserting a hidden unit at each iteration, so that the optimization process takes the form of a sequence of finite models of increasing complexity.  As a side benefit, this can be used to easily and efficiently identify an appropriate number of hidden units during the optimization. The resulting model can also be used as an initialization for typical state-of-the-art RBM training algorithms such as contrastive divergence, leading to models with consistently higher test likelihood than random initialization.", "full_text": "Learning In\ufb01nite RBMs with Frank-Wolfe\n\nWei Ping\u2217\n\n\u2217Computer Science, UC Irvine\n{wping,ihler}@ics.uci.edu\n\n\u2020Computer Science, Dartmouth College\nqliu@cs.dartmouth.edu\n\nQiang Liu\u2020\n\nAlexander Ihler\u2217\n\nAbstract\n\nIn this work, we propose an in\ufb01nite restricted Boltzmann machine (RBM), whose\nmaximum likelihood estimation (MLE) corresponds to a constrained convex op-\ntimization. We consider the Frank-Wolfe algorithm to solve the program, which\nprovides a sparse solution that can be interpreted as inserting a hidden unit at each\niteration, so that the optimization process takes the form of a sequence of \ufb01nite\nmodels of increasing complexity. As a side bene\ufb01t, this can be used to easily and\nef\ufb01ciently identify an appropriate number of hidden units during the optimization.\nThe resulting model can also be used as an initialization for typical state-of-the-art\nRBM training algorithms such as contrastive divergence, leading to models with\nconsistently higher test likelihood than random initialization.\n\n1\n\nIntroduction\n\nRestricted Boltzmann machines (RBMs) are two-layer latent variable models that use a layer of\nhidden units h to model the distribution of visible units v [Smolensky, 1986, Hinton, 2002]. RBMs\nhave been widely used to capture complex distributions in numerous application domains, including\nimage modeling [Krizhevsky et al., 2010], human motion capture [Taylor et al., 2006] and collab-\norative \ufb01ltering [Salakhutdinov et al., 2007], and are also widely used as building blocks for deep\ngenerative models, such as deep belief networks [Hinton et al., 2006] and deep Boltzmann machines\n[Salakhutdinov and Hinton, 2009]. Due to the intractability of the likelihood function, RBMs are\nusually learned using the contrastive divergence (CD) algorithm [Hinton, 2002, Tieleman, 2008],\nwhich approximates the gradient of the likelihood using a Gibbs sampler.\nOne practical problem when using a RBM is that we need to decide the size of the hidden layer\n(number of hidden units) before performing learning, and it can be challenging to decide what is\nthe optimal size. One simple heuristic is to search the \u2018best\u201d number of hidden units using cross\nvalidation or testing likelihood within a pre-de\ufb01ned candidate set. Unfortunately, this is extremely\ntime consuming; it involves running a full training algorithm (e.g., CD) for each possible size, and\nthus we can only search over a relatively small set of sizes using this approach.\nIn addition, because the log-likelihood of the RBM is highly non-convex, its performance is sensitive\nto the initialization of the learning algorithm. Although random initializations (to relatively small\nvalues) are routinely used in practice with algorithms like CD, it would be valuable to explore more\nrobust algorithms that are less sensitive to the initialization, as well as smarter initialization strategies\nto obtain better results.\nIn this work, we propose a fast, greedy algorithm for training RBMs by inserting one hidden unit at\neach iteration. Our algorithm provides an ef\ufb01cient way to determine the size of the hidden layer in\nan adaptive fashion, and can also be used as an initialization for a full CD-like learning algorithm.\nOur method is based on constructing a convex relaxation of the RBM that is parameterized by a\ndistribution over the weights of the hidden units, for which the training problem can be framed as\na convex functional optimization and solved using an ef\ufb01cient Frank-Wolfe algorithm [Frank and\nWolfe, 1956, Jaggi, 2013] that effectively adds one hidden unit at each iteration by solving a relatively\nfast inner loop optimization.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fRelated Work Our contributions connect to a number of different themes of existing work within\nmachine learning and optimization. Here we give a brief discussion of prior related work.\nThere have been a number of works on convex relaxations of latent variable models in functional\nspace, which are related to the gradient boosting method [Friedman, 2001]. In supervised learning,\nBengio et al. [2005] propose a convex neural network in which the number of hidden units is\nunbounded and can be learned, and Bach [2014] analyzes the appealing theoretical properties of\nsuch a model. For clustering problems, several works on convex functional relaxation have also\nbeen proposed [e.g., Nowozin and Bakir, 2008, Bradley and Bagnell, 2009]. Other forms of convex\nrelaxation have also been developed for two layer latent variable models [e.g., Aslan et al., 2013].\nThere has also been considerable work on extending directed/hierarchical models into \u201cin\ufb01nite\u201d\nmodels such that the dimensionality of the latent space can be automatically inferred during learning.\nMost of these methods are Bayesian nonparametric models, and a brief overview can be found\nin Orbanz and Teh [2011]. A few directions have been explored for undirected models, particularly\nRBMs. Welling et al. [2002] propose a boosting algorithm in the feature space of the model; a new\nfeature is added into the RBM at each boosting iteration, instead of a new hidden unit. Nair and\nHinton [2010] conceptually tie the weights of an in\ufb01nite number of binary hidden units, and connect\nthese sigmoid units with noisy recti\ufb01ed linear units (ReLUs). Recently, C\u00f4t\u00e9 and Larochelle [2015]\nextend an ordered RBM model with in\ufb01nite number of hidden units, and Nalisnick and Ravi [2015]\nuse the same technique for word embedding. The ordered RBM is sensitive to the ordering of its\nhidden units and can be viewed as an mixture of RBMs. In contrast, our model incorporates regular\nRBMs as a special case, and enables model selection for standard RBMs.\nThe Frank-Wolfe method [Frank and Wolfe, 1956] (a.k.a. conditional gradient) is a classical algorithm\nto solve constrained convex optimization. It has recently received much attention because it uni\ufb01es a\nlarge variety of sparse greedy methods [Jaggi, 2013], including boosting algorithms [e.g., Beygelzimer\net al., 2015], learning with dual structured SVM [Lacoste-Julien et al., 2013] and marginal inference\nusing MAP in graphical models [e.g., Belanger et al., 2013, Krishnan et al., 2015].\nVerbeek et al. [2003] proposed a greedy learning algorithm for Gaussian mixture models, which\ninserts a new component at each step and resembles our algorithm in its procedure. As one bene\ufb01t,\nit provides a better initialization for EM than random initialization. Likas et al. [2003] investigate\ngreedy initialization in k-means clustering.\n\n2 Background\n\nA restricted Boltzmann machine (RBM) is an undirected graphical model that de\ufb01nes a joint distribu-\ntion over the vectors of visible units v \u2208 {0, 1}|v|\u00d71 and hidden units h \u2208 {0, 1}|h|\u00d71,\n\nexp(cid:0)v(cid:62)W h + b\n\n(cid:62)\n\nv(cid:1); Z(\u03b8) =\n\nexp(cid:0)v(cid:62)W h + b\n\n(cid:62)\n\nv(cid:1),\n\n(1)\n\np(v, h | \u03b8) =\n\n1\n\nZ(\u03b8)\n\n(cid:88)\n\n(cid:88)\n\nv\n\nh\n\nwhere |v| and |h| are the dimensions of v and h respectively, and \u03b8 := {W, b} are the model\nparameters including the pairwise interaction term W \u2208 R|v|\u00d7|h| and the bias term b \u2208 R|v|\u00d71 for\nthe visible units. Here we drop the bias term for the hidden units h, since it can be achieved by\nintroducing a dummy visible unit whose value is always one. The partition function Z(\u03b8) serves to\nnormalize the probability to sum to one, and is typically intractable to calculate exactly.\nBecause RBMs have a bipartite structure, the conditional distributions p(v|h; \u03b8) and p(h|v; \u03b8) are\nfully factorized and can be calculated in closed form,\n\n|h|(cid:89)\n|v|(cid:89)\n\ni=1\n\np(hi|v), with p(hi = 1|v) = \u03c3(cid:0)vT W\u2022i\n(cid:1),\np(vj|h), with p(vj = 1|h) = \u03c3(cid:0)Wj\u2022h + bj\n\np(h|v, \u03b8) =\n\np(v|h, \u03b8) =\n\n(cid:1),\n\n(2)\n\nj=1\n\nwhere \u03c3(u) = 1/(1 + exp(\u2212u)) is the logistic function, and W\u2022i and Wj\u2022 and are the i-th column\nand j-th row of W respectively. Eq. (2) allows us to derive an ef\ufb01cient blocked Gibbs sampler that\niteratively alternates between drawing v and h.\n\n2\n\n\fThe marginal log-likelihood of the RBM is\n\n|h|(cid:88)\n\nlog(cid:0)1 + exp(w(cid:62)\n\ni v)(cid:1) + b\n\nlog p(v | \u03b8) =\n\n(cid:62)\n\nv \u2212 log Z(\u03b8),\n\n(3)\n\ni=1\n\n(cid:62)\n\nwhere wi := W\u2022i is the i-th column of W and corresponds to the weights connected to the i-th\nhidden unit. Because we assume each hidden unit hi takes values in {0, 1}, we get the softplus\nfunction log(1 + exp(w(cid:62)\ni v)) when we marginalize hi. This form shows that the (marginal) free\nv and a set of softplus functions with different weights\nenergy of the RBM is a sum of a linear term b\nwi; this provides a foundation for our development.\nGiven a dataset {vn}N\n\nn=1, the gradient of the log-likelihood for each data point vn is\n(cid:62) \u2212 Ep(v,h|\u03b8)\n\n(cid:62)(cid:3) = vn(\u00b5n)\n(cid:2)vh\n\n(cid:62)(cid:3) \u2212 Ep(v,h|\u03b8)\n\n(4)\nwhere \u00b5n = \u03c3(W (cid:62)vn) and the logistic function \u03c3 is applied in an element-wise manner. The positive\npart of the gradient can be calculated exactly, since the conditional distribution p(h|vn) is fully\nfactorized. The negative part arises from the derivatives of the log-partition function and is intractable.\nStochastic optimization algorithms, such as CD [Hinton, 2002] and persistent CD [Tieleman, 2008],\nare popular methods to approximate the intractable expectation using Gibbs sampling.\n\n(cid:62)(cid:3),\n(cid:2)vh\n\n\u2202 log p(vn|\u03b8)\n\n(cid:2)vnh\n\n= Ep(h|vn;\u03b8)\n\n\u2202W\n\n3 RBM with In\ufb01nite Hidden Units\n\nIn this section, we \ufb01rst generalize the RBM model de\ufb01ned in Eq. (3) to a model with an in\ufb01nite\nnumber of hidden units, which can also be viewed as a convex relaxation of the RBM in functional\nspace. Then, we describe the learning algorithm.\n\n3.1 Model De\ufb01nition\n\nOur general model is motivated by Eq. (3), in which the \ufb01rst term can be treated as an empirical\naverage of the softplus function log(1 + exp(w(cid:62)v)) under an empirical distribution over the weights\n{wi}. To extend this, we de\ufb01ne a general distribution q(w) over the weight w, and replace the\nempirical averaging with the expectation under q(w); this gives the following generalization of an\nRBM with an in\ufb01nite (possibly uncountable) number of hidden units,\n(cid:62)\n\n(cid:2) log(1 + exp(w(cid:62)v))(cid:3) + b\n(cid:2) log(1 + exp(w(cid:62)v))(cid:3) + b\n\n(cid:17)\n\nv \u2212 log Z(q, \u03d1),\n(cid:62)\n\nv\n\n,\n\n(5)\n\n(cid:16)\n\nlog p(v | q, \u03d1) = \u03b1Eq(w)\nZ(q, \u03d1) =\n\n(cid:88)\nhidden units in the model, and Eq(w)[f (w)] :=(cid:82)\nproperly normalized, i.e.,(cid:82)\n\n\u03b1Eq(w)\n\nexp\n\nv\n\ni=1\n\n(cid:80)|h|\n\nwhere \u03d1 := {b, \u03b1} and \u03b1 > 0 is a temperature parameter which controls the \u201ceffective number\u201d of\nw q(w)f (w)dw. Note that q(w) is assumed to be\nw q(w)dw = 1. Intuitively, (5) de\ufb01nes a semi-parametric model whose\nlog probability is a sum of a linear bias term parameterized by b, and a nonlinear term parameterized\nby the weight distribution w and \u03b1 that controls the magnitude of the nonlinear term. This model can\nbe regarded as a convex relaxation of the regular RBM, as shown in the following result.\nProposition 3.1. The model in Eq. (5) includes the standard RBM (3) as special case by constraining\nI(w = wi) and \u03b1 = |h|. Moreover, the log-likelihood of the model is concave\nq(w) = 1|h|\nw.r.t. the function q(w), \u03b1 and b respectively, and is jointly concave with q(w) and b.\nWe should point out that the parameter \u03b1 plays a special role in this model: we reduce to the standard\nRBM only when \u03b1 equals the number |h| of particles in q(w) = 1|h|\nI(w = wi), and would\notherwise get a fractional RBM. The fractional RBM leads to a more challenging inference problem\nthan a standard RBM, since the standard Gibbs sampler is no longer directly applicable. We discuss\nthis point further in Section 3.3.\nGiven a dataset {vn}N\nestimator (MLE) that involves a convex functional optimization:\nlog p(vn | q, \u03d1) \u2212 \u03bb\n2\n\nn=1, we learn the parameters q and \u03d1 using a penalized maximum likelihood\n\nL(q, \u03d1) \u2261 1\nN\n\nEq(w)[||w||2]\n\n(cid:80)|h|\n\nN(cid:88)\n\n(cid:26)\n\n(cid:27)\n\narg max\nq\u2208M, \u03d1\n\n(6)\n\ni=1\n\n,\n\nn=1\n\n3\n\n\foptimize the likelihood on MC = {q | q(w) \u2265 0 and (cid:82)\n\nwhere M is the set of valid distributions and we introduce a functional L2 norm regularization\nEq(w)[||w||2] to penalize the likelihood for large values of w. Alternatively, we could equivalently\n||w||2\u2264C q(w) = 1}, which restricts the\nprobability mass to a 2-norm ball ||w||2 \u2264 C.\n\n3.2 Learning In\ufb01nite RBMs with Frank-Wolfe\n\nIt is challenging to directly solve the optimization in Eq. (6) by standard gradient descent methods,\nbecause it involves optimizing the function q(w) with in\ufb01nite dimensions. Instead, we propose to\nsolve it using the Frank-Wolfe algorithm [Jaggi, 2013], which is projection-free and provides a sparse\nsolution.\nAssume we already have qt at the iteration t; then Frank-Wolfe \ufb01nds the qt+1 by maximizing the\nlinearization of the objective function :\n\nq\u2208M\n\n(cid:80)t\n\n(cid:104)q,\u2207qL(qt, \u03d1t)(cid:105),\n\nqt+1 \u2190 (1 \u2212 \u03b2t+1)qt + \u03b2t+1rt+1, where rt+1 \u2190 arg max\n\n(7)\nwhere \u03b2t+1 \u2208 [0, 1] is a step size parameter, and the convex combination step guarantees the new\nqt+1 remains a distribution after the update. A typical step-size is \u03b2t = 1/t, in which case we have\ns=1 rs(w), that is, qt equals the average of all the earlier solutions obtained by the\nqt(w) = 1\nt\nlinear program.\nTo apply Frank-Wolfe to solve our problem, we need to calculate the functional gradient \u2207qL(qt, \u03d1t)\n(cid:21)\nin E.q. (7). We can show that (see details in Appendix),\n\n\u2207qL(qt, \u03d1t) = \u2212 \u03bb\n2\nwhere p(v | qt, \u03d1t) is the distribution parametrized by the weight density qt(w) and parameter \u03d1t,\n\nN(cid:88)\nexp(cid:0)\u03b1tEqt(w)[log(1 + exp(w(cid:62)v))] + b\n\np(v | qt, \u03d1t) log(1 + exp(w\n\nvn)) \u2212(cid:88)\n\nt v(cid:1)\n\n||w||2 + \u03b1t\n\nlog(1 + exp(w\n\n(cid:20) 1\n\nv))\n\n(cid:62)\n\nN\n\nn=1\n\n(cid:62)\n\n,\n\n(cid:62)\n\nv\n\np(v | qt, \u03d1t) =\n\nZ(qt, \u03d1t)\n\n.\n\n(8)\n\nmax\n\nThe (functional) linear program in Eq. (7) is equivalent to an optimization over weight vector w :\nq\u2208M(cid:104)q,\u2207qL(qt, \u03d1t)(cid:105) = max\n(cid:88)\nq\u2208M\n= \u2212 min\np(v | qt, \u03d1t) log(1 + exp(w(cid:62)v)) \u2212 1\nN\n\nlog(1 + exp(w(cid:62)vn))\n\nEq(w)[\u2207qL(qt, \u03d1t)]\n\n||w||2 +\n\n(cid:26) \u03bb\n\nN(cid:88)\n\n(cid:27)\n\n2\n\nw\n\n(9)\n\nv\n\nThe gradient of the objective (9) is,\n\n\u2207w = \u03bbw + Ep(v|qt,\u03d1t)\n\n(cid:2)\u03c3(w(cid:62)v) \u00b7 v(cid:3) \u2212 1\n\nN\n\nn=1\n\nN(cid:88)\n\n\u03c3(w(cid:62)vn) \u00b7 vn,\n\nn=1\n\nwhere the expectation over p(v | qt, \u03d1t) can be intractable to calculate, and one may use stochastic\noptimization and draw samples using MCMC. Note that the second two terms in the gradient enforce\nan intuitive moment matching condition: the optimal w introduces a set of \u201cimportance weights\u201d\n\u03c3(w(cid:62)v) that adjust the empirical data and the previous model, such that their moments match with\neach other.\nNow, suppose w\u2217\nt is the optimum of Eq. (9) at iteration t, the item rt(w) we added can be shown to be\nI(w = w\u2217\nthe delta over w\u2217\nt , that is, rt(w) = I(w = w\u2217\ns)\nwhen the step size is taken to be \u03b2t = 1\nt . Therefore, this Frank-Wolfe update can be naturally\ninterpreted as greedily inserting a hidden unit into the current model p(v | qt, \u03d1t). In particular, if we\nupdate the temperature parameter as \u03b1t \u2190 t, according to Proposition 3.1, we can directly transform\nour model p(v | qt, \u03d1t) to a regular RBM after each Frank-Wolfe step, which enables the convenient\nblocked Gibbs sampling for inference.\nCompared with the (regularized) MLE of the standard RBM (e.g. in Eq. (4)), the optimization in\nEq. (9) has the following nice properties: (1) The current model p(v | qt, \u03d1t) does not depend on\n\nt ); in addition, we have qt(w) = 1\nt\n\n(cid:80)t\n\ns=1\n\n4\n\n\f(cid:80)S\ns=1 log(1 + exp(w(cid:62)vs))\u2212 1\n\nN\n\n(cid:80)N\nn=1 log(1 + exp(w(cid:62)vn))\n\n(cid:111)\n\n;\n\nAlgorithm 1 Frank-Wolfe Learning Algorithm\n\nInput: training data {vn}N\nOutput: sparse solution q\u2217(w), and \u03d1\u2217\nInitialize q0(w) = I(w = w(cid:48)) at random w(cid:48); b0 = 0; \u03b10 = 1;\nfor t = 1 : T [or, stopping criterion] do\n\nn=1; step size \u03b7; regularization \u03bb.\n\n(cid:110) \u03bb\ns=1 from p(v | qt\u22121, \u03d1t\u22121);\n2||w||2 + 1\n\nDraw sample {vs}S\nw\u2217\nt = argminw\nUpdate qt(w) \u2190 (1 \u2212 1\nUpdate \u03b1t \u2190 t (optional: gradient descent);\nSet bt = bt\u22121;\nrepeat\n\nt ) \u00b7 qt\u22121(w) + 1\n\nS\n\nDraw a mini-batch samples {vm}M\nUpdate bt \u2190 bt + \u03b7 \u00b7 ( 1\n\n(cid:80)N\nn=1 vn \u2212 1\n\nM\n\nN\n\n(cid:80)M\n\nt \u00b7 I(w = w\u2217\nt );\n\nm=1 from p(v | qt, \u03d1t)\n\nm=1 vm)\n\nuntil\nend for\nReturn q\u2217(w) = qt(w); \u03d1\u2217 = {bt, \u03b1t};\n\nN(cid:88)\n\nvn \u2212(cid:88)\n\nn=1\n\nv\n\nw, which means we can draw enough samples from p(v | qt, \u03d1t) at each iteration t, and reuse them\nduring the optimization of w. (2) The objective function in Eq. (9) can be evaluated explicitly given\na set of samples, and hence ef\ufb01cient off-the-shelf optimization tools such as L-BFGS can be used\nto solve the optimization very ef\ufb01ciently. (3) Each iteration of our method involves much fewer\nparameters (only the weights for a single hidden unit, which is |v| \u00d7 1 instead of the full |v| \u00d7 |h|\nweight matrix are updated), and hence de\ufb01nes a series of easier problems that can be less sensitive\nto initialization. We note that a similar greedy learning strategy has been successfully applied for\nlearning mixture models [Verbeek et al., 2003], in which one greedily inserts a component at each\nstep, and that this approach can provide better initialization for EM optimization than using multiple\nrandom initializations.\nOnce we obtain qt+1, we can update the bias parameter bt by gradient descent,\n\n\u2207bL(qt+1, \u03d1t) =\n\n1\nN\n\np(v|qt+1, \u03d1t)v.\n\n(10)\n\nOne can further optimize \u03b1t by gradient descent,1 but we \ufb01nd simply updating \u03b1t \u2190 t is more ef\ufb01cient\nand works well in practice. We summarize our Frank-Wolfe learning algorithm in Algorithm 1.\nAdding hidden units on RBM. Besides initializing q(w) to be a delta function at some random w(cid:48)\nand learning the model from scratch, one can also adapt Algorithm 1 to incrementally add hidden\nunits into an existing RBM in Eq. (3) (e.g. have been learned by CD). According to Proposition 3.1,\nI(w = wi), \u03b1t = |h|, and continue the Frank-Wolfe\none can simply initialize qt(w) = 1|h|\niterations at t = |h| + 1.\nRemoving hidden units. Since the hidden units are added in a greedy manner, one may want to remove\nan old hidden unit during the Frank-Wolfe learning, provided it is bad with respect to our objective\nEq. (9) after more hidden units have been added. A variant of Frank-Wolfe with away-steps [Gu\u00e9lat\nand Marcotte, 1986] \ufb01ts this requirement and can be directly applied. As shown by [Clarkson, 2010],\nit can improve the sparsity of the \ufb01nal solution (i.e., fewer hidden units in the learned model).\n\n(cid:80)|h|\n\ni=1\n\n3.3 MCMC Inference for Fractional RBMs\n\nAs we point out in Section 3.1, we need to take \u03b1 equal to the number of particles in q(w) (that\nis, \u03b1t \u2190 t in Algorithm 1) in order to have our model reduce to the standard RBM. If \u03b1 takes a\nmore general real number, we obtain a more general fractional RBM model, for which inference is\n\n1see Appendix for the de\ufb01nition of \u2207\u03b1L(qt, \u03d1t)\n\n5\n\n\fmore challenging because the standard block Gibbs sampler is not directly applicable. In practice,\nwe \ufb01nd that setting \u03b1t \u2190 t to correspond to a regular RBM seems to give the best performance,\nbut for completeness, we discuss the fractional RBM in more detail in this section, and propose\na Metropolis-Hastings algorithm to draw samples from the fractional RBM. We believe that this\nfractional RBM framework provides an avenue for further improvements in future work.\n\nTo frame the problem, let us assume \u03b1q(w) =(cid:80)\n\ni ci \u00b7 I(w = wi), where ci is a general real number;\n\nthe corresponding model is\n\nlog p(v | q, \u03d1) =\n\nci log(1 + exp(w(cid:62)\n\ni v)) + b\n\n(cid:62)\n\nv \u2212 log Z(q, \u03d1),\n\n(11)\n\n(cid:88)\n\ni\n\n(cid:88)\n\nlog(cid:101)p(v | q, \u03d1) =\n\nwhich differs from the standard RBM in (3) because each softplus function is multiplied by ci.\nNevertheless, one may push the ci into the softplus function, and obtain a standard RBM that forms\nan approximation of (11):\n\nlog(1 + exp(ci \u00b7 w(cid:62)\n\ni v)) + b\n\n(cid:62)\n\nv \u2212 log(cid:101)Z(q, \u03d1).\n\n(12)\n\ni\n\nThis approximation can be justi\ufb01ed by considering the special case when the magnitude of the\nweights w is very large, so that the softplus function essentially reduces to a ReLU function, that\nis, log(1 + exp(w(cid:62)\ni v). In this case, (11) and (12) become equivalent because\nci max(0, x) = max(0, cix). More concretely, we can guarantee the following bound:\nProposition 3.2. For any 0 < ci \u2264 1,we have\n\ni v)) \u2248 max(0, w(cid:62)\n\n1\n21\u2212ci\n\n(1 + exp(ci \u00b7 w(cid:62)\n\ni v)) \u2264 (1 + exp(w(cid:62)\n\ni v))ci \u2264 1 + exp(ci \u00b7 w(cid:62)\n\ni v).\n\nThe proof can be found in the Appendix. Note that we apply the bound when ci > 1 by splitting ci\ninto the sum of its integer part and fractional remainder, and apply the bound to the fractional part.\n\nTherefore, the fractional RBM (11) can be well approximated by the standard RBM (12), and this\ncan be leveraged to design an inference algorithm for (11). As one example, we can use the Gibbs\nupdate of (12) as a proposal for a Metropolis-Hastings update for (11). To be speci\ufb01c, given a\nmin(1, A(v \u2192 v(cid:48))),\n\ncon\ufb01guration v, we perform Gibbs update in RBM(cid:101)p(v | q, \u03d1) to get v(cid:48), and accept it with probability\n\nA(v \u2192 v(cid:48)) =\nwhere (cid:101)T (v \u2192 v(cid:48)) is the Gibbs transition of RBM(cid:101)p(v | q, \u03d1). Because the acceptance probability of\na Gibbs sampler equals one, we have (cid:101)p(v)(cid:101)T (v\u2192v(cid:48))\n(cid:101)p(v(cid:48))(cid:101)T (v(cid:48)\u2192v)\n(cid:81)\n(cid:81)\ni(1 + exp(w(cid:62)\ni(1 + exp(w(cid:62)\n\ni(1 + exp(ci \u00b7 w(cid:62)\ni(1 + exp(ci \u00b7 w(cid:62)\n\np(v(cid:48))(cid:101)p(v)\np(v)(cid:101)p(v(cid:48))\n\nA(v \u2192 v(cid:48)) =\n\n= 1 . This gives\n\ni v))\ni v(cid:48)))\n\n=\n\n.\n\n,\n\np(v(cid:48))(cid:101)T (v(cid:48) \u2192 v)\np(v)(cid:101)T (v \u2192 v(cid:48))\ni v(cid:48)))ci \u00b7(cid:81)\ni v))ci \u00b7(cid:81)\n\n4 Experiments\n\nIn this section, we test the performance of our Frank-Wolfe (FW) learning algorithm on two datasets:\nMNIST [LeCun et al., 1998] and Caltech101 Silhouettes [Marlin et al., 2010]. The MNIST handwrit-\nten digits database contains 60,000 images in the training set and 10,000 test set images, where each\nimage vn includes 28 \u00d7 28 pixels and is associated with a digit label yn. We binarize the grayscale\nimages by thresholding the pixels at 127, and randomly select 10,000 images from training as the\nvalidation set. The Caltech101 Silhouettes dataset [Marlin et al., 2010] has 8,671 images with 28\u00d7 28\nbinary pixels, where each image represents objects silhouette and has a class label (overall 101\nclasses). The dataset is divided into three subsets: 4,100 examples for training, 2,264 for validation\nand 2,307 for testing.\n\n6\n\n\f(a) MNIST\n\n(b) Caltech101 Silhouettes\n\nFigure 1: Average test log-likelihood on the two datasets as we increase the number of hidden\nunits. We can see that FW can correctly identify an appropriate hidden layer size with high test\nlog-likelihood (marked by the green dashed line). In addition, CD initialized by FW gives higher test\nlikelihood than random initialization for the same number of hidden units. Best viewed in color.\n\nTraining algorithms We train RBMs with CD-10 algorithm. 2 A \ufb01xed learning rate is selected\nfrom the set {0.05, 0.02, 0.01, 0.005} using the validation set, and the mini-batch size is selected\nfrom the set {10, 20, 50, 100, 200}. We use 200 epochs for training on MINIST and 400 epochs on\nCaltech101. Early stopping is applied by monitoring the difference of average log-likelihood between\ntraining and validation data, so that the intractable log-partition function is cancelled [Hinton, 2010].\nWe train RBMs with {20, 50, 100, 200, 300, 400, 500, 600, 700} hidden units. We incrementally\ntrain a RBM model by Frank-Wolfe (FW) algorithm 1. A \ufb01xed step size \u03b7 is selected from the set\n{0.05, 0.02, 0.01, 0.005} using the validation data, and a regularization strength \u03bb is selected from\nthe set {1, 0.5, 0.1, 0.05, 0.01}. We set T = 700 in Algorithm 1, and use the same early stopping\ncriterion as CD. We randomly initialize the CD algorithm 5 times and select the best one on the\nvalidation set; meanwhile, we also initialize CD by the model learned from Frank-Wolfe.\n\nTest likelihood To evaluate the test likelihood of the learned models, we estimate the partition func-\ntion using annealed importance sampling (AIS) [Salakhutdinov and Murray, 2008]. The temperature\nparameter is selected following the standard guidance: \ufb01rst 500 temperatures spaced uniformly from\n0 to 0.5, and 4,000 spaced uniformly from 0.5 to 0.9, and 10,000 spaced uniformly from 0.9 to 1.0;\nthis gives a total of 14,500 intermediate distributions. We summarize the averaged test log-likelihood\nof MNIST and Caltech101 Silhouettes in Figure 1, where we report the result averaged over 500 AIS\nruns in all experiments, with the error bars indicating the 3 standard deviations of the estimations.\n\nWe evaluate the test likelihood of the model in FW after adding every 20 hidden units. We perform\nearly stopping when the gap of average log-likelihood between training and validation data largely\nincreases. As shown in Figure 1, this procedure selects 460 hidden units on MNIST (as indicated by\nthe green dashed lines), and 550 hidden units on Caltech101; purely for illustration purposes, we\ncontinue FW in the experiment until reaching T = 700 hidden units. We can see that the identi\ufb01ed\nnumber of hidden units roughly corresponds to the maximum of the test log-likelihood of all the\nthree algorithms, suggesting that FW can identify the appropriate number of hidden units during the\noptimization.\n\nWe also use the model learned by FW as an initialization for CD (the blue lines in Figure 2), and\n\ufb01nd it consistently performs better than the best result of CD with 5 random initializations. In our\nimplementation, the running time of the FW procedure is at most twice as CD for the same number\nof hidden units. Therefore, FW initialized CD provides a practical strategy for learning RBMs: it\nrequires approximately three times of computation time as a single run of CD, while simultaneously\nidentifying the proper number of hidden units and obtaining better test likelihood.\n\n2CD-k refers to using k-step Gibbs sampler to approximate the gradient of the log-partition function.\n\n7\n\n0100200300400500600700\u2212160\u2212140\u2212120\u2212100\u221280Number of hidden unitsAvg. test log\u2212likelihood FWCD(rand init.)CD(FW init.)100200300400500600700\u2212220\u2212200\u2212180\u2212160\u2212140\u2212120Number of hidden unitsAvg. test log\u2212likelihood FWCD(rand init.)CD(FW init.)\f(a) MNIST\n\n(b) Caltech101 Silhouettes\n\nFigure 2: Classi\ufb01cation error when using the learned hidden representations as features.\n\nClassi\ufb01cation The performance of our method is further evaluated using discriminant image\nclassi\ufb01cation tasks. We take the hidden units\u2019 activation vectors Ep(h|vn)[h] generated by the three\nalgorithms in Figure 1 and use it as the feature in a multi-class logistic regression on the class labels\nyn in MNIST and Caltech101. From Figure 2, we \ufb01nd that our basic FW tends to be worse than\nthe fully trained CD (best in 5 random initializations) when only small numbers of hidden units are\nadded, but outperforms CD when more hidden units (about 450 in both cases) are added. Meanwhile,\nthe CD initialized by FW outperforms CD using the best of 5 random initializations.\n\n5 Conclusion\n\nIn this work, we propose a convex relaxation of the restricted Boltzmann machine with an in\ufb01nite\nnumber of hidden units, whose MLE corresponds to a constrained convex program in a function\nspace. We solve the program using Frank-Wolfe, which provides a sparse greedy solution that can be\ninterpreted as inserting a single hidden unit at each iteration. Our new method allows us to easily\nidentify the appropriate number of hidden units during the progress of learning, and can provide an\nadvanced initialization strategy for other state-of-the-art training methods such as CD to achieve\nhigher test likelihood than random initialization.\n\nAcknowledgements\n\nThis work is sponsored in part by NSF grants IIS-1254071 and CCF-1331915. It is also funded\nin part by the United States Air Force under Contract No. FA8750-14-C-0011 under the DARPA\nPPAML program.\n\nAppendix\n\nDerivation of gradients The functional gradient of L(q, \u03d1) w.r.t. the density function q(w) is\n\n||w||2 + \u03b1\n\nN(cid:88)\n\nn=1\n\nlog(1 + exp(w(cid:62)vn))\n\n(cid:20) 1\nv exp(cid:0)\u03b1Eq(w)[log(1 + exp(w(cid:62)v))] + b\n(cid:80)\n(cid:20) 1\nlog(1 + exp(w(cid:62)vn)) \u2212(cid:88)\n\nN(cid:88)\n\nZ(q, b, \u03b1)\n\nN\n\nN\n\nn=1\n\nv\n\n||w||2 + \u03b1\n\n\u2207qL(q, \u03d1) = \u2212 \u03bb\n2\n\n= \u2212 \u03bb\n2\n\nv(cid:1) \u00b7 log(1 + exp(w(cid:62)v))\n\n(cid:21)\n\n(cid:62)\n\np(v | q, \u03d1) log(1 + exp(w(cid:62)v))\n\n.\n\n(cid:21)\n\n\u2212\n\nN(cid:88)\n\nThe gradient of L(q, \u03d1) w.r.t. the temperature parameter \u03b1 is\n\n(cid:2) log(1 + exp(w(cid:62)vn))(cid:3) \u2212(cid:88)\n\n(cid:2) log(1 + exp(w(cid:62)v))(cid:3).\n\np(v | q, \u03d1) Eq(w)\n\n\u2207\u03b1L(q, \u03d1) =\n\n1\nN\n\nEq(w)\n\nn=1\n\nv\n\n8\n\n30040050060070022.533.5Number of hidden unitsTest error (%) FWCD(rand init.)CD(FW init.)3004005006007003436384042Number of hidden unitsTest error (%) FWCD(rand init.)CD(FW init.)\fProof of Proposition 4.2\nProof. For any 0 < c \u2264 1, we have following classical inequality,\nxk \u2264 (\n1\n2\n\nk)1/c, and 1\nxc\n2\n\n(cid:88)\n\n(cid:88)\n\nxk \u2264 (\n\n(cid:88)\n\nk\n\nk\n\nk\n\n(cid:88)\n\nk\n\nxc\nk)1/c\n\nLet x1 = 1 and x2 = exp(w(cid:62)\n\ni v), and the proposition is a direct result of above two inequalities.\n\nReferences\n\u00d6. Aslan, H. Cheng, X. Zhang, and D. Schuurmans. Convex two-layer modeling. In NIPS, 2013.\nF. Bach. Breaking the curse of dimensionality with convex neural networks. arXiv:1412.8690, 2014.\nD. Belanger, D. Sheldon, and A. McCallum. Marginal inference in MRFs using Frank-Wolfe. In NIPS Workshop\n\non Greedy Optimization, Frank-Wolfe and Friends, 2013.\n\nY. Bengio, N. L. Roux, P. Vincent, O. Delalleau, and P. Marcotte. Convex neural networks. In NIPS, 2005.\nA. Beygelzimer, E. Hazan, S. Kale, and H. Luo. Online gradient boosting. In NIPS, 2015.\nD. M. Bradley and J. A. Bagnell. Convex coding. In UAI, 2009.\nK. L. Clarkson. Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm. ACM Transactions on\n\nAlgorithms, 2010.\n\nM.-A. C\u00f4t\u00e9 and H. Larochelle. An in\ufb01nite restricted Boltzmann machine. Neural Computation, 2015.\nM. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 1956.\nJ. H. Friedman. Greedy function approximation: a gradient boosting machine. Annals of Statistics, 2001.\nJ. Gu\u00e9lat and P. Marcotte. Some comments on Wolfe\u2019s \u2018away step\u2019. Mathematical Programming, 1986.\nG. Hinton. A practical guide to training restricted Boltzmann machines. UTML TR, 2010.\nG. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets. Neural Computation,\n\nG. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 2002.\nM. Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In ICML, 2013.\nR. G. Krishnan, S. Lacoste-Julien, and D. Sontag. Barrier Frank-Wolfe for marginal inference. In NIPS, 2015.\nA. Krizhevsky, G. E. Hinton, et al. Factored 3-way restricted Boltzmann machines for modeling natural images.\n\nS. Lacoste-Julien, M. Jaggi, M. Schmidt, and P. Pletscher. Block-coordinate Frank-Wolfe optimization for\n\n2006.\n\nIn AISTATS, 2010.\n\nstructural SVMs. In ICML, 2013.\n\nProceedings of the IEEE, 1998.\n\nlearning. In AISTATS, 2010.\n\nY. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nA. Likas, N. Vlassis, and J. J. Verbeek. The global k-means clustering algorithm. Pattern recognition, 2003.\nB. M. Marlin, K. Swersky, B. Chen, and N. D. Freitas. Inductive principles for restricted Boltzmann machine\n\nV. Nair and G. E. Hinton. Recti\ufb01ed linear units improve restricted Boltzmann machines. In ICML, 2010.\nE. Nalisnick and S. Ravi. In\ufb01nite dimensional word embeddings. arXiv:1511.05392, 2015.\nS. Nowozin and G. Bakir. A decoupled approach to exemplar-based unsupervised learning. In ICML, 2008.\nP. Orbanz and Y. W. Teh. Bayesian nonparametric models. In Encyclopedia of Machine Learning. 2011.\nR. Salakhutdinov and G. E. Hinton. Deep Boltzmann machines. In AISTATS, 2009.\nR. Salakhutdinov and I. Murray. On the quantitative analysis of deep belief networks. In ICML, 2008.\nR. Salakhutdinov, A. Mnih, and G. Hinton. Restricted Boltzmann machines for collaborative \ufb01ltering. In ICML,\n\n2007.\n\n2006.\n\n2008.\n\nP. Smolensky. Information processing in dynamical systems: Foundations of harmony theory. Technical report,\n\nDTIC Document, 1986.\n\nG. W. Taylor, G. E. Hinton, and S. Roweis. Modeling human motion using binary latent variables. In NIPS,\n\nT. Tieleman. Training restricted Boltzmann machines using approximations to the likelihood gradient. In ICML,\n\nJ. J. Verbeek, N. Vlassis, and B. Kr\u00f6se. Ef\ufb01cient greedy learning of Gaussian mixture models. Neural\n\nComputation, 2003.\n\nM. Welling, R. S. Zemel, and G. E. Hinton. Self supervised boosting. In NIPS, 2002.\n\n9\n\n\f", "award": [], "sourceid": 1523, "authors": [{"given_name": "Wei", "family_name": "Ping", "institution": "UC Irvine"}, {"given_name": "Qiang", "family_name": "Liu", "institution": "Dartmouth College"}, {"given_name": "Alexander", "family_name": "Ihler", "institution": "UC Irvine"}]}