{"title": "Sparse Additive Text Models with Low Rank Background", "book": "Advances in Neural Information Processing Systems", "page_first": 172, "page_last": 180, "abstract": "The sparse additive model for text modeling involves the sum-of-exp computing, with consuming costs for large scales. Moreover, the assumption of equal background across all classes/topics may be too strong. This paper extends to propose sparse additive model with low rank background (SAM-LRB), and simple yet efficient estimation. Particularly, by employing a double majorization bound, we approximate the log-likelihood into a quadratic lower-bound with the sum-of-exp terms absent. The constraints of low rank and sparsity are then simply embodied by nuclear norm and $\\ell_1$-norm regularizers. Interestingly, we find that the optimization task in this manner can be transformed into the same form as that in Robust PCA. Consequently, parameters of supervised SAM-LRB can be efficiently learned using an existing algorithm for Robust PCA based on accelerated proximal gradient. Besides the supervised case, we extend SAM-LRB to also favor unsupervised and multifaceted scenarios. Experiments on real world data demonstrate the effectiveness and efficiency of SAM-LRB, showing state-of-the-art performances.", "full_text": "Sparse Additive Text Models with Low Rank\n\nBackground\n\nLei Shi\n\nBaidu.com, Inc.\n\nP.R. China\n\nshilei06@baidu.om\n\nAbstract\n\nThe sparse additive model for text modeling involves the sum-of-exp computing,\nwhose cost is consuming for large scales. Moreover, the assumption of equal back-\nground across all classes/topics may be too strong. This paper extends to propose\nsparse additive model with low rank background (SAM-LRB) and obtains sim-\nple yet ef\ufb01cient estimation. Particularly, employing a double majorization bound,\nwe approximate log-likelihood into a quadratic lower-bound without the log-sum-\nexp terms. The constraints of low rank and sparsity are then simply embodied by\nnuclear norm and \u21131-norm regularizers. Interestingly, we \ufb01nd that the optimiza-\ntion task of SAM-LRB can be transformed into the same form as in Robust PCA.\nConsequently, parameters of supervised SAM-LRB can be ef\ufb01ciently learned us-\ning an existing algorithm for Robust PCA based on accelerated proximal gradient.\nBesides the supervised case, we extend SAM-LRB to favor unsupervised and mul-\ntifaceted scenarios. Experiments on three real data demonstrate the effectiveness\nand ef\ufb01ciency of SAM-LRB, compared with a few state-of-the-art models.\n\nIntroduction\n\n1\nGenerative models of text have gained large popularity in analyzing a large collection of documents\n[3, 4, 17]. This type of models overwhelmingly rely on the Dirichlet-Multinomial conjugate pair,\nperhaps mainly because its formulation and estimation is straightforward and ef\ufb01cient. However,\nthe ease of parameter estimation may come at a cost: unnecessarily over-complicated latent struc-\ntures and lack of robustness to limited training data. Several efforts emerged to seek alternative\nformulations, taking the correlated topic models [13, 19] for instance.\n\nRecently in [10], the authors listed three main problems with Dirichlet-Multinomial generative mod-\nels, namely inference cost, overparameterization, and lack of sparsity. Motivated by them, a Sparse\nAdditive GEnerative model (SAGE) was proposed in [10] as an alternative choice of generative mod-\nel. Its core idea is that the lexical distribution in log-space comes by adding the background distribu-\ntion with sparse deviation vectors. Successfully applying SAGE, effort [14] discovers geographical\ntopics in the twitter stream, and paper [25] detects communities in computational linguistics.\n\nHowever, SAGE still suffers from two problems. First, the likelihood and estimation involve the\nsum-of-exponential computing due to the soft-max generative nature, and it would be time consum-\ning for large scales. Second, SAGE assumes one single background vector across all classes/topics,\nor equivalently, there is one background vector for each class/topic but all background vectors are\nconstrained to be equal. This assumption might be too strong in some applications, e.g., when lots\nof synonyms vary their distributions across different classes/topics.\n\nMotivated to solve the second problem, we are propose to use a low rank constrained background.\nHowever, directly assigning the low rank assumption to the log-space is dif\ufb01cult. We turn to ap-\nproximate the data log-likelihood of sparse additive model by a quadratic lower-bound based on the\n\n1\n\n\fdouble majorization bound in [6], so that the costly log-sum-exponential computation, i.e., the \ufb01rst\nproblem of SAGE, is avoided. We then formulate and derive learning algorithm to the proposed\nSAM-LRB model. Main contributions of this paper can be summarized into four-fold as below:\n\n\u2022 Propose to use low rank background to extend the equally constrained setting in SAGE.\n\u2022 Approximate the data log-likelihood of sparse additive model by a quadratic lower-bound\nbased on the double majorization bound in [6], so that the costly log-sum-exponential com-\nputation is avoided.\n\n\u2022 Formulate the constrained optimization problem into Lagrangian relaxations, leading to a\nform exactly the same as in Robust PCA [28]. Consequently, SAM-LRB can be ef\ufb01ciently\nlearned by employing the accelerated proximal gradient algorithm for Robust PCA [20].\n\n\u2022 Extend SAM-LRB to favor supervised classi\ufb01cation, unsupervised topic model and multi-\n\nfaceted model; conduct experimental comparisons on real data to validate SAM-LRB.\n\n2 Supervised Sparse Additive Model with Low Rank Background\n2.1 Supervised Sparse Additive Model\nSame as in SAGE [10], the core idea of our model is that the lexical distribution in log-space comes\nfrom adding the background distribution with additional vectors. Particularly, we are given doc-\numents D documents over M words. For each document d \u2208 [1, D], let yd \u2208 [1, K] represent\nthe class label in the current supervised scenario, cd \u2208 RM\n+ denote the vector of term counts,\n\nand Cd = Pw cdw be the total term count. We assume each class k \u2208 [1, K] has two vectors\n\nbk, sk \u2208 RM , denoting the background and additive distributions in log-space, respectively. Then\nthe generative distribution for each word w in a document d with label yd is a soft-max form:\n\np(w|yd) = p(w|yd, byd , syd ) =\n\n.\n\n(1)\n\nexp(bydw + sydw)\ni=1 exp(bydi + sydi)\n\nPM\n\nXi=1\n\nGiven \u0398 = {B, S} with B = [b1, . . . , bK] and S = [s1, . . . , sK ], the log-likelihood of data X is:\n\nL = log p(X |\u0398) =\n\nL(d, k), L(d, k) = c\u22a4\n\nd (bk + sk) \u2212 Cd log\n\nexp(bki + ski).\n\n(2)\n\nK\n\nM\n\nXk=1 Xd:yd=k\n\nSimilarly, a testing document d is classi\ufb01ed into class \u02c6y(d) according to \u02c6y(d) = arg maxk L(d, k).\nIn SAGE [10], the authors further assumed that the background vectors across all classes are the\nsame, i.e., bk = b for \u2200k, and each additive vector sk is sparse. Although intuitive, the background\nequality assumption may be too strong for real applications. For instance, to express a same/similar\nmeaning, different classes of documents may choose to use different terms from a tuple of synonyms.\nIn this case, SAGE would tend to include these terms as the sparse additive part, instead of as the\nbackground. Taking Fig. 1 as an illustrative example, the log-space distribution (left) is the sum\nof the low-rank background B (middle) and the sparse S (right). Applying SAGE to this type of\ndata, the equality constrained background B would fail to capture the low-rank structure, and/or the\nadditive part S would be not sparse, so that there may be risks of over-\ufb01tting or under-\ufb01tting.\n\nMoreover, since there exists sum-of-exponential terms in Eq. (2) and thus also in its derivatives, the\ncomputing cost becomes huge when the vocabulary size M is large. As a result, although performing\nwell in [10, 14, 25], SAGE might still suffer from problems of over-constrain and inef\ufb01ciency.\n\nFigure 1: Low rank background.\nLeft to right illustrates the log-\nspace distr., background B, and\nsparse S, resp. Rows index\nterms, and columns for classes.\n\nFigure 2: Lower-bound\u2019s optimization. Left to right\nshows the trajectory of lower-bound, \u03b1, and \u03be, resp.\n\n2\n\n\f2.2 Supervised Sparse Additive Model with Low Rank Background\n\nMotivated to avoid the inef\ufb01cient computing due to sum-of-exp, we adopt the double majorization\nlower-bound of L [6], so that it is well approximated and quadratic w.r.t. B and S. Further based on\nthis lower-bound, we proceed to assume the background B across classes is low-rank, in contrast to\nthe equality constraint in SAGE. An optimization algorithm is proposed based on proximal gradient.\n\n2.2.1 Double Majorization Quadratic Lower Bound\nIn the literature, there have been several existing efforts on ef\ufb01cient computing the sum-of-exp ter-\nm involved in soft-max [5, 15, 6]. For instance, based on the convexity of logarithm, one can\n\nobtain a bound \u2212 logPi exp(xi) \u2265 \u2212\u03c6Pi exp(xi) + log \u03c6 + 1 for any \u03c6 \u2208 R+, namely the\n\nlb-log-cvx bound. Moreover, via upper-bounding the Hessian matrix, one can obtain the fol-\nlowing local quadratic approximation for any \u2200\u03bei \u2208 R, shortly named as lb-quad-loc:\n\n\u2212 log\n\nM\n\nXi=1\n\nexp(xi) \u2265\n\n1\nM\n\n(Xi\n\nxi \u2212Xi\n\n\u03bei)2 \u2212Xi\n\n(xi \u2212\u03bei)2 \u2212Pi(xi \u2212 \u03bei) exp(\u03bei)\n\nPi exp(\u03bei)\n\n\u2212logXi\n\nexp(\u03bei).\n\nIn [6], Bouchard proposed the following quadratic lower-bound by double majorization\n(lb-quad-dm) and demonstrated its better approximation compared with the previous two:\n\n\u2212 log\n\nM\n\nXi=1\n\nexp(xi) \u2265 \u2212\u03b1 \u2212\n\n1\n2\n\nM\n\nXi=1(cid:8)xi \u2212 \u03b1 \u2212 \u03bei + f (\u03bei)[(xi \u2212 \u03b1)2 \u2212 \u03be2\n\ni ] + 2 log[exp(\u03bei) + 1](cid:9) ,\n\n(3)\n\nwith \u03b1 \u2208 R and \u03be \u2208 RM\nbound is closely related to the bound proposed by Jaakkola and Jordan [6].\nEmploying Eq. (3), we obtain a lower-bound Llb \u2264 L to the data log-likelihood in Eq. (2):\n\n+ being auxiliary (variational) variables, and f (\u03be) = 1\n\n2\u03be \u00b7 exp(\u03be)\u22121\n\nexp(\u03be)+1 . This\n\nK\n\nLlb =\n\nwith \u03b3k = \u02dcCk(\u03b1k \u2212\n\nXk=1(cid:2)\u2212(bk + sk)\u22a4Ak(bk + sk) \u2212 \u03b2\u22a4\nXi=1(cid:2)\u03b1k + \u03beki + f (\u03beki)(\u03b12\n\n1\n2\n\nM\n\nAk = \u02dcCkdiag [f (\u03bek)] , \u03b2k = \u02dcCk(\n\nk (bk + sk) \u2212 \u03b3k(cid:3) ,\n\nk \u2212 \u03be2\n\nki) + 2 log(exp(\u03beki) + 1)(cid:3)) ,\n\u02dcCk = Xd:yd=k\n\ncd,\n\n1\n2\n\n\u2212 \u03b1kf (\u03bek)) \u2212 Xd:yd=k\n\nCd.\n\n(4)\n\nFor each class k, the two variational variables, \u03b1k \u2208 R and \u03bek \u2208 RM\n+ , can be updated iteratively as\nbelow for a better approximated lower-bound. Therein, abs(\u00b7) denotes the absolute value operator.\n\n\u03b1k =\n\n1\n\ni=1 f (\u03beki)\" M\nPM\n\n2\n\n\u2212 1 +\n\n(bki + ski)f (\u03beki)# ,\n\nM\n\nXi=1\n\n\u03bek = abs(bk + sk \u2212 \u03b1k).\n\n(5)\n\nOne example of the trajectories during optimizing this lower-bound is illustrated in Fig. 2. Partic-\nularly, the left shows the lower-bound converges quickly to ground truth, usually within 5 rounds\nin our experiences. The values of the three lower-bounds with randomly sampled the variation-\nal variables are also sorted and plotted. One can \ufb01nd that lb-quad-dm approximates better or\ncomparably well even with a random initialization. Please see [6] for more comparisons.\n\n2.2.2 Supervised SAM-LRB Model and Optimization by Proximal Gradient\nRather than optimizing the data log-likelihood in Eq. (2) like in SAGE, we turn to optimize its\nlower-bound in Eq. (4), which is convenient for further assigning the low-rank constraint on B and\nthe sparsity constraint on S. Concretely, our target is formulated as a constrained optimization task:\n\nLlb,\n\nwith Llb speci\ufb01ed in Eq. (4),\n\nmax\nB,S\ns.t.\n\n(6)\nConcerning the two constraints, we call the above as supervised Sparse Additive Model with Low-\nRank Background, or supervised SAM-LRB for short. Although both of the two assumptions can\n\nB = [b1, . . . , bK ]\n\nS = [s1, . . . , sK ]\n\nis low rank,\n\nis sparse.\n\n3\n\n\fbe tackled via formulating a fully generative model, assigning appropriate priors, and delivering\ninference in a Bayesian manner similar to [8], we determine to choose the constrained optimization\nform for not only a clearer expression but also a simpler and ef\ufb01cient algorithm.\n\nIn the literature, there have been several efforts considering both low rank and sparse constraints\nsimilar to Eq. (6), most of which take the use of proximal gradient [2, 7]. Papers [20, 28] studied the\nproblems under the name of Robust Principal Component Analysis (RPCA), aiming to decouple an\nobserved matrix as the sum of a low rank matrix and a sparse matrix. Closely related to RPCA,\nour scenario in Eq. (6) can be regarded as a weighted RPCA formulation, and the weights are\ncontrolled by variational variables. In [24], the authors proposed an ef\ufb01cient algorithm for problems\nthat constrain a matrix to be both low rank and sparse simultaneously.\n\nFollowing these existing works, we adopt the nuclear norm to implement the low rank constraint, and\n\u21131-norm for the sparsity constraint, respectively. Letting the partial derivative w.r.t. \u03bbk = (bk + sk)\nof Llb equal to zero, the maximum of Llb can be achieved at \u03bb\u2217\nk \u03b2k. Since Ak is\npositive de\ufb01nite and diagonal, the optimal solution \u03bb\u2217\nk is well-posed and can be ef\ufb01ciently computed.\nSimultaneously considering the equality \u03bbk = (bk + sk), the low rank on B and the sparsity on S,\none can rewritten Eq. (6) into the following Lagrangian form:\n\nk = \u2212 1\n\n2 A\u22121\n\nmin\nB,S\n\n1\n2\n\n||\u039b\u2217 \u2212 B \u2212 S||2\n\nF + \u00b5(||B||\u2217 + \u03bd|S|1), with \u039b\u2217 = [\u03bb\u2217\n\n1, . . . , \u03bb\u2217\n\nK ],\n\n(7)\n\nwhere ||\u00b7||F , ||\u00b7||\u2217 and | \u00b7 |1 denote the Frobenius norm, nuclear norm and \u21131-norm, respectively.\nThe Frobenius norm term concerns the accuracy of decoupling from \u039b\u2217 into B and S. Lagrange\nmultipliers \u00b5 and \u03bd control the strengths of low rank constraint and sparsity constraint, respectively.\n\nInterestingly, Eq. (7) is exactly the same as the objective of RPCA [20, 28]. Paper [20] proposed an\nalgorithm for RPCA based on accelerated proximal gradient (APG-RPCA), showing its advantages\nof ef\ufb01ciency and stability over (plain) proximal gradient. We choose it, i.e., Algorithm 2 in [20], for\nseeking solutions to Eq. (7). The computations involved in APG-RPCA include SVD decomposition\nand absolute value thresholding, and interested readers are referred to [20] for more details. The\naugmented Lagrangian and alternating direction methods [9, 29] could be considered as alternatives.\n\nData: Term counts and labels {cd, Cd, yd}D\nResult: Log-space distributions: low-rank B and sparse S\nInitialization: randomly initialize parameters {B, S}, and variational variables {\u03b1k, \u03bek}k;\nwhile not converge do\n\nd=1 of D docs and K classes, sparse thres. \u03bd \u2248 0.05\n\nif optimize variational variables then iteratively update {\u03b1k, \u03bek}k according to Eq. (5);\nfor k = 1, . . . , K do calculate Ak and \u03b2k by Eq. (4), and \u03bb\u2217\nB, S \u2190\u2212 APG-RPCA(\u039b\u2217, \u03bd) by Algorithm 2 in [20], with \u039b\u2217 = [\u03bb\u2217\n\nk = \u2212 1\n\n2 A\u22121\n\nk \u03b2k ;\n1, . . . , \u03bb\u2217\n\nK ];\n\nend\n\nAlgorithm 1: Supervised SAM-LRB learning algorithm\n\nConsequently, the supervised SAM-LRB algorithm is speci\ufb01ed in Algorithm 1. Therein, one can\nchoose to either \ufb01x or update the variational variables {\u03b1k, \u03bek}k. If they are \ufb01xed, Algorithm 1\nhas only one outer iteration with no need to check the convergence. Compared with the supervised\nSAGE learning algorithm in Sec. 3 of [10], our supervised SAM-LRB algorithm not only does not\nneed to compute the sum of exponentials so that computing cost is saved, but also is optimized sim-\nply and ef\ufb01ciently by proximal gradient instead of using Newton updating as in SAGE. Moreover,\nadding Laplacian-Exponential prior on S for sparseness, SAGE updates the conjugate posteriors and\nneeds to employ a \u201cwarm start\u201d technique to avoid being trapped in early stages with inappropriate\ninitializations, while in contrast SAM-LRB does not have this risk. Additionally, since the evolution\nfrom SAGE to SAM-LRB is two folded, i.e., the low rank background assumption and the convex\nrelaxation, we \ufb01nd that adopting the convex relaxation also helps SAGE during optimization.\n\n3 Extensions\n\nAnalogous to [10], our SAM-LRB formulation can be also extended to unsupervised topic modeling\nscenario with latent variables, and the scenario with multifaceted class labels.\n\n4\n\n\f3.1 Extension 1: Unsupervised Latent Variable Model\n\nWe consider how to incorporate SAM-LRB in a latent variable model of unsupervised text mod-\nelling. Following topic models, there is one latent vector of topic proportions per document and\none latent discrete variable per term. That is, each document d is endowed with a vector of topic\nproportions \u03b8d \u223c Dirichlet(\u03c1), and each term w in this document is associated with a latent topic\nlabel z(d)\n\nw \u223c Multinomial(\u03b8d). Then the probability distribution for w is\n\np(w|z(d)\n\nw , B, S) \u221d exp(cid:16)b\n\nw w + s\n\n(d)\n\nz\n\nz\n\n(d)\n\nw w(cid:17) ,\n\nwhich only replaces the known class label yd in Eq. (1) with the unknown topic label z(d)\nw .\nWe can combine the mean \ufb01eld variational inference for latent Dirichlet allocation (LDA) [4] with\nthe lower-bound treatment in Eq. (4), leading to the following unsupervised lower-bound\n\n(8)\n\n(9)\n\nK\n\nLlb =\n\nXk=1(cid:2)\u2212(bk + sk)\u22a4Ak(bk + sk) \u2212 \u03b2\u22a4\n+Xd\n\nk (bk + sk) \u2212 \u03b3k(cid:3)\n[hlog p(\u03b8d|\u03c1)i \u2212 hlog Q(\u03b8d)i] +Xd Xw hhlog p(z(d)\n\nM\n\nw |\u03b8d)i \u2212 hlog Q(z(d)\n\nw )ii ,\n\nwith \u03b3k = \u02dcCk(\u03b1k \u2212\n\n1\n2\n\nXi=1(cid:2)\u03b1k + \u03beki + f (\u03beki)(\u03b12\n\nk \u2212 \u03be2\n\nki) + 2 log(exp(\u03beki) + 1)(cid:3)) ,\n\nAk = \u02dcCkdiag [f (\u03bek)] ,\n\n\u03b2k = \u02dcCk(\n\n\u2212 \u03b1kf (\u03bek)) \u2212 \u02dcck,\n\n1\n2\n\nwhere each w-th item in \u02dcck is \u02dcckw = Pd Q(k|d, w)cdw, i.e. the expected count of term w in topic\nk, and \u02dcCk =Pw \u02dcckw is the topic\u2019s expected total count throughout all words.\n\nThis unsupervised SAM-LRB model formulates a topic model with low rank background and sparse\ndeviation, which is learned via EM iterations. The E-step to update posteriors Q(\u03b8d) and Q(z(d)\nw ) is\nidentical to the standard LDA. Once {Ak, \u03b2k} are computed as above, the M-step to update {B, S}\nand variational variables {\u03b1k, \u03bek}k remains the same as the supervised case in Algorithm 1.\n\n3.2 Extension 2: Multifaceted Modelling\n\nWe consider how SAM-LRB can be used to combine multiple facets (multi-dimensional class label-\ns), i.e, combining per-word latent topics and document labels and pursuing a structural view of labels\nand topics. In the literature, multifaceted generative models have been studied in [1, 21, 23], and they\nincorporated latent switching variables that determine whether each term is generated from a topic\nor from a document label. Topic-label interactions can also be included to capture the distributions\nof words at the intersections. However in this kind of models, the number of parameters becomes\nvery large for large vocabulary size, many topics, many labels. In [10], SAGE needs no switching\nvariables and shows advantageous of model sparsity on multifaceted modeling. More recently, paper\n[14] employs SAGE and discovers meaningful geographical topics in the twitter streams.\n\nApplying SAM-LRB to the multifaceted scenario, we still assume the multifaceted variations are\ncomposed of low rank background and sparse deviation. Particularly, for each topic k \u2208 [1, K],\nwe have the topic background b(T )\nk ; for each label j \u2208 [1, J], we have\nlabel background b(L)\n; for each topic-label interaction pair (k, j), we\nhave only the sparse deviation s(I)\nK ] and\nB(L) = [b(L)\n\nJ ] are assumed of low ranks to capture single view\u2019s distribution similarity.\n\nkj . Again, background distributions B(T ) = [b(T )\n\nand sparse deviation s(L)\n\nand sparse deviation s(T )\n\n, . . . , b(T )\n\n, . . . , b(L)\n\n1\n\nk\n\nj\n\nj\n\n1\n\nThen for a single term w given the latent topic z(d)\nis obtained by summing the background and sparse components together:\n\nw and the class label yd, its generative probability\n\np(w|z(d)\n\nw , yd, \u0398) \u221d exp(cid:16)b(T )\n\nz\n\n(d)\nw w\n\n+ s(T )\n\n(d)\nw w\n\nz\n\n5\n\n+ b(L)\n\nydw + s(L)\n\nydw + s(I)\n\nz\n\n(d)\n\nw ydw(cid:17) ,\n\n(10)\n\n\fwith parameters \u0398 = {B(T ), S(T ), B(L), S(L), S(I)}. The log-likelihood\u2019s lower-bound involves\nthe sum through all topic-label pairs:\n\nK\n\nJ\n\nLlb =\n\nwith\n\nXk=1\n+Xd\n\nkjAkj\u03bbkj \u2212 \u03b2\u22a4\n\nXj=1(cid:2)\u2212\u03bb\u22a4\n[hlog p(\u03b8d|\u03c1)i \u2212 hlog Q(\u03b8d)i] +Xd Xw hhlog p(z(d)\n\nkj\u03bbkj \u2212 \u03b3kj(cid:3)\n\n\u03bbkj , b(T )\n\nk + s(T )\n\nk + b(L)\n\nj + s(L)\n\nj + s(I)\nkj .\n\nw |\u03b8d)i \u2212 hlog Q(z(d)\n\nw )ii ,\n\n(11)\n\nIn the quadratic form, the values of Akj, \u03b2kj and \u03b3kj are trivial combination of Eq. (4) and Eq. (9),\ni.e., weighted by both the observed labels and posteriors of latent topics. Details are omitted here\ndue to space limit. The second row remains the same as in Eq. (9) and standard LDA.\n\nDuring the iterative estimation, every iteration includes the following steps:\n\n\u2022 Estimate the posteriors Q(z(d)\n\u2022 With (B(T ), S(T ), S(I)) \ufb01xed, solve a quadratic program over \u039b\u2217(L), which approximates\n\nw ) and Q(\u03b8d);\n\nthe sum of B(L) and S(L). Put \u039b\u2217(L) into Algorithm 1 to update B(L) and S(L);\n\n\u2022 With (B(L), S(L), S(I)) \ufb01xed, solve a quadratic program over \u039b\u2217(T ), which approximates\n\nthe sum of B(T ) and S(T ). Put \u039b\u2217(T ) into Algorithm 1 to update B(T ) and S(T );\n\n\u2022 With (B(T ), S(T ), B(L), S(L)) \ufb01xed, update S(I) by proximal gradient.\n\n4 Experimental Results\n\nIn order to test SAM-LRB in different scenarios, this section considers experiments under three\ntasks, namely supervised document classi\ufb01cation, unsupervised topic modeling, and multi-faceted\nmodeling and classi\ufb01cation, respectively.\n\n4.1 Document Classi\ufb01cation\n\nWe \ufb01rst test our SAM-LRB model in the supervised document modeling scenario and evaluate\nthe classi\ufb01cation accuracy. Particularly, the supervised SAM-LRB is compared with the Dirichlet-\nMultinomial model and SAGE. The precision of the Dirichlet prior in Dirichlet-Multinomial model\nis updated by the Newton optimization [22]. Nonparametric Jeffreys prior [12] is adopted in SAGE\nas a parameter-free sparse prior. Concerning the variational variables {\u03b1i, \u03bei}i in the quadratic\nlower-bound of SAM-LRB, both cases of \ufb01xing them and updating them are considered.\nWe consider the benchmark 20Newsgroups data1, and aim to classify unlabelled newsgroup post-\nings into 20 newsgroups. No stopword \ufb01ltering is performed, and we randomly pick a vocabulary\nof 55,000 terms. In order to test the robustness, we vary the proportion of training data. After 5\nindependent runs by each algorithm, the classi\ufb01cation accuracies on testing data are plotted in Fig. 3\nin terms of box-plots, where the lateral axis varies the training data proportion.\n\nFigure 3: Classi\ufb01cation accuracy on 20Newsgroups data. The pro-\nportion of training data varies in {10%, 30%, 50%}.\n\n1Following [10], we use the training/testing sets from http://people.csail.mit.edu/jrennie/20Newsgroups/\n\n6\n\n\fOne can \ufb01nd that, SAGE outperforms Dirichlet-Multinomial model especially in case of limited\ntraining data, which is consistent to the observations in [10]. Moreover, with random and \ufb01xed\nvariational variables, the SAM-LRB model performs further better or at least comparably well. If\nthe variational variables are updated to tighten the lower-bound, the performance of SAM-LRB is\nsubstantially the best, with a 10%\u223c20% relative improvement over SAGE. Table 1 also reports the\naverage computing time of SAGE and SAM-LRB. We can see that, by avoiding the log-sum-exp\ncalculation, SAM-LRB (\ufb01xed) performs more than 7 times faster than SAGE, while SAM-LRB\n(optimized) pays for updating the variational variables.\n\nTable 1: Comparison on average time costs per iteration (in minutes).\n\nmethod\n\nSAGE SAM-LRB (\ufb01xed)\n\nSAM-LRB (optimized)\n\ntime cost (minutes)\n\n3.8\n\n0.6\n\n3.3\n\n4.2 Unsupervised Topic Modeling\n\nWe now apply our unsupervised SAM-LRB model to the benchmark NIPS data2. Following the\nsame preprocessing and evaluation as in [10, 26], we have a training set of 1986 documents with\n237,691 terms, and a testing set of 498 documents with 57,427 terms.\n\nFor consistency, SAM-LRB is still compared with Dirichlet-Multinomial model (variational LDA\nmodel with symmetric Dirichlet prior) and SAGE. For all these unsupervised models, the number\nof latent topics is varied from 10 to 25 and then to 50. After unsupervised training, the performance\nis evaluated by perplexity, the smaller the better. The performances of 5 independent runs by each\nmethod are illustrated in Fig. 4, again in terms of box-plots.\n\nFigure 4: Perplexity results on NIPS data.\n\nAs shown, SAGE performs worse than LDA when there are few number of topics, perhaps mainly\ndue to its strong equality assumption on background. Whereas, SAM-LRB performs better than\nboth LDA and SAGE in most cases. With one exception happens when the topic number equals 50,\nSAM-LRB (\ufb01xed) performs slightly worse than SAGE, mainly caused by inappropriate \ufb01xed values\nof variational variables. If updated instead, SAM-LRB (optimized) performs promisingly the best.\n\n4.3 Multifaceted Modeling\n\nWe then proceed to test the multifaceted modeling by SAM-LRB. Same as [10], we choose a\npublicly-available dataset of political blogs describing the 2008 U.S. presidential election3 [11].\nOut of the total 6 political blogs, three are from the right and three are from left. There are 20,827\ndocuments and a vocabulary size of 8284. Using four blogs for training, our task is to predict the\nideological perspective of two unlabeled blogs.\n\nOn this task, Ahmed and Xing in [1] used multiview LDA model to achieve accuracy within\n65.0% \u223c 69.1% depending on different topic number settings. Also, support vector machine pro-\nvides a comparable accuracy of 69%, while supervised LDA [3] performs undesirably on this task.\nIn [10], SAGE is repeated 5 times for each of multiple topic numbers, and achieves its best median\n\n2http://www.cs.nyu.edu/\u223croweis/data.html\n3http://sailing.cs.cmu.edu/socialmedia/blog2008.html\n\n7\n\n\fresult 69.6% at K = 30. Using SAM-LRB (optimized), the median results out of 5 runs for each\ntopic number are shown in Table 2. Interestingly, SAM-LRB provides a similarly state-of-the-art\nresult, while achieving it at K = 20. The different preferences on topic numbers between SAGE and\nSAM-LRB may mainly come from their different assumptions on background lexical distributions.\n\nTable 2: Classi\ufb01cation accuracy on political blogs data by SAM-LRB (optimized).\n\n# topic (K)\n\naccuracy (%) median out of 5 runs\n\n10\n67.3\n\n20\n69.8\n\n30\n69.1\n\n40\n68.3\n\n50\n68.1\n\n5 Concluding Remarks\n\nThis paper studies the sparse additive model for document modeling. By employing the double ma-\njorization technique, we approximate the log-sum-exponential term involved in data log-likelihood\ninto a quadratic lower-bound. With the help of this lower-bound, we are able to conveniently relax\nthe equality constraint on background log-space distribution of SAGE [10], into a low-rank con-\nstraint, leading to our SAM-LRB model. Then, after the constrained optimization is transformed\ninto the form of RPCA\u2019s objective function, an algorithm based on accelerated proximal gradient\nis adopted during learning SAM-LRB. The model speci\ufb01cation and learning algorithm are some-\nwhat simple yet effective. Besides the supervised version, extensions of SAM-LRB to unsupervised\nand multifaceted scenarios are investigated. Experimental results demonstrate the effectiveness and\nef\ufb01ciency of SAM-LRB compared with Dirichlet-Multinomial and SAGE.\n\nSeveral perspectives may deserve investigations in future. First, the accelerated proximal gradient\nupdating needs to compute SVD decompositions, which are probably consuming for very large scale\ndata. In this case, more ef\ufb01cient optimization considering nuclear norm and \u21131-norm are expected,\nwith the semide\ufb01nite relaxation technique in [16] being one possible choice. Second, this paper\nuses a constrained optimization formulation, while Bayesian tackling via adding conjugate priors to\ncomplete the generative model similar to [8] is an alternative choice. Moreover, we may also adopt\nnonconjugate priors and employ nonconjugate variational inference in [27]. Last but not the least,\ndiscriminative learning with large margins [18, 30] might be also equipped for robust classi\ufb01cation.\nSince nonzero elements of sparse S in SAM-LRB can be also regarded as selected feature, one\nmay design to include them into the discriminative features, rather than only topical distributions\n[3]. Additionally, the augmented Lagrangian and alternating direction methods [9, 29] could be also\nconsidered as alternatives to the proximal gradient optimization.\n\nReferences\n\n[1] A. Ahmed and E. P. Xing. Staying informed: supervised and semi-supervised multi-view\n\ntopical analysis of ideological pespective. In Proc. EMNLP, pages 1140\u20131150, 2010.\n\n[2] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse\n\nproblems. SIAM Journal on Imaging Sciences, 2(1):183\u2013202, 2009.\n\n[3] D. Blei and J. McAuliffe. Supervised topic models. In Advances in NIPS, pages 121\u2013128.\n\n2008.\n\n[4] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. JMLR, 3:993\u20131022, 2003.\n[5] D. Bohning. Multinomial logistic regression algorithm. Annals of Inst. of Stat. Math., 44:197\u2013\n\n200, 1992.\n\n[6] G. Bouchard. Ef\ufb01cient bounds for the softmax function, applications to inference in hybrid\nmodels. In Workshop for Approximate Bayesian Inference in Continuous/Hybrid Systems at\nNIPS\u201907, 2007.\n\n[7] X. Chen, Q. Lin, S. Kim, J. G. Carbonell, and E. P. Xing. Smoothing proximal gradient method\nfor general structured sparse regression. The Annals of Applied Statistics, 6(2):719\u2013752, 2012.\n[8] X. Ding, L. He, and L. Carin. Bayesian robust principal component analysis. IEEE Trans.\n\nImage Processing, 20(12):3419\u20133430, 2011.\n\n8\n\n\f[9] J. Eckstein. Augmented Lagrangian and alternating direction methods for convex optimization:\nA tutorial and some illustrative computational results. Technical report, RUTCOR Research\nReport RRR 32-2012, 2012.\n\n[10] J. Eisenstein, A. Ahmed, and E. P. Xing. Sparse additive generative models of text. In Proc.\n\nICML, 2011.\n\n[11] J. Eisenstein and E. P. Xing. The CMU 2008 political blog corpus. Technical report, Carnegie\n\nMellon University, School of Computer Science, Machine Learning Department, 2010.\n\n[12] M. A. T. Figueiredo. Adaptive sparseness using Jeffreys prior. In Advances in NIPS, pages\n\n679\u2013704. 2002.\n\n[13] M. R. Gormley, M. Dredze, B. Van Durme, and J. Eisner. Shared components topic models.\n\nIn Proc. NAACL-HLT, pages 783\u2013792, 2012.\n\n[14] L. Hong, A. Ahmed, S. Gurumurthy, A. J. Smola, and K. Tsioutsiouliklis. Discovering geo-\n\ngraphical topics in the twitter stream. In Proc. 12th WWW, pages 769\u2013778, 2012.\n\n[15] T. Jaakkola and M. I. Jordan. A variational approach to Bayesian logistic regression problems\n\nand their extensions. In Proc. AISTATS, 1996.\n\n[16] M. Jaggi and M. Sulovsk`y. A simple algorithm for nuclear norm regularized problems. In\n\nProc. ICML, pages 471\u2013478, 2010.\n\n[17] Y. Jiang and A. Saxena. Discovering different types of topics: Factored topics models. In Proc.\n\nIJCAI, 2013.\n\n[18] A. Joulin, F. Bach, and J. Ponce. Ef\ufb01cient optimization for discriminative latent class models.\n\nIn Advances in NIPS, pages 1045\u20131053. 2010.\n\n[19] J. D. Lafferty and M. D. Blei. Correlated topic models. In Advances in NIPS, pages 147\u2013155,\n\n2006.\n\n[20] Z. Lin, A. Ganesh, J. Wright, L. Wu, M. Chen, and Y. Ma. Fast convex optimization algorithms\nfor exact recovery of a corrupted low-rank matrix. Technical report, UIUC Technical Report\nUILU-ENG-09-2214, August 2009.\n\n[21] Q. Mei, X. Ling, M. Wondra, H. Su, and C. X. Zhai. Topic sentiment mixture: modeling facets\n\nand opinions in webblogs. In Proc. WWW, 2007.\n\n[22] T. P. Minka. Estimating a dirichlet distribution. Technical report, Massachusetts Institute of\n\nTechnology, 2003.\n\n[23] M. Paul and R. Girju. A two-dimensional topic-aspect model for discovering multi-faceted\n\ntopics. In Proc. AAAI, 2010.\n\n[24] E. Richard, P.-A. Savalle, and N. Vayatis. Estimation of simultaneously sparse and low rank\n\nmatrices. In Proc. ICML, pages 1351\u20131358, 2012.\n\n[25] Y. S. N. A. Smith and D. A. Smith. Discovering factions in the computational linguistics\n\ncommunity. In ACL Workshop on Rediscovering 50 Years of Discoveries, 2012.\n\n[26] C. Wang and D. Blei. Decoupling sparsity and smoothness in the discrete hierarchical dirichlet\n\nprocess. In Advances in NIPS, pages 1982\u20131989. 2009.\n\n[27] C. Wang and D. M. Blei. Variational inference in nonconjugate models. To appear in JMLR.\n[28] J. Wright, A. Ganesh, S. Rao, Y. Peng, and Y. Ma. Robust principal component analysis: Exact\nrecovery of corrupted low-rank matrices via convex optimization. In Advances in NIPS, pages\n2080\u20132088. 2009.\n\n[29] J. Yang and X. Yuan. Linearized augmented Lagrangian and alternating direction methods for\n\nnuclear norm minimization. Math. Comp., 82:301\u2013329, 2013.\n\n[30] J. Zhu, A. Ahmed, and E. P. Xing. MedLDA: maximum margin supervised topic models.\n\nJMLR, 13:2237\u20132278, 2012.\n\n9\n\n\f", "award": [], "sourceid": 162, "authors": [{"given_name": "Lei", "family_name": "Shi", "institution": "Baidu"}]}