{"title": "Bayesian models for Large-scale Hierarchical Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 2411, "page_last": 2419, "abstract": "A challenging problem in hierarchical classification is to leverage the hierarchical relations among classes for  improving classification performance. An even greater challenge is to do so in a manner that is computationally feasible for the large scale problems usually encountered in practice. This paper proposes a set of Bayesian methods to model hierarchical dependencies among class labels using multivari- ate logistic regression. Specifically, the parent-child relationships are modeled by placing a hierarchical prior over the children nodes centered around the parame- ters of their parents; thereby encouraging classes nearby in the hierarchy to share similar model parameters. We present new, efficient variational algorithms for tractable posterior inference in these models, and provide a parallel implementa- tion that can comfortably handle large-scale problems with hundreds of thousands of dimensions and tens of thousands of classes. We run a comparative evaluation on multiple large-scale benchmark datasets that highlights the scalability of our approach, and shows a significant performance advantage over the other state-of- the-art hierarchical methods.", "full_text": "Bayesian models for Large-scale Hierarchical\n\nClassi\ufb01cation\n\nSiddharth Gopal\n\nYiming Yang\n\nsgopal1@andrew.cmu.edu yiming@cs.cmu.edu\n\nCarnegie Mellon University\n\nBing Bai\n\nAlexandru Niculescu-Mizil\n\n{bing,alex}@nec-labs.com\n\nNEC Laboratories America, Princeton\n\nAbstract\n\nA challenging problem in hierarchical classi\ufb01cation is to leverage the hierarchi-\ncal relations among classes for improving classi\ufb01cation performance. An even\ngreater challenge is to do so in a manner that is computationally feasible for large\nscale problems. This paper proposes a set of Bayesian methods to model hier-\narchical dependencies among class labels using multivariate logistic regression.\nSpeci\ufb01cally, the parent-child relationships are modeled by placing a hierarchi-\ncal prior over the children nodes centered around the parameters of their parents;\nthereby encouraging classes nearby in the hierarchy to share similar model param-\neters. We present variational algorithms for tractable posterior inference in these\nmodels, and provide a parallel implementation that can comfortably handle large-\nscale problems with hundreds of thousands of dimensions and tens of thousands\nof classes. We run a comparative evaluation on multiple large-scale benchmark\ndatasets that highlights the scalability of our approach and shows improved per-\nformance over the other state-of-the-art hierarchical methods.\n\nIntroduction\n\n1\nWith the tremendous growth of data, providing a multi-granularity conceptual view using hierar-\nchical classi\ufb01cation (HC) has become increasingly important. The large taxonomies for web page\ncategorization at the Yahoo! Directory and the Open Directory Project, and the International Patent\nTaxonomy are examples of widely used hierarchies. The large hierarchical structures present both\nchallenges and opportunities for statistical classi\ufb01cation research. Instead of focusing on individ-\nual classes in isolation, we need to address joint training and inference based on the hierarchical\ndependencies among the classes. Moreover this has to be done in a computationally ef\ufb01cient and\nscalable manner, as many real world HC problems are characterized by large taxonomies and high\ndimensionality.\nIn this paper, we investigate a Bayesian framework for leveraging the hierarchical class structure.\nThe Bayesian framework is a natural \ufb01t for this problem as it can seamlessly capture the idea that\nthe models at the lower levels of the hierarchy are specialization of models at the ancestor nodes.\nWe de\ufb01ne a hierarchical Bayesian model where the prior distribution for the parameters at a node\nis a Gaussian centered at the parameters of the parent node. This prior encourages the parameters\nof nodes that are close in the hierarchy to be similar thereby enabling propagation of information\nacross the hierarchical structure and leading to inductive transfer (sharing statistical strength) among\nthe models corresponding to the different nodes. The strength of the Gaussian prior, and hence the\namount of information sharing between nodes, is controlled by its covariance parameter, which is\nalso learned from the data. Modelling the covariance structures gives us the \ufb02exibility to incorporate\ndifferent ways of sharing information in the hierarchy. For example, consider a hierarchical organi-\nzation of all animals with two sub-topics mammals and birds. By placing feature speci\ufb01c variances,\nthe model can learn that the sub-topic parameters are more similar along common features like\n\u2018eyes\u2019,\u2018claw\u2019 and less similar in other sub-topic speci\ufb01c features like \u2018feathers\u2019, \u2018tail\u2019 etc. As an-\nother example, the model can incorporate children-speci\ufb01c covariances that allows some sub-topic\n\n1\n\n\fparameters to be less similar to their parent and some to be more similar; for e.g. sub-topic whales is\nquite distinct from its parent mammals compared to its siblings felines, primates. Formulating such\nconstraints in non-Bayesian large-margin approaches is not as easy, and to our knowledge has not\ndone before in the context of hierarchical classi\ufb01cation. Other advantages of a fully Bayesian treat-\nment are that there no reliance on cross-validation, the outputs have a probabilistic interpretation,\nand it is easy to incorporate prior domain knowledge.\nOur approach shares similarity to the correlated Multinomial logit [18] (corrMNL) in taking a\nBayesian approach to model the hierarchical class structure, but improves over it in two signi\ufb01cant\naspects - scalability and setting hyperparameters. Firstly, CorrMNL uses slower MCMC sampling\nfor inference, making it dif\ufb01cult to scale to problems with more than a few hundred features and a\nfew hundred nodes in the hierarchy. By modelling the problem as a Hierarchical Bayesian Logis-\ntic Regression (HBLR), we are able to vastly improve the scalability by 1) developing variational\nmethods for faster inference, 2) introducing even faster algorithms (partial MAP) to approximate\nthe variational inference at an insigni\ufb01cant cost in classi\ufb01cation accuracy, and 3) parallelizing the\ninference. The approximate variational inference (1 plus 2) reduces the computation time by several\norder of magnitudes (750x) over MCMC, and the parallel implementation in a Hadoop cluster [4]\nfurther improves the time almost linearly in the number of processors. These enabled us to com-\nfortably conduct joint posterior inference for hierarchical logistic regression models with tens of\nthousands of categories and hundreds of thousands of features.\nSecondly, a dif\ufb01culty with the Bayesian approaches, that has been largely side-stepped in [18], is\nthat, when expressed in full generality, they leave many hyperparameters open to subjective input\nfrom the user. Typically, these hyper-parameters need to be set carefully as they control the amount\nof regularization in the model, and traditional techniques such as Empirical Bayes or cross-validation\nencounter dif\ufb01culties in achieving this. For instance, Empirical Bayes requires the maximization of\nmarginal likelihood which is dif\ufb01cult to compute in hierarchical logistic models [9] in general, and\ncross-validation requires reducing the number of free parameters for computational reasons, poten-\ntially losing the \ufb02exibility to capture the desired phenomena. In contrast, we propose a principled\nway to set the hyper-parameters directly from data using an approximation to the observed Fisher\nInformation Matrix. Our proposed technique can be easily used to set a large number of hyper-\nparameters without losing model tractability and \ufb02exibility.\nTo evaluate the proposed techniques we run a comprehensive empirical study on several large scale\nhierarchical classi\ufb01cation problems. The results show that our approach is able to leverage the\nclass hierarchy and obtain a signi\ufb01cant performance boost over leading non-Bayesian hierarchical\nclassi\ufb01cation methods, as well as consistently outperform \ufb02at methods that do not use the hierarchy\ninformation.\nOther Related Work: Most of the previous work in HC has been primarily using large-margin\ndiscriminative methods. Some of the early works in HC [10, 14] use the hierarchical structure to\ndecompose the classi\ufb01cation problem into sub-problems recursively along the hierarchy and allocate\na classi\ufb01er at each node. The hierarchy is used to partition the training data into node-speci\ufb01c subsets\nand classi\ufb01ers at each node are trained independently without using the hierarchy any further. Many\napproaches have been proposed to better utilize the hierarchical structure. For instance, in [22, 1],\nthe output of the lower-level classi\ufb01ers was used as additional features for the instance at the top-\nlevel classi\ufb01ers. Smoothing the estimated parameters in naive Bayes classi\ufb01ers along each path from\nthe root to a leaf node has been tried in [17]. [20, 6] proposed large-margin discriminative methods\nwhere the discriminant function at each node takes the contributions from all nodes along the path\nto the root node, and the parameters are jointly learned to minimize a global loss over the hierarchy.\nRecently, enforcing orthogonality constraints between parent and children classi\ufb01ers was shown to\nachieve state-of-art performance [23].\n2 The Hierarchical Bayesian Logistic Regression (HBLR) Framework\nDe\ufb01ne a hierarchy as a set of nodes Y = {1, 2...} with the parent relationship \u03c0 : Y \u2192 Y where\n\u03c0(y) is the parent of node y \u2208 Y . Let D = {(xi, ti)}N\ni=1 denote the training data where xi \u2208 Rd is\nan instance, ti \u2208 T is a label, where T \u2282 Y is the set of leaf nodes in the hierarchy labeled from 1\nto |T|. We assume that each instance is assigned to one of the leaf nodes in the hierarchy. Let Cy be\nthe set of all children of y.\n\n2\n\n\fFor each node y \u2208 Y , we associate a parameter vector wy which has a Gaussian prior. We set the\nmean of the prior to the parameter of the parent node, w\u03c0(y). Different constraints on the covariance\nmatrix of the prior corresponds to different ways of propagating information across the hierarchy. In\nwhat follows, we consider three alternate ways to model the covariance matrix which we call M1,\nM2 and M3 variants of HBLR. In the M1 variant all the siblings share the same spherical covariance\nmatrix. Formally, the generative model for M1 is\n\nM1 wroot \u223c N (w0, \u03a30),\n\n\u03b1root \u223c \u0393(a0, b0)\n\nwy| w\u03c0(y), \u03a3\u03c0(y) \u223c N (w\u03c0(y), \u03a3\u03c0(y)) \u2200y,\nt | x \u223c Multinomial(p1(x), p2(x), .., p|T|(x)) \u2200(x, t) \u2208 D\npi(x) = exp(w(cid:62)\n\ni x)/\u03a3t(cid:48)\u2208T exp(w(cid:62)\n\n\u03b1y \u223c \u0393(ay, by) \u2200y /\u2208 T\n\nt(cid:48) x)\n\n(1)\nThe parameters of the root node are drawn using user speci\ufb01ed parameters w0, \u03a30, a0, b0. Each non-\nleaf node y /\u2208 T has its own \u03b1y drawn from a Gamma with the shape and inverse-scale parameters\nspeci\ufb01ed by ay and by. Each wy is drawn from the Normal with mean w\u03c0(y) and covariance matrix\n\u03a3\u03c0(y) = \u03b1\u22121\n\u03c0(y)I. The class-labels are drawn from a Multinomial whose parameters are a soft-max\ntransformation of the wys from the leaf nodes. This model leverages the class hierarchy information\nby encouraging the parameters of closely related nodes (parents, children and siblings) to be more\nsimilar to each other than those of distant ones in the hierarchy. Moreover, by using different inverse\nvariance parameters \u03b1y for each node, the model has the \ufb02exibility to adapt the degree of similarity\nbetween the parameters (i.e. parent and children nodes) on a per family basis. For instance it can\nlearn that sibling nodes which are higher in the hierarchy (e.g. mammals and birds) are generally\nless similar compared to sibling nodes lower in the hierarchy (e.g. chimps and orangutans).\nAlthough this model is equivalent to the corrMNL proposed in [18], the hierarchical logistic re-\ngression formulation is different from corrMNL and has a distinct advantage that the parameters\ncan be decoupled. As we shall see in Section 3, this enables the use of scalable and parallelizable\nvariational inference algorithms. In contrast, in corrMNL the soft-max parameters are modeled as\na sum of contributions along the path from a leaf to the root-node. This introduces two layers of\ndependencies between the parameters in the corrMNL model (inside the normalization constant as\nwell along the path from leaves to root-node) which makes it less amenable to ef\ufb01cient variational\ninference. Even if one were to develop a variational approach for the corrMNL parameterization, it\nwould be slower and not ef\ufb01cient for parallelization.\nAlthough the M1 approach is rational, one may argue that it would be bene\ufb01cial to allow the diagonal\nelements of the covariance matrix \u03a3\u03c0(y) to be feature-speci\ufb01c instead of uniform. In our previous\nexample with sub-topics mammals and birds, we may want wmammals , wbirds to be commonly\nclose to their parent in some dimensions (e.g., in some common features like \u2018eyes\u2019,\u2018breathe\u2019 and\n\u2018claw\u2019) but not in other dimensions (e.g., in bird speci\ufb01c features like \u2018feathers\u2019 or \u2018beak\u2019). We\naccommodate this by replacing prior \u03b1y using \u03b1(i)\nfor every feature (i). This form of setting the\ny\nprior is referred to as Automatic Relevant Determination (ARD) and forms the basis of several works\nsuch as Sparse Bayesian Learning [19], Relevance Vector Machines [3], etc. For the HC problem,\nwe de\ufb01ne the M2 variant of the HBLR approach as:\n\nM2 wy| w\u03c0(y), \u03a3\u03c0(y) \u223c N (w\u03c0(y), \u03a3\u03c0(y))\ni = 1..d, \u2200y /\u2208 T\n\u03c0(y), . . . , \u03b1(d)\n\u03c0(y))\n\ny \u223c \u0393(a(i)\n\u03b1(i)\ny , b(i)\ny )\n\u22121\n\u03c0(y) = diag(\u03b1(1)\n\u03c0(y), \u03b1(2)\n\nwhere \u03a3\n\n\u2200y\n\nYet another extension of the M1 model would be to allow each node to have its own covariance\nmatrix for the Gaussian prior over wy, not shared with its siblings. This enables the model to learn\nhow much the individual children nodes differ from the parent node. For example, consider topic\nmammals and its two sub-topics whales and carnivores; the sub-topic whales is very distinct from a\ntypical mammal and is more of an \u2018outlier\u2019 topic. Such mismatches are very typical in hierarchies;\nespecially in cases where there is not enough training data and an entire subtree of topics is collapsed\nas a single node. M3 aims to cope up with such differences.\n\nM3 wy| w\u03c0(y), \u03a3y \u223c N (w\u03c0(y), \u03a3y)\n\n\u03b1y \u223c \u0393(ay, by)\n\n\u2200y\n\u2200y /\u2208 T\n\nNote that the only difference between M 3 and M 1 is that M 3 uses \u03a3y = \u03b1\u22121\ny I instead of \u03a3\u03c0(y) in\nthe prior for wy. In our experiments we found that M3 consistently outperformed the other variants\nsuggesting that such effects are important to model in HC. Although it would be natural to extend\n\n3\n\n\fM3 by placing ARD priors instead of the uniform \u03b1y, we do not expect to see better performance\ndue to the dif\ufb01culty in learning a large number of parameters. Preliminary experiments con\ufb01rmed\nour suspicions so we did not explore this direction further.\n3\nIn this section, we present the inference method for M2 which is harder. The procedure can be easily\nextended for M1 and M3 1. The posterior of M2 is given by\n\nInference for HBLR\n\np(W, \u03b1|D) \u221d p(D|W, \u03b1)p(W, \u03b1)\n\n\u221d (cid:89)\n\n(x,t)\u2208D\n\n(cid:80)\n\nt(cid:48)\u2208T\n\nexp(w(cid:62)\n\nt x)\nexp(w(cid:62)\n\n(cid:89)\n\nd(cid:89)\n\nt(cid:48) x)\n\ny\u2208Y \\T\n\ni=1\n\n(cid:89)\n\ny\u2208Y\n\np(\u03b1(i)\n\ny |a(i)\n\ny , b(i)\ny )\n\np(wy|w\u03c0(y), \u03a3\u03c0(y))\n\n(2)\n\nClosed-form solution for the posterior is not possible due to the non-conjugacy between the lo-\ngistic likelihood and the Gaussian prior, we therefore resort to variational methods to compute the\nposterior. However, using variational methods are themselves computational intractable in high di-\nmensional scenarios due to the requirement of a matrix inversion which is computationally intensive.\nTherefore, we explore much faster approximation schemes such as partial MAP inference which are\nhighly scalable. Finally, we show the resulting approximate variational inference procedure can be\nparallelized in a map-reduce framework to tackle large-scale problems that would be impossible to\nsolve on a single processor.\n3.1 Variational Inference\nStarting with a simple factored form for the posterior, we seek such a distribution q which is closest\nin KL divergence to the true posterior p. We use independent Gaussian q(wy) and Gamma q(\u03b1y)\nposterior distributions for wy and \u03b1y per node as the factored representation:\n\n\u0393(.|\u03c4 (i)\n\ny , \u03c5(i)\ny )\n\nN (.|\u00b5y, \u03a8y)\n\n(cid:89)\n\ny\u2208Y\n\nIn order to tackle the non-conjugacy inside p(D|W, \u03b1) in (2), we use a suitable lower-bound to the\nsoft-max normalization constant proposed by [5], for any \u03b2 \u2208 R , \u03bek \u2208 [0,\u221e)\n\n(cid:21)\n\n+ \u03bb(\u03bek)((gk \u2212 \u03b2)2 \u2212 \u03be2\n\nk) + log(1 + e\u03bek )\n\n(cid:89)\n\ny\u2208Y \\T\n\nq(wy) \u221d (cid:89)\n\nd(cid:89)\n\ny\u2208Y \\T\n\ni=1\n\nq(\u03b1y)\n\ny\u2208Y\n\n(cid:89)\n(cid:20) gk \u2212 \u03b2 \u2212 \u03bek\n(cid:88)\n(cid:17)\n(cid:16) 1\n1+e\u2212\u03be \u2212 1\n\n2\n\nk\n\n2\n\nq(W, \u03b1) =\n\n(cid:88)\n\nlog(\n\negk ) \u2264 \u03b2 +\n\nk\n\nwhere \u03bb(\u03be) = 1\n2\u03be\n\n(x,t)\u2208D\n\n(cid:88)\n\uf8eb\uf8edI(y \u2208 T )\n(cid:88)\n\n(cid:88)\n\n(x,t)\u2208D\n\nc\u2208Cy\n\nwhere \u03b2 , \u03bek are variational parameters which we can optimize to get the tightest possible bound.\nFor every (x, y) we introduce variational parameters \u03b2x and \u03bexy. We now derive an EM algorithm\nthat computes the posterior in the E-step and maximizes the variational parameters in the M-step.\nVariational E-Step The local variational parameters are \ufb01xed, and the posterior for a parameter is\ncomputed by matching the log-likelihood of the posterior with the expectation of log-likelihood\nunder the rest of the parameters. The parameters are updated as1,\n) + |Cy| diag(\n\u03c4y\n\u03c5y\n\ny = I(y \u2208 T )\n\u22121\n\n2\u03bb(\u03bexy)xx\n\n\u03c4\u03c0(y)\n\u03c5\u03c0(y)\n\n+ diag(\n\n(3)\n\n\u03a8\n\n(cid:62)\n\n)\n\n\u00b5y = \u03a8y\n\n(I(t = y) \u2212 1\n2\n\n+ 2\u03bb(\u03bexy)\u03b2x)x + diag(\n\n\u03c4\u03c0(y)\n\u03c5\u03c0(y)\n\n)\u00b5\u03c0(y) + diag(\n\n\u03c4y\n\u03c5y\n\n)\n\n\u03c5(i)\ny = b(i)\n\ny +\n\n\u03a8(i,i)\n\ny + \u03a8(i,i)\n\nc + (\u00b5(i)\n\ny \u2212 \u00b5(i)\nc )2\n\nand \u03c4 (i)\n\ny = a(i)\n\ny +\n\n|Cy|\n2\n\nVariational M-Step We keep the parameters of the posterior distribution \ufb01xed and maximize the\nvariational parameters \u03bexy, \u03b2x. Refer to [5] for detailed M-step derivations,\n\n(cid:62)\n\n\u03be2\nxy = x\n\ndiag(\n\n\u03c4y\n\u03c5y\n\n)x + (\u03b2x \u2212 \u00b5\n\n(cid:62)\ny x)2\n\n\u03b2x = (.5(.5|T| \u2212 1) +\n\n\u03bb(\u03bexy)\u00b5\n\n(cid:62)\ny x)/\n\n\u03bb(\u03bexy)\n\n(cid:88)\n\ny\u2208T\n\n(cid:88)\n\ny\u2208T\n\nClass-label Prediction After computing the posterior, one way to compute the probability of a target\nclass-label given a test instance is to simply plugin the posterior mean for prediction. A more\nprincipled way would be to compute the predictive distribution of the target class label l given the\n\n1 Complete derivations are presented in the extended version located at http://www.cs.cmu.edu/\u02dcsgopal1.\n\n4\n\n\uf8f6\uf8f8\n\n\u00b5c\n\n(cid:88)\n\nc\u2208Cy\n\n(4)\n\n\ftest instance,\n\np(l|x) =\n\n(5)\nThe above integral cannot be computed in closed form and people have often resorted to probit\napproximations [16]. We take an alternative route by calculating the joint posterior p(l, W|x) by\nvariational approximations. We assume the following factored form for the predictive distribution,\n\np(l, W|x)dW \u2248\n\np(l|W, x)q(W)dW\n\n\u02dcq(l, W) =\n\nN (.|\u02dc\u00b5y, \u02dc\u03a8y)Bern(.|\u02dcpy)\n\n\u02dcq(wy)\u02dcq(ly) \u2261 (cid:89)\n\ny\u2208T\n\n(cid:90)\n\n(cid:89)\n\ny\u2208T\n\n(cid:90)\n\nThe posterior can be calculated as before, by introducing variational parameters \u02dc\u03bexy , \u02dc\u03b2x and match-\ning the log likelihoods. Substituting \u02dcq(l, W) in (5), we see that the predictive distribution is given\nby \u02dcq(l) and the target class label is given by arg maxy\u2208T \u02dcpy.\n3.2 Partial MAP Inference\nIn most applications, the requirement for a matrix inversion in step (3) could be demanding. In such\nscenarios, we split the inference into two stages, \ufb01rst calculating the posterior of wy using MAP\nsolution, and second calculating the posterior of \u03b1y. In the \ufb01rst stage, we \ufb01nd the MAP estimate\nand then use laplace approximation to approximate the posterior using a separate Normal\nwmap\ndistribution for each dimension, thereby leading to a diagonal covariance matrix. Note that due to\nthe laplace approximation, wmap\n\ny\n\n(cid:88)\n\ny\n\nand the posterior mean \u00b5y coincide.\n\u2212 1\n2\n\n(wy \u2212 w\u03c0(y))\n\n\u03c4\u03c0(y)\n\u03c5\u03c0(y)\n\ndiag(\n\n(cid:62)\n\n)(wy \u2212 w\u03c0(y)) + log p(D|W, \u03b1)\n\n(6)\n\n\u00b5 = wmap\n\ny = arg max\n\n(\u03a8(i,i)\n\ny\n\n\u22121 =\n)\n\nW\n\n(cid:88)\n\n(x,t)\u2208Dy\n\ny\u2208T\nx(i)pxy(1 \u2212 pxy)x(i)\n\nwhere pxy is the probability that training instance x is labeled as y. The arg max in (6) can be\ncomputed for all \u00b5y at the same time using optimization techniques like LBFGS [13]. For the\nsecond stage, parameters \u03c4y and \u03c5y are updated using (4). Full MAP inference is also possible by\nperforming an alternating maximization between wy, \u03b1y but we do not recommend it as there is no\ngain in scalability compared to partial MAP Inference and it loses the posterior distribution of \u03b1y.\n3.3 Parallelization\nFor large hierarchies, it might be impractical to learn the parameters of all classes, or even store\nthem in memory, on a single machine. We therefore, devise a parallel memory-ef\ufb01cient implemen-\ntation scheme for our partial MAP Inference. There are 4 sets of parameters that are updated -\n{\u00b5y, \u03a8y, \u03c4y, \u03bdy}. The \u03a8y, \u03c4y, \u03bdy can be updated in parallel for each node using (3),(4).\nFor \u00b5, the optimization step in (6) is not easy to parallelize since the w\u2019s are coupled together inside\nthe soft-max function. To make it parallelizable we replace the soft-max function in (1) with multi-\nple binary logistic functions (one for each terminal node), which removes the coupling of parameters\ninside the log-normalization constant. The optimization can now be done in parallel by making the\nfollowing observations - \ufb01rstly note that the optimization problem in (6) is concave maximation,\ntherefore any order of updating the variables reaches the same unique maximum. Secondly, note\nthat the interactions between the wy\u2019s are only through the parent and child nodes. By \ufb01xing the\nparameters of the parent and children, the parameter wy of a node can be optimized independently\nof the rest of the hierarchy. One simple way to parallelize is to traverse the hierarchy level by level,\noptimize the parameters at each level in parallel, and iterate until convergence. A better way that\nachieves a larger degree of parallelization is to iteratively optimize the odd and even levels - if we \ufb01x\nthe parameters at the odd levels, the parameters of parents and the children of all nodes at even levels\nare \ufb01xed, and the wy\u2019s at all even levels can be optimized in parallel. The same goes for optimizing\nthe odd level parameters. To aid convergence we interleave the \u00b5, \u03a8 updates with the \u03c4, \u03bd updates\nand warm-start with the previous value of \u00b5y. In practice, for the larger hierarchies we observed\nspeedups linear in the number of processors. Note that the convergence follows from viewing this\nprocedure as block co-ordinate ascent on a concave differentiable function [15].\nWe tested our parallelization framework on a cluster running map-reduce based Hadoop 20.2 with\n64 worker nodes with 8 cores and 16GB RAM each. We used Accumulo 1.4 key-value store for\nfast retrieve-update of the wys. On this hardware, our experiments on the largest dataset with 15358\nclass labels and 347256 features took just 38 minutes. Although the map-reduce framework is not\na requirement; it is a ubiquitous paradigm in distributed computing and having an implementation\ncompatible with it is a de\ufb01nite advantage.\n\n5\n\n\fDataset\nCLEF\nNEWS20\nLSHTC-small\nLSHTC-large\nIPC\n\nTable 1: Dataset Statistics\n\n#Training #Testing #Class-Labels #Leaf-labels Depth #Features\n\n10000\n11260\n4463\n93805\n46324\n\n1006\n7505\n1858\n34905\n28926\n\n87\n27\n1563\n15358\n552\n\n63\n20\n1139\n12294\n451\n\n4\n3\n6\n6\n4\n\n89\n\n53975\n51033\n347256\n541869\n\ny , b(i)\n\nrepresents the expected variance of the w(i)\n\n4 Setting prior parameters\nThe w0, \u03a30 represent the overall mean and covariance structure for the wy. We set w0 = 0 and \u03a30 =\nI because of their minimal effect on the rest of the parameters. The a(i)\ny are variance components\nsuch that b(i)\ny . Typically, choosing these parameters is\ny\na(i)\ndif\ufb01cult before seeing the data. The traditional way to overcome this is to learn {ay, by} from the\ny\ndata using Empirical Bayes. Unfortunately, in our proposed model, one cannot do this as each\n{ay, by} is associated with a single \u03b1y. Generally, we need more than one sample value to learn the\nprior parameters effectively [7].\nWe therefore resort to a data dependent way of setting these parameters by using an approximation\nto the observed Fisher Information matrix. We \ufb01rst derive on a simpler model and then extend it\nto a hierarchy. Consider the following binary logistic model with unknown w and let the Fisher\nInformation matrix be I and observed Fisher Information \u02c6I\nY | x \u223c Bernoulli(\nIt is well known that I\u22121 is the asymptotic covariance of the MLE estimator of w, so reasonable\nguess for the covariance of a Gaussian prior over w could be the observed \u02c6I\u22121 from a dataset\nD. The problem with \u02c6I\u22121 is that we do not have a good estimate \u02c6p(x) for a given x as we have\nexactly one sample for a given x i.e each instance x is labeled exactly once with certainty, therefore\n\u02c6p(x)(1 \u2212 \u02c6p(x)) will always be zero. Therefore we approximate \u02c6p(x) as the sample prior probability\nindependent of x, i.e. \u02c6p(x) = \u02c6p = \u03a3(x,t)\u2208D\nt|D|. Now, the prior on the covariance of wy can be set\nsuch that the expected covariance is \u02c6I\u22121. To extend this to HC, we need to handle multiple classes,\nwhich can be done by estimating \u02c6I(y)\u22121 for each y \u2208 T , as well handle multiple levels, which can\nbe done by recursively setting ay, by as follows,\n\n\u02c6p(x)(1 \u2212 \u02c6p(x))xx\n(cid:62)\n\np(x)(1 \u2212 p(x))xx\n\nexp(w(cid:62)x)\n\n1 + exp(w(cid:62)x)\n\n);\n\n(cid:88)\n\n, \u02c6I =\n\n(x,t)\u2208D\n\n(cid:62)(cid:105)\n\n(cid:104)\n\nI = E\n\nc , (cid:80)\n\na(i)\n\nb(i)\nc )\n\n\uf8f1\uf8f2\uf8f3 ( (cid:80)\n\n(a(i)\n\ny , b(i)\n\ny ) =\n\nc\u2208Cy\nc\u2208Cy\n(1, \u02c6I(y)\u22121(i,i))\n\nif y /\u2208 T\nif y \u2208 T\n\n2, extension to handle multiple classes as well as hierarchies.\n\nwhere \u02c6I(y) is the observed Fisher Information matrix for class label y. This way of setting the priors\nis similar to the method proposed in [12], the key differences are in approximating p(x)(1 \u2212 p(x))\nfrom the data rather using p(x) = 1\nWe also tried other popular strategies such as setting improper gamma priors \u0393(\u0001, \u0001) \u0001 \u2192 0 widely\nused in many ARD works (which is equivalent to using type-2 ML for the \u03b1\u2019s if one uses variational\nmethods [2]) and Empirical Bayes using a single a and b (as well as other Empirical Bayes variants).\nNeither of worked well, the former being to be too sensitive to the value of \u0001 which is in agreement\nwith the observations made by [11] and the latter constraining the model by using a single a and b.\nWe do not discuss this any further due to lack of space.\n5 Experiments Results\nThroughout our experiements, we used 4 popular benchmark datasets (Table 1) with the recom-\nmended train-test splits - CLEF[8], NEWS202, LSHTC-{small,large}3, IPC4.\nFirst, to evaluate the speed advantage of the variational inference, we compare the full variational\n{M1,M2,M3}-var and partial MAP {M1,M2,M3-map} inference 5 for the three variants of HBLR to\nthe MCMC sampling based inference of CorrMNL [18]. For CorrMNL, we used the implementation\nas provided by the authors6. We performed sampling for 2500 iterations with 1000 for burn-in.\n2 http://people.csail.mit.edu/jrennie/20Newsgroups/\n3 http://lshtc.iit.demokritos.gr/\n4 http://www.wipo.int/classi\ufb01cations /ipc/en/support/\n5 Code available at http://www.cs.cmu.edu/\u02dcsgopal1\n6 http://www.ics.uci.edu/ babaks/Site/Codes.html\n\n6\n\n\fTable 2: Comparison with CorrMNL: Macro-F1 and Micro-F1 on the CLEF dataset\n\n3\n\n3\n\n3\n\n3\n\n{M1,M2,M3}-var\n\n{M1,M2,M3}-\ufb02at\nCorrMNL M1 M2 M3 M1 M2 M3 M1 M2 M3\n56.67 51.23 59.67 55.53 54.76 59.65 52.13 48.78 55.23\n81.21 79.92 81.61 80.88 80.25 81.41 79.82 77.83 80.52\n79\n\n{M1,M2,M3}-map\n\n55.59\n81.10\n2279\n\n81\n\n80\n\n3\n\n3\n\nMacro-f1\nMicro-f1\nTime (mins)\n\nFigure 1: Micro-F1 (left) & Macro-F1 (right) on the CLEF dataset with limited number of training examples.\n\nRe-starts with different initialization values gave the same results for both MCMC and variational\nmethods. All models were run on a single CPU without parallelization. We used the small CLEF[8]\ndataset in order to be able to run CorrMNL model in reasonable time. The results are presented\nin Table 2. For an informative comparison, we also included the results of {M1,M2,M3}-\ufb02at, our\nproposed approach using a \ufb02at hierarchy. With regards to scalability, partial MAP inference is\nthe most scalable method being orders of magnitude faster (750x) than CorrMNL. Full variational\ninference, although less scalable as it requires O(d3) matrix inversions in the feature space, is still\norders of magnitude faster (20x) than CorrMNL. In terms of performance, we see that the partial\nMAP inference for the HBLR has only small loss in performance compared to the full variational\ninference while having similar training time to the \ufb02at approach that does not model the hierarchy\n({M1,M2,M3}-\ufb02at).\nNext, we compare the performance of HBLR to several other competing approaches:\n1. Hierarchical Baselines: We selected 3 representative hierarchical methods that have shown to\nhave state-of-the-art performance - Hierarchical SVM [6] (HSVM), a large-margin discriminative\nmethod with path-dependent discriminant function. Orthogonal Transfer [23] (OT), a method en-\nforcing orthogonality constraints between the parent node and children and Top-down Classi\ufb01cation\n[14] (TD) Top-down decision making using binary SVMs trained at each node.\n2. Flat Baselines: Typical \ufb02at approaches which do not make use of the hierarchy. We tested One-\nversus rest Binary logistic Regressions (BLR), Multiclass Logistic Regression (MLR), One-versus\nRest Binary SVMs (BSVM), and Multiclass SVM (MSVM) [21].\nFor all competing approaches, we tune the regularization parameter using 5 fold CV with a range\nof values from 10\u22125 to 105. For the HBLR models, we used partial MAP Inference because full\nvariational is not scalable to high dimensions. The IPC and LSHTC-large are very large datasets so\nwe are unable to test any method other than our parallel implementation of HBLR, and BLR, BSVM\nwhich can be trivially parallelized. Although TD can be parallelized we did not pursue this since\nTD did not achieve competitive performance on the other datasets. Parallelizing the other methods\nis not obvious and has not been discussed in previous literature to the best of our knowledge.\nTable 3 summarizes the results obtain by the different methods. The performance was measured us-\ning the standard macro-F1 and micro-F1 measures [14]. The signi\ufb01cance tests are performed using\nsign-test for Micro-F1 and a wilcoxon rank test on the Macro-F1 scores. For every data collection,\neach method is compared to the best performing method on that dataset. The null hypothesis is that\nthere is no signi\ufb01canct difference between the two systems being compared, the alternative is that\nthe best-performing-method is better. Among M1,M2 and M3, the performance of M3 seems to be\nconsistently better than M1, followed by M2. Although M2 is more expressive than M1, the bene\ufb01t\nof a better model seems to be offset by the dif\ufb01culty in learning a large number of parameters.\nComparing to the other hierarchical baselines, M3 achieves signi\ufb01cantly higher performance on all\ndatasets, showing that the Bayesian approach is able to leverage the information provided in the\nclass hierarchy. Among the baselines, we \ufb01nd that the average performance of HSVM is higher\nthan the TD, OT. This can be partially explained by noting that both OT and TD are greedy top-\ndown classi\ufb01cation methods and any error made in the top level classi\ufb01cations propagates down to\n\n7\n\n123450.150.20.250.30.350.40.450.50.55BLRMLRBSVMMSVMM3-map# Training Examples per ClassMicro-F1123450.10.150.20.250.30.350.4BLRMLRBSVMMSVMM3-map# Training Examples per ClassMacro-F1\fM2\n\nFlat methods\n\nM3 HSVM OT\n\nHierarchical methods\nTD\n\n{M1,M2,M3}-map\nM1\nBLR MLR BSVM MSVM\n55.53\u2020 54.76\u2020 59.65 57.23* 37.12\u2020 32.32\u2020 53.26\u2020 54.76\u2020 48.59\u2020 54.33\u2020\n80.88* 80.25* 81.41 79.72\u2020 73.84\u2020 70.11\u2020 79.92\u2020 80.52\u2020 77.53\u2020 80.02\u2020\n81.54 80.91* 81.69 80.04\u2020\n82.32\n81.73\n82.24* 81.54* 82.56* 80.79* 81.98* 81.20\u2020 82.97 82.56* 83.10\n82.47*\n28.81\u2020 25.81\u2020 30.81 21.95\u2020 19.45\u2020 20.01\u2020 28.12\u2020 28.38* 28.62* 28.34*\n45.48 43.31\u2020 46.03 39.66\u2020 37.12\u2020 38.48\u2020 44.94\u2020 45.20 45.21*\n45.62\n28.32* 24.93\u2020 28.76\n43.98 43.11\u2020 44.05\n50.43\u2020 47.45\u2020 51.06\n55.80* 54.22\u2020 56.02\n\n27.91*\n43.98\n48.29\u2020\n55.03\u2020\n\n27.89*\n44.03\n45.71\u2020\n53.12\u2020\n\n81.20 80.86* 82.17\n\n81.82\n\n-\n-\n\n-\n-\n\nCLEF\nMacro-f1\nMicro-f1\nNEWS20\nMacro-f1\nMicro-f1\nLSHTC-small\nMacro-f1\nMicro-f1\nLSHTC-large\nMacro-f1\nMicro-f1\nIPC\nMacro-f1\nMicro-f1\n\nTable 3: Macro-F1 and Micro-F1 on the 4 datasets. Bold faced number indicate best performing method. The\nresults of the signi\ufb01cance tests are denoted * for a p-value less than 5% and \u2020 for p-value less than 1%.\n\n-\n-\n\n-\n-\n\n-\n-\n\n-\n-\n\n-\n-\n\n-\n-\n\n-\n-\n\n-\n-\n\nthe leaf node; in contrast to HSVM which uses an exhaustive search over all labels. However, the\nresult of OT do not seem to support the conclusions in [23]. We hypothesize two reasons - \ufb01rstly,\nthe orthogonality condition which is assumed in OT does not hold in general, secondly, unlike\n[23] we use cross-validation to set the underlying regularization parameters rather than setting them\narbitrarily to 1 (which was used in [23]).\nSurprisingly, the hierarchical baselines (HSVM,TD and OT) experience a very large drop in perfor-\nmance on LSHTC-small when compared to the \ufb02at baselines, indicating that the hierarchy informa-\ntion actually mislead these methods rather than helping them. In contrast, M3 is consistently better\nthan the \ufb02at baselines on all datasets except NEWS20. In particular, M3 performs signi\ufb01cantly bet-\nter on the largest datasets, especially in Macro-F1, showing that even very large class hierarchies can\nconvey very useful information, and highlighting the importance of having a scalable, parallelizable\nhierarchical classi\ufb01cation algorithm.\nTo further establish the importance of modeling the hierarchy, we test our approach under scenarios\nwhen the number of training examples is limited. We expect the hierarchy to be most useful in\nsuch cases as it enables of sharing of information between class parameters. To verify this, we\nprogressively increased the number of training examples per class-label on the CLEF dataset and\ncompared M3-map with the other best performing methods. Figure 1 reports the results of M3-\nmap, MLR, BSVM, MSVM averaged over 20 runs. The results shows that M3-map is signi\ufb01cantly\nbetter than the other methods especially when the number of examples is small. For instance, when\nthere is exactly one training example per class, M3-map achieves a whopping 10% higher Micro-\nF1 and a 2% higher Macro-F1 than the next best method. We repeated the same experiments on\nthe NEWS20 dataset but however did not \ufb01nd an improved performance even with limited training\nexamples suggesting that the hierarchical methods are not able to leverage the hierarchical structure\nof NEWS20.\n6 Conclusion\nIn this paper, we presented the HBLR approach to hierarchical classi\ufb01cation, focusing on scalable\nways to leverage hierarchical dependencies among classes in a joint framework. Using a Gaussian\nprior with informative mean and covariance matrices, along with fast variational methods, and a\npractical way to set hyperparameters, HBLR signi\ufb01cantly outperformed other popular HC methods\non multiple benchmark datasets. We hope this study provides useful insights into how hierarchical\nrelationships can be successfully leveraged in large-scale HC. In future, we would like to adapt this\napproach to equivalent non-bayesian large-margin discriminative counterparts.\nACKNOWLDEGMENTS: This work is supported, in part, by the NEC Laboratories America,\nPrinceton under \u2018NEC Labs Data Management University Awards\u2019 and the National Science Foun-\ndation (NSF) under grant IIS 1216282. A major part of work was accomplished while the \ufb01rst\nauthor was interning at NEC Labs, Princeton.\n\n8\n\n\fReferences\n[1] P.N. Bennett and N. Nguyen. Re\ufb01ned experts: improving classi\ufb01cation in large taxonomies.\n\nIn SIGIR, 2009.\n\n[2] C.M. Bishop. Pattern recognition and machine learning.\n[3] C.M. Bishop and M.E. Tipping. Bayesian regression and classi\ufb01cation. 2003.\n[4] D. Borthakur. The hadoop distributed \ufb01le system: Architecture and design. Hadoop Project\n\nWebsite, 11:21, 2007.\n\n[5] G. Bouchard. Ef\ufb01cient bounds for the softmax function. 2007.\n[6] L. Cai and T. Hofmann. Hierarchical document categorization with support vector machines.\n\nIn CIKM, pages 78\u201387. ACM, 2004.\n\n[7] George Casella. Empirical bayes method - a tutorial. Technical report.\n[8] I. Dimitrovski, D. Kocev, L. Suzana, and S. D\u02c7zeroski. Hierchical annotation of medical images.\n\nIn IMIS, 2008.\n\n[9] C.B. Do, C.S. Foo, and A.Y. Ng. Ef\ufb01cient multiple hyperparameter learning for log-linear\n\nmodels. In Neural Information Processing Systems, volume 21, 2007.\n\n[10] S. Dumais and H. Chen. Hierarchical classi\ufb01cation of web content. In SIGIR, 2000.\n[11] A. Gelman. Prior distributions for variance parameters in hierarchical models. BA.\n[12] R.E. Kass and R. Natarajan. A default conjugate prior for variance components in generalized\n\nlinear mixed models. Bayesian Analysis, 2006.\n\n[13] D.C. Liu and J. Nocedal. On the limited memory bfgs method for large scale optimization.\n\nMathematical programming, 45(1):503\u2013528, 1989.\n\n[14] T.Y. Liu, Y. Yang, H. Wan, H.J. Zeng, Z. Chen, and W.Y. Ma. Support vector machines\n\nclassi\ufb01cation with a very large-scale taxonomy. ACM SIGKDD, pages 36\u201343, 2005.\n\n[15] Z.Q. Luo and P. Tseng. On the convergence of the coordinate descent method for convex\ndifferentiable minimization. Journal of Optimization Theory and Applications, 72(1):7\u201335,\n1992.\n\n[16] D.J.C. MacKay. The evidence framework applied to classi\ufb01cation networks. Neural computa-\n\ntion, 1992.\n\n[17] A. McCallum, R. Rosenfeld, T. Mitchell, and A.Y. Ng. Improving text classi\ufb01cation by shrink-\n\nage in a hierarchy of classes. In ICML, pages 359\u2013367, 1998.\n\n[18] B. Shahbaba and R.M. Neal. Improving classi\ufb01cation when a class hierarchy is available using\n\na hierarchy-based prior. Bayesian Analysis, 2(1):221\u2013238, 2007.\n\n[19] M.E. Tipping. Sparse bayesian learning and the relevance vector machine. JMLR, 1:211\u2013244,\n\n2001.\n\n[20] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured\n\nand interdependent output variables. JMLR, 6(2):1453, 2006.\n\n[21] J. Weston and C. Watkins. Multi-class support vector machines. Technical report, 1998.\n[22] G.R. Xue, D. Xing, Q. Yang, and Y. Yu. Deep classi\ufb01cation in large-scale text hierarchies. In\n\nSIGIR, pages 619\u2013626. ACM, 2008.\n\n[23] D. Zhou, L. Xiao, and M. Wu. Hierarchical classi\ufb01cation via orthogonal transfer. Technical\n\nreport, MSR-TR-2011-54, 2011.\n\n9\n\n\f", "award": [], "sourceid": 1165, "authors": [{"given_name": "Siddharth", "family_name": "Gopal", "institution": null}, {"given_name": "Yiming", "family_name": "Yang", "institution": null}, {"given_name": "Bing", "family_name": "Bai", "institution": null}, {"given_name": "Alexandru", "family_name": "Niculescu-mizil", "institution": null}]}