{"title": "Sorting out typicality with the inverse moment matrix SOS polynomial", "book": "Advances in Neural Information Processing Systems", "page_first": 190, "page_last": 198, "abstract": "We study a surprising phenomenon related to the representation of a cloud of data points using polynomials. We start with the previously unnoticed empirical observation that, given a collection (a cloud) of data points, the sublevel sets of a certain distinguished polynomial capture the shape of the cloud very accurately. This distinguished polynomial is a sum-of-squares (SOS) derived in a simple manner from the inverse of the empirical moment matrix. In fact, this SOS polynomial is directly related to orthogonal polynomials and the Christoffel function. This allows to generalize and interpret extremality properties of orthogonal polynomials and to provide a mathematical rationale for the observed phenomenon. Among diverse potential applications, we illustrate the relevance of our results on a network intrusion detection task for which we obtain performances similar to existing dedicated methods reported in the literature.", "full_text": "Sorting out typicality with the inverse moment matrix\n\nSOS polynomial\n\nJean-Bernard Lasserre\nLAAS-CNRS & IMT\nUniversit\u00e9 de Toulouse\n31400 Toulouse, France\n\nlasserre@laas.fr\n\nEdouard Pauwels\n\nIRIT & IMT\n\nUniversit\u00e9 Toulouse 3 Paul Sabatier\n\n31400 Toulouse, France\n\nedouard.pauwels@irit.fr\n\nAbstract\n\nWe study a surprising phenomenon related to the representation of a cloud of data\npoints using polynomials. We start with the previously unnoticed empirical obser-\nvation that, given a collection (a cloud) of data points, the sublevel sets of a certain\ndistinguished polynomial capture the shape of the cloud very accurately. This\ndistinguished polynomial is a sum-of-squares (SOS) derived in a simple manner\nfrom the inverse of the empirical moment matrix. In fact, this SOS polynomial is\ndirectly related to orthogonal polynomials and the Christoffel function. This allows\nto generalize and interpret extremality properties of orthogonal polynomials and to\nprovide a mathematical rationale for the observed phenomenon. Among diverse\npotential applications, we illustrate the relevance of our results on a network intru-\nsion detection task for which we obtain performances similar to existing dedicated\nmethods reported in the literature.\n\n1\n\nIntroduction\n\nCapturing and summarizing the global shape of a cloud of points is at the heart of many data\nprocessing applications such as novelty detection, outlier detection as well as related unsupervised\nlearning tasks such as clustering and density estimation. One of the main dif\ufb01culties is to account\nfor potentially complicated shapes in multidimensional spaces, or equivalently to account for non\nstandard dependence relations between variables. Such relations become critical in applications, for\nexample in fraud detection where a fraudulent action may be the dishonest combination of several\nactions, each of them being reasonable when considered on their own.\nAccounting for complicated shapes is also related to computational geometry and nonlinear algebra\napplications, for example integral computation [11] and reconstruction of sets from moments data\n[6, 7, 12]. Some of these problems have connections and potential applications in machine learning.\nThe work presented in this paper brings together ideas from both disciplines, leading to a method\nwhich allows to encode in a simple manner the global shape and spatial concentration of points within\na cloud.\nWe start with a surprising (and apparently unnoticed) empirical observation. Given a collection of\npoints, one may build up a distinguished sum-of-squares (SOS) polynomial whose coef\ufb01cients (or\nGram matrix) is the inverse of the empirical moment matrix (see Section 3). Its degree depends on\nhow many moments are considered, a choice left to the user. Remarkably its sublevel sets capture\nmuch of the global shape of the cloud as illustrated in Figure 3. This phenomenon is not incidental as\nillustrated in many additional examples in Appendix A. To the best of our knowledge, this observation\nhas remained unnoticed and the purpose of this paper is to report this empirical \ufb01nding to the machine\nlearning community and provide \ufb01rst elements toward a mathematical understanding as well as\npotential machine learning applications.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fpolynomial Q\u00b5,d (d = 4). The level set(cid:0)p+d\n\n(cid:1), which corresponds to the average value of Q\u00b5,d, is\n\nFigure 1: Left: 1000 points in R2 and the level sets of the corresponding inverse moment matrix SOS\n\nd\n\nrepresented in red. Right: 1040 points in R2 with size and color proportional to the value of inverse\nmoment matrix SOS polynomial Q\u00b5,d (d = 8).\nThe proposed method is based on the computation of the coef\ufb01cients of a very speci\ufb01c polynomial\nwhich depends solely on the empirical moments associated with the data points. From a practical\nperspective, this can be done via a single pass through the data, or even in an online fashion via\na sequence of ef\ufb01cient Woodbury updates. Furthermore the computational cost of evaluating the\npolynomial does not depend on the number of data points which is a crucial difference with existing\nnonparametric methods such as nearest neighbors or kernel based methods [3]. On the other hand,\nthis computation requires the inversion of a matrix whose size depends on the dimension of the\nproblem (see Section 3). Therefore, the proposed framework is suited for moderate dimensions and\npotentially very large number of observations.\nIn Section 4 we \ufb01rst describe an af\ufb01ne invariance result which suggests that the distinguished SOS\npolynomial captures very intrinsic properties of clouds of points. In a second step, we provide a\nmathematical interpretation that supports our empirical \ufb01ndings based on connections with orthogonal\npolynomials [5]. We propose a generalization of a well known extremality result for orthogonal\nunivariate polynomials on the real line (or the complex plane) [16, Theorem 3.1.2]. As a consequence,\nthe distinguished SOS polynomial of interest in this paper is understood as the unique optimal\nsolution of a convex optimization problem: minimizing an average value over a structured set of\npositive polynomials. In addition, we revisit [16, Theorem 3.5.6] about the Christoffel function.\nThe mathematics behind provide a simple and intuitive explanation for the phenomenon that we\nempirically observed.\nFinally, in Section 5 we perform numerical experiments on KDD cup network intrusion dataset\n[13]. Evaluation of the distinguished SOS polynomial provides a score that we use as a measure of\noutlyingness to detect network intrusions (assuming that they correspond to outlier observations).\nWe refer the reader to [3] for a discussion of available methods for this task. For the sake of a\nfair comparison we have reproduced the experiments performed in [18] for the same dataset. We\nreport results similar to (and sometimes better than) those described in [18] which suggests that the\nmethod is comparable to other dedicated approaches for network intrusion detection, including robust\nestimation and Mahalanobis distance [8, 10], mixture models [14] and recurrent neural networks\n[18].\n\n2 Multivariate polynomials, moments and sums of squares\n\nNotations: We \ufb01x the ambient dimension to be p throughout the text. For example, we will\nmanipulate vectors in Rp as well as p-variate polynomials with real coef\ufb01cients. We denote by X a\nset of p variables X1, . . . , Xp which we will use in mathematical expressions de\ufb01ning polynomials.\nWe identify monomials from the canonical basis of p-variate polynomials with their exponents in\nNp: we associate to \u03b1 = (\u03b1i)i=1...p \u2208 Np the monomial X \u03b1 := X \u03b11\np which degree is\ni=1 \u03b1i. We use the expressions <gl and \u2264gl to denote the graded lexicographic order,\na well ordering over p-variate monomials. This amounts to, \ufb01rst, use the canonical order on the\n\ndeg(\u03b1) :=(cid:80)p\n\n. . . X \u03b1p\n\n1 X \u03b12\n\n2\n\n2\n\n 15 10 30 80 210 460 950 950 950 1850 1850 1850 3570 3570 3570 6860 6860 6860 13810 13810 13810 30670 30670 67380 67380 146370 146370 333340 333340 \fdegree and, second, break ties in monomials with the same degree using the lexicographic order with\nX1 = a, X2 = b . . . For example, the monomials in two variables X1, X2, of degree less or equal to\n3 listed in this order are given by: 1, X1, X2, X 2\nd, the set {\u03b1 \u2208 Np; deg(\u03b1) \u2264 d} ordered by \u2264gl. R[X] denotes the set of p-variate\nWe denote by Np\npolynomials: linear combinations of monomials with real coef\ufb01cients. The degree of a polynomial is\nthe highest of the degrees of its monomials with nonzero coef\ufb01cients1. We use the same notation,\ndeg(\u00b7), to denote the degree of a polynomial or of an element of Np. For d \u2208 N, Rd[X] denotes\n\nthe set of p-variate polynomials of degree less or equal to d. We set s(d) =(cid:0)p+d\n\n(cid:1), the number of\n\n1 , X1X2, X 2\n\n1 X2, X1X 2\n\n2 .\n2 , X 3\n\n2 , X 3\n\n1 , X 2\n\nd\n\nd\n\nmonomials of degree less or equal to d. We will denote by vd(X) the vector of monomials of degree\nless or equal to d sorted by \u2264gl. We let vd(X) := (X \u03b1)\u03b1\u2208Np\n\u2208 Rd[X]s(d). With this notation,\nwe can write a polynomial P \u2208 Rd[X] as follows P (X) = (cid:104)p, vd(X)(cid:105) for some real vector of\n\u2208 Rs(d) ordered using \u2264gl. Given x = (xi)i=1...p \u2208 Rp, P (x) denotes\ncoef\ufb01cients p = (p\u03b1)\u03b1\u2208Np\nthe evaluation of P with the assignments X1 = x1, X2 = x2, . . . Xp = xp. Given a Borel probability\nRp x\u03b1d\u00b5(x). Throughout the\n\nmeasure \u00b5 and \u03b1 \u2208 Np, y\u03b1(\u00b5) denotes the moment \u03b1 of \u00b5: y\u03b1(\u00b5) =(cid:82)\n\npaper, we will only consider measures of which all moments are \ufb01nite.\nMoment matrix: Given a Borel probability measure \u00b5 on Rp, the moment matrix of \u00b5, Md(\u00b5), is a\nmatrix indexed by monomials of degree at most d ordered by \u2264gl. For \u03b1, \u03b2 \u2208 Np\nd, the corresponding\nentry in Md(\u00b5) is de\ufb01ned by Md(\u00b5)\u03b1,\u03b2 := y\u03b1+\u03b2(\u00b5), the moment \u03b1 + \u03b2 of \u00b5. When p = 2, letting\ny\u03b1 = y\u03b1(\u00b5) for \u03b1 \u2208 N2\n\n4, we have\n\nd\n\n1 X1 X2 X 2\n\n1 X1X2 X 2\n2\n\nM2(\u00b5) :\n\n1\nX1\nX2\nX 2\n1\nX1X2\nX 2\n2\n\n1\ny10\ny01\ny20\ny11\ny02\n\ny10\ny20\ny11\ny30\ny21\ny12\n\ny01\ny11\ny02\ny21\ny12\ny03\n\ny20\ny30\ny21\ny40\ny31\ny22\n\ny11\ny21\ny12\ny31\ny22\ny13\n\n.\n\ny02\ny12\ny03\ny22\ny13\ny04\n\npolynomial with vector of coef\ufb01cients p, we have pT Md(\u00b5)p =(cid:82)\nwe have the identity Md(\u00b5) =(cid:82)\na family of polynomials Pj \u2208 Rm[X], j \u2208 J, such that P = (cid:80)\n\nMd(\u00b5) is positive semide\ufb01nite for all d \u2208 N. Indeed, for any p \u2208 Rs(d), let P \u2208 Rd[X] be the\nRp P 2(x)d\u00b5(x) \u2265 0. Furthermore,\nRp vd(x)vd(x)T d\u00b5(x) where the integral is understood elementwise.\nSum of squares (SOS): We denote by \u03a3[X] \u2282 R[X] (resp. \u03a3d[X] \u2282 Rd[X]), the set of polynomi-\nals (resp. polynomials of degree at most d) which can be written as a sum of squares of polynomials.\nLet P \u2208 R2m[X] for some m \u2208 N, then P belongs to \u03a32m[X] if there exists a \ufb01nite J \u2282 N and\nj . It is obvious that sum\nof squares polynomials are always nonnegative. A further interesting property is that this class of\npolynomials is connected with positive semide\ufb01niteness. Indeed, P belongs to \u03a32m[X] if and only if\n(1)\nAs a consequence, every positive semide\ufb01nite matrix Q \u2208 Rs(m)\u00d7s(m) de\ufb01nes a polynomial in\n\u03a32m[X] by using the representation in (1).\n\n\u2203Q \u2208 Rs(m)\u00d7s(m), Q (cid:23) 0, P (x) = vd(x)T Qvd(x), \u2200x \u2208 Rp.\n\nj\u2208J P 2\n\n3 Empirical observations on the inverse moment matrix SOS polynomial\n\nThe inverse moment-matrix SOS polynomial is associated to a measure \u00b5 which satis\ufb01es the following.\nAssumption 1 \u00b5 is a Borel probability measure on Rp with all its moments \ufb01nite and Md(\u00b5) is\npositive de\ufb01nite for a given d \u2208 N.\nDe\ufb01nition 1 Let \u00b5, d satisfy Assumption 1. We call the SOS polynomial Q\u00b5,d \u2208 \u03a32d[X] de\ufb01ned by\nthe application:\n\nx (cid:55)\u2192 Q\u00b5,d(x) := vd(x)T Md(\u00b5)\u22121vd(x),\n\n(2)\n1For the null polynomial, we use the convention that its degree is 0 and it is \u2264gl smaller than all other\n\nx \u2208 Rp,\n\nmonomials.\n\n3\n\n\f(cid:80)n\n\nthe inverse moment-matrix SOS polynomial of degree 2d associated to \u00b5.\nActually, connection to orthogonal polynomials will show that the inverse function x (cid:55)\u2192 Q\u00b5,d(x)\u22121\nis called the Christoffel function in the literature [16, 5] (see also Section 4).\nIn the remainder of this section, we focus on the situation when \u00b5 corresponds to an empirical\nmeasure over n points in Rp which are \ufb01xed. So let x1, . . . , xn \u2208 Rp be a \ufb01xed set of points and let\ni=1 \u03b4xi where \u03b4x corresponds to the Dirac measure at x. In such a case the polynomial\n\u00b5 := 1\nn\nQ\u00b5,d in (2) is determined only by the empirical moments up to degree 2d of our collection of points.\nNote that we also require that Md(\u00b5) (cid:31) 0. In other words, the points x1, . . . , xn do not belong to\nan algebraic set de\ufb01ned by a polynomial of degree less or equal to d. We \ufb01rst describe empirical\nproperties of inverse moment matrix SOS polynomial in this context of empirical measures. A\nmathematical intuition and further properties behind these observations are developped in Section 4.\n\n3.1 Sublevel sets\n\nThe starting point of our investigations is the following phenomenon which to the best of our\nknowledge has remained unnoticed in the literature. For the sake of clarity and simplicity we provide\nan illustration in the plane. Consider the following experiment in R2 for a \ufb01xed d \u2208 N: represent on\nthe same graphic, the cloud of points {xi}i=1...n and the sublevel sets of SOS polynomial Q\u00b5,d in\nR2 (equivalently, the superlevel sets of the Christoffel function). This is illustrated in the left panel\nof Figure 3. The collection of points consists of 500 simulations of two different Gaussians and the\nvalue of d is 4. The striking feature of this plot is that the level sets capture the global shape of the\n\ncloud of points quite accurately. In particular, the level set {x : Q\u00b5,d(x) \u2264(cid:0)p+d\n\n(cid:1)} captures most of\n\nthe points. We could reproduce very similar observations on different shapes with various number of\npoints in R2 and degree d (see Appendix A).\n\nd\n\n3.2 Measuring outlyingness\n\nAn additional remark in a similar line is that Q\u00b5,d tends to take higher values on points which are\nisolated from other points. Indeed in the left panel of Figure 3, the value of the polynomial tends to\nbe smaller on the boundary of the cloud. This extends to situations where the collection of points\ncorrespond to shape with a high density of points with a few additional outliers. We reproduce a\nsimilar experiment on the right panel of Figure 3. In this example, 1000 points are sampled close to a\nring shape and 40 additional points are sampled uniformly on a larger square. We do not represent\nthe sublevel sets of Q\u00b5,d here. Instead, the color and shape of the points are taken proportionally to\nthe value of Q\u00b5,d, with d = 8.\nFirst, the results con\ufb01rm the observation of the previous paragraph, points that fall close to the ring\nshape tend to be smaller and points on the boundary of the ring shape are larger. Second, there is a\nclear increase in the size of the points that are relatively far away from the ring shape. This highlight\nthe fact that Q\u00b5,d tends to take higher value in less populated areas of the space.\n\n3.3 Relation to maximum likelihood estimation\n\n(cid:80)n\n\n(cid:19)\n\n(cid:18) 1 \u00b5T\n\n\u00b5\n\nS\n\nMd(\u00b5) =\n\nIf we \ufb01x d = 1, we recover the maximum likelihood estimation for the Gaussian, up to a constant\nadditive factor. To see this, set \u00b5 = 1\ni . With this notation, we have\ni=1 xi and S = 1\nn\nn\nthe following block representation of the moment matrix,\n\ni=1 xixT\n\n(cid:80)n\n(cid:18) 1 + \u00b5T V \u22121\u00b5 \u2212\u00b5T V \u22121\n\n(cid:19)\n\n,\n\nMd(\u00b5)\u22121 =\n\n\u2212V \u22121\u00b5\n\nV \u22121\n\nwhere V = S \u2212 \u00b5\u00b5T is the empirical covariance matrix and the expression for the inverse is given by\nSchur complement. In this case, we have Q\u00b5,1(x) = 1 + (x \u2212 \u00b5)T V \u22121(x \u2212 \u00b5) for all x \u2208 Rp. We\nrecognize the quadratic form that appears in the density function of the multivariate Gaussian with\nparameters estimated by maximum likelihood. This suggests a connection between the inverse SOS\nmoment polynomial and maximum likelihood estimation. Unfortunately, this connection is dif\ufb01cult\nto generalize for higher values of d and we do not pursue the idea of interpreting the empirical\nobservations of this section through the prism of maximum likelihood estimation and leave it for\nfurther research. Instead, we propose an alternative view in Section 4.\n\n4\n\n\fd\n\n3.4 Computational aspects\n\nRecall that s(d) =(cid:0)p+d\n\n(cid:1) is the number of p-variate monomials of degree up to d. The computation\n\nof Q\u00b5,d requires O(ns(d)2) operations for the computation of the moment matrix and O(s(d)3)\noperations for the matrix inversion. The evaluation of Q\u00b5,d requires O(s(d)2) operations.\nEstimating the coef\ufb01cients of Q\u00b5,d has a computational cost that depends only linearly in the number\nof points n. The cost of evaluating Q\u00b5,d is constant with respect to the number of points n. This is\nan important contrast with kernel based or distance based methods (such as nearest neighbors and\none class SVM) for density estimation or outlier detection since they usually require at least O(n2)\noperations for the evaluation of the model [3]. Moreover, this is well suited for online settings where\ninverse moment matrix computation can be done using rank one Woodbury updates [15, Section\n2.7.1].\nThe dependence in the dimension p is of the order of pd for a \ufb01xed d. Similarly, the dependence in d\nis of the order of dp for a \ufb01xed dimension p and the joint dependence is exponential. Furthermore,\nMd(\u00b5) has a Hankel structure which is known to produce ill conditioned matrices. This suggests\nthat the direct computation and evaluation of Q\u00b5,d will mostly make sense for moderate dimensions\nand degree d. In our experiments, for large d, the evaluation of Q\u00b5,d remains quite stable, but the\ninversion leads to numerical error for higher values (around 20).\n\n4\n\nInvariance and interpretation through orthogonal polynomials\n\nThe purpose of this section is to provide a mathematical rationale that explains the empirical obser-\nvations made in Section 3. All the proofs are postponed to Appendix B. We \ufb01x a Borel probability\nmeasure \u00b5 on Rp which satis\ufb01es Assumption 1. Note that Md(\u00b5) is always positive de\ufb01nite if \u00b5\nis not supported on the zero set of a polynomial of degree at most d. Under Assumption 1, Md(\u00b5)\ninduces an inner product on Rs(d) and by extension on Rd[X] (see Section 2). This inner product is\ndenoted by (cid:104)\u00b7,\u00b7(cid:105)\u00b5 and satis\ufb01es for any polynomials P, Q \u2208 Rd[X] with coef\ufb01cients p, q \u2208 Rs(d),\n\n(cid:104)P, Q(cid:105)\u00b5 := (cid:104)p, Md(\u00b5)q(cid:105)Rs(d) =\n\nP (x)Q(x)d\u00b5(x).\n\n(cid:90)\n\nRp\n\nWe will also use the canonical inner product over Rd[X] which we write (cid:104)P, Q(cid:105)Rd[X] := (cid:104)p, q(cid:105)Rs(d)\nfor any polynomials P, Q \u2208 Rd[X] with coef\ufb01cients p, q \u2208 Rs(d). We will omit the subscripts for\nthis canonical inner product and use (cid:104)\u00b7,\u00b7(cid:105) for both products.\n\n4.1 Af\ufb01ne invariance\nIt is worth noticing that the mapping x (cid:55)\u2192 Q\u00b5,d(x) does not depend on the particular choice of vd(X)\nas a basis of Rd[X], any other basis would lead to the same mapping. This leads to the result that\nQ\u00b5,d captures af\ufb01ne invariant properties of \u00b5.\nLemma 1 Let \u00b5 satisfy Assumption 1 and A \u2208 Rp\u00d7p, b \u2208 Rp de\ufb01ne an invertible af\ufb01ne mapping on\nRp, A : x \u2192 Ax+b. Then, the push foward measure, de\ufb01ned by \u02dc\u00b5(S) = \u00b5(A\u22121(S)) for all Borel sets\nS \u2282 Rp, satis\ufb01es Assumption 1 (with the same d as \u00b5) and for all x \u2208 Rp, Q\u00b5,d(x) = Q\u02dc\u00b5,d(Ax + b).\ni=1 \u03b4xi as in Section 3. In this case, we\ni=1 \u03b4Axi+b and Lemma 1 asserts that the level sets of Q\u02dc\u00b5,d are simply the images of\n\nLemma 1 is probably better understood when \u00b5 = 1/n(cid:80)n\nhave \u02dc\u00b5 = 1/n(cid:80)n\n\nthose of Q\u00b5,d under the af\ufb01ne transformation x (cid:55)\u2192 Ax + b. This is illustrated in Appendix D.\n\n4.2 Connection with orthogonal polynomials\nWe de\ufb01ne a classical [16, 5] family of orthonormal polynomials, {P\u03b1}\u03b1\u2208Np\nwhich satis\ufb01es for all \u03b1 \u2208 Np\n\nd\n\nd\n\nordered according to \u2264gl\n\n(cid:104)P\u03b1, X \u03b2(cid:105) = 0 if \u03b1 <gl \u03b2, (cid:104)P\u03b1, P\u03b1(cid:105)\u00b5 = 1, (cid:104)P\u03b1, X \u03b2(cid:105)\u00b5 = 0 if \u03b2 <gl \u03b1, (cid:104)P\u03b1, X \u03b1(cid:105)\u00b5 > 0.\n\n(3)\nIt follows from (3) that (cid:104)P\u03b1, P\u03b2(cid:105)\u00b5 = 0 if \u03b1 (cid:54)= \u03b2. Existence and uniqueness of such a family is\nguaranteed by the Gram-Schmidt orthonormalization process following the \u2264gl order, and by the\n\n5\n\n\fpositivity of the moment matrix, see for instance [5, Theorem 3.1.11]. There exist determinantal\nformulae [9] and more precise description can be made for measures which have additional geometric\nproperties, see [5] for many examples.\nLet Dd(\u00b5) be the lower triangular matrix whose rows are the coef\ufb01cients of the polynomials P\u03b1\nde\ufb01ned in (3) ordered by \u2264gl. It can be shown that Dd(\u00b5) = Ld(\u00b5)\u2212T , where Ld(\u00b5) is the Cholesky\nfactorization of Md(\u00b5). Furthermore, there is a direct relation with the inverse moment matrix as\nMd(\u00b5)\u22121 = Dd(\u00b5)T Dd(\u00b5) [9, Proof of Theorem 3.1]. This has the following consequence.\n\nLemma 2 Let \u00b5 satisfy Assumption 1, then Q\u00b5,d = (cid:80)\nde\ufb01ned by (3) and(cid:82)\n\nRp Q\u00b5,d(x)d\u00b5(x) = s(d).\n\n\u03b1, where the family {P\u03b1}\u03b1\u2208Np\nP 2\n\nd\n\nis\n\n\u03b1\u2208Np\n\nd\n\nThat is, Q\u00b5,d is a very speci\ufb01c and distinguished SOS polynomial, the sum of squares of the\northonormal basis elements {P\u03b1}\u03b1\u2208Np\nof Rd(X) (w.r.t. \u00b5). Furthermore, the average value of Q\u00b5,d\nwith respect to \u00b5 is s(d) which corresponds to the red level set in left panel of Figure 3.\n\nd\n\n4.3 A variational formulation for the inverse moment matrix SOS polynomial\nIn this section, we show that the family of polynomials {P\u03b1}\u03b1\u2208Np\nde\ufb01ned in (3) is the unique\nsolution (up to a multiplicative constant) of a convex optimization problem over polynomials. This\nfact combined with Lemma 2 provides a mathematical rationale for the empirical observations\noutlined in Section 3. Consider the following optimization problem.\n\nd\n\n(cid:90)\n\n(cid:88)\n\nRp\n\n\u03b1\u2208Np\n\nd\n\nmin\n\nQ\u03b1,\u03b8\u03b1,\u03b1\u2208Np\n\nd\n\n1\n2\n\nQ\u03b1(x)2d\u00b5(x)\n\n(4)\n\n(cid:88)\n\n\u03b1\u2208Np\n\nd\n\n\u03b8\u03b1 = 0,\n\ns.t. q\u03b1\u03b1 \u2265 exp(\u03b8\u03b1), q\u03b1\u03b2 = 0, \u03b1, \u03b2 \u2208 Np\n\nd, \u03b1 <gl \u03b2,\n\nwhere Q\u03b1(x) =(cid:80)\ncomment on problem (4). Let P =(cid:80)\n\n\u03b2\u2208Np\n\nd\n\n1\n2\n\nQ2\n\n\u03b1\u2208Np\n\nd\n\n(cid:82) P d\u00b5.\n\nq\u03b1\u03b2x\u03b2 is a polynomial and \u03b8\u03b1 is a real variable for each \u03b1 \u2208 Np\n\nd. We \ufb01rst\n\u03b1 be the SOS polynomial appearing in the objective\nfunction of (4). The objective of (4) simply involves the average value of P with respect to \u00b5. Let\nSd \u2282 \u03a3d[X] be the set of such SOS polynomials P which have a sum of square decomposition\nsatisfying the constraints of (4) (for some arbitrary value of the real variables {\u03b8\u03b1}\u03b1\u2208Np\n). With this\nnotation, problem (4) has the simple formulation minP\u2208Sd\nBased on this formulation, problem (4) can be interpreted as balancing two antagonist targets. On one\nhand the minimization of the average value of the SOS polynomial P with respect to \u00b5, on the other\nhand the avoidance of the trivial polynomial, enforced by the constraint that P \u2208 Sd. The constraint\nP \u2208 Sd is simple and natural. It ensures that P is a sum of squares of polynomials {Q\u03b1}\u03b1\u2208Np\n, where\nthe leading term of each Q\u03b1 (according to the ordering \u2264gl) is q\u03b1\u03b1x\u03b1 with q\u03b1\u03b1 > 0 (and hence\ndoes not vanish). Inversely, using Cholesky factorization, for any SOS polynomial Q of degree 2d\nwhich coef\ufb01cient matrix (see equation (1)) is positive de\ufb01nite, there exists a > 0 such that aQ \u2208 Sd.\nThis suggests that Sd is a quite general class of nonvanishing SOS polynomials. The following result,\nwhich gives a relation between Q\u00b5,d and solutions of (4), uses a generalization of [16, Theorem 3.1.2]\nto several orthogonal polynomials of several variables.\n\u221a\n\nTheorem 1 : Under Assumption 1, problem (4) is a convex optimization problem with a unique\noptimal solution (Q\u2217\nd, for some \u03bb > 0. In particular,\n\u03b1)2, is (part of) the unique\n\u03b1\u2208Np\n\nthe distinguished SOS polynomial Q\u00b5,d = (cid:80)\n\n\u03b1), which satis\ufb01es Q\u2217\n\n\u03bbP\u03b1, \u03b1 \u2208 Np\n\u03b1 = 1\nP 2\n\u03bb\n\n\u03b1 =\n\u03b1\u2208Np\n\n(cid:80)\n\n\u03b1, \u03b8\u2217\n\n(Q\u2217\n\nd\n\nd\n\nd\n\nd\n\noptimal solution of (4).\n\nTheorem 1 states that up to the scaling factor \u03bb, the distinguished SOS polynomial Q\u00b5,d is the\nunique optimal solution of problem (4). A detailed proof is provided in the Appendix B and\nwe only sketch the main ideas here. First, it is remarkable that for each \ufb01xed \u03b1 \u2208 Np\nd (and\nagain up to a scaling factor) the polynomial P\u03b1 is the unique optimal solution of the problem:\n. This fact is well-known in the\nminQ\nunivariate case [16, Theorem 3.1.2] and does not seem to have been exploited in the literature, at\n\n(cid:110)(cid:82) Q2d\u00b5 : Q \u2208 Rd[X], Q(x) = x\u03b1 +(cid:80)\n\n\u03b2<gl\u03b1 q\u03b2 x\u03b2(cid:111)\n\n6\n\n\f(cid:26)(cid:90)\n\n(cid:27)\n\nthat, at an optimal solution, the contribution of(cid:82) (Q\u2217\n\nleast for purposes similar to ours. So intuitively, P 2\n\u03b1 should be as close to 0 as possible on the support\nof \u00b5. Problem (4) has similar properties and the constraint on the vector of weights \u03b8 enforces\n\u03b1)2 d\u00b5 to the overall sum in the criterion is the\nsame for all \u03b1. Using Lemma 2 yields (up to a multiplicative constant) the polynomial Q\u00b5,d. Other\nconstraints on \u03b8 would yield different weighted sum of the squares P 2\n\u03b1. This will be a subject of\nfurther investigations.\nTo sum up, Theorem 1 provides a rationale for our observations. Indeed when solving (4), intuitively,\nQ\u00b5,d should be close to 0 on average while remaining in a class of nonvanishing SOS polynomials.\n\n4.4 Christoffel function and outlier detection\n\nThe following result from [5, Theorem 3.5.6] draws a direct connection between Q\u00b5,d and the\nChritoffel function (the right hand side of (5)).\nTheorem 2 ([5]) Let Assumption 1 hold and let z \u2208 Rp be \ufb01xed, arbitrary. Then\n\nQ\u00b5,d(z)\u22121 = min\nP\u2208Rd[X]\n\nRp\n\nP (x)2 d\u00b5(x) : P (z) = 1\n\n.\n\n(5)\n\nTheorem 2 provides a mathematical rationale for the use of Q\u00b5,d for outlier or novelty detection\npurposes. Indeed, from Lemma 2 and equation (3), we have Q\u00b5,d \u2265 1 on Rp. Furthermore, the\n1 \u2212 Q\u00b5,d(z)\u22121 (by Markov\u2019s inequality). Hence, for high values of Q\u00b5,d(z), the sublevel set\n\nsolution of the minimization problem in (5) satis\ufb01es P (z)2 = 1 and \u00b5(cid:0)(cid:8)x \u2208 Rp : P (x)2 \u2264 1(cid:9)(cid:1) \u2265\n(cid:8)x \u2208 Rp : P (x)2 \u2264 1(cid:9) contains most of the mass of \u00b5 while P (z)2 = 1. An illustration of this\n\ndiscussion is given in appendix E. Again the result of Theorem 2 does not seem to have been\ninterpreted for purposes similar to ours.\n\n5 Experiments on network intrusion datasets\n\nFor instance, the sub-level sets of Q\u00b5,d, and in particular {x \u2208 Rp : Q\u00b5,d(x) \u2264(cid:0)p+d\n\nIn addition to having its own mathematical interest, Theorem 1 can be exploited for various purposes.\n\n(cid:1)}, can be used\n\nto encode a cloud of points in a simple and compact form. However in this section we focus on\nanother potential application in anomaly detection.\nEmpirical \ufb01ndings described in Section 3 suggest that the polynomial Q\u00b5,d can be used to detect\noutliers in a collection of real vectors (with \u00b5 the empirical average). This is backed up by the results\npresented in Section 4. We illustrate these properties on a real world example. We choose the KDD\ncup 99 network intrusion dataset [13] consisting of network connection data, labeled as normal traf\ufb01c\nor network intrusions. We follow [19] and [18] and construct \ufb01ve datasets consisting of labeled\nvectors in R3 with the following properties\n\nd\n\nDataset\n\nNumber of examples\nProportions of attacks\n\nhttp\n\n567498\n0.004\n\nsmtp\n95156\n0.0003\n\nftp-data\n30464\n0.023\n\nftp\n4091\n0.077\n\nothers\n5858\n0.016\n\nThe details on the datasets construction are available in [19, 18] and reproduced in Appendix C.\nThe main idea is to compute an outlyingness score (independant of the label) and compare outliers\npredicted by the score and network intrusion labels. The underlying assumption is that network\nintrusions correspond to infrequent abnormal behaviors and could be considered as outliers.\nWe reproduce the same experiment as in [18, Section 5.4] using the value of Q\u00b5,d from De\ufb01nition 1\nas an outlyingness score (with d = 3). The authors of [18] have compared different methods in the\nsame experimental setting: robust estimation and Mahalanobis distance [8, 10], mixture models [14]\nand recurrent neural networks. The results are gathered in [18, Figure 7]. In the left panel of Figure 2\nwe represent the same performance measure for our approach: we \ufb01rst compute the value of Q\u00b5,d\nfor each datapoint and use it as an outlyingness score. We then display the proportion of correctly\nidenti\ufb01ed outliers, with score above a given threshold, as a function of the proportion of examples\nwith score above the threshold (for different values of the threshold). The main comments are as\nfollows.\n\n7\n\n\fFigure 2: Left: reproduction of the results described in [18] with the evaluation of Q\u00b5,d as an\noutlyingness score (d = 3). Right: precision-recall curves for different values of d (dataset \u201cothers\u201d).\n\u2022 The inverse moment matrix SOS polynomial does detect network intrusions with varying perfor-\nmances on the \ufb01ve datasets.\n\u2022 Except for the \u201cftp-data dataset\u201d, the global shape of these curves are very similar to results reported\nin [18, Figure 7] indicating that the proposed approach is comparable to other dedicated methods for\nintrusion detection in these four datasets.\nIn a second experiment, we investigate the effect of changing the value of d on the performances.\nWe focus on the \u201cothers\u201d dataset because it is the most heterogeneous. We adopt a slightly different\nmeasure of performance and use precision recall (see for example [4]) to measure performances\nin identifying network intrusions (the higher the curve, the better). We call the area under such\ncurves the AUPR. The right panel of Figure 2 represents these results. First, the case d = 1, which\ncorresponds to vanilla Mahalanobis distance as outlined in Section 3.3, gives poor performances.\nSecond, the global performances rapidly increase with d and then decrease and stabilize.\nThis suggests that d can be used as a tuning parameter to control the \u201ccomplexity\u201d of Q\u00b5,d. Indeed,\n2d is the degree of the polynomial Q\u00b5,d and it is expected that more complex models will identify\nmore diverse classes of examples as outliers. In our case, this means identifying regular traf\ufb01c as\noutliers while it actually does not correspond to intrusions. In general, a good heuristic regarding the\ntuning of d is to investigate performances on a well speci\ufb01ed task in a preliminary experiment.\n\n6 Future work\nAn important question is the asymptotic regime when d \u2192 \u221e. Current state of knowledge suggests\nthat, up to a correct scaling, the limit of the Christoffel functions (when known to exist) involves an\nedge effect term, related to the support of the measure, and the density of \u00b5 with respect to Lebesgue\nmeasure, see for example [2] for the Euclidean ball. It also suggests connections with the notion of\nequilibrium measure in potential theory [17, 1, 7]. Generalization and interpretation of these results\nin our context will be investigated in future work.\nEven though good approximations are obtained with low degree (at least in dimension 2 or 3), the\napproach involves the inversion of large ill conditioned Hankel matrices which reduces considerably\nthe applicability for higher degrees and dimensions. A promising research line is to develop approx-\nimation procedures and advanced optimization and algebra tools so that the approach could scale\ncomputationally to higher dimensions and degrees.\nFinally, we did not touch the question of statistical accuracy. In the context of empirical processes, this\nwill be very relevant to understand further potential applications in machine learning and reduce the\ngap between the abstract orthogonal polynomial theory and practical machine learning applications.\n\nAcknowledgments\n\nThis work was partly supported by project ERC-ADG TAMING 666981, ERC-Advanced Grant of\nthe European Research Council and grant number FA9550-15-1-0500 from the Air Force Of\ufb01ce of\nScienti\ufb01c Research, Air Force Material Command.\n\n8\n\n0.00.20.40.60.81.00.00.20.40.60.81.0% top outlyingness score% correctly identified outliersdatasethttpsmtpftp_dataftpothers0.000.250.500.751.000.00.20.40.60.81.0RecallPrecisiond (AUPR)1 (0.08)2 (0.18)3 (0.18)4 (0.16)5 (0.15)6 (0.13)\fReferences\n[1] R. J. Berman (2009). Bergman kernels for weighted polynomials and weighted equilibrium\n\nmeasures of Cn . Indiana University Mathematics Journal, 58(4):1921\u20131946.\n\n[2] L. Bos, B. Della Vecchia and G. Mastroianni (1998). On the asymptotics of Christoffel functions\nfor centrally symmetric weights functions on the ball in Rn. Rendiconti del Circolo Matematico\ndi Palermo, 52:277\u2013290.\n\n[3] V. Chandola, A. Banerjee and V. Kumar (2009). Anomaly detection: A survey. ACM computing\n\nsurveys (CSUR) 41(3):15.\n\n[4] J. Davis and M. Goadrich (2006). The relationship between Precision-Recall and ROC curves.\n\nProceedings of the 23rd international conference on Machine learning (pp. 233-240). ACM.\n\n[5] C.F. Dunkl and Y. Xu (2001). Orthogonal polynomials of several variables. Cambridge\n\nUniversity Press. MR1827871.\n\n[6] G.H Golub, P. Milanfar and J. Varah (1999). A stable numerical method for inverting shape\n\nfrom moments. SIAM Journal on Scienti\ufb01c Computating 21(4):1222\u20131243 (1999).\n\n[7] B. Gustafsson, M. Putinar, E. Saff and N. Stylianopoulos (2009). Bergman polynomials on an\narchipelago: estimates, zeros and shape reconstruction. Advances in Mathematics 222(4):1405\u2013\n1460.\n\n[8] A.S. Hadi (1994). A modi\ufb01cation of a method for the detection of outliers in multivariate\n\nsamples. Journal of the Royal Statistical Society. Series B (Methodological), 56(2):393-396.\n\n[9] J.W. Helton, J.B. Lasserre and M. Putinar (2008). Measures with zeros in the inverse of their\n\nmoment matrix. The Annals of Probability, 36(4):1453-1471.\n\n[10] E.M. Knorr, R.T. Ng and R.H.Zamar (2001). Robust space transformations for distance-based\noperations. Proceedings of the international conference on Knowledge discovery and data\nmining (pp. 126-135). ACM.\n\n[11] J.B. Lasserre (2015). Level Sets and NonGaussian Integrals of Positively Homogeneous\n\nFunctions. International Game Theory Review, 17(01):1540001.\n\n[12] J.B. Lasserre and M.Putinar (2015). Algebraic-exponential Data Recovery from Moments.\n\nDiscrete & Computational Geometry, 54(4):993-1012.\n\n[13] M. Lichman (2013). UCI Machine Learning Repository, http://archive.ics.uci.edu/ml\n\nUniversity of California, Irvine, School of Information and Computer Sciences.\n\n[14] J.J. Oliver, R.A.Baxter and C.S. Wallace (1996). Unsupervised learning using MML. Proceed-\n\nings of the International Conference on Machine Learning (pp. 364-372).\n\n[15] W. H. Press, S. A. Teukolsky, W. T. Vetterling and B. P. Flannery (2007). Numerical Recipes:\n\nThe Art of Scienti\ufb01c. Computing (3rd Edition). Cambridge University Press.\n\n[16] G. Szeg\u00f6 (1974). Orthogonal polynomials. In Colloquium publications, AMS, (23), fourth\n\nedition.\n\n[17] V. Totik (2000). Asymptotics for Christoffel functions for general measures on the real line.\n\nJournal d\u2019Analyse Math\u00e9matique, 81(1):283-303.\n\n[18] G. Williams, R. Baxter, H. He, S. Hawkins and L. Gu (2002). A Comparative Study of RNN\nfor Outlier Detection in Data Mining. IEEE International Conference on Data Mining (p. 709).\nIEEE Computer Society.\n\n[19] K. Yamanishi, J.I. Takeuchi, G. Williams and P. Milne (2004). On-line unsupervised outlier de-\ntection using \ufb01nite mixtures with discounting learning algorithms. Data Mining and Knowledge\nDiscovery, 8(3):275-300.\n\n9\n\n\f", "award": [], "sourceid": 145, "authors": [{"given_name": "Edouard", "family_name": "Pauwels", "institution": "IRIT"}, {"given_name": "Jean", "family_name": "Lasserre", "institution": "LAAS-CNRS"}]}