{"title": "Entropy Estimations Using Correlated Symmetric Stable Random Projections", "book": "Advances in Neural Information Processing Systems", "page_first": 3176, "page_last": 3184, "abstract": "Methods for efficiently estimating the Shannon entropy of data streams have important applications in learning, data mining, and network anomaly detections (e.g., the DDoS attacks). For nonnegative data streams, the method of Compressed Counting (CC) based on maximally-skewed stable random projections can provide accurate estimates of the Shannon entropy using small storage. However, CC is no longer applicable when entries of data streams can be below zero, which is a common scenario when comparing two streams. In this paper, we propose an algorithm for entropy estimation in general data streams which allow negative entries. In our method, the Shannon entropy is approximated by the finite difference of two correlated frequency moments estimated from correlated samples of symmetric stable random variables. Our experiments confirm that this method is able to substantially better approximate the Shannon entropy compared to the prior state-of-the-art.", "full_text": "Entropy Estimations Using Correlated Symmetric\n\nStable Random Projections\n\nDepartment of Statistical Science\n\nDepartment of Statistics and Biostatistics\n\nPing Li\n\nCornell University\nIthaca, NY 14853\npingli@cornell.edu\n\nCun-Hui Zhang\n\nRutgers University\n\nNew Brunswick, NJ 08901\nczhang@stat.rutgers.edu\n\nAbstract\n\nMethods for ef\ufb01ciently estimating Shannon entropy of data streams have impor-\ntant applications in learning, data mining, and network anomaly detections (e.g.,\nthe DDoS attacks). For nonnegative data streams, the method of Compressed\nCounting (CC) [11, 13] based on maximally-skewed stable random projections\ncan provide accurate estimates of the Shannon entropy using small storage. How-\never, CC is no longer applicable when entries of data streams can be below zero,\nwhich is a common scenario when comparing two streams.\nIn this paper, we\npropose an algorithm for entropy estimation in general data streams which allow\nnegative entries. In our method, the Shannon entropy is approximated by the \ufb01-\nnite difference of two correlated frequency moments estimated from correlated\nsamples of symmetric stable random variables. Interestingly, the estimator for the\nmoment we recommend for entropy estimation barely has bounded variance itself,\nwhereas the common geometric mean estimator (which has bounded higher-order\nmoments) is not suf\ufb01cient for entropy estimation. Our experiments con\ufb01rm that\nthis method is able to well approximate the Shannon entropy using small storage.\n\n1 Introduction\nComputing the Shannon entropy in massive data have important applications in neural compu-\ntation [17], graph estimation [5], query logs analysis in Web search [14], network anomaly de-\ntection [21], etc. (See NIPS2003 workshop on entropy estimation www.menem.com/\u02dcilya/\npages/NIPS03). In modern applications, as massive datasets are often generated in a streaming\nfashion, entropy estimation in data streams has become a challenging and interesting problem.\n1.1 Data Streams\nMassive data generated in a streaming fashion are dif\ufb01cult to transmit and store [15], as the pro-\ncessing is often done on the \ufb02y in one-pass of the data. The problem of \u201cscaling up for high\ndimensional data and high speed data streams\u201d is among the \u201cten challenging problems in data min-\ning research\u201d [20]. Mining data streams at petabyte scale has become an important research area [1],\nas network data can easily reach that scale [20].\nIn the standard turnstile model [15], a data stream is a vector At of length D, where D = 264 or\neven D = 2128 is possible in network applications, e.g., (a pair of) IP addresses + port numbers. At\ntime t, there is an input stream at = (it, It), it \u2208 [1, D] which updates At by a linear rule:\n\nAt[it] = At\u22121[it] + It.\n\n(1)\nwhere It is the increment/decrement of package size at t. For network traf\ufb01c, normally At[i] \u2265 0,\nwhich is called the strict turnstile model and suf\ufb01ces for describing certain natural phenomena. On\nthe other hand, the general turnstile model (which allows At[i] < 0) is often used for comparing\ntwo streams, e.g., in network OD (origin-destination) \ufb02ow analysis [21].\nAn important task is to compute the \u03b1-th frequency moment F(\u03b1) and the Shannon entropy H:\n\nD(cid:88)\n\nH = \u2212 D(cid:88)\n\nF(\u03b1) =\n\n|At[i]|\u03b1,\n\ni=1\n\ni=1\n\n|At[i]|\nF1\n\nlog\n\n|At[i]|\nF1\n\n,\n\n(2)\n\nThe exact computation of these summary statistics is not feasible because to do so one has to store\nthe entire vector At of length D, as the entries are time-varying. Also, many applications (such as\nanomaly detections of network traf\ufb01c) require computing the summary statistics in real-time.\n\n1\n\n\f1.2 Network Measurement, Monitoring, and Anomaly Detection\nNetwork traf\ufb01c is a typical example of high-rate data streams. Industries are now prepared to move\nto 100 Gbits/second or Terabit/second Ethernet. An effective and reliable measurement of network\ntraf\ufb01c in real-time is crucial for anomaly detection and network diagnosis; and one such measure-\nment metric is the Shannon entropy [4, 8, 19, 2, 9, 21]. The exact entropy measurement in real-time\non high-speed links is however computationally prohibitive.\nThe Distributed Denial of Service (DDoS) attack is a rep-\nresentative example of network anomalies. A DDoS at-\ntack attempts to make computers unavailable to intended\nusers, either by forcing users to reset the computers or\nby exhausting the resources of service-hosting sites. For\nexample, hackers may maliciously saturate the victim\nmachines by sending many external communication re-\nquests. DDoS attacks typically target sites such as banks,\ncredit card payment gateways, or military sites. A DDoS\nattack normally changes the statistical distribution of net-\nwork traf\ufb01c, which could be reliably captured by the ab-\nnormal variations in the measurements of Shannon en-\ntropy [4]. See Figure 1 for an illustration.\nApparently, the entropy measurements do not have to be\n\u201cperfect\u201d for detecting attacks. It is however crucial that\nthe algorithms should be computationally ef\ufb01cient (i.e., real-time and one-pass) at low memory cost,\nbecause the traf\ufb01c data generating by large high-speed networks are enormous and transient.\n1.3 Symmetric Stable Random Projections and Entropy Estimation Using Moments\nIt turns out that, for 0 < \u03b1 \u2264 2, one can use stable random projections to compute F(\u03b1) ef\ufb01ciently\nbecause the Turnstile model (1) is a linear model and the random projection operation is also linear\n(i.e., vector-matrix multiplication) [7]. Conceptually, we multiply the data stream vector At \u2208 RD\nby a random matrix R \u2208 RD\u00d7k, resulting in a vector X = At \u00d7 R \u2208 Rk with entries\n\nFigure 1: This plot is reproduced from\na DARPA conference [4]. One can view\nx-axis as the surrogate for time. Y-axis\nis the measured Shannon entropy, which\nexhibited a sudden sharp change at the\ntime when an attack occurred.\n\nxj = [At \u00d7 R]j =\n\nrijAt[i], j = 1, 2, ..., k\n\nwhere rij \u223c S(\u03b1, 1) is a symmetric \u03b1-stable random variable with unit scale [3, 22]: E(erij t) =\ne\u2212|t|\u03b1. The standard normal (or Cauchy) distribution is a special case with \u03b1 = 2 (or \u03b1 = 1).\nIn data stream computations, the matrix R is not materialized. The standard procedure is to\n(re)generate entries of R on-demand [7] using pseudo-random numbers [16]. Thus, we only need\nto store X \u2208 Rk. When a stream element at = (it, It) arrives, one updates the entries of X:\n\nBy property of stable distributions, the samples xj, j = 1 to k, are also i.i.d. stable\n\nxj \u2190 xj + Itritj,\n(cid:195)\nD(cid:88)\n\nrijAt[i] \u223c S\n\nxj =\n\n\u03b1, F(\u03b1) =\n\n|At[i]|\u03b1\n\ni=1\n\ni=1\n\nj = 1, 2, ..., k.\n\n(cid:33)\n\n(3)\n\n(4)\n\nTherefore, the task boils down to estimating the scale parameter from k i.i.d. stable samples.\nBecause the Shannon entropy is essentially the derivative of the frequency moment at \u03b1 = 1, the\npopular approach is to approximate the Shannon entropy by the Tsallis entropy [18]:\n\n1\n\nT\u03b1 =\n\n\u03b1 \u2212 1\n\n1 \u2212 F(\u03b1)\nF \u03b1\n(1)\n[21] used a slight variant of (5) but the\nIn their approach, F(\u03b1) and F(1) are \ufb01rst estimated separately from\n\n(5)\n\nwhich approaches the Shannon entropy H as \u03b1 \u2192 1.\ndifference is not essential.1\n1[21] used F(1+\u2206)\u2212F(1\u2212\u2206)\n\n2\u2206\n\nbetween the \ufb01nite difference approximations is not essential. It is the correlation that plays the crucial role.\n\nand estimated the two frequency moments independently. The subtle difference\n\nD(cid:88)\n\ni=1\n\n(cid:195)\n\n2\n\nD(cid:88)\n\n(cid:33)\n\n.\n\n20040060080010001200012345678910packet counts (thousands)source IP address: entropy value\f(cid:161)\n\n(cid:162)\n\ntwo independent sets of samples. The estimated moments are then plugged in (5) to estimate the\nShannon entropy H. Immediately, we can see the problem here: the variance of the estimated T(\u03b1)\nmight be proportional to\n\n(Recall var(cX) = c2var(X)).\n\n1\n\n(\u03b1\u22121)2 = 1\n\u22062 .\n\n1014\n\nOne question is how to choose \u03b1 (i.e., \u2206). [6] proposed a conservative criterion by choosing \u03b1\naccording to the worst case bias |H \u2212 T\u03b1|. One can verify that \u2206 = |1 \u2212 \u03b1| < 10\u22127 is likely in\n[6]. In other words, the required sample size could be O\n. In practice, [21] exploited the bias-\nvariance tradeoff but they still had to use an excessive number of samples, e.g., 106. In comparison,\nusing our proposed approach, it appears that 100 \u223c 1000 samples might be suf\ufb01cient.\n1.4 Our Proposal\nWe have made two key contributions. Firstly, instead of estimating F(\u03b1) and F(1) separately using\ntwo independent sets of samples, we make them highly positively correlated. Intuitively, if the two\nconsistent estimators, denoted by \u02c6F(\u03b1) and \u02c6F(1) respectively, are highly positively correlated, then\npossibly their ratio \u02c6F(\u03b1)\n, the\n\u02c6F \u03b1\n(1)\n\ncan be close to 1 with small variance. Ideally, if V ar\n\n(cid:161)\n\u22062(cid:162)\n\n\u02c6F(\u03b1)\n\u02c6F \u03b1\n(1)\n\n(cid:181)\n\n(cid:182)\n\n= O\n\nvariance of the estimated Tsallis entropy \u02c6T\u03b1 = 1\n\u03b1\u22121\n\nwill be essentially independent of \u2206.\n\n(cid:181)\n(cid:182)\n\n(cid:182)\n\u22062(cid:162)\n\n(cid:161)\n\n1 \u2212 \u02c6F(\u03b1)\n\n\u02c6F \u03b1\n(1)\n\n(cid:181)\n\n\u02c6F(\u03b1)\n\u02c6F \u03b1\n(1)\n\n1\n\u22062\n\n= O\n\n(cid:162)\n\n(cid:161)\n\n(cid:182)\n\n(cid:181)\n\n\u02c6F(\u03b1)\n\u02c6F \u03b1\n(1)\n\nIt turns out that \ufb01nding an estimator with V ar\nwas not straightforward. It is known\nthat around \u03b1 = 1, the geometric mean estimator [10] is nearly statistically optimal. Interestingly,\nour analysis and simulation show that using the geometric mean estimator, we can essentially only\n= O (\u2206), which, albeit a large improvement, is not small suf\ufb01cient to cancel\nachieve V ar\nthe O\nterm. Therefore, our second key component is a new estimator of T\u03b1 using a moment\nestimator which does not have (or barely has) \ufb01nite variance. Even though such an estimator is not\ngood for estimating the single moment compared to the geometric mean, due to the high correlation,\nthe ratio \u02c6F(\u03b1)\n, as shown in our\n\u02c6F \u03b1\n(1)\ntheoretical analysis and experiments.\n1.5 Compressed Counting (CC) for Nonnegative Data Streams\nThe recent work [13] on Compressed Counting (CC) [11] provides an ideal solution to the problem\nof entropy estimation in nonnegative data streams. Basically, for nonnegative data streams, i.e.,\nAt[i] \u2265 0 at all times and all locations, we can compute the \ufb01rst moment easily, because\n\nis still very well-behaved and its variance is essentially O\n\n(cid:162)\n\n(cid:161)\n\n\u22062\n\nD(cid:88)\n\nD(cid:88)\n\nt(cid:88)\n\nF(1) =\n\n|At[i]| =\n\nAt[i] =\n\nIs\n\n(6)\n\ni=1\n\ni=1\n\ns=0\n\n(cid:162)\n\n(cid:161)\n\nwhere Is is the increment/decrement at time s. In other words, we just need a single counter to\naccumulate all the increments Is. This observation lead to the conjecture that estimating F(\u03b1) should\nbe also easy if \u03b1 \u2248 1, which consequently lead to the development of Compressed Counting which\nused maximally-skewed stable random projections instead of symmetric stable projections. The\nmost recent work of CC [13] provided a new moment estimator to achieve the variance \u221d O\n.\nUnfortunately, for general data streams where entries can be negative, we have to resort to sym-\nmetric stable random projections. Fundamentally, the reason that skewed projections work well on\nnonnegative data streams is because the data themselves are skewed. However, when we compare\ntwo streams, the data become more or less symmetric and hence we must use symmetric projections.\n1.6 Why Comparing the Difference of Two Streams?\nIn machine learning research and practice, people routinely use the difference between feature vec-\ntors. [21] used the difference between data streams from a slightly different motivation.\nThe goal of [21] is to measure the entropies of all OD pairs (origin-destination) in a network, because\nentropy measurements are crucial for detecting anomaly events such as DDoS attacks and network\nfailures. They argued that the change of entropy of the traf\ufb01c distribution may be invisible (i.e., too\nsmall to be detected) in the traditional volume matrix even during the time when an attack occurs.\nInstead, they proposed to measure the entropy from a number of locations across the network, i.e.,\n\n\u22062\n\n3\n\n\fby examining the entropy of every OD \ufb02ow in the network. In a similar argument, a DDoS attack\nmay be invisible in terms of the traf\ufb01c volume change, if the attack is launched outside the network.\nWhile [21] successfully demonstrated that measuring the Shannon entropy of OD \ufb02ows is effective\nfor detecting anomaly events, at that time they did not have the tools for ef\ufb01ciently estimating the\nentropy. Using symmetric stable random projections and independent samples, they needed a large\nnumber of samples (e.g., 106) because their variance blows up at the rate of O\nFor anomaly detection, reducing the sample size (k) is crucial because k determines the storage\nand estimation speed; and it is often required to detect the events at real time.\nIn addition, the\npseudo-random numbers have to be (re)-generated on the \ufb02y, at a cost proportional to k.\n\n(cid:161)\n\n(cid:162)\n\n1\n\u22062\n\n.\n\n2 Our Proposed Algorithm\nRecall that a data stream is a long vector At[i], i = 1 to D. At time t, an incoming element\nat = (it, It) updates one entry: At[it] \u2190 At\u22121[it] + It. Conceptually, we generate a random\nmatrix R \u2208 RD\u00d7k whose entries are sampled from a stable distribution and multiply it with At:\nX = At \u00d7 R. The matrix multiplication is linear and can be conducted incrementally as the new\nstream elements arrive. R is not materialized; its entries are re-generated on demand using pseudo-\nrandom numbers, as the standard practice in data stream computations [7]. Our method does not\nrequire At[i] \u2265 0 and hence it can handle the difference between two streams (e.g., the OD \ufb02ows).\n2.1 The Symmetric Stable Law\nOur work utilizes the symmetric stable distribution. We adopt the standard approach [3] to sample\nfrom the stable law S(\u03b1, 1) with index \u03b1 and unit scale. We generate two independent random\nvariables: w \u223c exp(1) and u \u223c unif om(\u2212\u03c0/2, \u03c0/2) and feed them to a nonlinear transformation:\n\n(cid:104)cos(u \u2212 \u03b1u)\n\n(cid:105)(1\u2212\u03b1)/\u03b1 \u223c S(\u03b1, 1),\n\n(7)\n\nZ(\u03b1) = g(w, u, \u03b1) =\n\nsin(\u03b1u)\n(cos u)1/\u03b1\n\nw\n\nto obtain a sample from S(\u03b1, 1). An important property is that, for \u22121 < \u03b3 < \u03b1, the moment exists:\nE|Z|\u03b3 = (2/\u03c0)\u0393(1 \u2212 \u03b3/\u03b1)\u0393(\u03b3) sin(\u03b3\u03c0/2). For convenience, we de\ufb01ne\n\nG(\u03b1, \u03b3) = E|g(w, u, \u03b1)|\u03b3 =\n\n\u0393(1 \u2212 \u03b3/\u03b1)\u0393(\u03b3) sin (\u03b3\u03c0/2)\n\n2\n\u03c0\n\n(8)\n\n2.2 Our Recommended Estimator\nConceptually, we have two matrices of i.i.d. random numbers:\n\nwij \u223c exp(1),\n\nuij \u223c unif orm(\u2212\u03c0/2, \u03c0/2),\n\ni = 1, 2, ..., D,\n\nj = 1, 2, ..., k,\n\n(9)\n\nAs new stream elements arrive, we incrementally maintain two sets of samples, i.e., for i = 1 to k,\n\nxj =\n\nAt[i]g(wij, uij, 1),\n\nyj =\n\nAt[i]g(wij, uij, \u03b1)\n\n(10)\n\ni=1\n\ni=1\n\nNote that xj and yj are highly correlated because they are generated using the same random numbers\n(with different \u03b1). However, xi and yj are independent if i (cid:54)= j.\nOur recommended estimator of the Tsallis entropy T\u03b1 is\n\n(cid:195)\n\n\uf8eb\uf8ed1 \u2212\n\n(cid:162)(cid:80)k\n(cid:80)k\n\nj=1\n\nj=1\n\n(cid:161)\n\n\u221a\n\u03c0\n1 \u2212 1\n\n2\u03b1\n\n\u0393\n\n(cid:33)2\u03b1\n\n\uf8f6\uf8f8\n\n(cid:112)|yj|\n(cid:112)|xj|\n\n\u02c6T\u03b1,0.5 =\n\n1\n\n\u03b1 \u2212 1\n\n(11)\n\nwhere \u03b1 = 1 + \u2206 > 1 and the meaning of 0.5 will soon be clear. When \u2206 is suf\ufb01ciently small, the\nestimated Tsallis entropy will be suf\ufb01ciently close to the Shannon entropy. A nice property is that its\n\u22062 terms. While it is intuitively clear that it is bene\ufb01cial to make xj and yj\nvariance is free of 1\nhighly correlated for the sake of reducing the variance, it might not be as intuitive why \u02c6T\u03b1,0.5 (11)\nis a good estimator for the entropy. We will explain why the obvious geometric mean estimator [10]\nis not suf\ufb01cient for entropy estimation.\n\n\u2206 or 1\n\n4\n\nD(cid:88)\n\nD(cid:88)\n\n\f3 The Geometric Mean Estimator\n(cid:81)k\nFor estimating F(\u03b1), the geometric mean estimator [10] is close to be statistically optimal (ef\ufb01ciency\n\u2248 80%) at \u03b1 \u2248 1. Thus, it was our \ufb01rst attempt to test the following estimator of the Tsallis entropy:\nj=1 |xj|1/k\nGk(1, 1/k) ,\n\n(cid:195)\n1 \u2212 \u02c6F(\u03b1),gm\n\n\u02c6F(\u03b1),gm =\n\n\u02c6F(1),gm =\n\n\u02c6T\u03b1,gm =\n\n, where\n\n\u03b1 \u2212 1\n\n(cid:33)\n\n1\n\n\u02c6F \u03b1\n\n(1),gm\n\nwhere G() is de\ufb01ned in (8). After simpli\ufb01cation, we obtain:\n\n\uf8eb\uf8ed1 \u2212 k(cid:89)\n\nj=1\n\n(cid:34)(cid:175)(cid:175)(cid:175)(cid:175) yj\n\nxj\n\n\u02c6T\u03b1,gm =\n\n1\n\n\u03b1 \u2212 1\n\n(cid:81)k\nj=1 |yj|\u03b1/k\nGk(\u03b1, \u03b1/k) ,\n(cid:35)\uf8f6\uf8f8 .\n(cid:175)(cid:175)(cid:175)(cid:175)\u03b1/k G(1, 1/k)\n\uf8f9\uf8fb ,\n(cid:175)(cid:175)(cid:175)(cid:175)(cid:175)s\u03b1/k\n\nG(\u03b1, \u03b1/k)\n\n(12)\n\n(13)\n\n3.1 Theoretical Analysis\nThe theoretical analysis of \u02c6T(\u03b1),gm, however, turns out to be dif\ufb01cult, as it requires computing\n\n(cid:34)(cid:175)(cid:175)(cid:175)(cid:175) yj\n\nxj\n\n(cid:35)\n\n(cid:175)(cid:175)(cid:175)(cid:175)s\u03b1/k\n\nE\n\n\uf8ee\uf8f0(cid:175)(cid:175)(cid:175)(cid:175)(cid:175)\n\n(cid:80)D\n(cid:80)D\n\n= E\n\ni=1 At[i]g(wij, uij, \u03b1)\ni=1 At[i]g(wij, uij, 1)\n\ns = 1, 2,\n\nwhere g() is de\ufb01ned in (7). We \ufb01rst provide the following Lemma:\nLemma 1 Let w \u223c exp(1) and u \u223c unif orm(\u2212\u03c0/2, \u03c0/2) be two independent variables. Let\n\u03b1 = 1 + \u2206 > 1, for small \u2206 > 0. Then, for \u03b3 > \u22121,\n\ng(w, u, 1)\n\nE\n=1 \u2212 0.5772\u03b3\u2206 + 0.5772\u03b3\u22062 \u2212 1.6386\u03b3\u22063 + 1.6822\u03b32\u22062 + O\n\n\u03b3\u22064(cid:162)\n(cid:161)\n\n(cid:161)\n\n\u03b32\u22063(cid:162)\n\n+ O\n\n(cid:164) (14)\n\n(cid:175)(cid:175)(cid:175)(cid:175) g(w, u, \u03b1)\n\n(cid:175)(cid:175)(cid:175)(cid:175)\u03b3\n\nNote that we need to keep higher order terms in order to prove Lemma 2, to show the properties of\nthe geometric mean estimator, when D = 1 (i.e., a stream with only one element).\n\nLemma 2 If D = 1, then\n\n(cid:180)\n\n(cid:179)\n\n(cid:179)\n\nE\n\n\u02c6T\u03b1,gm\n\nV ar\n\n\u02c6T\u03b1,gm\n\n=\n\n(cid:180)\n\n\u2206\nk\n\n(cid:181)\n\n1\nk\n\n=\n\n\u2212 2.0935\n\n\u03c02\n2\n3.3645\n\n+ O\n\nk\n\n\u2206\nk\n\n(cid:182)\n\n(cid:181)\n\n(cid:182)\n\n1\nk2\n\n+ O\n\n(cid:161)\n\u22063(cid:162)\n\n(cid:181)\n\n(cid:182)\n\n\u22062\nk\n\n(cid:181)\n\n(cid:182)\n\n1\nk2\n\n+ 1.0614\u22062 + O\n\n+ O\n\n+ O\n\n(15)\n\n(cid:164) (16)\n\nWhen D = 1, we know T\u03b1 = H = 0.\nasymptotically unbiased with variance essentially free of 1\n\u2206, which is very encouraging.\nWill this result in Lemma 2 extend to general D? The answer is no, even for D = 2, i.e.,\n\nIn this case, the geometric mean estimator \u02c6T\u03b1,gm is\n\nyj\nxj\n\n= At[1]g(w1j, u1j, \u03b1) + At[2]g(w2j, u2j, \u03b1)\nAt[1]g(w1j, u1j, 1) + At[2]g(w2j, u2j, 1)\nit\n\nis\n\nsymmetric,\n\nthe denominator At[1]g(w1j, u1j, 1) +\nBecause g()\nis possible that\nthe numerator At[1]g(w1j, u1j, \u03b1) +\nAt[2]g(w2j, u2j, 1) might be very small while\nAt[2]g(w2j, u2j, \u03b1) is not\nthere will be more variations when\nD > 1. In fact, our experiments in Sec. 3.2 and the theoretical analysis of a more general estimator\nin Sec. 4 both reveal that the variance of \u02c6T\u03b1,gm is essentially O\n, which is of course still a\nsubstantial improvement over the previous O\n\nIn other words,\n\ntoo small.\n\nsolution.\n\n(cid:161)\n\n(cid:161)\n\n(cid:162)\n\n(cid:162)\n\n1\n\u2206\n\n1\n\u22062\n\n3.2 Experiments on the Geometric Mean Estimator (Correlated vs. Independent samples)\nWe present some experimental results for evaluating \u02c6T\u03b1,gm, to demonstrate that (i) using correlation\ndoes substantially reduce variance and hence reduces the required sample size, and (ii) the variance\n(or MSE, the mean square error) of \u02c6T\u03b1,gm is roughly O\n\n(cid:162)\n\n(cid:161)\n\n.\n\n1\n\u2206\n\n5\n\n\fWe follow [13] by using static data to evaluate the accuracies of the estimators. The projected vector\nX = At \u00d7 R is the same at the end of the stream, regardless of whether it is computed at once (i.e.,\nstatic) or incrementally (i.e., dynamic). Following [13], we selected 4 word vectors from a chunk of\nWeb crawl data. For example, the entries for vector \u201cREVIEW\u201d are the numbers of occurrences of\nthe word \u201cREVIEW\u201d in each document. We group these 4 vectors into 2 pairs: \u201cTHIS-HAVE\u201d and\n\u201cDO-REVIEW\u201d and we estimate the Shannon entropies of the two resultant difference vectors.\nFigure 2 presents the mean square errors (MSE) of the estimated Shannon entropy, i.e., E( \u02c6T\u03b1\u2212H)2,\nnormalized by the truth (H 2). The left panels contain the results using independently sampling\n(i.e., the prior work [21]) and the geometric mean estimator. The middle panels contain the results\nusing correlated sampling (i.e., this paper) and the geometric mean estimator (12). The right panels\nmultiply the results of the middle panels by \u2206 to illustrate that the variance of the geometric mean\nestimator for entropy \u02c6T\u03b1,gm is essentially O\n\n. See more experiments in Figure 3.\n\n(cid:162)\n\n(cid:161)\n\n1\n\u2206\n\nFigure 2: Two pairs of word vectors were selected. We conducted symmetric random projections\nusing both independent sampling (left panels, as in [21]) and correlated sampling (middle panels, as\nour proposal). The Tsallis entropy (of the difference vector) is estimated using the geometric mean\nestimator (12) with three sample sizes k = 10, 100, and 1000. The normalized mean square errors\n(MSE: E| \u02c6T\u03b1,gm \u2212 H|2/H 2) verify that correlated sampling reduces the errors substantially.\n4 The General Estimator\nSince the geometric mean estimator could not satisfactorily solve the entropy estimation problem,\nwe resort to estimators which behave dramatically different from the geometric mean. Our rec-\nommended estimator \u02c6T\u03b1,0.5 as in (11) is a special case (for \u03b3 = 0.5) of a more general family of\nestimators [12], parameterized by \u03b3 \u2208 (0, 1):\n\n(cid:33)\n\n(cid:195)\n1 \u2212 \u02c6F(\u03b1),\u03b3\n\u02c6F \u03b1\n\n(1),\u03b3\n\n\u02c6T\u03b1,\u03b3 =\n\n1\n\n\u03b1 \u2212 1\n\n, \u02c6F(\u03b1),\u03b3 =\n\nwhich, after simpli\ufb01cation, becomes\n\n\u02c6T\u03b1,\u03b3 =\n\n1\n\n\u03b1 \u2212 1\n\n(cid:33)\u03b1/\u03b3\n\nj=1 |yj|\u03b3\nkG(\u03b1, \u03b3)\n\n(cid:195)(cid:80)k\n\uf8eb\uf8ed1 \u2212\n(cid:195)(cid:80)k\n(cid:80)k\nj=1 |yj|\u03b3\nj=1 |xj|\u03b3\n\u221a\n\u03c0\n2\u03b1).\n\u0393(1\u2212 1\n\n\u02c6F(1),\u03b3 =\n\n(cid:33)\u03b1/\u03b3\n\n\uf8f6\uf8f8\n\n,\n\nG(1, \u03b3)\nG(\u03b1, \u03b3)\n\n(cid:33)1/\u03b3\n\n(cid:195)(cid:80)k\n\nj=1 |xj|\u03b3\nkG(1, \u03b3)\n\n(cid:182)\n\nG(\u03b1,0.5) =\n\nRecall G(\u03b1, \u03b3) is de\ufb01ned in (8), and G(1,0.5)\nTo better understand \u02c6F(\u03b1),\u03b3, recall if Z \u223c S(\u03b1, 1), then E|Z|\u03b3 = G(\u03b1, \u03b3) < \u221e if \u22121 < \u03b3 < \u03b1.\nTherefore,\n(\u03b1) . To recover F(\u03b1), we need to apply the\npower \u03b1/\u03b3 operation. Thus, it is clear that, as long as 0 < \u03bb < 1, \u02c6F(\u03b1),\u03b3 is a consistent estimator of\nF(\u03b1) and E\n\nis \ufb01nite. In particular, the variance of \u02c6F(\u03b1),\u03b3 is bounded if 0 < \u03b3 < 0.5:\n\nis an unbiased estimate of F \u03b3/\u03b1\n\nj=1 |yj|\u03b3\nkG(\u03b1,\u03b3)\n\n\u02c6F(\u03b1),\u03b3\n\n(cid:180)\n\n(cid:181)(cid:80)k\n(cid:179)\n(cid:180)\n\n(cid:179)\n\n=\n\nF 2\n\n(\u03b1)\nk\n\n\u03b12\n\u03b32\n\nG(\u03b1, 2\u03b3) \u2212 G2(\u03b1, \u03b3)\n\nG2(\u03b1, \u03b3)\n\n+ O\n\n1\nk2\n\nE\n\n\u02c6F(\u03b1),\u03b3\n\n= F(\u03b1) + O\n\n,\n\nV ar\n\n(17)\n\n(cid:181)\n\n(cid:182)\n\n(cid:181)\n\n(cid:182)\n\n1\nk\n\n(cid:179)\n\n(cid:180)\n\n\u02c6F(\u03b1),\u03b3\n\n6\n\n10\u2212510\u2212410\u2212310\u2212210\u2212110010\u22122100102104106k = 10100k = 1000THIS\u2212HAVE : GM + Indep.\u2206 = \u03b1 \u2212 1Normalized MSE10\u2212510\u2212410\u2212310\u2212210\u2212110010\u22122100102104106k = 10k = 100k = 1000THIS\u2212HAVE : GM + Corr.\u2206 = \u03b1 \u2212 1Normalized MSE10\u2212510\u2212410\u2212310\u2212210\u2212110010\u2212510\u2212410\u2212310\u2212210\u22121k = 10k = 100k = 1000THIS\u2212HAVE : GM + Corr.\u2206 = \u03b1 \u2212 1Normalized MSE \u00d7 \u2206MSE \u00d7 \u220610\u2212510\u2212410\u2212310\u2212210\u2212110010\u22122100102104106k = 10100k = 1000DO\u2212REVIEW : GM + Indep.\u2206 = \u03b1 \u2212 1Normalized MSE10\u2212510\u2212410\u2212310\u2212210\u2212110010\u22122100102104106k = 10k = 100k = 1000DO\u2212REVIEW : GM + Corr.\u2206 = \u03b1 \u2212 1Normalized MSE10\u2212510\u2212410\u2212310\u2212210\u2212110010\u2212510\u2212410\u2212310\u2212210\u22121k = 10k = 100k = 1000DO\u2212REVIEW : GM + Corr.\u2206 = \u03b1 \u2212 1Normalized MSE \u00d7 \u2206MSE \u00d7 \u2206\fThe variance is unbounded if \u03b3 = 0.5 and \u03b1 = 1, because G(1, 1) = \u221e (\u0393(0) = \u221e). Interestingly,\nwhen \u03b3 \u2192 0 and \u03b1 = 1, the asymptotic variance reaches the minimum. In fact, when \u03b3 \u2192 0, \u02c6F(\u03b1),\u03b3\nconverges to the geometric mean estimator \u02c6F(\u03b1),gm. A variant of \u02c6F(\u03b1),\u03b3 was discussed in [12].\n4.1 Theoretical Analysis\nBased on Lemma 3 and Lemma 4 (which is a fairly technical proof), we know that the variance of the\n, for \ufb01xed \u03b3 \u2208 (0, 1/2). In other words,\ngeneral estimator is essentially V ar\nwhen \u03b3 is close to 0, the variance of the entropy estimator is essentially on the order of O (1/(k\u2206)),\nand while \u03b3 is close to 1/2, the variance is essentially O(1/k) as desired.\nLemma 3 For any \ufb01xed \u03b3 \u2208 (0, 1),\n\n\u22062\u03b3\u22121\n\n\u02c6T\u03b1,\u03b3\n\n= O\n\n(cid:179)\n\n(cid:179)\n\n(cid:180)\n\n(cid:180)\n\n(cid:179)\n\n(cid:180)\n\n(cid:161)\n\nV ar\n\n\u02c6T\u03b1,\u03b3\n\n=\n\nO\n\n1\n\u22062\n\nE(|x1|\u03b3 \u2212 |y1|\u03b3)2\n\nk\n\nk\n\n(cid:162)\n\n(cid:181)\n\n(cid:182)\n\n+ O\n\n1\nk2\n\n(cid:164)\n\nLemma 4 Let 0 < \u2206 \u2264 1/2 and \u03b1 = 1 + \u2206. Let \u03b3 \u2208 (0, 1/2) and m be a positive integer no\nsmaller than 1/\u03b3. Then, there exists a universal constant M such that\n\n(cid:179)\n\n(cid:180)2 \u2264 M F 2\u03b3\n(cid:161)(cid:80)D\n(cid:161)\n\n|At[i]|\nF(1)\n\ni=1\n\n(cid:110)\nm + (cid:101)H 2\n(cid:162)1/(2m).\n\n(1)\u22061+2\u03b3\u22121/mm2\nlog |At[i]|\n\n)2m\n\nF(1)\n\n|x1|\u03b3 \u2212 |y1|\u03b3\n\nE\n\nwhere (cid:101)H2m =\n\n(cid:111)(cid:46)\n\n(cid:164)\n\n(2m) + (1 \u2212 2\u03b3)\u22122\n\n(1 \u2212 2\u03b3),\n\nWe should clarify that our theoretical analysis is only applicable for \ufb01xed \u03b3 \u2208 (0, 1/2). When\n\u03b3 = 0.5, the estimator \u02c6T(\u03b1),0.5 is still well-behaved, except we are unable to precisely analyze this\ncase. Also, since we do not compute the exact constant, it is possible that for some carefully chosen\n\u03b1 (data-dependent), \u02c6T(\u03b1),\u03b3 with \u03b3 < 0.5 may exhibit smaller variance than \u02c6T(\u03b1),0.5. We recommend\n\u02c6T(\u03b1),0.5 for convenience because it essentially frees practitioners from carefully choosing \u03b1.\n4.2 Experimental Results\nFigure 3 presents some empirical results, for testing the general estimator \u02c6T\u03b1,\u03b3 (17), using more\nword vector pairs (including the same 2 pairs in Figure 2). We can see that when \u03b3 = 0.5, the\n(normalized) MSEs become \ufb02at (as desired) as \u2206 = \u03b1\u2212 1 \u2192 0. When \u03b3 > 1/2, the MSEs increase\nalthough the curves remain \ufb02at. When \u03b3 < 1/2, the MSEs blow up with increasing \u2206. Note that,\nwhen \u03b3 < 1/2, it is possible to achieve smaller MSEs if we carefully choose \u03b1.\nHow many samples (k) are needed? If the goal is to estimate the Shannon entropy within a few\npercentages of the the true value, then k = 100 \u223c 1000 should be suf\ufb01cient, because\nM SE/H <\n0.1 when k \u2265 100 as shown in Figure 3.\n5 Conclusion\nEntropy estimation is an important task in machine learning, data mining, network measurement,\nanomaly detection, neural computations, etc. In modern applications, the data are often generated\nin a streaming fashion and many operations on the streams can only be conducted in one-pass of the\ndata. It has been a challenging problem to estimate the Shannon entropy of data streams.\nThe prior work [21] achieved some success in entropy estimation using symmetric stable random\nprojections. However, even after aggressively exploiting the bias-variance tradeoff, they still need\nto a large number of samples, e.g., 106, which is prohibitive in both time and space, especially\nconsidering that in streaming applications the pseudo-random numbers have to be re-generated on\nthe \ufb02y, the cost of which is directly proportional to the sample size.\nIn our approach, we approximate the Shannon entropy using two high correlated estimates of the\nfrequent comments. The positive correlation can substantially reduce the variance of the Shannon\nentropy estimate. However, \ufb01nding the appropriate estimator of the frequency moment is another\nchallenging task. We successfully \ufb01nd such an estimator and show that its variance (of the Shannon\nentropy estimate) is very small. Experimental results demonstrate that about 100 \u223c 1000 samples\nshould be suf\ufb01cient for achieving high accuracies.\n\n\u221a\n\n7\n\n\fAcknowledgement\nThe research of Ping Li is partially supported by NSF-IIS-1249316, NSF-DMS-0808864, NSF-SES-\n1131848, and ONR-YIP-N000140910911. The research of Cun-Hui Zhang is partially supported\nby NSF-DMS-0906420, NSF-DMS-1106753, NSF-DMS-1209014, and NSA-H98230-11-1-0205.\n\nFigure 3: The \ufb01rst two rows are the normalized MSEs for same two vectors used in Figure 2, for\nestimating Shannon entropy using the general estimator \u02c6T\u03b1,\u03b3 with \u03b3 = 0.3, 0.4, 0.5, 0.6, 0.7. For\nthe rest of the rows, the leftmost panels are the results of using independent samples (i.e., the prior\nwork [21]) and the geometric mean estimator. The second column of panels are the results of using\ncorrelated samples and the geometric mean estimator. The right three columns of panels are for the\nproposed general estimator \u02c6T\u03b1,\u03b3 with \u03b3 = 0.3, 0.5, 0.7. We recommend \u03b3 = 0.5.\n\n8\n\n10\u2212510\u2212410\u2212310\u2212210\u2212110010\u2212410\u2212310\u2212210\u22121100k = 10k = 100k = 1000THIS\u2212HAVE : Corr. \u03b3 = 0.3\u2206 = \u03b1 \u2212 1Normalized MSE10\u2212510\u2212410\u2212310\u2212210\u2212110010\u2212410\u2212310\u2212210\u22121100k = 10k = 100k = 1000THIS\u2212HAVE : Corr. \u03b3 = 0.4\u2206 = \u03b1 \u2212 1Normalized MSE10\u2212510\u2212410\u2212310\u2212210\u2212110010\u2212410\u2212310\u2212210\u22121100k = 10k = 100k = 1000THIS\u2212HAVE : Corr. \u03b3 = 0.5\u2206 = \u03b1 \u2212 1Normalized MSE10\u2212510\u2212410\u2212310\u2212210\u2212110010\u2212410\u2212310\u2212210\u22121100k = 10k = 100k = 1000THIS\u2212HAVE : Corr. \u03b3 = 0.6\u2206 = \u03b1 \u2212 1Normalized MSE10\u2212510\u2212410\u2212310\u2212210\u2212110010\u2212410\u2212310\u2212210\u22121100k = 10k = 100k = 1000THIS\u2212HAVE : Corr. \u03b3 = 0.7\u2206 = \u03b1 \u2212 1Normalized MSE10\u2212510\u2212410\u2212310\u2212210\u2212110010\u2212410\u2212310\u2212210\u22121100k = 10k = 100k = 1000DO\u2212REVIEW : Corr. \u03b3 = 0.3\u2206 = \u03b1 \u2212 1Normalized MSE10\u2212510\u2212410\u2212310\u2212210\u2212110010\u2212410\u2212310\u2212210\u22121100k = 10k = 100k = 1000DO\u2212REVIEW : Corr. \u03b3 = 0.4\u2206 = \u03b1 \u2212 1Normalized MSE10\u2212510\u2212410\u2212310\u2212210\u2212110010\u2212410\u2212310\u2212210\u22121100k = 10k = 100k = 1000DO\u2212REVIEW : Corr. \u03b3 = 0.5\u2206 = \u03b1 \u2212 1Normalized MSE10\u2212510\u2212410\u2212310\u2212210\u2212110010\u2212410\u2212310\u2212210\u22121100k = 10k = 100k = 1000DO\u2212REVIEW : Corr. \u03b3 = 0.6\u2206 = \u03b1 \u2212 1Normalized MSE10\u2212510\u2212410\u2212310\u2212210\u2212110010\u2212410\u2212310\u2212210\u22121100k = 10k = 100k = 1000DO\u2212REVIEW : Corr. \u03b3 = 0.7\u2206 = \u03b1 \u2212 1Normalized MSE10\u2212510\u2212410\u2212310\u2212210\u2212110010\u22122100102104106k = 10100k = 1000UNITED\u2212STATES : GM + Indep.\u2206 = \u03b1 \u2212 1Normalized MSE10\u2212510\u2212410\u2212310\u2212210\u2212110010\u22122100102104106k = 10k = 100k = 1000UNITED\u2212STATES : GM + Corr.\u2206 = \u03b1 \u2212 1Normalized MSE10\u2212510\u2212410\u2212310\u2212210\u2212110010\u2212410\u2212310\u2212210\u22121100k = 10k = 100k = 1000UNITED\u2212STATES : Corr. \u03b3 = 0.3\u2206 = \u03b1 \u2212 1Normalized MSE10\u2212510\u2212410\u2212310\u2212210\u2212110010\u2212410\u2212310\u2212210\u22121100k = 10k = 100k = 1000UNITED\u2212STATES : Corr. \u03b3 = 0.5\u2206 = \u03b1 \u2212 1Normalized MSE10\u2212510\u2212410\u2212310\u2212210\u2212110010\u2212410\u2212310\u2212210\u22121100k = 10k = 100k = 1000UNITED\u2212STATES : Corr. \u03b3 = 0.7\u2206 = \u03b1 \u2212 1Normalized MSE10\u2212510\u2212410\u2212310\u2212210\u2212110010\u22122100102104106k = 10100k = 1000A\u2212THE : GM + Indep.\u2206 = \u03b1 \u2212 1Normalized MSE10\u2212510\u2212410\u2212310\u2212210\u2212110010\u22122100102104106k = 10k = 100k = 1000A\u2212THE : GM + Corr.\u2206 = \u03b1 \u2212 1Normalized MSE10\u2212510\u2212410\u2212310\u2212210\u2212110010\u2212410\u2212310\u2212210\u22121100k = 10k = 100k = 1000A\u2212THE : Corr. \u03b3 = 0.3\u2206 = \u03b1 \u2212 1Normalized MSE10\u2212510\u2212410\u2212310\u2212210\u2212110010\u2212410\u2212310\u2212210\u22121100k = 10k = 100k = 1000A\u2212THE : Corr. \u03b3 = 0.5\u2206 = \u03b1 \u2212 1Normalized MSE10\u2212510\u2212410\u2212310\u2212210\u2212110010\u2212410\u2212310\u2212210\u22121100k = 10k = 100k = 1000A\u2212THE : Corr. \u03b3 = 0.7\u2206 = \u03b1 \u2212 1Normalized MSE10\u2212510\u2212410\u2212310\u2212210\u2212110010\u22122100102104106k = 10100k = 1000FOOD\u2212LOVE : GM + Indep.\u2206 = \u03b1 \u2212 1Normalized MSE10\u2212510\u2212410\u2212310\u2212210\u2212110010\u22122100102104106k = 10k = 100k = 1000FOOD\u2212LOVE : GM + Corr.\u2206 = \u03b1 \u2212 1Normalized MSE10\u2212510\u2212410\u2212310\u2212210\u2212110010\u2212410\u2212310\u2212210\u22121100k = 10k = 100k = 1000FOOD\u2212LOVE : Corr. \u03b3 = 0.3\u2206 = \u03b1 \u2212 1Normalized MSE10\u2212510\u2212410\u2212310\u2212210\u2212110010\u2212410\u2212310\u2212210\u22121100k = 10k = 100k = 1000FOOD\u2212LOVE : Corr. \u03b3 = 0.5\u2206 = \u03b1 \u2212 1Normalized MSE10\u2212510\u2212410\u2212310\u2212210\u2212110010\u2212410\u2212310\u2212210\u22121100k = 10k = 100k = 1000FOOD\u2212LOVE : Corr. \u03b3 = 0.7\u2206 = \u03b1 \u2212 1Normalized MSE10\u2212510\u2212410\u2212310\u2212210\u2212110010\u22122100102104106k = 10100k = 1000DATA\u2212PAPER : GM + Indep.\u2206 = \u03b1 \u2212 1Normalized MSE10\u2212510\u2212410\u2212310\u2212210\u2212110010\u22122100102104106k = 10k = 100k = 1000DATA\u2212PAPER : GM + Corr.\u2206 = \u03b1 \u2212 1Normalized MSE10\u2212510\u2212410\u2212310\u2212210\u2212110010\u2212410\u2212310\u2212210\u22121100k = 10k = 100k = 1000DATA\u2212PAPER : Corr. \u03b3 = 0.3\u2206 = \u03b1 \u2212 1Normalized MSE10\u2212510\u2212410\u2212310\u2212210\u2212110010\u2212410\u2212310\u2212210\u22121100k = 10k = 100k = 1000DATA\u2212PAPER : Corr. \u03b3 = 0.5\u2206 = \u03b1 \u2212 1Normalized MSE10\u2212510\u2212410\u2212310\u2212210\u2212110010\u2212410\u2212310\u2212210\u22121100k = 10k = 100k = 1000DATA\u2212PAPER : Corr. \u03b3 = 0.7\u2206 = \u03b1 \u2212 1Normalized MSE10\u2212510\u2212410\u2212310\u2212210\u2212110010\u22122100102104106k = 10100k = 1000NEWS\u2212WASHINGTON : GM + Indep.\u2206 = \u03b1 \u2212 1Normalized MSE10\u2212510\u2212410\u2212310\u2212210\u2212110010\u22122100102104106k = 10k = 100k = 1000NEWS\u2212WASHINGTON : GM + Corr.\u2206 = \u03b1 \u2212 1Normalized MSE10\u2212510\u2212410\u2212310\u2212210\u2212110010\u2212410\u2212310\u2212210\u22121100k = 10k = 100k = 1000NEWS\u2212WASHINGTON : Corr. \u03b3 = 0.3\u2206 = \u03b1 \u2212 1Normalized MSE10\u2212510\u2212410\u2212310\u2212210\u2212110010\u2212410\u2212310\u2212210\u22121100k = 10k = 100k = 1000NEWS\u2212WASHINGTON : Corr. \u03b3 = 0.5\u2206 = \u03b1 \u2212 1Normalized MSE10\u2212510\u2212410\u2212310\u2212210\u2212110010\u2212410\u2212310\u2212210\u22121100k = 10k = 100k = 1000NEWS\u2212WASHINGTON : Corr. \u03b3 = 0.7\u2206 = \u03b1 \u2212 1Normalized MSE10\u2212510\u2212410\u2212310\u2212210\u2212110010\u22122100102104106k = 10100k = 1000MACHINE\u2212LEARN : GM + Indep.\u2206 = \u03b1 \u2212 1Normalized MSE10\u2212510\u2212410\u2212310\u2212210\u2212110010\u22122100102104106k = 10k = 100k = 1000MACHINE\u2212LEARN : GM + Corr.\u2206 = \u03b1 \u2212 1Normalized MSE10\u2212510\u2212410\u2212310\u2212210\u2212110010\u2212410\u2212310\u2212210\u22121100k = 10k = 100k = 1000MACHINE\u2212LEARN : Corr. \u03b3 = 0.3\u2206 = \u03b1 \u2212 1Normalized MSE10\u2212510\u2212410\u2212310\u2212210\u2212110010\u2212410\u2212310\u2212210\u22121100k = 10k = 100k = 1000MACHINE\u2212LEARN : Corr. \u03b3 = 0.5\u2206 = \u03b1 \u2212 1Normalized MSE10\u2212510\u2212410\u2212310\u2212210\u2212110010\u2212410\u2212310\u2212210\u22121100k = 10k = 100k = 1000MACHINE\u2212LEARN : Corr. \u03b3 = 0.7\u2206 = \u03b1 \u2212 1Normalized MSE\fReferences\n[1] Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, and Jennifer Widom. Models and issues\n\nin data stream systems. In PODS, pages 1\u201316, Madison, WI, 2002.\n\n[2] Daniela Brauckhoff, Bernhard Tellenbach, Arno Wagner, Martin May, and Anukool Lakhina. Impact of\n\npacket sampling on anomaly detection metrics. In IMC, pages 159\u2013164, Rio de Janeriro, Brazil, 2006.\n\n[3] John M. Chambers, C. L. Mallows, and B. W. Stuck. A method for simulating stable random variables.\n\nJournal of the American Statistical Association, 71(354):340\u2013344, 1976.\n\n[4] Laura Feinstein, Dan Schnackenberg, Ravindra Balupari, and Darrell Kindred. Statistical approaches to\nDDoS attack detection and response. In DARPA Information Survivability Conference and Exposition,\npages 303\u2013314, 2003.\n\n[5] Anupam Gupta, John D. Lafferty, Han Liu, Larry A. Wasserman, and Min Xu. Forest density estimation.\n\nIn COLT, pages 394\u2013406, Haifa, Israel, 2010.\n\n[6] Nicholas J. A. Harvey, Jelani Nelson, and Krzysztof Onak. Streaming algorithms for estimating entropy.\n\nIn ITW, 2008.\n\n[7] Piotr Indyk. Stable distributions, pseudorandom generators, embeddings, and data stream computation.\n\nJournal of ACM, 53(3):307\u2013323, 2006.\n\n[8] Anukool Lakhina, Mark Crovella, and Christophe Diot. Mining anomalies using traf\ufb01c feature distribu-\n\ntions. In SIGCOMM, pages 217\u2013228, Philadelphia, PA, 2005.\n\n[9] Ashwin Lall, Vyas Sekar, Mitsunori Ogihara, Jun Xu, and Hui Zhang. Data streaming algorithms for\n\nestimating entropy of network traf\ufb01c. In SIGMETRICS, pages 145\u2013156, Saint Malo, France, 2006.\n\n[10] Ping Li. Estimators and tail bounds for dimension reduction in l\u03b1 (0 < \u03b1 \u2264 2) using stable random\n\nprojections. In SODA, pages 10 \u2013 19, San Francisco, CA, 2008.\n[11] Ping Li. Compressed counting. In SODA, New York, NY, 2009.\n[12] Ping Li and Trevor J. Hastie. A uni\ufb01ed near-optimal estimator for dimension reduction in l\u03b1 (0 < \u03b1 \u2264 2)\n\nusing stable random projections. In NIPS, Vancouver, BC, Canada, 2007.\n\n[13] Ping Li and Cun-Hui Zhang. A new algorithm for compressed counting with applications in shannon\n\nentropy estimation in dynamic data. In COLT, 2011.\n\n[14] Qiaozhu Mei and Kenneth Church. Entropy of search logs: How hard is search? with personalization?\n\nwith backoff? In WSDM, pages 45 \u2013 54, Palo Alto, CA, 2008.\n\n[15] S. Muthukrishnan. Data streams: Algorithms and applications. Foundations and Trends in Theoretical\n\nComputer Science, 1:117\u2013236, 2 2005.\n\n[16] Noam Nisan. Pseudorandom generators for space-bounded computations. In Proceedings of the twenty-\n\nsecond annual ACM symposium on Theory of computing, STOC, pages 204\u2013212, 1990.\n\n[17] Liam Paninski. Estimation of entropy and mutual information. Neural Comput., 15(6):1191\u20131253, 2003.\n[18] Constantino Tsallis. Possible generalization of boltzmann-gibbs statistics. Journal of Statistical Physics,\n\n52:479\u2013487, 1988.\n\n[19] Kuai Xu, Zhi-Li Zhang, and Supratik Bhattacharyya. Pro\ufb01ling internet backbone traf\ufb01c: behavior models\nand applications. In SIGCOMM \u201905: Proceedings of the 2005 conference on Applications, technologies,\narchitectures, and protocols for computer communications, pages 169\u2013180, Philadelphia, Pennsylvania,\nUSA, 2005.\n\n[20] Qiang Yang and Xindong Wu. 10 challeng problems in data mining research. International Journal of\n\nInformation Technology and Decision Making, 5(4):597\u2013604, 2006.\n\n[21] Haiquan Zhao, Ashwin Lall, Mitsunori Ogihara, Oliver Spatscheck, Jia Wang, and Jun Xu. A data stream-\n\ning algorithm for estimating entropies of od \ufb02ows. In IMC, San Diego, CA, 2007.\n\n[22] Vladimir M. Zolotarev. One-dimensional Stable Distributions. American Mathematical Society, Provi-\n\ndence, RI, 1986.\n\n9\n\n\f", "award": [], "sourceid": 1456, "authors": [{"given_name": "Ping", "family_name": "Li", "institution": null}, {"given_name": "Cun-hui", "family_name": "Zhang", "institution": null}]}