{"title": "Tight Bounds on Profile Redundancy and Distinguishability", "book": "Advances in Neural Information Processing Systems", "page_first": 3257, "page_last": 3265, "abstract": null, "full_text": "Tight Bounds on Pro\ufb01le Redundancy and Distinguishability\n\nJayadev Acharya\n\nECE, UCSD\n\njacharya@ucsd.edu\n\nHirakendu Das\n\nYahoo!\n\nhdas@yahoo-inc.com\n\nAlon Orlitsky\n\nECE & CSE, UCSD\nalon@ucsd.edu\n\nAbstract\n\nThe minimax KL-divergence of any distribution from all distributions in a collection P has several\npractical implications. In compression, it is called redundancy and represents the least additional\nnumber of bits over the entropy needed to encode the output of any distribution in P. In online es-\ntimation and learning, it is the lowest expected log-loss regret when guessing a sequence of random\nvalues generated by a distribution in P. In hypothesis testing, it upper bounds the largest number of\ndistinguishable distributions in P. Motivated by problems ranging from population estimation to text\nclassi\ufb01cation and speech recognition, several machine-learning and information-theory researchers\nhave recently considered label-invariant observations and properties induced by i.i.d. distributions.\nA suf\ufb01cient statistic for all these properties is the data\u2019s pro\ufb01le, the multiset of the number of times\neach data element appears. Improving on a sequence of previous works, we show that the redun-\ndancy of the collection of distributions induced over pro\ufb01les by length-n i.i.d. sequences is between\n0.3 \u00b7 n1/3 and n1/3 log2 n, in particular, establishing its exact growth power.\n\nIntroduction\n\n1\nInformation theory, machine learning, and statistics, are closely related disciplines. One of their main intersection\nareas is the con\ufb02uence of universal compression, online learning, and hypothesis testing. We consider two concepts in\nthis overlap. The minimax KL divergence\u2014a fundamental measure for, among other things, how dif\ufb01cult distributions\nare to compress, predict, and classify, and pro\ufb01les\u2014a relatively new approach for compression, classi\ufb01cation, and\nproperty testing over large alphabets. Improving on several previous results, we determine the exact growth power of\nthe KL-divergence minimax of pro\ufb01les of i.i.d. distributions over any alphabet.\n1.1 Minimax KL divergence\n\nAs is well known in information theory, the expected number of bits required to compress data X generated according\nto a known distribution P is the distribution\u2019s entropy, H(P ) = EP log 1/P (X), and is achieved by encoding X using\nroughly log 1/P (X) bits. However, in many applications P is unknown, except that it belongs to a known collection\nP of distributions, for example the collection of all i.i.d., or all Markov distributions. This uncertainty typically raises\nthe number of bits above the entropy and is studied in Universal compression [9, 13]. Any encoding corresponds to\nsome distribution Q over the encoded symbols. Hence the increase in the expected number of bits used to encode the\noutput of P is EP log 1/Q(X)\u2212 H(P ) = D(P||Q), the KL divergence between P and Q. Typically one is interested\nin the highest increase for any distribution P \u2208 P, and \ufb01nds the encoding that minimizes it. The resulting quantity,\ncalled the (expected) redundancy of P, e.g., [8, Chap. 13], is therefore the KL minimax\n\nR(P) def= min\n\nQ\n\nP\u2208P D(P||Q).\n\nmax\n\n(cid:80)n\nThe same quantity arises in online-learning, e.g., [5, Ch. 9], where the probabilities of random elements X1, . . . , Xn\nare sequentially estimated. One of the most popular measures for the performance of an estimator Q is the per-symbol\ni=1 log Q(Xi|X i\u22121). As in compression, for underlying distribution P \u2208 P, the expected log loss is\nlog loss 1\nEP log 1/Q(X), and the log-loss regret is EP log 1/Q(X) \u2212 H(P ) = D(P||Q). The maximal expected regret for\nn\nany distribution in P, minimized over all estimators Q is again the KL minimax, namely, redundancy.\n\n1\n\n\fIn statistics, redundancy arises in multiple hypothesis testing. Consider the largest number of distributions that can\nbe distinguished from their observations. For example, the largest number of topics distinguishable based on text of a\ngiven length. Let P be a collection of distributions over a support set X . As in [18], a sub-collection S \u2286 P of the\ndistributions is \u0001-distinguishable if there is a mapping f : X \u2192 S such that if X is generated by a distribution S \u2208 S,\nthen P (f (X) (cid:54)= S) \u2264 \u0001. Let M (P, \u0001) be the largest number of \u0001-distinguishable distributions in P, and let h(\u0001) be\nthe binary entropy function. In Section 4 we show that for all P,\n\n(1 \u2212 \u0001) log M (P, \u0001) \u2264 R(P) + h(\u0001),\n\n(1)\n\nand in many cases, like the one considered here, the inequality is close to equality.\nRedundancy has many other connections to data compression [27, 28], the minimum-description-length principle [3,\n16, 17], sequential prediction [21], and gambling [20]. Because of the fundamental nature of R(P), and since tight\nbounds on it often reveal the structure of P, the value of R(P) has been studied extensively in all three communities,\ne.g., the above references as well as [29, 37] and a related minimax in [6].\n\n1.2 Redundancy of i.i.d. distributions\nThe most extensively studied collections are independently, identically distributed (i.i.d.). For example, for the collec-\ntion I n\nk of length-n i.i.d. distributions over alphabets of size k, a string of works [7, 10, 11, 28, 33, 35, 36] determined\nthe redundancy up to a diminishing additive term,\n\nR(I n\n\nk ) =\n\nk \u2212 1\n\n2\n\nlog n + Ck + o(1),\n\n(2)\n\nwhere the constant Ck was determined exactly in terms of k. For compression this shows that the extra number of\nbits per symbol required to encode an i.i.d. sequence when the underlying distribution is unknown diminishes to zero\nas (k \u2212 1) log n/(2n). For online learning this shows that these distributions can be learned (or approximated) and\nthat this approximation can be done at the above rate. In hypothesis testing this shows that there are roughly n(k\u22121)/2\ndistinguishable i.i.d. distributions of alphabet size k and length n.\nUnfortunately, while R(I n\nk ) increases logarithmically in the sequence length n, it grows linearly in the alphabet size k.\nFor suf\ufb01ciently large k, this value even exceeds n itself, showing that general distributions over large alphabets cannot\nbe compressed or learned at a uniform rate over all alphabet sizes, and as the alphabet size increases, progressively\nlarger lengths are needed to achieve a given redundancy, learning rate, or test error.\n\n1.3 Patterns\nPartly motivated by redundancy\u2019s fast increase with the alphabet size, a new approach was recently proposed to address\ncompression, estimation, classi\ufb01cation, and property testing over large alphabets.\nThe pattern [25] of a sequence represents the relative order in which its symbols appear. For example, the pattern of\nabracadabra is 12314151231. A natural method to compress a sequence over a large alphabet is to compress its pattern\nas well as the dictionary that maps the order to the original symbols. For example, for abracadabra, 1 \u2192 a, 2 \u2192 b,\n3 \u2192 r, 4 \u2192 c, 5 \u2192 d.\nIt can be shown [15, 26] that for all i.i.d. distributions, over any alphabet, even in\ufb01nitely large, as the sequence\nlength increases, essentially all the entropy lies in the pattern, and practically none is in the dictionary. Hence [25]\nfocused on the redundancy of compressing patterns. They showed, e.g., Subsection 1.5, that the although, as in (2),\ni.i.d. sequences over large alphabets have arbitrarily high per-symbol redundancy, and although as above patterns\ncontain essentially all the information of long sequences, the per-symbol redundancy of patterns diminishes to zero at\na uniform rate independent of the alphabet size.\nIn online learning, patterns correspond to estimating the probabilities of each observed symbol, and of all unseen ones\ncombined. For example, after observing the sequence dad, with pattern 121, we estimate the probabilities of 1, 2, and\n3. The probability we assign to 1 is that of d, the probability we assign to 2 is that of a, and the probability we assign\nto 3 is the probability of all remaining letters combined. The aforementioned results imply that while distributions\nover large alphabets cannot be learned with uniformly diminishing per-symbol log loss, if we would like to estimate\nthe probability of each seen element, but combine together the probabilities of all unseen ones, then the per symbol\nlog loss diminishes to zero uniformly regardless of the alphabet size.\n\n2\n\n\f1.4 Pro\ufb01les\n\nthat an i.i.d. distribution P generates an n-element sequence x is P (x) def= (cid:81)n\n\u03d5 \u2208 \u03a6n is the sum of the probabilities of all sequences of this pro\ufb01le, P (\u03d5) def= (cid:80)\n\nImproving on existing pattern-redundancy bounds seems easier to accomplish via pro\ufb01les. Since we consider i.i.d.\ndistributions, the order of the elements in a pattern does not affect its probability. For example, for every distribution\nIt is easy to see that the probability of a pattern is determined by the \ufb01ngerprint [4] or\nP , P (112) = P (121).\npro\ufb01le [25] of the pattern, the multiset of the number of appearances of the symbols in the pattern. For example, the\npro\ufb01le of the pattern 121 is {1, 2} and all patterns with this pro\ufb01le, 112, 121, 122 will have the same probability under\nany distribution P . Similarly, the pro\ufb01le of 1213 is {1, 1, 2} and all patterns with this pro\ufb01le, 1123, 1213, 1231, 1223,\n1232, and 1233, will have the same probability under any distribution.\nIt is easy to see that since all patterns of a given pro\ufb01le have the same probability, the ratio between the actual and\nestimated probability of a pro\ufb01le is the same as this ratio for each of its patterns. Hence pattern redundancy is the\nsame as pro\ufb01le redundancy [25]. Therefore from now on we consider only pro\ufb01le redundancy, and begin by de\ufb01ning\nit more formally.\nThe multiplicity \u00b5(a) of a symbol a in a sequence is the number of times it appears. The pro\ufb01le \u03d5(x) of a sequence\nx is the multiset of multiplicities of all symbols appearing in it [24, 25]. The pro\ufb01le of the sequence is the multiset of\nmultiplicities. For example, the sequence ababcde has multiplicities \u00b5(a) = \u00b5(b) = 2, \u00b5(c) = \u00b5(d) = \u00b5(e) = 1, and\npro\ufb01le {1, 1, 1, 2, 2}. The prevalence \u03d5\u00b5 of a multiplicity \u00b5 is the number of elements with multiplicity \u00b5.\nLet \u03a6n denote the collection of all pro\ufb01les of length-n sequences. For example, for sequences of length one there is a\nsingle element appearing once, hence \u03a61 = {{1}}, for length two, either one element appears twice, or each of two\nelements appear once, hence \u03a62 = {{2},{1, 1}}, similarly \u03a63 = {{3},{2, 1},{1, 1, 1}}, etc.\nWe consider the distributions induced on \u03a6n by all discrete i.i.d. distributions over any alphabet. The probability\ni=1 P (xi). The probability of a pro\ufb01le\nx:\u03d5(x)=\u03d5 P (x). For example, if P is\nB(2/3) over h and t, then for n = 3, P ({3}) = P (hhh) + P (ttt) = 1/3, P ({2, 1}) = P (hht) + P (hth) + P (thh) +\nP (tth) + P (tht) + P (htt) = 2/3, and P ({1, 1, 1} = 0 as this P is binary hence at most two symbols can appear.\nOn the other hand, if P is a roll of a fair die, then P ({3}) = 1/36, P ({2, 1}) = 5/12, and P ({1, 1, 1} = 5/9. We let\n\u03a6 = {P (\u03d5) : P is a discrete i.i.d. distribution} be the collection of all distributions on \u03a6n induced by any discrete\nI n\ni.i.d. distribution over any alphabet, possibly even in\ufb01nite.\nIt is easy to see that any relabeling of the elements in an i.i.d. distribution will leave the pro\ufb01le distribution unchanged,\nfor example, if instead of h and t above, we have a distribution over 0\u2019s and 1\u2019s. Furthermore, pro\ufb01les are suf\ufb01cient\nstatistics for every label-invariant property. While many theoretical properties of pro\ufb01les are known, even calculating\nthe pro\ufb01le probabilities for a given distribution and a pro\ufb01le seems hard [23, 38] in general.\nPro\ufb01le redundancy arises in at least two other machine-learning applications, closeness-testing and classi\ufb01cation.In\ncloseness testing [4], we try to determine if two sequences are generated by same or different distributions. In clas-\nsi\ufb01cation, we try to assign a test sequence to one of two training sequences. Joint pro\ufb01les and quantities related to\npro\ufb01le redundancy are used to construct competitive closeness tests and classi\ufb01ers that perform almost as well as the\nbest possible [1, 2].\nPro\ufb01les also arise in statistics, in estimating symmetric or label-invariant properties of i.i.d. distributions ([34] and\nreferences therein). For example the support size, entropy, moments, or number of heavy hitters. All these properties\ndepend only on the multiset of probability values in the distribution. For example, the entropy of the distribution\np(heads) = .6, p(tails) = .4, depends only on the probability multiset {.6, .4}. For all these properties, pro\ufb01les are\na suf\ufb01cient statistic.\n1.5 Previous Results\nAs patterns and pro\ufb01les have the same redundancy, we describe the results for pro\ufb01les.\nInstead of the expected redundancy R(I n\nmore stringent but closely-related worst-case redundancy, \u02c6R(I n\nbits, namely over all sequences. Using bounds [19] on the partition function, they showed that\n\n\u03a6) that re\ufb02ects the increase in the expected number of bits, [25] bounded the\n\u03a6), re\ufb02ecting the increase in the worst-case number of\n\n(cid:32)\n\n(cid:33)\n\n(cid:114) 2\n\n\u2126(n1/3) \u2264 \u02c6R(I n\n\n\u03a6) \u2264\n\n3\n\n\u03c0\n\n3\n\nn1/2.\n\n\fThese bounds do not involve the alphabet size, hence show that unlike the sequences themselves, patterns (whose re-\ndundancy equals that of pro\ufb01les), though containing essentially all the information of the sequence, can be compressed\nand learned with redundancy and log-loss diminishing as n\u22121/2, uniformly over all alphabet sizes.\nNote however that by contrast to i.i.d. distributions, where the redundancy (2) was determined up to a diminishing\nadditive constant, here not even the power was known. Consequently several papers considered improvements of these\nbounds, mostly for expected redundancy, the minimax KL divergence.\nSince expected redundancy is at most the worst-case redundancy, the upper bound applies also for expected redun-\ndancy. Subsequently [31] described a partial proof-outline that could potentially show the following tighter upper\nbound on expected redundancy, and [14] proved the following lower bound, strengthening one in [32],\n\n(cid:18) n\n\n(cid:19)1/3 \u2264 R(I n\n\n1.84\n\nlog n\n\n\u03a6) \u2264 n0.4.\n\n(3)\n\n1.6 New results\nIn Theorem 15 we use error-correcting codes to exhibit a larger class of distinguishable distributions in I n\n\u03a6 than was\nknown before, thereby removing the log n factor from the lower bound in (3). In Theorem 11 we demonstrate a small\nnumber of distributions such that every distribution in I n\n\u03a6 is within a small KL divergence from one of them, thereby\nreducing the upper bound to have the same power as the lower bound. Combining these results we obtain,\n\n0.3 \u00b7 n1/3 \u2264 (1 \u2212 \u0001) log M (I n\n\n\u03a6, \u0001) \u2264 R(I n\n\n\u03a6) \u2264 n1/3 log2 n.\n\n(4)\n\nThese results close the power gap between the upper and lower bounds that existed in the literature. They show that\nwhen a pattern is compressed or a sequence is estimated (with all unseen elements combined into new), the per-symbol\nredundancy and log-loss decrease to 0 uniformly over all distributions faster than log2 n/n2/3, a rate that is optimal\nup to a log2 n factor. They also show that for length-n pro\ufb01les, the redundancy R(I n\n\u03a6) is essentially the logarithm\nlog M (I n\n\n\u03a6, \u0001) of the number of distinguishable distributions.\n\n1.7 Outline\n\nIn the next section we describe properties of Poisson sampling and redundancy that will be used later in the paper.\nIn Section 3 we establish the upper bound and in Section 4, the lower bound. Most of the proofs are provided in the\nAppendix.\n\n2 Preliminaries\nWe describe some techniques and results used in the proofs.\n\n2.1 Poisson sampling\n\nWhen a distribution is sampled i.i.d. exactly n times, the multiplicities are dependent, complicating the analysis of\nmany properties. A standard approach [22] to overcome the dependence is to sample the distribution a random poi(n)\ntimes, the Poisson distribution with parameter n, resulting in sequences of random length near close to n. We let\npoi(\u03bb, \u00b5) def= e\u2212\u03bb\u03bb\u00b5/\u00b5! denote the probability that a poi(\u03bb) random variable attains the value \u00b5.\nThe following basic properties of Poisson sampling help simplify the analysis and relate it to \ufb01xed-length sampling.\nLemma 1. If a discrete i.i.d. distribution is sampled poi(n) times then: (1) the number of appearances of different\nsymbols are independent; (2) a symbol with probability p appears poi(np) times; (3) for any \ufb01xed n0, conditioned on\nthe length poi(n) \u2265 n0, the \ufb01rst n0 elements are distributed identically to sampling P exactly n0 times.\nWe now express pro\ufb01le probabilities and redundancy under Poisson sampling. As we saw, the probability of a pro\ufb01le is\ndetermined by just the multiset of probability value and the symbol labels are irrelevant. For convenience, we assume\nthat the distribution is over the positive integers, and we replace the distribution parameters {pi} by the Poisson\ndef= npi, and \u039b = {\u03bb1, \u03bb2, . . .}. The pro\ufb01le generated\nparameters {npi}. For a distribution P = {p1, p2, . . .}, let \u03bbi\n\n4\n\n\fby this distribution is a multiset \u03d5 = {\u00b51, \u00b52, . . .}, where each \u00b5i generated independently according to poi(\u03bbi). The\nprobability that \u039b generates \u03d5 is [1, 25],\n\n(cid:88)\n\n(cid:89)\n\n1(cid:81)\u221e\n\n\u00b5=0 \u03d5\u00b5!\n\n\u03c3\n\n\u039b(\u03d5) =\n\npoi(\u03bb\u03c3(i), \u00b5i).\n\n(5)\n\ni\nwhere the summation is over all permutations of the support set.\nFor example, for \u039b = {\u03bb1, \u03bb2, \u03bb3}, the pro\ufb01le \u03d5 = {2, 2, 3} can be generated by specifying which element appears\nthree times. This is re\ufb02ected by the \u03d52! in the denominator, and each of the repeated terms in the numerator are\ncounted only once.\nSimilar to I n\n\u03a6, we use I poi(n)\nof length poi(n) are generated i.i.d.. It is easy to see that a distribution in I poi(n)\nThe redundancy R(I poi(n)\nthat bounding M (I poi(n)\nLemma 2. For any \ufb01xed \u0001 > 0,\n(1 \u2212 o(1))R(I n\u2212\u221a\n\nto denote the class of distributions induced on \u03a6\u2217 \u2206= \u03a60 \u222a \u03a61 \u222a \u03a62 \u222a . . . when sequences\nis a collection of \u03bbi\u2019s summing to n.\n, \u0001) are de\ufb01ned as before. The following lemma shows\n\n), and \u0001-distinguishability M (I poi(n)\n, \u0001) and R(I poi(n)\n\n) is suf\ufb01cient to bound R(I n\n\u03a6).\n\n\u221a\n, \u0001) \u2264 M (I n+\n\nand M (I poi(n)\n\n, 2\u0001).\n\nn log n\n\nn log n\n\n\u03a6\n\n\u03a6\n\n\u03a6\n\n\u03a6\n\n\u03a6\n\n\u03a6\n\n\u03a6\n\n\u03a6\n\n\u03a6\n\nProof Sketch. It is easy to show that R(I n\nthe probability that poi(n) is less than n \u2212 \u221a\n\nn log n or greater than n +\n\n\u03a6, \u0001) are non-decreasing in n. Combining this with the fact that\n(cid:4)\n\nn log n goes to 0 yields the bounds.\n\n\u221a\n\n\u03a6\n\n) \u2264 R(I poi(n)\n)\n\u03a6) and M (I n\n\nFinally, the next lemma, proved in the Appendix, provides a simple formula for cross expectations of Poisson distri-\nbutions.\nLemma 3. For any \u03bb0, \u03bb1, \u03bb2 > 0,\n\n(cid:20) poi(\u03bb2, \u00b5)\n\n(cid:21)\n\npoi(\u03bb0, \u00b5)\n\n(cid:18) (\u03bb1 \u2212 \u03bb0)(\u03bb2 \u2212 \u03bb0)\n\n(cid:19)\n\n.\n\n\u03bb0\n\n= exp\n\nE\u00b5\u223cpoi(\u03bb1)\n\n2.2 Redundancy\n\nWe state some basic properties of redundancy.\nFor a distribution P over A and a function f : A \u2192 B, let f (P ) be the distribution over B that assigns to b \u2208 B\nthe probability P (f\u22121(b)). Similarly, for a collection P of distributions over A, let f (P) = {f (P ) : P \u2208 P}. The\nconvexity of KL-divergence shows that D(f (P )||f (Q)) \u2264 D(P||Q), and can be used to show\nLemma 4 (Function Redundancy). R(f (P)) \u2264 R(P).\nFor a collection P of distributions over A\u00d7B, let PA and PB be the collection of marginal distributions over A and B,\nrespectively. In general, R(P) can be larger or smaller than R(PA) + R(PB). However, when P consists of product\ndistributions, namely P (a, b) = PA(a) \u00b7 PB(b), the redundancy of the product is at most the sum of the marginal\nredundancies. The proof is given in the Appendix.\nLemma 5 (Redundancy of products). If P be a collection of product distributions over A \u00d7 B, then\n\nR(P) \u2264 R(PA) + R(PB).\n\nFor a pre\ufb01x-free code C : A \u2192 {0, 1}\u2217, let EP [|C|] be the expected length of C under distribution P . Redundancy is\nthe extra number of bits above the entropy needed to encode the output of any distribution in P. Hence,\nLemma 6. For every pre\ufb01x-free code C, R(P) \u2264 maxP\u2208P EP [|C|].\nLemma 7 (Redundancy of unions). If P1, . . . ,PT are distribution collections, then\n\n(cid:91)\n\nR(\n\n1\u2264i\u2264k\n\nPi) \u2264 max\n1\u2264i\u2264T\n\nR(Pi) + log T.\n\n5\n\n\f3 Upper bound\nA distribution in \u039b \u2208 I poi(n)\n\n\u03a6\n\nis a multiset of \u03bb\u2019s adding to n. For any such distribution, let\n\ndef= {\u03bb \u2208 \u039b : \u03bb \u2264 n1/3}, \u039bmed\n\ndef= {\u03bb \u2208 \u039b : n1/3 < \u03bb \u2264 n2/3}, \u039bhigh\n\ndef= {\u03bb \u2208 \u039b : \u03bb > n2/3},\n\n\u039blow\n\n\u03a6\n\nand let \u03d5low, \u03d5med, \u03d5high denote the corresponding pro\ufb01le each subset generates. Then \u03d5 = \u03d5low \u222a \u03d5med \u222a \u03d5high. Let\nI\u03d5low = {\u039blow : \u039b \u2208 I poi(n)\n} be the collection of all \u039blow. Note that n is implicit here and in the rest of the paper.\nA distribution in I\u03d5low is a multiset of \u03bb\u2019s such that each is \u2264 n1/3 and they sum to either n or to \u2264 n \u2212 n1/3. I\u03d5med\nand I\u03d5high are de\ufb01ned similarly.\n\u03d5 is determined by the triple (\u03d5low, \u03d5med, \u03d5high), and by Poisson sampling, \u03d5low, \u03d5med and \u03d5high are independent.\nHence by Lemmas 4 and 5,\nR(I n\n\n\u03a6) \u2264 R(I(\u03d5low,\u03d5med,\u03d5high)) \u2264 R(I\u03d5low ) + R(I\u03d5med ) + R(I\u03d5high ).\n\n2 n1/3 log2 n.\n\nIn Subsection 3.2 we show that\n\nIn Subsection 3.1 we show that I\u03d5low < 4n1/3 log n and I\u03d5high < 4n1/3 log n.\nI\u03d5med < 1\nIn the next two subsections we elaborate on the overview and sketch some proof details.\n3.1 Bounds on R(I\u03d5low ) and R(I\u03d5high )\nElias Codes [12] are pre\ufb01x-free codes that encode a positive integer n using at most log n + log(log n + 1) + 1 bits.\nWe use Elias codes and design explicit coding schemes for distributions in I\u03d5low and I\u03d5high, and prove the following\nresult.\nLemma 8. R(I\u03d5low ) < 4n1/3 log n, and R(I\u03d5high) < 2n1/3 log n.\nProof. Any distribution \u039bhigh \u2208 I\u03d5high consists of \u03bb\u2019s that are > n2/3 and add to \u2264 n. Hence |\u039bhigh| is < n1/3, and\nso is the number of multiplicities in \u03d5high. Each multiplicity is a poi(\u03bb) random variable, and is encoded separately\nusing Elias code. For example, the pro\ufb01le {100, 100, 200, 250, 500} is encoded by coding the sequence 100, 100, 200,\n250, 500 all using Elias scheme. For \u03bb > 10, the number of bits needed to encode a poi(\u03bb) random variable using\nElias codes can be shown to be at most 2 log \u03bb. The expected code-length is at most n1/3 \u00b7 2 log n. Applying Lemma 6\ngives R(I\u03d5high ) < 2n1/3 log n.\nA distribution \u039blow \u2208 I\u03d5low consists of \u03bb\u2019s less that < n1/3 and sum at most n. We encode distinct multiplicities along\nwith their prevalences, using two integers for each distinct multiplicity. For example, \u03d5 = {1, 1, 1, 1, 1, 2, 2, 2, 5} is\ncoded as 1, 5, 2, 3, 5, 1. Using Poisson tail bounds, we bound the largest multiplicity in \u03d5low, and use arguments similar\nto I\u03d5high to obtain R(I\u03d5low ) < 4n1/3 log n.\n(cid:4)\n\n3.2 Bound on R(cid:0)I\u03d5med\n\n(cid:1)\n\nWe partition the interval (n1/3, n2/3] into B = n1/3 bins. For each distribution in I\u03d5med, we divide the \u03bb\u2019s in it\naccording to these bins. We show that within each interval, there is a uniform distribution such that the KL divergence\nbetween the underlying distribution and the induced uniform distribution is small. We then show that the number of\nuniform distributions needed is at most exp(n1/3 log n). We expand on these ideas and bound R(I\u03d5med ).\nWe partition I\u03d5med into T \u2264 exp(n1/3 log n) classes, upper bound the redundancy of each class, and then invoke\n\n(cid:1). A distribution \u039b = {\u03bb1, \u03bb2, . . . , \u03bbr} \u2208 I\u03d5med is such that \u03bbi \u2208\n\nLemma 7 to obtain an upper bound on R(cid:0)I\u03d5med\n[n1/3, n2/3] and(cid:80)r\n\ni=1 \u03bbi \u2264 n.\nConsider any partition of\n\u22061, \u22062, . . . , \u2206B.For each distribution \u039b \u2208 I\u03d5med, let \u039bj\nthe set of elements of \u039b in Ij where mj\n\n(n1/3, n2/3]\n\ninto B def= n1/3\n\nconsecutive intervals\n\nlengths\ndef= {\u03bbj,l : l = 1, 2, . . . , mj} def= {\u03bb : \u03bb \u2208 \u039b \u2229 Ij} be\n\nI1, I2, . . . , IB of\n\ndef= mj(\u039b) def= |\u039bj| is the number of elements of \u039b in Ij. Let\n\n\u03c4 (\u039b) def= (m1, m2, . . . , mB)\n\n6\n\n\fbe the B\u2212tuple of the counts of \u03bb\u2019s in each interval.\nFor example, if n = 1000, then n1/3 = 10 and n2/3 = 100. For simplicity, we choose B = 3 instead of\nn1/3 and \u22061 = 10, \u22062 = 30, \u22063 = 50, so the intervals are I1 = (10, 20], I2 = (20, 50], I3 = (50, 100].\nSuppose, \u039b = {12, 15, 25, 35, 32, 43, 46, 73}, then \u039b1 = {12, 15}, \u039b2 = {25, 35, 32, 43, 46}, \u039b3 = {73} and\n\u03c4 (\u039b) = (m1, m2, m3) = (2, 5, 1).\nWe partition I\u03d5med, such that two distributions \u039b and \u039b(cid:48) are in the same class if and only if \u03c4 (\u039b) = \u03c4 (\u039b(cid:48)). Thus each\nclass of distributions is characterized by a B-tuple of integers \u03c4 = (m1, m2, . . . , mB) and let I\u03c4 denote this class. Let\nT def= T (\u2206) be the set of all possible different \u03c4 (such that I\u03c4 is non-empty), and T = |T | be the number of classes.\n\u03bb\u2208\u039b \u03bb >\nmj \u00b7 n1/3 = n. So, each mj in \u03c4 can take at most n2/3 < n values. So, T < (n2/3)B < nn1/3\nFor any choice of \u2206, let \u03bb\u2212\nbound R(I\u03c4 ) of any particular class \u03c4 = (m1, m2, . . . , mB) in the following result.\nLemma 9. For all choices of \u2206 = (\u22061, . . . , \u2206B), and all classes I\u03c4 such that \u03c4 = (m1, . . . , mB) \u2208 T (\u2206),\n\nWe \ufb01rst bound T below. Observe that for any \u039b \u2208 I\u03d5med, and any j, we have mj < n2/3, otherwise(cid:80)\n\ni=1 \u2206i be the left end point of the interval Ij for j = 1, 2, . . . , B. We upper\n\ndef= n1/3 +(cid:80)j\u22121\n\n= exp(n1/3 log n).\n\nj\n\nR(I\u03c4 ) \u2264 B(cid:88)\n\nmj\n\n\u22062\nj\n\u03bb\u2212\n\nj\n\n.\n\nj=1\n\nall \u039b \u2208 I\u03c4 , D(\u039b||\u039b\u2217) \u2264 (cid:80)B\n\nProof Sketch. For any choice of \u2206, \u03c4 = (m1, . . . , mB) \u2208 T (\u2206), we show a distribution \u039b\u2217 \u2208 I\u03c4 such that for\n. Recall that for \u039b \u2208 I\u03c4 , \u039bj is the set of elements of \u039b in Ij. Let \u03d5j\nbe the pro\ufb01le generated by \u039bj. Then, \u03d5med = \u03d51 \u222a . . . \u222a \u03d5B. The distribution \u039b\u2217 is chosen to be of the form\n{\u03bb\u2217\n1\u00d7m1, \u03bb\u2217\nj is uniform. The result follows from Lemma 3, and the details are in\n(cid:4)\nthe Appendix .\n\nB\u00d7mB}, i.e., each \u039b\u2217\n\n2\u00d7m2, . . . , \u03bb\u2217\n\nj=1 mj\n\n\u22062\nj\n\u03bb\u2217\n\nj\n\nWe now prove that R(I\u03d5med ) < 1\nBy Lemma 7 it suf\ufb01ces to bound R(I\u03c4 ). From Theorem 9 it follows that the choice of \u2206 determines the bound on\nR(I\u03c4 ). A solution to the following optimization problem yields a bound :\n\n2 n1/3 log2 n.\n\nB(cid:88)\n\nj=1\n\nmin\n\n\u2206\n\nmax\n\n\u03c4\n\nmj\n\n\u22062\nj\n\u03bb\u2212\n\nj\n\n, subject to\n\nB(cid:88)\n\nj=1\n\nmj\u03bb\u2212\n\nj \u2264 n.\n\nInstead of minimizing over all partitions, we choose the endpoints of the intervals as a geometric series as a bound for\nthe expression. The left-end point of Ij is \u03bb\u2212\nj , so \u03bb\u2212\nj (1 + c). The constant c is chosen to\nensure that \u03bb\u2212\n= n2/3, the right end-point of IB. This yields, c < 2 log(1+c) = 2 log(n1/3)\n.\nNow, \u2206j = \u03bb\u2212\n= c2\u03bb\u2212\nj . This translates the objective function to the constraint, and is in fact\nthe optimal intervals for the optimization problem (details omitted). Using this, for any \u03c4 = (m1, . . . , mB) \u2208 T (\u2206),\n\n1 = n1/3. We let \u03bb\u2212\n\nj+1 = \u03bb\u2212\n\nn1/3\n\n(cid:18) 2 log(n1/3)\n\n(cid:19)2\n\nn1/3\n\nmj\u03bb\u2212\n\nj \u2264 c2n <\n\nn =\n\n4\n9\n\nn1/3 log2 n.\n\n1 (1+c)B = n1/3(1+c)n1/3\nj+1 \u2212 \u03bb\u2212\nj = c\u03bb\u2212\nj , so \u22062\nj\n\u2212\nj\nB(cid:88)\n\nB(cid:88)\n\n\u03bb\n\n= c2\n\nmj\n\n\u22062\nj\n\u03bb\u2212\n\nj\n\nj=1\n\nj=1\n\nThis, along with Lemma 7 gives the following Corollary for suf\ufb01ciently large n.\nCorollary 10. For large n, R(I\u03d5med ) < 1\nCombining Lemma 8 with this result yields,\nTheorem 11. For suf\ufb01ciently large n,\n\n2 \u00b7 n1/3 log2 n.\n\nR(I n\n\n\u03a6) \u2264 n1/3 log2 n.\n\n7\n\n\f4 Lower bound\nWe use error-correcting codes to construct a collection of 20.3n1/3 distinguishable distributions, improving by a loga-\nrithmic factor the bound in [14, 31].\nThe convexity of KL-divergence can be used to show\nLemma 12. Let P and Q be distributions on A. Suppose A1 \u2282 A be such that P (A1) \u2265 1\u2212 \u0001 > 1/2, Q(A1) \u2264 \u03b4 <\n\n1/2. Then, D(P||Q) \u2265 (1 \u2212 \u0001) log(cid:0) 1\nsuch that Pj(Aj) \u2265 1\u2212\u0001. Let Q0 be the distribution such that, R(P) = supP\u2208P D(P||Q0). Since(cid:80)M\n\nWe use this result to show that (1 \u2212 \u0001) log M (P, \u0001) \u2264 R(P). Recall that for P over A, M def= M (P, \u0001) is the largest\nnumber of \u0001\u2212distinguishable distributions in P. Let P1, P2, . . . , PM in P and A1,A2, . . . ,AM be a partition of A\nj=1 Q0(Aj) = 1,\nM for some m \u2208 {1, . . . , M}. Also, Pm(Am) \u2265 1 \u2212 \u0001. Plugging in P = Pm, Q = Q0, A1 = Am, and\nQ0(Am) < 1\n\u03b4 = 1/M in the Lemma 12,\n\n(cid:1) \u2212 h(\u0001).\n\n\u03b4\n\nR(P) \u2265 D(Pm||Q0) \u2265 (1 \u2212 \u0001) log (M (P, \u0001)) \u2212 h(\u0001).\n\nWe now describe the class of distinguishable distributions. Fix C > 0. Let \u03bb\u2217\nS def= {\u03bb\u2217\nstring and\n\ndef= Ci2, K def= (cid:98)(3n/C)1/3(cid:99), and\ni : 1 \u2264 i \u2264 K}. K is chosen so that sum of elements in S is at most n. Let x = x1x2 . . . xK be a binary\n\ni\n\ni : xi = 1} \u222a(cid:110)\n\nn \u2212(cid:88)\n\n(cid:111)\n\ndef= {\u03bb\u2217\n\n\u039bx\n\n\u03bb\u2217\ni xi\n\n.\n\n2 > \u03b1 > 0. There exists a code with dmin \u2265 \u03b1k and size \u2265 2k(1\u2212h(\u03b1)\u2212o(1)).\n\nThe distribution contains \u03bb\u2217\ni whenever xi = 1, and the last element ensures that the elements add up to n. A binary\ncode of length k and minimum distance dmin is a collection of k\u2212length binary strings with Hamming distance\nbetween any two strings is at least dmin. The size of the code is the number of elements (codewords) in it. The\nfollowing shows the existence of codes with a speci\ufb01ed minimum distance and size.\nLemma 13 ([30]). Let 1\nLet C be a code satisfying Lemma 13 for k = K and let L = {\u039bc : c \u2208 C} be the set of distributions generated by\nusing the strings in C. The following result shows that distributions in L are distinguishable and is proved in Appendix\n.\nLemma 14. The set L is 2e\u2212C/4\nPlugging \u03b1 = 5 \u00d7 10\u22125 and C = 60, then Lemma 13 and Equation (1) yields,\nTheorem 15. For suf\ufb01ciently large n,\n\n\u03b1 \u2212distinguishable.\n\n0.3 \u00b7 n1/3 \u2264 R(I n\n\u03a6).\n\nAcknowledgments\nThe authors thank Ashkan Jafarpour and Ananda Theertha Suresh for many helpful discussions.\nReferences\n[1] J. Acharya, H. Das, A. Jafarpour, A. Orlitsky, and S. Pan. Competitive closeness testing. J. of Machine Learning Research -\n\nProceedings Track, 19:47\u201368, 2011.\n\n[2] J. Acharya, H. Das, A. Jafarpour, A. Orlitsky, S. Pan, and A. T. Suresh. Competitive classi\ufb01cation and closeness testing.\n\nJournal of Machine Learning Research - Proceedings Track, 23:22.1\u201322.18, 2012.\n\n[3] A. R. Barron, J. Rissanen, and B. Yu. The minimum description length principle in coding and modeling. IEEE Transactions\n\non Information Theory, 44(6):2743\u20132760, 1998.\n\n[4] T. Batu, L. Fortnow, R. Rubinfeld, W. D. Smith, and P. White. Testing that distributions are close. In Annual Symposium on\n\nFoundations of Computer Science, page 259, 2000.\n\n[5] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, New York, NY, USA, 2006.\n[6] K. Chaudhuri and A. McGregor. Finding metric structure in information theoretic clustering. In Conference on Learning\n\nTheory, pages 391\u2013402, 2008.\n\n[7] T. Cover. Universal portfolios. Mathematical Finance, 1(1):1\u201329, January 1991.\n\n8\n\n\f[8] T. Cover and J. Thomas. Elements of Information Theory, 2nd Ed. Wiley Interscience, 2006.\n[9] L. Davisson. Universal noiseless coding. IEEE Transactions on Information Theory, 19(6):783\u2013795, November 1973.\n[10] L. D. Davisson, R. J. McEliece, M. B. Pursley, and M. S. Wallace. Ef\ufb01cient universal noiseless source codes. IEEE Transac-\n\ntions on Information Theory, 27(3):269\u2013279, 1981.\n\n[11] M. Drmota and W. Szpankowski. Precise minimax redundancy and regret.\n\n50(11):2686\u20132707, 2004.\n\nIEEE Transactions on Information Theory,\n\n[12] P. Elias. Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory, 21(2):194\u2013\n\n203, Mar 1975.\n\n[13] B. M. Fitingof. Optimal coding in the case of unknown and changing message statistics. Probl. Inform. Transm., 2(2):1\u20137,\n\n1966.\n\n[14] A. Garivier. A lower-bound for the maximin redundancy in pattern coding. Entropy, 11(4):634\u2013642, 2009.\n[15] G. M. Gemelos and T. Weissman. On the entropy rate of pattern processes.\n\nIEEE Transactions on Information Theory,\n\n52(9):3994\u20134007, 2006.\n\n[16] P. Gr\u00a8unwald. A tutorial introduction to the minimum description length principle. CoRR, math.ST/0406077, 2004.\n[17] P. Gr\u00a8unwald, J. S. Jones, J. de Winter, and \u00b4E. Smith. Safe learning: bridging the gap between bayes, mdl and statistical\n\nlearning theory via empirical convexity. J. of Machine Learning Research - Proceedings Track, 19:397\u2013420, 2011.\n\n[18] P. D. Gr\u00a8unwald. The Minimum Description Length Principle. The MIT Press, 2007.\n[19] G. Hardy and S. Ramanujan. Asymptotic formulae in combinatory analysis. Proceedings of London Mathematics Society,\n\n17(2):75\u2013115, 1918.\n\n[20] J. Kelly. A new interpretation of information rate. IEEE Transactions on Information Theory, 2(3):185\u2013189, 1956.\n[21] N. Merhav and M. Feder. Universal prediction. IEEE Transactions on Information Theory, 44(6):2124\u20132147, October 1998.\n[22] M. Mitzenmacher and E. Upfal. Probability and computing - randomized algorithms and probabilistic analysis. Cambridge\n\nUniversity Press, 2005.\n\n[23] A. Orlitsky, S. Pan, Sajama, N. Santhanam, and K. Viswanathan. Pattern maximum likelihood: computation and experiments.\n\nIn preparation, 2012.\n\n[24] A. Orlitsky, N. Santhanam, K. Viswanathan, and J. Zhang. On modeling pro\ufb01les instead of values. In Proceedings of the 20th\n\nconference on Uncertainty in arti\ufb01cial intelligence, 2004.\n\n[25] A. Orlitsky, N. Santhanam, and J. Zhang. Universal compression of memoryless sources over unknown alphabets. IEEE\n\nTransactions on Information Theory, 50(7):1469\u2013 1481, July 2004.\n\n[26] A. Orlitsky, N. P. Santhanam, K. Viswanathan, and J. Zhang. Limit results on pattern entropy.\n\nInformation Theory, 52(7):2954\u20132964, 2006.\n\nIEEE Transactions on\n\n[27] J. Rissanen. Universal coding, information, prediction, and estimation. IEEE Transactions on Information Theory, 30(4):629\u2013\n\n636, July 1984.\n\n[28] J. Rissanen. Fisher information and stochastic complexity. IEEE Transactions on Information Theory, 42(1):40\u201347, January\n\n1996.\n\n[29] J. Rissanen, T. P. Speed, and B. Yu. Density estimation by stochastic complexity. IEEE Transactions on Information Theory,\n\n38(2):315\u2013323, 1992.\n\n[30] R. M. Roth. Introduction to coding theory. Cambridge University Press, 2006.\n[31] G. Shamir. A new upper bound on the redundancy of unknown alphabets. In Proceedings of The 38th Annual Conference on\n\nInformation Sciences and Systems, Princeton, New-Jersey, 2004.\n\n[32] G. Shamir. Universal lossless compression with unknown alphabets\u2014the average case. IEEE Transactions on Information\n\nTheory, 52(11):4915\u20134944, November 2006.\n\n[33] W. Szpankowski. On asymptotics of certain recurrences arising in universal coding. Problems of Information Transmission,\n\n34(2):142\u2013146, 1998.\n\n[34] P. Valiant. Testing symmetric properties of distributions. PhD thesis, Cambridge, MA, USA, 2008. AAI0821026.\n[35] F. M. J. Willems, Y. M. Shtarkov, and T. J. Tjalkens. The context-tree weighting method: basic properties. IEEE Transactions\n\non Information Theory, 41(3):653\u2013664, 1995.\n\n[36] Q. Xie and A. Barron. Asymptotic minimax regret for data compression, gambling and prediction. IEEE Transactions on\n\nInformation Theory, 46(2):431\u2013445, March 2000.\n\n[37] B. Yu and T. P. Speed. A rate of convergence result for a universal d-semifaithful code. IEEE Transactions on Information\n\nTheory, 39(3):813\u2013820, 1993.\n\n[38] J. Zhang. Universal Compression and Probability Estimation with Unknown Alphabets. PhD thesis, UCSD, 2005.\n\n9\n\n\f", "award": [], "sourceid": 4542, "authors": [{"given_name": "Jayadev", "family_name": "Acharya", "institution": null}, {"given_name": "Hirakendu", "family_name": "Das", "institution": null}, {"given_name": "Alon", "family_name": "Orlitsky", "institution": null}]}