{"title": "Bayesian entropy estimation for binary spike train data using parametric prior knowledge", "book": "Advances in Neural Information Processing Systems", "page_first": 1700, "page_last": 1708, "abstract": "Shannon's entropy is a basic quantity in information theory, and a  fundamental building block for the analysis of neural codes.   Estimating the entropy of a discrete distribution from samples is an important and difficult problem that has received considerable   attention in statistics and theoretical neuroscience.  However,  neural responses have characteristic statistical structure that   generic entropy estimators fail to exploit.  For example, existing  Bayesian entropy estimators make the naive assumption that all spike   words are equally likely a priori, which makes for an  inefficient allocation of prior probability mass in cases where   spikes are sparse.  Here we develop Bayesian estimators for the  entropy of binary spike trains using priors designed to flexibly   exploit the statistical structure of simultaneously-recorded spike  responses.  We define two prior distributions over spike words using   mixtures of Dirichlet distributions centered on simple parametric  models.  The parametric model captures high-level statistical   features of the data, such as the average spike count in a spike  word, which allows the posterior over entropy to concentrate more   rapidly than with standard estimators (e.g., in cases where the  probability of spiking differs strongly from 0.5). Conversely, the   Dirichlet distributions assign prior mass to distributions far from  the parametric model, ensuring consistent estimates for arbitrary   distributions.  We devise a compact representation of the data and  prior that allow for computationally efficient implementations of   Bayesian least squares and empirical Bayes entropy estimators with  large numbers of neurons.  We apply these estimators to simulated   and real neural data and show that they substantially outperform  traditional methods.", "full_text": "Bayesian entropy estimation for binary spike train\n\ndata using parametric prior knowledge\n\nEvan Archer13, Il Memming Park123, Jonathan W. Pillow123\n\n1. Center for Perceptual Systems, 2. Dept. of Psychology,\n\n3. Division of Statistics & Scienti\ufb01c Computation\n\n{memming@austin., earcher@, pillow@mail.} utexas.edu\n\nThe University of Texas at Austin\n\nAbstract\n\nShannon\u2019s entropy is a basic quantity in information theory, and a fundamental\nbuilding block for the analysis of neural codes. Estimating the entropy of a dis-\ncrete distribution from samples is an important and dif\ufb01cult problem that has re-\nceived considerable attention in statistics and theoretical neuroscience. However,\nneural responses have characteristic statistical structure that generic entropy esti-\nmators fail to exploit. For example, existing Bayesian entropy estimators make\nthe naive assumption that all spike words are equally likely a priori, which makes\nfor an inef\ufb01cient allocation of prior probability mass in cases where spikes are\nsparse. Here we develop Bayesian estimators for the entropy of binary spike trains\nusing priors designed to \ufb02exibly exploit the statistical structure of simultaneously-\nrecorded spike responses. We de\ufb01ne two prior distributions over spike words us-\ning mixtures of Dirichlet distributions centered on simple parametric models. The\nparametric model captures high-level statistical features of the data, such as the\naverage spike count in a spike word, which allows the posterior over entropy to\nconcentrate more rapidly than with standard estimators (e.g., in cases where the\nprobability of spiking differs strongly from 0.5). Conversely, the Dirichlet distri-\nbutions assign prior mass to distributions far from the parametric model, ensuring\nconsistent estimates for arbitrary distributions. We devise a compact representa-\ntion of the data and prior that allow for computationally ef\ufb01cient implementations\nof Bayesian least squares and empirical Bayes entropy estimators with large num-\nbers of neurons. We apply these estimators to simulated and real neural data and\nshow that they substantially outperform traditional methods.\n\nIntroduction\n\nInformation theoretic quantities are popular tools in neuroscience, where they are used to study\nneural codes whose representation or function is unknown. Neuronal signals take the form of fast\n(\u223c 1 ms) spikes which are frequently modeled as discrete, binary events. While the spiking response\nof even a single neuron has been the focus of intense research, modern experimental techniques make\nit possible to study the simultaneous activity of hundreds of neurons. At a given time, the response\nof a population of n neurons may be represented by a binary vector of length n, where each entry\nrepresents the presence (1) or absence (0) of a spike. We refer to such a vector as a spike word.\nFor n much greater than 30, the space of 2n spike words becomes so large that effective modeling\nand analysis of neural data, with their high dimensionality and relatively low sample size, presents\na signi\ufb01cant computational and theoretical challenge.\nWe study the problem of estimating the discrete entropy of spike word distributions. This is a dif-\n\ufb01cult problem when the sample size is much less than 2n, the number of spike words. Entropy\nestimation in general is a well-studied problem with a literature spanning statistics, physics, neuro-\n\n1\n\n\fFigure 1: Illustrated example of binarized spike responses for n = 3 neurons and corresponding\n(A) The spike responses of n = 3 simultaneously-recorded neurons (green,\nword distribution.\norange, and purple). Time is discretized into bins of size \u2206t. A single spike word is a 3 \u00d7 1 binary\nvector whose entries are 1 or 0 corresponding to whether the neuron spiked or not within the time\nbin. (B) We model spike words as drawn iid from the word distribution \u03c0, a probability distribution\nsupported on the A = 2n unique binary words. Here we show a schematic \u03c0 for the data of panel\n(A). The spike words (x-axis) occur with varying probability (blue)\n\nscience, ecology, and engineering, among others [1\u20137]. We introduce a novel Bayesian estimator\nwhich, by incorporating simple a priori information about spike trains via a carefully-chosen prior,\ncan estimate entropy with remarkable accuracy from few samples. Moreover, we exploit the struc-\nture of spike trains to compute ef\ufb01ciently on the full space of 2n spike words.\nWe begin by brie\ufb02y reviewing entropy estimation in general. In Section 2 we discuss the statistics\nof spike trains and emphasize a statistic, called the synchrony distribution, which we employ in\nour model. In Section 3 we introduce two novel estimators, the Dirichlet\u2013Bernoulli (DBer) and\nDirichlet\u2013Synchrony (DSyn) entropy estimators, and discuss their properties and computation. We\ncompare \u02c6HDBer and \u02c6HDSyn to other entropy estimation techniques in simulation and on neural data,\nand show that \u02c6HDBer drastically outperforms other popular techniques when applied to real neural\ndata. Finally, we apply our estimators to study synergy across time of a single neuron.\n\n1 Entropy Estimation\nLet x := {xk}N\nk=1 be spike words drawn iid from an unknown word distribution \u03c0 := {\u03c0i}A\ni=1.\nThere are A = 2n unique words for a population of n neurons, which we index by {1, 2, . . . , A}.\nEach sampled word xk is a binary vector of length n, where xki records the presence or absence of\na spike from the ith neuron. We wish to estimate the entropy of \u03c0,\n\nH(\u03c0) = \u2212\n\n\u03c0k log \u03c0k,\n\n(1)\n\nA(cid:88)\n\nk=1\n\nA(cid:88)\n\n(cid:80)N\n\nwhere \u03c0k > 0 denotes the probability of observing the kth word.\nA naive method for estimating H is to \ufb01rst estimate \u03c0 using the count of observed words nk =\ni=1 1{xi=k} for each word k. This yields the empirical distribution \u02c6\u03c0, where \u02c6\u03c0k = nk/N. Eval-\n\nuating eq. 1 on this estimate yields the \u201cplugin\u201d estimator,\n\n\u02c6Hplugin = \u2212\n\n\u02c6\u03c0i log \u02c6\u03c0i,\n\n(2)\n\ni=1\n\nwhich is also the maximum-likelihood estimator under the multinomial likelihood. Although con-\nsistent and straightforward to compute, \u02c6Hplugin is in general severely biased when N (cid:28) A.\nIndeed, all entropy estimators are biased when N (cid:28) A [8]. As a result, many techniques for bias-\ncorrection have been proposed in the literature [6, 9\u201318]. Here, we extend the Bayesian approach\nof [19], focusing in particular on the problem of entropy estimation for simultaneously-recorded\nneurons.\nIn a Bayesian paradigm, rather than attempting to directly compute and remove the bias for a given\nestimator, we instead choose a prior distribution over the space of discrete distributions. Nemenman\n\n2\n\n110011001time000100001110101111010011wordsword distributionABneuronsfrequecny\fFigure 2: Sparsity structure of spike word distribution illustrated using the synchrony distribution.\n(A) The empirical synchrony distribution of 8 simultaneously-recorded retinal ganglion cells (blue).\nThe cells were recorded for 20 minutes and binned with \u2206t = 2 ms bins. Spike words are over-\nwhelmingly sparse, with w0 by far the most common word. In contrast, we compare the prior em-\npirical synchrony distribution sampled using 106 samples from the NSB prior (\u03c0 \u223c Dir(\u03b1, . . . , \u03b1),\nwith p(\u03b1) \u221d A\u03c81(A\u03b1 + 1) \u2212 \u03c81(\u03b1 + 1), and \u03c81 the digamma function) (red). The empirical syn-\nchrony distribution shown is averaged across samples. (B) The synchrony distribution of an Ising\nmodel (blue) compared to its best binomial \ufb01t (red). The Ising model parameters were set randomly\nby drawing the entries of the matrix J and vector h iid from N(0, 1). A binomial distribution cannot\naccurately capture the observed synchrony distribution.\n\net al. showed Dirichlet to be priors highly informative about the entropy, and thus a poor prior for\nBayesian entropy estimation [19]. To rectify this problem, they introduced the Nemenman\u2013Shafee\u2013\nBialek (NSB) estimator, which uses a mixture of Dirichlet distributions to obtain an approximately\n\ufb02at prior over H. As a prior on \u03c0, however, the NSB prior is agnostic about application: all symbols\nhave the same marginal probability under the prior, an assumption that may not hold when the\nsymbols correspond to binary spike words.\n\n2 Spike Statistics and the Synchrony Distribution\n\nWe discretize neural signals by binning multi-neuron spike trains in time, as illustrated in Fig. 1. At a\ntime t, then, the spike response of a population of n neurons is a binary vector (cid:126)w = (b1, b2, . . . , bn),\nwhere bi \u2208 {0, 1} corresponds to the event that the ith neuron spikes within the time window\ni=0 bi2i. There are a total of A = 2n possible\n\n(t, t + \u2206t). We let (cid:126)wk be that word such that k =(cid:80)n\u22121\n\nwords.\nFor a suf\ufb01ciently small bin size \u2206t, spike words are likely to be sparse, and so our strategy will be\nto choose priors that place high prior probability on sparse words. To quantify sparsity we use the\nsynchrony distribution: the distribution of population spike counts across all words. In Fig. 2 we\ncompare the empirical synchrony distribution for a population of 8 simultaneously-recorded retinal\nganglion cells (RGCs) with the prior synchrony distribution under the NSB model. For real data, the\nsynchrony distribution is asymmetric and sparse, concentrating around words with few simultaneous\nspikes. No more than 4 synchronous spikes are observed in the data. In contrast, under the NSB\nmodel all words are equally likely, and the prior synchrony distribution is symmetric and centered\naround 4.\nThese deviations in the synchrony distribution are noteworthy: beyond quantifying sparseness, the\nsynchrony distribution provides a surprisingly rich characterization of a neural population. Despite\nits simplicity, the synchrony distribution carries information about the higher-order correlation struc-\nture of a population [20,21]. It uniquely speci\ufb01es distributions \u03c0 for which the probability of a word\ni bi. Equivalently: all words with spike count\nk, Ek = {w : [w] = k}, have identical probability \u03b2k of occurring. For such a \u03c0, the synchrony\n\nwk depends only on its spike count [k] = [ (cid:126)wk] :=(cid:80)\n\n3\n\n01234567800.10.20.30.40.50.60.7proportion of words (out of 600000 words)number of spikes in a wordRGC Empirical Synchrony Distribution, 2ms binsRGC dataNSB priorAB01234567800.10.20.30.40.50.60.7proportion of words (of 600000 words)number of spikes in a wordEmpirical Synchrony Distribution of Simulated Ising Model and ML Binomial FitIsing Binomial fit\fdistribution \u00b5 is given by,\n\n(cid:88)\n\nwi\u2208Ek\n\n\u03c0i =\n\n(cid:19)\n\n\u03b2k.\n\n(cid:18)n\n\nk\n\n\u00b5k =\n\n(3)\n\nDifferent neural models correspond to different synchrony distributions. Consider an independently-\nBernoulli spiking model. Under this model, the number of spikes in a word w is distributed binomi-\nally, [ (cid:126)w] \u223c Bin(p, n), where p is a uniform spike probability across neurons. The probability of a\nword wk is given by,\n\nwhile the probability of observing a word with i spikes is,\n\nP ( (cid:126)wk|p) = \u03b2[k] = p[k](1 \u2212 p)n\u2212[k],\n\n(cid:18)n\n\n(cid:19)\n\n\u03b2i.\n\ni\n\nP (Ei|p) =\n\n3 Entropy Estimation with Parametric Prior Knowledge\n\nAlthough a synchrony distribution may capture our prior knowledge about the structure of spike\npatterns, our goal is not to estimate the synchrony distribution itself. Rather, we use it to inform a\nprior on the space of discrete distributions, the (2n\u22121)-dimensional simplex. Our strategy is to use a\nsynchrony distribution G as the base measure of a Dirichlet distribution. We construct a hierarchical\nmodel where \u03c0 is a mixture of Dir(\u03b1G), and counts n of spike train observations are multinomial\ngiven \u03c0 (See Fig. 3(A). Exploiting the conjugacy of Dirichlet and multinomial, and the convenient\nsymmetries of both the Dirichlet distribution and G, we obtain a computationally ef\ufb01cient Bayes\nleast squares estimator for entropy. Finally, we discuss using empirical estimates of the synchrony\ndistribution \u00b5 as a base measure.\n\n3.1 Dirichlet\u2013Bernoulli entropy estimator\n\nWe model spike word counts n as drawn iid multinomial given the spike word distribution \u03c0. We\nplace a mixture-of-Dirichlets prior on \u03c0, which in general takes the form,\n\n(4)\n\n(5)\n\n(6)\n(7)\n\nn \u223c Mult(\u03c0)\n\u03c0 \u223c Dir(\u03b11, \u03b12, . . . , \u03b1A\n\n),\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n2n\n\n(cid:125)\n\n(cid:126)\u03b1 := (\u03b11, \u03b12, . . . , \u03b1A) \u223c P ((cid:126)\u03b1),\n\n(8)\nwhere \u03b1i > 0 are concentration parameters, and P ((cid:126)\u03b1) is a prior distribution of our choosing. Due\nto the conjugacy of Dirichlet and multinomial, the posterior distribution given observations and (cid:126)\u03b1 is\n\u03c0|n, (cid:126)\u03b1 \u223c Dir(\u03b11 + n1, . . . , \u03b1A + nA), where ni is the number of observations for the i-th spiking\npattern. The posterior expected entropy given (cid:126)\u03b1 is given by [22],\n\nA(cid:88)\n\ni=1\n\n\u03b1i\n\u03ba\n\n\u03c80(\u03b1i + 1)\n\n(9)\n\nE[H(\u03c0)|(cid:126)\u03b1] = \u03c80(\u03ba + 1) \u2212\n\nwhere \u03c80 is the digamma function, and \u03ba =(cid:80)A\n\ni=1 \u03b1i.\n\nFor large A, (cid:126)\u03b1 is too large to select arbitrarily, and so in practice we center the Dirichlet around\na simple, parametric base measure G [23]. We rewrite the vector of concentration parameters as\n(cid:126)\u03b1 \u2261 \u03b1G, where G = Bernoulli(p) is a Bernoulli distribution with spike rate p and \u03b1 > 0 is a scalar.\nThe general prior of eq. 7 then takes the form,\n\nwhere each gk is the probability of the kth word under the base measure, satisfying(cid:80) gk = 1.\n\n\u03c0 \u223c Dir(\u03b1G) \u2261 Dir(\u03b1g1, \u03b1g2 . . . , \u03b1gA),\n\n(10)\n\nWe illustrate the dependency structure of this model schematically in Fig. 3. Intuitively, the base\nmeasure incorporates the structure of G into the prior by shifting the Dirichlet\u2019s mean. With a\nbase measure G the prior mean satis\ufb01es E[\u03c0|p] = G|p. Under the NSB model, G is the uniform\ndistribution; thus, when p = 0.5 the Binomial G corresponds exactly to the NSB model. Since\n\n4\n\n\fin practice choosing a base measure is equivalent to selecting distinct values of the concentration\nparameter \u03b1i, the posterior mean of entropy under this model has the same form as eq. 9, simply\nreplacing \u03b1k = \u03b1gk. Given hyper-prior distributions P (\u03b1) and P (p), we obtain the Bayes least\nsquares estimate, the posterior mean of entropy under our model,\n\n\u02c6HDBer = E[H|x] =\n\nE [H|\u03b1, p] P (\u03b1, p|x) d\u03b1 dp.\n\n(11)\n\nWe refer to eq. 11 as the Dirichlet\u2013Bernoulli (DBer) entropy estimator, \u02c6HDBer. Thanks to the closed-\nform expression for the conditional mean eq. 9 and the convenient symmetries of the Bernoulli\ndistribution, the estimator is fast to compute using a 2D numerical integral over the hyperparameters\n\u03b1 and p.\n\n3.1.1 Hyper-priors on \u03b1 and p\n\nPrevious work on Bayesian entropy estimation has focused on Dirichlet priors with scalar, constant\nconcentration parameters \u03b1i = \u03b1. Nemenman et al. [19] noted that these \ufb01xed-\u03b1 priors yield poor\nestimators for entropy, because p(H|\u03b1) is highly concentrated around its mean. To address this\nproblem, [19] proposed a Dirichlet mixture prior on \u03c0,\n\n(cid:90)(cid:90)\n\n(cid:90)\n\nPDir(\u03c0|\u03b1)P (\u03b1)d\u03b1,\n\n(12)\nP (\u03c0) =\nE[H(\u03c0)|\u03b1] assures an approximately \ufb02at prior distribution over\n\nwhere the hyper-prior, P (\u03b1) \u221d d\nH. We adopt the same strategy here, choosing the prior,\n\nd\u03b1\n\nE[H(\u03c0)|\u03b1, p] = \u03c81(\u03b1 + 1) \u2212 n(cid:88)\n\n(cid:18)n\n\n(cid:19)\n\ni\n\ni=0\n\nP (\u03b1) \u221d d\nd\u03b1\n\n\u03b22\ni \u03c81(\u03b1\u03b2i + 1).\n\n(13)\n\n(cid:80)\n\nEntropy estimates are less sensitive to the choice of prior on p. Although we experimented with\nseveral priors on p, in all examples we found that the evidence for p was highly concentrated around\nij xij, the maximum (Bernoulli) likelihood estimate for p. In practice, we found that an\n\u02c6p = 1\nN n\nempirical Bayes procedure, \ufb01tting \u02c6p from data \ufb01rst and then using the \ufb01xed \u02c6p to perform the integral\neq. 11, performed indistinguishably from a P (p) uniform on [0, 1].\n\n3.1.2 Computation\n\nFor large n, the 2n distinct values of \u03b1i render the sum of eq. 9 potentially intractable to compute.\nWe sidestep this exponential scaling of terms by exploiting the redundancy of Bernoulli and binomial\ndistributions. Doing so, we are able to compute eq. 9 without explicitly representing the 2N values\nof \u03b1i.\nUnder the Bernoulli model, each element gk of the base measure takes the value \u03b2[k] (eq. 4). Further,\n\n(cid:1) words for which the value of \u03b1i is identical, so that A =(cid:80)n\n\n(cid:1)\u03b2i = \u03b1. Applied\n\nthere are(cid:0)n\n\ni=0 \u03b1(cid:0)n\n\ni\n\ni\n\nto eq. 9, we have,\n\nFor the posterior, the sum takes the same form, except that A = n + \u03b1, and the mean is given by,\n\nE[H(\u03c0)|\u03b1, p, x] = \u03c80(n + \u03b1 + 1) \u2212\n\n\u03c80(ni + \u03b1\u03b2[i] + 1)\n\n(14)\n\ni=0\n\nE[H(\u03c0)|\u03b1, p] = \u03c80(\u03b1 + 1) \u2212 n(cid:88)\nA(cid:88)\n= \u03c80(n + \u03b1 + 1) \u2212(cid:88)\nn(cid:88)\n\ni=1\n\ni\u2208I\n\u2212 \u03b1\n\ni=0\n\n5\n\n(cid:18)n\n\n(cid:19)\n\ni\n\n\u03b2i\u03c80(\u03b1\u03b2i + 1).\n\nni + \u03b1\u03b2[i]\n\nn + \u03b1\n\nni + \u03b1\u03b2[i]\n\nn + \u03b1\n\n(cid:0)(cid:0)n\n\n(cid:1) \u2212 \u02dcni\n\n(cid:1) \u03b2i\n\ni\n\nn + \u03b1\n\n\u03c80(ni + \u03b1\u03b2[i] + 1)\n\n\u03c80(\u03b1\u03b2i + 1),\n\n\fFigure 3: Model schematic and intuition for Dirichlet\u2013Bernoulli entropy estimation. (A) Graphical\nmodel for Dirichlet\u2013Bernoulli entropy estimation. The Bernoulli base measure G depends on the\nspike rate parameter p. In turn, G acts as the mean of a Dirichlet prior over \u03c0. The scalar Dirichlet\nconcentration parameter \u03b1 determines the variability of the prior around the base measure. (B) The\nset of possible spike words for n = 4 neurons. Although easy to enumerate for this small special\ncase, the number of words increases exponentially with n. In order to compute with this large set,\nwe assume a prior distribution with a simple equivalence class structure: a priori, all words with the\nsame number of synchronous spikes (outlined in blue) occur with equal probability. We then need\nonly n parameters, the synchrony distribution of eq. 3, to specify the distribution. (C) We center a\nDirichlet distribution on a model of the synchrony distribution. The symmetries of the count and\nDirichlet distributions allow us to compute without explicitly representing all A words.\n\nwhere I = {k : nk > 0}, the set of observed characters, and \u02dcni is the count of observed words with\ni spikes (i.e., observations of the equivalence class Ei). Note that eq. 14 is much more computation-\nally tractable than the mathematically equivalent form given immediately above it. Thus, careful\nbookkeeping allows us to ef\ufb01ciently evaluate eq. 9 and, in turn, eq. 11.1\n\n3.2 Empirical Synchrony Distribution as a Base Measure\n\nWhile the Bernoulli base measure captures the sparsity structure of multi-neuron recordings, it also\nimposes unrealistic independence assumptions. In general, the synchrony distribution can capture\ncorrelation structure that cannot be represented by a Bernoulli model. For example, in Fig. 2B, a\nmaximum likelihood Bernoulli \ufb01t fails to capture the sparsity structure of a simulated Ising model.\nWe might address this by choosing a more \ufb02exible parametric base measure. However, since the\ndimensionality of \u00b5 scales only linearly with the number of neurons, the empirical synchrony dis-\ntribution (ESD),\n\nN(cid:88)\n\nj=1\n\n\u02c6\u00b5i =\n\n1\nN\n\n1{[xj ]=i},\n\n(15)\n\nconverges quickly even when the sample size is inadequate for estimating the full \u03c0.\nThis suggests an empirical Bayes procedure where we use the ESD as a base measure (take G = \u02c6\u00b5)\nfor entropy estimation. Computation proceeds exactly as in Section 3.1.2 with the Bernoulli base\n\nmeasure replaced by the ESD by setting gk = \u00b5k and \u03b2i = \u00b5i/(cid:0)m\n\n(cid:1). The resulting Dirichlet\u2013\n\nSynchrony (DSyn) estimator may incorporate more varied sparsity and correlation structures into its\nprior than \u02c6HDBer (see Fig. 4), although it depends on an estimate of the synchrony distribution.\n\ni\n\n4 Simulations and Comparisons\n\nWe compared \u02c6HDBer and \u02c6HDSyn to the Nemenman\u2013Shafee\u2013Bialek (NSB) [19] and Best Upper Bound\n(BUB) entropy estimators [8] for several simulated and real neural datasets. For \u02c6HDSyn, we regular-\n\n1For large n, the binomial coef\ufb01cient of eq. 14 may be dif\ufb01cult to compute. By writing it in terms of the\n\nBernoulli probability eq. 5, it may be computed using the Normal approximation to the Binomial.\n\n6\n\nABC0000010010000010000110101100100101100011010110110111110111101111N neuronsmost likelyless likelyeven less likelyless likely stillleast likely,,,,priorentropydistributiondatawords\fFigure 4: Convergence of \u02c6HDBer, \u02c6HDSyn, \u02c6HNSB, \u02c6HBUB, and \u02c6Hplugin as a function of sample size for\ntwo simulated examples of 30 neurons. Binary word data are drawn from two speci\ufb01ed synchrony\ndistributions (insets). Error bars indicate variability of the estimator over independent samples (\u00b11\nstandard deviation). (A) Data drawn from a bimodal synchrony distribution with peaks at 0 spikes\n(B) Data generated from a power-law synchrony\nand 10 spikes\ndistribution (\u00b5i \u221d i\u22123).\n\n10 e\u22124(i\u22122n/3)2(cid:17)\n\n(cid:16)\n\n\u00b5i = e\u22122i + 1\n\n.\n\nFigure 5: Convergence of \u02c6HDBer, \u02c6HDSyn, \u02c6HNSB, \u02c6HBUB, and \u02c6Hplugin as a function of sample size for\n27 simultaneously-recorded retinal ganglion cells (RGC). The two \ufb01gures show the same RGC data\nbinned and binarized at \u2206t = 1 ms (A) and 10 ms (B). The error bars, axes, and color scheme are as\nin Fig. 4. While all estimators improve upon the performance of \u02c6Hplugin, \u02c6HDSyn and \u02c6HDBer both show\nexcellent performance for very low sample sizes (10\u2019s of samples). (inset) The empirical synchrony\ndistribution estimated from 120 minutes of data.\n\nK , where K is the number of unique words\nized the estimated ESD by adding a pseudo-count of 1\nobserved in the sample. In Fig. 4 we simulated data from two distinct synchrony distribution mod-\nels. As is expected, among all estimators, \u02c6HDSyn converges the fastest with increasing sample size\nN. The \u02c6HDBer estimator converges more slowly, as the Bernoulli base measure is not capable of\ncapturing the correlation structure of the simulated synchrony distributions.\nIn Fig. 5, we show\nconvergence performance on increasing subsamples of 27 simultaneously-recorded retinal ganglion\ncells. Again, \u02c6HDBer and \u02c6HDSyn show excellent performance. Although the true word distribution is\nnot described by a synchrony distribution, the ESD proves an excellent regularizer for the space of\ndistributions, even for very small sample sizes.\n\n5 Application: Quanti\ufb01cation of Temporal Dependence\n\nWe can gain insight into the coding of a single neural time-series by quantifying the amount of in-\nformation a single time bin contains about another. The correlation function (Fig. 6A) is the statistic\nmost widely used for this purpose. However, correlation cannot capture higher-order dependencies.\nIn neuroscience, mutual information is used to quantify higher-order temporal structure [24]. A re-\n\n7\n\n1021031041051062345678910NEntropy (nats)Bimodal Synchrony Distribution (30 Neurons)pluginDBerDSynBUBNSB0.40.60.811.21.41.61.82NEntropy (nats)Power Law Synchrony Distribution (30 Neurons)102103104105106010203000.10.20.3number of spikes per wordfrequency010203000.20.40.60.8number of spikes per wordfrequencyAB0102000.10.20.30.4number of spikes per wordnumber of spikes per word1.522.533.5Entropy (nats)RGC Spike Data  (27 Neurons)1 ms bins10 ms binsN102103104105pluginDBerDSynBUBNSB681012141618Entropy (nats)RGC Spike Data (27 neurons)N102103104105ABfrequencyfrequency010203000.10.2\fFigure 6: Quantifying temporal dependence of RGC coding using \u02c6HDBer. (A) The auto-correlation\nfunction of a single retinal ganglion neuron. Correlation does not capture the full temporal de-\n(B) Schematic de\ufb01nition of time delayed mutual in-\npendence. We bin with \u2206t = 1 ms bins.\nformation (dMI), and block mutual information. The information gain of the sth bin is \u03bd(s) =\nI(Xt; Xt+1:t+s) \u2212 I(Xt; Xt+1:t+s\u22121).\n(C) Block mutual information estimate as a function of\ngrowing block size. Note that the estimate is monotonically increasing, as expected, since adding\nnew bins can only increase the mutual information. (D) Information gain per bin assuming temporal\nindependence (dMI), and with difference between block mutual informations (\u03bd(s)). We observe\nsynergy for the time bins in the 5 to 10 ms range.\n\nlated quantity, the delayed mutual information (dMI) provides an indication of instantaneous depen-\ndence: dM I(s) = I(Xt; Xt+s), where Xt is a binned spike train, and I(X; Y ) = H(X)\u2212H(X|Y )\ndenotes the mutual information. However, this quantity ignores any temporal dependences in\nthe intervening times: Xt+1, . . . , Xt+s\u22121. An alternative approach allows us to consider such\nthe \u201cblock mutual information\u201d \u03bd(s) = I(Xt; Xt+1:t+s) \u2212 I(Xt; Xt+1:t+s\u22121)\ndependences:\n(Fig. 6B,C,D)\nThe relationship between \u03bd(s) and dM I(s) provides insight about the information contained in the\nrecent history of the signal. If each time bin is conditionally independent given Xt, then \u03bd(s) =\ndM I(s). In contrast, if \u03bd(s) < dM I(s), instantaneous dependence is partially explained by history.\nFinally, \u03bd(s) > dM I(s) implies that the joint distribution of Xt, Xt+1, . . . , Xt+s contains more\ninformation about Xt than the joint distribution of Xt and Xt+s alone. We use the \u02c6HDBer entropy\nestimator to compute mutual information (by computing H(X) and H(X|Y )) accurately for \u223c 15\nbins of history. Surprisingly, individual retinal ganglion cells code synergistically in time (Fig. 6D).\n\n6 Conclusions\n\nWe introduced two novel Bayesian entropy estimators, \u02c6HDBer and \u02c6HDSyn. These estimators use a\nhierarchical mixture-of-Dirichlets prior with a base measure designed to integrate a priori knowl-\nedge about spike trains into the model. By choosing base measures with convenient symmetries,\nwe simultaneously sidestepped potentially intractable computations in the high-dimensional space\nof spike words. It remains to be seen whether these symmetries, as exempli\ufb01ed in the structure of\nthe synchrony distribution, are applicable across a wide range of neural data. Finally, however, we\nshowed several examples in which these estimators, especially \u02c6HDSyn, perform exceptionally well in\napplication to neural data. A MATLAB implementation of the estimators will be made available at\nhttps://github.com/pillowlab/CDMentropy.\n\nAcknowledgments\nWe thank E. J. Chichilnisky, A. M. Litke, A. Sher and J. Shlens for retinal data. This work was\nsupported by a Sloan Research Fellowship, McKnight Scholar\u2019s Award, and NSF CAREER Award\nIIS-1150186 (JP).\n\n8\n\ngrowing block mutual informationinformation gain5101502465101500.51bitsbitslags (ms)spike rate10 ms20 spk/sauto-correlation functionBACDdMIdelayed mutual informationgrowing block mutual information\fReferences\n\n[1] K. H. Schindler, M. Palus, M. Vejmelka, and J. Bhattacharya. Causality detection based on information-\n\ntheoretic approaches in time series analysis. Physics Reports, 441:1\u201346, 2007.\n\n[2] A. R\u00b4enyi. On measures of dependence. Acta Mathematica Hungarica, 10(3-4):441\u2013451, 9 1959.\n[3] C. Chow and C. Liu. Approximating discrete probability distributions with dependence trees. Information\n\nTheory, IEEE Transactions on, 14(3):462\u2013467, 1968.\n\n[4] A. Chao and T. Shen. Nonparametric estimation of Shannon\u2019s index of diversity when there are unseen\n\nspecies in sample. Environmental and Ecological Statistics, 10(4):429\u2013443, 2003.\n\n[5] P. Grassberger. Estimating the information content of symbol sequences and ef\ufb01cient codes. Information\n\nTheory, IEEE Transactions on, 35(3):669\u2013675, 1989.\n\n[6] S. Ma. Calculation of entropy from data of motion. Journal of Statistical Physics, 26(2):221\u2013240, 1981.\n[7] S. Panzeri, R. Senatore, M. A. Montemurro, and R. S. Petersen. Correcting for the sampling bias problem\n\nin spike train information measures. J Neurophysiol, 98(3):1064\u20131072, Sep 2007.\n\n[8] L. Paninski. Estimation of entropy and mutual information. Neural Computation, 15:1191\u20131253, 2003.\n[9] W. Bialek, F. Rieke, R. R. de Ruyter van Steveninck, R., and D. Warland. Reading a neural code. Science,\n\n252:1854\u20131857, 1991.\n\n[10] R. Strong, S. Koberle, de Ruyter van Steveninck R., and W. Bialek. Entropy and information in neural\n\nspike trains. Physical Review Letters, 80:197\u2013202, 1998.\n\n[11] L. Paninski. Estimation of entropy and mutual information. Neural Computation, 15:1191\u20131253, 2003.\n[12] R. Barbieri, L. Frank, D. Nguyen, M. Quirk, V. Solo, M. Wilson, and E. Brown. Dynamic analyses of\n\ninformation encoding in neural ensembles. Neural Computation, 16:277\u2013307, 2004.\n\n[13] M. Kennel, J. Shlens, H. Abarbanel, and E. Chichilnisky. Estimating entropy rates with Bayesian con\ufb01-\n\ndence intervals. Neural Computation, 17:1531\u20131576, 2005.\n\n[14] J. Victor. Approaches to information-theoretic analysis of neural activity. Biological theory, 1(3):302\u2013\n\n316, 2006.\n\n[15] J. Shlens, M. B. Kennel, H. D. I. Abarbanel, and E. J. Chichilnisky. Estimating information rates with\n\ncon\ufb01dence intervals in neural spike trains. Neural Computation, 19(7):1683\u20131719, Jul 2007.\n\n[16] V. Q. Vu, B. Yu, and R. E. Kass. Coverage-adjusted entropy estimation.\n\nStatistics in medicine,\n\n26(21):4039\u20134060, 2007.\n\n[17] V. Q. Vu, B. Yu, and R. E. Kass. Information in the nonstationary case. Neural Computation, 21(3):688\u2013\n\n703, 2009, http://www.mitpressjournals.org/doi/pdf/10.1162/neco.2008.01-08-700. PMID: 18928371.\n\n[18] E. Archer, I. M. Park, and J. Pillow. Bayesian estimation of discrete entropy with mixtures of stick-\nbreaking priors. In P. Bartlett, F. Pereira, C. Burges, L. Bottou, and K. Weinberger, editors, Advances in\nNeural Information Processing Systems 25, pages 2024\u20132032. MIT Press, Cambridge, MA, 2012.\n\n[19] I. Nemenman, F. Shafee, and W. Bialek. Entropy and inference, revisited. In Advances in Neural Infor-\n\nmation Processing Systems 14, pages 471\u2013478. MIT Press, Cambridge, MA, 2002.\n\n[20] M. Okun, P. Yger, S. L. Marguet, F. Gerard-Mercier, A. Benucci, S. Katzner, L. Busse, M. Carandini, and\nK. D. Harris. Population rate dynamics and multineuron \ufb01ring patterns in sensory cortex. The Journal of\nNeuroscience, 32(48):17108\u201317119, 2012, http://www.jneurosci.org/content/32/48/17108.full.pdf+html.\n[21] G. Tka\u02c7cik, O. Marre, T. Mora, D. Amodei, M. J. Berry II, and W. Bialek. The simplest maximum\nentropy model for collective behavior in a neural network. Journal of Statistical Mechanics: Theory and\nExperiment, 2013(03):P03011, 2013.\n\n[22] D. Wolpert and D. Wolf. Estimating functions of probability distributions from a \ufb01nite set of samples.\n\nPhysical Review E, 52(6):6841\u20136854, 1995.\n\n[23] I. M. Park, E. Archer, K. Latimer, and J. W. Pillow. Universal models for binary spike patterns using\n\ncentered Dirichlet processes. In Advances in Neural Information Processing Systems (NIPS), 2013.\n\n[24] A. Panzeri, S. Treves, S. Schultz, and E. Rolls. On decoding the responses of a population of neurons\n\nfrom short time windows. Neural Computation, 11:1553\u20131577, 1999.\n\n9\n\n\f", "award": [], "sourceid": 861, "authors": [{"given_name": "Evan", "family_name": "Archer", "institution": "UT Austin"}, {"given_name": "Il Memming", "family_name": "Park", "institution": "UT Austin"}, {"given_name": "Jonathan", "family_name": "Pillow", "institution": "UT Austin"}]}