Part of Advances in Neural Information Processing Systems 18 (NIPS 2005)

*Kristina Klinkner, Cosma Shalizi, Marcelo Camperi*

Most nervous systems encode information about stimuli in the responding activity of large neuronal networks. This activity often manifests itself as dynamically coordinated sequences of action potentials. Since multiple electrode recordings are now a standard tool in neuroscience research, it is important to have a measure of such network-wide behavioral coordination and information sharing, applicable to multiple neural spike train data. We propose a new statistic, informational coherence, which measures how much better one unit can be predicted by knowing the dynamical state of another. We argue informational coherence is a measure of association and shared information which is superior to traditional pairwise measures of synchronization and correlation. To find the dynamical states, we use a recently-introduced algorithm which reconstructs effective state spaces from stochastic time series. We then extend the pairwise measure to a multivariate analysis of the network by estimating the network multi-information. We illustrate our method by testing it on a detailed model of the transition from gamma to beta rhythms. Much of the most important information in neural systems is shared over multiple neurons or cortical areas, in such forms as population codes and distributed representations [1]. On behavioral time scales, neural information is stored in temporal patterns of activity as opposed to static markers; therefore, as information is shared between neurons or brain regions, it is physically instantiated as coordination between entire sequences of neural spikes. Furthermore, neural systems and regions of the brain often require coordinated neural activity to perform important functions; acting in concert requires multiple neurons or cortical areas to share information [2]. Thus, if we want to measure the dynamic network-wide behavior of neurons and test hypotheses about them, we need reliable, practical methods to detect and quantify behavioral coordination and the associated information sharing across multiple neural units. These would be especially useful in testing ideas about how particular forms of coordination relate to distributed coding (e.g., that of [3]). Current techniques to analyze relations among spike trains handle only pairs of neurons, so we further need a method which is extendible to analyze the coordination in the network, system, or region as a whole. Here we propose a new measure of behavioral coordination and information sharing, informational coherence, based on the notion of dynamical state. Section 1 argues that coordinated behavior in neural systems is often not captured by exist-

ing measures of synchronization or correlation, and that something sensitive to nonlinear, stochastic, predictive relationships is needed. Section 2 defines informational coherence as the (normalized) mutual information between the dynamical states of two systems and explains how looking at the states, rather than just observables, fulfills the needs laid out in Section 1. Since we rarely know the right states a prori, Section 2.1 briefly describes how we reconstruct effective state spaces from data. Section 2.2 gives some details about how we calculate the informational coherence and approximate the global information stored in the network. Section 3 applies our method to a model system (a biophysically detailed conductance-based model) comparing our results to those of more familiar second-order statistics. In the interest of space, we omit proofs and a full discussion of the existing literature, giving only minimal references here; proofs and references will appear in a longer paper now in preparation.

1

Synchrony or Coherence?

Most hypotheses which involve the idea that information sharing is reflected in coordinated activity across neural units invoke a very specific notion of coordinated activity, namely strict synchrony: the units should be doing exactly the same thing (e.g., spiking) at exactly the same time. Investigators then measure coordination by measuring how close the units come to being strictly synchronized (e.g., variance in spike times). From an informational point of view, there is no reason to favor strict synchrony over other kinds of coordination. One neuron consistently spiking 50 ms after another is just as informative a relationship as two simultaneously spiking, but such stable phase relations are missed by strict-synchrony approaches. Indeed, whatever the exact nature of the neural code, it uses temporally extended patterns of activity, and so information sharing should be reflected in coordination of those patterns, rather than just the instantaneous activity. There are three common ways of going beyond strict synchrony: cross-correlation and related second-order statistics, mutual information, and topological generalized synchrony. The cross-correlation function (the normalized covariance function; this includes, for present purposes, the joint peristimulus time histogram [2]), is one of the most widespread measures of synchronization. It can be efficiently calculated from observable series; it handles statistical as well as deterministic relationships between processes; by incorporating variable lags, it reduces the problem of phase locking. Fourier transformation of the covariance function X Y (h) yields the cross-spectrum FX Y ( ), which in turn gives the 2 spectral coherence cX Y ( ) = FX Y ( )/FX ( )FY ( ), a normalized correlation between the Fourier components of X and Y . Integrated over frequencies, the spectral coherence measures, essentially, the degree of linear cross-predictability of the two series. ([4] applies spectral coherence to coordinated neural activity.) However, such second-order statistics only handle linear relationships. Since neural processes are known to be strongly nonlinear, there is little reason to think these statistics adequately measure coordination and synchrony in neural systems. Mutual information is attractive because it handles both nonlinear and stochastic relationships and has a very natural and appealing interpretation. Unfortunately, it often seems to fail in practice, being disappointingly small even between signals which are known to be tightly coupled [5]. The major reason is that the neural codes use distinct patterns of activity over time, rather than many different instantaneous actions, and the usual approach misses these extended patterns. Consider two neurons, one of which drives the other to spike 50 ms after it does, the driving neuron spiking once every 500 ms. These are very tightly coordinated, but whether the first neuron spiked at time t conveys little information about what the second neuron is doing at t -- it's not spiking, but it's not spiking most of the time anyway. Mutual information calculated from the direct observations conflates the

"no spike" of the second neuron preparing to fire with its just-sitting-around "no spike". Here, mutual information could find the coordination if we used a 50 ms lag, but that won't work in general. Take two rate-coding neurons with base-line firing rates of 1 Hz, and suppose that a stimulus excites one to 10 Hz and suppresses the other to 0.1 Hz. The spiking rates thus share a lot of information, but whether the one neuron spiked at t is uninformative about what the other neuron did then, and lagging won't help. Generalized synchrony is based on the idea of establishing relationships between the states of the various units. "State" here is taken in the sense of physics, dynamics and control theory: the state at time t is a variable which fixes the distribution of observables at all times t, rendering the past of the system irrelevant [6]. Knowing the state allows us to predict, as well as possible, how the system will evolve, and how it will respond to external forces [7]. Two coupled systems are said to exhibit generalized synchrony if the state of one system is given by a mapping from the state of the other. Applications to data employ statespace reconstruction [8]: if the state x X evolves according to smooth, d-dimensional deterministic dynamics, and we observe a generic function y = f (x), then the space Y of time-delay vectors [y (t), y (t - ), ...y (t - (k - 1) )] is diffeomorphic to X if k > 2d, for generic choices of lag . The various versions of generalized synchrony differ on how, precisely, to quantify the mappings between reconstructed state spaces, but they all appear to be empirically equivalent to one another and to notions of phase synchronization based on Hilbert transforms [5]. Thus all of these measures accommodate nonlinear relationships, and are potentially very flexible. Unfortunately, there is essentially no reason to believe that neural systems have deterministic dynamics at experimentally-accessible levels of detail, much less that there are deterministic relationships among such states for different units. What we want, then, but none of these alternatives provides, is a quantity which measures predictive relationships among states, but allows those relationships to be nonlinear and stochastic. The next section introduces just such a measure, which we call "informational coherence".

2

States and Informational Coherence

There are alternatives to calculating the "surface" mutual information between the sequences of observations themselves (which, as described, fails to capture coordination). If we know that the units are phase oscillators, or rate coders, we can estimate their instantaneous phase or rate and, by calculating the mutual information between those variables, see how coordinated the units' patterns of activity are. However, phases and rates do not exhaust the repertoire of neural patterns and a more general, common scheme is desirable. The most general notion of "pattern of activity" is simply that of the dynamical state of the system, in the sense mentioned above. We now formalize this. Assuming the usual notation for Shannon information [9], the information content of a state variable X is H [X ] and the mutual information between X and Y is I [X ; Y ]. As is well-known, I [X ; Y ] min H [X ], H [Y ]. We use this to normalize the mutual state information to a 0 - 1 scale, and this is the informational coherence (IC). (X, Y ) = I [X ; Y ] , with 0/0 = 0 . min H [X ], H [Y ] (1)

can be interpreted as follows. I [X ; Y ] is the Kullback-Leibler divergence between the joint distribution of X and Y , and the product of their marginal distributions [9], indicating the error involved in ignoring the dependence between X and Y . The mutual information between predictive, dynamical states thus gauges the error involved in assuming the two systems are independent, i.e., how much predictions could improve by taking into account the dependence. Hence it measures the amount of dynamically-relevant information shared

between the two systems. simply normalizes this value, and indicates the degree to which two systems have coordinated patterns of behavior (cf. [10], although this only uses directly observable quantities). 2.1 Reconstruction and Estimation of Effective State Spaces

As mentioned, the state space of a deterministic dynamical system can be reconstructed from a sequence of observations. This is the main tool of experimental nonlinear dynamics [8]; but the assumption of determinism is crucial and false, for almost any interesting neural system. While classical state-space reconstruction won't work on stochastic processes, such processes do have state-space representations [11], and, in the special case of discretevalued, discrete-time series, there are ways to reconstruct the state space. Here we use the CSSR algorithm, introduced in [12] (code available at http://bactra.org/CSSR). This produces causal state models, which are stochastic automata capable of statistically-optimal nonlinear prediction; the state of the machine is a minimal sufficient statistic for the future of the observable process[13].1 The basic idea is to form a set of states which should be (1) Markovian, (2) sufficient statistics for the next observable, and (3) have deterministic transitions (in the automata-theory sense). The algorithm begins with a minimal, one-state, IID model, and checks whether these properties hold, by means of hypothesis tests. If they fail, the model is modified, generally but not always by adding more states, and the new model is checked again. Each state of the model corresponds to a distinct distribution over future events, i.e., to a statistical pattern of behavior. Under mild conditions, which do not involve prior knowledge of the state space, CSSR converges in probability to the unique causal state model of the data-generating process [12]. In practice, CSSR is quite fast (linear in the data size), and generalizes at least as well as training hidden Markov models with the EM algorithm and using cross-validation for selection, the standard heuristic [12]. One advantage of the causal state approach (which it shares with classical state-space reconstruction) is that state estimation is greatly simplified. In the general case of nonlinear state estimation, it is necessary to know not just the form of the stochastic dynamics in the state space and the observation function, but also their precise parametric values and the distribution of observation and driving noises. Estimating the state from the observable time series then becomes a computationally-intensive application of Bayes's Rule [17]. Due to the way causal states are built as statistics of the data, with probability 1 there is a finite time, t, at which the causal state at time t is certain. This is not just with some degree of belief or confidence: because of the way the states are constructed, it is impossible for the process to be in any other state at that time. Once the causal state has been established, it can be updated recursively, i.e., the causal state at time t + 1 is an explicit function of the causal state at time t and the observation at t + 1. The causal state model can be automatically converted, therefore, into a finite-state transducer which reads in an observation time series and outputs the corresponding series of states [18, 13]. (Our implementation of CSSR filters its training data automatically.) The result is a new time series of states, from which all non-predictive components have been filtered out. 2.2 Estimating the Coherence

Our algorithm for estimating the matrix of informational coherences is as follows. For each unit, we reconstruct the causal state model, and filter the observable time series to produce a series of causal states. Then, for each pair of neurons, we construct a joint histogram of 1 Causal state models have the same expressive power as observable operator models [14] or predictive state representations [7], and greater power than variable-length Markov models [15, 16].

a

b

Figure 1: Rastergrams of neuronal spike-times in the network. Excitatory, pyramidal neurons (numbers 1 to 1000) are shown in green, inhibitory interneurons (numbers 1001 to 1300) in red. During the first 10 seconds (a), the current connections among the pyramidal cells are suppressed and a gamma rhythm emerges (left). At t = 10s, those connections become active, leading to a beta rhythm (b, right).

the state distribution, estimate the mutual information between the states, and normalize by the single-unit state informations. This gives a symmetric matrix of values. Even if two systems are independent, their estimated IC will, on average, be positive, because, while they should have zero mutual information, the empirical estimate of mutual information is non-negative. Thus, the significance of IC values must be assessed against the null hypothesis of system independence. The easiest way to do so is to take the reconstructed state models for the two systems and run them forward, independently of one another, to generate a large number of simulated state sequences; from these calculate values of the IC. This procedure will approximate the sampling distribution of the IC under a null model which preserves the dynamics of each system, but not their interaction. We can then find p-values as usual. We omit them here to save space. 2.3 Approximating the Network Multi-Information

There is broad agreement [2] that analyses of networks should not just be an analysis of pairs of neurons, averaged over pairs. Ideally, an analysis of information sharing in a network would look at the over-all structure of statistical dependence between the various units, reflected in the complete joint probability distribution P of the states. This would then allow us, for instance, to calculate the n-fold multi-information, I [X1 , X2 , . . . Xn ] D(P ||Q), the Kullback-Leibler divergence between the joint distribution P and the product of marginal distributions Q, analogous to the pairwise mutual information [19]. Calculated over the predictive states, the multi-information would give the total amount of shared dynamical information in the system. Just as we normalized the mutual information I [X1 , X2 ] by its maximum possible value, min H [X1 ], H [X2 ], we normalize the multiinformation by its maximum, which is the smallest sum of n - 1 marginal entropies: i I [X1 ; X2 ; . . . Xn ] min H [Xn ] k =k

Unfortunately, P is a distribution over a very high dimensional space and so, hard to estimate well without strong parametric constraints. We thus consider approximations. The lowest-order approximation treats all the units as independent; this is the distribution Q. One step up are tree distributions, where the global distribution is a function of the joint distributions of pairs of units. Not every pair of units needs to enter into such a distribution,

though every unit must be part of some pair. Graphically, a tree distribution corresponds to a spanning tree, with edges linking units whose interactions enter into the global probability, and conversely spanning trees determine tree distributions. Writing ET for the set of pairs (i, j ) and abbreviating X1 = x1 , X2 = x2 , . . . Xn = xn by X = x, one has n ( T (Xi = xi , Xj = xj ) i T (X = x) = T (Xi = xi ) (2) T (Xi = xi )T (Xj = xj ) =1 i,j )ET

where the marginal distributions T (Xi ) and the pair distributions T (Xi , Xj ) are estimated by the empirical marginal and pair distributions. We must now pick edges ET so that T best approximates the true global distribution P . A natural approach is to minimize D(P ||T ), the divergence between P and its tree approximation. Chow and Liu [20] showed that the maximum-weight spanning tree gives the divergence-minimizing distribution, taking an edge's weight to be the mutual information between the variables it links. There are three advantages to using the Chow-Liu approximation. (1) Estimating T from empirical probabilities gives a consistent maximum likelihood estimator of the ideal ChowLiu tree [20], with reasonable rates of convergence, so T can be reliably known even if P cannot. (2) There are efficient algorithms for constructing maximum-weight spanning trees, such as Prim's algorithm [21, sec. 23.2], which runs in time O(n2 + n log n). Thus, the approximation is computationally tractable. (3) The KL divergence of the Chow-Liu distribution from Q gives a lower bound on the network multi-information; that bound is just the sum of the mutual informations along the edges in the tree: ( I [Xi ; Xj ] (3) I [X1 ; X2 ; . . . Xn ] D(T ||Q) = i,j )ET

Even if we knew P exactly, Eq. 3 would be useful as an alternative to calculating D(P ||Q) directly, evaluating log P (x)/Q(x) for all the exponentially-many configurations x. It is natural to seek higher-order approximations to P , e.g., using three-way interactions not decomposable into pairwise interactions [22, 19]. But it is hard to do so effectively, because finding the optimal approximation to P when such interactions are allowed is NP [23], and analytical formulas like Eq. 3 generally do not exist [19]. We therefore confine ourselves to the Chow-Liu approximation here.

3

Example: A Model of Gamma and Beta Rhythms

We use simulated data as a test case, instead of empirical multiple electrode recordings, which allows us to try the method on a system of over 1000 neurons and compare the measure against expected results. The model, taken from [24], was originally designed to study episodes of gamma (3080Hz) and beta (1230Hz) oscillations in the mammalian nervous system, which often occur successively with a spontaneous transition between them. More concretely, the rhythms studied were those displayed by in vitro hippocampal (CA1) slice preparations and by in vivo neocortical EEGs. The model contains two neuron populations: excitatory (AMPA) pyramidal neurons and inhibitory (GABAA ) interneurons, defined by conductance-based Hodgkin-Huxley-style equations. Simulations were carried out in a network of 1000 pyramidal cells and 300 interneurons. Each cell was modeled as a one-compartment neuron with all-to-all coupling, endowed with the basic sodium and potassium spiking currents, an external applied current, and some Gaussian input noise. The first 10 seconds of the simulation correspond to the gamma rhythm, in which only a group of neurons is made to spike via a linearly increasing applied current. The beta rhythm

Do not remove: This comment is monitored to verify that the site is working properly