{"title": "Robust learning of low-dimensional dynamics from large neural ensembles", "book": "Advances in Neural Information Processing Systems", "page_first": 2391, "page_last": 2399, "abstract": "Recordings from large populations of neurons make it possible to search for hypothesized low-dimensional dynamics. Finding these dynamics requires models that take into account biophysical constraints and can be fit efficiently and robustly. Here, we present an approach to dimensionality reduction for neural data that is convex, does not make strong assumptions about dynamics, does not require averaging over many trials and is extensible to more complex statistical models that combine local and global influences. The results can be combined with spectral methods to learn dynamical systems models. The basic method can be seen as an extension of PCA to the exponential family using nuclear norm minimization. We evaluate the effectiveness of this method using an exact decomposition of the Bregman divergence that is analogous to variance explained for PCA. We show on model data that the parameters of latent linear dynamical systems can be recovered, and that even if the dynamics are not stationary we can still recover the true latent subspace. We also demonstrate an extension of nuclear norm minimization that can separate sparse local connections from global latent dynamics. Finally, we demonstrate improved prediction on real neural data from monkey motor cortex compared to fitting linear dynamical models without nuclear norm smoothing.", "full_text": "Robust learning of low-dimensional dynamics from\n\nlarge neural ensembles\n\nDavid Pfau\n\nLiam Paninski\n\nEftychios A. Pnevmatikakis\n\nCenter for Theoretical Neuroscience\n\nDepartment of Statistics\n\nGrossman Center for the Statistics of Mind\n\nColumbia University, New York, NY\n\npfau@neurotheory.columbia.edu\n\n{eftychios,liam}@stat.columbia.edu\n\nAbstract\n\nRecordings from large populations of neurons make it possible to search for hy-\npothesized low-dimensional dynamics. Finding these dynamics requires models\nthat take into account biophysical constraints and can be \ufb01t ef\ufb01ciently and ro-\nbustly. Here, we present an approach to dimensionality reduction for neural data\nthat is convex, does not make strong assumptions about dynamics, does not require\naveraging over many trials and is extensible to more complex statistical models\nthat combine local and global in\ufb02uences. The results can be combined with spec-\ntral methods to learn dynamical systems models. The basic method extends PCA\nto the exponential family using nuclear norm minimization. We evaluate the effec-\ntiveness of this method using an exact decomposition of the Bregman divergence\nthat is analogous to variance explained for PCA. We show on model data that\nthe parameters of latent linear dynamical systems can be recovered, and that even\nif the dynamics are not stationary we can still recover the true latent subspace.\nWe also demonstrate an extension of nuclear norm minimization that can separate\nsparse local connections from global latent dynamics. Finally, we demonstrate\nimproved prediction on real neural data from monkey motor cortex compared to\n\ufb01tting linear dynamical models without nuclear norm smoothing.\n\nIntroduction\n\n1\nProgress in neural recording technology has made it possible to record spikes from ever larger pop-\nulations of neurons [1]. Analysis of these large populations suggests that much of the activity can\nbe explained by simple population-level dynamics [2]. Typically, this low-dimensional activity is\nextracted by principal component analysis (PCA) [3, 4, 5], but in recent years a number of exten-\nsions have been introduced in the neuroscience literature, including jPCA [6] and demixed principal\ncomponent analysis (dPCA) [7]. A downside of these methods is that they do not treat either the\ndiscrete nature of spike data or the positivity of \ufb01ring rates in a statistically principled way. Standard\npractice smooths the data substantially or averages it over many trials, losing information about \ufb01ne\ntemporal structure and inter-trial variability.\nOne alternative is to \ufb01t a more complex statistical model directly from spike data, where temporal\ndependencies are attributed to latent low dimensional dynamics [8, 9]. Such models can account for\nthe discreteness of spikes by using point-process models for the observations, and can incorporate\ntemporal dependencies into the latent state model. State space models can include complex inter-\nactions such as switching linear dynamics [10] and direct coupling between neurons [11]. These\nmethods have drawbacks too: they are typically \ufb01t by approximate EM [12] or other methods that\nare prone to local minima, the number of latent dimensions is typically chosen ahead of time, and a\ncertain class of possible dynamics must be chosen before doing dimensionality reduction.\n\n1\n\n\fIn this paper we attempt to combine the computational tractability of PCA and related methods with\nthe statistical richness of state space models. Our approach is convex and based on recent advances\nin system identi\ufb01cation using nuclear norm minimization [13, 14, 15], a convex relaxation of matrix\nrank minimization. Compared to recent work on spectral methods for \ufb01tting state space models\n[16], our method more easily generalizes to handle different nonlinearities, non-Gaussian, non-\nlinear, and non-stationary latent dynamics, and direct connections between observed neurons. When\napplied to model data, we \ufb01nd that: (1) low-dimensional subspaces can be accurately recovered,\neven when the dynamics are unknown and nonstationary (2) standard spectral methods can robustly\nrecover the parameters of state space models when applied to data projected into the recovered\nsubspace (3) the confounding effects of common input for inferring sparse synaptic connectivity can\nbe ameliorated by accounting for low-dimensional dynamics. In applications to real data we \ufb01nd\ncomparable performance to models trained by EM with less computational overhead, particularly as\nthe number of latent dimensions grows.\nThe paper is organized as follows. In section 2 we introduce the class of models we aim to \ufb01t,\nwhich we call low-dimensional generalized linear models (LD-GLM). In section 3 we present a\nconvex formulation of the parameter learning problem for these models, as well as a generalization\nof variance explained to LD-GLMs used for evaluating results. In section 4 we show how to \ufb01t these\nmodels using the alternating direction method of multipliers (ADMM). In section 5 we present\nresults on real and arti\ufb01cial neural datasets. We discuss the results and future directions in section 6.\n\n2 Low dimensional generalized linear models\nOur model is closely related to the generalized linear model (GLM) framework for neural data [17].\nUnlike the standard GLM, where the inputs driving the neurons are observed, we assume that the\ndriving activity is unobserved, but lies on some low dimensional subspace. This can be a useful\nway of capturing spontaneous activity, or accounting for strong correlations in large populations of\nneurons. Thus, instead of \ufb01tting a linear receptive \ufb01eld, the goal of learning in low-dimensional\nGLMs is to accurately recover the latent subspace of activity.\nLet xt \u2208 Rm be the value of the dynamics at time t. To turn this into spiking activity, we project\nthis into the space of neurons: yt = Cxt + b is a vector in Rn, n (cid:29) m, where each dimension of yt\ncorresponds to one neuron. C \u2208 Rn\u00d7m denotes the subspace of the neural population and b \u2208 Rn\nthe bias vector for all the neurons. As yt can take on negative values, we cannot use this directly as\na \ufb01ring rate, and so we pass each element of yt through some convex and log-concave increasing\npoint-wise nonlinearity f : R \u2192 R+. Popular choices for nonlinearities include f (x) = exp(x) and\nf (x) = log(1 + exp(x)). To account for biophysical effects such as refractory periods, bursting, and\ndirect synaptic connections, we include a linear dependence on spike history before the nonlinearity.\nThe \ufb01ring rate f (yt) is used as the rate for some point process \u03be such as a Poisson process to generate\na vector of spike counts st for all neurons at that time:\n\nk(cid:88)\n\n\u03c4 =1\n\nyt = Cxt +\nst \u223c \u03be(f (yt))\n\nD\u03c4 st\u2212\u03c4 + b\n\n(1)\n\n(2)\n\nMuch of this paper is focused on estimating yt, which is the natural parameter for the Poisson\ndistribution in the case f (\u00b7) = exp(\u00b7), and so we refer to yt as the natural rate to avoid confusion\nwith the actual rate f (yt). We will see that our approach works with any point process with a\nlog-concave likelihood, not only Poisson processes.\nWe can extend this simple model by adding dynamics to the low-dimensional latent state, including\ninput-driven dynamics. In this case the model is closely related to the common input model used\nin neuroscience [11], the difference being that the observed input is added to xt rather than being\ndirectly mapped to yt. The case without history terms and with linear Gaussian dynamics is a well-\nstudied state space model for neural data, usually \ufb01t by EM [19, 12, 20], though a consistent spectral\nmethod has been derived [16] for the case f (\u00b7) = exp(\u00b7). Unlike these methods, our approach\nlargely decouples the problem of dimensionality reduction and learning dynamics: even in the case\nof nonstationary, non-Gaussian dynamics where A, B and Cov[\u0001] change over time, we can still\nrobustly recover the latent subspace spanned by xt.\n\n2\n\n\f3 Learning\n3.1 Nuclear norm minimization\n\nThen rank(A(Y )) = m. Ideally we would minimize \u03bbnT rank(A(Y )) \u2212(cid:80)T\n\nIn the case that the spike history terms D1:k are zero, the natural rate at time t is yt = Cxt + b, so all\nyt are elements of some m-dimensional af\ufb01ne space given by the span of the columns of C offset by\nb. Ideally, our estimate of y1:T would trade off between making the dimension of this af\ufb01ne space\nas low as possible and the likelihood of y1:T as high as possible. Let Y = [y1, . . . , yT ] be the n \u00d7 T\nmatrix of natural rates and let A(\u00b7) be the row mean centering operator A(Y ) = Y \u2212 1\nT .\nT Y 1T 1T\nt=1 log p(st|yt), where\n\u03bb controls how much we trade off between a simple solution and the likelihood of the data, however\ngeneral rank minimization is a hard non convex problem. Instead we replace the matrix rank with\nits convex envelope: the sum of singular values or nuclear norm (cid:107) \u00b7 (cid:107)\u2217 [13], which can be seen as\nthe analogue of the (cid:96)1 norm for vector sparsity. Our problem then becomes:\n\nlog p(st|yt)\n\n(3)\n\nnT||A(Y )||\u2217 \u2212 T(cid:88)\n\n\u221a\n\u03bb\n\nmin\n\nY\n\nt=1\n\n\u221a\n\nN(cid:88)\n\nSince the log likelihood scales linearly with the size of the data, and the singular values scale with\nthe square root of the size, we also add a factor of\nnT in front of the nuclear norm term. In the\nexamples in this paper, we assume spikes are drawn from a Poisson distribution:\n\nlog p(st|yt) =\n\nsit log f (yit) \u2212 f (yit) \u2212 log sit!\n\n(4)\n\nHowever, this method can be used with any point process with a log-concave likelihood. This can be\nviewed as a convex formulation of exponential family PCA [21, 22] which does not \ufb01x the number\nof principal components ahead of time.\n\ni=1\n\n3.2 Stable principal component pursuit\n\nThe model above is appropriate for cases where the spike history terms D\u03c4 are zero, that is the\nobserved data can entirely be described by some low-dimensional global dynamics. In real data\nneurons exhibit history-dependent behavior like bursting and refractory periods. Moreover if the\nrecorded neurons are close to each other some may have direct synaptic connections. In this case\nD\u03c4 may have full column rank, so from Eq. 1 it is clear that yt is no longer restricted to a low-\ndimensional af\ufb01ne space. In most practical cases we expect D\u03c4 to be sparse, since most neurons are\nnot connected to one another. In this case the natural rates matrix combines a low-rank term and a\nsparse term, and we can minimize a convex function that trades off between the rank of one term via\nthe nuclear norm, the sparsity of another via the (cid:96)1 norm, and the data log likelihood:\n\n\u221a\n\nmin\n\nY,D1:k,L\n\n\u03bb\n\nnT||A(L)||\u2217 + \u03b3\n\nk(cid:88)\n\nk(cid:88)\n\n||D\u03c4||1 \u2212 T(cid:88)\n\n\u03c4 =1\n\nt=1\n\nT\nn\n\nlog p(st|yt)\n\n(5)\n\ns.t. Y = L +\n\nD\u03c4 S\u03c4 , with S\u03c4 = [0n,\u03c4 , s1, . . . , sT\u2212\u03c4 ],\n\n\u03c4 =1\n\nwhere 0n,\u03c4 is a matrix of zeros of size n\u00d7 \u03c4, used to account for boundary effects. This is an exten-\nsion of stable principal component pursuit [23], which separates sparse and low-rank components\nof a noise-corrupted matrix. Again to ensure that every term in the objective function of Eq. 5 has\nroughly the same scaling O(nT ) we have multiplied each (cid:96)1 norm with T /n. One can also consider\nthe use of a group sparsity penalty where each group collects a speci\ufb01c synaptic weight across all\nthe k time lags.\n\n3.3 Evaluation through Bregman divergence decomposition\n\nWe need a way to evaluate the model on held out data, without assuming a particular form for the\ndynamics. As we recover a subspace spanned by the columns of Y rather than a single parameter,\nthis presents a challenge. One option is to compute the marginal likelihood of the data integrated\n\n3\n\n\f(cid:33)\n\nD\u03c4 st\u2212\u03c4 + b\n\n(6)\n\nk(cid:88)\n\n\u03c4 =1\n\nonto the singular vectors. Then the divergence from the mean\n\nq(cid:88)\n\ni=1\n\ny(q)\nt\n\n=\n\nuiv(q)\n\nit +\n\nv(q)\nt\n\n= arg max\n\nv\n\nlog p\n\nst\n\nk(cid:88)\n(cid:32)\n\n\u03c4 =1\n\ni=1\n\nuivit +\n\nD\u03c4 st\u2212\u03c4 + b\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) q(cid:88)\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)y(q)\n(cid:105)\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)g(st)\n(cid:105)\n\ny(q\u22121)\nt\ny(0)\nt\n\nt\n\n(cid:104)\n(cid:104)\n\nt\n\nt\n\nHere v(q)\n\nis the projection of y(q)\n\nexplained by the qth dimension is given by(cid:80)\n(cid:80)\nPoisson noise g(x) = log(x) and F (x) =(cid:80)\n\nt DF\n\nt DF\n\nover the entire subspace, but this is computationally dif\ufb01cult. For the case of PCA, we can project\nthe held out data onto a subspace spanned by principal components and compute what fraction of\ntotal variance is explained by this subspace. We extend this approach beyond the linear Gaussian\ncase by use of a generalized Pythagorean theorem.\nFor any exponential family with natural parameters \u03b8, link function g, function F such that\n\u2207F = g\u22121 and suf\ufb01cient statistic T , the log likelihood can be written as DF [\u03b8||g(T (x))] \u2212 h(x),\nwhere D\u00b7[\u00b7||\u00b7] is a Bregman divergence [24]: DF [x||y] = F (x) \u2212 F (y) \u2212 (x \u2212 y)T\u2207F (y). Intu-\nitively, the Bregman divergence between x and y is the difference between the value of F (x) and\nthe value of the best linear approximation around y. Bregman divergences obey a generalization\nof the Pythagorean theorem: for any af\ufb01ne set \u2126 and points x /\u2208 \u2126 and y \u2208 \u2126, it follows that\nDF [x||y] = DF [x||\u03a0\u2126(x)] + DF [\u03a0\u2126(x)||y] where \u03a0\u2126(x) = arg min\u03c9\u2208\u2126 DF [x||\u03c9] is the projec-\ntion of x onto \u2126. In the case of squared error this is just a linear projection, and for the case of GLM\nlog likelihoods this is equivalent to maximum likelihood estimation when the natural parameters are\nrestricted to \u2126.\nGiven a matrix of natural rates recovered from training data, we compute the fraction of Bregman\ndivergence explained by a sequence of subspaces as follows. Let ui be the ith singular vector of\nthe recovered natural rates. Let b be the mean natural rate, and let y(q)\nbe the maximum likelihood\nnatural rates restricted to the space spanned by u1, . . . , uq:\n\nt\n\n(7)\n\nt\n\nwhere y(0)\nis the bias b plus the spike history terms. The sum of divergences explained over all q is\nequal to one by virtue of the generalized Pythagorean theorem. For Gaussian noise g(x) = x and\n2||x||2 and this is exactly the variance explained by each principal component, while for\nF (x) = 1\ni exp(xi). This decomposition is only exact if f = g\u22121\nin Eq. 4, that is, if the nonlinearity is exponential. However, for other nonlinearities this may still be\na useful approximation, and gives us a principled way of evaluating the goodness of \ufb01t of a learned\nsubspace.\n4 Algorithms\nMinimizing Eq. 3 and Eq. 5 is dif\ufb01cult, because the nuclear and (cid:96)1 norm are not differentiable\neverywhere. By using the alternating direction method of multipliers (ADMM), we can turn these\nproblems into a sequence of tractable subproblems [25]. While not always the fastest method for\nsolving a particular problem, we use it for its simplicity and generality. We describe the algorithm\nbelow, with more details in the supplemental materials.\n\n4.1 Nuclear norm minimization\n\nTo \ufb01nd the optimal Y we alternate between minimizing an augmented Lagrangian with respect to Y ,\nminimizing with respect to an auxiliary variable Z, and performing gradient ascent on a Lagrange\nmultiplier \u039b. The augmented Lagrangian is\n\nL\u03c1(Y, Z, \u039b) = \u03bb\n\n\u221a\n\nnT||Z||\u2217 \u2212(cid:88)\n\nt\n\nwhich is a smooth function of Y and can be minimized by Newton\u2019s method. The gradient and\nHessian of L\u03c1 with respect to Y at iteration k are\n\nlog p(st|yt) + (cid:104)\u039b,A(Y ) \u2212 Z(cid:105) +\n\n||A(Y ) \u2212 Z||2\n\nF\n\n\u03c1\n2\n\n(8)\n\n4\n\n\f(cid:88)\n(cid:88)\n\nt\n\n\u2207Y L\u03c1 = \u2212\u2207Y\n\nlog p(st|yt) + \u03c1A(Y ) \u2212 AT (\u03c1Zk \u2212 \u039bk)\n\n(9)\n\n1\nT\n\nY L\u03c1 = \u2212\u22072\n\u22072\n\nlog p(st|yt) + \u03c1InT \u2212 \u03c1\n\n(1T \u2297 In)(1T \u2297 In)T\n\nY\n\n(10)\nwhere \u2297 is the Kronecker product. Note that the \ufb01rst two terms of the Hessian are diagonal and\nthe third is low-rank, so the Newton step can be computed in O(nT ) time by using the Woodbury\nmatrix inversion lemma.\nThe minimum of Eq. 17 with respect to Z is given exactly by singular value thresholding:\n\nt\n\n(11)\nwhere U \u03a3V T is the singular value decomposition of A(Yk+1) + \u039bk/\u03c1, and St(\u00b7) is the (pointwise)\nsoft thresholding operator St(x) = sgn(x)max(0,|x| \u2212 t). Finally, the update to \u039b is a simple\ngradient ascent step: \u039bk+1 = \u039bk + \u03c1(A(Yk+1) \u2212 Zk+1) where \u03c1 is a step size that can be chosen.\n\nnT /\u03c1(\u03a3)V T ,\n\nZk+1 = US\u03bb\n\u221a\n\n4.2 Stable principal component pursuit\n\nTo extend ADMM to the problem in Eq. 5 we only need to add one extra step, taking the minimum\nover the connectivity matrices with the other parameters held \ufb01xed. To simplify the notation, we\ngroup the connectivity matrices into a single matrix D = (D1, . . . , Dk), and stack the different\ntime-shifted matrices of spike histories on top of one another to form a single spike history matrix\nH. The objective then becomes\n\nwhere we have substituted Y \u2212 DH for the variable L, and the augmented Lagrangian is\n\n\u221a\n\nmin\nY,D\n\n\u03bb\n\nnT||A(Y \u2212 DH)||\u2217 + \u03b3\n\n\u221a\nL\u03c1(Y, Z, D, \u039b) = \u03bb\n\nnT||Z||\u2217 + \u03b3\n\nT\nn\n\nT\nn\n\n||D||1 \u2212(cid:88)\n||D||1 \u2212(cid:88)\n\nt\n\nt\n\n\u03c1\n2\n\nlog p(st|yt)\n\nlog p(st|yt)\n\n(12)\n\n(13)\n\n(14)\n\n(15)\n\n+(cid:104)\u039b,A(Y \u2212 DH) \u2212 Z(cid:105) +\n\n||A(Y \u2212 DH) \u2212 Z||2\n\nF\n\nThe updates for \u039b and Z are almost unchanged, except that A(Y ) becomes A(Y \u2212 DH). Likewise\nfor Y the only change is one additional term in the gradient:\n\n\u2207Y L\u03c1 = \u2212\u2207Y\n\nlog p(st|yt) + \u03c1A(Y ) \u2212 AT (\u03c1Z + \u03c1A(DH) \u2212 \u039b)\n\n(cid:88)\n\nt\nMinimizing D requires solving:\nT\nn\n\narg min\n\n\u03b3\n\nD\n\n||D||1 +\n\n\u03c1\n2\n\n||A(DH) + Z \u2212 A(Y ) \u2212 \u039b/\u03c1||2\n\nF\n\nThis objective has the same form as LASSO regression. We solve this using ADMM as well, but\nany method for LASSO regression can be substituted.\n\n5 Experiments\nWe demonstrate our method on a number of arti\ufb01cial datasets and one real dataset. First, we show\nin the absence of spike history terms that the true low dimensional subspace can be recovered in\nthe limit of large data, even when the dynamics are nonstationary. Second, we show that spectral\nmethods can accurately recover the transition matrix when dynamics are linear. Third, we show\nthat local connectivity can be separated from low-dimensional common input. Lastly, we show that\nnuclear-norm penalized subspace recovery leads to improved prediction on real neural data recorded\nfrom macaque motor cortex.\nModel data was generated with 8 latent dimension and 200 neurons, without any external input. For\nlinear dynamical systems, the transition matrix was sampled from a Gaussian distribution, and the\n\n5\n\n\fFigure 1: Recovering low-dimensional subspaces from nonstationary model data. While the subspace remains\nthe same, the dynamics switch between 5 different linear systems. Left top: one dimension of the latent\ntrajectory, switching from one set of dynamics to another (red line). Left middle: \ufb01ring rates of a subset of\nneurons during the same switch. Left bottom: covariance between spike counts for different neurons during\neach epoch of linear dynamics. Right top: Angle between the true subspace and top principal components\ndirectly from spike data, from natural rates recovered by nuclear norm minimization, and from the true natural\nrates. Right bottom: fraction of Bregman divergence explained by the top 1, 5 or 10 dimensions from nuclear\nnorm minimization. Dotted lines are variance explained by the same number of principal components. For\n\u03bb < 0.1 the divergence explained by a given number of dimensions exceeds the variance explained by the\nsame number of PCs.\n\neigenvalues rescaled so the magnitude fell between .9 and .99 and the angle between \u00b1 \u03c0\n10, yielding\nslow and stable dynamics. The linear projection C was a random Gaussian matrix with standard\ndeviation 1/3, and the biases bi were sampled from N (\u22124, 1), which we found gave reasonable\n\ufb01ring rates with nonlinearity f (x) = log(1 + exp(x)). To investigate the variance of our estimates,\nwe generated multiple trials of data with the same parameters but different innovations.\nWe \ufb01rst sought to show that we could accurately recover the subspace in which the dynamics take\nplace even when those dynamics are not stationary. We split each trial into 5 epochs and in each\nepoch resampled the transition matrix A and set the covariance of innovations \u0001t to QQT where Q\nis a random Gaussian matrix. We performed nuclear norm minimization on data generated from\nthis model, varying the smoothing parameter \u03bb from 10\u22123 to 10, and compared the subspace angle\nbetween the top 8 principal components and the true matrix C. We repeated this over 10 trials to\ncompute the variance of our estimator. We found that when smoothing was optimized the recovered\nsubspace was signi\ufb01cantly closer to the true subspace than the top principal components taken di-\nrectly from spike data. Increasing the amount of data from 1000 to 10000 time bins signi\ufb01cantly\nreduced the average subspace angle at the optimal \u03bb. The top PCs of the true natural rates Y , while\nnot spanning exactly the same space as C due to differences between the mean column and true bias\nb, was still closer to the true subspace than the result of nuclear norm minimization.\nWe also computed the fraction of Bregman divergence explained by the sequence of spaces spanned\nby successive principal components, solving Eq. 6 by Newton\u2019s method. We did not \ufb01nd a clear\ndrop at the true dimensionality of the subspace, but we did \ufb01nd that a larger share of the divergence\ncould be explained by the top dimensions than by PCA directly on spikes. Results are presented in\nFig. 1.\nTo show that the parameters of a latent dynamical system can be recovered, we investigated the\nperformance of spectral methods on model data with linear Gaussian latent dynamics. As the model\nis a linear dynamical system with GLM output, we call this a GLM-LDS model. After estimating\nnatural rates by nuclear norm minimization with \u03bb = 0.01 on 10 trials of 10000 time bins with\nunit-variance innovations \u0001t, we \ufb01t the transition matrix A by subspace identi\ufb01cation (SSID) [26].\nThe transition matrix is only identi\ufb01able up to a change of coordinates, so we evaluated our \ufb01t by\ncomparing the eigenvalues of the true and estimated A. Results are presented in Fig. 2. As expected,\nSSID directly on spikes led to biased estimates of the transition. By contrast, SSID on the output of\n\n6\n\n1600170018001900200021002200230024002500\u221250050160017001800190020002100220023002400250051015202500.511.5\u03bbSubspace Angle  T = 1000, SpikesT = 10000, SpikesT = 1000, NNT = 10000, NNT = 1000, True YT = 10000, True Y1e\u221231e\u221221e\u221211e01e100.51\u03bbDivergence Explained  1 Dim5 Dim10 Dim1e\u221231e\u221221e\u221211e01e1\f(a)\n\n(b)\n\n(c)\n\nFigure 2: Recovered eigenvalues for the transition matrix of a linear dynamical system from model neural data.\nBlack: true eigenvalues. Red: recovered eigenvalues. (2a) Eigenvalues recovered from the true natural rates.\n(2b) Eigenvalues recovered from subspace identi\ufb01cation directly on spike counts. (2c) Eigenvalues recovered\nfrom subspace identi\ufb01cation on the natural rates estimated by nuclear norm minimization.\n\nnuclear norm minimization had little bias, and seemed to perform almost as well as SSID directly\non the true natural rates. We found that other methods for \ufb01tting linear dynamical systems from the\nestimated natural rates were biased, as was SSID on the result of nuclear norm minimization without\nmean-centering (see the supplementary material for more details).\nWe incorporated spike history terms into our model data to see whether local connectivity and global\ndynamics could be separated. Our model network consisted of 50 neurons, randomly connected with\n95% sparsity, and synaptic weights sampled from a unit variance Gaussian. Data were sampled from\n10000 time bins. The parameters \u03bb and \u03b3 were both varied from 10\u221210 to 104. We found that we\ncould recover synaptic weights with an r2 up to .4 on this data by combining both a nuclear norm and\n(cid:96)1 penalty, compared to at most .25 for an (cid:96)1 penalty alone, or 0.33 for a nuclear norm penalty alone.\nSomewhat surprisingly, at the extreme of either no nuclear norm penalty or a dominant nuclear norm\npenalty, increasing the (cid:96)1 penalty never improved estimation. This suggests that in a regime with\nstrong common inputs, some kind of correction is necessary not only for sparse penalties to achieve\noptimal performance, but to achieve any improvement over maximum likelihood. It is also of interest\nthat the peak in r2 is near a sharp transition to total sparsity.\n\nFigure 3: Connectivity matrices recovered by SPCP on model data. Left: r2 between true and recovered\nsynaptic weights across a range of parameters. The position in parameter space of the data to the right is\nhighlighted by the stars. Axes are on a log scale. Right: scatter plot of true versus recovered synaptic weights,\nillustrating the effect of the nuclear norm term.\n\nFinally, we demonstrated the utility of our method on real recordings from a large population of\nneurons. The data consists of 125 well-isolated units from a multi-electrode recording in macaque\nmotor cortex while the animal was performing a pinball task in two dimensions. Previous studies on\nthis data [27] have shown that information about arm velocity can be reliably decoded. As the elec-\ntrodes are spaced far apart, we do not expect any direct connections between the units, and so leave\nout the (cid:96)1 penalty term from the objective. We used 800 seconds of data binned every 100 ms for\ntraining and 200 seconds for testing. We \ufb01t linear dynamical systems by subspace identi\ufb01cation as in\nFig. 2, but as we did not have access to a \u201ctrue\u201d linear dynamical system for comparison, we evalu-\nated our model \ufb01ts by approximating the held out log likelihood by Laplace-Gaussian \ufb01ltering [28].\n\n7\n\n\u22120.2\u22120.100.10.20.80.850.90.9511.051.1Imaginary ComponentReal Component  TrueBest Empirical Estimate\u22120.2\u22120.100.10.20.80.850.90.9511.051.1Imaginary ComponentReal Component  TrueSSID\u22120.2\u22120.100.10.20.80.850.90.9511.051.1Imaginary ComponentReal Component  TrueNN+SSID\u03bb\u03b3r2 for Synaptic Weights  1.00e\u2212101.00e\u2212071.00e\u2212041.00e\u2212011.00e+021.00e\u2212101.00e\u2212081.00e\u2212061.00e\u2212041.00e\u2212021.00e+001.00e+021.00e+040.050.10.150.20.250.30.350.4\u22122\u22121012\u22121.5\u22121\u22120.500.511.5TrueRecoveredSynaptic Weights, Optimal\u22122\u22121012\u22121.5\u22121\u22120.500.511.5TrueRecoveredSynaptic Weights, Small \u03bb\u22122\u22121012\u22121.5\u22121\u22120.500.511.5TrueRecoveredSynaptic Weights, Large \u03bb\fWe also \ufb01t the GLM-LDS model by running ran-\ndomly initialized EM for 50 iterations for models\nwith up to 30 latent dimensions (beyond which train-\ning was prohibitively slow). We found that a strong\nnuclear norm penalty improved prediction by several\nhundred bits per second, and that fewer dimensions\nwere needed for optimal prediction as the nuclear\nnorm penalty was increased. The best \ufb01t models pre-\ndicted held out data nearly as well as models trained\nvia EM, even though nuclear norm minimization is\nnot directly maximizing the likelihood of a linear dy-\nnamical system.\n\nFigure 4: Log likelihood of held out motor cortex\ndata versus number of latent dimensions for dif-\nferent latent linear dynamical systems. Prediction\nimproves as \u03bb increases, until it is comparable to\nEM.\n\n6 Discussion\nThe method presented here has a number of straight-\nforward extensions. If the dimensionality of the la-\ntent state is greater than the dimensionality of the\ndata, for instance when there are long-range history\ndependencies in a small population of neurons, we\nwould extend the natural rate matrix Y so that each\ncolumn contains multiple time steps of data. Y is then a block-Hankel matrix. Constructing the\nblock-Hankel matrix is also a linear operation, so the objective is still convex and can be ef\ufb01ciently\nminimized [15]. If there are also observed inputs ut then the term inside the nuclear norm should\nalso include a projection orthogonal to the row space of the inputs. This could enable joint learning\nof dynamics and receptive \ufb01elds for small populations of neurons with high dimensional inputs.\nOur model data results on connectivity inference have important implications for practitioners work-\ning with highly correlated data. GLM models with sparsity penalties have been used to infer connec-\ntivity in real neural networks [29], and in most cases these networks are only partially observed and\nhave large amounts of common input. We offer one promising route to removing the confounding\nin\ufb02uence of unobserved correlated inputs, which explicitly models the common input rather than\nconditioning on it [30].\nIt remains an open question what kinds of dynamics can be learned from the recovered natural\nparameters. In this paper we have focused on linear systems, but nuclear norm minimization could\njust as easily be combined with spectral methods for switching linear systems and general nonlinear\nsystems. We believe that the techniques presented here offer a powerful, extensible and robust\nframework for extracting structure from neural activity.\n\nAcknowledgments\n\nThanks to Zhang Liu, Michael C. Grant, Lars Buesing and Maneesh Sahani for helpful discussions,\nand Nicho Hatsopoulos for providing data. This research was generously supported by an NSF\nCAREER grant.\n\nReferences\n[1] I. H. Stevenson and K. P. Kording, \u201cHow advances in neural recording affect data analysis,\u201d Nature\n\nneuroscience, vol. 14, no. 2, pp. 139\u2013142, 2011.\n\n[2] M. Okun, P. Yger, S. L. Marguet, F. Gerard-Mercier, A. Benucci, S. Katzner, L. Busse, M. Carandini, and\nK. D. Harris, \u201cPopulation rate dynamics and multineuron \ufb01ring patterns in sensory cortex,\u201d The Journal\nof Neuroscience, vol. 32, no. 48, pp. 17108\u201317119, 2012.\n\n[3] K. L. Briggman, H. D. I. Abarbanel, and W. B. Kristan, \u201cOptical imaging of neuronal populations during\n\ndecision-making,\u201d Science, vol. 307, no. 5711, pp. 896\u2013901, 2005.\n\n[4] C. K. Machens, R. Romo, and C. D. Brody, \u201cFunctional, but not anatomical, separation of \u201cwhat\u201d and\n\n\u201cwhen\u201d in prefrontal cortex,\u201d The Journal of Neuroscience, vol. 30, no. 1, pp. 350\u2013360, 2010.\n\n[5] M. Stopfer, V. Jayaraman, and G. Laurent, \u201cIntensity versus identity coding in an olfactory system,\u201d\n\nNeuron, vol. 39, no. 6, pp. 991\u20131004, 2003.\n\n8\n\n05101520253035404550\u22122000\u22121500\u22121000\u2212500Number of Latent DimensionsLog Likelihood (bits/s)Prediction of Held out Data from GLM\u2212LDS  \u03bb = 1.00e\u221204\u03bb = 1.00e\u221203\u03bb = 1.00e\u221202\u03bb = 3.16e\u221202EM\f[6] M. M. Churchland, J. P. Cunningham, M. T. Kaufman, J. D. Foster, P. Nuyujukian, S. I. Ryu, and K. V.\n\nShenoy, \u201cNeural population dynamics during reaching,\u201d Nature, 2012.\n\n[7] W. Brendel, R. Romo, and C. K. Machens, \u201cDemixed principal component analysis,\u201d Advances in Neural\n\nInformation Processing Systems, vol. 24, pp. 1\u20139, 2011.\n\n[8] L. Paninski, Y. Ahmadian, D. G. Ferreira, S. Koyama, K. R. Rad, M. Vidne, J. Vogelstein, and W. Wu, \u201cA\nnew look at state-space models for neural data,\u201d Journal of Computational Neuroscience, vol. 29, no. 1-2,\npp. 107\u2013126, 2010.\n\n[9] B. M. Yu, J. P. Cunningham, G. Santhanam, S. I. Ryu, K. V. Shenoy, and M. Sahani, \u201cGaussian-process\nfactor analysis for low-dimensional single-trial analysis of neural population activity,\u201d Journal of neuro-\nphysiology, vol. 102, no. 1, pp. 614\u2013635, 2009.\n\n[10] B. Petreska, B. M. Yu, J. P. Cunningham, G. Santhanam, S. I. Ryu, K. V. Shenoy, and M. Sahani, \u201cDynam-\nical segmentation of single trials from population neural data,\u201d Advances in neural information processing\nsystems, vol. 24, 2011.\n\n[11] J. E. Kulkarni and L. Paninski, \u201cCommon-input models for multiple neural spike-train data,\u201d Network:\n\nComputation in Neural Systems, vol. 18, no. 4, pp. 375\u2013407, 2007.\n\n[12] A. Smith and E. Brown, \u201cEstimating a state-space model from point process observations,\u201d Neural Com-\n\nputation, vol. 15, no. 5, pp. 965\u2013991, 2003.\n\n[13] M. Fazel, H. Hindi, and S. P. Boyd, \u201cA rank minimization heuristic with application to minimum order\nsystem approximation,\u201d Proceedings of the American Control Conference., vol. 6, pp. 4734\u20134739, 2001.\n[14] Z. Liu and L. Vandenberghe, \u201cInterior-point method for nuclear norm approximation with application to\nsystem identi\ufb01cation,\u201d SIAM Journal on Matrix Analysis and Applications, vol. 31, pp. 1235\u20131256, 2009.\n[15] Z. Liu, A. Hansson, and L. Vandenberghe, \u201cNuclear norm system identi\ufb01cation with missing inputs and\n\noutputs,\u201d Systems & Control Letters, vol. 62, no. 8, pp. 605\u2013612, 2013.\n\n[16] L. Buesing, J. Macke, and M. Sahani, \u201cSpectral learning of linear dynamics from generalised-linear obser-\nvations with application to neural population data,\u201d Advances in neural information processing systems,\nvol. 25, 2012.\n\n[17] L. Paninski, J. Pillow, and E. Simoncelli, \u201cMaximum likelihood estimation of a stochastic integrate-and-\n\n\ufb01re neural encoding model,\u201d Neural computation, vol. 16, no. 12, pp. 2533\u20132561, 2004.\n\n[18] E. Chornoboy, L. Schramm, and A. Karr, \u201cMaximum likelihood identi\ufb01cation of neural point process\n\nsystems,\u201d Biological cybernetics, vol. 59, no. 4-5, pp. 265\u2013275, 1988.\n\n[19] J. Macke, J. Cunningham, M. Byron, K. Shenoy, and M. Sahani, \u201cEmpirical models of spiking in neural\n\npopulations,\u201d Advances in neural information processing systems, vol. 24, 2011.\n\n[20] M. Collins, S. Dasgupta, and R. E. Schapire, \u201cA generalization of principal component analysis to the\n\nexponential family,\u201d Advances in neural information processing systems, vol. 14, 2001.\n\n[21] V. Solo and S. A. Pasha, \u201cPoint-process principal components analysis via geometric optimization,\u201d Neu-\n\nral Computation, vol. 25, no. 1, pp. 101\u2013122, 2013.\n\n[22] Z. Zhou, X. Li, J. Wright, E. Candes, and Y. Ma, \u201cStable principal component pursuit,\u201d Proceedings of\n\nthe IEEE International Symposium on Information Theory, pp. 1518\u20131522, 2010.\n\n[23] A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh, \u201cClustering with Bregman divergences,\u201d The Journal\n\nof Machine Learning Research, vol. 6, pp. 1705\u20131749, 2005.\n\n[24] S. P. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, \u201cDistributed optimization and statistical learn-\ning via the alternating direction method of multipliers,\u201d Foundations and Trends R(cid:13) in Machine Learning,\nvol. 3, no. 1, pp. 1\u2013122, 2011.\n\n[25] P. Van Overschee and B. De Moor, \u201cSubspace identi\ufb01cation for linear systems: theory, implementation,\n\napplications,\u201d 1996.\n\n[26] V. Lawhern, W. Wu, N. Hatsopoulos, and L. Paninski, \u201cPopulation decoding of motor cortical activity\nusing a generalized linear model with hidden states,\u201d Journal of neuroscience methods, vol. 189, no. 2,\npp. 267\u2013280, 2010.\n\n[27] S. Koyama, L. Castellanos P\u00b4erez-Bolde, C. R. Shalizi, and R. E. Kass, \u201cApproximate methods for state-\n\nspace models,\u201d Journal of the American Statistical Association, vol. 105, no. 489, pp. 170\u2013180, 2010.\n\n[28] J. W. Pillow, J. Shlens, L. Paninski, A. Sher, A. M. Litke, E. Chichilnisky, and E. P. Simoncelli, \u201cSpatio-\ntemporal correlations and visual signalling in a complete neuronal population,\u201d Nature, vol. 454, no. 7207,\npp. 995\u2013999, 2008.\n\n[29] M. Harrison, \u201cConditional inference for learning the network structure of cortical microcircuits,\u201d in 2012\n\nJoint Statistical Meeting, (San Diego, CA), 2012.\n\n9\n\n\f", "award": [], "sourceid": 1136, "authors": [{"given_name": "David", "family_name": "Pfau", "institution": "Columbia University"}, {"given_name": "Eftychios", "family_name": "Pnevmatikakis", "institution": "Columbia University"}, {"given_name": "Liam", "family_name": "Paninski", "institution": "Columbia University"}]}