{"title": "Approximate Gaussian process inference for the drift function in stochastic differential equations", "book": "Advances in Neural Information Processing Systems", "page_first": 2040, "page_last": 2048, "abstract": "We introduce a nonparametric approach for estimating drift functions in systems of stochastic differential equations from incomplete observations of the state vector. Using a Gaussian process prior over the drift as a function of the state vector, we develop an approximate EM algorithm to deal with the unobserved, latent dynamics between observations. The posterior over states is approximated by a piecewise linearized process and the MAP estimation of the drift is facilitated by a sparse Gaussian process regression.", "full_text": "Approximate Gaussian process inference for the drift\n\nof stochastic differential equations\n\nAndreas Ruttor\n\nPhilipp Batz\n\nComputer Science, TU Berlin\n\nComputer Science, TU Berlin\n\nandreas.ruttor@tu-berlin.de\n\nphilipp.batz@tu-berlin.de\n\nManfred Opper\n\nComputer Science, TU Berlin\n\nmanfred.opper@tu-berlin.de\n\nAbstract\n\nWe introduce a nonparametric approach for estimating drift functions in systems\nof stochastic differential equations from sparse observations of the state vector.\nUsing a Gaussian process prior over the drift as a function of the state vector, we\ndevelop an approximate EM algorithm to deal with the unobserved, latent dynam-\nics between observations. The posterior over states is approximated by a piecewise\nlinearized process of the Ornstein-Uhlenbeck type and the MAP estimation of the\ndrift is facilitated by a sparse Gaussian process regression.\n\n1\n\nIntroduction\n\nGaussian process (GP) inference methods have been successfully applied to models for dynamical\nsystems, see e.g. [1\u20133]. Usually, these studies have dealt with discrete time dynamics, where one\nuses a GP prior for modeling transition function and the measurement function of the system. On\nthe other hand, many dynamical systems in the physical world evolve in continuous time and the\nnoisy dynamics is described naturally in terms of stochastic differential equations (SDE). SDEs\nhave also attracted considerable interest in the NIPS community in recent years [4\u20137]. So far most\ninference approaches have dealt with the posterior prediction of state variables between observations\n(smoothing) and the estimation of parameters contained in the drift function, which governs the\ndeterministic part of the microscopic time evolution. Since the drift is usually a nonlinear function\nof the state vector, a nonparametric estimation using Gaussian process priors would be a natural\nchoice, when a large number of data is available. A recent result by [8, 9] presented an important\nstep in this direction. The authors have shown that GPs are a conjugate family to SDE likelihoods. In\nfact, if an entire path of dense observations of the state dynamics is observed, the posterior process\nover the drift is exactly a GP. Unfortunately, this simplicity is lost, when observations are not dense,\nbut separated by larger time intervals.\nIn [8] this sparse, incomplete observation case has been\ntreated by a Gibbs sampler, which alternates between sampling complete state paths of the SDE and\ncreating GP samples for the drift. A nontrivial problem is the sampling from SDE paths conditioned\non observations. Second, the densely sampled hidden paths are equivalent to a large number of\nimputed observations, for which the matrix inversions required by the GP posterior predictions can\nbecome computationally costly. It was shown in [8] that in the univariate case for GP priors with\nprecision operators (the inverses of covariance kernels) which are differential operators ef\ufb01cient\npredictions can be realized in terms of the solutions of differential equations.\n\nIn this paper, we develop an alternative approximate expectation maximization (EM) method for\ninference from sparse observations, which is faster than the sampling approach and can also be\napplied to arbitrary kernels and multivariate SDEs. In the E-Step we approximate expectations over\n\n1\n\n\fstate paths by those of a locally \ufb01tted Ornstein-Uhlenbeck model. The M-step for computing the\nmaximum posterior GP prediction of the drift depends on a continuum of function values and is thus\napproximated by a sparse GP.\n\nThe paper is organized as follows. Section 2 introduces stochastic differential equations and section\n3 discusses GP based inference for completely observed paths. In section 4 our approximate EM\nalgorithm is derived and its performance is demonstrated on a variety of SDEs in section 6. Section\n7 presents a discussion.\n\n2 Stochastic differential equations\n\nWe consider continuous-time univariate Markov processes of the diffusion type, where the dynamics\n\nof a d-dimensional state vector Xt \u2208 Rd is given by the stochastic differential equation (SDE)\n\ndXt = f (Xt)dt + D1/2dW.\n\n(1)\nThe vector function f (x) = (f 1(x), . . . , f d(x)) de\ufb01nes the deterministic drift and W is a Wiener\nprocess, which models additive white noise. D is the diffusion matrix, which we assume to be\nindependent of x. We will not attempt a rigorous treatment of probability measures over continuous\ntime paths here, but will mostly assume for our derivations that the process can be approximated\nwith a discrete time process Xt in the Euler-Maruyama discretization [10], where the times t \u2208 G\nare on a regular grid G = {0, \u2206t, 2\u2206t, . . .} and where \u2206t is some small microscopic time. The\ndiscretized process is given by\n\nXt+\u2206t \u2212 Xt = f (Xt)\u2206t + D1/2\u221a\u2206t \u01ebt,\n\n(2)\nwhere \u01ebt \u223c N (0, I) is a sequence of i.i.d. Gaussian noise vectors. We will usually take the limit\n\u2206t \u2192 0 only in expressions where (Riemann) sums are over nonrandom quantities, i.e. where\nexpectations over paths have been carried out and can be replaced by ordinary integrals.\n\n3 Bayesian Inference for dense observations\n\nSuppose we observe a path of n d-dimensional observations X0:T = (Xt)t\u2208G over the time interval\n[0, T ]. Since for \u2206t \u2192 0, the transition probabilities of the process are Gaussian,\n||Xt+\u2206t \u2212 Xt \u2212 f (Xt)\u2206t||2# ,\n\npf (X0:T|f ) \u221d exp\"\u2212\n\n(3)\n\n1\n\n2\u2206tXt\n\n.\n= (f (Xt))t\u2208G at these observations\n\nthe probability density for the path with a given drift function f\ncan be written as the product\n\nwhere\n\npf (X0:T|f ) = p0(X0:T )L(X0:T|f ),\n\np0(X0:T ) \u221d exp\"\u2212\n\n1\n\n2\u2206tXt\n\n||Xt+\u2206t \u2212 Xt||2#\n\n(4)\n\n(5)\n\nis the measure over paths without drift, i.e. a discretized version of the Wiener measure, and a term\nwhich we will call likelihood in the following,\n\nL(X0:T|f ) = exp\"\u2212\n\n1\n\n2Xt\n\n||f (Xt)||2 \u2206t +Xt\n\nhf (Xt), Xt+\u2206t \u2212 Xti# .\n\n(6)\n\n.\n= u\u22a4D\u22121v and the corresponding squared norm\n\nHere we have introduced the inner product hu, vi\n||u||2 .\n\n= u\u22a4D\u22121u to avoid cluttered notation.\n\nTo attempt a nonparametric Bayesian estimate of the drift function f (x), we note that the expo-\nnent in (6) contains the drift f at most quadratically. Hence it becomes clear that a conjugate prior\nto the drift for this model is given by a Gaussian process, i.e. we assume that for each component\nf \u223c P0(f ) = GP(0, K), where K is a kernel [11], a fact which was recently observed in [8]. We de-\nnote probabilities over the drift f by upper case symbols in order to avoid confusion with path prob-\nabilities. Although a more general model is possible, we will restrict ourselves to the case where the\n\n2\n\n\fFigure 1: The left \ufb01gure shows a snippet of the double well sample path in black and observations\nas red dots. The right picture displays the estimated drift function for the double well model after\ninitialization, where the red line denotes the true drift function and the black line the mean function\nwith corresponding 95%-con\ufb01dence bounds (twice the standard deviation) in blue. One can clearly\nsee that the larger distance between the consecutive points leads to a wrong prediction.\n\nGP priors over the components f j(x), j = 1, . . . , d of the drift are independent (with usually differ-\nent kernels) and we assume that we have a diagonal diffusion matrix D = diag(\u03c32\nd). In this\ncase, the GP posteriors of f j(x) are independent, too, and we can estimate drift components inde-\npendently by ordinary GP regression. We de\ufb01ne data vectors by dj = ((X j\nt\u2208G\\{T },\nthe kernel matrix Kj = (K j(Xs, Xt))s,t\u2208G, and the test vector kj(x) = (K j(x, Xt))\u22a4\nt\u2208G. Then a\nstandard calculation [11] shows that the posterior process over drift functions f has a posterior mean\nand a GP posterior variance at an arbitrary point x is given by\n\nt+\u2206t\u2212X j\n\n1, . . . , \u03c32\n\nt )/\u2206t)\u22a4\n\n\u00aff j(x) = kj(x)\u22a4 Kj +\n\n\u03c32\nj\n\u2206t\n\nI!\u22121\n\ndj,\n\n\u03c32\n\nf j (x) = K j(x, x)\u2212kj(x)\u22a4 Kj +\n\n\u03c32\nj\n\u2206t\n\nI!\u22121\n\nkj(x). (7)\n\nNote that \u03c32\nj /\u2206t plays the role of the variance of the observation noise in the standard regression\ncase. In practice, the number of observations can be quite large for a \ufb01ne time discretization, and a\nfast computation of (7) could become infeasible. A possible way out of this problem\u2014as suggested\nby [8]\u2014could be a restriction to kernels for which the inverse kernel, the precision operator, is a\ndifferential operator. A well known machine learning approach, which is based on a sparse Gaussian\nprocess approximation, applies to arbitrary kernels and generalizes easily to multivariate SDE. We\nhave resorted speci\ufb01cally to the optimal Kullback-Leibler sparsity [1, 12], where the likelihood term\nof a GP model is replaced by another effective likelihood, which depends only on a smaller set of\nvariables fs.\n\n4 MAP Inference for sparse observations\n\nThe simple GP regression approach outlined in the previous section cannot be applied when obser-\n.\n= X\u03c4k , k = 1, . . . , n\nvations are sparse in time. In this setting, we assume that n observations yk\nare obtained at (for simplicity) regular intervals \u03c4k = k\u03c4 , where \u03c4 \u226b \u2206t is much larger than the\nmicroscopic time scale. In this case, a discretization in (6), where the sum over the microscopic grid\nt \u2208 G would be replaced by a sum over macroscopic times \u03c4k and \u2206t by \u03c4 , would correspond to\na discrete time dynamical model of the form (1) again replacing \u2206t by \u03c4 . But this discretization\nwould give a bad approximation to the true SDE dynamics. The estimator of the drift would give\nsome (approximate) estimation of the mean of the transition kernel over macroscopic times \u03c4 . How-\never, this does usually not give a good approximation for the original drift. This can be seen in \ufb01gure\n1, where the red line corresponds to the true drift (of the so called double-well model [4]) and the\nblack line to its prediction based on observations with \u03c4 = 0.2 and the naive estimation method.\n\nTo deal with this problem, we treat the process Xt for times t between consecutive observations\nk\u03c4 < t < (k + 1)\u03c4 as a latent unobserved random variable with a posterior path measure given by\n\np(X0:T|y, f ) \u221d p(X0:T|f )\n\nn\n\nYk=1\n\n\u03b4(yk \u2212 Xk\u03c4 ),\n\n3\n\n(8)\n\n\fwhere y is the collection of observations yk and \u03b4(\u00b7) denotes the Dirac-distribution encoding the\nfact that the process is known perfectly at times \u03c4k. Our goal is to use an EM algorithm to compute\nthe maximum posterior (MAP) prediction for the drift function f (x). Unfortunately, exact posterior\nexpectations are intractable and one needs to work with suitable approximations.\n\n4.1 Approximate EM algorithm\n\nThe EM algorithm cycles between two steps\n\n1. In the E-step, we compute the expected negative logarithm of the complete data likelihood\n\nL(f , q) = \u2212Eq [ln L(X0:T|f )] ,\n\n(9)\n\nwhere q denotes a measure over paths which approximates the intractable posterior\np(X0:T|y, fold) for the previous estimate fold of the drift.\n\n2. In the M-Step, we recompute the drift function as\n\nfnew = arg min\n\nf\n\n(L(f , q) \u2212 ln P0(f )) .\n\n(10)\n\nTo compute the expectation in the E-step, we use (6) and take the limit \u2206t \u2192 0 at the end, when\nexpectations have been computed. As f (x) is a time-independent function, this yields\n\n1\n\n\u2206t\u21920\n\nEq(cid:2)||f (Xt)||2\u2206t \u2212 2hf (Xt), Xt+\u2206t \u2212 Xti(cid:3)\n\n2Xt\n\u2212Eq[ln L(X0:T|f )] = lim\n2Z T\nEq(cid:2)||f (Xt)||2 \u2212 2hf (Xt), gt(Xt)i(cid:3) dt\n2Z ||f (x)||2A(x)dx \u2212Z hf (x), y(x)i dx.\n\n=\n\n=\n\n1\n\n1\n\n0\n\nHere qt(x) is the marginal density of Xt computed from the approximate posterior path measure q.\nWe have also de\ufb01ned the corresponding approximate posterior drift\n\n(11)\n\n(12)\n\n(13)\n\ngt(x) = lim\n\u2206t\u21920\n\n1\n\u2206t\n\nEq[Xt+\u2206t \u2212 Xt|Xt = x],\n\nas well as the functions\n\nA(x) =Z T\n\n0\n\nqt(x)dt\n\nand\n\ngt(x)qt(x)dt.\n\ny(x) =Z T\n\n0\n\nThere are two main problems for a practical realization of this EM algorithm:\n\n1. We need to \ufb01nd tractable path measures q, which lead to good approximations for marginal\n\ndensities and posterior drifts given arbitrary prior drift functions f (x).\n\n2. The M-Step requires a functional optimization, because (11) shows that L(f , q) \u2212 ln P0(f )\nis actually a functional of f (x), i.e. it contains a continuum of values f (x), where x \u2208 Rd.\n\n4.2 Linear drift approximation: The Ornstein-Uhlenbeck bridge\n\nFor given drift f (\u00b7) and times t \u2208 Ik in the interval Ik = [k \u03c4 ; (k + 1)\u03c4 ] between two consecutive\nobservations, the exact posterior marginal pt(x) equals the density of Xt = x conditioned on the\nfact that Xk\u03c4 = yk and X(k+1)\u03c4 = yk+1. This can be expressed by the transition densities of the\nhomogeneous Markov diffusion process with drift f (x). We denote this quantity by ps(Xt+s|Xt)\nbeing the density of the random variable Xt+s at time t + s conditioned on Xt at time t. Using the\nMarkov property, this yields the representation\n\npt(x) \u221d p(k+1)\u03c4 \u2212t(yk+1|x)pt\u2212k\u03c4 (x|yk) for t \u2208 Ik.\n\n(14)\n\nAs functions of t and x, the second factor ful\ufb01lls a forward Fokker-Planck equation and the \ufb01rst one\na Kolmogorov backward equation [13]. Both are partial differential equations. Since exact compu-\ntations are not feasible for general drift functions, we approximate the transition density ps(x|xk) in\neach interval Ik by that of a process, where the drift f (x) is replaced by its local linearization\n\nf (x) \u2248 fou(x, t) = f (xk) \u2212 \u0393k(x \u2212 xk) with \u0393k = \u2212\u2207f (xk).\n\n(15)\n\n4\n\n\fThis is equivalent to assuming that for t \u2208 Ik the dynamics is approximated by the homogeneous\nOrnstein-Uhlenbeck process [13]\n\ndXt = [f (yk) \u2212 \u0393k(Xt \u2212 yk)]dt + D1/2dW,\n\n(16)\n\nwhich is also used to build computationally ef\ufb01cient hierarchical models [14, 15], as in this case\nthe marginal posterior can be calculated analytically. Here the transition density is a multivariate\nGaussian\n\nwhere \u03b1k = yk + \u0393\u22121\nusing the matrix exponential\n\nk f (yk) is the stationary mean and the variance Ss = AsB\u22121\n\ns\n\nq(k)\ns\n\n(x|y) = N (cid:0)x|\u03b1k + e\u2212\u0393ks(y \u2212 \u03b1k); Ss(cid:1)\nk (cid:21) s(cid:19)(cid:20) 0\n(cid:20) As\nBs(cid:21) = exp(cid:18)(cid:20) \u0393k\nI(cid:21) .\n\nD\n0 \u2212\u0393\u22a4\n(x) = N (x|m(t); C(t)) of the marginal posterior\n\n(18)\n\n(17)\n\nis calculated\n\n(20)\n\n(21)\n\n(22)\n\n(23)\n\nThen we obtain the Gaussian approximation q(k)\nfor t \u2208 Ik by multiplying the two transition densities, where\n\nt\n\nC(t) = (cid:16)e\u2212\u0393\u22a4\n\nm(t) = C(t) e\u2212\u0393\u22a4\n\nk (tk+1\u2212t)S\u22121\n\ntk+1\u2212te\u2212\u0393k(tk+1\u2212t) + S\u22121\n\nand\n\nk (tk+1\u2212t)S\u22121\n\ntk+1\u2212t(cid:16)yk+1 \u2212 \u03b1k + e\u2212\u0393k(tk+1\u2212t)\u03b1k(cid:17)\n\n+ C(t) S\u22121\n\nt\u2212tk(cid:16)\u03b1k + e\u2212\u0393k(t\u2212tk)(yk \u2212 \u03b1k)(cid:17) .\n\nt\u2212tk(cid:17)\u22121\n\nBy inspecting mean and variance we see that the distribution is a equivalent to a bridge between the\npoints X = yk and X = yk+1 and collapses to point masses at these points.\n\nWithin this approximation, we can estimate parameters such as the diffusion D using the approxi-\nmate evidence\n\nn\u22121\n\nFinally, in this approximation we obtain for the posterior drift\n\np(y|f ) \u2248 pou(y) = p(x1)\n\nq(k)\n\u03c4\n\n(yk+1|yk)\n\n(19)\n\nYj=1\n\ngt(x) = lim\n\u2206t\u21920\n\n1\n\u2206t\n\nE [Xt+\u2206t \u2212 Xt|Xt = x, X\u03c4 = yk+1]\n\n= f (yk) \u2212 \u0393k(x \u2212 yk) + De\u2212\u0393\u22a4\n\nk (tk+1\u2212t)S\u22121\n\ntk+1\u2212t(yk+1 \u2212 \u03b1k \u2212 e\u2212\u0393k(tk+1\u2212t)(x \u2212 \u03b1k))\n\nas shown in appendix A in the supplementary material.\n\n4.3 Sparse M-Step approximation\n\nTo cope with the functional optimization, we resort to a sparse approximation for replacing the\nin\ufb01nite set f by a sparse set fs. Here the GP posteriors (for each component of the drift) is replaced\nby one that is closest in the KL sense. Following appendix B in the supplementary material, we \ufb01nd\nthat in the sparse approximation the likelihood (11) is replaced by\n\nLs(f , q) =\n\n1\n\n2Z ||E0[f (x)|fs]||2 A(x) dx \u2212Z hE0[f (x)|fs], y(x)i dx,\n\nwhere the conditional expectation is over the GP prior. In order to avoid cluttered notation, it should\ns , y(x), \u03c32,\nbe noted that in the following results for a component f j, the quantities \u039bs, fs, ks, K\u22121\nsimilar to (7) depend on the component j, but not A(x).\n\nThis is easily computed as\n\nHence\n\nwith\n\n\u039bs =\n\n1\n\u03c32\n\nE0[f (x)|fs] = k\u22a4\n\ns (x)K\u22121\n\ns fs.\n\n1\n2\n\ns \u039bsf s \u2212 f \u22a4\nf \u22a4\n\ns ds\n\nLs(f , q) =\ns (x)dx(cid:27) K\u22121\n\ns ,\n\n5\n\nK\u22121\n\ns (cid:26)Z ks(x) A(x) k\u22a4\n\nds =\n\n1\n\u03c32\n\nK\u22121\n\ns Z ks(x) y(x) dx.\n\n\fWith these results, the approximate MAP estimate is\n\n\u00affs(x) = k\u22a4\n\ns (x)(I + \u039bsKs)\u22121ds.\n\n(24)\n\nThe integrals over x in (23) can be computed analytically for many kernels of interest such as\npolynomial and RBF ones. However, we have done this for 1-dimensional models only. For higher\ndimensions, we found it more ef\ufb01cient to treat both the time integration in (13) and the x integrals\nby sampling, where time points t are drawn uniformly at random and x points from the multivariate\nGaussian qt(x).\nA related expression for the variance \u03c32\ns (x)(I + \u039bKs)\u22121\u039bsks(x) can only be\nviewed as a crude estimate, because it does not include the impact of the GP \ufb02uctuations on the path\nprobabilities.\n\ns (x) = K(x, x) \u2212 k\u22a4\n\n5 A crude estimate of an approximation error\n\nUnfortunately, there is no guarantee that this approximation to the EM algorithm will always in-\ncrease the exact likelihood p(y|f ). Here, we will develop a crude estimate how p(y|f ) differs\n.\nfrom the the Ornstein-Uhlenbeck approximation (19) to lowest order in the difference \u03b4f (Xt, t)\n=\nf (Xt) \u2212 fou(Xt, t) between drift function and its approximation.\nOur estimate is based on the exact expression\n\np(y|f ) =Z dp0(X0:T ) eln L(X0:T |f )\n\nn\n\nYk=1\n\n\u03b4(yk \u2212 Xk\u03c4 )\n\n(25)\n\nwhere the Wiener measure p0 is de\ufb01ned in (5) and the likelihood L(X0:T|f ) in (6). The Ornstein-\nUhlenbeck approximation (19) can expressed in a similar way: we just have to replace L(X0:T|f )\nby a functional Lou(X0:T|f ) which in turn is obtained by replacing f (Xt) with the linearized drift\nfou(Xt, t) in (6). The difference in free energies (negative log evidences) can be expressed ex-\nactly by an expectation over the posterior OU processes and then expanded (similar to a cumulant\nexpansion) in a Taylor series in \u2206L = \u2212 ln(L/Lou). The \ufb01rst two terms are given by\n\n.\n\n\u2206F\n\n= \u2212{ln p(y|f ) \u2212 ln pou(y)} = \u2212 ln Eq(cid:2)e\u2212\u2206L(cid:3) \u2248 Eq [\u2206L] \u2212\n\nThe computation of the \ufb01rst term is similar to (11) and requires only the marginal qt and the posterior\ngt. The second term contains the posterior variance and requires two-time covariances of the OU\nprocess. We concentrate on the \ufb01rst term which we further expand in the difference \u03b4f (Xt, t). This\nyields\n\n1\n2\n\nVarq [\u2206L] \u00b1 . . .\n\n(26)\n\n\u2206F \u2248 Eq [\u2206L] \u2248Z T\n\n0\n\nEq [h\u03b4f (Xt, t), fou(Xt, t) \u2212 gt(Xt)i] dt.\n\n(27)\n\nThis expression could be evaluated in order to estimate the in\ufb02uence of nonlinear parts of the drift\non the approximation error.\n\n6 Experiments\n\nIn all experiments, we used different versions of the following general kernel, which is a linear\ncombination of a RBF and a polynomial kernel,\n\nK(x1, x2) = c \u03c3RBF exp(cid:18)\u2212\n\n(x1 \u2212 x2)T (x1 \u2212 x2)\n\n2l2\n\nRBF\n\n(cid:19) + (1 \u2212 c)(cid:0)1 + x\u22a4\n\n1 x2(cid:1)p\n\n,\n\n(28)\n\nwhere the hyperparameters \u03c3RBF and lRBF denote the variance and length scale of the RBF kernel\nand p denotes the order of the polynomial kernel.\n\nAlso, we determined the sparse points for the GP algorithm in each case by \ufb01rst constructing a\nhistogram over the observations and then selecting the set of histogram midpoints of each histogram\nbin which contained at least a certain number bmin of observations. In our experiments, we chose\nbmin = 5.\n\n6\n\n\fFigure 2: The \ufb01gures show the estimated drift functions for the double well model (left) and the\nperiodic diffusion model (right) after completion of the EM algorithm. Again, the black and blue\nlines denote mean and 95%-con\ufb01dence bounds, while the red lines indicate the true drift functions.\n\n6.1 One-dimensional toy models\n\nFirst we test our algorithm on two toy data sets, the double well model with dynamics given by the\nSDE\n\n(29)\n\n(30)\n\nand a diffusion model driven by a periodic drift\n\ndx = 4(x \u2212 x3)dt + dW\n\ndx = sin(x)dt + dW.\n\nFor both models, we simulated a path of size M = 105 on a regular grid with width \u2206t = 0.01 from\nthe corresponding SDE and kept every 20th sample point as observation, resulting in N = 5000\ndata points. We initialized the EM Algorithm by running the sparse GP for the observation points\nwithout any imputation and subsequently computed the expectation operators by analytically evalu-\nating the expressions on the same time grid as the simulated path and summing over the time steps.\nAn alternative initialization strategy which consists of generating a full trajectory of the same size as\nthe original path using Brownian bridge sampling between observations did not bring any noticeable\nperformance improvements. Since we cannot guarantee that the likelihood increases in every itera-\ntion due to the approximation in the E-step, we resort to a simple heuristic by assuming convergence\nonce L stabilizes up to some minor \ufb02uctuation. In our experiments convergence was typically at-\ntained after a few (< 10) iterations. For the double well model we used an equal weighting c = 0.5\nbetween kernels with hyperparameters \u03c3RBF = 1, lRBF = 0.5 and p = 5, whereas for the periodic\nmodel we used an RBF kernel (c = 1) with the same values for \u03c3RBF and lRBF.\n\n6.2 Application to a real data set\n\nAs an example of a real world data set, we used the NGRIP ice core data (provided by Niels-\nBohr institute in Copenhagen, http://www.iceandclimate.nbi.ku.dk/data/), which\nprovides an undisturbed ice core record containing climatic information stretching back into the\nlast glacial. Speci\ufb01cally, this data set as shown in \ufb01gure 3 contains 4918 observations of oxygen\nisotope concentration \u03b418O over a time period from the present to roughly 1.23 \u00b7 105 years into\n\nthe past. Since there are generally less isotopes in ice formed under cold conditions, the isotope\nconcentration can be regarded as an indicator of past temperatures.\n\nRecent research [16] suggest to model the rapid paleoclimatic changes exhibited in the data set\nby a simple dynamical system with polynomial drift function of order p = 3 as canonical model\nwhich allows for bistability. This corresponds to a meta stable state at higher temperatures close to\nmarginal stability and a stable state at low values, which is consistent with other research on this\ndata set, linking a stable state of oxygen isotopes to a baseline temperature and a region at higher\nvalues corresponding to the occurrence of rapid temperature spikes. For this particular problem we\n\ufb01rst tried to determine the diffusion constant \u03c3 of the data. Therefore we estimated the likelihood of\nthe data set for 40 \ufb01xed values of \u03c3 in an interval from 0.3 to 11.5 by running the EM algorithm with\na polynomial kernel (c = 0) of order p = 3 for each value in turn. The resulting drift function with\nthe highest likelihood is shown in \ufb01gure 3. The result seems to con\ufb01rm the existence of a metastable\nstate of oxygen isotope concentration and a stable state at lower values.\n\n7\n\n\fFigure 3: The \ufb01gure on the left displays the NGRIP data set, while the picture on the right shows\nthe estimated drift in black with corresponding 95%-con\ufb01dence bounds denoting twice the standard\ndeviation in blue for the optimal diffusion value \u02c6\u03c3 = 2.9.\n\nFigure 4: The left \ufb01gure shows the empirical density for the two-dimensional model, together with\nthe vector \ufb01elds of the actual drift function given in blue and the estimated drift given in red. The\nright picture shows a snippet from the full sample in black together with the \ufb01rst 20 observations\ndenoted by red dots.\n\n6.3 Two-dimensional toy model\n\nAs an example of a two dimensional system, we simulated from a process with the following SDE:\n\ndx = (x(1 \u2212 x2 \u2212 y2) \u2212 y)dt + dW1,\ndy = (y(1 \u2212 x2 \u2212 y2) + y)dt + dW2.\n\n(31)\n\n(32)\n\nFor this model we simulated a path of size M = 106 on a regular grid with width \u2206t = 0.002 from\nthe corresponding SDE and kept every 100th sample point as observation, resulting in N = 104 data\npoints. In the inference shown in \ufb01gure 4 we used a polynomial kernel (c = 0) of order p = 4.\n\n7 Discussion\n\nIt would be interesting to replace the ad hoc local linear approximation of the posterior drift by a\nmore \ufb02exible time dependent Gaussian model. This could be optimized in a variational EM approx-\nimation by minimizing a free energy in the E-step, which contains the Kullback-Leibler divergence\nbetween the linear and true processes. Such a method could be extended to noisy observations and\nthe case, where some components of the state vector are not observed. Finally, this method could be\nturned into a variational Bayesian approximation, where one optimizes posteriors over both drifts\nand over state paths. The path probabilities are then in\ufb02uenced by the uncertainties in the drift\nestimation, which would lead to more realistic predictions of error bars.\n\nAcknowledgments This work was supported by the European Community\u2019s Seventh Framework\nProgramme (FP7, 2007-2013) under the grant agreement 270327 (CompLACS).\n\n8\n\n\fReferences\n\n[1] Michalis K. Titsias. Variational learning of inducing variables in sparse Gaussian processes. JMLR\n\nWC&P, 5:567\u2013574, 2009.\n\n[2] Marc Deisenroth and Shakir Mohamed. Expectation propagation in Gaussian process dynamical systems.\nIn P. Bartlett, F.C.N. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural\nInformation Processing Systems 25, pages 2618\u20132626. 2012.\n\n[3] Jonathan Ko and Dieter Fox. GP-BayesFilters: Bayesian \ufb01ltering using Gaussian process prediction and\n\nobservation models. Autonomous Robots, 27(1):75\u201390, July 2009.\n\n[4] C\u00b4edric Archambeau, Manfred Opper, Yuan Shen, Dan Cornford, and John Shawe-Taylor. Variational\ninference for diffusion processes. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in\nNeural Information Processing Systems 20, pages 17\u201324. MIT Press, Cambridge, MA, 2008.\n\n[5] Jos\u00b4e Bento Ayres Pereira, Morteza Ibrahimi, and Andrea Montanari. Learning networks of stochastic\nIn J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R.S. Zemel, and A. Culotta,\n\ndifferential equations.\neditors, Advances in Neural Information Processing Systems 23, pages 172\u2013180. 2010.\n\n[6] Danilo J. Rezende, Daan Wierstra, and Wulfram Gerstner. Variational learning for recurrent spiking\nIn J. Shawe-Taylor, R.S. Zemel, P. Bartlett, F.C.N. Pereira, and K.Q. Weinberger, editors,\n\nnetworks.\nAdvances in Neural Information Processing Systems 24, pages 136\u2013144. 2011.\n\n[7] Simon Lyons, Amos Storkey, and Simo Sarkka. The coloured noise expansion and parameter estimation\nIn P. Bartlett, F.C.N. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger,\n\nof diffusion processes.\neditors, Advances in Neural Information Processing Systems 25, pages 1961\u20131969. 2012.\n\n[8] Omiros Papaspiliopoulos, Yvo Pokern, Gareth O. Roberts, and Andrew M. Stuart. Nonparametric esti-\n\nmation of diffusions: a differential equations approach. Biometrika, 99(3):511\u2013531, 2012.\n\n[9] Yvo Pokern, Andrew M. Stuart, and J.H. van Zanten. Posterior consistency via precision operators\nfor Bayesian nonparametric drift estimation in SDEs. Stochastic Processes and their Applications,\n123(2):603\u2013628, 2013.\n\n[10] P. E. Kloeden and E. Platen. Numerical Solution of Stochastic Differential Equations. Springer, New\n\nYork, corrected edition, June 2011.\n\n[11] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.\n\n[12] Lehel Csat\u00b4o, Manfred Opper, and Ole Winther. TAP Gibbs free energy, belief propagation and sparsity.\nIn T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing\nSystems 14, pages 657\u2013663. MIT Press, 2002.\n\n[13] C. W. Gardiner. Handbook of Stochastic Methods. Springer, Berlin, second edition, 1996.\n\n[14] Manfred Opper, Andreas Ruttor, and Guido Sanguinetti. Approximate inference in continuous time\nIn J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R.S. Zemel, and A. Cu-\n\nGaussian-jump processes.\nlotta, editors, Advances in Neural Information Processing Systems 23, pages 1831\u20131839. 2010.\n\n[15] Florian Stimberg, Manfred Opper, and Andreas Ruttor. Bayesian inference for change points in dynamical\nsystems with reusable states\u2014a Chinese restaurant process approach. JMLR WC&P, 22:1117\u20131124,\n2012.\n\n[16] Frank Kwasniok. Analysis and modelling of glacial climate transitions using simple dynamical systems.\nPhilosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences,\n371(1991), 2013.\n\n9\n\n\f", "award": [], "sourceid": 1024, "authors": [{"given_name": "Andreas", "family_name": "Ruttor", "institution": "TU Berlin"}, {"given_name": "Philipp", "family_name": "Batz", "institution": "TU Berlin"}, {"given_name": "Manfred", "family_name": "Opper", "institution": "TU Berlin"}]}