{"title": "Graphical Time Warping for Joint Alignment of Multiple Curves", "book": "Advances in Neural Information Processing Systems", "page_first": 3648, "page_last": 3656, "abstract": "Dynamic time warping (DTW) is a fundamental technique in time series analysis for comparing one curve to another using a flexible time-warping function. However, it was designed to compare a single pair of curves. In many applications, such as in metabolomics and image series analysis, alignment is simultaneously needed for multiple pairs. Because the underlying warping functions are often related, independent application of DTW to each pair is a sub-optimal solution. Yet, it is largely unknown how to efficiently conduct a joint alignment with all warping functions simultaneously considered, since any given warping function is constrained by the others and dynamic programming cannot be applied. In this paper, we show that the joint alignment problem can be transformed into a network flow problem and thus can be exactly and efficiently solved by the max flow algorithm, with a guarantee of global optimality. We name the proposed approach graphical time warping (GTW), emphasizing the graphical nature of the solution and that the dependency structure of the warping functions can be represented by a graph. Modifications of DTW, such as windowing and weighting, are readily derivable within GTW. We also discuss optimal tuning of parameters and hyperparameters in GTW. We illustrate the power of GTW using both synthetic data and a real case study of an astrocyte calcium movie.", "full_text": "Graphical Time Warping for Joint Alignment of\n\nMultiple Curves\n\nYizhi Wang\nVirginia Tech\nyzwang@vt.edu\n\nDavid J. Miller\n\nKira Poskanzer\n\nPennsylvania State University\n\nUniversity of California, San Francisco\n\ndjmiller@engr.psu.edu\n\nKira.Poskanzer@ucsf.edu\n\nYue Wang\nVirginia Tech\n\nyuewang@vt.edu\n\nLin Tian\n\nUniversity of California, Davis\n\nlintian@ucdavis.edu\n\nGuoqiang Yu\nVirginia Tech\nyug@vt.edu\n\nAbstract\n\nDynamic time warping (DTW) is a fundamental technique in time series analysis\nfor comparing one curve to another using a \ufb02exible time-warping function. How-\never, it was designed to compare a single pair of curves. In many applications,\nsuch as in metabolomics and image series analysis, alignment is simultaneously\nneeded for multiple pairs. Because the underlying warping functions are often\nrelated, independent application of DTW to each pair is a sub-optimal solution.\nYet, it is largely unknown how to ef\ufb01ciently conduct a joint alignment with all\nwarping functions simultaneously considered, since any given warping function\nis constrained by the others and dynamic programming cannot be applied. In\nthis paper, we show that the joint alignment problem can be transformed into a\nnetwork \ufb02ow problem and thus can be exactly and ef\ufb01ciently solved by the max\n\ufb02ow algorithm, with a guarantee of global optimality. We name the proposed\napproach graphical time warping (GTW), emphasizing the graphical nature of\nthe solution and that the dependency structure of the warping functions can be\nrepresented by a graph. Modi\ufb01cations of DTW, such as windowing and weighting,\nare readily derivable within GTW. We also discuss optimal tuning of parameters\nand hyperparameters in GTW. We illustrate the power of GTW using both synthetic\ndata and a real case study of an astrocyte calcium movie.\n\n1\n\nIntroduction\n\nTime series, such as neural recordings, economic observations and biological imaging movies, are\nubiquitous, containing rich information about the temporal patterns of physical quantities under\ncertain conditions. Comparison of time series lies at the heart of many scienti\ufb01c questions. Due to\nthe time distortions, direct comparison of time series using e.g. Euclidean distance is problematic.\nDynamic time warping (DTW) is a powerful and popular technique for time series comparison\nusing \ufb02exible warping functions. DTW has been successful for various tasks, including querying,\nclassi\ufb01cation, and clustering [1, 2, 3]. Although DTW is a mature approach, signi\ufb01cant improvements\nhave been proposed over the years, such as derivative DTW [4], weighted DTW [5], curve pairs with\nmultiple dimensions [6], and extensions for large scale data mining [7].\nHowever, DTW and all its variants consider the alignment of a single pair of time series, while in\nmany applications we encounter the task of aligning multiple pairs simultaneously. One might apply\nDTW or its variants to each pair separately. However, very often, this is suboptimal because it ignores\nthe dependency structure between the multiple warping functions. For example, when analyzing time\nlapse imaging data [8], we can consider the data as a collection of time series indexed by pixel. One\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fFigure 1: (a) Each node is a warping path between two curves xn and yn. Neighboring paths are\nassumed to be similar (A and B) while non-neighboring ones may be quite different (A and C). (b)\nDTW can be represented as a shortest path problem in a directed graph. Each edge originating from\nnode (k1, k2) has a weight given by the dissimilarity (e.g. Euclidean distance) between xn(k1) and\nyn(k2). The path distance between the purple and green paths is de\ufb01ned as the area of the shaded\nparts. (c) Primal and dual graphs. The purple and gold edges are two in\ufb01nite capacity reverse edges\nfor the dual and primal graphs, respectively. Only two such edges are drawn for clarity. The dashed\nline shows the auxiliary edges used for transforming the primal graph to the dual graph, which are\nremoved afterwards. (d) Flow chart for GTW. The corresponding \ufb01gure for each step is annotated.\n\npotential task is to compute the warping function associated with every pixel with respect to a given\nreference time series, with the ultimate goal of identifying signal propagation patterns among pixels.\nAlthough different pixels may have different warping functions, we expect that the functions are\nmore similar between adjacent pixels than between distant pixels. That is, we expect a certain degree\nof smoothness among spatially adjacent warping functions. Another example is pro\ufb01le alignment\nfor liquid chromatography-mass spectrometry (LC-MS) data, which is used to measure expression\nlevels of small biomolecules such as proteins and metabolites. Each pro\ufb01le can be considered as a\ntime series indexed by the retention time [9]. Typically, all pro\ufb01les in the data set must be aligned to\na reference pro\ufb01le. Because the LC-MS data measures similar things against a common reference\npro\ufb01le, we expect similar warping functions for all pro\ufb01les.\nTo the best of our knowledge, there is no existing approach that fundamentally generalizes DTW\nto jointly model multiple warping functions and align multiple curves, while retaining these advan-\ntageous properties of DTW: (1) computational ef\ufb01ciency and (2) a guarantee of global optimality.\nAs we will discuss below, most existing efforts reuse DTW multiple times in a heuristic way. In-\nterestingly, the necessity for and the challenge of a joint and integrated modeling approach come\nprecisely from the two factors that contribute to the wide use of DTW. On one hand, the power of\nDTW is due to its \ufb02exibility in allowing a broad range of warping functions. As is well known in\nmachine learning, an unavoidable consequence of \ufb02exibility is the problem of over\ufb01tting [10], and\nhence the estimated warping function is often unreliable. This problem becomes severe when the\nobserved time series are very noisy and this is often the case, rather than the exception, for multiple\ncurve alignment. On the other hand, the solution to DTW is extremely ef\ufb01cient and global optimality\n(with respect to the DTW objective function) is guaranteed, through the application of dynamic\nprogramming [11]. Unfortunately, when we consider joint modeling of multiple warping functions,\ndynamic programming is no longer applicable due to interactions between the different warping\nfunctions.\nThe computational burden of such a joint modeling seems prohibitive, and the feasibility of obtaining\nthe global optimum is far from obvious, because each warping function is coupled to all the rest,\neither directly or indirectly. Thus, we were fortuitous to \ufb01nd that the joint modeling can be solved\nvery ef\ufb01ciently, with global optimality ensured.\nIn this paper, we develop Graphical Time Warping (GTW) to jointly model multiple time warping\nfunctions in a uni\ufb01ed framework. Given a set of warping function {Pn, n = 1, . . . , N} to be\n\n2\n\n\foptimized, a generic form of GTW can be expressed as follows:\n\nN(cid:88)\n\nn=1\n\nmin\n\n{Pn,n=1,...,N}\n\n(cid:88)\n\nE(m,n)\u2208Gstruct\n\nDT W _cost(Pn) + \u03ba\n\ndissimilarity_cost(Pm, Pn),\n\n(1)\n\nwhere Pn is subject to the same constraints as in conventional DTW such as boundary conditions,\ncontinuity, and monotonicity [12]. Gstruct is a graph encoding the dependency structure among\nthe warping functions. Each node in the graph represents one warping function, indexed by n,\nand E(m, n) \u2208 Gstruct denotes that there is an edge between nodes m and n in Gstruct, whose\ncorresponding warping functions are expected to be similar, as encoded in the second term of the\ncost (1). DT W _cost is the conventional DTW path \ufb01nding cost and dissimilarity_cost ensures\nthe neighboring warping functions are similar. The graph Gstruct can be de\ufb01ned by users or induced\nfrom other sources, which provides great \ufb02exibility for encoding various types of problems. For\nexample, to analyze time series imaging data, the graph can be induced by the pixel grid so that\nedges exist only between spatially neighboring pixels. Alternatively, when aligning multiple LC-MS\npro\ufb01les, the graph is fully connected, such that each pro\ufb01le has an edge with all other pro\ufb01les.\nSince a warping function is a path in a two-dimensional grid from a given source to a given sink\n(as in Fig.1b), we propose to use the area bounded by two paths as the dissimilarity cost between\nthem. Later, we will show how the optimization problem in Equation (1) equipped with this speci\ufb01c\ndissimilarity cost can be transformed into a network \ufb02ow problem and solved by the max \ufb02ow\nalgorithm [13, 14].\nAs previously discussed, most DTW improvements have focused on the alignment of a single pair\nof curves. There are some heuristic efforts that deal with alignment of multiple curves. Chudova\njointly performed clustering and time warping using a mixture model [15]; this assumes curves from\nthe same cluster are generated by a single model. This is a suboptimal, restrictive \u201csurrogate\u201d for\ncapturing the relationships between curves, and does not capture relationships as (user-)speci\ufb01ed\nby a graph. Tsai et al. applied an MCMC strategy to align multiple LC-MS pro\ufb01les with a single\nprior distribution imposed on all warping functions [9], but the approach is time-consuming and\nno \ufb01nite-time convergence to the global optimum is guaranteed. Similarly, algorithms for aligning\nmultiple DNA sequences are based on \ufb01rst clustering the sequences and then progressively aligning\nthem [16, 17]. Most critically, all existing approaches assume special dependency structures, e.g. all\nnodes (curves) are equally dependent, and do not promise a globally optimal solution, while GTW\nworks with any given dependency structure and \ufb01nds the globally optimal solution ef\ufb01ciently.\nInterestingly, the max \ufb02ow algorithm has long been suggested as an alternative to DTW [13] by\nresearchers in the network \ufb02ow community. As an example, Uchida extended DTW to the non-\nMarkovian case and solved it by the max \ufb02ow algorithm [18]. Max \ufb02ow formulations have also been\ndeveloped to solve image segmentation [14], stereo matching [19] and shape matching problems\n[20]. But researchers in the time series analysis community have paid little attention to the max\n\ufb02ow approach, perhaps because dynamic programming is much more ef\ufb01cient than the max \ufb02ow\nalgorithm and is suf\ufb01cient for conventional DTW problems.\n\n2 Problem Formulation\nThe task is to jointly align N pairs of curves (xn, yn), 1 \u2264 n \u2264 N. For the sake of clarity, but\nwithout loss of generality, we assume all curves have the same length K and each curve is indexed by\nan integer from 1 to N. To rigorously formulate the problem, we have the following de\ufb01nitions.\nDe\ufb01nition 1 \u2013 valid warping function. A valid warping function for the nth pair of curves is\na set of integer pairs Pn = {(kn,x, kn,y)} such that the following conditions are satis\ufb01ed: (a)\nboundary conditions: (1, 1) \u2208 Pn and (K, K) \u2208 Pn; (b) continuity and monotonicity conditions: if\n(kn,x, kn,y) \u2208 Pn, then (kn,x \u2212 1, kn,y) \u2208 Pn or (kn,x, kn,y \u2212 1) \u2208 Pn or (kn,x \u2212 1, kn,y \u2212 1) \u2208 Pn.\nDe\ufb01nition 2 \u2013 alignment cost. For any given valid warping function Pn and its corresponding pair\nof curves (xn, yn), the associated alignment cost is de\ufb01ned as follows:\ng(xn[k1] \u2212 yn[k2])),\n\n(cid:88)\n\ncost(Pn) =\n\n(2)\n\nwhere g(\u00b7) is any nonnegative function.\n\n(k1,k2)\u2208Pn\n\n3\n\n\fFigure 2: (a) GTW graph for two neighboring pairs. Only two (bidirectional) edges (green) are drawn\nfor clarity. The orange background represents the (single pair) primal graphs. The blue foreground\nrepresents the dual graphs. (b) A neighborhood structure used for simulation. In the center is a 10\nby 10 grid for 100 pairs, with e.g. a close spatial neighborhood de\ufb01ned around each grid point. The\nwarping paths for the three blue squares are shown. The short red and green lines indicate when time\nshifts occur. They are at different positions along the three paths. The warping paths for spatially\nclose pairs should be similar.\n\nDe\ufb01nition 3 \u2013 neighboring warping functions. Suppose the dependency structure for a set of N\nvalid warping functions is given by the graph Gstruct = {Vs, Es}, where Vs is the set of nodes,\nwith each node corresponding to a warping function, and Es is the set of undirected edges between\nnodes. If there is an edge between the mth and nth nodes, we call Pm and Pn neighbors, denoted by\n(m, n) \u2208 N eib.\nDe\ufb01nition 4 \u2013 distance between two valid warping functions. We de\ufb01ne the distance between two\nvalid warping functions dist(Pm, Pn) as the area of the region bounded by the two paths as shown in\nFig.1b.\nWhen we jointly align multiple pairs of curves, our goal is to minimize both the overall alignment\ncost and the distance between neighboring warping functions. Mathematically, denoting VP the set\nof valid warping function and \u03ba1 the hyperparameter, we want to solve the following optimization\nproblem:\n\nmin\n\nf (P ) =\n\nP\n\nmin\n\nP ={Pn\u2208VP |1\u2264n\u2264N}\n\ncost(Pn) + \u03ba1\n\ndist(Pm, Pn)\n\n(3)\n\nN(cid:88)\n\nn=1\n\n(cid:88)\n\n(m,n)\u2208N eib\n\n3 Methods\n\nIn this section, we \ufb01rst construct a graph based on Equation (3); then we prove that Equation (3) can\nbe solved via the well-known max \ufb02ow problem in the graph; \ufb01nally we provide a practical algorithm.\n\n3.1 Graph construction\n\nDe\ufb01nition 5 \u2013 directed planar graph for a single pair of curves. For each pair of curves, consistent\nwith the cost function (2), there is an induced directed planar graph [21], Gn := {Vn, En}, 1 \u2264 n \u2264\nN, where Vn and En are the nodes and directed edges, respectively. An example is shown in Fig.1b.\nn} as the dual graph of the directed planar graph\nDe\ufb01nition 6 \u2013 dual graph. De\ufb01ne G(cid:48)\nn are all faces of Gn, and for each e \u2208 En, we have a new edge e(cid:48) \u2208 E(cid:48)\nGn, where nodes V (cid:48)\nn\nconnecting the faces from the right side of e to the left side. This edge is directed (with positive\ndirection by convention). The edge weights are the same as for the primal graph Gn. An example is\nshown in Fig.1c.\nIn contrast to conventional dual graph theory, one critical innovation here is that besides the positive\nedge we add in one more edge with reverse direction in the dual graph corresponding to each edge in\n\nn := {V (cid:48)\n\nn, E(cid:48)\n\n4\n\n\fthe primal graph. The weight for the reversed edge is set to in\ufb01nity. This design is critical: otherwise,\nas demonstrated in Fig.3c, we cannot get an equivalent simpler problem.\nDe\ufb01nition 7 \u2013 GTW graph. The GTW graph Ggtw := {Vgtw, Egtw} is de\ufb01ned as the integrated\ngraph of all dual graphs {G(cid:48)\nn|1 \u2264 n \u2264 N} with the integration guided by the neighborhood\nof warping functions, such that Vgtw = {V (cid:48)\nn|1 \u2264 n \u2264 N} \u222a\n{(V (cid:48)\nn,i) are bi-directional with\ncapacity \u03ba2 (whereas all other edges have capacity proportional to the distance between two curves,\nmeasured at a pair of time points, i.e. g(xn(k1) \u2212 yn(k2). An example is shown in Fig.2a.\n\nn,i)|(m, n) \u2208 N eib}. All newly introduced edges (V (cid:48)\n\nn|1 \u2264 n \u2264 N} and Egtw = {E(cid:48)\n\nm,i, V (cid:48)\n\nm,i, V (cid:48)\n\n3.2 Equivalent problem\n\nWe claim that the GTW problem as stated in Equation (3) is equivalent to the minimum cut problem\non the GTW graph Ggtw if we set \u03ba2 = 2\u03ba1.\n\n3.3 Proof of equivalence\n\nFor brevity, more proofs of lemmas can be found in the supplementary material.\nDe\ufb01nition 8 \u2013 labeling of graph. L is a labeling of graph G if it assigns each node in G a binary label.\nL can induce a cut set C = {(i, j)|L(i) (cid:54)= L(j), (i, j) \u2208 EG}.The corresponding cut (or \ufb02ow) is\n(i,j)\u2208C weight(i, j), where weight(i, j) is the weight on the edge between\n\ncut(L) = cut(C) =(cid:80)\n\nnodes i and j.\nBased on its construction, a labeling L for the graph Ggtw can be written as L = {Ln|1 \u2264 n \u2264 N},\nwhere Ln is a labeling for the dual graph G(cid:48)\nn. So we can express the minimum cut problem for the\ngraph Ggtw as:\n\nN(cid:88)\n\nn=1\n\n(cid:88)\n\n(m,n)\u2208N eib\n\nmin\n\nL\n\ng(L) =\n\nmin\n\nL:={Ln|1\u2264n\u2264N}\n\ncut(Ln) + \u03ba2\n\ncut(Lm, Ln),\n\n(4)\n\nn and cut(Lm, Ln) is the number of the cut edges between\n\nwhere cut(Ln) is the cut of all edges for G(cid:48)\ntwo neighboring dual graphs G(cid:48)\nm and G(cid:48)\nn.\nDenote Lmf as the labeling induced by applying the max \ufb02ow algorithm on Ggtw, where for\neach node v, Lmf (v) = 0 if distres(v, s) < \u221e and Lmf (v) = 1 if distres(v, s) = \u221e, where\ndistres(i, j) is the distance between nodes i and j on the residual graph Gext,res given by the\nmaximum \ufb02ow algorithm and s and t are the source and sink nodes of Ggtw, respectively. Denote\nS = {v|Lmf (v) = 0} and T = {v|Lmf (v) = 1}. We further denote Lmf,n as the component\ncorresponding to G(cid:48)\nn. Similarly, Sn and Tn are subsets of S and T , respectively. Obviously, by\nthe max-\ufb02ow min-cut theorem, the resulting cut set Cmf has the smallest cut. Cmf,n is the cut set\nrestricted to the graph G(cid:48)\nn.\nLemma 1 Given labeling Lmf,n \u2208 Lmf , Sn forms a single connected area within graph G(cid:48)\nn. That\nis, \u2200i \u2208 Sn, there is a path with nodes {i, j, k, . . . , s} \u2282 Sn from i to s. Similarly, Tn also forms a\nsingle connected area. In other words, after applying the max \ufb02ow algorithm on Ggtw, members of\ngroup Sn do not completely surround members of group Tn, or vice versa.\nDe\ufb01nition 9 \u2013 directed cut set. Cut set C is a directed cut set if \u2200(i, j) \u2208 C, either i \u2208 S and j \u2208 T\nor cap(i, j) = \u221e, i \u2208 T and j \u2208 S. As will be seen, this de\ufb01nition ensures that the cut set includes\nonly the edges e(cid:48) that correspond to edges in the primal graph Gn, instead of the reverse edges\nintroduced when building the dual graph G(cid:48)\nLemma 2 Cmf,n is a directed cut set.\nFrom Lemma 1 and 2, we can build the link between the \ufb01rst term of f (Equation (3)) and g (Equation\n(4)).\nLemma 3 For any directed cut set Cn, 1 \u2264 n \u2264 N for Ggtw, there is a valid warping function\nPn, 1 \u2264 n \u2264 N for Gn, 1 \u2264 n \u2264 N so that cut(Cn) = cost(Pn), and vice versa.\nLemma 4 For two neighboring pairs (xm, ym) and (xn, yn), dist(Pn, Pm) = 0.5|Cm,n|, where we\ndenote Cm,n := {(V (cid:48)\n\nn, which give the wrong path direction.\n\nn,i \u2208 T or V (cid:48)\n\nm,i \u2208 T, V (cid:48)\n\nn,i \u2208 S}.\n\nm,i, V (cid:48)\n\nn,i)|V (cid:48)\n\nm,i \u2208 S, V (cid:48)\n\n5\n\n\fLemma 4 states that the distance between two paths in the primal graph (Fig.1b) is the same as the\nnumber of neighborhood cuts between those two pairs, up to a constant scaling factor.\nLemma 5 Let P be a set of valid warping functions for {Gn|1 \u2264 n \u2264 N} and let L be the labeling\nin Ggtw that corresponds to directed cuts. If \u03ba2 = 2\u03ba1, given P , we can \ufb01nd a corresponding L with\nf (P ) = g(L) and given L, we can \ufb01nd a corresponding P so that g(L) = f (P ).\nProof. First we show each P gives an L. As each path Pn can be transformed to a directed cut Cn\n(Lemma 3), which by de\ufb01nition is also a cut, it gives a valid labeling Ln and cost(Pn) = cut(Ln)\nby de\ufb01nition. And dist(Pm, Pn) = 0.5 \u00d7 cut(Lm, Ln) by Lemma 4. Then, with \u03ba2 = 2\u03ba1, we\n\ufb01nd L = {Ln|1 \u2264 n \u2264 N} such that f (P ) = g(L). Conversely, given L = {Ln|1 \u2264 n \u2264 N}\ncorresponding to directed cuts Cn, Cn can be transformed back to a valid path Pn with the same cost\n(Lemma 3). For the cut between Lm and Ln, we still have cut(Lm, Ln) = 2 \u00d7 dist(Pm, Pn) using\nLemma 4. Thus we \ufb01nd a set P = {Pn|1 \u2264 n \u2264 N} with the same cost as L.\nTheorem 1 If Lmf is a labeling induced by the maximum \ufb02ow algorithm on Ggtw, then the corre-\nsponding P minimizes f (P ).\nProof. Assume the max \ufb02ow algorithm gives us a labeling L, which corresponds to path P and by\nLemma 5 the relationship f (P ) = g(L) holds. Here f is the primal cost function and g is the dual\ncost function. Assume we have another labeling L(cid:48) (cid:54)= L and it corresponds to another path P (cid:48); then\nalso by Lemma 5 f (P (cid:48)) = g(L(cid:48)) holds. Suppose path P (cid:48) is better than path P , i.e. f (P (cid:48)) < f (P ).\nThis implies g(L(cid:48)) < g(L), which contradicts the assumption that L is the labeling from the max\n\ufb02ow algorithm. Thus, there is no better path in terms of f () than that associated with the result of the\nmax \ufb02ow algorithm.\nFrom Theorem 1 we know that after the max \ufb02ow algorithm and labeling \ufb01nishes, we can get a\nsingle path Pn for each pair (xn, yn), which solves the primal form optimization problem. Since the\nlabeling sometimes may not be unique, different labelings may have the same cut. Correspondingly,\ndifferent paths in the primal graph may have the same (jointly minimum) cost.\nCorollary 1 If \u03ba1 = \u03ba2 = 0, L that minimizes g(L) corresponds to the P = {Pn|1 \u2264 n \u2264 N}\nwhere Pn is the solution of the single pair DTW problem for (xn, yn).\n\n3.4 Flowchart of GTW algorithm\n\nOnce the equivalence is established, a practical algorithm is readily available, as shown in the\n\ufb02owchart of Fig.1d. Assuming the hyperparameter (\u03ba1) is \ufb01xed, one \ufb01rst constructs a primal graph\nseparately for each alignment task, then converts each primal graph to its dual form, and \ufb01nally adds\nin edges to the set of dual graphs to obtain the GTW graph. Once we get the GTW graph, we can\napply any maximum \ufb02ow algorithm to the graph, leading to the minimum cut set Cmf . For each sub\ncut-set Cmf,n corresponding to the nth dual graph G(cid:48)\nn, we convert the cut edges back to edges in\nthe primal graph Gn. The resulting edges will be connected as a warping path and hence lead to a\nwarping function. The set of resultant warping functions are the solution to our GTW problem. A\nworking example is given in the Supplementary.\nNote also that, as indicated in Fig.1d, this algorithm can be iteratively applied, with parameter (and\nhyperparameter) re-estimation performed at each iteration. The primary parameter is the noise\nvariance (which can easily be generalized to a separate noise variance parameter for each pair of\ncurves, when appropriate). In addition to the major hyperparameter \u03ba1 in Equation (3), we may use\nother hyperparameters to incorporate prior knowledge such as favoring a diagonal warping direction,\nwhich actually results in an extension of DTW even for a single pair of curves. In the Supplement,\nwe show that the hyperparameters can be tuned, along with parameters, via either cross validation\nor approximately consistent with maximum likelihood estimation. In addition, as a heuristic rule of\nthumb, we can choose \u03ba1 = a\u03c32, where \u03c32 is the noise variance and a \u2208 (1, 10).\n\n4 Experimental results\n\nWe used synthetic and real data to compare the performance of GTW and DTW. For the synthetic\ndata, we evaluate the performance by the estimation error for the warping path Pn. For real data, we\nexamine the spatial delay pattern relative to a reference curve. We also illustrate the impact of the\ncapacity of the reverse edges. More experiments can be found in the Supplement.\n\n6\n\n\fFigure 3: (a) The curves before (blue, xn) and after (red, yn) warping in the simulation. The green\ndashed squares indicate where the warping occurs. (b) Performance comparison of GTW and DTW\nfor 100 simulations under different additive noise variances. Both cases include the off-diagonal\nweights \u03b2 (see section 4.1). Error bars indicate standard deviation. (c) The impact of reverse capacity.\nLeft: a pair of curves from an astrocyte imaging movie. Only times 81 to 100 are shown. The right\nthree \ufb01gures are the warping paths with different reverse capacities. P os_cap is the capacity for\ncorresponding edges from the primal graph. Red dashed circles indicate where the DTW constraints\nare violated. (d) Estimated propagation patterns on the astrocyte image. Left: original movie from\ntimes 6 to 8. The yellow dot is the position of the reference curve. Middle: the delay pattern of pixels\nrelative to the reference curve, estimated by GTW. Right: results for DTW.\n\n4.1 Experiment on synthetic data\n\nWe generated N = 100 pairs of curves (xn, yn). Each pair is linked by a warping function Wn so\nthat yn = Wn(xn). Curve xn is a time series composed of low pass \ufb01ltered Gaussian noise and yn is\ngenerated by applying Wn on xn (Fig.3a). Noise is also added to both xn and yn. In this simulation\nthe pairs are in a 10 \u00d7 10 four connected grid; thus the ground-truth warping paths for neighboring\npairs are similar (Fig.2b). The warping path of the pair at location (1, 1) has a one time-point shift\nfrom 21 to 30 and another one from 71 to 80. The pair at location (10, 10) has a one time point shift\nfrom 30 to 39 and another from 62 to 71. The warping function for pairs between these locations are\nsmoothly interpolated.\nWe ran the simulation 100 times and added uncorrelated Gaussian noises to xn and yn. All hyperpa-\nrameters were initialized to 0; the noise variance was initialized to 0.01. In addition, the distance\nof the path from the diagonal line was penalized via a hyperparameter \u03b2 =\nd/\u03c32, where d is the\ndistance of a point in the path to the diagonal. When the parameter and hyperparameter changes were\nall less than 0.001, we stopped the algorithm. Convergence usually occurred within 10 iterations. The\nestimated path was compared with the ground truth one and we de\ufb01ne the normalized error as\n\n\u221a\n\nerrnorm =\n\n1\n\n(K \u2212 1)N\n\n(k, k + 1) \u2212 SPn (k, k + 1)\n\n(5)\n\nK\u22121(cid:88)\n\nN(cid:88)\n\nk=1\n\nn=1\n\n(cid:12)(cid:12)(cid:12)S \u02c6Pn\n\n(cid:12)(cid:12)(cid:12)\n\n(k, k + 1) is the area under the path \u02c6Pn between times k and k + 1.\n\nHere S \u02c6Pn\nGTW improves the accuracy in estimating warping functions. As shown in Fig.3b, GTW outper-\nforms DTW even when the noise level is small or moderate. Moreover, while DTW degrades with\nincreasing noise, GTW maintains a much smaller change in its normalized error for increasing noise.\n\n7\n\n\fIn\ufb01nite capacity reverse edges are critical. In Fig.3c we illustrate the importance of introducing\nin\ufb01nite capacity reverse edges when we construct the dual graph G(cid:48)\nn for each primal graph Gn. This\nensures the cut found by the maximum \ufb02ow algorithm is a directed cut, which is linked to a path\nin the primal graph that satis\ufb01es the constraints of DTW. If the reverse edge is not added, the max\n\ufb02ow algorithm acts as if there is a reverse edge with zero weight. Alternatively, we can add in a\nreverse edge with the same weight as for the positive direction. However, in both cases as shown in\nthe right two subplots of Fig.3c, DTW\u2019s monotonicity and continuity constraints are violated almost\neverywhere, since what we obtain by max \ufb02ow in this case is no longer a directed cut and the path in\nthe primal graph is no longer a valid warping function.\n\n4.2 Application to time-lapse astrocyte calcium imaging data\n\nWe applied GTW to estimate the propagation patterns of astrocyte calcium \ufb02uorescent imaging data\n[22, 8]. The movie was obtained from a neuro-astrocyte co-cultured Down syndrome cell line. It\ncontains 100 time points and rich types of propagation are observed during the time course. Here we\nfocused on a selected region. The movie between time instants 6 and 8 is shown in the left column of\nFig.3d. At time 6, the activity occurs at the center part and it spreads out over the subsequent time\npoints. At time 8, the active area is the largest. Since the movie was taken while the cells were under\ndrug treatment conditions, the properties of these calcium waves are important features of interest.\nHere we focused on one segmented area and identi\ufb01ed the propagation pattern within it. We extracted\nthe curve for one pixel as the reference curve x (Fig.3c, left) and all other pixels are yn. So now\nx1 = x2 = \u00b7\u00b7\u00b7 = xN = x, which is a special case of GTW. All parameters and hyperparameters\nwere initialized in the same way as previously and both methods included an off-diagonal cost \u03b2.\nFrom the estimated warping path, we extracted the delay relative to the reference curve, which is\nde\ufb01ned as the largest discrepancy from the diagonal line at a given time point (Fig.3d, middle and\nright columns). GTW gives cleaner patterns of delay compared to DTW, which produces noisier\nresults.\n\n5 Discussion\n\nWhile GTW can be applied to time series data analysis tasks like classi\ufb01cation and clustering to\nobtain a smoothed distance measure, it could be even more powerful for mining the relationships\nbetween warping functions. Their differences could be classi\ufb01ed or clustered, and explained by other\nfeatures (or factors) for those curve pairs. This may bring further insights and interpretability to the\nsolution. As a two-layer network for time series, GTW is a general framework for analyzing the\npattern of warping functions. First, the time series can be \ufb02exibly organized into pairs with DTW\nconstraints. One curve can participate in multiple pairings and even play different roles (either as\na reference or as a test curve). Partial matching, direction preference and weighting of DTW can\nbe readily incorporated. In addition, GTW allows the test curve and the reference curve to have\ndifferent lengths. Second, the construction of graphs from pairs adds another layer of \ufb02exibility.\nFor spatio-temporal data or video analysis, physical locations or pixels naturally guide the choice\nof graph edges. Otherwise, we can avoid using a fully connected graph by utilizing any auxiliary\ninformation on each pair of curves to build the graph. For example, features related to each subject\n(e.g., clinical features) can be used to enforce a sparse graph structure.\n\n6 Conclusion\n\nIn this paper, we developed graphical time warping (GTW) to impose a \ufb02exible dependency structure\namong warping functions to jointly align multiple pairs of curves. After formulating the original\ncost function, the single pair time warping term is transformed into its dual form and pairwise costs\nare added. We proved the equivalence of this dual form and the primal form by the properties of\nthe dual-directed graph as well as the speci\ufb01c structure of the primal single pair shortest path graph.\nWindowing, partial matching, direction, and off-diagonal costs can also be incorporated in the model,\nwhich makes GTW \ufb02exible for various applications of time warping. Iterative unsupervised parameter\nestimation and inference by max \ufb02ow are shown to be effective and ef\ufb01cient in our experiments.\nSimulation results and a case study of astrocyte propagation demonstrate the effectiveness of our\napproach.\n\n8\n\n\fReferences\n[1] D. J. Berndt and J. Clifford, \u201cUsing Dynamic Time Warping to Find Patterns in Time Series,\u201d in Proceedings\nof the 3rd International Conference on Knowledge Discovery and Data Mining, AAAIWS\u201994, (Seattle,\nWA), pp. 359\u2013370, AAAI Press, 1994.\n\n[2] T.-c. Fu, \u201cA review on time series data mining,\u201d Engineering Applications of Arti\ufb01cial Intelligence, vol. 24,\n\npp. 164\u2013181, Feb. 2011.\n\n[3] T. Warren Liao, \u201cClustering of time series data\u2014a survey,\u201d Pattern Recognition, vol. 38, pp. 1857\u20131874,\n\nNov. 2005.\n\n[4] E. Keogh and M. Pazzani, \u201cDerivative Dynamic Time Warping,\u201d in Proceedings of the 2001 SIAM\nInternational Conference on Data Mining, Proceedings, pp. 1\u201311, Society for Industrial and Applied\nMathematics, Apr. 2001.\n\n[5] Y.-S. Jeong, M. K. Jeong, and O. A. Omitaomu, \u201cWeighted dynamic time warping for time series\n\nclassi\ufb01cation,\u201d Pattern Recognition, vol. 44, pp. 2231\u20132240, Sept. 2011.\n\n[6] M. Shokoohi-Yekta, J. Wang, and E. Keogh, \u201cOn the Non-Trivial Generalization of Dynamic Time Warping\nto the Multi-Dimensional Case,\u201d in Proceedings of the 2015 SIAM International Conference on Data\nMining, pp. 289\u2013297, Society for Industrial and Applied Mathematics, June 2015.\n\n[7] T. Rakthanmanon, B. Campana, A. Mueen, G. Batista, B. Westover, Q. Zhu, J. Zakaria, and E. Keogh,\n\u201cSearching and mining trillions of time series subsequences under dynamic time warping,\u201d in Proceedings\nof the 18th ACM SIGKDD, pp. 262\u2013270, ACM, 2012.\n\n[8] A. Volterra, N. Liaudet, and I. Savtchouk, \u201cAstrocyte Ca2+ signalling: an unexpected complexity,\u201d Nat Rev\n\nNeurosci, vol. 15, pp. 327\u2013335, May 2014.\n\n[9] T.-H. Tsai, M. G. Tadesse, C. Di Poto, L. K. Pannell, Y. Mechref, Y. Wang, and H. W. Ressom, \u201cMulti-\npro\ufb01le Bayesian alignment model for LC-MS data analysis with integration of internal standards,\u201d Bioin-\nformatics, vol. 29, pp. 2774\u20132780, Nov. 2013.\n\n[10] J. Friedman, T. Hastie, and R. Tibshirani, The elements of statistical learning, vol. 1. Springer series in\n\nstatistics Springer, Berlin, 2001.\n\n[11] P. F. Felzenszwalb and R. Zabih, \u201cDynamic Programming and Graph Algorithms in Computer Vision,\u201d\n\nIEEE Trans. Pattern Anal. Mach. Intell., vol. 33, pp. 721\u2013740, Apr. 2011.\n\n[12] E. Keogh and C. A. Ratanamahatana, \u201cExact indexing of dynamic time warping,\u201d Knowl Inf Syst, vol. 7,\n\npp. 358\u2013386, May 2004.\n\n[13] R. Ahuja, T. Magnanti, and J. Orlin, Network Flows: Theory, Algorithms, and Applications. Prentice Hall,\n\nFeb. 1993.\n\n[14] Y. Boykov and V. Kolmogorov, \u201cAn experimental comparison of min-cut/max- \ufb02ow algorithms for\nenergy minimization in vision,\u201d IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26,\npp. 1124\u20131137, Sept. 2004.\n\n[15] D. Chudova, S. Gaffney, and P. Smyth, \u201cProbabilistic models for joint clustering and time-warping\nof multidimensional curves,\u201d in Proceedings of the Nineteenth conference on Uncertainty in Arti\ufb01cial\nIntelligence, pp. 134\u2013141, Morgan Kaufmann Publishers Inc., 2002.\n\n[16] P. Hogeweg and B. Hesper, \u201cThe alignment of sets of sequences and the construction of phyletic trees: an\n\nintegrated method,\u201d Journal of molecular evolution, vol. 20, no. 2, pp. 175\u2013186, 1984.\n\n[17] F. Sievers, A. Wilm, D. Dineen, T. J. Gibson, K. Karplus, W. Li, R. Lopez, H. McWilliam, M. Remmert,\nJ. S\u00f6ding, and others, \u201cFast, scalable generation of high-quality protein multiple sequence alignments\nusing Clustal Omega,\u201d Molecular systems biology, vol. 7, no. 1, p. 539, 2011.\n\n[18] S. Uchida, M. Fukutomi, K. Ogawara, and Y. Feng, \u201cNon-Markovian dynamic time warping,\u201d in 2012 21st\n\nInternational Conference on Pattern Recognition (ICPR), pp. 2294\u20132297, Nov. 2012.\n\n[19] H. Ishikawa and D. Geiger, \u201cOcclusions, discontinuities, and epipolar lines in stereo,\u201d in Computer Vision\n\u2014 ECCV\u201998, Lecture Notes in Computer Science, pp. 232\u2013248, Springer Berlin Heidelberg, June 1998.\nDOI: 10.1007/BFb0055670.\n\n[20] F. R. Schmidt, E. T\u00f6ppe, D. Cremers, and Y. Boykov, \u201cEf\ufb01cient Shape Matching Via Graph Cuts,\u201d in\nEnergy Minimization Methods in Computer Vision and Pattern Recognition, no. 4679 in Lecture Notes in\nComputer Science, pp. 39\u201354, Springer Berlin Heidelberg, Aug. 2007.\n\n[21] B. Korte, J. Vygen, B. Korte, and J. Vygen, Combinatorial optimization, vol. 2. Springer, 2012.\n[22] Y. Wang, G. Shi, D. J. Miller, Y. Wang, G. Broussard, Y. Wang, L. Tian, and G. Yu, \u201cFASP: A machine\nlearning approach to functional astrocyte phenotyping from time-lapse calcium imaging data,\u201d in 2016\nIEEE 13th International Symposium on Biomedical Imaging (ISBI), pp. 351\u2013354, Apr. 2016.\n\n9\n\n\f", "award": [], "sourceid": 1815, "authors": [{"given_name": "Yizhi", "family_name": "Wang", "institution": "Virginia Tech"}, {"given_name": "David", "family_name": "Miller", "institution": "The Pennsylvania State University"}, {"given_name": "Kira", "family_name": "Poskanzer", "institution": "University of California"}, {"given_name": "Yue", "family_name": "Wang", "institution": "Virginia Tech"}, {"given_name": "Lin", "family_name": "Tian", "institution": "The University of California"}, {"given_name": "Guoqiang", "family_name": "Yu", "institution": "Virginia Tech"}]}