{"title": "Kernel Hyperalignment", "book": "Advances in Neural Information Processing Systems", "page_first": 1790, "page_last": 1798, "abstract": "We offer a regularized, kernel extension of the multi-set, orthogonal Procrustes problem, or hyperalignment. Our new method, called Kernel Hyperalignment, expands the scope of hyperalignment to include nonlinear measures of similarity and enables the alignment of multiple datasets with a large number of base features. With direct application to fMRI data analysis, kernel hyperalignment is well-suited for multi-subject alignment of large ROIs, including the entire cortex. We conducted experiments using real-world, multi-subject fMRI data.", "full_text": "Kernel Hyperalignment\n\nAlexander Lorbert & Peter J. Ramadge\n\nDepartment of Electrical Engineering\n\nPrinceton University\n\nAbstract\n\nWe offer a regularized, kernel extension of the multi-set, orthogonal Procrustes\nproblem, or hyperalignment. Our new method, called Kernel Hyperalignment,\nexpands the scope of hyperalignment to include nonlinear measures of similar-\nity and enables the alignment of multiple datasets with a large number of base\nfeatures. With direct application to fMRI data analysis, kernel hyperalignment is\nwell-suited for multi-subject alignment of large ROIs, including the entire cortex.\nWe report experiments using real-world, multi-subject fMRI data.\n\n1\n\nIntroduction\n\nOne of the goals of multi-set data analysis is forming qualitative comparisons between datasets. To\nthe extent that we can control and design experiments to facilitate these comparisons, we must \ufb01rst\nask whether the data are aligned. In its simplest form, the primary question of interest is whether\ncorresponding features among the datasets measure the same quantity. If yes, we say the data are\naligned; if not, we must \ufb01rst perform an alignment of the data.\nThe alignment problem is crucial to multi-subject fMRI data analysis, which is the motivation for\nthis work. An appreciable amount of effort is devoted to designing experiments that maintain the\nfocus of a subject. This is to ensure temporal alignment across subjects for a common stimulus.\nHowever, with each subject exhibiting his/her own unique spatial response patterns, there is a need\nfor spatial alignment. Speci\ufb01cally, we want between subject correspondence of voxel j at TR i\n(Time of Repetition). The typical approach taken is anatomical alignment [20] whereby anatomi-\ncal landmarks are used to anchor spatial commonality across subjects. In linear algebra parlance,\nanatomical alignment is an af\ufb01ne transformation with 9 degrees of freedom.\nRecently, Haxby et al. [9] proposed Hyperalignment, a function-based alignment procedure. Instead\nof a 9-parameter transformation, a higher-order, orthogonal transformation is derived from voxel\ntime-series data. The underlying assumption of hyperalignment is that, for a \ufb01xed stimulus, a sub-\nject\u2019s time-series data will possess a common geometry. Accordingly, the role of alignment is to\n\ufb01nd isometric transformations of the per-subject trajectories traced out in voxel space so that the\ntransformed time-series best match each other. Using their method, the authors were able to achieve\na between-subject classi\ufb01cation accuracy on par with\u2014and even greater than\u2014within-subject accu-\nracy.\nSuppose that subject data are recorded in matrices X1:m \u2208 Rt\u00d7n. This could be data from an\nexperiment involving m subjects, t TRs, and n voxels. We are interested in extending the regularized\nhyperalignment problem\n\nwhere matrices A1:m \u2208 Rn\u00d7n are symmetric and positive de\ufb01nite. In general, the above problem\nmanifests itself in many application areas. For example, when Ak = I we have hyperalignment or\n\n1\n\n(1)\n\nminimize (cid:80)\n\nsubject to RT\n\ni<j (cid:107)XiRi \u2212 XjRj(cid:107)2\nk AkRk = I\n\nF\n\nk = 1, 2, . . . , m ,\n\n\f1 ) + ( n\n\nk Xk, (1) represents a form of multi-set Canonical Correlation Analysis (CCA) [12, 13, 8].\n\na multi-set orthogonal Procrustes problem, commonly used in shape analysis [6, 7]. When Ak =\nXT\nThe success of hyperalignment engenders numerous questions and in this work we address two of\nthem. First, is hyperalignment scalable? In [9], the authors consider a subset of ventral temporal cor-\ntex (VT), using hundreds of voxels. The relatively-low voxel count alleviates a huge computational\ncost and storage burden. However, the current method for solving (1) is infeasible when considering\nmany or all voxels, and therefore limits the scope of hyperalignment to a local alignment procedure.\nFor example, if n = 50,000 voxels, then storing the n \u00d7 n matrix for one subject requires over 18\ngigabytes of memory. Moreover, computing a full SVD for a matrix this size is a tall order.\nCoupled with scalability, we also ask whether we can include new features of our subjects\u2019 data.\nFor example, we may want to augment the input data with the associated second-order mixtures,\ni.e., n voxels become ( n\n2 ) = n(n+1)/2 features. Again, for a reasonably-sized voxel count,\nrunning hyperalignment is infeasible.\nAddressing scalability and feature extension results in the main contribution of kernel hyperalign-\nment. The inclusion of a large feature space motivates the use of kernel methods. Additionally,\nnumerous optimization problems that use the kernel trick possess global optimizers spanned by the\nmapped examples. This is guaranteed by the Representer Theorem [14, 18]. Therefore, the two sep-\narate issues of scalability and feature extension are merged into a single problem through the use of\nkernel methods. With kernel hyperalignment, the bottleneck shifts from voxel count to the number\nof TRs times subjects (or the original inputs to the number of examples).\nThe problem we address in this paper is the alignment of multiple datasets in the same and extended\nfeature space. Multi-set data analysis by means of kernel methods has already been considered in\nthe framework of CCA [16, 1]. Our approach deviates from [1] and [15] because we focus on align-\nment and never leave feature space until training and testing. We use the kernel trick as a means\nof navigating through a high-dimensional orthogonal group. Our CCA variant is more constrained,\nand each dataset is assigned the same kernel, supplying us with a richer, single reproducing kernel\nHilbert space (RKHS) over a collection of m smaller and distinct ones. Allowing for subject-speci\ufb01c\nkernels leads to the dif\ufb01cult problem of selecting them\u2014a signi\ufb01cantly harder problem than select-\ning a single kernel. In this respect, we assume a single kernel can provide the sought-after linearity\nused for comparing multiple datasets.\nThe paper is organized as follows: in \u00a72 we review regularized hyperalignment, or the regularized\nmulti-set orthogonal Procrustes problem. Next, in \u00a73 we formulate its kernel variant, and in \u00a74 we\ndiscuss classi\ufb01cation with aligned data. We provided experimental results in \u00a75, and we conclude in\n\u00a76. All proofs are supplied in the Supplemental Material.\n\n2 Hyperalignment\n\nThe hyperalignment problem of (1) is equivalent to [7]:\n\nminimize (cid:80)m\n\nsubject to Y = 1\nm\n\n(cid:80)m\n\ni=1(cid:107)XiRi \u2212 Y(cid:107)2\nF\nj=1XjRj\n\nand RT\n\nk AkRk = I for k = 1, . . . , m .\n\n(2)\n\nThe matrix Y is the image centroid and serves as the catalyst for computing a solution: for dataset\ni, \ufb01x a centroid and solve for Ri. This process cycles over all datasets for a speci\ufb01ed number of\nrounds, or until approximate convergence is reached (see Algorithm 1). The dynamic centroid Y\ncan be a sample mean or a leave-one-out (LOO) mean. Regardless of type, the last round should use\nthe \ufb01xed sample mean provided by the penultimate round. We can set Qk = A1/2\nk Rk, using the\nsymmetric, positive de\ufb01nite square root1, yielding the key operation\n\u2212 1\nk Qk \u2212 Y(cid:107)2\n\nF\n\n2\n\n(3)\n\nminimize (cid:107)XkA\nsubject to QT\n\nk Qk = I .\n\n2\n\nThe above is the familiar orthogonal Procrustes problem [19] and is solved using the SVD of\nA\n\n\u2212 1\nk XT\n1In practice, we would use the Cholesky factorization of Ak. However, in deriving the kernel hyperalign-\n\nk Y.\n\nment procedure it is necessary to familiarize the reader with this approach.\n\n2\n\n\f3 Kernel Hyperalignment\n\nThe previous section dealt with alignment based on the original data. In the context of optimization,\nthe alignment problem of (1) is indifferent to both data generation and data recording. There are,\nhowever, implicit assumptions about these two processes. The data are generated according to a\ncommon input signal, and each of the m datasets represents a speci\ufb01c view of this signal. In other\nwords, the matrices X1:m have row correspondence. The alignment problem of (1) seeks column\ncorrespondence through a linear mapping of the original features.\nIn fMRI, the m views are manifested by m subjects experiencing a common, synchronous stimulus.\nEach data matrix records fMRI time-series data: the rows are indexed by a TR and the columns are\nindexed by a voxel. There are t TRs and n voxels per subject, i.e., Xk \u2208 Rt\u00d7n. The synchrony of the\nstimulus ensures row correspondence. Hyperalignment can be posed as the minimization problem\nof (2) with Ak = I. Voxel (column) correspondence is then achieved via an orthogonal constraint\nplaced on each of the linear mappings. The orthogonal constraint present in hyperalignment follows\na subject-independent isometry assumption. We can view the time-series data of each subject as a\ntrajectory in Rn. For a \ufb01xed stimulus this trajectory is [approximately] identical\u2014up to a rotation-\nre\ufb02ection\u2014across subjects.\nAs stated above, we are assuming equivalence of the per-view information in its original form, but\nwe are not assuming that this information can be related through a linear mapping. Now suppose\nthere is a common set of N features\u2014derived from each n-dimensional example\u2014that does allow\nfor a linear relationship between views. Alternatively, there may be derivative features of interest\nthat lead to better alignment via a linear mapping. For example, it is conceivable that second-order\ndata, i.e., pairwise mixtures of the original data, obey a linear construct and may be a preferred\nfeature set for alignment. In general, we wish to formulate an alignment technique for this new\nfeature set. Rather than limit expression of the data to the n given coordinates, we consider an\nN-coordinate representation, where N may be much greater than n.\nLet Xi \u2208 Rt\u00d7n have i(cid:48)-th row [xi\n\ni(cid:48) \u2208 Rn. We introduce the row-based mapping of Xi:\n\ni(cid:48)]T with xi\n\n\uf8eb\uf8ec\uf8ed\u03c61(xi\n\n...\n\u03c61(xi\n\n1) \u03c62(xi\n1)\n...\nt) \u03c62(xi\nt)\n\n\u00b7\u00b7\u00b7 \u03c6N (xi\n1)\n...\n\u03c6N (xi\nt)\n\n\u00b7\u00b7\u00b7\n\n\uf8f6\uf8f7\uf8f8 \u2208 Rt\u00d7N .\n\n\u03a6(Xi) =\n\n(4)\n\n(5)\n\n(6)\n\nThe N functions \u03c61:N : Rn \u2192 R are used to derive N features from the original data. For matrix\nXi \u2208 Rt\u00d7n let \u03a6i = \u03a6(Xi). In general, for Xi \u2208 Rt\u00d7n and Xj \u2208 Rs\u00d7n, we de\ufb01ne the Gram\nmatrix Kij (cid:44) \u03a6i\u03a6T\ni . We assume that there is an\nappropriate positive de\ufb01nite kernel, \u02c6k : Rn \u00d7 Rn \u2192 R, so that we can leverage the kernel trick\n[2, 10] and obtain the i(cid:48)j(cid:48)-th element of Kij via\n\nj \u2208 Rt\u00d7s. We also write Ki (cid:44) Kii = \u03a6i\u03a6T\n\n(Kij)i(cid:48)j(cid:48) = \u02c6k( xi\n\ni(cid:48) , xj\n\nj(cid:48) ) .\n\nUsing the feature map \u03a6(\u00b7), we form the regularized Kernel Hyperalignment problem:\n\nminimize (cid:80)\n\nsubject to RT\n\ni<j(cid:107)\u03a6(Xi)Ri \u2212 \u03a6(Xj)Rj(cid:107)2\nk AkRk = I for k = 1, . . . , m .\n\nF\n\nThe latent variables are R1:m \u2208 RN\u00d7N and we are given symmetric, positive de\ufb01nite matrices\nA1:m \u2208 RN\u00d7N . Although different than the original hyperalignment problem, obtaining a solution\nto (6) is accomplished in the same way: \ufb01x a centroid and \ufb01nd the individual linear maps. To this\nend, the key operation involves solving\n\n(cid:107)\u03a6iR \u2212 \u03a8(cid:107)2\n\nF\n\narg min\nRT AiR=I\n\nor\n\narg min\nQT Q=I\n\n(cid:107)\u03a6iA\n\n\u2212 1\ni Q \u2212 \u03a8(cid:107)2\nF ,\n\n2\n\n(7)\n\nwhere \u03a6i = \u03a6(Xi), i \u2265 1, is the current, individual dataset under consideration and \u03a8 =\nj\u2208A \u03a6j \u02c6Rj is a centroid based on the current estimates of R1:m, denoted \u02c6R1:m. The index set\n1|A|\nA \u2286 {1, . . . , m} determines how the estimated centroid is calculated (sample or LOO mean).\n\n(cid:80)\n\n3\n\n\fThe dif\ufb01culty of (7) lies in the size of N. Any of the well-known kernels correspond to an N so large\nthat direct computation is generally impractical. For example, if using second-order interactions as\nthe feature set, the number of unknowns in kernel hyperalignment is O(mn4) in contrast to O(mn2)\nunknowns for hyperalignment. Nevertheless, the minimization problem of (7) places us in familiar\nterritory of solving an orthogonal Procrustes problem.\nSince we are now in feature space, the matrix Ai poses a problem unless we con\ufb01ne it to a speci\ufb01c\n\u22121/2\nwould be infeasible for large N. Additionally,\nform. For example, if Ai is random, \ufb01nding A\ni\nthe constraint RT\ni \u03a6i\nwith \u03b1 > 0 and \u03b2 \u2265 0. As with regularized hyperalignment [22], when (\u03b1, \u03b2) = (1, 0) we obtain\nhyperalignment and when (\u03b1, \u03b2) \u2248 (0, 1) we obtain a form of CCA.\nLet Ki have eigen-decomposition Vi\u039biVT\nshort. We introduce two symmetric, positive de\ufb01nite matrices: Bi = Vi diagj{\nCi = Vi diagj{ 1\n\ni , where \u039bi = diag{\u03bbi1, . . . , \u03bbit} or diagj{\u03bbij} for\ni and\n\ni AiRi = I would lack any intuition. Therefore, we restrict Ai = \u03b1I + \u03b2\u03a6T\n\n1\u221a\n\n1\u221a\n\n\u2212 1\u221a\n\n}VT\n\n\u03b1+\u03b2\u03bbij\n\n\u03b1 )}VT\ni .\n\n(\n\n\u03bbij\n\n\u03b1+\u03b2\u03bbij\n\nLemma 3.1. For Ai = \u03b1I + \u03b2\u03a6T\n\ni \u03a6i we have A\n\n\u2212 1\ni = 1\u221a\n\n2\n\n\u03b1 I + \u03a6T\n\ni Ci\u03a6i and \u03a6iA\n\n\u2212 1\ni = Bi\u03a6i.\n\n2\n\nWe can use Lemma 3.1 to transform (7) into\n\n(cid:107)Bi\u03a6iQ \u2212 \u03a8(cid:107)2\n\nF\n\nor\n\narg min\nQT Q=I\n\narg max\nQT Q=I\n\ntr\n\n(cid:16)\n\nQT \u03a6T\n\ni Bi\n\n(cid:104) 1|A|\n\n(cid:80)\n\nj\u2208A Bj\u03a6j \u02c6Qj\n\n(cid:105)(cid:17)\n\n,\n\n(8)\n\nwhere \u02c6Qj is the current estimate of Qj. Solving for the matrix Q is still well beyond practical\ncomputation. The following lemma is the gateway for managing this problem.\nLemma 3.2. If \u02dcU \u2208 St(N, d) and \u02dcG \u2208 O(d), then \u02dcQ = IN \u2212 \u02dcU(Id \u2212 \u02dcG) \u02dcUT \u2208 O(N ).2\n\nFamiliar applications of the above lemma include the identity matrix ( \u02dcG = Id) and Householder\nre\ufb02ections ( \u02dcG = \u2212Id).\nIf \u02dcG is block diagonal with 2 \u00d7 2 blocks of Givens rotations, then the\ncolumns of \u02dcU, taken two at a time, are the two-dimensional planes of rotation [7]. We therefore\nrefer to \u02dcU as the plane support matrix.\nLemma 3.2 can be interpreted as a lifting mechanism for identity deviations. The difference Id \u2212 \u02dcG\nrepresents a O(d) deviation from identity. Applying \u02dcU(Id \u2212 \u02dcG) \u02dcUT = IN \u2212 \u02dcQ, \u201clifts\u201d this differ-\nence to a O(N ) deviation from identity. Reversing directions, we can also utilize Lemma 3.2 for\ncompressing O(N ). From IN \u2212 \u02dcQ = \u02dcU(Id \u2212 \u02dcG) \u02dcUT , the rank of the deviation, IN \u2212 Q, is upper\nbounded by d, producing a subset of O(N ).\nMotivated by Lemma 3.2 we impose\n\nQi = IN \u2212 U(I \u2212 Gi)UT ,\n\n(9)\nwhere U \u2208 St(N, r), Gi \u2208 O(r), and 1 \u2264 r \u2264 N. Ideally, we want r small to bene\ufb01t from a\nreduced dimension. As is typically the case when using kernel methods, leveraging the Representer\nTheorem shifts the dimensionality of the problem from the feature cardinality to the number of\nexamples, i.e., r = mt. We pool all of the data, forming the mt \u00d7 N matrix\n\n2\n\n,\n\nm\n\n0 K\n\n\u00b7\u00b7\u00b7 \u03a6T\n(10)\n1 \u03a6T\n2\n\u2212 1\n0 assumed positive de\ufb01nite. As long as r \u2264 N,\n0 \u2208 RN\u00d7r with K0 = \u03a60\u03a6T\n\u2212 1\n0 K\n0\n\nand set U = \u03a6T\nthe orthogonality constraint is met because (\u03a6T\nTheorem 3.3 (Hyperalignment Representer Theorem). Within the set of global minimizers of (6)\nthere exists a solution {R(cid:63)\nm} that admits a representation\ni = IN \u2212 U(I \u2212 G(cid:63)\nQ(cid:63)\n2 St(N, d) (cid:44) {Z : Z \u2208 RN\u00d7d , ZT Z = Id} is the (N, d) Stiefel Manifold (N \u2265 d), and\n\n\u2212 1\nm} = {A\n1 Q(cid:63)\n\u2212 1\ni )UT , where U = \u03a6T\n0 K\n0\n\ni \u2208 O(mt) (i = 1, . . . , m).\n\n1, . . . , A\nand G(cid:63)\n\n\u2212 1\n0 = Ir.\n\n\u2212 1\n0 K0K\n\n\u2212 1\n0 K\n0\n\n1, . . . , R(cid:63)\n\nO(N ) (cid:44) {Z : Z \u2208 RN\u00d7N , ZT Z = IN} is the orthogonal group of N \u00d7 N matrices.\n\n\u2212 1\nm Q(cid:63)\n\n)T (\u03a6T\n\n) = K\n\n2\n\n2\n\n2\n\n2\n\n2\n\n2\n\n2\n\n\u03a60 =(cid:2)\u03a6T\n\n(cid:3)T\n\n4\n\n\fA \u2190\n\nsample mean\n{1, 2, . . . , m} \\ {i} LOO mean\n\nInput: X1:m \u2208 Rt\u00d7n, A1:m \u2208 Rn\u00d7n\nOutput: R1:m \u2208 Rn\u00d7n\nInitialize Q1:m as identity (n \u00d7 n)\n1:m\u2190 XiA\n\u22121/2\nSet \u02dcXi\ni\nforeach round do\n\nforeach subject/view i do\n\n(cid:40) {1, 2, . . . , m}\n(cid:88)\n\nY \u2190 1\n\u02dcXjQj\n|A|\n[ \u00afU \u00af\u03a3 \u00afV] \u2190 SVD( \u02dcXT\nQi \u2190 \u00afU \u00afVT\n\nj\u2208A\n\ni Y)\n\nend\n\nend\nforeach subject/view i do\n\nRi \u2190 A\n\n\u2212 1\ni Qi\n\n2\n\nend\n\nAlgorithm 1: Regularized Hyperalignment\n\n(cid:3)T\n\n1 \u03a6T\n\n2 \u00b7\u00b7\u00b7 \u03a6T\n\nm\n\nInput: \u02c6k(\u00b7,\u00b7), \u03b1, \u03b2, X1:m \u2208 Rt\u00d7n\nOutput: R1:m, linear maps in feature space\nInitialize feature maps \u03a61, . . . , \u03a6m \u2208 Rt\u00d7N\n\nInitialize G1:m \u2208 Rr\u00d7r as identity (r = mt)\nforeach round do\n\nInitialize plane support \u03a60 =(cid:2)\u03a6T\n(cid:40) {1, 2, . . . , m}\n(cid:88)\n\nforeach subject/view i do\n\nA \u2190\n\nsample mean\n{1, 2, . . . , m} \\ {i} LOO mean\n\n\u02dcBjGj\n\nY \u2190 1\n|A|\n[ \u00afU \u00af\u03a3 \u00afV] \u2190 SVD( \u02dcBT\nGi \u2190 \u00afU \u00afVT\n\nj\u2208A\n\ni Y)\n\nend\n\nend\nforeach subject/view i do\n\u2212 1\n0 K\n0\n\nQi \u2190 I \u2212 \u03a6T\n\u2212 1\nRi \u2190 A\ni Qi\n\n2\n\n2\n\n(Ir \u2212 Gi)K\n\n\u2212 1\n0 \u03a60\n\n2\n\nend\nAlgorithm 2: Regularized Kernel Hyperalignment\n\nWhen mt is large enough so that evaluating an SVD of numerous mt \u00d7 mt matrices is prohibitive,\nwe can \ufb01rst perform PCA-like reduction. Let K0 have eigen-decomposition V0\u039b0VT\n0 , where the\nnonnegative diagonal entries of \u039b0 are sorted in decreasing order. We set \u03a60(cid:48) = VT\n0(cid:48)\u03a60, where\nV0(cid:48) is formed by the \ufb01rst r columns of V0, and then use U = \u03a6T\nIn general, rather\nthan compute Q according to (7), involving N (N\u22121)/2 = O(N 2) degrees of freedom (when N is\n\ufb01nite), we end up with r(r\u22121)/2 = O(r2) degrees of freedom via the kernel trick.\nLet \u02dcBi = BiKi0K\n\n\u2212 1\n0 \u2208 Rt\u00d7r. We reduce (8) in terms of Gi and obtain (Supplementary Material)\n\n\u22121/2\n0(cid:48)\n\n0(cid:48)K\n\n.\n\n2\n\n\uf8f9\uf8fb\uf8f6\uf8f8 ,\n\n\u02dcBj \u02c6Gj\n\ntr\n\n\uf8eb\uf8edGT \u02dcBT\n(cid:104) 1|A|\n(cid:80)\n\ni\n\n\uf8ee\uf8f0 1\n\n|A|\n\n(cid:88)\n\nj\u2208A\n\n(cid:105)\n\nGi = arg max\nG\u2208O(r)\n\n(11)\n\nwhere \u02c6Gj is the current estimate of Gj. Equation (11) is the classical orthogonal Procrustes prob-\nlem. If \u00afU \u00af\u03a3 \u00afVT is the SVD of GT \u02dcBT\n, then a maximizer is given by \u00afU \u00afVT [7].\ni\nThe kernel hyperalignment procedure is given in Algorithm 2. Using the approach taken in this\nsection also leads to an ef\ufb01cient solution of the standard orthogonal Procrustes problem for n \u2265 2t\n(Supplementary Material). In turn, this leads to an ef\ufb01cient iterative solution for the hyperalignment\nproblem when n is large.\n\nj\u2208A \u02dcBj \u02c6Gj\n\n4 Alignment Assessment\n\nAn alignment procedure is not subject to the typical train-and-test paradigm. The lack of spatial\ncorrespondence demands an align-train-test approach. We assume these three sets have within-\nsubject (or within-view) alignment. With all other parameters \ufb01xed, if the aligned test error is\nsmaller than the unaligned test error, there is strong evidence suggesting that alignment was the\nunderlying cause.\nKernel hyperalignment returns linear transformations R1:m that act on data living in feature space.\nIn general, we cannot directly train and test in the feature space due to its large size. We can,\nhowever, learn from relational data. For example, we can compute distances between examples\nand, subsequently, produce nearest neighbor classi\ufb01ers. Assume (\u03b1, \u03b2) = (1, 0), i.e., the R1:m\n\n5\n\n\fare orthogonal. If x1 \u2208 Rn is a view-i example and x2 \u2208 Rn is a view-j example, the respective\npre-aligned and post-aligned squared distances between the two examples are given by\n\n(cid:107)\u03a6(xT\n\n1 ) \u2212 \u03a6(xT\n\n(cid:107)\u03a6(xT\n1 )Ri \u2212 \u03a6(xT\n\n2 )(cid:107)2\n2 )Rj(cid:107)2\n\nF = \u02c6k(x1, x1) + \u02c6k(x2, x2) \u2212 2\u02c6k(x1, x2)\nF = \u02c6k(x1, x1) + \u02c6k(x2, x2) \u2212 2\u03a6(xT\n\n(12)\n(13)\nThe cross-term in (13) has not been expanded for a simple reason: it is too messy. We realized early\non that the alignment and training phase would be replete with lengthy expansions and, consequently,\nsought to simplify matters with a computer science solution. Both binary and unary operations in\nfeature space can be accomplished with a simple class. Our Phi class stores expressions of the\nfollowing forms:\n\n1 )RiRT\n\nj \u03a6(xT\n\n2 )T .\n\nk=1Mk\u03a6(Xa(k))\n\nk=1\u03a6(Xa(k))T Mk\n\nk=1\u03a6(Xa(k))T Mk\u03a6(Xa(k))\n\n.\n\n(14)\n\n(cid:125)\n\n(cid:80)K\n(cid:124)\n\n(cid:123)(cid:122)\n\nType 1\n\n(cid:80)K\n(cid:124)\n\n(cid:125)\n\n(cid:123)(cid:122)\n\nType 2\n\nbIN +(cid:80)K\n(cid:124)\n\n(cid:125)\n\n(cid:123)(cid:122)\n\nType 3\n\nEach class instance stores matrices M1:K, scalar b, right address vector a, and left address vector a.\nThe address vectors are pointers to the input data. This allows for faster manipulation and smaller\nmemory allocation. Addition and subtraction require a common type. If types match, then the M\nmatrices must be checked for compatible sizes. Multiplication is performed for types 1 with 2, 1\nwith 3, 2 with 1, 3 with 2, and 3 with 3. The \ufb01rst of these cases, for example, produces a numeric\nresult via the kernel trick. We also de\ufb01ne scalar multiplication and division for all types and matrix\nmultiplication for types 1 and 2. A transpose operator applies for all types and maps type 1 to 2,\n2 to 1, and 3 to 3. More advanced operations, such as powers and inverses, are also possible. Our\nimplementation was done in Matlab.\nThe construction of the Phi class allows us to stay in feature space and avoid lengthy expansions. In\nturn, this facilitates implementing the richer set of SVM classi\ufb01ers. Let X\u00af1, . . . , X \u00afm \u2208 Rs\u00d7n be our\ntraining data with feature representation \u03a6\u00af\u0131 = \u03a6(X\u00af\u0131) \u2208 Rs\u00d7N . Recall that kernel hyperalignment\n\u00af\uf6be ; we now\nseeks to align in feature space. Before alignment we might have considered K\u00af\u0131\u00af\uf6be = \u03a6\u00af\u0131\u03a6T\n\u00af\uf6be . If every row of X\u00af\u0131 has a corresponding\nconsider the Gram matrix (\u03a6\u00af\u0131Ri)(\u03a6\u00af\uf6beRj)T = \u03a6\u00af\u0131RiRT\nlabel, we can train an SVM with\n\nj \u03a6T\n\n\u00afA \u2208 Rms\u00d7ms denotes the aligned kernel matrix. The unaligned kernel matrix, K \u00afU ,\n\nwhere K \u00afA = KT\nis also an m \u00d7 m block matrix with ij-th block K\u00af\u0131\u00af\uf6be.\nUsing the dual formulation of an SVM, a classi\ufb01er can be constructed from the relational data\nexhibited among the examples [4]. Similar to a k-nearest neighbor classi\ufb01er relying on pairwise\ndistances, an SVM relies on the kernel matrix. The kernel matrix is a matrix of inner products and\nis therefore linear. This enables us to assess a partition-based alignment.\nIn fMRI, we perform two alignments\u2014one for each hemisphere. Each alignment produces two\naligned kernel matrices, which we sum and then input into an SVM. Thus, linearity provides us the\nmeans to handle \ufb01ner partitions by simply summing the aligned kernel matrices.\n\n\uf8eb\uf8ec\uf8ed \u03a6\u00af1R1\n\n...\n\n\uf8f6\uf8f7\uf8f8\u00d7\n\uf8eb\uf8ec\uf8ed \u03a6\u00af1R1\n\n...\n\n\u03a6 \u00afmRm\n\n\u03a6 \u00afmRm\n\n\uf8f6\uf8f7\uf8f8T\n\n=\n\n\uf8eb\uf8ec\uf8ec\uf8ec\uf8ed\n\nK \u00afA =\n\n\u03a6\u00af1R1RT\n\u03a6\u00af2R2RT\n\n1 \u03a6T\n1 \u03a6T\n\n\u00af1 \u03a6\u00af1R1RT\n\u00af1 \u03a6\u00af2R2RT\n\n2 \u03a6T\n\u00af2\n2 \u03a6T\n\u00af2\n\n...\n\n\u03a6 \u00afmRmRT\n\n1 \u03a6T\n\u00af1\n\n\u00b7\u00b7\u00b7 \u03a6\u00af1R1RT\n\nm\u03a6T\n\u00afm\n\n...\n\n\u03a6 \u00afmRmRT\n\nm\u03a6T\n\u00afm\n\n\uf8f6\uf8f7\uf8f7\uf8f7\uf8f8 ,\n\n(15)\n\n6\n\n\fTable 1: Seven label classi\ufb01cation using movie-based alignment Below is the cross-validated,\nbetween-subject classi\ufb01cation accuracy (within-subject in brackets) with (\u03b1, \u03b2) = (1, 0). Four\nhundred TRs per subject were used for the alignment. Chance = 1/7 \u2248 14.29%.\n\nKernel\n\nVentral Temporal\n\n2,997 voxels/hemisphere\n\nAnatomical\n\nEntire Cortex\n\n133,590 voxels/hemisphere\n\nKernel Hyp.\n\nAnatomical\n\nKernel Hyp.\n\nLinear\nQuadratic\nGaussian\nSigmoid\n\n35.71% [42.68%]\n35.00% [43.32%]\n36.25% [43.39%]\n35.89% [43.21%]\n\n48.57% [42.68%]\n50.36% [42.32%]\n48.57% [43.39%]\n48.21% [43.21%]\n\n34.64% [26.79%]\n36.07% [25.54%]\n36.07% [26.07%]\n35.00% [26.79%]\n\n36.25% [26.79%]\n36.43% [25.54%]\n36.43% [26.07%]\n36.25% [26.79%]\n\n5 Experiments\n\nThe data used in this section consisted of fMRI time-series data from 10 subjects who viewed a\nmovie and also engaged in a block-design visualization experiment [17]. Each subject saw Raiders\nof the Lost Ark (1981) lasting a total of 2213 TRs. In the visualization experiment, subjects were\nshown images belonging to a speci\ufb01c class for 16 TRs followed by 10 TRs of rest. The 7 classes\nwere: (1) female face, (2) male face, (3) monkey, (4) house, (5) chair, (6) shoe and (7) dog. There\nwere 8 runs total, and each run had every image class represented once.\nWe assess alignment by classi\ufb01cation accuracy. To provide the same number of voxels per ROI for\nall subjects, we \ufb01rst performed anatomical alignment. We then selected a contiguous block of 400\nTRs from the movie data to serve as the per-subject input of the kernel hyperalignment. Next, we\nextracted labeled examples from the visualization experiment by taking an offset time average of\neach 16 TR class exposure. An offset of 6 seconds factored in the hemodynamic response. This\nproduced 560 labeled examples: 10 subjects \u00d7 8 runs/subject \u00d7 7 examples/run.\nKernel hyperalignment allows us to (a) use nonlinear measures of similarity, and (b) consider more\nvoxels for the alignment. Consequently, we (a) experiment with a variety of kernels, and (b) do not\nneed to pre-select or screen voxels as was done in [9]\u2014we include them all. Table 1 features results\nfrom a 7-label classi\ufb01cation experiment. Recall that a linear kernel reduces to hyperalignment. We\nclassi\ufb01ed using a multi-label \u03bd-SVM [3]. We used the \ufb01rst 400 TRs from each subject\u2019s movie data,\nand aligned each hemisphere separately. The kernel functions are supplied in the Supplementary\nMaterial. As observed in [9] and repeated here, hyperalignment leads to increased between-subject\naccuracy and outperforms within-subject accuracy. Thus, we are extracting more common structure\nacross subjects. Whereas employing Algorithm 1 for 2,997 voxels is feasible (and slow), 133,590\nvoxels is not feasible at all.\nTo complete the picture, we plot the effects of regularization. Figure 1 displays the cross-validated,\nbetween-subject classi\ufb01cation accuracy for varying (\u03b1, \u03b2) where \u03b1 = 1\u2212\u03b2. This traces out a route\nfrom CCA (\u03b1 \u2248 0) to hyperalignment (\u03b1 = 1). When compared to the alignments in [9], our voxel\ncounts are orders of magnitude larger. For our four chosen kernels, hyperalignment (\u03b1 = 1) presents\nitself as the option with near-greatest accuracy.\nOur results support the robustness of hyperalignment and imply that voxel selection may be a crucial\npre-processing step when dealing with the whole volume. More voxels mean more noisy voxels,\nand hyperalignment does not distinguish itself from anatomical alignment when the entire cortex is\nconsidered. We can visualize this phenomenon with Multidimensional Scaling (MDS) [21].\nMDS takes as input all of the pairwise distances between subjects (the previous section discussed\ndistance calculations). Figure 2 depicts the optimal Euclidean representation of our 10 subjects be-\nfore and after kernel hyperalignment ((\u03b1, \u03b2) = (1, 0)) with respect to the \ufb01rst 400 TRs of the movie\ndata. Focusing on VT, kernel hyperalignment manages to cluster 7 of the 10 subjects. However,\nwhen we shift to the entire cortex, we see that anatomical alignment has already succeeded in a sim-\nilar clustering. Kernel hyperalignment manages to group the subjects closer together, and manifests\nitself as a re-centering.\n\n7\n\n\fFigure 1: Cross-validated between-subject classi\ufb01cation accuracy (7 labels) as a function of the\nregularization parameter, \u03b1 = 1\u2212\u03b2, for various kernels after alignment. The solid curves are for\nVentral Temporal and the dashed curves are for the entire cortex. Chance = 1/7 \u2248 14.29%.\n\nFigure 2: Visualizing alignment with MDS Each locus pair approximates the normalized relation-\nship among the 10 subjects in 2D - before (left) and after (right) applying kernel hyperalignment.\nCentroids are translated to the origin and numbers correspond to individual subjects.\n\n6 Conclusion\n\nWe have extended hyperalignment in both scale and feature space. Kernel hyperalignment can\nhandle a large number of original features and incorporate nonlinear measures of similarity. We have\nalso shown how to use the linear maps\u2014applied in feature space\u2014for post-alignment classi\ufb01cation.\nIn the setting of fMRI, we have demonstrated successful alignment with a variety of kernels. Kernel\nhyperalignment achieved better between-subject classi\ufb01cation over anatomical alignment for VT.\nThere was no noticeable difference when we considered the entire cortex. Nevertheless, kernel\nhyperalignment proved robust and did not degrade with increasing voxel count.\nWe envision a fruitful path for kernel hyperalignment. Empirically, we have noticed a tradeoff\nbetween feature cardinality and classi\ufb01cation accuracy, motivating the need for intelligent feature\nselection within our established framework. Although we have limited our focus to fMRI data anal-\nysis, kernel hyperalignment can be applied to other research areas which rely on multi-set Procrustes\nproblems.\n\n8\n\n00.20.40.60.810.20.250.30.350.40.450.50.55\uf061 ( = 1-\uf062)BSC AccuracyLinear Kernel00.20.40.60.810.20.250.30.350.40.450.50.55\uf061 ( = 1-\uf062)BSC AccuracyQuadratic Kernel00.20.40.60.810.20.250.30.350.40.450.50.55\uf061 ( = 1-\uf062)BSC AccuracyGaussian Kernel00.20.40.60.810.20.250.30.350.40.450.50.55\uf061 ( = 1-\uf062)BSC AccuracySigmoid Kernel1234567891012345678910123456789101234567891012345678910123456789101234567891012345678910Ventral TemporalEntire CortexLinear KernelGaussian Kernel\fReferences\n[1] F.R. Bach and M.I. Jordan. Kernel independent component analysis. The Journal of Machine\n\nLearning Research, 3:1\u201348, 2003.\n\n[2] C.M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.\n[3] C.C. Chang and C.J. Lin. LIBSVM: A library for support vector machines. ACM Transactions\non Intelligent Systems and Technology, 2:27:1\u201327:27, 2011. Software available at http:\n//www.csie.ntu.edu.tw/\u02dccjlin/libsvm.\n\n[4] P.H. Chen, C.J. Lin, and B. Sch\u00a8olkopf. A tutorial on \u03bd-support vector machines. Applied\n\nStochastic Models in Business and Industry, 21(2):111\u2013136, 2005.\n\n[5] A. Edelman, T. As, A. Arias, and T. Smith. The geometry of algorithms with orthogonality\n\nconstraints. SIAM J. Matrix Anal. Appl, 1998.\n\n[6] C. Goodall. Procrustes methods in the statistical analysis of shape. Journal of the Royal\n\nStatistical Society. Series B (Methodological), pages 285\u2013339, 1991.\n\n[7] J.C. Gower and G.B. Dijksterhuis. Procrustes Problems, volume 30. Oxford University Press,\n\nUSA, 2004.\n\n[8] D.R. Hardoon, S. Szedmak, and J. Shawe-Taylor. Canonical correlation analysis: An overview\n\nwith application to learning methods. Neural Computation, 16(12):2639\u20132664, 2004.\n\n[9] J.V. Haxby, J.S. Guntupalli, A.C. Connolly, Y.O. Halchenko, B.R. Conroy, M.I. Gobbini,\nM. Hanke, and P.J. Ramadge. A common, high-dimensional model of the representational\nspace in human ventral temporal cortex. Neuron, 72(2):404\u2013416, 2011.\n\n[10] T. Hofmann, B. Sch\u00a8olkopf, and A.J. Smola. Kernel methods in machine learning. The Annals\n\nof Statistics, pages 1171\u20131220, 2008.\n\n[11] R.A. Horn and C.R. Johnson. Matrix Analysis. Cambridge University Press, 1990.\n[12] H. Hotelling. Relations between two sets of variates. Biometrika, 28(3/4):321\u2013377, 1936.\n[13] J.R. Kettenring. Canonical analysis of several sets of variables. Biometrika, 58(3):433, 1971.\n[14] G.S. Kimeldorf and G. Wahba. A correspondence between Bayesian estimation on stochastic\nprocesses and smoothing by splines. The Annals of Mathematical Statistics, 41(2):495\u2013502,\n1970.\n\n[15] M. Kuss and T. Graepel. The geometry of kernel canonical correlation analysis. Technical\n\nreport, Max Planck Institute, 2003.\n\n[16] P.L. Lai and C. Fyfe. Kernel and nonlinear canonical correlation analysis. International Jour-\n\nnal of Neural Systems, 10(5):365\u2013378, 2000.\n\n[17] M.R. Sabuncu, B.D. Singer, B. Conroy, R.E. Bryan, P.J. Ramadge, and J.V. Haxby. Function\n\nbased inter-subject alignment of human cortical anatomy. Cerebral Cortex, 2009.\n\n[18] B. Sch\u00a8olkopf, R. Herbrich, and A. Smola. A generalized representer theorem. In Computa-\n\ntional learning theory, pages 416\u2013426. Springer, 2001.\n\n[19] P.H. Schonemann. A generalized solution of the orthogonal procrustes problem. Psychome-\n\ntrika, 31(1):1\u201310, March 1966.\n\n[20] J. Talairach and P. Tournoux. Co-planar stereotaxic atlas of the human brain: 3-dimensional\n\nproportional system: an approach to cerebral imaging. Thieme, 1988.\n\n[21] J.B. Tenenbaum, V. De Silva, and J.C. Langford. A global geometric framework for nonlinear\n\ndimensionality reduction. Science, 290(5500):2319\u20132323, 2000.\n\n[22] H. Xu, A. Lorbert, P. J. Ramadge, J. S. Guntupalli, and J. V. Haxby. Regularized hyperalign-\nment of multi-set fmri data. Proceedings of the 2012 IEEE Signal Processing Workshop, Ann\nArbor Michigan, 2012.\n\n9\n\n\f", "award": [], "sourceid": 884, "authors": [{"given_name": "Alexander", "family_name": "Lorbert", "institution": null}, {"given_name": "Peter", "family_name": "Ramadge", "institution": null}]}