{"title": "Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization", "book": "Advances in Neural Information Processing Systems", "page_first": 2409, "page_last": 2417, "abstract": "We propose a new weakly-supervised structured learning approach for recognition and spatio-temporal localization of actions in video. As part of the proposed approach we develop a generalization of the Max-Path search algorithm, which allows us to efficiently search over a structured space of multiple spatio-temporal paths, while also allowing to incorporate context information into the model. Instead of using spatial annotations, in the form of bounding boxes, to guide the latent model during training, we utilize human gaze data in the form of a weak supervisory signal. This is achieved by incorporating gaze, along with the classification, into the structured loss within the latent SVM learning framework. Experiments on a challenging benchmark dataset, UCF-Sports, show that our model is more accurate, in terms of classification, and achieves state-of-the-art results in localization. In addition, we show how our model can produce top-down saliency maps conditioned on the classification label and localized latent paths.", "full_text": "Action is in the Eye of the Beholder: Eye-gaze Driven\n\nModel for Spatio-Temporal Action Localization\n\nNataliya Shapovalova\u2217\n\nMichalis Raptis\u2020\n\u2020Comcast\n\n\u2217Simon Fraser University\n{nshapova,mori}@cs.sfu.ca mraptis@cable.comcast.com lsigal@disneyresearch.com\n\n\u2021Disney Research\n\nLeonid Sigal\u2021\n\nGreg Mori\u2217\n\nAbstract\n\nWe propose a weakly-supervised structured learning approach for recognition and\nspatio-temporal localization of actions in video. As part of the proposed approach,\nwe develop a generalization of the Max-Path search algorithm which allows us to\nef\ufb01ciently search over a structured space of multiple spatio-temporal paths while\nalso incorporating context information into the model. Instead of using spatial\nannotations in the form of bounding boxes to guide the latent model during train-\ning, we utilize human gaze data in the form of a weak supervisory signal. This is\nachieved by incorporating eye gaze, along with the classi\ufb01cation, into the struc-\ntured loss within the latent SVM learning framework. Experiments on a chal-\nlenging benchmark dataset, UCF-Sports, show that our model is more accurate,\nin terms of classi\ufb01cation, and achieves state-of-the-art results in localization. In\naddition, our model can produce top-down saliency maps conditioned on the clas-\nsi\ufb01cation label and localized latent paths.\n\n1\n\nIntroduction\n\nStructured prediction models for action recognition and localization are emerging as prominent al-\nternatives to more traditional holistic bag-of-words (BoW) representations. The obvious advantage\nof such models is the ability to localize, spatially and temporally, an action (and actors) in poten-\ntially long and complex scenes with multiple subjects. Early alternatives [3, 7, 14, 27] address this\nchallenge using sub-volume search, however, this implicitly assumes that the action and actor(s) are\nstatic within the frame. More recently, [9] and [18, 19] propose \ufb01gure-centric approaches that can\ntrack an actor by searching over the space of spatio-temporal paths in video [19] and by incorporat-\ning person detection into the formulation [9]. However, all successful localization methods, to date,\nrequire spatial annotations in the form of partial poses [13], bounding boxes [9, 19] or pixel level\nsegmentations [7] for learning. Obtaining such annotations is both time consuming and unnatural;\noften it is not easy for a human to decide which spatio-temporal segment corresponds to an action.\n\nOne alternative is to proceed in a purely unsupervised manner and try to mine for most discriminant\nportions of the video for classi\ufb01cation [2]. However, this often results in over\ufb01tting due to the rela-\ntively small and constrained nature of the datasets, as discriminant portions of the video, in training,\nmay correspond to regions of background and be unrelated to the motion of interest (e.g., grass may\nbe highly discriminative for \u201ckicking\u201d action because in the training set most instances come from\nsoccer, but clearly \u201ckicking\u201d can occur on nearly any surface). Bottom-up perceptual saliency, com-\nputed from eye-gaze of observers (obtained using an eye tracker), has recently been introduced as\nanother promising alternative to annotation and supervision [11, 21]. It has been shown that tradi-\ntional BoW models computed over the salient regions of the video result in superior performance,\ncompared to dense sampling of descriptors. However, this comes at expense of losing ability to\nlocalize actions. Bottom-up saliency models usually respond to numerous unrelated low-level stim-\nuli [25](e.g., textured cluttered backgrounds, large motion gradients from subjects irrelevant to the\naction, etc.) which often fall outside the region of the action (and can confuse classi\ufb01ers).\n\n1\n\n\fIn this paper we posit that a good spatio-temporal model for action recognition and localization\nshould have three key properties: (1) be \ufb01gure-centric, to allow for subject and/or camera motion,\n(2) discriminative, to facilitate classi\ufb01cation and localization, and (3) perceptually semantic, to mit-\nigate over\ufb01tting to accidental statistical regularities in a training set. To avoid reliance on spatial\nannotation of actors we utilize human gaze data (collected by having observers view correspond-\ning videos [11]) as weak supervision in learning1. Note that such weak annotation is more natural,\neffortless (from the point of view of an annotator) and can be done in real-time. By design, gaze\ngives perceptually semantic interest regions, however, while semantic, gaze, much like bottom-up\nsaliency, is not necessarily discriminative. Fig. 1(b) shows that while for some (typically fast) ac-\ntions like \u201cdiving\u201d, gaze may be well aligned with the actor and hence discriminative, for others, like\n\u201cgolf\u201d and \u201chorse riding\u201d, gaze may either drift to salient but non discriminant regions (the ball), or\nsimply fall on background regions that are prominent or of intrinsic aesthetic value to the observer.\nTo deal with complexities of the search and ambiguities in the weak-supervision, given by gaze, we\nformulate our model in a max-margin framework where we attempt to infer latent smooth spatio-\ntemporal path(s) through the video that simultaneously maximize classi\ufb01cation accuracy and pass\nthrough regions of high gaze concentration. During learning, this objective is encouraged in the\nlatent Structural SVM [26] formulation through a real-valued loss that penalizes misclassi\ufb01cation\nand, for correctly classi\ufb01ed instances, misalignment with salient regions induced by the gaze. In\naddition to classi\ufb01cation and localization, we show that our model can provide top-down action-\nspeci\ufb01c saliency by predicting distribution over gaze conditioned on the action label and inferred\nspatio-temporal path. Having less (annotation) information available at training time, our model\nshows state-of-the art classi\ufb01cation and localization accuracy on the UCF-Sports dataset and is the\n\ufb01rst, to our knowledge, to propose top-down saliency for action classi\ufb01cation task.\n\n2 Related works\n\nAction recognition: The literature on vision-based action recognition is too vast. Here we focus\non the most relevant approaches and point the reader to recent surveys [20, 24] for a more complete\noverview. The most prominent action recognition models to date utilize visual BoW representa-\ntions [10, 22] and extensions [8, 15]. Such holistic models have proven to be surprisingly good at\nrecognition, but are, by design, incapable of spatial or temporal localization of actions.\nSaliency and eye gaze: Work in cognitive science suggests that control inputs to the attention mech-\nanism can be grouped into two categories: stimulus-driven (bottom-up) and goal-driven (top-down)\n[4]. Recent work in action recognition [11, 21] look at bottom-up saliency as a way to sparsify\ndescriptors and to bias BoW representations towards more salient portions of the video. In [11] and\n[21] multiple subjects were tasked with viewing videos while their gaze was recorded. A saliency\nmodel is then trained to predict the gaze and is used to either prune or weight the descriptors. How-\never, the proposed saliency-based sampling is purely bottom-up, and still lacks ability to localize\nactions in either space or time2. In contrast, our model is designed with spatio-temporal localiza-\ntion in mind and uses gaze data as weak supervision during learning. In [16] and [17] authors use\n\u201cobjectness\u201d saliency operator and person detector as weak supervision respectively, however, in\nboth cases the saliency is bottom-up and task independent. The top-down discriminative saliency,\nbased on distribution of gaze, in our approach, allows our model to focus on perceptually salient re-\ngions that are also discriminative. Similar in spirit, in [5] gaze and action labels are simultaneously\ninferred in ego-centric action recognition setting. While conceptually similar, the model in [5] is\nsigni\ufb01cantly different both in terms of formulation and use. The model [5] is generative and relies\non existence of object detectors.\nSub-volume search: Spatio-temporal localization of actions is a dif\ufb01cult task, largely due to the\ncomputational complexity of search involved. One way to alleviate this computational complexity\nis to model the action as an axis aligned rectangular 3D volume. This allows spatio-temporal search\nto be formulated ef\ufb01ciently using convolutions in the Fourier [3] or Clifford Fourier [14] domain. In\n[28] an ef\ufb01cient spatio-temporal branch-and-bound approach was proposed as alternative. However,\nthe assumption of single \ufb01xed axis aligned volumetric representation is limiting and only applicable\n\n1We assume no gaze data is available for test videos.\n2Similar observations have been made in object detection domain [25], where purely bottom-up saliency\n\nhas been shown to produce responses on textured portions of the background, outside of object of interest.\n\n2\n\n\f(a)\n\n(b)\n\nFigure 1: Graphical model representation is illustrated in (a). Term \u03c6(x, h) captures information\nabout context (all the video excluding regions de\ufb01ned by latent variables h); terms \u03c8(x, hi) capture\ninformation about latent regions. Inferred latent regions should be discriminative and match high\ndensity regions of eye gaze data. In (b) ground truth eye gaze density, computed from \ufb01xations of\nmultiple subjects, is overlaid over images from sequences of 3 different action classes (see Sect. 1).\n\nfor well de\ufb01ned and relatively static actions. In [7] an extension to multiple sub-volumes that model\nparts of the action is proposed and amounts to a spatio-temporal part-based (pictorial structure)\nmodel. While part-based model of [7] allows for greater \ufb02exibility, the remaining axis-aligned\nnature of part sub-volumes is still largely appropriate for recognition in scenarios where camera and\nsubject are relatively static. This constraint is slightly relaxed in [12] where a part-based model built\non dense trajectory clustering is proposed. However, [12] relies on sophisticated pre-processing\nwhich requires building long feature trajectories over time, which is dif\ufb01cult to do for fast motions\nor less textured regions.\nMost closely related approaches to our work come from [9, 18, 19]. In [18] Tran and Yuan show\nthat a rectangular axis-aligned volume constraint can be relaxed by ef\ufb01ciently searching over the\nspace of smooth paths within the spatio-temporal volume. The resulting Max-Path algorithm is\napplied to object tracking in video. In [19] this approach is further extended by incorporating Max-\nPath inference into a max-margin structured output learning framework, resulting in an approach\ncapable of localizing actions. We generalize Max-Path idea by allowing multiple smooth paths and\ncontext within a latent max-margin structured output learning. In addition, our model is trained to\nsimultaneously localize and classify actions. Alternatively, [9] uses latent SVM to jointly detect\nan actor and recognize actions. In practice, [9] relies on human detection for both inference and\nlearning and only sub-set of frames can be localized due to the choice of the features (HOG3D).\nSimilarly, [2] relies on person detection and distributed partial pose representation, in the form of\nposelets, to build a spatio-temporal graph for action recognition and localization. We want to stress\nthat [2, 9, 18, 19] require bounding box annotations for actors in learning. In contrast, we focus on\nweaker and more natural source of data \u2013 gaze, to formulate our learning criteria.\n\n3 Recognizing and Localizing Actions in Videos\n\nOur goal is to learn a model that can jointly localize and classify human actions in video. This prob-\nlem is often tackled in the same manner as object recognition and localization in images. However,\nextension to a temporal domain comes with many challenges. The core challenges we address are:\n(i) dealing with a motion of the actor within the frame, resulting from camera or actor\u2019s own motion\nin the world; (ii) complexity of the resulting spatio-temporal search, that needs to search over the\nspace of temporal paths; (iii) ability to model coarse temporal progression of the action and action\ncontext, and (iv) learning in absence of direct annotations for actor(s) position within the frame.\nTo this end, we propose a model that has the ability to localize temporally and spatially discrimi-\nnative regions of the video and encode the context in which these regions occur. The output of the\nmodel indicates the absence or presence of a particular action in the video sequence while simulta-\nneously extracting the most discriminative and perceptually salient spatio-temporal video regions.\nDuring the training phase, the selection of these regions is implicitly driven by eye gaze \ufb01xations\ncollected by a sample of viewers. As a consequence, our model is able to perform top-down video\nsaliency detection conditioned on the performed action and localized action region.\n\n3\n\n\f1 Model Formulation\nGiven a set of video sequences {x1, . . . , xn} \u2282 X and their associated labels {y1, . . . , yn}, with\nyi \u2208 {\u22121, 1}, our purpose is to learn a mapping f : X \u2192 {\u22121, 1}. Additionally, we introduce aux-\niliary latent variables {h1, . . . , hn}, where hi = {hi1, . . . , hiK} and hik \u2208 \u2205\u222a{(lj, tj, rj, bj)Te\n}\nj=Ts\ndenotes the left, top, right and bottom coordinates of spatio-temporal paths of bounding boxes that\nare de\ufb01ned from frame Ts up to Te. The latent variables h specify the spatio-temporal regions\nselected by our model. Our function is then de\ufb01ned y\u2217\n\nx(w) = f (x; w), where\n\nx(w), h\u2217\n(y\u2217\n\nx(w)) =\n\nargmax\n\n(y,h)\u2208{\u22121,1}\u00d7H\n\nF (x, y, h; w), F (x, y, h; w) = wT \u03a8(x, y, h),\n\n(1)\n\nw is a parameter of the model, and \u03a8(x, y, h) \u2208 Rd is a joint feature map. Video sequences in\nwhich the action of interest is absent are treated as zero vectors in the Hilbert space induced by the\nfeature map \u03a8 similar to [1]. Whereas, the corresponding feature map of videos where the action\nof interest is present is being decomposed into two components: a) the latent regions and b) context\nregions. As a consequence, the scoring function is written:\n\nK(cid:88)\n\nF (x, y = 1, h; w) = wT \u03a8(x, y = 1, h) = wT\n\n0 \u03c6(x, h) +\n\nwT\n\nk \u03c8(x, hk) + b\n\n(2)\n\nk=1\n\nwhere K is the number of latent regions of the action model and b is the bias term. A graphical\nrepresentation of the model is illustrated in Fig. 1(a).\nLatent regions potential wT\nk \u03c8(x, hk): This potential function measures the compatibility of latent\nspatio-temporal region hk with the action model. More speci\ufb01cally, \u03c8(x, hk) returns the sum of\nnormalized BoW histograms extracted from the bounding box de\ufb01ned by the latent variable hk =\n(lj, tj, rj, bj)Te\n\nat each corresponding frame.\n\nj=Ts\n\nContext potential wT\n0 \u03c6(x, h): We de\ufb01ne context as the entire video sequence excluding the latent\nregions; our aim is to capture any information that is not directly produced by the appearance and\nmotion of the actor. The characteristics of the context are encoded in \u03c6(x, h) as a sum of normalized\nBoW histograms at each frame of the video excluding the regions indicated by latent variables h.\nMany action recognition scoring functions recently proposed [9, 12, 16] include the response of a\nglobal BoW statistical representation of the entire video. While such formulations are simpler, since\nthe response of the global representation is independent from the selection of the latent variables,\nthey are also somewhat unsatisfactory from the modeling point of view. First, the visual information\nthat corresponds to the latent region of interest implicitly gets to be counted twice. Second, it is\nimpossible to decouple and analyze importance of foreground and contextual information separately.\n\n2 Inference\nGiven the model parameters w and an unseen video x our goal is to infer the binary action label y\u2217 as\nwell as the location of latent regions h\u2217 (Eq. 1). The scoring function for the case of y = \u22121 is equal\nto zero due to the trivial zero vector feature map (Sect. 1). However, estimating the optimal value of\nthe scoring function for the case of y = 1 involves the maximization over the latent variables. The\nsearch space over even a single spatio-temporal path (non-smooth) of variable size bounding boxes\nin a video sequence of width M, height N and length T is exponential: O(M N )2T . Therefore, we\nrestrict the search space by introducing a number of assumptions. We constraint the search space\nto smooth spatio-temporal paths3 of \ufb01xed size bounding boxes [18]. These constraints allows the\ninference of the optimal latent variables for a single region using dynamic programming, similarly\nto Max-Path algorithm proposed by Tran and Yuan [18].\nAlgorithm 1 summarizes the process of dynamic programming considering both the context and the\nlatent region contributions. The time and space complexity of this algorithm is O(M N T ). How-\never, without introducing further constraints on the latent variables, the extension of this forward\nmessage passing procedure to multiple latent regions results in an exponential, in the number of\nregions, algorithm because of the implicit dependency of the latent variables through the context\n\n3The feasible positions of the bounding box in a frame are constrained by its location in the previous frame.\n\n4\n\n\fAlgorithm 1 MaxCPath: Inference of Single Latent Region with Context\n1: Input : R(t): the context local response without the presence of bounding box,\n\nQ0(u, v, t): the context local response excluding the bounding box at location (u, v),\nQ1(u, v, t): the latent region local response\n\nP (u, v, t): the best path record for tracing back\n\nfor each (u, v) \u2208 [1..M ] \u00d7 [1..N ] do\n\n(u0, v0) \u2190 argmax(u(cid:48),v(cid:48) )\u2208Nb(u,v) S(u(cid:48), v(cid:48), t \u2212 1)\n\nif S(u0, v0, t \u2212 1) >(cid:80)T\nS(u, v, t) \u2190 Q0(u, v, t) + Q1(u, v, t) +(cid:80)T\n\n2: Output : S(t): score of the best path till frame t, L(t): end point of the best path till t,\n3: Initialize S\u2217 = \u2212 inf, S(u, v, 0) = \u2212inf, \u2200u, v, l\u2217 = null\n4: for t \u2190 1 to T do // Forward Process, Backward Process: t \u2190 T to 1\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17:\n18: end for\n\nS(u, v, t) \u2190 S(u0, v0, t \u2212 1) + Q0(u, v, t) + Q1(u, v, t) \u2212 R(t)\nP (u, v, t) \u2190 (u0, v0, t \u2212 1)\n\nend for\nS(t) \u2190 S\u2217 and L(t) \u2190 l\u2217\n\nS\u2217 \u2190 S(u, v, t) and l\u2217 \u2190 (u, v, t)\n\nend if\nif S(u, v, t) > S\u2217 then\n\ni=1 R(i) \u2212 R(t)\n\ni=1 R(i) then\n\nend if\n\nelse\n\nAlgorithm 2 Inference: Two Latent Region with Context\n1: Input : R(t): the context local response without the presence of bounding box, Q0(u, v, t): the context local response excluding the\nbounding box at location (u, v), Q1(u, v, t): the latent region local response of the \ufb01rst latent region, Q2(u, v, t): the latent region\nlocal response of the second latent region.\n\nS \u2190 S1(t) + S2(t + 1) \u2212(cid:80)T\n\n2: Output : S\u2217: the maximum score of the inference, h1, h2: \ufb01rst and second latent regions\n3: Initialize S\u2217 = \u2212 inf, t\u2217 = null\n4: (S1, L1, P1) \u2190 M axCP ath \u2212 F orward(R, Q0, Q1)\n5: (S2, L2, P2) \u2190 M axCP ath \u2212 Backward(R, Q0, Q2)\n6: for t \u2190 1 to T \u2212 1 do\n7:\n8:\n9:\n10:\n11: end for\n12: h1 \u2190 traceBackward(P1, L1(t\u2217))\n13: h2 \u2190 traceF orward(P2, L2(t\u2217 + 1))\n\nS\u2217 \u2190 S and t\u2217 \u2190 t\n\nif S > S\u2217 then\n\ni=1 R(i)\n\nend if\n\nterm. Incorporating temporal ordering constraints between the K latent regions leads to a polyno-\nmial time algorithm. More speci\ufb01cally, the optimal scoring function can be inferred by enumerating\nall potential end locations of each latent region and executing independently Algorithm 1 at each\ninterval in O(M N T K). For the special case of K = 2, we derive a forward/backward message\nprocess that remains linear in the size of video volume: O(M N T ); see summary in Algorithm 2. In\nour experimental validation a model with 2 latent regions proved to be suf\ufb01ciently expressive.\n3 Learning Framework\nIdentifying the spatio-temporal regions of the video sequences that will enable our model to detect\nhuman action is a challenging optimization problem. While the introduction of latent variables in\ndiscriminative models [6, 9, 12, 13, 23, 26] is natural for many applications (e.g., modeling body\nparts) and has also offered excellent performance [6], it also lead to training formulations with non-\nconvex functions. In our training formulation we adopt the large-margin latent structured output\nlearning [26], however we also introduce a loss function that weakly supervises the selection of latent\nvariables based on human gaze information. Our training set of videos {x1, . . . , xn} along with their\naction labels {y1, . . . , yn} contains 2D \ufb01xation points (sampled at much higher frequency than the\nvideo frame rate) of 16 subjects observing the videos [11]. We transform these measurements using\nkernel density estimation with Gaussian kernel (with bandwidth set to the visual angle span of 2\u25e6)\nto a probability density function of gaze gi = {g1\ni } at each frame of video xi. Following\nn(cid:88)\nthe Latent Structural SVM formulation [26], our learning takes the following form:\n\ni , . . . , gTi\n\nmin\nw,\u03be\u22650\n\n1\n2\n\n(cid:107)w(cid:107)2 + C\n\n\u03bei\n\ni=1\n\ni\u2208H wT \u03a8(xi, yi, h(cid:48)\n\nmax\nh(cid:48)\n\ni) \u2212 wT \u03a8(xi, \u02c6yi, \u02c6hi) \u2265 \u2206(yi, gi, \u02c6yi, \u02c6hi) \u2212 \u03bei,\n\n(3)\n\u2200\u02c6yi \u2208 {\u22121, 1},\u2200\u02c6hi \u2208 H,\n\n5\n\n\f(cid:26) 1 \u2212 1\n\n(cid:80)K\n\nwhere \u2206(yi, gi, \u02c6yi, \u02c6hi) \u2265 0 is an asymmetric loss function encoding the cost of an incorrect action\nlabel prediction but also of mislocalization of the eye gaze. The loss function is de\ufb01ned as follows:\n\n\u2206(yi, gi, \u02c6yi, \u02c6hi) =\n\nk=1 \u03b4(gi, \u02c6hik)\n\n1 \u2212 1\n\nK\n2 (yi \u02c6yi + 1)\n\nif yi = \u02c6yi = 1,\notherwise.\n\n\u03b4(gi, \u02c6hik) indicates the minimum overlap of \u02c6hik and a given eye gaze gi map over a frame:\n\n\u03b4(gi, \u02c6hik) = min\n\n\u03b4p(bj\n\nik, gj\n\ni ), Ts,k \u2264 j \u2264 Te,k,\n\n(4)\n\n(5)\n\n(6)\n\n(cid:40) 1\n\nj\n\n(cid:80)\n\n1\nr\n\ngj\ni\n\nbj\nik\n\nif (cid:80)\n\nbj\nik\n\notherwise,\n\n\u03b4p(bj\n\nik, gj\n\ni ) =\n\ni \u2265 r,\ngj\n\n0 < r < 1,\n\nwhere bj\nik is the bounding box at frame j of the k-th latent region in the xi video. The parameter r\nregulates the minimum amount of eye gaze \u201cmass\u201d that should be enclosed by each bounding box.\nThe loss function can be easily incorporated in Algorithm 1 during the loss-augmented inference.\n\n4 Gaze Prediction\n\nOur model is based on the core assumption that a subset of perceptually salient regions of a video,\nencoded by the gaze map, share discriminative idiosyncrasies useful for human action classi\ufb01cation.\nThe loss function dictating the learning process enables the model\u2019s parameter (i.e , w) to encode\nthis notion into our model4. Assuming our assumption holds in practice, we can use selected latent\nregions for prediction of top-down saliency within the latent region. We do so by regressing the\namount of eye gaze (probability density map over gaze) on a \ufb01xed grid, inside each bounding box\nof the latent regions, by conditioning on low level features that construct the feature map \u03c8i and the\naction label. In this way the latent regions select consistent salient portions of videos using top-down\nknowledge about the action, and image content modulates the saliency prediction within that region.\nGiven the training data gaze g and the corresponding inferred latent variables h, we learn a linear\nregression, per action class, that maps augmented feature representation of the extracted bounding\nboxes, of each latent region, to a coarse description of the corresponding gaze distribution. Each\nbounding box is divided into a 4 \u00d7 4 grid and a BoW representation for each cell is computed;\naugmented feature is constructed by concatenating these histograms. Similarly, the human gaze is\nsummarized by a 16 dimension vector by accumulating gaze density at each cell over a 4 \u00d7 4 grid.\nFor visualization, we further smooth the predictions to obtain a continuous and smooth gaze density\nover the latent regions. We \ufb01nd our top-down saliency predictions to be quite good (see Sect. 5) in\nmost cases which experimentally validated our initial model assumption.\n\n5 Experiments\n\nWe evaluate our model on the UCF-Sports dataset presented in [14]. The dataset contains 150 videos\nextracted from broadcast television channels and includes 10 different action classes. The dataset\nincludes annotation of action classes as well as bounding boxes around the person of interest (which\nwe ignore for training but use to measure localization performance). We follow the evaluation setup\nde\ufb01ned in the of Lan et al. [9] and split the dataset into 103 training and 47 test samples. We\nemploy the eye gaze data made available by Mathe and Sminchisescu [11]. The data captures eye\nmovements of 16 subjects while they were watching the video clips from the dataset. The eye gaze\ndata are represented with a probability density function (Sect. 4).\nData representation: We extract HoG, HoF, and HoMB descriptors [12] at a dense spatio-temporal\ngrid and at 4 different scales. These descriptors are clustered into 3 vocabularies of 500, 500, 300\nsizes correspondingly. For the baseline experiments, we use (cid:96)1-normalized histogram representation.\nFor the potentials described in Sect. 1, we represent latent regions/context with the sum of per-\nframe normalized histograms. Per-frame normalization, as opposed to global normalization over the\nspatio-temporal region, allows us to aggregate scores iteratively in Algorithm 1.\nBaselines: We compare our model to several baseline methods. All our baselines are trained with\nlinear SVM, to make them comparable to our linear model, and use the same feature representation\n\n4Parameter r of the loss (Sect. 3) modulates importance of gaze localization within the latent region.\n\n6\n\n\fBaselines\n\nModel\n\nGlobal BoW\nBoW with SS\nBoW with TS\n\nAccuracy\n\n64.29\n65.95\n69.64\n\nLocalization\n\nN/A\nN/A\nN/A\n\n# of Latent Regions\n\nOur Model\n\nState-of-the-art\n\nRegion\nRegion+Context\nRegion+Global\nLan et al. [9]\nTran and Yuan [19]\nShapovalova et al. [16]\nRaptis et al. [12]\n\nK = 1 K = 2 K = 1 K = 2\n77.98\n77.62\n76.79\n\n82.14\n81.31\n80.71\n\n26.4\n32.3\n29.6\n\n20.8\n29.3\n30.4\n\n73.1\nN/A\n75.3\n79.4\n\n27.8\n54.3\u2217\nN/A\nN/A\n\nTable 1: Action classi\ufb01cation and localization results. Our model signi\ufb01cantly outperforms the\nbaselines and most of the State-of-the-art results (see text for discussion). \u2217Note that the average\nlocalization score is calculated based only on three classes reported in [19].\n\nas described above. We report performance of three baselines: (1) Global BoW, where video is\nrepresented with just one histogram and all the temporal-spatial structure is discarded. (2) BoW\nwith spatial split (SS), where video is divided by a 2 \u00d7 2 spatial grid and parts in order to capture\nspatial structure. (3) BoW with temporal split (TS), where the video is divided into 2 consecutive\ntemporal segments. This setup allows the capture of the basic temporal structure of human action.\nOur model: We evaluate three different variants of our model, which we call Region, Re-\ngion+Global, and Region+Context. Region:\nincludes only the latent regions, the potentials \u03c8\nfrom our scoring function in Eq. 1, and ignores the context features \u03c6. Region+Global: the context\npotential \u03c6 is replaced with a Global BoW, like in our \ufb01rst baseline. Region+Context: represents\nour full model from the Eq. 1. We test all our models with one and two latent regions.\nAction classi\ufb01cation and localization: Results of action classi\ufb01cation are summarized in Table 1.\nWe train a model for each action separately in a standard one-vs-all framework. Table 1 shows that\nall our models outperform the BoW baselines and the results of Lan et al. [9] and Shapovalova et\nal. [16]. The Region and Region+Context models with two latent regions demonstrate superior\nperformance compared to Raptis et al. [12]. Our model with 1 latent region performs slightly worse\nthen model of Raptis et al. [12], however note that [12] used non-linear SVM with \u03c72 kernel and 4\nregions, while we work with linear SVM only. Further, we can clearly see that having 2 latent regions\nis bene\ufb01cial, and improves the classi\ufb01cation performance by roughly 4%. The addition of Global\nBoW marginally decreases the performance, due to, we believe, over counting of image evidence\nand hence over\ufb01tting. Context does not improve classi\ufb01cation, but does improve localization.\nWe perform action localization by following the evaluation procedure of [9, 19] and estimate how\nwell inferred latent regions capture the human5 performing the action. Given a video, for each frame\nwe compute the overlap score between the latent region and the ground truth bounding box of the\nhuman. The overlap O(bj\ngt) is de\ufb01ned by the \u201cintersection-over-union\u201d metric between inferred\nand ground truth bounding box. The total localization score per video is computed as an average of\nthe overlap scores of the frames: 1\ngt). Note, since our latent regions may not span\nT\nthe entire video, instead of dividing by the number of frames T , we divide by the total length of\nthe inferred latent regions. To be consistent with the literature [9, 19], we calculate the localization\nscore of each test video given its ground truth action label.\nTable 1 illustrates average localization scores6. It is clear that our model with Context achieves\nconsiderably better localization than without (Region) especially with two latent regions. This can\nbe explained by the fact that in UCF-Sports background tends to be discriminative for classi\ufb01cation;\nhence without proper context a latent region is likely to drift to the background (which reduces\nlocalization score). Context in our model models the background and leaves the latent regions free\nto select perceptually salient regions of the video. Numerically, our full model (Region+Context)\noutperforms the model of Lan et al. [9] (despite [9] having person detections and actor annotations\n\n(cid:80)T\n\nj=1 O(bj\n\nk, bj\n\nk, bj\n\n5Note that by de\ufb01nition the task of localizing a human is unnatural for our model since it captures perceptu-\nally salient \ufb01xed sized discriminate regions for action classi\ufb01cation, not human localization. This unfavorably\nbiases localization results agains our model; see Fig. 3 for visual comparison between annotated person regions\nand our inferred discriminative salient latent regions.\n\n6It is worth mentioning that [19] and [9] have regions detected at different subsets of frames; thus in terms\n\nof localization, these methods are not directly comparable.\n\n7\n\n\fRegion\n\nRegion+Context\nK = 1 K = 2 K = 1 K = 2\n60.6\n63.8\n\n47.6\n\n68.5\n\nAve.\n\nRegion, K = 1\nCorr.\n0.36\n0.44\n\n\u03c72\n1.64\n1.43\n\nRegion+Context, K = 1\nCorr.\n0.36\n0.46\n\n\u03c72\n1.55\n1.31\n\nOurs\n[11]\n\nTable 2: Average amount of gaze (left): Table shows fraction of ground truth gaze captured by the\nlatent region(s) on test videos; context improves the performance. Top-down saliency prediction\n(right): \u03c72 distance and norm. cross-correlation between predicted and ground-truth gaze densities.\n\nFigure 2: Precision-Recall curves for localization: We compare our model (Region+Context with\nK=1 latent region) to the method from [18] and [19].\n\nFigure 3: Localization and gaze prediction: First row: groundtruth gaze and person bounding box,\nsecond row: predicted gaze and extent of the latent region in the frame.\nat training). We cannot compare our average performance to Tran and Yuan [19] since their approach\nis evaluated only on 3 action classes out of 10, but we provide their numbers in Table 1 for reference.\nWe build Precision-Recall (PR) curves for our model (Region+Context) and results reported in [19]\nto better evaluate our method with respect to [19] (see Fig. 2). We refer to [19] for experimental setup\nand evaluate the PR curves at \u03c3 = 0.2. For the 3 classes in [19] our model performs considerably\nbetter for \u201cdiving\u201d action, similarly for \u201chorse-riding\u201d, and marginally worse for the \u201crunning\u201d.\nGaze localization and prediction: Since our model is driven by eye-gaze, we also measure how\nmuch gaze our latent regions can actually capture on the test set and whether we can predict eye-\ngaze saliency maps for the inferred latent regions. Evaluation of the gaze localization is performed\nin a similar fashion to the evaluation of action localization described earlier. We estimate amount\nof gaze that falls into each bounding box of the latent region, and then average the gaze amount\nover the length of all the latent regions of the video. Thus, each video has a gaze localization score\nsg \u2208 [0, 1]. Table 2 (left) summarizes average gaze localization for different variants of our model.\nNoteworthy, we are able to capture around 60% of gaze by latent regions when modeling context.\nWe estimate gaze saliency, as described in Sect. 4. Qualitative results of the gaze prediction are\nillustrated in Fig. 3. For quantitative comparison we compute normalized cross-correlation and \u03c72\ndistance between predicted and ground truth gaze, see Table 2 (right). We also evaluate performance\nof bottom-up gaze prediction [11] within inferred latent regions. Better results of bottom-up ap-\nproach can be explained by superior low-level features used for learning [11]. Still, we can observe\nthat for both approaches the full model (Region+Context) is more consistent with gaze prediction.\n6 Conclusion\nWe propose a novel weakly-supervised structured learning approach for recognition and spatio-\ntemporal localization of actions in video. Special case of our model with two temporally ordered\npaths and context can be solved in linear time complexity. In addition, our approach does not require\nactor annotations for training. Instead we rely on gaze data for weak supervision, by incorporating it\ninto our structured loss. Further, we show how our model can be used to predict top-down saliency\nin the form of gaze density maps.\nIn the future, we plan to explore the bene\ufb01ts of searching over\nregion scale and focus on more complex spatio-temporal relationships between latent regions.\n\n8\n\n00.20.40.60.8100.20.40.60.81RecallPrecisionDiving  Tran&Yuan(2011)Tran&Yuan(2012)Our model00.20.40.60.8100.20.40.60.81RecallPrecisionRunning  Tran&Yuan(2011)Tran&Yuan(2012)Our model00.20.40.60.8100.20.40.60.81RecallPrecisionHorse\u2212riding  Tran&Yuan(2011)Tran&Yuan(2012)Our model\fReferences\n\n[1] M. Blaschko and C. Lampert. Learning to localize objects with structured output regression. ECCV,\n\n2008.\n\n[2] C. Chen and K. Grauman. Ef\ufb01cient activity detection with max-subgraph search. In CVPR, 2012.\n[3] K. G. Derpanis, M. Sizintsev, K. Cannons, and R. P. Wildes. Ef\ufb01cient action spotting based on a spacetime\n\noriented structure representation. In CVPR, 2010.\n\n[4] D. V. Essen, B. Olshausen, C. Anderson, and J. Gallant. Pattern recognition, attention, and information\nbottlenecks in the primate visual system. SPIE Conference on Visual Information Processing: From\nNeurons to Chips, 1991.\n\n[5] A. Fathi, Y. Li, and J. M. Rehg. Learning to recognize daily actions using gaze. In ECCV, 2012.\n[6] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively\n\ntrained part based models. IEEE PAMI, 2010.\n\n[7] Y. Ke, R. Sukthankar, and M. Hebert. Event detection in crowded videos. In ICCV, 2007.\n[8] A. Kovashka and K. Grauman. Learning a Hierarchy of Discriminative Space-Time Neighborhood Fea-\n\ntures for Human Action Recognition. In CVPR, 2010.\n\n[9] T. Lan, Y. Wang, and G. Mori. Discriminative \ufb01gure-centric models for joint action localization and\n\nrecognition. In ICCV, 2011.\n\n[10] I. Laptev. On space-time interest points. IJCV, 64, 2005.\n[11] S. Mathe and C. Sminchisescu. Dynamic eye movement datasets and learnt saliency models for visual\n\naction recognition. In ECCV, 2012.\n\n[12] M. Raptis, I. Kokkinos, and S. Soatto. Discovering discriminative action parts from mid-level video\n\nrepresentations. In CVPR, 2012.\n\n[13] M. Raptis and L. Sigal. Poselet key-framing: A model for human activity recognition. In CVPR, 2013.\n[14] M. Rodriguez, J. Ahmed, and M. Shah. Action MACH: a spatio-temporal maximum average correlation\n\nheight \ufb01lter for action recognition. In CVPR, 2008.\n\n[15] M. Ryoo and J. Aggarwal. Spatio-temporal relationship match: Video structure comparison for recogni-\n\ntion of complex human activities. In ICCV, 2009.\n\n[16] N. Shapovalova, A. Vahdat, K. Cannons, T. Lan, and G. Mori. Similarity constrained latent support vector\n\nmachine: An application to weakly supervised action classi\ufb01cation. In ECCV, 2012.\n\n[17] P. Siva and T. Xiang. Weakly supervised action detection. In BMVC, 2011.\n[18] D. Tran and J. Yuan. Optimal spatio-temporal path discovery for video event detection. In CVPR, 2011.\n[19] D. Tran and J. Yuan. Max-margin structured output regression for spatio-temporal action localization. In\n\nNIPS, 2012.\n\n[20] P. Turaga, R. Chellappa, V. Subrahmanian, and O. Udrea. Machine recognition of human activities: A\n\nsurvey. IEEE Transactions on Circuits and Systems for Video Technology, 18(11):1473\u20131488, 2008.\n\n[21] E. Vig, M. Dorr, and D. Cox. Space-variant descriptor sampling for action recognition based on saliency\n\nand eye movements. In ECCV, 2012.\n\n[22] H. Wang, M. M. Ullah, A. Kl\u00a8aser, I. Laptev, and C. Schmid. Evaluation of local spatio-temporal features\n\nfor action recognition. In BMVC, 2009.\n\n[23] Y. Wang and G. Mori. Hidden part models for human action recognition: Probabilistic vs. max-margin.\n\nIEEE PAMI, 2010.\n\n[24] D. Weinland, R. Ronfard, and E. Boyer. A survey of vision-based methods for action representation,\n\nsegmentation and recognition. Computer Vision and Image Understanding, 115(2):224\u2013241, 2011.\n\n[25] J. Yang and M.-H. Yang. Top-down visual saliency via joint crf and dictionary learning. In CVPR, 2012.\n[26] C.-N. J. Yu and T. Joachims. Learning structural svms with latent variables. In ICML, 2009.\n[27] J. Yuan, Z. Liu, and Y. Wu. Discriminative subvolume search for ef\ufb01cient action detection. In CVPR,\n\n2009.\n\n[28] J. Yuan, Z. Liu, and Y. Wu. Discriminative video pattern search for ef\ufb01cient action detection. IEEE PAMI,\n\n33(9), 2011.\n\n9\n\n\f", "award": [], "sourceid": 1139, "authors": [{"given_name": "Nataliya", "family_name": "Shapovalova", "institution": "Simon Fraser University"}, {"given_name": "Michalis", "family_name": "Raptis", "institution": "Disney Research"}, {"given_name": "Leonid", "family_name": "Sigal", "institution": "Disney Research"}, {"given_name": "Greg", "family_name": "Mori", "institution": "Simon Fraser University"}]}