{"title": "Modeling Saccadic Targeting in Visual Search", "book": "Advances in Neural Information Processing Systems", "page_first": 830, "page_last": 836, "abstract": null, "full_text": "Modeling Saccadic Targeting in Visual Search \n\nRajesh P. N. Rao \n\nComputer Science Department \n\nUniversity of Rochester \nRochester, NY 14627 \n\nrao@cs.rochester.edu \n\nGregory J. Zelinsky \n\nCenter for Visual Science \nUniversity of Rochester \nRochester, NY 14627 \n\ngreg@cvs.rochester.edu \n\nMary M. Hayhoe \n\nCenter for Visual Science \nUniversity of Rochester \nRochester, NY 14627 \n\nDana H. Ballard \n\nComputer Science Department \n\nUniversity of Rochester \nRochester, NY 14627 \n\nmary@cvs.rochester.edu \n\ndana@cs.rochester.edu \n\nAbstract \n\nVisual cognition depends criticalIy on the ability to make rapid eye movements \nknown as saccades that orient the fovea over targets of interest in a visual \nscene. Saccades are known to be ballistic: the pattern of muscle activation \nfor foveating a prespecified target location is computed prior to the movement \nand visual feedback is precluded. Despite these distinctive properties, there \nhas been no general model of the saccadic targeting strategy employed by \nthe human visual system during visual search in natural scenes. This paper \nproposes a model for saccadic targeting that uses iconic scene representations \nderived from oriented spatial filters at multiple scales. Visual search proceeds \nin a coarse-to-fine fashion with the largest scale filter responses being compared \nfirst. The model was empirically tested by comparing its perfonnance with \nactual eye movement data from human subjects in a natural visual search task; \npreliminary results indicate substantial agreement between eye movements \npredicted by the model and those recorded from human subjects. \n\n1 \n\nINTRODUCTION \n\nHuman vision relies extensively on the ability to make saccadic eye movements. These rapid \neye movements, which are made at the rate of about three per second, orient the high-acuity \nfoveal region of the eye over targets of interest in a visual scene. The high velocity of saccades, \nreaching up to 700\u00b0 per second for large movements, serves to minimize the time in flight; most \nof the time is spent fixating the chosen targets. \n\nThe objective of saccades is currently best understood for reading text [13] where the eyes fixate \nalmost every word, sometimes skipping over smalI function words. In general scenes, however, \nthe purpose of saccades is much more difficult to analyze. It was originally suggested that \n\n\fModeling Saccadic Targeting in Visual Search \n\n831 \n\n(a) \n\n(b) \n\nFigure 1: Eye Movements in Visual Search. (a) shows the typical pattern of multiple saccades (shown \nhere for two different subjects) elicited during the course of searching for the object composed of the fork \nand knife. The initial fixation point is denoted by \u2022 +'. (b) depicts a summary of such movements over many \nexperiments as a function of the six possible locations of a target object on the table. \n\nthe movements and their resultant fixations formed a visual-motor memory (or \"scan-paths\") of \nobjects [11] but subsequent work has suggested that the role of saccades is more tightly coupled \nto the momentary problem solving strategy being employed by the subject. In chess, it has \nbeen shown that saccades are used to assess the current situation on the board in the course of \nmaking a decision to move, but the exact information that is being represented is not yet known \n[5]. In a task involving the copying of a model block pattern located on a board, fixations have \nbeen shown to be used in accessing crucial information for different stages of the copying task \n[2]. In natural language processing, there has been recent evidence that fixations reflect the \ninstantaneous parsing of a spoken sentence [18]. However, none of the above work addresses the \nimportant question of what possible computational mechanisms underlie saccadic targeting. \n\nThe complexity of the targeting problem can be illustrated by the saccades employed by subjects \nto solve a natural visual search task. In this task, subjects are given a 1 second preview of a \nsingle object on a table and then instructed to determine, in the shortest possible amount of time, \nwhether the previewed object is among a group of one to five objects on the same table in a \nsubsequent view. The typical eye movements elicited are shown in Figure 1 (a). Rather than a \nsingle movement to the remembered target, several saccades are typical, with each successive \nsaccade moving closer to the goal object (Figure 1 (b\u00bb . \n\nThe purpose of this paper is to describe a mechanism for programming saccades that can appro x -\nimately model the saccadic targeting method used by human SUbjects. Previous models of human \nvisual search have focused on simple search tasks involving elementary features such as hori(cid:173)\nzontaVvertical bars of possibly different color [1,4,8] or have relied exclusively on bottom-up \ninput-driven saliency criteria for generating scan-paths [10, 19]. The proposed model achieves \ntargeting in arbitrary visual scenes by using bottom-up scene representations in conjunction with \npreviously memorized top-down object representations; both of these representations are iconic, \nbased on oriented spatial filters at multiple scales. \n\nOne of the difficult aspects of modeling saccadic targeting is that saccades are ballistic, i.e., \ntheir final location is computed prior to making the movement and the movement trajectory is \nuninterrupted by incoming visual signals. Furthermore, owing to the structure of the retina, the \ncentral 1.50 of the visual field is represented with a resolution that is almost 100 times greater \nthan that of the periphery. We resolve these issues by positing that the targeting computation \nproceeds sequentially with coarse resolution information being used in the computation of target \ncoordinates prior to fine resolution information. The method is compared to actual eye movements \nmade by human subjects in the visual search task described above; the eye movements predicted \nby the model are shown to be in close agreement with observed human eye movements. \n\n\f832 \n\nR. P. N. RAO, G. J. ZELINSKY, M. M. HAYHOE, D. H. BALLARD \n\nFigure 2: Multiscale Natural Basis Functions. The 10 oriented spatial filters used in our model to \ngenerate iconic scene representations, shown here at three octave-separated scales. These filters resemble \nthe receptive field profiles of cells in the primate visual cortex [20] and have been shown to approximate \nthe dominant eigenvectors of natural image distributions as obtained from principal component analysis \n[7,17]. \n\n2 ICONIC REPRESENTATIONS \nThe current implementation of our model uses a set of non-orthogonal basis functions as given \nby a zeroth order Gaussian Go and nine of its oriented derivatives as follows [6]: \n\nG~n,n = 1,2,3,8n = 0, ... ,m7r/(n + l),m = 1, ... ,n \n\n(1) \n\nwhere n denotes the order of the filter and 8n refers to the preferred orientation of the filter \n(Figure 2). The response of an image patch I centered at (xo, Yo) to a particular basis filter G~; \ncan be obtained by convolving the image patch with the filter: \n\nTi,i(xO,YO) = II G~;(xo-x,yo-y)I(x,y)dxdy \n\n(2) \n\nThe iconic representation for the local image patch centered at (xo, Yo) is formed by combining \ninto a high-dimensional vector the responses from the ten basis filters at different scales: \n\n(3) \nwhere i = 0, 1, 2, 3 denotes the order of the filter, j = 1, ... , i + 1 denotes the different filters per \norder, and S = Smin, ... ,Sma., denotes the different scales as given by the levels of a Gaussian \nimage pyramid. \n\nrs(xo,Yo) = h,i,s(XO, YO)] \n\nThe use of multiple scales is crucial to the visual search model (see Section 3). In particular, the \nlarger the number of scales, the greater the perspicuity of the representation as depicted in Figure 3. \nA multiscale representation also alJows interpolation strategies for scale invariance. The bigh(cid:173)\ndimensionality of the vectors makes them remarkably robust to noise due to the orthogonality \ninherent in high-dimensional spaces: given any vector, most of the other vectors in the space \ntend to be relatively uncorrelated with the given vector. The iconic representations can also be \nmade invariant to rotations in the image plane (for a fixed scale) without additional convolutions \nby exploiting the property of steerability [6]. Rotations about an image plane axis are handled \nby storing feature vectors from different views. We refer the interested reader to [14] for more \ndetails regarding the above properties. \n\n3 THE VISUAL SEARCH MODEL \n\nOur model for visual search is derived from a model for vision that we previously proposed in \n[14]. This model decomposes visual behaviors into sequences of two visual routines, one for \nidentifying the visual image near the fovea (the \"what\" routine), and another for locating a stored \nprototype on the retina (the \"where\" routine). \n\n\fModeling Saccadic Targeting in Visual Search \n\n833 \n\nI .. \ni \n\nI \"-Q~II \n\nG.O'..., \n\nj .. \ni \n\n. Con ... .:. \n\n(a) \n\n~ \n~ \n\n(b) \n\nFigure 3: The Effect of Scale. The distribution of distances (in tenns of correlations) between the response \nvector for a selected model point in the dining table scene and all other points in the scene is shown for \nsingle scale response vectors (a) and multiple scale vectors (b). Using responses from multiple scales (five \nin this case) results in greater perspicuity and a sharper peak near 0.0; only one point (the model point) \nhad a correlation greater than 0.94 in the multiple scale case (b) whereas 936 candidate points fell in this \ncategory in the single scale case (a). \n\nThe visual search model assumes the existence of three independent processes running concur(cid:173)\nrently: (a) a targeting process (similar to the \"where\" routine of [14]) that computes the next \nlocation to be fixated; (b) an oculomotor process that accepts target locations and executes a \nsaccade to foveate that location (see [16] for more details); and (c) a decision process that models \nthe cortico-cortical dynamics of the VI f+ V2 f+ V 4 f+ IT pathway related to the identification \nof objects in the fovea (see [15] for more details). \n\nHere, we focus on the saccadic targeting process. Objects of interest to the current search task are \nassumed to be represented by a set of previously memorized iconic feature vectors r-:p where s \ndenotes the scale of the filters. The targeting algorithm computes the next location to be foveated \nas follows: \n\n1. Initialize the routine by setting the current scale of analysis k to the largest scale i.e. \n\nk = max; set Sm(x, y) = 0 for all (x, y). \n2. Compute the current saliency image Sm as \nmax \n\n3. Find the location to be foveated by using the following weighted population averaging \n\n(or soft max) scheme: \n\nSm(x, y) = L Ilr~(x, y) - r-:Pll2 \n\ns=k \n\n(x, '0) = L F(S7n(x, y))(x, y) \n\n(x,y) \n\n(4) \n\n(5) \n\n(6) \n\nwhere F is an interpolation function. For the experiments, we chose: \n\nF(S7n(x, y)) = E \n\ne-S\",(x,y)/)'(k) \n\ne-S\",(x,y)/)'(k) \n\n(x,y) \n\nThis choice is attractive since it allows an interpretation of our algorithm as computing \nIn the above, >'(k) is \nmaximum likelihood estimates (cf. [12]) of target locations. \ndecreased with k. \n\n4. Iterate step (2) and (3) above with k = max-I, max-2, . . . until either the target object \n\nhas been foveated or the number of scales has been exhausted. \n\nFigure 4 illustrates the above targeting procedure. The case where multiple model vectors are \nused per object proceeds in an analogous manner with the target location being averaged over all \nthe model vectors. \n\n\f834 \n\nR. P. N. RAO, G. J. ZELINSKY, M. M. HAYHOE, D. H. BALLARD \n\n(c) \n\n(d) \n\nFigure 4: mustration of Saccadic Targeting. The saliency image after the inclusion of the largest (a), \nintermediate (b), and smallest scale (c) as given by image distances to the prototype (the fork and knife); \nthe lightest points are the closest matches. (d) shows the predicted eye movements as determined by the \nweighted population averaging scheme (for comparison, saccades from a human subject are given by the \ndotted arrows). \n\n4 EXPERIMENTAL RESULTS AND DISCUSSION \n\nEye movements from four human subjects were recorded for the search task described in Section 1 \nfor three different scenes (dining table, work bench, and a crib) using an SRI Dual Purkinje \nEyetracker. The model was implemented on a pipeline image processor, the Datacube MV200, \nwhich can compute convolutions at frame rate (30/ sec). Figure 5 compares the model's perfor(cid:173)\nmance to the human data. As the results show, there is remarkably good correspondence between \nthe eye movements observed in human subjects and those generated by the model on the same \ndata sets. The model has only one important parameter: the scaling function used to rate the \npeaks in the saliency map. In the development of the algorithm, this was adjusted to achieve an \napproximate fit to the human data. \n\nOur model relies crucially on the existence of a coarse-to-fine matching mechanism. The \nmain benefit of a coarse-to-fine strategy is that it allows continuous execution of the deci(cid:173)\nsion/oculomotor processes, thereby increasing the probability of an early match. Coarse-to-fine \nstrategies have enjoyed recent popularity in computer vision with the advent of image pyramids \nin tasks such as motion detection [3]. Although these methods show that considerable speedup \ncan be achieved by decreasing the size of window of analysis as resolution increases, our prelim(cid:173)\ninary experiments suggest that this might be an inappropriate strategy for visual search: limiting \nsearch to a small window centered on the coarse location estimate obtained from a larger scale \noften resulted in significant errors since the targets frequently lay outside the search window. A \npossible solution is to adaptively select the size of the search window based on the current scene \nbut this would require additional computational machinery. \n\nA key question that remains is the source of sequential application of the filters in the human \nvisual system. A possible source is the variation in resolution of the retina. Since only very high \nresolution information is at the fovea, and since this resolution falls off with distance. fine spatial \nscales may be ineffective purely because the fixation point is distant from the target. However. \nour preliminary experiments with modeling the variation in retinal resolution suggest that this is \nprobably not the sole cause. The variations at middle distances from the fovea are too small to \nexplain the dramatic improvement in target location experienced with the second saccade. Thus, \n\n\fModeling Saccadic Targeting in Visual Search \n\n835 \n\nFirst SICCldOl: Hullln \n\nFirst Socc.d .. : nod.1 \n\n.. ::1 \n~ 15 \n'\" ~ 20 \no 15 \n! '~ ,.\",., \u2022\u2022\u2022 ,III,I,I,I,.,.,.,.,.,.~,., \n\n8 \n\n11 \n\n16 \n\n10 \n\n1~ 18 \n\n31 \n\n56 \n\n40 \n\n~ \n\n8 \n\n11 \n\n16 \n\n10 \n\n1~ 18 \n\n51 \n\nI 1'\"\"' \n56 \n\nI \n40 \n\nS.cond SICCOdOl: Hullon \n\nS.clnd SoccodOl: nod.1 \n\n~::I~ \n\n~ 2D \n; \nIS \n\n10 \n\n: \n\n.: :, ,., .... ,.~,.,.,., ~,.,.,.,.., 1\"\",., \n\n4 \n\n8 \n\n11 \n\n16 \n\n10 \n\n14 \n\n18 \n\n51 \n\n56 \n\n40 \n\n~ 15 \n\n.. ::l \n.: :,.1,.,11.,1.,.,.,.,.,., ... ,.,., ,..,., \n\n'\" ~ 20 \n; \nIS \n: 10 \n\n4 \n\n8 \n\n11 \n\n16 \n\n10 \n\n1~ 18 \n\n51 \n\n36 \n\n40 \n\nThird SoccodOl: HUIIIO \n\nThird Soccldos: nod.1 \n\n~ 50 \nl! 40 \n; \n30 \n: 20 \n\n:0 .:~ \n.: ': ,I,.,.,.~\",..\"\"\", I \n\n14 \n\n18 \n\n10 \n\n8 \n\n'1 \n\n16 \n\nI \n31 \n\nI \nI \n36 \n\n'-I \n40 \n\n~5D \n\n.. ::1 \n'\" l! 40 \n; 30 \n.. .. '0 \n: 20 \n0 ' - \"1 ,-I\"\"'I I 1\"\"1 \n\n~ \n\n8 \n\n'1 \n\n16 \n\n'-1-1-' .-. I \n\n1~ 18 \n\n51 \n\nI \n10 \n\nI \n36 \n\nI 1-' \n40 \n\nEn~olnt Error (,1 \u2022\u2022 11 \u2022 10) \n\nEndpoint Error (,I \u2022\u2022 ls \u2022 10) \n\nFigure 5: Experimental Results. The graphs compare the distribution of endpoint errors (in terms of \nfrequency histograms) for three consecutive saccades as predicted by the model for 180 trials (on the right) \nand as observed with four human subjects for 676 trials (left). Each of the trials contained search scenes \nwith one to five objects, one of the objects being the previeWed model. \n\nthere are two remaining possibilities: (a) the resolution fall-off in the cortex is different from the \nretinal variation in a way that supports the data, or (b) the cortical machinery is set up to match \nthe larger scales first. In the latter case, the observed data would result from the fact that the \noculomotor system is ready to move before all the scales can be matched, and thus the eyes move \nto the current best target position. This interpretation of the data is appealing in two aspects. \nFirst, it reflects a long history of observations on the priority of large scale channels [9], and \nsecond, it reflects current thinking about eye movement programming suggesting that fixation \ntimes are approximately constant and that the eyes are moved as soon as they can be during the \ncourse of visual problem solving. The above questions can however be definitively answered \nonly through additional testing of human subjects followed by subsequent modeling. We expect \nour saccadic targeting model to playa crucial role in this process. \n\nAcknowledgments \n\nThis research was supported by NIHIPHS research grants 1-P41-RR09283 and 1-R24-RR06853-\n02, and by NSF research grants IRI-9406481 and IRI-8903582. \n\n\f836 \n\nReferences \n\nR. P. N. RAO. G. 1. ZELINSKY. M. M. HAYHOE. D. H. BALLARD \n\n[1] Subutai Ahmad and Stephen Omohundro. Efficient visual search: A connectionist solution. \nIn Proceeding of the 13th Annual Conference of the Cognitive Science Society. Chicago, \n1991. \n\n[2] Dana H. Ballard, Mary M. Hayhoe, and Polly K. Pook. Deictic codes for the embodiment \nof cognition. Technical Report 95.1, National Resource Laboratory for the study of Brain \nand Behavior, University of Rochester, January 1995. \n\n[3] P.J. Burt. Attention mechanisms for vision in a dynamic world. In ICPR, pages 977-987, \n\n1988. \n\n[4] David Chapman. Vision. Instruction. and Action. PhD thesis, MIT Artificial Intelligence \n\nLaboratory, 1990. (Technical Report 1204). \n\n[5] W.G. Chase and H.A. Simon. Perception in chess. Cognitive Psychology, 4:55-81,1973. \n[6] William T. Freeman and Edward H. Adelson. The design and use of steerable filters. IEEE \n\nPAMI, 13(9):891-906, September 1991. \n\n[7] Peter lB. Hancock, Roland J. Baddeley, and Leslie S. Smith. The principal components of \n\nnatural images. Network, 3:61-70, 1992. \n\n[8] Michael C. Mozer. The perception of multiple objects: A connectionist approach. Cam(cid:173)\n\nbridge, MA: MIT Press, 1991. \n\n[9] D. Navon. Forest before trees: The precedence of global features in visual perception. \n\nCognitive Psychology, 9:353-383,1977. \n\n[10] Ernst Niebur and ChristofKoch. Control of selective visual attention: Modeling the \"where\" \n\npathway. This volume, 1996. \n\n[11] D. N oton and L. Stark. Scanpaths in saccadic eye movements while viewing and recognizing \n\npatterns. Vision Reseach, 11:929-942,1971. \n\n[12] Steven 1. Nowlan. Maximum likelihood competitive learning. In Advances in Neural \n\nInfonnation Processing Systems 2, pages 574-582. Morgan Kaufmann, 1990. \n\n[13] J.K. O'Regan. Eye movements and reading. In E. Kowler, editor, Eye Movements and Their \n\nRole in Visual and Cognitive Processes, pages 455-477. New York: Elsevier, 1990. \n\n[14] Rajesh P.N. Rao and Dana H. Ballard. An active vision architecture based on iconic \n\nrepresentations. Artificial Intelligence (Special Issue on Vision), 78:461-505, 1995. \n\n[15] Rajesh P.N. Rao and Dana H. Ballard. Dynamic model of visual memory predicts neural \n\nresponse properties in the visual cortex. Technical Report 95.4, National Resource Labo(cid:173)\nratory for the study of Brain and Behavior, Computer Sci. Dept., University of Rochester, \nNovember 1995. \n\n[16] Rajesh P.N. Rao and Dana H. Ballard. Learning saccadic eye movements using multi scale \nspatial filters. In G. Tesauro, D.S. Touretzky, and T.K. Leen, editors, Advances in Neural \nInfonnation Processing Systems 7, pages 893-900. Cambridge, MA: MIT Press, 1995. \n\n[17] Rajesh P.N. Rao and Dana H. Ballard. Natural basis functions and topographic memory for \n\nface recognition. In Proc. ofllCAl, pages 10-17, 1995. \n\n[18] M. Tanenhaus, M. Spivey-Knowlton, K. Eberhard, and I Sedivy. Integration of visual and \n\nlinguistic information in spoken language comprehension. To appear in Science, 1995. \n\n[19] Keiji Yamada and Garrison W. Cottrell. A model of scan paths applied to face recognition. \n\nIn Proc. 17th Annual Conf. of the Cognitive Science Society, 1995. \n\n[20] R.A. Young. The Gaussian derivative theory of spatial vision: Analysis of cortical cell \nreceptive field line-weighting profiles. General Motors Research Publication GMR-4920, \n1985. \n\n\f", "award": [], "sourceid": 1128, "authors": [{"given_name": "Rajesh", "family_name": "Rao", "institution": null}, {"given_name": "Gregory", "family_name": "Zelinsky", "institution": null}, {"given_name": "Mary", "family_name": "Hayhoe", "institution": null}, {"given_name": "Dana", "family_name": "Ballard", "institution": null}]}