{"title": "A self-organizing multiple-view representation of 3D objects", "book": "Advances in Neural Information Processing Systems", "page_first": 274, "page_last": 281, "abstract": "", "full_text": "274 WeinshalI, Edelman and BiiIthofT \n\nA self-organizing multiple-view representation \n\nof 3D objects \n\nDaphna Weinshall \nCenter for Biological \nInformation Processing \nMIT E25-201 \nCambridge, MA 02139 \n\nShimon Edelman \nCenter for Biological \nInformation Processing \n\nMIT E25-201 \n\nCambridge, MA 02139 \n\nHeinrich H. BiilthofF \nDept. of Cognitive and \nLinguistic Sciences \nBrown University \nProvidence, Rl 02912 \n\nABSTRACT \n\nWe demonstrate the ability of a two-layer network of thresholded \nsummation units to support representation of 3D objects in which \nseveral distinct 2D views are stored for ea.ch object. Using unsu(cid:173)\npervised Hebbian relaxation, the network learned to recognize ten \nobjects from different viewpoints. The training process led to the \nemergence of compact representations of the specific input views. \nWhen tested on novel views of the same objects, the network ex(cid:173)\nhibited a substantial generalization capability. In simulated psy(cid:173)\nchophysical experiments, the network's behavior was qualitatively \nsimilar to that of human subjects. \n\n1 Background \nModel-based object recognition involves, by definition, a compa.rison between the \ninput image and models of different objects that are internal to the recognition \nsystem. The form in which these models are best stored depends on the kind of \ninformation available in the input, and on the trade-off between the amount of \nmemory allocated for the storage and the degree of sophistication required of the \nrecognition process. \n\nIn computer vision, a distinction can be made between representation schemes that \nuse 3D object-centered coordinate systems and schemes that store viewpoint-specific \ninformation such as 2D views of objects. In principle, storing enough 2D views would \n\n\fA Self-Organizing Multiple-View Representation of 3D Objects \n\n275 \n\nallow the system to use simple recognition techniques such as template matching. \nIf only a few views of each object are remembered, the system must have the capa(cid:173)\nbility to normalize the appearance of an input object, by carrying out appropriate \ngeometrical transformations, before it can be directly compared to the stored rep(cid:173)\nresen tat ions . \n\nWhat representation strategy is employed by the human visual system? The notion \nthat objects are represented in viewpoint-dependent fashion is supported by the \nfinding that commonplace objects are more readily recognized from certain so-called \ncanonical vantage points than from other, random viewpoints (Palmer et al. 1981). \nNamely, canonical views are identified more quickly (and more accurately) than \nothers, with response times decreasing monotonically with increasing subjective \ngoodness.! \n\nThe monotonic increase in the recognition latency with misorientation of the object \nrelative to a canonical view prompts the interpretation of the recognition process in \nterms of a mechanism related to mental rotation. In the classical mental rotation \ntask (see Shepard & Cooper 1982), the subject is required to decide whether two \nsimultaneously presented images are two views of the same 3D object. The average \nlatency of correct response in this task is linearly dependent on the difference in \nthe 3D attitude of the object in the two images. This dependence is commonly \naccounted for by postulating a process that attempts to rotate the 3D shapes per(cid:173)\nceived in the two images into congruence before making the identity decision. The \nrotation process is sometimes claimed to be analog, in the sense that the represen(cid:173)\ntation of the object appears to pass through intermediate orientation stages as the \nrotation progresses (Shepard & Cooper 1982). \nPsychological findings seem to support the involvement of some kind of mental \nrotation in recognition by demonstrating the dependence of recognition latency for \nan unfamiliar view of an object on the distance to its closest familiar view. There \nis, however, an important qualification. Practice with specific objects appears to \ncause this strategy to be abandoned in favor of a more memory-intensive, less time(cid:173)\nconsuming direct comparison strategy. Under direct comparison, many views of the \nobjects are stored and recognition proceeds in essentially constant time, provided \nthat the presented views are sufficiently close to one of the stored views (Tarr & \nPinker 1989, Edelman et al. 1989). \n\nFrom the preceding outline, it appears that a faithful model of object representa(cid:173)\ntion in the human visual system should provide both for the ability to \"rotate\" \n3D objects and for the fast direct-comparison strategy that supersedes mental ro(cid:173)\ntation for highly familiar objects. Surprisingly, it turns out that mental rotation \nin recognition can be replicated by a self-organizing memory-intensive model based \non direct comparison. The rest of the present paper describes such a model, called \nCLF (conjunctions of localized features; see Edelman & Weins hall 1989). \n\n1 Canonical viewl of objects can be reliably identified in lubjective judgement al well as in \nrecognition talb. For example, when alked to form a mental image of an object, people Ulually \nimagine it as leen &om a canonical perspective. \n\n\f276 Weinshall, Edelman and Bulthoff \n\nINPUT (feature) LAYER \n\nF \n\n\\I \nA \n1\\ \nI \n\nI I \nII \n( \n\\ / \nI \n\nfepre$etltation of V1 \n\nREPRESENTATION LAYER \n\nFOOTPRINT of object 01 \n\nFigure 1: The network consists of two layers, F (input, or feature, layer) and \nR (representation layer). Only a small part of the projections from F to Rare \nshown. The network encodes input patterns by making units in the R-Iayer respond \nselectively to conjunctions of features localized in the F-Iayer. The curve connecting \nthe representations of the different views of the same object in R-Iayer symbolizes \nthe association that builds up between these views as a result of practice. \n\n2 The model \nThe structure of the model appears in Figur~ 1 (see Edelman &; Weins hall 1989 for \ndetails). The first (input, or feature) layer of the network is a feature map. In our \nexperiments, vertices of wire-frame objects served as the input features. Every unit \nin the (feature) F-Iayer is connected to all units in the second (representation) R(cid:173)\nlayer. The initial strength of a \"vertical\" (V) connection between an F-unit and an \nR-unit decreases monotonically with the \"horizontal\" distance between the units, \naccording to an inverse square law (which may be considered the first approximation \nto a Gaussian distribution). In our simulations the size of the F-layer was 64 x 64 \nunits and the size of the R-Iayer - 16 x 16 units. Let (z,1I) be the coordinates of an \nF-unit and (i, j) - the coordinates of an R-unit. The initial weight between these \ntwo units is w\"'rijlt=o = (0'[1 + (z - 4i)2 + (11- 4j)2])-1, where 0' = 50 and (4i,4j) \nis the point in the F-Iayer that is directly \"above\" the R-unit (i, j). \nThe R-units in the representation layer are connected among themselves by lateral \n(L) connections, whose initial strength is zero. Whereas the V-connections form the \nrepresentations ofindividual views of an object, the L-connections form associations \namong different views of the same object. \n\n2.1 Operation \n\nDuring training, the input to the model is a sequence of appearances of an object, \nencoded by the 2D locations of concrete sensory features (vertices) rather than a lis t \n\n\fA Self-Organizing Multiple-View Representation of 3D Objects \n\n277 \n\nof abstract features. At the first presentation of a stimulus several representation \nunits are active, all with different strengths (due to the initial distribution of vertical \nconnection strengths). \n\n2.1.1 Winner Take All \n\nWe employ a simple winner-take-all (WTA) mechanism to identify for each view \nof the input object a few most active R-units, which subsequently are recruited to \nrepresent that view. The WTA mechanism works as follows. The net activities \nof the R-units are uniformly thresholded. Initially, the threshold is high enough to \nensure that all activity in the R-Iayer is suppressed. The threshold is then gradually \ndecreased, by a fixed (multiplicative) amount, until some activity appears in the \nR-layer. If the decrease rate of the threshold is slow enough, only a few units will \nremain active at the end of the WTA process. In our implementation, the decrease \nrate was 0.95. In most cases, only one winner emerged. \n\nNote that although the WTA can be obtained by a simple computation, we prefer \nthe stepwise algorithm above because it has a natural interpretation in biological \nterms. Such an interpretation requires postulating two mechanisms that operate in \nparallel. The first mechanism, which looks at the activity of the R-Iayer, may be \nthought as a high fan-in OR gate. The second mechanism, which performs uniform \nadjustable thresholding on all the R-units, is similar to a global bias. Together, they \nresemble feedback-regulated global arousal networks that are thought to be present, \ne.g., in the medulla and in the limbic system of the brain (Kandel & Schwartz 1985).2 \n\n2.1.2 Adjustment of weights and thresholds \n\nIn the next stage, two changes of weights and thresholds occur that make the \ncurrently active R-units (the winners of the WTA stage) selectively responsive to \nthe present view of the input object. First, there is an enhancement of the V(cid:173)\nconnections from the active (input) F-units to the active R-units (the winners). \nAt the same time, the thresholds of the active R-units are raised, so that at the \npresentation of a different input these units will be less likely to respond and to be \nrecruited anew. We employ Hebbian relaxation to enhance the V-connections from \nthe input layer to the active R-unit (or units). The connection strength tid from \nF-unit a to R-unit b = (i, j) changes by \n\nwhere Aii is the activation of the R-unit (i, j) after WTA, tim,,!!: is an upper bound \non a connection strength and a is a parameter controlling the rate of convergence. \nThe threshold of a winner R-unit is increased by \n\n:3 The relationship of this approach to other WTA algoritluns is discussed in Edehnan It: Wein(cid:173)\n\n.hall1989. \n\n(1) \n\n\f278 Weinshall, Edelman and BiiIthofT \n\n(2) \n\nwhere 6 < 1. This rule keeps the thresholded activity level of the unit growing \nwhile the unit becomes more input specific. As a result, the unit encodes the \nspatial structure of a specific view, responding selectively to that view after only a \nfew (two or three) presentations. \n\n2.1.3 Between-views association \n\nThe principle by which specific views of the same object are grouped is that of \ntemporal association. New views of the object appear in a natural order, corre(cid:173)\nsponding to their succession during an arbitrary rotation of the object. The lateral \n(L) connections in the representation layer are modified by a time-delay Hebbian re(cid:173)\nlaxation. L-connection Wbc between R-units b = (i,i) and e = (I, m) that represent \nsuccessive views is enhanced in proportion to the closeness of their peak activations \nin time, up to a certain time difference K: \n\n(3) \n\nThe strength of the association between two views is made proportional to a co(cid:173)\nefficient, AM(b, e), that measures the strength of the apparent motion effect that \nwould ensue if the two views were presented in succession to a human subject (see \nEdelman & WeinshallI989). \n\n2.1.4 Multiple-view representation \n\nThe appearance of a new object is explicitly signalled to the network, so that two \ndifferent objects do not become associated by this mechanism. The parameter -r1c \ndecreases with Ikl so that the association is stronger for units whose activation is \ncloser in time. In this manner, a footprint of temporally associated view-specific rep(cid:173)\nresentations is formed in the second layer for each object. Together, the view-specific \nrepresentations form a distributed multiple-view representation of the object. \n\n3 Testing the model \nWe have subjected the eLF network to simulated experiments, modeled after the \nexperiments of (Edelman et al. 1989). Some of the results of the real and simulated \nexperiments appear in Figures 2 and 3. In the experiments, each of ten novel 3D \nwire-frame objects served in turn as target. The task was to distinguish between \nthe target and the other nine, non-target, objects. The network was first trained \non a set of projections of the target's vertices from 16 evenly spaced viewpoints. \nAfter learning the target using Hebbian relaxation as described above, the network \n\n\fA Self-Organizing Multiple-View Representation of 3D Objects \n\n279 \n\n_ 0.8..----~--~---~ \nU \n4) \n~ 0.1 ............. . \nI-\na:: \n\n. \n................ ~ ................ : ........ . \n~\n. . \n........ ... ..... ; .......... .. .... : .. ... ... . \n\n0.6 ............. ...\n\n. \n. \n\nO.S \u2022...\u2022..\u2022... ...\u2022. \n\n. .. ....... ... .\u2022 . ~ ... .. ..\u2022....... . ! ........ . \n\n. \n: \n\n. \n: \n\n: \n. \n\n: \n: \n\na:: 0.8 \na:: \n8 0.1~\u00b7\u00b7\u00b7\u00b7 .... \u00b7 .. \u00b7f .. \u00b7\u00b7\u00b7 .. \u00b7\u00b7 .. \u00b7\u00b7\u00b7\u00b7\u00b7~ .. ~ \n... \n\n0.6 .... .... ... ... .. ; ........ .... ..... ~ .. ...... .... .... ~ .... .... . \n\n: \n: \n\n. \n. \n\n. \n. \n\n: \n. \n. \n. \n\n: \n\u00b7 \n\u00b7 \n\u00b7 \n\n: \n. \n. \n. \n\nO.S ... .... ... .. .. .. ! ............. .... ~ ......... ....... ! .... .... . \n\no \nD. dlst. from best view (deg) \n\n150 \n\n100 \n\nSO \n\no \nD. dlst. from best view (deg) \n\nlS0 \n\n100 \n\nSO \n\nFigure 3: Another comparison of human performance (left panel) with that of the \nCLF model (right panel). Define the best view for each object as the view with the \nshortest RT (highest CORR). If recognition involves rotation to the best (canonical) \nview, RT or CORR should depend monotonically on D = D(ta.,.get, view). the \ndistance between the best view and the actually shown view. (The decrease in RT \nor CORR at D = 180 0 is due to the fact that for the wire-frame objects used in the \nexperiments the view diametrically opposite the best one is also easily recognized.) \nFor both human subjects and the model, the dependence is clear for the first session \nof the experiment (upper curves), but disappears with practice (second session -\nlower curves). \n\nWe note that blurring the input prior to its application to the F-Iayer can signif(cid:173)\nicantly extend the generalization ability of the eLF model. Performing autoasso(cid:173)\nciation on a dot pattern blurred with a Gaussian is computationally equivalent to \ncorrelating the input with a set of templates, realized as Gaussian receptive fields. \nThis, in turn, appears to be related to interpolation with Radial Basis Functions \n(Moody & Darken 1989, Poggio & Girosi 1989, Poggio & Edelman 1989). \n\n4 Summary \n\nWe have described a two-layer network ofthresholded summation units which is ca(cid:173)\npable of developing multiple-view representations of 3D objects in an unsupervised \nfashion, using fast Hebbian learning. Using this network to model the performance \nof human subjects on similar stimuli, we replicated psychophysical experiments that \ninvestigated the phenomena of canonical views and mental rotation. The model's \nperformance closely parallels that of the human subjects, even though the network \nhas no a priori mechanism for \"rotating\" object representations. In the model, a \nsemblance of rotation is created by progressive activation of object footprints (chains \nof representation units created through association during training). Practice causes \nthe footprints to lose their linear structure through the creation of secondary as(cid:173)\nsociation links between random representation units, leading to the disappearance \nof orientation effects. Our results may indicate that a different interpretation of \nfindings that are usually taken to signify mental rotation is possible. The foot-\n\n\f280 Weinshall, Edelman and Biilthoff \n\n. \n\n_60.----------------------, _60r---------------~----. \n.... \n............. . \n;:' so \u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7!\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7!\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7t\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7 ...... . \n.... .. .. ... .. . \n~40 .. .. ...... .... ~~~ ......... .. .. . \no \n\n. \n\n. \n\n~ \n\n.... \n..............\n'-' so ............... ..............\na: a: 40 .. .. .... ............ ..... ......... ...... .\no \no so ..... .......... ; ............ . \n'0 20 .... .... ....... [ ........ .. ............\n.. ........... . \n> 10 ....... .. .. .... i ........ ..................... ; ..... ....... .. . \no \no~---...;;~--~----~: ----~ \no.S \n2.S \nsession \n\n; \n1.0 \n\n: \n\n1.S \n\n~o \n\n....\n\nE \n\n~so ........ \u00b7 .. \u00b7 .. l~r .... \u00b7 ...... .. \n2O ' ..... .... .... !~ ............ .. \n\n: \n\n10 ~----;..... ----...;.-------;..------1 \no.S \n~o \n~S \nsession \n\n1.0 \n\n1.S \n\nFigure 2: Performance of five human subjects (left panel) and of the eLF model \n(right panel). The variation of the performance measure (for human subjects, re(cid:173)\nsponse time RTj for the model, correlation CORR between the input and a stored \nrepresentation) over different views of an object serves as an estimate of the strength \nof the canonical views phenomenon. In both human subjects and the model, prac(cid:173)\ntice appears to reduce the strength of this phenomenon. \n\nwas tested on a sequence of inputs, half of which consisted of familiar views of the \ntarget, and half of views of other, not necessarily familiar, objects. \n\nThe presentation of an input to the F-Iayer activated units in the representation \nlayer. The activation then spread to other R-units via the L-connections. After a \nfixed number of lateral activation cycles, we correlated the resulting pattern of ac(cid:173)\ntivity with footprints of objects learned so far. The object whose footprint yielded \nthe highest correlation was recognized by definition. In the beginning of the test(cid:173)\ning stage, this correlation, which served as an analog of response time,S exhibited \nstrong dependence on object orientation, replicating the effect of mental rotation \nin recognition. During testing, successive activation of R-units through association \nstrengthened the L-connection between them, leading to an obliteration of the linear \nstructure of R-unit sequences responsible for mental rotation effects. \n\n3.1 Generalization to novel views \n\nThe usefulness of a recognition scheme based on multiple-view representation de(cid:173)\npends on its ability to classify correctly novel views of familiar objects. To assess \nthe generalization ability of the CLF network, we have tested it on views obtained \nby rotating the objects away from learned views by as much as 23\u00b0 (see Figure 4). \nThe classification rate was better than chance for the entire range of rotation. For \nrotations of up to 4\u00b0 it was close to perfect, decreasing to 30% at 23\u00b0 (chance level \nwas 10% because we have used ten objects). One may compare this result with \nthe finding (Rock & DiVita 1987) that people have difficulties in recognizing or \nimagining wire-frame objects in a novel orientation that differs by more than 30\u00b0 \nfrom a familiar one. \n\n3The justification tor this use ot correlation appear. in Edelman\" Weinshall1989. \n\n\fA Self-Organizing Multiple-View Representation of 3D Objects \n\n281 \n\n! \n.... \nIII \n~ ... \n~ \n8' \n~ u \n\n..\n\n~ \n\n........... .. .. ~ . .. ' ...... ...... ''1' ..\n\n',\u00b7\" .... \u00b7 .. r ........ '\u00b7 ...... ,\u00b7\u00b7 .,.... \n\n.. , .. ,:... . ..\n.. \n\n..\" ....... ,., \n................. , \n\n0,4\n\n'\n\n\u00b7 .. \" \n\n.... ...... I'\" \n\nt.2 \n\nt \n\n. ~ \n\no.o~--~--~--~---_.-J \n\no \n\n~ \n\nu \n\n~ \n\nDistance from learned position (deg) \n\nFigure 4: Performance of the network on novel orientations of familiar objects \n(mean of 10 objects, bars denote the variance). \n\nprints formed in the representation layer in our model provide a hint as to what the \nsubstrate upon which the mental rotation phenomena are based may look like. \n\nReferences \n[1] S. Edelman, H. Biilthoff, and D. Weinshall. Stimulus familiarity determines \n\nrecognition strategy for novel 3D objects. MIT A.I. Memo No. 1138, 1989. \n\n[2] S. Edelman and D. Weinshall. A self-organizing multiple-view representation \n\nof 3D objects. MIT A.I. Memo No. 1146, 1989. \n\n[3] E. R. Kandel and J. H. Schwartz. Principle6 0/ neural 6cience. Elsevier, 1985. \n[4] J. Moody and C. Darken. Fast learning in networks oflocally tuned processing \n\nunits. Neural Computation, 1:281-289, 1989. \n\n[5] S. Palmer, E. Rosch, and P. Chase. Canonical perspective and the perception \nof objects. In J. Long and A. Baddeley, eds, Attn. & Perf. IX, 135-151. \nErlbaum, 1981. \n\n[6] T. Poggio and S. Edelman. A network that learns to recognize 3D objects. \n\nNature, 1989, in press. \n\n[7] T. Poggio and F. Girosi. A theory of networks for approximation and learning. \n\nMIT A.I. Memo No. 1140, 1989. \n\n[8] I. Rock and J. DiVita. A case of viewer-centered object perception. Cognitive \n\nP6ychology, 19:280-293, 1987. \n\n[9] R. N. Shepard and L. A. Cooper. Mental image6 and their tran6/ormation6. \n\nMIT Press, 1982. \n\n[10] M. Tall and S. Pinker. Mental rotation and orientation-dependence in shape \n\nrecognition. Cognitive Psychology, 21, 1989. \n\n\f", "award": [], "sourceid": 240, "authors": [{"given_name": "Daphna", "family_name": "Weinshall", "institution": null}, {"given_name": "Shimon", "family_name": "Edelman", "institution": null}, {"given_name": "Heinrich", "family_name": "B\u00fclthoff", "institution": null}]}