Part of Advances in Neural Information Processing Systems 8 (NIPS 1995)
Jonathan Marshall, Richard Alley, Robert Hubbard
Visual occlusion events constitute a major source of depth information. This paper presents a self-organizing neural network that learns to detect, represent, and predict the visibility and invisibility relationships that arise during occlusion events, after a period of exposure to motion sequences containing occlusion and disocclusion events. The network develops two parallel opponent channels or "chains" of lateral excitatory connections for every resolvable motion trajectory. One channel, the "On" chain or "visible" chain, is activated when a moving stimulus is visible. The other channel, the "Off" chain or "invisible" chain, carries a persistent, amodal representation that predicts the motion of a formerly visible stimulus that becomes invisible due to occlusion. The learning rule uses disinhibition from the On chain to trigger learning in the Off chain. The On and Off chain neurons can learn separate associations with object depth or(cid:173) dering. The results are closely related to the recent discovery (Assad & Maunsell, 1995) of neurons in macaque monkey posterior parietal cortex that respond selectively to inferred motion of invisible stimuli.
1
INTRODUCTION: LEARNING ABOUT OCCLUSION EVENTS
Visual occlusion events constitute a major source of depth information. Yet lit(cid:173) tle is known about the neural mechanisms by which visual systems use occlusion events to infer the depth relations among visual objects. What is the structure of such mechanisms? Some possible answers to this question are revealed through an analysis of learning rules that can cause such mechanisms to self-organize.
Evidence from psychophysics (Kaplan, 1969; Nakayama & Shimojo, 1992; Nakayama, Shimojo, & Silverman, 1989; Shimojo, Silverman, & Nakayama, 1988, 1989; Yonas, Craton, & Thompson, 1987) and neurophysiology (Assad & Maunsell, 1995; Frost, 1993) suggests that the process of determining relative depth from occlusion events operates at an early stage of visual processing. Mar(cid:173) shall (1991) describes evidence that suggests that the same early processing mech(cid:173) anisms maintain a representation of temporarily occluded objects for some amount
Learning to Predict Visibility and Invisibility from Occlusion Events
817
of time after they have disappeared behind an occluder, and that these represen(cid:173) tations of invisible objects interact with other object representations, in much the same manner as do representations of visible objects. The evidence includes the phenomena of kinetic subjective contours (Kellman & Cohen, 1984), motion viewed through a slit (Parks' Camel) (Parks, 1965) , illusory occlusion (Ramachandran, In(cid:173) ada, & Kiama, 1986) , and interocular occlusion sequencing (Shimojo, Silverman, & Nakayama, 1988). 2 PERCEPTION OF OCCLUSION AND
DISOCCLUSION EVENTS: AN ANALYSIS
The neural network model exploits the visual changes that occur at occlusion bound(cid:173) aries to form a mechanism for detecting and representing object visibility/invisibility information. The set of learning rules used in this model is an extended version of one that has been used before to describe the formation of neural mechanisms for a variety of other visual processing functions (Hubbard & Marshall, 1994; Mar(cid:173) shall, 1989, 1990ac, 1991, 1992; Martin & Marshall, 1993).
Our analysis is derived from the following visual predictivity principle, which
may be postulated as a fundamental principle of neural organization in visual sys(cid:173) tems: Visual systems represent the world in terms of predictions of its appearance, and they reorganize themselves to generate better predictions. To maximize the cor(cid:173) rectness and completeness of its predictions, a visual system would need to predict the motions and visibility/invisibility of all objects in a scene. Among other things, it would need to predict the disappearance of an object moving behind an occluder and the reappearance of an object emerging from behind an occluder.
A consequence of this postulate is that occluded objects must, at some level,
continue to be represented even though they are invisible. Moreover, the repre(cid:173) sentation of an object must distinguish whether the object is visible or invisible; otherwise, the visual system could not determine whether its representations predict visibility or invisibility, which would contravene the predictivity principle. Thus, simple single-channel prediction schemes like the one described by Marshall (1989, 1990a) are inadequate to represent occlusion and disocclusion events. 3 A MODEL FOR GROUNDED LEARNING TO
PREDICT VISIBILITY AND INVISIBILITY
The initial structure of the Visible/Invisible network model is given in Figure 1A. The network self-organizes in response to a training regime containing many input sequences representing motion with and without occlusion and disocclusion events. After a period of self-organization, the specific connections that a neuron receives (Figure 1B) determine whether it responds to visible or invisible objects. A neuron that responds to visible objects would have strong bottom-up input connections, and it would also have strong time-delayed lateral excitatory input connections. A neuron that responds selectively to invisible objects would not have strong bottom(cid:173) up connections, but it would have strong lateral excitatory input connections. These lateral inputs would transmit to the neuron evidence that a previously visible object existed. The neurons that respond to invisible objects must operate in a way that allows lateral input excitation alone to activate the neurons supraliminally, in the absence of bottom-up input excitation from actual visible objects. 4 SIMULATION OF A SIMPLIFIED NETWORK 4.1
INITIAL NETWORK STRUCTURE
The simulated network, shown in Figure 2, describes a simplified one(cid:173) dimensional subnetwork (Marshall & Alley, 1993) of the more general two(cid:173) dimensional network. Layer 1 is restricted to a set of motion-sensitive neurons corresponding to one rightward motion trajectory.
The L+ connections in the simulation have a signal transmission latency of one time unit. Restricting the lateral connections to a single time delay and to a single direction limits the simulation to representing a single speed and direction of motion; these results are therefore preliminary. This restriction reduced the number of connections and made the simulation much faster.
818
J. A. MARSHALL, R. K. ALLEY, R. S. HUBBARD