{"title": "Visual Sequence Learning  in Hierarchical Prediction Networks and Primate Visual Cortex", "book": "Advances in Neural Information Processing Systems", "page_first": 2662, "page_last": 2673, "abstract": "In this paper we developed a computational hierarchical network model to understand the spatiotemporal  sequence learning effects observed in the primate  visual cortex. The model is a hierarchical recurrent neural model that learns to predict video sequences using the incoming video signals as teaching signals.\n  The model performs fast feedforward analysis using a deep convolutional neural network with sparse convolution and feedback synthesis using a stack of LSTM modules. The network learns a representational hierarchy by minimizing its prediction errors of the incoming signals at each level of the hierarchy. We found that recurrent feedback in this network lead to the development of  semantic cluster of global movement patterns in the population codes of the units at the lower levels of the hierarchy.  These representations facilitate the learning of relationship among movement patterns, yielding state-of-the-art performance in long range video sequence predictions on benchmark datasets. Without further tuning, this model  automatically exhibits the neurophysiological correlates of visual sequence memories that we observed in the early visual cortex of awake monkeys, suggesting the principle of self-supervised prediction learning might be relevant to  understanding the cortical mechanisms of representational learning.", "full_text": "Visual Sequence Learning in Hierarchical Prediction\n\nNetworks and Primate Visual Cortex\n\nJielin Qiu1, Ge Huang2, Tai Sing Lee1,2\n\n1Computer Science Department\n\n{jielinq,taislee}@andrew.cmu.edu\n\n2 Neuroscience Institute\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\nAbstract\n\nIn this paper we developed a computational hierarchical network model to under-\nstand the spatiotemporal sequence learning effects observed in the primate visual\ncortex. The model is a hierarchical recurrent neural model that learns to predict\nvideo sequences using the incoming video signals as teaching signals. The model\nperforms fast feedforward analysis using a deep convolutional neural network with\nsparse convolution and feedback synthesis using a stack of LSTM modules. The\nnetwork learns a representational hierarchy by minimizing its prediction errors\nof the incoming signals at each level of the hierarchy. We found that recurrent\nfeedback in this network lead to the development of semantic cluster of global\nmovement patterns in the population codes of the units at the lower levels of the\nhierarchy. These representations facilitate the learning of relationship among move-\nment patterns, yielding state-of-the-art performance in long range video sequence\npredictions on benchmark datasets. Without further tuning, this model automati-\ncally exhibits the neurophysiological correlates of visual sequence memories that\nwe observed in the early visual cortex of awake monkeys, suggesting the princi-\nple of self-supervised prediction learning might be relevant to understanding the\ncortical mechanisms of representational learning.\n\n1\n\nIntroduction\n\nWhile the hippocampus is known to play a critical role in encoding episodic memories, the storage of\nthese memories might ultimately rest in the sensory areas of the neocortex [1]. Indeed, a number of\nneurophysiological studies suggest that neurons throughout the hierarchical visual cortex, including\nthose in the early visual areas such as V1 and V2, might be encoding memories of object images [2]\nand of visual sequences in cell assemblies [3,4,5,6,7]. These memories, together with the generic\nstatistical priors encoded in receptive \ufb01elds and connectivity of neurons, serve as internal models of\nthe world for predicting incoming visual experiences. However, it is not clear why early visual cortex\nneeds to be involved in the encoding of spatiotemporal memories and what computational roles it\nmight play.\nIn this paper, we explored a class of computational models based on predictive self-supervised\nlearning for understanding some neurophysiological learning phenomena observed in the early visual\ncortex. This class of models uses the incoming visual signals as teaching signals to train neural\nnetwork using backpropagation [8,9,10,11,12,13]. Recently, a number of hierarchical recurrent neural\nnetwork models based on this principle, notably PredNet [14] and PredRNN++ [15], have been\ndeveloped for video prediction with state-of-the-art performance. PredNet in particular was inspired\nby the predictive coding theory in neuroscience [16,17,13,18,19] and is a legitimate cortical model\nat a functional level. It learns a LSTM (long short-term memory) model at each level to predict the\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\ferrors made in an earlier level of the hierarchical network. It has been demonstrated to be effective\nin explaining the predictive suppression phenomena in the inferotemporal cortex [63]. However,\nPredNet only builds a hierarchical representation of errors, where the model of a higher layer learns\nto predict the prediction errors of the lower layer. It does not build a feature hierarchy. Thus, its\nability for long-range video prediction is rather limited. PredRNN++ does build a feature hierarchy,\nbut the generation of prediction is based on a auto-encoder-like feedforward network albeit with local\nrecurrent within each layer. It does not explicitly model the recurrent feedback architecture of the\ncortex. Nor does it claim any neural plausibility.\nWe propose Hierarchical Prediction Network (HPNet) as an alternative functional model for the\nvisual cortex incorporating additional neural constraints. It draws on good features from both models.\nIt learns a feature hierarchy while using recurrent feedback to provide top-down synthesis of the\nexpectation at each level. In this paper, we will \ufb01rst demonstrate HPNet\u2019s effectiveness in video\nlearning and prediction, with performance superior to PredNet and comparable to PredRNN++,\nwhich is a state-of-the-art computer vision deep learning models for video prediction. Then we\nwill present novel \ufb01ndings from a neurophysiological experiment that demonstrates the early visual\ncortex exhibited similar sensitivity to memories of global movement patterns, and that HPNet\ncan automatically account for the neurophysiological observations without further tuning. These\n\ufb01ndings suggest that predictive self-supervised learning might be relevant principle for learning\nrepresentational hierarchy in the visual cortex.\n\n2 Related works\n\nOur model integrates ideas from predictive coding models [21,17,14] and associative coding or\ninteractive activation models [22,23]. It is therefore also related to the classical ideas of analysis\nby synthesis [21,64], counter stream model [62] as well as hierarchical spatiotemporal memory\nmodel (HTM) [61]. In contrast to the earlier models that use a feedback path which synthesizes the\nexpectation using linear transform, prediction is synthesized in HPNet (as well as PredNet) by an\nLSTM circuit at each level under feedback gating from higher levels.\nPredictive self-supervised learning has long been advocated as a plausible strategy the brain uses\nto learn internal representations [8]. Recently, thanks to the development of deep learning tech-\nnology, self-supervised learning in computer vision [24,12,25,11,26,27], and video prediction\n[9,28,29,30,10,31,32] have become an active area of research. The large variety of models can\nbe roughly grouped into three categories: autoencoders, DCNN, hierarchy of LSTMs, adversary net-\nworks [33,14,34,15,35], as well as variational autoencoders[36,37]. The state-of-the-art hierarchical\nmodel for video prediction at the writing of this paper is PredRNN++ [15]. Like PredNet and HPNet,\nPredRNN++ [15] consists of a stack of LSTM modules, but operates in a feedforward auto-encoder\narchitecture to generate the next video frame. It offers state-of-the-art performance for benchmark\nperformance evaluation, with documented comparisons to other approaches.\nBoth PredNet and HPNet provides recurrent feedback to the early layers of the network that are\nanalogous to early visual areas (V1, V2 and V4). HPNet in particular predicts that higher order\nsemantic information such as global movement pattern information might transform the population\ncodes in the early visual cortex, resulting in sensitivity to memory of global movement patterns in early\nvisual cortical neurons. Neurons in the inferotemporal cortex (IT) of monkeys are known to exhibit\nsensitivity to memories of predictable familiar image sequences than to novel sequences [52,53,54],\nand some sensitivity to memories of grating sequences have been reported in V1 [3,4,5,6,7]. In\nthis study, we presented novel neurophysiological \ufb01ndings demonstrating that early visual cortical\n(V2) neurons in awake monkeys also demonstrate sensitivity to memories of natural movies of large\nspatial extent in the form of response suppression to familiar or predicted movies, consistent with the\nbehaviors of model neurons in HPNet.\n\n3 Hierarchical Prediction Network\n\n3.1 Cortical Modules\n\nHPNet is composed of a stack of Cortical Modules (CM). Each CM can be considered as a visual\narea along the ventral stream of the primate visual system, such as V1, V2, V4 and IT. We used four\nCortical Modules in our experiment. The network contains a fast feedforward path, instantiated by a\n\n2\n\n\fdeep convolutional neural network (DCNN) that learns a representational hierarchy of features of\nsuccessive complexity, and a feedback path that mediates the synthesis of a prediction via a Long\nShort Term Memory (LSTM) module at each level. The prediction is compared against the input\nsignal from the feedforward path at that level, and the prediction error is used to modulate the LSTMs\nat the same level as well as the level above.\n\nFigure 1: (a) Two successive layers of Cortical Modules in our hierarchical network. The input I1 at\nthe bottom level is a spatiotemporal block of video frames. The (cid:63) notation means a convolution along\nthat path. 2 \u2191 indicates up-sampling or expansion operation. 2 \u2193 means down-sample or reduction\n-(cid:13) indicates comparator or subtraction operation; (b) The DCNN analysis path is\nin resolution.\nimplemented in a sparsi\ufb01ed convolution scheme to speed up bottom-up processing; (c) Detailed\nstructure of the standard LSTM used. Ct is the internal state, and Ht is the output. X is external\ninput, which contains multiple sources in our model. (d) Frame-by-frame scheme; (e) Block-by-\nframe scheme; and (f) Block-by-block scheme, where left and right part indicates output and input\nrespectively with the middle indicating 2D or 3D convolution LSTM.\n\nFigure 1a shows two cascaded CMs. The feedforward path performs convolution (indicated by (cid:63)) on\nthe input spatiotemporal block Il with a kernel to produce Rl, where l indicates the CM level. Rl\nis then down-sampled to provide the input Il+1 for CMl+1 for another round of convolution along\nthe feedforward path. Il+1 also goes into LSTMl+1 (LSTM in CMl+1). In each CMl level, the\nbottom-up input Il is compared with the prediction Pl, which is generated from the interpretation\noutput Hl of LSTMl. The prediction error signal is transformed by a convolution into El, which is\nfed back to both LSTMl and LSTMl+1 to in\ufb02uence their generation of new interpretations Hl and\nHl+1. To make the timing relationship clear in Algorithm 1, we use k to index a spatiotemporal\nblock in a block sequence, which is extracted from the video input sequence xt with a stride s that\ncould vary from 1 to d, where d is the number of video frames contained in a block. At the bottom\ninput level I k\nl\u22121, the\ntop-down feedback of the higher CM\u2019s LSTM\u2019s output H k\nl\u22121 and\nEk\u22121\nl , which is then transformed into a new prediction P k\n(see Algorithm 1 in Supplementary Information (SI)).\n\n1 = (xks, .., xks+d ) LSTMl at step k integrates the bottom-up feature input Rk\nl+1, and the prediction errors Ek\n\nto generate new hypothesis output H k\n\nl\n\nl\n\n3.2 Sparse Convolution, Spatiotemporal Blocks and 3D convolutional LSTM\n\nThe feedforward DCNN path in Figure 1a runs much faster if the input to each convolution layer\nis made sparse, as shown in [44]. A scheme has been proposed by [45,46,44] to speed up video\nprocessing in convolutional neural net by transmitting the \ufb01rst frame in its entirety, but for the\nsubsequent frames only the frame difference between them is transmitted. Hence, convolution can be\nperformed ef\ufb01ciently on the difference signals \u2206I k\nbetween two consecutive blocks.\nl is added back to the representation of the last time block Rk\u22121\nIn the next layer, the resulting \u2206Rk\nto reconstruct the actual higher order feature representation of the input signals Rk\nl . This allows the\nnetwork to recover and maintain a full higher order representation R for computation in the next\nlayer while enjoying the bene\ufb01t of fast computation on sparse data. The same set of sparse kernels\nwas used for processing both the \ufb01rst full frame and the subsequent temporal-difference frames.\n\nl \u2212 I k\u22121\n\nl = I k\n\nl\n\nl\n\n3\n\nI lRlElPl(cid:11)(cid:19)Prediction(a)(e)(f)RlI lRlI lI lRl(cid:11)LSTMlI l+1Rl+1El+1Pl+1(cid:11)(cid:19)LSTMl+1(cid:19)(cid:19)InputPredictionPredictionInputInput(d)(cid:19)(cid:19)(c)(b)\u03c3 \u03c3 (cid:85)(cid:66)(cid:79)(cid:73)\u03c3 (cid:85)(cid:66)(cid:79)(cid:73)Ct-1CtHt-1HtX(cid:11)(cid:11)(cid:11)(cid:11)(cid:11)\fVisual neurons\u2019 receptive \ufb01elds are spatiotemporal 3D kernels, rather than 2D. Therefore, we use\nspatiotemporal block of the input video sequence as input to the neuron in our neural network. The\nblock could slide in time with a temporal stride s as small as one frame or as large as the length of the\nspatiotemporal block d. To process 3D data, a 3D convolutional LSTM is used [20,35]. The details\nof our 3D convolutional LSTM are speci\ufb01ed in the Supplementary Information (SI).\n\n3.3 Training and Loss Function\n\nThe entire network is trained by minimizing a loss function which is the L2 weighted sum of the\nprediction errors of all the Cortical Modules (CM),\n\u03bbl\nnl\n\nl \u2212 P k\nl )2\n(I k\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\nLloss =\n\n(1)\n\n\u03bbk\n\nk\n\nl\n\nnl\n\nwhere k indexes the spatiotemporal block sequence, l the CM level, and nl the number of units\nin that level; \u03bbk and \u03bbl are weighting factors for time step and CM level, respectively. I k\nis kth\nl\nspatiotemporal block input to the CM at level l, and P k\nis the prediction at that level, following the\nl\nvariables\u2019 notations above as well as in Figure 1.\n\n(cid:26) M axP ool(ReLU (Rk\n\nI k\nl =\n\nxt\n\nl = 3DconvLST M (H k\u22121\nH k\n\n(cid:26) ReLU (conv(H k\n\nl ))\n\nl\u22121))\n\nl > 1\nl = 1\n\nP k\n\nl =\n\nSAT LU (ReLU (conv(H k\n\nl )))\n\nl > 1\nl = 1\n\nl+1))\n\n(2)\n\n(3)\n\nl\n\nl\n\n, Ek\u22121\nl \u2212 I k\u22121\n\n, M axP ool(ReLU (Rk\nl \u2212 P k\n\nl\u22121, Ek\n\nl\u22121)), upsample(H k\nl = Rk\u22121\n\nl + \u2206Rk\n\nl\n\nl\n\n\u2206Rk\n\n), Ek\n\nl = spconv(I k\n\n(4)\nl = spconv(I k\nwhere xt is the video input sequence, H k\nis the output of LST M, SAT LU is a saturating non-\nl\nlinearity set at the maximum pixel value (SAT LU (x; pmax) := min(pmax, x), where spconv\nindicates sparse convolution). The algorithm is shown in Algorithm 1 in the Supplementary Material.\nFor hyperparameter tuning, we adapt PredNet\u2019s approach, performing a large grid search in hyperpa-\nrameter space. We did not try to \ufb01nd the best possible set of parameters, only a set of parameters that\nbeat the state-of-the-art in video prediction. We did not tune our network or other networks in our\nsimulation of the neurophysiological experiments.\n\nl )Rk\n\n4 Experimental Results\n\nIn this section, to establish computational competency in video prediction, we \ufb01rst evaluate the\nperformance of our model in video prediction using two bench-mark datasets. We will then evaluate\nthe latent representations of the hierarchical network and compare the behaviors of the model units\nin the network with behaviors of the neurons in the visual cortex in a video sequence learning\nexperiment.\n\n4.1 Competency of the Model in Long-Range Video Sequence Prediction\n\nWe tested the network with two datasets which were also used as benchmark datasets in PredNet and\nPredRNN++: (1) synthetic sequences of the Moving-MNIST1 database and (2) the KTH2 real world\nhuman movement database.\nThe Moving-MNIST dataset contains synthetic video sequences with two handwritten digits bouncing\ninside a frame of 64\u00d764 pixels. Each sequence is 40 frames long and its starting position, hence\nthe speed and direction of the movements, are chosen uniformly at random in [3,4) as [15]. This\nextraction process is repeated 15000 times, resulting in a training set of 10000 sequences, a validation\nset of 2000 sequences, and a testing set of 3000 sequences.\nThe KTH video database [49] contains 2391 real-world sequences of six human actions: walking,\njogging, running, boxing, hand waving, and hand clapping, performed by 25 subjects in four different\nscenarios. We divided video clips across all 6 action categories into a training set of 108717 sequences\n(persons #1-16) and a test set of 4086 sequences (persons #17-25) as was done in [15], except we\n\n1 http://yann.lecun.com/exdb/mnist/\n2 http://www.nada.kth.se/cvap/actions/\n\n4\n\n\fextracted 40-frame sequences. We center-cropped each frame to a 120\u00d7120 square and then re-sized\nit to input frame size of 64\u00d764.\nWe compared HPNet\u2019s video prediction performance, particularly for long-term prediction, with\nPredNet and PredRNN++. Because these two models work on a frame-to-frame basis to predict the\nnext frame based on all the existing frames, we tested two versions of our network for comparison:\n(1) Frame-to-Frame (F-F), where we set our spatiotemporal block size of our data to one frame (i.e.\nd = 1) and used 2D convLSTM instead of 3D convLSTM to predict the next frame based on the\ncurrent and past input frames; (2) Block-to-Block (B-B), our default model using spatiotemporal\nblock as data unit, where the next spatiotemporal block (d = 5, s = 5) was predicted from the current\nspatiotemporal block.\nWe trained all four networks using 40-frame sequences extracted from the two databases in the same\nway as described in [14,15]. We then compared their performance in predicting the next 20 frames\nwhen only the \ufb01rst 20 frames were given. The test sequences were drawn from the same dataset\nbut not in the training set. To predict future frames by dead-reckoning when input was no longer\navailable, PredNet and PredRNN++ simply took the predicted frame and fed into the network as the\ninput frame to generate prediction of the next time step. All models tested have four modules (layers).\nBoth versions of our models and PredNet used the same number of feature channels in each layer,\noptimized by grid search, i.e. (16,32,64,128) for the Moving-MNIST dataset, and (24,48,96,192)\nfor the KTH dataset. For PredRNN++, we used the same architecture and feature channel numbers\nprovided by [15]. All kernel sizes are either 3\u00d73 (for F-F) or 3\u00d73\u00d73 (for B-B) for all four models.\nThe input image frame\u2019s spatial resolution is 64\u00d764. The models were trained and tested on GeForce\nGTX TITAN X GPUs. We evaluated the prediction performance based on two quantitative index:\nMean-Squared Error (MSE) and the Structural Similarity Index Measure (SSIM) [48] of the last 20\nframes between the predicted frames and the actual frames. The values of SSIM range from -1 to 1,\nwith larger value indicating greater similarity between the predicted frames and the actual frames.\n\nFigure 2: Left panel: Video prediction results on Moving-MNIST dataset, where the \ufb01rst row to\nlast row are ground truth (GT), results from three different version of HPNet (block-to-block (B-B),\nframe-to-frame (F-F)), PredNet, and PredRNN++, respectively. k=1 to k=19 are predicted frames of\nthe models when the input frames were available. k=21 to k=39 are the \"dead-reckoning\" predicted\nframes of the model when there is no input. Right panel: Comparison of the prediction results of the\nfour models for the Moving-MNIST dataset on the last 20 frames in structural similarity measures\n(SSIM).\n\nFigure 2 and Table 1 show the performance of the four models on the Moving-MNIST dataset. Figure\n3 and Table 2 show the performance of the four models on the KTH dataset. In the examples shown\nin Figure 2 and Figure 3, each test sequence has 40 frames but we show the results every two frames.\nActual input was provided only for the \ufb01rst 20 frames, where each frame or block of frames were\npredicted based on the previous frame or previous block of input. The prediction for the last 20\nframes (frame 21 to frame 40) were dead-reckoning prediction. The left panels of both \ufb01gures show\nexamples of the predicted sequences generated by the four models. The right panels compare the\nperformance of the four models during the last 20 dead-reckoning frames.\nFor both the synthetic and real world datasets, HPNet using the Block-Block scheme consistently\nyields the best performance. HPNet in Frame-Frame scheme performs better than PredNet, suggesting\nthat a feature hierarchy is better than a prediction error hierarchy for long-range prediction. However,\nHPNet in Frame-Frame scheme does not perform as well as PredRNN++ on long range video\nprediction. The superiority of PredRNN++ in this case is likely because its LSTM at each level\nis boosted to have access to information from all layers below, rather than from just that layer as\nin PredNet and HPNet. Hence, it took longer to train but can potentially encode richer movement\n\n5\n\n\fFigure 3: Left panel: Video prediction results on the KTH dataset, where the \ufb01rst row to last\nrow are ground truth (GT), results from block-to-block (B-B), frame-to-frame (F-F), PredNet, and\nPredRNN++, respectively, same format as Figure 2. Right panel: Comparison of the prediction\nresults of the four models for the KTH dataset on the last 20 frames in structural similarity measures\n(SSIM).\n\npatterns in its memories for long-term prediction. HPNet might have compensated by using 3D data\nblocks with 3D convolutional LSTM to achieve better performance. Given that every area in the\nvisual cortex has some recurrent connections to many lower areas (levels) in addition to the adjacent\nlevel in the hierarchy, it would be reasonable to incorporate that feature of PredRNN++ to see if\nfurther improvement can be obtained.\n\nTable 1: Comparison results of different meth-\nods on Moving-MNIST dataset for long time\nprediction experiment.\n\nMethod\nOurs(B-B)\nCM+ConvLSTM (F-F)\nPredNet [14]\nPredRNN++ [15]\n\nSSIM MSE\n65.2\n0.915\n89.5\n0.692\n0.658\n101.2\n69.4\n0.872\n\nTable 2: Comparison results of different meth-\nods on the KTH dataset for long time prediction\nexperiment.\nMethod\nOurs(B-B)\nCM+ConvLSTM (F-F)\nPredNet [14]\nPredRNN++ [15]\n\nSSIM MSE\n80.3\n0.882\n103.4\n0.701\n0.656\n108.9\n86.7\n0.865\n\nIt should be noted that PredNet, because of the sparse nature of the prediction errors, is very fast\nto train (8 hours in our cluster), HPNet in frame-frame scheme took 9.3 hours while PredRNN++\nand HPNet (B-B) take 10.6 hours and 11.8 hours to train respectively. Sparsifying the feedforward\ncomputation in HPNet alone reduces the training of HPNet by 13%. One might expect some additional\nsaving if the LSTM\u2019s representations are also sparsi\ufb01ed.\n\n4.2 Evaluation of the Latent Representation in the Hierarchy\n\nTo understand how the recurrent feedback might have changed the hierarchical representation of\nHPNet, We trained the HPNet networks in the Block-to-Block (B-B) scheme with variable numbers of\nmodules. First, we found that adding cortical modules tends to improve performance. Second, when\nwe used t-SNE [50] to visualize the representation R in the different modules of the networks for the\nlast 20 dead-reckoning predicted frames of 600 testing sequences belonging to the six movements in\nthe KTH dataset, we found that adding higher modules lead to the formation of more distinct clusters\nof global movement patterns in the representation units of the lower modules (Figure 4a versus Figure\n4d&e; Figure 4c versus Figure 4e&h.). Better encoding of these global movement patterns in the\npopulation codes of the earlier modules manifest in the improvement of their accuracy of decoding\nthe six movement patterns.\nTable 3 compares the accuracy of decoding the six movement patterns based on the representations\nat four different layers (modules) of HPNet, PredNet and PredRNN++. Chance accuracy is 16%.\nDecoding accuracy of PredNet is close to chance because its lacking hierarchical feature represen-\ntations. Decoding accuracy of PredRNN++ peaks at layer (module) 2 and 3 at 30%. For HPNet,\nsemantic clustering and decoding accuracy improves progressively as one moves up the hierarchy,\nfrom 26% in the \ufb01rst layer (module) to 63% for the top module of the 4-module network. Thus, the\nbetter performance of HPNet in long range video predictions might be attributed to its ability to\nlearn semantically meaningful hierarchical spatiotemporal feature representations and movement to\nmovement relationships (see also [51]).\n\n6\n\n\fFigure 4: Visualization of R representational units of the different modules in (a) a one-module\nnetwork; (b)-(c) a two-module network; (d)-(f) a three-module network; and (g)-(j) a four-module\nnetwork. \"Module 2_1\" means Module 1 in a two-module network.\n\nTable 3: Models\u2019 decoding results of six movement classes in the KTH dataset based on representa-\ntions in the different layers of the network.\n\nLayer 1\n\nLayer 2\n\nLayer 3\n\nLayer 4\n\nModel with mean decoding accuracy\nHPNet\nPredNet\nPredRNN++\n\n0.26\n0.19\n0.19\n\n0.41\n0.18\n0.30\n\n0.57\n0.16\n0.28\n\n0.63\n0.16\n0.18\n\n4.3 Visual Sequence Learning in the Visual Cortex\n\nThe recurrent feedback in HPNet allows the representational (R) units in even the lower Cortical\nModules to develop sensitivity to global movement and image patterns, despite these units\u2019 very\nlocalized receptive \ufb01elds (Figure 4). Assuming HPNet is a plausible cortical model at least at a\nfunctional level, it predicts that neurons in the early visual areas such as V1 and V2 would exhibit\nsensitivity to the memory of global movement patterns.\nTo test this prediction, we performed a video learning neurophysiological experiment on V2 neurons\nin two awake behaving monkeys with Gray-Matter semi-chronic multielectrode arrays (SC32 and\nSC96) implanted over their V1 operculum3. Six experiments were carried out. Each lasted over 7\ndays, with daily recording sessions. In each daily session, we presented a set of 20, 800-ms long\nmovie clips to the monkey, 20-25 times a day, so that over time, this set of movies became familiar\n(and predictable) to the monkey. This set is called the Predicted set or Familiar set. Every day,\nwe also tested another set of 20 movie clips that were different daily. These sets are called the\nUnpredicted sets or Novel sets. Both sets of movies (through an aperture of 8o in diameter over\nall the receptive \ufb01elds of recorded neurons) were presented daily, one clip per trial, at the same\nlocation on the computer monitor relative to the red spot the monkeys \ufb01xated on during each trial.\nWith this experimental paradigm, we can compute and compare the daily temporal responses (PSTH\nor peri-stimulus histogram) of all the neurons across all the movies in the Predicted set and in the\nUnpredicted set to monitor the development of sensitivity to memory of the familiar or predicted\nmovies.\nSince we were averaging across many neurons (over 30+ per session) with different feature preferences\nor tuning properties over 20 different movies, the average PSTH responses (across neurons and\nmovies) were expected to be the same for the Predicted set and the daily Unpredicted set. Indeed,\nwe found the averaged PSTHs for the Predicted set and the Unpredicted set to be the same for the\n\ufb01rst two or three days (Figure 5b (top row)), but they started to bifurcate at the later part of their\nresponses in subsequent days (Figure 5b (bottom row)) with suppression for the Predicted Movies.\nSimilar predictive suppression effects have been observed in IT neurons, and were considered as\na biomarker for sequence memory. What is novel and unanticipated about our \ufb01nding is that V2\nneurons\u2019 receptive \ufb01elds are very local, yet this \"memory effect\" depends on the presence of the\nglobal context of the entire movie \u2013 reducing the movie aperture from 8o to 3o, barely larger than the\n\n3 All experimental procedures were approved by Carnegie Mellon University\u2019s Institutional Animal Care and Use\nCommittee, in compliance with the guidelines set forth in the United States Public Health Service Guide for the\nCare and Use of Laboratory Animals.\n\n7\n\n\freceptive \ufb01elds of the individual neurons, would annihilate the predictive suppression effects. Thus,\nthis predictive suppression effects were not due to adaptation of local receptive \ufb01eld features, but\nre\ufb02ected a sensitivity to the memory of the global context of the movies or movement. Such memories\nare likely mediated by horizontal or recurrent feedback mechanisms or both. To better assess the\nevolution of the memory effect of the neuronal populations, we computed the predictive suppression\nindex for each individual neuron as (P \u2212 U )/(P \u2212 U ) where P and U are the daily average spike\ncounts of the neuron in the later part of responses for the Predicted set and the Unpredicted set,\nrespectively. Figure 5a traced the development of this predictive suppression effect in one experiment,\nshowing that the most neurons exhibit predictive suppression on the average after 3 days of exposure\nto the movies.\n\nFigure 5: (a) The development of the prediction suppression effect across days in one experiment.\nEach dot is the prediction suppression index of a neuron. Color indicates whether the effect was\nsigni\ufb01cant or not (red - signi\ufb01cant, blue - insigni\ufb01cant, green - signi\ufb01cant in the opposite way) based\non t-test with p < 0.05 as statistical signi\ufb01cance threshold. (b) Averaged temporal responses of the\nV2 neurons (averaged across 3 experiments) of one monkey to Predicted set and the Unpredicted sets\nin the \ufb01rst two days (top row). Their averaged responses ( from day 5 to day 12) to the Predicted\nset and the Unpredicted sets, exhibiting signi\ufb01cant prediction suppression. Module 2\u2019s normalized\naveraged population responses of the three types of units to the Predicted set and the Unpredicted set\nbefore (c)-(e) and after (f)-(h) training.\n\nWe performed a similar experiment on our network, pre-trained with the KTH dataset. We randomly\nextracted 20 sequences from the BAIR dataset [65], resized the sequence length to 40 frames and\nframe size to 64\u00d764. We separated the 20 sequences into two sets \u2013 the Predicted set and the\nUnpredicted set. We averaged the responses to the two movie sets respectively of each type of\nneurons in the network (E (prediction error units), P (prediction units), and R (representation units))\nin each CM within the center 8\u00d78 hypercolumns. Before familiarity training, the responses of each\ntype of neurons are indeed the same for both movie sets (top row). After the network was trained with\nthe Predicted set for 2000 epochs, prediction suppression effect can be observed in all three types\nof neurons (Prediction error neurons E, Representation neurons R, Prediction neurons P) in all the\nmodules in the hierarchy, with the higher modules showing a stronger effect (see the Supplementary\nMaterial). Figure 5 (c)-(h) show the effects in CM2, corresponding roughly to V2 in the hierarchy.\nIt is not surprising that the prediction error neurons E would decrease their responses as the network\nlearns to predict the familiar movies better. However, it is rather interesting to \ufb01nd the representation\nneurons R and the prediction neurons P also exhibit prediction suppression, even though these\nneurons represent features rather than prediction errors. The precise reasons remain to be determined,\nbut the fact that all neuron types in the model exhibited the prediction suppression effect might explain\nwhy the prediction suppression effects were commonly observed in most of the randomly sampled\nneurons in the visual cortex. We also performed the same experiment on PredNet and PredRNN++\nand found that their corresponding R neurons would also exhibit predictive suppression effect to a\ncertain extent, but much smaller in magnitude. Thus, we can only claim that the neurophysiological\n\ufb01nding is consistent with this genre of self-supervised predictive learning models, though HPnet\nmight be a better approximation. Further experiments and analysis are needed to obtain a better model\napproximation of the cortical mechanisms. Given PredRNN++ has no explicit feedback between\nLSTMs in the hierarchy, the fact that some small predictive suppression effect can be observed in\n\n8\n\n\flayer 2 of their R units suggests that at least part of the suppression effect is mediated by horizontal\nrecurrent connections, which is then further enhanced by feedback.\nAn important question is whether the observed video prediction suppression effect in macaque might\narise from the same mechanism underlying the static image familiarity suppression effects that\nhave been long observed in the inferotemporal cortex [53,55,56] and recently in V2 [2] of macaque\nmonkeys. The experimental paradigms are similar to the video prediction experiment we presented\nexcept that static images rather than videos were used. Similar to our video prediction experiments,\nafter several days of exposure training, it was found that neural responses to the familiar images were\nsigni\ufb01cantly suppressed relative to the novel images in the later part of their responses in both IT\n[53,55,56] and V2 [2], with the temporal response pro\ufb01les (PSTH) very similar to the ones shown\nin Figure 5b. It is important to note that neurons in monkey V1 and V2 have very small receptive\n\ufb01elds, yet they show familiarity suppression effects to large object images much larger than the\nsize of their classical receptive \ufb01elds, hence there is no response difference between the averaged\nresponses to familiar and novel when the stimuli are presented in an aperture slightly larger than the\nreceptive \ufb01elds of the neurons [2]. Together with the observation that the latency of the familiarity\nsuppression effect was 100 ms after stimulus onset, these \ufb01ndings implicate the encoding of global\nimage memories in local recurrent circuits within each visual area. We found that HPNet also\nexhibited the static image familiarity suppression effect when tested with static images, suggesting\nstatic image familiarity suppression effect and prediction suppression effect can both arise from the\nsame network mechanisms. This allows us to investigate whether other deep networks trained for\nImageNet image classi\ufb01cation also exhibit familiarity suppression effects. We evaluated several\nsupervised-learning models: standard VGG [66], and a recurrent network called deep Predictive\ncoding networks (PCN) [67]. PCN, also inspired by predictive coding theory, used predition error\nminimization to derive a recurrent network architecture to be trained by backpropagation, but did not\nuse it as a part of its objective function. Remarkably, it can achieve, with only 9 layers, ImageNet\nclassi\ufb01caiton performance comparable to a 100+-layer ResNet [68]. Nevertheless, both VGG and\nPCN\u2019s neurons did not exhibit the static image familiarity effect. We also experimented with\na recurrent network [69] our group developed with biologically inspired top-down and horizontal\nrecurrent connections between different stages of the VGG16. While this network yields improvement\nin its robustness against noises and occlusion in image classi\ufb01cation, it also does not produce static\nimage familiarity effect. These \ufb01ndings suggest that having prediction error minimization in the\nobjective function might be an important factor for the emergence of prediction suppression or image\nfamiliarity suppression effects.\n\n5 Conclusion\n\nIn this paper, we report a novel neurophysiological \ufb01nding suggesting repeated exposure can induce\nencoding of global video memory in the early visual cortex of primates with repeated exposure. We\nshow that a class of self-supervised prediction learning models also exhibits similar neurophysiologi-\ncal phenomenon. In developing the proposed hierarchical prediction network (HPNet), we found,\n\ufb01rst, sparse coding can considerably speed up computation and learning; second, processing videos\nby spatiotemporal blocks rather than by frame as in other models allow the LSTMs to learn relation-\nships between movement patterns; and third, having a feature hierarchy allows explicit encoding of\nmore and more complex movement patterns, yielding signi\ufb01cant improvement in long-range video\nprediction. We found that the recurrent feedback imbued better semantic clustering of movement\npatterns in the early levels of hierarchical representations of HPNet, resulting in better movement\npattern decoding and action recognition, even though the network was not trained to do that. To\ncompare with representations in the visual cortex in more \ufb01ne-grained details, HPNet will need to\nbe trained with more complex and naturalistic movies. Taken together, the \ufb01ndings in this paper\nsuggest the relevance of a class of predictive learning models for understanding the principles and\nmechanisms of learning and computation in the visual cortex.\n\nAcknowledgments\n\nThis work is supported by National Science Foundation (1816568). We thank Yimeng Zhang, Mert\nInan, Harold Rockwell, Siming Yan, Maureen Kelly and David Pane for their technical assistance.\n\n9\n\n\fReferences\n\n[1] James L. McClelland & Bruce L. McNaughton. (1999) Complementary learning systems 1 why there are\ncomplementary learning systems in the hippocampus and neocortex : Insights from the successes and failures of\nconnectionist models of learning and memory.\n[2] Huang, G., Ramachandran, S., Lee, T. S., & Olson, C. R. (2018) Neural correlate of visual familiarity in\nmacaque area v2. The Journal of neuroscience : the of\ufb01cial journal of the Society for Neuroscience.\n[3] Yao, H., Shi, L., Han, F., Gao, H., & Dan, Y. (2007) Rapid learning in cortical coding of visual scenes.\nNature Neuroscience, 10:772\u2013778.\n[4] Han, F., Caporale, N., & Dan, Y. (2008) Reverberation of recent visual experience in spontaneous cortical\nwaves. Neuron, 60:321\u2013327.\n[5] Xu, S., Jiang, W., Poo, M.-M., & Dan, Y. (2012) Activity recall in visual cortical ensemble. In Nature\nNeuroscience.\n[6] Cooke, S. F. & Bear, M. F. (2014) How the mechanisms of long-term synaptic potentiation and depression\nserve experience-dependent plasticity in primary visual cortex. Philosophical transactions of the Royal Society\nof London. Series B, Biological sciences, 369 1633:20130284.\n[7] Cooke, S. F. & Bear, M. F. (2015) Visual recognition memory: a view from v1. Current opinion in\nneurobiology, 35: 57\u201365.\n[8] Elman, J. L. (1990) Finding structure in time. Cognitive Science, 14:179\u2013211.\n[9] Mathieu, M., Couprie, C., & LeCun, Y. (2015) Deep multiscale video prediction beyond mean square error.\nCoRR, abs/1511.05440.\n[10] Villegas, R., Yang, J., Zou, Y., Sohn, S., Lin, X., & Lee, H. (2017) Learning to generate long-term future\nvia hierarchical prediction. In ICML.\n[11] Srivastava, N., Mansimov, E., & Salakhutdinov, R. (2015) Unsupervised learning of video representations\nusing lstms. In ICML.\n[12] O\u2019Reilly, R. C., Wyatte, D., & Rohrlich, J. (2014) Learning through time in the thalamocortical loops.\n[13] Lee, T. S. (2015) The visual system\u2019s internal models of the world. Proceedings of the IEEE, 103:1359\u20131378.\n[14] Lotter, W., Kreiman, G., & Cox, D. D. (2016) Deep predictive coding networks for video prediction and\nunsupervised learning. CoRR, abs/1605.08104.\n[15] Wang, Y., Gao, Z., Long, M., Wang, J., & Yu, P. S. (2018) Predrnn++: Towards a resolution of the\ndeep-in-time dilemma in spatiotemporal predictive learning. In ICML.\n[16] Mumford, D. (1991) On the computational architecture of the neocortex. Biological Cybernetics,\n65:135\u2013145.\n[17] Rao, R. P. N. & Ballard, D. H. (1999) Predictive coding in the visual cortex: a functional interpretation of\nsome extraclassical receptive-\ufb01eld effects. Nature neuroscience, 2 1: 79\u201387.\n[18] Dijkstra, N., Zeidman, P., Ondobaka, S., van Gerven, M. A. J., & Friston, K. J. (2017) Distinct top-down\nand bottom-up brain connectivity during visual perception and imagery. In Scienti\ufb01c Reports.\n[19] Friston, K. J. (2018) Does predictive coding have a future? Nature Neuroscience, 21:1019\u20131021.\n[20] Choy, C. B., Xu, D., Gwak, J., Chen, K., & Savarese, S. (2016) 3d-r2n2: A uni\ufb01ed approach for single and\nmulti-view 3d object reconstruction. In ECCV.\n[21] Mumford, D. (1992) On the computational architecture of the neocortex. ii. the role of cortico-cortical\nloops. Biological cybernetics, 66 3:241\u201351.\n[22] McClelland, J. L. & Rumelhart, D. E. (1985) Distributed memory and the representation of general and\nspeci\ufb01c information. Journal of experimental psychology. General, 114 2:159\u2013 97.\n[23] Grossberg, S. (1987) Competitive learning: From interactive activation to adaptive resonance. Cognitive\nScience, 11:23\u201363.\n[24] Palm, R. B. (2012) Prediction as a candidate for learning deep hierarchical models of data.\n[25] Goroshin, R., Mathieu, M., & LeCun, Y. 92015 Learning to linearize under uncertainty. In NIPS.\n[26] Patraucean, V., Handa, A., & Cipolla, R. (2015) Spatio-temporal video autoencoder with differentiable\nmemory. CoRR, abs/1511.06309.\n\n10\n\n\f[27] Vondrick, C., Pirsiavash, H., & Torralba, A. (2016) Generating videos with scene dynamics. In NIPS.\n[28] Kalchbrenner, N., van den Oord, A., Simonyan, K., Danihelka, I., Vinyals, O., Graves, A., a& Kavukcuoglu,\nK. (2017) Video pixel networks. In ICML.\n[29] Xu, Z., Wang, Y., Long, M., & Wang, J. (2018) Predcnn: Predictive learning with cascade convolutions. In\nIJCAI.\n[30] Oh, J., Guo, X., Lee, H., Lewis, R. L., & Singh, S. P. (2015) Action-conditional video prediction using deep\nnetworks in atari games. In NIPS.\n[31] Lee, A. X., Zhang, R., Ebert, F., Abbeel, P., Finn, C., & Levine, S. (2018) Stochastic adversarial video\nprediction. CoRR, abs/1804.01523.\n[32] Wichers, N., Villegas, R., Erhan, D., & Lee, H. (2018) Hierarchical long-term video prediction without\nsupervision. In ICML.\n[33] Finn, C., Goodfellow, I. J., & Levine, S. (2016) Unsupervised learning for physical interaction through\nvideo prediction. In NIPS.\n[34] Wang, Y., Long, M., Wang, J., Gao, Z., & Yu, P. S. (2017) Predrnn: Recurrent neural networks for predictive\nlearning using spatiotemporal lstms. In NIPS.\n[35] Wang, Y., Jiang, L., Yang, M.-H., Li, L.-J., Long, M., & Fei-Fei, L. (2019) Eidetic 3d lstm: A model for\nvideo prediction and beyond. In ICLR.\n[36] Babaeizadeh, M., Finn, C., Erhan, D., Campbell, R. H., & Levine, S. (2017) Stochastic variational video\nprediction. CoRR, abs/1710.11252.\n[37] Denton, E. L. & Fergus, R. (2018) Stochastic video generation with a learned prior. In ICML.\n[38] Ullman, S. (1995) Sequential seeking and counter streams: a computational model for bidirectional \ufb02ow in\nthe visual cortex. Cerebral Cortex, 5:1:1\u20131.\n[39] Lee, T. S. & Mumford, D. (2003) Hierarchical bayesian inference in the visual cortex. Journal of the Optical\nSociety of America. A, Optics, image science, and vision, 20 7:1434\u2013 48.\n[40] Dayan, P., Hinton, G. E., Neal, R. M., & Zemel, R. S. (1995) The helmholtz machine. Neural Computation,\n7:889\u2013904.\n[41] Kersten, D. J. & Yuille, A. L. (2003) Bayesian models of object perception. Current opinion in neurobiology,\n13 2:150\u20138.\n[42] Nayebi, A., Bear, D., Kubilius, J., Kar, K., Ganguli, S., Sussillo, D., DiCarlo, J. J., & Yamins, D. L. K.\n(2018) Taskdriven convolutional recurrent models of the visual system. CoRR, abs/1807.00053.\n[43] Wen, H., Han, K., Shi, J., Zhang, Y., Culurciello, E., & Liu, Z. (2018) Deep predictive coding network for\nobject recognition. In ICML.\n[44] Pan, B., Lin, W., Fang, X., Huang, C., Zhou, B., & Lu, C. (2018) Recurrent residual module for fast\ninference in videos. CoRR, abs/1802.09723.\n[45] Liu, X., Pool, J., Han, S., & Dally, W. J. (2017) Ef\ufb01cient sparse-winograd convolutional neural networks.\nCoRR, abs/1802.06367.\n[46] Dave, A., Russakovsky, O., & Ramanan, D. (2017) Predictivecorrective networks for action detection. 2017\nIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2067\u20132076.\n[47] Shi, X., Chen, Z., Wang, H., Yeung, D.-Y., Wong, W.-K., & chun Woo, W. (2015) Convolutional lstm\nnetwork: A machine learning approach for precipitation nowcasting. In NIPS.\n[48] Wang, Z., Bovik, A. C., Sheikh, H. R., & Simoncelli, E. P. (2004) Image quality assessment: from error\nvisibility to structural similarity. IEEE Transactions on Image Processing, 13:600\u2013612.\n[49] Schuldt, C., Laptev, I., & Caputo, B. (2004) Recognizing human International Conference on Pattern\nRecognition, 2004. ICPR 2004., 3:32\u201336 Vol.3.\n[50] van der Maaten, L. & Hinton, G. E. (2008) Visualizing data using t-sne.\n[51] Kheradpisheh, S. R., Ganjtabesh, M., Thorpe, S. J., & Masquelier, T. (2018) Stdp-based spiking deep\nconvolutional neural networks for object recognition. Neural networks : the of\ufb01cial journal of the International\nNeural Network Society, 99:56\u201367.\n[52] Meyer, T. & Olson, C. R. (2011) Statistical learning of visual transitions in monkey inferotemporal cortex.\nProceedings of the National Academy of Sciences of the United States of America, 108 48:19401\u20136.\n\n11\n\n\f[53] Meyer, T., Walker, C., Cho, R. Y., & Olson, C. R. (2014) Image Familiarization sharpens response dynamics\nof neurons in inferotemporal cortex. Nature Neuroscience, 17:1388\u2013 1394.\n[54] Ramachandran, S., Meyer, T., & Olson, C. R. (2017) Prediction suppression and surprise enhancement in\nmonkey inferotemporal cortex. Journal of neurophysiology, 118 1:374\u2013382.\n[55] Freedman, D. J. & Assad, J. A. (2006) Experience-dependent representation of visual categories in parietal\ncortex. Nature, 443:85\u201388.\n[56] Mruczek, R. E. B. & Sheinberg, D. L. (2007) Context familiarity enhances target processing by inferior\ntemporal cortex neurons. The Journal of neuroscience : the of\ufb01cial journal of the Society for Neuroscience, 27\n32:8533\u201345.\n[58] Yao, H., Shi, L., Han, F., Gao, H., & Dan, Y. (2007) Rapid learning in cortical coding of visual scenes.\nNature Neuroscience, 10:772\u2013778.\n[59] Han, F., Caporale, N., & Dan, Y. (2008) Reverberation of recentvisual experience in spontaneous cortical\nwaves. Neuron, 60:321\u2013327, 2008.\n[60] Xu, S., Jiang, W., Poo, M.-M., & Dan, Y. (2012) Activity recall in visual cortical ensemble. In Nature\nNeuroscience.\n[61] Hawkins, J. and George, D. (2006) Hierarchical temporal memory concepts, theory, and terminology.\n[62] Ullman, S. (1995) Sequential seeking and counter streams: a computational model for bidirectional \ufb02ow in\nthe visual cortex. Cerebral Cortex, 5:1:1\u20131.\n[63] Lotter W., Kreiman, G., Cox, D. (2018) A neural network trained to predict future video frames mimics\ncritical properties of biological neuronal responses and perception. In NIPS.\n[64] Lee, T. S. and Mumford, D.(2003) Hierarchical bayesian inference in the visual cortex. Journal of the\nOptical Society of America. A, Optics, image science, and vision, 20 7:1434\u201348.\n[65] Ebert, F., Finn, C., Lee. A., and Levine, S. (2017) Self-supervised visual planning with temporal skip\nconnections. In Conference on Robot Learning (CoRL).\n[66] Simonyan, K., Zisserman, A. (2014) Very deep convolutional networks for large-scale image recognition.\narXiv: 1409.1556.\n[67] Han,K., Wen, H., Fu, D., Culurciello E., and Liu Z (2018) Deep predictive coding network with local\nrecurrent processing for object recognition. NIPS 18, Proceedings of the 32nd International Conference on\nNeural Information Processing Systems 9221-9233. Montreal Canada.\n[68] He, K., Zhang, X., Ren, S., and Sun, J. (2016) Deep residual learning for image recognition. In Proceedings\nof the IEEE conference on computer vision and pattern recognition,pages 770\u2013778, 2016.\n[69] Yan, S., Fang, X, Xiao, B., Rockwell, H., Zhang, Y., Lee, T.S. (2019) Recurrent feedback improves\nfeedforward representations in deep neural networks. arXiv.\n\n12\n\n\f", "award": [], "sourceid": 1532, "authors": []}