{"title": "Tagger: Deep Unsupervised Perceptual Grouping", "book": "Advances in Neural Information Processing Systems", "page_first": 4484, "page_last": 4492, "abstract": "We present a framework for efficient perceptual inference that explicitly reasons about the segmentation of its inputs and features.  Rather than being trained for any specific segmentation, our framework learns the grouping process in an unsupervised manner or alongside any supervised task. We enable a neural network to group the representations of different objects in an iterative manner through a differentiable mechanism.  We achieve very fast convergence by allowing the system to amortize the joint iterative inference of the groupings and their representations.  In contrast to many other recently proposed methods for addressing multi-object scenes, our system does not assume the inputs to be images and can therefore directly handle other modalities. We evaluate our method on multi-digit classification of very cluttered images that require texture segmentation. Remarkably our method achieves improved classification performance over convolutional networks despite being fully connected, by making use of the grouping mechanism. Furthermore, we observe that our system greatly improves upon the semi-supervised result of a baseline Ladder network on our dataset. These results are evidence that grouping is a powerful tool that can help to improve sample efficiency.", "full_text": "Tagger: Deep Unsupervised Perceptual Grouping\n\nKlaus Greff*, Antti Rasmus, Mathias Berglund, Tele Hotloo Hao,\n\nJ\u00fcrgen Schmidhuber*, Harri Valpola\n\nThe Curious AI Company {antti,mathias,hotloo,harri}@cai.fi\n\n*IDSIA {klaus,juergen}@idsia.ch\n\nAbstract\n\nWe present a framework for ef\ufb01cient perceptual inference that explicitly reasons\nabout the segmentation of its inputs and features. Rather than being trained for\nany speci\ufb01c segmentation, our framework learns the grouping process in an unsu-\npervised manner or alongside any supervised task. We enable a neural network to\ngroup the representations of different objects in an iterative manner through a dif-\nferentiable mechanism. We achieve very fast convergence by allowing the system\nto amortize the joint iterative inference of the groupings and their representations.\nIn contrast to many other recently proposed methods for addressing multi-object\nscenes, our system does not assume the inputs to be images and can therefore di-\nrectly handle other modalities. We evaluate our method on multi-digit classi\ufb01cation\nof very cluttered images that require texture segmentation. Remarkably our method\nachieves improved classi\ufb01cation performance over convolutional networks despite\nbeing fully connected, by making use of the grouping mechanism. Furthermore,\nwe observe that our system greatly improves upon the semi-supervised result of a\nbaseline Ladder network on our dataset. These results are evidence that grouping\nis a powerful tool that can help to improve sample ef\ufb01ciency.\n\n1\n\nIntroduction\n\nHumans naturally perceive the world as being structured into different\nobjects, their properties and relation to each other. This phenomenon\nwhich we refer to as perceptual grouping is also known as amodal\nperception in psychology. It occurs effortlessly and includes a seg-\nmentation of the visual input, such as that shown in in Figure 1. This\ngrouping also applies analogously to other modalities, for example\nin solving the cocktail party problem (audio) or when separating the\nsensation of a grasped object from the sensation of \ufb01ngers touching\neach other (tactile). Even more abstract features such as object class,\ncolor, position, and velocity are naturally grouped together with the\ninputs to form coherent objects. This rich structure is crucial for many\nreal-world tasks such as manipulating objects or driving a car, where\nawareness of different objects and their features is required.\nIn this paper, we introduce a framework for learning ef\ufb01cient itera-\ntive inference of such perceptual grouping which we call iTerative\nAmortized Grouping (TAG). This framework entails a mechanism for\niteratively splitting the inputs and internal representations into several\ndifferent groups. We make no assumptions about the structure of this\nsegmentation and rather train the model end-to-end to discover which\nare the relevant features and how to perform the splitting.\n\nFigure 1: An example of per-\nceptual grouping for vision.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fFigure 2: Left: Three iterations of the TAG system which learns by denoising its input using several\ngroups (shown in color). Right: Detailed view of a single iteration on the TextureMNIST1 dataset.\nPlease refer to the supplementary material for further details.\n\nBy using an auxiliary denoising task we train the system to directly amortize the posterior inference\nof the object features and their grouping. Because our framework does not make any assumptions\nabout the structure of the data, it is completely domain agnostic and applicable to any type of data.\nThe TAG framework works completely unsupervised, but can also be combined with supervised\nlearning for classi\ufb01cation or segmentation.\n\n2\n\nIterative Amortized Grouping (TAG)1\n\nGrouping. Our goal is to enable neural networks to split inputs and internal representations into\ncoherent groups. We de\ufb01ne a group to be a collection of inputs and internal representations that are\nprocessed together, but (largely) independent of each other. By processing each group separately the\nnetwork can make use of invariant distributed features without the risk of interference and ambiguities,\nwhich might arise when processing everything in one clump. We make no assumptions about the\ncorrespondence between objects and groups. If the network can process several objects in one group\nwithout unwanted interference, then the network is free to do so. The \u201ccorrect\u201d grouping is often\ndynamic, ambiguous and task dependent. So rather than training it as a separate task, we allow the\nnetwork to split the processing of the inputs, and let it learn how to best use this ability for any given\nproblem. To make the task of instance segmentation easy, we keep the groups symmetric in the sense\nthat each group is processed by the same underlying model.\n\nAmortized Iterative Inference. We want our model to reason not only about the group assignments\nbut also about the representation of each group. This amounts to inference over two sets of variables:\nthe latent group assignments and the individual group representations; A formulation very similar to\nmixture models for which exact inference is typically intractable. For these models it is a common\napproach to approximate the inference in an iterative manner by alternating between (re-)estimation\nof these two sets (e.g., EM-like methods [4]). The intuition is that given the grouping, inferring the\nobject features becomes easy, and vice versa. We employ a similar strategy by allowing our network\nto iteratively re\ufb01ne its estimates of the group assignments as well as the object representations.\nRather than deriving and then running an inference algorithm, we train a parametric mapping to arrive\nat the end result of inference as ef\ufb01ciently as possible [9]. This is known as amortized inference [31],\nand it is used, for instance, in variational autoencoders where the encoder learns to amortize the\nposterior inference required by the generative model represented by the decoder. Here we instead\napply the framework of denoising autoencoders [6, 15, 34] which are trained to reconstruct original\ninputs x from corrupted versions \u02dcx. This encourages the network to implement useful amortized\nposterior inference without ever having to specify or even know the underlying generative model\nwhose inference is implicitly learned.\n\n1Note: This section only provides a short and high-level overview of the TAG framework and Tagger. For\na more detailed description please refer to the supplementary material or the extended version of this paper:\nhttps://arxiv.org/abs/1606.06724\n\n2\n\nPARAMETRIC MAPPINGPARAMETRICMAPPINGPARAMETRICMAPPINGPARAMETRICMAPPINGq1(x)iteration 1iteration 2iteration 3q1(x)q1(x|g)q2(x)q3(x)\u02dcx\u02dcxxxz0m1z1m2m0z2m3z3L(m0)\u03b4z0\u03b4z1\u03b4zi-1mimi-1zi-1zi\u03b4z2L(mi-1)L(m1)L(m2)\fData: x, K, T, \u03c3, v, Wh, Wu, \u0398\nResult: zT , mT , C\nbegin Initialization:\n\n\u02dcx \u2190 x + N (0, \u03c32I);\nm0 \u2190 softmax(N (0, I));\nz0 \u2190 E[x];\n\nend\nfor i = 0 . . . T \u2212 1 do\nfor k = 1 . . . K do\n\u02dczk \u2190 N (\u02dcx; zi\nk \u2190 (\u02dcx \u2212 zi\n(cid:2)zi\nk) \u2190 \u02dczk(cid:80)\n\u03b4zi\nL(mi\nk \u2190 f (Wh\nhi\n[zi+1\n, mi+1\n\nk\n\nk\n\nk, (v + \u03c32)I);\nk)mi\n;\n\nk \u02dczk;\n\nh \u02dczh\n\nk, mi\n\nk, \u03b4zi\n\nk, L(mi\n\n] \u2190 WuLadder(hi\n\nk, \u0398);\n\nk)(cid:3));\n\nend\nmi+1 \u2190 softmax(mi+1);\n\nqi+1(x) \u2190(cid:80)K\nC \u2190 \u2212(cid:80)T\n\nend\n\nk=1 N (x; zi+1\n\nk\n\ni=1 log qi(x);\n\n, vI)mi+1;\n\nFigure 3: An example of how Tagger\nwould use a 3-layer-deep Ladder Network\nas its parametric mapping to perform its\niteration i + 1. Note the optional class pre-\ndiction output yi\ng for classi\ufb01cation tasks.\nSee supplementary material for details.\n\nAlgorithm 1: Pseudocode for running Tagger on a sin-\ngle real-valued example x. For details and a binary-\ninput version please refer to supplementary material.\n\nPutting it together. By using the negative log likelihood C(x) = \u2212(cid:80)\n\ni log qi(x) as a cost function,\nwe train our system to compute an approximation qi(x) of the true denoising posterior p(x|\u02dcx) at\neach iteration i. An overview of the whole system is given in Figure 2. For each input element xj we\nintroduce K latent binary variables gk,j that take a value of 1 if this element is generated by group\nk. This way inference is split into K groups, and we can write the approximate posterior in vector\nnotation as follows:\n\n(cid:88)\n\nqi(x) =\n\n(cid:88)\n\nqi(x|gk)qi(gk) =\n\nN (x; zi\n\nk, vI)mi\n\nk ,\n\n(1)\n\nk\n\nk\n\nwhere we model the group reconstruction qi(x|gk) as a Gaussian with mean zi\nthe group assignment posterior qi(gk) as a categorical distribution mk.\nThe trainable part of the TAG framework is given by a parametric mapping that operates independently\non each group k and is used to compute both zi\nk (which is afterwards normalized using an\nelementwise softmax over the groups). This parametric mapping is usually implemented by a neural\nnetwork and the whole system is trained end-to-end using standard backpropagation through time.\nThe input to the network for the next iteration consists of the vectors zi\nk along with two\nadditional quantities: The remaining modelling error \u03b4zi\nk and the group assignment likelihood ratio\nL(mi\n\nk and variance v, and\n\nk and mi\n\nk and mi\n\nk) which carry information about how the estimates can be improved:\n(cid:80)\nk) \u221d qi(\u02dcx|gk)\nh qi(\u02dcx|gh)\n\nk \u221d \u2202C(\u02dcx)\n\u2202zi\nk\n\nL(mi\n\nand\n\n\u03b4zi\n\nNote that they are derived from the corrupted input \u02dcx, to make sure we don\u2019t leak information about\nthe clean input x into the system.\n\nTagger. For this paper we chose the Ladder network [19] as the parametric mapping because its\nstructure re\ufb02ects the computations required for posterior inference in hierarchical latent variable\nmodels. This means that the network should be well equipped to handle the hierarchical structure one\nmight expect to \ufb01nd in many domains. We call this Ladder network wrapped in the TAG framework\nTagger. This is illustrated in Figure 3 and the corresponding pseudocode can be found in Algorithm 1.\n\n3\n\n\f3 Experiments and results\n\nWe explore the properties and evaluate the performance of Tagger both in fully unsupervised settings\nand in semi-supervised tasks in two datasets2. Although both datasets consist of images and grouping\nis intuitively similar to image segmentation, there is no prior in the Tagger model for images: our\nresults (unlike the ConvNet baseline) generalize even if we permute all the pixels .\n\nShapes. We use the simple Shapes dataset [21] to examine the basic properties of our system. It\nconsists of 60,000 (train) + 10,000 (test) binary images of size 20x20. Each image contains three\nrandomly chosen shapes ((cid:52), (cid:53), or (cid:3)) composed together at random positions with possible overlap.\n\nTextured MNIST. We generated a two-object supervised dataset (TextureMNIST2) by sequentially\nstacking two textured 28x28 MNIST-digits, shifted two pixels left and up, and right and down,\nrespectively, on top of a background texture. The textures for the digits and background are different\nrandomly shifted samples from a bank of 20 sinusoidal textures with different frequencies and\norientations. Some examples from this dataset are presented in the column of Figure 4b. We use\na 50k training set, 10k validation set, and 10k test set to report the results. We also use a textured\nsingle-digit version (TextureMNIST1) without a shift to isolate the impact of texturing from multiple\nobjects.\n\n3.1 Training and evaluation\n\nWe train Tagger in an unsupervised manner by only showing the network the raw input example\nx, not ground truth masks or any class labels, using 4 groups and 3 iterations. We average the cost\nover iterations and use ADAM [14] for optimization. On the Shapes dataset we trained for 100\nepochs with a bit-\ufb02ip probability of 0.2, and on the TextureMNIST dataset for 200 epochs with a\ncorruption-noise standard deviation of 0.2. The models reported in this paper took approximately 3\nand 11 hours in wall clock time on a single Nvidia Titan X GPU for Shapes and TextureMNIST2\ndatasets respectively.\nWe evaluate the trained models using two metrics: First, the denoising cost on the validation set, and\nsecond we evaluate the segmentation into objects using the adjusted mutual information (AMI) score\n[35] and ignore the background and overlap regions in the Shapes dataset (consistent with Greff et al.\n[8]). Evaluations of the AMI score and classi\ufb01cation results in semi-supervised tasks were performed\nusing uncorrupted input. The system has no restrictions regarding the number of groups and iterations\nused for training and evaluation. The results improved in terms of both denoising cost and AMI score\nwhen iterating further, so we used 5 iterations for testing. Even if the system was trained with 4\ngroups and 3 shapes per training example, we could test the evaluation with, for example, 2 groups\nand 3 shapes, or 4 groups and 4 shapes.\n\n3.2 Unsupervised Perceptual Grouping\n\nTable 1 shows the median performance of Tagger on the Shapes dataset over 20 seeds. Tagger is able\nto achieve very fast convergences, as shown in Table 1a. Through iterations, the network improves its\ndenoising performances by grouping different objects into different groups. Comparing to Greff et al.\n[8], Tagger performs signi\ufb01cantly better in terms of AMI score (see Table 1b). We found that for\nthis dataset using LayerNorm [1] instead of BatchNorm [13] greatly improves the results as seen in\nTable 1.\nFigure 4a and Figure 4b qualitatively show the learned unsupervised groupings for the Shapes and\ntextured MNIST datasets. Tagger uses its TAG mechanism slightly differently for the two datasets.\nFor Shapes, zg represents \ufb01lled-in objects and masks mg show which part of the object is actually\nvisible. For textured MNIST, zg represents the textures while masks mg capture texture segments.\nIn the case of the same digit or two identical shapes, Tagger can segment them into separate groups,\nand hence, performs instance segmentation. We used 4 groups for training even though there are only\n3 objects in the Shapes dataset and 3 segments in the TexturedMNIST2 dataset. The excess group is\nleft empty by the trained system but its presence seems to speed up the learning process.\n\n2The datasets and a Theano [33] reference implementation of Tagger are available at http://github.com/\n\nCuriousAI/tagger\n\n4\n\n\ft\ne\ns\n\nt\ns\ne\nt\n\ne\nh\nt\n\nm\no\nr\nf\n\ns\ne\nl\np\nm\na\nx\ne\n7\n\n:\nn\nm\nu\nl\no\nc\n\nt\nf\ne\nL\n\n.\nt\ne\ns\na\nt\na\nd\n2\nT\nS\nI\nN\nM\n\ne\nr\nu\nt\nx\ne\nT\ne\nh\nt\n\nr\no\nf\n\ns\nt\nl\nu\ns\ne\nR\n\n)\nb\n(\n\n2\nE\nd\nn\na\n\n,\ne\nc\nn\ne\nr\ne\nf\nn\ni\n\nl\na\nc\ni\np\ny\nt\n\ns\ne\nt\na\nr\nt\ns\nn\no\nm\ne\nd\n\n1\nE\n\n.\n\n2\nT\nS\nI\nN\nM\n\ne\nr\nu\nt\nx\ne\nT\nm\no\nr\nf\n\ne\nl\np\nm\na\nx\ne\n\nd\ne\nk\nc\ni\np\n-\nd\nn\na\nh\n\nt\ns\no\nm\np\no\nt\n(\np\nu\no\nr\ng\nn\ni\na\nt\nr\ne\nc\n\na\nn\ne\nh\nw\n\nt\nu\np\nn\ni\n\ne\nh\nt\n\ne\nt\na\nm\n\ni\nt\ns\ne\no\nt\n\ne\nl\nb\na\n\ns\ni\n\nm\ne\nt\ns\ny\ns\n\ne\nh\nt\n\nw\no\nh\ns\ne\nt\na\nr\nt\ns\nn\no\nm\ne\nd\n\ns\nn\no\ni\nt\na\nr\ne\nt\ni\n\nr\ne\nv\no\ns\ns\ne\nc\no\nr\np\ne\nc\nn\ne\nr\ne\nf\nn\ni\n\ne\nh\nt\n\nf\no\nn\no\ni\nt\na\nr\nt\ns\nu\nl\nl\nI\n\n:\nn\nm\nu\nl\no\nc\n\nt\nh\ng\ni\nR\n\n.\n\nd\ne\nv\no\nm\ne\nr\n\ns\ni\n\n)\n4\nt\ni\ng\ni\nd\n\n.\nk\nz\nd\nn\na\n\nk\n\nm\n\n;\ns\np\nu\no\nr\ng\n\nd\ne\nd\no\nc\n-\nr\no\nl\no\nc\n\nr\nu\no\nf\n\nr\no\nf\n\nd\ne\nk\nc\ni\np\n-\nd\nn\na\nh\n\n3\n\nd\nn\na\n\nr\ne\nd\nr\no\n\ne\nr\no\nc\ns\n\nI\n\nM\nA\ng\nn\ni\nd\nn\ne\nc\ns\ne\nd\n\nn\ni\n\ns\ng\nn\ni\np\nu\no\nr\ng\n\ng\nn\ni\nt\nl\nu\ns\ne\nr\n\nr\ni\ne\nh\nt\n\nA\n\n:\n2\n-\n1\nE\n\n.\nt\ne\ns\na\nt\na\nd\n\n1\nT\nS\nI\nN\nM\n\ne\nr\nu\nt\nx\ne\nT\n\ne\nh\nt\n\nm\no\nr\nf\n\ne\nl\np\nm\na\nx\ne\n\nn\nA\n\n:\n\nD\n\n.\n)\n2\nE\n\n,\n\n1\nE\n\n,\n\nD\n\n(\n\nh\nt\ni\n\nw\ng\nn\no\nl\na\n\ns\ne\nl\np\nm\na\nx\ne\n\nh\nt\ni\n\nw\ng\nn\no\nl\na\n\nt\ne\ns\n\nt\ns\ne\nt\n\ne\nh\nt\n\nm\no\nr\nf\n\ns\ne\nl\np\nm\na\nx\ne\n\n7\n\n:\nn\nm\nu\nl\no\nc\n\nt\nf\ne\nL\n\n.\nt\ne\ns\na\nt\na\nd\n\ns\ne\np\na\nh\nS\nr\no\nf\n\ns\nt\nl\nu\ns\ne\nR\n\n)\na\n(\n\n.\na\nt\na\nd\n\nt\nc\ne\nj\nb\no\n\n3\n\nn\no\n\nl\ne\nd\no\nm\np\nu\no\nr\ng\n-\n2\n\ng\nn\ni\nt\ns\ne\nT\n\n:\n\nA\n\n.\n\nn\no\ni\nt\na\nz\ni\nl\na\nr\ne\nn\ne\ng\n\ne\nt\na\nr\nt\ns\nn\no\nm\ne\nd\n\no\nt\n\n)\n\nC\nd\nn\na\n\n,\n\nB\n\ns\ne\nl\np\nm\na\nx\ne\nd\ne\nk\nc\ni\np\n-\nd\nn\na\nh\n3\nd\nn\na\n\nr\ne\nd\nr\no\ne\nr\no\nc\ns\n\nI\n\nM\nA\ng\nn\ni\nd\nn\ne\nc\ns\ne\nd\nn\ni\n\ns\ng\nn\ni\np\nu\no\nr\ng\ng\nn\ni\nt\nl\nu\ns\ne\nr\n\nr\ni\ne\nh\nt\n\np\nu\no\nr\ng\n-\n4\n\ne\nc\nn\ne\nr\ne\nf\nn\ni\n\ng\nn\ni\nt\ns\ne\nT\n\n:\n\nC\n\n.\ns\nt\nc\ne\nj\nb\no\n4\nn\no\na\nt\na\nd\nt\nc\ne\nj\nb\no\n-\n3\nh\nt\ni\n\nw\nd\ne\nn\ni\na\nr\nt\n\nl\ne\nd\no\nm\np\nu\no\nr\ng\n-\n4\na\ng\nn\ni\nt\ns\ne\nT\n\ne\nh\nt\n\nf\no\nn\no\ni\nt\na\nr\nt\ns\nu\nl\nl\nI\n\n:\nn\nm\nu\nl\no\nc\n\nt\nh\ng\ni\nR\n\n.\ns\nt\nc\ne\nj\nb\no\n2\nn\no\na\nt\na\nd\nt\nc\ne\nj\nb\no\n-\n3\nh\nt\ni\n\nw\nd\ne\nn\ni\na\nr\nt\n\nl\ne\nd\no\nm\n\n,\n\nA\n\n(\n\n:\n\nB\n\n.\nk\nz\nd\nn\na\n\nk\n\nm\n\n;\ns\np\nu\no\nr\ng\n\nd\ne\nd\no\nc\n-\nr\no\nl\no\nc\n\nr\nu\no\nf\n\nr\no\nf\n\ns\nn\no\ni\nt\na\nr\ne\nt\ni\n\nr\ne\nv\no\n\ns\ns\ne\nc\no\nr\np\n\n5\n\nreconst:i=0i=1i=2i=3i=4i=5z0m0z1m1z2m2z3m31:001:001:001:001:000:850:60ABoriginalCreconst:reconst:i=0i=1i=2i=3i=4i=5Pred::0Classz0Pred::noclassm0Pred::2z1Pred::noclassm1z2m2z3m30:950:920:900:890:870:860:85DE1originalE2reconst:\fDenoising cost\nAMI\nDenoising cost*\nAMI*\n\nIter 1\n0.094\n0.58\n0.100\n0.70\n\nIter 2\n0.068\n0.73\n0.069\n0.90\n\nIter 3\n0.063\n0.77\n0.057\n0.95\n\nIter 4\n0.063\n0.79\n0.054\n0.96\n\nIter 5\n0.063\n0.79\n0.054\n0.97\n\n(a) Convergence of Tagger over iterative inference\n\nAMI\n\n0.61 \u00b1 0.005\nRC [8]\n0.79 \u00b1 0.034\nTagger\n0.97 \u00b1 0.009\nTagger*\n(b) Method comparison\n\nTable 1: Table (a) shows how quickly the algorithm evaluation converges over inference iterations\nwith the Shapes dataset. Table (b) compares segmentation quality to previous work on the Shapes\ndataset. The AMI score is de\ufb01ned in the range from 0 (guessing) to 1 (perfect match). The results\nwith a star (*) are using LayerNorm [1] instead of BatchNorm.\n\nThe hand-picked examples A-C in Figure 4a illustrate the robustness of the system when the number\nof objects changes in the evaluation dataset or when evaluation is performed using fewer groups.\nExample E is particularly interesting; E2 demonstrates how we can remove the topmost digit from\nthe normal evaluated scene E1 and let the system \ufb01ll in digit below and the background. We do\nthis by setting the corresponding group assignment probabilities mg to a large negative number just\nbefore the \ufb01nal softmax over groups in the last iteration.\nTo solve the textured two-digit MNIST task, the system has to combine texture cues with high-level\nshape information. The system \ufb01rst infers the background texture and mask which are \ufb01nalized\non the \ufb01rst iteration. Then the second iteration typically \ufb01xes the texture used for topmost digit,\nwhile subsequent iterations clarify the occluded digit and its texture. This demonstrates the need for\niterative inference of the grouping.\n\n3.3 Classi\ufb01cation\n\nTo investigate the role of grouping for the task of classi\ufb01cation, we evaluate Tagger against four\nbaseline models on the textured MNIST task. As our \ufb01rst baseline we use a fully connected network\n(FC) with ReLU activations and BatchNorm [13] after each layer. Our second baseline is a ConvNet\n(Conv) based on Model C from [30], which has close to state-of-the-art results on CIFAR-10. We\nremoved dropout, added BatchNorm after each layer and replaced the \ufb01nal pooling by a fully\nconnected layer to improve its performance for the task. Furthermore, we compare with a fully\nconnected Ladder [19] (FC Ladder) network.\nAll models use a softmax output and are trained with 50,000 samples to minimize the categorical cross\nentropy error. In case there are two different digits in the image (most examples in the TextureMNIST2\ndataset), the target is p = 0.5 for both classes. We evaluate the models based on classi\ufb01cation errors,\nwhich we compute based on the two highest predicted classes (top 2) for the two-digit case.\nFor Tagger, we \ufb01rst train the system in an unsupervised phase for 150 epochs and then add two\nfresh randomly initialized layers on top and continue training the entire system end to end using the\nsum of unsupervised and supervised cost terms for 50 epochs. Furthermore, the topmost layer has a\nper-group softmax activation that includes an added \u2019no class\u2019 neuron for groups that do not contain\nany digit. The \ufb01nal classi\ufb01cation is then performed by summing the softmax output over all groups\nfor the true 10 classes and renormalizing it.\nAs shown in Table 2, Tagger performs signi\ufb01cantly better than all the fully connected baseline models\non both variants, but the improvement is more pronounced for the two-digit case. This result is\nexpected because for cases with multi-object overlap, grouping becomes more important. Moreover, it\ncon\ufb01rms the hypothesis that grouping can help classi\ufb01cation and is particularly bene\ufb01cial for complex\ninputs. Remarkably, Tagger is on par with the convolutional baseline for the TexturedMNIST1 dataset\nand even outperforms it in the two-digit case, despite being fully connected itself. We hypothesize\nthat one reason for this result is that grouping allows for the construction of ef\ufb01cient invariant features\nalready in the low layers without losing information about the assignment of features to objects.\nConvolutional networks solve this problem to some degree by grouping features locally through the\nuse of receptive \ufb01elds, but that strategy is expensive and can break down in cases of heavy overlap.\n\n6\n\n\fError 1k Model details\n\n2000-2000-2000 / 1000-1000\n3000-2000-1000-500-250\n3000-2000-1000-500-250\nbased on Model C [30]\n\n2000-2000-2000 / 1000-1000\n3000-2000-1000-500-250\n3000-2000-1000-500-250\nbased on Model C [30]\n\nDataset Method\nFC MLP\n\nTextureMNIST1\nchance level: 90% FC Ladder\n\nFC Tagger (ours)\nConvNet\n\nTextureMNIST2\nchance level: 80% FC Ladder\n\nFC MLP\n\nFC Tagger (ours)\nConvNet\n\nError 50k\n31.1 \u00b1 2.2\n7.2 \u00b1 0.1\n4.0 \u00b1 0.3\n3.9 \u00b1 0.3\n55.2 \u00b1 1.0\n41.1 \u00b1 0.2\n7.9 \u00b1 0.3\n12.6 \u00b1 0.4\n\n89.0 \u00b1 0.2\n30.5 \u00b1 0.5\n10.5 \u00b1 0.9\n52.4 \u00b1 5.3\n79.4 \u00b1 0.3\n68.5 \u00b1 0.2\n24.9 \u00b1 1.8\n79.1 \u00b1 0.8\n\nTable 2: Test-set classi\ufb01cation errors in % for both textured MNIST datasets. We report mean and\nsample standard deviation over 5 runs. FC = Fully Connected, MLP = Multi Layer Perceptron.\n\n3.4 Semi-Supervised Learning\n\nThe TAG framework does not rely on labels and is therefore directly usable in a semi-supervised\ncontext. For semi-supervised learning, the Ladder [19] is arguably one of the strongest baselines with\nSOTA results on 1,000 MNIST and 60,000 permutation invariant MNIST classi\ufb01cation. We follow\nthe common practice of using 1,000 labeled samples and 49,000 unlabeled samples for training\nTagger and the Ladder baselines. For completeness, we also report results of the convolutional\n(ConvNet) and fully-connected (FC) baselines trained fully supervised on only 1,000 samples.\nFrom Table 2, it is obvious that all the fully supervised methods fail on this task with 1,000 labels.\nThe best baseline result is achieved by the FC Ladder, which reaches 30.5 % error for one digit but\n68.5 % for TextureMNIST2. For both datasets, Tagger achieves by far the lowest error rates: 10.5 %\nand 24.9 %, respectively. Again, this difference is ampli\ufb01ed for the two-digit case, where Tagger\nwith 1,000 labels even outperforms the Ladder baseline with all 50k labels. This result matches our\nintuition that grouping can often segment even objects of an unknown class and thus help select the\nrelevant features for learning. This is particularly important in semi-supervised learning where the\ninability to self-classify unlabeled samples can mean that the network fails to learn from them at all.\nTo put these results in context, we performed informal tests with \ufb01ve human subjects. The subjects\nimproved signi\ufb01cantly over training for a few days but there were also signi\ufb01cant individual dif-\nferences. The task turned out to be quite dif\ufb01cult and strenuous, with the best performing subjects\nscoring around 10 % error for TextureMNIST1 and 30 % error for TextureMNIST2.\n\n4 Related work\n\nAttention models have recently become very popular, and similar to perceptual grouping they help\nin dealing with complex structured inputs. These approaches are not, however, mutually exclusive\nand can bene\ufb01t from each other. Overt attention models [28, 5] control a window (fovea) to focus on\nrelevant parts of the inputs. Two of their limitations are that they are mostly tailored to the visual\ndomain and are usually only suited to objects that are roughly the same shape as the window. But\ntheir ability to limit the \ufb01eld of view can help to reduce the complexity of the target problem and thus\nalso help segmentation. Soft attention mechanisms [26, 3, 40] on the other hand use some form of\ntop-down feedback to suppress inputs that are irrelevant for a given task. These mechanisms have\nrecently gained popularity, \ufb01rst in machine translation [2] and then for many other problems such as\nimage caption generation [39]. Because they re-weigh all the inputs based on their relevance, they\ncould bene\ufb01t from a perceptual grouping process that can re\ufb01ne the precise boundaries of attention.\nOur work is primarily built upon a line of research based on the concept that the brain uses syn-\nchronization of neuronal \ufb01ring to bind object representations together. This view was introduced by\n[37] and has inspired many early works on oscillations in neural networks (see the survey [36] for a\nsummary). Simulating the oscillations explicitly is costly and does not mesh well with modern neural\nnetwork architectures (but see [17]). Rather, complex values have been used to model oscillating\nactivations using the phase as soft tags for synchronization [18, 20]. In our model, we further abstract\nthem by using discretized synchronization slots (our groups). It is most similar to the models of\nWersing et al. [38], Hyv\u00e4rinen & Perki\u00f6 [12] and Greff et al. [8]. However, our work is the \ufb01rst to\ncombine this with denoising autoencoders in an end-to-end trainable fashion.\n\n7\n\n\fAnother closely related line of research [23, 22] has focused on multi-causal modeling of the inputs.\nMany of the works in that area [16, 32, 29, 11] build upon Restricted Boltzmann Machines. Each\ninput is modeled as a mixture model with a separate latent variable for each object. Because exact\ninference is intractable, these models approximate the posterior with some form of expectation\nmaximization [4] or sampling procedure. Our assumptions are very similar to these approaches, but\nwe allow the model to learn the amortized inference directly (more in line with Goodfellow et al. [7]).\nSince recurrent neural networks (RNNs) are general purpose computers, they can in principle\nimplement arbitrary computable types of temporary variable binding [25, 26], unsupervised segmen-\ntation [24], and internal [26] and external attention [28]. For example, an RNN with fast weights [26]\ncan rapidly associate or bind the patterns to which the RNN currently attends. Similar approaches\neven allow for metalearning [27], that is, learning a learning algorithm. Hochreiter et al. [10], for ex-\nample, learned fast online learning algorithms for the class of all quadratic functions of two variables.\nUnsupervised segmentation could therefore in principle be learned by any RNN as a by-product\nof data compression or any other given task. That does not, however, imply that every RNN will,\nthrough learning, easily discover and implement this tool. From that perspective, TAG can be seen as\na way of helping an RNN to quickly learn and ef\ufb01ciently implement a grouping mechanism.\n\n5 Conclusion\n\nIn this paper, we have argued that the ability to group input elements and internal representations\nis a powerful tool that can improve a system\u2019s ability to handle complex multi-object inputs. We\nhave introduced the TAG framework, which enables a network to directly learn the grouping and\nthe corresponding amortized iterative inference in a unsupervised manner. The resulting iterative\ninference is very ef\ufb01cient and converges within \ufb01ve iterations. We have demonstrated the bene\ufb01ts\nof this mechanism for a heavily cluttered classi\ufb01cation task, in which our fully connected Tagger\neven signi\ufb01cantly outperformed a state-of-the-art convolutional network. More impressively, we have\nshown that our mechanism can greatly improve semi-supervised learning, exceeding conventional\nLadder networks by a large margin. Our method makes minimal assumptions about the data and can\nbe applied to any modality. With TAG, we have barely scratched the surface of a comprehensive\nintegrated grouping mechanism, but we already see signi\ufb01cant advantages. We believe grouping to\nbe crucial to human perception and are convinced that it will help to scale neural networks to even\nmore complex tasks in the future.\n\nAcknowledgments\n\nThe authors wish to acknowledge useful discussions with Theofanis Karaletsos, Jaakko S\u00e4rel\u00e4, Tapani\nRaiko, and S\u00f8ren Kaae S\u00f8nderby. And further acknowledge Rinu Boney, Timo Haanp\u00e4\u00e4 and the rest\nof the Curious AI Company team for their support, computational infrastructure, and human testing.\nThis research was supported by the EU project \u201cINPUT\u201d (H2020-ICT-2015 grant no. 687795).\n\nReferences\n[1] Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization. arXiv:1607.06450 [cs, stat], July 2016.\n[2] Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate.\n\narXiv preprint arXiv:1409.0473, 2014.\n\n[3] Deco, G. Biased competition mechanisms for visual attention in a multimodular neurodynamical system.\nIn Emergent Neural Computational Architectures Based on Neuroscience, pp. 114\u2013126. Springer, 2001.\n[4] Dempster, A. P., Laird, N. M., and Rubin, D. B. Maximum likelihood from incomplete data via the EM\n\nalgorithm. Journal of the royal statistical society., pp. 1\u201338, 1977.\n\n[5] Eslami, S. M., Heess, N., Weber, T., Tassa, Y., Kavukcuoglu, Y., and Hinton, G. E. Attend, infer, repeat:\n\nFast scene understanding with generative models. preprint arXiv:1603.08575, 2016.\n\n[6] Gallinari, P., LeCun, Y., Thiria, S., and Fogelman-Soulie, F. M\u00e9moires associatives distribu\u00e9es: Une\n\ncomparaison (distributed associative memories: A comparison). In Cesta-Afcet, 1987.\n\n[7] Goodfellow, I. J., Bulatov, Y., Ibarz, J., Arnoud, S., and Shet, V. Multi-digit number recognition from street\n\nview imagery using deep convolutional neural networks. arXiv preprint arXiv:1312.6082, 2013.\n\n[8] Greff, K., Srivastava, R. K., and Schmidhuber, J. Binding via reconstruction clustering. arXiv:1511.06418\n\n[cs], November 2015.\n\n[9] Gregor, K. and LeCun, Y. Learning fast approximations of sparse coding. In Proceedings of the 27th\n\nInternational Conference on Machine Learning (ICML-10), pp. 399\u2013406, 2010.\n\n8\n\n\f[10] Hochreiter, S., Younger, A. S., and Conwell, P. R. Learning to learn using gradient descent. In Proc.\n\nInternational Conference on Arti\ufb01cial Neural Networks, pp. 87\u201394. Springer, 2001.\n\n[11] Huang, J. and Murphy, K. Ef\ufb01cient inference in occlusion-aware generative models of images. arXiv\n\npreprint arXiv:1511.06362, 2015.\n\n[12] Hyv\u00e4rinen, A. and Perki\u00f6, J. Learning to segment any random vector. In The 2006 IEEE International\n\nJoint Conference on Neural Network Proceedings, pp. 4167\u20134172. IEEE, 2006.\n\n[13] Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal\n\ncovariate shift. arXiv preprint arXiv:1502.03167, 2015.\n\n[14] Kingma, D. and Ba, J. Adam: A method for stochastic optimization. CBLS, 2015.\n[15] Le Cun, Y. Mod\u00e8les Connexionnistes de L\u2019apprentissage. PhD thesis, Paris 6, 1987.\n[16] Le Roux, N., Heess, N., Shotton, J., and Winn, J. Learning a generative model of images by factoring\n\nappearance and shape. Neural Computation, 23(3):593\u2013650, 2011.\n\n[17] Meier, M., Haschke, R., and Ritter, H. J. Perceptual grouping through competition in coupled oscillator\n\nnetworks. Neurocomputing, 141:76\u201383, 2014.\n\n[18] Rao, R. A., Cecchi, G., Peck, C. C., and Kozloski, J. R. Unsupervised segmentation with dynamical units.\n\nNeural Networks, IEEE Transactions on, 19(1):168\u2013182, 2008.\n\n[19] Rasmus, A., Berglund, M., Honkala, M., Valpola, H., and Raiko, T. Semi-supervised learning with ladder\n\nnetworks. In NIPS, pp. 3532\u20133540, 2015.\n\n[20] Reichert, D. P. and Serre, T. Neuronal synchrony in complex-valued deep networks. arXiv:1312.6115 [cs,\n\nq-bio, stat], December 2013.\n\n[21] Reichert, D. P., Series, P, and Storkey, A. J. A hierarchical generative model of recurrent object-based\n\nattention in the visual cortex. In ICANN, pp. 18\u201325. Springer, 2011.\n\n[22] Ross, D. A. and Zemel, R. S. Learning parts-based representations of data. The Journal of Machine\n\nLearning Research, 7:2369\u20132397, 2006.\n\n[23] Saund, E. A multiple cause mixture model for unsupervised learning. Neural Computation, 7(1):51\u201371,\n\n1995.\n\n[24] Schmidhuber, J. Learning complex, extended sequences using the principle of history compression. Neural\n\nComputation, 4(2):234\u2013242, 1992.\n\n[25] Schmidhuber, J. Learning to control fast-weight memories: An alternative to dynamic recurrent networks.\n\nNeural Computation, 4(1):131\u2013139, 1992.\n\n[26] Schmidhuber, J. Reducing the ratio between learning complexity and number of time varying variables in\n\nfully recurrent nets. In ICANN\u201993, pp. 460\u2013463. Springer, 1993.\n\n[27] Schmidhuber, J. A \u2018self-referential\u2019 weight matrix. In ICANN\u201993, pp. 446\u2013450. Springer, 1993.\n[28] Schmidhuber, J. and Huber, R. Learning to generate arti\ufb01cial fovea trajectories for target detection.\n\nInternational Journal of Neural Systems, 2(01n02):125\u2013134, 1991.\n\n[29] Sohn, K., Zhou, G., Lee, C., and Lee, H. Learning and selecting features jointly with point-wise gated\nBoltzmann machines. In Proceedings of The 30th International Conference on Machine Learning, pp.\n217\u2013225, 2013.\n\n[30] Springenberg, J. T., Dosovitskiy, A., Brox, T., and Riedmiller, M. Striving for simplicity: The all\n\nconvolutional net. arXiv preprint arXiv:1412.6806, 2014.\n\n[31] Srikumar, V., Kundu, G., and Roth, D. On amortizing inference cost for structured prediction. In EMNLP-\n\nCoNLL \u201912, pp. 1114\u20131124, Stroudsburg, PA, USA, 2012. Association for Computational Linguistics.\n\n[32] Tang, Y., Salakhutdinov, R., and Hinton, G. Robust boltzmann machines for recognition and denoising. In\nComputer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pp. 2264\u20132271. IEEE, 2012.\n[33] Team, The Theano Development. Theano: A Python framework for fast computation of mathematical\n\nexpressions. arXiv:1605.02688 [cs], May 2016.\n\n[34] Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P. A. Extracting and composing robust features\n\nwith denoising autoencoders. In ICML, pp. 1096\u20131103. ACM, 2008.\n\n[35] Vinh, N. X., Epps, J., and Bailey, J. Information theoretic measures for clusterings comparison: Variants,\n\nproperties, normalization and correction for chance. JMLR, 11:2837\u20132854, 2010.\n\n[36] von der Malsburg, C. Binding in models of perception and brain function. Current opinion in neurobiology,\n\n5(4):520\u2013526, 1995.\n\n[37] von der Malsburg, Christoph. The Correlation Theory of Brain Function. Departmental technical report,\n\nMPI, 1981.\n\n[38] Wersing, H., Steil, J. J., and Ritter, H. A competitive-layer model for feature binding and sensory\n\nsegmentation. Neural Computation, 13(2):357\u2013387, 2001.\n\n[39] Xu, K., Ba, J., Kiros, R., Courville, A., Salakhutdinov, R., Zemel, R., and Bengio, Y. Show, attend and tell:\n\nNeural image caption generation with visual attention. arXiv preprint arXiv:1502.03044, 2015.\n\n[40] Yli-Krekola, A., S\u00e4rel\u00e4, J., and Valpola, H. Selective attention improves learning. In Arti\ufb01cial Neural\n\nNetworks\u2013ICANN 2009, pp. 285\u2013294. Springer, 2009.\n\n9\n\n\f", "award": [], "sourceid": 2226, "authors": [{"given_name": "Klaus", "family_name": "Greff", "institution": "IDSIA"}, {"given_name": "Antti", "family_name": "Rasmus", "institution": "The Curious AI Company"}, {"given_name": "Mathias", "family_name": "Berglund", "institution": "The Curious AI Company"}, {"given_name": "Tele", "family_name": "Hao", "institution": "The Curious AI Company"}, {"given_name": "Harri", "family_name": "Valpola", "institution": "The Curious AI Company"}, {"given_name": "J\u00fcrgen", "family_name": "Schmidhuber", "institution": "IDSIA"}]}