{"title": "Can Active Memory Replace Attention?", "book": "Advances in Neural Information Processing Systems", "page_first": 3781, "page_last": 3789, "abstract": "Several mechanisms to focus attention of a neural network on selected parts of its input or memory have been used successfully in deep learning models in recent years. Attention has improved image classification, image captioning, speech recognition, generative models, and learning algorithmic tasks, but it had probably the largest impact on neural machine translation.  Recently, similar improvements have been obtained using  alternative mechanisms that do not focus on a single part of a memory but operate on all of it in parallel, in a uniform way. Such mechanism, which we call active memory, improved over attention in algorithmic tasks, image processing, and in generative modelling.  So far, however, active memory has not improved over attention for most natural language processing tasks, in particular for machine translation. We analyze this shortcoming in this paper and propose an extended model of active memory that matches existing attention models on neural machine translation and generalizes better to longer sentences. We investigate this model and explain why previous active memory models did not succeed. Finally, we discuss when active memory brings most benefits and where attention can be a better choice.", "full_text": "Can Active Memory Replace Attention?\n\n\u0141ukasz Kaiser\nGoogle Brain\n\nlukaszkaiser@google.com\n\nSamy Bengio\nGoogle Brain\n\nbengio@google.com\n\nAbstract\n\nSeveral mechanisms to focus attention of a neural network on selected parts of its\ninput or memory have been used successfully in deep learning models in recent\nyears. Attention has improved image classi\ufb01cation, image captioning, speech\nrecognition, generative models, and learning algorithmic tasks, but it had probably\nthe largest impact on neural machine translation.\nRecently, similar improvements have been obtained using alternative mechanisms\nthat do not focus on a single part of a memory but operate on all of it in parallel,\nin a uniform way. Such mechanism, which we call active memory, improved over\nattention in algorithmic tasks, image processing, and in generative modelling.\nSo far, however, active memory has not improved over attention for most natural\nlanguage processing tasks, in particular for machine translation. We analyze this\nshortcoming in this paper and propose an extended model of active memory that\nmatches existing attention models on neural machine translation and generalizes\nbetter to longer sentences. We investigate this model and explain why previous\nactive memory models did not succeed. Finally, we discuss when active memory\nbrings most bene\ufb01ts and where attention can be a better choice.\n\n1\n\nIntroduction\n\nRecent successes of deep neural networks have spanned many domains, from computer vision [1] to\nspeech recognition [2] and many other tasks. In particular, sequence-to-sequence recurrent neural\nnetworks (RNNs) with long short-term memory (LSTM) cells [3] have proven especially successful\nat natural language processing (NLP) tasks, including machine translation [4, 5, 6].\nThe basic sequence-to-sequence architecture for machine translation is composed of an RNN encoder\nwhich reads the source sentence one token at a time and transforms it into a \ufb01xed-sized state vector.\nThis is followed by an RNN decoder, which generates the target sentence, one token at a time, from\nthe state vector. While a pure sequence-to-sequence recurrent neural network can already obtain good\ntranslation results [4, 6], it suffers from the fact that the whole sentence to be translated needs to be\nencoded into a single \ufb01xed-size vector. This clearly manifests itself in the degradation of translation\nquality on longer sentences (see Figure 6) and hurts even more when there is less training data [7].\nIn [5], a successful mechanism to overcome this problem was presented: a neural model of attention.\nIn a sequence-to-sequence model with attention, one retains the outputs of all steps of the encoder\nand concatenates them to a memory tensor. At each step of the decoder, a probability distribution\nover this memory is computed and used to estimate a weighted average encoder representation to be\nused as input to the next decoder step. The decoder can hence focus on different parts of the encoder\nrepresentation while producing tokens. Figure 1 illustrates a single step of this process.\nThe attention mechanism has proven useful well beyond the machine translation task. Image models\ncan bene\ufb01t from attention too; for instance, image captioning models can focus on the relevant parts\nof the image when describing it [8]; generative models for images yield especially good results with\nattention, as was demonstrated by the DRAW model [9], where the network focuses on a part of the\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fnew state\n\nnew memory = memory\n\nmask over memory\n\nstate\n\nmemory\n\nFigure 1: Attention model. The state vector is used to compute a probability distribution over memory.\nWeighted average of memory elements, with focus on one of them, is used to compute the new state.\n\nimage to produce at a given time. Another interesting use-case for the attention mechanism is the\nNeural Turing Machine [10], which can learn basic algorithms and generalize beyond the length of\nthe training instances.\nWhile the attention mechanism is very successful, one important limitation is built into its de\ufb01nition.\nSince the attention mask is computed using a Softmax, it by de\ufb01nition tries to focus on a single\nelement of the memory it is attending to. In the extreme case, also known as hard attention [8], one of\nthe memory elements is selected and the selection is trained using the REINFORCE algorithm (since\nthis is not differentiable) [11]. It is easy to demonstrate that this restriction can make some tasks\nalmost unlearnable for an attention model. For example, consider the task of adding two decimal\nnumbers, presented one after another like this:\n\nInput\nOutput\n\n1\n3\n\n2\n5\n\n5\n6\n\n0\n5\n\n+\n\n2\n\n3\n\n1\n\n5\n\nA recurrent neural network can have the carry-over in its state and could learn to shift its attention to\nsubsequent digits. But that is only possible if there are two attention heads, attending to the \ufb01rst and\nto the second number. If only a single attention mechanism is present, the model will have a hard\ntime learning this task and will not generalize properly, as was demonstrated in [12, 13].\nA solution to this problem, already proposed in the recent literature (for instance, the Neural GPU\nfrom [12]), is to allow the model to access and change all its memory at each decoding step. We\nwill call this mechanism an active memory. While it might seem more expensive than attention\nmodels, it is actually not, since the attention mechanism needs to compute an attention score for all\nits memory as well in order to focus on the most appropriate part. The approximate complexity of an\nattention mechanism is therefore the same as the complexity of the active memory. In practice, we\nget step-times around 1.7 second for an active memory model, the Extended Neural GPU introduced\nbelow, and 1.2 second for a comparable model with an attention mechanism. But active memory can\npotentially make parallel computations on the whole memory, as depicted in Figure 2.\n\nnew memory\n\nmemory\n\nFigure 2: Active memory model. The whole memory takes part in the computation at every step.\nEach element of memory is active and changes in a uniform way, e.g., using a convolution.\n\n2\n\n\fActive memory is a natural choice for image models as they usually operate on a canvas. And indeed,\nrecent works have shown that actively updating the canvas that will be used to produce the \ufb01nal results\ncan be bene\ufb01cial. Residual networks [14], the currently best performing model on the ImageNet task,\nfalls into this category. In [15] it was shown that the weights of different layers of a residual network\ncan be tied (so it becomes recurrent), without degrading performance. Other models that operate on\nthe whole canvas at each step were presented in [16, 17]. Both of these models are generative and\nshow very good performance, yielding better results than the original DRAW model. Thus, the active\nmemory approach seems to be a better choice for image models.\nBut what about non-image models? The Neural GPUs [12] demonstrated that active memory yields\nsuperior results on algorithmic tasks. But can it be applied to real-world problems? In particular,\nthe original attention model brought a great success to natural language processing, esp. to neural\nmachine translation. Can active memory be applied to this task on a large scale?\nWe answer this question positively, by presenting an extension of the Neural GPU model that yields\ngood results for neural machine translation. This model allows us to investigate in depth a number of\nquestions about the relationship between attention and active memory. We clarify why the previous\nactive memory model did not succeed on machine translation by showing how it is related to the\ninherent dependencies in the target distributions, and we study a few variants of the model that show\nhow a recurrent structure on the output side is necessary to obtain good results.\n\n2 Active Memory Models\n\nIn the previous section, we used the term active memory broadly, referring to any model where every\npart of the memory undergoes active change at every step. This is in contrast to attention models\nwhere only a small part of the memory changes at every step, or where the memory remains constant.\nThe exact implementation of an active change of the memory might vary from model to model. In\nthe present paper, we will focus on the most common ways this change is implemented that all rely\non the convolution operator.\nThe convolution acts on a kernel bank and a 3-dimensional tensor. Our kernel banks are 4-dimensional\ntensors of shape [kw, kh, m, m], i.e., they contain kw \u00b7 kh \u00b7 m2 parameters, where kw and kh are\nkernel width and height. A kernel bank U can be convolved with a 3-dimensional tensor s of shape\n[w, h, m] which results in the tensor U \u2217 s of the same shape as s de\ufb01ned by:\n\nU \u2217 s[x, y, i] =\n\ns[x + u, y + v, c] \u00b7 U [u, v, c, i].\n\n(cid:98)kw/2(cid:99)(cid:88)\n\n(cid:98)kh/2(cid:99)(cid:88)\n\nm(cid:88)\n\nu=(cid:98)\u2212kw/2(cid:99)\n\nv=(cid:98)\u2212kh/2(cid:99)\n\nc=1\n\nIn the equation above the index x + u might sometimes be negative or larger than the size of s, and\nin such cases we assume the value is 0. This corresponds to the standard convolution operator used\nin many deep learning toolkits, with zero padding on both sides and stride 1. Using the standard\noperator has the advantage that it is heavily optimized and can directly bene\ufb01t from any new work\n(e.g., [18]) on optimizing convolutions.\nGiven a memory tensor s, an active memory model will produce the next memory s(cid:48) by using a\nnumber of convolutions on s and combining them. In the most basic setting, a residual active memory\nmodel will be de\ufb01ned as:\n\ns(cid:48) = s + U \u2217 s,\n\ni.e., it will only add to an already existing state.\nWhile residual models have been successful in image analysis [14] and generation [16], they might\nsuffer from the vanishing gradient problem in the same way as recurrent neural networks do. There-\nfore, in the same spirit as LSTM gates [3] and GRU gates [19] improve over pure RNNs, one can\nintroduce convolutional LSTM and GRU operators. Let us focus on the convolutional GRU, which\nwe de\ufb01ne in the same way as in [12], namely:\n\nCGRU(s) = u (cid:12) s + (1 \u2212 u) (cid:12) tanh(U \u2217 (r (cid:12) s) + B), where\n\nu = \u03c3(U(cid:48) \u2217 s + B(cid:48))\n\nand r = \u03c3(U(cid:48)(cid:48) \u2217 s + B(cid:48)(cid:48)).\n\n(1)\n\nAs a baseline for our investigation of active memory models, we will use the Neural GPU model from\n[12], depicted in Figure 3, and de\ufb01ned as follows. The given sequence i = (i1, . . . , in) of n discrete\n\n3\n\n\fi1\n\n...\n\nin\n\nCGRU1\n\nCGRU2\n\n. . .\n\nCGRU1\n\nCGRU2\n\no1\n\n...\n\non\n\ns0\n\ns1\n\nsn\u22121\n\nsn\n\nFigure 3: Neural GPU with 2 layers and width w = 3 unfolded in time.\n\nsymbols from {0, . . . , I} is \ufb01rst embedded into the tensor s0 by concatenating the vectors obtained\nfrom an embedding lookup of the input symbols into its \ufb01rst column. More precisely, we create the\nstarting tensor s0 of shape [w, n, m] by using an embedding matrix E of shape [I, m] and setting\ns0[0, k, :] = E[ik] (in python notation) for all k = 1 . . . n (here i1, . . . , in is the input). All other\nelements of s0 are set to 0. Then, we apply l different CGRU gates in turn for n steps to produce the\n\ufb01nal tensor s\ufb01n:\n\nst+1 = CGRUl(CGRUl\u22121 . . . CGRU1(st) . . . )\n\nand s\ufb01n = sn.\n\nThe result of a Neural GPU is produced by multiplying each item in the \ufb01rst column of s\ufb01n by an\noutput matrix O to obtain the logits lk = Os\ufb01n[0, k, :] and then selecting the largest one: ok =\nargmax(lk). During training we use the standard loss function, i.e., we compute a Softmax over the\nlogits lk and use the negative log probability of the target as the loss.\n\n2.1 The Markovian Neural GPU\n\nThe baseline Neural GPU model yields very poor results on neural machine translation: its per-word\nperplexity on WMT1 does not go below 30 (good models on this task go below 4), and its BLEU\nscores are also very bad (below 5, while good models are higher than 20). Which part of the model is\nresponsible for such bad results?\nIt turns out that the main culprit is the output generator. As one can see in Figure 3 above, every\noutput symbol is generated independently of all other output symbols, conditionally only on the state\ns\ufb01n. This is \ufb01ne for learning purely deterministic functions, like the toy tasks the Neural GPU was\ndesigned for. But it does not work for harder real-world problems, where there could be multiple\npossible outputs for each input.\nThe most basic way to mitigate this problem is to make every output symbol depend on the previous\noutput. This only changes the output generation, not the state, so the de\ufb01nition of the model is the\nsame as above until s\ufb01n. The result is then obtained by multiplying by an output matrix O each item\nfrom the \ufb01rst column of s\ufb01n concatenated with the embedding of the previous output generated by\nanother embedding matrix E(cid:48):\n\nlk = O concat(s\ufb01n[0, k, :], E(cid:48)ok\u22121).\n\nFor k = 0 we use a special symbol ok\u22121 = GO and, to get the output, we select ok = argmax(lk).\nDuring training we use the standard loss function, i.e., we compute a Softmax over the logits lk and\nuse the negative log probability of the target as the loss. Also, as is standard in recurrent networks [4],\nwe use teacher forcing, i.e., during training we provide the true output label as ok\u22121 instead of using\nthe previous output generated by the model. This means that the loss incurred from generating ok\ndoes not directly in\ufb02uence the value of ok\u22121. We depict this model in Figure 4.\n\n2.2 The Extended Neural GPU\n\nThe Markovian Neural GPU yields much better results on neural machine translation than the baseline\nmodel: its per-word perplexity reaches about 12 and its BLEU scores improve a bit. But these results\nare still far from those achieved by models with attention.\n1See Section 3 for more details on the experimental setting.\n\n4\n\n\fi1\n\n...\n\nin\n\nCGRU1\n\nCGRU2\n\n. . .\n\nCGRU1\n\nCGRU2\n\no1o2o3\n\n. . .\n\non\n\ns0\n\ns1\n\nsn\u22121\n\nsn\n\nFigure 4: Markovian Neural GPU. Each output ok is conditionally dependent on the \ufb01nal tensor\ns\ufb01n = sn and the previous output symbol ok\u22121.\n\np0\n\no1\n\np1\n\no2\n\np2\n\n. . .\n\npn\u22121\n\non\n\ni1\n\n...\n\nin\n\nCGRU\n\nCGRU\n\n. . .\n\nCGRU\n\nCGRUd\n\nCGRUd\n\nCGRUd\n\n. . .\n\nCGRUd\n\ns0\n\ns1\n\nsn = d0\n\nd1\n\nd2\n\ndn\n\nFigure 5: Extended Neural GPU with active memory decoder. See the text below for de\ufb01nition.\n\nCould it be that the Markovian dependence of the outputs is too weak for this problem, that a full\nrecurrent dependence of the state is needed for good performance? We test this by extending the\nbaseline model with an active memory decoder, as depicted in Figure 5.\nThe de\ufb01nition of the Extended Neural GPU follows the baseline model until s\ufb01n = sn. We consider\nsn as the starting point for the active memory decoder, i.e., we set d0 = sn. In the active memory\ndecoder we will also use a separate output tape tensor p of the same shape as d0, i.e., p is of shape\n[w, n, m]. We start with p0 set to all 0 and de\ufb01ne the decoder states by\n\ndt+1 = CGRUd\n\n1(dt, pt) . . . , pt), pt),\n\nl (CGRUd\n\nl\u22121(. . . CGRUd\n\nwhere CGRUdis de\ufb01ned just like CGRU in Equation (1) but with additional input as highlighted\nbelow in bold:\n\nCGRUd(s, p) = u (cid:12) s + (1 \u2212 u) (cid:12) tanh(U \u2217 (r (cid:12) s) + W \u2217 p + B), where\nand r = \u03c3(U(cid:48)(cid:48) \u2217 s + W (cid:48)(cid:48) \u2217 p + B(cid:48)(cid:48)).\n\nu = \u03c3(U(cid:48) \u2217 s + W (cid:48) \u2217 p + B(cid:48))\n\n(2)\n\nWe generate the k-th output by multiplying the k-th vector in the \ufb01rst column of dk by the output\nmatrix O, i.e., lk = O dk[0, k, :]. We then select ok = argmax(lk). The symbol ok is then embedded\nback into a dense representation using another embedding matrix E(cid:48) and we put it into the k-th place\non the output tape p, i.e., we de\ufb01ne\n\npk+1 = pk with\n\npk[0, k, :] \u2190 E(cid:48)ok.\n\nIn this way, we accumulate (embedded) outputs step-by-step on the output tape p. Each step pt has\naccess to all outputs produced in all steps before t.\nAgain, it is important to note that during training we use teacher forcing, i.e., we provide the true\noutput labels for ok instead of using the outputs generated by the model.\n\n5\n\n\f2.3 Related Models\n\nA convolutional architecture has already been used to obtain good results in word-level neural\nmachine translation in [20] and more recently in [21]. These model use a standard RNN on top of\nthe convolution to generate the output and avoid the output dependence problem in this way. But\nthe state of this RNN has a \ufb01xed size, and in the \ufb01rst one the sentence representation generated by\nthe convolutional network is also a \ufb01xed-size vector. Therefore, while super\ufb01cially similar to active\nmemory, these models are more similar to \ufb01xed-size memory models. The \ufb01rst one suffers from all\nthe limitations of sequence-to-sequence models without attention [4, 6] that we discussed before.\nAnother recently introduced model, the Grid LSTM [22], might look less related to active memory,\nas it does not use convolutions at all. But in fact it is to a large extend an active memory model \u2013 the\nmemory is on the diagonal of the grid of the running LSTM cells. The Reencoder architecture for\nneural machine translation introduced in that paper is therefore related to the Extended Neural GPU.\nBut it differs in a number of ways. For one, the input is provided step-wise, so the network cannot\nstart processing the whole input in parallel, as in our model. The diagonal memory changes in size\nand the model is a 3-dimensional grid, which might not be necessary for language processing. The\nReencoder also does not use convolutions and this is crucial for performance. The experiments from\n[22] are only performed on a very small dataset of 44K short sentences. This is almost 1000 times\nsmaller than the dataset we are experimenting with and makes is unclear whether Grid LSTMs can be\napplied to large-scale real-world tasks.\nIn image processing, in addition to the captioning [8] and generative models [16, 17] that we\nmentioned before, there are several other active memory models. They use convolutional LSTMs, an\narchitecture similar to CGRU, and have recently been used for weather prediction [23] and image\ncompression [24], in both cases surpassing the state-of-the-art.\n\n3 Experiments\n\nSince all components of our models (de\ufb01ned above) are differentiable, we can train them using any\nstochastic gradient descent optimizer. For the results presented in this paper we used the Adam\noptimizer [25] with \u03b5 = 10\u22124 and gradients norm clipped to 1. The number of layers was set to\nl = 2, the width of the state tensors was constant at w = 4, the number of maps was m = 512, and\nthe convolution kernels width and height was always kw = kh = 3.2\nAs our main test, we train the models discussed above and a baseline attention model on the WMT\u201914\nEnglish-French translation task. This is the same task that was used to introduce attention [5], but \u2013\nto avoid the problem with the UNK token \u2013 we spell-out each word that is not in the vocabulary. More\nprecisely, we use a 32K vocabulary that includes all characters and the most common words, and\nevery word that is not in the vocabulary is spelled-out letter-by-letter. We also include a special SPACE\nsymbol, which is used to mark spaces between characters (we assume spaces between words). We\ntrain without any data \ufb01ltering on the WMT\u201914 corpus and test on the WMT\u201914 test set (newstest\u201914).\nAs a baseline, we use a GRU model with attention that is almost identical to the original one from\n[5], except that it has 2 layers of GRU cells, each with 1024 units. Tokens from the vocabulary are\nembedded into vectors of size 512, and attention is put on the top layer. This model is identical as the\none in [7], except that is uses GRU cells instead of LSTM cells. It has about 120M parameters, while\nour Extended Neural GPU model has about 110M parameters. Better results have been reported on\nthis task with attention models with more parameters, but we aim at a baseline similar in size to the\nactive memory model we are using.\nWhen decoding from the Extendend Neural GPU model, one has to provide the expected size of the\noutput, as it determines the size of the memory. We test all sizes between input size and double the\ninput size using a greedy decoder and pick the result with smallest log-perplexity (highest likelihood).\nThis is expensive, so we only use a very basic beam-search with beam of size 2 and no length\nnormalization. It is possible to reduce the cost by predicting the output length: we tried a basic\nestimator based just on input sentence length and it decreased the BLEU score by 0.3. Better training\nand decoding could remove the need to predict output length, but we leave this for future work.\n\n2Our model was implemented using TensorFlow [26].\n\nIts code is available as open-source at https:\n\n//github.com/tensorflow/models/tree/master/neural_gpu/.\n\n6\n\n\fModel\nNeural GPU\nMarkovian Neural GPU\nExtended Neural GPU\nGRU+Attention\n\nPerplexity (log) BLEU\n\n30.1 (3.5)\n11.8 (2.5)\n3.3 (1.19)\n3.4 (1.22)\n\n< 5\n< 5\n29.6\n26.4\n\nTable 1: Results on the WMT English->French translation task. We provide the average per-word\nperplexity (and its logarithm in parenthesis) and the BLEU score. Perplexity is computed on the test\nset with the ground truth provided, so it do not depend on the decoder.\n\nFor the baseline model, we use a full beam-search decoder with beam of size 12, length normalization\nand an attention coverage penalty in the decoder. This is a basic penalty that pushes the decoder to\nattend to all words in the source sentence. We experimented with more elaborate methods following\n[27] but it did not improve our results. The parameters for length normalization and coverage penalty\nare tuned on the development set (newstest\u201913). The \ufb01nal BLEU scores and per-word perplexities for\nthese different models are presented in Table 1. Worse models have higher variance of their BLEU\nscores, so we only write < 5 for these models.\nOne can see from Table 1 that an active memory model can indeed match an attention model on\nthe machine translation task, even with slightly fewer parameters. It is interesting to note that the\nactive memory model does not need the length normalization that is necessary for the attention model\n(esp. when rare words are spelled). We conjecture that active memory inherently generalizes better\nfrom shorter examples and makes decoding easier, a welcome news, since tuning decoders is a large\nproblem in sequence-to-sequence models.\nIn addition to the summary results from Table 1, we analyzed the performance of the models on\nsentences of different lengths. This was the key problem solved by the attention mechanism, so it is\nworth asking if active memory solves it as well. In Figure 6 we plot the BLEU scores on the test set\nfor sentences in each length bucket, bucketing by 10, i.e., for lengths (0, 10], (10, 20] and so on. We\nplot the curves for the Extended Neural GPU model, the long baseline GRU model with attention,\nand \u2013 for comparison \u2013 we add the numbers for a non-attention model from Figure 2 of [5]. (Note\nthat these numbers are for a model that uses different tokenization, so they are not fully comparable,\nbut still provide a context.)\nAs can be seen, our active memory model is less sensitive to sentence length than the attention\nbaseline. It indeed solves the problem that the attention mechanism was designed to solve.\n\nParsing.\nIn addition to the main large-scale translation task, we tested the Extended Neural GPU\non English constituency parsing, the same task as in [7]. We only used the standard WSJ dataset for\ntraining. It is small by neural network standards, as it contains only 40K sentences. We trained the\nExtended Neural GPU with the same settings as above, only with m = 256 (instead of m = 512)\nand dropout of 30% in each step. During decoding, we selected well-bracketed outputs with the right\nnumber of POS-tags from all lengths considered. Evaluated with the standard EVALB tool on the\nstandard WSJ 23 test set, we got 85.1 F1 score. This is lower than 88.3 reported in [7], but we didn\u2019t\nuse any of their optimizations (no early stopping, no POS-tag substitution, no special tuning). Since a\npure sequence-to-sequence model has F1 score well below 70, this shows that the Extended Neural\nGPU is versatile and can learn and generalize well even on small data-sets.\n\n4 Discussion\n\nTo better understand the main shortcoming of previous active memory models, let us look at the\naverage log-perplexities of different attention models in Table 1. A pure Neural GPU model yields\n3.5, a Markovian one yields 2.5, and only a model with full dependence, trained with teacher forcing,\nachieves 1.3. The recurrent dependence in generating the output distribution turns out to be the key\nto achieving good performance.\nWe \ufb01nd it illuminating that the issue of dependencies in the output distribution can be disentangled\nfrom the particularities of the model or model class. In earlier works, such dependence (and training\nwith teacher forcing) was always used in LSTM and GRU models, but very rarely in other kinds\n\n7\n\n\f30\n\n27\n\n24\n\n21\n\n18\n\ne\nr\no\nc\ns\nU\nE\nL\nB\n\n15\n\n0\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\nSentence length\n\nExtended Neural GPU\n\nGRU+Attention\n\nNo Attention\n\nFigure 6: BLEU score (the higher the better) vs source sentence length.\n\nmodels. We show that it can be bene\ufb01cial to consider this issue separately from the model architecture.\nIt allows us to create the Extended Neural GPU and this way of thinking might also prove fruitful for\nother classes of models.\nWhen the issue of recurrent output dependencies is addressed, as we do in the Extended Neural GPU,\nan active memory model can indeed match or exceed attention models on a large-scale real-world\ntask. Does this mean we can always replace attention by active memory?\nThe answer could be yes for the case of soft attention. Its cost is approximately the same as active\nmemory, it performs much worse on some tasks like learning algorithms, and \u2013 with the introduction\nof the Extended Neural GPU \u2013 we do not know of a task where it performs clearly better.\nStill, an attention mask is a very natural concept, and it is probable that some tasks can bene\ufb01t from\na selector that focuses on single items by de\ufb01nition. This is especially obvious for hard attention:\nit can be used over large memories with potentially much less computational cost than an active\nmemory, so it might be indispensable for devising long-term memory mechanisms. Luckily, active\nmemory and attention are not exclusive, and we look forward to investigating models that combine\nthese mechanisms.\n\nReferences\n[1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural network. In Advances in Neural Information Processing Systems, 2012.\n\n[2] George E. Dahl, Dong Yu, Li Deng, and Alex Acero. Context-dependent pre-trained deep neural networks\nfor large-vocabulary speech recognition. IEEE Transactions on Audio, Speech & Language Processing,\n20(1):30\u201342, 2012.\n\n[3] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735\u20131780,\n\n1997.\n\n[4] Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. Sequence to sequence learning with neural networks. In\n\nAdvances in Neural Information Processing Systems, pages 3104\u20133112, 2014.\n\n[5] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning\n\nto align and translate. CoRR, abs/1409.0473, 2014.\n\n8\n\n\f[6] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk, and Yoshua\nBengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation.\nCoRR, abs/1406.1078, 2014.\n\n[7] Vinyals & Kaiser, Koo, Petrov, Sutskever, and Hinton. Grammar as a foreign language. In Advances in\n\nNeural Information Processing Systems, 2015.\n\n[8] Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S.\nZemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention.\nIn ICML, 2015.\n\n[9] Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra. Draw: A\n\nrecurrent neural network for image generation. CoRR, abs/1502.04623, 2015.\n\n[10] Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. CoRR, abs/1410.5401, 2014.\n[11] Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement\n\nlearning. Machine Learning, 8:229\u2013\u2013256, 1992.\n\n[12] \u0141ukasz Kaiser and Ilya Sutskever. Neural GPUs learn algorithms. In International Conference on Learning\n\nRepresentations (ICLR), 2016.\n\n[13] A. Joulin and T. Mikolov. Inferring algorithmic patterns with stack-augmented recurrent nets. In Advances\n\nin Neural Information Processing Systems, (NIPS), 2015.\n\n[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\n\nIn CVPR, 2016.\n\n[15] Qianli Liao and Tomaso Poggio. Bridging the gaps between residual learning, recurrent neural networks\n\nand visual cortex. CoRR, abs/1604.03640, 2016.\n\n[16] Danilo Jimenez Rezende, Shakir Mohamed, Ivo Danihelka, Karol Gregor, and Daan Wierstra. One-shot\n\ngeneralization in deep generative models. CoRR, abs/1603.05106, 2016.\n\n[17] Karol Gregor, Frederic Besse, Danilo Jimenez Rezende, Ivo Danihelka, and Daan Wierstra. Towards\n\nconceptual compression. CoRR, abs/1604.08772, 2016.\n\n[18] Andrew Lavin and Scott Gray. Fast algorithms for convolutional neural networks. CoRR, abs/1509.09308,\n\n2015.\n\n[19] K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio. On the properties of neural machine translation:\n\nEncoder-decoder approaches. CoRR, abs/1409.1259, 2014.\n\n[20] Nal Kalchbrenner and Phil Blunsom. Recurrent continuous translation models. In Proceedings EMNLP\n\n2013, pages 1700\u20131709, 2013.\n\n[21] Fandong Meng, Zhengdong Lu, Mingxuan Wang, Hang Li, Wenbin Jiang, and Qun Liu. Encoding source\n\nlanguage with convolutional neural network for machine translation. In ACL, pages 20\u201330, 2015.\n\n[22] Nal Kalchbrenner, Ivo Danihelka, and Alex Graves. Grid long short-term memory. In International\n\nConference on Learning Representations, 2016.\n\n[23] Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai kin Wong, and Wang chun Woo. Convolu-\ntional LSTM network: A machine learning approach for precipitation nowcasting. In Advances in Neural\nInformation Processing Systems, 2015.\n\n[24] George Toderici, Sean M. O\u2019Malley, Sung Jin Hwang, Damien Vincent, David Minnen, Shumeet Baluja,\nMichele Covell, and Rahul Sukthankar. Variable rate image compression with recurrent neural networks.\nIn International Conference on Learning Representations, 2016.\n\n[25] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980,\n\n2014.\n\n[26] Mart\u00edn Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg Corrado,\nAndy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey\nIrving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg,\nDan Man\u00e9, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens,\nBenoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda\nVi\u00e9gas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng.\nTensor\ufb02ow: Large-scale machine learning on heterogeneous distributed systems, 2015.\n\n[27] Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu, and Hang Li. Modeling coverage for neural machine\n\ntranslation. CoRR, abs/1601.04811, 2016.\n\n9\n\n\f", "award": [], "sourceid": 1871, "authors": [{"given_name": "\u0141ukasz", "family_name": "Kaiser", "institution": "Google Brain"}, {"given_name": "Samy", "family_name": "Bengio", "institution": "Google Brain"}]}