{"title": "On Multiplicative Integration with Recurrent Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 2856, "page_last": 2864, "abstract": "We introduce a general simple structural design called \u201cMultiplicative Integration\u201d (MI) to improve recurrent neural networks (RNNs). MI changes the way of how the information flow gets integrated in the computational building block of an RNN, while introducing almost no extra parameters. The new structure can be easily embedded into many popular RNN models, including LSTMs and GRUs. We empirically analyze its learning behaviour and conduct evaluations on several tasks using different RNN models. Our experimental results demonstrate that Multiplicative Integration can provide a substantial performance boost over many of the existing RNN models.", "full_text": "On Multiplicative Integration with\n\nRecurrent Neural Networks\n\nYuhuai Wu1,\u2217, Saizheng Zhang2,\u2217, Ying Zhang2, Yoshua Bengio2,4 and Ruslan Salakhutdinov3,4\n\n1University of Toronto, 2MILA, Universit\u00e9 de Montr\u00e9al, 3Carnegie Mellon University, 4CIFAR\nywu@cs.toronto.edu,2{firstname.lastname}@umontreal.ca,rsalakhu@cs.cmu.edu\n\nAbstract\n\nWe introduce a general and simple structural design called \u201cMultiplicative Integra-\ntion\u201d (MI) to improve recurrent neural networks (RNNs). MI changes the way in\nwhich information from difference sources \ufb02ows and is integrated in the compu-\ntational building block of an RNN, while introducing almost no extra parameters.\nThe new structure can be easily embedded into many popular RNN models, includ-\ning LSTMs and GRUs. We empirically analyze its learning behaviour and conduct\nevaluations on several tasks using different RNN models. Our experimental results\ndemonstrate that Multiplicative Integration can provide a substantial performance\nboost over many of the existing RNN models.\n\n1\n\nIntroduction\n\nRecently there has been a resurgence of new structural designs for recurrent neural networks (RNNs)\n[1, 2, 3]. Most of these designs are derived from popular structures including vanilla RNNs, Long\nShort Term Memory networks (LSTMs) [4] and Gated Recurrent Units (GRUs) [5]. Despite of their\nvarying characteristics, most of them share a common computational building block, described by the\nfollowing equation:\n(1)\nwhere x \u2208 Rn and z \u2208 Rm are state vectors coming from different information sources, W \u2208 Rd\u00d7n\nand U \u2208 Rd\u00d7m are state-to-state transition matrices, and b is a bias vector. This computational\nbuilding block serves as a combinator for integrating information \ufb02ow from the x and z by a sum\noperation \u201c+\u201d, followed by a nonlinearity \u03c6. We refer to it as the additive building block. Additive\nbuilding blocks are widely implemented in various state computations in RNNs (e.g. hidden state\ncomputations for vanilla-RNNs, gate/cell computations of LSTMs and GRUs.\nIn this work, we propose an alternative design for constructing the computational building block by\nchanging the procedure of information integration. Speci\ufb01cally, instead of utilizing sum operation\n\u201c+\", we propose to use the Hadamard product \u201c(cid:12)\u201d to fuse Wx and Uz:\n\n\u03c6(Wx + Uz + b),\n\n\u03c6(Wx (cid:12) Uz + b)\n\n(2)\nThe result of this modi\ufb01cation changes the RNN from \ufb01rst order to second order [6], while introducing\nno extra parameters. We call this kind of information integration design a form of Multiplicative\nIntegration. The effect of multiplication naturally results in a gating type structure, in which Wx\nand Uz are the gates of each other. More speci\ufb01cally, one can think of the state-to-state computation\nUz (where for example z represents the previous state) as dynamically rescaled by Wx (where\nfor example x represents the input). Such rescaling does not exist in the additive building block, in\nwhich Uz is independent of x. This relatively simple modi\ufb01cation brings about advantages over the\nadditive building block as it alters RNN\u2019s gradient properties, which we discuss in detail in the next\nsection, as well as verify through extensive experiments.\n\n\u2217Equal contribution.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fIn the following sections, we \ufb01rst introduce a general formulation of Multiplicative Integration. We\nthen compare it to the additive building block on several sequence learning tasks, including character\nlevel language modelling, speech recognition, large scale sentence representation learning using a\nSkip-Thought model, and teaching a machine to read and comprehend for a question answering\ntask. The experimental results (together with several existing state-of-the-art models) show that\nvarious RNN structures (including vanilla RNNs, LSTMs, and GRUs) equipped with Multiplicative\nIntegration provide better generalization and easier optimization. Its main advantages include: (1) it\nenjoys better gradient properties due to the gating effect. Most of the hidden units are non-saturated;\n(2) the general formulation of Multiplicative Integration naturally includes the regular additive\nbuilding block as a special case, and introduces almost no extra parameters compared to the additive\nbuilding block; and (3) it is a drop-in replacement for the additive building block in most of the\npopular RNN models, including LSTMs and GRUs. It can also be combined with other RNN training\ntechniques such as Recurrent Batch Normalization [7]. We further discuss its relationship to existing\nmodels, including Hidden Markov Models (HMMs) [8], second order RNNs [6] and Multiplicative\nRNNs [9].\n\n2 Structure Description and Analysis\n2.1 General Formulation of Multiplicative Integration\n\nThe key idea behind Multiplicative Integration is to integrate different information \ufb02ows Wx and Uz,\nby the Hadamard product \u201c(cid:12)\u201d. A more general formulation of Multiplicative Integration includes\ntwo more bias vectors \u03b21 and \u03b22 added to Wx and Uz:\n\n\u03c6((Wx + \u03b21) (cid:12) (Uz + \u03b22) + b)\n\n(3)\nwhere \u03b21, \u03b22 \u2208 Rd are bias vectors. Notice that such formulation contains the \ufb01rst order terms as\nin a additive building block, i.e., \u03b21 (cid:12) Uht\u22121 + \u03b22 (cid:12) Wxt. In order to make the Multiplicative\nIntegration more \ufb02exible, we introduce another bias vector \u03b1 \u2208 Rd to gate2 the term Wx (cid:12) Uz,\nobtaining the following formulation:\n\n\u03c6(\u03b1 (cid:12) Wx (cid:12) Uz + \u03b21 (cid:12) Uz + \u03b22 (cid:12) Wx + b),\n\n(4)\nNote that the number of parameters of the Multiplicative Integration is about the same as that of the\nadditive building block, since the number of new parameters (\u03b1, \u03b21 and \u03b22) are negligible compared\nto total number of parameters. Also, Multiplicative Integration can be easily extended to LSTMs\nand GRUs3, that adopt vanilla building blocks for computing gates and output states, where one can\ndirectly replace them with the Multiplicative Integration. More generally, in any kind of structure\nwhere k information \ufb02ows (k \u2265 2) are involved (e.g. residual networks [10]), one can implement\npairwise Multiplicative Integration for integrating all k information sources.\n\n2.2 Gradient Properties\nThe Multiplicative Integration has different gradient properties compared to the additive building\nblock. For clarity of presentation, we \ufb01rst look at vanilla-RNN and RNN with Multiplicative\nIntegration embedded, referred to as MI-RNN. That is, ht = \u03c6(Wxt + Uht\u22121 + b) versus\nht = \u03c6(Wxt (cid:12) Uht\u22121 + b). In a vanilla-RNN, the gradient\ncan be computed as follows:\n\n\u2202ht\n\n\u2202ht\u2212n\n\nUT diag(\u03c6(cid:48)\n\nk),\n\n(5)\n\nt(cid:89)\n\n\u2202ht\n\u2202ht\u2212n\n\n=\n\nk=t\u2212n+1\n\nwhere \u03c6(cid:48)\nk = \u03c6(cid:48)(Wxk + Uhk\u22121 + b). The equation above shows that the gradient \ufb02ow through time\nheavily depends on the hidden-to-hidden matrix U, but W and xk appear to play a limited role: they\nonly come in the derivative of \u03c6(cid:48) mixed with Uhk\u22121. On the other hand, the gradient\nof a\nMI-RNN is4:\n\n\u2202ht\u2212n\n\n\u2202ht\n\nUT diag(Wxk)diag(\u03c6(cid:48)\n\nk),\n\n(6)\n\nt(cid:89)\n\n\u2202ht\n\u2202ht\u2212n\n\n=\n\nk=t\u2212n+1\n\n2If \u03b1 = 0, the Multiplicative Integration will degenerate to the vanilla additive building block.\n3See exact formulations in the Appendix.\n4Here we adopt the simplest formulation of Multiplicative Integration for illustration. In the more general\n\ncase (Eq. 4), diag(Wxk) in Eq. 6 will become diag(\u03b1 (cid:12) Wxk + \u03b21).\n\n2\n\n\fk = \u03c6(cid:48)(Wxk (cid:12) Uhk\u22121 + b). By looking at the gradient, we see that the matrix W and\nwhere \u03c6(cid:48)\nthe current input xk is directly involved in the gradient computation by gating the matrix U, hence\nmore capable of altering the updates of the learning system. As we show in our experiments, with\nWxk directly gating the gradient, the vanishing/exploding problem is alleviated: Wxk dynamically\nreconciles U, making the gradient propagation easier compared to the regular RNNs. For LSTMs\nand GRUs with Multiplicative Integration, the gradient propagation properties are more complicated.\nBut in principle, the bene\ufb01ts of the gating effect also persists in these models.\n3 Experiments\nIn all of our experiments, we use the general form of Multiplicative Integration (Eq. 4) for any hidden\nstate/gate computations, unless otherwise speci\ufb01ed.\n3.1 Exploratory Experiments\nTo further understand the functionality of Multiplicative Integration, we take a simple RNN for\nillustration, and perform several exploratory experiments on the character level language modeling\ntask using Penn-Treebank dataset [11], following the data partition in [12]. The length of the\ntraining sequence is 50. All models have a single hidden layer of size 2048, and we use Adam\noptimization algorithm [13] with learning rate 1e\u22124. Weights are initialized to samples drawn from\nuniform[\u22120.02, 0.02]. Performance is evaluated by the bits-per-character (BPC) metric, which is\nlog2 of perplexity.\n3.1.1 Gradient Properties\nTo analyze the gradient \ufb02ow of the model, we divide the gradient in Eq. 6 into two parts: 1. the\ngated matrix products: UT diag(Wxk), and 2. the derivative of the nonlinearity \u03c6(cid:48), We separately\nanalyze the properties of each term compared to the additive building block. We \ufb01rst focus on the\ngating effect brought by diag(Wxk). In order to separate out the effect of nonlinearity, we chose \u03c6\nto be the identity map, hence both vanilla-RNN and MI-RNN reduce to linear models, referred to as\nlin-RNN and lin-MI-RNN.\nFor each model we monitor the log-L2-norm of the gradient log ||\u2202C/\u2202ht||2 (averaged over the\ntraining set) after every training epoch, where ht is the hidden state at time step t, and C is the\nnegative log-likelihood of the single character prediction at the \ufb01nal time step (t = 50). Figure. 1\nshows the evolution of the gradient norms for small t, i.e., 0, 5, 10, as they better re\ufb02ect the gradient\npropagation behaviour. Observe that the norms of lin-MI-RNN (orange) increase rapidly and soon\nexceed the corresponding norms of lin-RNN by a large margin. The norms of lin-RNN stay close to\nzero (\u2248 10\u22124) and their changes over time are almost negligible. This observation implies that with\nthe help of diag(Wxk) term, the gradient vanishing of lin-MI-RNN can be alleviated compared to\nlin-RNN. The \ufb01nal test BPC (bits-per-character) of lin-MI-RNN is 1.48, which is comparable to a\nvanilla-RNN with stabilizing regularizer [14], while lin-RNN performs rather poorly, achieving a test\nBPC of over 2.\nNext we look into the nonlinearity \u03c6. We chose \u03c6 = tanh for both vanilla-RNN and MI-RNN.\nFigure 1 (c) and (d) shows a comparison of histograms of hidden activations over all time steps on\nthe validation set after training. Interestingly, in (c) for vanilla-RNN, most activations are saturated\nwith values around \u00b11, whereas in (d) for MI-RNN, most activations are non-saturated with values\naround 0. This has a direct consequence in gradient propagation: non-saturated activations imply\nk) \u2248 1 for \u03c6 = tanh, which can help gradients propagate, whereas saturated activations\nthat diag(\u03c6(cid:48)\nimply that diag(\u03c6(cid:48)\n3.1.2 Scaling Problem\nWhen adding two numbers at different order of magnitude, the smaller one might be negligible for the\nsum. However, when multiplying two numbers, the value of the product depends on both regardless\nof the scales. This principle also applies when comparing Multiplicative Integration to the additive\nbuilding blocks. In this experiment, we test whether Multiplicative Integration is more robust to the\nscales of weight values. Following the same models as in Section 3.1.1, we \ufb01rst calculated the norms\nof Wxk and Uhk\u22121 for both vanilla-RNN and MI-RNN for different k after training. We found that\nin both structures, Wxk is a lot smaller than Uhk\u22121 in magnitude. This might be due to the fact that\nxk is a one-hot vector, making the number of updates for (columns of) W be smaller than U. As a\nresult, in vanilla-RNN, the pre-activation term Wxk + Uhk\u22121 is largely controlled by the value of\nUhk\u22121, while Wxk becomes rather small. In MI-RNN, on the other hand, the pre-activation term\nWxk (cid:12) Uhk\u22121 still depends on the values of both Wxk and Uhk\u22121, due to multiplication.\n\nk) \u2248 0, resulting in gradients vanishing.\n\n3\n\n\fFigure 1: (a) Curves of log-L2-norm of gradients for lin-RNN (blue) and lin-MI-RNN (orange). Time gradually\nchanges from {1, 5, 10}. (b) Validation BPC curves for vanilla-RNN, MI-RNN-simple using Eq. 2, and MI-\nRNN-general using Eq. 4. (c) Histogram of vanilla-RNN\u2019s hidden activations over the validation set, most\nactivations are saturated. (d) Histogram of MI-RNN\u2019s hidden activations over the validation set, most activations\nare not saturated.\n\nWe next tried different initialization of W and U to test their sensitivities to the scaling. For each\nmodel, we \ufb01x the initialization of U to uniform[\u22120.02, 0.02] and initialize W to uniform[\u2212rW, rW]\nwhere rW varies in {0.02, 0.1, 0.3, 0.6}. Table 1, top left panel, shows results. As we increase\nthe scale of W, performance of the vanilla-RNN improves, suggesting that the model is able to\nbetter utilize the input information. On the other hand, MI-RNN is much more robust to different\ninitializations, where the scaling has almost no effect on the \ufb01nal performance.\n\n3.1.3 On different choices of the formulation\nIn our third experiment, we evaluated the performance of different computational building blocks,\nwhich are Eq. 1 (vanilla-RNN), Eq. 2 (MI-RNN-simple) and Eq. 4 (MI-RNN-general)5. From the\nvalidation curves in Figure 1 (b), we see that both MI-RNN, simple and MI-RNN-general yield much\nbetter performance compared to vanilla-RNN, and MI-RNN-general has a faster convergence speed\ncompared to MI-RNN-simple. We also compared our results to the previously published models\nin Table 1, bottom left panel, where MI-RNN-general achieves a test BPC of 1.39, which is to our\nknowledge the best result for RNNs on this task without complex gating/cell mechanisms.\n\n3.2 Character Level Language Modeling\nIn addition to the Penn-Treebank dataset, we also perform character level language modeling on two\nlarger datasets: text86 and Hutter Challenge Wikipedia7. Both of them contain 100M characters from\nWikipedia while text8 has an alphabet size of 27 and Hutter Challenge Wikipedia has an alphabet\nsize of 205. For both datasets, we follow the training protocols in [12] and [1] respectively. We use\nAdam for optimization with the starting learning rate grid-searched in {0.002, 0.001, 0.0005}. If the\nvalidation BPC (bits-per-character) does not decrease for 2 epochs, we half the learning rate.\nWe implemented Multiplicative Integration on both vanilla-RNN and LSTM, referred to as MI-\nRNN and MI-LSTM. The results for the text8 dataset are shown in Table 1, bottom middle panel.\nAll \ufb01ve models, including some of the previously published models, have the same number of\n\n5We perform hyper-parameter search for the initialization of {\u03b1, \u03b21, \u03b22, b} in MI-RNN-general.\n6http://mattmahoney.net/dc/textdata\n7http://prize.hutter1.net/\n\n4\n\n510152025number of epochs\u22127\u22126\u22125\u22124\u22123\u22122\u22121log||dC / dh_t||_2(a)lin-RNN, t=0lin-RNN, t=5lin-RNN, t=10lin-MI-RNN, t=0lin-MI-RNN, t=5lin-MI-RNN, t=100510152025number of epochs1.51.82.12.42.73.0validation BPC(b)vanilla-RNNMI-RNN-simpleMI-RNN-general\u22121.0\u22120.50.00.51.0activation values of h_t0.00.10.20.30.40.5normalized fequency(d)\u22121.0\u22120.50.00.51.0activation values of h_t0.000.020.040.060.080.100.12normalized fequency(d)\frW = 0.02 0.1 0.3 0.6\nRNN\n\nstd\n1.69 1.65 1.57 1.54 0.06\nMI-RNN 1.39 1.40 1.40 1.41 0.008\n\nWSJ Corpus\nCER WER\nDRNN+CTCbeamsearch [15]\n10.0 14.1\nEncoder-Decoder [16]\n9.3\n6.4\nLSTM+CTCbeamsearch [17]\n8.7\n9.2\n7.3\nEesen [18]\n-\nLSTM+CTC+WFST (ours)\n6.5\n8.7\nMI-LSTM+CTC+WFST (ours) 6.0\n8.2\n\nPenn-Treebank\nRNN [12]\nHF-MRNN [12]\nRNN+stabalization [14]\nMI-RNN (ours)\nlinear MI-RNN (ours)\n\nBPC\n1.42\n1.41\n1.48\n1.39\n1.48\n\ntext8\nRNN+smoothReLu [19]\nHF-MRNN [12]\nMI-RNN (ours)\nLSTM (ours)\nMI-LSTM(ours)\n\nBPC\n1.55\n1.54\n1.52\n1.51\n1.44\n\nHutterWikipedia\nstacked-LSTM [20]\nGF-LSTM [1]\ngrid-LSTM [2]\nMI-LSTM (ours)\n\nBPC\n1.67\n1.58\n1.47\n1.44\n\nTable 1: Top: test BPCs and the standard deviation of models with different scales of weight initializations. Top\nright: test CERs and WERs on WSJ corpus. Bottom left: test BPCs on character level Penn-Treebank dataset.\nBottom middle: test BPCs on character level text8 dataset. Bottom right: test BPCs on character level Hutter\nPrize Wikipedia dataset.\nparameters (\u22484M). For RNNs without complex gating/cell mechanisms (the \ufb01rst three results), our\nMI-RNN (with {\u03b1, \u03b21, \u03b22, b} initialized as {2, 0.5, 0.5, 0}) performs the best, our MI-LSTM (with\n{\u03b1, \u03b21, \u03b22, b} initialized as {1, 0.5, 0.5, 0}) outperforms all other models by a large margin8.\nOn Hutter Challenge Wikipedia dataset, we compare our MI-LSTM (single layer with 2048 unit,\n\u224817M, with {\u03b1, \u03b21, \u03b22, b} initialized as {1, 1, 1, 0}) to the previous stacked LSTM (7 layers,\n\u224827M) [20], GF-LSTM (5 layers, \u224820M) [1], and grid-LSTM (6 layers, \u224817M) [2]. Table 1, bottom\nright panel, shows results. Despite the simple structure compared to the sophisticated connection\ndesigns in GF-LSTM and grid-LSTM, our MI-LSTM outperforms all other models and achieves the\nnew state-of-the-art on this task.\n\n3.3 Speech Recognition\n\nWe next evaluate our models on Wall Street Journal (WSJ) corpus (available as LDC corpus\nLDC93S6B and LDC94S13B), where we use the full 81 hour set \u201csi284\u201d for training, set \u201cdev93\u201d for\nvalidation and set \u201ceval92\u201d for test. We follow the same data preparation process and model setting\nas in [18], and we use 59 characters as the targets for the acoustic modelling. Decoding is done with\nthe CTC [21] based weighted \ufb01nite-state transducers (WFSTs) [22] as proposed by [18].\nOur model (referred to as MI-LSTM+CTC+WFST) consists of 4 bidirectional MI-LSTM lay-\ners, each with 320 units for each direction. CTC is performed on top to resolve the alignment\nissue in speech transcription. For comparison, we also train a baseline model (referred to as\nLSTM+CTC+WFST) with the same size but using vanilla LSTM. Adam with learning rate 0.0001\nis used for optimization and Gaussian weight noise with zero mean and 0.05 standard deviation\nis injected for regularization. We evaluate our models on the character error rate (CER) without\nlanguage model and the word error rate (WER) with extended trigram language model.\nTable 1, top right panel, shows that MI-LSTM+CTC+WFST achieves quite good results on both CER\nand WER compared to recent works, and it has a clear improvement over the baseline model. Note\nthat we did not conduct a careful hyper-parameter search on this task, hence one could potentially\nobtain better results with better decoding schemes and regularization techniques.\n\n3.4 Learning Skip-Thought Vectors\n\nNext, we evaluate our Multiplicative Integration on the Skip-Thought model of [23]. Skip-Thought is\nan encoder-decoder model that attempts to learn generic, distributed sentence representations. The\nmodel produces sentence representation that are robust and perform well in practice, as it achieves\nexcellent results across many different NLP tasks. The model was trained on the BookCorpus dataset\nthat consists of 11,038 books with 74,004,228 sentences. Not surprisingly, a single pass through\n\n8[7] reports better results but they use much larger models (\u224816M) which is not directly comparable.\n\n5\n\n\fSemantic-Relatedness\nuni-skip [23]\nbi-skip [23]\ncombine-skip [23]\nuni-skip (ours)\nMI-uni-skip (ours)\n\nr\n\n\u03c1 MSE\n0.8477 0.7780 0.2872\n0.8405 0.7696 0.2995\n0.8584 0.7916 0.2687\n0.8436 0.7735 0.2946\n0.8588 0.7952 0.2679\n\nParaphrase detection Acc F1\n73.0 81.9\nuni-skip [23]\n71.2 81.2\nbi-skip [23]\n73.0 82.0\ncombine-skip [23]\n74.0 81.9\nuni-skip (ours)\n74.0 82.1\nMI-uni-skip (ours)\n\nMR CR SUBJ MPQA\nClassi\ufb01cation\n75.5 79.3 92.1\n86.9\nuni-skip [23]\n83.3\nbi-skip [23]\n73.9 77.9 92.5\ncombine-skip [23] 76.5 80.1 93.6\n87.1\n87.0\nuni-skip (ours)\n75.9 80.1 93.0\nMI-uni-skip (ours) 77.9 82.3 93.3\n88.1\n\nAttentive Reader\nLSTM [7]\nBN-LSTM [7]\nBN-everywhere [7]\nLSTM (ours)\nMI-LSTM (ours)\nMI-LSTM+BN (ours)\nMI-LSTM+BN-everywhere (ours)\n\nVal. Err.\n0.5033\n0.4951\n0.5000\n0.5053\n0.4721\n0.4685\n0.4644\n\nTable 2: Top left: skip-thought+MI on Semantic-Relatedness task. Top Right: skip-thought+MI on Paraphrase\nDetection task. Bottom left: skip-thought+MI on four different classi\ufb01cation tasks. Bottom right: Multiplicative\nIntegration (with batch normalization) on Teaching Machines to Read and Comprehend task.\n\nthe training data can take up to a week on a high-end GPU (as reported in [23]). Such training\nspeed largely limits one to perform careful hyper-parameter search. However, with Multiplicative\nIntegration, not only the training time is shortened by a factor of two, but the \ufb01nal performance is\nalso signi\ufb01cantly improved.\nWe exactly follow the authors\u2019 Theano implementation of the skip-thought model9: Encoder and\ndecoder are single-layer GRUs with hidden-layer size of 2400; all recurrent matrices adopt orthogonal\ninitialization while non-recurrent weights are initialized from uniform distribution. Adam is used\nfor optimization. We implemented Multiplicative Integration only for the encoder GRU (embedding\nMI into decoder did not provide any substantial gains). We refer our model as MI-uni-skip, with\n{\u03b1, \u03b21, \u03b22, b} initialized as {1, 1, 1, 0}. We also train a baseline model with the same size, referred\nto as uni-skip(ours), which essentially reproduces the original model of [23].\nDuring the course of training, we evaluated the skip-thought vectors on the semantic relatedness\ntask, using SICK dataset, every 2500 updates for both MI-uni-skip and the baseline model (each\niteration processes a mini-batch of size 64). The results are shown in Figure 2a. Note that MI-uni-skip\nsigni\ufb01cantly outperforms the baseline, not only in terms of speed of convergence, but also in terms\nof \ufb01nal performance. At around 125k updates, MI-uni-skip already exceeds the best performance\nachieved by the baseline, which takes about twice the number of updates.\nWe also evaluated both models after one week of training, with the best results being reported on six\nout of eight tasks reported in [23]: semantic relatedness task on SICK dataset, paraphrase detection\ntask on Microsoft Research Paraphrase Corpus, and four classi\ufb01cation benchmarks: movie review\nsentiment (MR), customer product reviews (CR), subjectivity/objectivity classi\ufb01cation (SUBJ), and\nopinion polarity (MPQA). We also compared our results with the results reported on three models in\nthe original skip-thought paper: uni-skip, bi-skip, combine-skip. Uni-skip is the same model as our\nbaseline, bi-skip is a bidirectional model of the same size, and combine-skip takes the concatenation\nof the vectors from uni-skip and bi-skip to form a 4800 dimension vector for task evaluation. Table\n2 shows that MI-uni-skip dominates across all the tasks. Not only it achieves higher performance\nthan the baseline model, but in many cases, it also outperforms the combine-skip model, which has\ntwice the number of dimensions. Clearly, Multiplicative Integration provides a faster and better way\nto train a large-scale Skip-Thought model.\n\n3.5 Teaching Machines to Read and Comprehend\n\nIn our last experiment, we show that the use of Multiplicative Integration can be combined with\nother techniques for training RNNs, and the advantages of using MI still persist. Recently, [7]\nintroduced Recurrent Batch-Normalization. They evaluated their proposed technique on a uni-\n\n9https://github.com/ryankiros/skip-thoughts\n\n6\n\n\fFigure 2: (a) MSE curves of uni-skip (ours) and MI-uni-skip (ours) on semantic relatedness task on SICK\ndataset. MI-uni-skip signi\ufb01cantly outperforms baseline uni-skip. (b) Validation error curves on attentive reader\nmodels. There is a clear margin between models with and without MI.\n\ndirectional Attentive Reader Model [24] for the question answering task using the CNN corpus10. To\ntest our approach, we evaluated the following four models: 1. A vanilla LSTM attentive reader model\nwith a single hidden layer size 240 (same as [7]) as our baseline, referred to as LSTM (ours), 2. A\nmultiplicative integration LSTM with a single hidden size 240, referred to as MI-LSTM, 3. MI-\nLSTM with Batch-Norm, referred to as MI-LSTM+BN, 4. MI-LSTM with Batch-Norm everywhere\n(as detailed in [7]), referred to as MI-LSTM+BN-everywhere. We compared our models to results\nreported in [7] (referred to as LSTM, BN-LSTM and BN-LSTM everywhere) 11.\nFor all MI models, {\u03b1, \u03b21, \u03b22, b} were initialized to {1, 1, 1, 0}. We follow the experimental\nprotocol of [7]12 and use exactly the same settings as theirs, except we remove the gradient clipping\nfor MI-LSTMs. Figure. 2b shows validation curves of the baseline (LSTM), MI-LSTM, BN-LSTM,\nand MI-LSTM+BN, and the \ufb01nal validation errors of all models are reported in Table 2, bottom right\npanel. Clearly, using Multiplicative Integration results in improved model performance regardless\nof whether Batch-Norm is used. However, the combination of MI and Batch-Norm provides the\nbest performance and the fastest speed of convergence. This shows the general applicability of\nMultiplication Integration when combining it with other optimization techniques.\n\n4 Relationship to Previous Models\n4.1 Relationship to Hidden Markov Models\n\nOne can show that under certain constraints, MI-RNN is effectively implementing the forward\nalgorithm of the Hidden Markov Model(HMM). A direct mapping can be constructed as follows (see\n[25] for a similar derivation). Let U \u2208 Rm\u00d7m be the state transition probability matrix with Uij =\nPr[ht+1 = i|ht = j], W \u2208 Rm\u00d7n be the observation probability matrix with Wij = Pr[xt =\ni|ht = j]. When xt is a one-hot vector (e.g., in many of the language modelling tasks), multiplying\nit by W is effectively choosing a column of the observation matrix. Namely, if the jth entry of xt\nis one, then Wxt = Pr[xt|ht = j]. Let h0 be the initial state distribution with h0 = Pr[h0] and\n{ht}t\u22651 be the alpha values in the forward algorithm of HMM, i.e., ht = Pr[x1, ..., xt, ht]. Then\nUht = Pr[x1, ..., xt, ht+1]. Thus ht+1 = Wxt+1 (cid:12) Uht = Pr[xt+1|ht+1]\u00b7 Pr[x1, ..., xt, ht+1] =\nPr[x1, ..., xt+1, ht+1]. To exactly implement the forward algorithm using Multiplicative Integration,\nthe matrices W and U have to be probability matrices, and xt needs to be a one-hot vector. The\nfunction \u03c6 needs to be linear, and we drop all the bias terms. Therefore, RNN with Multiplicative\nIntegration can be seen as a nonlinear extension of HMMs. The extra freedom in parameter values\nand nonlinearity makes the model more \ufb02exible compared to HMMs.\n\n4.2 Relations to Second Order RNNs and Multiplicative RNNs\n\nMI-RNN is related to the second order RNN [6] and the multiplicative RNN (MRNN) [9]. We \ufb01rst\ndescribe the similarities with these two models:\nThe second order RNN involves a second order term st in a vanilla-RNN, where the ith element\nt T (i)ht\u22121, where T (i) \u2208 Rn\u00d7m(1 \u2264 i \u2264 m) is\nst,i is computed by the bilinear form: st,i = xT\n\n10Note that [7] used a truncated version of the original dataset in order to save computation.\n11Learning curves and the \ufb01nal result number are obtained by emails correspondence with authors of [7].\n12https://github.com/cooijmanstim/recurrent-batch-normalization.git.\n\n7\n\n050100150200250number of iterations (2.5k)0.260.280.300.320.340.36MSE(a)uni-skip (ours)MI-uni-skip (ours)0200400600800number of iterations (1k)0.450.500.550.600.650.70validation error(b)LSTM [7]BN-LSTM [7]MI-LSTM (ours)MI-LSTM+BN (ours)\fapproximates T by a tensor decomposition(cid:80) x(i)\n\nthe ith slice of a tensor T \u2208 Rm\u00d7n\u00d7m. Multiplicative Integration also involve a second order term\nst = \u03b1 (cid:12) Wxt (cid:12) Uht\u22121, but in our case st,i = \u03b1i(wi \u00b7 xt)(ui \u00b7 ht\u22121) = xT\nt (\u03b1wi \u2297 ui)ht\u22121,\nwhere wi and ui are ith row in W and U, and \u03b1i is the ith element of \u03b1. Note that the outer product\n\u03b1iwi \u2297 ui is a rank-1 matrix. The Multiplicative RNN is also a second order RNN, but which\nt T (i) = Pdiag(Vxt)Q. For MI-RNN, we can\nalso think of the second order term as a tensor decomposition: \u03b1(cid:12) Wxt (cid:12) Uht\u22121 = U(xt)ht\u22121 =\n[diag(\u03b1)diag(Wxt)U]ht\u22121.\nThere are however several differences that make MI a favourable model: (1) Simpler Parametrization:\nMI uses a rank-1 approximation compared to the second order RNNs, and a diagonal approximation\ncompared to Multiplicative RNN. Moreover, MI-RNN shares parameters across the \ufb01rst and second\norder terms, whereas the other two models do not. As a result, the number of parameters are largely\nreduced, which makes our model more practical for large scale problems, while avoiding over\ufb01tting.\n(2) Easier Optimization: In tensor decomposition methods, the products of three different (low-rank)\nmatrices generally makes it hard to optimize [9]. However, the optimization problem becomes\neasier in MI, as discussed in section 2 and 3. (3) General structural design vs. vanilla-RNN design:\nMultiplicative Integration can be easily embedded in many other RNN structures, e.g. LSTMs and\nGRUs, whereas the second order RNN and MRNN present a very speci\ufb01c design for modifying\nvanilla-RNNs.\nMoreover, we also compared MI-RNN\u2019s performance to the previous HF-MRNN\u2019s results (Multi-\nplicative RNN trained by Hessian-free method) in Table 1, bottom left and bottom middle panels, on\nPenn-Treebank and text8 datasets. One can see that MI-RNN outperforms HF-MRNN on both tasks.\n\n4.3 General Multiplicative Integration\n\nMultiplicative Integration can be viewed as a general way of combining information \ufb02ows from\ntwo different sources. In particular, [26] proposed the ladder network that achieves promising\nresults on semi-supervised learning. In their model, they combine the lateral connections and the\nbackward connections via the \u201ccombinator\u201d function by a Hadamard product. The performance would\nseverely degrade without this product as empirically shown by [27]. [28] explored neural embedding\napproaches in knowledge bases by formulating relations as bilinear and/or linear mapping functions,\nand compared a variety of embedding models on the link prediction task. Surprisingly, the best\nresults among all bilinear functions is the simple weighted Hadamard product. They further carefully\ncompare the multiplicative and additive interactions and show that the multiplicative interaction\ndominates the additive one.\n5 Conclusion\n\nIn this paper we proposed to use Multiplicative Integration (MI), a simple Hadamard product to\ncombine information \ufb02ow in recurrent neural networks. MI can be easily integrated into many popular\nRNN models, including LSTMs and GRUs, while introducing almost no extra parameters. Indeed,\nthe implementation of MI requires almost no extra work beyond implementing RNN models. We also\nshow that MI achieves state-of-the-art performance on four different tasks or 11 datasets of varying\nsizes and scales. We believe that the Multiplicative Integration can become a default building block\nfor training various types of RNN models.\n\nAcknowledgments\n\nThe authors acknowledge the following agencies for funding and support: NSERC, Canada Research\nChairs, CIFAR, Calcul Quebec, Compute Canada, Disney research and ONR Grant N000141310721.\nThe authors thank the developers of Theano [29] and Keras [30], and also thank Jimmy Ba for many\nthought-provoking discussions.\n\nReferences\n[1] Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. Gated feedback recurrent neural\n\nnetworks. arXiv preprint arXiv:1502.02367, 2015.\n\n[2] Nal Kalchbrenner, Ivo Danihelka, and Alex Graves. Grid long short-term memory. arXiv preprint\n\narXiv:1507.01526, 2015.\n\n8\n\n\f[3] Saizheng Zhang, Yuhuai Wu, Tong Che, Zhouhan Lin, Roland Memisevic, Ruslan Salakhutdinov,\nand Yoshua Bengio. Architectural complexity measures of recurrent neural networks. arXiv preprint\narXiv:1602.08210, 2016.\n\n[4] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735\u20131780,\n\n1997.\n\n[5] Kyunghyun Cho, Bart Van Merri\u00ebnboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger\nSchwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical\nmachine translation. arXiv preprint arXiv:1406.1078, 2014.\n\n[6] Mark W Goudreau, C Lee Giles, Srimat T Chakradhar, and D Chen. First-order versus second-order\n\nsingle-layer recurrent neural networks. Neural Networks, IEEE Transactions on, 5(3):511\u2013513, 1994.\n\n[7] Tim Cooijmans, Nicolas Ballas, C\u00e9sar Laurent, and Aaron Courville. Recurrent batch normalization.\n\nhttp://arxiv.org/pdf/1603.09025v4.pdf, 2016.\n\n[8] LE Baum and JA Eagon. An inequality with application to statistical estimation for probabilistic functions\nof markov processes and to a model for ecology. Bulletin of the American Mathematical Society, 73:360\u2013\n363, 1967.\n\n[9] Ilya Sutskever, James Martens, and Geoffrey E Hinton. Generating text with recurrent neural networks. In\nProceedings of the 28th International Conference on Machine Learning (ICML-11), pages 1017\u20131024,\n2011.\n\n[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\n\narXiv preprint arXiv:1512.03385, 2015.\n\n[11] Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated corpus\n\nof english: The penn treebank. Computational linguistics, 19(2):313\u2013330, 1993.\n\n[12] Tom\u00e1\u0161 Mikolov, Ilya Sutskever, Anoop Deoras, Hai-Son Le, and Stefan Kombrink. Subword language\n\nmodeling with neural networks. preprint, (http://www.\ufb01t.vutbr.cz/imikolov/rnnlm/char.pdf), 2012.\n\n[13] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\narXiv:1511.08400, 2015.\n\n[14] David Krueger and Roland Memisevic. Regularizing rnns by stabilizing activations. arXiv preprint\n\n[15] Awni Y Hannun, Andrew L Maas, Daniel Jurafsky, and Andrew Y Ng. First-pass large vocabulary\ncontinuous speech recognition using bi-directional recurrent dnns. arXiv preprint arXiv:1408.2873, 2014.\n[16] Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, Philemon Brakel, and Yoshua Bengio. End-to-end\n\nattention-based large vocabulary speech recognition. arXiv preprint arXiv:1508.04395, 2015.\n\n[17] Alex Graves and Navdeep Jaitly. Towards end-to-end speech recognition with recurrent neural networks.\nIn Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 1764\u20131772,\n2014.\n\n[18] Yajie Miao, Mohammad Gowayyed, and Florian Metze. Eesen: End-to-end speech recognition using deep\n\nrnn models and wfst-based decoding. arXiv preprint arXiv:1507.08240, 2015.\n\n[19] Marius Pachitariu and Maneesh Sahani. Regularization and nonlinearities for neural language models:\n\nwhen are they needed? arXiv preprint arXiv:1301.5650, 2013.\n\n[20] Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.\n[21] Alex Graves, Santiago Fern\u00e1ndez, Faustino Gomez, and J\u00fcrgen Schmidhuber. Connectionist temporal\nclassi\ufb01cation: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the\n23rd international conference on Machine learning, pages 369\u2013376. ACM, 2006.\n\n[22] Mehryar Mohri, Fernando Pereira, and Michael Riley. Weighted \ufb01nite-state transducers in speech recogni-\n\ntion. Computer Speech & Language, 16(1):69\u201388, 2002.\n\n[23] Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba,\nand Sanja Fidler. Skip-thought vectors. In Advances in Neural Information Processing Systems, pages\n3276\u20133284, 2015.\n\n[24] Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman,\nand Phil Blunsom. Teaching machines to read and comprehend. In Advances in Neural Information\nProcessing Systems, pages 1684\u20131692, 2015.\n\n[25] T. Wessels and C. W. Omlin. Re\ufb01ning hidden markov models with recurrent neural networks. In Neural\nNetworks, 2000. IJCNN 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on,\nvolume 2, pages 271\u2013276 vol.2, 2000.\n\n[26] Antti Rasmus, Harri Valpola, Mikko Honkala, Mathias Berglund, and Tapani Raiko. Semi-supervised\n\nlearning with ladder network. arXiv preprint arXiv:1507.02672, 2015.\n\n[27] Mohammad Pezeshki, Linxi Fan, Philemon Brakel, Aaron Courville, and Yoshua Bengio. Deconstructing\n\nthe ladder network architecture. arXiv preprint arXiv:1511.06430, 2015.\n\n[28] Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. Embedding entities and relations\n\nfor learning and inference in knowledge bases. arXiv preprint arXiv:1412.6575, 2014.\n\n[29] Rami Al-Rfou, Guillaume Alain, Amjad Almahairi, and et al. Theano: A python framework for fast\n\ncomputation of mathematical expressions, 2016.\n\n[30] Fran\u00e7ois Chollet. Keras. GitHub repository: https://github.com/fchollet/keras, 2015.\n\n9\n\n\f", "award": [], "sourceid": 1442, "authors": [{"given_name": "Yuhuai", "family_name": "Wu", "institution": "University of Toronto"}, {"given_name": "Saizheng", "family_name": "Zhang", "institution": "University of Montreal"}, {"given_name": "Ying", "family_name": "Zhang", "institution": "University of Montreal"}, {"given_name": "Yoshua", "family_name": "Bengio", "institution": "U. Montreal"}, {"given_name": "Russ", "family_name": "Salakhutdinov", "institution": "University of Toronto"}]}