{"title": "Full-Capacity Unitary Recurrent Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 4880, "page_last": 4888, "abstract": "Recurrent neural networks are powerful models for processing sequential data, but they are generally plagued by vanishing and exploding gradient problems. Unitary recurrent neural networks (uRNNs), which use unitary recurrence matrices, have recently been proposed as a means to avoid these issues. However, in previous experiments, the recurrence matrices were restricted to be a product of parameterized unitary matrices, and an open question remains: when does such a parameterization fail to represent all unitary matrices, and how does this restricted representational capacity limit what can be learned? To address this question, we propose full-capacity uRNNs that optimize their recurrence matrix over all unitary matrices, leading to significantly improved performance over uRNNs that use a restricted-capacity recurrence matrix. Our contribution consists of two main components. First, we provide a theoretical argument to determine if a unitary parameterization has restricted capacity. Using this argument, we show that a recently proposed unitary parameterization has restricted capacity for hidden state dimension greater than 7. Second,we show how a complete, full-capacity unitary recurrence matrix can be optimized over the differentiable manifold of unitary matrices. The resulting multiplicative gradient step is very simple and does not require gradient clipping or learning rate adaptation. We confirm the utility of our claims by empirically evaluating our new full-capacity uRNNs on both synthetic and natural data, achieving superior performance compared to both LSTMs and the original restricted-capacity uRNNs.", "full_text": "Full-Capacity Unitary Recurrent Neural Networks\n\nScott Wisdom1\u2217, Thomas Powers1\u2217, John R. Hershey2, Jonathan Le Roux2, and Les Atlas1\n\n1 Department of Electrical Engineering, University of Washington\n\n{swisdom, tcpowers, atlas}@uw.edu\n\n2 Mitsubishi Electric Research Laboratories (MERL)\n\n{hershey, leroux}@merl.com\n\nAbstract\n\nRecurrent neural networks are powerful models for processing sequential data,\nbut they are generally plagued by vanishing and exploding gradient problems.\nUnitary recurrent neural networks (uRNNs), which use unitary recurrence matri-\nces, have recently been proposed as a means to avoid these issues. However, in\nprevious experiments, the recurrence matrices were restricted to be a product of\nparameterized unitary matrices, and an open question remains: when does such a\nparameterization fail to represent all unitary matrices, and how does this restricted\nrepresentational capacity limit what can be learned? To address this question,\nwe propose full-capacity uRNNs that optimize their recurrence matrix over all\nunitary matrices, leading to signi\ufb01cantly improved performance over uRNNs that\nuse a restricted-capacity recurrence matrix. Our contribution consists of two main\ncomponents. First, we provide a theoretical argument to determine if a unitary\nparameterization has restricted capacity. Using this argument, we show that a\nrecently proposed unitary parameterization has restricted capacity for hidden state\ndimension greater than 7. Second, we show how a complete, full-capacity unitary\nrecurrence matrix can be optimized over the differentiable manifold of unitary\nmatrices. The resulting multiplicative gradient step is very simple and does not\nrequire gradient clipping or learning rate adaptation. We con\ufb01rm the utility of our\nclaims by empirically evaluating our new full-capacity uRNNs on both synthetic\nand natural data, achieving superior performance compared to both LSTMs and\nthe original restricted-capacity uRNNs.\n\n1\n\nIntroduction\n\nDeep feed-forward and recurrent neural networks have been shown to be remarkably effective in a\nwide variety of problems. A primary dif\ufb01culty in training using gradient-based methods has been\nthe so-called vanishing or exploding gradient problem, in which the instability of the gradients over\nmultiple layers can impede learning [1, 2]. This problem is particularly keen for recurrent networks,\nsince the repeated use of the recurrent weight matrix can magnify any instability.\nThis problem has been addressed in the past by various means, including gradient clipping [3],\nusing orthogonal matrices for initialization of the recurrence matrix [4, 5], or by using pioneering\narchitectures such as long short-term memory (LSTM) recurrent networks [6] or gated recurrent\nunits [7]. Recently, several innovative architectures have been introduced to improve information\n\ufb02ow in a network: residual networks, which directly pass information from previous layers up in\na feed-forward network [8], and attention networks, which allow a recurrent network to access\npast activations [9]. The idea of using a unitary recurrent weight matrix was introduced so that the\ngradients are inherently stable and do not vanish or explode [10]. The resulting unitary recurrent\n\n\u2217Equal contribution\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fneural network (uRNN) is complex-valued and uses a complex form of the recti\ufb01ed linear activation\nfunction. However, this idea was investigated using, as we show, a potentially restricted form of\nunitary matrices.\nThe two main components of our contribution can be summarized as follows:\n1) We provide a theoretical argument to determine the smallest dimension N for which any parame-\nterization of the unitary recurrence matrix does not cover the entire set of all unitary matrices. The\nargument relies on counting real-valued parameters and using Sard\u2019s theorem to show that the smooth\nmap from these parameters to the unitary manifold is not onto. Thus, we can show that a previously\nproposed parameterization [10] cannot represent all unitary matrices larger than 7 \u00d7 7. Thus, such a\nparameterization results in what we refer to as a restricted-capacity unitary recurrence matrix.\n2) To overcome the limitations of restricted-capacity parameterizations, we propose a new method for\nstochastic gradient descent for training the unitary recurrence matrix, which constrains the gradient to\nlie on the differentiable manifold of unitary matrices. This approach allows us to directly optimize a\ncomplete, or full-capacity, unitary matrix. Neither restricted-capacity nor full-capacity unitary matrix\noptimization require gradient clipping. Furthermore, full-capacity optimization still achieves good\nresults without adaptation of the learning rate during training.\nTo test the limitations of a restricted-capacity representation and to con\ufb01rm that our full-capacity\nuRNN does have practical implications, we test restricted-capacity and full-capacity uRNNs on\nboth synthetic and natural data tasks. These tasks include synthetic system identi\ufb01cation, long-term\nmemorization, frame-to-frame prediction of speech spectra, and pixel-by-pixel classi\ufb01cation of\nhandwritten digits. Our proposed full-capacity uRNNs generally achieve equivalent or superior\nperformance on synthetic and natural data compared to both LSTMs [6] and the original restricted-\ncapacity uRNNs [10].\nIn the next section, we give an overview of unitary recurrent neural networks. Section 3 presents\nour \ufb01rst contribution: the theoretical argument to determine if any unitary parameterization has\nrestricted-capacity. Section 4 describes our second contribution, where we show how to optimize a\nfull-capacity unitary matrix. We con\ufb01rm our results with simulated and natural data in Section 5 and\npresent our conclusions in Section 6.\n\n2 Unitary recurrent neural networks\n\nThe uRNN proposed by Arjovsky et al. [10] consists of the following nonlinear dynamical system\nthat has real- or complex-valued inputs xt of dimension M, complex-valued hidden states ht of\ndimension N, and real- or complex-valued outputs yt of dimension L:\n\nwhere yt = Re{Uht + c} if the outputs yt are real-valued. The element-wise nonlinearity \u03c3 is\n\n[\u03c3b(z)]i =\n\n(|zi| + bi) zi|zi| ,\n\n0,\n\nif |zi| + bi > 0,\notherwise.\n\nht =\u03c3b (Wht\u22121 + Vxt)\nyt =Uht + c,\n\n(cid:40)\n\n(1)\n\n(2)\n\nNote that this non-linearity consists in a soft-thresholding of the magnitude using the bias vector\nb. Hard-thresholding would set the output of \u03c3 to zi if |zi| + bi > 0. The parameters of the uRNN\nare as follows: W \u2208 U (N ), unitary hidden state transition matrix; V \u2208 CN\u00d7M , input-to-hidden\ntransformation; b \u2208 RN , nonlinearity bias; U \u2208 CL\u00d7N , hidden-to-output transformation; and\nc \u2208 CL, output bias.\nArjovsky et al. [10] propose the following parameterization of the unitary matrix W:\n\nWu(\u03b8u) = D3R2F\u22121D2PR1FD1,\n\n(3)\nwhere D are diagonal unitary matrices, R are Householder re\ufb02ection matrices [11], F is a discrete\nFourier transform (DFT) matrix, and P is a permutation matrix. The resulting matrix Wu is unitary\nbecause all its component matrices are unitary. This decomposition is ef\ufb01cient because diagonal,\nre\ufb02ection, and permutation matrices are O(N ) to compute, and DFTs can be computed ef\ufb01ciently in\nO(N log N ) time using the fast Fourier transform (FFT). The parameter vector \u03b8u consists of 7N\n\n2\n\n\freal-valued parameters: N parameters for each of the 3 diagonal matrices where Di,i = ej\u03b8i and 2N\nparameters for each of the 2 Householder re\ufb02ection matrices, which are real and imaginary values of\nthe complex re\ufb02ection vectors ui: Ri = I \u2212 2 uiuH\n(cid:104)ui,ui(cid:105).\n\ni\n\n3 Estimating the representation capacity of structured unitary matrices\n\nIn this section, we state and prove a theorem that can be used to determine when any particular unitary\nparameterization does not have capacity to represent all unitary matrices. As an application of this\ntheorem, we show that the parameterization (3) does not have the capacity to cover all N \u00d7 N unitary\nmatrices for N > 7. First, we establish an upper bound on the number of real-valued parameters\nrequired to represent any N \u00d7 N unitary matrix. Then, we state and prove our theorem.\nLemma 3.1 The set of all unitary matrices is a manifold of dimension N 2.\n\n2\n\nProof: The set of all unitary matrices is the well-known unitary Lie group U (N ) [12, \u00a73.4]. A Lie\ngroup identi\ufb01es group elements with points on a differentiable manifold [12, \u00a72.2]. The dimension\nof the manifold is equal to the dimension of the Lie algebra u, which is a vector space that is the\ntangent space at the identity element [12, \u00a74.5]. For U (N ), the Lie algebra consists of all skew-\nHermitian matrices A [12, \u00a75.4]. A skew-Hermitian matrix is any A \u2208 CN\u00d7N such that A = \u2212AH,\nwhere (\u00b7)H is the conjugate transpose. To determine the dimension of U (N ), we can determine the\ndimension of u. Because of the skew-Hermitian constraint, the diagonal elements of A are purely\nimaginary, which corresponds to N real-valued parameters. Also, since Ai,j = \u2212A\u2217\nj,i, the upper and\nlower triangular parts of A are parameterized by N (N\u22121)\ncomplex numbers, which corresponds to\nan additional N 2 \u2212 N real parameters. Thus, U (N ) is a manifold of dimension N 2.\nTheorem 3.2 If a family of N \u00d7 N unitary matrices is parameterized by P real-valued parameters\nfor P < N 2, then it cannot contain all N \u00d7 N unitary matrices.\nProof: We consider a family of unitary matrices that is parameterized by P real-valued parameters\nthrough a smooth map g : P(P ) \u2192 U(N 2) from the space of parameters P(P ) to the space of all\nunitary matrices U(N 2). The space P(P ) of parameters is considered as a P -dimensional manifold,\nwhile the space U(N 2) of all unitary matrices is an N 2-dimensional manifold according to lemma 3.1.\nThen, if P < N 2, Sard\u2019s theorem [13] implies that the image g(P) of g is of measure zero in U(N 2),\nand in particular g is not onto. Since g is not onto, there must exist a unitary matrix W \u2208 U(N 2)\nfor which there is no corresponding input P \u2208 P(P ) such that W = g(P). Thus, if P is such that\nP < N 2, the manifold P(P ) cannot represent all unitary matrices in U(N 2).\nWe now apply Theorem 3.2 to the parameterization (3). Note that the parameterization (3) has\nP = 7N real-valued parameters. If we solve for N in 7N < N 2, we get N > 7. Thus, the\nparameterization (3) cannot represent all unitary matrices for dimension N > 7.\n\n4 Optimizing full-capacity unitary matrices on the Stiefel manifold\n\nIn this section, we show how to get around the limitations of restricted-capacity parameterizations\nand directly optimize a full-capacity unitary matrix. We consider the Stiefel manifold of all N \u00d7 N\ncomplex-valued matrices whose columns are N orthonormal vectors in CN [14]. Mathematically,\nthe Stiefel manifold is de\ufb01ned as\n\nVN (CN ) =(cid:8)W \u2208 CN\u00d7N : WHW = IN\u00d7N\n\n(4)\nFor any W \u2208 VN (CN ), any matrix Z in the tangent space TWVN (CN ) of the Stiefel manifold\nsatis\ufb01es ZHW \u2212 WH Z = 0 [14]. The Stiefel manifold becomes a Riemannian manifold when\nits tangent space is equipped with an inner product. Tagare [14] suggests using the canonical inner\nproduct, given by\n\n(cid:9) .\n\n(cid:104)Z1, Z2(cid:105)c = tr\n\n1 (I \u2212 1\nZH\n2\n\nWWH )Z2\n\n.\n\n(5)\n\n(cid:18)\n\n(cid:19)\n\nUnder this canonical inner product on the tangent space, the gradient in the Stiefel manifold of the loss\nfunction f with respect to the matrix W is AW, where A = GHW \u2212 WH G is a skew-Hermitian\n\n3\n\n\f(cid:18)\n\nY(k)(\u03bb) =\n\nI +\n\n\u03bb\n2\n\n(cid:19)\n\n(cid:19)\u22121(cid:18)\n\nA(k)\n\nI \u2212 \u03bb\n2\nW(k) \u2212 W(k)H\n\nmatrix and G with Gi,j = \u03b4f\nis the usual gradient of the loss function f with respect to the matrix\n\u03b4Wi,j\nW [14]. Using these facts, Tagare [14] suggests a descent curve along the Stiefel manifold at training\niteration k given by the matrix product of the Cayley transformation of A(k) with the current solution\nW(k):\n\nA(k)\n\nW(k),\n\n(6)\n\nwhere \u03bb is a learning rate and A(k) = G(k)H\nG(k). Gradient descent proceeds by\nperforming updates W(k+1) = Y(k)(\u03bb). Tagare [14] suggests an Armijo-Wolfe search along the\ncurve to adapt \u03bb, but such a procedure would be expensive for neural network optimization since it\nrequires multiple evaluations of the forward model and gradients. We found that simply using a \ufb01xed\nlearning rate \u03bb often works well. Also, RMSprop-style scaling of the gradient G(k) by a running\naverage of the previous gradients\u2019 norms [15] before applying the multiplicative step (6) can improve\nconvergence. The only additional substantial computation required beyond the forward and backward\npasses of the network is the N \u00d7 N matrix inverse in (6).\n\n5 Experiments\n\nAll models are implemented in Theano [16], based on the implementation of restricted-capacity\nuRNNs by [10], available from https://github.com/amarshah/complex_RNN. All code to\nreplicate our results is available from https://github.com/stwisdom/urnn. All models use\nRMSprop [15] for optimization, except that full-capacity uRNNs optimize their recurrence matrices\nwith a \ufb01xed learning rate using the update step (6) and optional RMSprop-style gradient normalization.\n\n5.1 Synthetic data\n\nFirst, we compare the performance of full-capacity uRNNs to restricted-capacity uRNNs and LSTMs\non two tasks with synthetic data. The \ufb01rst task is synthetic system identi\ufb01cation, where a uRNN must\nlearn the dynamics of a target uRNN given only samples of the target uRNN\u2019s inputs and outputs.\nThe second task is the copy memory problem, in which the network must recall a sequence of data\nafter a long period of time.\n\n5.1.1 System identi\ufb01cation\n\nFor the task of system identi\ufb01cation, we consider the problem of learning the dynamics of a nonlinear\ndynamical system that has the form (1), given a dataset of inputs and outputs of the system. We will\ndraw a true system Wsys randomly from either a constrained set Wu of restricted-capacity unitary\nmatrices using the parameterization Wu(\u03b8u) in (3) or from a wider set Wg of restricted-capacity\nunitary matrices that are guaranteed to lie outside Wu. We sample from Wg by taking a matrix\nproduct of two unitary matrices drawn from Wu.\nWe use a sequence length of T = 150, and we set the input dimension M and output dimension L\nboth equal to the hidden state dimension N. The input-to-hidden transformation V and output-to-\nhidden transformation U are both set to identity, the output bias c is set to 0, the initial state is set\nto 0, and the hidden bias b is drawn from a uniform distribution in the range [\u22120.11,\u22120.09]. The\nhidden bias has a mean of \u22120.1 to ensure stability of the system outputs. Inputs are generated by\nsampling T -length i.i.d. sequences of zero-mean, diagonal and unit covariance circular complex-\nvalued Gaussians of dimension N. The outputs are created by running the system (1) forward on the\ninputs.\nWe compare a restricted-capacity uRNN using the parameterization from (3) and a full-capacity\nuRNN using Stiefel manifold optimization with no gradient normalization as described in Section 4.\nWe choose hidden state dimensions N to test critical points predicted by our arguments in Section 3\nof Wu(\u03b8u) in (3): N \u2208 {4, 6, 7, 8, 16}. These dimensions are chosen to test below, at, and above the\ncritical dimension of 7.\nFor all experiments, the number of training, validation, and test sequences are 20000, 1000, and\n1000, respectively. Mean-squared error (MSE) is used as the loss function. The learning rate is 0.001\nwith a batch size of 50 for all experiments. Both models use the same matrix drawn from Wu as\ninitialization. To isolate the effect of unitary recurrence matrix capacity, we only optimize W, setting\n\n4\n\n\fall other parameters to true oracle values. For each method, we report the best test loss over 100\nepochs and over 6 random initializations for the optimization.\nThe results are shown in Table 1. \u201cWsys init.\u201d refers to the initialization of the true system unitary\nmatrix Wsys, which is sampled from either the restricted-capacity set Wu or the wider set Wg.\nTable 1: Results for system identi\ufb01cation in terms of best normalized MSE. Wu is the set of\nrestricted-capacity unitary matrices from (3), and Wg is a wider set of unitary matrices.\n\nWsys init.\n\nWu\nWu\nWg\nWg\n\nCapacity\nRestricted\n\nFull\n\nRestricted\n\nFull\n\nN = 4\n4.81e\u22121\n1.28e\u22121\n3.21e\u22124\n8.72e\u22122\n\nN = 6\n6.75e\u22123\n3.03e\u22121\n3.36e\u22121\n3.86e\u22121\n\nN = 7\n3.53e\u22121\n2.16e\u22121\n3.36e\u22121\n2.62e\u22121\n\nN = 8\n3.51e\u22121\n5.04e\u22122\n2.69e\u22121\n7.22e\u22122\n\nN = 16\n7.30e\u22121\n1.28e\u22121\n7.60e\u22121\n1.00e\u22126\n\nNotice that for N < 7, the restricted-capacity uRNN achieves comparable or better performance than\nthe full-capacity uRNN. At N = 7, the restricted-capacity and full-capacity uRNNs achieve relatively\ncomparable performance, with the full-capacity uRNN achieving slightly lower error. For N > 7, the\nfull-capacity uRNN always achieves better performance versus the restricted-capacity uRNN. This\nresult con\ufb01rms our theoretical arguments that the restricted-capacity parameterization in (3) lacks the\ncapacity to model all matrices in the unitary group for N > 7 and indicates the advantage of using a\nfull-capacity unitary recurrence matrix.\n\n5.1.2 Copy memory problem\n\nThe experimental setup follows the copy memory problem from [10], which itself was based on\nthe experiment from [6]. We consider alternative hidden state dimensions and extend the sequence\nlengths to T = 1000 and T = 2000, which are longer than the maximum length of T = 750\nconsidered in previous literature.\nIn this task, the data is a vector of length T + 20 and consists of elements from 10 categories. The\nvector begins with a sequence of 10 symbols sampled uniformly from categories 1 to 8. The next\nT \u2212 1 elements of the vector are the ninth \u2019blank\u2019 category, followed by an element from the tenth\ncategory, the \u2018delimiter\u2019. The remaining ten elements are \u2018blank\u2019. The task is to output T + 10 blank\ncharacters followed by the sequence from the beginning of the vector. We use average cross entropy\nas the training loss function. The baseline solution outputs the blank category for T + 10 time steps\nand then guesses a random symbol uniformly from the \ufb01rst eight categories. This baseline has an\nexpected average cross entropy of 10 log(8)\nT +20 .\n\nFigure 1: Results of the copy memory problem with sequence lengths of 1000 (left) and 2000 (right).\nThe full-capacity uRNN converges quickly to a perfect solution, while the LSTM and restricted-\ncapacity uRNN with approximately the same number of parameters are unable to improve past the\nbaseline naive solution.\n\nThe full-capacity uRNN uses a hidden state size of N = 128 with no gradient normalization. To\nmatch the number of parameters (\u2248 22k), we use N = 470 for the restricted-capacity uRNN, and\nN = 68 for the LSTM. The training set size is 100000 and the test set size is 10000. The results\n\n5\n\n\fof the T = 1000 experiment can be found on the left half of Figure 1. The full-capacity uRNN\nconverges to a solution with zero average cross entropy after about 2000 training iterations, whereas\nthe restricted-capacity uRNN settles to the baseline solution of 0.020. The results of the T = 2000\nexperiment can be found on the right half of Figure 1. The full-capacity uRNN hovers around the\nbaseline solution for about 5000 training iterations, after which it drops down to zero average cross\nentropy. The restricted-capacity again settles down to the baseline solution of 0.010. These results\ndemonstrate that the full-capacity uRNN is very effective for problems requiring very long memory.\n\n5.2 Speech data\n\nWe now apply restricted-capacity and full-capacity uRNNs to real-world speech data and compare\ntheir performance to LSTMs. The main task we consider is predicting the log-magnitude of future\nframes of a short-time Fourier transform (STFT). The STFT is a commonly used feature domain\nfor speech enhancement, and is de\ufb01ned as the Fourier transform of short windowed frames of the\ntime series. In the STFT domain, a real-valued audio signal is represented as a complex-valued\nF \u00d7 T matrix composed of T frames that are each composed of F = Nwin/2 + 1 frequency bins,\nwhere Nwin is the duration of the time-domain frame. Most speech processing algorithms use the\nlog-magnitude of the complex STFT values and reconstruct the processed audio signal using the\nphase of the original observations.\nThe frame prediction task is as follows: given all the log-magnitudes of STFT frames up to time t,\npredict the log-magnitude of the STFT frame at time t + 1.We use the TIMIT dataset [17]. According\nto common practice [18], we use a training set with 3690 utterances from 462 speakers, a validation\nset of 400 utterances, an evaluation set of 192 utterances. Training, validation, and evaluation sets\nhave distinct speakers. Results are reported on the evaluation set using the network parameters\nthat perform best on the validation set in terms of the loss function over three training trials. All\nTIMIT audio is resampled to 8kHz. The STFT uses a Hann analysis window of 256 samples (32\nmilliseconds) and a window hop of 128 samples (16 milliseconds).\nThe LSTM requires gradient clipping during optimization, while the restricted-capacity and full-\ncapacity uRNNs do not. The hidden state dimensions N of the LSTM are chosen to match the\nnumber of parameters of the full-capacity uRNN. For the restricted-capacity uRNN, we run models\nthat match either N or number of parameters. For the LSTM and restricted-capacity uRNNs, we\nuse RMSprop [15] with a learning rate of 0.001, momentum 0.9, and averaging parameter 0.1. For\nthe full-capacity uRNN, we also use RMSprop to optimize all network parameters, except for the\nrecurrence matrix, for which we use stochastic gradient descent along the Stiefel manifold using the\nupdate (6) with a \ufb01xed learning rate of 0.001 and no gradient normalization.\n\nTable 2: Log-magnitude STFT prediction results on speech data, evaluated using objective and\nperceptual metrics (see text for description).\n\nModel\n\nN # parameters Valid.\nMSE\n\nEval.\nMSE\n\nSegSNR\n(dB)\n\nSTOI\n\nPESQ\n\nLSTM\n84\nRestricted-capacity uRNN 128\nRestricted-capacity uRNN 158\nFull-capacity uRNN\n128\nLSTM\n120\nRestricted-capacity uRNN 192\nRestricted-capacity uRNN 256\nFull-capacity uRNN\n192\nLSTM\n158\nRestricted-capacity uRNN 378\nFull-capacity uRNN\n256\n\n\u224883k\n\u224867k\n\u224883k\n\u224883k\n\u2248135k\n\u2248101k\n\u2248135k\n\u2248135k\n\u2248200k\n\u2248200k\n\u2248200k\n\n18.02\n15.03\n15.06\n14.78\n16.59\n15.20\n15.27\n14.56\n15.49\n15.78\n14.41\n\n18.32\n15.78\n14.87\n15.24\n16.98\n15.17\n15.63\n14.66\n15.80\n16.14\n14.45\n\n1.95\n3.30\n3.32\n3.57\n2.32\n3.31\n3.31\n3.76\n2.92\n3.16\n3.75\n\n0.77\n0.83\n0.83\n0.84\n0.79\n0.83\n0.83\n0.84\n0.81\n0.83\n0.84\n\n1.99\n2.36\n2.33\n2.40\n2.14\n2.35\n2.36\n2.42\n2.24\n2.35\n2.38\n\nResults are shown in Table 2, and Figure 2 shows example predictions of the three types of networks.\nResults in Table 2 are given in terms of the mean-squared error (MSE) loss function and several metrics\ncomputed on the time-domain signals, which are reconstructed from the predicted log-magnitude\n\n6\n\n\fFigure 2: Ground truth and one-frame-ahead predictions of a spectrogram for an example utterance.\nFor each model, hidden state dimension N is chosen for the best validation MSE. Notice that the\nfull-capacity uRNN achieves the best detail in its predictions.\n\nand the original phase of the STFT. These time-domain metrics are segmental signal-to-noise ratio\n(SegSNR), short-time objective intelligibility (STOI), and perceptual evaluation of speech quality\n(PESQ). SegSNR, computed using [19], uses a voice activity detector to avoid measuring SNR in\nsilent frames. STOI is designed to correlate well with human intelligibility of speech, and takes on\nvalues between 0 and 1, with a higher score indicating higher intelligibility [20]. PESQ is the ITU-T\nstandard for telephone voice quality testing [21, 22], and is a popular perceptual quality metric for\nspeech enhancement [23]. PESQ ranges from 1 (bad quality) to 4.5 (no distortion).\nNote that full-capacity uRNNs generally perform better than restricted-capacity uRNNs with the\nsame number of parameters, and both types of uRNN signi\ufb01cantly outperform LSTMs.\n\n5.3 Pixel-by-pixel MNIST\n\nAs another challenging long-term memory task with natural data, we test the performance of LSTMs\nand uRNNs on pixel-by-pixel MNIST and permuted pixel-by-pixel MNIST, \ufb01rst proposed by [5]\nand used by [10] to test restricted-capacity uRNNs. For permuted pixel-by-pixel MNIST, the pixels\nare shuf\ufb02ed, thereby creating some non-local dependencies between pixels in an image. Since the\nMNIST images are 28 \u00d7 28 pixels, resulting pixel-by-pixel sequences are T = 784 elements long.\nWe use 5000 of the 60000 training examples as a validation set to perform early stopping with a\npatience of 5. The loss function is cross-entropy. Weights with the best validation loss are used to\nprocess the evaluation set. The full-capacity uRNN uses RMSprop-style gradient normalization.\n\nTable 3: Results for unpermuted and permuted pixel-by-pixel MNIST. Classi\ufb01cation accuracies are\nreported for trained model weights that achieve the best validation loss.\n\nModel\n\nN # parameters Validation accurary Evaluation accuracy\n\n128\nd LSTM\ne\nLSTM\n256\nt\nu\nm\nRestricted-capacity uRNN 512\nr\ne\n116\nFull-capacity uRNN\np\nn\n512\nFull-capacity uRNN\nU\n128\nd LSTM\n256\nLSTM\nRestricted-capacity uRNN 512\n116\nFull-capacity uRNN\nFull-capacity uRNN\n512\n\ne\nt\nu\nm\nr\ne\nP\n\n\u2248 68k\n\u2248270k\n\u2248 16k\n\u2248 16k\n\u2248270k\n\u2248 68k\n\u2248270k\n\u2248 16k\n\u2248 16k\n\u2248270k\n\n7\n\n98.1\n98.5\n97.9\n92.7\n97.5\n91.7\n92.1\n94.2\n92.2\n94.7\n\n97.8\n98.2\n97.5\n92.8\n96.9\n91.3\n91.7\n93.3\n92.1\n94.1\n\n\fFigure 3: Learning curves for unpermuted pixel-by-pixel MNIST (top panel) and permuted pixel-by-\npixel MNIST (bottom panel).\n\nLearning curves are shown in Figure 3, and a summary of classi\ufb01cation accuracies is shown in Table\n3. For the unpermuted task, the LSTM with N = 256 achieves the best evaluation accuracy of\n98.2%. For the permuted task, the full-capacity uRNN with N = 512 achieves the best evaluation\naccuracy of 94.1%, which is state-of-the-art on this task. Both uRNNs outperform LSTMs on the\npermuted case, achieving their best performance after fewer traing epochs and using an equal or lesser\nnumber of trainable parameters. This performance difference suggests that LSTMs are only able\nto model local dependencies, while uRNNs have superior long-term memory capabilities. Despite\nnot representing all unitary matrices, the restricted-capacity uRNN with N = 512 still achieves\nimpressive test accuracy of 93.3% with only 1/16 of the trainable parameters, outperforming the\nfull-capacity uRNN with N = 116 that matches number of parameters. This result suggests that\nfurther exploration into the potential trade-off between hidden state dimension N and capacity of\nunitary parameterizations is necessary.\n\n6 Conclusion\n\nUnitary recurrent matrices prove to be an effective means of addressing the vanishing and exploding\ngradient problems. We provided a theoretical argument to quantify the capacity of constrained\nunitary matrices. We also described a method for directly optimizing a full-capacity unitary matrix\nby constraining the gradient to lie in the differentiable manifold of unitary matrices. The effect of\nrestricting the capacity of the unitary weight matrix was tested on system identi\ufb01cation and memory\ntasks, in which full-capacity unitary recurrent neural networks (uRNNs) outperformed restricted-\ncapacity uRNNs from [10] as well as LSTMs. Full-capacity uRNNs also outperformed restricted-\ncapacity uRNNs on log-magnitude STFT prediction of natural speech signals and classi\ufb01cation\nof permuted pixel-by-pixel images of handwritten digits, and both types of uRNN signi\ufb01cantly\noutperformed LSTMs. In future work, we plan to explore more general forms of restricted-capacity\nunitary matrices, including constructions based on products of elementary unitary matrices such as\nHouseholder operators or Givens operators.\nAcknowledgments: We thank an anonymous reviewer for suggesting improvements to our proof in\nSection 3 and Vamsi Potluru for helpful discussions. Scott Wisdom and Thomas Powers were funded\nby U.S. ONR contract number N00014-12-G-0078, delivery orders 13 and 24. Les Atlas was funded\nby U.S. ARO grant W911NF-15-1-0450.\n\n8\n\n\fReferences\n\n[1] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is dif\ufb01cult.\n\nIEEE Transactions on Neural Networks, 5(2):157\u2013166, 1994.\n\n[2] S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber. Gradient \ufb02ow in recurrent nets: the dif\ufb01culty of\nlearning long-term dependencies. In S. C. Kremer and J. F. Kolen, eds, A \ufb01eld guide to dynamical recurrent\nneural networks. IEEE Press, 2001.\n\n[3] R. Pascanu, T. Mikolov, and Y. Bengio. On the dif\ufb01culty of training Recurrent Neural Networks.\n\narXiv:1211.5063, Nov. 2012.\n\n[4] A. M. Saxe, J. L. McClelland, and S. Ganguli. Exact solutions to the nonlinear dynamics of learning in\n\ndeep linear neural networks. arXiv:1312.6120, Dec. 2013.\n\n[5] Q. V. Le, N. Jaitly, and G. E. Hinton. A simple way to initialize recurrent networks of recti\ufb01ed linear units.\n\narXiv:1504.00941, Apr. 2015.\n\n[6] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735\u20131780, 1997.\n[7] K. Cho, B. van Merri\u00ebnboer, D. Bahdanau, and Y. Bengio. On the properties of neural machine translation:\n\nEncoder-decoder approaches. arXiv:1409.1259, 2014.\n\n[8] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv:1512.03385,\n\nDec. 2015.\n\n[9] V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu. Recurrent models of visual attention. In Advances in\n\nNeural Information Processing Systems (NIPS), pp. 2204\u20132212, 2014.\n\n[10] M. Arjovsky, A. Shah, and Y. Bengio. Unitary Evolution Recurrent Neural Networks. In International\n\nConference on Machine Learning (ICML), Jun. 2016.\n\n[11] A. S. Householder. Unitary triangularization of a nonsymmetric matrix. Journal of the ACM, 5(4):339\u2013342,\n\n1958.\n\n[12] R. Gilmore. Lie groups, physics, and geometry: an introduction for physicists, engineers and chemists.\n\nCambridge University Press, 2008.\n\n[13] A. Sard. The measure of the critical values of differentiable maps. Bulletin of the American Mathematical\n\nSociety, 48(12):883\u2013890, 1942.\n\n[14] H. D. Tagare. Notes on optimization on Stiefel manifolds. Technical report, Yale University, 2011.\n[15] T. Tieleman and G. Hinton. Lecture 6.5\u2014RmsProp: Divide the gradient by a running average of its recent\n\nmagnitude, 2012. COURSERA: Neural Networks for Machine Learning.\n\n[16] Theano Development Team. Theano: A Python framework for fast computation of mathematical expres-\n\nsions. arXiv: 1605.02688, May 2016.\n\n[17] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett. DARPA TIMIT acoustic-phonetic\ncontinous speech corpus. Technical Report NISTIR 4930, National Institute of Standards and Technology,\n1993.\n\n[18] A. K. Halberstadt. Heterogeneous acoustic measurements and multiple classi\ufb01ers for speech recognition.\n\nPhD thesis, Massachusetts Institute of Technology, 1998.\n\n[19] M. Brookes. VOICEBOX: Speech processing toolbox for MATLAB, 2002.\n\nhttp://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html.\n\n[Online]. Available:\n\n[20] C. Taal, R. Hendriks, R. Heusdens, and J. Jensen. An algorithm for intelligibility prediction of time-\nfrequency weighted noisy speech. IEEE Trans. on Audio, Speech, and Language Processing, 19(7):2125\u2013\n2136, Sep. 2011.\n\n[21] A. Rix, J. Beerends, M. Hollier, and A. Hekstra. Perceptual evaluation of speech quality (PESQ)-a new\nmethod for speech quality assessment of telephone networks and codecs. In Proc. ICASSP, vol. 2, pp.\n749\u2013752, 2001.\n\n[22] ITU-T P.862. Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech\n\nquality assessment of narrow-band telephone networks and speech codecs, 2000.\n\n[23] P. C. Loizou. Speech Enhancement: Theory and Practice. CRC Press, Boca Raton, FL, Jun. 2007.\n\n9\n\n\f", "award": [], "sourceid": 2471, "authors": [{"given_name": "Scott", "family_name": "Wisdom", "institution": "University of Washington"}, {"given_name": "Thomas", "family_name": "Powers", "institution": "University of Washington"}, {"given_name": "John", "family_name": "Hershey", "institution": "MERL"}, {"given_name": "Jonathan", "family_name": "Le Roux", "institution": "Mitsubishi Electric Research Laboratories (MERL)"}, {"given_name": "Les", "family_name": "Atlas", "institution": "University of Washington"}]}