{"title": "RNADE: The real-valued neural autoregressive density-estimator", "book": "Advances in Neural Information Processing Systems", "page_first": 2175, "page_last": 2183, "abstract": "We introduce RNADE, a new model for joint density estimation of real-valued vectors. Our model calculates the density of a datapoint as the product of one-dimensional conditionals modeled using mixture density networks with shared parameters. RNADE learns a distributed representation of the data, while having a tractable expression for the calculation of densities. A tractable likelihood allows direct comparison with other methods and training by standard gradient-based optimizers. We compare the performance of RNADE on several datasets of heterogeneous and perceptual data, finding it outperforms mixture models in all but one case.", "full_text": "RNADE: The real-valued neural autoregressive\n\ndensity-estimator\n\nBenigno Uria and Iain Murray\n\nSchool of Informatics\nUniversity of Edinburgh\n\n{b.uria,i.murray}@ed.ac.uk\n\nHugo Larochelle\n\nD\u00b4epartement d\u2019informatique\n\nUniversit\u00b4e de Sherbrooke\n\nhugo.larochelle@usherbrooke.ca\n\nAbstract\n\nWe introduce RNADE, a new model for joint density estimation of real-valued\nvectors. Our model calculates the density of a datapoint as the product of one-\ndimensional conditionals modeled using mixture density networks with shared\nparameters. RNADE learns a distributed representation of the data, while having\na tractable expression for the calculation of densities. A tractable likelihood\nallows direct comparison with other methods and training by standard gradient-\nbased optimizers. We compare the performance of RNADE on several datasets of\nheterogeneous and perceptual data, \ufb01nding it outperforms mixture models in all\nbut one case.\n\n1\n\nIntroduction\n\nProbabilistic approaches to machine learning involve modeling the probability distributions over large\ncollections of variables. The number of parameters required to describe a general discrete distribution\ngrows exponentially in its dimensionality, so some structure or regularity must be imposed, often\nthrough graphical models [e.g. 1]. Graphical models are also used to describe probability densities\nover collections of real-valued variables.\nOften parts of a task-speci\ufb01c probabilistic model are hard to specify, and are learned from data using\ngeneric models. For example, the natural probabilistic approach to image restoration tasks (such as\ndenoising, deblurring, inpainting) requires a multivariate distribution over uncorrupted patches of\npixels. It has long been appreciated that large classes of densities can be estimated consistently by\nkernel density estimation [2], and a large mixture of Gaussians can closely represent any density. In\npractice, a parametric mixture of Gaussians seems to \ufb01t the distribution over patches of pixels and\nobtains state-of-the-art restorations [3]. It may not be possible to \ufb01t small image patches signi\ufb01cantly\nbetter, but alternative models could further test this claim. Moreover, competitive alternatives to\nmixture models might improve performance in other applications that have insuf\ufb01cient training data\nto \ufb01t mixture models well.\nRestricted Boltzmann Machines (RBMs), which are undirected graphical models, \ufb01t samples of\nbinary vectors from a range of sources better than mixture models [4, 5]. One explanation is that\nRBMs form a distributed representation: many hidden units are active when explaining an observation,\nwhich is a better match to most real data than a single mixture component. Another explanation is\nthat RBMs are mixture models, but the number of components is exponential in the number of hidden\nunits. Parameter tying among components allows these more \ufb02exible models to generalize better\nfrom small numbers of examples. There are two practical dif\ufb01culties with RBMs: the likelihood of\nthe model must be approximated, and samples can only be drawn from the model approximately\nby Gibbs sampling. The Neural Autoregressive Distribution Estimator (NADE) overcomes these\ndif\ufb01culties [5]. NADE is a directed graphical model, or feed-forward neural network, initially derived\nas an approximation to an RBM, but then \ufb01tted as a model in its own right.\n\n1\n\n\fIn this work we introduce the Real-valued Autoregressive Density Estimator (RNADE), an extension\nof NADE. An autoregressive model expresses the density of a vector as an ordered product of\none-dimensional distributions, each conditioned on the values of previous dimensions in the (perhaps\narbitrary) ordering. We use the parameter sharing previously introduced by NADE, combined with\nmixture density networks [6], an existing \ufb02exible approach to modeling real-valued distributions with\nneural networks. By construction, the density of a test point under RNADE is cheap to compute,\nunlike RBM-based models. The neural network structure provides a \ufb02exible way to alter the mean\nand variance of a mixture component depending on context, potentially modeling non-linear or\nheteroscedastic data with fewer components than unconstrained mixture models.\n\n2 Background: Autoregressive models\n\nany distribution over a vector of variables into a product of terms: p(x) = (cid:81)D\n\nBoth NADE [5] and our RNADE model are based on the chain rule (or product rule), which factorizes\nd=1 p(xd | x<d),\nwhere x<d denotes all attributes preceding xd in a \ufb01xed arbitrary ordering of the attributes. This\nfactorization corresponds to a Bayesian network where every variable is a parent of all variables after\nit. As this model assumes no conditional independences, it says nothing about the distribution in\nitself. However, the (perhaps arbitrary) ordering we choose will matter if the form of the conditionals\nis constrained. If we assume tractable parametric forms for each of the conditional distributions, then\nthe joint distribution can be computed for any vector, and the parameters of the model can be locally\n\ufb01tted to a penalized maximum likelihood objective using any gradient-based optimizer.\nFor binary data, each conditional distribution can be modeled with logistic regression, which is called\na fully visible sigmoid belief network (FVSBN) [7]. Neural networks can also be used for each\nbinary prediction task [8]. The neural autoregressive distribution estimator (NADE) also uses neural\nnetworks for each conditional, but with parameter sharing inspired by a mean-\ufb01eld approximation to\nRestricted Boltzmann Machines [5]. In detail, each conditional is given by a feed-forward neural\nnetwork with one hidden layer, hd \u2208 RH:\n\n(cid:1) where hd = sigm (W \u00b7,<dx<d + c) ,\n\np(xd = 1| x<d) = sigm(cid:0)v(cid:62)\n\nd hd + bd\n\n(1)\nwhere vd \u2208 RH, bd \u2208 R, c \u2208 RH, and W \u2208 RH\u00d7(D\u22121) are neural network parameters, and sigm\nrepresents the logistic sigmoid function 1/(1 + e\u2212x).\nThe weights between the inputs and the hidden units for each neural network are tied: W \u00b7,<d is\nthe \ufb01rst d\u2212 1 columns of a shared weight matrix W . This parameter sharing reduces the total\nnumber of parameters from quadratic in the number of input dimensions to linear, lessening the\nneed for regularisation. Computing the probability of a datapoint can also be done in time linear in\ndimensionality, O(DH), by sharing the computation when calculating the hidden activation of each\nneural network (ad = W \u00b7,<dx<d + c):\n\na1 = c,\n\nad+1 = ad + xdW \u00b7,d.\n\n(2)\nWhen approximating Restricted Boltzmann Machines, the output weights {vd} in (1) were originally\ntied to the input weights W . Untying these weights gave better statistical performance on a range of\ntasks, with negligible extra computational cost [5].\nNADE has recently been extended to count data [9]. The possibility of extending generic neural\nautoregressive models to continuous data has been mentioned [8, 10], but has not been previously\nexplored to our knowledge. An autoregressive mixture of experts with scale mixture model experts has\nbeen developed as part of a sophisticated multi-resolution model speci\ufb01cally for natural images [11].\nIn more general work, Gaussian processes have been used to model the conditional distributions of a\nfully visible Bayesian network [12]. However, these \u2018Gaussian process networks\u2019 cannot deal with\nmultimodal conditional distributions or with large datasets (currently (cid:39) 104 points would require\nfurther approximation). In the next section we propose a more \ufb02exible and scalable approach.\n\n3 Real-valued neural autoregressive density estimators\n\nThe original derivation of NADE suggests deriving a real-valued version from a mean-\ufb01eld approxi-\nmation to the conditionals of a Gaussian-RBM. However, we discarded this approach because the\n\n2\n\n\flimitations of the Gaussian-RBM are well documented [13, 14]: its isotropic conditional noise model\ndoes not give competitive density estimates. Approximating a more capable RBM model, such as the\nmean-covariance RBM [15] or the spike-and-slab RBM [16], might be a fruitful future direction.\nThe main characteristic of NADE is the tying of its input-to-hidden weights. The output layer was\n\u2018untied\u2019 from the approximation to the RBM to give the model greater \ufb02exibility. Taking this idea\nfurther, we add more parameters to NADE to represent each one-dimensional conditional distribution\nwith a mixture of Gaussians instead of a Bernoulli distribution. That is, the outputs are mixture\ndensity networks [6], with a shared hidden layer, using the same parameter tying as NADE.\nThus, our Real-valued Neural Autoregressive Density-Estimator or RNADE model represents the\nprobability density of a vector as:\n\nD(cid:89)\n\nd=1\n\np(x) =\n\np(xd| x<d) with p(xd| x<d) = pM(xd| \u03b8d),\n\n(3)\n\nwhere pM is a mixture of Gaussians with parameters \u03b8d. The mixture model parameters are calculated\nusing a neural network with all of the preceding dimensions, x<d, as inputs. We now give the details.\nRNADE computes the same hidden unit activations, ad, as before using (2). As discussed by Bengio\n[10], as an RNADE (or a NADE) with sigmoidal units progresses across the input dimensions\nd \u2208 {1 . . . D}, its hidden units will tend to become more and more saturated, due to their input\nbeing a weighted sum of an increasing number of inputs. Bengio proposed alleviating this effect by\nrescaling the hidden units\u2019 activation by a free factor \u03c1d at each step, making the hidden unit values\n(4)\n\nhd = sigm (\u03c1dad) .\n\nLearning these extra rescaling parameters worked slightly better, and all of our experiments use them.\nPrevious work on neural networks with real-valued outputs has found that recti\ufb01ed linear units can\nwork better than sigmoidal non-linearities [17]. The hidden values for recti\ufb01ed linear units are:\n\n(cid:26)\u03c1dad\n\n0\n\nhd =\n\nif \u03c1dad > 0\notherwise.\n\n(5)\n\n(cid:17)\n\n(cid:16)\n\n(cid:62)\n\nhd + b\u03b1\nd\n\nV \u03b1\nd\nhd + b\u00b5\nd\nhd + b\u03c3\nV \u03c3\nd\nd\n\n(cid:62)\n\n(cid:17)\n\n,\n\n(6)\n\n(7)\n(8)\n\nIn preliminary experiments we found that these hidden units worked better than sigmoidal units in\nRNADE, and used them throughout (except for an example result with sigmoidal units in Table 2).\nFinally, the mixture of Gaussians parameters for the d-th conditional, \u03b8d = {\u03b1d, \u00b5d, \u03c3d}, are set by:\n\nK mixing fractions,\nK component means,\nK component standard deviations,\n\n(cid:62)\n\n\u03b1d = softmax\n\u00b5d = V \u00b5\nd\n\u03c3d = exp\n\n(cid:16)\nd are H\u00d7K matrices, and b\u03b1\n\nd , b\u00b5\n\nd , V \u00b5\n\nd, b\u03c3\n\nd, V \u03c3\n\nwhere free parameters V \u03b1\nd are vectors of size K. The\nsoftmax [18] ensures the mixing fractions are positive and sum to one, the exponential ensures the\nstandard deviations are positive.\nFitting an RNADE can be done using gradient ascent on the model\u2019s likelihood given a training set of\nexamples. We used minibatch stochastic gradient ascent in all our experiments. In those RNADE\nmodels with MoG conditionals, we multiplied the gradient of each component mean by its standard\ndeviation (for a Gaussian, Newton\u2019s method multiplies the gradient by its variance, but empirically\nmultiplying by the standard deviation worked better). This gradient scaling makes tight components\nmove more slowly than broad ones, a heuristic that we found allows the use of higher learning rates.\nVariants: Using a mixture of Gaussians to represent the conditional distributions in RNADE is an\narbitrary parametric choice. Given several components, the mixture model can represent a rich set\nof skewed and multimodal distributions with different tail behaviors. However, other choices could\nbe appropriate in particular circumstances. For example, work on natural images often uses scale\nmixtures, where components share a common mean. Conditional distributions of perceptual data\nare often assumed to be Laplacian [e.g. 19]. We call our main variant with mixtures of Gaussians\nRNADE-MoG, but also experiment with mixtures of Laplacian outputs, RNADE-MoL.\n\n3\n\n\fTable 1: Average test-set log-likelihood per datapoint for 4 different models on \ufb01ve UCI datasets.\nPerformances not in bold can be shown to be signi\ufb01cantly worse than at least one of the results in\nbold as per a paired t-test on the ten mean-likelihoods, with signi\ufb01cance level 0.05.\n\nDataset\nRed wine\nWhite wine\nParkinsons\nIonosphere\nBoston housing\n\ndim\n11\n11\n15\n32\n10\n\nsize Gaussian\n1599\n4898\n5875\n351\n506\n\n\u221213.18\n\u221213.20\n\u221210.85\n\u221241.24\n\u221211.37\n\nMFA\n\u221210.19\n\u221210.73\n\u22121.99\n\u221217.55\n\u22124.54\n\nFVBN\n\u221211.03\n\u221210.52\n\u22120.71\n\u221226.55\n\u22123.41\n\nRNADE-MoG RNADE-MoL\n\n\u22129.36\n\u221210.23\n\u22120.90\n\u22122.50\n\u22120.64\n\n\u22129.46\n\u221210.38\n\u22122.63\n\u22125.87\n\u22124.04\n\n4 Experiments\n\nWe compared RNADE to mixtures of Gaussians (MoG) and factor analyzers (MFA), which are\nsurprisingly strong baselines in some tasks [20, 21]. Given the known poor performance of discrete\nmixtures [4, 5], we limited our experiments to modeling continuous attributes. However it would be\neasy to include both discrete and continuous variables in a NADE-like architecture.\n\n4.1 Low-dimensional data\n\nWe \ufb01rst considered \ufb01ve UCI datasets [22], previously used to study the performance of other density\nestimators [23, 20]. These datasets have relatively low dimensionality, with between 10 and 32\nattributes, but have hard thresholds and non-linear dependencies that may make it dif\ufb01cult to \ufb01t\nmixtures of Gaussians or factor analyzers.\nFollowing Tang et al. [20], we eliminated discrete-valued attributes and an attribute from every pair\nwith a Pearson correlation coef\ufb01cient greater than 0.98. Each dimension of the data was normalized\nby subtracting its training subset sample mean and dividing by its standard deviation. All results are\nreported on the normalized data.\nAs baselines we \ufb01tted full-covariance Gaussians and mixtures of factor analysers. To measure the\nperformance of the different models, we calculated their log-likelihood on held-out test data. Because\nthese datasets are small, we used 10-folds, with 90% of the data for training, and 10% for testing.\nWe chose the hyperparameter values for each model by doing per-fold cross-validation; using a ninth\nof the training data as validation data. Once the hyperparameter values had been chosen, we trained\neach model using all the training data (including the validation data) and measured its performance\non the 10% of held-out testing data. In order to avoid over\ufb01tting, we stopped the training after\nreaching a training likelihood higher than the one obtained on the best validation-wise iteration of the\ncorresponding validation run. Early stopping is crucial to avoid over\ufb01tting the RNADE models. It\nalso improves the results of the MFAs, but to a lesser degree.\nThe MFA models were trained using the EM algorithm [24, 25], the number of components and\nfactors were crossvalidated. The number of factors was chosen from even numbers from 2 . . . D,\nwhere selecting D gives a mixture of Gaussians. The number of components was chosen among all\neven numbers from 2 . . . 50 (crossvalidation always selected fewer than 50 components).\nRNADE-MoG and RNADE-MoL models were \ufb01tted using minibatch stochastic gradient descent,\nusing minibatches of size 100, for 500 epochs, each epoch comprising 10 minibatches. For each\nexperiment, the number of hidden units (50), the non-linear activation-function of the hidden units\n(RLU), and the form of the conditionals were \ufb01xed. Three hyperparameters were crossvalidated\nusing grid-search: the number of components on each one-dimensional conditional was chosen from\nthe set {2, 5, 10, 20}; the weight-decay (used only to regularize the input to hidden weights) from\nthe set {2.0, 1.0, 0.1, 0.01, 0.001, 0}; and the learning rate from the set {0.1, 0.05, 0.025, 0.0125}.\nLearning-rates were decreased linearly to reach 0 after the last epoch.\nWe also trained fully-visible Bayesian networks (FVBN), an autoregressive model where each one-\ndimensional conditional is modelled by a separate mixture density network using no parameter tying.\n\n4\n\n\fFigure 1: Top: 15 8x8 patches from the BSDS test set. Center: 15 samples from Zoran and Weiss\u2019s\nMoG model with 200 components. Bottom: 15 samples from an RNADE with 512 hidden units and\n10 output components per dimension. All data and samples were drawn randomly.\n\nThe same cross-validation procedure and hyperparameters as for RNADE training were used. The\nbest validationwise MDN for each one-dimensional conditional was chosen.\nThe results are shown in Table 1. Autoregressive methods obtained statistical performances superior\nto mixture models on all datasets. An RNADE with mixture of Gaussian conditionals was among the\nstatistically signi\ufb01cant group of best models on all datasets. Unfortunately we could not reproduce\nthe data-folds used by previous work, however, our improvements are larger than those demonstrated\nby a deep mixture of factor analyzers over standard MFA [20].\n\n4.2 Natural image patches\n\nWe also measured the ability of RNADE to model small patches of natural images. Following the\nrecent work of Zoran and Weiss [3], we use 8-by-8-pixel patches of monochrome natural images,\nobtained from the BSDS300 dataset [26] (Figure 1 gives examples).\nPixels in this dataset can take a \ufb01nite number of brightness values ranging from 0 to 255. Modeling\ndiscretized data using a real-valued distribution can lead to arbitrarily high density values, by locating\nnarrow high density spike on each of the possible discrete values. In order to avoid this \u2018cheating\u2019\nsolution, we added noise uniformly distributed between 0 and 1 to the value of each pixel. We then\ndivided by 256, making each pixel take a value in the range [0, 1].\nIn previous experiments, Zoran and Weiss [3] subtracted the mean pixel value from each patch,\nreducing the dimensionality of the data by one: the value of any pixel could be perfectly predicted\nas minus the sum of all other pixel values. However, the original study still used a mixture of full-\ncovariance 64-dimensional Gaussians. Such a model could obtain arbitrarily high model likelihoods,\nso unfortunately the likelihoods reported in previous work on this dataset [3, 20] are dif\ufb01cult to\ninterpret. In our preliminary experiment using RNADE, we observed that if we model the 64-\ndimensional data, the 64th pixel is always predicted by a very thin spike centered at its true value.\nThe ability of RNADE to capture this spurious dependency is reassuring, but we wouldn\u2019t want our\nresults to be dominated by it. Recent work by Zoran and Weiss [21], projects the data on the leading\n63 eigenvectors of each component, when measuring the model likelihood [27]. For comparison\namongst a range of methods, we advocate simply discarding the 64th (bottom-right) pixel.\nWe trained our model using patches drawn randomly from 180 images in the training subset of\nBSDS300. A validation dataset containing 1,000 random patches from the remaining 20 images in the\ntraining subset were used for early-stopping when training RNADE. We measured the performance\nof each model by measuring their log-likelihood on one million patches drawn randomly from the\ntest subset, which is composed of 100 images not present in the training subset. Given the larger\nscale of this dataset, hyperparameters of the RNADE and MoG models were chosen manually using\nthe performance of preliminary runs on the validation data, rather than by an extensive search.\nThe RNADE model had 512 recti\ufb01ed-linear hidden units and a mixture of 20 one-dimensional\nGaussian components per output. Training was done by minibatch gradient descent, with 25 datapoints\nper minibatch, for a total of 200 epochs, each comprising 1,000 minibatches. The learning-rate was\nscheduled to start at 0.001 and linearly decreased to reach 0 after the last epoch. Gradient momentum\nwith momentum factor 0.9 was used, but initiated at the beginning of the second epoch. A weight\ndecay rate of 0.001 was applied to the input-to-hidden weight matrix only. Again, we found that\nmultiplying the gradient of the mean output parameters by the standard deviation improves results.\nRNADE training was early stopped but didn\u2019t show signs of over\ufb01tting. We produced a further run\n\n5\n\n\fTable 2: Average per-example log-likelihood of several mixture of Gaussian and RNADE models,\nwith mixture of Gaussian (MoG) or mixture of Laplace (MoL) conditionals, on 8-by-8 patches of\nnatural images. These results are measured in nats and were calculated using one million patches.\nStandard errors due to the \ufb01nite test sample size are lower than 0.1 in every case. K gives the number\nof one-dimensional components for each conditional in RNADE, and the number of full-covariance\ncomponents for MoG.\n\nModel\nMoG K = 200 (Z&W)\nMoG K = 100\nMoG K = 200\nMoG K = 300\nRNADE-MoG K = 5\nRNADE-MoG K = 10\nRNADE-MoG K = 20\nRNADE-MoL K = 5\nRNADE-MoL K = 10\nRNADE-MoL K = 20\nRNADE-MoG K = 10 (sigmoid h. units)\nRNADE-MoG K = 10 (1024 units, 400 epochs)\n\nTraining LogL Test LogL\n\n161.9\n152.8\n159.3\n159.3\n158.0\n160.0\n158.6\n150.2\n149.7\n150.1\n155.1\n161.1\n\n152.8\n144.7\n150.4\n150.4\n149.1\n151.0\n149.7\n141.5\n141.1\n141.5\n146.4\n152.1\n\nwith 1024 hidden units for 400 epochs, with still no signs of over\ufb01tting; even larger models might\nperform better.\nThe MoG model was trained using minibatch EM, for 1,000 iterations. At each iteration 20,000\nrandomly sampled datapoints were used in an EM update. A step was taken from the previous mixture\nmodel towards the parameters resulting from the M-step: \u03b8t = (1 \u2212 \u03b7)\u03b8t\u22121 + \u03b7\u03b8EM , where the\nstep size (\u03b7) was scheduled to start at 0.1 and linearly decreased to reach 0 after the last update. The\ntraining of the MoG was also early-stopped and also showed no signs of over\ufb01tting.\nThe results are shown in Table 2. We compare RNADE with a mixtures of Gaussians model trained\non 63 pixels, and with a MoG trained by Zoran and Weiss (downloaded from Daniel Zoran\u2019s website)\nfrom which we removed the 64th row and column of each covariance matrix. The best RNADE test\nlog-likelihood is, on average, 0.7 nats per patch lower than Zoran and Weiss\u2019s MoG, which had a\ndifferent training procedure than our mixture of Gaussians.\nFigure 1 shows a few examples from the test set, and samples from the MoG and RNADE models.\nSome of the samples from RNADE are unnaturally noisy, with pixel values outside the legal range\n(see fourth sample from the right in Figure 1). If we constrain the pixels values to a unit range, by\nrejection sampling or otherwise, these artifacts go away. Limiting the output range of the model\nwould also improve test likelihood scores slightly, but not by much: log-likelihood does not strongly\npenalize models for putting a small fraction of probability mass on \u2018junk\u2019 images.\nAll of the results in this section were obtained by \ufb01tting the pixels in a raster-scan order. Perhaps\nsurprisingly, but consistent with previous results on NADE [5] and by Frey [28], randomizing\nthe order of the pixels made little difference to these results. The difference in performance was\ncomparable to the differences between multiple runs with the same pixel ordering.\n\n4.3 Speech acoustics\n\nWe also measured the ability of RNADE to model small patches of speech spectrograms, extracted\nfrom the TIMIT dataset [29]. The patches contained 11 frames of 20 \ufb01lter-banks plus energy; totaling\n231 dimensions per datapoint. These \ufb01lter-bank encoding is common in speech-recognition, and\nbetter for visualization than the more frequently used MFCC features. A good generative model of\nspeech could be used, for example, in denoising, or speech detection tasks.\nWe \ufb01tted the models using the standard TIMIT training subset, and compared RNADE with a MoG\nby measuring their log-likelihood on the complete TIMIT core-test dataset.\n\n6\n\n\fTable 3: Log-likelihood of several MoG and RNADE models on the core-test set of TIMIT measured\nin nats. Standard errors due to the \ufb01nite test sample size are lower than 0.3 nats in every case. RNADE\nobtained a higher (better) log-likelihood.\n\nModel\nMoG N = 50\nMoG N = 100\nMoG N = 200\nMoG N = 300\nRNADE-MoG K = 10\nRNADE-MoG K = 20\nRNADE-MoL K = 10\nRNADE-MoL K = 20\n\nTraining LogL Test LogL\n\n111.6\n113.4\n113.9\n114.1\n125.9\n126.7\n120.3\n122.2\n\n110.4\n112.0\n112.5\n112.5\n123.9\n124.5\n118.0\n119.8\n\nFigure 2: Top: 15 datapoints from the TIMIT core-test set. Center: 15 samples from a MoG model\nwith 200 components. Bottom: 15 samples from an RNADE with 1024 hidden units and output\ncomponents per dimension. On each plot, time is shown on the horizontal axis, the bottom row\ndisplays the energy feature, while the others display the \ufb01lter bank features (in ascending frequency\norder from the bottom). All data and samples were drawn randomly.\n\nThe RNADE model has 1024 recti\ufb01ed-linear hidden units and a mixture of 20 one-dimensional\nGaussian components per output. Given the larger scale of this dataset hyperparameter choices were\nagain made manually using validation data, and the same minibatch training procedures for RNADE\nand MoG were used as for natural image patches.\nThe results are shown in Table 3. RNADE obtained, on average, 10 nats more per test example\nthan a mixture of Gaussians. In Figure 2 a few examples from the test set, and samples from the\nMoG and RNADE models are shown. In contrast with the log-likelihood measure, there are no\nmarked differences between the samples from each model. Both set of samples look like blurred\nspectrograms, but RNADE seems to capture sharper formant structures (peaks of energy at the lower\nfrequency bands characteristic of vowel sounds).\n\n5 Discussion\n\nMixture Density Networks (MDNs) [6] are a \ufb02exible conditional model of probability densities,\nthat can capture skewed, heavy-tailed, and multi-modal distributions. In principle, MDNs can be\napplied to multi-dimensional data. However, the number of parameters that the network has to output\ngrows quadratically with the number of targets, unless the targets are assumed independent. RNADE\nexploits an autoregressive framework to apply practical, one-dimensional MDNs to unsupervised\ndensity estimation.\nTo specify an RNADE we needed to set the parametric form for the output distribution of each\nMDN. A suf\ufb01ciently large mixture of Gaussians can closely represent any density, but it is hard to\nlearn the conditional densities found in some problems with this representation. The marginal for\nthe brightness of a pixel in natural image patches is heavy tailed, closer to a Laplace distribution\n\n7\n\n\fFigure 3: Comparison of Mixture of Gaussian (MoG) and Mixture of Laplace (MoL) conditionals.\n(a) Example test patch. (b) Density of p(x1) under RNADE-MoG (dashed-red) and RNADE-MoL\n(solid-blue), both with K = 10. RNADE-MoL closely matches a histogram of brightness values from\npatches in the test-set (green). The vertical line indicates the value in (a). (c) Log-density of the\ndistributions in (b). (d) Log-density of MoG and MoL conditionals of pixel 19 in (a). (e) Log-density\nof MoG and MoL conditionals of pixel 37 in (a). (f) Difference in predictive log-density between\nMoG and MoL conditionals for each pixel, averaged over 10,000 test patches.\n\nthan Gaussian. Therefore, RNADE-MoG must \ufb01t predictions of the \ufb01rst pixel, p(x1), with several\nGaussians of different widths, that coincidentally have zero mean. This solution can be dif\ufb01cult to\n\ufb01t, and RNADE with a mixture of Laplace outputs predicted the \ufb01rst pixel of image patches better\nthan with a mixture of Gaussians (Figure 3b and c). However, later pixels were predicted better\nwith Gaussian outputs (Figure 3f); the mixture of Laplace model is not suitable for predicting with\nlarge contexts. For image patches, a scale mixture can work well [11], and could be explored within\nour framework. However for general applications, scale mixtures within RNADE would be too\nrestrictive (e.g., p(x1) would be zero-mean and unimodal). More \ufb02exible one-dimensional forms\nmay aid RNADE to generalize better for different context sizes and across a range of applications.\nOne of the main drawbacks of RNADE, and of neural networks in general, is the need to decide\nthe value of several training hyperparameters. The gradient descent learning rate can be adjusted\nautomatically using, for example, the techniques developed by Schaul et al. [30]. Also, methods for\nchoosing hyperparameters more ef\ufb01ciently than grid search have been recently developed [31, 32].\nThese, and several other recent improvements in the neural network \ufb01eld, like dropouts [33], should\nbe directly applicable to RNADE, and possibly obtain even better performance than shown in this\nwork. RNADE makes it relatively straight-forward to translate advances in the neural-network \ufb01eld\ninto better density estimators, or at least into new estimators with different inductive biases.\nIn summary, we have presented RNADE, a novel \u2018black-box\u2019 density estimator. Both likelihood\ncomputation time and the number of parameters scale linearly with the dataset dimensionality.\nGeneralization across a range of tasks, representing arbitrary feature vectors, image patches, and\nauditory spectrograms is excellent. Performance on image patches was close to a recently reported\nstate-of-the-art mixture model [3], and RNADE outperformed mixture models on all other datasets\nconsidered.\n\nAcknowledgments\n\nWe thank John Bridle, Steve Renals, Amos Storkey, and Daniel Zoran for useful interactions.\n\nReferences\n[1] D. Koller and N. Friedman. Probabilistic graphical models: principles and techniques. MIT Press, 2009.\n[2] T. Cacoullos. Estimation of a multivariate density. Annals of the Institute of Statistical Mathematics, 18\n\n(1):179\u2013189, 1966.\n\n[3] D. Zoran and Y. Weiss. From learning models of natural image patches to whole image restoration. In\n\nInternational Conference on Computer Vision, pages 479\u2013486. IEEE, 2011.\n\n[4] R. Salakhutdinov and I. Murray. On the quantitative analysis of deep belief networks. In Proceedings of\n\nthe 25th International Conference on Machine learning, pages 872\u2013879. Omnipress, 2008.\n\n8\n\n(a)\u22120.4\u22120.20.00.20.40246810(b)p(x1|x<1)\u22120.4\u22120.20.00.20.4\u22124\u22122024(c)logp(x1|x<1)\u22120.4\u22120.20.00.20.4\u22124\u22122024(d)logp(x19|x<19)\u22120.4\u22120.20.00.20.4\u22124\u22122024(e)logp(x37|x<37)102030405060\u22120.2\u22120.10.00.1(f)logpMoG(xi|x<i)\u2212logpMoL(xi|x<i)\f[5] H. Larochelle and I. Murray. The neural autoregressive distribution estimator. Journal of Machine\n\nLearning Research W&CP, 15:29\u201337, 2011.\n\n[6] C. M. Bishop. Mixture density networks. Technical Report NCRG 4288, Neural Computing Research\n\nGroup, Aston University, Birmingham, 1994.\n\n[7] B. J. Frey, G. E. Hinton, and P. Dayan. Does the wake-sleep algorithm produce good density estimators?\n\nIn Advances in Neural Information Processing Systems 8, pages 661\u2013670. MIT Press, 1996.\n\n[8] Y. Bengio and S. Bengio. Modeling high-dimensional discrete data with multi-layer neural networks.\n\nAdvances in Neural Information Processing Systems, 12:400\u2013406, 2000.\n\n[9] H. Larochelle and S. Lauly. A neural autoregressive topic model. In Advances in Neural Information\n\nProcessing Systems 25, 2012.\n\n[10] Y. Bengio. Discussion of the neural autoregressive distribution estimator. Journal of Machine Learning\n\nResearch W&CP, 15:38\u201339, 2011.\n\n[11] L. Theis, R. Hosseini, and M. Bethge. Mixtures of conditional Gaussian scale mixtures applied to\n\nmultiscale image representations. PLoS ONE, 7(7), 2012. doi: 10.1371/journal.pone.0039857.\n\n[12] N. Friedman and I. Nachman. Gaussian process networks. In Proceedings of the Sixteenth Conference on\n\nUncertainty in Arti\ufb01cial Intelligence, pages 211\u2013219. Morgan Kaufmann Publishers Inc., 2000.\n\n[13] I. Murray and R. Salakhutdinov. Evaluating probabilities under high-dimensional latent variable models.\n\nIn Advances in Neural Information Processing Systems 21, pages 1137\u20131144, 2009.\n\n[14] L. Theis, S. Gerwinn, F. Sinz, and M. Bethge. In all likelihood, deep belief is not enough. Journal of\n\nMachine Learning Research, 12:3071\u20133096, 2011.\n\n[15] M. A. Ranzato and G. E. Hinton. Modeling pixel means and covariances using factorized third-order\n\nBoltzmann machines. In Computer Vision and Pattern Recognition, pages 2551\u20132558. IEEE, 2010.\n\n[16] A. Courville, J. Bergstra, and Y. Bengio. A spike and slab restricted Boltzmann machine. Journal of\n\nMachine Learning Research, W&CP, 15, 2011.\n\n[17] V. Nair and G. E. Hinton. Recti\ufb01ed linear units improve restricted Boltzmann machines. In Proceedings\n\nof the 27th International Conference on Machine Learning, pages 807\u2013814. Omnipress, 2010.\n\n[18] J. S. Bridle. Probabilistic interpretation of feedforward classi\ufb01cation network outputs, with relationships\nto statistical pattern recognition. In Neuro-computing: algorithms, architectures and applications, pages\n227\u2013236. Springer-Verlag, 1989.\n\n[19] T. Robinson. SHORTEN: simple lossless and near-lossless waveform compression. Technical Report\n\nCUED/F-INFENG/TR.156, Engineering Department, Cambridge University, 1994.\n\n[20] Y. Tang, R. Salakhutdinov, and G. Hinton. Deep mixtures of factor analysers. In Proceedings of the 29th\n\nInternational Conference on Machine Learning, pages 505\u2013512. Omnipress, 2012.\n\n[21] D. Zoran and Y. Weiss. Natural images, Gaussian mixtures and dead leaves. Advances in Neural\n\nInformation Processing Systems, 25:1745\u20131753, 2012.\n\n[22] K. Bache and M. Lichman. UCI machine learning repository, 2013. http://archive.ics.uci.edu/ml.\n[23] R. Silva, C. Blundell, and Y. W. Teh. Mixed cumulative distribution networks. Journal of Machine\n\nLearning Research W&CP, 15:670\u2013678, 2011.\n\n[24] Z. Ghahramani and G. E. Hinton. The EM algorithm for mixtures of factor analyzers. Technical Report\n\nCRG-TR-96-1, University of Toronto, 1996.\n\n[25] J. Verbeek. Mixture of factor analyzers Matlab implementation, 2005. http://lear.inrialpes.fr/ verbeek/code/.\n[26] D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images and its\napplication to evaluating segmentation algorithms and measuring ecological statistics. In International\nConference on Computer Vision, volume 2, pages 416\u2013423. IEEE, July 2001.\n\n[27] D. Zoran. Personal communication, 2013.\n[28] B. Frey. Graphical models for machine learning and digital communication. MIT Press, 1998.\n[29] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, N. L. Dahlgren, and V. Zue. Timit\n\nacoustic-phonetic continuous speech corpus. Linguistic Data Consortium, 10(5):0, 1993.\n\n[30] T. Schaul, S. Zhang, and Y. LeCun. No More Pesky Learning Rates.\n\ninternational conference on Machine learning, 2013.\n\nIn Proceedings of the 30th\n\n[31] J. Bergstra and Y. Bengio. Random search for hyper-parameter optimization. The Journal of Machine\n\nLearning Research, 13:281\u2013305, 2012.\n\n[32] J. Snoek, H. Larochelle, and R. Adams. Practical Bayesian optimization of machine learning algorithms.\n\nIn Advances in Neural Information Processing Systems 25, pages 2960\u20132968, 2012.\n\n[33] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural\n\nnetworks by preventing co-adaptation of feature detectors. Arxiv preprint arXiv:1207.0580, 2012.\n\n9\n\n\f", "award": [], "sourceid": 1068, "authors": [{"given_name": "Benigno", "family_name": "Uria", "institution": "University of Edinburgh"}, {"given_name": "Iain", "family_name": "Murray", "institution": "University of Edinburgh"}, {"given_name": "Hugo", "family_name": "Larochelle", "institution": "Universit\u00e9 de Sherbrooke (Quebec)"}]}