{"title": "Does the Wake-sleep Algorithm Produce Good Density Estimators?", "book": "Advances in Neural Information Processing Systems", "page_first": 661, "page_last": 667, "abstract": null, "full_text": "Does the Wake-sleep Algorithm \n\nProduce Good Density Estimators? \n\nBrendan J. Frey,  Geoffrey E. Hinton \n\nPeter Dayan \n\nDepartment of Computer Science \n\nUniversity of Toronto \n\nToronto, ON M5S  1A4, Canada \n{frey,  hinton} @cs.toronto.edu \n\nDepartment of Brain and Cognitive Sciences \n\nMassachusetts Institute of Technology \n\nCambridge, MA 02139, USA \n\ndayan@ai.mit.edu \n\nAbstract \n\nThe wake-sleep algorithm (Hinton, Dayan, Frey and Neal 1995) is a rel(cid:173)\natively efficient method of fitting  a multilayer stochastic generative \nmodel to high-dimensional data. In addition to  the top-down connec(cid:173)\ntions in the generative model, it makes use of bottom-up connections for \napproximating the probability distribution over the hidden units given \nthe data, and it trains these bottom-up connections using a simple delta \nrule. We use a variety of synthetic and real data sets to compare the per(cid:173)\nformance of the wake-sleep algorithm with Monte Carlo and mean field \nmethods for fitting  the same generative model and also compare it with \nother models that are less powerful but easier to fit. \n\n1  INTRODUCTION \nNeural networks are often used as bottom-up recognition devices that transform input vec(cid:173)\ntors into representations of those vectors in one or more hidden layers. But multilayer net(cid:173)\nworks of stochastic neurons can also be used as top-down generative models that produce \npatterns with complicated correlational structure in  the bottom visible layer.  In this paper \nwe consider generative models composed of layers of stochastic binary logistic units. \nGiven a generative model parameterized by top-down weights, there is an obvious way to \nperform unsupervised learning. The generative weights are adjusted to maximize the prob(cid:173)\nability that the visible vectors generated by the model would match the  observed data. \nUnfortunately, to compute the derivatives of the log probability of a visible vector, d,  with \nrespect to the generative weights, e, it is necessary to consider all  possible ways in which \nd could be generated. For each possible binary representation a  in the hidden units the \nderivative needs to be weighted by the posterior probability of a given d and e: \n\nP(ald, e)  = P(ale)p(dla, e)ILP(~le)p(dl~, e). \n\n13 \n\n(1) \n\n\f662 \n\nB. J. FREY. G.  E.  HINTON, P. DAYAN \n\nIt is intractable to compute P(ald, 9), so instead of minimizing -logP(dI9), we minimize \nan easily computed upper bound on this quantity that depends on some additional parame(cid:173)\nters, <1>: \n\n-logP(dI9) ~ F(dI9, <1\u00bb  =  - I, Q(al d, <I\u00bblogP(a, d19) + I,Q(ald, <I\u00bblogQ(ald, <1\u00bb.  (2) \n\na \n\na \n\nF(dI9, <1\u00bb \nQ(-Id, <1\u00bb \nexceeds -logP(dI9) by the asymmetric divergence: \n\nis  a Helmholtz free energy and is equal to  -logP(dI9)  when the distribution \nis  the  same as  the posterior distribution  P(-Id, 9). Otherwise,  F(dI9, <1\u00bb \n\nD  =  I,Q(ald, <I\u00bblog  (Q(ald, <I\u00bbIP(ald, 9\u00bb \n\na \n\n. \n\n(3) \n\nto be a product distribution within each layer that is  conditional on \n\nWe restrict  Q( -I d,  <1\u00bb \nthe binary states in the layer below and we can therefore compute it efficiently using a bot(cid:173)\ntom-up recognition network. We  call a model that uses bottom-up connections to mini(cid:173)\nmize the bound in equation 2 in  this way a Helmholtz machine (Dayan, Hinton. Neal and \nZemel 1995). The recognition weights  <I>  take the binary activities in  one layer and sto(cid:173)\nchastically produce binary activities in the layer above using a logistic function.  So, for a \ngiven visible vector, the recognition weights may produce many different representations \nin \nin the hidden layers, but we can get an unbiased sample from the distribution  Q(-Id, <1\u00bb \na single bottom-up pass through the recognition network. \nThe highly restricted form of Q( -I d,  <1\u00bb  means that even if we use the optimal recognition \nweights, the gap between  F(dI9, <1\u00bb  and  -logP(dI9) is large for some generative models. \nHowever, when  F(dI9, <1\u00bb  is minimized with respect to the generative weights, these mod(cid:173)\nels will generally be avoided. \n\nF(dI9, <1\u00bb  can be viewed as the expected number of bits required to communicate a visible \nvector to a receiver. First we use the recognition model to  get a sample from  the distribu(cid:173)\ntion  Q( -I d,  <1\u00bb.  Then, starting at the top layer, we communicate the activities in each layer \nusing the top-down expectations generated from the already communicated activities in \nthe layer above. It can be shown that the number of bits required for communicating the \nstate of each binary unit is  sklog(qk1pk)  + (l-sk)log[(1-qk)/(1-Pk)], where  Pk  is  the \ntop-down probability that Sk  is on and qk  is the bottom-up probability that Sk  is on. \n\nThere is a very simple on-line algorithm that minimizes F(dI9, <1\u00bb  with respect to the gen(cid:173)\nerative weights. We simply use the recognition network to generate a sample from the dis(cid:173)\ntribution  Q(-Id, <1\u00bb  and then we increment each top-down weight 9kj  by  ESk(SrPj), where \n9kj  connects unit k to unit j. It is much more difficult to exactly follow  the gradient of \nF(dI9, <1\u00bb  with respect to  the recognition weights, but there is a  simple approximate \nmethod (Hinton, Dayan, Frey and Neal  1995). We  generate a stochastic sample from the \ngenerative model and then  we increment each bottom-up weight  <l>ij  by  ESi(Sj- f/j)  to \nincrease the log probability that the recognition weights would produce the correct activi(cid:173)\nties in the layer above. This way of fitting a Helmholtz machine is called the \"wake-sleep\" \nalgorithm and the purpose of this paper is to assess how effective it is at performing high(cid:173)\ndimensional density estimation on  a variety of synthetically constructed data sets and two \nreal-world ones. We compare it with other methods of fitting  the same type of generative \nmodel and also with simpler models for which there are efficient fitting algorithms. \n\n2  COMPETITORS \nWe compare the wake-sleep algorithm with six other density estimation methods. All data \nunits are binary and can take on values dk = 1 (on) and dk = 0 (off). \nGzip. Gzip (Gailly, 1993) is a practical compression method based on Lempel-Ziv coding. \nThis sequential data compression technique encodes future segments of data by transmit-\n\n\fDoes the Wake-sleep Algorithm Produce Good Density Estimators? \n\n663 \n\nting codewords that consist of a pointer into a buffer of recent past output together with \nthe length of the segment being coded. Gzip's perfonnance is measured by subtracting the \nlength of the compressed training set from the length of the compressed training set plus a \nsubset of the test set. Taking all disjoint test subsets into account gives an overall test set \ncode cost. Since we are interested in estimating the expected perfonnance on one test case, \nto get a tight lower bound on  gzip's perfonnance, the subset size should be kept as small \nas possible in order to prevent gzip from using early test data to compress later test data. \n\nBase Rate Model. Each visible unit k is assumed to  be independent of the others with a \nprobability Pk of being on. The probability of vector d is p(d)  = Ilk Pkdk (1  - Pk)l- dk  . The \narithmetic mean of unit k's activity is used to estimate Pk'  except in order to avoid serious \noverfitting, one extra on and one extra off case are included in the estimate. \n\nBinary Mixture Model. This method is a hierarchical extension of the base rate model \nwhich uses more than one set of base rates. Each set is called a component. Component j \nhas probability 1tj and awards each visible unit k  a probability Pjk of being on. The net \nprobability of dis p(d)  = Lj 1tj  Ilk Pj/k (1 - Pjk)l-dk . For a given training datum, we con(cid:173)\nsider the  component identity  to be a missing value which must be filled in before the \nparameters can be adjusted. To  accomplish this,  we use the expectation maximization \nalgorithm (Dempster, Laird and Rubin  1977) to  maximize the log-likelihood of the train(cid:173)\ning set, using the same method as above to avoid serious overfitting. \n\nGibbs Machine (GM). This machine uses the same generative model as  the Helmholtz \nmachine, but employs a Monte Carlo method called Gibbs sampling to find  the posterior \nin equation  1 (Neal,  1992). Unlike the Helmholtz machine it does not require a separate \nrecognition model and with sufficiently prolonged sampling it inverts the generative \nmodel perfectly. Each hidden unit is sampled in fixed order from a probability distribution \nconditional on the states of the other hidden and visible units. To reduce the time required \nto approach equilibrium, the network is annealed during sampling. \n\nMean Field Method (MF). Instead of using a separate recognition model to approximate \nthe posterior in equation  1, we can assume that the distribution over hidden units is facto(cid:173)\nrial for a given visible vector. Obtaining a good approximation to the posterior is then a \nmatter of minimizing free energy with respect to the mean activities. In our experiments, \nwe use the on-line mean field learning algorithm due to Saul, Jaakkola, and Jordan (1996). \n\nFully Visible Belief Network (FVBN). This method is a special case of the Helmholtz \nmachine where the top-down network is fully connected and there are no hidden units. No \nrecognition model is needed since there is no posterior to be approximated. \n\n3  DATA SETS \nThe perfonnances of these methods were compared on  five  synthetic data sets and two \nreal ones.  The synthetic data sets had matched complexities:  the generative models that \nproduced them had  100 visible units and between  1000 and 2500 parameters. A data set \nwith 100,000 examples was generated from each model and then  partitioned into  10,000 \nfor training,  10,000 for validation and 80,000 for testing. For tractable cases, each data set \nentropy was approximated by the negative log-likelihood of the training set under its gen(cid:173)\nerative model. These entropies are approximate lower bounds on the perfonnance. \n\nThe first synthetic data set was generated by a mixture model with 20 components. Each \ncomponent is  a vector of 100 base rates for the  100 visible units. To make the data more \nrealistic, we arranged for there to be many different components whose base rates are all \nextreme (near 0 or 1) -\nand a few components with \nmost base rates near 0.5 -\nrepresenting much broader clusters. For componentj, we \nselected base rate Pjk from a beta distribution with mean  Ilt  and variance  1lt(1-1lt)/40 (we \nchose this variance to keep the entropy of visible units low for  Ilt  near 0 or 1, representing \nwell-defined clusters). Then, as often as  not we randomly replaced each Pjk  with  1-Pjk to \n\nrepresenting well-defined clusters -\n\n\f664 \n\nB. 1.  FREY, G. E.  HINTON, P. DAY AN \n\nmake each component different (without doing this, all components would favor all  units \noff). In  order to obtain many well-defined clusters, the component means  Il.i  were them(cid:173)\nselves sampled from a beta distribution with mean 0.1  and variance 0.02. \n\nThe next two  synthetic data sets were produced using sigmoidal belief networks (Neal \n1992) which are just the generative parts of binary stochastic Helrnhol tz machines. These \nnetworks had full  connectivity between layers,  one with a 20~100 architecture and one \nwith a 5~10~15~2~100 architecture. The biases were set to 0 and the weights were sam(cid:173)\npled uniformly from [-2,2), a range chosen to keep the networks from being deterministic. \n\nThe final  two synthetic data sets were produced using Markov random fields.  These net(cid:173)\nworks had full  bidirectional connections between layers.  One had a 10<=>20<=>100  architec(cid:173)\nture, and the other was a concatenation of ten independent 10<=>10 fields.  The biases were \nset to 0 and the weights were sampled from the set {-4, 0, 4}  with probabilities {0.4, 0.4, \n0.2}. To find data sets with high-order structure, versions of these networks were sampled \nuntil data sets were found for which the base rate method performed badly. \n\nWe also compiled two versions of a data set to which the wake-sleep algorithm has previ(cid:173)\nously been applied (Hinton et al. 1995). These data consist of normalized and quantized \n8x8 binary images of handwritten digits made available by the US Postal Service Office of \nAdvanced Technology. The first version consists of a total of 13,000 images partitioned as \n6000 for training, 2000 for validation and 5000 for testing. The second version consists of \npairs of 8x8 images (ie.  128 visible units) made by concatenating vectors from each of the \nabove data sets with those from a random reordering of the respective data set. \n\n4  TRAINING DETAILS \nThe exact log-likelihoods for the base rate and mixture models can be computed, because \nthese methods have no or few hidden variables.  For the other methods, computing the \nexact log-likelihood is usually intractable. However, these methods provide an  approxi(cid:173)\nmate upper bound on the negative log-likelihood in the form of a coding cost or Helmholtz \nfree energy, and results are therefore presented as coding costs in bits. \n\nBecause gzip performed poorly on  the synthetic tasks,  we did not break up the test and \nvalidation sets into subsets.  On the digit tasks,  we broke the validation and test sets up to \nmake subsets of 100 visible vectors.  Since the \"-9\" gzip option did not improve perfor(cid:173)\nmance significantly, we used the default configuration. \n\nTo obtain fair results, we tried to automate the model selection process subject to the con(cid:173)\nstraint of obtaining results in a reasonable amount of time.  For the mixture model,  the \nGibbs machine, the mean field method, and the Helmholtz machine,  a single learning run \nwas performed with each of four different architectures using performance on a validation \nset to avoid wasted effort.  Performance on the validation set was computed every five \nepochs, and if two successive validation performances were not better than the previous \none by more than 0.2%, learning was terminated. The network corresponding to the best \nvalidation performance was selected for test set analysis. Although it would be desirable \nto explore a wide range of architectures, it would be computationally ruinous. The archi(cid:173)\ntectures used are given in tables 3 and 4 in the appendix. \n\nThe Gibbs machine was annealed from an initial temperature of 5.0. Between each sweep \nof the network, during which each hidden unit was sampled once,  the temperature was \nmultiplied by 0.9227 so that after 20 sweeps the temperature was 1.0. Then, the generative \nweights were updated using the delta rule. To bound the datum probability, the network is \nannealed as above and then 40 sweeps at unity temperature are performed while summing \nthe probability over one-nearest-neighbor configurations, checking for overlap. \n\nA learning rate of 0.01  was used for the Gibbs machine, the mean field method, the Helm(cid:173)\nholtz machine, and the fully  visible belief network. For each of these methods, this  value \nwas found to be roughly the largest possible learning rate that safely avoided oscillations. \n\n\fDoes the Wake-sleep Algorithm Produce Good Density Estimators? \n\n665 \n\n70r-----------------------------------------------------, \n\n60 \n\n50 \n\n40 \n\n20 \n\nGzip  -\n\nBase rate model  - (cid:173)\nMixture model  -e-(cid:173)\nGibbs machine  -\n\nMean field method  -4-(cid:173)\n\nFully visible belief network  -\n\nEntropy \n\n\u2022 \n\nl:~m \n\n- 10~--------------------------------------------------~ \n\nMixture \n2~1()() \n\nBN \n\n2~I()() \n\nBN 5~IO~ \n15~20~I()()  1~2~l()()  lOx (IO~IO) \n\nMRF \n\nMRF \n\nSingle \ndigits \n\nDigit \npairs \n\nFigure 1.  Compression performance relative to  the Helmholtz machine. Lines connecting \nthe data points are for visualization only, since there is no meaningful interpolant. \n\nTasks \n\n5  RESULTS \nThe learning times and the validation performances are given  in  tables  3 and 4 of the \nappendix. Test set appraisals and total learning times are given in table 1 for the synthetic \ntasks and in table 2 for the digit tasks.  Because there were relatively many training cases \nin each simulation, the validation procedure serves to provide timing information more \nthan to prevent overfitting. Gzip and the base rate model were very fast,  followed by the \nfully  visible belief network, the mixture model,  the Helmholtz machine, the mean field \nmethod, and finally  the Gibbs machine. Test set appraisals are summarized by compres(cid:173)\nsion performance relative to the Helmholtz machine in figure  1 above. Greater compres(cid:173)\nsion sizes correspond to lower test set likelihoods and imply worse density estimation. \nWhen available, the data set entropies indicate how close to optimum each method comes. \n\nThe Helmholtz machine yields a much lower cost compared to  gzip and base rates on all \ntasks.  Compared to  the mixture model, it gives a lower cost on  both BN tasks and the \nMRF 10 x  (1O~1O) task. The latter case shows that the Helmholtz machine was able to take \nadvantage of the independence of the ten concatenated input segments, whereas the mix(cid:173)\nture method was not. Simply to represent a problem where there are only two distinct clus(cid:173)\nters present in each of the ten segments, the mixture model would require 210 components. \nResults on the two BN tasks indicate the Helmholtz machine is better able to model multi(cid:173)\nple simultaneous causes than the mixture method, which requires that only one component \n(cause) is active at a time. On the other hand,  compared to the mixture model, the Helm(cid:173)\nholtz machine performs poorly on the Mixture 20~100 task. It is not able to learn that only \none cause should be active at a time. This problem can be avoided by hard-wiring softmax \ngroups into the Helmholtz machine. On the five synthetic tasks,  the Helmholtz machine \nperforms about the same as or better than the Gibbs machine, and runs two orders of mag(cid:173)\nnitude faster.  (The Gibbs machine was too slow to run on the digit tasks.) While the qual(cid:173)\nity of density estimation produced by the mean field method is indistinguishable from the \nHelmholtz machine, the latter runs an  order of magnitude faster than the mean field  algo(cid:173)\nrithm  we used.  The fully  visible belief network performs significantly better than  the \nHelmholtz machine on the two digit tasks and significantly worse on two of the synthetic \ntasks. It is trained roughly two orders of magnitude faster than the Helmholtz machine. \n\n\f666 \n\nB. J. FREY, G. E.  HINTON, P. DAYAN \n\nTable 1.  Test set cost (bits) and total training time (hrs) for the synthetic tasks. \n\nModel used to produce synthetic data \n\nMixture \n20=>100 \n\nBN \n\n20=>100 \n\nBN5~1O~ \n15~20~100  10~20~100  10 x  (1O~1O) \nunknown \n\nMRF \n\nMRF \n\nEntropy \ngzip \nBase rates \nMixture \nGM \nMF \nHM \nFVBN \n\n36.5 \n61.4 \n96.6 \n36.7 \n44.1 \n42.2 \n42.7 \n50.9 \n\n63.5 \no 98.0 \no 80.7 \no 74.0 \n131  63.9 \n68  64.7 \n8  65.2 \no 67.8 \n\n0  92.1 \n0  69.2 \n0  62.6 \n240  58.1 \n80  58.4 \n3  58.5 \n0  60.6 \n\n19.2 \no 35 .6 \no 42.2 \n1  19.3 \n251  26.1 \n68  19.3 \n4  19.4 \n0  19.8 \n\n36.8 \n0  59.9 \n0  68.1 \n1  49.6 \n195  40.3 \n75  38.7 \n2  38.6 \n0  38.2 \n\n0 \n0 \n1 \n145 \n89 \n4 \n0 \n\nTable 2.  Test set cost (bits) and training time (hrs) for the digit tasks. \n\nMethod \ngzip \nBase rates \nMixture \nMF \nHM \nFVBN \n\nSingle digits  Method \n44.3 \n59.2 \n37.5 \n39.5 \n39.1 \n35.9 \n\n0  gZlp \no Base rates \no Mixture \n38  MF \n2  HM \no FVBN \n\nDigit pairs \n89.2 \n118.4 \n92.7 \n80.7 \n80.4 \n72.9 \n\n0 \n0 \n1 \n104 \n7 \n0 \n\n6  CONCLUSIONS \nIf we were given a new data set and asked to leave our research biases aside and do effi(cid:173)\ncient density estimation, how would we proceed? Evidently it would not be worth trying \ngzip and the base rate model. We'd first try the fully visible belief network and the mixture \nmodel, since these are fast and sometimes give good estimates.  Hoping to extract extra \nhigher-order structure, we would then proceed to use the Helmholtz machine or the mean \nfield method (keeping in mind that our implementation of the Helmholtz machine is con(cid:173)\nsiderably faster than Saul et al. 's implementation of the mean field method). Because it is \nso slow, we would avoid using the Gibbs machine unless the data set was very small. \n\nAcknowledgments \nWe greatly appreciate the mean field  software provided by Tommi Jaakkola and Lawrence \nSaul. We thank members of the Neural Network Research Group at the University of Tor(cid:173)\nonto for helpful advice. The financial support from ITRC, IRIS, and NSERC is appreciated. \n\nReferences \nDayan, P. , Hinton, G.  E., Neal, R.  M ., and Zemel, R.  S.  1995. The Helmholtz machine. \nNeural Computation 7, 889-904. \n\nDempster, A.  P., Laird, N. M. and Rubin, D. B. 1977. Maximum likelihood from incom(cid:173)\nplete data via the EM algorithm. J.  Royal Statistical Society,  Series B 34,  1-38. \n\nGailly, J. 1993. gzip program for unix. \n\nHinton, G . E., Dayan, P.,  Frey,  B.  J., Neal,  R.  M.  1995. The wake-sleep algorithm for \nunsupervised neural networks. Science 268,  1158-1161. \n\nNeal, R. M. 1992. Connectionist learning of belief networks. Artificial Intelligence 56,71-113. \n\nSaul, L. K., Jaakkola, T., and Jordan, M.I. 1996. Mean field theory for sigmoid belief net(cid:173)\nworks. Submitted to Journal of Artificial Intelligence. \n\n\fDoes the Wake-sleep Algorithm  Produce Good Density Estimators? \n\n667 \n\nAppendix \n\nThe average validation set cost per example and the associated learning time for each sim(cid:173)\nulation are listed in tables 3 and 4.  Architectures judged to be optimal according to valida(cid:173)\ntion performance are indicated by \"*,, and were used to produce the test results given in \nthe body of this paper. \n\nTable 3.  Validation set cost (bits) and learning time (min) for the synthetic tasks. \n\nModel used to produce synthetic data \n\nMixture \n20~100 \n\nBN \n\n20~100 \n\nBN 5~lO~ \n15~20~100 \n\n0  92.3 \n0  69.4 \n3  63.9 \n5  63.2 \n7  62.9 \n12  62.7* \n\no 98.1 \no 80.7 \n3  75.6 \n5  74.8 \n7  74.4 \n14  74.0* \n\no 35.6 \ngzip \n61.6 \no 42.1 \nBase rates \n96.7 \n4  19.2* \nMixture 20~100  44.6 \n7  19.2 \nMixture 40~100  36.8* \n8  19.2 \nMixture 60~100  36.8 \n13  19.3 \nMixture lOO~lOO  37.0 \n1187  63.9*  1639  58.1*  2084  26.1* \nOM 20~lOO \n50.6 \n5234  49.2 \n2328  80.4 \nOM 50~lOO \n68.8 \n872  66.4 \n3084  28.0 \nOM  1O~20~100  44.1* \n4647  55.3 \n3476  91.3 \nOM 20~50~100  52.7 \n518  64.6 \n497  19.4 \nMF 20~100 \n49.5 \n1644  64.8 \n1465  20.4 \nMF 50~100 \n49.9 \n543  19.3* \n306  64.6* \nMF 1O~20~100  46.0 \nMF 20~50~100  42.1*  1623  65.0 \n1553  19.3 \n41  19.7 \n41  65.2 \n50.0 \nHM 20~lOO \n78  20.2 \n81  65.5 \nHM 50~lOO \n50.7 \n45  19.4* \n32  65.1* \nHM  lO~20~100  43.4 \n93  19.5 \n308  67.2 \nHM 20~50~lOO  42.6* \n7  67.8 \n6  19.8 \n51.0 \nFVBN \n\n3481  76.4 \n1771  59.8 \n7504  88.0 \n427  58.4* \n1945  58.6 \n658  58.5 \n1798  58.6 \n28  58.8 \n66  59.4 \n38  58.5* \n69  59.2 \n7  60.7 \n\nMRF \n\nMRF \n\n10<=>20<=> 1 00 \n\n10 x (1O<=> 10) \no 60.0 \n0 \no 68.1 \n0 \n3  54.8 \n5 \n15 \n7  52.4 \n17 \n8  51.0 \n22 \n12  49.6* \n934  40.3*  1425 \n3472 \n6472  56.5 \n767  42.3 \n1033 \n3529  63.5 \n2781 \n862  39.2 \n471 \n1264  38.7*  2427 \n882 \n569  38.9 \n1575 \n1778  38.8 \n15  38.6* \n30 \n46 \n27  38.9 \n46 \n21  38.9 \n64  39.4 \n102 \n6 \n8  38.3 \n\nTable 4.  Validation set cost (bits) and learning time (min) for the digit tasks. \n\n0  gzip \n0  Base rates \n1  Mixture 32~128 \n4  Mixture 64~128 \n5  Mixture  128~128 \n6  Mixture 256~128 \n\nSingle digits  Method \nMethod \n44.2 \ngzip \n59.0 \nBase rates \n43.2 \nMixture 16~64 \n40.0 \nMixture 32~64 \n38.0 \nMixture 64~64 \n37.1* \nMixture 128~64 \n39.9 \nMF 16~24~64 \n39.1* \nMF24~32~64 \nMF 12~16~24~64  39.8 \nMF 16~24~32~64  39.1 \n39.7 \nHM 16~24~64 \nHM 24~32~64 \n39.4 \nHM 12~16~24~64  40.4 \nHM 16~24~32~64  38.9* \n35.8 \nFVBN \n\nDigit pairs \n1 \n88.8 \n0 \n117.9 \n6 \n96.9 \n8 \n93.8 \n14 \n92.4* \n27 \n92.8 \n1335 \n341  MF 16~24~32~128 \n82.7 \n1441 \n845  MF 16~32~64~128 \n81.2 \n475  MF 12~16~24~32~128  82.8 \n896 \n603  MF 12~16~32~64~128  80.1*  2586 \n76 \n24  HM 16~24~32~128 \n83.8 \n138 \n34  HM 16~32~64~128 \n80.1* \n74 \n16  HM 12~16~24~32~128  84.6 \n52  HM 12~16~32~64~128  80.1 \n135 \n7 \n1  FVBN \n72.5 \n\n\f\fPART V \n\nIMPLEMENTATIONS \n\n\f\f", "award": [], "sourceid": 1153, "authors": [{"given_name": "Brendan", "family_name": "Frey", "institution": null}, {"given_name": "Geoffrey", "family_name": "Hinton", "institution": null}, {"given_name": "Peter", "family_name": "Dayan", "institution": null}]}