{"title": "Modeling Dyadic Data with Binary Latent Factors", "book": "Advances in Neural Information Processing Systems", "page_first": 977, "page_last": 984, "abstract": null, "full_text": "Modeling Dyadic Data with Binary Latent Factors\n\nEdward Meeds\n\nDepartment of Computer Science\n\nUniversity of Toronto\n\nZoubin Ghahramani\n\nDepartment of Engineering\n\nCambridge University\n\newm@cs.toronto.edu\n\nzoubin@eng.cam.ac.uk\n\nRadford Neal\n\nDepartment of Computer Science\n\nUniversity of Toronto\n\nSam Roweis\n\nDepartment of Computer Science\n\nUniversity of Toronto\n\nradford@cs.toronto.edu\n\nroweis@cs.toronto.edu\n\nAbstract\n\nWe introduce binary matrix factorization, a novel model for unsupervised ma-\ntrix decomposition. The decomposition is learned by \ufb01tting a non-parametric\nBayesian probabilistic model with binary latent variables to a matrix of dyadic\ndata. Unlike bi-clustering models, which assign each row or column to a single\ncluster based on a categorical hidden feature, our binary feature model re\ufb02ects the\nprior belief that items and attributes can be associated with more than one latent\ncluster at a time. We provide simple learning and inference rules for this new\nmodel and show how to extend it to an in\ufb01nite model in which the number of\nfeatures is not a priori \ufb01xed but is allowed to grow with the size of the data.\n\n1 Distributed representations for dyadic data\n\nOne of the major goals of probabilistic unsupervised learning is to discover underlying or hidden\nstructure in a dataset by using latent variables to describe a complex data generation process. In this\npaper we focus on dyadic data: our domains have two \ufb01nite sets of objects/entities and observa-\ntions are made on dyads (pairs with one element from each set). Examples include sparse matrices\nof movie-viewer ratings, word-document counts or product-customer purchases. A simple way to\ncapture structure in this kind of data is to do \u201cbi-clustering\u201d (possibly using mixture models) by\ngrouping the rows and (independently or simultaneously) the columns[6, 13, 9]. The modelling as-\n\nsumption in such a case is that movies come in\u0003 types and viewers in\u0004 types and that knowing\ncomponential structure: each item (row) has associated with it an unobserved vector of\u0003 binary\nfeatures; similarly each attribute (column) has a hidden vector of\u0004 binary features. Knowing the\nmatrixX into (a distribution de\ufb01ned by) the productUWV>\n, whereU andV are binary feature\nmatrices, andW is a real-valued weight matrix. Below, we develop this binary matrix factorization\n\nthe type of movie and type of viewer is suf\ufb01cient to predict the response. Clustering or mixture\nmodels are quite restrictive \u2013 their major disadvantage is that they do not admit a componential or\ndistributed representation because items cannot simultaneously belong to several classes. (A movie,\nfor example, might be explained as coming from a cluster of \u201cdramas\u201d or \u201ccomedies\u201d; a viewer as\na \u201csingle male\u201d or as a \u201cyoung mother\u201d.) We might instead prefer a model (e.g. [10, 5]) in which\nobjects can be assigned to multiple latent clusters: a movie might be a drama and have won an Os-\ncar and have subtitles; a viewer might be single and female and a university graduate. Inference in\nsuch models falls under the broad area of factorial learning (e.g. [7, 1, 3, 12]), in which multiple\ninteracting latent causes explain each observed datum.\n\nfeatures of the item and the features of the attribute are suf\ufb01cient to generate (before noise) the\nresponse at that location in the matrix. In effect, we are factorizing a real-valued data (response)\n\nIn this paper, we assume that both data items (rows) and attributes (columns) have this kind of\n\n\fW\n\nU\n\n(B)\n\n(A)\n\n(B) BMF shown pictorally.\n\n(cid:21)\n(cid:26)\nvj\n\nJ\n\nL\n\n=X\n\nf\n\nFigure 1: (A) The graphical model representation of the linear-Gaussian BMF model. The concen-\n\n(BMF) model using Bayesian non-parametric priors over the number and values of the unobserved\nbinary features and the unknown weights.\n\n(cid:30),w\u0003\n(cid:11)\nwk\n(cid:25)k\nV>\n\tik\nxij\n(cid:18)\ntration parameter and Beta weights for the columns ofX are represented by the symbols(cid:21) and(cid:26).\nBinary matrix factorization is a model of an\u0001\u0002\u0002 dyadic data matrixX with exchangeable rows\nand columns. The entries ofX can be real-valued, binary, or categorial; BMF models suitable\nfor each type are described below. Associated with each row is a latent binary feature vector\ti;\nsimilarly each column has an unobserved binary vectorvj. The primary parameters are represented\nby a matrixW of interaction weights.X is generated by a \ufb01xed observation processf\u0004\u0001\u0005 applied\nXjU;V;W(cid:24)f\u0004UWV>;\u0002\u0005\nwhere\u0002 are extra parameters speci\ufb01c to the model variant. Three possible parametric forms for\nthe noise (observation) distributionf are: Gaussian, with meanUWV>\nand covariance\u00041=(cid:18)\u0005\u0001;\nlogistic, with mean1=\u00041\u0007ex\u0004\u0004UWV>\u0005\u0005; and Poisson, with mean (and variance)UWV>\nmatricesU;V and the weightsW. We adopt the same priors over binary matrices as previously\ndescribed in [5]. For \ufb01nite sized matricesU with\u0001 rows and\u0003 columns, we generate a bias(cid:25)k\nindependently for each columnk using a Beta prior (denotedB) and then conditioned on this bias\ngenerate the entries in columnk independently from a Bernoulli with mean(cid:25)k.\n(cid:11)ja(cid:11);b(cid:11)(cid:24)G\u0004a(cid:11);b(cid:11)\u0005\n(cid:25)kj(cid:11);\u0003 (cid:24) B\u0004(cid:11)=\u0003;(cid:12)\u0005\n\u0001Yi=1\u0003Yk=1(cid:25)\tikk\n\u00041(cid:25)k\u00051\tik=\u0003Yk=1(cid:25)\u0002kk\u00041(cid:25)k\u0005\u0001\u0002k\nUj(cid:25) (cid:24)\nwhere\u0002k=\bi\tik. The hyperprior on the concentration(cid:11) is a Gamma distribution (denotedG),\nThe biases(cid:25) are easily integrated out, which creates dependencies between the rows, although\nthey remain exchangeable. The resulting prior depends only on the number\u0002k of active features\nin each column. An identical prior is used onV, with\u0002 rows and\u0004 columns, but with different\nconcentration prior(cid:21). The variable(cid:12) was set to1 for all experiments.\nThe appropriate prior distribution over weights depends on the observation distributionf\u0004\u0001\u0005. For\nthe linear-Gaussian variant, a convenient prior onW is a matrix normal with prior meanW\u0003\n\n.\nOther parametric forms are also possible. For illustrative purposes, we will use the linear-Gaussian\nmodel throughout this paper; this can be thought of as a two-sided version of the linear-Gaussian\nmodel found in [5].\n\n(elementwise) to the linear inner product of the features and weights, which is the \u201cfactorization\u201d or\napproximation of the data:\n\nTo complete the description of the model, we need to specify prior distributions over the feature\n\nwhose shape and scale hyperparameters control the expected fraction of zeros/ones in the matrix.\n\nK\n\nI\n\n2 BMF model description\n\n(1)\n\nand\n\n\fhyperpriors:\n\n3 Inference of features and parameters\n\nterior nor to compute its exact marginals). However, as with many other non-parametric Bayesian\nmodels, we can employ Markov Chain Monte Carlo (MCMC) methods to create an iterative proce-\ndure which, if run for suf\ufb01ciently long, will produce correct posterior samples.\n\n3.1 Finite binary latent feature matrices\n\n(2)\n\nonly in terms of the binary features. This is true, for example, when we place a Gaussian prior on\nthe weights and use a linear-Gaussian output process.\n\nexchangeable rows and a potentially in\ufb01nite number of columns (although the expected number of\ncolumns which are not entirely zero remains \ufb01nite). Such a distribution, the Indian Buffet Process\n(IBP) was described by [5] and is analogous to the Dirichlet process and the associated Chinese\nrestaurant process (CRP) [11]. Fortunately, as we will see, inference with this in\ufb01nite prior is not\nonly tractable, but is also nearly as ef\ufb01cient as the \ufb01nite version.\n\nWjW\u0003;(cid:30)(cid:24)\u0006\u0004W\u0003;\u00041=(cid:30)\u0005\u0001\u0005\n(cid:30)ja(cid:30);b(cid:30)(cid:24)G\u0004a(cid:30);b(cid:30)\u0005\n(cid:18)ja(cid:18);b(cid:18)(cid:24)G\u0004a(cid:18);b(cid:18)\u0005\n\ncovariance\u00041=(cid:30)\u0005\u0001. The scale(cid:30) of the weights and output precision(cid:18) (if needed) have Gamma\nIn certain cases, when the prior on the weights is conjugate to the output distribution modelf, the\nweights may be analytically integrated out, expressing the marginal distribution of the dataXjU;V\nRemarkably, the Beta-Bernoulli prior distribution overU (and similarlyV) can easily be extended\nto the case where\u0003!1, creating a distribution over binary matrices with a \ufb01xed number\u0001 of\nAs with many other complex hierarchical Bayesian models, exact inference of the latent variablesU\nandV in the BMF model is intractable (ie there is no ef\ufb01cient way to sample exactly from the pos-\nThe posterior distribution of a single entry inU (orV) given all other model parameters is propor-\nfrom integrating out the biases(cid:25) in the Beta-Bernoulli model and is proportional the number of\nsingle entries ofU (orV) can be done using the following updates:\n\b\u0004\tik=1jUik;V;W;X\u0005=C\u0004(cid:11)=\u0003\u0007\u0002i;k\u0005\b\u0004XjUik;\tik=1;V;W\u0005\n\b\u0004\tik=0jUik;V;W;X\u0005=C\u0004(cid:12)\u0007\u0004\u00011\u0005\u0002i;k\u0005\b\u0004XjUik;\tik=0;V;W\u0005 (3)\nwhere\u0002i;k=\bh6=i\thk,Uik excludes entryik, andC is a normalizing constant. (Conditioning\non(cid:11);\u0003 and(cid:18) is implicit.) When conditioning onW, we only need to calculate the ratio of likeli-\nhoods corresponding to rowi. (Note that this is not the case when the weights are integrated out.)\nThis ratio is a simple function of the model\u2019s predictions^x\u0007ij=\bh\tihvjwh (when\tik=1) and\n^xij=\bh\tihvjwh (when\tik=0). In the linear-Gaussian case:\n\u0004(cid:12)\u0007\u0004\u00011\u0005\u0002i;k\u000512Xj(cid:18)ij\u0002\u0004xij^x\u0007ij\u00052\u0004xij^xij\u00052\u0003\n\u0003g\b\u0004\tik=1jUik;V;W;X\u0005\n\u0004(cid:11)=\u0003\u0007\u0002i;k\u0005\n\b\u0004\tik=0jUik;V;W;X\u0005=\u0003g\nW and hyperparameters. To simplify the presentation, we consider a \u201cvectorized\u201d representation of\nour variables. Letx be an\u0001\u0002 column vector taken column-wise fromX,w be a\u0003\u0004 column vector\ntaken column-wise fromW andA be a\u0001\u0002\u0002\u0003\u0004 binary matrix which is the kronecker product\nV(cid:10)U. (In \u201cMatlab notation\u201d,x=X\u0004:\u0005;w=W\u0004:\u0005 andA=k\u0006\u0003\u0002\u0004V;U\u0005.) In this notation, the\ndata distribution is written as:xjA;w;(cid:18)(cid:24)\u0006\u0004Aw;\u00041=(cid:18)\u0005\u0001\u0005. Given values forU andV, samples\ncan be drawn forw,(cid:30), and(cid:18) using the following posterior distributions (where conditioning on\nw\u0003;(cid:30);(cid:18);a(cid:30);b(cid:30);a(cid:18);b(cid:18) is implicit):\nwjx;A(cid:24)\u0006(cid:16)\u0004(cid:18)A>A\u0007(cid:30)\u0001\u00051\u0004(cid:18)A>x\u0007(cid:30)w\u0003\u0005;\u0004(cid:18)A>A\u0007(cid:30)\u0001\u00051(cid:17)\n\ntional to the product of the conditional prior and the data likelihood. The conditional prior comes\n\nactive entries in other rows of the same column plus a term for new activations. Gibbs sampling for\n\nIn the linear-Gaussian case, we can easily derive analogous Gibbs sampling updates for the weights\n\n\fset A\n\nset B\n\n0\n0\n0\n0\n0\n0\n0\n0\n\nget:\n\nLet set A have at least one non-zero entry\n\ncolumns, including the set of columns where\n\n3.2 In\ufb01nite binary latent feature matrices\n\nand the countably in\ufb01nite number of all-zero\ncolumns. Sampling values for elements in row\n\nOne of the most elegant aspects of non-parametric Bayesian modeling is the ability to use a prior\nwhich allows a countably in\ufb01nite number of latent features. The number of instantiated features is\nautomatically adjusted during inference and depends on the amount of data and how many features\nit supports. Remarkably, we can do MCMC sampling using such in\ufb01nite priors with essentially no\n\n(cid:30)jw(cid:24)G(cid:18)a(cid:30)\u0007\u0003\u0004=2;(cid:18)b(cid:30)\u000712\u0004ww\u0003\u0005>\u0004ww\u0003\u0005(cid:19)(cid:19)\n(cid:18)jx;A;w(cid:24)G(cid:18)a(cid:18)\u0007\u0001\u0002=2;(cid:18)b(cid:18)\u000712\u0004xAw\u0005>\u0004xAw\u0005(cid:19)(cid:19)\nNote that we do not have to explicitly compute the matrixA. For computing the posterior of linear-\nGaussian weights, the matrixA>A can be computed asA>A=k\u0006\u0003\u0002\u0004V>V;U>U\u0005. Similarly,\nthe expressionA>x is constructed by computingU>XV and taking the elements column-wise.\ncomputational penalty over the \ufb01nite case. To derive these updates (e.g. for rowi of the matrixU),\nit is useful to consider partitioning the columns ofU into two sets as shown below.\nin rows other thani. Let set B be all other\n\u0001\u0001\u0001\nthe only non-zero entries are found in rowi\n\u0001\u0001\u0001\n\u0001\u0001\u0001\n\u0001\u0001\u0001\nrowi\ni of set A given everything else is straightfor-\n\u0001\u0001\u0001\n\u0001\u0001\u0001\n\u0001\u0001\u0001\ntions (2) and (3); as\u0003!1 andk in set A we\n\b\u0004\tik=1jUik;V;W\u0005=C\u0001\u0002i;k\b\u0004XjUik;\tik=1;V;W\u0005\n\b\u0004\tik=0jUik;V;W\u0005=C\u0001\u0004(cid:12)\u0007\u00011\u0002i;k\u0005\b\u0004XjUik;\tik=0;V;W\u0005\ninterested in the number of entries\u0002?B in set B which will be turned on in rowi. Sampling\nthe number of entries set to1 can be done with Metropolis-Hastings updates. Let\u0002\u0004\u0002?Bj\u0002B\u0005=\nPoisson\u0004\u0002?Bj(cid:11)=\u0004(cid:12)\u0007\u00011\u0005\u0005 be the proposal distribution for a move which replaces the current\u0002B\nactive entries with\u0002?B active entries in set B. The reverse proposal is\u0002\u0004\u0002Bj\u0002?B\u0005. The acceptance\nprobability is\u0001i\u00021;\u0006\u0002B!\u0002?B\u0001\n, where\u0006\u0002B!\u0002?B is\n\b\u0004Xj\u0002B\u0005 Poisson\u0004\u0002Bj(cid:11)=\u0004(cid:12)\u0007\u00011\u0005\u0005\u0002\u0004\u0002?Bj\u0002B\u0005=\b\u0004Xj\u0002?B\u0005\n\b\u0004\u0002?BjX\u0005\u0002\u0004\u0002Bj\u0002?B\u0005\n\b\u0004\u0002BjX\u0005\u0002\u0004\u0002?Bj\u0002B\u0005=\b\u0004Xj\u0002?B\u0005 Poisson\u0004\u0002?Bj(cid:11)=\u0004(cid:12)\u0007\u00011\u0005\u0005\u0002\u0004\u0002Bj\u0002?B\u0005\n\b\u0004Xj\u0002B\u0005\nThis assumes a conjugate situation in which the weightsW are explicitly integrated out of the\nmodel to compute the marginal likelihood\b\u0004Xj\u0002?B\u0005. In the non-conjugate case, a more compli-\ncated proposal is required. Instead of proposing\u0002?B, we jointly propose\u0002?B and associated feature\nparametersw?B from their prior distributions. In the linear-Gaussian model, wherew?B is a set of\n\u0002\u0004\u0002?B;w?Bj\u0002B;wB\u0005= Poisson\u0004\u0002?Bj(cid:11)=\u0004(cid:12)\u0007\u00011\u0005\u0005 Normal\u0004w?Bj\u0002?B;(cid:30)\u0005\nWe need actually sample only the \ufb01nite portion ofw?B where\tik=1. As in the conjugate case, the\n\u0006\u0002B;wB!\u0002?B;w?B=\b\u0004Xj\u0002?B;w?B\u0005\n\b\u0004Xj\u0002B;wB\u0005\nThe Gibbs updates described above for the entries ofU,V andW are the simplest moves we could\n\nWhen sampling new values for set B, the columns are exchangeable, and so we are really only\n\nward, and involves Gibbs updates almost iden-\ntical to those in the \ufb01nite case handled by equa-\n\nweights for features in set B, the proposal distribution is:\n\nacceptance ratio reduces to the ratio of data likelihoods:\n\n0\n0\n0\n0\n1\n0\n0\n0\n\n0\n0\n0\n0\n0\n0\n0\n0\n\n0\n0\n0\n0\n1\n0\n0\n0\n\n0\n0\n0\n0\n0\n0\n0\n0\n\n3.3 Faster mixing transition proposals\n\n0\n0\n1\n1\n1\n0\n0\n1\n\n1\n0\n1\n0\n1\n1\n0\n0\n\n0\n1\n0\n0\n0\n0\n0\n0\n\n0\n0\n0\n1\n0\n0\n1\n0\n\n1\n0\n1\n1\n1\n0\n0\n1\n\n(4)\n(5)\n\n(6)\n\n(7)\n\n(8)\n\nmake in a Markov Chain Monte Carlo inference procedure for the BMF model. However, these\n\n\flimited local updates may result in extremely slow mixing. In practice, we often implement larger\nmoves in indicator space using, for example, Metropolis-Hastings proposals on multiple features\n\nand compute the probability under the conditional prior of proposing the current con\ufb01guration. The\nacceptance probability of such a proposal is (the maximum of unity and) the ratio of likelihoods\nbetween the new proposed con\ufb01guration and the current con\ufb01guration.\n\nSplit-merge moves may also be useful for ef\ufb01ciently sampling from the posterior distribution of\nthe binary feature matrices. Jain and Neal [8] describe split-merge algorithms for Dirichlet process\nmixture models with non-conjugate component distributions. We have developed and implemented\nsimilar split-merge proposals for binary matrices with IBP priors. Due to space limitations, we\n\nrandom. If they are in the same column, we propose splitting that column; if they are in different\ncolumns, we propose merging their columns. The key difference between this algorithm and the Jain\nand Neal algorithm is that the binary features are not constrained to sum to unity in each row. Our\n\nA major reason for building generative models of data is to be able to impute missing data values\ngiven some observations. In the linear-Gaussian model, the predictive distribution at each iteration\nof the Markov chain is a Gaussian distribution. The interaction weights can be analytically integrated\nout at each iteration, also resulting in a Gaussian posterior, removing sampling noise contributed by\nhaving the weights explicitly represented. Computing the exact predictive distribution, however,\nconditional only on the model hyperparameters, is analytically intractable: it requires integrating\n\nfor rowi simultaneously. For example, we can propose new values for several columns in rowi\nof matrixU by sampling feature values independently from their conditional priors. To compute\nthe reverse proposal, we imagine forgetting the current con\ufb01guration of those features for rowi\npresent here only a sketch of the procedure. Two nonzero entries inU are selected uniformly at\nsplit-merge algorithm also performs restricted Gibbs scans on columns ofU to increase acceptance\nover all binary matricesU andV, and all other nuisance parameters (e.g., the weights and preci-\nBy averaging predictive distributions, our algorithm implicitly integrates overU andV.\nexperiments, we show samples from the posteriors ofU andV to help explain what the model is\nwill depend, for example, on the current value of(cid:11) and(cid:21) (higher values will result in more features)\ntures. Data consists of\u0001 vectors of size82\nThe generation process is as follows: sinceV has the same number of rows as the dimension of the\nimages,V is \ufb01xed to be a set of vertical and horizontal bars (when reshaped into an image).U is\nsampled from the IBP, and global precisions(cid:18) and(cid:30) are set to1=2. The weightsW are sampled\nfrom zero mean Gaussians. Model estimates ofU andV were initialized from an IBP prior.\nthe expected reconstruction using MCMC samples ofU,V, andW. Despite the relatively high\n\nA toy problem commonly used to illustrate additive feature or multiple cause models is the bars\nproblem ([2, 12, 1]). Vertical and horizontal bars are combined in some way to generate data sam-\nples. The goal of the illustration is to show recovery of the latent structure in the form of bars. We\nhave modi\ufb01ed the typical usage of bars to accommodate the linear-Gaussian BMF with in\ufb01nite fea-\nwhere each vector can be reshaped into a square image.\n\nsions). Instead we integrate over these parameters implicitly by averaging predictive distributions\nfrom many MCMC iterations. This posterior, which is conditional only on the observed data and hy-\nperparameters, is highly complex, potentially multimodal, and non-linear function of the observed\nvariables.\n\nIn Figure 2 we demonstrate the performance of the linear-Gaussian BMF on the bars data. We train\nthe BMF with 200 training examples of the type shown in the top row in Figure 2. Some examples\nhave their bottom halves labeled missing and are shown in the Figure with constant grey values. To\nhandle this, we resample their values at each iteration of the Markov chain. The bottom row shows\n\ndoing, but we stress that the posterior may have signi\ufb01cant mass on many possible binary matrices.\nThe number of features and their degrees of overlap will vary over MCMC iterations. Such variation\n\nprobability.\n\n3.4 Predictions\n\nIn our\n\nand precision values (higher weight precision results in less variation in weights).\n\n4 Experiments\n\n4.1 Modi\ufb01ed \u201cbars\u201d problem\n\n\fnoise levels in the data, the model is able to capture the complex relationships between bars and\nweights. The reconstruction of vertical bars is very good. The reconstruction of horizontal bars is\ngood as well, considering that the model has no information regarding the existence of horizontal\nbars on the bottom half.\n\n(A) Data samples\n\n(B) Noise-free data\n\nBased solely on the information in the top-half of the original data, these are the noise-free nearest\n\n(C) Initial reconstruction\n\n(D) Mean reconstruction\n\n(E) Nearest neighbour\n\nFigure 2: Bars reconstruction. (A) Bars randomly sampled from the complete dataset. The bottom\nhalf of these bars were removed and labeled missing during learning. (B) Noise-free versions of the\nsame data. (C) The initial reconstruction. The missing values have been set to their expected value,\n\n0, to highlight the missing region. (D) The average MCMC reconstruction of the entire image. (E)\nVW>\nneighbours in pixel space.V\nFigure 3: Bars features. The top row shows values ofV andWV>\nsecond row shows a sample ofV andWV>\nfrom the Markov chain.WV>\nset of basis images which can be added together with binary coef\ufb01cients (U) to create images.\nIn Figure 3 we show the generating, or true, values ofV andWV>\nfeatures from the Markov chain. Because the model is generated by adding multipleWV>\nare fairly similar to the generatingWV>\ncaptured features. The learnedWV>\ncomposed of overlapping bar structure (learnedV).\nWe train logistic BMF with 100 examples each of digits1,2, and3 from the USPS dataset. In\nshow the mean and mode (\b\u0004xij=1\u0005>0:5) of the BMF reconstruction. In the bottom row we\nthe average image of the data which have each feature inU on. It is clear that some row features\nhave distinct digit forms and others are overlapping. In row G, the basis imagesWV>\nBy adjusting the features that are non-zero in each row ofU, images are composed by adding basis\nimages together. Finally, in row H we showV. These pixel features mask out different regions in\n\nBy examining the features captured by the model, we can understand the performance just described.\nalong with one sample of those\nbasis\nimages shown on the right of Figure 3, multiple bars are used in each image. This is re\ufb02ected in the\n, but the former are\n\nthe \ufb01rst \ufb01ve rows of Figure 4 we again illustrate the ability of BMF to impute missing data values.\nThe top row shows all 16 samples from the dataset which had their bottom halves labeled missing.\nMissing values are \ufb01lled-in at each iteration of the Markov chain. In the third and fourth rows we\n\nIn Section 2 we brie\ufb02y stated that BMF can be applied to data models other than the linear-Gaussian\nmodel. We demonstrate this with a logistic BMF applied to binarized images of handwritten digits.\n\nused to generate the data. The\ncan be thought of as a\n\n4.2 Digits\n\nhave shown the nearest neighbors, in pixel space, to the training examples based only on the top\nhalves of the original digits.\n\nIn the last three rows of Figure 4 we show the features captured by the model. In row F, we show\n\nare shown.\n\n\fpixel space, which are weighted together to create the basis images. Note that there are\u0003 features\nin rows F and G, and\u0004 features in row H.\n\n(A)\n\n(B)\n\n(C)\n\n(D)\n\n(E)\n\n(F)\n\n(G)\n\n(H)\n\nrepresents a bias feature.\n\n4.3 Gene expression data\n\n(A) Digits randomly sampled from the complete dataset. The\nFigure 4: Digits reconstruction.\nbottom half of these digits were removed and labeled missing during learning. (B) The data shown\nto the algorithm. The top half is the original data value. (C) The mean of the reconstruction for\nthe bottom halves. (D) The mode reconstruction of the bottom halves. (E) The nearest neighbours\nof the original data are shown in the bottom half, and were found based solely on the information\n\nfrom the top halves of the images. (F) The average of all digits for eachU feature. (G) The feature\nWV>\nreshaped in the form of digits. By adding these features together, which theU features do,\nreconstructions of the digits is possible. (H)V reshaped into the form of digits. The \ufb01rst image\nBayesian special case of our model in which the matrixW is diagonal and the number of bi-\nof the data and its expected reconstruction are ordered such that contiguous regions inX were ob-\ning BMF to model gene expression data would be to \ufb01x certain columns ofU orV with knowledge\n\nnary features is \ufb01xed. Our goal in this experiment is merely to illustrate qualitatively the ability\nof BMF to \ufb01nd multiple clusters in gene expression data, some of which are overlapping, others\nnon-overlapping. The data in this experiment consists of rows corresponding to genes and columns\ncorresponding to patients; the patients suffer from one of two types of acute Leukemia [4]. In Figure\n5 we show the factorization produced by the \ufb01nal state in the Markov chain. The rows and columns\n\nGene expression data is able to exhibit multiple and overlapping clusters simultaneously; \ufb01nding\nmodels for such complex data is an interesting and active research area ([10], [13]). The plaid\nmodel[10], originally introduced for analysis of gene expression data, can be thought of as a non-\n\nservable. Some of the many feature pairings are highlighted. The BMF clusters consist of broad,\noverlapping clusters, and small, non-overlapping clusters. One of the interesting possibilities of us-\n\ngained from experiments or literature, and to allow the model to add new features that help explain\nthe data in more detail.\n\n5 Conclusion\nWe have introduced a new model, binary matrix factorization, for unsupervised decomposition of\ndyadic data matrices. BMF makes use of non-parametric Bayesian methods to simultaneously dis-\ncover binary distributed representations of both rows and columns of dyadic data. The model ex-\nplains each row and column entity using a componential code composed of multiple binary latent\nfeatures along with a set of parameters describing how the features interact to create the observed\nresponses at each position in the matrix. BMF is based on a hierarchical Bayesian model and can be\nnaturally extended to make use of a prior distribution which permits an in\ufb01nite number of features,\nat very little extra computational cost. We have given MCMC algorithms for posterior inference\nof both the binary factors and the interaction parameters conditioned on some observed data, and\n\n\fFigure 5: Gene expression results. (A) The top-left isX sorted according to contiguous features in\nthe \ufb01nalU andV in the Markov chain. The bottom-left isV>\nand the top-right isU. The bottom-\nright isW. (B) The same as (A), but the expected value ofX,^X=UWV>\nregions that have both\tik andvj on. For clarity, we have only shown the (at most) two largest\n\n. We have highlighted\n\ncontiguous regions for each feature pair.\n\n(A)\n\n(B)\n\ndemonstrated the model\u2019s ability to capture overlapping structure and model complex joint distribu-\ntions on a variety of data. BMF is fundamentally different from bi-clustering algorithms because of\nits distributed latent representation and from factorial models with continuous latent variables which\ninteract linearly to produce the observations. This allows a much richer latent structure, which we\nbelieve makes BMF useful for many applications beyond the ones we outlined in this paper.\n\nReferences\n[1] P. Dayan and R. S. Zemel. Competition and multiple cause models. Neural Computation, 7(3), 1995.\n[2] P. Foldiak. Forming sparse representations by local anti-Hebbian learning. Biological Cybernetics, 64,\n\n1990.\n\n[3] Z. Ghahramani. Factorial learning and the EM algorithm. In NIPS, volume 7. MIT Press, 1995.\n[4] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh,\nJ. R. Downing, M. A. Caligiuri, C. D. Bloom\ufb01eld, and E. S. Lander. Molecular classi\ufb01cation of cancer:\nClass discovery and class prediction by gene expression monitoring. Science, 286(5439), 1999.\n\n[5] T. Grif\ufb01ths and Z. Ghahramani. In\ufb01nite latent feature models and the Indian buffet process. In NIPS,\n\nvolume 18. MIT Press, 2005.\n\n[6] J. A. Hartigan. Direct clustering of a data matrix. Journal of the American Statistical Association, 67,\n\n1972.\n\n[7] G. Hinton and R. S. Zemel. Autoencoders, minimum description length, and Helmholtz free energy. In\n\nNIPS, volume 6. Morgan Kaufmann, 1994.\n\n[8] S. Jain and R. M. Neal. Splitting and merging for a nonconjugate Dirichlet process mixture model. To\n\nappear in Bayesian Analysis.\n\n[9] C. Kemp, J. B. Tenebaum, T. L. Grif\ufb01ths, T. Yamada, and N. Ueda. Learning systems of concepts with an\nin\ufb01nite relational model. Proceedings of the Twenty-First National Conference on Arti\ufb01cial Intelligence,\n2006.\n\n[10] L. Lazzeroni and A. Owen. Plaid models for gene expression data. Statistica Sinica, 12, 2002.\n[11] J. Pitman. Combinatorial stochastic processes. Lecture Notes for St. Flour Course, 2002.\n[12] E. Saund. A multiple cause mixture model for unsupervised learning. Neural Computation, 7(1), 1994.\n[13] R. Tibshirani, T. Hastie, M. Eisen, D. Ross, D. Botstein, and P. Brown. Clustering methods for the analysis\n\nof DNA microarray data. Technical report, Stanford University, 1999. Department of Statistics.\n\n\f", "award": [], "sourceid": 3033, "authors": [{"given_name": "Edward", "family_name": "Meeds", "institution": null}, {"given_name": "Zoubin", "family_name": "Ghahramani", "institution": null}, {"given_name": "Radford", "family_name": "Neal", "institution": null}, {"given_name": "Sam", "family_name": "Roweis", "institution": null}]}