{"title": "Matrix Completion From any Given Set of Observations", "book": "Advances in Neural Information Processing Systems", "page_first": 1781, "page_last": 1787, "abstract": "In the matrix completion problem the aim is to recover an unknown real matrix from a subset of its entries.  This problem comes up in many application areas, and has received a great deal of attention in the context of the netflix prize.  A central approach to this problem is to output a matrix of lowest  possible complexity (e.g. rank or trace norm) that agrees with the partially  specified matrix.  The performance of this approach under the assumption that the revealed entries are sampled randomly   has received considerable attention. In practice, often the set of revealed entries is not chosen at random and these results do not apply. We are therefore left with no guarantees on the performance of the algorithm we are using.  We present a means to obtain performance guarantees with respect to any set of initial observations.  The first  step remains the same: find a matrix of lowest possible complexity that agrees with the partially specified matrix.   We give a new way to interpret the output of this algorithm by next finding a probability distribution over  the non-revealed entries with respect to which a bound on the generalization error can be proven.  The more  complex the set of revealed entries according to a certain measure, the better the bound on the generalization  error.", "full_text": "Matrix Completion From any Given Set of\n\nObservations\n\nNanyang Technological University and\n\nCentre for Quantum Technologies\n\nTroy Lee\n\ntroyjlee@gmail.com\n\nAdi Shraibman\n\nDepartment of Computer Science\nTel Aviv-Yaffo Academic College\nadi.shribman@gmail.com\n\nAbstract\n\nIn the matrix completion problem the aim is to recover an unknown real matrix\nfrom a subset of its entries. This problem comes up in many application areas,\nand has received a great deal of attention in the context of the net\ufb02ix prize.\nA central approach to this problem is to output a matrix of lowest possible\ncomplexity (e.g.\nrank or trace norm) that agrees with the partially speci\ufb01ed\nmatrix. The performance of this approach under the assumption that the re-\nvealed entries are sampled randomly has received considerable attention (e.g.\n[1, 2, 3, 4, 5, 6, 7, 8]). In practice, often the set of revealed entries is not chosen\nat random and these results do not apply. We are therefore left with no guarantees\non the performance of the algorithm we are using.\nWe present a means to obtain performance guarantees with respect to any set of\ninitial observations. The \ufb01rst step remains the same: \ufb01nd a matrix of lowest possi-\nble complexity that agrees with the partially speci\ufb01ed matrix. We give a new way\nto interpret the output of this algorithm by next \ufb01nding a probability distribution\nover the non-revealed entries with respect to which a bound on the generalization\nerror can be proven. The more complex the set of revealed entries according to a\ncertain measure, the better the bound on the generalization error.\n\n1\n\nIntroduction\n\nIn the matrix completion problem we observe a subset of the entries of a target matrix Y , and our aim\nis to retrieve the rest of the matrix. Obviously some restriction on the target matrix Y is unavoidable\nas otherwise it is impossible to retrieve even one missing entry; usually, it is assumed that Y is\ngenerated in a way so as to have low complexity according to a measure such as matrix rank.\nA common scheme for the matrix completion problem is to select a matrix X that minimizes some\ncombination of the complexity of X and the distance between X and Y on the observed part. In\nparticular, one can demand that X agrees with Y on the observed initial sample (i.e. the distance\nbetween X and Y on the observed part is zero). This general algorithm is described in Figure 1, and\nwe refer to it as Alg1. It outputs a matrix with minimal complexity that agrees with Y on the initial\nsample S. The complexity measure can be rank, or a norm to serve as an ef\ufb01ciently computable\nproxy for the rank such as the trace norm or \u03b32 norm. When we wish to mention which complexity\nmeasure is used we write it explicitly, e.g. Alg1(\u03b32). Our framework is suitable using any norm\nsatisfying few simple conditions described in the sequel.\nThe performance of Alg1 under the assumption that the initial subset is picked at random is well\nunderstood [1, 2, 3, 4, 5, 6, 7, 8]. This line of research can be divided into two parts. One line\nof research [5, 6, 4] studies conditions under which Alg1(Tr) retrieves the matrix exactly 1. They\n\n1There are other papers studying exact matrix completion, e.g. [7].\n\n1\n\n\fde\ufb01ne what they call an incoherence property which quanti\ufb01es how spread the singular vectors of Y\nare. The exact de\ufb01nition of the incoherence property varies in different results. It is then proved that\nif there are enough samples relative to the rank of Y and its incoherence property, then Alg1(Tr)\nretrieves the matrix Y exactly with high probability, assuming the samples are chosen uniformly at\nrandom. Note that in this line of research the trace norm is used as the complexity measure in the\nalgorithm. It is not clear how to prove similar results with the \u03b32 norm.\nCandes and Recht [5] observed that it is impossible to reconstruct a matrix that has only one entry\nequal to 1 and zeros everywhere else, unless most of its entries are observed. Thus, exact matrix\ncompletion must assume some special property of the target matrix Y . In a second line of research,\ngeneral results are proved regarding the performance of Alg1. These results are weaker in that they\ndo not prove exact recovery, but rather bounds on the distance between the output matrix X and\nY . But these results apply for every matrix Y , they can be generalized for non-uniform probability\ndistributions, and also apply when the complexity measure is the \u03b32 norm. These results take the\nfollowing form:\nTheorem 1 ([2]) Let Y be an n \u00d7 n real matrix, and P a probability distribution on pairs (i, j) \u2208\n[n]2. Choose a sample S of |S| > n log n entries according to P . Then, with probability at least\n1 \u2212 2\u2212n/2 over the sample selection, the following holds:\n\nPij|Xij \u2212 Yij| \u2264 c\u03b32(X)\n\n(cid:114) n\n\n|S| .\n\n(cid:88)\n\ni,j\n\nWhere X is the output of the algorithm with sample S, and c is a universal constant.\n\nIn practice, the assumption that the sample is random is not always valid. Sometimes the subset we\nsee re\ufb02ects our partial knowledge which is not random at all. What can we say about the output\nof the algorithm in this case? The analysis of random samples does not help us here, because these\nproofs do not reveal the structure that makes generalization possible. In order to answer this question\nwe need to understand what properties of a sample enable generalization.\nA \ufb01rst step in this direction was taken in [9] where the initial subset was chosen deterministically\nas the set of edges of a good expander (more generally, a good sparsi\ufb01er). Deterministic guarantees\nwere proved for the algorithm in this case, that resemble the guarantees proved for random sampling.\nFor example:\n\nTheorem 2 [9] Let S be the set of edges of a d-regular graph with second eigenvalue 2 bound \u03bb.\nFor every n \u00d7 n real matrix Y , if X is the output of Alg1 with initial subset S, then\n\n(cid:88)\n\n1\nn2\n\n(Xij \u2212 Yij)2 \u2264 c\u03b32(Y )2 \u03bb\nd\n\n,\n\ni,j\nwhere c is a small universal constant.\n\n\u221a\n\nd) can be constructed in linear time using e.g.\n\nRecall that d-regular graphs with \u03bb = O(\nwell-known LPS Ramanujan graphs [10].\nThis theorem was also generalized to bound the error with respect to any probability distribution.\nInstead of expanders, sparsi\ufb01ers were used to select the entries to observe for this result.\nTheorem 3 [9] Let P be a probability distribution on pairs (i, j) \u2208 [n]2, and d > 1. There is an\nef\ufb01ciently constructible set S \u2282 [n]2 of size at most dn, such that for every n \u00d7 n real target matrix\nY , if X is the output of our algorithm with initial subset S, then\n\nthe\n\n(cid:88)\n\ni,j\n\nPij(Xij \u2212 Yij)2 \u2264 c\u03b32(Y )2 1\u221a\nd\n\n.\n\nThe results in [9] still do not answer the practical question of how to reconstruct a matrix from an\narbitrary sample. In this paper we continue the work started in [9], and give a simple and general\nanswer to this second question.\nWe extend the results of [9] in several ways:\n\n2The eigenvalues are eigenvalues of the adjacency matrix of the graph.\n\n2\n\n\f1. We upper bound the generalization error of Alg1 given any set of initial observations. This\n\nbound depends on properties of the set of observed entries.\n\n2. We show there is a probability distribution outside of the observed entries such that the\ngeneralization error under this distribution is bounded in terms of the complexity of the\nobserved entries, under a certain complexity measure.\n\n3. The results hold not only for \u03b32 but also for the trace norm, and in fact any norm satisfying\n\na few basic properties.\n\n2 Preliminaries\n\nHere we introduce some of the matrix notation and norms that we will be using. For matrices A, B\nof the same size, let A \u25e6 B denote the Hadamard or entrywise product of A and B. For a m-by-n\nmatrix A with m \u2265 n let \u03c31(A) \u2265 \u00b7\u00b7\u00b7 \u2265 \u03c3n(A) denote the singular values of A. The trace norm,\ndenoted (cid:107)A(cid:107)tr, is the (cid:96)1 norm of the vector of singular values, and the Frobenius norm, denoted\n(cid:107)A(cid:107)F , is the (cid:96)2 norm of the vector of singular values.\nAs the rank of a matrix is equal to the number of non-zero singular values, it follows from the\nCauchy-Schwarz inequality that\n\n(cid:107)A(cid:107)2\n(cid:107)A(cid:107)2\n\ntr\n\nF\n\n\u2264 rk(A) .\n\n(1)\n\nThis inequality motivates the use of the trace norm as a proxy for rank in rank minimization prob-\nlems. A problem with the bound of (1) as a complexity measure is that it is not monotone\u2014the\nbound can be larger on a submatrix of A than on A itself. As taking the Hadamard product of a\nmatrix with a rank one matrix does not increase its rank, a way to \ufb01x this problem is to consider\ninstead:\n\n(cid:107)A \u25e6 vuT(cid:107)2\n(cid:107)A \u25e6 vuT(cid:107)2\n\ntr\n\nF\n\nmax\nu,v\n\n(cid:107)u(cid:107)=(cid:107)v(cid:107)=1\n\n\u2264 rk(A) .\n\n(cid:107)A \u25e6 vuT(cid:107)2\n\ntr \u2264 rk(A) .\n\nmax\nu,v\n\n(cid:107)u(cid:107)=(cid:107)v(cid:107)=1\n\nWhen A is a sign matrix, this bound simpli\ufb01es nicely\u2014for then, (cid:107)A \u25e6 vuT(cid:107)F = (cid:107)u(cid:107)(cid:107)v(cid:107) = 1, and\nwe are left with\n\nThis motivates the de\ufb01nition of the \u03b32 norm.\nDe\ufb01nition 4 Let A be a n-by-n matrix. Then\n\n\u03b32(A) = max\nu,v\n\n(cid:107)u(cid:107)=(cid:107)v(cid:107)=1\n\n(cid:107)A \u25e6 vuT(cid:107)tr .\n\nWe will also make use of the dual norms of the trace and \u03b32 norms. Recall that in general for a norm\n\u03a6(A) the dual norm \u03a6\u2217 is de\ufb01ned as\n\n\u03a6\u2217(A) = max\n\nB\n\n(cid:104)A, B(cid:105)\n\u03a6(B)\n\nNotice that this means that\n\n(cid:104)A, B(cid:105) \u2264 \u03a6\u2217(A)\u03a6(B) .\n\n(2)\nThe dual of the trace norm is (cid:107) \u00b7 (cid:107) the operator norm from (cid:96)2 to (cid:96)2, also known as the spectral norm.\nThe dual of the \u03b32 norm looks as follows.\nDe\ufb01nition 5\n\n(cid:1)\n\n(cid:0)(cid:107)X(cid:107)2\n\nF + (cid:107)Y (cid:107)2\n\n1\n2\n(cid:107)X(cid:107)F(cid:107)Y (cid:107)F ,\n\nF\n\n\u03b3\u2217\n2 (A) = min\nX,Y\n\nX T Y =A\n\n= min\nX,Y\n\nX T Y =A\n\nwhere the min is taken over X, Y with orthogonal columns.\n\n3\n\n\fFinally, we will make use of the approximate \u03b32 norm. This is the minimum of the \u03b32 norm over all\nmatrices which approximate the target matrix in some sense. The particular version we will need is\ndenoted \u03b30,\u221e\nDe\ufb01nition 6 Let S \u2208 {0, 1}m\u00d7n be a boolean matrix. Let \u00afS denote the complement of S, that is\n\u00afS = J \u2212 S where J is the all ones matrix. Then\n\nand is de\ufb01ned as follows.\n\n2\n\n\u03b30,\u221e\n\n2\n\n(S) = min\n\nT\n\n{\u03b32(T ) : T \u25e6 S \u2265 S, T \u25e6 \u00afS = 0}\n\n2\n\nIn words, \u03b30,\u221e\n(S) is the minimum \u03b32 norm of a matrix T which is 0 whenever S is zero, and at\nleast 1 whenever S is 1. This can be thought of as a \u201cone-sided error\u201d version of the more familiar\n\u03b3\u221e\n2 norm of a sign matrix, which is the minimum \u03b32 norm of a matrix which agrees in sign with the\ntarget matrix and has all entries of magnitude at least 1. The \u03b3\u221e\n2 bound is also known to be equal to\nthe margin complexity [11].\n\n3 The algorithm\nLet S \u2282 [m] \u00d7 [n] be a subset of entries, representing our partial knowledge. We can always run\nAlg1 and get an output matrix X. What we need in order to make intelligent use of X is a way to\nmeasure the distance between X and Y . Our \ufb01rst observation is that although Y is not known, it\nis possible to bound the distance between X and Y . This result is stated in the following theorem\nwhich generalizes Theorems (2) and (4) of [9] 3:\nTheorem 7 Fix a set of entries S \u2282 [m] \u00d7 [n]. Let P be a probability distribution on pairs (i, j) \u2208\n[m] \u00d7 [n], such that there exists a real matrix Q satisfying\n\n1. Qij = 0 when (i, j) (cid:54)\u2208 S.\n2. \u03b3\u2217\n\n2 (P \u2212 Q) \u2264 \u039b\n\nThen for every m \u00d7 n real target matrix Y , if X is the output of our algorithm with initial subset S,\nit holds that\n\n(cid:88)\n\nPij(Xij \u2212 Yij)2 \u2264 4\u039b\u03b32(Y )2 .\n\ni,j\n\n2 (P \u2212 Q) determines, at least to some extent, the expected distance between\n\nTheorem 7 says that \u03b3\u2217\nX and Y with respect to P .\nThis gives us a way to measure the quality of the output of Alg1 for any set S of initial observations.\nNamely, we can do the following:\n\n1. Choose a probability distribution P on the entries of the matrix.\n2. Find a real matrix Q such that Qij = 0 when (i, j) (cid:54)\u2208 S, and \u03b3\u2217\n3. Output the minimal value \u039b.\n\n2 (P \u2212 Q) is minimal.\n\nWe then know, using Theorem 7, that the expected square distance between X and Y can be bounded\nin terms of \u039b and the complexity of Y .\nObviously, the choice of P makes a big difference. For example if the set of initial observations is\ncontained in a submatrix we cannot expect X to be close to Y outside this submatrix. In such cases\nit makes sense to restrict P to the submatrix containing S.\nOne approach to \ufb01nd a distribution for which we can expect to be close on the unseen entries is to\noptimize over probability distributions P such that Theorem 7 gives the best bound. Since \u03b3\u2217\n2 can\nbe expressed as the optimum of semide\ufb01nite program, we can \ufb01nd in polynomial time a probability\n2 (P \u2212 Q) is minimizd. Thus, instead of\ndistribution P and a weight function Q on S such that \u03b3\u2217\ntrying different parameters, we can \ufb01nd a probability distribution for which we can prove optimal\n\n3Here we state the result for \u03b32. See Section 4 for the corresponding result for the trace norm as well.\n\n4\n\n\f1.\n2.\n\nInput: a subset S \u2282 [n]2 and the value of Y on S.\nOutput: a matrix X of smallest possible CC(X) under the condition that\nXij = Yij for all (i, j) \u2208 S.\n\nFigure 1: Algorithm Alg1(CC)\n\nguarantees using Theorem 7. The second algorithm we suggest does exactly that. We refer to this\nalgorithm as Alg2, or Alg2(CC) if we wish to state the complexity measure that is used.\nFor Alg2(\u03b32), we do the following: Minimize \u03b3\u2217\nthat:\n\n2 (P \u2212 Q) over all m \u00d7 n matrices Q and P such\n\n1. Qij = 0 for (i, j) (cid:54)\u2208 S.\n2. Pij = 0 for (i, j) \u2208 S.\n\n3. (cid:80)\n\ni,j Pij = 1.\n\nGlobally, our algorithm for matrix completion therefore works in two phases. We \ufb01rst use Alg1 to\nget an output matrix X, and then use Alg2 in order to \ufb01nd optimal guarantees regarding the distance\nbetween X and Y . The generalization error bounds for this algorithm are proved in Section 4.\n\n3.1 Using a general norm\n\nIn our description of Alg2 above we have used the norm \u03b32. The same idea works for any norm \u03a6\nsatisfying the property \u03a6(A \u25e6 A) \u2264 \u03a6(A)2. Moreover, if the dual norm can be computed ef\ufb01ciently\nvia a linear or semide\ufb01nite program, then the optimal distribution P for the bound can be found\nef\ufb01ciently as well.\nFor example for the trace norm the algorithm becomes: Given the sample S run Alg1((cid:107) \u00b7 (cid:107)tr) and\nget an output matrix X. The second part of the algorithm is: Minimize (cid:107)P \u2212 Q(cid:107) over all m \u00d7 n\nmatrices Q and P such that:\n\n1. Qij = 0 for (i, j) (cid:54)\u2208 S.\n2. Pij = 0 for (i, j) \u2208 S.\n\n3. (cid:80)\n\ni,j Pij = 1.\n\nDenote by \u039b the optimal value of the above program, and by P the optimal probability distribution.\n\nThen analogously to Theorem 7, we have(cid:88)\n\nPij(Xij \u2212 Yij)2 \u2264 4\u039b(cid:107)Y (cid:107)2\n\ntr .\n\ni,j\n\nBoth of these results will follow from a more general theorem which we show in the next section.\n\n4 Generalization bounds\n\nHere we show a more general theorem which will imply Theorem 7.\nTheorem 8 Let \u03a6 be a norm and \u03a6\u2217 its dual norm. Suppose that \u03a6(A\u25e6 A) \u2264 \u03a6(A)2 for any matrix\nA.\nFix a set of indices S \u2282 [m] \u00d7 [n]. Let P be a probability distribution on pairs (i, j) \u2208 [m] \u00d7 [n],\nsuch that there exists a real matrix Q satisfying\n\n1. Qij = 0 when (i, j) (cid:54)\u2208 S.\n2. \u03a6\u2217(P \u2212 Q) \u2264 \u039b\n\n5\n\n\fThen for every m \u00d7 n real target matrix Y , if X is the output of algorithm Alg1(\u03a6) with initial\nsubset S, it holds that\n\n(cid:88)\n\nPij(Xij \u2212 Yij)2 \u2264 4\u03a6(Y )2\u039b.\n\ni,j\n\nProof Let R be the matrix where Rij = (Xij \u2212 Yij)2. By assumption \u03a6\u2217(P \u2212 Q) \u2264 \u039b thus by (2)\n\n(cid:104)P \u2212 Q, R(cid:105) \u2264 \u039b\u03a6(R) .\n\nNow let us focus on \u03a6(R). As R = (X \u2212 Y ) \u25e6 (X \u2212 Y ) by the assumption on \u03a6 we have\n\n\u03a6(R) \u2264 \u03a6(X \u2212 Y )2 \u2264 (\u03a6(X) + \u03a6(Y ))2 .\n\nNow by de\ufb01nition of Alg1(\u03a6) we have \u03a6(X) \u2264 \u03a6(Y ), thus \u03a6(R) \u2264 4\u03a6(Y )2. Also, by de\ufb01nition\nof the algorithm Rij = 0 for (i, j) \u2208 S, and Qij equals zero outside of S, which implies that\n\n(cid:80)\ni,j QijRij = 0. We conclude that(cid:88)\n\nPij(Xij \u2212 Yij)2 \u2264 4\u039b\u03a6(Y )2.\n\ni,j\n\nBoth the trace norm and \u03b32 norm satisfy the condition of the theorem as they are multiplicative under\ntensor product.\n\n5 Analyzing the error bound\n\nWe now look more closely at the minimal value of the parameter \u039b from Theorem 7. The optimal\nvalue of \u039b depends only on the set of observed indices S. For a set of indices S \u2282 [m] \u00d7 [n] let \u00afS\nbe its complement.\nGiven samples S we want to \ufb01nd P, Q so as to minimize \u03b3\u2217\ndistribution over \u00afS and Q has support in S. We can express this as a semide\ufb01nite program\n\n2 (P \u2212 Q) such that P is a probability\n\n\u039b =minimize\n\n\u03b1,P,Q\n\n1\n2\n\nTr(\u03b1)\n\nsubject to \u03b1 \u2212 ( \u02c6P \u2212 \u02c6Q) (cid:23) 0\n\nP \u2265 0\n(cid:104)P, \u00afS(cid:105) = 1\n(cid:104)Q, S(cid:105) = Q.\n\n(cid:20) 0\n\n\u02c6P =\n\n(cid:21)\n\nP\n0\n\nHere\n\nP T\nis the \u201cbipartite\u201d version of P , and similarly for \u02c6Q.\nTaking the dual of this program we \ufb01nd\n\n1/\u039b =minimize\n\n\u03b32(A)\nsubject to A \u2265 \u00afS\n\nA\n\nA \u25e6 \u00afS = A\n\nIn words, this says that that 1\nin S and at least 1 on all entries in \u00afS. Thus \u039b = 1/\u03b30,\u221e\nmore complex the set of unobserved entries \u00afS according to the measure \u03b30,\u221e\nof \u039b. Note that in particular, if we consider the sign matrix \u00afS\u2212S then \u03b30,\u221e\nis lower bounded by the margin complexity of S \u2212 \u00afS.\n\n\u039b is equal to the minimum \u03b32 norm of a matrix that is zero on all entries\n( \u00afS) (recall De\ufb01nition 6). This says that the\n, the smaller the value\n2 ( \u00afS\u2212S)\u22121)/2\n\n( \u00afS) \u2265 (\u03b3\u221e\n\n2\n\n2\n\n2\n\n6\n\n\fReferences\n[1] N. Srebro, J. D. M. Rennie, and T. S. Jaakola. Maximum-margin matrix factorization.\n\nNeural Information Processing Systems, 2005.\n\nIn\n\n[2] N. Srebro and A. Shraibman. Rank, trace-norm and max-norm. In 18th Annual Conference on\n\nComputational Learning Theory (COLT), pages 545\u2013560, 2005.\n\n[3] R. Foygel and N. Srebro. Concentration-based guarantees for low-rank matrix reconstruction.\n\nTechnical report, arXiv, 2011.\n\n[4] E. J. Candes and T. Tao. The power of convex relaxation: near-optimal matrix completion.\n\nIEEE Transactions on Information Theory, 56(5):2053\u20132080, 2010.\n\n[5] E. J. Candes and B. Recht. Exact matrix completion via convex optimization. Foundations of\n\nComputational Mathematics, 9(6):717\u2013772, 2009.\n\n[6] B. Recht. A simpler approach to matrix completion. Technical report, arXiv, 2009.\n[7] R. H. Keshavan, A. Montanari, and S. Oh. Matrix completion from noisy entries. Journal of\n\nMachine Learning Research, 11:2057\u20132078, 2010.\n\n[8] V. Koltchinskii, A. B. Tsybakov, and K. Lounici. Nuclear norm penalization and optimal rates\n\nfor noisy low rank matrix completion. Technical report, arXiv, 2010.\n\n[9] E. Heiman, G. Schechtman, and A. Shraibman. Deterministic algorithms for matrix comple-\n\ntion. Random Structures and Algorithms, 2013.\n\n[10] A. Lubotzky, R. Phillips, and P. Sarnak. Ramanujan graphs. Combinatorica, 8:261\u2013277, 1988.\n[11] N. Linial, S. Mendelson, G. Schechtman, and A. Shraibman. Complexity measures of sign\n\nmatrices. Combinatorica, 27(4):439\u2013463, 2007.\n\n7\n\n\f", "award": [], "sourceid": 901, "authors": [{"given_name": "Troy", "family_name": "Lee", "institution": "Centre for Quantum Technologies"}, {"given_name": "Adi", "family_name": "Shraibman", "institution": "Weizmann Institute of Science"}]}