{"title": "Noise-Tolerant Life-Long Matrix Completion via Adaptive Sampling", "book": "Advances in Neural Information Processing Systems", "page_first": 2955, "page_last": 2963, "abstract": "We study the problem of recovering an incomplete $m\\times n$ matrix of rank $r$ with columns arriving online over time. This is known as the problem of life-long matrix completion, and is widely applied to recommendation system, computer vision, system identification, etc. The challenge is to design provable algorithms tolerant to a large amount of noises, with small sample complexity. In this work, we give algorithms achieving strong guarantee under two realistic noise models. In bounded deterministic noise, an adversary can add any bounded yet unstructured noise to each column. For this problem, we present an algorithm that returns a matrix of a small error, with sample complexity almost as small as the best prior results in the noiseless case. For sparse random noise, where the corrupted columns are sparse and drawn randomly, we give an algorithm that exactly recovers an $\\mu_0$-incoherent matrix by probability at least $1-\\delta$ with sample complexity as small as $O(\\mu_0rn\\log(r/\\delta))$. This result advances the state-of-the-art work and matches the lower bound in a worst case. We also study the scenario where the hidden matrix lies on a mixture of subspaces and show that the sample complexity can be even smaller. Our proposed algorithms perform well experimentally in both synthetic and real-world datasets.", "full_text": "Noise-Tolerant Life-Long Matrix Completion via\n\nAdaptive Sampling\n\nMaria-Florina Balcan\n\nMachine Learning Department\n\nCarnegie Mellon University, USA\n\nninamf@cs.cmu.edu\n\nHongyang Zhang\n\nMachine Learning Department\n\nCarnegie Mellon University, USA\n\nhongyanz@cs.cmu.edu\n\nAbstract\n\nWe study the problem of recovering an incomplete m \u00d7 n matrix of rank r with\ncolumns arriving online over time. This is known as the problem of life-long\nmatrix completion, and is widely applied to recommendation system, computer\nvision, system identi\ufb01cation, etc. The challenge is to design provable algorithms\ntolerant to a large amount of noises, with small sample complexity. In this work,\nwe give algorithms achieving strong guarantee under two realistic noise models. In\nbounded deterministic noise, an adversary can add any bounded yet unstructured\nnoise to each column. For this problem, we present an algorithm that returns a\nmatrix of a small error, with sample complexity almost as small as the best prior\nresults in the noiseless case. For sparse random noise, where the corrupted columns\nare sparse and drawn randomly, we give an algorithm that exactly recovers an\n\u00b50-incoherent matrix by probability at least 1 \u2212 \u03b4 with sample complexity as small\nas O (\u00b50rn log(r/\u03b4)). This result advances the state-of-the-art work and matches\nthe lower bound in a worst case. We also study the scenario where the hidden\nmatrix lies on a mixture of subspaces and show that the sample complexity can\nbe even smaller. Our proposed algorithms perform well experimentally in both\nsynthetic and real-world datasets.\n\nIntroduction\n\n1\nLife-long learning is an emerging object of study in machine learning, statistics, and many other\ndomains [2, 11]. In machine learning, study of such a framework has led to signi\ufb01cant advances\nin learning systems that continually learn many tasks over time and improve their ability to learn\nas they do so, like humans [15]. A natural approach to achieve this goal is to exploit information\nfrom previously-learned tasks under the belief that some commonalities exist across the tasks [2,\n24]. The focus of this work is to apply this idea of life-long learning to the matrix completion\nproblem. That is, given columns of a matrix that arrive online over time with missing entries, how to\napproximately/exactly recover the underlying matrix by exploiting the low-rank commonality across\neach column.\nOur study is motivated by several promising applications where life-long matrix completion is\napplicable. In recommendation systems, the column of the hidden matrix consists of ratings by\nmultiple users to a speci\ufb01c movie/news; The news or movies are updated online over time but usually\nonly a few ratings are submitted by those users. In computer vision, inferring camera motion from a\nsequence of online arriving images with missing pixels has received signi\ufb01cant attention in recent\nyears, known as the structure-from-motion problem; Recovering those missing pixels from those\npartial measurements is an important preprocessing step. Other examples where our technique is\napplicable include system identi\ufb01cation, multi-class learning, global positioning of sensors, etc.\nDespite a large amount of applications of life-long matrix completion, many fundamental ques-\ntions remain unresolved. One of the long-standing challenges is designing noise-tolerant, life-long\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\falgorithms that can recover the unknown target matrix with small error. In the absence of noise,\nthis problem is not easy because the overall structure of the low rankness is unavailable in each\nround. This problem is even more challenging in the context of noise, where an adversary can add\nany bounded yet unstructured noise to those observations and the error propagates as the algorithm\nproceeds. This is known as bounded deterministic noise. Another type of noise model that receives\ngreat attention is sparse random noise, where the noise is sparse compared to the number of columns\nand is drawn i.i.d. from a non-degenerate distribution.\nOur Contributions: This paper tackles the problem of noise-tolerant, life-long matrix completion\nand advances the state-of-the-art results under the two realistic noise models.\n\u2022 Under bounded deterministic noise, we design and analyze an algorithm that is robust to\nnoise, with only a small output error (See Figure 3). The sample complexity is almost as\nsmall as the best prior results in the noiseless case, provided that the noise level is small.\n\u2022 Under sparse random noise, we give sample complexity that guarantees an exact recovery of\nthe hidden matrix with high probability. The sample complexity advances the state-of-the-art\nresults (See Figure 3) and matches the lower bound in the worst case of this scenario.\n\u2022 We extend our result of sparse random noise to the setting where the columns of the hidden\nmatrix lie on a mixture of subspaces, and show that smaller sample complexity suf\ufb01ces to\nexactly recover the hidden matrix in this more benign setting.\n\u2022 We also show that our proposed algorithms perform well experimentally in both synthetic\n\nand real-world datasets.\n\n2 Preliminaries\nBefore proceeding, we de\ufb01ne some notations and clarify problem setup in this section.\nNotations: We will use bold capital letter to represent matrix, bold lower-case letter to represent\nvector, and lower-case letter to represent scalar. Speci\ufb01cally, we denote by M \u2208 Rm\u00d7n the noisy\nobservation matrix in hindsight. We denote by L the underlying clean matrix, and by E the noise. We\nwill frequently use M:t \u2208 Rm\u00d71 to indicate the t-th column of matrix M, and similarly Mt: \u2208 R1\u00d7n\nthe t-th row. For any set of indices \u2126, M\u2126: \u2208 R|\u2126|\u00d7n represents subsampling the rows of M at\ncoordinates \u2126. Without confusion, denote by U the column space spanned by the matrix L. Denote by\n\n(cid:101)U the noisy version of U, i.e., the subspace corrupted by the noise, and by (cid:98)U our estimated subspace.\nThe superscript k of (cid:101)Uk means that (cid:101)Uk has k columns in the current round. PU is frequently used to\n\nij M2\n\nF =(cid:80)\n\nrepresent the orthogonal projection operator onto subspace U. We use \u03b8(a, b) to denote the angle\nbetween vectors a and b. For a vector u and a subspace V, de\ufb01ne \u03b8(u, V) = minv\u2208V \u03b8(u, v). We\nde\ufb01ne the angle between two subspaces U and V as \u03b8(U, V) = maxu\u2208U \u03b8(u, V). For norms, denote\nby (cid:107)v(cid:107)2 the vector (cid:96)2 norm of v. For matrix, (cid:107)M(cid:107)2\nij and (cid:107)M(cid:107)\u221e,2 = maxi (cid:107)Mi:(cid:107)2, i.e.,\nthe maximum vector (cid:96)2 norm across rows. The operator norm is induced by the matrix Frobenius\nnorm, which is de\ufb01ned as (cid:107)P(cid:107) = max(cid:107)M(cid:107)F \u22641 (cid:107)PM(cid:107)F . If P can be represented as a matrix, (cid:107)P(cid:107)\nalso denotes the maximum singular value.\n2.1 Problem Setup\nIn the setting of life-long matrix completion, we assume that each column of the underlying matrix\nL is normalized1 and arrives online over time. We are not allowed to get access to the next column\nuntil we perform the completion for the current one. This is in sharp contrast to the of\ufb02ine setting\nwhere all columns come at one time and so we are able to immediately exploit the low-rank structure\nto do the completion. In hindsight, we assume the underlying matrix is of rank r. This assumption\nenables us to represent L as L = US, where U is the dictionary (a.k.a. basis matrix) of size m \u00d7 r\nwith each column representing a latent metafeature, and S is a matrix of size r \u00d7 n containing the\nweights of linear combination for each column L:t. The overall subspace structure is captured by U\nand the \ufb01ner grouping structure, e.g., the mixture of multiple subspaces, is captured by the sparsity\nof S. Our goal is to approximately/exactly recover the subspace U and the matrix L from a small\nfraction of the entries, possibly corrupted by noise, although these entries can be selected sequentially\nin a feedback-driven way.\nNoise Models: We study two types of realistic noise models, one of which is the deterministic noise.\nIn this setting, we assume that the (cid:96)2 norm of noise on each column is bounded by \u0001noise. Beyond\n1Without loss of generality, we assume (cid:107)L:t(cid:107)2 = 1 for all t, although our result can be easily extended to the\n\ngeneral case.\n\n2\n\n\fthat, no other assumptions are made on the nature of noise. The challenge under this noise model is to\ndesign an online algorithm limiting the possible error propagation during the completion procedure.\nAnother noise model we study is the sparse random noise, where we assume that the noise vectors\nare drawn i.i.d. from any non-degenerate distribution. Additionally, we assume the noise is sparse,\ni.e., only a few columns of L are corrupted by noise. Our goal is to exactly recover the underlying\nmatrix L with sample complexity as small as possible.\nIncoherence: Apart from the sample budget and noise level, another quantity governing the dif\ufb01culty\nof the completion problem is the coherence parameter on the row/column space. Intuitively, the\ncompletion should perform better when the information spreads evenly throughout the matrix. To\nquantify this term, for subspace U of dimension r in Rm, we de\ufb01ne\n\n\u00b5(U) =\n\nm\nr\n\nmax\ni\u2208[m]\n\n(cid:107)PUei(cid:107)2\n2,\n\n(1)\n\nwhere ei is the i-th column of the identity matrix. Indeed, without (1) there is an identi\ufb01ability issue\nin the matrix completion problem [7, 8, 27]. As an extreme example, let L be a matrix with only one\nnon-zero entry. Such a matrix cannot be exactly recovered unless we see the non-zero element. As in\n[19], to mitigate the issue, in this paper we assume incoherence \u00b50 = \u00b5(U) on the column space of\nthe underlying matrix. This is in contrast to the classical results of Cand\u00e8s et al. [7, 8], in which one\nrequires incoherence \u00b50 = max{\u00b5(U), \u00b5(V)} on both the column and the row subspaces.\nSampling Model: Instead of sampling the entries passively by uniform distribution, our sampling\noracle allows adaptively measuring entries in each round. Speci\ufb01cally, for any arriving column we\nare allowed to have two types of sampling phases: we can either uniformly take the samples of the\nentries, as the passive sampling oracle, or choose to request all entries of the column in an adaptive\nmanner. This is a natural extension of the classical passive sampling scheme with wide applications.\nFor example, in network tomography, a network operator is interested in inferring latencies between\nhosts while injecting few packets into the network. The operator is in control of the network, thus\ncan adaptively sample the matrix of pair-wise latencies. In particular, the operator can request full\ncolumns of the matrix by measuring one host to all others. In gene expression analysis, we are\ninterested in recovering a matrix of expression levels for various genes across a number of conditions.\nThe high-throughput microarrays provide expression levels of all genes of interest across operating\nconditions, corresponding to revealing entire columns of the matrix.\n3 Main Results\nIn this section, we formalize our life-long matrix completion algorithm, develop our main theoretical\ncontributions, and compare our results with the prior work.\n\n3.1 Bounded Deterministic Noise\n\nTo proceed, our algorithm streams the columns of noisy M into memory and iteratively updates the\n\nU, and when processing an arriving column M:t, requests only a few entries of M:t and a few rows\n\nestimate for the column space of L. In particular, the algorithm maintains an estimate (cid:98)U of subspace\nof (cid:98)U to estimate the distance between L:t and U. If the value of the estimator is greater than a given\ncolumns of (cid:98)U. The pseudocode of the procedure is displayed in Algorithm 1. We note that our\n\nthreshold \u03b7k, the algorithm requests the remaining entries of M:t and adds the new direction M:t\nto the subspace estimate; Otherwise, \ufb01nds a best approximation of M:t by a linear combination of\n\nwe downsample both M:t and (cid:98)Uk to M\u2126t and (cid:98)Uk\nsubspace (cid:98)Uk\n\nalgorithm is similar to the algorithm of [19] for the problem of of\ufb02ine matrix completion without\nnoise. However, our setting, with the presence of noise (which might conceivably propagate through\nthe course of the algorithm), makes our analysis signi\ufb01cantly more subtle.\nThe key ingredient of the algorithm is to estimate the distance between the noiseless column L:t\nand the clean subspace Uk with only a few measurements with noise. To estimate this quantity,\n\u2126:, respectively. We then project M\u2126t onto\nM\u2126t(cid:107)2 as our estimator. A subtle and\ncritical aspect of the algorithm is the choice of the threshold \u03b7k for this estimator. In the noiseless\nsetting, we can simply set \u03b7k = 0 if the sampling number |\u2126| is large enough \u2014 in the order of\nO(\u00b50r log2 r), because O(\u00b50r log2 r) noiseless measurements already contain enough information\nfor testing whether a speci\ufb01c column lies in a given subspace [19]. In the noisy setting, however, the\n\n\u2126: and use the projection residual (cid:107)M\u2126t \u2212 P(cid:98)Uk\n\n\u2126:\n\n3\n\n\f\u2126:\n\nAlgorithm 1 Noise-Tolerant Life-Long Matrix Completion under Bounded Deterministic Noise\n\nInitialize: Let the basis matrix (cid:98)U0 = \u2205. Randomly draw entries \u2126 \u2282 [m] of size d uniformly with\n\nInput: Columns of matrices arriving over time.\n\nM\u2126t(cid:107)2 > \u03b7k\n\nreplacement.\n1: For t from 1 to n, do\n(a) If (cid:107)M\u2126t \u2212 P(cid:98)Uk\ni. Fully measure M:t and add it to the basis matrix (cid:98)Uk. Orthogonalize (cid:98)Uk.\n2:\n3:\nii. Randomly draw entries \u2126 \u2282 [m] of size d uniformly with replacement.\n(b) Otherwise(cid:99)M:t := (cid:98)Uk(cid:98)Uk\u2020\n4:\n5:\niii. k := k + 1.\nOutput: Estimated range space (cid:98)UK and the underlying matrix(cid:99)M with column(cid:99)M:t.\n6:\n7: End For\nchallenge is that both M:t and (cid:98)Uk are corrupted by noise, and the error propagates as the algorithm\n\nproceeds. Thus instead of setting the threshold as 0 always, our theory suggests setting \u03b7k proportional\nto the noise level\n\u0001noise. Indeed, the threshold \u03b7k balances the trade-off between the estimation\nerror and the sample complexity: a) if \u03b7k is too large, most of the columns are represented by the\nnoisy dictionary and therefore the error propagates too quickly; b) In contrast, if \u03b7k is too small, we\nobserve too many columns in full and so the sample complexity increases. Our goal in this paper\nis to capture this trade-off, providing a global upper bound on the estimation error of the life-long\narriving columns while keeping the sample complexity as small as possible.\n3.1.1 Recovery Guarantee\n\n\u2126:M\u2126t.\n\n\u221a\n\nOur analysis leads to the following guarantee on the performance of Algorithm 1.\nTheorem 1 (Robust Recovery under Deterministic Noise). Let r be the rank of the underlying\nmatrix L with \u00b50-incoherent column space. Suppose that the (cid:96)2 norm of noise in each column\nis upper bounded by \u0001noise. Set the parameters d \u2265 c(\u00b50r + mk\u0001noise) log2(2n/\u03b4)) and \u03b7k =\n\nC(cid:112)dk\u0001noise/m for global constants c and C. Then with probability at least 1 \u2212 \u03b4, Algorithm 1\noutputs (cid:98)UK with K \u2264 r and outputs(cid:99)M with (cid:96)2 error (cid:107)(cid:99)M:t \u2212 L:t(cid:107)2 \u2264 O(cid:0) m\n(cid:1)2 uniformly\n\nk\u0001noise\nfor all t, where k \u2264 r is the number of base vectors when processing the t-th column.\n\n\u221a\n\nd\n\n\u221a\n\nWe then prove the correctness of our test in Step 2. Lemma 2 guarantees that the underlying subspace\n\nProof Sketch. We \ufb01rstly show that our estimated subspace in each round is accurate. The key\ningredient of our proof is a result pertaining the angle between the underlying subspace and the noisy\none. Ideally, the column space spanned by the noisy dictionary cannot be too far to the underlying\nsubspace if the noise level is small. This is true only if the angle between the newly added vector and\nthe column space of the current dictionary is large, as shown by the following lemma.\n\nLemma 2. Let Uk = span{u1, u2, ..., uk} and (cid:101)Uk = span{(cid:101)u1,(cid:101)u2, ...,(cid:101)uk} be two subspaces\n20k\u0001noise and \u03b8((cid:101)ui,(cid:101)Ui\u22121) \u2265 \u03b3i for\nsuch that \u03b8(ui,(cid:101)ui) \u2264 \u0001noise for all i \u2208 [k]. Let \u03b3k =\ni = 2, ..., k. Then \u03b8(Uk,(cid:101)Uk) \u2264 \u03b3k/2.\nUk and our estimated one (cid:101)Uk cannot be too distinct. So by algorithm, projecting any vector on\nthe subspace spanned by (cid:101)Uk does not make too many mistakes, i.e., \u03b8(M:t,(cid:101)Uk) \u2248 \u03b8(M:t, Uk).\nOn the other hand, by standard concentration argument our test statistic (cid:107)M\u2126t \u2212 P(cid:101)Uk\nm(cid:107)M:t \u2212 P(cid:101)UkM:t(cid:107)2. Note that the latter term is determined by the angle of \u03b8(M:t,(cid:101)Uk).\nTherefore, our test statistic in Step 2 is indeed an effective measure of \u03b8(M:t,(cid:101)Uk), or \u03b8(L:t,(cid:101)Uk)\nIf \u03b8(L:t,(cid:101)Uk) \u2264 \u0001k, then with probability at least 1 \u2212 4\u03b4, we have (cid:107)M\u2126t \u2212 P(cid:101)Uk\nC(cid:112)dk\u0001noise/m. Inversely, if \u03b8(L:t,(cid:101)Uk) \u2265 c\u0001k, then with probability at least 1 \u2212 4\u03b4, we have\nM\u2126t(cid:107)2 \u2265 C(cid:112)dk\u0001noise/m, where c0, c and C are absolute constants.\n(cid:107)M\u2126t \u2212 P(cid:101)Uk\n\nclose to d\nsince L:t \u2248 M:t, as proven by the following novel result.\n20k\u0001noise, and k \u2264 r. Suppose that we observe a set of coordinates\nLemma 3. Let \u0001k = 2\u03b3k, \u03b3k =\n\u2126 \u2282 [m] of size d uniformly at random with replacement, where d \u2265 c0(\u00b50r + mk\u0001noise) log2(2/\u03b4).\nM\u2126t(cid:107)2 \u2264\n\nM\u2126t(cid:107)2 is\n\n\u221a\n\n\u2126:\n\n\u2126:\n\n\u2126:\n\n2By our proof, the constant factor is 9.\n\n4\n\n\fFinally, as both our dictionary and our statistic are accurate, the output error cannot be too large. A\nsimple deduction on the union bound over all columns leads to Theorem 1.\n\nTheorem 1 implies a result in the noiseless setting when \u0001noise goes to zero. Indeed, with the sample\nsize growing in the order of O(\u00b50nr log2 n), Algorithm 1 outputs a solution that is exact with\nprobability at least 1 \u2212 1\nn10 . To the best of our knowledge, this is the best sample complexity in the\nexisting literature for noiseless matrix completion without additional side information [19, 22]. For\nthe noisy setting, Algorithm 1 enjoys the same sample complexity O(\u00b50nr log2 n) as the noiseless\ncase, if \u0001noise \u2264 \u0398(\u00b50r/(mk)). In addition, Algorithm 1 inherits the bene\ufb01ts of adaptive sampling\nscheme. The vast majority results in the passive sampling scenarios require both the row and column\nincoherence for exact/robust recovery [22]. In contrast, via adaptive sampling we can relax the\nincoherence assumption on the row space of the underlying matrix and are therefore more applicable.\nWe compare our result with several related lines of research in the prior work. While lots of online\nmatrix completion algorithms have been proposed recently, they either lack of solid theoretical\nguarantee [17], or require strong assumptions for the streaming data [19, 21, 13, 18]. Speci\ufb01cally,\nKrishnamurthy et al. [18] proposed an algorithm that requires column subset selection in the noisy\ncase, which might be impractical in the online setting as we cannot measure columns that do not\narrive. Focusing on a similar online matrix completion problem, Lois et al. [21] assumed that a)\nthere is a good initial estimate for the column space; b) the column space changes slowly; c) the\nbase vectors of the column space are dense; d) the support of the measurements changes by at least a\ncertain amount. In contrast, our assumptions are much simpler and more realistic.\nWe mention another related line of research \u2014 matched subspace detection. The goal of matched\nsubspace detection is to decide whether an incomplete signal/vector lies within a given subspace [5, 4].\nIt is highly related to the procedure of our algorithm in each round, where we aim at determining\nwhether an arriving vector belongs to a given subspace based on partial and noisy observations. Prior\nwork targeting on this problem formalizes the task as a hypothesis testing problem. So they assume\na speci\ufb01c random distribution on the noise, e.g., Gaussian, and choose \u03b7k by \ufb01xing the probability\nof false alarm in the hypothesis testing [5, 23]. Compared with this, our result does not have any\nassumption on the noise structure/distribution.\n3.2 Sparse Random Noise\nIn this section, we discuss life-long matrix completion on a simpler noise model but with a stronger\nrecovery guarantee. We assume that noise is sparse, meaning that the total number of noisy columns\nis small compared to the total number of columns n. The noisy columns may arrive at any time, and\neach noisy column is assumed to be drawn i.i.d. from a non-degenerate distribution. Our goal is to\nexactly recover the underlying matrix and identify the noise with high probability.\nWe use an algorithm similar to Algorithm 1 to attack the problem, with \u03b7k = 0. The challenge is that\nhere we frequently add noise vectors to the dictionary and so we need to distinguish the noise from the\nclean column and remove them out of the dictionary at the end of the algorithm. To resolve the issue,\nwe additionally record the support of the representation coef\ufb01cients in each round when we represent\nthe arriving vector by the linear combinations of the columns in the dictionary matrix. On one hand,\nthe noise vectors in the dictionary fail to represent any column, because they are random. So if the\nrepresentation coef\ufb01cient corresponding to a column in the dictionary is 0 always, it is convincing to\nidentify the column as a noise. On the other hand, to avoid recognizing a true base vector as a noise,\nwe make a mild assumption that the underlying column space is identi\ufb01able. Typically, that means\nfor each direction in the underlying subspace, there are at least two clean data points having non-zero\nprojection on that direction. We argue that the assumption is indispensable, since without it there\nis an identi\ufb01ability issue between the clean data and the noise. As an extreme example, we cannot\nidentify the black point in Figures 1 as the clean data or as noise if we make no assumption on the\nunderlying subspace. To mitigate the problem, we assume that for each i \u2208 [r] and a subspace Ur\n:i L:ai (cid:54)= 0\nwith orthonormal basis, there are at least two columns L:ai and L:bi of L such that [Ur]T\n:i L:bi (cid:54)= 0. The detailed algorithm can be found in the supplementary material.\nand [Ur]T\n\n3.2.1 Upper Bound\nWe now provide upper and lower bound on the sample complexity of above algorithm for the exact\nrecovery of underlying matrix. Our upper bound matches the lower bound up to a constant factor. We\nthen analyze a more benign setting, namely, the data lie on a mixture of low-rank subspaces with\n\n5\n\n\fTable 1: Comparisons of our sample complexity with the best prior results in the noise-free setting.\n\nComplexity O(cid:0)\u00b50nr log2(n/\u03b4)(cid:1)[22] O(cid:0)\u00b50nr log2(r/\u03b4)(cid:1)[19] O (\u00b50nr log(r/\u03b4)) (Ours)\n\nAdaptive Sampling\n\nPassive Sampling\n\nLower bound O (\u00b50nr log(n/\u03b4))[10]\n\nO (\u00b50nr log(r/\u03b4)) (Ours)\n\ndimensionality \u03c4 (cid:28) r. Our analysis leads to the following guarantee on the performance of above\nalgorithm. The proof is in the supplementary material.\nTheorem 4 (Exact Recovery under Random Noise). Let r be the rank of the underlying matrix L\nwith \u00b50-incoherent column space. Suppose that the noise Es0 of size m \u00d7 s0 are drawn from any\nnon-degenerate distribution, and that the underlying subspace Ur is identi\ufb01able. Then our algorithm\nexactly recovers the underlying matrix L, the column space Ur, and the outlier Es0 with probability\nat least 1 \u2212 \u03b4, provided that d \u2265 c\u00b50r log (r/\u03b4) and s0 \u2264 d \u2212 r \u2212 1. The total sample complexity is\nthus c\u00b50rn log (r/\u03b4), where c is a universal constant.\nTheorem 4 implies an immediate result in the noise-free setting\nIn particular, O (\u00b50nr log(r/\u03b4)) mea-\nas \u0001noise goes to zero.\nsurements are suf\ufb01cient so that our algorithm outputs a solution\nthat is exact with probability at least 1 \u2212 \u03b4. This sample com-\n\nplexity improves over existing results of O(cid:0)\u00b50nr log2(n/\u03b4)(cid:1) [22]\nand O(cid:0)\u00b50nr3/2 log(r/\u03b4)(cid:1) [18], and over O(cid:0)\u00b50nr log2(r/\u03b4)(cid:1) of\n\nTheorem 1 when \u0001noise = 0.\nIndeed, our sample complexity\nO (\u00b50nr log(r/\u03b4)) matches the lower bound, as shown by Theorem\n5 (See Table 1 for comparisons of sample complexity). We notice\nanother paper of Gittens [14] which showed that Nsytr\u00a8om method\nrecovers a positive-semide\ufb01nite matrix of rank r from uniformly sam-\npling O(\u00b50r log(r/\u03b4)) columns. While this result matches our sam-\nple complexity, the assumptions of positive-semide\ufb01niteness and of\nsubsampling the columns are impractical in the online setting.\nWe compare Theorem 4 with prior methods on decomposing an in-\ncomplete matrix as the sum of a low-rank term and a column-sparse\nterm. Probably one of the best known algorithms is Robust PCA via\nOutlier Pursuit [25, 28, 27, 26]. Outlier Pursuit converts this problem\nto a convex program:\n\n(a) Identi\ufb01able Subspace\n\n(b) Unidenti\ufb01able Subspace\n\nFigure 1: Identi\ufb01ability.\n\nmin\nL,E\n\n(cid:107)L(cid:107)\u2217 + \u03bb(cid:107)E(cid:107)2,1, s.t. P\u2126M = P\u2126(L + E),\n\n0r2 log3 n and s0 \u2264 c2d4n/(\u00b55\n\n(2)\nwhere (cid:107) \u00b7 (cid:107)\u2217 captures the low-rankness of the underlying subspace and (cid:107) \u00b7 (cid:107)2,1 captures the column-\nsparsity of the noise. Recent papers on Outlier Pursuit [26] prove that the solution to (2) exactly\nrecovers the underlying subspace, provided that d \u2265 c1\u00b52\n0r5m3 log6 n)\nfor constants c1 and c2. Our result de\ufb01nitely outperforms the existing result in term of the sample\ncomplexity d, while our dependence of s0 is not always better (although in some cases better) when\nn is large. Note that while Outlier Pursuit loads all columns simultaneously and so can exploit the\nglobal low-rank structure, our algorithm is online and therefore cannot tolerate too much noise.\n3.2.2 Lower Bound\nWe now establish a lower bound on the sample complexity. Our lower bound shows that in our\nadaptive sampling setting, one needs at least \u2126 (\u00b50rn log (r/\u03b4)) many samples in order to uniquely\nidentify a certain matrix in the worst case. This lower bound matches our analysis of upper bound in\nSection 3.2.1.\nTheorem 5 (Lower Bound on Sample Complexity). Let 0 < \u03b4 < 1/2, and \u2126 \u223c Uniform(d) be the\nindex of the row sampling \u2286 [m]. Suppose that Ur is \u00b50-incoherent. If the total sampling number\ndn < c\u00b50rn log (r/\u03b4) for a constant c, then with probability at least 1 \u2212 \u03b4, there is an example of M\nsuch that under the sampling model of Section 2.1 (i.e., when a column arrives the choices are either\n(a) randomly sample or (b) view the entire column), there exist in\ufb01nitely many matrices L(cid:48) of rank r\nobeying \u00b50-incoherent condition on column space such that L(cid:48)\nThe proof can be found in the supplementary material. We mention several lower bounds on the\nsample complexity for passive matrix completion. The \ufb01rst is the paper of Cand\u00e8s and Tao [10], that\n\n\u2126: = L\u2126:.\n\n6\n\n\tUnderlying\tSubspace\t\t\tUnderlying\tSubspace\t\f(a) Single Subspace\n\n(b) Mixture of Subspaces\n\nFigure 2: Subspace structure.\n\ngives a lower bound of \u2126(\u00b50nr log(n/\u03b4)) if the matrix has both incoherent rows and columns. Taking\na weaker assumption, Krishnamurthy and Singh [18, 19] showed that if the row space is coherent,\nany passive sampling scheme followed by any recovery algorithm must have \u2126(mn) measurements.\nIn contrast, Theorem 5 demonstrates that in the absence of row-space incoherence, exact recovery of\nthe matrix is possible with only \u2126(\u00b50nr log(r/\u03b4)) samples, if the sampling scheme is adaptive.\n3.2.3 Extension to Mixture of Subspaces\nTheorem 5 gives a lower bound on sample complexity in the worst\ncase. In this section, we explore the possibility of further reducing\nthe sample complexity with more complex common structure. We\nassume that the underlying subspace is a mixture of h independent\nsubspaces3 [20], each of which is of dimension at most \u03c4 (cid:28) r. Such\nan assumption naturally models settings in which there are really h\ndifferent categories of movies/news while they share a certain com-\nmonality across categories. We can view this setting as a network\nwith two layers: The \ufb01rst layer captures the overall subspace with\nr metafeatures; The second layer is an output layer, consisting of\nmetafeatures each of which is a linear combination of only \u03c4 metafea-\ntures in the \ufb01rst layer. See Figures 2 for visualization. Our argument\nshows that the sparse connections between the two layers signi\ufb01cantly\nimprove the sample complexity.\nAlgorithmically, given a new column, we uniformly sample \u02dcO(\u03c4 log r)\nentries as our observations. We try to represent those elements by\na sparse linear combination of only \u03c4 columns in the basis matrix,\nwhose rows are truncated to those sampled indices; If we fail, we measure the column in full, add that\ncolumn into the dictionary, and repeat the procedure for the next arriving column. See supplementary\nmaterial for the detailed algorithm.\nRegarding computational considerations, learning a \u03c4-sparse representation of a given vector w.r.t.\na known dictionary can be done in polynomial time if the dictionary matrix satis\ufb01es the restricted\nisometry property [9], or trivially if \u03c4 is a constant [2]. This can be done by applying (cid:96)1 minimization\nor brute-force algorithm, respectively. Indeed, many real datasets match the constant-\u03c4 assumption,\ne.g., face image [6] (each person lies on a subspace of dimension \u03c4 = 9), 3D motion trajectory [12]\n(each object lies on a subspace of dimension \u03c4 = 4), handwritten digits [16] (each script lies on a\nsubspace of dimension \u03c4 = 12), etc. So our algorithm is applicable for all these settings.\nTheoretically, the following theorem provides a strong guarantee for our algorithm. The proof can be\nfound in the supplementary material.\nTheorem 6 (Mixture of Subspaces). Let r be the rank of the underlying matrix L. Suppose that the\ncolumns of L lie on a mixture of identi\ufb01able and independent subspaces, each of which is of dimension\nat most \u03c4. Denote by \u00b5\u03c4 the maximal incoherence over all \u03c4-combinations of L. Let the noise model\nbe that of Theorem 4. Then our algorithm exactly recovers the underlying matrix L, the column space\nUr, and the outlier Es0 with probability at least 1 \u2212 \u03b4, provided that d \u2265 c\u00b5\u03c4 \u03c4 2 log (r/\u03b4) for some\nglobal constant c and s0 \u2264 d \u2212 \u03c4 \u2212 1. The total sample complexity is thus c\u00b5\u03c4 \u03c4 2n log (r/\u03b4).\nAs a concrete example, if the incoherence parameter \u00b5\u03c4 is a global constant and the dimension \u03c4 of\neach subspace is far less than r, the sample complexity of O(\u00b5\u03c4 n\u03c4 2 log(r/\u03b4)) is signi\ufb01cantly better\nthan the complexity of O(\u00b50nr log(r/\u03b4)) for the structure of a single subspace in Theorem 4. This\nargument shows that the sparse connections between the two layers improve the sample complexity.\n4 Experimental Results\nBounded Deterministic Noise: We verify the estimated error of our algorithm in Theorem 1 under\nbounded deterministic noise. Our synthetic data are generated as follows. We construct 5 base\ni=1 by sampling their entries from N (0, 1). The underlying matrix L is then generated\nvectors {ui}5\nby L =\nu11T\ncolumn of which is normalized to the unit (cid:96)2 norm. Finally, we add bounded yet unstructured noise\nto each column, with noise level \u0001noise = 0.6. We randomly pick 20% entries to be unobserved. The\nleft \ufb01gure in Figure 3 shows the comparison between our estimated error4 and the true error by our\n\n(cid:105) \u2208 R100\u00d72,000, each\n\n200,(cid:80)2\n\n200,(cid:80)3\n\n200,(cid:80)4\n\n(cid:104)\n\ni=1 ui1T\n\ni=1 ui1T\n\ni=1 ui1T\n\ni=1 ui1T\n\n1,200\n\n200,(cid:80)5\n\n3h linear subspaces are independent if the dimensionality of their sum is equal to the sum of their dimensions.\n4The estimated error is up to a constant factor.\n\n7\n\n\tHidden\tLayer\tOutput\tLayer\tUnderlying\tSpace\t\tSubspace\t1\tSubspace\t2\tHidden\tLayer\tOutput\tLayer\t\fFigure 3: Left Figure: Approximate recovery under bounded deterministic noise with estimated error.\nRight Two Figures: Exact recovery under sparse random noise with varying rank and sample size.\nWhite Region: Nuclear norm minimization (passive sampling) succeeds. White and Gray Regions:\nOur algorithm (adaptive sampling) succeeds. Black Region: Our algorithm fails. It shows that the\nsuccess region of our algorithm strictly contains that of the passive sampling method.\nalgorithm. The result demonstrates that empirically, our estimated error successfully predicts the\ntrend of the true algorithmic error.\nSparse Random Noise: We then verify the exact recoverability of our algorithm under sparse random\nnoise. The synthetic data are generated as follows. We construct the underlying matrix L = XY\nas a product of m \u00d7 r and r \u00d7 n i.i.d. N (0, 1) matrices. The sparse random noise is drawn from\nstandard Gaussian distribution such that s0 \u2264 d \u2212 r \u2212 1. For each size of problem (50 \u00d7 500 and\n100\u00d7 1, 000), we test with different rank ratios r/m and measurement ratios d/m. The experiment is\n\nrun by 10 times. We de\ufb01ne that the algorithm succeeds if (cid:107)(cid:98)L \u2212 L(cid:107)F \u2264 10\u22126, rank((cid:98)L) = r, and the\n\nrecovered support of the noise is exact for at least one experiment. The right two \ufb01gures in Figure 3\nplots the fraction of correct recoveries: white denotes perfect recovery by nuclear norm minimization\napproach (2); white+gray represents perfect recovery by our algorithm; black indicates failure for\nboth methods. It shows that the success region of our algorithm strictly contains that of the prior\napproach. Moreover, the phase transition of our algorithm is nearly a linear function w.r.t r and d.\nThis is consistent with our prediction d = \u2126(\u00b50r log(r/\u03b4)) when \u03b4 is small, e.g., poly(1/n).\nMixture of Subspaces: To test the performance of our algorithm for the mixture of subspaces, we\nconduct an experiment on the Hopkins 155 dataset. The Hopkins 155 database is composed of 155\nmatrices/tasks, each of which consists of multiple data points drawn from two or three motion objects.\nThe trajectory of each object lie in a subspace. We input the data matrix to our algorithm with varying\n\nsample sizes. Table 2 records the average relative error (cid:107)(cid:98)L \u2212 L(cid:107)F /(cid:107)L(cid:107)F of 10 trials for the \ufb01rst \ufb01ve\n\ntasks in the dataset. It shows that our algorithm is able to recover the target matrix with high accuracy.\nAnother experiment comparing the sample complexity of single subspace v.s. mixture of subspaces\ncan be found in the supplementary material.\n\nTable 2: Life-long Matrix Completion on the \ufb01rst 5 tasks in Hopkins 155 database.\n#Task Motion Number\nd = 0.8m d = 0.85m d = 0.9m d = 0.95m\n2.6 \u00d7 10\u22123\n9.4 \u00d7 10\u22123\n1.9 \u00d7 10\u22123\n5.9 \u00d7 10\u22123\n7.2 \u00d7 10\u22124\n6.3 \u00d7 10\u22123\n7.1 \u00d7 10\u22123\n1.5 \u00d7 10\u22123\n8.7 \u00d7 10\u22123\n1.2 \u00d7 10\u22123\n\n3.4 \u00d7 10\u22123\n2.4 \u00d7 10\u22123\n2.8 \u00d7 10\u22123\n6.1 \u00d7 10\u22123\n3.1 \u00d7 10\u22123\n\n6.0 \u00d7 10\u22123\n4.4 \u00d7 10\u22123\n4.8 \u00d7 10\u22123\n6.8 \u00d7 10\u22123\n5.8 \u00d7 10\u22123\n\n#1\n#2\n#3\n#4\n#5\n\n2\n3\n2\n2\n2\n\n5 Conclusions\nIn this paper, we study life-long matrix completion that aims at online recovering an m \u00d7 n matrix of\nrank r under two realistic noise models \u2014 bounded deterministic noise and sparse random noise. Our\nresult advances the state-of-the-art work and matches the lower bound under sparse random noise. In\na more benign setting where the columns of the underlying matrix lie on a mixture of subspaces, we\nshow that a smaller sample complexity is possible to exactly recover the target matrix. It would be\ninteresting to extend our results to other realistic noise models, including random classi\ufb01cation noise\nor malicious noise previously studied in the context of supervised classi\ufb01cation [1, 3]\nAcknowledgements. This work was supported in part by grants NSF-CCF 1535967, NSF CCF-\n1422910, NSF CCF-1451177, a Sloan Fellowship, and a Microsoft Research Fellowship.\n\n8\n\nColumn Index0200400600800100012001400160018002000Error00.20.40.60.811.21.41.61.82Error by AlgorithmEstimated Error50#500Rank/m0.20.40.60.81 Observations/m0.10.20.30.40.50.60.70.80.91 100#1000Rank/m0.20.40.60.81 Observations/m0.10.20.30.40.50.60.70.80.91 \fReferences\n[1] P. Awasthi, M. F. Balcan, and P. M. Long. The power of localization for ef\ufb01ciently learning\nlinear separators with noise. In ACM Symposium on Theory of Computing, pages 449\u2013458.\nACM, 2014.\n\n[2] M.-F. Balcan, A. Blum, and S. Vempala. Ef\ufb01cient representations for life-long learning and\n\nautoencoding. In Annual Conference on Learning Theory, 2015.\n\n[3] M.-F. F. Balcan and V. Feldman. Statistical active learning algorithms. In Advances in Neural\n\nInformation Processing Systems, pages 1295\u20131303, 2013.\n\n[4] L. Balzano, R. Nowak, and B. Recht. Online identi\ufb01cation and tracking of subspaces from\nhighly incomplete information. In Annual Allerton Conference on Communication, Control,\nand Computing, pages 704\u2013711, 2010.\n\n[5] L. Balzano, B. Recht, and R. Nowak. High-dimensional matched subspace detection when data\nare missing. In IEEE International Symposium on Information Theory, pages 1638\u20131642, 2010.\n[6] R. Basri and D. W. Jacobs. Lambertian re\ufb02ectance and linear subspaces. IEEE Transactions on\n\nPattern Analysis and Machine Intelligence, 25(2):218\u2013233, 2003.\n\n[7] E. J. Cand\u00e8s and Y. Plan. Matrix completion with noise. Proceedings of the IEEE, 98(6):925\u2013\n\n936, 2010.\n\n[8] E. J. Cand\u00e8s and B. Recht. Exact matrix completion via convex optimization. Foundations of\n\nComputational Mathematics, 9(6):717\u2013772, 2009.\n\n[9] E. J. Cand\u00e8s, J. Romberg, and T. Tao. Robust uncertainty principles: Exact signal reconstruction\nfrom highly incomplete frequency information. IEEE Transactions on Information Theory,\n52(2):489\u2013509, 2006.\n\n[10] E. J. Cand\u00e8s and T. Tao. The power of convex relaxation: Near-optimal matrix completion.\n\nIEEE Transactions on Information Theory, 56(5):2053\u20132080, 2010.\n\n[11] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. H. Jr., and T. M. Mitchell. Toward an\narchitecture for never-ending language learning. In AAAI Conference on Arti\ufb01cial Intelligence,\n2010.\n\n[12] J. Costeira and T. Kanade. A multibody factorization method for independently moving objects.\n\nInternational Journal of Computer Vision, 29(3):159\u2013179, 1998.\n\n[13] C. Dhanjal, R. Gaudel, and S. Cl\u00e9mencon. Online matrix completion through nuclear norm\n\nregularisation. In SIAM International Conference on Data Mining, pages 623\u2013631, 2014.\n\n[14] A. Gittens. The spectral norm error of the na\u00efve Nystr\u00f6m extension.\n\narXiv preprint\n\n[15] A. Gopnik, A. N. Meltzoff, and P. K. Kuhl. How babies think: the science of childhood. Phoenix,\n\n[16] T. Hastie and P. Y. Simard. Metrics and models for handwritten character recognition. Statistical\n\narXiv:1110.5305, 2011.\n\n2001.\n\nScience, pages 54\u201365, 1998.\n\n[17] R. Kennedy, C. J. Taylor, and L. Balzano. Online completion of ill-conditioned low-rank\n\nmatrices. In IEEE Global Conference on Signal and Information, pages 507\u2013511, 2014.\n\n[18] A. Krishnamurthy and A. Singh. Low-rank matrix and tensor completion via adaptive sampling.\n\nIn Advances in Neural Information Processing Systems, pages 836\u2013844, 2013.\n\n[19] A. Krishnamurthy and A. Singh. On the power of adaptivity in matrix completion and approxi-\n\nmation. arXiv preprint arXiv:1407.3619, 2014.\n\n[20] G. Lerman and T. Zhang. (cid:96)p-recovery of the most signi\ufb01cant subspace among multiple subspaces\n\nwith outliers. Constructive Approximation, 40(3):329\u2013385, 2014.\n\n[21] B. Lois and N. Vaswani. Online matrix completion and online robust PCA. In IEEE International\n\nSymposium on Information Theory, pages 1826\u20131830, 2015.\n\n[22] B. Recht. A simpler approach to matrix completion. Journal of Machine Learning Research,\n\n[23] L. L. Scharf and B. Friedlander. Matched subspace detectors. IEEE Transactions on Signal\n\n12:3413\u20133430, 2011.\n\nProcessing, 42(8):2146\u20132157, 1994.\n\n[24] M. K. Warmuth and D. Kuzmin. Randomized online pca algorithms with regret bounds that are\nlogarithmic in the dimension. Journal of Machine Learning Research, 9(10):2287\u20132320, 2008.\n[25] H. Xu, C. Caramanis, and S. Sanghavi. Robust PCA via outlier pursuit. IEEE Transaction on\n\nInformation Theory, 58(5):3047\u20133064, 2012.\n\n[26] H. Zhang, Z. Lin, and C. Zhang. Completing low-rank matrices with corrupted samples from\nfew coef\ufb01cients in general basis. IEEE Transactions on Information Theory, 62(8):4748\u20134768,\n2016.\n\n[27] H. Zhang, Z. Lin, C. Zhang, and E. Chang. Exact recoverability of robust PCA via outlier pursuit\nwith tight recovery bounds. In AAAI Conference on Arti\ufb01cial Intelligence, pages 3143\u20133149,\n2015.\n\n[28] H. Zhang, Z. Lin, C. Zhang, and J. Gao. Relations among some low rank subspace recovery\n\nmodels. Neural Computation, 27:1915\u20131950, 2015.\n\n9\n\n\f", "award": [], "sourceid": 1479, "authors": [{"given_name": "Maria-Florina", "family_name": "Balcan", "institution": "Carnegie Mellon University"}, {"given_name": "Hongyang", "family_name": "Zhang", "institution": "CMU"}]}