{"title": "More data speeds up training time in learning halfspaces over sparse vectors", "book": "Advances in Neural Information Processing Systems", "page_first": 145, "page_last": 153, "abstract": "The increased availability of data in recent years led several authors to ask whether it is possible to use data as  a {\\em computational} resource. That is, if more data is available, beyond the sample complexity limit, is it possible to use the extra examples to speed up the computation time required to perform the learning task?  We give the first positive answer to this question for a {\\em natural supervised learning problem} --- we consider agnostic PAC learning of halfspaces over $3$-sparse vectors in $\\{-1,1,0\\}^n$. This class is inefficiently learnable using $O\\left(n/\\epsilon^2\\right)$ examples. Our main contribution is a novel, non-cryptographic, methodology for establishing computational-statistical gaps, which allows us to show that, under a widely believed assumption that refuting random $\\mathrm{3CNF}$ formulas is hard, efficiently learning this class using $O\\left(n/\\epsilon^2\\right)$ examples is impossible. We further show that under stronger hardness assumptions, even $O\\left(n^{1.499}/\\epsilon^2\\right)$ examples do not suffice.  On the other hand, we show a new algorithm that learns this class efficiently using $\\tilde{\\Omega}\\left(n^2/\\epsilon^2\\right)$ examples. This formally establishes the tradeoff between sample and computational complexity for a natural supervised learning problem.", "full_text": "More data speeds up training time in learning\n\nhalfspaces over sparse vectors\n\nAmit Daniely\n\nDepartment of Mathematics\n\nThe Hebrew University\n\nJerusalem, Israel\n\nNati Linial\n\nSchool of CS and Eng.\nThe Hebrew University\n\nJerusalem, Israel\n\nShai Shalev-Shwartz\nSchool of CS and Eng.\nThe Hebrew University\n\nJerusalem, Israel\n\nAbstract\n\nThe increased availability of data in recent years has led several authors to ask\nwhether it is possible to use data as a computational resource. That is, if more\ndata is available, beyond the sample complexity limit, is it possible to use the\nextra examples to speed up the computation time required to perform the learning\ntask?\nWe give the \ufb01rst positive answer to this question for a natural supervised learning\nproblem \u2014 we consider agnostic PAC learning of halfspaces over 3-sparse vec-\n\nOur main contribution is a novel, non-cryptographic, methodology for establish-\ning computational-statistical gaps, which allows us to show that, under a widely\nbelieved assumption that refuting random 3CNF formulas is hard, it is impossible\n\ntors in {\u22121, 1, 0}n. This class is inef\ufb01ciently learnable using O(cid:0)n/\u00012(cid:1) examples.\nto ef\ufb01ciently learn this class using only O(cid:0)n/\u00012(cid:1) examples. We further show that\nunder stronger hardness assumptions, even O(cid:0)n1.499/\u00012(cid:1) examples do not suf-\nusing \u02dc\u2126(cid:0)n2/\u00012(cid:1) examples. This formally establishes the tradeoff between sample\n\n\ufb01ce. On the other hand, we show a new algorithm that learns this class ef\ufb01ciently\n\nand computational complexity for a natural supervised learning problem.\n\n1\n\nIntroduction\n\nIn the modern digital period, we are facing a rapid growth of available datasets in science and\ntechnology. In most computing tasks (e.g. storing and searching in such datasets), large datasets\nare a burden and require more computation. However, for learning tasks the situation is radically\ndifferent. A simple observation is that more data can never hinder you from performing a task. If\nyou have more data than you need, just ignore it!\nA basic question is how to learn from \u201cbig data\u201d. The statistical learning literature classically studies\nquestions like \u201chow much data is needed to perform a learning task?\u201d or \u201chow does accuracy improve\nas the amount of data grows?\u201d etc. In the modern, \u201cdata revolution era\u201d, it is often the case that the\namount of data available far exceeds the information theoretic requirements. We can wonder whether\nthis, seemingly redundant data, can be used for other purposes. An intriguing question in this vein,\nstudied recently by several researchers ([Decatur et al., 1998, Servedio., 2000, Shalev-Shwartz et al.,\n2012, Berthet and Rigollet, 2013, Chandrasekaran and Jordan, 2013]), is the following\n\nQuestion 1: Are there any learning tasks in which more data, beyond the in-\nformation theoretic barrier, can provably be leveraged to speed up computation\ntime?\n\nThe main contributions of this work are:\n\n1\n\n\fof a natural supervised learning problem for which the answer to Question 1 is positive.\n\n\u2022 Conditioning on the hardness of refuting random 3CNF formulas, we give the \ufb01rst example\n\u2022 To prove this, we present a novel technique to establish computational-statistical tradeoffs\nin supervised learning problems. To the best of our knowledge, this is the \ufb01rst such a result\nthat is not based on cryptographic primitives.\n\nand 3-sparse vectors using \u02dcO(cid:0) n\n\n(cid:1) and \u02dcO\n\n\u00012\n\n(cid:17)\n\n(cid:16) n2\n\n\u00012\n\nexamples respectively.\n\nAdditional contributions are non trivial ef\ufb01cient algorithms for learning halfspaces over 2-sparse\n\nThe natural learning problem we consider is the task of learning the class of halfspaces over k-sparse\nvectors. Here, the instance space is the space of k-sparse vectors,\n\nCn,k = {x \u2208 {\u22121, 1, 0}n | |{i | xi (cid:54)= 0}| \u2264 k} ,\n\nand the hypothesis class is halfspaces over k-sparse vectors, namely\n\nHn,k = {hw,b : Cn,k \u2192 {\u00b11} | hw,b(x) = sign((cid:104)w, x(cid:105) + b), w \u2208 Rn, b \u2208 R} ,\n\nwhere (cid:104)\u00b7,\u00b7(cid:105) denotes the standard inner product in Rn.\nWe consider the standard setting of agnostic PAC learning, which models the realistic scenario\nwhere the labels are not necessarily fully determined by some hypothesis from Hn,k. Note that in\nthe realizable case, i.e. when some hypothesis from Hn,k has zero error, the problem of learning\nhalfspaces is easy even over Rn.\nIn addition, we allow improper learning (a.k.a. representation independent learning), namely, the\nlearning algorithm is not restricted to output a hypothesis from Hn,k, but only should output a hy-\npothesis whose error is not much larger than the error of the best hypothesis in Hn,k. This gives the\nlearner a lot of \ufb02exibility in choosing an appropriate representation of the problem. This additional\nfreedom to the learner makes it much harder to prove lower bounds in this model. Concretely, it is\nnot clear how to use standard reductions from NP hard problems in order to establish lower bounds\nfor improper learning (moreover, Applebaum et al. [2008] give evidence that such simple reductions\ndo not exist).\nThe classes Hn,k and similar classes have been studied by several authors (e.g. Long. and Servedio\n[2013]). They naturally arise in learning scenarios in which the set of all possible features is very\nlarge, but each example has only a small number of active features. For example:\n\n\u2022 Predicting an advertisement based on a search query: Here, the possible features of each\ninstance are all English words, whereas the active features are only the set of words given\nin the query.\n\u2022 Learning Preferences [Hazan et al., 2012]: Here, we have n players. A ranking of the\nplayers is a permutation \u03c3 : [n] \u2192 [n] (think of \u03c3(i) as the rank of the i\u2019th player). Each\nranking induces a preference h\u03c3 over the ordered pairs, such that h\u03c3(i, j) = 1 iff i is ranked\nhigher that j. Namely,\n\n(cid:26)1\n\nh\u03c3(i, j) =\n\n\u03c3(i) > \u03c3(j)\n\u22121 \u03c3(i) < \u03c3(j)\n\nThe objective here is to learn the class, Pn, of all possible preferences. The problem of\nlearning preferences is related to the problem of learning Hn,2: if we associate each pair\n(i, j) with the vector in Cn,2 whose i\u2019th coordinate is 1 and whose j\u2019th coordinate is \u22121,\nit is not hard to see that Pn \u2282 Hn,2: for every \u03c3, h\u03c3 = hw,0 for the vector w \u2208 Rn,\ngiven by wi = \u03c3(i). Therefore, every upper bound for Hn,2 implies an upper bound for\nPn, while every lower bound for Pn implies a lower bound for Hn,2. Since VC(Pn) = n\n\nand VC(Hn,2) = n + 1, the information theoretic barrier to learn these classes is \u0398(cid:0) n\n(cid:16) n log3(n)\n\nIn Hazan et al. [2012] it was shown that Pn can be ef\ufb01ciently learnt using O\nexamples. In section 4, we extend this result to Hn,2.\n\n(cid:1).\n(cid:17)\n\n\u00012\n\n\u00012\n\nWe will show a positive answer to Question 1 for the class Hn,3. To do so, we show1 the following:\n1In fact, similar results hold for every constant k \u2265 3. Indeed, since Hn,3 \u2282 Hn,k for every k \u2265 3, it is\ntrivial that item 3 below holds for every k \u2265 3. The upper bound given in item 1 holds for every k. For item 2,\n\n2\n\n\fsize \u02dc\u2126\n\n2. It is also possible to ef\ufb01ciently learn Hn,3 if we are provided with a larger training set (of\n\n1. Ignoring computational issues, it is possible to learn the class Hn,3 using O(cid:0) n\n(cid:16) n2\nsize O(cid:0) n\n\n(cid:17)\n(cid:1) under Feige\u2019s assumption regarding the hardness of refuting random 3CNF\n\n3. It is impossible to ef\ufb01ciently learn Hn,3, if we are only provided with a training set of\nformulas [Feige, 2002]. Furthermore, for every \u03b1 \u2208 [0, 0.5), it is impossible to learn\nef\ufb01ciently with a training set of size O\nunder a stronger hardness assumption. This\nis formalized in Theorem 4.1.\n\n(cid:1) examples.\n\n). This is formalized in Theorem 3.1.\n\n(cid:16) n1+\u03b1\n\n(cid:17)\n\n\u00012\n\n\u00012\n\n\u00012\n\n\u00012\n\nA graphical illustration of our main results is given below:\n\nruntime\n\n2O(n)\n\n> poly(n)\n\nnO(1)\n\nn\n\nn1.5\n\nn2\n\nexamples\n\n(cid:17)\n\n(cid:16) n3\n\nThe proof of item 1 above is easy \u2013 simply note that Hn,3 has VC dimension n + 1.\nItem 2 is proved in section 4, relying on the results of Hazan et al. [2012]. We note, however, that\na weaker result, that still suf\ufb01ces for answering Question 1 in the af\ufb01rmative, can be proven using\na naive improper learning algorithm. In particular, we show below how to learn Hn,3 ef\ufb01ciently\nexamples. The idea is to replace the class Hn,3 with the class {\u00b11}Cn,3\nwith a sample of \u2126\ncontaining all functions from Cn,3 to {\u00b11}. Clearly, this class contains Hn,3.\nIn addition, we\ncan ef\ufb01ciently \ufb01nd a function f that minimizes the empirical training error over a training set S\nas follows: For every x \u2208 Cn,k, if x does not appear at all in the training set we will set f (x)\narbitrarily to 1. Otherwise, we will set f (x) to be the majority of the labels in the training set\nthat correspond to x. Finally, note that the VC dimension of {\u00b11}Cn,3 is smaller than n3 (since\n|Cn,3| < n3). Hence, standard generalization results (e.g. Vapnik [1995]) implies that a training set\nsize of \u2126\n\nsuf\ufb01ces for learning this class.\n\n(cid:16) n3\n\n(cid:17)\n\n\u00012\n\n\u00012\n\nItem 3 is shown in section 3 by presenting a novel\ncomputational tradeoffs.\nThe class Hn,2. Our main result gives a positive answer to Question 1 for the task of improp-\nerly learning Hn,k for k \u2265 3. A natural question is what happens for k = 2 and k = 1. Since\nVC(Hn,1) = VC(Hn,2) = n + 1, the information theoretic barrier for learning these classes is\n\n\u0398(cid:0) n\n(cid:1). In section 4, we prove that Hn,2 (and, consequently, Hn,1 \u2282 Hn,2) can be learnt using\n(cid:16) n log3(n)\n\ntechnique for establishing statistical-\n\nexamples, indicating that signi\ufb01cant computational-statistical tradeoffs start to mani-\n\n(cid:17)\n\n\u00012\n\nO\nfest themselves only for k \u2265 3.\n\n\u00012\n\n1.1 Previous approaches, dif\ufb01culties, and our techniques\n\n[Decatur et al., 1998] and [Servedio., 2000] gave positive answers to Question 1 in the realizable\nPAC learning model. Under cryptographic assumptions, they showed that there exist binary learn-\ning problems, in which more data can provably be used to speed up training time. [Shalev-Shwartz\net al., 2012] showed a similar result for the agnostic PAC learning model. In all of these papers, the\nmain idea is to construct a hypothesis class based on a one-way function. However, the constructed\nit is not hard to show that Hn,k can be learnt using a sample of \u2126\nalgorithm, similar to the algorithm we describe in this section for k = 3.\n\nexamples by a naive improper learning\n\n(cid:16) nk\n\n(cid:17)\n\n\u00012\n\n3\n\n\fclasses are of a very synthetic nature, and are of almost no practical interest. This is mainly due\nto the construction technique which is based on one way functions. In this work, instead of using\ncryptographic assumptions, we rely on the hardness of refuting random 3CNF formulas. The sim-\nplicity and \ufb02exibility of 3CNF formulas enable us to derive lower bounds for natural classes such\nas halfspaces.\nRecently, [Berthet and Rigollet, 2013] gave a positive answer to Question 1 in the context of unsuper-\nvised learning. Concretely, they studied the problem of sparse PCA, namely, \ufb01nding a sparse vector\nthat maximizes the variance of an unsupervised data. Conditioning on the hardness of the planted\nclique problem, they gave a positive answer to Question 1 for sparse PCA. Our work, as well as\nthe previous work of Decatur et al. [1998], Servedio. [2000], Shalev-Shwartz et al. [2012], studies\nQuestion 1 in the supervised learning setup. We emphasize that unsupervised learning problems\nare radically different than supervised learning problems in the context of deriving lower bounds.\nThe main reason for the difference is that in supervised learning problems, the learner is allowed\nto employ improper learning, which gives it a lot of power in choosing an adequate representa-\ntion of the data. For example, the upper bound we have derived for the class of sparse halfspaces\nswitched from representing hypotheses as halfspaces to representation of hypotheses as tables over\nCn,3, which made the learning problem easy from the computational perspective. The crux of the\ndif\ufb01culty in constructing lower bounds is due to this freedom of the learner in choosing a convenient\nrepresentation. This dif\ufb01culty does not arise in the problem of sparse PCA detection, since there\nthe learner must output a good sparse vector. Therefore, it is not clear whether the approach given\nin [Berthet and Rigollet, 2013] can be used to establish computational-statistical gaps in supervised\nlearning problems.\n\n2 Background and notation\nFor hypothesis class H \u2282 {\u00b11}X and a set Y \u2282 X, we de\ufb01ne the restriction of H to Y by\nH|Y = {h|Y | h \u2208 H}. We denote by J = Jn the all-ones n \u00d7 n matrix. We denote the j\u2019th vector\nin the standard basis of Rn by ej.\n\n2.1 Learning Algorithms\nFor h : Cn,3 \u2192 {\u00b11} and a distribution D on Cn,3 \u00d7 {\u00b11} we denote the error of h w.r.t. D\nby ErrD(h) = Pr(x,y)\u223cD (h(x) (cid:54)= y). For H \u2282 {\u00b11}Cn,3 we denote the error of H w.r.t. D by\nErrD(H) = minh\u2208H ErrD(h). For a sample S \u2208 (Cn,3 \u00d7 {\u00b11})m we denote by ErrS(h) (resp.\nErrS(H)) the error of h (resp. H) w.r.t. the empirical distribution induces by the sample S.\nA learning algorithm, L, receives a sample S \u2208 (Cn,3 \u00d7 {\u00b11})m and return a hypothesis L(S) :\nCn,3 \u2192 {\u00b11}. We say that L learns Hn,3 using m(n, \u0001) examples if,2 for every distribution D on\nCn,3 \u00d7 {\u00b11} and a sample S of more than m(n, \u0001) i.i.d. examples drawn from D,\n\n(ErrD(L(S)) > ErrD(H3,n) + \u0001) <\n\nPr\nS\n\n1\n10\n\nThe algorithm L is ef\ufb01cient if it runs in polynomial time in the sample size and returns a hypothesis\nthat can be evaluated in polynomial time.\n\n2.2 Refuting random 3SAT formulas\nWe frequently view a boolean assignment to variables x1, . . . , xn as a vector in Rn. It is convenient,\ntherefore, to assume that boolean variables take values in {\u00b11} and to denote negation by \u201c \u2212 \u201d\n(instead of the usual \u201c\u00ac\u201d). An n-variables 3CNF clause is a boolean formula of the form\n\nC(x) = (\u22121)j1xi1 \u2228 (\u22121)j2 xi2 \u2228 (\u22121)j1xi3, x \u2208 {\u00b11}n\n\nAn n-variables 3CNF formula is a boolean formula of the form\n\n\u03c6(x) = \u2227m\n\ni=1Ci(x) ,\n\n2For simplicity, we require the algorithm to succeed with probability of at least 9/10. This can be easily\nampli\ufb01ed to probability of at least 1 \u2212 \u03b4, as in the usual de\ufb01nition of agnostic PAC learning, while increasing\nthe sample complexity by a factor of log(1/\u03b4).\n\n4\n\n\fwhere every Ci is a 3CNF clause. De\ufb01ne the value, Val(\u03c6), of \u03c6 as the maximal fraction of clauses\nthat can be simultaneously satis\ufb01ed. If Val(\u03c6) = 1, we say the \u03c6 is satis\ufb01able. By 3CNFn,m we\ndenote the set of 3CNF formulas with n variables and m clauses.\nRefuting random 3CNF formulas has been studied extensively (see e.g. a special issue of TCS\nDubios et al. [2001]). It is known that for large enough \u2206 (\u2206 = 6 will suf\ufb01ce) a random formula in\n3CNFn,\u2206n is not satis\ufb01able with probability 1 \u2212 o(1). Moreover, for every 0 \u2264 \u0001 < 1\n4, and a large\nenough \u2206 = \u2206(\u0001), the value of a random formula 3CNFn,\u2206n is \u2264 1 \u2212 \u0001 with probability 1 \u2212 o(1).\nThe problem of refuting random 3CNF concerns ef\ufb01cient algorithms that provide a proof that a\nrandom 3CNF is not satis\ufb01able, or far from being satis\ufb01able. This can be thought of as a game\nbetween an adversary and an algorithm. The adversary should produce a 3CNF-formula. It can\neither produce a satis\ufb01able formula, or, produce a formula uniformly at random. The algorithm\nshould identify whether the produced formula is random or satis\ufb01able.\nFormally, let \u2206 : N \u2192 N and 0 \u2264 \u0001 < 1\n4. We say that an ef\ufb01cient algorithm, A, \u0001-refutes random\n3CNF with ratio \u2206 if its input is \u03c6 \u2208 3CNFn,n\u2206(n), its output is either \u201ctypical\u201d or \u201cexceptional\u201d\nand it satis\ufb01es:\n\n\u2022 Soundness: If Val(\u03c6) \u2265 1 \u2212 \u0001, then\n\nPr\n\nRand. coins of A\n\n(A(\u03c6) = \u201cexceptional\u201d) \u2265 3\n4\n\n\u2022 Completeness: For every n,\n\nRand. coins of A, \u03c6\u223cUni(3CNFn,n\u2206(n))\n\nPr\n\n(A(\u03c6) = \u201ctypical\u201d) \u2265 1 \u2212 o(1)\n\n4 can be ampli\ufb01ed to 1\u2212 2\u2212n, while ef\ufb01ciency\nBy a standard repetition argument, the probability of 3\nis preserved. Thus, given such an (ampli\ufb01ed) algorithm, if A(\u03c6) = \u201ctypical\u201d, then with con\ufb01dence\nof 1 \u2212 2\u2212n we know that Val(\u03c6) < 1 \u2212 \u0001. Since for random \u03c6 \u2208 3CNFn,n\u2206(n), A(\u03c6) = \u201ctypical\u201d\nwith probability 1 \u2212 o(1), such an algorithm provides, for most 3CNF formulas a proof that their\nvalue is less that 1 \u2212 \u0001.\nNote that an algorithm that \u0001-refutes random 3CNF with ratio \u2206 also \u0001(cid:48)-refutes random 3CNF with\nratio \u2206 for every 0 \u2264 \u0001(cid:48) \u2264 \u0001. Thus, the task of refuting random 3CNF\u2019s gets easier as \u0001 gets smaller.\nMost of the research concerns the case \u0001 = 0. Here, it is not hard to see that the task is getting easier\n\u221a\nas \u2206 grows. The best known algorithm [Feige and Ofek, 2007] 0-refutes random 3CNF with ratio\nn). In Feige [2002] it was conjectured that for constant \u2206 no ef\ufb01cient algorithm can\n\u2206(n) = \u2126(\nprovide a proof that a random 3CNF is not satis\ufb01able:\nConjecture 2.1 (R3SAT hardness assumption \u2013 [Feige, 2002]). For every \u0001 > 0 and for every large\nenough integer \u2206 > \u22060(\u0001) there exists no ef\ufb01cient algorithm that \u0001-refutes random 3CNF formulas\nwith ratio \u2206.\nIn fact, for all we know, the following conjecture may be true for every 0 \u2264 \u00b5 \u2264 0.5.\nConjecture 2.2 (\u00b5-R3SAT hardness assumption). For every \u0001 > 0 and for every integer \u2206 > \u22060(\u0001)\nthere exists no ef\ufb01cient algorithm that \u0001-refutes random 3CNF with ratio \u2206 \u00b7 n\u00b5.\n\nNote that Feige\u2019s conjecture is equivalent to the 0-R3SAT hardness assumption.\n3 Lower bounds for learning Hn,3\nTheorem 3.1 (main). Let 0 \u2264 \u00b5 \u2264 0.5. If the \u00b5-R3SAT hardness assumption (conjecture 2.2) is\ntrue, then there exists no ef\ufb01cient learning algorithm that learns the class Hn,3 using O\nexamples.\n\n(cid:16) n1+\u00b5\n\n(cid:17)\n\n\u00012\n\nIn the proof of Theorem 3.1 we rely on the validity of a conjecture, similar to conjecture 2.2 for 3-\nvariables majority formulas. Following an argument from [Feige, 2002] (Theorem 3.2) the validity\nof the conjecture on which we rely for majority formulas follows the validity of conjecture 2.2.\n\n5\n\n\fDe\ufb01ne\n\n\u2200(x1, x2, x3) \u2208 {\u00b11}3, MAJ(x1, x2, x3) := sign(x1 + x2 + x3)\n\nAn n-variables 3MAJ clause is a boolean formula of the form\n\nC(x) = MAJ((\u22121)j1 xi1 , (\u22121)j2xi2, (\u22121)j1xi3), x \u2208 {\u00b11}n\n\nAn n-variables 3MAJ formula is a boolean formula of the form\n\n\u03c6(x) = \u2227m\n\ni=1Ci(x)\n\nwhere the Ci\u2019s are 3MAJ clauses. By 3MAJn,m we denote the set of 3MAJ formulas with n variables\nand m clauses.\nTheorem 3.2 ([Feige, 2002]). Let 0 \u2264 \u00b5 \u2264 0.5. If the \u00b5-R3SAT hardness assumption is true, then\nfor every \u0001 > 0 and for every large enough integer \u2206 > \u22060(\u0001) there exists no ef\ufb01cient algorithm\nwith the following properties.\n\n\u2022 Its input is \u03c6 \u2208 3MAJn,\u2206n1+\u00b5, and its output is either \u201ctypical\u201d or \u201cexceptional\u201d.\n\u2022 If Val(\u03c6) \u2265 3\n\n4 \u2212 \u0001, then\n\nPr\n\nRand. coins of A\n\n(A(\u03c6) = \u201cexceptional\u201d) \u2265 3\n4\n\n\u2022 For every n,\n\nRand. coins of A, \u03c6\u223cUni(3MAJn,\u2206n1+\u00b5 )\n\nPr\n\n(A(\u03c6) = \u201ctypical\u201d) \u2265 1 \u2212 o(1)\n\nNext, we prove Theorem 3.1.\nIn fact, we will prove a slightly stronger result. Namely, de\ufb01ne\nthe subclass Hd\nn,3 =\n{hw,0 | w \u2208 {\u00b11}n}. As we show, under the \u00b5-R3SAT hardness assumption, it is impossible to\nef\ufb01ciently learn this subclass using only O\n\nn,3 \u2282 Hn,3, of homogenous halfspaces with binary weights, given by Hd\n\n(cid:16) n1+\u00b5\n\nexamples.\n\n(cid:17)\n\n\u00012\n\nn,3, that uses \u03ba n\n\nn,3) = 1 \u2212 Val(\u03c6).\n\nProof idea: We will reduce the task of refuting random 3MAJ formulas with linear number of\nclauses to the task of (improperly) learning Hd\nn,3 with linear number of samples. The \ufb01rst step will be\nto construct a transformation that associates every 3MAJ clause with two examples in Cn,3\u00d7{\u00b11},\nand every assignment with a hypothesis in Hd\nn,3. As we will show, the hypothesis corresponding to\nan assignment \u03c8 is correct on the two examples corresponding to a clause C if and only if \u03c8 satis\ufb01es\nC. With that interpretation at hand, every 3MAJ formula \u03c6 can be thought of as a distribution D\u03c6\non Cn,3 \u00d7 {\u00b11}, which is the empirical distribution induced by \u03c8\u2019s clauses. It holds furthermore\nthat ErrD\u03c6(Hd\nSuppose now that we are given an ef\ufb01cient learning algorithm for Hd\n\u00012 examples,\nfor some \u03ba > 0. To construct an ef\ufb01cient algorithm for refuting 3MAJ-formulas, we simply feed\n0.012 examples drawn from D\u03c6 and answer \u201cexceptional\u201d if the error\nthe learning algorithm with \u03ba n\nof the hypothesis returned by the algorithm is small. If \u03c6 is (almost) satis\ufb01able, the algorithm is\nguaranteed to return a hypothesis with a small error. On the other hand, if \u03c6 is far from being\nsatis\ufb01able, ErrD\u03c6(Hd\nn,3) is large. If the learning algorithm is proper, then it must return a hypothesis\nfrom Hd\nn,3 and therefore it would necessarily return a hypothesis with a large error. This argument\ncan be used to show that, unless N P = RP , learning Hd\nn,3 with a proper ef\ufb01cient algorithm is\nimpossible. However, here we want to rule out improper algorithms as well.\nThe crux of the construction is that if \u03c6 is random, no algorithm (even improper and even inef\ufb01cient)\ncan return a hypothesis with a small error. The reason for that is that since the sample provided\nto the algorithm consists of only \u03ba n\n0.012 samples, the algorithm won\u2019t see most of \u03c8\u2019s clauses,\nand, consequently, the produced hypothesis h will be independent of them. Since these clauses are\nrandom, h is likely to err on about half of them, so that ErrD\u03c6(h) will be close to half!\nTo summarize we constructed an ef\ufb01cient algorithm with the following properties: if \u03c6 is almost\nsatis\ufb01able, the algorithm will return a hypothesis with a small error, and then we will declare \u201cex-\nceptional\u201d, while for random \u03c6, the algorithm will return a hypothesis with a large error, and we will\ndeclare \u201ctypical\u201d.\n\n6\n\n\fOur construction crucially relies on the restriction to learning algorithm with a small sample com-\nplexity. Indeed, if the learning algorithm obtains more than n1+\u00b5 examples, then it will see most\nof \u03c8\u2019s clauses, and therefore it might succeed in \u201clearning\u201d even when the source of the formula is\nrandom. Therefore, we will declare \u201cexceptional\u201d even when the source is random.\n\n(cid:16) n1+\u00b5\n\n(cid:17)\n\nProof. (of theorem 3.1) Assume by way of contradiction that the \u00b5-R3SAT hardness assumption is\ntrue and yet there exists an ef\ufb01cient learning algorithm that learns the class Hn,3 using O\n100, we conclude that there exists an ef\ufb01cient algorithm L and a constant\nexamples. Setting \u0001 = 1\n\u03ba > 0 such that given a sample S of more than \u03ba \u00b7 n1+\u00b5 examples drawn from a distribution D on\nCn,3 \u00d7 {\u00b11}, returns a classi\ufb01er L(S) : Cn,3 \u2192 {\u00b11} such that\n\n\u00012\n\n\u2022 L(S) can be evaluated ef\ufb01ciently.\n\u2022 W.p. \u2265 3\n\n4 over the choice of S, ErrD(L(S)) \u2264 ErrD(Hn,3) + 1\n100.\n\n100. We\nFix \u2206 large enough such that \u2206 > 100\u03ba and the conclusion of Theorem 3.2 holds with \u0001 = 1\nwill construct an algorithm, A, contradicting Theorem 3.2. On input \u03c6 \u2208 3MAJn,\u2206n1+\u00b5 consisting\nof the 3MAJ clauses C1, . . . , C\u2206n1+\u00b5, the algorithm A proceeds as follows\n\n1. Generate a sample S consisting of \u2206n1+\u00b5 examples as follows. For every clause, Ck =\nMAJ((\u22121)j1xi1, (\u22121)j2 xi2 , (\u22121)j3xi3), generate an example (xk, yk) \u2208 Cn,3 \u00d7 {\u00b11} by\nchoosing b \u2208 {\u00b11} at random and letting\n\n(cid:32) 3(cid:88)\n\n(cid:33)\n(\u22121)jl eil , 1\n\n(xk, yk) = b \u00b7\n\n\u2208 Cn,3 \u00d7 {\u00b11} .\n\nl=1\n\nFor example, if n = 6, the clause is MAJ(\u2212x2, x3, x6) and b = \u22121, we generate the\nexample\n\n((0, 1,\u22121, 0, 0,\u22121),\u22121)\n\n2. Choose a sample S1 consisting of \u2206n1+\u00b5\n\n(with repetitions) examples from S.\n\n100 \u2265 \u03ba \u00b7 n1+\u00b5 examples by choosing at random\n\n3. Let h = L(S1). If ErrS(h) \u2264 3\n\n8, return \u201cexceptional\u201d. Otherwise, return \u201ctypical\u201d.\n\nWe claim that A contradicts Theorem 3.2. Clearly, A runs in polynomial time. It remains to show\nthat\n\n\u2022 If Val(\u03c6) \u2265 3\n\n4 \u2212 1\n\n100, then\n\nPr\n\nRand. coins of A\n\n(A(\u03c6) = \u201cexceptional\u201d) \u2265 3\n4\n\n\u2022 For every n,\n\nRand. coins of A, \u03c6\u223cUni(3MAJn,\u2206n1+\u00b5 )\n\nPr\n\n(A(\u03c6) = \u201ctypical\u201d) \u2265 1 \u2212 o(1)\n\nAssume \ufb01rst that \u03c6 \u2208 3MAJn,\u2206n1+\u00b5 is chosen at random. Given the sample S1, the sample S2 :=\nS \\ S1 is a sample of |S2| i.i.d. examples which are independent from the sample S1, and hence also\nfrom h = L(S1). Moreover, for every example (xk, yk) \u2208 S2, yk is a Bernoulli random variable\n2 which is independent of xk. To see that, note that an example whose instance is xk\nwith parameter 1\ncan be generated by exactly two clauses \u2013 one corresponds to yk = 1, while the other corresponds\nto yk = \u22121 (e.g., the instance (1,\u22121, 0, 1) can be generated from the clause MAJ(x1,\u2212x2, x4) and\nb = 1 or the clause MAJ(\u2212x1, x2,\u2212x4) and b = \u22121). Thus, given the instance xk, the probability\nthat yk = 1 is 1\n\n2, independent of xk.\n\n7\n\n\fIt follows that ErrS2(h) is an average of at least(cid:0)1 \u2212 1\n(cid:18)\n\nvariable. By Chernoff\u2019s bound, with probability \u2265 1 \u2212 o(1), ErrS2 (h) > 1\n\u2212 1\n100\n\nErrS2(h) \u2265\n\n(cid:1) \u2206n1+\u00b5 independent Bernoulli random\n(cid:19)\n\n(cid:19)\n2 \u2212 1\n\nErrS(h) \u2265\n\n1 \u2212 1\n100\n\n1 \u2212 1\n100\n\n100. Thus,\n\n(cid:18) 1\n\n(cid:18)\n\n(cid:19)\n\n3\n8\n\n100\n\n>\n\n2\n\n\u00b7\n\nAnd the algorithm will output \u201ctypical\u201d.\nAssume now that Val(\u03c6) \u2265 3\n100 and let \u03c8 \u2208 {\u00b11}n be an assignment that indicates that. Let\n\u03a8 \u2208 Hn,3 be the hypothesis \u03a8(x) = sign ((cid:104)\u03c8, x(cid:105)). It can be easily checked that \u03a8(xk) = yk if and\nonly if \u03c8 satis\ufb01es Ck. Since Val(\u03c6) \u2265 3\n\n4 \u2212 1\n\n100, it follows that\n\n4 \u2212 1\nErrS(\u03a8) \u2264 1\n4\n\n+\n\n1\n100\n\n.\n\nThus,\n\nErrS(Hn,3) \u2264 1\n4\n\n+\n\n1\n100\n\n.\n\n+\n\n+\n\n1\n100\n\n1\n100\n\n4,\n4 = 3\n\nBy the choice of L, with probability \u2265 1 \u2212 1\nErrS(h) \u2264 1\n4\nand the algorithm will return \u201cexceptional\u201d.\n4 Upper bounds for learning Hn,2 and Hn,3\nThe following theorem derives upper bounds for learning Hn,2 and Hn,3. Its proof relies on results\nfrom Hazan et al. [2012] about learning \u03b2-decomposable matrices, and due to the lack of space is\ngiven in the appendix.\nTheorem 4.1.\n\n3\n8\n\n<\n\n\u2022 There exists an ef\ufb01cient algorithm that learns Hn,2 using O\n\u2022 There exists an ef\ufb01cient algorithm that learns Hn,3 using O\n\n5 Discussion\n\n(cid:17)\n(cid:16) n log3(n)\n(cid:17)\n(cid:16) n2 log3(n)\n\n\u00012\n\n\u00012\n\nexamples\n\nexamples\n\nWe formally established a computational-sample complexity tradeoff for the task of (agnostically\nand improperly) PAC learning of halfspaces over 3-sparse vectors. Our proof of the lower bound\nrelies on a novel, non cryptographic, technique for establishing such tradeoffs. We also derive a new\nnon-trivial upper bound for this task.\nOpen questions. An obvious open question is to close the gap between the lower and upper bounds.\nWe conjecture that Hn,3 can be learnt ef\ufb01ciently using a sample of \u02dcO\nexamples. Also, we\nbelieve that our new proof technique can be used for establishing computational-sample complexity\ntradeoffs for other natural learning problems.\n\n(cid:16) n1.5\n\n(cid:17)\n\n\u00012\n\nAcknowledgements: Amit Daniely is a recipient of the Google Europe Fellowship in Learning\nTheory, and this research is supported in part by this Google Fellowship. Nati Linial is supported\nby grants from ISF, BSF and I-Core. Shai Shalev-Shwartz is supported by the Israeli Science Foun-\ndation grant number 590-10.\n\nReferences\nBenny Applebaum, Boaz Barak, and David Xiao. On basing lower-bounds for learning on worst-\ncase assumptions. In Foundations of Computer Science, 2008. FOCS\u201908. IEEE 49th Annual IEEE\nSymposium on, pages 211\u2013220. IEEE, 2008.\n\n8\n\n\fQuentin Berthet and Philippe Rigollet. Complexity theoretic lower bounds for sparse principal\n\ncomponent detection. In COLT, 2013.\n\nNicolo Cesa-Bianchi, Alex Conconi, and Claudio Gentile. On the generalization ability of on-line\n\nlearning algorithms. IEEE Transactions on Information Theory, 50:2050\u20132057, 2001.\n\nVenkat Chandrasekaran and Michael I. Jordan. Computational and statistical tradeoffs via convex\n\nrelaxation. Proceedings of the National Academy of Sciences, 2013.\n\nS. Decatur, O. Goldreich, and D. Ron. Computational sample complexity. SIAM Journal on Com-\n\nputing, 29, 1998.\n\nO. Dubios, R. Monasson, B. Selma, and R. Zecchina (Guest Editors). Phase Transitions in Combi-\n\nnatorial Problems. Theoretical Computer Science, Volume 265, Numbers 1-2, 2001.\nU. Feige. Relations between average case complexity and approximation complexity.\n\nIn STOC,\n\npages 534\u2013543, 2002.\n\nUriel Feige and Eran Ofek. Easily refutable subformulas of large random 3cnf formulas. Theory of\n\nComputing, 3(1):25\u201343, 2007.\n\nE. Hazan, S. Kale, and S. Shalev-Shwartz. Near-optimal algorithms for online matrix prediction. In\n\nCOLT, 2012.\n\nP. Long. and R. Servedio. Low-weight halfspaces for sparse boolean vectors. In ITCS, 2013.\nR. Servedio. Computational sample complexity and attribute-ef\ufb01cient learning. J. of Comput. Syst.\n\nSci., 60(1):161\u2013178, 2000.\n\nShai Shalev-Shwartz, Ohad Shamir, and Eran Tromer. Using more data to speed-up training time.\n\nIn AISTATS, 2012.\n\nV.N. Vapnik. The Nature of Statistical Learning Theory. Springer, 1995.\n\n9\n\n\f", "award": [], "sourceid": 143, "authors": [{"given_name": "Amit", "family_name": "Daniely", "institution": "Hebrew University"}, {"given_name": "Nati", "family_name": "Linial", "institution": "The Hebrew University"}, {"given_name": "Shai", "family_name": "Shalev-Shwartz", "institution": "The Hebrew University"}]}