{"title": "Learning Bound for Parameter Transfer Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2721, "page_last": 2729, "abstract": "We consider a transfer-learning problem by using the parameter transfer approach, where a suitable parameter of feature mapping is learned through one task and applied to another objective task. Then, we introduce the notion of the local stability of parametric feature mapping and  parameter transfer learnability, and thereby derive a learning bound for parameter transfer algorithms. As an application of parameter transfer learning, we discuss the performance of sparse coding in self-taught learning. Although self-taught learning algorithms with plentiful unlabeled data often show excellent empirical performance, their theoretical analysis has not been studied. In this paper, we also provide the first theoretical learning bound for self-taught learning.", "full_text": "Learning Bound for Parameter Transfer Learning\n\nWataru Kumagai\n\nFaculty of Engineering\nKanagawa University\n\nkumagai@kanagawa-u.ac.jp\n\nAbstract\n\nWe consider a transfer-learning problem by using the parameter transfer approach,\nwhere a suitable parameter of feature mapping is learned through one task and ap-\nplied to another objective task. Then, we introduce the notion of the local stability\nand parameter transfer learnability of parametric feature mapping, and thereby\nderive a learning bound for parameter transfer algorithms. As an application of\nparameter transfer learning, we discuss the performance of sparse coding in self-\ntaught learning. Although self-taught learning algorithms with plentiful unlabeled\ndata often show excellent empirical performance, their theoretical analysis has not\nbeen studied. In this paper, we also provide the \ufb01rst theoretical learning bound for\nself-taught learning.\n\n1 Introduction\n\nIn traditional machine learning, it is assumed that data are identically drawn from a single distribu-\ntion. However, this assumption does not always hold in real-world applications. Therefore, it would\nbe signi\ufb01cant to develop methods capable of incorporating samples drawn from different distribu-\ntions. In this case, transfer learning provides a general way to accommodate these situations. In\ntransfer learning, besides the availability of relatively few samples related with an objective task,\nabundant samples in other domains that are not necessarily drawn from an identical distribution, are\navailable. Then, transfer learning aims at extracting some useful knowledge from data in other do-\nmains and applying the knowledge to improve the performance of the objective task. In accordance\nwith the kind of knowledge that is transferred, approaches to solving transfer-learning problems can\nbe classi\ufb01ed into cases such as instance transfer, feature representation transfer, and parameter trans-\nfer (Pan and Yang (2010)). In this paper, we consider the parameter transfer approach, where some\nkind of parametric model is supposed and the transferred knowledge is encoded into parameters.\nSince the parameter transfer approach typically requires many samples to accurately learn a suitable\nparameter, unsupervised methods are often utilized for the learning process. In particular, trans-\nfer learning from unlabeled data for predictive tasks is known as self-taught learning (Raina et al.\n(2007)), where a joint generative model is not assumed to underlie unlabeled samples even though\nthe unlabeled samples should be indicative of a structure that would subsequently be helpful in pre-\ndicting tasks. In recent years, self-taught learning has been intensively studied, encouraged by the\ndevelopment of strong unsupervised methods. Furthermore, sparsity-based methods such as sparse\ncoding or sparse neural networks have often been used in empirical studies of self-taught learning.\nAlthough many algorithms based on the parameter transfer approach have empirically demonstrated\nimpressive performance in self-taught learning, some fundamental problems remain. First, the theo-\nretical aspects of the parameter transfer approach have not been studied, and in particular, no learning\nbound was obtained. Second, although it is believed that a large amount of unlabeled data help to\nimprove the performance of the objective task in self-taught learning, it has not been suf\ufb01ciently\nclari\ufb01ed how many samples are required. Third, although sparsity-based methods are typically em-\nployed in self-taught learning, it is unknown how the sparsity works to guarantee the performance\nof self-taught learning.\n\n\fThe aim of the research presented in this paper is to shed light on the above problems. We \ufb01rst\nconsider a general model of parametric feature mapping in the parameter transfer approach. Then,\nwe newly formulate the local stability of parametric feature mapping and the parameter transfer\nlearnability for this mapping, and provide a theoretical learning bound for parameter transfer learn-\ning algorithms based on the notions. Next, we consider the stability of sparse coding. Then we\ndiscuss the parameter transfer learnability by dictionary learning under the sparse model. Applying\nthe learning bound for parameter transfer learning algorithms, we provide a learning bound of the\nsparse coding algorithm in self-taught learning.\nThis paper is organized as follows. In the remainder of this section, we refer to some related studies.\nIn Section 2, we formulate the stability and the parameter transfer learnability of the parametric\nfeature mapping. Then, we present a learning bound for parameter transfer learning. In Section 3,\nwe show the stability of the sparse coding under perturbation of the dictionaries. Then, by imposing\nsparsity assumptions on samples and by considering dictionary learning, we derive the parameter\ntransfer learnability for sparse coding. In particular, a learning bound is obtained for sparse coding\nin the setting of self-taught learning. In Section 4, we conclude the paper.\n\n1.1 Related Works\n\nApproaches to transfer learning can be classi\ufb01ed into some cases based on the kind of knowledge\nbeing transferred (Pan and Yang (2010)). In this paper, we consider the parameter transfer approach.\nThis approach can be applied to various notable algorithms such as sparse coding, multiple kernel\nlearning, and deep learning since the dictionary, weights on kernels, and weights on the neural net-\nwork are regarded as parameters, respectively. Then, those parameters are typically trained or tuned\non samples that are not necessarily drawn from a target region. In the parameter transfer setting, a\nnumber of samples in the source region are often needed to accurately estimate the parameter to be\ntransferred. Thus, it is desirable to be able to use unlabeled samples in the source region.\nSelf-taught learning corresponds to the case where only unlabeled samples are given in the source\nregion while labeled samples are available in the target domain. In this sense, self-taught learning is\ncompatible with the parameter transfer approach. Actually, in Raina et al. (2007) where self-taught\nlearning was \ufb01rst introduced, the sparse coding-based method is employed and the parameter trans-\nfer approach is already used regarding the dictionary learnt from images as the parameter to be\ntransferred. Although self-taught learning has been studied in various contexts (Dai et al. (2008);\nLee et al. (2009); Wang et al. (2013); Zhu et al. (2013)), its theoretical aspects have not been suf\ufb01-\nciently analyzed. One of the main results in this paper is to provide a \ufb01rst theoretical learning bound\nin self-taught learning with the parameter transfer approach. We note that our setting differs from\nthe environment-based setting (Baxter (2000), Maurer (2009)), where a distribution on distributions\non labeled samples, known as an environment, is assumed. In our formulation, the existence of the\nenvironment is not assumed and labeled data in the source region are not required.\nSelf-taught learning algorithms are often based on sparse coding. In the seminal paper by Raina et al.\n(2007), they already proposed an algorithm that learns a dictionary in the source region and trans-\nfers it to the target region. They also showed the effectiveness of the sparse coding-based method.\nMoreover, since remarkable progress has been made in unsupervised learning based on sparse neural\nnetworks (Coates et al. (2011), Le (2013)), unlabeled samples of the source domain in self-taught\nlearning are often preprocessed by sparsity-based methods. Recently, a sparse coding-based gen-\neralization bound was studied (Mehta and Gray (2013); Maurer et al. (2012)) and the analysis in\nSection 3.1 is based on (Mehta and Gray (2013)).\n\n2 Learning Bound for Parameter Transfer Learning\n\n2.1 Problem Setting of Parameter Transfer Learning\n\nWe formulate parameter transfer learning in this subsection. We \ufb01rst brie\ufb02y introduce notations and\nterminology in transfer learning (Pan and Yang (2010)). Let X and Y be a sample space and a label\nspace, respectively. We refer to a pair of Z := X (cid:2) Y and a joint distribution P (x; y) on Z as a\nregion. Then, a domain comprises a pair consisting of a sample space X and a marginal probability\nof P (x) on X and a task consists of a pair containing a label set Y and a conditional distribution\nP (yjx).\nIn addition, let H = fh : X ! Yg be a hypothesis space and \u2113 : Y (cid:2) Y ! R(cid:21)0\n\n2\n\n\f\u2211\n\nn\n\nn\n\nE(x;y)(cid:24)P [\u2113(y; h(x))] and bRn(h) := 1\n\nrepresent a loss function. Then, the expected risk and the empirical risk are de\ufb01ned by R(h) :=\nj=1 \u2113(yj; h(xj)), respectively. In the setting of transfer\nlearning, besides samples from a region of interest known as a target region, it is assumed that\nsamples from another region known as a source region are also available. We distinguish between\nthe target and source regions by adding a subscript T or S to each notation introduced above, (e.g.\nPT , RS). Then, the homogeneous setting (i.e., XS = XT ) is not assumed in general, and thus, the\nheterogeneous setting (i.e., XS \u0338= XT ) can be treated. We note that self-taught learning, which is\ntreated in Section 3, corresponds to the case when the label space YS in the source region is the set\nof a single element.\nWe consider the parameter transfer approach, where the knowledge to be transferred is encoded into\na parameter. The parameter transfer approach aims to learn a hypothesis with low expected risk for\nthe target task by obtaining some knowledge about an effective parameter in the source region and\ntransfer it to the target region. In this paper, we suppose that there are parametric models on both\nthe source and target regions and that their parameter spaces are partly shared. Then, our strategy\nis to learn an effective parameter in the source region and then transfer a part of the parameter to\nthe target region. We describe the formulation in the following. In the target region, we assume that\nYT (cid:26) R and there is a parametric feature mapping (cid:18) : XT ! Rm on the target domain such that\neach hypothesis hT ;(cid:18);w : XT ! YT is represented by\n\nhT ;(cid:18);w(x) := \u27e8w; (cid:18)(x)\u27e9\n\n(1)\nwith parameters (cid:18) 2 (cid:2) and w 2 WT , where (cid:2) is a subset of a normed space with a norm \u2225 (cid:1) \u2225 and\nWT is a subset of Rm. Then the hypothesis set in the target region is parameterized as\n\nIn the following, we simply denote RT (hT ;(cid:18);w) and bRT (hT ;(cid:18);w) by RT ((cid:18); w) and bRT ((cid:18); w),\n\nHT = fhT ;(cid:18);wj(cid:18) 2 (cid:2); w 2 WT g:\n\nrespectively. In the source region, we suppose that there exists some kind of parametric model such\nas a sample distribution PS;(cid:18);w or a hypothesis hS;(cid:18);w with parameters (cid:18) 2 (cid:2) and w 2 WS, and\nS 2 WS\n(cid:3)\na part (cid:2) of the parameter space is shared with the target region. Then, let (cid:18)\nbe parameters that are supposed to be effective in the source region (e.g., the true parameter of the\nsample distribution, the parameter of the optimal hypothesis with respect to the expected risk RS);\nhowever, explicit assumptions are not imposed on the parameters. Then, the parameter transfer\nalgorithm treated in this paper is described as follows. Let N- and n-samples be available in the\nsource and target regions, respectively. First, a parameter transfer algorithm outputs the estimator\n\nS 2 (cid:2) and w\n(cid:3)\n\nb(cid:18)N 2 (cid:2) of (cid:18)\n\nby using n-samples, where r(w) is a 1-strongly convex function with respect to \u2225 (cid:1) \u22252 and (cid:26) > 0.\n(cid:3)\nIf the source region relates to the target region in some sense, the effective parameter (cid:18)\nS in the\nsource region is expected to also be useful for the target task. In the next subsection, we regard\nRT ((cid:18)\n\n(cid:3)\nT ) as the baseline of predictive performance and derive a learning bound.\n\n(cid:3)\nS ; w\n\n2.2 Learning Bound Based on Stability and Learnability\n\nWe newly introduce the local stability and the parameter transfer learnability as below. These notions\nare essential to derive a learning bound in Theorem 1.\nDe\ufb01nition 1 (Local Stability). A parametric feature mapping (cid:18) is said to be locally stable if there\nexist \u03f5(cid:18) : X ! R>0 for each (cid:18) 2 (cid:2) and L > 0 such that for (cid:18)\n\n\u2032 2 (cid:2)\n\n\u2225(cid:18) (cid:0) (cid:18)\n\n\u2032\u2225 (cid:20) \u03f5(cid:18)(x) ) \u2225 (cid:18)(x) (cid:0) (cid:18)\n\n\u2032 (x)\u22252 (cid:20) L \u2225(cid:18) (cid:0) (cid:18)\n\n\u2032\u2225:\n\nWe term \u03f5(cid:18)(x) the permissible radius of perturbation for (cid:18) at x. For samples Xn = fx1; : : : xng,\nwe denote as \u03f5(cid:18)(Xn) := minj2[n] \u03f5(cid:18)(xj), where [n] := f1; : : : ; ng for a positive integer n. Next,\nwe formulate the parameter transfer learnability based on the local stability.\n\n3\n\n(cid:3)\nS by using N-samples. Next, for the parameter\nRT ((cid:18)\n(cid:3)\nS ; w)\n\n(cid:3)\nT\nw\n\n:= argmin\nw2WT\n\nbwN;n\n\n:= argmin\nw2WT\n\nbRT ;n(b(cid:18)N ; w) + (cid:26)r(w)\n\nin the target region, the algorithm outputs its estimator\n\n\fDe\ufb01nition 2 (Parameter Transfer Learnability). Suppose that N-samples in the source domain and\nn-samples Xn in the target domain are available. Let a parametric feature mapping f (cid:18)g(cid:18)2(cid:2) be\nlocally stable. For (cid:22)(cid:14) 2 [0; 1), f (cid:18)g(cid:18)2(cid:2) is said to be parameter transfer learnable with probability\n1 (cid:0) (cid:22)(cid:14) if there exists an algorithm that depends only on N-samples in the source domain such that,\n\nthe outputb(cid:18)N of the algorithm satis\ufb01es\n[\n\u2225b(cid:18)N (cid:0) (cid:18)\n\nPr\n\n]\n\nS\u2225 (cid:20) \u03f5(cid:18)\n(cid:3)\n\nS (Xn)\n(cid:3)\n\n(cid:21) 1 (cid:0) (cid:22)(cid:14):\n\nIn the following, we assume that parametric feature mapping is bounded as \u2225 (cid:18)(x)\u22252 (cid:20) R for\narbitrary x 2 X and (cid:18) 2 (cid:2) and linear predictors are also bounded as \u2225w\u22252 (cid:20) RW for any w 2 W.\nIn addition, we suppose that a loss function \u2113((cid:1);(cid:1)) is L\u2113-Lipschitz and convex with respect to the\nsecond variable. We denote as Rr := supw2W jr(w)j. Then, the following learning bound is\nobtained, where the strong convexity of the regularization term (cid:26)r(w) is essential.\nTheorem 1 (Learning Bound). Suppose that the parametric feature mapping (cid:18) is locally stable\nprobability 1 (cid:0) (cid:22)(cid:14). When (cid:26) = L\u2113R \n1 (cid:0) ((cid:14) + 2(cid:22)(cid:14)):\nRT\n\nand an estimatorb(cid:18)N learned in the source region satis\ufb01es the parameter transfer learnability with\n(cid:13)(cid:13)(cid:13)\n\n, the following inequality holds with probability\n\n(cid:13)(cid:13)(cid:13)b(cid:18)N (cid:0) (cid:18)\n\n2Rr(32 + log(2=(cid:14)))\n\n2 log(2=(cid:14)) + 2\n\n+ L\u2113L R \n\n\u221a\n\n\u221a\n\n8(32+log(2=(cid:14)))\n\n(cid:3)\nS ; w\n\n(cid:3)\nT )\n\nRW\n\nRrn\n\n(cid:3)\nS\n\n)\n\n(b(cid:18)N ;bwN;n\n(\n\u221a\n(cid:20) L\u2113R \n\u221a\n\n(cid:0) RT ((cid:18)\n(\nIf the estimation error \u2225b(cid:18)N (cid:0) (cid:18)\n\nL RW R \n\n+L\u2113\n\nRr\n\n2(32 + log(2=(cid:14)))\nS\u2225 can be evaluated in terms of the number N of samples, Theorem\n(cid:3)\n1 clari\ufb01es which term is dominant, and in particular, the number of samples required in the source\ndomain such that this number is suf\ufb01ciently large compared to the samples in the target domain.\n\n)\n\u221a(cid:13)(cid:13)(cid:13)b(cid:18)N (cid:0) (cid:18)\n\n1p\nn\n(cid:3)\nS\n\n(cid:13)(cid:13)(cid:13):\n\n) 1\n\n4\n\n1\n4\n\nn\n\n2.3 Proof of Learning Bound\nWe prove Theorem 1 in this subsection. In this proof, we omit the subscript T for simplicity. In\naddition, we denote (cid:18)\n\n(cid:3)\nS simply by (cid:18)\n\n. We set as\n\n(cid:3)\n\n(cid:3)(xj)\u27e9) + (cid:26)r(w):\n\nThen, we have\n\n(2)\n\n(3)\n\n\u2113(yj;\u27e8w; (cid:18)\n]\n\nn\u2211\n\nj=1\n\n(cid:3)\nn\n\n1\nn\n\nRT\n\n(cid:0) RT ((cid:18)\n\n= E(x;y)(cid:24)P\n\n:= argmin\nw2W\n\n(cid:3)\n(cid:3)\n; w\n)\n(x)\u27e9)\n\nbw\n(b(cid:18)N ;bwN;n\n)\n[\n\u2113(y;\u27e8bwN;n; b(cid:18)N\n+E(x;y)(cid:24)P [\u2113(y;\u27e8bwN;n; (cid:18)\n+E(x;y)(cid:24)P [\u2113(y;\u27e8bw\n[\n\u2113(y;\u27e8bwN;n; b(cid:18)N\n[(cid:13)(cid:13)(cid:13) b(cid:18)N\n(cid:13)(cid:13)(cid:13) b(cid:18)N\nn\u2211\n(cid:20) L\u2113RWE(x;y)(cid:24)P\n(cid:20) L\u2113RW 1\n(cid:13)(cid:13)(cid:13) + L\u2113RW R \n(cid:13)(cid:13)(cid:13)b(cid:18)N (cid:0) (cid:18)\n\n(x)\u27e9)\n(x) (cid:0) (cid:18)\n\n(cid:20) L\u2113L RW\n\n(xj) (cid:0) (cid:18)\n\nE(x;y)(cid:24)P\n\n(cid:3)\nn; (cid:18)\n\n(cid:3)(xj)\n\n]\n\nj=1\n\nn\n\n(cid:3)\n\n(cid:0) E(x;y)(cid:24)P [\u2113(y;\u27e8bwN;n; (cid:18)\n\n(cid:3)(x)\u27e9)] (cid:0) E(x;y)(cid:24)P [\u2113(y;\u27e8bw\n(cid:3) (x)\u27e9)] (cid:0) E(x;y)(cid:24)P [\u2113(y;\u27e8w\n\n(cid:3)\nn; (cid:18)\n; (cid:18)\n\n(cid:3)(x)\u27e9)]\n(cid:3)(x)\u27e9)]\n(cid:3)(x)\u27e9)] :\n\n(cid:3)\n\n(cid:3)(x)\u27e9)]\n\n(cid:0) E(x;y)(cid:24)P [\u2113(y;\u27e8bwN;n; (cid:18)\n(cid:13)(cid:13)(cid:13)]\n(cid:13)(cid:13)(cid:13) + L\u2113RW R \n\u221a\n\n\u221a\n\n2 log(2=(cid:14))\n\nn\n\n(cid:3)(x)\n\n2 log(2=(cid:14))\n\nn\n\n;\n\n4\n\nIn the following, we bound three parts of (3). First, we have the following inequality with probability\n1 (cid:0) ((cid:14)=2 + (cid:22)(cid:14)):\n\n\fwhere we used Hoeffding\u2019s inequality as the third inequality, and the local stability and parameter\ntransfer learnability in the last inequality. Second, we have the following inequality with probability\n1 (cid:0) (cid:22)(cid:14):\n\nE(x;y)(cid:24)P [\u2113(y;\u27e8bwN;n; (cid:18)\n(cid:3)(x)\u27e9)] (cid:0) E(x;y)(cid:24)P [\u2113(y;\u27e8bw\n(cid:20) L\u2113E(x;y)(cid:24)P [j\u27e8bwN;n; (cid:18)\n(cid:3)(x)\u27e9 (cid:0) \u27e8bw\n(cid:20) L\u2113R \u2225bwN;n (cid:0)bw\n(cid:13)(cid:13)(cid:13);\n(cid:13)(cid:13)(cid:13)b(cid:18)N (cid:0) (cid:18)\n(cid:20) L\u2113R \n\n(cid:3)\nn\n2L\u2113L RW\n\n(cid:3)(x)\u27e9j]\n\n(cid:3)\nn; (cid:18)\n\n\u221a\n\n\u2225\n\n(cid:3)\n\n2\n\n(cid:26)\n\n(cid:3)\nn; (cid:18)\n\n(cid:3) (x)\u27e9)]\n\n(4)\n\nwhere the last inequality is derived by the strong convexity of the regularizer (cid:26)r(w) in the Appendix.\nThird, the following holds by Theorem 1 of Sridharan et al. (2009) with probability 1 (cid:0) (cid:14)=2:\n\nE(x;y)(cid:24)P [\u2113(y;\u27e8bw\n= E(x;y)(cid:24)P [\u2113(y;\u27e8bw\n(cid:3)\nn; (cid:18)\n(cid:3)\n(\nn; (cid:18)\n(cid:0)E(x;y)(cid:24)P [\u2113(y;\u27e8w\n8L2\n\u221a\n\n; (cid:18)\n (32 + log(2=(cid:14)))\n\n\u2113 R2\n\n(cid:20)\n\n(cid:26)n\n\n(cid:3)\n\n(cid:3) (x)\u27e9) + (cid:26)r(bw\n(cid:3) (x)\u27e9)] (cid:0) E(x;y)(cid:24)P [\u2113(y;\u27e8w\n(cid:3)\n)\nn)]\n(cid:3)(x)\u27e9) + (cid:26)r(w\n(cid:3)\n\n)] + (cid:26)(r(w\n\n(cid:3)\n\n(cid:3) (x)\u27e9)]\n\n; (cid:18)\n\n) (cid:0) r(bw\n\n(cid:3)\n\n(cid:3)\nn))\n\n+ (cid:26)Rr:\n\nThus, when (cid:26) = L\u2113R \n\n8(32+log(2=(cid:14)))\n\nRrn\n\n, we have (2) with probability 1 (cid:0) ((cid:14) + 2(cid:22)(cid:14)).\n\n3 Stability and Learnability in Sparse Coding\n\nIn this section, we consider the sparse coding in self-taught learning, where the source region es-\nsentially consists of the sample space XS without the label space YS. We assume that the sample\nspaces in both regions are Rd. Then, the sparse coding method treated here consists of a two-stage\nprocedure, where a dictionary is learnt on the source region, and then a sparse coding with the learnt\ndictionary is used for a predictive task in the target region.\nFirst, we show that sparse coding satis\ufb01es the local stability in Section 3.1 and next explain that\nappropriate dictionary learning algorithms satisfy the parameter transfer learnability in Section 3.4.\nAs a consequence of Theorem 1, we obtain the learning bound of self-taught learning algorithms\nbased on sparse coding. We note that the results in this section are useful independent of transfer\nlearning.\nWe here summarize the notations used in this section. Let \u2225 (cid:1) \u2225p be the p-norm on Rd. We de\ufb01ne as\nsupp(a) := fi 2 [m]jai \u0338= 0g for a 2 Rm. We denote the number of elements of a set S by jSj.\nWhen a vector a satis\ufb01es \u2225a\u22250 = jsupp(a)j (cid:20) k, a is said to be k-sparse. We denote the ball with\nradius R centered at 0 by BRd(R) := fx 2 Rdj\u2225x\u22252 (cid:20) Rg. We set as D := fD = [d1; : : : ; dm] 2\nBRd(1)mj \u2225dj\u22252 = 1 (i = 1; : : : ; m)g and each D 2 D a dictionary with size m.\nDe\ufb01nition 3 (Induced matrix norm). For an arbitrary matrix E = [e1; : : : ; em] 2 Rd(cid:2)m, 1) the\ninduced matrix norm is de\ufb01ned by \u2225E\u22251;2 := maxi2[m] \u2225ei\u22252.\nWe adopt \u2225(cid:1)\u22251;2 to measure the difference of dictionaries since it is typically used in the framework\nof dictionary learning. We note that \u2225D (cid:0) ~D\u22251;2 (cid:20) 2 holds for arbitrary dictionaries D; ~D 2 D.\n\n3.1 Local Stability of Sparse Representation\n\nWe show the local stability of sparse representation under a sparse model. A sparse representation\nwith dictionary parameter D of a sample x 2 Rd is expressed as follows:\n\n\u03c6D(x) := argmin\nz2Rm\n\n\u2225x (cid:0) Dz\u22252\n\n2 + (cid:21)\u2225z\u22251;\n\n1\n2\n\n1) In general, the (p; q)-induced norm for p; q (cid:21) 1 is de\ufb01ned by \u2225E\u2225p;q := supv2Rm;\u2225v\u2225p=1\n\n\u2225Ev\u2225q. Then,\n\u2225 (cid:1) \u22251;2 in this general de\ufb01nition coincides with that in De\ufb01nition 3 by Lemma 17 of Vainsencher et al. (2011).\n\n5\n\n\fwhere (cid:21) > 0 is a regularization parameter. This situation corresponds to the case where (cid:18) = D\nand (cid:18) = \u03c6D in the setting of Section 2.1. We prepare some notions to the stability of the sparse\nrepresentation. The following margin and incoherence were introduced by Mehta and Gray (2013).\nDe\ufb01nition 4 (k-margin). Given a dictionary D = [d1; : : : ; dm] 2 D and a point x 2 Rd, the\nk-margin of D on x is\n\nf(cid:21) (cid:0) j\u27e8dj; x (cid:0) D\u03c6D(x)\u27e9jg :\n\nMk(D; x) :=\nDe\ufb01nition 5 ((cid:22)-incoherence). A dictionary matrix D = [d1; : : : ; dm] 2 D is termed (cid:22)-incoherent\np\nif j\u27e8di; dj\u27e9j (cid:20) (cid:22)=\nThen, the following theorem is obtained.\nTheorem 2 (Sparse Coding Stability). Let D 2 D be (cid:22)-incoherent and \u2225D (cid:0) ~D\u22251;2 (cid:20) (cid:21). When\n\nd for all i \u0338= j.\n\nmin\nj2I\n\nmax\n\nI(cid:26)[m];jIj=m(cid:0)k\n\n\u2225D (cid:0) ~D\u22251;2 (cid:20) \u03f5k;D(x) :=\n\n(5)\n\nthe following stability bound holds:\n\n\u2225\u03c6D(x) (cid:0) \u03c6 ~D(x)\u2225\n\n2\n\nMk;D(x)2(cid:21)\n64 maxf1;\u2225x\u2225g4 ;\np\n(cid:20) 4\u2225x\u22252\np\nk\n(1 (cid:0) (cid:22)k=\nd)(cid:21)\n\n\u2225D (cid:0) ~D\u22251;2:\n\nFrom Theorem 2, \u03f5k;D(x) becomes the permissible radius of perturbation in De\ufb01nition 1.\nHere, we refer to the relation with the sparse coding stability (Theorem 4) of Mehta and Gray (2013),\nwho measured the difference of dictionaries by \u2225 (cid:1) \u22252;2 instead of \u2225 (cid:1) \u22251;2 and the permissible radius\nof perturbation is given by Mk;D(x)2(cid:21) except for a constant factor. Applying the simple inequality\n\u2225E\u22252;2 (cid:20) p\nm\u2225E\u22251;2 for E 2 Rd(cid:2)m, we can obtain a variant of the sparse coding stability with the\nnorm \u2225(cid:1)\u22251;2. However, then the dictionary size m affects the permissible radius of perturbation and\nthe stability bound of the sparse coding stability. On the other hand, the factor of m does not appear\nin Theorem 2, and thus, the result is effective even for a large m. In addition, whereas \u2225x\u2225 (cid:20) 1\nis assumed in Mehta and Gray (2013), Theorem 2 does not assume that \u2225x\u2225 (cid:20) 1 and clari\ufb01es the\ndependency for the norm \u2225x\u2225.\nIn existing studies related to sparse coding, the sparse representation \u03c6D(x) is modi\ufb01ed as \u03c6D(x)(cid:10)\nx (Mairal et al. (2009)) or \u03c6D(x) (cid:10) (x (cid:0) D\u03c6D(x)) (Raina et al. (2007)) where (cid:10) is the tensor\nproduct. By the stability of sparse representation (Theorem 2), it can be shown that such modi\ufb01ed\nrepresentations also have local stability.\n\n3.2 Sparse Modeling and Margin Bound\nIn this subsection, we assume a sparse structure for samples x 2 Rd and specify a lower bound\nfor the k-margin used in (5). The result obtained in this section plays an essential role to show the\nparameter transfer learnability in Section 3.4.\n(cid:3) such that every sample x is indepen-\nAssumption 1 (Model). There exists a dictionary matrix D\ndently generated by a representation a and noise (cid:24) as\n\n(cid:3)\n\nx = D\n\na + (cid:24):\n\nMoreover, we impose the following three assumptions on the above model.\n= [d1; : : : ; dm] 2 D is (cid:22)-incoherent.\nAssumption 2 (Dictionary). The dictionary matrix D\nAssumption 3 (Representation). The representation a is a random variable that is k-sparse (i.e.,\n\u2225a\u22250 (cid:20) k) and the non-zero entries are lower bounded by C > 0 (i.e., ai \u0338= 0 satisfy jaij (cid:21) C).\np\nAssumption 4 (Noise). The noise (cid:24) is independent across coordinates and sub-Gaussian with pa-\nrameter (cid:27)=\n\nd on each component.\n\n(cid:3)\n\nWe note that the assumptions do not require the representation a or noise (cid:24) to be identically dis-\ntributed while those components are independent. This is essential because samples in the source\nand target domains cannot be assumed to be identically distributed in transfer learning.\n\n6\n\n\f(cid:14)t;(cid:21)\n\n:=\n\nTheorem 3 (Margin Bound). Let 0 < t < 1. We set as\n\n(\n\n)\n\nexp\n\np\n2(cid:27)\n(1 (cid:0) t)\n\u221a\nd(cid:21)\n{(\nd(1 (cid:0) (cid:22)k=\n\n4(cid:27)k\n\nC\n\n+\n\n+\n\n8(cid:27)2\n\n(cid:0) (1 (cid:0) t)2d(cid:21)2\n(\n2(cid:27)mp\nd(cid:21)\n(cid:0) C 2d(1 (cid:0) (cid:22)k=\n}2\n\n)\n\n8(cid:27)2\n\nexp\n\np\nd)\n\n)\n\n(\n)\n\nexp\np\nd)\n\n(cid:0) d(cid:21)2\n8(cid:27)2\n8(cid:27)(d (cid:0) k)\n\np\n\n+\n\nd(cid:21)\n\n(\n\nexp\n\n(cid:0) d(cid:21)2\n32(cid:27)2\n\n)\n\n: (6)\n\nWe suppose that d (cid:21)\nAssumptions 1-4, the following inequality holds with probability 1 (cid:0) (cid:14)t;(cid:21) at least:\n\n(cid:0)(cid:28) for arbitrary 1=4 (cid:20) (cid:28) (cid:20) 1=2. Under\n\nand (cid:21) = d\n\n1 + 6\n\n(1(cid:0)t)\n\n(cid:22)k\n\nMk;D(cid:3)(x) (cid:21) t(cid:21):\n\n(7)\n\n(8)\n\n(9)\n\nWe refer to the regularization parameter (cid:21). An appropriate re\ufb02ection of the sparsity of samples\nrequires the regularization parameter (cid:21) to be set suitably. According to Theorem 4 of Zhao and Yu\n(2006)2), when samples follow the sparse model as in Assumptions 1-4 and (cid:21) (cid:24)\n(cid:0)(cid:28) for 1=4 (cid:20) (cid:28) (cid:20)\n1=2, the representation \u03c6D(x) reconstructs the true sparse representation a of sample x with a small\nerror. In particular, when (cid:28) = 1=4 (i.e., (cid:21) (cid:24)\n(cid:24)\n(cid:0)p\n=\nd on the margin is guaranteed to become sub-exponentially small with respect to dimension d\ne\nand is negligible for the high-dimensional case. On the other hand, the typical choice (cid:28) = 1=2 (i.e.,\n(cid:21) (cid:24)\n(cid:0)1=2) does not provide a useful result because (cid:14)t;(cid:21) is not small at all.\n\n(cid:0)1=4) in Theorem 3, the failure probability (cid:14)t;(cid:21)\n\n= d\n\n= d\n\n= d\n\n3.3 Proof of Margin Bound\n\n(cid:3)\n\nWe give a sketch of proof of Theorem 3. We denote the \ufb01rst term, the second term and the sum of\nthe third and fourth terms of (6) by (cid:14)1, (cid:14)2 and (cid:14)3, respectively From Assumptions 1 and 3, a sample\na + (cid:24) and \u2225a\u22250 (cid:20) k. Without loss of generality, we assume that the \ufb01rst\nis represented as x = D\nm (cid:0) k components of a are 0 and the last k components are not 0. Since\nMk;D(cid:3) (x) (cid:21) min\ndj; a (cid:0) \u03c6D(x)\u27e9;\n(cid:21) (cid:0) \u27e8dj; x (cid:0) D\n(cid:3)\nit is enough to show that the following holds an arbitrary 1 (cid:20) j (cid:20) m (cid:0) k to prove Theorem 3:\n\n\u03c6D(x)\u27e9 = min\n\n(cid:21) (cid:0) \u27e8dj; (cid:24)\u27e9 (cid:0) \u27e8D\n\n1(cid:20)j(cid:20)m(cid:0)k\n\n1(cid:20)j(cid:20)m(cid:0)k\n\n(cid:3)\u22a4\n\nThen, (8) follows from the following inequalities:\n\u27e8dj; (cid:24)\u27e9 >\ndj; a (cid:0) \u03c6D(x)\u27e9 >\n\n[\n\u27e8D\n\n(cid:3)\u22a4\n\nPr\n\nPr[\u27e8dj; (cid:24)\u27e9 + \u27e8D\n\n(cid:3)\u22a4\n\ndj; a (cid:0) \u03c6D(x)\u27e9 > (1 (cid:0) t)(cid:21)] (cid:20) (cid:14)t;(cid:21):\n[\n\n]\n]\n\n(cid:21)\n\n1 (cid:0) t\n2\n1 (cid:0) t\n2\n\n(cid:20) (cid:14)1;\n(cid:20) (cid:14)2 + (cid:14)3:\n\nPr\n\n(10)\nThe inequality (9) holds since \u2225dj\u2225 = 1 by the de\ufb01nition and Assumption 4. Thus, all we have to\ndo is to show (10). We have\n\n(cid:21)\n\n(cid:3)\u22a4\n\n\u27e8D\n\ndj; a (cid:0) \u03c6D(x)\u27e9 = \u27e8[\u27e8d1; dj\u27e9; : : : ;\u27e8dm; dj\u27e9]\n\n\u22a4\n\n; a (cid:0) \u03c6D(x)\u27e9\n\n= \u27e8(1supp(a(cid:0)\u03c6D(x)) \u25e6 [\u27e8d1; dj\u27e9; : : : ;\u27e8dm; dj\u27e9])\n\u22a4\n(cid:20) \u22251supp(a(cid:0)\u03c6D(x)) \u25e6 [\u27e8d1; dj\u27e9; : : : ;\u27e8dm; dj\u27e9]\u22252\u2225a (cid:0) \u03c6D(x)\u22252;(11)\nwhere u \u25e6 v is the Hadamard product (i.e. component-wise product) between u and v, and 1A for a\nset A (cid:26) [m] is a vector whose i-th component is 1 if i 2 A and 0 otherwise.\nApplying Theorem 4 of Zhao and Yu (2006) and using the condition for (cid:21), the following holds with\nprobability 1 (cid:0) (cid:14)3:\n\n; a (cid:0) \u03c6D(x)\u27e9\n\nsupp(a) = supp(\u03c6D(x)):\n\n(12)\n2)Theorem 4 of Zhao and Yu (2006) is stated for Gaussian noise. However, it can be easily generalized to\nsub-Gaussian noise as in Assumption 4. Our setting corresponds to the case in which c1 = 1=2; c2 = 1; c3 =\n(log (cid:20) + log log d)= log d for some (cid:20) > 1 (i.e., edc3 (cid:24)\n= d(cid:20)) and c4 = c in Theorem 4 of Zhao and Yu (2006).\nNote that our regularization parameter (cid:21) corresponds to (cid:21)d=d in (Zhao and Yu (2006)).\n\n7\n\n\fMoreover, under (12), the following holds with probability 1 (cid:0) (cid:14)2 by modifying Corollary 1 of\nNegahban et al. (2009) and using the condition for (cid:21):\np\n\u2225a (cid:0) \u03c6D(x)\u22252 (cid:20) 6\nk(cid:21)\n1 (cid:0) (cid:22)kp\n\n(13)\n\n:\n\nd\n\nThus, if both of (12) and (13) hold, the right hand side of (11) is bounded as follows:\n\u221a\n\u22251supp(a(cid:0)\u03c6D(x)) \u25e6 [\u27e8d1; dj\u27e9; : : : ;\u27e8dm; dj\u27e9]\u22252\u2225a (cid:0) \u03c6D(x)\u22252\n(cid:21) (cid:20) 1 (cid:0) t\njsupp(a (cid:0) \u03c6D(x))j (cid:22)p\n\n(cid:20)\n\n(cid:21);\n\n=\n\n6(cid:22)kp\nd (cid:0) (cid:22)k\n\n2\n\np\nk(cid:21)\n6\n1 (cid:0) (cid:22)kp\n\nd\n\nd\n\nwhere we used Assumption 2 in the \ufb01rst inequality, (12) and Assumption 3 in the equality and the\ncondition for d in the last inequality. From the above discussion, the left hand side of (10) is bounded\nby the sum of the probability (cid:14)3 that (12) does not hold and the probability (cid:14)2 that (12) holds but\n(13) does not hold.\n\n3.4 Transfer Learnability for Dictionary Learning\n\n(cid:3) exists as in Assumption 1, we show that the output bDN of a suitable\n(cid:3),b(cid:18)N = bDN and \u2225 (cid:1) \u2225 = \u2225 (cid:1) \u22251;2 in Section 2.1.\n\nWhen the true dictionary D\ndictionary learning algorithm from N-unlabeled samples satis\ufb01es the parameter transfer learnability\nfor the sparse coding \u03c6D. Then, Theorem 1 guarantees the learning bound in self-taught learning\nsince the discussion in this section does not assume the label space in the source region. This\nsituation corresponds to the case where (cid:18)\nWe show that an appropriate dictionary learning algorithm satis\ufb01es the parameter transfer learnabil-\nity for the sparse coding \u03c6D by focusing on the permissible radius of perturbation in (5) under some\n(cid:0)(cid:28) for 1=4 (cid:20) (cid:28) (cid:20) 1=2, the margin bound (7)\nassumptions. When Assumptions 1-4 hold and (cid:21) = d\nfor x 2 X holds with probability 1 (cid:0) (cid:14)t;(cid:21), and thus, we have\n\n(cid:3)\nS = D\n\nt2(cid:21)3\n\n(cid:0)3(cid:28) ):\n\n\u03f5k;D(cid:3)(x) (cid:21)\n\n64 maxf1;\u2225x\u2225g4 = (cid:2)(d\n\nThus, if a dictionary learning algorithm outputs the estimator bDN such that\n\u2225bDN (cid:0) D\nwith probability 1 (cid:0) (cid:14)N , the estimator bDN of D\n\n(14)\n(cid:3) satis\ufb01es the parameter transfer learnability for the\nsparse coding \u03c6D with probability (cid:22)(cid:14) = (cid:14)N + n(cid:14)t;(cid:21). Then, by the local stability of the sparse repre-\nsentation and the parameter transfer learnability of such a dictionary learning, Theorem 1 guarantees\nthat sparse coding in self-taught learning satis\ufb01es the learning bound in (2).\np\nWe note that Theorem 1 can apply to any dictionary learning algorithm as long as (14) is satis\ufb01ed.\nFor example, Arora et al. (2015) show that, when k = O(\nd= log d), m = O(d), Assumptions 1-4\n\nand some additional conditions are assumed, their dictionary learning algorithm outputs bDN which\n\n(cid:3)\u22251;2 (cid:20) O(d\n\n(cid:0)3(cid:28) )\n\nsatis\ufb01es\n\nwith probability 1 (cid:0) d\n\n(cid:0)M\n\n\u2032\n\n4 Conclusion\n\n\u2225bDN (cid:0) D\n\n(cid:3)\u22251;2 = O(d\n\n(cid:0)M )\n\nfor arbitrarily large M; M\n\n\u2032 as long as N is suf\ufb01ciently large.\n\nWe derived a learning bound (Theorem 1) for a parameter transfer learning problem based on the\nlocal stability and parameter transfer learnability, which are newly introduced in this paper. Then,\napplying it to a sparse coding-based algorithm under a sparse model (Assumptions 1-4), we obtained\nthe \ufb01rst theoretical guarantee of a learning bound in self-taught learning. Although we only consider\nsparse coding, the framework of parameter transfer learning includes other promising algorithms\nsuch as multiple kernel learning and deep neural networks, and thus, our results are expected to\nbe effective to analyze the theoretical performance of these algorithms. Finally, we note that our\nlearning bound can be applied to different settings from self-taught learning because Theorem 1\nincludes the case in which labeled samples are available in the source region.\n\n8\n\n\fReferences\n[1] S. Arora, R. Ge, T. Ma, and A. Moitra (2015) \u201cSimple, ef\ufb01cient, and neural algorithms for\n\nsparse coding,\u201d arXiv preprint arXiv:1503.00778.\n\n[2]\n\nJ. Baxter (2000) \u201cA model of inductive bias learning,\u201d J. Artif. Intell. Res.(JAIR), Vol. 12, p. 3.\n\n[3] A. Coates, A. Y. Ng, and H. Lee (2011) \u201cAn analysis of single-layer networks in unsupervised\nfeature learning,\u201d in International conference on arti\ufb01cial intelligence and statistics, pp. 215\u2013\n223.\n\n[4] W. Dai, Q. Yang, G.-R. Xue, and Y. Yu (2008) \u201cSelf-taught clustering,\u201d in Proceedings of the\n\n25th international conference on Machine learning, pp. 200\u2013207, ACM.\n\n[5] Q. V. Le (2013) \u201cBuilding high-level features using large scale unsupervised learning,\u201d in\nAcoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on,\npp. 8595\u20138598, IEEE.\n\n[6] H. Lee, R. Raina, A. Teichman, and A. Y. Ng (2009) \u201cExponential Family Sparse Coding\n\nwith Application to Self-taught Learning,\u201d in IJCAI, Vol. 9, pp. 1113\u20131119, Citeseer.\n\n[7]\n\nJ. Mairal, J. Ponce, G. Sapiro, A. Zisserman, and F. R. Bach (2009) \u201cSupervised dictionary\nlearning,\u201d in Advances in neural information processing systems, pp. 1033\u20131040.\n\n[8] A. Maurer (2009) \u201cTransfer bounds for linear feature learning,\u201d Machine learning, Vol. 75,\n\npp. 327\u2013350.\n\n[9] A. Maurer, M. Pontil, and B. Romera-Paredes (2012) \u201cSparse coding for multitask and trans-\n\nfer learning,\u201d arXiv preprint arXiv:1209.0738.\n\n[10] N. Mehta and A. G. Gray (2013) \u201cSparsity-based generalization bounds for predictive sparse\ncoding,\u201d in Proceedings of the 30th International Conference on Machine Learning (ICML-\n13), pp. 36\u201344.\n\n[11] S. Negahban, B. Yu, M. J. Wainwright, and P. K. Ravikumar (2009) \u201cA uni\ufb01ed framework for\nhigh-dimensional analysis of M-estimators with decomposable regularizers,\u201d in Advances in\nNeural Information Processing Systems, pp. 1348\u20131356.\n\n[12] S. J. Pan and Q. Yang (2010) \u201cA survey on transfer learning,\u201d Knowledge and Data Engineer-\n\ning, IEEE Transactions on, Vol. 22, pp. 1345\u20131359.\n\n[13] R. Raina, A. Battle, H. Lee, B. Packer, and A. Y. Ng (2007) \u201cSelf-taught learning: transfer\nlearning from unlabeled data,\u201d in Proceedings of the 24th international conference on Ma-\nchine learning, pp. 759\u2013766, ACM.\n\n[14] K. Sridharan, S. Shalev-Shwartz, and N. Srebro (2009) \u201cFast rates for regularized objectives,\u201d\n\nin Advances in Neural Information Processing Systems, pp. 1545\u20131552.\n\n[15] D. Vainsencher, S. Mannor, and A. M. Bruckstein (2011) \u201cThe sample complexity of dictio-\n\nnary learning,\u201d The Journal of Machine Learning Research, Vol. 12, pp. 3259\u20133281.\n\n[16] H. Wang, F. Nie, and H. Huang (2013) \u201cRobust and discriminative self-taught learning,\u201d in\n\nProceedings of The 30th International Conference on Machine Learning, pp. 298\u2013306.\n\n[17] P. Zhao and B. Yu (2006) \u201cOn model selection consistency of Lasso,\u201d The Journal of Machine\n\nLearning Research, Vol. 7, pp. 2541\u20132563.\n\n[18] X. Zhu, Z. Huang, Y. Yang, H. T. Shen, C. Xu, and J. Luo (2013) \u201cSelf-taught dimensionality\nreduction on the high-dimensional small-sized data,\u201d Pattern Recognition, Vol. 46, pp. 215\u2013\n229.\n\n9\n\n\f", "award": [], "sourceid": 1391, "authors": [{"given_name": "Wataru", "family_name": "Kumagai", "institution": "Kanagawa University"}]}