Part of Advances in Neural Information Processing Systems 17 (NIPS 2004)

*Lorenzo Rosasco, Andrea Caponnetto, Ernesto Vito, Francesca Odone, Umberto Giovannini*

Many works have shown that strong connections relate learning from ex- amples to regularization techniques for ill-posed inverse problems. Nev- ertheless by now there was no formal evidence neither that learning from examples could be seen as an inverse problem nor that theoretical results in learning theory could be independently derived using tools from reg- ularization theory. In this paper we provide a positive answer to both questions. Indeed, considering the square loss, we translate the learning problem in the language of regularization theory and show that consis- tency results and optimal regularization parameter choice can be derived by the discretization of the corresponding inverse problem.

1 Introduction

The main goal of learning from examples is to infer an estimator, given a finite sample of data drawn according to a fixed but unknown probabilistic input-output relation. The desired property of the selected estimator is to perform well on new data, i.e. it should gen- eralize. The fundamental works of Vapnik and further developments [16], [8], [5], show that the key to obtain a meaningful solution to the above problem is to control the complex- ity of the solution space. Interestingly, as noted by [12], [8], [2], this is the idea underlying regularization techniques for ill-posed inverse problems [15], [7]. In such a context to avoid undesired oscillating behavior of the solution we have to restrict the solution space.

Not surprisingly the form of the algorithms proposed in both theories is strikingly similar. Anyway a careful analysis shows that a rigorous connection between learning and regular- ization for inverse problem is not straightforward. In this paper we consider the square loss and show that the problem of learning can be translated into a convenient inverse problem and consistency results can be derived in a general setting. When a generic loss is consid- ered the analysis becomes immediately more complicated. Some previous works on this subject considered the special case in which the elements of the input space are fixed and not probabilistically drawn [11], [9]. Some weaker results in the same spirit of those presented in this paper can be found in [13] where anyway the connections with inverse problems is not discussed. Finally, our analysis is close to the idea of stochastic inverse problems discussed in [16]. It follows the plan of the paper. Af- ter recalling the main concepts and notation of learning and inverse problems, in section 4 we develop a formal connection between the two theories. In section 5 the main results are stated and discussed. Finally in section 6 we conclude with some remarks and open problems.

2 Learning from examples

We briefly recall some basic concepts of learning theory [16], [8]. In the framework of learning, there are two sets of variables: the input space X, compact subset of Rn, and the output space Y , compact subset of R. The relation between the input x X and the output y Y is described by a probability distribution (x, y) = (x)(y|x) on X Y . The distribution is known only through a sample z = (x, y) = ((x1, y1), . . . , (x , y )), called training set, drawn i.i.d. according to . The goal of learning is, given the sample z, to find a function fz : X R such that fz(x) is an estimate of the output y when the new input x is given. The function fz is called estimator and the rule that, given a sample z, provides us with fz is called learning algorithm.

Given a measurable function f : X R, the ability of f to describe the distribution is measured by its expected risk defined as

```
I[f ] = (f (x) - y)2 d(x,y). XY
```

The regression function

```
g(x) = y d(y|x), Y is the minimizer of the expected risk over the set of all measurable functions and always exists since Y is compact. Usually, the regression function cannot be reconstructed exactly since we are given only a finite, possibly small, set of examples z. To overcome this problem, in the regularized least squares algorithm an hypothesis space H is fixed, and, given > 0, an estimator f z is defined as the solution of the regularized least squares problem,
1 min{ (f (xi) - yi)2 + f 2H}. (1) f H i=1
```

The regularization parameter has to be chosen depending on the available data, = ( , z), in such a way that, for every > 0

```
lim P I[f ( ,z) z ] - inf I[f] = 0. (2) + f H
```

We note that in general inffH I[f ] is larger that I[g] and represents a sort of irreducible error associated with the choice of the space H. The above convergence in probability is usually called consistency of the algorithm [16] [14].

3 Ill-Posed Inverse Problems and Regularization

In this section we give a very brief account of linear inverse problems and regularization theory [15], [7]. Let H and K be two Hilbert spaces and A : H K a linear bounded operator. Consider the equation Af = g (3)

where g, g K and g - g K . Here g represents the exact, unknown data and g the available, noisy data. Finding the function f satisfying the above equation, given A and g, is the linear inverse problem associated to Eq. (3). The above problem is, in general, ill- posed, that is, the Uniqueness can be restored introducing the Moore-Penrose generalized inverse f = Ag defined as the minimum norm solution of the problem

```
min Af - g 2 . (4) K f H
```

However the operator A is usually not bounded so, in order to ensure a continuous de- pendence of the solution on the data, the following Tikhonov regularization scheme can be considered1 min{ Af - g 2 + f 2 K H}, (5) f H

whose unique minimizer is given by

```
f = (AA + I )-1Ag , (6)
```

where A denotes the adjoint of A.

A crucial step in the above algorithm is the choice of the regularization parameter = (, g), as a function of the noise level and the data g, in such a way that

```
lim f (,g) = 0, (7) - f 0 H
```

that is, the regularized solution f (,g) converges to the generalized solution f = Ag (f exists if and only if P g Range(A), where P is the projection on the closure of the range of A and, in that case, Af = P g) when the noise goes to zero.

The similarity between regularized least squares algorithm (1) and Tikhonov regulariza- tion (5) is apparent. However, several difficulties emerge. First, to treat the problem of learning in the setting of ill-posed inverse problems we have to define a direct problem by means of a suitable operator A. Second, in the context of learning, it is not clear the nature of the noise . Finally we have to clarify the relation between consistency (2) and the kind of convergence expressed by (7). In the following sections we will show a possible way to tackle these problems.

4 Learning as an Inverse Problem

We can now show how the problem of learning can be rephrased in a framework close to the one presented in the previous section. We assume that hypothesis space H is a reproducing kernel Hilbert space [1] with a contin- uous kernel K : X X R. If x X, we let Kx(s) = K(s, x), and, if is the marginal distribution of on X, we define the bounded linear operator A : H L2(X, ) as (Af )(x) = f, Kx = f (x), H

```
1In the framework of inverse problems, many other regularization procedures are introduced [7]. For simplicity we only treat the Tikhonov regularization.
```

that is, A is the canonical injection of H in L2(X, ). In particular, for all f H, the expected risk becomes, I[f ] = Af - g 2 + I[g], L2(X,) where g is the regression function [2]. The above equation clarifies that if the expected risk admits a minimizer fH on the hypothesis space H, then it is exactly the generalized solution2 f = Ag of the problem Af = g. (8) Moreover, given a training set z = (x, y), we get a discretized version Ax : H E of A, that is (Axf)i = f, Kx = f (x i H i), where E = R is the finite dimensional euclidean space endowed with the scalar product

```
1 y, y = y E iyi. i=1 It is straightforward to check that
1 (f (xi) - yi)2 = Axf - y 2 , E i=1
```

so that the estimator f z given by the regularized least squares algorithm is the regularized solution of the discrete problem Axf = y. (9) At this point it is useful to remark the following two facts. First, in learning from examples we are not interested into finding an approximation of the generalized solution of the dis- cretized problem (9), but we want to find a stable approximation of the solution of the exact problem (8) (compare with [9]). Second, we notice that in learning theory the consistency property (2) involves the control of the quantity

```
I[f z ] - inf I[f] = Af - g 2 Af . (10) L2(X,) - inf - g 2L2(X,) f H f H
```

If P is the projection on the closure of the range of A, the definition of P gives 2 I[f z ] - inf I[f] = Afz - P g (11) f H L2(X,)

(the above equality stronlgy depends on the fact that the loss function is the square loss). In the inverse problem setting, the square root of the above quantity is called the residue of the solution f z . Hence, consistency is controlled by the residue of the estimator, instead of

the reconstruction error f z - f (as in inverse problems). In particular, consistency H is a weaker condition than the one required by (7) and does not require the existence of the generalized solution fH.

5 Regularization, Stochastic Noise and Consistency

To apply the framework of ill-posed inverse problems of Section 3 to the formulation of learning proposed above, we note that the operator Ax in the discretized problem (9) differs from the operator A in the exact problem (8) and a measure of the difference between Ax and A is required. Moreover, the noisy data y E and the exact data g L2(X, ) belong to different spaces, so that the notion of noise has to be modified. Given the above premise our derivation of consistency results is developed in two steps: we first study the residue of the solution by means of a measure of the noise due to discretization and then we show a possible way to give a probabilistic evaluation of the noise previously introduced.

```
2The fact that fH is the minimal norm solution of (4) is ensured by the assumption that the support of the measure is X, since in this case the operator A is injective.
```

5.1 Bounding the Residue of the Regularized Solution

We recall that the regularized solutions of problems (9) and (8) are given by

```
f = (A A y, z x x + I )-1A x f = (AA + I)-1Ag.
```

The above equations show that f and f depend only on A A z x x and AA which are operators from H into H and on Ay and Ag which are elements of x H, so that the space E disappears. This observation suggests that noise levels could be A A x x - AA L(H) and A y , where is the uniform operator norm. To this purpose, for x - Ag H L(H) every = (1, 2) R2+ we define the collection of training sets. U = {z (X Y ) | Ay A x - Ag H 1, Ax x - AA L(H) 2, N} and we let M = sup{|y| | y Y }. The next theorem is the central result of the paper. Theorem 1 If > 0, the following inequalities hold

```
1. for any training set z U M Af 2 + 1 z - P g L2(X,) - Af - P g L2(X,) 4 2
2. if P g Range(A), for any training set z U, M f 2 + 1 z - f H - f - f H 3 2 2
```

Moreover if we choose = (, z) in such a way that lim0 sup (,z) = 0 zU 2 lim 1 0 sup = 0 zU (12) (,z) then lim 2 0 sup = 0 zU (,z)

```
lim sup Af (,z) = 0. (13) z - Pg 0 zU L2(X,)
```

We omit the complete proof and refer to [3]. Briefly, the idea is to note that

```
Af z - P g L2(X,) - Af - P g L2(X,) 1 Af = (AA) 2 (f z - Af L2(X,) z - f) H where the last equation follows by polar decomposition of the operator A. Moreover a simple algebraic computation gives
```

f A A y+(AA+I)-1(A y z -f = (AA+I)-1(AA-Ax x)(Ax x+I)-1Ax x -Ag) where the relevant quantities for definition of the noise appear. The first item in the above proposition quantifies the difference between the residues of the regularized solutions of the exact and discretized problems in terms of the noise level = (1, 2). As mentioned before this is exactly the kind of result needed to derive consistency. On the other hand the last part of the proposition gives sufficient conditions on the parameter to ensure convergence of the residue to zero as the level noise decreases. The above results were obtained introducing the collection U of training sets compatible with a certain noise level . It is left to quantify the noise level corresponding to a training set of cardinality . This will be achieved in a probabilistic setting in the next section.

5.2 Stochastic Evaluation of the Noise

In this section we estimate the discretization noise = (1, 2).

Theorem 2 Let 1, 2 > 0 and = supxX K(x, x), then

```
M 2 P Ag - A x y A H + 1, AA - Ax x L(H) + 2
2 2 1 2 1 - e-22M2 - e-24 (14)
```

The proof is given in [3] and it is based on McDiarmid inequality [10] applied to the random variables F (z) = A x y - Ag G(z) = A A . H x x - AA L(H) Other estimates of the noise can be given using, for example, union bounds and Hoeffd- ing's inequality. Anyway rather then providing a tight analysis our concern was to find an natural, explicit and easy to prove estimate of .

5.3 Consistency and Regularization Parameter Choice

Combining Theorems 1 and 2, we easily derive the following corollary.

Corollary 1 Given 0 < < 1, with probability greater that 1 - ,

```
Af z - Pg - Af - Pg L2(X,) L2(X,)
M 1 4 + 1 + log (15) 2 2
```

Do not remove: This comment is monitored to verify that the site is working properly