{"title": "Probabilistic Methods for Support Vector Machines", "book": "Advances in Neural Information Processing Systems", "page_first": 349, "page_last": 355, "abstract": null, "full_text": "Probabilistic methods for  Support Vector \n\nMachines \n\nDepartment of Mathematics, King's  College London \n\nStrand, London WC2R 2LS,  U.K.  Email:  peter.sollich@kcl.ac.uk \n\nPeter Sollich \n\nAbstract \n\nI  describe  a  framework  for  interpreting Support  Vector  Machines \n(SVMs)  as  maximum  a  posteriori  (MAP)  solutions  to  inference \nproblems with Gaussian Process priors.  This can provide intuitive \nguidelines  for  choosing  a  'good'  SVM  kernel.  It  can  also  assign \n(by  evidence  maximization)  optimal values  to parameters such  as \nthe noise level C which cannot be determined unambiguously from \nproperties of the MAP  solution alone  (such as  cross-validation er(cid:173)\nror) . I illustrate this using a simple approximate expression for the \nSVM  evidence.  Once  C  has  been determined,  error bars on SVM \npredictions can also  be obtained. \n\n1  Support  Vector Machines:  A  probabilistic framework \n\nSupport  Vector  Machines  (SVMs)  have  recently  been  the  subject  of  intense  re(cid:173)\nsearch  activity  within  the  neural  networks  community;  for  tutorial  introductions \nand overviews of recent developments see  [1,  2,  3].  One of the open questions that \nremains is  how  to set the 'tunable' parameters of an SVM  algorithm:  While meth(cid:173)\nods for  choosing the width of the kernel function and the noise parameter C  (which \ncontrols  how  closely  the  training  data  are  fitted)  have  been  proposed  [4,  5]  (see \nalso, very recently,  [6]), the effect of the overall shape of the kernel function remains \nimperfectly understood [1].  Error bars  (class probabilities) for  SVM  predictions -\nimportant for safety-critical applications, for example -\nare also difficult to obtain. \nIn this paper I suggest that a probabilistic interpretation of SVMs  could be used to \ntackle these problems.  It shows that the SVM  kernel defines  a  prior over functions \non the input space, avoiding the need to think in terms of high-dimensional feature \nspaces.  It also allows one to define quantities such as the evidence  (likelihood) for a \nset of hyperparameters (C,  kernel amplitude Ko  etc).  I give a simple approximation \nto the  evidence  which  can  then  be  maximized  to set  such  hyperparameters.  The \nevidence is sensitive to the values of C and Ko individually, in contrast to properties \n(such  as  cross-validation  error)  of the  deterministic  solution,  which  only  depends \non the product  CKo.  It can thfrefore be used  to  assign an unambiguous  value  to \nC, from  which  error bars can be derived. \n\n\f350 \n\nP.  Sollich \n\nI  focus  on  two-class  classification  problems.  Suppose  we  are  given  a  set  D  of n \ntraining examples  (Xi, Yi)  with  binary  outputs Yi  =  \u00b11  corresponding to  the  two \nclasses.  The  basic  SVM  idea  is  to  map  the  inputs  X onto  vectors  c/>(x)  in  some \nhigh-dimensional feature space; ideally, in this feature space, the problem should be \nlinearly separable.  Suppose first  that this  is  true.  Among all decision  hyperplanes \nw\u00b7c/>(x) + b =  0 which separate the training examples (Le.  which obey Yi(W'c/>(Xi) + \nb)  > 0  for  all  Xi  E  Dx , Dx  being the set of training inputs),  the SVM  solution  is \nchosen  as  the one  with  the  largest  margin,  Le.  the  largest  minimal  distance  from \nany of the training examples.  Equivalently,  one specifies  the margin to be one and \nminimizes the squared length of the weight vector IIwI1 2 [1], subject to the constraint \nthat Yi(W'c/>(Xi) + b)  2::  1 for  all  i.  If the problem  is  not  linearly  separable,  'slack \nvariables'  ~i  2::  0  are  introduced  which  measure  how  much  the margin  constraints \nare violated;  one  writes Yi(W'c/>(Xi)  + b)  2::  1 - ~i'  To  control the  amount  of slack \nallowed,  a  penalty  term  C Ei ~i  is  then  added  to  the  objective  function  ~ IIwI1 2 , \nwith  a  penalty  coefficient  C.  Training  examples  with  Yi(w \u00b7c/>(xd  + b)  2::  1  (and \nhence  ~i  =  0)  incur  no  penalty;  all  others contribute C[l - Yi(W 'c/>(Xi)  + b)]  each. \nThis gives the SVM  optimization problem:  Find wand b to minimize \n\n~llwl12 + C Ei l(Yi[W 'c/>(Xi) + b]) \n\n(1) \n\nwhere l(z)  is  the  (shifted)  'hinge loss', l(z) =  (1- z)8(1- z). \n\nTo  interpret  SVMs  probabilistically,  one  can  regard  (1)  as  defining  a  (negative) \nlog-posterior probability for  the parameters wand b of the SVM,  given  a  training \nset  D.  The  first  term  gives  the  prior  Q(w,b)  \"\"  exp(-~llwW - ~b2B-2).  This \nis  a  Gaussian  prior on  W;  the  components  of W  are  uncorrelated  with each other \nI  have  chosen  a  Gaussian  prior  on  b  with  variance  B2; \nand  have  unit  variance. \nthe  flat  prior implied  by  (1)  can  be  recovered!  by  letting  B  -+  00.  Because only \nthe  'latent variable' values O(x)  =  w\u00b7c/>(x) + b -\nrather than wand b individually \nappear  in  the  second,  data  dependent  term  of  (1),  it  makes  sense  to  express \n-\nthe  prior  directly  as  a  distribution  over  these.  The  O(x)  have  a  joint  Gaussian \ndistribution because the components ofw do,  with covariances given by (O(x)O(x')) \n= (( c/>(x) \u00b7w) (w\u00b7c/>(x'))) + B2 = c/>(x)\u00b7c/>(x') + B2.  The SVM prior is therefore simply \na  Gaussian  process  (GP)  over the functions  0, with covariance function  K(x,x')  = \nc/>(x) \u00b7c/>(x')  + B2  (and  zero  mean).  This  correspondence between  SVMs  and  GPs \nhas been noted  by  a number of authors,  e.g.  [6,  7,  8,  9,  10J . \n\nThe second term in  (1)  becomes  a  (negative)  log-likelihood if we  define  the proba(cid:173)\nbility of obtaining output Y for  a given  X  (and 0)  as \n\nQ(y =\u00b1llx, 0)  = ~(C) exp[-Cl(yO(x))] \n\n(2) \n\nWe  set  ~(C)  =  1/[1 + exp(-2C)]  to  ensure  that  the  probabilities  for  Y \n\u00b11 \nnever  add up to a  value  larger than one.  The likelihood for  the  complete data set \nis  then  Q(DIO)  = It Q(Yilxi, O)Q(Xi),  with  some  input  distribution  Q(x)  which \nremains  essentially arbitrary at this point.  However, this likelihood function  is  not \nnormalized,  because \nlI(O(x))  =  Q(llx, 0) + Q( -llx, 0)  =  ~(C){ exp[ -Cl(O(x))] + exp[-Cl( -O(x))]} < 1 \n\nlIn the probabilistic setting,  it actually makes more sense to keep B  finite  (and small); \n\nfor  B  -+ 00,  only training sets with  all  Yi  equal have nonzero probability. \n\n\fProbabilistic Methods for Support Vector Machines \n\n351 \n\nexcept when  IO(x)1  =  1.  To remedy this,  I write the actual probability model as \n\nP(D,9) =  Q(DI9)Q(9)/N(D) . \n\n(3) \n\nIts posterior probability P(9ID) '\" Q(DI9)Q(9) is independent Qfthe normalization \nfactor N(D);  by  construction,  the  MAP  value  of 9  is  therefore the SVM  solution. \nThe simplest choice of N(D) which normalizes P(D, 9)  is  D-independent: \n\nN  =  Nn =  Jd9Q(9)Nn(9),  N(9) = JdxQ(x)lI(O(x)). \n\n(4) \n\nConceptually, this corresponds to the following procedure of sampling from P(D, 9): \nFirst, sample 9  from  the GP prior Q(9) . Then, for  each data point, sample x  from \nQ(x).  Assign  outputs  Y =  \u00b11  with  probability  Q(ylx,9),  respectively;  with  the \nremaining probability l-lI(O(x))  (the 'don't know' class probability in [11]),  restart \nthe whole process by sampling a new 9.  Because lI(O(x))  is smallest2  inside the 'gap' \nIO(x)1  < 1, functions  9 with many values in this gap are less  likely  to 'survive' until \na  dataset  of the  required  size  n  is  built  up.  This  is  reflected  in  an  n-dependent \nfactor  in  the  (effective)  prior,  which  follows  from  (3,4)  as  P(9)  '\"  Q(9)Nn(9). \nCorrespondingly, in the likelihood \n\nP(ylx,9) = Q(ylx, 9)/1I(O(x)), \n\nP(xI9) '\" Q(x) lI(O(x)) \n\n(5) \n\n(which  now  is  normalized  over  y  =  \u00b11),  the  input  density  is  influenced  by  the \nfunction  9  itself;  it is  reduced in  the  'uncertainty gaps'  IO(x)1  < 1. \nTo  summarize, eqs.  (2-5)  define  a  probabilistic data generation model whose  MAP \nsolution  9*  =  argmax  P(9ID)  for  a  given  data  set  D  is  identical  to  a  standard \nSVM.  The effective prior P(9)  is  a  GP prior modified by  a data set size-dependent \nfactor;  the likelihood  (5)  defines not just a conditional output distribution, but also \nan input distribution  (relative to some  arbitrary Q(x)).  All  relevant  properties of \nthe  feature  space  are  encoded  in  the  underlying  GP  prior  Q(9),  with  covariance \nmatrix equal to the kernel K(x, Xl).  The log-posterior of the model \n\nIn P(9ID) = -t J dx dxl O(X)K-l(X, Xl) O(XI)  - C 'Ei l(YiO(xi)) + const \n\n(6) \n\nis  just  a  transformation  of  (1)  from  wand b to  9.  By  differentiating  w.r.t.  the \nO(x)  for  non-training  inputs,  one  sees  that  its  maximum  is  of the  standard form \nO*(x)  = Ei (}:iYiK(X, Xi);  for YiO*(Xi)  > 1,  < 1,  and =  lone has (}:i  = 0,  (}:i  = C  and \n(}:i  E  [0, C]  respectively.  I will  call the training inputs Xi  in the last group marginal; \nthey form  a  subset of all  support vectors  (the  Xi  with  (}:i  > 0).  The sparseness of \nthe  SVM  solution  (often  the  number  of support  vectors  is  \u00ab  n)  comes  from  the \nfact  that the  hinge  loss  l(z)  is  constant for  z  > 1.  This  contrasts  with  other uses \nof GP  models  for  classification  (see  e.g.  [12]),  where  instead  of the  likelihood  (2) \na  sigmoidal  (often  logistic)  'transfer function'  with nonzerO gradient everywhere is \nused.  Moreover,  in  the noise free  limit,  the sigmoidal transfer function  becomes  a \nstep  function,  and  the  MAP  values  9*  will  tend  to the  trivial  solution O*(x)  =  O. \nThis  illuminates  from  an alternative  point  of view  why  the  margin  (the  'shift'  in \nthe hinge loss)  is  important for  SVMs. \n\nWithin  the  probabilistic framework,  the  main  effect  of the  kernel  in  SVM  classi(cid:173)\nfication  is  to  change  the  properties  of the  underlying  GP  prior  Q(9)  in  P(9)  '\" \n\n2This is true for  C > In 2.  For smaller C,  v( O( x\u00bb \n\nmodel makes less  intuitive sense. \n\nis actually higher in the gap,  and the \n\n\f352 \n\nP.  Sollich \n\n(e) \n\n(h) \n\nFigure  1:  Samples  from  SVM  priors;  the  input  space  is  the  unit  square  [0,1]2. \n3d  plots  are  samples  8(x)  from  the  underlying  Gaussian  process  prior  Q(8).  2d \ngreyscale plots represent the output distributions obtained when 8(x)  is used in the \nlikelihood  model  (5)  with  C  = 2;  the greyscale  indicates  the  probability  of  y = 1 \n(black:  0,  white:  1). \n(a,b)  Exponential  (Ornstein-Uhlenbeck)  kernel/covariance \nfunction  Koexp(-Ix - x/l/l),  giving  rough  8(x)  and  decision  boundaries.  Length \nscale l = 0.1,  Ko  = 10.  (c)  Same with Ko  = 1, i.e. with a reduced amplitude of O(x); \nnote  how,  in  a  sample from  the  prior  corresponding to  this  new  kernel,  the  grey \n'uncertainty gaps'  (given roughly by  18(x)1  < 1)  between regions of definite outputs \n(black/white) have widened.  (d,e)  As first row, but with squared exponential (RBF) \nkernel Ko exp[-(x - X I )2/(2l2)], yielding smooth 8(x)  and decision boundaries.  (f) \nChanging l to 0.05 (while holding Ko fixed at 10) and taking a new sample shows how \nthis  parameter sets  the typical  length scale for  decision  regions.  (g,h)  Polynomial \nkernel  (1  + x\u00b7xl)P,  with p  = 5;  (i)  p  = 10.  The absence  of a  clear length scale  and \nthe widely differing magnitudes of 8(x)  in the bottom left (x = [0,0])  and top right \n(x  = [1,1])  corners of the square make this kernel less plausible from  a probabilistic \npoint of view. \n\n\fProbabilistic Methods for Support  Vector Machines \n\n353 \n\nQ(O)Nn(o).  Fig.  1 illustrates this with samples from  Q(O)  for  three different types \nof kernels.  The effect of the kernel on smoothness of decision  boundaries,  and typ(cid:173)\nical  sizes  of decision  regions  and  'uncertainty gaps'  between  them,  can  clearly  be \nseen.  When prior  knowledge  about  these  properties of the target is  available,  the \nprobabilistic framework  can therefore provide intuition for  a  suitable choice of ker(cid:173)\nnel.  Note that the samples in  Fig.  1 are from  Q(O),  rather than from  the effective \nprior P(O).  One finds, however, that the n-dependent factor Nn(o) does not change \nthe properties of the prior qualitatively3. \n\n2  Evidence and error  bars \n\nBeyond  providing  intuition  about  SVM  kernels,  the  probabilistic  framework  dis(cid:173)\ncussed  above  also  makes  it  possible  to apply  Bayesian methods  to SVMs.  For ex(cid:173)\nample, one can define the evidence, i.e.  the likelihood of the data D, given the model \nas specified by the hyperparameters C  and  (some parameters defining)  K(x, x').  It \nfollows  from  (3)  as \n\nP(D) =  Q(D)/Nn, \n\nQ(D) =  J dO Q(DIO)Q(O). \n\n(7) \n\nThe factor  Q(D)  is  the  'naive'  evidence  derived from  the unnormalized  likelihood \nmodel;  the  correction  factor  Nn  ensures  that  P(D)  is  normalized  over  all  data \nsets.  This  is  crucial  in  order  to guarantee that optimization of the  (log)  evidence \ngives  optimal  hyperparameter values  at least  on  average  (M  Opper,  private  com(cid:173)\nmunication).  Clearly,  P(D)  will  in  general  depend  on  C  and K(x,x')  separately. \nThe  actual SVM  solution, on the other hand, i.e.  the  MAP  values  0*,  can be seen \nfrom  (6)  to depend  on the  product C K (x, x') only.  Properties of the deterministi(cid:173)\ncally trained SVM alone (such as test or cross-validation error)  cannot therefore be \nused to determine C  and the resulting class probabilities  (5)  unambiguously. \n\nI  now  outline  how  a  simple  approximation  to  the  naive  evidence  can  be  derived. \nQ(D) is  given by an integral over all B(x), with the log integrand being (6)  up to an \nadditive constant.  After integrating out the Gaussian distributed B( x) with x \u00a2 Dx , \nan intractable integral over  the B(Xi)  remains.  However,  progress  can be made  by \nexpanding  the  log  integrand  around  its  maximum  B*(Xi)'  For  all  non-marginal \ntraining  inputs  this  is  equivalent  to  Laplace's  approximation:  the  first  terms  in \nthe  expansion  are  quadratic in  the  deviations  from  the maximum  and give  simple \nGaussian integrals.  For the remaining B(Xi),  the leading terms in the log integrand \nvary linearly near the maximum.  Couplings between these B(Xi)  only appear at the \nnext (quadratic) order;  discarding these terms as subleading, the integral factorizes \nover the B(xd  and can be evaluated.  The end result of this  calculation is: \nInQ(D)  ~ -! LiYi<liB*(Xi) - CLil(YiB*(xd) - nln(l + e- 2C )  - ! Indet(LmKm) \n(8) \nThe  first  three  terms  represent  the  maximum  of  the  log  integrand,  In Q(DIO*); \nthe  last  one  comes  from  the  integration  over  the  fluctuations  of  the  B(x).  Note \nthat  it  only  contains  information  about  the  marginal  training inputs:  Km  is  the \ncorresponding  submatrix  of  K(x, x'),  and  Lm  is  a  diagonal  matrix  with  entries \n\n3Quantitative changes  arise  because  function  values  with  IO(x)1  < 1 are  'discouraged' \nfor  large nj this tends to increase the size of the decision regions and narrow the uncertainty \ngaps.  I  have verified this by comparing samples from  Q(O)  and P(O). \n\n\f9(x) \n\n354 \n\n2 \n\n1 \n\no \n\n-1 \n\n-2 \n\n0.2 \n\n0.4  x  0.6 \n\n0.8 \n\n1 \n\nP.  Sollich \n\n0 \n-0.1 \n-0.2 \n-0.3 \n-0.4 \n-0.5 \n\n1 \n\n2  C  3 \n\n4 \n\n0.8 \n0.6 \n0.4 \n0.2 \n\no o \n\nI \nI \nP(y=llx)  I \nI \nI \nI \nI \nI \n\\ \nJ \n\\. \n0.4  x  0.6 \n\n0.2 \n\nI \nI \nI \n, \nI \n\\.  }; \n0.8 \n\n1 \n\nFigure 2: Toy example of evidence maximization.  Left:  Target 'latent' function 8(x) \n(solid line).  A SVM with RBF kernel K(x, Xl) = Ko exp[-(x - XI )2 /(2[2)], [ = 0.05, \nCKo =  2.5 was trained (dashed line) on n =  50 training examples (circles).  Keeping \nCKo  constant,  the  evidence  P(D)  (top  right)  was  then  evaluated  as  a  function \nof  C  using  (7,8).  Note  how  the  normalization  factor  Nn  shifts  the  maximum  of \nP(D)  towards  larger values of C  than in the naive  evidence  Q(D).  Bottom right: \nClass  probability  P(y = 11x)  for  the target  (solid),  and  prediction  at the evidence \nmaximum C  ~ 1.8  (dashed) .  The target was generated from  (3)  with C=2 . \n\n27r[ai(C -ai)/C]2.  Given the sparseness ofthe SVM solution, these matrices should \nbe reasonably small, making their determinants amenable to numerical computation \nor  estimation [12].  Eq.  (8)  diverges  when  ai  -+ a or  -+  C  for  one  of the  marginal \ntraining inputs; the approximation of retaining only linear terms in the log integrand \nthen breaks  down.  I  therefore  adopt the simple  heuristic of replacing  det(LmKm) \nby  det(1 + LmKm),  which  prevents  these  spurious  singularities  (I  is  the  identity \nmatrix) .  This choice  also keeps the evidence continuous when training inputs  move \nin or out of the set of marginal inputs  as  hyperparameters are varied. \n\nFig. 2 shows a simple application of the evidence estimate (8) .  For a given data set, \nthe evidence P(D) was evaluated4  as  a function of C.  The kernel amplitude Ko  was \nvaried simultaneously such  that C Ko  and  hence  the SVM  solution itself remained \nunchanged.  Because  the  data set  was  generated  artificially  from  the  probability \nmodel  (3),  the  'true'  value  of  C  =  2  was  known;  in  spite  of  the  rather  crude \napproximation  for  Q(D),  the  maximum  of the  full  evidence  P(D)  identifies  C  ~ \n1.8  quite  close  to  the  truth.  The  approximate  class  probability  prediction  P(y = \n11x, D) for  this value of C  is  also plotted in Fig. 2; it overestimates the noise in the \ntarget somewhat.  Note  that P(ylx, D)  was obtained simply  by inserting the MAP \nvalues 8*(x)  into (5).  In a proper Bayesian treatment, an average over the posterior \ndistribution  P(OID)  should of course  be taken;  I  leave this for  future work. \n\n4The  normalization  factor  Nn  was  estimated,  for  the  assumed  uniform  input  density \nQ(x)  of  the  example,  by  sampling  from  the  GP  prior  Q(9) .  If Q(x)  is  unknown,  the \nempirical training input distribution can be used as a proxy,  and one samples instead from \na  multivariate  Gaussian  for  the  9(xd  with  covariance  matrix  K(Xi , Xj).  This  gave  very \nsimilar values of In Nn  in  the example, even when only a subset of 30 training inputs was \nused. \n\n\fProbabilistic Methods for Support  Vector Machines \n\n355 \n\nIn summary, I  have  described  a  probabilistic framework for  SVM  classification.  It \ngives  an  intuitive  understanding  of  the  effect  of  the  kernel,  which  determines  a \nGaussian  process  prior.  More  importantly,  it  also  allows  a  properly  normalized \nevidence  to  be defined;  from  this,  optimal  values  of hyperparameters  such  as  the \nnoise  parameter  C,  and  corresponding  error  bars,  can  be  derived.  Future  work \nwill  have to include more comprehensive experimental tests of the simple  Laplace(cid:173)\ntype estimate of the (naive)  evidence Q(D) that I have given, and comparison with \nother approaches.  These include variational methods; very recent experiments with \na  Gaussian  approximation  for  the  posterior  P(9ID),  for  example,  seem  promis(cid:173)\ning  [6].  Further  improvement should  be  possible  by  dropping  the  restriction  to  a \n'factor-analysed' covariance form  [6].  (One easily shows  that the optimal  Gaussian \ncovariance matrix is  (D + K- 1 )-1, parameterized only by  a diagonal matrix D.)  It \nwill  also be interesting to compare the Laplace and Gaussian variational results for \nthe evidence with those from  the 'cavity field'  approach of [10]. \n\nAcknowledgements \n\nIt is a pleasure to thank Tommi  Jaakkola, Manfred  Opper, Matthias Seeger,  Chris \nWilliams and Ole Winther for interesting comments and discussions,  and the Royal \nSociety for  financial  support through a  Dorothy Hodgkin Research Fellowship. \n\nReferences \n\n[1]  C  J  C  Burges.  A tutorial on support vector machines for  pattern recognition.  Data \n\nMining  and  Knowledge  Discovery,  2:121-167,  1998. \n\n[2]  A  J  Smola and B  Scholkopf.  A tutorial  on  support  vector  regression.  1998.  Neuro \n\nCOLT Technical  Report  TR-1998-030;  available  from  http://svm.first.gmd.de/. \n\n[3]  B Scholkopf, C  Burges,  and A J  Smola.  Advances  in Kernel  Methods:  Su.pport  Vector \n\nMachines .  MIT Press, Cambridge,  MA , 1998. \n\n[4)  B  Scholkopf,  P  Bartlett,  A  Smola,  and  R  Williamson.  Shrinking  the  tube:  a  new \n\nsupport vector regression  algorithm.  In  NIPS  11. \n\n[5]  N  Cristianini,  C  Campbell,  and  J  Shawe-Taylor.  Dynamically  adapting  kernels  in \n\nsupport vector  machines.  In NIPS 11. \n\n[6]  M Seeger.  Bayesian model selection for  Support Vector machines, Gaussian processes \n\nand other kernel classifiers.  Submitted to  NIPS 12. \n\n[7]  G Wahba.  Support vector  machines,  reproducing kernel  Hilbert spaces  and the ran(cid:173)\n\ndomized GACV.  Technical  Report 984,  University of Wisconsin,  1997. \n\n[8]  T S Jaakkola and D Haussler.  Probabilistic kernel regression models. In Proceedings  of \nThe  7th  International  Workshop  on Artificial Intelligence  and  Statistics.  To  appear. \n\n[9]  A  J  Smola,  B  Scholkopf,  and  K  R  Muller.  The  connection  between  regularization \n\noperators and support vector kernels.  Neu.ral  Networks,  11:637-649,  1998. \n\n[10]  M Opper and 0  Winther. Gaussian process classification and SVM: Mean field results \nand leave-one-out estimator.  In Advances  in Large  Margin  Classifiers.  MIT Press.  To \nappear. \n\n[11]  P  Sollich.  Probabilistic  interpretation  and  Bayesian  methods  for  Support  Vector \n\nMachines.  Submitted to ICANN 99. \n\n[12]  C K  I Williams.  Prediction with Gaussian processes:  From linear regression to linear \nprediction  and  beyond.  In  M  I  Jordan, editor,  Learning  and  Inference  in  Graphical \nModels,  pages  599-621.  Kluwer  Academic,  1998. \n\n\f", "award": [], "sourceid": 1711, "authors": [{"given_name": "Peter", "family_name": "Sollich", "institution": null}]}