{"title": "Learning Model Bias", "book": "Advances in Neural Information Processing Systems", "page_first": 169, "page_last": 175, "abstract": null, "full_text": "Learning Model Bias \n\nJonathan Baxter \n\nDepartment  of Computer Science \n\nRoyal Holloway  College,  University  of London \n\njon~dcs.rhbnc.ac.uk \n\nAbstract \n\nIn  this  paper  the  problem  of learning appropriate  domain-specific \nbias is  addressed.  It is shown that this can be achieved by learning \nmany related  tasks from  the  same domain,  and a  theorem  is  given \nbounding the number  tasks that must be learnt.  A  corollary of the \ntheorem  is  that if the  tasks  are  known to  possess  a  common  inter(cid:173)\nnal  representation  or  preprocessing  then  the  number  of examples \nrequired  per  task for  good generalisation  when learning n  tasks  si(cid:173)\nmultaneously  scales  like  O(a + ~),  where  O(a)  is  a  bound  on  the \nminimum  number  of examples  requred  to  learn  a  single  task,  and \nO( a + b)  is  a  bound  on the  number  of examples  required  to  learn \neach  task independently.  An  experiment  providing  strong qualita(cid:173)\ntive  support for  the  theoretical  results  is  reported. \n\n1 \n\nIntroduction \n\nIt has been argued  (see  [6])  that the main problem in machine learning is  the biasing \nof a  learner's  hypothesis  space  sufficiently  well  to  ensure  good generalisation  from \na  small  number  of  examples.  Once  suitable  biases  have  been  found  the  actual \nlearning  task  is  relatively  trivial.  Exisiting  methods  of bias  generally  require  the \ninput  of a  human  expert  in  the  form  of heuristics,  hints  [1],  domain  knowledge, \netc.  Such methods are clearly limited by the accuracy and reliability  of the expert's \nknowledge and also by the extent to  which that knowledge can be transferred to the \nlearner.  Here  I  attempt  to  solve  some  of these  problems  by  introducing  a  method \nfor  automatically learning the bias. \n\nThe  central  idea  is  that  in  many  learning  problems  the  learner  is  typically  em(cid:173)\nbedded  within  an  environment  or  domain  of related  learning  tasks  and  that  the \nbias  appropriate  for  a  single  task is  likely  to  be  appropriate  for  other  tasks  within \nthe  same  environment.  A  simple  example  is  the  problem  of handwritten  character \nrecognition.  A  preprocessing  stage  that  identifies  and  removes  any  (small)  rota(cid:173)\ntions,  dilations  and  translations  of an image of a  character will be advantageous for \n\n\f170 \n\nJ.BAXTER \n\nrecognising  all characters.  If the set  of all individual character recognition problems \nis  viewed  as  an  environment  of learning  tasks,  this  preprocessor  represents  a  bias \nthat is  appropriate  to all  tasks in the environment.  It is  likely  that  there  are  many \nother currently  unknown biases  that are also  appropriate for  this  environment.  We \nwould  like  to  be  able  to learn  these  automatically. \n\nBias  that is  appropriate for  all  tasks  must  be learnt  by sampling from  many tasks. \nIf only  a  single  task is  learnt  then the  bias  extracted is  likely  to  be  specific  to  that \ntask.  For  example,  if a  network  is  constructed  as in figure  1 and  the output  nodes \nare  simultaneously  trained  on  many  similar  problems,  then  the  hidden  layers  are \nmore likely  to be useful  in learning  a  novel problem of the same type  than if only a \nsingle  problem is  learnt.  In the rest  of this  paper  I  develop  a  general  theory of bias \nlearning  based  upon  the  idea of learning  multiple  related  tasks.  The  theory  shows \nthat  a  learner's  generalisation  performance  can  be  greatly  improved  by  learning \nrelated  tasks  and that if sufficiently  many tasks are learnt the learner's  bias  can be \nextracted  and  used  to learn novel  tasks. \n\nOther  authors  that  have  empirically  investigated  the  idea  of learning  multiple  re(cid:173)\nlated tasks include  [5]  and  [8]. \n\n2  Learning  Bias \n\nFor  the  sake  of argument  I  consider  learning  problems  that  amount  to  minimizing \nthe  mean squared  error  of a  function  h  over  some  training  set  D.  A  more  general \nformulation  based on statistical  decision  theory is  given  in  [3].  Thus,  it  is  assumed \nthat  the  learner  receives  a  training  set  of (possibly  noisy)  input- output pairs  D  = \n{(XI, YI), ... , (xm' Ym)},  drawn according to a  probability distribution  P  on X  X  Y \n(X  being  the input  space  and Y  being  the  output space)  and searches  through its \nhypothesis  space 1l  for  a  function  h: X  --+  Y  minimizing  the  empirical  error, \n\nE(h, D) =  - 2)h(xd - yd 2. \n\n1  m \n\nm \n\ni=1 \n\nThe  true  error or  generalisation  error of h  is  the  expected  error  under  P: \n\nE(h, P) =  r  (h(x)  - y)2  dP(x, y). \n\nixxY \n\n(1) \n\n(2) \n\nThe  hope  of course  is  that  an  h  with  a  small  empirical  error  on  a  large  enough \ntraining set  will  also  have  a  small  true  error,  i.e.  it will  generalise  well. \nI  model  the  environment of the learner  as a  pair  (P, Q)  where  P  = {P} is  a  set  of \nlearning  tasks  and  Q  is  a  probability  measure  on P.  The learner  is  now  supplied \nnot  with  a  single  hypothesis  space  1l  but  with  a  hypothesis  space  family  IHI  =  {1l}. \nEach 1l  E IHI  represents  a  different  bias  the learner  has about  the environment.  For \nexample,  one  1l  may  contain  functions  that  are  very  smooth,  whereas  another  1l \nmight  contain  more  wiggly  functions.  Which  hypothesis  space  is  best  will  depend \non  the  kinds  of functions  in  the  environment.  To  determine  the  best  1l  E  !HI  for \n(P, Q),  we  provide  the  learner  not  with  a  single  training  set  D  but  with  n  such \ntraining sets  D I , ... , Dn.  Each  Di  is  generated  by first  sampling from 'P  according \nto  Q to  give  Pi  and  then  sampling  m  times  from  X  x  Y  according  to  Pi  to  give \nDi  = {(XiI, Yil), ... , (Xim, Yim)}.  The  learner  searches  for  the  hypothesis  space \n1l  E IHI  with  minimal empirical  error  on  D I , ... ,  Dn ,  where  this is  defined  by \n\n~ \nE*(1l, D I , ... , Dn) = -\nn \n\n~ \n\ninf  E(h, Dd. \nhE1/. \n\n(3) \n\n1 2:n \n\ni=1 \n\n\fLearning Model  Bias \n\n171 \n\n\\ - - -'- - -, \n\nI \nI \n\nI \nI \n\n...  9 1 \n\n... \n\n... \n\n___  L  __  , \n\nI \nI \n\nf \n\nFigure  1:  Net  for  learning  multiple  tasks.  Input  Xij  from  training  set  Di  is  prop(cid:173)\nagated forwards  through  the  internal  representation f  and  then  only  through  the \noutput  network  gi.  The  error  [gi(l(Xij))  - Yij]2  is  similarly  backpropagated  only \nthrough the output  network gi  and then f.  Weight updates are performed after all \ntraining  sets  D 1 ,  ... ,  Dn  have been  presented. \n\nThe hypothesis space 1l with smallest empirical error is  the  one that is  best able to \nlearn  the  n  data sets  on average. \n\nThere  are  two  ways  of measuring  the  true  error  of a  bias learner.  The  first  is  how \nwell  it  generalises  on  the  n  tasks  PI,\"\"  Pn  used  to  generate  the  training  sets. \nAssuming  that  in  the  process  of minimising  (3)  the  learner  generates  n  functions \nhI, ... , hn  E 1l  with  minimal empirical  error  on their  respective  training  setsI ,  the \nlearner's  true  error is  measured  by: \n\n(4) \n\nin \n\nthis \n\nthat \n\nNote \nror  is  given  by  En(hI, ... , hn' Db ... , Dn)  =  ~ 2:~=I E(hi' Dt}.  The  second  way \nof measuring  the generalisation  error of a  bias learner  is  to determine  how good 1l \nis  for learning  novel  tasks  drawn from  the environment  (P, Q): \n\nempirical \n\nlearner's \n\ncase \n\nthe \n\nE*(1l, Q)  = 1 inf  E(h, P) dQ(P) \n\n1>  hEll \n\ner(cid:173)\n\n(5) \n\nA  learner  that  has found  an 1l  with  a  small value of (5)  can  be said  to  have  learnt \nto  learn the tasks in  P  in  general.  To state  the bounds ensuring  these  two  types  of \ngeneralisation  a  few  more definitions  must be introduced. \n\nDefinition 1  Let !HI  =  {1l}  be  a  hypothesis  space  family.  Let ~ =  {h  E  1l: 1l  E \nlHt}.  For  any  h:X -+  Y,  define  a  map  h:X X  Y  -+  [0,1]  by  h(x,y)  = (h(x)  _  y)2. \nNote  the  abuse  of notation:  h  stands  for  two  different  functions  depending  on  its \nargument.  Given a  sequence  of n  functions h = (hI, . .. , hn ) let h: (X  x  y)n -+  [0,1] \nbe  the function (XI, YI, ... , Xn , Yn)  H  ~ 2:~=I hi (Xi, yt).  Let 1ln  be  the set of all such \nfunctions  where  the  hi  are  all  chosen  from  1l.  Let JHr  =  {1l n : 1l  E  H}.  For  each \n1l  E !HI  define 1l*:P -+  [0,1]  by1l*(P) = infhEll E(h, P}  and let:HI*  = {1l*:1l  E !HI}. \n\n1 This assumes  the infimum in (3)  is  attained. \n\n\f172 \n\nJ. BAXTER \n\nDefinition 2  Given  a  set  of function s 1i  from  any  space  Z  to  [0,  1],  and  any prob(cid:173)\nability  measure  on Z,  define  the  pseudo-metric dp  on 1i  by \n\ndp(h, hI)  =  l\n\nlh(Z) - hl(z)1 dP(z). \n\nDenote  the  smallestE-cover  of (1i,dp )  byJV{E,1i,dp ).  Define  the  E-capacity  of1i \nby \n\nC(E, 1i) =  sup JV (E, 1i, dp ) \n\np \n\nwhere  the  supremum  is  over all  discrete probability  measures  P  on Z. \n\nDefinition  2  will  be  used  to  define  the  E-capacity  of spaces  such  as  !HI*  and  [IHF ]\"., \nwhere from  definition  1 the latter is  [IHF ]\".  =  {h E 1in :1i E  H}. \nThe following  theorem bounds the number of tasks  and examples  per  task required \nto  ensure  that  the  hypothesis  space  learnt  by a  bias learner  will,  with high  proba(cid:173)\nbility,  contain good solutions  to novel  tasks in the same environment2 . \n\nTheorem 1  Let  the  n  training  sets  D I , ... , Dn  be  generated  by  sampling n  times \nfrom  the  environment P  according  to  Q  to  give  PI\"\", Pn ,  and  then  sampling  m \ntimes  from  each  Pi  to  generate  Di .  Let !HI  =  {1i}  be  a  hypothesis  space  family  and \nsuppose  a  learner  chooses 1\u00a3  E 1HI  minimizing (3)  on  D I , ... , Dn.  For  all  E  >  0  and \n0 <  8 <  1,  if \n\nn \n\nand  m \n\nthen \n\nThe  bound  on  m  in  theorem  1  is  the  also  the  number  of examples  required  per \ntask  to  ensure  generalisation  of  the  first  kind  mentioned  above.  That  is,  it  is \nthe  number  of examples  required  in  each  data  set  Di  to  ensure  good  generalisa(cid:173)\ntion  on  average  across  all  n  tasks  when  using  the  hypothesis  space  family  1HI. \nIf \nwe  let  m(lHI, n, E,  8)  be  the  number  of examples  required  per  task  to  ensure  that \n\nPr {Db\"\"  Dn:  IEn(hI\"' \"  hn , DI\"'\"  Dn) - En(hI, . . . , hn , PI\" \" , Pn)1  > E}  < \n\n8,  where  all  hi  E 1i for  some fixed  1i  E IHI,  then \n\nG(IHI,  n, E, 8)  = m(lHI, 1, E, 8) \nm{lHI,  n, E, 8) \n\nrepresents  the  advantage in learning  n  tasks  as  opposed  to  one  task  (the  ordinary \nlearning scenario).  Call  G(IHI,  n, E, 8)  the  n-task gain of IHI.  Using  the fact  [3]  that \n\nC (E, lHl\". ) :::;  C (E,  [IHr ]\".)  :::;  C (E, IHl\". t , \n\nand the formula for  m  from  theorem  1,  we  have, \n\n1 :::;  G{IHI,  n, E, 8)  :::;  n. \n\n2The bounds  in  theorem  1  can be  improved to 0  (~) if all 11.  E  H  are convex  and  the \n\nerror is  the squared loss  [7]. \n\n\fLearning Model  Bias \n\n173 \n\nThus,  at least  in the  worst  case analysis  here,  learning  n  tasks in the same environ(cid:173)\nment can result in anything from no gain at all to an n-fold reduction in the number \nof examples  required  per  task.  In  the  next  section  a  very  intuitive  analysis  of the \nconditions  leading  to  the  extreme  values  of G(H, n, c, J)  is  given  for  the  situation \nwhere  an internal representation is  being learnt for  the environment.  I  will  also say \nmore about the bound on the  number of tasks (n)  in theorem  1. \n\n3  Learning  Internal Representations with  Neural  Networks \n\nIn  figure  1  n  tasks  are  being  learnt  using  a  common  representation  f. \nIn  this \ncase  [JHF]\".  is  the set  of all  possible  networks formed  by choosing the weights in the \nrepresentation and output networks.  IHl\".  is the same space with a single output node. \nIf the n  tasks were learnt independently (i.e. without a common representation) then \neach task would  use its own copy of H\".,  i.e.  we  wouldn't be forcing  the tasks to all \nuse  the same representation. \n\nLet  W R  be  the  total  number  of weights  in  the  representation  network  and  W 0 \nbe  the  number  of weights  in  an individual  output network.  Suppose  also  that  all \nthe  nodes  in  each  network  are  Lipschitz  boundecP.  Then it  can  be  shown  [3]  that \nInC(c, [IHr]\".)::::  0  ((Wo + ~)In~) and InC(c,IHr):::: 0  (WRln~). Substituting \nthese  bounds  into  theorem  1  shows  that  to  generalise  well  on  average  on  n  tasks \nusing  a  common representation  requires  m  ::::  0  (/2  [( W 0  + ~) In ~ + ~ In } ])  :::: \no (a + .~J  examples  of each task.  In addition, if n ::::  0  CI, WR In ~) then with high \nprobability  the  resulting  representation  will  be good for  learning  novel  tasks  from \nthe same environment.  Note that this bound is very large.  However it results from a \nworst-case analysis  and so is  highly likely  to be beaten in practice.  This is  certainly \nborne  out by the experiment  in  the  next  section. \nThe  learning  gain  G(H, n, c)  satisfies  G(H, n, c)  ~ Wo\u00b1'!a. Thus,  if WR  \u00bb  Wo, \nG  ~ n,  while  if Wo  \u00bb  W R  then  G  ~ 1.  This  is  perfectly intuitive:  when  Wo  \u00bb \nW R  the  representation  network  is  hardly  doing  any  work,  most  of the  power  of \nthe  network  is  in  the  ouput  networks  and  hence  the  tasks  are  effectively  being \nlearnt  independently.  However,  if WR  \u00bb  Wo  then  the  representation  network \ndominates;  there  is  very  little  extra  learning  to  be  done  for  the  individual  tasks \nonce  the representation  is  known,  and so  each example from every  task is  providing \nfull  information to the  representation  network.  Hence  the gain  of n. \n\nWo\u00b1.:::.B. \n\nNote that once  a  representation  has been learnt the sampling burden for  learning a \nnovel  task will  be reduced  to  m::::  0  (e1,  [Wo In ~ + In}])  because  only  the output \nnetwork  has  to  be  learnt.  If this  theory  applies  to  human  learning  then  the  fact \nthat  we  are  able  to  learn  words,  faces,  characters,  etcwith  relatively  few  examples \n(a single  example in the case offaces) indicates that our  \"output networks\"  are very \nsmall,  and,  given our large ignorance concerning  an appropriate representation,  the \nrepresentation  network for  learning in these  domains would have to  be large,  so  we \nwould expect  to see  an  n-task gain of nearly  n  for  learning within these  domains. \n\n3 A  node  a : lR P  -t lR  is  LipJChitz bounded if there  exists a  constant  e such  that  la( x) -\na(x'}1  < ellx - x'il  for  all x, x'  E lR P \u2022  Note that this rules  out threshold nodes,  but sigmoid \nsquashing functions  are okay  as  long as  the weights  are  bounded. \n\n\f174 \n\nJ. BAXTER \n\n4  Experiment:  Learning  Symmetric Boolean Functions \n\nIn  this section  the results  of an experiment  are reported  in  which  a  neural  network \nwas  trained  to learn  symmetric4  Boolean functions.  The network  was  the same as \nthe  one  in figure  1  except  that  the  output  networks  9i  had  no  hidden  layers.  The \ninput  space  X  =  {O,  Ipo was  restricted  to include  only  those  inputs  with  between \none and four ones.  The functions in the environment of the network consisted  of all \npossible  symmetric Boolean functions  over the input space,  except  the  trivial  \"con(cid:173)\nstant  0\"  and  \"constant  1\"  functions.  Training  sets  D 1 ,  ..\u2022 ,Dn  were  generated  by \nfirst  choosing  n  functions  (with replacement)  uniformly from  the fourteen  possible, \nand then choosing  m  input  vectors  by choosing a  random number between  1 and 4 \nand  placing  that  many  l's at random  in  the  input  vector.  The  training  sets  were \nlearnt  by  minimising  the  empirical  error  (3)  using  the  backpropagation  algorithm \nas  outlined  in figure  1.  Separate  simulations  were  performed  with  n  ranging  from \n1 to 21  in steps  of four  and m  ranging from  1 to  171  in steps of 10.  Further details \nof the  experimental  procedure  may be found in  [3],  chapter  4. \n\nOnce the network had sucessfully learnt the n  training sets its generalization  ability \nwas  tested  on  all  n  functions  used  to  generate  the  training  set.  In  this  case  the \ngeneralisation  error  (equation  (4))  could  be  computed  exactly  by  calculating  the \nnetwork's output (for  all  n  functions)  for  each of the 385 input vectors.  The gener(cid:173)\nalisation  error  as  a  function  of nand m  is  plotted  in figure  2 for  two  independent \nsets of simulations.  Both simulations support the theoretical result  that the number \nof examples  m  required  for  good generalisation  decreases  with increasing  n  (cf the(cid:173)\norem 1) .  For training sets  D 1 , ... , Dn  that led  to  a  generalisation error  of less  than \n\nG&n erailsahon Error \n\nFigure  2:  Learning  surfaces for  two  independent  simulations. \n\n0.01, the representation network f  was extracted and tested for its  true  error, where \nthis  is  defined  as in  equation  (5)  (the hypothesis  space 1\u00a3  is  the set  of all  networks \nformed  by  attaching  any  output  network  to  the  fixed  representation  network  f). \nAlthough  there  is  insufficient  space  to  show  the  representation  error  here  (see  [3] \nfor  the details),  it was found  that the representation  error  monotonically  decreased \nwith  the number  of tasks learnt,  verifying  the theoretical  conclusions. \n\nThe  representation's  output  for  all  inputs  is  shown  in  figure  3  for  sample  sizes \n(n, m)  = (1,131), (5, 31)  and  (13,31).  All  outputs  corresponding  to  inputs  from \nthe same category (i.e.  the same number of ones) are labelled with the same symbol. \nThe network in the n  = 1 case generalised  perfectly but the resulting representation \ndoes  not  capture  the  symmetry  in  the  environment  and  also  does  not  distinguish \nthe inputs  with 2,3 and 4  \"I's\"  (because  the function  learnt  didn't),  showing  that \n\n4 A  symmetric Boolean function is  one that is  invariant under interchange of its inputs, \n\nor equivalently,  one that  only  depends  on  the number of \"l's\"  in its input  (e.g.  parity). \n\n\fLearning Model  Bias \n\n175 \n\nlearning  a  single  function  is  not  sufficient  to  learn  an  appropriate  representation. \nBy  n  =  5  the  representation's  behaviour  has  improved  (the  inputs  with  differing \nnumbers  of l's are  now  well  separated,  but they  are  still  spread  around  a  lot)  and \nby  n  =  13  it  is  perfect.  As  well  as reducing  the sampling  burden for  the  n  tasks in \n\n( 1, HII \n\nnode . -\n\n( 5, J I I \n\n8~\\  16 \n\nd \nno  e \n\nnode  J \n\n( I), ll) \n\no \n\nI \n\nI \n\nnode  ~ \n\n0 \n\nFigure 3:  Plots of the output of a representation generated from the indicated (n, m) \nsample. \n\nthe  training  set,  a  representation  learnt  on sufficiently  many tasks  should  be  good \nfor  learning novel tasks and should greatly reduce the number of examples required \nfor  new  tasks.  This  too  was  experimentally  verified  although  there  is  insufficient \nspace  to present  the results  here  (see  [3]). \n\n5  Conclusion \n\nI  have  introduced  a  formal  model  of bias  learning  and  shown  that  (under  mild \nrestrictions)  a  learner  can  sample  sufficiently  many  times  from  sufficiently  many \ntasks  to learn bias  that is  appropriate  for  the  entire  environment.  In addition,  the \nnumber  of examples  required  per  task  to  learn  n  tasks  independently  was  shown \nto  be  upper  bounded  by  O(a + bin)  for  appropriate  environments.  See  [2]  for  an \nanalysis  of bias  learning  within an  Information theoretic  framework  which  leads  to \nan exact a + bin-type bound. \nReferences \n\n[1]  Y.  S.  Abu-Mostafa.  Learning from  Hints in Neural Networks.  Journal  of Com(cid:173)\n\nplecity,  6:192-198, 1989. \n\n[2]  J. Baxter.  A Bayesian Model of Bias Learning.  Submitted to COLT 1996,  1995. \n[3]  J.  Baxter. \n\nInternal  Representations. \n\nthesis,  De(cid:173)\n\nLearning \n\nPhD \n\npartment  of  Mathematics  and  Statistics,  The  Flinders  University  of \nSouth  Australia, \nin  Neuroprose  Archive  under \n\"/pub/neuroprose/Thesis/baxter.thesis.ps.Z\" . \n\nDraft  copy \n\n1995. \n\n[4]  J. Baxter.  Learning Internal Representations.  In Proceedings  of the  Eighth Inter(cid:173)\n\nnational Conference  on  Computational Learning Theory,  Santa Cruz,  California, \n1995.  ACM  Press. \n\n[5]  R.  Caruana.  Learning  Many  Related Tasks at the  Same Time with  Backpropa(cid:173)\n\ngation.  In  Advances  in  Neural  Information  Processing  5,  1993. \n\n[6]  S.  Geman,  E.  Bienenstock,  and  R.  Doursat.  Neural  networks  and  the \n\nbias/variance  dilemma.  Neural  Comput.,  4:1-58,  1992. \n\n[7]  W. S.  Lee,  P.  L.  Bartlett,  and R.  C.  Williamson.  Sample Complexity of Agnostic \n\nLearning with  Squared Loss.  In preparation,  1995. \n\n[8]  T.  M.  Mitchell  and  S.  Thrun.  Learning  One  More  Thing.  Technical  Report \n\nCMU-CS-94-184,  CMU,  1994. \n\n\f", "award": [], "sourceid": 1112, "authors": [{"given_name": "Jonathan", "family_name": "Baxter", "institution": null}]}